Sysop: | Amessyroom |
---|---|
Location: | Fayetteville, NC |
Users: | 27 |
Nodes: | 6 (0 / 6) |
Uptime: | 42:36:28 |
Calls: | 631 |
Calls today: | 2 |
Files: | 1,187 |
D/L today: |
24 files (29,813K bytes) |
Messages: | 175,358 |
Just found a gem on Cray arithmetic, which (rightly) incurred
The Wrath of Kahan:
https://people.eecs.berkeley.edu/~wkahan/CS279/CrayUG.pdf
"Pessimism comes less from the error-analyst's dour personality--- Synchronet 3.21a-Linux NewsLink 1.2
than from his mental model of computer arithmetic."
I also had to look up "equipollent".
I assume many people in this group know this, but for those who
don't, it is well worth reading.
Just found a gem on Cray arithmetic, which (rightly) incurred The Wrath
of Kahan ...
https://people.eecs.berkeley.edu/~wkahan/CS279/CrayUG.pdf
Just found a gem on Cray arithmetic, which (rightly) incurred The Wrath
of Kahan:
On Sat, 11 Oct 2025 10:32:22 +0000, Thomas Koenig wrote:
Just found a gem on Cray arithmetic, which (rightly) incurred The Wrath
of Kahan:
While the arithmetic on the Cray I was bad enough, this document seems
to focus on some later models in the Cray line, which, like the IBM
System/ 360 when it first came out, before an urgent retrofit, lacked a
guard digit!
The concluding part of that article had a postscript which said that,
while Cray accepted the importance of fixing the deficiencies in future models, there would be no retrofit to existing ones.
After reading that article, I looked for more information on other >processors with poor arithmetic, and I found that the Intel i860 also had
a branch delay slot, as well as using traps to implement some portions of >the IEEE 754 standard... thus, presumably, being one of the architectures
to inspire the piece about bad architectures from Linus Torvalds recently >quoted here.
Concerning implementing only a part of FP in hardware, and throwing
the rest over the wall to software, Alpha ist probably the
best-known example (denormal support only in software) ...
On Mon, 13 Oct 2025 07:39:11 GMT, Anton Ertl wrote:It took many years to figure it out for *DEC* hardware designers.
Concerning implementing only a part of FP in hardware, and throwing
the rest over the wall to software, Alpha ist probably the
best-known example (denormal support only in software) ...
The hardware designers took many years -- right through the 1990s, I
think -- to be persuaded that IEEE754 really was worth implementing
in its entirety, that the rCLtoo hardrCY or rCLtoo obscurerCY parts were there for an important reason,
to make programming that much easier,For many non-obvious parts of 754 it's true. For many other parts, esp.
and should not be .
YourCOll notice that Kahan mentioned Apple more than once, as seeminglyAccording to my understanding, Motorola suffered from being early
his favourite example of a company that took IEEE754 to heart and
implemented it completely in software, where their hardware vendor of
choice at the time (Motorola), skimped a bit on hardware support.
John Savard <quadibloc@invalid.invalid> writes:
After reading that article, I looked for more information on other
processors with poor arithmetic, and I found that the Intel i860 also had
a branch delay slot, as well as using traps to implement some portions of
the IEEE 754 standard... thus, presumably, being one of the architectures
to inspire the piece about bad architectures from Linus Torvalds recently
quoted here.
There never was a Linux port to the i860. There are lots of
architectures with Linux ports that have the properties that Linus
Torvalds mentions. Concerning implementing only a part of FP in
hardware, and throwing the rest over the wall to software, Alpha ist
probably the best-known example (denormal support only in software),
and Linus Torvalds worked on it personally. Concerning exposing the pipeline, MIPS-I not just has branch-delay slots, but also other
limitations. SPARC and HPPA have branch delay slots.
- anton
On Mon, 13 Oct 2025 07:39:11 GMT, Anton Ertl wrote:
Concerning implementing only a part of FP in hardware, and throwing
the rest over the wall to software, Alpha ist probably the
best-known example (denormal support only in software) ...
The hardware designers took many years -- right through the 1990s, I think -- to be persuaded that IEEE754 really was worth implementing in its entirety, that the rCLtoo hardrCY or rCLtoo obscurerCY parts were there for an
important reason, to make programming that much easier, and should not be skipped.
YourCOll notice that Kahan mentioned Apple more than once, as seemingly his favourite example of a company that took IEEE754 to heart and implemented--- Synchronet 3.21a-Linux NewsLink 1.2
it completely in software, where their hardware vendor of choice at the
time (Motorola), skimped a bit on hardware support.
Lawrence =?iso-8859-13?q?D=FFOliveiro?= <ldo@nz.invalid> posted:
On Mon, 13 Oct 2025 07:39:11 GMT, Anton Ertl wrote:
Concerning implementing only a part of FP in hardware, and throwing
the rest over the wall to software, Alpha ist probably the
best-known example (denormal support only in software) ...
The hardware designers took many years -- right through the 1990s, I think >> -- to be persuaded that IEEE754 really was worth implementing in its
entirety, that the rCLtoo hardrCY or rCLtoo obscurerCY parts were there for an
important reason, to make programming that much easier, and should not be
skipped.
I disagree:: full compliance with IEEE 754-whenever is to make programs
more reliable (more numerically stable) and to give the programmer a
constant programming model (not easier).
You can argue that not having to do ((x-0.5)-0.5) as you did in Hex
did make it easier--but NaNs, infinities, Underflow at the Denorm level
went in the other direction.
YourCOll notice that Kahan mentioned Apple more than once, as seemingly his >> favourite example of a company that took IEEE754 to heart and implemented
it completely in software, where their hardware vendor of choice at the
time (Motorola), skimped a bit on hardware support.
On 10/13/2025 2:39 AM, Anton Ertl wrote:
John Savard <quadibloc@invalid.invalid> writes:
After reading that article, I looked for more information on other
processors with poor arithmetic, and I found that the Intel i860 also had >> a branch delay slot, as well as using traps to implement some portions of >> the IEEE 754 standard... thus, presumably, being one of the architectures >> to inspire the piece about bad architectures from Linus Torvalds recently >> quoted here.
There never was a Linux port to the i860. There are lots of
architectures with Linux ports that have the properties that Linus
Torvalds mentions. Concerning implementing only a part of FP in
hardware, and throwing the rest over the wall to software, Alpha ist probably the best-known example (denormal support only in software),
and Linus Torvalds worked on it personally. Concerning exposing the pipeline, MIPS-I not just has branch-delay slots, but also other limitations. SPARC and HPPA have branch delay slots.
From what I can gather, the MIPS chip in the N64 also only did a
partial implementation in hardware, with optional software traps for the rest.
Apparently it can be a problem because modern FPUs don't exactly
recreate N64 behavior, and a lot of the games ran without the traps, so
a lot of the N64 games suffer drift and other issues over time (as the programmers had compensated for the MIPS issues in code rather than via traps).
Though, reading some stuff, implies a predecessor chip (the R4000) had a more functionally complete FPU. So, I guess it is also possible that the R4300 had a more limited FPU to make it cheaper for the embedded market.
Well, in any case, my recent efforts in these areas have been mostly:
Trying to hunt down some remaining bugs involving RVC in the CPU core;
RVC is seemingly "the gift that keeps on giving" in this area.
(The more dog-chewed the encoding, the harder it is to find bugs)
Going from just:
"Doing weak/crappy FP in hardware"
To:
"Trying to do less crappy FPU via software traps".
A "mostly traps only" implementation of Binary128.
Doesn't exactly match the 'Q' extension, but that is OK.
I sorta suspect not many people are going to implement Q either.
As I see it though, if the overall cost of the traps remains below 1%,
it is mostly OK.
Though, ATM the FDIV and FSQRT traps for Binary128 are almost slow
enough to justify turning them into a syscall like handler. Though, in
this case would likely overlap it with the Page-Fault handler (fallback
path for the TLB Miss handler, which is also being used here for FPU emulation).
Partial issue is mostly that one doesn't want to remain in an interrupt handler for too long because this blocks any other interrupts,
On Mon, 13 Oct 2025 09:05:18 -0000 (UTC), Lawrence DrCOOliveiro wrote:
The hardware designers took many years -- right through the 1990s, I
think -- to be persuaded that IEEE754 really was worth implementing in
its entirety, that the rCLtoo hardrCY or rCLtoo obscurerCY parts were there for
an important reason, to make programming that much easier, and should
not be skipped.
I disagree:: full compliance with IEEE 754-whenever is to make programs
more reliable (more numerically stable) and to give the programmer a
constant programming model (not easier).
You can argue that not having to do ((x-0.5)-0.5) as you did in Hex did
make it easier--but NaNs, infinities, Underflow at the Denorm level went
in the other direction.
On Mon, 13 Oct 2025 09:05:18 -0000 (UTC)
Lawrence DrCOOliveiro <ldo@nz.invalid> wrote:
The hardware designers took many years -- right through the 1990s,
I think -- to be persuaded that IEEE754 really was worth
implementing in its entirety, that the rCLtoo hardrCY or rCLtoo obscurerCY >> parts were there for an important reason,
It took many years to figure it out for *DEC* hardware designers.
Was there any other general-purpose RISC vendor that suffered from
similar denseness?
YourCOll notice that Kahan mentioned Apple more than once, as
seemingly his favourite example of a company that took IEEE754 to
heart and implemented it completely in software, where their
hardware vendor of choice at the time (Motorola), skimped a bit on
hardware support.
According to my understanding, Motorola suffered from being early
adapters, similarly to Intel. They implemented 754 before the
standard was finished and later on were in difficult position of
conflict between compatibility wits standard vs compatibility with
previous generations. Moto is less forgivable than Intel, because
they were early adapters not nearly as early.
BGB <cr88192@gmail.com> posted:
On 10/13/2025 2:39 AM, Anton Ertl wrote:
John Savard <quadibloc@invalid.invalid> writes:
After reading that article, I looked for more information on other
processors with poor arithmetic, and I found that the Intel i860 also had >>>> a branch delay slot, as well as using traps to implement some portions of >>>> the IEEE 754 standard... thus, presumably, being one of the architectures >>>> to inspire the piece about bad architectures from Linus Torvalds recently >>>> quoted here.
There never was a Linux port to the i860. There are lots of
architectures with Linux ports that have the properties that Linus
Torvalds mentions. Concerning implementing only a part of FP in
hardware, and throwing the rest over the wall to software, Alpha ist
probably the best-known example (denormal support only in software),
and Linus Torvalds worked on it personally. Concerning exposing the
pipeline, MIPS-I not just has branch-delay slots, but also other
limitations. SPARC and HPPA have branch delay slots.
From what I can gather, the MIPS chip in the N64 also only did a
partial implementation in hardware, with optional software traps for the
rest.
Apparently it can be a problem because modern FPUs don't exactly
recreate N64 behavior, and a lot of the games ran without the traps, so
a lot of the N64 games suffer drift and other issues over time (as the
programmers had compensated for the MIPS issues in code rather than via
traps).
And this is why FP wants high quality implementation.
Though, reading some stuff, implies a predecessor chip (the R4000) had a
more functionally complete FPU. So, I guess it is also possible that the
R4300 had a more limited FPU to make it cheaper for the embedded market.
Well, in any case, my recent efforts in these areas have been mostly:
Trying to hunt down some remaining bugs involving RVC in the CPU core; >> RVC is seemingly "the gift that keeps on giving" in this area.
(The more dog-chewed the encoding, the harder it is to find bugs)
Going from just:
"Doing weak/crappy FP in hardware"
To:
"Trying to do less crappy FPU via software traps".
A "mostly traps only" implementation of Binary128.
Doesn't exactly match the 'Q' extension, but that is OK.
I sorta suspect not many people are going to implement Q either.
Do it right or don't do it at all.
As I see it though, if the overall cost of the traps remains below 1%,
it is mostly OK.
While I can agree with the sentiment, the emulation overhead makes this
very hard to achieve indeed.
Though, ATM the FDIV and FSQRT traps for Binary128 are almost slow
enough to justify turning them into a syscall like handler. Though, in
this case would likely overlap it with the Page-Fault handler (fallback
path for the TLB Miss handler, which is also being used here for FPU
emulation).
Partial issue is mostly that one doesn't want to remain in an interrupt
handler for too long because this blocks any other interrupts,
At the time of control arrival, interrupts are already reentrant in
My 66000. A higher priority interrupt will take control from the
lower priority interrupt.
On Mon, 13 Oct 2025 13:12:12 +0300, Michael S wrote:
On Mon, 13 Oct 2025 09:05:18 -0000 (UTC)
Lawrence DrCOOliveiro <ldo@nz.invalid> wrote:
The hardware designers took many years -- right through the 1990s,It took many years to figure it out for *DEC* hardware designers.
I think -- to be persuaded that IEEE754 really was worth
implementing in its entirety, that the rCLtoo hardrCY or rCLtoo obscurerCY >>> parts were there for an important reason,
Was there any other general-purpose RISC vendor that suffered from
similar denseness?
I thought they all did, just about.
YourCOll notice that Kahan mentioned Apple more than once, asAccording to my understanding, Motorola suffered from being early
seemingly his favourite example of a company that took IEEE754 to
heart and implemented it completely in software, where their
hardware vendor of choice at the time (Motorola), skimped a bit on
hardware support.
adapters, similarly to Intel. They implemented 754 before the
standard was finished and later on were in difficult position of
conflict between compatibility wits standard vs compatibility with
previous generations. Moto is less forgivable than Intel, because
they were early adapters not nearly as early.
LetrCOs see, the Motorola 68881 came out in 1984 <https://en.wikipedia.org/wiki/Motorola_68881>, while the first
release of IEEE754 dates from two years before <https://en.wikipedia.org/wiki/IEEE_754>.
On Mon, 13 Oct 2025 17:33:32 GMT, MitchAlsup wrote:
On Mon, 13 Oct 2025 09:05:18 -0000 (UTC), Lawrence DrCOOliveiro wrote:
The hardware designers took many years -- right through the 1990s, I
think -- to be persuaded that IEEE754 really was worth implementing in
its entirety, that the rCLtoo hardrCY or rCLtoo obscurerCY parts were there for
an important reason, to make programming that much easier, and should
not be skipped.
I disagree:: full compliance with IEEE 754-whenever is to make programs
more reliable (more numerically stable) and to give the programmer a
constant programming model (not easier).
As a programmer, I count all that under my definition of rCLeasierrCY.
You can argue that not having to do ((x-0.5)-0.5) as you did in Hex did
make it easier--but NaNs, infinities, Underflow at the Denorm level went
in the other direction.
NaNs and infinities allow you to propagate certain kinds of pathological results right through to the end of the calculation, in a mathematically consistent way.
Denormals -- arenrCOt they called rCLsubnormalsrCY now? -- are also about making
things easier. Providing graceful underflow means a gradual loss of
precision as you get too close to zero, instead of losing all the bits at once and going straight to zero. ItrCOs about the principle of least surprise.
Again, all that helps to make things easier for programmers --
particularly those of us whose expertise of numerics is not on a level
with Prof Kahan.
Circa 1981 there was the Weitek chips. Wikipedia doesn't say if the
early ones were 754 compatible, but later chips from 1986 intended for
the 386 were compatible, and they seem to have been used by many
(Motorola, Intel, Sun, PA-RISC, ...)
I see the benefits of NaNs - sometimes you have bad data, and it can be >useful to have a representation for that. The defined "viral" nature of >NaNs means that you can write your code in a streamlined fashion,
knowing that if a NaN goes in, a NaN comes out - you don't have to have >checks and conditionals in the middle of your calculations.
And what are you doing where it is acceptable to lose some precision
with those numbers, but not to give up and say things have gone badly
wrong (a NaN or infinity, or underflow signal)?
I have a lot of
difficulty imagining a situation where denormals would be helpful and
you haven't got a major design issue with your code
perhaps
calculations should be re-arranged, algorithms changed, or you should be >using an arithmetic format with greater range (switch from single to
double, double to quad, or use something more advanced).
David Brown <david.brown@hesbynett.no> writes:
I see the benefits of NaNs - sometimes you have bad data, and it can be
useful to have a representation for that. The defined "viral" nature of
NaNs means that you can write your code in a streamlined fashion,
knowing that if a NaN goes in, a NaN comes out - you don't have to have
checks and conditionals in the middle of your calculations.
Unfortunately, there are no NaFs (not a flag), and if there were, how
would an IF behave? As a consequence, a<b can have a different result
than !(a>=b) if a or b can be a NaN. That's quite contrary to what programmers tend to expect. So NaNs have their pitfalls.
Would it be better to trap is a NaN is compared with an ordinary
comparison operator, and to use special NaN-aware comparison operators
when that is actually intended?
And what are you doing where it is acceptable to lose some precision
with those numbers, but not to give up and say things have gone badly
wrong (a NaN or infinity, or underflow signal)?
The usual alternative to denormals is not NaN or Infinity (of course
not), or a trap (I assume that's what you mean with "signal"), but 0.
I have a lot of
difficulty imagining a situation where denormals would be helpful and
you haven't got a major design issue with your code
The classical example is the assumption that a<b is equivalent to
a-b<0. It holds if denormals are implemented and fails on
flush-to-zero.
Basically, with denormals more of the usual assumptions hold.
perhaps
calculations should be re-arranged, algorithms changed, or you should be
using an arithmetic format with greater range (switch from single to
double, double to quad, or use something more advanced).
The first two require more knowledge about FP than many programmers
have, all just to avoid some hardware cost. Not a good idea in any
area where the software crisis* is relevant. The last increases the
resource usage much more than proper support for denormals.
* The Wikipedia article on the software crisis does not give a useful
definition for deciding whether there is a software crisis or not,
and it does not even mention the symptom that was mentioned first
when I learned about the software crisis (in 1986): The cost of
software exceeds the cost of hardware. So that's my decision
criterion: If the software cost is higher than the hardware cost,
the software crisis is relevant; and in the present context, it
means that expending hardware to reduce the cost of software is
justified. Denormal numbers are such a feature.
- anton
On 14/10/2025 09:51, Anton Ertl wrote:
David Brown <david.brown@hesbynett.no> writes:
I see the benefits of NaNs - sometimes you have bad data, and it can be
useful to have a representation for that. The defined "viral" nature of >>> NaNs means that you can write your code in a streamlined fashion,
knowing that if a NaN goes in, a NaN comes out - you don't have to have
checks and conditionals in the middle of your calculations.
Unfortunately, there are no NaFs (not a flag), and if there were, how
would an IF behave? As a consequence, a<b can have a different result
than !(a>=b) if a or b can be a NaN. That's quite contrary to what
programmers tend to expect. So NaNs have their pitfalls.
I entirely agree. If you have a type that has some kind of non-value,
and it might contain that representation, you have to take that into
account in your code. It's much the same thing as having a pointer that >could be a null pointer.
The usual alternative to denormals is not NaN or Infinity (of course
not), or a trap (I assume that's what you mean with "signal"), but 0.
Sure. My thoughts with NaN are that it might be appropriate for a
floating point model (not IEEE) to return a NaN in circumstances where
IEEE says the result is a denormal - I think that might have been a more >useful result.
And my mention of infinity is because often when people
have a very small value but are very keen on it not being zero, it is >because they intend to divide by it and want to avoid division by zero
(and thus infinity).
The classical example is the assumption that a<b is equivalent to
a-b<0. It holds if denormals are implemented and fails on
flush-to-zero.
Basically, with denormals more of the usual assumptions hold.
OK. (I like that aspect of signed integer overflow being UB - more of
your usual assumptions hold.)
However, if "a" or "b" could be a NaN or an infinity, does that
equivalence still hold?
Are you thinking of this equivalence as something the compiler would do
in optimisation, or something programmers would use when writing their code?
I fully agree on both these points. However, I can't help feeling that
if you are seeing denormals, you are unlikely to be getting results from >your code that are as accurate as you had expected - your calculations
are numerically unstable. Denormals might give you slightly more leeway >before everything falls apart, but only a tiny amount.
On 14/10/2025 04:27, Lawrence D|ore4raoOliveiro wrote:Subnormal is critical for stability of zero-seeking algorithms, i.e a
On Mon, 13 Oct 2025 17:33:32 GMT, MitchAlsup wrote:
On Mon, 13 Oct 2025 09:05:18 -0000 (UTC), Lawrence D|ore4raoOliveiro wrote: >>>
The hardware designers took many years -- right through the 1990s, I>>>> think -- to be persuaded that IEEE754 really was worth implementing in
its entirety, that the |ore4+otoo hard|ore4-Y or |ore4+otoo obscure|ore4-Y parts
were there for
an important reason, to make programming that much easier, and should
not be skipped.
I disagree:: full compliance with IEEE 754-whenever is to make programs
more reliable (more numerically stable) and to give the programmer a
constant programming model (not easier).
As a programmer, I count all that under my definition of |ore4+oeasier|ore4-Y.
You can argue that not having to do ((x-0.5)-0.5) as you did in Hex did
make it easier--but NaNs, infinities, Underflow at the Denorm level went >>> in the other direction.
NaNs and infinities allow you to propagate certain kinds of pathological
results right through to the end of the calculation, in a mathematically
consistent way.
Denormals -- aren|ore4raot they called |ore4+osubnormals|ore4-Y now? -- are also
about making
things easier. Providing graceful underflow means a gradual loss of
precision as you get too close to zero, instead of losing all the bits at
once and going straight to zero. It|ore4raos about the principle of least
surprise.
Again, all that helps to make things easier for programmers --
particularly those of us whose expertise of numerics is not on a level>> with Prof Kahan.
I see the benefits of NaNs - sometimes you have bad data, and it can be useful to have a representation for that.-a The defined "viral" nature of NaNs means that you can write your code in a streamlined fashion,
knowing that if a NaN goes in, a NaN comes out - you don't have to have checks and conditionals in the middle of your calculations.
But I find it harder to understand why denormals or subnormals are going
to be useful.-a Ultimately, your floating point code is approximating arithmetic on real numbers.-a Where are you getting your real numbers,
and what calculations are you doing on them, that mean you are getting > results that have such a dynamic range that you are using denormals? And
what are you doing where it is acceptable to lose some precision with
those numbers, but not to give up and say things have gone badly wrong > (a NaN or infinity, or underflow signal)?-a I have a lot of difficulty
imagining a situation where denormals would be helpful and you haven't > got a major design issue with your code - perhaps calculations should be
re-arranged, algorithms changed, or you should be using an arithmetic
format with greater range (switch from single to double, double to quad,
or use something more advanced).
David Brown <david.brown@hesbynett.no> writes:
I see the benefits of NaNs - sometimes you have bad data, and it can be
useful to have a representation for that. The defined "viral" nature of
NaNs means that you can write your code in a streamlined fashion,
knowing that if a NaN goes in, a NaN comes out - you don't have to have
checks and conditionals in the middle of your calculations.
Unfortunately, there are no NaFs (not a flag), and if there were, how
would an IF behave? As a consequence, a<b can have a different result
than !(a>=b) if a or b can be a NaN. That's quite contrary to what programmers tend to expect. So NaNs have their pitfalls.
David Brown <david.brown@hesbynett.no> writes:
On 14/10/2025 09:51, Anton Ertl wrote:
David Brown <david.brown@hesbynett.no> writes:
I see the benefits of NaNs - sometimes you have bad data, and it can be >>>> useful to have a representation for that. The defined "viral" nature of >>>> NaNs means that you can write your code in a streamlined fashion,
knowing that if a NaN goes in, a NaN comes out - you don't have to have >>>> checks and conditionals in the middle of your calculations.
Unfortunately, there are no NaFs (not a flag), and if there were, how
would an IF behave? As a consequence, a<b can have a different result
than !(a>=b) if a or b can be a NaN. That's quite contrary to what
programmers tend to expect. So NaNs have their pitfalls.
I entirely agree. If you have a type that has some kind of non-value,
and it might contain that representation, you have to take that into
account in your code. It's much the same thing as having a pointer that
could be a null pointer.
Not really:
* Null pointers don't materialize spontaneously as results of
arithmetic operations. They are stored explicitly by the
programmer, making the programmer much more aware of their
existence.
* Programmers are trained to check for null pointers. And if they
forget such a check, the result usually is that the program traps,
usually soon after the place where the check should have been. With
a NaN you just silently execute the wrong branch of an IF, and later
you wonder what happened.
* The most common use for null pointers is terminating a linked list
or other recursive data structure. Programmers are trained to deal
with the terminating case in their code.
The usual alternative to denormals is not NaN or Infinity (of course
not), or a trap (I assume that's what you mean with "signal"), but 0.
Sure. My thoughts with NaN are that it might be appropriate for a
floating point model (not IEEE) to return a NaN in circumstances where
IEEE says the result is a denormal - I think that might have been a more
useful result.
When a denormal is generated, an underflow "exception" happens (IEEE "exceptions" are not traps). You can set your FPU to trap on a
certain kind of exception. Maybe you can also set it up such that it produces a NaN instead. I doubt that many people would find that
useful, however.
And my mention of infinity is because often when people
have a very small value but are very keen on it not being zero, it is
because they intend to divide by it and want to avoid division by zero
(and thus infinity).
Denormals don't help much here. IEEE doubles cannot represent 2^1024,
but denormals allow to represent positive numbers down to 2^-1074.
So, with denormal numbers, the absolute value of your divisor must be
less than 2^-50 to produce a non-infinite result where flush-to-zero
would have produced an infinity.
The classical example is the assumption that a<b is equivalent to
a-b<0. It holds if denormals are implemented and fails on
flush-to-zero.
Basically, with denormals more of the usual assumptions hold.
OK. (I like that aspect of signed integer overflow being UB - more of
your usual assumptions hold.)
Not mine. An assumption that I like is that the associative law
holds. It holds with -fwrapv, but not with overflow-is-undefined.
I fail to see how declaring any condition undefined behaviour would
increase any guarantees.
However, if "a" or "b" could be a NaN or an infinity, does that
equivalence still hold?
Yes.
If any of them is a NaN, the result is false for either comparison
(because a-b would be NaN, and because the result of any comparison
with a NaN is false).
For infinity there are a number of cases
1) inf<noninf (false) vs. inf-noninf=inf<0 (false)
2) -inf<noninf (true) vs. -inf-noninf=-inf<0 (true)
3) noninf<inf (true) vs. noninf-inf=-inf<0 (true)
4) noninf<-inf (false) vs. noninf--inf=inf<0 (false)
5) inf<inf (false) vs. inf-inf=NaN<0 (false)
6) -inf<-inf (false) vs. -inf--inf=NaN<0 (false)
7) inf<-inf (false) vs. inf--inf=inf<0 (false)
8) -inf<inf (true) vs. -inf-inf=-inf<0 (true)
The most interesting case here is 5), because if means that a<=b is
not equivalent to a-b<=0, even with denormal numbers.
Are you thinking of this equivalence as something the compiler would do
in optimisation, or something programmers would use when writing their code?
I was thinking about what programmers might use when writing their
code. For compilers, having that equivalence may occasionally be
helpful for producing better code, but if it does not hold, the
compiler will just not use such an equivalence (once the compiler is debugged).
This is an example from Kahan that stuck in my mind, because it
appeals to me as a programmer. He has also given other examples that
don't do that for me, but may appeal to a mathematician, phycisist or chemist.
I fully agree on both these points. However, I can't help feeling that
if you are seeing denormals, you are unlikely to be getting results from
your code that are as accurate as you had expected - your calculations
are numerically unstable. Denormals might give you slightly more leeway
before everything falls apart, but only a tiny amount.
I think the nicer properties (such as the equivalence mentioned above)
is the more important benefit. And if you take a different branch of
an IF-statement if you have a flush-to-zero FPU, you can easily get a completely bogus result when the denormal case would still have had
enough accuracy by far.
On 10/13/2025 4:53 PM, MitchAlsup wrote:
BGB <cr88192@gmail.com> posted:
On 10/13/2025 2:39 AM, Anton Ertl wrote:
John Savard <quadibloc@invalid.invalid> writes:
After reading that article, I looked for more information on other
processors with poor arithmetic, and I found that the Intel i860 also had
a branch delay slot, as well as using traps to implement some portions of
the IEEE 754 standard... thus, presumably, being one of the architectures
to inspire the piece about bad architectures from Linus Torvalds recently
quoted here.
There never was a Linux port to the i860. There are lots of
architectures with Linux ports that have the properties that Linus
Torvalds mentions. Concerning implementing only a part of FP in
hardware, and throwing the rest over the wall to software, Alpha ist
probably the best-known example (denormal support only in software),
and Linus Torvalds worked on it personally. Concerning exposing the
pipeline, MIPS-I not just has branch-delay slots, but also other
limitations. SPARC and HPPA have branch delay slots.
From what I can gather, the MIPS chip in the N64 also only did a
partial implementation in hardware, with optional software traps for the >> rest.
Apparently it can be a problem because modern FPUs don't exactly
recreate N64 behavior, and a lot of the games ran without the traps, so
a lot of the N64 games suffer drift and other issues over time (as the
programmers had compensated for the MIPS issues in code rather than via
traps).
And this is why FP wants high quality implementation.
From what I gather, it was a combination of Binary32 with DAZ/FTZ and truncate rounding. Then, with emulators running instead on hardware with denormals and RNE.
But, the result was that the games would work correctly on the original hardware, but in the emulators things would drift; like things like
moving platforms gradually creeping away from the origin, etc.
Though, reading some stuff, implies a predecessor chip (the R4000) had a >> more functionally complete FPU. So, I guess it is also possible that the >> R4300 had a more limited FPU to make it cheaper for the embedded market. >>
Well, in any case, my recent efforts in these areas have been mostly:
Trying to hunt down some remaining bugs involving RVC in the CPU core; >> RVC is seemingly "the gift that keeps on giving" in this area.
(The more dog-chewed the encoding, the harder it is to find bugs)
Going from just:
"Doing weak/crappy FP in hardware"
To:
"Trying to do less crappy FPU via software traps".
A "mostly traps only" implementation of Binary128.
Doesn't exactly match the 'Q' extension, but that is OK.
I sorta suspect not many people are going to implement Q either.
Do it right or don't do it at all.
?...
The traps route sorta worked OK in a lot of the MIPS era CPUs.
But, it will be opt-in via an FPSCR flag.
If the flag is not set, it will not trap.
Or, is the argument here that sticking with weaker not-quite IEEE FPU is preferable to using trap handlers.
For Binary128, real HW support is not likely to happen. The main reason
to consider trap-only Binary128 is more because it has less code
footprint than using runtime calls.
On 14/10/2025 04:27, Lawrence DrCOOliveiro wrote:
On Mon, 13 Oct 2025 17:33:32 GMT, MitchAlsup wrote:
On Mon, 13 Oct 2025 09:05:18 -0000 (UTC), Lawrence DrCOOliveiro wrote:
The hardware designers took many years -- right through the 1990s, I
think -- to be persuaded that IEEE754 really was worth implementing in >>> its entirety, that the rCLtoo hardrCY or rCLtoo obscurerCY parts were there for
an important reason, to make programming that much easier, and should
not be skipped.
I disagree:: full compliance with IEEE 754-whenever is to make programs
more reliable (more numerically stable) and to give the programmer a
constant programming model (not easier).
As a programmer, I count all that under my definition of rCLeasierrCY.
You can argue that not having to do ((x-0.5)-0.5) as you did in Hex did
make it easier--but NaNs, infinities, Underflow at the Denorm level went >> in the other direction.
NaNs and infinities allow you to propagate certain kinds of pathological results right through to the end of the calculation, in a mathematically consistent way.
Denormals -- arenrCOt they called rCLsubnormalsrCY now? -- are also about making
things easier. Providing graceful underflow means a gradual loss of precision as you get too close to zero, instead of losing all the bits at once and going straight to zero. ItrCOs about the principle of least surprise.
Again, all that helps to make things easier for programmers --
particularly those of us whose expertise of numerics is not on a level
with Prof Kahan.
I see the benefits of NaNs - sometimes you have bad data, and it can be useful to have a representation for that. The defined "viral" nature of NaNs means that you can write your code in a streamlined fashion,
knowing that if a NaN goes in, a NaN comes out - you don't have to have checks and conditionals in the middle of your calculations.
But I find it harder to understand why denormals or subnormals are going
to be useful.
Ultimately, your floating point code is approximating arithmetic on real numbers.
Where are you getting your real numbers,
and what calculations are you doing on them, that mean you are getting results that have such a dynamic range that you are using denormals?
And what are you doing where it is acceptable to lose some precision
with those numbers, but not to give up and say things have gone badly
wrong (a NaN or infinity, or underflow signal)? I have a lot of
difficulty imagining a situation where denormals would be helpful and
you haven't got a major design issue with your code - perhaps
calculations should be re-arranged, algorithms changed, or you should be using an arithmetic format with greater range (switch from single to
double, double to quad, or use something more advanced).
David Brown <david.brown@hesbynett.no> writes:
I see the benefits of NaNs - sometimes you have bad data, and it can be >useful to have a representation for that. The defined "viral" nature of >NaNs means that you can write your code in a streamlined fashion,
knowing that if a NaN goes in, a NaN comes out - you don't have to have >checks and conditionals in the middle of your calculations.
Unfortunately, there are no NaFs (not a flag), and if there were, how
would an IF behave? As a consequence, a<b can have a different result
than !(a>=b) if a or b can be a NaN. That's quite contrary to what programmers tend to expect. So NaNs have their pitfalls.
Would it be better to trap is a NaN is compared with an ordinary
comparison operator, and to use special NaN-aware comparison operators
when that is actually intended?
And what are you doing where it is acceptable to lose some precision
with those numbers, but not to give up and say things have gone badly >wrong (a NaN or infinity, or underflow signal)?
The usual alternative to denormals is not NaN or Infinity (of course
not), or a trap (I assume that's what you mean with "signal"), but 0.
I have a lot of
difficulty imagining a situation where denormals would be helpful and
you haven't got a major design issue with your code
The classical example is the assumption that a<b is equivalent to
a-b<0. It holds if denormals are implemented and fails on
flush-to-zero.
Basically, with denormals more of the usual assumptions hold.
perhaps
calculations should be re-arranged, algorithms changed, or you should be >using an arithmetic format with greater range (switch from single to >double, double to quad, or use something more advanced).
The first two require more knowledge about FP than many programmers
have,
all just to avoid some hardware cost. Not a good idea in any
area where the software crisis* is relevant.
David Brown <david.brown@hesbynett.no> posted:
Ultimately, your floating point code is approximating
arithmetic on real numbers.
Don' make me laugh.
The associative law holds fine with UB on overflow,
Well, I think that if your values are getting that small enough to make >denormal results, your code is at least questionable.
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:[...]
Languages have not kept up with NaNs, needing IF-THEN-ELSE-NAN
semantics.
Would it be better to trap is a NaN is compared with an ordinary
comparison operator, and to use special NaN-aware comparison operators
when that is actually intended?
You are thinking that FCMP only decodes 6 states {==, !=, <, <=, > >=}
The first two require more knowledge about FP than many programmers
have,
Don't allow THOSE programmers to program FP codes !!
Get ones that understand the nuances.
BGB <cr88192@gmail.com> posted:
On 10/13/2025 4:53 PM, MitchAlsup wrote:
BGB <cr88192@gmail.com> posted:
On 10/13/2025 2:39 AM, Anton Ertl wrote:
John Savard <quadibloc@invalid.invalid> writes:
After reading that article, I looked for more information on other >>>>>> processors with poor arithmetic, and I found that the Intel i860 also had
a branch delay slot, as well as using traps to implement some portions of
the IEEE 754 standard... thus, presumably, being one of the architectures
to inspire the piece about bad architectures from Linus Torvalds recently
quoted here.
There never was a Linux port to the i860. There are lots of
architectures with Linux ports that have the properties that Linus
Torvalds mentions. Concerning implementing only a part of FP in
hardware, and throwing the rest over the wall to software, Alpha ist >>>>> probably the best-known example (denormal support only in software), >>>>> and Linus Torvalds worked on it personally. Concerning exposing the >>>>> pipeline, MIPS-I not just has branch-delay slots, but also other
limitations. SPARC and HPPA have branch delay slots.
From what I can gather, the MIPS chip in the N64 also only did a
partial implementation in hardware, with optional software traps for the >>>> rest.
Apparently it can be a problem because modern FPUs don't exactly
recreate N64 behavior, and a lot of the games ran without the traps, so >>>> a lot of the N64 games suffer drift and other issues over time (as the >>>> programmers had compensated for the MIPS issues in code rather than via >>>> traps).
And this is why FP wants high quality implementation.
From what I gather, it was a combination of Binary32 with DAZ/FTZ and
truncate rounding. Then, with emulators running instead on hardware with
denormals and RNE.
In the above sentence I was talking about your FPU not getting
an infinitely correct result and then rounding to container size.
Not about the other "other" anomalies" many of which can be dealt
with in SW.
But, the result was that the games would work correctly on the original
hardware, but in the emulators things would drift; like things like
moving platforms gradually creeping away from the origin, etc.
Though, reading some stuff, implies a predecessor chip (the R4000) had a >>>> more functionally complete FPU. So, I guess it is also possible that the >>>> R4300 had a more limited FPU to make it cheaper for the embedded market. >>>>Do it right or don't do it at all.
Well, in any case, my recent efforts in these areas have been mostly:
Trying to hunt down some remaining bugs involving RVC in the CPU core;
RVC is seemingly "the gift that keeps on giving" in this area.
(The more dog-chewed the encoding, the harder it is to find bugs) >>>> Going from just:
"Doing weak/crappy FP in hardware"
To:
"Trying to do less crappy FPU via software traps".
A "mostly traps only" implementation of Binary128.
Doesn't exactly match the 'Q' extension, but that is OK.
I sorta suspect not many people are going to implement Q either. >>>
?...
The traps route sorta worked OK in a lot of the MIPS era CPUs.
But, it will be opt-in via an FPSCR flag.
If the flag is not set, it will not trap.
But their combination of HW+SW gets the right answer.
Your multiply does not.
Or, is the argument here that sticking with weaker not-quite IEEE FPU is
preferable to using trap handlers.
The 5-bang instructions as used by HW+SW has to computer the result
to infinite precision and then round to container size.
The paper illustrates CRAY 1,... FP was fast but inaccurate enough
to fund an army of numerical analysists to see if the program was
delivering acceptable results.
IEEE 754 got rid of the army of Numerical Analysists.
But now, nobody remembers how bad is was/can be.
For Binary128, real HW support is not likely to happen. The main reason
to consider trap-only Binary128 is more because it has less code
footprint than using runtime calls.
Nobody is asking for that.
* The Wikipedia article on the software crisis does not give a useful
definition for deciding whether there is a software crisis or not,
and it does not even mention the symptom that was mentioned first
when I learned about the software crisis (in 1986): The cost of
software exceeds the cost of hardware.
Languages have not kept up with NaNs, needing IF-THEN-ELSE-NAN
semantics.
On Tue, 14 Oct 2025 07:51:09 GMT, Anton Ertl wrote:
* The Wikipedia article on the software crisis does not give a useful
definition for deciding whether there is a software crisis or not,
and it does not even mention the symptom that was mentioned first
when I learned about the software crisis (in 1986): The cost of
software exceeds the cost of hardware.
The "crisis" was supposed to do with the shortage of programs to write all >the programs that were needed to solve business and user needs.
By that definition, I donrCOt think the "crisis" exists any more. It went >away with the rise of very-high-level languages, from about the time of >those such as Tcl/Tk, Perl, Python, PHP and JavaScript.
Lawrence =?iso-8859-13?q?D=FFOliveiro?= <ldo@nz.invalid> writes:What demonstrations?
On Tue, 14 Oct 2025 07:51:09 GMT, Anton Ertl wrote:
* The Wikipedia article on the software crisis does not give a
useful definition for deciding whether there is a software crisis
or not, and it does not even mention the symptom that was
mentioned first when I learned about the software crisis (in
1986): The cost of software exceeds the cost of hardware.
The "crisis" was supposed to do with the shortage of programs to
write all the programs that were needed to solve business and user
needs.
I never heard that one. The software project failures, deadline
misses, and cost overruns, and their increasing number was a symptom
that is reflected in the Wikipedia article.
By that definition, I donrCOt think the "crisis" exists any more. It
went away with the rise of very-high-level languages, from about the
time of those such as Tcl/Tk, Perl, Python, PHP and JavaScript.
Better tools certainly help. One interesting aspect here is that all
the languages you mention are only dynamically typechecked. There has
been quite a bit of work on adding static typechecking to some of
these languages in the last decade or so, and the motivation given for
that is difficulties in large software projects using these languages.
In any case, even with these languages there are still software
projects that fail, miss their deadlines and have overrun their
budget; and to come back to the criterion I mentioned, where software
cost is higher than hardware cost.
Anyway, the relevance for comp.arch is how to evaluate certain
hardware features: If we have a way to make the programmers' jobs
easier at a certain hardware cost, when is it justified to add the
hardware cost? When it affects many programmers and especially if the difficulty that would otherwise be added is outside the expertise of
many of them. Let's look at some cases:
"Closing the semantic gap" by providing instructions like EDIT: Even
with assembly-language programmers, calling a subroutine is hardly
harder. With higher-level languages, such instructions buy nothing.
Denormal numbers: It affects lots of code that deals with FP, and
where many programmers are not well-educated (and even the educated
ones have a harder time when they have to work around their absence).
Hardware without Spectre (e.g., with invisible speculation): There are
two takes here:
1) If you consider Spectre to be a realistically exploitable
vulnerability, you need to protect at least the secret keys against
extraction with Spectre; then you either need such hardware, or you
need to use software mitigations agains all Spectre variants in all
software that runs in processes that have secret keys in their
address space; the latter would be a huge cost that easily
justifies the cost of adding invisible speculation to the hardware.
2) The other take is that we Spectre is too hard to exploit to be a
realistic threat and that we do not need to eliminate it or
mitigate it. That's a similar to the mainstream opinion on
cache-timing attacks on AES before Dan Bernstein demonstrated that
such attacks can be performed. Except that for Spectre we already
have demonstrations.
- anton
David Brown <david.brown@hesbynett.no> writes:
Well, I think that if your values are getting that small enough to make
denormal results, your code is at least questionable.
As Terje Mathiesen wrote, getting close to 0 is standard fare for approximation algorithms, such as Newton-Raphson iteration. Of course
you can terminate the loop while you are still far from the solution,
but that's not going to improve the accuracy of the results.
David Brown <david.brown@hesbynett.no> posted:That was true under 754-2008 but we fixed it for 2019: All NaNs
On 14/10/2025 04:27, Lawrence D|ore4raoOliveiro wrote:>>> On Mon, 13 Oct 2025 17:33:32 GMT, MitchAlsup wrote:
On Mon, 13 Oct 2025 09:05:18 -0000 (UTC), Lawrence D|ore4raoOliveiro wrote:
The hardware designers took many years -- right through the 1990s, I >>>>> think -- to be persuaded that IEEE754 really was worth implementing in >>>>> its entirety, that the |ore4+otoo hard|ore4-Y or |ore4+otoo obscure|ore4-Y parts were there for
an important reason, to make programming that much easier, and should >>>>> not be skipped.
I disagree:: full compliance with IEEE 754-whenever is to make programs >>>> more reliable (more numerically stable) and to give the programmer a>>>> constant programming model (not easier).
As a programmer, I count all that under my definition of |ore4+oeasier|ore4-Y.
You can argue that not having to do ((x-0.5)-0.5) as you did in Hex did >>>> make it easier--but NaNs, infinities, Underflow at the Denorm level went >>>> in the other direction.
NaNs and infinities allow you to propagate certain kinds of pathological >>> results right through to the end of the calculation, in a mathematically >>> consistent way.
Denormals -- aren|ore4raot they called |ore4+osubnormals|ore4-Y now? -- are also about making
things easier. Providing graceful underflow means a gradual loss of
precision as you get too close to zero, instead of losing all the bits at >>> once and going straight to zero. It|ore4raos about the principle of least >>> surprise.
Again, all that helps to make things easier for programmers --
particularly those of us whose expertise of numerics is not on a level
with Prof Kahan.
I see the benefits of NaNs - sometimes you have bad data, and it can be
useful to have a representation for that. The defined "viral" nature of
NaNs means that you can write your code in a streamlined fashion,
knowing that if a NaN goes in, a NaN comes out - you don't have to have
checks and conditionals in the middle of your calculations.
MAX( x, NaN ) is x.
On 14/10/2025 18:46, Anton Ertl wrote:Please note that I have NOT personally observed this, but I have been
David Brown <david.brown@hesbynett.no> writes:
Well, I think that if your values are getting that small enough to make
denormal results, your code is at least questionable.
As Terje Mathiesen wrote, getting close to 0 is standard fare for
approximation algorithms, such as Newton-Raphson iteration.-a Of course
you can terminate the loop while you are still far from the solution,
but that's not going to improve the accuracy of the results.
Feel free to correct me if what I write below is wrong - you, Terje, and others here know a lot more about this stuff than I do.
When you write an expression like "x + y" with floating point, ignoring
NaNs and infinities, you can imagine the calculation being done by first getting the mathematical real values from x and y.-a Then - again in the mathematical real domain - the operation is carried out.-a Then the
result is truncated or rounded to fit back within the mantissa and
exponent format of the floating point type.
Double precision IEEE format has 53 bits of mantissa and 11 bits of exponent.-a For normal floating point values, that covers from 10 ^ -308
to 10 ^ +308, or 716 orders of magnitude.-a (For comparison, the size of
the universe measured in Planck lengths is only about 61 orders of magnitude.)
Denormals let you squeeze a bit more at the lower end here - another 16 orders of magnitude - at the cost of rapidly decreasing precision.-a They don't stop the inevitable approximation to zero, they just delay it a little.
I am still at a loss to understand how this is going to be useful - when will that small extra margin near zero actually make a difference, in
the real world, with real values?-a When you are using your
Newton-Raphson iteration to find your function's zeros, what are the circumstances in which you can get a more useful end result if you
continue to 10 ^ -324 instead of treating 10 ^ -308 as zero - especially when these smaller numbers have lower precision?
On 14/10/2025 18:46, Anton Ertl wrote:
David Brown <david.brown@hesbynett.no> writes:
Well, I think that if your values are getting that small enough to
make denormal results, your code is at least questionable.
As Terje Mathiesen wrote, getting close to 0 is standard fare for approximation algorithms, such as Newton-Raphson iteration. Of
course you can terminate the loop while you are still far from the solution, but that's not going to improve the accuracy of the
results.
Feel free to correct me if what I write below is wrong - you, Terje,
and others here know a lot more about this stuff than I do.
When you write an expression like "x + y" with floating point,
ignoring NaNs and infinities, you can imagine the calculation being
done by first getting the mathematical real values from x and y.
Then - again in the mathematical real domain - the operation is
carried out. Then the result is truncated or rounded to fit back
within the mantissa and exponent format of the floating point type.
Double precision IEEE format has 53 bits of mantissa and 11 bits of exponent. For normal floating point values, that covers from 10 ^
-308 to 10 ^ +308, or 716 orders of magnitude. (For comparison, the
size of the universe measured in Planck lengths is only about 61
orders of magnitude.)
Denormals let you squeeze a bit more at the lower end here - another
16 orders of magnitude - at the cost of rapidly decreasing precision.
They don't stop the inevitable approximation to zero, they just
delay it a little.
I am still at a loss to understand how this is going to be useful -
when will that small extra margin near zero actually make a
difference, in the real world, with real values? When you are using
your Newton-Raphson iteration to find your function's zeros, what are
the circumstances in which you can get a more useful end result if
you continue to 10 ^ -324 instead of treating 10 ^ -308 as zero -
especially when these smaller numbers have lower precision?
I realise there are plenty of numerical calculations in which errors
"build up", such as simulating non-linear systems over time, and
there you are looking to get as high an accuracy as you can in the intermediary steps so that you can continue for longer. But even
there, denormals are not going to give you more than a tiny amount
extra.
(There are, of course, mathematical problems which deal with values
or precisions far outside anything of relevance to the physical
world, but if you are dealing with those kinds of tasks then IEEE
floating point is not going to do the job anyway.)
David Brown wrote:It does not sound right to me. Newton-alike iterations oscillations by
On 14/10/2025 18:46, Anton Ertl wrote:
David Brown <david.brown@hesbynett.no> writes:
Well, I think that if your values are getting that small enough
to make denormal results, your code is at least questionable.
As Terje Mathiesen wrote, getting close to 0 is standard fare for
approximation algorithms, such as Newton-Raphson iteration.a Of
course you can terminate the loop while you are still far from the
solution, but that's not going to improve the accuracy of the
results.
Feel free to correct me if what I write below is wrong - you,
Terje, and others here know a lot more about this stuff than I do.
When you write an expression like "x + y" with floating point,
ignoring NaNs and infinities, you can imagine the calculation being
done by first getting the mathematical real values from x and y.
Then - again in the mathematical real domain - the operation is
carried out.a Then the result is truncated or rounded to fit back
within the mantissa and exponent format of the floating point type.
Double precision IEEE format has 53 bits of mantissa and 11 bits of exponent.a For normal floating point values, that covers from 10 ^
-308 to 10 ^ +308, or 716 orders of magnitude.a (For comparison,
the size of the universe measured in Planck lengths is only about
61 orders of magnitude.)
Denormals let you squeeze a bit more at the lower end here -
another 16 orders of magnitude - at the cost of rapidly decreasing precision.a They don't stop the inevitable approximation to zero,
they just delay it a little.
I am still at a loss to understand how this is going to be useful -
when will that small extra margin near zero actually make a
difference, in the real world, with real values?a When you are
using your Newton-Raphson iteration to find your function's zeros,
what are the circumstances in which you can get a more useful end
result if you continue to 10 ^ -324 instead of treating 10 ^ -308
as zero - especially when these smaller numbers have lower
precision?
Please note that I have NOT personally observed this, but I have been
told from people I trust (on the 754 working group) that at least
some zero-seeking algorithms will stabilize on an exact value, if and
only if you have subnormals, otherwise it is possible to wobble back
& forth between two neighboring results.
I.e. they differ by exactly one ulp.
As I noted, I have not been bitten by this particular issue, one of
the reaons being that I tend to not write infinite loops inside
functions, instead I'll pre-calculate how many (typically NR)
iterations should be needed.
Terje
David Brown wrote:
On 14/10/2025 18:46, Anton Ertl wrote:
David Brown <david.brown@hesbynett.no> writes:
Well, I think that if your values are getting that small enough to make >>>> denormal results, your code is at least questionable.
As Terje Mathiesen wrote, getting close to 0 is standard fare for
approximation algorithms, such as Newton-Raphson iteration.-a Of course
you can terminate the loop while you are still far from the solution,
but that's not going to improve the accuracy of the results.
Feel free to correct me if what I write below is wrong - you, Terje,
and others here know a lot more about this stuff than I do.
When you write an expression like "x + y" with floating point,
ignoring NaNs and infinities, you can imagine the calculation being
done by first getting the mathematical real values from x and y.-a Then
- again in the mathematical real domain - the operation is carried
out.-a Then the result is truncated or rounded to fit back within the
mantissa and exponent format of the floating point type.
Double precision IEEE format has 53 bits of mantissa and 11 bits of
exponent.-a For normal floating point values, that covers from 10 ^
-308 to 10 ^ +308, or 716 orders of magnitude.-a (For comparison, the
size of the universe measured in Planck lengths is only about 61
orders of magnitude.)
Denormals let you squeeze a bit more at the lower end here - another
16 orders of magnitude - at the cost of rapidly decreasing precision.
They don't stop the inevitable approximation to zero, they just delay
it a little.
I am still at a loss to understand how this is going to be useful -
when will that small extra margin near zero actually make a
difference, in the real world, with real values?-a When you are using
your Newton-Raphson iteration to find your function's zeros, what are
the circumstances in which you can get a more useful end result if you
continue to 10 ^ -324 instead of treating 10 ^ -308 as zero -
especially when these smaller numbers have lower precision?
Please note that I have NOT personally observed this, but I have been
told from people I trust (on the 754 working group) that at least some zero-seeking algorithms will stabilize on an exact value, if and only if
you have subnormals, otherwise it is possible to wobble back & forth
between two neighboring results.
I.e. they differ by exactly one ulp.
As I noted, I have not been bitten by this particular issue, one of the reaons being that I tend to not write infinite loops inside functions, instead I'll pre-calculate how many (typically NR) iterations should be needed.
Terje
On Wed, 15 Oct 2025 12:36:17 +0200
David Brown <david.brown@hesbynett.no> wrote:
On 14/10/2025 18:46, Anton Ertl wrote:
David Brown <david.brown@hesbynett.no> writes:
Well, I think that if your values are getting that small enough to
make denormal results, your code is at least questionable.
As Terje Mathiesen wrote, getting close to 0 is standard fare for
approximation algorithms, such as Newton-Raphson iteration. Of
course you can terminate the loop while you are still far from the
solution, but that's not going to improve the accuracy of the
results.
Feel free to correct me if what I write below is wrong - you, Terje,
and others here know a lot more about this stuff than I do.
When you write an expression like "x + y" with floating point,
ignoring NaNs and infinities, you can imagine the calculation being
done by first getting the mathematical real values from x and y.
Then - again in the mathematical real domain - the operation is
carried out. Then the result is truncated or rounded to fit back
within the mantissa and exponent format of the floating point type.
Double precision IEEE format has 53 bits of mantissa and 11 bits of
exponent. For normal floating point values, that covers from 10 ^
-308 to 10 ^ +308, or 716 orders of magnitude. (For comparison, the
size of the universe measured in Planck lengths is only about 61
orders of magnitude.)
Denormals let you squeeze a bit more at the lower end here - another
16 orders of magnitude - at the cost of rapidly decreasing precision.
They don't stop the inevitable approximation to zero, they just
delay it a little.
I am still at a loss to understand how this is going to be useful -
when will that small extra margin near zero actually make a
difference, in the real world, with real values? When you are using
your Newton-Raphson iteration to find your function's zeros, what are
the circumstances in which you can get a more useful end result if
you continue to 10 ^ -324 instead of treating 10 ^ -308 as zero -
especially when these smaller numbers have lower precision?
I realise there are plenty of numerical calculations in which errors
"build up", such as simulating non-linear systems over time, and
there you are looking to get as high an accuracy as you can in the
intermediary steps so that you can continue for longer. But even
there, denormals are not going to give you more than a tiny amount
extra.
(There are, of course, mathematical problems which deal with values
or precisions far outside anything of relevance to the physical
world, but if you are dealing with those kinds of tasks then IEEE
floating point is not going to do the job anyway.)
I don't think that I agree with Anton's point, at least as formulated.
Yes, subnormals improve precision of Newton-Raphson and such*, but only
when the numbers involved in calculations are below 2**-971, which does
not happen very often. What is more important that *when* it happens
then naively written implementations of such algorithms still converge. Without subnormals (or without expert provisions) there is big chance
that they would not converge at all. That happens mostly because
IEEE-754 preserves following intuitive invariant:
When x > y then x - y > 0
Without subnormals, e.g. with VAX float formats that are otherwise
pretty good, this invariant does not hold.
* - I personally prefer to illustrate it with cord-and-tangent
root-finding algorithm that can be used for any type of function as
long as you proved that on section of interest there is no change of
sign of its first and second derivatives. May be, because I
was taught this algorithm at age of 15. This algo can be called
half-Newton].
On Wed, 15 Oct 2025 05:55:40 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
Lawrence =?iso-8859-13?q?D=FFOliveiro?= <ldo@nz.invalid> writes:
On Tue, 14 Oct 2025 07:51:09 GMT, Anton Ertl wrote:I never heard that one. The software project failures, deadline
* The Wikipedia article on the software crisis does not give aThe "crisis" was supposed to do with the shortage of programs to
useful definition for deciding whether there is a software crisis
or not, and it does not even mention the symptom that was
mentioned first when I learned about the software crisis (in
1986): The cost of software exceeds the cost of hardware.
write all the programs that were needed to solve business and user
needs.
misses, and cost overruns, and their increasing number was a symptom
that is reflected in the Wikipedia article.
By that definition, I don+AN++N++t think the "crisis" exists any more. It >>> went away with the rise of very-high-level languages, from about theBetter tools certainly help. One interesting aspect here is that all
time of those such as Tcl/Tk, Perl, Python, PHP and JavaScript.
the languages you mention are only dynamically typechecked. There has
been quite a bit of work on adding static typechecking to some of
these languages in the last decade or so, and the motivation given for
that is difficulties in large software projects using these languages.
In any case, even with these languages there are still software
projects that fail, miss their deadlines and have overrun their
budget; and to come back to the criterion I mentioned, where software
cost is higher than hardware cost.
Anyway, the relevance for comp.arch is how to evaluate certain
hardware features: If we have a way to make the programmers' jobs
easier at a certain hardware cost, when is it justified to add the
hardware cost? When it affects many programmers and especially if the
difficulty that would otherwise be added is outside the expertise of
many of them. Let's look at some cases:
"Closing the semantic gap" by providing instructions like EDIT: Even
with assembly-language programmers, calling a subroutine is hardly
harder. With higher-level languages, such instructions buy nothing.
Denormal numbers: It affects lots of code that deals with FP, and
where many programmers are not well-educated (and even the educated
ones have a harder time when they have to work around their absence).
Hardware without Spectre (e.g., with invisible speculation): There are
two takes here:
1) If you consider Spectre to be a realistically exploitable
vulnerability, you need to protect at least the secret keys against
extraction with Spectre; then you either need such hardware, or you
need to use software mitigations agains all Spectre variants in all
software that runs in processes that have secret keys in their
address space; the latter would be a huge cost that easily
justifies the cost of adding invisible speculation to the hardware.
2) The other take is that we Spectre is too hard to exploit to be a
realistic threat and that we do not need to eliminate it or
mitigate it. That's a similar to the mainstream opinion on
cache-timing attacks on AES before Dan Bernstein demonstrated that
such attacks can be performed. Except that for Spectre we already
have demonstrations.
- anton
What demonstrations?
The demonstration that I would consider realistic should be from JS
running on browser released after 2018-01-28.
I'm of strong opinion that at least Spectre Variant 1 (Bound Check
Bypass) should not be mitigated in hardware.
W.r.t. variant 2 (Branch Target Injection) I am less categorical. That
is, I rather prefer it not mitigated on the hardware that I use,
because I am sure that in no situation it is a realistic threat for me. However it is harder to prove that it is not a realistic threat to
anybody. And since HW mitigation has smaller performance impact than of Variant 1, so if some CPU vendors decide to mitigate Variant 2, I would
not call them spinless idiots because of it. I'd call them "slick businessmen" which in my book is less derogatory.
Anyway, the relevance for comp.arch is how to evaluate certain
hardware features: If we have a way to make the programmers' jobs
easier at a certain hardware cost, when is it justified to add the
hardware cost?
When it affects many programmers and especially if the difficulty that would otherwise be added is outside the expertise of
many of them. Let's look at some cases:
"Closing the semantic gap" by providing instructions like EDIT: Even
with assembly-language programmers, calling a subroutine is hardly
harder. With higher-level languages, such instructions buy nothing.
Denormal numbers: It affects lots of code that deals with FP, and
where many programmers are not well-educated (and even the educated
ones have a harder time when they have to work around their absence).
Hardware without Spectre (e.g., with invisible speculation): There are
two takes here:
1) If you consider Spectre to be a realistically exploitable
vulnerability, you need to protect at least the secret keys against
extraction with Spectre; then you either need such hardware, or you
need to use software mitigations agains all Spectre variants in all
software that runs in processes that have secret keys in their
address space; the latter would be a huge cost that easily
justifies the cost of adding invisible speculation to the hardware.
2) The other take is that we Spectre is too hard to exploit to be a
realistic threat and that we do not need to eliminate it or
mitigate it. That's a similar to the mainstream opinion on
cache-timing attacks on AES before Dan Bernstein demonstrated that
such attacks can be performed. Except that for Spectre we already
have demonstrations.
- anton--- Synchronet 3.21a-Linux NewsLink 1.2
Hardware without Spectre (e.g., with invisible speculation): There are
two takes here:
1) If you consider Spectre to be a realistically exploitable
vulnerability, you need to protect at least the secret keys against
extraction with Spectre; then you either need such hardware, or you
need to use software mitigations agains all Spectre variants in all
software that runs in processes that have secret keys in their
address space; the latter would be a huge cost that easily
justifies the cost of adding invisible speculation to the hardware.
2) The other take is that we Spectre is too hard to exploit to be a
realistic threat and that we do not need to eliminate it or
mitigate it. That's a similar to the mainstream opinion on
cache-timing attacks on AES before Dan Bernstein demonstrated that
such attacks can be performed. Except that for Spectre we already
have demonstrations.
- anton
What demonstrations?
The demonstration that I would consider realistic should be from JS
running on browser released after 2018-01-28.
I'm of strong opinion that at least Spectre Variant 1 (Bound Check
Bypass) should not be mitigated in hardware.
W.r.t. variant 2 (Branch Target Injection) I am less categorical. That
is, I rather prefer it not mitigated on the hardware that I use,
because I am sure that in no situation it is a realistic threat for me. However it is harder to prove that it is not a realistic threat to
anybody. And since HW mitigation has smaller performance impact than of Variant 1, so if some CPU vendors decide to mitigate Variant 2, I would
not call them spinless idiots because of it. I'd call them "slick businessmen" which in my book is less derogatory.
What demonstrations?
The demonstration that I would consider realistic should be from JS
running on browser released after 2018-01-28.
I'm of strong opinion that at least Spectre Variant 1 (Bound Check
Bypass) should not be mitigated in hardware.
W.r.t. variant 2 (Branch Target Injection) I am less categorical. That
is, I rather prefer it not mitigated on the hardware that I use,
because I am sure that in no situation it is a realistic threat for me. However it is harder to prove that it is not a realistic threat to
anybody. And since HW mitigation has smaller performance impact than of Variant 1, so if some CPU vendors decide to mitigate Variant 2, I would
not call them spinless idiots because of it. I'd call them "slick businessmen" which in my book is less derogatory.
I had an idea on how to eliminate Bound Check Bypass.
I intend to have range-check-and-fault instructions like
CHKLTU value_Rs1, limit_Rs2
value_Rs1, #limit_imm
throws an overflow fault exception if value register >= unsigned limit.
(The unsigned >= check also catches negative signed integer values).
It can be used to check an array index before use in a LD/ST, e.g.
CHKLTU index_Rs, limit_Rs
LD Rd, [base_Rs, index_Rs*scale]
The problem is that there is no guarantee that an OoO cpu will execute
the CHKLTU instruction before using the index register in the LD/ST.
My idea is for the CHKcc instruction to copy the test value to a dest register when the check is successful. This makes the dest value register write-dependent on successfully passing the range check,
and blocks the subsequent LD from using the index until validated.
CHKLTU index_R2, index_R1, limit_R3
LD R4, [base_R5, index_R2*scale]
Because there is no branch, there is no way to speculate around the check (but load value speculation could negate this fix).
Please note that I have NOT personally observed this, but I have been
told from people I trust (on the 754 working group) that at least some zero-seeking algorithms will stabilize on an exact value, if and only if
you have subnormals, otherwise it is possible to wobble back & forth
between two neighboring results.
I.e. they differ by exactly one ulp.
As I noted, I have not been bitten by this particular issue, one of the reaons being that I tend to not write infinite loops inside functions, instead I'll pre-calculate how many (typically NR) iterations should be needed.
Terje--- Synchronet 3.21a-Linux NewsLink 1.2
Most people would say:: "When it adds performance" AND the compiler can
use it. Some would add: "from unmodified source code"; but I am a little wishy-washy on the last clause.
I might note that SIMD obeys none of the 3 conditions.
On Wed, 15 Oct 2025 03:45:31 -0000 (UTC), Lawrence DrCOOliveiro wrote:
The "crisis" was supposed to do with the shortage of programs to write
all the programs that were needed to solve business and user needs.
By that definition, I donrCOt think the "crisis" exists any more. It went
away with the rise of very-high-level languages, from about the time of
those such as Tcl/Tk, Perl, Python, PHP and JavaScript.
Better tools certainly help. One interesting aspect here is that all
the languages you mention are only dynamically typechecked.
There has been quite a bit of work on adding static typechecking to some
of these languages in the last decade or so, and the motivation given
for that is difficulties in large software projects using these
languages.
In any case, even with these languages there are still software projects
that fail, miss their deadlines and have overrun their budget ...
On Wed, 15 Oct 2025 21:09:27 GMT, MitchAlsup wrote:
Most people would say:: "When it adds performance" AND the compiler can
use it. Some would add: "from unmodified source code"; but I am a little wishy-washy on the last clause.
I might note that SIMD obeys none of the 3 conditions.
I believe GCC can do auto-vectorization in some situations.
But the RISC-V folks still think Cray-style long vectors are better than SIMD, if only because it preserves the rCLRrCY in rCLRISCrCY.
On Wed, 15 Oct 2025 05:55:40 GMT, Anton Ertl wrote:
On Wed, 15 Oct 2025 03:45:31 -0000 (UTC), Lawrence DrCOOliveiro wrote:
The "crisis" was supposed to do with the shortage of programs to write
all the programs that were needed to solve business and user needs.
By that definition, I donrCOt think the "crisis" exists any more. It went >> away with the rise of very-high-level languages, from about the time of
those such as Tcl/Tk, Perl, Python, PHP and JavaScript.
Better tools certainly help. One interesting aspect here is that all
the languages you mention are only dynamically typechecked.
Correct. That does seem to be a key part of what rCLvery-high-levelrCY means.
There has been quite a bit of work on adding static typechecking to some
of these languages in the last decade or so, and the motivation given
for that is difficulties in large software projects using these
languages.
What werCOre seeing here is a downward creep, as those very-high-level languages (Python and JavaScript, particularly) are encroaching into the territory of the lower levels. Clearly they must still have some
advantages over those languages that already inhabit the lower levels, otherwise we might as well use the latter.
In any case, even with these languages there are still software projects that fail, miss their deadlines and have overrun their budget ...
IrCOm not aware of such; feel free to give an example of some large Python project, for example, which has exceeded its time and/or budget. The key point about using such a very-high-level language is you can do a lot in just a few lines of code.--- Synchronet 3.21a-Linux NewsLink 1.2
But the RISC-V folks still think Cray-style long vectors are better
than SIMD, if only because it preserves the rCLRrCY in rCLRISCrCY.
The R in RISC-V comes from "student _R_esearch".
Oh, and BTW: I don't believe SIMD is better than CRAY-like vectors (or
vice versa)--they simply represent different ways of shooting yourself
in the foot.
No ISA with more than 200 instructions deserves the RISC mantra.
On Wed, 15 Oct 2025 21:42:32 -0000 (UTC), Lawrence DrCOOliveiro wrote:
What werCOre seeing here is a downward creep, as those very-high-level
languages (Python and JavaScript, particularly) are encroaching into
the territory of the lower levels. Clearly they must still have some
advantages over those languages that already inhabit the lower levels,
otherwise we might as well use the latter.
There is a pernicious trap:: once an application written in a VHLL is acclaimed by the masses--it instantly falls into the trap where "users
want more performance":: something the VHLL cannot provide until they.........
45 years ago it was LISP, you wrote the application in LISP to figure
out the required algorithms and once you got it working, you rewrote it
in a high-performance language (FORTRAN or C) so it was usably fast.
On Wed, 15 Oct 2025 22:19:18 GMT, MitchAlsup wrote:
But the RISC-V folks still think Cray-style long vectors are better
than SIMD, if only because it preserves the rCLRrCY in rCLRISCrCY.
The R in RISC-V comes from "student _R_esearch".
rCLReduced Instruction Set ComputingrCY. That was what every single primer on the subject said, right from the 1980s onwards.
Oh, and BTW: I don't believe SIMD is better than CRAY-like vectors (or
vice versa)--they simply represent different ways of shooting yourself
in the foot.
The primary design criterion, as I understood it, was to avoid filling up
the instruction opcode space with a combinatorial explosion. (Or sequence
of combinatorial explosions, when you look at the wave after wave of SIMD extensions in x86 and elsewhere.)
Also there might be some pipeline benefits in having longer vector
operands ... IrCOll bow to your opinion on that.
No ISA with more than 200 instructions deserves the RISC mantra.
There you go ... agreeing with me about what the rCLRrCY stands for.
The demonstration that I would consider realistic should be from JS
running on browser released after 2018-01-28.
I'm of strong opinion that at least Spectre Variant 1 (Bound Check
Bypass) should not be mitigated in hardware.
Terje Mathisen <terje.mathisen@tmsw.no> posted:
----------------------
Please note that I have NOT personally observed this, but I have been
told from people I trust (on the 754 working group) that at least some
zero-seeking algorithms will stabilize on an exact value, if and only if
you have subnormals, otherwise it is possible to wobble back & forth
between two neighboring results.
I know of several Newton-Raphson-iterations that converge faster and
more accurately using reciprocal-SQRT() than the equivalent algorithm
using SQRT() directly in NR-iteration.
I.e. they differ by exactly one ulp.
In my cases, the RSQRT() was 1 or 2 iterations faster and 2 ULP more accurate. I don't know of a case oscillating at 1 ULP due to arithmetic anomalies.
On 16/10/2025 07:44, Lawrence DrCOOliveiro wrote:
On Wed, 15 Oct 2025 22:19:18 GMT, MitchAlsup wrote:
But the RISC-V folks still think Cray-style long vectors are better
than SIMD, if only because it preserves the rCLRrCY in rCLRISCrCY.
The R in RISC-V comes from "student _R_esearch".
rCLReduced Instruction Set ComputingrCY. That was what every single primer on
the subject said, right from the 1980s onwards.
Oh, and BTW: I don't believe SIMD is better than CRAY-like vectors (or
vice versa)--they simply represent different ways of shooting yourself
in the foot.
No ISA with more than 200 instructions deserves the RISC mantra.
There you go ... agreeing with me about what the rCLRrCY stands for.
I have always thought that RISC parsed as (RI)SC rather than R(IS)C.
That is, the instructions themselves should be simple (the aim was one instruction, one clock cycle) rather than that there necessarily should
be fewer instructions.
EricP <ThatWouldBeTelling@thevillage.com> posted:
---------------------------
What demonstrations?I had an idea on how to eliminate Bound Check Bypass.
The demonstration that I would consider realistic should be from JS
running on browser released after 2018-01-28.
I'm of strong opinion that at least Spectre Variant 1 (Bound Check
Bypass) should not be mitigated in hardware.
W.r.t. variant 2 (Branch Target Injection) I am less categorical. That
is, I rather prefer it not mitigated on the hardware that I use,
because I am sure that in no situation it is a realistic threat for me.
However it is harder to prove that it is not a realistic threat to
anybody. And since HW mitigation has smaller performance impact than of
Variant 1, so if some CPU vendors decide to mitigate Variant 2, I would
not call them spinless idiots because of it. I'd call them "slick
businessmen" which in my book is less derogatory.
I intend to have range-check-and-fault instructions like
CHKLTU value_Rs1, limit_Rs2
value_Rs1, #limit_imm
throws an overflow fault exception if value register >= unsigned limit.
(The unsigned >= check also catches negative signed integer values).
It can be used to check an array index before use in a LD/ST, e.g.
CHKLTU index_Rs, limit_Rs
LD Rd, [base_Rs, index_Rs*scale]
The problem is that there is no guarantee that an OoO cpu will execute
the CHKLTU instruction before using the index register in the LD/ST.
Yes, order in OoO is sanity-impairing.
But, what you do know is that CHKx will be performed before LD can
retire. _AND_ if your -|A does not update -|A state prior to retire,
you can be as OoO as you like and still not be Spectr|- sensitive.
One of the things recently put into My 66000 is that AGEN detects
overflow and raises PageFault.
My idea is for the CHKcc instruction to copy the test value to a dest
register when the check is successful. This makes the dest value register
write-dependent on successfully passing the range check,
and blocks the subsequent LD from using the index until validated.
CHKLTU index_R2, index_R1, limit_R3
LD R4, [base_R5, index_R2*scale]
If you follow my rule above this is unnecessary, but it may be less
painful than holding back state update until retire.
Because there is no branch, there is no way to speculate around the check
(but load value speculation could negate this fix).
x86 has cases (like Shift by 0) where HW predicts that CFLAGs are set
and -|faults when shift count == 0 and prevents setting of CFLAGS.
You "COULD" do something similar at -|A level.
Michael S <already5chosen@yahoo.com> writes:
The demonstration that I would consider realistic should be from JS
running on browser released after 2018-01-28.
You apparently only consider attacks through the browser as relevant. Netspectre demonstrates a completely remote attack, i.e., without a
browser.
As for the browsers, AFAIK they tried to make Spectre leak less by
making the clock less precise. That does not stop Spectre, it only
makes data extraction using the clock slower. Moreover, there are
ways to work around that by running a timing loop, i.e., instead of
the clock you use the current count of the counted loop.
I'm of strong opinion that at least Spectre Variant 1 (Bound Check
Bypass) should not be mitigated in hardware.
What do you mean with "mitigated in hardware"? The answers to
hardware vulnerabilities are to either fix the hardware (for Spectre "invisible speculation" looks the most promising to me), or to leave
the hardware vulnerable and mitigate the vulnerability in software
(possibly supported by hardware or firmware changes that do not fix
the vulnerability).
So do you not want it to be fixed in hardware, or not mitigated in
software? As long as the hardware is not fixed, you may not have a
choice in the latter, unless you use an OS you write yourself. AFAIK
you can disable the software mitigations in the Linux kernel, but the development cost of these mitigations still has to be paid, and any
slowdowns that result from organizing the code such that enabling the mitigations is possible will still be there even with the mitigations disabled.
So if you are against hardware fixes, you will pay for software
mitigations, in development cost (possibly indirectly) and in
performance.
More info on the topic:
Fix Spectre in Hardware! Why and How https://repositum.tuwien.at/bitstream/20.500.12708/210758/1/Ertl-2025-Fix%20Spectre%20in%20Hardware%21%20Why%20and%20How-smur.pdf
- anton
On Wed, 15 Oct 2025 22:19:18 GMT, MitchAlsup wrote:
But the RISC-V folks still think Cray-style long vectors are better
than SIMD, if only because it preserves the rCLRrCY in rCLRISCrCY.
The R in RISC-V comes from "student _R_esearch".
rCLReduced Instruction Set ComputingrCY. That was what every single primer on the subject said, right from the 1980s onwards.
Oh, and BTW: I don't believe SIMD is better than CRAY-like vectors (or
vice versa)--they simply represent different ways of shooting yourself
in the foot.
The primary design criterion, as I understood it, was to avoid filling up
the instruction opcode space with a combinatorial explosion. (Or sequence
of combinatorial explosions, when you look at the wave after wave of SIMD extensions in x86 and elsewhere.)
Also there might be some pipeline benefits in having longer vector
operands ... IrCOll bow to your opinion on that.
No ISA with more than 200 instructions deserves the RISC mantra.
There you go ... agreeing with me about what the rCLRrCY stands for.
On 16/10/2025 07:44, Lawrence DrCOOliveiro wrote:
On Wed, 15 Oct 2025 22:19:18 GMT, MitchAlsup wrote:
But the RISC-V folks still think Cray-style long vectors are better
than SIMD, if only because it preserves the rCLRrCY in rCLRISCrCY.
The R in RISC-V comes from "student _R_esearch".
rCLReduced Instruction Set ComputingrCY. That was what every single primer on
the subject said, right from the 1980s onwards.
Oh, and BTW: I don't believe SIMD is better than CRAY-like vectors (or
vice versa)--they simply represent different ways of shooting yourself
in the foot.
The primary design criterion, as I understood it, was to avoid filling up
the instruction opcode space with a combinatorial explosion. (Or sequence
of combinatorial explosions, when you look at the wave after wave of SIMD
extensions in x86 and elsewhere.)
I believe another aim is to have the same instructions work on different hardware.-a With SIMD, you need different code if your processor can add
4 ints at a time, or 8 ints, or 16 ints - it's all different
instructions using different SIMD registers.-a With the vector style instructions in RISC-V, the actual SIMD registers and implementation are
not exposed to the ISA and you have the same code no matter how wide the actual execution units are.-a I have no experience with this (or much experience with SIMD), but that seems like a big win to my mind.-a It is akin to letting the processor hardware handle multiple instructions in parallel in superscaler cpus, rather than Itanium EPIC coding.
Also there might be some pipeline benefits in having longer vector
operands ... IrCOll bow to your opinion on that.
No ISA with more than 200 instructions deserves the RISC mantra.
There you go ... agreeing with me about what the rCLRrCY stands for.
I have always thought that RISC parsed as (RI)SC rather than R(IS)C.
That is, the instructions themselves should be simple (the aim was one instruction, one clock cycle) rather than that there necessarily should
be fewer instructions.
MitchAlsup wrote:
EricP <ThatWouldBeTelling@thevillage.com> posted: ---------------------------
What demonstrations?I had an idea on how to eliminate Bound Check Bypass.
The demonstration that I would consider realistic should be from JS
running on browser released after 2018-01-28.
I'm of strong opinion that at least Spectre Variant 1 (Bound Check
Bypass) should not be mitigated in hardware.
W.r.t. variant 2 (Branch Target Injection) I am less categorical. That >>> is, I rather prefer it not mitigated on the hardware that I use,
because I am sure that in no situation it is a realistic threat for me. >>> However it is harder to prove that it is not a realistic threat to
anybody. And since HW mitigation has smaller performance impact than of >>> Variant 1, so if some CPU vendors decide to mitigate Variant 2, I would >>> not call them spinless idiots because of it. I'd call them "slick
businessmen" which in my book is less derogatory.
I intend to have range-check-and-fault instructions like
CHKLTU value_Rs1, limit_Rs2
value_Rs1, #limit_imm
throws an overflow fault exception if value register >= unsigned limit.
(The unsigned >= check also catches negative signed integer values).
It can be used to check an array index before use in a LD/ST, e.g.
CHKLTU index_Rs, limit_Rs
LD Rd, [base_Rs, index_Rs*scale]
The problem is that there is no guarantee that an OoO cpu will execute
the CHKLTU instruction before using the index register in the LD/ST.
Yes, order in OoO is sanity-impairing.
But, what you do know is that CHKx will be performed before LD can
retire. _AND_ if your -|A does not update -|A state prior to retire,
you can be as OoO as you like and still not be Spectr|- sensitive.
One of the things recently put into My 66000 is that AGEN detects
overflow and raises PageFault.
My idea is for the CHKcc instruction to copy the test value to a dest
register when the check is successful. This makes the dest value register >> write-dependent on successfully passing the range check,
and blocks the subsequent LD from using the index until validated.
CHKLTU index_R2, index_R1, limit_R3
LD R4, [base_R5, index_R2*scale]
If you follow my rule above this is unnecessary, but it may be less
painful than holding back state update until retire.
My idea is the same as a SUB instruction with overflow detect,
which I would already have. I like cheap solutions.
But the core idea here, to eliminate a control flow race condition by changing it to a data flow dependency, may be applicable in other areas.
Because there is no branch, there is no way to speculate around the check >> (but load value speculation could negate this fix).
On second thought, no, load value speculation would not negate this fix.
x86 has cases (like Shift by 0) where HW predicts that CFLAGs are set
and -|faults when shift count == 0 and prevents setting of CFLAGS.
You "COULD" do something similar at -|A level.
I'd prefer not to step in that cow pie to begin with.
Then I won't have to spend time cleaning my shoes afterwards.
MitchAlsup wrote:
Terje Mathisen <terje.mathisen@tmsw.no> posted:
----------------------
Please note that I have NOT personally observed this, but I have been
told from people I trust (on the 754 working group) that at least some
zero-seeking algorithms will stabilize on an exact value, if and only if >> you have subnormals, otherwise it is possible to wobble back & forth
between two neighboring results.
I know of several Newton-Raphson-iterations that converge faster and
more accurately using reciprocal-SQRT() than the equivalent algorithm
using SQRT() directly in NR-iteration.
I.e. they differ by exactly one ulp.
In my cases, the RSQRT() was 1 or 2 iterations faster and 2 ULP more accurate. I don't know of a case oscillating at 1 ULP due to arithmetic anomalies.
Interesting! I have also found rsqrt() to be a very good building block,
to the point where if I can only have one helper function (approximate lookup to start the NR), it would be rsqrt, and I would use it for all
of sqrt, fdiv and rsqrt.
Terje
On 16/10/2025 07:44, Lawrence DrCOOliveiro wrote:
On Wed, 15 Oct 2025 22:19:18 GMT, MitchAlsup wrote:
But the RISC-V folks still think Cray-style long vectors are better
than SIMD, if only because it preserves the rCLRrCY in rCLRISCrCY.
The R in RISC-V comes from "student _R_esearch".
rCLReduced Instruction Set ComputingrCY. That was what every single primer on
the subject said, right from the 1980s onwards.
Oh, and BTW: I don't believe SIMD is better than CRAY-like vectors (or
vice versa)--they simply represent different ways of shooting yourself
in the foot.
The primary design criterion, as I understood it, was to avoid filling up the instruction opcode space with a combinatorial explosion. (Or sequence of combinatorial explosions, when you look at the wave after wave of SIMD extensions in x86 and elsewhere.)
I believe another aim is to have the same instructions work on different hardware. With SIMD, you need different code if your processor can add
4 ints at a time, or 8 ints, or 16 ints - it's all different
instructions using different SIMD registers.
With the vector style instructions in RISC-V, the actual SIMD registers and implementation are
not exposed to the ISA and you have the same code no matter how wide the actual execution units are.
I have no experience with this (or much experience with SIMD), but that seems like a big win to my mind. It is
akin to letting the processor hardware handle multiple instructions in parallel in superscaler cpus, rather than Itanium EPIC coding.
Also there might be some pipeline benefits in having longer vector
operands ... IrCOll bow to your opinion on that.
No ISA with more than 200 instructions deserves the RISC mantra.
There you go ... agreeing with me about what the rCLRrCY stands for.
I have always thought that RISC parsed as (RI)SC rather than R(IS)C.
That is, the instructions themselves should be simple (the aim was one instruction, one clock cycle) rather than that there necessarily should
be fewer instructions.