Forum: Too Lazy BBS

Who's Online
Recent Visitors
- Kawasu
  Thu Oct 16 10:17:15 2025
  from Mena, Ar via Telnet
- Geek2
  Thu Oct 16 06:39:58 2025
  from Euclid, Oh via Telnet
- Amr
  Tue Oct 14 21:13:21 2025
  from Fayetteville, Nc via Telnet
- Amr
  Tue Oct 14 20:34:34 2025
  from Fayetteville, Nc via Telnet

System Info

Sysop:	Amessyroom
Location:	Fayetteville, NC
Users:	27
Nodes:	6 (0 / 6)
Uptime:	42:36:28
Calls:	631
Calls today:	2
Files:	1,187
D/L today:	24 files (29,813K bytes)
Messages:	175,358

On Cray arithmetic

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sat Oct 11 10:32:22 2025

From Newsgroup: comp.arch

Just found a gem on Cray arithmetic, which (rightly) incurred
The Wrath of Kahan:

https://people.eecs.berkeley.edu/~wkahan/CS279/CrayUG.pdf

"Pessimism comes less from the error-analyst's dour personality
than from his mental model of computer arithmetic."

I also had to look up "equipollent".

I assume many people in this group know this, but for those who
don't, it is well worth reading.
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat Oct 11 19:36:44 2025

From Newsgroup: comp.arch

Thomas Koenig <tkoenig@netcologne.de> posted:

Just found a gem on Cray arithmetic, which (rightly) incurred
The Wrath of Kahan:

https://people.eecs.berkeley.edu/~wkahan/CS279/CrayUG.pdf

I hope BGB reads this and takes it to heart.

"Pessimism comes less from the error-analyst's dour personality
than from his mental model of computer arithmetic."

I also had to look up "equipollent".

I assume many people in this group know this, but for those who
don't, it is well worth reading.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Lawrence =?iso-8859-13?q?D=FFOliveiro?=@ldo@nz.invalid to comp.arch on Sun Oct 12 00:28:16 2025

From Newsgroup: comp.arch

On Sat, 11 Oct 2025 10:32:22 -0000 (UTC), Thomas Koenig wrote:

Just found a gem on Cray arithmetic, which (rightly) incurred The Wrath
of Kahan ...

No harm in reminding everyone of his legendary foreword to the
Standard Apple Numerics manual, 2nd ed, of 1988. He had something
suitably acerbic to say about a great number of different vendorsrCO
idea of floating-point arithmetic (including Cray).

I posted one instance here <http://groups.google.com/group/comp.lang.python/msg/5aaf5dd86cb00651?hl=en>. --- Synchronet 3.21a-Linux NewsLink 1.2

From Lawrence =?iso-8859-13?q?D=FFOliveiro?=@ldo@nz.invalid to comp.arch on Sun Oct 12 01:15:23 2025

From Newsgroup: comp.arch

On Sat, 11 Oct 2025 10:32:22 -0000 (UTC), Thomas Koenig wrote:

https://people.eecs.berkeley.edu/~wkahan/CS279/CrayUG.pdf

Anybody curious about whatrCOs on pages 62-5 of the Apple Numerics Manual
2nd ed can find a copy here <https://vintageapple.org/inside_o/>.
--- Synchronet 3.21a-Linux NewsLink 1.2

From John Savard@quadibloc@invalid.invalid to comp.arch on Sun Oct 12 04:04:46 2025

From Newsgroup: comp.arch

On Sat, 11 Oct 2025 10:32:22 +0000, Thomas Koenig wrote:

Just found a gem on Cray arithmetic, which (rightly) incurred The Wrath
of Kahan:

While the arithmetic on the Cray I was bad enough, this document seems to focus on some later models in the Cray line, which, like the IBM System/
360 when it first came out, before an urgent retrofit, lacked a guard
digit!

John Savard
--- Synchronet 3.21a-Linux NewsLink 1.2

From Lawrence =?iso-8859-13?q?D=FFOliveiro?=@ldo@nz.invalid to comp.arch on Sun Oct 12 06:06:35 2025

From Newsgroup: comp.arch

On Sun, 12 Oct 2025 04:04:46 -0000 (UTC), John Savard wrote:

On Sat, 11 Oct 2025 10:32:22 +0000, Thomas Koenig wrote:

Just found a gem on Cray arithmetic, which (rightly) incurred The Wrath
of Kahan:

While the arithmetic on the Cray I was bad enough, this document seems
to focus on some later models in the Cray line, which, like the IBM
System/ 360 when it first came out, before an urgent retrofit, lacked a
guard digit!

The concluding part of that article had a postscript which said that,
while Cray accepted the importance of fixing the deficiencies in future models, there would be no retrofit to existing ones.
--- Synchronet 3.21a-Linux NewsLink 1.2

From John Savard@quadibloc@invalid.invalid to comp.arch on Mon Oct 13 07:23:21 2025

From Newsgroup: comp.arch

On Sun, 12 Oct 2025 06:06:35 +0000, Lawrence DrCOOliveiro wrote:

The concluding part of that article had a postscript which said that,
while Cray accepted the importance of fixing the deficiencies in future models, there would be no retrofit to existing ones.

That is a pity.

After reading that article, I looked for more information on other
processors with poor arithmetic, and I found that the Intel i860 also had
a branch delay slot, as well as using traps to implement some portions of
the IEEE 754 standard... thus, presumably, being one of the architectures
to inspire the piece about bad architectures from Linus Torvalds recently quoted here.

John Savard

--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Mon Oct 13 07:39:11 2025

From Newsgroup: comp.arch

John Savard <quadibloc@invalid.invalid> writes:

After reading that article, I looked for more information on other >processors with poor arithmetic, and I found that the Intel i860 also had
a branch delay slot, as well as using traps to implement some portions of >the IEEE 754 standard... thus, presumably, being one of the architectures
to inspire the piece about bad architectures from Linus Torvalds recently >quoted here.

There never was a Linux port to the i860. There are lots of
architectures with Linux ports that have the properties that Linus
Torvalds mentions. Concerning implementing only a part of FP in
hardware, and throwing the rest over the wall to software, Alpha ist
probably the best-known example (denormal support only in software),
and Linus Torvalds worked on it personally. Concerning exposing the
pipeline, MIPS-I not just has branch-delay slots, but also other
limitations. SPARC and HPPA have branch delay slots.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From Lawrence =?iso-8859-13?q?D=FFOliveiro?=@ldo@nz.invalid to comp.arch on Mon Oct 13 09:05:18 2025

From Newsgroup: comp.arch

On Mon, 13 Oct 2025 07:39:11 GMT, Anton Ertl wrote:

Concerning implementing only a part of FP in hardware, and throwing
the rest over the wall to software, Alpha ist probably the
best-known example (denormal support only in software) ...

The hardware designers took many years -- right through the 1990s, I think
-- to be persuaded that IEEE754 really was worth implementing in its
entirety, that the rCLtoo hardrCY or rCLtoo obscurerCY parts were there for an important reason, to make programming that much easier, and should not be skipped.

YourCOll notice that Kahan mentioned Apple more than once, as seemingly his favourite example of a company that took IEEE754 to heart and implemented
it completely in software, where their hardware vendor of choice at the
time (Motorola), skimped a bit on hardware support.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.arch on Mon Oct 13 13:12:12 2025

From Newsgroup: comp.arch

On Mon, 13 Oct 2025 09:05:18 -0000 (UTC)
Lawrence DrCOOliveiro <ldo@nz.invalid> wrote:

On Mon, 13 Oct 2025 07:39:11 GMT, Anton Ertl wrote:

Concerning implementing only a part of FP in hardware, and throwing
the rest over the wall to software, Alpha ist probably the
best-known example (denormal support only in software) ...

The hardware designers took many years -- right through the 1990s, I
think -- to be persuaded that IEEE754 really was worth implementing
in its entirety, that the rCLtoo hardrCY or rCLtoo obscurerCY parts were there for an important reason,

It took many years to figure it out for *DEC* hardware designers.
Was there any other general-purpose RISC vendor that suffered from
similar denseness?

to make programming that much easier,
and should not be .

For many non-obvious parts of 754 it's true. For many other parts, esp.
related to exceptions, it's false.
That is, they should not be skipped, but the only reason for that is
ease of documentation (just write "754" and you are done) and access to
test vectors. This parts are not well-thought, do not make application programming any easier and do not fit well into programming languages.

YourCOll notice that Kahan mentioned Apple more than once, as seemingly
his favourite example of a company that took IEEE754 to heart and
implemented it completely in software, where their hardware vendor of
choice at the time (Motorola), skimped a bit on hardware support.

According to my understanding, Motorola suffered from being early
adapters, similarly to Intel. They implemented 754 before the standard
was finished and later on were in difficult position of conflict
between compatibility wits standard vs compatibility with previous
generations.
Moto is less forgivable than Intel, because they were early adapters
not nearly as early.
--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Mon Oct 13 12:30:33 2025

From Newsgroup: comp.arch

On 10/13/2025 2:39 AM, Anton Ertl wrote:

John Savard <quadibloc@invalid.invalid> writes:

After reading that article, I looked for more information on other
processors with poor arithmetic, and I found that the Intel i860 also had
a branch delay slot, as well as using traps to implement some portions of
the IEEE 754 standard... thus, presumably, being one of the architectures
to inspire the piece about bad architectures from Linus Torvalds recently
quoted here.

There never was a Linux port to the i860. There are lots of
architectures with Linux ports that have the properties that Linus
Torvalds mentions. Concerning implementing only a part of FP in
hardware, and throwing the rest over the wall to software, Alpha ist
probably the best-known example (denormal support only in software),
and Linus Torvalds worked on it personally. Concerning exposing the pipeline, MIPS-I not just has branch-delay slots, but also other
limitations. SPARC and HPPA have branch delay slots.

From what I can gather, the MIPS chip in the N64 also only did a
partial implementation in hardware, with optional software traps for the
rest.

Apparently it can be a problem because modern FPUs don't exactly
recreate N64 behavior, and a lot of the games ran without the traps, so
a lot of the N64 games suffer drift and other issues over time (as the programmers had compensated for the MIPS issues in code rather than via traps).

Though, reading some stuff, implies a predecessor chip (the R4000) had a
more functionally complete FPU. So, I guess it is also possible that the
R4300 had a more limited FPU to make it cheaper for the embedded market.

Well, in any case, my recent efforts in these areas have been mostly:
Trying to hunt down some remaining bugs involving RVC in the CPU core;
RVC is seemingly "the gift that keeps on giving" in this area.
(The more dog-chewed the encoding, the harder it is to find bugs)
Going from just:
"Doing weak/crappy FP in hardware"
To:
"Trying to do less crappy FPU via software traps".
A "mostly traps only" implementation of Binary128.
Doesn't exactly match the 'Q' extension, but that is OK.
I sorta suspect not many people are going to implement Q either.

As I see it though, if the overall cost of the traps remains below 1%,
it is mostly OK.

Though, ATM the FDIV and FSQRT traps for Binary128 are almost slow
enough to justify turning them into a syscall like handler. Though, in
this case would likely overlap it with the Page-Fault handler (fallback
path for the TLB Miss handler, which is also being used here for FPU emulation).

Partial issue is mostly that one doesn't want to remain in an interrupt handler for too long because this blocks any other interrupts, so for
longer running operations it is better to switch to a handler that can
deal with interrupts (and, ATM, FDIV.Q and FSQRT.Q are kinda horridly
slow; so, less like a TLB miss, and more like a page-fault...).

The TestKern related code is getting a little behind in my GitHub repo,
idea is that these parts will be posted when they are done.

I had found/fixed one RVC bug since the last upload of the CPU core to
GitHub, but more bugs remain and are still being hunted down.

Progress is slow...

- anton

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon Oct 13 17:33:32 2025

From Newsgroup: comp.arch

Lawrence =?iso-8859-13?q?D=FFOliveiro?= <ldo@nz.invalid> posted:

On Mon, 13 Oct 2025 07:39:11 GMT, Anton Ertl wrote:

Concerning implementing only a part of FP in hardware, and throwing
the rest over the wall to software, Alpha ist probably the
best-known example (denormal support only in software) ...

The hardware designers took many years -- right through the 1990s, I think -- to be persuaded that IEEE754 really was worth implementing in its entirety, that the rCLtoo hardrCY or rCLtoo obscurerCY parts were there for an
important reason, to make programming that much easier, and should not be skipped.

I disagree:: full compliance with IEEE 754-whenever is to make programs
more reliable (more numerically stable) and to give the programmer a
constant programming model (not easier).

You can argue that not having to do ((x-0.5)-0.5) as you did in Hex
did make it easier--but NaNs, infinities, Underflow at the Denorm level
went in the other direction.

YourCOll notice that Kahan mentioned Apple more than once, as seemingly his favourite example of a company that took IEEE754 to heart and implemented
it completely in software, where their hardware vendor of choice at the
time (Motorola), skimped a bit on hardware support.

--- Synchronet 3.21a-Linux NewsLink 1.2

From David Brown@david.brown@hesbynett.no to comp.arch on Mon Oct 13 21:08:56 2025

From Newsgroup: comp.arch

On 13/10/2025 19:33, MitchAlsup wrote:

Lawrence =?iso-8859-13?q?D=FFOliveiro?= <ldo@nz.invalid> posted:

On Mon, 13 Oct 2025 07:39:11 GMT, Anton Ertl wrote:

Concerning implementing only a part of FP in hardware, and throwing
the rest over the wall to software, Alpha ist probably the
best-known example (denormal support only in software) ...

The hardware designers took many years -- right through the 1990s, I think >> -- to be persuaded that IEEE754 really was worth implementing in its
entirety, that the rCLtoo hardrCY or rCLtoo obscurerCY parts were there for an
important reason, to make programming that much easier, and should not be
skipped.

I disagree:: full compliance with IEEE 754-whenever is to make programs
more reliable (more numerically stable) and to give the programmer a
constant programming model (not easier).

It does not make the programs more reliable - it makes them more
consistent, predictable and portable. It does not make things easier
for most code (support for NaNs and infinities can make some code
easier, if mathematically nonsense results are a real possibility). But
since consistency, predictability and portability are often very useful characteristics, full IEEE 754 compliance is a good thing for
general-purpose processors.

However, there are plenty of more niche situations where these are not
vital, and where cost (die space, design costs, run-time power, etc.) is
more important. Thus on small microcontrollers, it can be a better
choice to skip support for the "obscure" stuff, and maybe even cutting
corners on things like rounding behaviour. The same applies for
software floating point routines for devices that don't have hardware
floating point at all.

You can argue that not having to do ((x-0.5)-0.5) as you did in Hex
did make it easier--but NaNs, infinities, Underflow at the Denorm level
went in the other direction.

YourCOll notice that Kahan mentioned Apple more than once, as seemingly his >> favourite example of a company that took IEEE754 to heart and implemented
it completely in software, where their hardware vendor of choice at the
time (Motorola), skimped a bit on hardware support.

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon Oct 13 21:53:33 2025

From Newsgroup: comp.arch

BGB <cr88192@gmail.com> posted:

On 10/13/2025 2:39 AM, Anton Ertl wrote:

John Savard <quadibloc@invalid.invalid> writes:

After reading that article, I looked for more information on other
processors with poor arithmetic, and I found that the Intel i860 also had >> a branch delay slot, as well as using traps to implement some portions of >> the IEEE 754 standard... thus, presumably, being one of the architectures >> to inspire the piece about bad architectures from Linus Torvalds recently >> quoted here.

There never was a Linux port to the i860. There are lots of
architectures with Linux ports that have the properties that Linus
Torvalds mentions. Concerning implementing only a part of FP in
hardware, and throwing the rest over the wall to software, Alpha ist probably the best-known example (denormal support only in software),
and Linus Torvalds worked on it personally. Concerning exposing the pipeline, MIPS-I not just has branch-delay slots, but also other limitations. SPARC and HPPA have branch delay slots.

From what I can gather, the MIPS chip in the N64 also only did a
partial implementation in hardware, with optional software traps for the rest.

Apparently it can be a problem because modern FPUs don't exactly
recreate N64 behavior, and a lot of the games ran without the traps, so
a lot of the N64 games suffer drift and other issues over time (as the programmers had compensated for the MIPS issues in code rather than via traps).

And this is why FP wants high quality implementation.

Though, reading some stuff, implies a predecessor chip (the R4000) had a more functionally complete FPU. So, I guess it is also possible that the R4300 had a more limited FPU to make it cheaper for the embedded market.

Well, in any case, my recent efforts in these areas have been mostly:
Trying to hunt down some remaining bugs involving RVC in the CPU core;
RVC is seemingly "the gift that keeps on giving" in this area.
(The more dog-chewed the encoding, the harder it is to find bugs)
Going from just:
"Doing weak/crappy FP in hardware"
To:
"Trying to do less crappy FPU via software traps".
A "mostly traps only" implementation of Binary128.
Doesn't exactly match the 'Q' extension, but that is OK.
I sorta suspect not many people are going to implement Q either.

Do it right or don't do it at all.

As I see it though, if the overall cost of the traps remains below 1%,
it is mostly OK.

While I can agree with the sentiment, the emulation overhead makes this
very hard to achieve indeed.

Though, ATM the FDIV and FSQRT traps for Binary128 are almost slow
enough to justify turning them into a syscall like handler. Though, in
this case would likely overlap it with the Page-Fault handler (fallback
path for the TLB Miss handler, which is also being used here for FPU emulation).

Partial issue is mostly that one doesn't want to remain in an interrupt handler for too long because this blocks any other interrupts,

At the time of control arrival, interrupts are already reentrant in
My 66000. A higher priority interrupt will take control from the
lower priority interrupt.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Lawrence =?iso-8859-13?q?D=FFOliveiro?=@ldo@nz.invalid to comp.arch on Tue Oct 14 02:27:46 2025

From Newsgroup: comp.arch

On Mon, 13 Oct 2025 17:33:32 GMT, MitchAlsup wrote:

On Mon, 13 Oct 2025 09:05:18 -0000 (UTC), Lawrence DrCOOliveiro wrote:

The hardware designers took many years -- right through the 1990s, I
think -- to be persuaded that IEEE754 really was worth implementing in
its entirety, that the rCLtoo hardrCY or rCLtoo obscurerCY parts were there for
an important reason, to make programming that much easier, and should
not be skipped.

I disagree:: full compliance with IEEE 754-whenever is to make programs
more reliable (more numerically stable) and to give the programmer a
constant programming model (not easier).

As a programmer, I count all that under my definition of rCLeasierrCY.

You can argue that not having to do ((x-0.5)-0.5) as you did in Hex did
make it easier--but NaNs, infinities, Underflow at the Denorm level went
in the other direction.

NaNs and infinities allow you to propagate certain kinds of pathological results right through to the end of the calculation, in a mathematically consistent way.

Denormals -- arenrCOt they called rCLsubnormalsrCY now? -- are also about making
things easier. Providing graceful underflow means a gradual loss of
precision as you get too close to zero, instead of losing all the bits at
once and going straight to zero. ItrCOs about the principle of least
surprise.

Again, all that helps to make things easier for programmers --
particularly those of us whose expertise of numerics is not on a level
with Prof Kahan.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Lawrence =?iso-8859-13?q?D=FFOliveiro?=@ldo@nz.invalid to comp.arch on Tue Oct 14 02:36:50 2025

From Newsgroup: comp.arch

On Mon, 13 Oct 2025 13:12:12 +0300, Michael S wrote:

On Mon, 13 Oct 2025 09:05:18 -0000 (UTC)
Lawrence DrCOOliveiro <ldo@nz.invalid> wrote:

The hardware designers took many years -- right through the 1990s,
I think -- to be persuaded that IEEE754 really was worth
implementing in its entirety, that the rCLtoo hardrCY or rCLtoo obscurerCY >> parts were there for an important reason,

It took many years to figure it out for *DEC* hardware designers.
Was there any other general-purpose RISC vendor that suffered from
similar denseness?

I thought they all did, just about.

YourCOll notice that Kahan mentioned Apple more than once, as
seemingly his favourite example of a company that took IEEE754 to
heart and implemented it completely in software, where their
hardware vendor of choice at the time (Motorola), skimped a bit on
hardware support.

According to my understanding, Motorola suffered from being early
adapters, similarly to Intel. They implemented 754 before the
standard was finished and later on were in difficult position of
conflict between compatibility wits standard vs compatibility with
previous generations. Moto is less forgivable than Intel, because
they were early adapters not nearly as early.

LetrCOs see, the Motorola 68881 came out in 1984 <https://en.wikipedia.org/wiki/Motorola_68881>, while the first
release of IEEE754 dates from two years before <https://en.wikipedia.org/wiki/IEEE_754>.

I would say Motorola had plenty of time to read the spec and get it
right. But they didnrCOt. So Apple had to patch things up in its
software implementation, introducing a mode where for example those
last few inaccurate bits in transcendentals were fixed up in software, sacrificing some speed over the raw hardware to ensure consistent
results with the (even slower) pure-software implementation.
--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Mon Oct 13 22:38:18 2025

From Newsgroup: comp.arch

On 10/13/2025 4:53 PM, MitchAlsup wrote:

BGB <cr88192@gmail.com> posted:

On 10/13/2025 2:39 AM, Anton Ertl wrote:

John Savard <quadibloc@invalid.invalid> writes:

After reading that article, I looked for more information on other
processors with poor arithmetic, and I found that the Intel i860 also had >>>> a branch delay slot, as well as using traps to implement some portions of >>>> the IEEE 754 standard... thus, presumably, being one of the architectures >>>> to inspire the piece about bad architectures from Linus Torvalds recently >>>> quoted here.

There never was a Linux port to the i860. There are lots of
architectures with Linux ports that have the properties that Linus
Torvalds mentions. Concerning implementing only a part of FP in
hardware, and throwing the rest over the wall to software, Alpha ist
probably the best-known example (denormal support only in software),
and Linus Torvalds worked on it personally. Concerning exposing the
pipeline, MIPS-I not just has branch-delay slots, but also other
limitations. SPARC and HPPA have branch delay slots.

From what I can gather, the MIPS chip in the N64 also only did a
partial implementation in hardware, with optional software traps for the
rest.

Apparently it can be a problem because modern FPUs don't exactly
recreate N64 behavior, and a lot of the games ran without the traps, so
a lot of the N64 games suffer drift and other issues over time (as the
programmers had compensated for the MIPS issues in code rather than via
traps).

And this is why FP wants high quality implementation.

From what I gather, it was a combination of Binary32 with DAZ/FTZ and truncate rounding. Then, with emulators running instead on hardware with denormals and RNE.

But, the result was that the games would work correctly on the original hardware, but in the emulators things would drift; like things like
moving platforms gradually creeping away from the origin, etc.

Though, reading some stuff, implies a predecessor chip (the R4000) had a
more functionally complete FPU. So, I guess it is also possible that the
R4300 had a more limited FPU to make it cheaper for the embedded market.

Well, in any case, my recent efforts in these areas have been mostly:
Trying to hunt down some remaining bugs involving RVC in the CPU core; >> RVC is seemingly "the gift that keeps on giving" in this area.
(The more dog-chewed the encoding, the harder it is to find bugs)
Going from just:
"Doing weak/crappy FP in hardware"
To:
"Trying to do less crappy FPU via software traps".
A "mostly traps only" implementation of Binary128.
Doesn't exactly match the 'Q' extension, but that is OK.
I sorta suspect not many people are going to implement Q either.

Do it right or don't do it at all.

?...

The traps route sorta worked OK in a lot of the MIPS era CPUs.
But, it will be opt-in via an FPSCR flag.
If the flag is not set, it will not trap.

Or, is the argument here that sticking with weaker not-quite IEEE FPU is preferable to using trap handlers.

For Binary128, real HW support is not likely to happen. The main reason
to consider trap-only Binary128 is more because it has less code
footprint than using runtime calls.

Also on RISC-V, it is more expensive to implement 128-bit arithmetic, so
the actual cost might be lower.

The main deviation from the Q extension is that it will use register
pairs rather than 128 bit registers. I suspect that likely 128-bit
registers would make more problems for software built to assume RV64G,
than the problems resulting from breaking the spec and using pairs.

Or, if the proper Q extension were supported, would make more sense in
the context of RV128, so XLEN==FLEN. Otherwise, Q on RV64 would break
the ability to move values between FPRs and GPRs (in the RV spec, they
note is the assumption that in this configuration, moves between FPRs
and GPRs would be done via memory loads and stores). This would suck,
and actively make the FPU worse than sticking primarily with the D
extension and doing something nonstandard.

As I see it though, if the overall cost of the traps remains below 1%,
it is mostly OK.

While I can agree with the sentiment, the emulation overhead makes this
very hard to achieve indeed.

Will have to test this more to find out.

But, at least in the case of Binary128, the operations themselves are
likely to be slow enough to partly offset the trap-handling and
instruction decoding overheads.

Though, ATM the FDIV and FSQRT traps for Binary128 are almost slow
enough to justify turning them into a syscall like handler. Though, in
this case would likely overlap it with the Page-Fault handler (fallback
path for the TLB Miss handler, which is also being used here for FPU
emulation).

Partial issue is mostly that one doesn't want to remain in an interrupt
handler for too long because this blocks any other interrupts,

At the time of control arrival, interrupts are already reentrant in
My 66000. A higher priority interrupt will take control from the
lower priority interrupt.

Yeah, no re-entrant interrupts here.

For a longer-running operation, it is mostly needed to handle things
with a context switch into supervisor mode. Can't use the normal SYSCALL handler though, as it itself may have been the source of the trap. So, Page-Fault needs its own handler task.

It is likely that re-entrant interrupts would require a different and
likely more complex mechanism.

Well, and/or rework things at the compiler level so that the ISR proper
is only used to implement a transition into supervisor mode (or from supervisor-mode back to usermode); and then fake something more like the
x86 style interrupt handling.

...

--- Synchronet 3.21a-Linux NewsLink 1.2

From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Tue Oct 14 01:53:17 2025

From Newsgroup: comp.arch

Lawrence DrCOOliveiro wrote:

On Mon, 13 Oct 2025 13:12:12 +0300, Michael S wrote:

On Mon, 13 Oct 2025 09:05:18 -0000 (UTC)
Lawrence DrCOOliveiro <ldo@nz.invalid> wrote:

The hardware designers took many years -- right through the 1990s,
I think -- to be persuaded that IEEE754 really was worth
implementing in its entirety, that the rCLtoo hardrCY or rCLtoo obscurerCY >>> parts were there for an important reason,

It took many years to figure it out for *DEC* hardware designers.
Was there any other general-purpose RISC vendor that suffered from
similar denseness?

I thought they all did, just about.

YourCOll notice that Kahan mentioned Apple more than once, as
seemingly his favourite example of a company that took IEEE754 to
heart and implemented it completely in software, where their
hardware vendor of choice at the time (Motorola), skimped a bit on
hardware support.

According to my understanding, Motorola suffered from being early
adapters, similarly to Intel. They implemented 754 before the
standard was finished and later on were in difficult position of
conflict between compatibility wits standard vs compatibility with
previous generations. Moto is less forgivable than Intel, because
they were early adapters not nearly as early.

LetrCOs see, the Motorola 68881 came out in 1984 <https://en.wikipedia.org/wiki/Motorola_68881>, while the first
release of IEEE754 dates from two years before <https://en.wikipedia.org/wiki/IEEE_754>.

Circa 1981 there was the Weitek chips. Wikipedia doesn't say if the
early ones were 754 compatible, but later chips from 1986 intended
for the 386 were compatible, and they seem to have been used by many
(Motorola, Intel, Sun, PA-RISC, ...)

https://en.wikipedia.org/wiki/Weitek

Unfortunately not all the chip documents are on bitsavers

http://www.bitsavers.org/components/weitek/dataSheets/

but the WTL-1164_1165 PDF from 1986 says

FULL 32-BIT AND 64-BIT FLOATING POINT
FORMAT AND OPERATIONS, CONFORMING TO
THE IEEE STANDARD FOR FLOATING POINT ARITHMETIC

2.38 MFlops (420 ns) 32-bit add/subtract/convert and compare
1.85 MFlops (540 ns) 64-bit add/subtract/convert and compare
2.38 MFlops (420 ns) 32-bit multiply
1.67 MFlops (600 ns) 64-bit multiply
0.52 MFlops (1.92 Jls) 32-bit divide
0.26 MFlops (3.78 Jls) 64-bit divide
Up to 3.33 MFlops (300 ns) for pipelined operations
Up to 3.33 MFlops (300 ns) for chained operations
32-bit data input or 32-bit data output operation every 60 ns

--- Synchronet 3.21a-Linux NewsLink 1.2

From David Brown@david.brown@hesbynett.no to comp.arch on Tue Oct 14 08:30:44 2025

From Newsgroup: comp.arch

On 14/10/2025 04:27, Lawrence DrCOOliveiro wrote:

On Mon, 13 Oct 2025 17:33:32 GMT, MitchAlsup wrote:

On Mon, 13 Oct 2025 09:05:18 -0000 (UTC), Lawrence DrCOOliveiro wrote:

The hardware designers took many years -- right through the 1990s, I
think -- to be persuaded that IEEE754 really was worth implementing in
its entirety, that the rCLtoo hardrCY or rCLtoo obscurerCY parts were there for
an important reason, to make programming that much easier, and should
not be skipped.

I disagree:: full compliance with IEEE 754-whenever is to make programs
more reliable (more numerically stable) and to give the programmer a
constant programming model (not easier).

As a programmer, I count all that under my definition of rCLeasierrCY.

You can argue that not having to do ((x-0.5)-0.5) as you did in Hex did
make it easier--but NaNs, infinities, Underflow at the Denorm level went
in the other direction.

NaNs and infinities allow you to propagate certain kinds of pathological results right through to the end of the calculation, in a mathematically consistent way.

Denormals -- arenrCOt they called rCLsubnormalsrCY now? -- are also about making
things easier. Providing graceful underflow means a gradual loss of
precision as you get too close to zero, instead of losing all the bits at once and going straight to zero. ItrCOs about the principle of least surprise.

Again, all that helps to make things easier for programmers --
particularly those of us whose expertise of numerics is not on a level
with Prof Kahan.

I see the benefits of NaNs - sometimes you have bad data, and it can be
useful to have a representation for that. The defined "viral" nature of
NaNs means that you can write your code in a streamlined fashion,
knowing that if a NaN goes in, a NaN comes out - you don't have to have
checks and conditionals in the middle of your calculations.

But I find it harder to understand why denormals or subnormals are going
to be useful. Ultimately, your floating point code is approximating arithmetic on real numbers. Where are you getting your real numbers,
and what calculations are you doing on them, that mean you are getting
results that have such a dynamic range that you are using denormals?
And what are you doing where it is acceptable to lose some precision
with those numbers, but not to give up and say things have gone badly
wrong (a NaN or infinity, or underflow signal)? I have a lot of
difficulty imagining a situation where denormals would be helpful and
you haven't got a major design issue with your code - perhaps
calculations should be re-arranged, algorithms changed, or you should be
using an arithmetic format with greater range (switch from single to
double, double to quad, or use something more advanced).

--- Synchronet 3.21a-Linux NewsLink 1.2

From Lawrence =?iso-8859-13?q?D=FFOliveiro?=@ldo@nz.invalid to comp.arch on Tue Oct 14 06:56:46 2025

From Newsgroup: comp.arch

On Tue, 14 Oct 2025 01:53:17 -0400, EricP wrote:

Circa 1981 there was the Weitek chips. Wikipedia doesn't say if the
early ones were 754 compatible, but later chips from 1986 intended for
the 386 were compatible, and they seem to have been used by many
(Motorola, Intel, Sun, PA-RISC, ...)

Weitek add-on cards, I think mainly the early ones, were popular with more hard-core power users of Lotus 1-2-3. Remember, that was the rCLkiller apprCY that prompted a lot of people to buy the IBM PC (and compatibles) in the
first place. Some of them must have been doing some serious number-
crunching, such that floating-point speed became a real issue.
--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Tue Oct 14 07:51:09 2025

From Newsgroup: comp.arch

David Brown <david.brown@hesbynett.no> writes:

I see the benefits of NaNs - sometimes you have bad data, and it can be >useful to have a representation for that. The defined "viral" nature of >NaNs means that you can write your code in a streamlined fashion,
knowing that if a NaN goes in, a NaN comes out - you don't have to have >checks and conditionals in the middle of your calculations.

Unfortunately, there are no NaFs (not a flag), and if there were, how
would an IF behave? As a consequence, a=b) if a or b can be a NaN. That's quite contrary to what
programmers tend to expect. So NaNs have their pitfalls.

Would it be better to trap is a NaN is compared with an ordinary
comparison operator, and to use special NaN-aware comparison operators
when that is actually intended?

And what are you doing where it is acceptable to lose some precision
with those numbers, but not to give up and say things have gone badly
wrong (a NaN or infinity, or underflow signal)?

The usual alternative to denormals is not NaN or Infinity (of course
not), or a trap (I assume that's what you mean with "signal"), but 0.

I have a lot of
difficulty imagining a situation where denormals would be helpful and
you haven't got a major design issue with your code

The classical example is the assumption that a<b is equivalent to
a-b<0. It holds if denormals are implemented and fails on
flush-to-zero.

Basically, with denormals more of the usual assumptions hold.

perhaps
calculations should be re-arranged, algorithms changed, or you should be >using an arithmetic format with greater range (switch from single to
double, double to quad, or use something more advanced).

The first two require more knowledge about FP than many programmers
have, all just to avoid some hardware cost. Not a good idea in any
area where the software crisis* is relevant. The last increases the
resource usage much more than proper support for denormals.

* The Wikipedia article on the software crisis does not give a useful
definition for deciding whether there is a software crisis or not,
and it does not even mention the symptom that was mentioned first
when I learned about the software crisis (in 1986): The cost of
software exceeds the cost of hardware. So that's my decision
criterion: If the software cost is higher than the hardware cost,
the software crisis is relevant; and in the present context, it
means that expending hardware to reduce the cost of software is
justified. Denormal numbers are such a feature.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From David Brown@david.brown@hesbynett.no to comp.arch on Tue Oct 14 10:47:56 2025

From Newsgroup: comp.arch

On 14/10/2025 09:51, Anton Ertl wrote:

David Brown <david.brown@hesbynett.no> writes:

I see the benefits of NaNs - sometimes you have bad data, and it can be
useful to have a representation for that. The defined "viral" nature of
NaNs means that you can write your code in a streamlined fashion,
knowing that if a NaN goes in, a NaN comes out - you don't have to have
checks and conditionals in the middle of your calculations.

Unfortunately, there are no NaFs (not a flag), and if there were, how
would an IF behave? As a consequence, a=b) if a or b can be a NaN. That's quite contrary to what programmers tend to expect. So NaNs have their pitfalls.

I entirely agree. If you have a type that has some kind of non-value,
and it might contain that representation, you have to take that into
account in your code. It's much the same thing as having a pointer that
could be a null pointer. But as long as you are aware of the
possibility and consequences of NaNs, they can be useful.

Would it be better to trap is a NaN is compared with an ordinary
comparison operator, and to use special NaN-aware comparison operators
when that is actually intended?

I'm sure there are a number of interesting ways to model this kind of
thing, in a programming language that supported it. NaN's in floating
point are somewhat akin to error values in C++ std::expected<>, or empty std::optional<> types, or like "result" types found in many languages.

And what are you doing where it is acceptable to lose some precision
with those numbers, but not to give up and say things have gone badly
wrong (a NaN or infinity, or underflow signal)?

The usual alternative to denormals is not NaN or Infinity (of course
not), or a trap (I assume that's what you mean with "signal"), but 0.

Sure. My thoughts with NaN are that it might be appropriate for a
floating point model (not IEEE) to return a NaN in circumstances where
IEEE says the result is a denormal - I think that might have been a more useful result. And my mention of infinity is because often when people
have a very small value but are very keen on it not being zero, it is
because they intend to divide by it and want to avoid division by zero
(and thus infinity).

I have a lot of
difficulty imagining a situation where denormals would be helpful and
you haven't got a major design issue with your code

The classical example is the assumption that a<b is equivalent to
a-b<0. It holds if denormals are implemented and fails on
flush-to-zero.

Basically, with denormals more of the usual assumptions hold.

OK. (I like that aspect of signed integer overflow being UB - more of
your usual assumptions hold.)

However, if "a" or "b" could be a NaN or an infinity, does that
equivalence still hold? I do not know the details here - it is simply
not something that turns up in the kind of coding I do. (In my line of
work, floating point values and expression results are always "normal",
if that is the correct term. I can always use gcc's "-ffast-math", and
I think a lot of real-world floating point code could do so - but I
fully appreciate that does not apply to all code.)

Are you thinking of this equivalence as something the compiler would do
in optimisation, or something programmers would use when writing their code?

perhaps
calculations should be re-arranged, algorithms changed, or you should be
using an arithmetic format with greater range (switch from single to
double, double to quad, or use something more advanced).

The first two require more knowledge about FP than many programmers
have, all just to avoid some hardware cost. Not a good idea in any
area where the software crisis* is relevant. The last increases the
resource usage much more than proper support for denormals.

I fully agree on both these points. However, I can't help feeling that
if you are seeing denormals, you are unlikely to be getting results from
your code that are as accurate as you had expected - your calculations
are numerically unstable. Denormals might give you slightly more leeway before everything falls apart, but only a tiny amount. Doing it right
is going to cost you, in development time or runtime efficiency, but
that's better than getting the wrong answers quickly!

* The Wikipedia article on the software crisis does not give a useful
definition for deciding whether there is a software crisis or not,
and it does not even mention the symptom that was mentioned first
when I learned about the software crisis (in 1986): The cost of
software exceeds the cost of hardware. So that's my decision
criterion: If the software cost is higher than the hardware cost,
the software crisis is relevant; and in the present context, it
means that expending hardware to reduce the cost of software is
justified. Denormal numbers are such a feature.

- anton

--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Tue Oct 14 11:26:10 2025

From Newsgroup: comp.arch

David Brown <david.brown@hesbynett.no> writes:

On 14/10/2025 09:51, Anton Ertl wrote:

David Brown <david.brown@hesbynett.no> writes:

I see the benefits of NaNs - sometimes you have bad data, and it can be
useful to have a representation for that. The defined "viral" nature of >>> NaNs means that you can write your code in a streamlined fashion,
knowing that if a NaN goes in, a NaN comes out - you don't have to have
checks and conditionals in the middle of your calculations.

Unfortunately, there are no NaFs (not a flag), and if there were, how
would an IF behave? As a consequence, a=b) if a or b can be a NaN. That's quite contrary to what
programmers tend to expect. So NaNs have their pitfalls.

I entirely agree. If you have a type that has some kind of non-value,
and it might contain that representation, you have to take that into
account in your code. It's much the same thing as having a pointer that >could be a null pointer.

Not really:

* Null pointers don't materialize spontaneously as results of
arithmetic operations. They are stored explicitly by the
programmer, making the programmer much more aware of their
existence.

* Programmers are trained to check for null pointers. And if they
forget such a check, the result usually is that the program traps,
usually soon after the place where the check should have been. With
a NaN you just silently execute the wrong branch of an IF, and later
you wonder what happened.

* The most common use for null pointers is terminating a linked list
or other recursive data structure. Programmers are trained to deal
with the terminating case in their code.

The usual alternative to denormals is not NaN or Infinity (of course
not), or a trap (I assume that's what you mean with "signal"), but 0.

Sure. My thoughts with NaN are that it might be appropriate for a
floating point model (not IEEE) to return a NaN in circumstances where
IEEE says the result is a denormal - I think that might have been a more >useful result.

When a denormal is generated, an underflow "exception" happens (IEEE "exceptions" are not traps). You can set your FPU to trap on a
certain kind of exception. Maybe you can also set it up such that it
produces a NaN instead. I doubt that many people would find that
useful, however.

And my mention of infinity is because often when people
have a very small value but are very keen on it not being zero, it is >because they intend to divide by it and want to avoid division by zero
(and thus infinity).

Denormals don't help much here. IEEE doubles cannot represent 2^1024,
but denormals allow to represent positive numbers down to 2^-1074.
So, with denormal numbers, the absolute value of your divisor must be
less than 2^-50 to produce a non-infinite result where flush-to-zero
would have produced an infinity.

The classical example is the assumption that a<b is equivalent to
a-b<0. It holds if denormals are implemented and fails on
flush-to-zero.

Basically, with denormals more of the usual assumptions hold.

OK. (I like that aspect of signed integer overflow being UB - more of
your usual assumptions hold.)

Not mine. An assumption that I like is that the associative law
holds. It holds with -fwrapv, but not with overflow-is-undefined.

I fail to see how declaring any condition undefined behaviour would
increase any guarantees.

However, if "a" or "b" could be a NaN or an infinity, does that
equivalence still hold?

Yes.

If any of them is a NaN, the result is false for either comparison
(because a-b would be NaN, and because the result of any comparison
with a NaN is false).

For infinity there are a number of cases

1) inf<noninf (false) vs. inf-noninf=inf<0 (false)
2) -inf<noninf (true) vs. -inf-noninf=-inf<0 (true)
3) noninf<inf (true) vs. noninf-inf=-inf<0 (true)
4) noninf<-inf (false) vs. noninf--inf=inf<0 (false)
5) inf<inf (false) vs. inf-inf=NaN<0 (false)
6) -inf<-inf (false) vs. -inf--inf=NaN<0 (false)
7) inf<-inf (false) vs. inf--inf=inf<0 (false)
8) -inf<inf (true) vs. -inf-inf=-inf<0 (true)

The most interesting case here is 5), because if means that a<=b is
not equivalent to a-b<=0, even with denormal numbers.

Are you thinking of this equivalence as something the compiler would do
in optimisation, or something programmers would use when writing their code?

I was thinking about what programmers might use when writing their
code. For compilers, having that equivalence may occasionally be
helpful for producing better code, but if it does not hold, the
compiler will just not use such an equivalence (once the compiler is
debugged).

This is an example from Kahan that stuck in my mind, because it
appeals to me as a programmer. He has also given other examples that
don't do that for me, but may appeal to a mathematician, phycisist or
chemist.

I fully agree on both these points. However, I can't help feeling that
if you are seeing denormals, you are unlikely to be getting results from >your code that are as accurate as you had expected - your calculations
are numerically unstable. Denormals might give you slightly more leeway >before everything falls apart, but only a tiny amount.

I think the nicer properties (such as the equivalence mentioned above)
is the more important benefit. And if you take a different branch of
an IF-statement if you have a flush-to-zero FPU, you can easily get a completely bogus result when the denormal case would still have had
enough accuracy by far.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Tue Oct 14 15:37:10 2025

From Newsgroup: comp.arch

David Brown wrote:

On 14/10/2025 04:27, Lawrence D|ore4raoOliveiro wrote:

On Mon, 13 Oct 2025 17:33:32 GMT, MitchAlsup wrote:

On Mon, 13 Oct 2025 09:05:18 -0000 (UTC), Lawrence D|ore4raoOliveiro wrote: >>>

The hardware designers took many years -- right through the 1990s, I>>>> think -- to be persuaded that IEEE754 really was worth implementing in
its entirety, that the |ore4+otoo hard|ore4-Y or |ore4+otoo obscure|ore4-Y parts
were there for
an important reason, to make programming that much easier, and should
not be skipped.

I disagree:: full compliance with IEEE 754-whenever is to make programs
more reliable (more numerically stable) and to give the programmer a
constant programming model (not easier).

As a programmer, I count all that under my definition of |ore4+oeasier|ore4-Y.

You can argue that not having to do ((x-0.5)-0.5) as you did in Hex did
make it easier--but NaNs, infinities, Underflow at the Denorm level went >>> in the other direction.

NaNs and infinities allow you to propagate certain kinds of pathological
results right through to the end of the calculation, in a mathematically
consistent way.

Denormals -- aren|ore4raot they called |ore4+osubnormals|ore4-Y now? -- are also
about making
things easier. Providing graceful underflow means a gradual loss of
precision as you get too close to zero, instead of losing all the bits at
once and going straight to zero. It|ore4raos about the principle of least
surprise.

Again, all that helps to make things easier for programmers --
particularly those of us whose expertise of numerics is not on a level>> with Prof Kahan.

I see the benefits of NaNs - sometimes you have bad data, and it can be useful to have a representation for that.-a The defined "viral" nature of NaNs means that you can write your code in a streamlined fashion,
knowing that if a NaN goes in, a NaN comes out - you don't have to have checks and conditionals in the middle of your calculations.

But I find it harder to understand why denormals or subnormals are going
to be useful.-a Ultimately, your floating point code is approximating arithmetic on real numbers.-a Where are you getting your real numbers,
and what calculations are you doing on them, that mean you are getting > results that have such a dynamic range that you are using denormals? And
what are you doing where it is acceptable to lose some precision with
those numbers, but not to give up and say things have gone badly wrong > (a NaN or infinity, or underflow signal)?-a I have a lot of difficulty
imagining a situation where denormals would be helpful and you haven't > got a major design issue with your code - perhaps calculations should be
re-arranged, algorithms changed, or you should be using an arithmetic
format with greater range (switch from single to double, double to quad,
or use something more advanced).

Subnormal is critical for stability of zero-seeking algorithms, i.e a
lot of standard algorithmic building blocks.
Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
--- Synchronet 3.21a-Linux NewsLink 1.2

From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Tue Oct 14 15:42:45 2025

From Newsgroup: comp.arch

Anton Ertl wrote:

David Brown <david.brown@hesbynett.no> writes:

I see the benefits of NaNs - sometimes you have bad data, and it can be
useful to have a representation for that. The defined "viral" nature of
NaNs means that you can write your code in a streamlined fashion,
knowing that if a NaN goes in, a NaN comes out - you don't have to have
checks and conditionals in the middle of your calculations.

Unfortunately, there are no NaFs (not a flag), and if there were, how
would an IF behave? As a consequence, a=b) if a or b can be a NaN. That's quite contrary to what programmers tend to expect. So NaNs have their pitfalls.

You have just named the only common pitfall, where all comparisons
against NaN shall return false.

You can in fact define your own

bool IsNan(f64 x)
{
((x < 0.0) | (x >= 0.0)) == false
}

but this depends on the compiler/optimizer not messing up.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
--- Synchronet 3.21a-Linux NewsLink 1.2

From David Brown@david.brown@hesbynett.no to comp.arch on Tue Oct 14 17:29:40 2025

From Newsgroup: comp.arch

On 14/10/2025 13:26, Anton Ertl wrote:

David Brown <david.brown@hesbynett.no> writes:

On 14/10/2025 09:51, Anton Ertl wrote:

David Brown <david.brown@hesbynett.no> writes:

I see the benefits of NaNs - sometimes you have bad data, and it can be >>>> useful to have a representation for that. The defined "viral" nature of >>>> NaNs means that you can write your code in a streamlined fashion,
knowing that if a NaN goes in, a NaN comes out - you don't have to have >>>> checks and conditionals in the middle of your calculations.

Unfortunately, there are no NaFs (not a flag), and if there were, how
would an IF behave? As a consequence, a=b) if a or b can be a NaN. That's quite contrary to what
programmers tend to expect. So NaNs have their pitfalls.

I entirely agree. If you have a type that has some kind of non-value,
and it might contain that representation, you have to take that into
account in your code. It's much the same thing as having a pointer that
could be a null pointer.

Not really:

* Null pointers don't materialize spontaneously as results of
arithmetic operations. They are stored explicitly by the
programmer, making the programmer much more aware of their
existence.

NaN's don't materialise spontaneously either. They can be the result of intentionally using NaN's for missing data, or when your code is buggy
and failing to calculate something reasonable. In either case, the
surprise happens when someone passes the non-value to code that was not expecting to have to deal with it.

* Programmers are trained to check for null pointers. And if they
forget such a check, the result usually is that the program traps,
usually soon after the place where the check should have been. With
a NaN you just silently execute the wrong branch of an IF, and later
you wonder what happened.

Fair enough.

* The most common use for null pointers is terminating a linked list
or other recursive data structure. Programmers are trained to deal
with the terminating case in their code.

I would disagree that this is the most common use for null pointers.
But it certainly is /one/ use, and programmers should handle that usage correctly.

So to sum up, there is a certain similarity, but there are also
significant differences.

The usual alternative to denormals is not NaN or Infinity (of course
not), or a trap (I assume that's what you mean with "signal"), but 0.

Sure. My thoughts with NaN are that it might be appropriate for a
floating point model (not IEEE) to return a NaN in circumstances where
IEEE says the result is a denormal - I think that might have been a more
useful result.

When a denormal is generated, an underflow "exception" happens (IEEE "exceptions" are not traps). You can set your FPU to trap on a
certain kind of exception. Maybe you can also set it up such that it produces a NaN instead. I doubt that many people would find that
useful, however.

And my mention of infinity is because often when people
have a very small value but are very keen on it not being zero, it is
because they intend to divide by it and want to avoid division by zero
(and thus infinity).

Denormals don't help much here. IEEE doubles cannot represent 2^1024,
but denormals allow to represent positive numbers down to 2^-1074.
So, with denormal numbers, the absolute value of your divisor must be
less than 2^-50 to produce a non-infinite result where flush-to-zero
would have produced an infinity.

OK.

The classical example is the assumption that a<b is equivalent to
a-b<0. It holds if denormals are implemented and fails on
flush-to-zero.

Basically, with denormals more of the usual assumptions hold.

OK. (I like that aspect of signed integer overflow being UB - more of
your usual assumptions hold.)

Not mine. An assumption that I like is that the associative law
holds. It holds with -fwrapv, but not with overflow-is-undefined.

I fail to see how declaring any condition undefined behaviour would
increase any guarantees.

The associative law holds fine with UB on overflow, as do things like
adding a positive number to an integer makes it bigger. But this is all straying from the discussion on floating point, and I suspect that we'd
just re-hash old disagreements rather than starting new and interesting
ones :-)

However, if "a" or "b" could be a NaN or an infinity, does that
equivalence still hold?

Yes.

If any of them is a NaN, the result is false for either comparison
(because a-b would be NaN, and because the result of any comparison
with a NaN is false).

For infinity there are a number of cases

1) inf<noninf (false) vs. inf-noninf=inf<0 (false)
2) -inf<noninf (true) vs. -inf-noninf=-inf<0 (true)
3) noninf<inf (true) vs. noninf-inf=-inf<0 (true)
4) noninf<-inf (false) vs. noninf--inf=inf<0 (false)
5) inf<inf (false) vs. inf-inf=NaN<0 (false)
6) -inf<-inf (false) vs. -inf--inf=NaN<0 (false)
7) inf<-inf (false) vs. inf--inf=inf<0 (false)
8) -inf<inf (true) vs. -inf-inf=-inf<0 (true)

The most interesting case here is 5), because if means that a<=b is
not equivalent to a-b<=0, even with denormal numbers.

Any kind of arithmetic with infinities is going to be awkward in some way!

Are you thinking of this equivalence as something the compiler would do
in optimisation, or something programmers would use when writing their code?

I was thinking about what programmers might use when writing their
code. For compilers, having that equivalence may occasionally be
helpful for producing better code, but if it does not hold, the
compiler will just not use such an equivalence (once the compiler is debugged).

Sure.

This is an example from Kahan that stuck in my mind, because it
appeals to me as a programmer. He has also given other examples that
don't do that for me, but may appeal to a mathematician, phycisist or chemist.

Fair enough.

I fully agree on both these points. However, I can't help feeling that
if you are seeing denormals, you are unlikely to be getting results from
your code that are as accurate as you had expected - your calculations
are numerically unstable. Denormals might give you slightly more leeway
before everything falls apart, but only a tiny amount.

I think the nicer properties (such as the equivalence mentioned above)
is the more important benefit. And if you take a different branch of
an IF-statement if you have a flush-to-zero FPU, you can easily get a completely bogus result when the denormal case would still have had
enough accuracy by far.

Well, I think that if your values are getting that small enough to make denormal results, your code is at least questionable. I am not
convinced that the equivalency you mentioned above is enough to make
denormals worth the effort, but that may be just the kind of code I
write. (And while I did study some of this stuff - numerical stability
- in my mathematics degree, it was quite a long time ago.)

Thanks for the comprehensive and educational information here. It is appreciated.

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Tue Oct 14 15:31:00 2025

From Newsgroup: comp.arch

BGB <cr88192@gmail.com> posted:

On 10/13/2025 4:53 PM, MitchAlsup wrote:

BGB <cr88192@gmail.com> posted:

On 10/13/2025 2:39 AM, Anton Ertl wrote:

John Savard <quadibloc@invalid.invalid> writes:

After reading that article, I looked for more information on other
processors with poor arithmetic, and I found that the Intel i860 also had
a branch delay slot, as well as using traps to implement some portions of
the IEEE 754 standard... thus, presumably, being one of the architectures
to inspire the piece about bad architectures from Linus Torvalds recently
quoted here.

There never was a Linux port to the i860. There are lots of
architectures with Linux ports that have the properties that Linus
Torvalds mentions. Concerning implementing only a part of FP in
hardware, and throwing the rest over the wall to software, Alpha ist
probably the best-known example (denormal support only in software),
and Linus Torvalds worked on it personally. Concerning exposing the
pipeline, MIPS-I not just has branch-delay slots, but also other
limitations. SPARC and HPPA have branch delay slots.

From what I can gather, the MIPS chip in the N64 also only did a
partial implementation in hardware, with optional software traps for the >> rest.

Apparently it can be a problem because modern FPUs don't exactly
recreate N64 behavior, and a lot of the games ran without the traps, so
a lot of the N64 games suffer drift and other issues over time (as the
programmers had compensated for the MIPS issues in code rather than via
traps).

And this is why FP wants high quality implementation.

From what I gather, it was a combination of Binary32 with DAZ/FTZ and truncate rounding. Then, with emulators running instead on hardware with denormals and RNE.

In the above sentence I was talking about your FPU not getting
an infinitely correct result and then rounding to container size.
Not about the other "other" anomalies" many of which can be dealt
with in SW.

But, the result was that the games would work correctly on the original hardware, but in the emulators things would drift; like things like
moving platforms gradually creeping away from the origin, etc.

Though, reading some stuff, implies a predecessor chip (the R4000) had a >> more functionally complete FPU. So, I guess it is also possible that the >> R4300 had a more limited FPU to make it cheaper for the embedded market. >>

Well, in any case, my recent efforts in these areas have been mostly:
Trying to hunt down some remaining bugs involving RVC in the CPU core; >> RVC is seemingly "the gift that keeps on giving" in this area.
(The more dog-chewed the encoding, the harder it is to find bugs)
Going from just:
"Doing weak/crappy FP in hardware"
To:
"Trying to do less crappy FPU via software traps".
A "mostly traps only" implementation of Binary128.
Doesn't exactly match the 'Q' extension, but that is OK.
I sorta suspect not many people are going to implement Q either.

Do it right or don't do it at all.

?...

The traps route sorta worked OK in a lot of the MIPS era CPUs.
But, it will be opt-in via an FPSCR flag.
If the flag is not set, it will not trap.

But their combination of HW+SW gets the right answer.
Your multiply does not.

Or, is the argument here that sticking with weaker not-quite IEEE FPU is preferable to using trap handlers.

The 5-bang instructions as used by HW+SW has to computer the result
to infinite precision and then round to container size.

The paper illustrates CRAY 1,... FP was fast but inaccurate enough
to fund an army of numerical analysists to see if the program was
delivering acceptable results.

IEEE 754 got rid of the army of Numerical Analysists.
But now, nobody remembers how bad is was/can be.

For Binary128, real HW support is not likely to happen. The main reason
to consider trap-only Binary128 is more because it has less code
footprint than using runtime calls.

Nobody is asking for that.

<snip>
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Tue Oct 14 15:34:23 2025

From Newsgroup: comp.arch

David Brown <david.brown@hesbynett.no> posted:

On 14/10/2025 04:27, Lawrence DrCOOliveiro wrote:

On Mon, 13 Oct 2025 17:33:32 GMT, MitchAlsup wrote:

On Mon, 13 Oct 2025 09:05:18 -0000 (UTC), Lawrence DrCOOliveiro wrote:

The hardware designers took many years -- right through the 1990s, I
think -- to be persuaded that IEEE754 really was worth implementing in >>> its entirety, that the rCLtoo hardrCY or rCLtoo obscurerCY parts were there for
an important reason, to make programming that much easier, and should
not be skipped.

I disagree:: full compliance with IEEE 754-whenever is to make programs
more reliable (more numerically stable) and to give the programmer a
constant programming model (not easier).

As a programmer, I count all that under my definition of rCLeasierrCY.

You can argue that not having to do ((x-0.5)-0.5) as you did in Hex did
make it easier--but NaNs, infinities, Underflow at the Denorm level went >> in the other direction.

NaNs and infinities allow you to propagate certain kinds of pathological results right through to the end of the calculation, in a mathematically consistent way.

Denormals -- arenrCOt they called rCLsubnormalsrCY now? -- are also about making
things easier. Providing graceful underflow means a gradual loss of precision as you get too close to zero, instead of losing all the bits at once and going straight to zero. ItrCOs about the principle of least surprise.

Again, all that helps to make things easier for programmers --
particularly those of us whose expertise of numerics is not on a level
with Prof Kahan.

I see the benefits of NaNs - sometimes you have bad data, and it can be useful to have a representation for that. The defined "viral" nature of NaNs means that you can write your code in a streamlined fashion,
knowing that if a NaN goes in, a NaN comes out - you don't have to have checks and conditionals in the middle of your calculations.

MAX( x, NaN ) is x.

But I find it harder to understand why denormals or subnormals are going
to be useful.

1/Big_Num does not underflow .............. completely.

Ultimately, your floating point code is approximating arithmetic on real numbers.

Don' make me laugh.

Where are you getting your real numbers,
and what calculations are you doing on them, that mean you are getting results that have such a dynamic range that you are using denormals?
And what are you doing where it is acceptable to lose some precision
with those numbers, but not to give up and say things have gone badly
wrong (a NaN or infinity, or underflow signal)? I have a lot of
difficulty imagining a situation where denormals would be helpful and
you haven't got a major design issue with your code - perhaps
calculations should be re-arranged, algorithms changed, or you should be using an arithmetic format with greater range (switch from single to
double, double to quad, or use something more advanced).

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Tue Oct 14 15:47:20 2025

From Newsgroup: comp.arch

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

David Brown <david.brown@hesbynett.no> writes:

I see the benefits of NaNs - sometimes you have bad data, and it can be >useful to have a representation for that. The defined "viral" nature of >NaNs means that you can write your code in a streamlined fashion,
knowing that if a NaN goes in, a NaN comes out - you don't have to have >checks and conditionals in the middle of your calculations.

Unfortunately, there are no NaFs (not a flag), and if there were, how
would an IF behave? As a consequence, a=b) if a or b can be a NaN. That's quite contrary to what programmers tend to expect. So NaNs have their pitfalls.

Many ISAs and many programs have trouble in getting NaNs into the
ELSE-clause. One cannot use deMorgan's Law to invert conditions in
the presence of NaNs.

We (Brain, Thomas and I) went to great pain to have FCMP deliver a
bit pattern where one could invert the condition AND still deliver
the NaN to the expected Clause. We threw in Ordered and Totally-
Ordered at the same time, along with OpenCL FP CLASS() function.

Languages have not kept up with NaNs, needing IF-THEN-ELSE-NAN
semantics.

Would it be better to trap is a NaN is compared with an ordinary
comparison operator, and to use special NaN-aware comparison operators
when that is actually intended?

You are thinking that FCMP only decodes 6 states {==, !=, <, <=, > >=}

And what are you doing where it is acceptable to lose some precision
with those numbers, but not to give up and say things have gone badly >wrong (a NaN or infinity, or underflow signal)?

The usual alternative to denormals is not NaN or Infinity (of course
not), or a trap (I assume that's what you mean with "signal"), but 0.

The worst of all possible results is no information whatsoever.

I have a lot of
difficulty imagining a situation where denormals would be helpful and
you haven't got a major design issue with your code

The classical example is the assumption that a<b is equivalent to
a-b<0. It holds if denormals are implemented and fails on
flush-to-zero.

Basically, with denormals more of the usual assumptions hold.

perhaps
calculations should be re-arranged, algorithms changed, or you should be >using an arithmetic format with greater range (switch from single to >double, double to quad, or use something more advanced).

The first two require more knowledge about FP than many programmers
have,

Don't allow THOSE programmers to program FP codes !!
Get ones that understand the nuances.

all just to avoid some hardware cost. Not a good idea in any
area where the software crisis* is relevant.

Windows 7 and Office 2003 were good enough. That would have allowed
zillions of programmers to go address the software crisis after being
freed from projects that had become good enough not to need continual
work.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Tue Oct 14 16:48:50 2025

From Newsgroup: comp.arch

MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

David Brown <david.brown@hesbynett.no> posted:

Ultimately, your floating point code is approximating
arithmetic on real numbers.

Don' make me laugh.

Somebody (not me) recently added the following to the gcc bugzilla
quip file:

The "real" type in fortran is called "real" because the
mathematician should not notice that it has finite decimal places
and forget that one needs lenghty adaptions of the proofs for
that....
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Tue Oct 14 16:46:03 2025

From Newsgroup: comp.arch

David Brown <david.brown@hesbynett.no> writes:

The associative law holds fine with UB on overflow,

With 32-bit ints:

The result of (2000000000+2000000000)+(-2000000000) is undefined.

The result of 2000000000+(2000000000+(-2000000000)) is 2000000000.

So, the associative law does not hold.

With -fwrapv both are defined to produce 2000000000, and the
associative law holds because modulo arithmetic is associative.

Well, I think that if your values are getting that small enough to make >denormal results, your code is at least questionable.

As Terje Mathiesen wrote, getting close to 0 is standard fare for
approximation algorithms, such as Newton-Raphson iteration. Of course
you can terminate the loop while you are still far from the solution,
but that's not going to improve the accuracy of the results.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Tue Oct 14 17:26:16 2025

From Newsgroup: comp.arch

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

[...]

Languages have not kept up with NaNs, needing IF-THEN-ELSE-NAN
semantics.

That may be a good idea. You can write it in current languages as
follows:

if (a<b) {
...
} else if (a>=b) {
...
} else {
... NaN case ...
}

Would it be better to trap is a NaN is compared with an ordinary
comparison operator, and to use special NaN-aware comparison operators
when that is actually intended?

You are thinking that FCMP only decodes 6 states {==, !=, <, <=, > >=}

I don't think anything about FCMP. What I wrote above is about
programming languages. I.e., a<b would trap if a or b is a NaN, while lt_or_nan(a,b) would be true if a or b is a NaN, and
lt_and_not_nan(a,b) would be false if a or b is a NaN. I think the
IEEE754 people have better names for these comparisons, but am too
lazy to look them up.

The first two require more knowledge about FP than many programmers
have,

Don't allow THOSE programmers to program FP codes !!
Get ones that understand the nuances.

We can all wish for Kahan writing all FP code, but that only deepens
the software crisis. Educating programmers is certainly a worthy
undertaking, but providing a good foundation for them to build on
helps those programmers as well as those that are less educated.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Tue Oct 14 12:45:08 2025

From Newsgroup: comp.arch

On 10/14/2025 10:31 AM, MitchAlsup wrote:

BGB <cr88192@gmail.com> posted:

On 10/13/2025 4:53 PM, MitchAlsup wrote:

BGB <cr88192@gmail.com> posted:

On 10/13/2025 2:39 AM, Anton Ertl wrote:

John Savard <quadibloc@invalid.invalid> writes:

After reading that article, I looked for more information on other >>>>>> processors with poor arithmetic, and I found that the Intel i860 also had
a branch delay slot, as well as using traps to implement some portions of
the IEEE 754 standard... thus, presumably, being one of the architectures
to inspire the piece about bad architectures from Linus Torvalds recently
quoted here.

There never was a Linux port to the i860. There are lots of
architectures with Linux ports that have the properties that Linus
Torvalds mentions. Concerning implementing only a part of FP in
hardware, and throwing the rest over the wall to software, Alpha ist >>>>> probably the best-known example (denormal support only in software), >>>>> and Linus Torvalds worked on it personally. Concerning exposing the >>>>> pipeline, MIPS-I not just has branch-delay slots, but also other
limitations. SPARC and HPPA have branch delay slots.

From what I can gather, the MIPS chip in the N64 also only did a
partial implementation in hardware, with optional software traps for the >>>> rest.

Apparently it can be a problem because modern FPUs don't exactly
recreate N64 behavior, and a lot of the games ran without the traps, so >>>> a lot of the N64 games suffer drift and other issues over time (as the >>>> programmers had compensated for the MIPS issues in code rather than via >>>> traps).

And this is why FP wants high quality implementation.

From what I gather, it was a combination of Binary32 with DAZ/FTZ and
truncate rounding. Then, with emulators running instead on hardware with
denormals and RNE.

In the above sentence I was talking about your FPU not getting
an infinitely correct result and then rounding to container size.
Not about the other "other" anomalies" many of which can be dealt
with in SW.

This mostly applies to FMUL, but:
I had already added a trap case for this as well.

In the cases where the all the low-order bits of either input are 0,
then the low-order results would also be 0 and so are N/A (the final
result would be the same either way).

If both sets of low-order bits are non-zero, it can trap.
This does mean that the software emulation will need to provide a full
width result though.

Checking for non-zero here being more cost-effective than actually doing
a full width multiply.

Also, RISC-V FMADD.D and similar are sorta also going to end up as traps
due to the lack of single-rounded FMA (though had debated whether to
have a separate control-flag for this to still allow non-slow FMADD.D
and similar; but as-is, these will trap).

For FADD:
The shifted-right bits that fall off the bottom (of the slightly-wider internal mantissa) don't matter, since they were always being added to
0, which can't generate any carry.

For FSUB, it may matter, but more in the sense that one can check
whether the "fell off the bottom" part had non-zero bits and use this to adjust the carry-in part of the subtractor (since non-zero bits would
absorb the carry-propagation of adding 1 to the bottom of a
theoretically arbitrarily wide twos complement negation).

So, in theory, can be dealt with in hardware to still give an exact result.

There are still some sub-ULP bits, so the complaints about the lack of
guard bit doesn't really apply.

Also apparently the Cray used a non-normalized floating point format (no hidden bit), which was odd (and could create its own issues).

Though, potentially a non-normalized format with lax normalization could
allow for cheaper re-normalization (even if it could require
re-normalization logic for FMUL). Though, for such a format, there is
the possibility that someone could make re-normalization be its own instruction (allowing for an FPU with less latency).

But, the result was that the games would work correctly on the original
hardware, but in the emulators things would drift; like things like
moving platforms gradually creeping away from the origin, etc.

Though, reading some stuff, implies a predecessor chip (the R4000) had a >>>> more functionally complete FPU. So, I guess it is also possible that the >>>> R4300 had a more limited FPU to make it cheaper for the embedded market. >>>>

Well, in any case, my recent efforts in these areas have been mostly:
Trying to hunt down some remaining bugs involving RVC in the CPU core;
RVC is seemingly "the gift that keeps on giving" in this area.
(The more dog-chewed the encoding, the harder it is to find bugs) >>>> Going from just:
"Doing weak/crappy FP in hardware"
To:
"Trying to do less crappy FPU via software traps".
A "mostly traps only" implementation of Binary128.
Doesn't exactly match the 'Q' extension, but that is OK.
I sorta suspect not many people are going to implement Q either. >>>

Do it right or don't do it at all.

?...

The traps route sorta worked OK in a lot of the MIPS era CPUs.
But, it will be opt-in via an FPSCR flag.
If the flag is not set, it will not trap.

But their combination of HW+SW gets the right answer.
Your multiply does not.

As noted above, I was already working on this.

Or, is the argument here that sticking with weaker not-quite IEEE FPU is
preferable to using trap handlers.

The 5-bang instructions as used by HW+SW has to computer the result
to infinite precision and then round to container size.

The paper illustrates CRAY 1,... FP was fast but inaccurate enough
to fund an army of numerical analysists to see if the program was
delivering acceptable results.

IEEE 754 got rid of the army of Numerical Analysists.
But now, nobody remembers how bad is was/can be.

OK.

As can be noted, for scalar operations I consider there to be a limit as
to how bad is "acceptable".

For SIMD operations, it is a little looser.
For example, the ability to operate on integer values and get exact
results is basically required for scalar operations, but optional for SIMD.

Though, in this case it is a case of both Quake and also some JavaScript
VMs relying on the ability to express integer values as floating-point
numbers and use them in calculations as such (so, for example, if the operations don't give exact results then the programs break).

For Binary128, real HW support is not likely to happen. The main reason
to consider trap-only Binary128 is more because it has less code
footprint than using runtime calls.

Nobody is asking for that.

OK.

Can note that in my looking, it seems like:
Pretty much none of the ASIC implementations support the Q extension;
It is not required in any of the mainline profiles;
Implementing Q proper would have non-zero impact on RV64G:
The differences between F+D and F+D+Q being non-zero.
Whereas, "fudging it" can retain strict compatibility with D.
Where, people actually use 'D'.

There is a non-zero amount of code using "long double", but in this case
the bigger issue is more the code footprint of the associated
long-double math functions rather than performance (say, if someone uses "cosl()" or similar).

Still not ideal, as (with my existing ISA extensions) there is still no single-instruction way to load a 64-bit value into an FPR.

But, could at least reduce it from 11 (44 bytes) instructions to 3 (20
bytes; "LI-Imm33; SHORI-Imm32; FMV.D.X"). This still means 40 bytes to
load a full-width Binary128 literal.
Loading the same literal would need 24 bytes in XG3.
And, an unrolled Taylor expansion uses a lot of them.

With Q proper? Only option would be to use memory loads here.
Like, these the C math functions are annoyingly bulky in this case.

Meanwhile, elsewhere saw a mention that apparently to deal with RISC-V fragmentation issues, there is now being work on a mechanism to allow modification of the RISC-V instruction listings in GCC without needing
to modify the code in GCC proper each time (basically hot injecting
stuff into the instruction listing and similar).

As apparently having everyone trying to modify the ISA every which way
is making a bit of an awful mess of things.

...

--- Synchronet 3.21a-Linux NewsLink 1.2

From Lawrence =?iso-8859-13?q?D=FFOliveiro?=@ldo@nz.invalid to comp.arch on Wed Oct 15 03:45:31 2025

From Newsgroup: comp.arch

On Tue, 14 Oct 2025 07:51:09 GMT, Anton Ertl wrote:

* The Wikipedia article on the software crisis does not give a useful
definition for deciding whether there is a software crisis or not,
and it does not even mention the symptom that was mentioned first
when I learned about the software crisis (in 1986): The cost of
software exceeds the cost of hardware.

The rCLcrisisrCY was supposed to do with the shortage of programs to write all the programs that were needed to solve business and user needs.

By that definition, I donrCOt think the rCLcrisisrCY exists any more. It went away with the rise of very-high-level languages, from about the time of
those such as Tcl/Tk, Perl, Python, PHP and JavaScript.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Lawrence =?iso-8859-13?q?D=FFOliveiro?=@ldo@nz.invalid to comp.arch on Wed Oct 15 03:47:14 2025

From Newsgroup: comp.arch

On Tue, 14 Oct 2025 15:47:20 GMT, MitchAlsup wrote:

Languages have not kept up with NaNs, needing IF-THEN-ELSE-NAN
semantics.

All the good languages have IEEE754 compliant arithmetic libraries,
including type queries for things like isnan().

E.g. <https://docs.python.org/3/library/math.html>
--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Wed Oct 15 05:55:40 2025

From Newsgroup: comp.arch

Lawrence =?iso-8859-13?q?D=FFOliveiro?= <ldo@nz.invalid> writes:

On Tue, 14 Oct 2025 07:51:09 GMT, Anton Ertl wrote:

* The Wikipedia article on the software crisis does not give a useful
definition for deciding whether there is a software crisis or not,
and it does not even mention the symptom that was mentioned first
when I learned about the software crisis (in 1986): The cost of
software exceeds the cost of hardware.

The "crisis" was supposed to do with the shortage of programs to write all >the programs that were needed to solve business and user needs.

I never heard that one. The software project failures, deadline
misses, and cost overruns, and their increasing number was a symptom
that is reflected in the Wikipedia article.

By that definition, I donrCOt think the "crisis" exists any more. It went >away with the rise of very-high-level languages, from about the time of >those such as Tcl/Tk, Perl, Python, PHP and JavaScript.

Better tools certainly help. One interesting aspect here is that all
the languages you mention are only dynamically typechecked. There has
been quite a bit of work on adding static typechecking to some of
these languages in the last decade or so, and the motivation given for
that is difficulties in large software projects using these languages.

In any case, even with these languages there are still software
projects that fail, miss their deadlines and have overrun their
budget; and to come back to the criterion I mentioned, where software
cost is higher than hardware cost.

Anyway, the relevance for comp.arch is how to evaluate certain
hardware features: If we have a way to make the programmers' jobs
easier at a certain hardware cost, when is it justified to add the
hardware cost? When it affects many programmers and especially if the difficulty that would otherwise be added is outside the expertise of
many of them. Let's look at some cases:

"Closing the semantic gap" by providing instructions like EDIT: Even
with assembly-language programmers, calling a subroutine is hardly
harder. With higher-level languages, such instructions buy nothing.

Denormal numbers: It affects lots of code that deals with FP, and
where many programmers are not well-educated (and even the educated
ones have a harder time when they have to work around their absence).

Hardware without Spectre (e.g., with invisible speculation): There are
two takes here:

1) If you consider Spectre to be a realistically exploitable
vulnerability, you need to protect at least the secret keys against
extraction with Spectre; then you either need such hardware, or you
need to use software mitigations agains all Spectre variants in all
software that runs in processes that have secret keys in their
address space; the latter would be a huge cost that easily
justifies the cost of adding invisible speculation to the hardware.

2) The other take is that we Spectre is too hard to exploit to be a
realistic threat and that we do not need to eliminate it or
mitigate it. That's a similar to the mainstream opinion on
cache-timing attacks on AES before Dan Bernstein demonstrated that
such attacks can be performed. Except that for Spectre we already
have demonstrations.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.arch on Wed Oct 15 12:41:28 2025

From Newsgroup: comp.arch

On Wed, 15 Oct 2025 05:55:40 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

Lawrence =?iso-8859-13?q?D=FFOliveiro?= <ldo@nz.invalid> writes:

On Tue, 14 Oct 2025 07:51:09 GMT, Anton Ertl wrote:

* The Wikipedia article on the software crisis does not give a
useful definition for deciding whether there is a software crisis
or not, and it does not even mention the symptom that was
mentioned first when I learned about the software crisis (in
1986): The cost of software exceeds the cost of hardware.

The "crisis" was supposed to do with the shortage of programs to
write all the programs that were needed to solve business and user
needs.

I never heard that one. The software project failures, deadline
misses, and cost overruns, and their increasing number was a symptom
that is reflected in the Wikipedia article.

By that definition, I donrCOt think the "crisis" exists any more. It
went away with the rise of very-high-level languages, from about the
time of those such as Tcl/Tk, Perl, Python, PHP and JavaScript.

Better tools certainly help. One interesting aspect here is that all
the languages you mention are only dynamically typechecked. There has
been quite a bit of work on adding static typechecking to some of
these languages in the last decade or so, and the motivation given for
that is difficulties in large software projects using these languages.

In any case, even with these languages there are still software
projects that fail, miss their deadlines and have overrun their
budget; and to come back to the criterion I mentioned, where software
cost is higher than hardware cost.

Anyway, the relevance for comp.arch is how to evaluate certain
hardware features: If we have a way to make the programmers' jobs
easier at a certain hardware cost, when is it justified to add the
hardware cost? When it affects many programmers and especially if the difficulty that would otherwise be added is outside the expertise of
many of them. Let's look at some cases:

"Closing the semantic gap" by providing instructions like EDIT: Even
with assembly-language programmers, calling a subroutine is hardly
harder. With higher-level languages, such instructions buy nothing.

Denormal numbers: It affects lots of code that deals with FP, and
where many programmers are not well-educated (and even the educated
ones have a harder time when they have to work around their absence).

Hardware without Spectre (e.g., with invisible speculation): There are
two takes here:

1) If you consider Spectre to be a realistically exploitable
vulnerability, you need to protect at least the secret keys against
extraction with Spectre; then you either need such hardware, or you
need to use software mitigations agains all Spectre variants in all
software that runs in processes that have secret keys in their
address space; the latter would be a huge cost that easily
justifies the cost of adding invisible speculation to the hardware.

2) The other take is that we Spectre is too hard to exploit to be a
realistic threat and that we do not need to eliminate it or
mitigate it. That's a similar to the mainstream opinion on
cache-timing attacks on AES before Dan Bernstein demonstrated that
such attacks can be performed. Except that for Spectre we already
have demonstrations.

- anton

What demonstrations?
The demonstration that I would consider realistic should be from JS
running on browser released after 2018-01-28.
I'm of strong opinion that at least Spectre Variant 1 (Bound Check
Bypass) should not be mitigated in hardware.
W.r.t. variant 2 (Branch Target Injection) I am less categorical. That
is, I rather prefer it not mitigated on the hardware that I use,
because I am sure that in no situation it is a realistic threat for me.
However it is harder to prove that it is not a realistic threat to
anybody. And since HW mitigation has smaller performance impact than of
Variant 1, so if some CPU vendors decide to mitigate Variant 2, I would
not call them spinless idiots because of it. I'd call them "slick
businessmen" which in my book is less derogatory.
--- Synchronet 3.21a-Linux NewsLink 1.2

From David Brown@david.brown@hesbynett.no to comp.arch on Wed Oct 15 12:36:17 2025

From Newsgroup: comp.arch

On 14/10/2025 18:46, Anton Ertl wrote:

David Brown <david.brown@hesbynett.no> writes:

Well, I think that if your values are getting that small enough to make
denormal results, your code is at least questionable.

As Terje Mathiesen wrote, getting close to 0 is standard fare for approximation algorithms, such as Newton-Raphson iteration. Of course
you can terminate the loop while you are still far from the solution,
but that's not going to improve the accuracy of the results.

Feel free to correct me if what I write below is wrong - you, Terje, and others here know a lot more about this stuff than I do.

When you write an expression like "x + y" with floating point, ignoring
NaNs and infinities, you can imagine the calculation being done by first getting the mathematical real values from x and y. Then - again in the mathematical real domain - the operation is carried out. Then the
result is truncated or rounded to fit back within the mantissa and
exponent format of the floating point type.

Double precision IEEE format has 53 bits of mantissa and 11 bits of
exponent. For normal floating point values, that covers from 10 ^ -308
to 10 ^ +308, or 716 orders of magnitude. (For comparison, the size of
the universe measured in Planck lengths is only about 61 orders of
magnitude.)

Denormals let you squeeze a bit more at the lower end here - another 16
orders of magnitude - at the cost of rapidly decreasing precision. They
don't stop the inevitable approximation to zero, they just delay it a
little.

I am still at a loss to understand how this is going to be useful - when
will that small extra margin near zero actually make a difference, in
the real world, with real values? When you are using your
Newton-Raphson iteration to find your function's zeros, what are the circumstances in which you can get a more useful end result if you
continue to 10 ^ -324 instead of treating 10 ^ -308 as zero - especially
when these smaller numbers have lower precision?

I realise there are plenty of numerical calculations in which errors
"build up", such as simulating non-linear systems over time, and there
you are looking to get as high an accuracy as you can in the
intermediary steps so that you can continue for longer. But even there, denormals are not going to give you more than a tiny amount extra.

(There are, of course, mathematical problems which deal with values or precisions far outside anything of relevance to the physical world, but
if you are dealing with those kinds of tasks then IEEE floating point is
not going to do the job anyway.)

--- Synchronet 3.21a-Linux NewsLink 1.2

From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Wed Oct 15 12:54:30 2025

From Newsgroup: comp.arch

MitchAlsup wrote:

David Brown <david.brown@hesbynett.no> posted:

On 14/10/2025 04:27, Lawrence D|ore4raoOliveiro wrote:>>> On Mon, 13 Oct 2025 17:33:32 GMT, MitchAlsup wrote:

On Mon, 13 Oct 2025 09:05:18 -0000 (UTC), Lawrence D|ore4raoOliveiro wrote:

The hardware designers took many years -- right through the 1990s, I >>>>> think -- to be persuaded that IEEE754 really was worth implementing in >>>>> its entirety, that the |ore4+otoo hard|ore4-Y or |ore4+otoo obscure|ore4-Y parts were there for
an important reason, to make programming that much easier, and should >>>>> not be skipped.

I disagree:: full compliance with IEEE 754-whenever is to make programs >>>> more reliable (more numerically stable) and to give the programmer a>>>> constant programming model (not easier).

As a programmer, I count all that under my definition of |ore4+oeasier|ore4-Y.

You can argue that not having to do ((x-0.5)-0.5) as you did in Hex did >>>> make it easier--but NaNs, infinities, Underflow at the Denorm level went >>>> in the other direction.

NaNs and infinities allow you to propagate certain kinds of pathological >>> results right through to the end of the calculation, in a mathematically >>> consistent way.

Denormals -- aren|ore4raot they called |ore4+osubnormals|ore4-Y now? -- are also about making
things easier. Providing graceful underflow means a gradual loss of
precision as you get too close to zero, instead of losing all the bits at >>> once and going straight to zero. It|ore4raos about the principle of least >>> surprise.

Again, all that helps to make things easier for programmers --
particularly those of us whose expertise of numerics is not on a level
with Prof Kahan.

I see the benefits of NaNs - sometimes you have bad data, and it can be
useful to have a representation for that. The defined "viral" nature of
NaNs means that you can write your code in a streamlined fashion,
knowing that if a NaN goes in, a NaN comes out - you don't have to have
checks and conditionals in the middle of your calculations.

MAX( x, NaN ) is x.

That was true under 754-2008 but we fixed it for 2019: All NaNs
propagate through the new min/max definitions. The old still exist of
course, but they are deprecated.
The point that made it obvious to everyone was that under the 2008
definition an SNaN would always propage, but be converted to a QNaN, but a QNaN could silently disappear as show above.
What this meant was that for any kind of vector reduction, the final
result could be the NaN or any of the other input values, depending upon the order of the individual comparisons!
I was one of the proponents who pushed this change through, but I will
say that after we showed some of the most surprising results, everyone
agreed to fix it. Having NaN maximally sticky is also definitely in the
spirit of the entire 754 standard:
The only operations that do not propagate NaN are those that explicitly
handle this case, or those that don't return a floating point value.
Having all compares return 'false' is an example of the latter.
Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
--- Synchronet 3.21a-Linux NewsLink 1.2

From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Wed Oct 15 13:07:01 2025

From Newsgroup: comp.arch

David Brown wrote:

On 14/10/2025 18:46, Anton Ertl wrote:

David Brown <david.brown@hesbynett.no> writes:

Well, I think that if your values are getting that small enough to make
denormal results, your code is at least questionable.

As Terje Mathiesen wrote, getting close to 0 is standard fare for
approximation algorithms, such as Newton-Raphson iteration.-a Of course
you can terminate the loop while you are still far from the solution,
but that's not going to improve the accuracy of the results.

Feel free to correct me if what I write below is wrong - you, Terje, and others here know a lot more about this stuff than I do.

When you write an expression like "x + y" with floating point, ignoring
NaNs and infinities, you can imagine the calculation being done by first getting the mathematical real values from x and y.-a Then - again in the mathematical real domain - the operation is carried out.-a Then the
result is truncated or rounded to fit back within the mantissa and
exponent format of the floating point type.

Double precision IEEE format has 53 bits of mantissa and 11 bits of exponent.-a For normal floating point values, that covers from 10 ^ -308
to 10 ^ +308, or 716 orders of magnitude.-a (For comparison, the size of
the universe measured in Planck lengths is only about 61 orders of magnitude.)

Denormals let you squeeze a bit more at the lower end here - another 16 orders of magnitude - at the cost of rapidly decreasing precision.-a They don't stop the inevitable approximation to zero, they just delay it a little.

I am still at a loss to understand how this is going to be useful - when will that small extra margin near zero actually make a difference, in
the real world, with real values?-a When you are using your
Newton-Raphson iteration to find your function's zeros, what are the circumstances in which you can get a more useful end result if you
continue to 10 ^ -324 instead of treating 10 ^ -308 as zero - especially when these smaller numbers have lower precision?

Please note that I have NOT personally observed this, but I have been
told from people I trust (on the 754 working group) that at least some zero-seeking algorithms will stabilize on an exact value, if and only if you have subnormals, otherwise it is possible to wobble back & forth
between two neighboring results.
I.e. they differ by exactly one ulp.
As I noted, I have not been bitten by this particular issue, one of the
reaons being that I tend to not write infinite loops inside functions,
instead I'll pre-calculate how many (typically NR) iterations should be needed.
Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.arch on Wed Oct 15 16:50:13 2025

From Newsgroup: comp.arch

On Wed, 15 Oct 2025 12:36:17 +0200
David Brown <david.brown@hesbynett.no> wrote:

On 14/10/2025 18:46, Anton Ertl wrote:

David Brown <david.brown@hesbynett.no> writes:

Well, I think that if your values are getting that small enough to
make denormal results, your code is at least questionable.

As Terje Mathiesen wrote, getting close to 0 is standard fare for approximation algorithms, such as Newton-Raphson iteration. Of
course you can terminate the loop while you are still far from the solution, but that's not going to improve the accuracy of the
results.

Feel free to correct me if what I write below is wrong - you, Terje,
and others here know a lot more about this stuff than I do.

When you write an expression like "x + y" with floating point,
ignoring NaNs and infinities, you can imagine the calculation being
done by first getting the mathematical real values from x and y.
Then - again in the mathematical real domain - the operation is
carried out. Then the result is truncated or rounded to fit back
within the mantissa and exponent format of the floating point type.

Double precision IEEE format has 53 bits of mantissa and 11 bits of exponent. For normal floating point values, that covers from 10 ^
-308 to 10 ^ +308, or 716 orders of magnitude. (For comparison, the
size of the universe measured in Planck lengths is only about 61
orders of magnitude.)

Denormals let you squeeze a bit more at the lower end here - another
16 orders of magnitude - at the cost of rapidly decreasing precision.
They don't stop the inevitable approximation to zero, they just
delay it a little.

I am still at a loss to understand how this is going to be useful -
when will that small extra margin near zero actually make a
difference, in the real world, with real values? When you are using
your Newton-Raphson iteration to find your function's zeros, what are
the circumstances in which you can get a more useful end result if
you continue to 10 ^ -324 instead of treating 10 ^ -308 as zero -
especially when these smaller numbers have lower precision?

I realise there are plenty of numerical calculations in which errors
"build up", such as simulating non-linear systems over time, and
there you are looking to get as high an accuracy as you can in the intermediary steps so that you can continue for longer. But even
there, denormals are not going to give you more than a tiny amount
extra.

(There are, of course, mathematical problems which deal with values
or precisions far outside anything of relevance to the physical
world, but if you are dealing with those kinds of tasks then IEEE
floating point is not going to do the job anyway.)

I don't think that I agree with Anton's point, at least as formulated.

Yes, subnormals improve precision of Newton-Raphson and such*, but only
when the numbers involved in calculations are below 2**-971, which does
not happen very often. What is more important that *when* it happens
then naively written implementations of such algorithms still converge.
Without subnormals (or without expert provisions) there is big chance
that they would not converge at all. That happens mostly because
IEEE-754 preserves following intuitive invariant:
When x > y then x - y > 0
Without subnormals, e.g. with VAX float formats that are otherwise
pretty good, this invariant does not hold.

* - I personally prefer to illustrate it with cord-and-tangent
root-finding algorithm that can be used for any type of function as
long as you proved that on section of interest there is no change of
sign of its first and second derivatives. May be, because I
was taught this algorithm at age of 15. This algo can be called
half-Newton].

--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.arch on Wed Oct 15 17:46:21 2025

From Newsgroup: comp.arch

On Wed, 15 Oct 2025 13:07:01 +0200
Terje Mathisen <terje.mathisen@tmsw.no> wrote:

David Brown wrote:

On 14/10/2025 18:46, Anton Ertl wrote:

David Brown <david.brown@hesbynett.no> writes:

Well, I think that if your values are getting that small enough
to make denormal results, your code is at least questionable.

As Terje Mathiesen wrote, getting close to 0 is standard fare for
approximation algorithms, such as Newton-Raphson iteration.a Of
course you can terminate the loop while you are still far from the
solution, but that's not going to improve the accuracy of the
results.

Feel free to correct me if what I write below is wrong - you,
Terje, and others here know a lot more about this stuff than I do.

When you write an expression like "x + y" with floating point,
ignoring NaNs and infinities, you can imagine the calculation being
done by first getting the mathematical real values from x and y.
Then - again in the mathematical real domain - the operation is
carried out.a Then the result is truncated or rounded to fit back
within the mantissa and exponent format of the floating point type.

Double precision IEEE format has 53 bits of mantissa and 11 bits of exponent.a For normal floating point values, that covers from 10 ^
-308 to 10 ^ +308, or 716 orders of magnitude.a (For comparison,
the size of the universe measured in Planck lengths is only about
61 orders of magnitude.)

Denormals let you squeeze a bit more at the lower end here -
another 16 orders of magnitude - at the cost of rapidly decreasing precision.a They don't stop the inevitable approximation to zero,
they just delay it a little.

I am still at a loss to understand how this is going to be useful -
when will that small extra margin near zero actually make a
difference, in the real world, with real values?a When you are
using your Newton-Raphson iteration to find your function's zeros,
what are the circumstances in which you can get a more useful end
result if you continue to 10 ^ -324 instead of treating 10 ^ -308
as zero - especially when these smaller numbers have lower
precision?

Please note that I have NOT personally observed this, but I have been
told from people I trust (on the 754 working group) that at least
some zero-seeking algorithms will stabilize on an exact value, if and
only if you have subnormals, otherwise it is possible to wobble back
& forth between two neighboring results.

I.e. they differ by exactly one ulp.

As I noted, I have not been bitten by this particular issue, one of
the reaons being that I tend to not write infinite loops inside
functions, instead I'll pre-calculate how many (typically NR)
iterations should be needed.

Terje

It does not sound right to me. Newton-alike iterations oscillations by
1 ULP could happen even with subnormals. They should be taken care of by properly written exit conditions.
What could happen without subnormals are oscillations by *more* than 1
ULP, sometimes much more.
Also in absence of subnormals one can suffer divisions by zero in code
like below:
while (fb > fa) {
a -= b*fa/(fb - fa);
...
}
--- Synchronet 3.21a-Linux NewsLink 1.2

From David Brown@david.brown@hesbynett.no to comp.arch on Wed Oct 15 16:53:33 2025

From Newsgroup: comp.arch

On 15/10/2025 13:07, Terje Mathisen wrote:

David Brown wrote:

On 14/10/2025 18:46, Anton Ertl wrote:

David Brown <david.brown@hesbynett.no> writes:

Well, I think that if your values are getting that small enough to make >>>> denormal results, your code is at least questionable.

As Terje Mathiesen wrote, getting close to 0 is standard fare for
approximation algorithms, such as Newton-Raphson iteration.-a Of course
you can terminate the loop while you are still far from the solution,
but that's not going to improve the accuracy of the results.

Feel free to correct me if what I write below is wrong - you, Terje,
and others here know a lot more about this stuff than I do.

When you write an expression like "x + y" with floating point,
ignoring NaNs and infinities, you can imagine the calculation being
done by first getting the mathematical real values from x and y.-a Then
- again in the mathematical real domain - the operation is carried
out.-a Then the result is truncated or rounded to fit back within the
mantissa and exponent format of the floating point type.

Double precision IEEE format has 53 bits of mantissa and 11 bits of
exponent.-a For normal floating point values, that covers from 10 ^
-308 to 10 ^ +308, or 716 orders of magnitude.-a (For comparison, the
size of the universe measured in Planck lengths is only about 61
orders of magnitude.)

Denormals let you squeeze a bit more at the lower end here - another
16 orders of magnitude - at the cost of rapidly decreasing precision.
They don't stop the inevitable approximation to zero, they just delay
it a little.

I am still at a loss to understand how this is going to be useful -
when will that small extra margin near zero actually make a
difference, in the real world, with real values?-a When you are using
your Newton-Raphson iteration to find your function's zeros, what are
the circumstances in which you can get a more useful end result if you
continue to 10 ^ -324 instead of treating 10 ^ -308 as zero -
especially when these smaller numbers have lower precision?

Please note that I have NOT personally observed this, but I have been
told from people I trust (on the 754 working group) that at least some zero-seeking algorithms will stabilize on an exact value, if and only if
you have subnormals, otherwise it is possible to wobble back & forth
between two neighboring results.

I.e. they differ by exactly one ulp.

I have no problems believing that this can occur on occasion. No matter
what range you pick for your floating point formats, or what precision
you pick, you will always be able to find examples of this kind of
algorithm that home in on the right value with the format you have
chosen but would fail with just one bit less. I just don't think that
such pathological examples mean that subnormals are important.

But if such cases occur regularly in real-world calculations, not just artificial examples, then it's a different matter.

As I noted, I have not been bitten by this particular issue, one of the reaons being that I tend to not write infinite loops inside functions, instead I'll pre-calculate how many (typically NR) iterations should be needed.

Terje

--- Synchronet 3.21a-Linux NewsLink 1.2

From David Brown@david.brown@hesbynett.no to comp.arch on Wed Oct 15 17:52:48 2025

From Newsgroup: comp.arch

On 15/10/2025 15:50, Michael S wrote:

On Wed, 15 Oct 2025 12:36:17 +0200
David Brown <david.brown@hesbynett.no> wrote:

On 14/10/2025 18:46, Anton Ertl wrote:

David Brown <david.brown@hesbynett.no> writes:

Well, I think that if your values are getting that small enough to
make denormal results, your code is at least questionable.

As Terje Mathiesen wrote, getting close to 0 is standard fare for
approximation algorithms, such as Newton-Raphson iteration. Of
course you can terminate the loop while you are still far from the
solution, but that's not going to improve the accuracy of the
results.

Feel free to correct me if what I write below is wrong - you, Terje,
and others here know a lot more about this stuff than I do.

When you write an expression like "x + y" with floating point,
ignoring NaNs and infinities, you can imagine the calculation being
done by first getting the mathematical real values from x and y.
Then - again in the mathematical real domain - the operation is
carried out. Then the result is truncated or rounded to fit back
within the mantissa and exponent format of the floating point type.

Double precision IEEE format has 53 bits of mantissa and 11 bits of
exponent. For normal floating point values, that covers from 10 ^
-308 to 10 ^ +308, or 716 orders of magnitude. (For comparison, the
size of the universe measured in Planck lengths is only about 61
orders of magnitude.)

Denormals let you squeeze a bit more at the lower end here - another
16 orders of magnitude - at the cost of rapidly decreasing precision.
They don't stop the inevitable approximation to zero, they just
delay it a little.

I am still at a loss to understand how this is going to be useful -
when will that small extra margin near zero actually make a
difference, in the real world, with real values? When you are using
your Newton-Raphson iteration to find your function's zeros, what are
the circumstances in which you can get a more useful end result if
you continue to 10 ^ -324 instead of treating 10 ^ -308 as zero -
especially when these smaller numbers have lower precision?

I realise there are plenty of numerical calculations in which errors
"build up", such as simulating non-linear systems over time, and
there you are looking to get as high an accuracy as you can in the
intermediary steps so that you can continue for longer. But even
there, denormals are not going to give you more than a tiny amount
extra.

(There are, of course, mathematical problems which deal with values
or precisions far outside anything of relevance to the physical
world, but if you are dealing with those kinds of tasks then IEEE
floating point is not going to do the job anyway.)

I don't think that I agree with Anton's point, at least as formulated.

Yes, subnormals improve precision of Newton-Raphson and such*, but only
when the numbers involved in calculations are below 2**-971, which does
not happen very often. What is more important that *when* it happens
then naively written implementations of such algorithms still converge. Without subnormals (or without expert provisions) there is big chance
that they would not converge at all. That happens mostly because
IEEE-754 preserves following intuitive invariant:
When x > y then x - y > 0
Without subnormals, e.g. with VAX float formats that are otherwise
pretty good, this invariant does not hold.

I can appreciate that you can have x > y, but with such small x and y
and such close values that (x - y) is a subnormal - thus without
subnormals, (x - y) would be 0.

Perhaps I am being obtuse, but I don't see how you would write a Newton-Raphson algorithm that would fail to converge, or fail to stop,
just because you don't have subnormals. Could you give very rough
outline of such problematic code?

* - I personally prefer to illustrate it with cord-and-tangent
root-finding algorithm that can be used for any type of function as
long as you proved that on section of interest there is no change of
sign of its first and second derivatives. May be, because I
was taught this algorithm at age of 15. This algo can be called
half-Newton].

I was perhaps that age when I first came across Newton-Raphson in a
maths book, and wrote an implementation for it on a computer. That was
in BBC Basic, and I'm pretty sure that the floating point type there was
not IEEE compatible, and did not support such fancy stuff as subnormals!
But I am also very sure I did not push the program to more difficult examples. (But it did show nice graphic illustrations of what it was
doing.)

It was also around then that I wrote a program for matrix inversion, and discovered the joys of numeric instability, and thus the need for care
when picking the order for Gaussian elimination.

--- Synchronet 3.21a-Linux NewsLink 1.2

From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Wed Oct 15 13:22:01 2025

From Newsgroup: comp.arch

Michael S wrote:

On Wed, 15 Oct 2025 05:55:40 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

Lawrence =?iso-8859-13?q?D=FFOliveiro?= <ldo@nz.invalid> writes:

On Tue, 14 Oct 2025 07:51:09 GMT, Anton Ertl wrote:

* The Wikipedia article on the software crisis does not give a
useful definition for deciding whether there is a software crisis
or not, and it does not even mention the symptom that was
mentioned first when I learned about the software crisis (in
1986): The cost of software exceeds the cost of hardware.

The "crisis" was supposed to do with the shortage of programs to
write all the programs that were needed to solve business and user
needs.

I never heard that one. The software project failures, deadline
misses, and cost overruns, and their increasing number was a symptom
that is reflected in the Wikipedia article.

By that definition, I don+AN++N++t think the "crisis" exists any more. It >>> went away with the rise of very-high-level languages, from about the
time of those such as Tcl/Tk, Perl, Python, PHP and JavaScript.

Better tools certainly help. One interesting aspect here is that all
the languages you mention are only dynamically typechecked. There has
been quite a bit of work on adding static typechecking to some of
these languages in the last decade or so, and the motivation given for
that is difficulties in large software projects using these languages.

In any case, even with these languages there are still software
projects that fail, miss their deadlines and have overrun their
budget; and to come back to the criterion I mentioned, where software
cost is higher than hardware cost.

Anyway, the relevance for comp.arch is how to evaluate certain
hardware features: If we have a way to make the programmers' jobs
easier at a certain hardware cost, when is it justified to add the
hardware cost? When it affects many programmers and especially if the
difficulty that would otherwise be added is outside the expertise of
many of them. Let's look at some cases:

"Closing the semantic gap" by providing instructions like EDIT: Even
with assembly-language programmers, calling a subroutine is hardly
harder. With higher-level languages, such instructions buy nothing.

Denormal numbers: It affects lots of code that deals with FP, and
where many programmers are not well-educated (and even the educated
ones have a harder time when they have to work around their absence).

Hardware without Spectre (e.g., with invisible speculation): There are
two takes here:

1) If you consider Spectre to be a realistically exploitable
vulnerability, you need to protect at least the secret keys against
extraction with Spectre; then you either need such hardware, or you
need to use software mitigations agains all Spectre variants in all
software that runs in processes that have secret keys in their
address space; the latter would be a huge cost that easily
justifies the cost of adding invisible speculation to the hardware.

2) The other take is that we Spectre is too hard to exploit to be a
realistic threat and that we do not need to eliminate it or
mitigate it. That's a similar to the mainstream opinion on
cache-timing attacks on AES before Dan Bernstein demonstrated that
such attacks can be performed. Except that for Spectre we already
have demonstrations.

- anton

What demonstrations?
The demonstration that I would consider realistic should be from JS
running on browser released after 2018-01-28.

I'm of strong opinion that at least Spectre Variant 1 (Bound Check
Bypass) should not be mitigated in hardware.
W.r.t. variant 2 (Branch Target Injection) I am less categorical. That
is, I rather prefer it not mitigated on the hardware that I use,
because I am sure that in no situation it is a realistic threat for me. However it is harder to prove that it is not a realistic threat to
anybody. And since HW mitigation has smaller performance impact than of Variant 1, so if some CPU vendors decide to mitigate Variant 2, I would
not call them spinless idiots because of it. I'd call them "slick businessmen" which in my book is less derogatory.

I had an idea on how to eliminate Bound Check Bypass.
I intend to have range-check-and-fault instructions like

CHKLTU value_Rs1, limit_Rs2
value_Rs1, #limit_imm

throws an overflow fault exception if value register >= unsigned limit.
(The unsigned >= check also catches negative signed integer values).

It can be used to check an array index before use in a LD/ST, e.g.

CHKLTU index_Rs, limit_Rs
LD Rd, [base_Rs, index_Rs*scale]

The problem is that there is no guarantee that an OoO cpu will execute
the CHKLTU instruction before using the index register in the LD/ST.

My idea is for the CHKcc instruction to copy the test value to a dest
register when the check is successful. This makes the dest value register write-dependent on successfully passing the range check,
and blocks the subsequent LD from using the index until validated.

CHKLTU index_R2, index_R1, limit_R3
LD R4, [base_R5, index_R2*scale]

Because there is no branch, there is no way to speculate around the check
(but load value speculation could negate this fix).

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Oct 15 21:09:27 2025

From Newsgroup: comp.arch

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

----------------------------

Anyway, the relevance for comp.arch is how to evaluate certain
hardware features: If we have a way to make the programmers' jobs
easier at a certain hardware cost, when is it justified to add the
hardware cost?

Most people would say:: "When it adds performance" AND the compiler
can use it. Some would add: "from unmodified source code"; but I
am a little wishy-washy on the last clause.

I might note that SIMD obeys none of the 3 conditions.

When it affects many programmers and especially if the difficulty that would otherwise be added is outside the expertise of
many of them. Let's look at some cases:

"Closing the semantic gap" by providing instructions like EDIT: Even
with assembly-language programmers, calling a subroutine is hardly
harder. With higher-level languages, such instructions buy nothing.

Printf-family "closes more of the gap" than EDIT ever could. And there
is a whole suite of things better off left in subroutines than being
raised into Instructions.

Unfortunately, elementary FP functions are no longer in that category.
When one can perform SIN(x) along with argument reduction and polynomial calculation in the cycle time of FDIV, SIN() deserves to be a first
class member of the instruction set--especially if the HW cost is
"not that much".

On the other hand: things like polynomial evaluating instructions
seem a bridge too far as you have to pick for all time 1 of {Horner,
Estrin, Pad|-, Power Series, Clenshaw, ...} and at some point it
becomes better to start using FFT-derived evaluation means.

Denormal numbers: It affects lots of code that deals with FP, and
where many programmers are not well-educated (and even the educated
ones have a harder time when they have to work around their absence).

Arguably, the best thing to do here is to Trap on the creation of deNorms.
At least then you can see them and do something about them at the algorithm level. {Gee Whiz Cap. Obvious: IEEE 754 already did this!}

Hardware without Spectre (e.g., with invisible speculation): There are
two takes here:

1) If you consider Spectre to be a realistically exploitable
vulnerability, you need to protect at least the secret keys against
extraction with Spectre; then you either need such hardware, or you
need to use software mitigations agains all Spectre variants in all
software that runs in processes that have secret keys in their
address space; the latter would be a huge cost that easily
justifies the cost of adding invisible speculation to the hardware.

My 66000 is immune from Spectr|-; -|A state is not updated until retire.

2) The other take is that we Spectre is too hard to exploit to be a
realistic threat and that we do not need to eliminate it or
mitigate it. That's a similar to the mainstream opinion on
cache-timing attacks on AES before Dan Bernstein demonstrated that
such attacks can be performed. Except that for Spectre we already
have demonstrations.

We just don't have the smoking gun of a missing $1M-to-$1B to make it
worth the effort to do something about it. But mark my words:: the vulnerability is being exploited ...

- anton

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Oct 15 21:13:53 2025

From Newsgroup: comp.arch

Michael S <already5chosen@yahoo.com> posted:

-------------------------------

Hardware without Spectre (e.g., with invisible speculation): There are
two takes here:

1) If you consider Spectre to be a realistically exploitable
vulnerability, you need to protect at least the secret keys against
extraction with Spectre; then you either need such hardware, or you
need to use software mitigations agains all Spectre variants in all
software that runs in processes that have secret keys in their
address space; the latter would be a huge cost that easily
justifies the cost of adding invisible speculation to the hardware.

2) The other take is that we Spectre is too hard to exploit to be a
realistic threat and that we do not need to eliminate it or
mitigate it. That's a similar to the mainstream opinion on
cache-timing attacks on AES before Dan Bernstein demonstrated that
such attacks can be performed. Except that for Spectre we already
have demonstrations.

- anton

What demonstrations?
The demonstration that I would consider realistic should be from JS
running on browser released after 2018-01-28.

I'm of strong opinion that at least Spectre Variant 1 (Bound Check
Bypass) should not be mitigated in hardware.

My 66000 allows an application to crap all over "the stack";
but it does provide a means whereby "crapping all over the stack"
does not allow the application to violate the contract between caller
and callee. Once application performs a RET (or EXIT) control is returns
to caller 1 instruction past calling point, and with the preserved
registers preserved !

W.r.t. variant 2 (Branch Target Injection) I am less categorical. That
is, I rather prefer it not mitigated on the hardware that I use,
because I am sure that in no situation it is a realistic threat for me. However it is harder to prove that it is not a realistic threat to
anybody. And since HW mitigation has smaller performance impact than of Variant 1, so if some CPU vendors decide to mitigate Variant 2, I would
not call them spinless idiots because of it. I'd call them "slick businessmen" which in my book is less derogatory.

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Oct 15 21:28:52 2025

From Newsgroup: comp.arch

EricP <ThatWouldBeTelling@thevillage.com> posted:
---------------------------

What demonstrations?
The demonstration that I would consider realistic should be from JS
running on browser released after 2018-01-28.

I'm of strong opinion that at least Spectre Variant 1 (Bound Check
Bypass) should not be mitigated in hardware.
W.r.t. variant 2 (Branch Target Injection) I am less categorical. That
is, I rather prefer it not mitigated on the hardware that I use,
because I am sure that in no situation it is a realistic threat for me. However it is harder to prove that it is not a realistic threat to
anybody. And since HW mitigation has smaller performance impact than of Variant 1, so if some CPU vendors decide to mitigate Variant 2, I would
not call them spinless idiots because of it. I'd call them "slick businessmen" which in my book is less derogatory.

I had an idea on how to eliminate Bound Check Bypass.
I intend to have range-check-and-fault instructions like

CHKLTU value_Rs1, limit_Rs2
value_Rs1, #limit_imm

throws an overflow fault exception if value register >= unsigned limit.
(The unsigned >= check also catches negative signed integer values).

It can be used to check an array index before use in a LD/ST, e.g.

CHKLTU index_Rs, limit_Rs
LD Rd, [base_Rs, index_Rs*scale]

The problem is that there is no guarantee that an OoO cpu will execute
the CHKLTU instruction before using the index register in the LD/ST.

Yes, order in OoO is sanity-impairing.

But, what you do know is that CHKx will be performed before LD can
retire. _AND_ if your -|A does not update -|A state prior to retire,
you can be as OoO as you like and still not be Spectr|- sensitive.

One of the things recently put into My 66000 is that AGEN detects
overflow and raises PageFault.

My idea is for the CHKcc instruction to copy the test value to a dest register when the check is successful. This makes the dest value register write-dependent on successfully passing the range check,
and blocks the subsequent LD from using the index until validated.

CHKLTU index_R2, index_R1, limit_R3
LD R4, [base_R5, index_R2*scale]

If you follow my rule above this is unnecessary, but it may be less
painful than holding back state update until retire.

Because there is no branch, there is no way to speculate around the check (but load value speculation could negate this fix).

x86 has cases (like Shift by 0) where HW predicts that CFLAGs are set
and -|faults when shift count == 0 and prevents setting of CFLAGS.
You "COULD" do something similar at -|A level.
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Oct 15 21:34:14 2025

From Newsgroup: comp.arch

Terje Mathisen <terje.mathisen@tmsw.no> posted:
----------------------

Please note that I have NOT personally observed this, but I have been
told from people I trust (on the 754 working group) that at least some zero-seeking algorithms will stabilize on an exact value, if and only if
you have subnormals, otherwise it is possible to wobble back & forth
between two neighboring results.

I know of several Newton-Raphson-iterations that converge faster and
more accurately using reciprocal-SQRT() than the equivalent algorithm
using SQRT() directly in NR-iteration.

I.e. they differ by exactly one ulp.

In my cases, the RSQRT() was 1 or 2 iterations faster and 2 ULP more
accurate. I don't know of a case oscillating at 1 ULP due to arithmetic anomalies.

As I noted, I have not been bitten by this particular issue, one of the reaons being that I tend to not write infinite loops inside functions, instead I'll pre-calculate how many (typically NR) iterations should be needed.

Almost always the right course of events.

The W() function may be different. W( poly|u(e^poly) ) = poly.

Terje

--- Synchronet 3.21a-Linux NewsLink 1.2

From Lawrence =?iso-8859-13?q?D=FFOliveiro?=@ldo@nz.invalid to comp.arch on Wed Oct 15 21:37:42 2025

From Newsgroup: comp.arch

On Wed, 15 Oct 2025 21:09:27 GMT, MitchAlsup wrote:

Most people would say:: "When it adds performance" AND the compiler can
use it. Some would add: "from unmodified source code"; but I am a little wishy-washy on the last clause.

I might note that SIMD obeys none of the 3 conditions.

I believe GCC can do auto-vectorization in some situations.

But the RISC-V folks still think Cray-style long vectors are better than
SIMD, if only because it preserves the rCLRrCY in rCLRISCrCY.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Lawrence =?iso-8859-13?q?D=FFOliveiro?=@ldo@nz.invalid to comp.arch on Wed Oct 15 21:42:32 2025

From Newsgroup: comp.arch

On Wed, 15 Oct 2025 05:55:40 GMT, Anton Ertl wrote:

On Wed, 15 Oct 2025 03:45:31 -0000 (UTC), Lawrence DrCOOliveiro wrote:

The "crisis" was supposed to do with the shortage of programs to write
all the programs that were needed to solve business and user needs.

By that definition, I donrCOt think the "crisis" exists any more. It went
away with the rise of very-high-level languages, from about the time of
those such as Tcl/Tk, Perl, Python, PHP and JavaScript.

Better tools certainly help. One interesting aspect here is that all
the languages you mention are only dynamically typechecked.

Correct. That does seem to be a key part of what rCLvery-high-levelrCY means.

There has been quite a bit of work on adding static typechecking to some
of these languages in the last decade or so, and the motivation given
for that is difficulties in large software projects using these
languages.

What werCOre seeing here is a downward creep, as those very-high-level languages (Python and JavaScript, particularly) are encroaching into the territory of the lower levels. Clearly they must still have some
advantages over those languages that already inhabit the lower levels, otherwise we might as well use the latter.

In any case, even with these languages there are still software projects
that fail, miss their deadlines and have overrun their budget ...

IrCOm not aware of such; feel free to give an example of some large Python project, for example, which has exceeded its time and/or budget. The key
point about using such a very-high-level language is you can do a lot in
just a few lines of code.
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Oct 15 22:19:18 2025

From Newsgroup: comp.arch

Lawrence =?iso-8859-13?q?D=FFOliveiro?= <ldo@nz.invalid> posted:

On Wed, 15 Oct 2025 21:09:27 GMT, MitchAlsup wrote:

Most people would say:: "When it adds performance" AND the compiler can
use it. Some would add: "from unmodified source code"; but I am a little wishy-washy on the last clause.

I might note that SIMD obeys none of the 3 conditions.

I believe GCC can do auto-vectorization in some situations.

Yes, 28 YEARS after it was first put in !! it danged better be
able !?! {yes argue about when}

My point was that you don't put it in until you can see a performance
advantage in the very next (or internal) compiler. {Where 'you' are
the designers of that generation.

But the RISC-V folks still think Cray-style long vectors are better than SIMD, if only because it preserves the rCLRrCY in rCLRISCrCY.

The R in RISC-V comes from "student _R_esearch".

Oh, and BTW: I don't believe SIMD is better than CRAY-like vectors
(or vice versa)--they simply represent different ways of shooting
yourself in the foot.

No ISA with more than 200 instructions deserves the RISC mantra.
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Oct 15 22:31:32 2025

From Newsgroup: comp.arch

Lawrence =?iso-8859-13?q?D=FFOliveiro?= <ldo@nz.invalid> posted:

On Wed, 15 Oct 2025 05:55:40 GMT, Anton Ertl wrote:

On Wed, 15 Oct 2025 03:45:31 -0000 (UTC), Lawrence DrCOOliveiro wrote:

The "crisis" was supposed to do with the shortage of programs to write
all the programs that were needed to solve business and user needs.

By that definition, I donrCOt think the "crisis" exists any more. It went >> away with the rise of very-high-level languages, from about the time of
those such as Tcl/Tk, Perl, Python, PHP and JavaScript.

Better tools certainly help. One interesting aspect here is that all
the languages you mention are only dynamically typechecked.

Correct. That does seem to be a key part of what rCLvery-high-levelrCY means.

There has been quite a bit of work on adding static typechecking to some
of these languages in the last decade or so, and the motivation given
for that is difficulties in large software projects using these
languages.

What werCOre seeing here is a downward creep, as those very-high-level languages (Python and JavaScript, particularly) are encroaching into the territory of the lower levels. Clearly they must still have some
advantages over those languages that already inhabit the lower levels, otherwise we might as well use the latter.

There is a pernicious trap:: once an application written in a VHLL
is acclaimed by the masses--it instantly falls into the trap where
"users want more performance":: something the VHLL cannot provide
until they.........

45 years ago it was LISP, you wrote the application in LISP to figure
out the required algorithms and once you got it working, you rewrote
it in a high-performance language (FORTRAN or C) so it was usably fast.

History has a way of repeating itself, when no-one remembers the past.

In any case, even with these languages there are still software projects that fail, miss their deadlines and have overrun their budget ...

A lot of these projects were unnecessary. Once someone figured out how to
make the (17 kinds of) hammers one needs, there it little need to make a
new hammer architecture.

Windows could have stopped at W7, and many MANY people would have been happier... The mouse was more precise in W7 than in W8 ... With a little upgrade for new PCIe architecture along the way rather than redesigning
whole kit and caboodle for tablets and phones which did not work BTW...

Office application work COULD have STOPPED in 2003, eXcel in 1998, ...
and few people would have cared. Many SW projects are driven not by demand
for the product, but pushed by companies to make already satisfied users
have to upgrade.

Those programmers could have transitioned to new SW projects rather than redesigning the same old thing 8 more times. Presto, there is now enough
well trained SW engineers to tackle the undone SW backlog.

IrCOm not aware of such; feel free to give an example of some large Python project, for example, which has exceeded its time and/or budget. The key point about using such a very-high-level language is you can do a lot in just a few lines of code.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Lawrence =?iso-8859-13?q?D=FFOliveiro?=@ldo@nz.invalid to comp.arch on Thu Oct 16 05:44:04 2025

From Newsgroup: comp.arch

On Wed, 15 Oct 2025 22:19:18 GMT, MitchAlsup wrote:

But the RISC-V folks still think Cray-style long vectors are better
than SIMD, if only because it preserves the rCLRrCY in rCLRISCrCY.

The R in RISC-V comes from "student _R_esearch".

rCLReduced Instruction Set ComputingrCY. That was what every single primer on the subject said, right from the 1980s onwards.

Oh, and BTW: I don't believe SIMD is better than CRAY-like vectors (or
vice versa)--they simply represent different ways of shooting yourself
in the foot.

The primary design criterion, as I understood it, was to avoid filling up
the instruction opcode space with a combinatorial explosion. (Or sequence
of combinatorial explosions, when you look at the wave after wave of SIMD extensions in x86 and elsewhere.)

Also there might be some pipeline benefits in having longer vector
operands ... IrCOll bow to your opinion on that.

No ISA with more than 200 instructions deserves the RISC mantra.

There you go ... agreeing with me about what the rCLRrCY stands for.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Lawrence =?iso-8859-13?q?D=FFOliveiro?=@ldo@nz.invalid to comp.arch on Thu Oct 16 05:57:34 2025

From Newsgroup: comp.arch

On Wed, 15 Oct 2025 22:31:32 GMT, MitchAlsup wrote:

On Wed, 15 Oct 2025 21:42:32 -0000 (UTC), Lawrence DrCOOliveiro wrote:

What werCOre seeing here is a downward creep, as those very-high-level
languages (Python and JavaScript, particularly) are encroaching into
the territory of the lower levels. Clearly they must still have some
advantages over those languages that already inhabit the lower levels,
otherwise we might as well use the latter.

There is a pernicious trap:: once an application written in a VHLL is acclaimed by the masses--it instantly falls into the trap where "users
want more performance":: something the VHLL cannot provide until they.........

45 years ago it was LISP, you wrote the application in LISP to figure
out the required algorithms and once you got it working, you rewrote it
in a high-performance language (FORTRAN or C) so it was usably fast.

No, you didnrCOt. There is a Pareto rule in effect, in that the majority of the CPU time (say, 90%) is spent in a minority of the code (say, 10%). So having got your prototype working, and done suitable profiling to identify
the bottlenecks, you concentrate on optimizing those bottlenecks, not on rewriting the whole app.

Paul Graham (well-known LISP guru) described how the company he was with
-- one of the early Dotcom startups -- wrote Orbitz, an airline
reservation system, in LISP. But the most performance critical part was
done in C++.

Nowadays, with the popularity of Python, we already have lots of efficient lower-level toolkits to take care of common tasks, taking advantage of the versatility of the core Python language. For example, NumPy for handling serious number-crunching: you write a few lines of Python, to express a high-level operation that crunches a million sets of numbers in just a few seconds.

Maybe it only took you a minute to come up with the line of code; maybe
you will never need to run it again. Writing a program entirely in FORTRAN
or C to perform the same operation might take an expert programmer an hour
or two, say; in that time, the Python programmer could try out dozens of similar operations, maybe discard the results of three quarters of them,
to narrow down the important information to be extracted from the raw
data.

ThatrCOs the kind of productivity gain we enjoy nowadays, on a routine
basis, without making a big deal about it in news headlines. And thatrCOs
why we donrCOt talk about a rCLsoftware crisisrCY any more.
--- Synchronet 3.21a-Linux NewsLink 1.2

From David Brown@david.brown@hesbynett.no to comp.arch on Thu Oct 16 09:04:23 2025

From Newsgroup: comp.arch

On 16/10/2025 07:44, Lawrence DrCOOliveiro wrote:

On Wed, 15 Oct 2025 22:19:18 GMT, MitchAlsup wrote:

But the RISC-V folks still think Cray-style long vectors are better
than SIMD, if only because it preserves the rCLRrCY in rCLRISCrCY.

The R in RISC-V comes from "student _R_esearch".

rCLReduced Instruction Set ComputingrCY. That was what every single primer on the subject said, right from the 1980s onwards.

Oh, and BTW: I don't believe SIMD is better than CRAY-like vectors (or
vice versa)--they simply represent different ways of shooting yourself
in the foot.

The primary design criterion, as I understood it, was to avoid filling up
the instruction opcode space with a combinatorial explosion. (Or sequence
of combinatorial explosions, when you look at the wave after wave of SIMD extensions in x86 and elsewhere.)

I believe another aim is to have the same instructions work on different hardware. With SIMD, you need different code if your processor can add
4 ints at a time, or 8 ints, or 16 ints - it's all different
instructions using different SIMD registers. With the vector style instructions in RISC-V, the actual SIMD registers and implementation are
not exposed to the ISA and you have the same code no matter how wide the actual execution units are. I have no experience with this (or much experience with SIMD), but that seems like a big win to my mind. It is
akin to letting the processor hardware handle multiple instructions in parallel in superscaler cpus, rather than Itanium EPIC coding.

Also there might be some pipeline benefits in having longer vector
operands ... IrCOll bow to your opinion on that.

No ISA with more than 200 instructions deserves the RISC mantra.

There you go ... agreeing with me about what the rCLRrCY stands for.

I have always thought that RISC parsed as (RI)SC rather than R(IS)C.
That is, the instructions themselves should be simple (the aim was one instruction, one clock cycle) rather than that there necessarily should
be fewer instructions.

--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Thu Oct 16 07:00:58 2025

From Newsgroup: comp.arch

Michael S <already5chosen@yahoo.com> writes:

The demonstration that I would consider realistic should be from JS
running on browser released after 2018-01-28.

You apparently only consider attacks through the browser as relevant. Netspectre demonstrates a completely remote attack, i.e., without a
browser.

As for the browsers, AFAIK they tried to make Spectre leak less by
making the clock less precise. That does not stop Spectre, it only
makes data extraction using the clock slower. Moreover, there are
ways to work around that by running a timing loop, i.e., instead of
the clock you use the current count of the counted loop.

I'm of strong opinion that at least Spectre Variant 1 (Bound Check
Bypass) should not be mitigated in hardware.

What do you mean with "mitigated in hardware"? The answers to
hardware vulnerabilities are to either fix the hardware (for Spectre
"invisible speculation" looks the most promising to me), or to leave
the hardware vulnerable and mitigate the vulnerability in software
(possibly supported by hardware or firmware changes that do not fix
the vulnerability).

So do you not want it to be fixed in hardware, or not mitigated in
software? As long as the hardware is not fixed, you may not have a
choice in the latter, unless you use an OS you write yourself. AFAIK
you can disable the software mitigations in the Linux kernel, but the development cost of these mitigations still has to be paid, and any
slowdowns that result from organizing the code such that enabling the mitigations is possible will still be there even with the mitigations
disabled.

So if you are against hardware fixes, you will pay for software
mitigations, in development cost (possibly indirectly) and in
performance.

More info on the topic:

Fix Spectre in Hardware! Why and How https://repositum.tuwien.at/bitstream/20.500.12708/210758/1/Ertl-2025-Fix%20Spectre%20in%20Hardware%21%20Why%20and%20How-smur.pdf

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Thu Oct 16 11:34:20 2025

From Newsgroup: comp.arch

MitchAlsup wrote:

Terje Mathisen <terje.mathisen@tmsw.no> posted:
----------------------

Please note that I have NOT personally observed this, but I have been
told from people I trust (on the 754 working group) that at least some
zero-seeking algorithms will stabilize on an exact value, if and only if
you have subnormals, otherwise it is possible to wobble back & forth
between two neighboring results.

I know of several Newton-Raphson-iterations that converge faster and
more accurately using reciprocal-SQRT() than the equivalent algorithm
using SQRT() directly in NR-iteration.

I.e. they differ by exactly one ulp.

In my cases, the RSQRT() was 1 or 2 iterations faster and 2 ULP more accurate. I don't know of a case oscillating at 1 ULP due to arithmetic anomalies.

Interesting! I have also found rsqrt() to be a very good building block,
to the point where if I can only have one helper function (approximate
lookup to start the NR), it would be rsqrt, and I would use it for all
of sqrt, fdiv and rsqrt.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
--- Synchronet 3.21a-Linux NewsLink 1.2

From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Thu Oct 16 10:24:37 2025

From Newsgroup: comp.arch

David Brown wrote:

On 16/10/2025 07:44, Lawrence DrCOOliveiro wrote:

On Wed, 15 Oct 2025 22:19:18 GMT, MitchAlsup wrote:

But the RISC-V folks still think Cray-style long vectors are better
than SIMD, if only because it preserves the rCLRrCY in rCLRISCrCY.

The R in RISC-V comes from "student _R_esearch".

rCLReduced Instruction Set ComputingrCY. That was what every single primer on
the subject said, right from the 1980s onwards.

Oh, and BTW: I don't believe SIMD is better than CRAY-like vectors (or
vice versa)--they simply represent different ways of shooting yourself
in the foot.

No ISA with more than 200 instructions deserves the RISC mantra.

There you go ... agreeing with me about what the rCLRrCY stands for.

I have always thought that RISC parsed as (RI)SC rather than R(IS)C.
That is, the instructions themselves should be simple (the aim was one instruction, one clock cycle) rather than that there necessarily should
be fewer instructions.

Looking at

The Case for the Reduced Instruction Set Computer, 1980, David Patterson https://dl.acm.org/doi/pdf/10.1145/641914.641917

he never says what defines RISC, just what improved results
this *design approach* should achieve.

"Several factors indicate a Reduced Instruction Set Computer as a
reasonable design alternative.
...
Implementation Feasibility. A great deal depends on being able to fit
an entire CPU design on a single chip.
...
[EricP: reduced absolute amount of logic for a minimum implementation]

Design Time. Design difficulty is a crucial factor in the success of
VLSI computer.
...
[EricP: reduced complexity leading to reduced design time]

Speed. The ultimate test for cost-effectiveness is the speed at which an implementation executes a given algorithm. Better use of chip area and availability of newer technology through reduced debugging time contribute
to the speed of the chip. A RISC potentially gains in speed merely from a simpler design.
...
[EricP: reduced complexity and logic leads to reduced critical
path lengths giving increased frequency.]

Better use of chip area. If you have the area, why not implement the CISC?
For a given chip area there are many tradeoffs for what can be realized.
We feel that the area gained back by designing a RISC architecture rather
than a CISC architecture can be used to make the RISC even more attractive
than the CISC. ... When the CISC becomes realizable on a single chip,
the RISC will have the silicon area to use pipelining techniques;
when the CISC gets pipelining the RISC will have on chip caches, etc.
...
[EricP: reduced waste on dragging around architectural boat anchors]

The experience we have from compilers suggests that the burden on compiler writers is eased when the instruction set is simple and uniform.
...
[EricP: reduced compiler complexity and development work]
"

--- Synchronet 3.21a-Linux NewsLink 1.2

From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Thu Oct 16 10:32:21 2025

From Newsgroup: comp.arch

MitchAlsup wrote:

EricP <ThatWouldBeTelling@thevillage.com> posted:
---------------------------

What demonstrations?
The demonstration that I would consider realistic should be from JS
running on browser released after 2018-01-28.

I'm of strong opinion that at least Spectre Variant 1 (Bound Check
Bypass) should not be mitigated in hardware.
W.r.t. variant 2 (Branch Target Injection) I am less categorical. That
is, I rather prefer it not mitigated on the hardware that I use,
because I am sure that in no situation it is a realistic threat for me.
However it is harder to prove that it is not a realistic threat to
anybody. And since HW mitigation has smaller performance impact than of
Variant 1, so if some CPU vendors decide to mitigate Variant 2, I would
not call them spinless idiots because of it. I'd call them "slick
businessmen" which in my book is less derogatory.

I had an idea on how to eliminate Bound Check Bypass.
I intend to have range-check-and-fault instructions like

CHKLTU value_Rs1, limit_Rs2
value_Rs1, #limit_imm

throws an overflow fault exception if value register >= unsigned limit.
(The unsigned >= check also catches negative signed integer values).

It can be used to check an array index before use in a LD/ST, e.g.

CHKLTU index_Rs, limit_Rs
LD Rd, [base_Rs, index_Rs*scale]

The problem is that there is no guarantee that an OoO cpu will execute
the CHKLTU instruction before using the index register in the LD/ST.

Yes, order in OoO is sanity-impairing.

But, what you do know is that CHKx will be performed before LD can
retire. _AND_ if your -|A does not update -|A state prior to retire,
you can be as OoO as you like and still not be Spectr|- sensitive.

One of the things recently put into My 66000 is that AGEN detects
overflow and raises PageFault.

My idea is for the CHKcc instruction to copy the test value to a dest
register when the check is successful. This makes the dest value register
write-dependent on successfully passing the range check,
and blocks the subsequent LD from using the index until validated.

CHKLTU index_R2, index_R1, limit_R3
LD R4, [base_R5, index_R2*scale]

If you follow my rule above this is unnecessary, but it may be less
painful than holding back state update until retire.

My idea is the same as a SUB instruction with overflow detect,
which I would already have. I like cheap solutions.

But the core idea here, to eliminate a control flow race condition by
changing it to a data flow dependency, may be applicable in other areas.

Because there is no branch, there is no way to speculate around the check
(but load value speculation could negate this fix).

On second thought, no, load value speculation would not negate this fix.

x86 has cases (like Shift by 0) where HW predicts that CFLAGs are set
and -|faults when shift count == 0 and prevents setting of CFLAGS.
You "COULD" do something similar at -|A level.

I'd prefer not to step in that cow pie to begin with.
Then I won't have to spend time cleaning my shoes afterwards.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.arch on Thu Oct 16 23:04:44 2025

From Newsgroup: comp.arch

On Thu, 16 Oct 2025 07:00:58 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

Michael S <already5chosen@yahoo.com> writes:

The demonstration that I would consider realistic should be from JS
running on browser released after 2018-01-28.

You apparently only consider attacks through the browser as relevant. Netspectre demonstrates a completely remote attack, i.e., without a
browser.

As for the browsers, AFAIK they tried to make Spectre leak less by
making the clock less precise. That does not stop Spectre, it only
makes data extraction using the clock slower. Moreover, there are
ways to work around that by running a timing loop, i.e., instead of
the clock you use the current count of the counted loop.

I don't think that it was a primary mitigation of Spectre Variant 1
implemented in browsers.
Indeed, they made clock less precise, but that was their secondary
line of defense, mostly aimed at new SPECTRE variants that are not
discovered yet.
For Spectre Variant 1 they implemented much more direct defense.
For example, before mitigation JS statement val = x[i] was compiled to:
cmp %RAX, 0(%RDX) # compare i with x.limit
jbe oob_handler
mov 8(%RDX, %RAX, 4), %RCX
After mitigation it looks like:
xor %ECX, %ECX
cmp %RAX, 0(%RDX) # compare i with x.limit
jbe oob_handler
movbe %ECX, %EAX # data dependency prevents problematic speculation
mov 8(%RDX, %RAX, 4), %RCX

Almost identical code could be generated on ARM or POWER or SPARC. On
MIPS rev6 it could be even shorter. On non-extended RISC-V it would be
somewhat longer, but browser vendors do not care about RISC-V, extended
or not.

The part above written for the benefit of interested bystanders.
You already know all that.

I'm of strong opinion that at least Spectre Variant 1 (Bound Check
Bypass) should not be mitigated in hardware.

What do you mean with "mitigated in hardware"? The answers to
hardware vulnerabilities are to either fix the hardware (for Spectre "invisible speculation" looks the most promising to me), or to leave
the hardware vulnerable and mitigate the vulnerability in software
(possibly supported by hardware or firmware changes that do not fix
the vulnerability).

So do you not want it to be fixed in hardware, or not mitigated in
software? As long as the hardware is not fixed, you may not have a
choice in the latter, unless you use an OS you write yourself. AFAIK
you can disable the software mitigations in the Linux kernel, but the development cost of these mitigations still has to be paid, and any
slowdowns that result from organizing the code such that enabling the mitigations is possible will still be there even with the mitigations disabled.

So if you are against hardware fixes, you will pay for software
mitigations, in development cost (possibly indirectly) and in
performance.

More info on the topic:

Fix Spectre in Hardware! Why and How https://repositum.tuwien.at/bitstream/20.500.12708/210758/1/Ertl-2025-Fix%20Spectre%20in%20Hardware%21%20Why%20and%20How-smur.pdf

- anton

May be, I'll look at it some day. Certainly not tonight.
May be, never.
After all, neither me nor you are experts in design of modern high perf
CPUs. So our reasonings about performance impact of this or that HW
solution are at best educated hand wavings.

--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Thu Oct 16 15:17:22 2025

From Newsgroup: comp.arch

On 10/16/2025 12:44 AM, Lawrence DrCOOliveiro wrote:

On Wed, 15 Oct 2025 22:19:18 GMT, MitchAlsup wrote:

But the RISC-V folks still think Cray-style long vectors are better
than SIMD, if only because it preserves the rCLRrCY in rCLRISCrCY.

The R in RISC-V comes from "student _R_esearch".

rCLReduced Instruction Set ComputingrCY. That was what every single primer on the subject said, right from the 1980s onwards.

With some fighting as to what exactly it means:
Small Listing (or smallest viable listing);
Simple Instructions (Eg: Load/Store);
Fixed-size instructions;
...

So, for RISC-V:
First point only really holds in the case of RV64I.
For RV64G, there is already a lot of unnecessary stuff in there.
Second Point:
Fails with the 'A' extension;
Also parts of F/D.
Third Point:
Fails with RV-C.
Though, people redefine it:
Still RISC so long as not using an x86-style encoding scheme.

Well, and still the past example of some old marketing for MSP430 trying
to pass it off as a RISC, where it had more in common with PDP-11 than
with any of the RISC's (and only reason listing looks tiny is by
ignoring the special cases encoded in certain combinations of registers
and addressing modes).

Like, you can sweep things like immediate-form instructions when you can
do "@PC+" and get the same effect.

Oh, and BTW: I don't believe SIMD is better than CRAY-like vectors (or
vice versa)--they simply represent different ways of shooting yourself
in the foot.

The primary design criterion, as I understood it, was to avoid filling up
the instruction opcode space with a combinatorial explosion. (Or sequence
of combinatorial explosions, when you look at the wave after wave of SIMD extensions in x86 and elsewhere.)

RISC-V tends to fail at this one in some areas...

Also, the V extension doesn't even fit entirely in the opcode, it
depends on additional state held in CSRs.

The P extension is also a fail in this area, as they went whole-hog in defining new instructions for nearly every possible combination.

Also there might be some pipeline benefits in having longer vector
operands ... IrCOll bow to your opinion on that.

IME, SIMD tends to primarily show benefits with 2 and 4 element vectors.

Most use-cases for longer vectors tend to matrix-like rather than
vector-like. Or, what cases that would appear suited to an 8-element
vector are often achieved sufficiently with two vectors.

Also, element sizes:
Most of the dominant use-cases seem to involve 16 and 32 bit elements.
Most cases that involve 8 bit elements are less suited to actual
computation at 8 bits (for example, RGB math often works better at 16 bits).

There are some weaknesses, for example, I mostly ended up dealing with
RGB math by simply repeating the 8-bit values twice within a 16-bit spot.

For various tasks, it might has been better to have gone with an
unpack/repack scheme like:
Pad2.Value8.Frac6
Pad4.Value8.Frac4
Where Pad can deal with values outside unit range, and Frac with values between the two LDR points. Then the RGB narrowing conversion operations
could have had the option for round-and-saturate.

Though, a more tacky option is to use the existing unpack operation and
then invert the low-order bits to add a little bit of padding space for underflow/overflow.

Another option being to use "Packed Shift" instructions to get a format
with pad bits.

No saturating ops in my case, as saturating ops didn't seem worth it
(and having Wrap/SSat/USat/... is a big part of the combinatorial
explosion seen in the P extension).

No ISA with more than 200 instructions deserves the RISC mantra.

There you go ... agreeing with me about what the rCLRrCY stands for.

Checking, if I take XG3, and exclude SIMD, 128-bit integer instructions,
stuff for 96-bit addressing, etc, the listing drops to around 208 instructions.

This does still include things like instructions with niche addressing
modes (such as "(GP,Disp16)"), etc.

If stripped back to "core instructions" (excluding rarely-used
instructions, such as ROT*/etc, and some of these alternate-mode
instructions, etc), could be dropped back a little further.

There are some instructions in the listing that would have been merged
in RISC-V, like FPU instructions which differ only in rounding mode (the
RNE and DYN instructions exist as separate instructions in this case, ...).

It is a little over 400 if the SIMD and ALUX stuff and similar is added
back in (excluding things like placeholder spots, or instructions which
were copied from XG2 but are either N/A or redundant, ...).

There is a fair chunk of instructions which mostly exist as SIMD format converters and similar.

So, seems roughly:
~ 50%: Base instructions
~ 20%: ALUX and 96-bit addressing.
~ 30%: SIMD stuff

Internally to the CPU core, there are roughly 44 core operations ATM,
though many multiplex groups of related operations as sub-operations.

So, things like ALU/CONV/etc don't represent a single instruction.
But, JMP/JSR/BRA/BSR are singular operations (and BRA/BSR both map to
JAL on the RV side, differing as to whether Rd is X0 or X1; similarly
with both JMP and JSR mapping to JALR in a similar way).

BSR and JSR had been modified to allow arbitrary link register, but it
may make sense to reverse this; as Rd other than X0 and X1 is seemingly
pretty much never used in practice (so not really worth the logic cost).

Other option being to trap and (potentially) emulate, if Rd is not X0 or
X1 (or just ignore it). Also, very possible, is demoting basically the
entire RV 'A' extension to "trap and emulate".

So, in HW:
RV64I : Fully
M : Mostly
A : Trap/Emulate
F/D : Partial (many cases are traps)
Zicsr : Partial (trap in general case)
Zifence: Trap
...

where, say, ALU gets a 6-bit control value:
(3:0): Which basic operation to perform;
(5:4): In one of several ways:
00: 32-bit, sign-ext result (eg: ADDW in RV terms)
01: 32-bit, zero-ext result (eg: ADDWU in RV terms)
10: 64-bit (ADD)
11: 2x 32-bit (some ops) or 4x 16-bit (some other ops)
PADD.L or PADD.W.

There is CONV/CONV2/CONV3:
CONV: Simple 2R converter ops which may have 1-cycle latency
(later demoted to 2-cycle, with moV being relocated elsewhere).
CONV2: More complex 2R converter ops, 2 cycle latency.
CONV3: Same as CONV2, but because CONV2 ran out of space.

Still no real mechanism to deal with the potential for proliferation of
".UW" instructions in RISC-V, for now I had been ignoring this.

...

--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Thu Oct 16 16:26:27 2025

From Newsgroup: comp.arch

On 10/16/2025 2:04 AM, David Brown wrote:

On 16/10/2025 07:44, Lawrence DrCOOliveiro wrote:

On Wed, 15 Oct 2025 22:19:18 GMT, MitchAlsup wrote:

But the RISC-V folks still think Cray-style long vectors are better
than SIMD, if only because it preserves the rCLRrCY in rCLRISCrCY.

The R in RISC-V comes from "student _R_esearch".

rCLReduced Instruction Set ComputingrCY. That was what every single primer on
the subject said, right from the 1980s onwards.

Oh, and BTW: I don't believe SIMD is better than CRAY-like vectors (or
vice versa)--they simply represent different ways of shooting yourself
in the foot.

The primary design criterion, as I understood it, was to avoid filling up
the instruction opcode space with a combinatorial explosion. (Or sequence
of combinatorial explosions, when you look at the wave after wave of SIMD
extensions in x86 and elsewhere.)

I believe another aim is to have the same instructions work on different hardware.-a With SIMD, you need different code if your processor can add
4 ints at a time, or 8 ints, or 16 ints - it's all different
instructions using different SIMD registers.-a With the vector style instructions in RISC-V, the actual SIMD registers and implementation are
not exposed to the ISA and you have the same code no matter how wide the actual execution units are.-a I have no experience with this (or much experience with SIMD), but that seems like a big win to my mind.-a It is akin to letting the processor hardware handle multiple instructions in parallel in superscaler cpus, rather than Itanium EPIC coding.

But, there is problem:
Once you go wider than 2 or 4 elements, cases where wider SIMD brings
more benefit tend to fall off a cliff.

More so, when you go wider, there are new problems:
Vector Masking;
Resource and energy costs of using wider vectors;
...

Then, for 'V':
In the basic case, it effectively doubles the size of the register file
vs 'G';
...

Then We have x86 land:
SSE: Did well;
AVX256: Rocky start, negligible benefit from the YMM registers;
Using AVX encodings for 128-bit vectors being arguably better.
AVX512: Sorta exists, but:
Very often not supported;
Trying to use it (on supported hardware) often makes stuff slower.

If even Intel can't make their crap work well, I am skeptical.

While arguably GPUs were very wide, it is different:
They were often doing very specialized tasks (such as 3D rendering);
And, often with a SIMT model rather than "very large SIMD";
Things like CUDA (and RTX) actually push things narrower;
Larger numbers of narrower cores,
rather than smaller number of wider cores.
...

The one area that doesn't seem to run into a diminishing returns wall
seems to be to map "embarrassingly parallel" problems to large numbers
of processor cores, and to try to keep things as loosely coupled as
possible.

This works mostly until the CPU runs out of memory bandwidth or similar.

Also there might be some pipeline benefits in having longer vector
operands ... IrCOll bow to your opinion on that.

No ISA with more than 200 instructions deserves the RISC mantra.

There you go ... agreeing with me about what the rCLRrCY stands for.

I have always thought that RISC parsed as (RI)SC rather than R(IS)C.
That is, the instructions themselves should be simple (the aim was one instruction, one clock cycle) rather than that there necessarily should
be fewer instructions.

Agreed, this is more the stance I take.

Instructions should be simple for hardware and to try to allow for low latency. Rather than trying to make the instruction listing small.

Though, that said, I still did end up in my case making most
instructions have a 2 or 3 cycle latency.

So, generally, MOV-RR and MOV-IR end up as basically about the only single-cycle instructions. A case could almost be made for making *all* instructions 2 or 3 cycles and then eliminate forwarding from EX1
entirely (or maybe add an EX4 stage).

Say:
PF IF ID RF E1 E2 E3 WB
FW from E2 and E3
RAW hazard between RF and E1 always stalls.
Or:
PF IF ID RF E1 E2 E3 E4 WB
FW from E2, E3, and E4.

With an E4 stage, one could maybe allow for pipelined low-precision FMAC
or similar.

Though, I see it more as the ISA not actively hindering achieving >= 1
IPC throughput, rather than instructions having 1 cycle latency.

But, can note that having 2 cycle latency does hinder the efficiency of
some common patterns in RISC-V, where tight register RAW dependencies
run rampant.

So, say, you ideally want 5-8 instructions between each instruction and
the next instruction that uses the result. This typically does not
happen in most code, and particularly not if one needs instruction
chains for semi-common idioms (say, where the optimal instruction
scheduling would far exceed the length of a typical loop body).

For better or worse does tend to result in a lot of performance
sensitive code being written to use fairly heavy-handed loop unrolling
though.

...

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Oct 16 21:52:22 2025

From Newsgroup: comp.arch

EricP <ThatWouldBeTelling@thevillage.com> posted:

MitchAlsup wrote:

EricP <ThatWouldBeTelling@thevillage.com> posted: ---------------------------

What demonstrations?
The demonstration that I would consider realistic should be from JS
running on browser released after 2018-01-28.

I'm of strong opinion that at least Spectre Variant 1 (Bound Check
Bypass) should not be mitigated in hardware.
W.r.t. variant 2 (Branch Target Injection) I am less categorical. That >>> is, I rather prefer it not mitigated on the hardware that I use,
because I am sure that in no situation it is a realistic threat for me. >>> However it is harder to prove that it is not a realistic threat to
anybody. And since HW mitigation has smaller performance impact than of >>> Variant 1, so if some CPU vendors decide to mitigate Variant 2, I would >>> not call them spinless idiots because of it. I'd call them "slick
businessmen" which in my book is less derogatory.

I had an idea on how to eliminate Bound Check Bypass.
I intend to have range-check-and-fault instructions like

CHKLTU value_Rs1, limit_Rs2
value_Rs1, #limit_imm

throws an overflow fault exception if value register >= unsigned limit.
(The unsigned >= check also catches negative signed integer values).

It can be used to check an array index before use in a LD/ST, e.g.

CHKLTU index_Rs, limit_Rs
LD Rd, [base_Rs, index_Rs*scale]

The problem is that there is no guarantee that an OoO cpu will execute
the CHKLTU instruction before using the index register in the LD/ST.

Yes, order in OoO is sanity-impairing.

But, what you do know is that CHKx will be performed before LD can
retire. _AND_ if your -|A does not update -|A state prior to retire,
you can be as OoO as you like and still not be Spectr|- sensitive.

One of the things recently put into My 66000 is that AGEN detects
overflow and raises PageFault.

My idea is for the CHKcc instruction to copy the test value to a dest
register when the check is successful. This makes the dest value register >> write-dependent on successfully passing the range check,
and blocks the subsequent LD from using the index until validated.

CHKLTU index_R2, index_R1, limit_R3
LD R4, [base_R5, index_R2*scale]

If you follow my rule above this is unnecessary, but it may be less
painful than holding back state update until retire.

My idea is the same as a SUB instruction with overflow detect,
which I would already have. I like cheap solutions.

But the core idea here, to eliminate a control flow race condition by changing it to a data flow dependency, may be applicable in other areas.

This adds unnecessary execution latency to the architectural path.
Without the check you have <say> 3-cycle unchecked LD
With the check you have 4-cycle checked LD

Now get some multi-pointer chasing per iteration algorithm in a loop and
all of a sudden the execution window is no longer big enough to run it at
full speed.

Because there is no branch, there is no way to speculate around the check >> (but load value speculation could negate this fix).

On second thought, no, load value speculation would not negate this fix.

x86 has cases (like Shift by 0) where HW predicts that CFLAGs are set
and -|faults when shift count == 0 and prevents setting of CFLAGS.
You "COULD" do something similar at -|A level.

I'd prefer not to step in that cow pie to begin with.

Just making sure you remain aware of the cow-pies littering the field...

Then I won't have to spend time cleaning my shoes afterwards.

I am more worried about the blood on the shoes than the cow-pie.
{{shooting oneself in the foot}}
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Oct 16 21:59:14 2025

From Newsgroup: comp.arch

Terje Mathisen <terje.mathisen@tmsw.no> posted:

MitchAlsup wrote:

Terje Mathisen <terje.mathisen@tmsw.no> posted:
----------------------

Please note that I have NOT personally observed this, but I have been
told from people I trust (on the 754 working group) that at least some
zero-seeking algorithms will stabilize on an exact value, if and only if >> you have subnormals, otherwise it is possible to wobble back & forth
between two neighboring results.

I know of several Newton-Raphson-iterations that converge faster and
more accurately using reciprocal-SQRT() than the equivalent algorithm
using SQRT() directly in NR-iteration.

I.e. they differ by exactly one ulp.

In my cases, the RSQRT() was 1 or 2 iterations faster and 2 ULP more accurate. I don't know of a case oscillating at 1 ULP due to arithmetic anomalies.

Interesting! I have also found rsqrt() to be a very good building block,
to the point where if I can only have one helper function (approximate lookup to start the NR), it would be rsqrt, and I would use it for all
of sqrt, fdiv and rsqrt.

In practice:: RSQRT() is no harder to compute {both HW and SW},
yet:: RSQRT() is more useful::

SQRT(x) = RSQRT(x)*x is 1 pipelined FMUL
RSQRT(x) = 1/SQRT(x) is 1 non-pipelined FDIV

Useful in vector normalization::

some-vector-calculation
-----------------------
SQRT( SUM(x**2,1,n) )

and a host of others.

Terje

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Oct 16 22:19:21 2025

From Newsgroup: comp.arch

David Brown <david.brown@hesbynett.no> posted:

On 16/10/2025 07:44, Lawrence DrCOOliveiro wrote:

On Wed, 15 Oct 2025 22:19:18 GMT, MitchAlsup wrote:

But the RISC-V folks still think Cray-style long vectors are better
than SIMD, if only because it preserves the rCLRrCY in rCLRISCrCY.

The R in RISC-V comes from "student _R_esearch".

rCLReduced Instruction Set ComputingrCY. That was what every single primer on
the subject said, right from the 1980s onwards.

Oh, and BTW: I don't believe SIMD is better than CRAY-like vectors (or
vice versa)--they simply represent different ways of shooting yourself
in the foot.

The primary design criterion, as I understood it, was to avoid filling up the instruction opcode space with a combinatorial explosion. (Or sequence of combinatorial explosions, when you look at the wave after wave of SIMD extensions in x86 and elsewhere.)

I believe another aim is to have the same instructions work on different hardware. With SIMD, you need different code if your processor can add
4 ints at a time, or 8 ints, or 16 ints - it's all different
instructions using different SIMD registers.

Among SIMD's ISA problems is additional state at context switch time
on top of FP's added state at context switch time; but with all the
fast memory move subroutines being SIMD-based--the service routines
need access to SIMD that they don't normally need for FP {and the
SIMD register file is larger, too}

With the vector style instructions in RISC-V, the actual SIMD registers and implementation are
not exposed to the ISA and you have the same code no matter how wide the actual execution units are.

Vector LD and ST instructions are not conceptually different than
LDM and STM--1 instruction accesses multiple memory locations.

But what gets me is the continual disconnect from actual vector
calculations in source code--causing the compilers to have to solve
many memory aliasing issues to use the vector ISA.

Software writes vector loops--yet the HW vectorizes instructions.

{{I might note My 66000 vectorizes loops not instructions to avoid
this problem; For example::

for( i = 0; i < max; i++ )
{
temp = a[i];
a[i] = a[max-i];
a[max-i] = temp;
}

is vectorizable in My 66000--those loops where the memory references
do not overlap can run "as fast as the width of the data path allow"
while those with memory reference collisions run no worse than scalar
code. For a large value of max the profile would look like::

FFFFFFFFFFFFFFFFFsssFFFFFFFFFFFFFFFFF

F representing fast (say 4-wide or 8-wide)
s representing slow (say 1-wide)

The same binary runs as fast as memory references (and data-flow
dependencies and data-path width) allow.
}}

I have no experience with this (or much experience with SIMD), but that seems like a big win to my mind. It is
akin to letting the processor hardware handle multiple instructions in parallel in superscaler cpus, rather than Itanium EPIC coding.

Also there might be some pipeline benefits in having longer vector
operands ... IrCOll bow to your opinion on that.

CRAY-like vector computers built memory systems that could handle the load
of the vector calculations. CRAY-1 could perform a new memory access every clock, CRAY-[XY]MP could handle 2 LDs and 1 ST per clock continuously.

If those CPUs of today were really going to fully utilize the vector
data-path, they are going to have to have a lot better memory system
than they are building presently (1 new cache miss per cycle).

The power of the vector computers was almost entirely in the memory system
not in the data path (which is surprisingly easy to build, and surprisingly difficult to keep fed).

No ISA with more than 200 instructions deserves the RISC mantra.

There you go ... agreeing with me about what the rCLRrCY stands for.

I have always thought that RISC parsed as (RI)SC rather than R(IS)C.
That is, the instructions themselves should be simple (the aim was one instruction, one clock cycle) rather than that there necessarily should
be fewer instructions.

On vacation over the summer, I canned a new phrase to denote what I
hope My 66000 will end up being::

CARD Computer Architecture Rightly Done.

Note: It does not stop at ISA--as ISA is less than 1/3rd of what a
computer architecture is and means.

--- Synchronet 3.21a-Linux NewsLink 1.2

Who's Online

Recent Visitors

System Info

On Cray arithmetic