• On Cray arithmetic

    From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sat Oct 11 10:32:22 2025
    From Newsgroup: comp.arch

    Just found a gem on Cray arithmetic, which (rightly) incurred
    The Wrath of Kahan:

    https://people.eecs.berkeley.edu/~wkahan/CS279/CrayUG.pdf

    "Pessimism comes less from the error-analyst's dour personality
    than from his mental model of computer arithmetic."

    I also had to look up "equipollent".

    I assume many people in this group know this, but for those who
    don't, it is well worth reading.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat Oct 11 19:36:44 2025
    From Newsgroup: comp.arch


    Thomas Koenig <tkoenig@netcologne.de> posted:

    Just found a gem on Cray arithmetic, which (rightly) incurred
    The Wrath of Kahan:

    https://people.eecs.berkeley.edu/~wkahan/CS279/CrayUG.pdf

    I hope BGB reads this and takes it to heart.

    "Pessimism comes less from the error-analyst's dour personality
    than from his mental model of computer arithmetic."

    I also had to look up "equipollent".

    I assume many people in this group know this, but for those who
    don't, it is well worth reading.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Lawrence =?iso-8859-13?q?D=FFOliveiro?=@ldo@nz.invalid to comp.arch on Sun Oct 12 00:28:16 2025
    From Newsgroup: comp.arch

    On Sat, 11 Oct 2025 10:32:22 -0000 (UTC), Thomas Koenig wrote:

    Just found a gem on Cray arithmetic, which (rightly) incurred The Wrath
    of Kahan ...

    No harm in reminding everyone of his legendary foreword to the
    Standard Apple Numerics manual, 2nd ed, of 1988. He had something
    suitably acerbic to say about a great number of different vendorsrCO
    idea of floating-point arithmetic (including Cray).

    I posted one instance here <http://groups.google.com/group/comp.lang.python/msg/5aaf5dd86cb00651?hl=en>. --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Lawrence =?iso-8859-13?q?D=FFOliveiro?=@ldo@nz.invalid to comp.arch on Sun Oct 12 01:15:23 2025
    From Newsgroup: comp.arch

    On Sat, 11 Oct 2025 10:32:22 -0000 (UTC), Thomas Koenig wrote:

    https://people.eecs.berkeley.edu/~wkahan/CS279/CrayUG.pdf

    Anybody curious about whatrCOs on pages 62-5 of the Apple Numerics Manual
    2nd ed can find a copy here <https://vintageapple.org/inside_o/>.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Savard@quadibloc@invalid.invalid to comp.arch on Sun Oct 12 04:04:46 2025
    From Newsgroup: comp.arch

    On Sat, 11 Oct 2025 10:32:22 +0000, Thomas Koenig wrote:

    Just found a gem on Cray arithmetic, which (rightly) incurred The Wrath
    of Kahan:

    While the arithmetic on the Cray I was bad enough, this document seems to focus on some later models in the Cray line, which, like the IBM System/
    360 when it first came out, before an urgent retrofit, lacked a guard
    digit!

    John Savard
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Lawrence =?iso-8859-13?q?D=FFOliveiro?=@ldo@nz.invalid to comp.arch on Sun Oct 12 06:06:35 2025
    From Newsgroup: comp.arch

    On Sun, 12 Oct 2025 04:04:46 -0000 (UTC), John Savard wrote:

    On Sat, 11 Oct 2025 10:32:22 +0000, Thomas Koenig wrote:

    Just found a gem on Cray arithmetic, which (rightly) incurred The Wrath
    of Kahan:

    While the arithmetic on the Cray I was bad enough, this document seems
    to focus on some later models in the Cray line, which, like the IBM
    System/ 360 when it first came out, before an urgent retrofit, lacked a
    guard digit!

    The concluding part of that article had a postscript which said that,
    while Cray accepted the importance of fixing the deficiencies in future models, there would be no retrofit to existing ones.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Savard@quadibloc@invalid.invalid to comp.arch on Mon Oct 13 07:23:21 2025
    From Newsgroup: comp.arch

    On Sun, 12 Oct 2025 06:06:35 +0000, Lawrence DrCOOliveiro wrote:

    The concluding part of that article had a postscript which said that,
    while Cray accepted the importance of fixing the deficiencies in future models, there would be no retrofit to existing ones.

    That is a pity.

    After reading that article, I looked for more information on other
    processors with poor arithmetic, and I found that the Intel i860 also had
    a branch delay slot, as well as using traps to implement some portions of
    the IEEE 754 standard... thus, presumably, being one of the architectures
    to inspire the piece about bad architectures from Linus Torvalds recently quoted here.

    John Savard

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Mon Oct 13 07:39:11 2025
    From Newsgroup: comp.arch

    John Savard <quadibloc@invalid.invalid> writes:
    After reading that article, I looked for more information on other >processors with poor arithmetic, and I found that the Intel i860 also had
    a branch delay slot, as well as using traps to implement some portions of >the IEEE 754 standard... thus, presumably, being one of the architectures
    to inspire the piece about bad architectures from Linus Torvalds recently >quoted here.

    There never was a Linux port to the i860. There are lots of
    architectures with Linux ports that have the properties that Linus
    Torvalds mentions. Concerning implementing only a part of FP in
    hardware, and throwing the rest over the wall to software, Alpha ist
    probably the best-known example (denormal support only in software),
    and Linus Torvalds worked on it personally. Concerning exposing the
    pipeline, MIPS-I not just has branch-delay slots, but also other
    limitations. SPARC and HPPA have branch delay slots.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Lawrence =?iso-8859-13?q?D=FFOliveiro?=@ldo@nz.invalid to comp.arch on Mon Oct 13 09:05:18 2025
    From Newsgroup: comp.arch

    On Mon, 13 Oct 2025 07:39:11 GMT, Anton Ertl wrote:

    Concerning implementing only a part of FP in hardware, and throwing
    the rest over the wall to software, Alpha ist probably the
    best-known example (denormal support only in software) ...

    The hardware designers took many years -- right through the 1990s, I think
    -- to be persuaded that IEEE754 really was worth implementing in its
    entirety, that the rCLtoo hardrCY or rCLtoo obscurerCY parts were there for an important reason, to make programming that much easier, and should not be skipped.

    YourCOll notice that Kahan mentioned Apple more than once, as seemingly his favourite example of a company that took IEEE754 to heart and implemented
    it completely in software, where their hardware vendor of choice at the
    time (Motorola), skimped a bit on hardware support.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Mon Oct 13 13:12:12 2025
    From Newsgroup: comp.arch

    On Mon, 13 Oct 2025 09:05:18 -0000 (UTC)
    Lawrence DrCOOliveiro <ldo@nz.invalid> wrote:
    On Mon, 13 Oct 2025 07:39:11 GMT, Anton Ertl wrote:

    Concerning implementing only a part of FP in hardware, and throwing
    the rest over the wall to software, Alpha ist probably the
    best-known example (denormal support only in software) ...

    The hardware designers took many years -- right through the 1990s, I
    think -- to be persuaded that IEEE754 really was worth implementing
    in its entirety, that the rCLtoo hardrCY or rCLtoo obscurerCY parts were there for an important reason,
    It took many years to figure it out for *DEC* hardware designers.
    Was there any other general-purpose RISC vendor that suffered from
    similar denseness?
    to make programming that much easier,
    and should not be .
    For many non-obvious parts of 754 it's true. For many other parts, esp.
    related to exceptions, it's false.
    That is, they should not be skipped, but the only reason for that is
    ease of documentation (just write "754" and you are done) and access to
    test vectors. This parts are not well-thought, do not make application programming any easier and do not fit well into programming languages.

    YourCOll notice that Kahan mentioned Apple more than once, as seemingly
    his favourite example of a company that took IEEE754 to heart and
    implemented it completely in software, where their hardware vendor of
    choice at the time (Motorola), skimped a bit on hardware support.
    According to my understanding, Motorola suffered from being early
    adapters, similarly to Intel. They implemented 754 before the standard
    was finished and later on were in difficult position of conflict
    between compatibility wits standard vs compatibility with previous
    generations.
    Moto is less forgivable than Intel, because they were early adapters
    not nearly as early.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Mon Oct 13 12:30:33 2025
    From Newsgroup: comp.arch

    On 10/13/2025 2:39 AM, Anton Ertl wrote:
    John Savard <quadibloc@invalid.invalid> writes:
    After reading that article, I looked for more information on other
    processors with poor arithmetic, and I found that the Intel i860 also had
    a branch delay slot, as well as using traps to implement some portions of
    the IEEE 754 standard... thus, presumably, being one of the architectures
    to inspire the piece about bad architectures from Linus Torvalds recently
    quoted here.

    There never was a Linux port to the i860. There are lots of
    architectures with Linux ports that have the properties that Linus
    Torvalds mentions. Concerning implementing only a part of FP in
    hardware, and throwing the rest over the wall to software, Alpha ist
    probably the best-known example (denormal support only in software),
    and Linus Torvalds worked on it personally. Concerning exposing the pipeline, MIPS-I not just has branch-delay slots, but also other
    limitations. SPARC and HPPA have branch delay slots.


    From what I can gather, the MIPS chip in the N64 also only did a
    partial implementation in hardware, with optional software traps for the
    rest.


    Apparently it can be a problem because modern FPUs don't exactly
    recreate N64 behavior, and a lot of the games ran without the traps, so
    a lot of the N64 games suffer drift and other issues over time (as the programmers had compensated for the MIPS issues in code rather than via traps).

    Though, reading some stuff, implies a predecessor chip (the R4000) had a
    more functionally complete FPU. So, I guess it is also possible that the
    R4300 had a more limited FPU to make it cheaper for the embedded market.


    Well, in any case, my recent efforts in these areas have been mostly:
    Trying to hunt down some remaining bugs involving RVC in the CPU core;
    RVC is seemingly "the gift that keeps on giving" in this area.
    (The more dog-chewed the encoding, the harder it is to find bugs)
    Going from just:
    "Doing weak/crappy FP in hardware"
    To:
    "Trying to do less crappy FPU via software traps".
    A "mostly traps only" implementation of Binary128.
    Doesn't exactly match the 'Q' extension, but that is OK.
    I sorta suspect not many people are going to implement Q either.



    As I see it though, if the overall cost of the traps remains below 1%,
    it is mostly OK.

    Though, ATM the FDIV and FSQRT traps for Binary128 are almost slow
    enough to justify turning them into a syscall like handler. Though, in
    this case would likely overlap it with the Page-Fault handler (fallback
    path for the TLB Miss handler, which is also being used here for FPU emulation).

    Partial issue is mostly that one doesn't want to remain in an interrupt handler for too long because this blocks any other interrupts, so for
    longer running operations it is better to switch to a handler that can
    deal with interrupts (and, ATM, FDIV.Q and FSQRT.Q are kinda horridly
    slow; so, less like a TLB miss, and more like a page-fault...).

    The TestKern related code is getting a little behind in my GitHub repo,
    idea is that these parts will be posted when they are done.


    I had found/fixed one RVC bug since the last upload of the CPU core to
    GitHub, but more bugs remain and are still being hunted down.


    Progress is slow...


    - anton

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon Oct 13 17:33:32 2025
    From Newsgroup: comp.arch


    Lawrence =?iso-8859-13?q?D=FFOliveiro?= <ldo@nz.invalid> posted:

    On Mon, 13 Oct 2025 07:39:11 GMT, Anton Ertl wrote:

    Concerning implementing only a part of FP in hardware, and throwing
    the rest over the wall to software, Alpha ist probably the
    best-known example (denormal support only in software) ...

    The hardware designers took many years -- right through the 1990s, I think -- to be persuaded that IEEE754 really was worth implementing in its entirety, that the rCLtoo hardrCY or rCLtoo obscurerCY parts were there for an
    important reason, to make programming that much easier, and should not be skipped.

    I disagree:: full compliance with IEEE 754-whenever is to make programs
    more reliable (more numerically stable) and to give the programmer a
    constant programming model (not easier).

    You can argue that not having to do ((x-0.5)-0.5) as you did in Hex
    did make it easier--but NaNs, infinities, Underflow at the Denorm level
    went in the other direction.

    YourCOll notice that Kahan mentioned Apple more than once, as seemingly his favourite example of a company that took IEEE754 to heart and implemented
    it completely in software, where their hardware vendor of choice at the
    time (Motorola), skimped a bit on hardware support.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From David Brown@david.brown@hesbynett.no to comp.arch on Mon Oct 13 21:08:56 2025
    From Newsgroup: comp.arch

    On 13/10/2025 19:33, MitchAlsup wrote:

    Lawrence =?iso-8859-13?q?D=FFOliveiro?= <ldo@nz.invalid> posted:

    On Mon, 13 Oct 2025 07:39:11 GMT, Anton Ertl wrote:

    Concerning implementing only a part of FP in hardware, and throwing
    the rest over the wall to software, Alpha ist probably the
    best-known example (denormal support only in software) ...

    The hardware designers took many years -- right through the 1990s, I think >> -- to be persuaded that IEEE754 really was worth implementing in its
    entirety, that the rCLtoo hardrCY or rCLtoo obscurerCY parts were there for an
    important reason, to make programming that much easier, and should not be
    skipped.

    I disagree:: full compliance with IEEE 754-whenever is to make programs
    more reliable (more numerically stable) and to give the programmer a
    constant programming model (not easier).

    It does not make the programs more reliable - it makes them more
    consistent, predictable and portable. It does not make things easier
    for most code (support for NaNs and infinities can make some code
    easier, if mathematically nonsense results are a real possibility). But
    since consistency, predictability and portability are often very useful characteristics, full IEEE 754 compliance is a good thing for
    general-purpose processors.

    However, there are plenty of more niche situations where these are not
    vital, and where cost (die space, design costs, run-time power, etc.) is
    more important. Thus on small microcontrollers, it can be a better
    choice to skip support for the "obscure" stuff, and maybe even cutting
    corners on things like rounding behaviour. The same applies for
    software floating point routines for devices that don't have hardware
    floating point at all.




    You can argue that not having to do ((x-0.5)-0.5) as you did in Hex
    did make it easier--but NaNs, infinities, Underflow at the Denorm level
    went in the other direction.

    YourCOll notice that Kahan mentioned Apple more than once, as seemingly his >> favourite example of a company that took IEEE754 to heart and implemented
    it completely in software, where their hardware vendor of choice at the
    time (Motorola), skimped a bit on hardware support.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon Oct 13 21:53:33 2025
    From Newsgroup: comp.arch


    BGB <cr88192@gmail.com> posted:

    On 10/13/2025 2:39 AM, Anton Ertl wrote:
    John Savard <quadibloc@invalid.invalid> writes:
    After reading that article, I looked for more information on other
    processors with poor arithmetic, and I found that the Intel i860 also had >> a branch delay slot, as well as using traps to implement some portions of >> the IEEE 754 standard... thus, presumably, being one of the architectures >> to inspire the piece about bad architectures from Linus Torvalds recently >> quoted here.

    There never was a Linux port to the i860. There are lots of
    architectures with Linux ports that have the properties that Linus
    Torvalds mentions. Concerning implementing only a part of FP in
    hardware, and throwing the rest over the wall to software, Alpha ist probably the best-known example (denormal support only in software),
    and Linus Torvalds worked on it personally. Concerning exposing the pipeline, MIPS-I not just has branch-delay slots, but also other limitations. SPARC and HPPA have branch delay slots.


    From what I can gather, the MIPS chip in the N64 also only did a
    partial implementation in hardware, with optional software traps for the rest.


    Apparently it can be a problem because modern FPUs don't exactly
    recreate N64 behavior, and a lot of the games ran without the traps, so
    a lot of the N64 games suffer drift and other issues over time (as the programmers had compensated for the MIPS issues in code rather than via traps).

    And this is why FP wants high quality implementation.

    Though, reading some stuff, implies a predecessor chip (the R4000) had a more functionally complete FPU. So, I guess it is also possible that the R4300 had a more limited FPU to make it cheaper for the embedded market.


    Well, in any case, my recent efforts in these areas have been mostly:
    Trying to hunt down some remaining bugs involving RVC in the CPU core;
    RVC is seemingly "the gift that keeps on giving" in this area.
    (The more dog-chewed the encoding, the harder it is to find bugs)
    Going from just:
    "Doing weak/crappy FP in hardware"
    To:
    "Trying to do less crappy FPU via software traps".
    A "mostly traps only" implementation of Binary128.
    Doesn't exactly match the 'Q' extension, but that is OK.
    I sorta suspect not many people are going to implement Q either.

    Do it right or don't do it at all.

    As I see it though, if the overall cost of the traps remains below 1%,
    it is mostly OK.

    While I can agree with the sentiment, the emulation overhead makes this
    very hard to achieve indeed.

    Though, ATM the FDIV and FSQRT traps for Binary128 are almost slow
    enough to justify turning them into a syscall like handler. Though, in
    this case would likely overlap it with the Page-Fault handler (fallback
    path for the TLB Miss handler, which is also being used here for FPU emulation).

    Partial issue is mostly that one doesn't want to remain in an interrupt handler for too long because this blocks any other interrupts,

    At the time of control arrival, interrupts are already reentrant in
    My 66000. A higher priority interrupt will take control from the
    lower priority interrupt.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Lawrence =?iso-8859-13?q?D=FFOliveiro?=@ldo@nz.invalid to comp.arch on Tue Oct 14 02:27:46 2025
    From Newsgroup: comp.arch

    On Mon, 13 Oct 2025 17:33:32 GMT, MitchAlsup wrote:

    On Mon, 13 Oct 2025 09:05:18 -0000 (UTC), Lawrence DrCOOliveiro wrote:

    The hardware designers took many years -- right through the 1990s, I
    think -- to be persuaded that IEEE754 really was worth implementing in
    its entirety, that the rCLtoo hardrCY or rCLtoo obscurerCY parts were there for
    an important reason, to make programming that much easier, and should
    not be skipped.

    I disagree:: full compliance with IEEE 754-whenever is to make programs
    more reliable (more numerically stable) and to give the programmer a
    constant programming model (not easier).

    As a programmer, I count all that under my definition of rCLeasierrCY.

    You can argue that not having to do ((x-0.5)-0.5) as you did in Hex did
    make it easier--but NaNs, infinities, Underflow at the Denorm level went
    in the other direction.

    NaNs and infinities allow you to propagate certain kinds of pathological results right through to the end of the calculation, in a mathematically consistent way.

    Denormals -- arenrCOt they called rCLsubnormalsrCY now? -- are also about making
    things easier. Providing graceful underflow means a gradual loss of
    precision as you get too close to zero, instead of losing all the bits at
    once and going straight to zero. ItrCOs about the principle of least
    surprise.

    Again, all that helps to make things easier for programmers --
    particularly those of us whose expertise of numerics is not on a level
    with Prof Kahan.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Lawrence =?iso-8859-13?q?D=FFOliveiro?=@ldo@nz.invalid to comp.arch on Tue Oct 14 02:36:50 2025
    From Newsgroup: comp.arch

    On Mon, 13 Oct 2025 13:12:12 +0300, Michael S wrote:

    On Mon, 13 Oct 2025 09:05:18 -0000 (UTC)
    Lawrence DrCOOliveiro <ldo@nz.invalid> wrote:

    The hardware designers took many years -- right through the 1990s,
    I think -- to be persuaded that IEEE754 really was worth
    implementing in its entirety, that the rCLtoo hardrCY or rCLtoo obscurerCY >> parts were there for an important reason,

    It took many years to figure it out for *DEC* hardware designers.
    Was there any other general-purpose RISC vendor that suffered from
    similar denseness?

    I thought they all did, just about.

    YourCOll notice that Kahan mentioned Apple more than once, as
    seemingly his favourite example of a company that took IEEE754 to
    heart and implemented it completely in software, where their
    hardware vendor of choice at the time (Motorola), skimped a bit on
    hardware support.

    According to my understanding, Motorola suffered from being early
    adapters, similarly to Intel. They implemented 754 before the
    standard was finished and later on were in difficult position of
    conflict between compatibility wits standard vs compatibility with
    previous generations. Moto is less forgivable than Intel, because
    they were early adapters not nearly as early.

    LetrCOs see, the Motorola 68881 came out in 1984 <https://en.wikipedia.org/wiki/Motorola_68881>, while the first
    release of IEEE754 dates from two years before <https://en.wikipedia.org/wiki/IEEE_754>.

    I would say Motorola had plenty of time to read the spec and get it
    right. But they didnrCOt. So Apple had to patch things up in its
    software implementation, introducing a mode where for example those
    last few inaccurate bits in transcendentals were fixed up in software, sacrificing some speed over the raw hardware to ensure consistent
    results with the (even slower) pure-software implementation.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Mon Oct 13 22:38:18 2025
    From Newsgroup: comp.arch

    On 10/13/2025 4:53 PM, MitchAlsup wrote:

    BGB <cr88192@gmail.com> posted:

    On 10/13/2025 2:39 AM, Anton Ertl wrote:
    John Savard <quadibloc@invalid.invalid> writes:
    After reading that article, I looked for more information on other
    processors with poor arithmetic, and I found that the Intel i860 also had >>>> a branch delay slot, as well as using traps to implement some portions of >>>> the IEEE 754 standard... thus, presumably, being one of the architectures >>>> to inspire the piece about bad architectures from Linus Torvalds recently >>>> quoted here.

    There never was a Linux port to the i860. There are lots of
    architectures with Linux ports that have the properties that Linus
    Torvalds mentions. Concerning implementing only a part of FP in
    hardware, and throwing the rest over the wall to software, Alpha ist
    probably the best-known example (denormal support only in software),
    and Linus Torvalds worked on it personally. Concerning exposing the
    pipeline, MIPS-I not just has branch-delay slots, but also other
    limitations. SPARC and HPPA have branch delay slots.


    From what I can gather, the MIPS chip in the N64 also only did a
    partial implementation in hardware, with optional software traps for the
    rest.


    Apparently it can be a problem because modern FPUs don't exactly
    recreate N64 behavior, and a lot of the games ran without the traps, so
    a lot of the N64 games suffer drift and other issues over time (as the
    programmers had compensated for the MIPS issues in code rather than via
    traps).

    And this is why FP wants high quality implementation.


    From what I gather, it was a combination of Binary32 with DAZ/FTZ and truncate rounding. Then, with emulators running instead on hardware with denormals and RNE.

    But, the result was that the games would work correctly on the original hardware, but in the emulators things would drift; like things like
    moving platforms gradually creeping away from the origin, etc.





    Though, reading some stuff, implies a predecessor chip (the R4000) had a
    more functionally complete FPU. So, I guess it is also possible that the
    R4300 had a more limited FPU to make it cheaper for the embedded market.


    Well, in any case, my recent efforts in these areas have been mostly:
    Trying to hunt down some remaining bugs involving RVC in the CPU core; >> RVC is seemingly "the gift that keeps on giving" in this area.
    (The more dog-chewed the encoding, the harder it is to find bugs)
    Going from just:
    "Doing weak/crappy FP in hardware"
    To:
    "Trying to do less crappy FPU via software traps".
    A "mostly traps only" implementation of Binary128.
    Doesn't exactly match the 'Q' extension, but that is OK.
    I sorta suspect not many people are going to implement Q either.

    Do it right or don't do it at all.


    ?...

    The traps route sorta worked OK in a lot of the MIPS era CPUs.
    But, it will be opt-in via an FPSCR flag.
    If the flag is not set, it will not trap.

    Or, is the argument here that sticking with weaker not-quite IEEE FPU is preferable to using trap handlers.



    For Binary128, real HW support is not likely to happen. The main reason
    to consider trap-only Binary128 is more because it has less code
    footprint than using runtime calls.

    Also on RISC-V, it is more expensive to implement 128-bit arithmetic, so
    the actual cost might be lower.

    The main deviation from the Q extension is that it will use register
    pairs rather than 128 bit registers. I suspect that likely 128-bit
    registers would make more problems for software built to assume RV64G,
    than the problems resulting from breaking the spec and using pairs.

    Or, if the proper Q extension were supported, would make more sense in
    the context of RV128, so XLEN==FLEN. Otherwise, Q on RV64 would break
    the ability to move values between FPRs and GPRs (in the RV spec, they
    note is the assumption that in this configuration, moves between FPRs
    and GPRs would be done via memory loads and stores). This would suck,
    and actively make the FPU worse than sticking primarily with the D
    extension and doing something nonstandard.


    As I see it though, if the overall cost of the traps remains below 1%,
    it is mostly OK.

    While I can agree with the sentiment, the emulation overhead makes this
    very hard to achieve indeed.


    Will have to test this more to find out.

    But, at least in the case of Binary128, the operations themselves are
    likely to be slow enough to partly offset the trap-handling and
    instruction decoding overheads.



    Though, ATM the FDIV and FSQRT traps for Binary128 are almost slow
    enough to justify turning them into a syscall like handler. Though, in
    this case would likely overlap it with the Page-Fault handler (fallback
    path for the TLB Miss handler, which is also being used here for FPU
    emulation).

    Partial issue is mostly that one doesn't want to remain in an interrupt
    handler for too long because this blocks any other interrupts,

    At the time of control arrival, interrupts are already reentrant in
    My 66000. A higher priority interrupt will take control from the
    lower priority interrupt.

    Yeah, no re-entrant interrupts here.

    For a longer-running operation, it is mostly needed to handle things
    with a context switch into supervisor mode. Can't use the normal SYSCALL handler though, as it itself may have been the source of the trap. So, Page-Fault needs its own handler task.


    It is likely that re-entrant interrupts would require a different and
    likely more complex mechanism.

    Well, and/or rework things at the compiler level so that the ISR proper
    is only used to implement a transition into supervisor mode (or from supervisor-mode back to usermode); and then fake something more like the
    x86 style interrupt handling.

    ...

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Tue Oct 14 01:53:17 2025
    From Newsgroup: comp.arch

    Lawrence DrCOOliveiro wrote:
    On Mon, 13 Oct 2025 13:12:12 +0300, Michael S wrote:

    On Mon, 13 Oct 2025 09:05:18 -0000 (UTC)
    Lawrence DrCOOliveiro <ldo@nz.invalid> wrote:

    The hardware designers took many years -- right through the 1990s,
    I think -- to be persuaded that IEEE754 really was worth
    implementing in its entirety, that the rCLtoo hardrCY or rCLtoo obscurerCY >>> parts were there for an important reason,
    It took many years to figure it out for *DEC* hardware designers.
    Was there any other general-purpose RISC vendor that suffered from
    similar denseness?

    I thought they all did, just about.

    YourCOll notice that Kahan mentioned Apple more than once, as
    seemingly his favourite example of a company that took IEEE754 to
    heart and implemented it completely in software, where their
    hardware vendor of choice at the time (Motorola), skimped a bit on
    hardware support.
    According to my understanding, Motorola suffered from being early
    adapters, similarly to Intel. They implemented 754 before the
    standard was finished and later on were in difficult position of
    conflict between compatibility wits standard vs compatibility with
    previous generations. Moto is less forgivable than Intel, because
    they were early adapters not nearly as early.

    LetrCOs see, the Motorola 68881 came out in 1984 <https://en.wikipedia.org/wiki/Motorola_68881>, while the first
    release of IEEE754 dates from two years before <https://en.wikipedia.org/wiki/IEEE_754>.

    Circa 1981 there was the Weitek chips. Wikipedia doesn't say if the
    early ones were 754 compatible, but later chips from 1986 intended
    for the 386 were compatible, and they seem to have been used by many
    (Motorola, Intel, Sun, PA-RISC, ...)

    https://en.wikipedia.org/wiki/Weitek

    Unfortunately not all the chip documents are on bitsavers

    http://www.bitsavers.org/components/weitek/dataSheets/

    but the WTL-1164_1165 PDF from 1986 says

    FULL 32-BIT AND 64-BIT FLOATING POINT
    FORMAT AND OPERATIONS, CONFORMING TO
    THE IEEE STANDARD FOR FLOATING POINT ARITHMETIC

    2.38 MFlops (420 ns) 32-bit add/subtract/convert and compare
    1.85 MFlops (540 ns) 64-bit add/subtract/convert and compare
    2.38 MFlops (420 ns) 32-bit multiply
    1.67 MFlops (600 ns) 64-bit multiply
    0.52 MFlops (1.92 Jls) 32-bit divide
    0.26 MFlops (3.78 Jls) 64-bit divide
    Up to 3.33 MFlops (300 ns) for pipelined operations
    Up to 3.33 MFlops (300 ns) for chained operations
    32-bit data input or 32-bit data output operation every 60 ns


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From David Brown@david.brown@hesbynett.no to comp.arch on Tue Oct 14 08:30:44 2025
    From Newsgroup: comp.arch

    On 14/10/2025 04:27, Lawrence DrCOOliveiro wrote:
    On Mon, 13 Oct 2025 17:33:32 GMT, MitchAlsup wrote:

    On Mon, 13 Oct 2025 09:05:18 -0000 (UTC), Lawrence DrCOOliveiro wrote:

    The hardware designers took many years -- right through the 1990s, I
    think -- to be persuaded that IEEE754 really was worth implementing in
    its entirety, that the rCLtoo hardrCY or rCLtoo obscurerCY parts were there for
    an important reason, to make programming that much easier, and should
    not be skipped.

    I disagree:: full compliance with IEEE 754-whenever is to make programs
    more reliable (more numerically stable) and to give the programmer a
    constant programming model (not easier).

    As a programmer, I count all that under my definition of rCLeasierrCY.

    You can argue that not having to do ((x-0.5)-0.5) as you did in Hex did
    make it easier--but NaNs, infinities, Underflow at the Denorm level went
    in the other direction.

    NaNs and infinities allow you to propagate certain kinds of pathological results right through to the end of the calculation, in a mathematically consistent way.

    Denormals -- arenrCOt they called rCLsubnormalsrCY now? -- are also about making
    things easier. Providing graceful underflow means a gradual loss of
    precision as you get too close to zero, instead of losing all the bits at once and going straight to zero. ItrCOs about the principle of least surprise.

    Again, all that helps to make things easier for programmers --
    particularly those of us whose expertise of numerics is not on a level
    with Prof Kahan.

    I see the benefits of NaNs - sometimes you have bad data, and it can be
    useful to have a representation for that. The defined "viral" nature of
    NaNs means that you can write your code in a streamlined fashion,
    knowing that if a NaN goes in, a NaN comes out - you don't have to have
    checks and conditionals in the middle of your calculations.

    But I find it harder to understand why denormals or subnormals are going
    to be useful. Ultimately, your floating point code is approximating arithmetic on real numbers. Where are you getting your real numbers,
    and what calculations are you doing on them, that mean you are getting
    results that have such a dynamic range that you are using denormals?
    And what are you doing where it is acceptable to lose some precision
    with those numbers, but not to give up and say things have gone badly
    wrong (a NaN or infinity, or underflow signal)? I have a lot of
    difficulty imagining a situation where denormals would be helpful and
    you haven't got a major design issue with your code - perhaps
    calculations should be re-arranged, algorithms changed, or you should be
    using an arithmetic format with greater range (switch from single to
    double, double to quad, or use something more advanced).



    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Lawrence =?iso-8859-13?q?D=FFOliveiro?=@ldo@nz.invalid to comp.arch on Tue Oct 14 06:56:46 2025
    From Newsgroup: comp.arch

    On Tue, 14 Oct 2025 01:53:17 -0400, EricP wrote:

    Circa 1981 there was the Weitek chips. Wikipedia doesn't say if the
    early ones were 754 compatible, but later chips from 1986 intended for
    the 386 were compatible, and they seem to have been used by many
    (Motorola, Intel, Sun, PA-RISC, ...)

    Weitek add-on cards, I think mainly the early ones, were popular with more hard-core power users of Lotus 1-2-3. Remember, that was the rCLkiller apprCY that prompted a lot of people to buy the IBM PC (and compatibles) in the
    first place. Some of them must have been doing some serious number-
    crunching, such that floating-point speed became a real issue.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Tue Oct 14 07:51:09 2025
    From Newsgroup: comp.arch

    David Brown <david.brown@hesbynett.no> writes:
    I see the benefits of NaNs - sometimes you have bad data, and it can be >useful to have a representation for that. The defined "viral" nature of >NaNs means that you can write your code in a streamlined fashion,
    knowing that if a NaN goes in, a NaN comes out - you don't have to have >checks and conditionals in the middle of your calculations.

    Unfortunately, there are no NaFs (not a flag), and if there were, how
    would an IF behave? As a consequence, a<b can have a different result
    than !(a>=b) if a or b can be a NaN. That's quite contrary to what
    programmers tend to expect. So NaNs have their pitfalls.

    Would it be better to trap is a NaN is compared with an ordinary
    comparison operator, and to use special NaN-aware comparison operators
    when that is actually intended?

    And what are you doing where it is acceptable to lose some precision
    with those numbers, but not to give up and say things have gone badly
    wrong (a NaN or infinity, or underflow signal)?

    The usual alternative to denormals is not NaN or Infinity (of course
    not), or a trap (I assume that's what you mean with "signal"), but 0.

    I have a lot of
    difficulty imagining a situation where denormals would be helpful and
    you haven't got a major design issue with your code

    The classical example is the assumption that a<b is equivalent to
    a-b<0. It holds if denormals are implemented and fails on
    flush-to-zero.

    Basically, with denormals more of the usual assumptions hold.

    perhaps
    calculations should be re-arranged, algorithms changed, or you should be >using an arithmetic format with greater range (switch from single to
    double, double to quad, or use something more advanced).

    The first two require more knowledge about FP than many programmers
    have, all just to avoid some hardware cost. Not a good idea in any
    area where the software crisis* is relevant. The last increases the
    resource usage much more than proper support for denormals.

    * The Wikipedia article on the software crisis does not give a useful
    definition for deciding whether there is a software crisis or not,
    and it does not even mention the symptom that was mentioned first
    when I learned about the software crisis (in 1986): The cost of
    software exceeds the cost of hardware. So that's my decision
    criterion: If the software cost is higher than the hardware cost,
    the software crisis is relevant; and in the present context, it
    means that expending hardware to reduce the cost of software is
    justified. Denormal numbers are such a feature.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From David Brown@david.brown@hesbynett.no to comp.arch on Tue Oct 14 10:47:56 2025
    From Newsgroup: comp.arch

    On 14/10/2025 09:51, Anton Ertl wrote:
    David Brown <david.brown@hesbynett.no> writes:
    I see the benefits of NaNs - sometimes you have bad data, and it can be
    useful to have a representation for that. The defined "viral" nature of
    NaNs means that you can write your code in a streamlined fashion,
    knowing that if a NaN goes in, a NaN comes out - you don't have to have
    checks and conditionals in the middle of your calculations.

    Unfortunately, there are no NaFs (not a flag), and if there were, how
    would an IF behave? As a consequence, a<b can have a different result
    than !(a>=b) if a or b can be a NaN. That's quite contrary to what programmers tend to expect. So NaNs have their pitfalls.

    I entirely agree. If you have a type that has some kind of non-value,
    and it might contain that representation, you have to take that into
    account in your code. It's much the same thing as having a pointer that
    could be a null pointer. But as long as you are aware of the
    possibility and consequences of NaNs, they can be useful.


    Would it be better to trap is a NaN is compared with an ordinary
    comparison operator, and to use special NaN-aware comparison operators
    when that is actually intended?


    I'm sure there are a number of interesting ways to model this kind of
    thing, in a programming language that supported it. NaN's in floating
    point are somewhat akin to error values in C++ std::expected<>, or empty std::optional<> types, or like "result" types found in many languages.

    And what are you doing where it is acceptable to lose some precision
    with those numbers, but not to give up and say things have gone badly
    wrong (a NaN or infinity, or underflow signal)?

    The usual alternative to denormals is not NaN or Infinity (of course
    not), or a trap (I assume that's what you mean with "signal"), but 0.


    Sure. My thoughts with NaN are that it might be appropriate for a
    floating point model (not IEEE) to return a NaN in circumstances where
    IEEE says the result is a denormal - I think that might have been a more useful result. And my mention of infinity is because often when people
    have a very small value but are very keen on it not being zero, it is
    because they intend to divide by it and want to avoid division by zero
    (and thus infinity).

    I have a lot of
    difficulty imagining a situation where denormals would be helpful and
    you haven't got a major design issue with your code

    The classical example is the assumption that a<b is equivalent to
    a-b<0. It holds if denormals are implemented and fails on
    flush-to-zero.

    Basically, with denormals more of the usual assumptions hold.

    OK. (I like that aspect of signed integer overflow being UB - more of
    your usual assumptions hold.)

    However, if "a" or "b" could be a NaN or an infinity, does that
    equivalence still hold? I do not know the details here - it is simply
    not something that turns up in the kind of coding I do. (In my line of
    work, floating point values and expression results are always "normal",
    if that is the correct term. I can always use gcc's "-ffast-math", and
    I think a lot of real-world floating point code could do so - but I
    fully appreciate that does not apply to all code.)

    Are you thinking of this equivalence as something the compiler would do
    in optimisation, or something programmers would use when writing their code?


    perhaps
    calculations should be re-arranged, algorithms changed, or you should be
    using an arithmetic format with greater range (switch from single to
    double, double to quad, or use something more advanced).

    The first two require more knowledge about FP than many programmers
    have, all just to avoid some hardware cost. Not a good idea in any
    area where the software crisis* is relevant. The last increases the
    resource usage much more than proper support for denormals.

    I fully agree on both these points. However, I can't help feeling that
    if you are seeing denormals, you are unlikely to be getting results from
    your code that are as accurate as you had expected - your calculations
    are numerically unstable. Denormals might give you slightly more leeway before everything falls apart, but only a tiny amount. Doing it right
    is going to cost you, in development time or runtime efficiency, but
    that's better than getting the wrong answers quickly!



    * The Wikipedia article on the software crisis does not give a useful
    definition for deciding whether there is a software crisis or not,
    and it does not even mention the symptom that was mentioned first
    when I learned about the software crisis (in 1986): The cost of
    software exceeds the cost of hardware. So that's my decision
    criterion: If the software cost is higher than the hardware cost,
    the software crisis is relevant; and in the present context, it
    means that expending hardware to reduce the cost of software is
    justified. Denormal numbers are such a feature.

    - anton

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Tue Oct 14 11:26:10 2025
    From Newsgroup: comp.arch

    David Brown <david.brown@hesbynett.no> writes:
    On 14/10/2025 09:51, Anton Ertl wrote:
    David Brown <david.brown@hesbynett.no> writes:
    I see the benefits of NaNs - sometimes you have bad data, and it can be
    useful to have a representation for that. The defined "viral" nature of >>> NaNs means that you can write your code in a streamlined fashion,
    knowing that if a NaN goes in, a NaN comes out - you don't have to have
    checks and conditionals in the middle of your calculations.

    Unfortunately, there are no NaFs (not a flag), and if there were, how
    would an IF behave? As a consequence, a<b can have a different result
    than !(a>=b) if a or b can be a NaN. That's quite contrary to what
    programmers tend to expect. So NaNs have their pitfalls.

    I entirely agree. If you have a type that has some kind of non-value,
    and it might contain that representation, you have to take that into
    account in your code. It's much the same thing as having a pointer that >could be a null pointer.

    Not really:

    * Null pointers don't materialize spontaneously as results of
    arithmetic operations. They are stored explicitly by the
    programmer, making the programmer much more aware of their
    existence.

    * Programmers are trained to check for null pointers. And if they
    forget such a check, the result usually is that the program traps,
    usually soon after the place where the check should have been. With
    a NaN you just silently execute the wrong branch of an IF, and later
    you wonder what happened.

    * The most common use for null pointers is terminating a linked list
    or other recursive data structure. Programmers are trained to deal
    with the terminating case in their code.

    The usual alternative to denormals is not NaN or Infinity (of course
    not), or a trap (I assume that's what you mean with "signal"), but 0.


    Sure. My thoughts with NaN are that it might be appropriate for a
    floating point model (not IEEE) to return a NaN in circumstances where
    IEEE says the result is a denormal - I think that might have been a more >useful result.

    When a denormal is generated, an underflow "exception" happens (IEEE "exceptions" are not traps). You can set your FPU to trap on a
    certain kind of exception. Maybe you can also set it up such that it
    produces a NaN instead. I doubt that many people would find that
    useful, however.


    And my mention of infinity is because often when people
    have a very small value but are very keen on it not being zero, it is >because they intend to divide by it and want to avoid division by zero
    (and thus infinity).

    Denormals don't help much here. IEEE doubles cannot represent 2^1024,
    but denormals allow to represent positive numbers down to 2^-1074.
    So, with denormal numbers, the absolute value of your divisor must be
    less than 2^-50 to produce a non-infinite result where flush-to-zero
    would have produced an infinity.

    The classical example is the assumption that a<b is equivalent to
    a-b<0. It holds if denormals are implemented and fails on
    flush-to-zero.

    Basically, with denormals more of the usual assumptions hold.

    OK. (I like that aspect of signed integer overflow being UB - more of
    your usual assumptions hold.)

    Not mine. An assumption that I like is that the associative law
    holds. It holds with -fwrapv, but not with overflow-is-undefined.

    I fail to see how declaring any condition undefined behaviour would
    increase any guarantees.

    However, if "a" or "b" could be a NaN or an infinity, does that
    equivalence still hold?

    Yes.

    If any of them is a NaN, the result is false for either comparison
    (because a-b would be NaN, and because the result of any comparison
    with a NaN is false).

    For infinity there are a number of cases

    1) inf<noninf (false) vs. inf-noninf=inf<0 (false)
    2) -inf<noninf (true) vs. -inf-noninf=-inf<0 (true)
    3) noninf<inf (true) vs. noninf-inf=-inf<0 (true)
    4) noninf<-inf (false) vs. noninf--inf=inf<0 (false)
    5) inf<inf (false) vs. inf-inf=NaN<0 (false)
    6) -inf<-inf (false) vs. -inf--inf=NaN<0 (false)
    7) inf<-inf (false) vs. inf--inf=inf<0 (false)
    8) -inf<inf (true) vs. -inf-inf=-inf<0 (true)

    The most interesting case here is 5), because if means that a<=b is
    not equivalent to a-b<=0, even with denormal numbers.

    Are you thinking of this equivalence as something the compiler would do
    in optimisation, or something programmers would use when writing their code?

    I was thinking about what programmers might use when writing their
    code. For compilers, having that equivalence may occasionally be
    helpful for producing better code, but if it does not hold, the
    compiler will just not use such an equivalence (once the compiler is
    debugged).

    This is an example from Kahan that stuck in my mind, because it
    appeals to me as a programmer. He has also given other examples that
    don't do that for me, but may appeal to a mathematician, phycisist or
    chemist.

    I fully agree on both these points. However, I can't help feeling that
    if you are seeing denormals, you are unlikely to be getting results from >your code that are as accurate as you had expected - your calculations
    are numerically unstable. Denormals might give you slightly more leeway >before everything falls apart, but only a tiny amount.

    I think the nicer properties (such as the equivalence mentioned above)
    is the more important benefit. And if you take a different branch of
    an IF-statement if you have a flush-to-zero FPU, you can easily get a completely bogus result when the denormal case would still have had
    enough accuracy by far.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Tue Oct 14 15:37:10 2025
    From Newsgroup: comp.arch

    David Brown wrote:
    On 14/10/2025 04:27, Lawrence D|ore4raoOliveiro wrote:
    On Mon, 13 Oct 2025 17:33:32 GMT, MitchAlsup wrote:

    On Mon, 13 Oct 2025 09:05:18 -0000 (UTC), Lawrence D|ore4raoOliveiro wrote: >>>
    The hardware designers took many years -- right through the 1990s, I>>>> think -- to be persuaded that IEEE754 really was worth implementing in
    its entirety, that the |ore4+otoo hard|ore4-Y or |ore4+otoo obscure|ore4-Y parts
    were there for
    an important reason, to make programming that much easier, and should
    not be skipped.

    I disagree:: full compliance with IEEE 754-whenever is to make programs
    more reliable (more numerically stable) and to give the programmer a
    constant programming model (not easier).

    As a programmer, I count all that under my definition of |ore4+oeasier|ore4-Y.

    You can argue that not having to do ((x-0.5)-0.5) as you did in Hex did
    make it easier--but NaNs, infinities, Underflow at the Denorm level went >>> in the other direction.

    NaNs and infinities allow you to propagate certain kinds of pathological
    results right through to the end of the calculation, in a mathematically
    consistent way.

    Denormals -- aren|ore4raot they called |ore4+osubnormals|ore4-Y now? -- are also
    about making
    things easier. Providing graceful underflow means a gradual loss of
    precision as you get too close to zero, instead of losing all the bits at
    once and going straight to zero. It|ore4raos about the principle of least
    surprise.

    Again, all that helps to make things easier for programmers --
    particularly those of us whose expertise of numerics is not on a level>> with Prof Kahan.

    I see the benefits of NaNs - sometimes you have bad data, and it can be useful to have a representation for that.-a The defined "viral" nature of NaNs means that you can write your code in a streamlined fashion,
    knowing that if a NaN goes in, a NaN comes out - you don't have to have checks and conditionals in the middle of your calculations.

    But I find it harder to understand why denormals or subnormals are going
    to be useful.-a Ultimately, your floating point code is approximating arithmetic on real numbers.-a Where are you getting your real numbers,
    and what calculations are you doing on them, that mean you are getting > results that have such a dynamic range that you are using denormals? And
    what are you doing where it is acceptable to lose some precision with
    those numbers, but not to give up and say things have gone badly wrong > (a NaN or infinity, or underflow signal)?-a I have a lot of difficulty
    imagining a situation where denormals would be helpful and you haven't > got a major design issue with your code - perhaps calculations should be
    re-arranged, algorithms changed, or you should be using an arithmetic
    format with greater range (switch from single to double, double to quad,
    or use something more advanced).
    Subnormal is critical for stability of zero-seeking algorithms, i.e a
    lot of standard algorithmic building blocks.
    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Tue Oct 14 15:42:45 2025
    From Newsgroup: comp.arch

    Anton Ertl wrote:
    David Brown <david.brown@hesbynett.no> writes:
    I see the benefits of NaNs - sometimes you have bad data, and it can be
    useful to have a representation for that. The defined "viral" nature of
    NaNs means that you can write your code in a streamlined fashion,
    knowing that if a NaN goes in, a NaN comes out - you don't have to have
    checks and conditionals in the middle of your calculations.

    Unfortunately, there are no NaFs (not a flag), and if there were, how
    would an IF behave? As a consequence, a<b can have a different result
    than !(a>=b) if a or b can be a NaN. That's quite contrary to what programmers tend to expect. So NaNs have their pitfalls.

    You have just named the only common pitfall, where all comparisons
    against NaN shall return false.

    You can in fact define your own

    bool IsNan(f64 x)
    {
    ((x < 0.0) | (x >= 0.0)) == false
    }

    but this depends on the compiler/optimizer not messing up.

    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From David Brown@david.brown@hesbynett.no to comp.arch on Tue Oct 14 17:29:40 2025
    From Newsgroup: comp.arch

    On 14/10/2025 13:26, Anton Ertl wrote:
    David Brown <david.brown@hesbynett.no> writes:
    On 14/10/2025 09:51, Anton Ertl wrote:
    David Brown <david.brown@hesbynett.no> writes:
    I see the benefits of NaNs - sometimes you have bad data, and it can be >>>> useful to have a representation for that. The defined "viral" nature of >>>> NaNs means that you can write your code in a streamlined fashion,
    knowing that if a NaN goes in, a NaN comes out - you don't have to have >>>> checks and conditionals in the middle of your calculations.

    Unfortunately, there are no NaFs (not a flag), and if there were, how
    would an IF behave? As a consequence, a<b can have a different result
    than !(a>=b) if a or b can be a NaN. That's quite contrary to what
    programmers tend to expect. So NaNs have their pitfalls.

    I entirely agree. If you have a type that has some kind of non-value,
    and it might contain that representation, you have to take that into
    account in your code. It's much the same thing as having a pointer that
    could be a null pointer.

    Not really:

    * Null pointers don't materialize spontaneously as results of
    arithmetic operations. They are stored explicitly by the
    programmer, making the programmer much more aware of their
    existence.


    NaN's don't materialise spontaneously either. They can be the result of intentionally using NaN's for missing data, or when your code is buggy
    and failing to calculate something reasonable. In either case, the
    surprise happens when someone passes the non-value to code that was not expecting to have to deal with it.

    * Programmers are trained to check for null pointers. And if they
    forget such a check, the result usually is that the program traps,
    usually soon after the place where the check should have been. With
    a NaN you just silently execute the wrong branch of an IF, and later
    you wonder what happened.

    Fair enough.


    * The most common use for null pointers is terminating a linked list
    or other recursive data structure. Programmers are trained to deal
    with the terminating case in their code.

    I would disagree that this is the most common use for null pointers.
    But it certainly is /one/ use, and programmers should handle that usage correctly.

    So to sum up, there is a certain similarity, but there are also
    significant differences.


    The usual alternative to denormals is not NaN or Infinity (of course
    not), or a trap (I assume that's what you mean with "signal"), but 0.


    Sure. My thoughts with NaN are that it might be appropriate for a
    floating point model (not IEEE) to return a NaN in circumstances where
    IEEE says the result is a denormal - I think that might have been a more
    useful result.

    When a denormal is generated, an underflow "exception" happens (IEEE "exceptions" are not traps). You can set your FPU to trap on a
    certain kind of exception. Maybe you can also set it up such that it produces a NaN instead. I doubt that many people would find that
    useful, however.


    And my mention of infinity is because often when people
    have a very small value but are very keen on it not being zero, it is
    because they intend to divide by it and want to avoid division by zero
    (and thus infinity).

    Denormals don't help much here. IEEE doubles cannot represent 2^1024,
    but denormals allow to represent positive numbers down to 2^-1074.
    So, with denormal numbers, the absolute value of your divisor must be
    less than 2^-50 to produce a non-infinite result where flush-to-zero
    would have produced an infinity.


    OK.

    The classical example is the assumption that a<b is equivalent to
    a-b<0. It holds if denormals are implemented and fails on
    flush-to-zero.

    Basically, with denormals more of the usual assumptions hold.

    OK. (I like that aspect of signed integer overflow being UB - more of
    your usual assumptions hold.)

    Not mine. An assumption that I like is that the associative law
    holds. It holds with -fwrapv, but not with overflow-is-undefined.

    I fail to see how declaring any condition undefined behaviour would
    increase any guarantees.

    The associative law holds fine with UB on overflow, as do things like
    adding a positive number to an integer makes it bigger. But this is all straying from the discussion on floating point, and I suspect that we'd
    just re-hash old disagreements rather than starting new and interesting
    ones :-)


    However, if "a" or "b" could be a NaN or an infinity, does that
    equivalence still hold?

    Yes.

    If any of them is a NaN, the result is false for either comparison
    (because a-b would be NaN, and because the result of any comparison
    with a NaN is false).

    For infinity there are a number of cases

    1) inf<noninf (false) vs. inf-noninf=inf<0 (false)
    2) -inf<noninf (true) vs. -inf-noninf=-inf<0 (true)
    3) noninf<inf (true) vs. noninf-inf=-inf<0 (true)
    4) noninf<-inf (false) vs. noninf--inf=inf<0 (false)
    5) inf<inf (false) vs. inf-inf=NaN<0 (false)
    6) -inf<-inf (false) vs. -inf--inf=NaN<0 (false)
    7) inf<-inf (false) vs. inf--inf=inf<0 (false)
    8) -inf<inf (true) vs. -inf-inf=-inf<0 (true)

    The most interesting case here is 5), because if means that a<=b is
    not equivalent to a-b<=0, even with denormal numbers.


    Any kind of arithmetic with infinities is going to be awkward in some way!

    Are you thinking of this equivalence as something the compiler would do
    in optimisation, or something programmers would use when writing their code?

    I was thinking about what programmers might use when writing their
    code. For compilers, having that equivalence may occasionally be
    helpful for producing better code, but if it does not hold, the
    compiler will just not use such an equivalence (once the compiler is debugged).


    Sure.

    This is an example from Kahan that stuck in my mind, because it
    appeals to me as a programmer. He has also given other examples that
    don't do that for me, but may appeal to a mathematician, phycisist or chemist.


    Fair enough.

    I fully agree on both these points. However, I can't help feeling that
    if you are seeing denormals, you are unlikely to be getting results from
    your code that are as accurate as you had expected - your calculations
    are numerically unstable. Denormals might give you slightly more leeway
    before everything falls apart, but only a tiny amount.

    I think the nicer properties (such as the equivalence mentioned above)
    is the more important benefit. And if you take a different branch of
    an IF-statement if you have a flush-to-zero FPU, you can easily get a completely bogus result when the denormal case would still have had
    enough accuracy by far.


    Well, I think that if your values are getting that small enough to make denormal results, your code is at least questionable. I am not
    convinced that the equivalency you mentioned above is enough to make
    denormals worth the effort, but that may be just the kind of code I
    write. (And while I did study some of this stuff - numerical stability
    - in my mathematics degree, it was quite a long time ago.)

    Thanks for the comprehensive and educational information here. It is appreciated.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Tue Oct 14 15:31:00 2025
    From Newsgroup: comp.arch


    BGB <cr88192@gmail.com> posted:

    On 10/13/2025 4:53 PM, MitchAlsup wrote:

    BGB <cr88192@gmail.com> posted:

    On 10/13/2025 2:39 AM, Anton Ertl wrote:
    John Savard <quadibloc@invalid.invalid> writes:
    After reading that article, I looked for more information on other
    processors with poor arithmetic, and I found that the Intel i860 also had
    a branch delay slot, as well as using traps to implement some portions of
    the IEEE 754 standard... thus, presumably, being one of the architectures
    to inspire the piece about bad architectures from Linus Torvalds recently
    quoted here.

    There never was a Linux port to the i860. There are lots of
    architectures with Linux ports that have the properties that Linus
    Torvalds mentions. Concerning implementing only a part of FP in
    hardware, and throwing the rest over the wall to software, Alpha ist
    probably the best-known example (denormal support only in software),
    and Linus Torvalds worked on it personally. Concerning exposing the
    pipeline, MIPS-I not just has branch-delay slots, but also other
    limitations. SPARC and HPPA have branch delay slots.


    From what I can gather, the MIPS chip in the N64 also only did a
    partial implementation in hardware, with optional software traps for the >> rest.


    Apparently it can be a problem because modern FPUs don't exactly
    recreate N64 behavior, and a lot of the games ran without the traps, so
    a lot of the N64 games suffer drift and other issues over time (as the
    programmers had compensated for the MIPS issues in code rather than via
    traps).

    And this is why FP wants high quality implementation.


    From what I gather, it was a combination of Binary32 with DAZ/FTZ and truncate rounding. Then, with emulators running instead on hardware with denormals and RNE.

    In the above sentence I was talking about your FPU not getting
    an infinitely correct result and then rounding to container size.
    Not about the other "other" anomalies" many of which can be dealt
    with in SW.

    But, the result was that the games would work correctly on the original hardware, but in the emulators things would drift; like things like
    moving platforms gradually creeping away from the origin, etc.





    Though, reading some stuff, implies a predecessor chip (the R4000) had a >> more functionally complete FPU. So, I guess it is also possible that the >> R4300 had a more limited FPU to make it cheaper for the embedded market. >>

    Well, in any case, my recent efforts in these areas have been mostly:
    Trying to hunt down some remaining bugs involving RVC in the CPU core; >> RVC is seemingly "the gift that keeps on giving" in this area.
    (The more dog-chewed the encoding, the harder it is to find bugs)
    Going from just:
    "Doing weak/crappy FP in hardware"
    To:
    "Trying to do less crappy FPU via software traps".
    A "mostly traps only" implementation of Binary128.
    Doesn't exactly match the 'Q' extension, but that is OK.
    I sorta suspect not many people are going to implement Q either.

    Do it right or don't do it at all.


    ?...

    The traps route sorta worked OK in a lot of the MIPS era CPUs.
    But, it will be opt-in via an FPSCR flag.
    If the flag is not set, it will not trap.

    But their combination of HW+SW gets the right answer.
    Your multiply does not.

    Or, is the argument here that sticking with weaker not-quite IEEE FPU is preferable to using trap handlers.

    The 5-bang instructions as used by HW+SW has to computer the result
    to infinite precision and then round to container size.

    The paper illustrates CRAY 1,... FP was fast but inaccurate enough
    to fund an army of numerical analysists to see if the program was
    delivering acceptable results.

    IEEE 754 got rid of the army of Numerical Analysists.
    But now, nobody remembers how bad is was/can be.

    For Binary128, real HW support is not likely to happen. The main reason
    to consider trap-only Binary128 is more because it has less code
    footprint than using runtime calls.

    Nobody is asking for that.

    <snip>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Tue Oct 14 15:34:23 2025
    From Newsgroup: comp.arch


    David Brown <david.brown@hesbynett.no> posted:

    On 14/10/2025 04:27, Lawrence DrCOOliveiro wrote:
    On Mon, 13 Oct 2025 17:33:32 GMT, MitchAlsup wrote:

    On Mon, 13 Oct 2025 09:05:18 -0000 (UTC), Lawrence DrCOOliveiro wrote:

    The hardware designers took many years -- right through the 1990s, I
    think -- to be persuaded that IEEE754 really was worth implementing in >>> its entirety, that the rCLtoo hardrCY or rCLtoo obscurerCY parts were there for
    an important reason, to make programming that much easier, and should
    not be skipped.

    I disagree:: full compliance with IEEE 754-whenever is to make programs
    more reliable (more numerically stable) and to give the programmer a
    constant programming model (not easier).

    As a programmer, I count all that under my definition of rCLeasierrCY.

    You can argue that not having to do ((x-0.5)-0.5) as you did in Hex did
    make it easier--but NaNs, infinities, Underflow at the Denorm level went >> in the other direction.

    NaNs and infinities allow you to propagate certain kinds of pathological results right through to the end of the calculation, in a mathematically consistent way.

    Denormals -- arenrCOt they called rCLsubnormalsrCY now? -- are also about making
    things easier. Providing graceful underflow means a gradual loss of precision as you get too close to zero, instead of losing all the bits at once and going straight to zero. ItrCOs about the principle of least surprise.

    Again, all that helps to make things easier for programmers --
    particularly those of us whose expertise of numerics is not on a level
    with Prof Kahan.

    I see the benefits of NaNs - sometimes you have bad data, and it can be useful to have a representation for that. The defined "viral" nature of NaNs means that you can write your code in a streamlined fashion,
    knowing that if a NaN goes in, a NaN comes out - you don't have to have checks and conditionals in the middle of your calculations.

    MAX( x, NaN ) is x.

    But I find it harder to understand why denormals or subnormals are going
    to be useful.

    1/Big_Num does not underflow .............. completely.

    Ultimately, your floating point code is approximating arithmetic on real numbers.

    Don' make me laugh.

    Where are you getting your real numbers,
    and what calculations are you doing on them, that mean you are getting results that have such a dynamic range that you are using denormals?
    And what are you doing where it is acceptable to lose some precision
    with those numbers, but not to give up and say things have gone badly
    wrong (a NaN or infinity, or underflow signal)? I have a lot of
    difficulty imagining a situation where denormals would be helpful and
    you haven't got a major design issue with your code - perhaps
    calculations should be re-arranged, algorithms changed, or you should be using an arithmetic format with greater range (switch from single to
    double, double to quad, or use something more advanced).



    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Tue Oct 14 15:47:20 2025
    From Newsgroup: comp.arch


    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    David Brown <david.brown@hesbynett.no> writes:
    I see the benefits of NaNs - sometimes you have bad data, and it can be >useful to have a representation for that. The defined "viral" nature of >NaNs means that you can write your code in a streamlined fashion,
    knowing that if a NaN goes in, a NaN comes out - you don't have to have >checks and conditionals in the middle of your calculations.

    Unfortunately, there are no NaFs (not a flag), and if there were, how
    would an IF behave? As a consequence, a<b can have a different result
    than !(a>=b) if a or b can be a NaN. That's quite contrary to what programmers tend to expect. So NaNs have their pitfalls.

    Many ISAs and many programs have trouble in getting NaNs into the
    ELSE-clause. One cannot use deMorgan's Law to invert conditions in
    the presence of NaNs.

    We (Brain, Thomas and I) went to great pain to have FCMP deliver a
    bit pattern where one could invert the condition AND still deliver
    the NaN to the expected Clause. We threw in Ordered and Totally-
    Ordered at the same time, along with OpenCL FP CLASS() function.

    Languages have not kept up with NaNs, needing IF-THEN-ELSE-NAN
    semantics.

    Would it be better to trap is a NaN is compared with an ordinary
    comparison operator, and to use special NaN-aware comparison operators
    when that is actually intended?

    You are thinking that FCMP only decodes 6 states {==, !=, <, <=, > >=}

    And what are you doing where it is acceptable to lose some precision
    with those numbers, but not to give up and say things have gone badly >wrong (a NaN or infinity, or underflow signal)?

    The usual alternative to denormals is not NaN or Infinity (of course
    not), or a trap (I assume that's what you mean with "signal"), but 0.

    The worst of all possible results is no information whatsoever.

    I have a lot of
    difficulty imagining a situation where denormals would be helpful and
    you haven't got a major design issue with your code

    The classical example is the assumption that a<b is equivalent to
    a-b<0. It holds if denormals are implemented and fails on
    flush-to-zero.

    Basically, with denormals more of the usual assumptions hold.

    perhaps
    calculations should be re-arranged, algorithms changed, or you should be >using an arithmetic format with greater range (switch from single to >double, double to quad, or use something more advanced).

    The first two require more knowledge about FP than many programmers
    have,

    Don't allow THOSE programmers to program FP codes !!
    Get ones that understand the nuances.

    all just to avoid some hardware cost. Not a good idea in any
    area where the software crisis* is relevant.

    Windows 7 and Office 2003 were good enough. That would have allowed
    zillions of programmers to go address the software crisis after being
    freed from projects that had become good enough not to need continual
    work.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Tue Oct 14 16:48:50 2025
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

    David Brown <david.brown@hesbynett.no> posted:

    Ultimately, your floating point code is approximating
    arithmetic on real numbers.

    Don' make me laugh.

    Somebody (not me) recently added the following to the gcc bugzilla
    quip file:

    The "real" type in fortran is called "real" because the
    mathematician should not notice that it has finite decimal places
    and forget that one needs lenghty adaptions of the proofs for
    that....
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Tue Oct 14 16:46:03 2025
    From Newsgroup: comp.arch

    David Brown <david.brown@hesbynett.no> writes:
    The associative law holds fine with UB on overflow,

    With 32-bit ints:

    The result of (2000000000+2000000000)+(-2000000000) is undefined.

    The result of 2000000000+(2000000000+(-2000000000)) is 2000000000.

    So, the associative law does not hold.

    With -fwrapv both are defined to produce 2000000000, and the
    associative law holds because modulo arithmetic is associative.

    Well, I think that if your values are getting that small enough to make >denormal results, your code is at least questionable.

    As Terje Mathiesen wrote, getting close to 0 is standard fare for
    approximation algorithms, such as Newton-Raphson iteration. Of course
    you can terminate the loop while you are still far from the solution,
    but that's not going to improve the accuracy of the results.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Tue Oct 14 17:26:16 2025
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
    [...]
    Languages have not kept up with NaNs, needing IF-THEN-ELSE-NAN
    semantics.

    That may be a good idea. You can write it in current languages as
    follows:

    if (a<b) {
    ...
    } else if (a>=b) {
    ...
    } else {
    ... NaN case ...
    }

    Would it be better to trap is a NaN is compared with an ordinary
    comparison operator, and to use special NaN-aware comparison operators
    when that is actually intended?

    You are thinking that FCMP only decodes 6 states {==, !=, <, <=, > >=}

    I don't think anything about FCMP. What I wrote above is about
    programming languages. I.e., a<b would trap if a or b is a NaN, while lt_or_nan(a,b) would be true if a or b is a NaN, and
    lt_and_not_nan(a,b) would be false if a or b is a NaN. I think the
    IEEE754 people have better names for these comparisons, but am too
    lazy to look them up.

    The first two require more knowledge about FP than many programmers
    have,

    Don't allow THOSE programmers to program FP codes !!
    Get ones that understand the nuances.

    We can all wish for Kahan writing all FP code, but that only deepens
    the software crisis. Educating programmers is certainly a worthy
    undertaking, but providing a good foundation for them to build on
    helps those programmers as well as those that are less educated.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Tue Oct 14 12:45:08 2025
    From Newsgroup: comp.arch

    On 10/14/2025 10:31 AM, MitchAlsup wrote:

    BGB <cr88192@gmail.com> posted:

    On 10/13/2025 4:53 PM, MitchAlsup wrote:

    BGB <cr88192@gmail.com> posted:

    On 10/13/2025 2:39 AM, Anton Ertl wrote:
    John Savard <quadibloc@invalid.invalid> writes:
    After reading that article, I looked for more information on other >>>>>> processors with poor arithmetic, and I found that the Intel i860 also had
    a branch delay slot, as well as using traps to implement some portions of
    the IEEE 754 standard... thus, presumably, being one of the architectures
    to inspire the piece about bad architectures from Linus Torvalds recently
    quoted here.

    There never was a Linux port to the i860. There are lots of
    architectures with Linux ports that have the properties that Linus
    Torvalds mentions. Concerning implementing only a part of FP in
    hardware, and throwing the rest over the wall to software, Alpha ist >>>>> probably the best-known example (denormal support only in software), >>>>> and Linus Torvalds worked on it personally. Concerning exposing the >>>>> pipeline, MIPS-I not just has branch-delay slots, but also other
    limitations. SPARC and HPPA have branch delay slots.


    From what I can gather, the MIPS chip in the N64 also only did a
    partial implementation in hardware, with optional software traps for the >>>> rest.


    Apparently it can be a problem because modern FPUs don't exactly
    recreate N64 behavior, and a lot of the games ran without the traps, so >>>> a lot of the N64 games suffer drift and other issues over time (as the >>>> programmers had compensated for the MIPS issues in code rather than via >>>> traps).

    And this is why FP wants high quality implementation.


    From what I gather, it was a combination of Binary32 with DAZ/FTZ and
    truncate rounding. Then, with emulators running instead on hardware with
    denormals and RNE.

    In the above sentence I was talking about your FPU not getting
    an infinitely correct result and then rounding to container size.
    Not about the other "other" anomalies" many of which can be dealt
    with in SW.


    This mostly applies to FMUL, but:
    I had already added a trap case for this as well.

    In the cases where the all the low-order bits of either input are 0,
    then the low-order results would also be 0 and so are N/A (the final
    result would be the same either way).

    If both sets of low-order bits are non-zero, it can trap.
    This does mean that the software emulation will need to provide a full
    width result though.

    Checking for non-zero here being more cost-effective than actually doing
    a full width multiply.


    Also, RISC-V FMADD.D and similar are sorta also going to end up as traps
    due to the lack of single-rounded FMA (though had debated whether to
    have a separate control-flag for this to still allow non-slow FMADD.D
    and similar; but as-is, these will trap).



    For FADD:
    The shifted-right bits that fall off the bottom (of the slightly-wider internal mantissa) don't matter, since they were always being added to
    0, which can't generate any carry.

    For FSUB, it may matter, but more in the sense that one can check
    whether the "fell off the bottom" part had non-zero bits and use this to adjust the carry-in part of the subtractor (since non-zero bits would
    absorb the carry-propagation of adding 1 to the bottom of a
    theoretically arbitrarily wide twos complement negation).

    So, in theory, can be dealt with in hardware to still give an exact result.


    There are still some sub-ULP bits, so the complaints about the lack of
    guard bit doesn't really apply.


    Also apparently the Cray used a non-normalized floating point format (no hidden bit), which was odd (and could create its own issues).

    Though, potentially a non-normalized format with lax normalization could
    allow for cheaper re-normalization (even if it could require
    re-normalization logic for FMUL). Though, for such a format, there is
    the possibility that someone could make re-normalization be its own instruction (allowing for an FPU with less latency).


    But, the result was that the games would work correctly on the original
    hardware, but in the emulators things would drift; like things like
    moving platforms gradually creeping away from the origin, etc.





    Though, reading some stuff, implies a predecessor chip (the R4000) had a >>>> more functionally complete FPU. So, I guess it is also possible that the >>>> R4300 had a more limited FPU to make it cheaper for the embedded market. >>>>

    Well, in any case, my recent efforts in these areas have been mostly:
    Trying to hunt down some remaining bugs involving RVC in the CPU core;
    RVC is seemingly "the gift that keeps on giving" in this area.
    (The more dog-chewed the encoding, the harder it is to find bugs) >>>> Going from just:
    "Doing weak/crappy FP in hardware"
    To:
    "Trying to do less crappy FPU via software traps".
    A "mostly traps only" implementation of Binary128.
    Doesn't exactly match the 'Q' extension, but that is OK.
    I sorta suspect not many people are going to implement Q either. >>>
    Do it right or don't do it at all.


    ?...

    The traps route sorta worked OK in a lot of the MIPS era CPUs.
    But, it will be opt-in via an FPSCR flag.
    If the flag is not set, it will not trap.

    But their combination of HW+SW gets the right answer.
    Your multiply does not.


    As noted above, I was already working on this.


    Or, is the argument here that sticking with weaker not-quite IEEE FPU is
    preferable to using trap handlers.

    The 5-bang instructions as used by HW+SW has to computer the result
    to infinite precision and then round to container size.

    The paper illustrates CRAY 1,... FP was fast but inaccurate enough
    to fund an army of numerical analysists to see if the program was
    delivering acceptable results.

    IEEE 754 got rid of the army of Numerical Analysists.
    But now, nobody remembers how bad is was/can be.



    OK.

    As can be noted, for scalar operations I consider there to be a limit as
    to how bad is "acceptable".

    For SIMD operations, it is a little looser.
    For example, the ability to operate on integer values and get exact
    results is basically required for scalar operations, but optional for SIMD.

    Though, in this case it is a case of both Quake and also some JavaScript
    VMs relying on the ability to express integer values as floating-point
    numbers and use them in calculations as such (so, for example, if the operations don't give exact results then the programs break).


    For Binary128, real HW support is not likely to happen. The main reason
    to consider trap-only Binary128 is more because it has less code
    footprint than using runtime calls.

    Nobody is asking for that.


    OK.


    Can note that in my looking, it seems like:
    Pretty much none of the ASIC implementations support the Q extension;
    It is not required in any of the mainline profiles;
    Implementing Q proper would have non-zero impact on RV64G:
    The differences between F+D and F+D+Q being non-zero.
    Whereas, "fudging it" can retain strict compatibility with D.
    Where, people actually use 'D'.

    There is a non-zero amount of code using "long double", but in this case
    the bigger issue is more the code footprint of the associated
    long-double math functions rather than performance (say, if someone uses "cosl()" or similar).

    Still not ideal, as (with my existing ISA extensions) there is still no single-instruction way to load a 64-bit value into an FPR.

    But, could at least reduce it from 11 (44 bytes) instructions to 3 (20
    bytes; "LI-Imm33; SHORI-Imm32; FMV.D.X"). This still means 40 bytes to
    load a full-width Binary128 literal.
    Loading the same literal would need 24 bytes in XG3.
    And, an unrolled Taylor expansion uses a lot of them.

    With Q proper? Only option would be to use memory loads here.
    Like, these the C math functions are annoyingly bulky in this case.


    Meanwhile, elsewhere saw a mention that apparently to deal with RISC-V fragmentation issues, there is now being work on a mechanism to allow modification of the RISC-V instruction listings in GCC without needing
    to modify the code in GCC proper each time (basically hot injecting
    stuff into the instruction listing and similar).

    As apparently having everyone trying to modify the ISA every which way
    is making a bit of an awful mess of things.

    ...



    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Lawrence =?iso-8859-13?q?D=FFOliveiro?=@ldo@nz.invalid to comp.arch on Wed Oct 15 03:45:31 2025
    From Newsgroup: comp.arch

    On Tue, 14 Oct 2025 07:51:09 GMT, Anton Ertl wrote:

    * The Wikipedia article on the software crisis does not give a useful
    definition for deciding whether there is a software crisis or not,
    and it does not even mention the symptom that was mentioned first
    when I learned about the software crisis (in 1986): The cost of
    software exceeds the cost of hardware.

    The rCLcrisisrCY was supposed to do with the shortage of programs to write all the programs that were needed to solve business and user needs.

    By that definition, I donrCOt think the rCLcrisisrCY exists any more. It went away with the rise of very-high-level languages, from about the time of
    those such as Tcl/Tk, Perl, Python, PHP and JavaScript.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Lawrence =?iso-8859-13?q?D=FFOliveiro?=@ldo@nz.invalid to comp.arch on Wed Oct 15 03:47:14 2025
    From Newsgroup: comp.arch

    On Tue, 14 Oct 2025 15:47:20 GMT, MitchAlsup wrote:

    Languages have not kept up with NaNs, needing IF-THEN-ELSE-NAN
    semantics.

    All the good languages have IEEE754 compliant arithmetic libraries,
    including type queries for things like isnan().

    E.g. <https://docs.python.org/3/library/math.html>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Wed Oct 15 05:55:40 2025
    From Newsgroup: comp.arch

    Lawrence =?iso-8859-13?q?D=FFOliveiro?= <ldo@nz.invalid> writes:
    On Tue, 14 Oct 2025 07:51:09 GMT, Anton Ertl wrote:

    * The Wikipedia article on the software crisis does not give a useful
    definition for deciding whether there is a software crisis or not,
    and it does not even mention the symptom that was mentioned first
    when I learned about the software crisis (in 1986): The cost of
    software exceeds the cost of hardware.

    The "crisis" was supposed to do with the shortage of programs to write all >the programs that were needed to solve business and user needs.

    I never heard that one. The software project failures, deadline
    misses, and cost overruns, and their increasing number was a symptom
    that is reflected in the Wikipedia article.

    By that definition, I donrCOt think the "crisis" exists any more. It went >away with the rise of very-high-level languages, from about the time of >those such as Tcl/Tk, Perl, Python, PHP and JavaScript.

    Better tools certainly help. One interesting aspect here is that all
    the languages you mention are only dynamically typechecked. There has
    been quite a bit of work on adding static typechecking to some of
    these languages in the last decade or so, and the motivation given for
    that is difficulties in large software projects using these languages.

    In any case, even with these languages there are still software
    projects that fail, miss their deadlines and have overrun their
    budget; and to come back to the criterion I mentioned, where software
    cost is higher than hardware cost.

    Anyway, the relevance for comp.arch is how to evaluate certain
    hardware features: If we have a way to make the programmers' jobs
    easier at a certain hardware cost, when is it justified to add the
    hardware cost? When it affects many programmers and especially if the difficulty that would otherwise be added is outside the expertise of
    many of them. Let's look at some cases:

    "Closing the semantic gap" by providing instructions like EDIT: Even
    with assembly-language programmers, calling a subroutine is hardly
    harder. With higher-level languages, such instructions buy nothing.

    Denormal numbers: It affects lots of code that deals with FP, and
    where many programmers are not well-educated (and even the educated
    ones have a harder time when they have to work around their absence).

    Hardware without Spectre (e.g., with invisible speculation): There are
    two takes here:

    1) If you consider Spectre to be a realistically exploitable
    vulnerability, you need to protect at least the secret keys against
    extraction with Spectre; then you either need such hardware, or you
    need to use software mitigations agains all Spectre variants in all
    software that runs in processes that have secret keys in their
    address space; the latter would be a huge cost that easily
    justifies the cost of adding invisible speculation to the hardware.

    2) The other take is that we Spectre is too hard to exploit to be a
    realistic threat and that we do not need to eliminate it or
    mitigate it. That's a similar to the mainstream opinion on
    cache-timing attacks on AES before Dan Bernstein demonstrated that
    such attacks can be performed. Except that for Spectre we already
    have demonstrations.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Wed Oct 15 12:41:28 2025
    From Newsgroup: comp.arch

    On Wed, 15 Oct 2025 05:55:40 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
    Lawrence =?iso-8859-13?q?D=FFOliveiro?= <ldo@nz.invalid> writes:
    On Tue, 14 Oct 2025 07:51:09 GMT, Anton Ertl wrote:

    * The Wikipedia article on the software crisis does not give a
    useful definition for deciding whether there is a software crisis
    or not, and it does not even mention the symptom that was
    mentioned first when I learned about the software crisis (in
    1986): The cost of software exceeds the cost of hardware.

    The "crisis" was supposed to do with the shortage of programs to
    write all the programs that were needed to solve business and user
    needs.

    I never heard that one. The software project failures, deadline
    misses, and cost overruns, and their increasing number was a symptom
    that is reflected in the Wikipedia article.

    By that definition, I donrCOt think the "crisis" exists any more. It
    went away with the rise of very-high-level languages, from about the
    time of those such as Tcl/Tk, Perl, Python, PHP and JavaScript.

    Better tools certainly help. One interesting aspect here is that all
    the languages you mention are only dynamically typechecked. There has
    been quite a bit of work on adding static typechecking to some of
    these languages in the last decade or so, and the motivation given for
    that is difficulties in large software projects using these languages.

    In any case, even with these languages there are still software
    projects that fail, miss their deadlines and have overrun their
    budget; and to come back to the criterion I mentioned, where software
    cost is higher than hardware cost.

    Anyway, the relevance for comp.arch is how to evaluate certain
    hardware features: If we have a way to make the programmers' jobs
    easier at a certain hardware cost, when is it justified to add the
    hardware cost? When it affects many programmers and especially if the difficulty that would otherwise be added is outside the expertise of
    many of them. Let's look at some cases:

    "Closing the semantic gap" by providing instructions like EDIT: Even
    with assembly-language programmers, calling a subroutine is hardly
    harder. With higher-level languages, such instructions buy nothing.

    Denormal numbers: It affects lots of code that deals with FP, and
    where many programmers are not well-educated (and even the educated
    ones have a harder time when they have to work around their absence).

    Hardware without Spectre (e.g., with invisible speculation): There are
    two takes here:

    1) If you consider Spectre to be a realistically exploitable
    vulnerability, you need to protect at least the secret keys against
    extraction with Spectre; then you either need such hardware, or you
    need to use software mitigations agains all Spectre variants in all
    software that runs in processes that have secret keys in their
    address space; the latter would be a huge cost that easily
    justifies the cost of adding invisible speculation to the hardware.

    2) The other take is that we Spectre is too hard to exploit to be a
    realistic threat and that we do not need to eliminate it or
    mitigate it. That's a similar to the mainstream opinion on
    cache-timing attacks on AES before Dan Bernstein demonstrated that
    such attacks can be performed. Except that for Spectre we already
    have demonstrations.

    - anton
    What demonstrations?
    The demonstration that I would consider realistic should be from JS
    running on browser released after 2018-01-28.
    I'm of strong opinion that at least Spectre Variant 1 (Bound Check
    Bypass) should not be mitigated in hardware.
    W.r.t. variant 2 (Branch Target Injection) I am less categorical. That
    is, I rather prefer it not mitigated on the hardware that I use,
    because I am sure that in no situation it is a realistic threat for me.
    However it is harder to prove that it is not a realistic threat to
    anybody. And since HW mitigation has smaller performance impact than of
    Variant 1, so if some CPU vendors decide to mitigate Variant 2, I would
    not call them spinless idiots because of it. I'd call them "slick
    businessmen" which in my book is less derogatory.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From David Brown@david.brown@hesbynett.no to comp.arch on Wed Oct 15 12:36:17 2025
    From Newsgroup: comp.arch

    On 14/10/2025 18:46, Anton Ertl wrote:
    David Brown <david.brown@hesbynett.no> writes:

    Well, I think that if your values are getting that small enough to make
    denormal results, your code is at least questionable.

    As Terje Mathiesen wrote, getting close to 0 is standard fare for approximation algorithms, such as Newton-Raphson iteration. Of course
    you can terminate the loop while you are still far from the solution,
    but that's not going to improve the accuracy of the results.


    Feel free to correct me if what I write below is wrong - you, Terje, and others here know a lot more about this stuff than I do.

    When you write an expression like "x + y" with floating point, ignoring
    NaNs and infinities, you can imagine the calculation being done by first getting the mathematical real values from x and y. Then - again in the mathematical real domain - the operation is carried out. Then the
    result is truncated or rounded to fit back within the mantissa and
    exponent format of the floating point type.

    Double precision IEEE format has 53 bits of mantissa and 11 bits of
    exponent. For normal floating point values, that covers from 10 ^ -308
    to 10 ^ +308, or 716 orders of magnitude. (For comparison, the size of
    the universe measured in Planck lengths is only about 61 orders of
    magnitude.)

    Denormals let you squeeze a bit more at the lower end here - another 16
    orders of magnitude - at the cost of rapidly decreasing precision. They
    don't stop the inevitable approximation to zero, they just delay it a
    little.

    I am still at a loss to understand how this is going to be useful - when
    will that small extra margin near zero actually make a difference, in
    the real world, with real values? When you are using your
    Newton-Raphson iteration to find your function's zeros, what are the circumstances in which you can get a more useful end result if you
    continue to 10 ^ -324 instead of treating 10 ^ -308 as zero - especially
    when these smaller numbers have lower precision?

    I realise there are plenty of numerical calculations in which errors
    "build up", such as simulating non-linear systems over time, and there
    you are looking to get as high an accuracy as you can in the
    intermediary steps so that you can continue for longer. But even there, denormals are not going to give you more than a tiny amount extra.

    (There are, of course, mathematical problems which deal with values or precisions far outside anything of relevance to the physical world, but
    if you are dealing with those kinds of tasks then IEEE floating point is
    not going to do the job anyway.)

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Wed Oct 15 12:54:30 2025
    From Newsgroup: comp.arch

    MitchAlsup wrote:

    David Brown <david.brown@hesbynett.no> posted:

    On 14/10/2025 04:27, Lawrence D|ore4raoOliveiro wrote:>>> On Mon, 13 Oct 2025 17:33:32 GMT, MitchAlsup wrote:

    On Mon, 13 Oct 2025 09:05:18 -0000 (UTC), Lawrence D|ore4raoOliveiro wrote:

    The hardware designers took many years -- right through the 1990s, I >>>>> think -- to be persuaded that IEEE754 really was worth implementing in >>>>> its entirety, that the |ore4+otoo hard|ore4-Y or |ore4+otoo obscure|ore4-Y parts were there for
    an important reason, to make programming that much easier, and should >>>>> not be skipped.

    I disagree:: full compliance with IEEE 754-whenever is to make programs >>>> more reliable (more numerically stable) and to give the programmer a>>>> constant programming model (not easier).

    As a programmer, I count all that under my definition of |ore4+oeasier|ore4-Y.

    You can argue that not having to do ((x-0.5)-0.5) as you did in Hex did >>>> make it easier--but NaNs, infinities, Underflow at the Denorm level went >>>> in the other direction.

    NaNs and infinities allow you to propagate certain kinds of pathological >>> results right through to the end of the calculation, in a mathematically >>> consistent way.

    Denormals -- aren|ore4raot they called |ore4+osubnormals|ore4-Y now? -- are also about making
    things easier. Providing graceful underflow means a gradual loss of
    precision as you get too close to zero, instead of losing all the bits at >>> once and going straight to zero. It|ore4raos about the principle of least >>> surprise.

    Again, all that helps to make things easier for programmers --
    particularly those of us whose expertise of numerics is not on a level
    with Prof Kahan.

    I see the benefits of NaNs - sometimes you have bad data, and it can be
    useful to have a representation for that. The defined "viral" nature of
    NaNs means that you can write your code in a streamlined fashion,
    knowing that if a NaN goes in, a NaN comes out - you don't have to have
    checks and conditionals in the middle of your calculations.

    MAX( x, NaN ) is x.
    That was true under 754-2008 but we fixed it for 2019: All NaNs
    propagate through the new min/max definitions. The old still exist of
    course, but they are deprecated.
    The point that made it obvious to everyone was that under the 2008
    definition an SNaN would always propage, but be converted to a QNaN, but a QNaN could silently disappear as show above.
    What this meant was that for any kind of vector reduction, the final
    result could be the NaN or any of the other input values, depending upon the order of the individual comparisons!
    I was one of the proponents who pushed this change through, but I will
    say that after we showed some of the most surprising results, everyone
    agreed to fix it. Having NaN maximally sticky is also definitely in the
    spirit of the entire 754 standard:
    The only operations that do not propagate NaN are those that explicitly
    handle this case, or those that don't return a floating point value.
    Having all compares return 'false' is an example of the latter.
    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Wed Oct 15 13:07:01 2025
    From Newsgroup: comp.arch

    David Brown wrote:
    On 14/10/2025 18:46, Anton Ertl wrote:
    David Brown <david.brown@hesbynett.no> writes:

    Well, I think that if your values are getting that small enough to make
    denormal results, your code is at least questionable.

    As Terje Mathiesen wrote, getting close to 0 is standard fare for
    approximation algorithms, such as Newton-Raphson iteration.-a Of course
    you can terminate the loop while you are still far from the solution,
    but that's not going to improve the accuracy of the results.


    Feel free to correct me if what I write below is wrong - you, Terje, and others here know a lot more about this stuff than I do.

    When you write an expression like "x + y" with floating point, ignoring
    NaNs and infinities, you can imagine the calculation being done by first getting the mathematical real values from x and y.-a Then - again in the mathematical real domain - the operation is carried out.-a Then the
    result is truncated or rounded to fit back within the mantissa and
    exponent format of the floating point type.

    Double precision IEEE format has 53 bits of mantissa and 11 bits of exponent.-a For normal floating point values, that covers from 10 ^ -308
    to 10 ^ +308, or 716 orders of magnitude.-a (For comparison, the size of
    the universe measured in Planck lengths is only about 61 orders of magnitude.)

    Denormals let you squeeze a bit more at the lower end here - another 16 orders of magnitude - at the cost of rapidly decreasing precision.-a They don't stop the inevitable approximation to zero, they just delay it a little.

    I am still at a loss to understand how this is going to be useful - when will that small extra margin near zero actually make a difference, in
    the real world, with real values?-a When you are using your
    Newton-Raphson iteration to find your function's zeros, what are the circumstances in which you can get a more useful end result if you
    continue to 10 ^ -324 instead of treating 10 ^ -308 as zero - especially when these smaller numbers have lower precision?
    Please note that I have NOT personally observed this, but I have been
    told from people I trust (on the 754 working group) that at least some zero-seeking algorithms will stabilize on an exact value, if and only if you have subnormals, otherwise it is possible to wobble back & forth
    between two neighboring results.
    I.e. they differ by exactly one ulp.
    As I noted, I have not been bitten by this particular issue, one of the
    reaons being that I tend to not write infinite loops inside functions,
    instead I'll pre-calculate how many (typically NR) iterations should be needed.
    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Wed Oct 15 16:50:13 2025
    From Newsgroup: comp.arch

    On Wed, 15 Oct 2025 12:36:17 +0200
    David Brown <david.brown@hesbynett.no> wrote:

    On 14/10/2025 18:46, Anton Ertl wrote:
    David Brown <david.brown@hesbynett.no> writes:

    Well, I think that if your values are getting that small enough to
    make denormal results, your code is at least questionable.

    As Terje Mathiesen wrote, getting close to 0 is standard fare for approximation algorithms, such as Newton-Raphson iteration. Of
    course you can terminate the loop while you are still far from the solution, but that's not going to improve the accuracy of the
    results.

    Feel free to correct me if what I write below is wrong - you, Terje,
    and others here know a lot more about this stuff than I do.

    When you write an expression like "x + y" with floating point,
    ignoring NaNs and infinities, you can imagine the calculation being
    done by first getting the mathematical real values from x and y.
    Then - again in the mathematical real domain - the operation is
    carried out. Then the result is truncated or rounded to fit back
    within the mantissa and exponent format of the floating point type.

    Double precision IEEE format has 53 bits of mantissa and 11 bits of exponent. For normal floating point values, that covers from 10 ^
    -308 to 10 ^ +308, or 716 orders of magnitude. (For comparison, the
    size of the universe measured in Planck lengths is only about 61
    orders of magnitude.)

    Denormals let you squeeze a bit more at the lower end here - another
    16 orders of magnitude - at the cost of rapidly decreasing precision.
    They don't stop the inevitable approximation to zero, they just
    delay it a little.

    I am still at a loss to understand how this is going to be useful -
    when will that small extra margin near zero actually make a
    difference, in the real world, with real values? When you are using
    your Newton-Raphson iteration to find your function's zeros, what are
    the circumstances in which you can get a more useful end result if
    you continue to 10 ^ -324 instead of treating 10 ^ -308 as zero -
    especially when these smaller numbers have lower precision?

    I realise there are plenty of numerical calculations in which errors
    "build up", such as simulating non-linear systems over time, and
    there you are looking to get as high an accuracy as you can in the intermediary steps so that you can continue for longer. But even
    there, denormals are not going to give you more than a tiny amount
    extra.

    (There are, of course, mathematical problems which deal with values
    or precisions far outside anything of relevance to the physical
    world, but if you are dealing with those kinds of tasks then IEEE
    floating point is not going to do the job anyway.)



    I don't think that I agree with Anton's point, at least as formulated.

    Yes, subnormals improve precision of Newton-Raphson and such*, but only
    when the numbers involved in calculations are below 2**-971, which does
    not happen very often. What is more important that *when* it happens
    then naively written implementations of such algorithms still converge.
    Without subnormals (or without expert provisions) there is big chance
    that they would not converge at all. That happens mostly because
    IEEE-754 preserves following intuitive invariant:
    When x > y then x - y > 0
    Without subnormals, e.g. with VAX float formats that are otherwise
    pretty good, this invariant does not hold.


    * - I personally prefer to illustrate it with cord-and-tangent
    root-finding algorithm that can be used for any type of function as
    long as you proved that on section of interest there is no change of
    sign of its first and second derivatives. May be, because I
    was taught this algorithm at age of 15. This algo can be called
    half-Newton].








    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Wed Oct 15 17:46:21 2025
    From Newsgroup: comp.arch

    On Wed, 15 Oct 2025 13:07:01 +0200
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:
    David Brown wrote:
    On 14/10/2025 18:46, Anton Ertl wrote:
    David Brown <david.brown@hesbynett.no> writes:

    Well, I think that if your values are getting that small enough
    to make denormal results, your code is at least questionable.

    As Terje Mathiesen wrote, getting close to 0 is standard fare for
    approximation algorithms, such as Newton-Raphson iteration.a Of
    course you can terminate the loop while you are still far from the
    solution, but that's not going to improve the accuracy of the
    results.

    Feel free to correct me if what I write below is wrong - you,
    Terje, and others here know a lot more about this stuff than I do.

    When you write an expression like "x + y" with floating point,
    ignoring NaNs and infinities, you can imagine the calculation being
    done by first getting the mathematical real values from x and y.
    Then - again in the mathematical real domain - the operation is
    carried out.a Then the result is truncated or rounded to fit back
    within the mantissa and exponent format of the floating point type.

    Double precision IEEE format has 53 bits of mantissa and 11 bits of exponent.a For normal floating point values, that covers from 10 ^
    -308 to 10 ^ +308, or 716 orders of magnitude.a (For comparison,
    the size of the universe measured in Planck lengths is only about
    61 orders of magnitude.)

    Denormals let you squeeze a bit more at the lower end here -
    another 16 orders of magnitude - at the cost of rapidly decreasing precision.a They don't stop the inevitable approximation to zero,
    they just delay it a little.

    I am still at a loss to understand how this is going to be useful -
    when will that small extra margin near zero actually make a
    difference, in the real world, with real values?a When you are
    using your Newton-Raphson iteration to find your function's zeros,
    what are the circumstances in which you can get a more useful end
    result if you continue to 10 ^ -324 instead of treating 10 ^ -308
    as zero - especially when these smaller numbers have lower
    precision?

    Please note that I have NOT personally observed this, but I have been
    told from people I trust (on the 754 working group) that at least
    some zero-seeking algorithms will stabilize on an exact value, if and
    only if you have subnormals, otherwise it is possible to wobble back
    & forth between two neighboring results.

    I.e. they differ by exactly one ulp.

    As I noted, I have not been bitten by this particular issue, one of
    the reaons being that I tend to not write infinite loops inside
    functions, instead I'll pre-calculate how many (typically NR)
    iterations should be needed.

    Terje
    It does not sound right to me. Newton-alike iterations oscillations by
    1 ULP could happen even with subnormals. They should be taken care of by properly written exit conditions.
    What could happen without subnormals are oscillations by *more* than 1
    ULP, sometimes much more.
    Also in absence of subnormals one can suffer divisions by zero in code
    like below:
    while (fb > fa) {
    a -= b*fa/(fb - fa);
    ...
    }
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From David Brown@david.brown@hesbynett.no to comp.arch on Wed Oct 15 16:53:33 2025
    From Newsgroup: comp.arch

    On 15/10/2025 13:07, Terje Mathisen wrote:
    David Brown wrote:
    On 14/10/2025 18:46, Anton Ertl wrote:
    David Brown <david.brown@hesbynett.no> writes:

    Well, I think that if your values are getting that small enough to make >>>> denormal results, your code is at least questionable.

    As Terje Mathiesen wrote, getting close to 0 is standard fare for
    approximation algorithms, such as Newton-Raphson iteration.-a Of course
    you can terminate the loop while you are still far from the solution,
    but that's not going to improve the accuracy of the results.


    Feel free to correct me if what I write below is wrong - you, Terje,
    and others here know a lot more about this stuff than I do.

    When you write an expression like "x + y" with floating point,
    ignoring NaNs and infinities, you can imagine the calculation being
    done by first getting the mathematical real values from x and y.-a Then
    - again in the mathematical real domain - the operation is carried
    out.-a Then the result is truncated or rounded to fit back within the
    mantissa and exponent format of the floating point type.

    Double precision IEEE format has 53 bits of mantissa and 11 bits of
    exponent.-a For normal floating point values, that covers from 10 ^
    -308 to 10 ^ +308, or 716 orders of magnitude.-a (For comparison, the
    size of the universe measured in Planck lengths is only about 61
    orders of magnitude.)

    Denormals let you squeeze a bit more at the lower end here - another
    16 orders of magnitude - at the cost of rapidly decreasing precision.
    They don't stop the inevitable approximation to zero, they just delay
    it a little.

    I am still at a loss to understand how this is going to be useful -
    when will that small extra margin near zero actually make a
    difference, in the real world, with real values?-a When you are using
    your Newton-Raphson iteration to find your function's zeros, what are
    the circumstances in which you can get a more useful end result if you
    continue to 10 ^ -324 instead of treating 10 ^ -308 as zero -
    especially when these smaller numbers have lower precision?

    Please note that I have NOT personally observed this, but I have been
    told from people I trust (on the 754 working group) that at least some zero-seeking algorithms will stabilize on an exact value, if and only if
    you have subnormals, otherwise it is possible to wobble back & forth
    between two neighboring results.

    I.e. they differ by exactly one ulp.

    I have no problems believing that this can occur on occasion. No matter
    what range you pick for your floating point formats, or what precision
    you pick, you will always be able to find examples of this kind of
    algorithm that home in on the right value with the format you have
    chosen but would fail with just one bit less. I just don't think that
    such pathological examples mean that subnormals are important.

    But if such cases occur regularly in real-world calculations, not just artificial examples, then it's a different matter.


    As I noted, I have not been bitten by this particular issue, one of the reaons being that I tend to not write infinite loops inside functions, instead I'll pre-calculate how many (typically NR) iterations should be needed.

    Terje

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From David Brown@david.brown@hesbynett.no to comp.arch on Wed Oct 15 17:52:48 2025
    From Newsgroup: comp.arch

    On 15/10/2025 15:50, Michael S wrote:
    On Wed, 15 Oct 2025 12:36:17 +0200
    David Brown <david.brown@hesbynett.no> wrote:

    On 14/10/2025 18:46, Anton Ertl wrote:
    David Brown <david.brown@hesbynett.no> writes:

    Well, I think that if your values are getting that small enough to
    make denormal results, your code is at least questionable.

    As Terje Mathiesen wrote, getting close to 0 is standard fare for
    approximation algorithms, such as Newton-Raphson iteration. Of
    course you can terminate the loop while you are still far from the
    solution, but that's not going to improve the accuracy of the
    results.

    Feel free to correct me if what I write below is wrong - you, Terje,
    and others here know a lot more about this stuff than I do.

    When you write an expression like "x + y" with floating point,
    ignoring NaNs and infinities, you can imagine the calculation being
    done by first getting the mathematical real values from x and y.
    Then - again in the mathematical real domain - the operation is
    carried out. Then the result is truncated or rounded to fit back
    within the mantissa and exponent format of the floating point type.

    Double precision IEEE format has 53 bits of mantissa and 11 bits of
    exponent. For normal floating point values, that covers from 10 ^
    -308 to 10 ^ +308, or 716 orders of magnitude. (For comparison, the
    size of the universe measured in Planck lengths is only about 61
    orders of magnitude.)

    Denormals let you squeeze a bit more at the lower end here - another
    16 orders of magnitude - at the cost of rapidly decreasing precision.
    They don't stop the inevitable approximation to zero, they just
    delay it a little.

    I am still at a loss to understand how this is going to be useful -
    when will that small extra margin near zero actually make a
    difference, in the real world, with real values? When you are using
    your Newton-Raphson iteration to find your function's zeros, what are
    the circumstances in which you can get a more useful end result if
    you continue to 10 ^ -324 instead of treating 10 ^ -308 as zero -
    especially when these smaller numbers have lower precision?

    I realise there are plenty of numerical calculations in which errors
    "build up", such as simulating non-linear systems over time, and
    there you are looking to get as high an accuracy as you can in the
    intermediary steps so that you can continue for longer. But even
    there, denormals are not going to give you more than a tiny amount
    extra.

    (There are, of course, mathematical problems which deal with values
    or precisions far outside anything of relevance to the physical
    world, but if you are dealing with those kinds of tasks then IEEE
    floating point is not going to do the job anyway.)



    I don't think that I agree with Anton's point, at least as formulated.

    Yes, subnormals improve precision of Newton-Raphson and such*, but only
    when the numbers involved in calculations are below 2**-971, which does
    not happen very often. What is more important that *when* it happens
    then naively written implementations of such algorithms still converge. Without subnormals (or without expert provisions) there is big chance
    that they would not converge at all. That happens mostly because
    IEEE-754 preserves following intuitive invariant:
    When x > y then x - y > 0
    Without subnormals, e.g. with VAX float formats that are otherwise
    pretty good, this invariant does not hold.


    I can appreciate that you can have x > y, but with such small x and y
    and such close values that (x - y) is a subnormal - thus without
    subnormals, (x - y) would be 0.

    Perhaps I am being obtuse, but I don't see how you would write a Newton-Raphson algorithm that would fail to converge, or fail to stop,
    just because you don't have subnormals. Could you give very rough
    outline of such problematic code?


    * - I personally prefer to illustrate it with cord-and-tangent
    root-finding algorithm that can be used for any type of function as
    long as you proved that on section of interest there is no change of
    sign of its first and second derivatives. May be, because I
    was taught this algorithm at age of 15. This algo can be called
    half-Newton].


    I was perhaps that age when I first came across Newton-Raphson in a
    maths book, and wrote an implementation for it on a computer. That was
    in BBC Basic, and I'm pretty sure that the floating point type there was
    not IEEE compatible, and did not support such fancy stuff as subnormals!
    But I am also very sure I did not push the program to more difficult examples. (But it did show nice graphic illustrations of what it was
    doing.)

    It was also around then that I wrote a program for matrix inversion, and discovered the joys of numeric instability, and thus the need for care
    when picking the order for Gaussian elimination.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Wed Oct 15 13:22:01 2025
    From Newsgroup: comp.arch

    Michael S wrote:
    On Wed, 15 Oct 2025 05:55:40 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    Lawrence =?iso-8859-13?q?D=FFOliveiro?= <ldo@nz.invalid> writes:
    On Tue, 14 Oct 2025 07:51:09 GMT, Anton Ertl wrote:

    * The Wikipedia article on the software crisis does not give a
    useful definition for deciding whether there is a software crisis
    or not, and it does not even mention the symptom that was
    mentioned first when I learned about the software crisis (in
    1986): The cost of software exceeds the cost of hardware.
    The "crisis" was supposed to do with the shortage of programs to
    write all the programs that were needed to solve business and user
    needs.
    I never heard that one. The software project failures, deadline
    misses, and cost overruns, and their increasing number was a symptom
    that is reflected in the Wikipedia article.

    By that definition, I don+AN++N++t think the "crisis" exists any more. It >>> went away with the rise of very-high-level languages, from about the
    time of those such as Tcl/Tk, Perl, Python, PHP and JavaScript.
    Better tools certainly help. One interesting aspect here is that all
    the languages you mention are only dynamically typechecked. There has
    been quite a bit of work on adding static typechecking to some of
    these languages in the last decade or so, and the motivation given for
    that is difficulties in large software projects using these languages.

    In any case, even with these languages there are still software
    projects that fail, miss their deadlines and have overrun their
    budget; and to come back to the criterion I mentioned, where software
    cost is higher than hardware cost.

    Anyway, the relevance for comp.arch is how to evaluate certain
    hardware features: If we have a way to make the programmers' jobs
    easier at a certain hardware cost, when is it justified to add the
    hardware cost? When it affects many programmers and especially if the
    difficulty that would otherwise be added is outside the expertise of
    many of them. Let's look at some cases:

    "Closing the semantic gap" by providing instructions like EDIT: Even
    with assembly-language programmers, calling a subroutine is hardly
    harder. With higher-level languages, such instructions buy nothing.

    Denormal numbers: It affects lots of code that deals with FP, and
    where many programmers are not well-educated (and even the educated
    ones have a harder time when they have to work around their absence).

    Hardware without Spectre (e.g., with invisible speculation): There are
    two takes here:

    1) If you consider Spectre to be a realistically exploitable
    vulnerability, you need to protect at least the secret keys against
    extraction with Spectre; then you either need such hardware, or you
    need to use software mitigations agains all Spectre variants in all
    software that runs in processes that have secret keys in their
    address space; the latter would be a huge cost that easily
    justifies the cost of adding invisible speculation to the hardware.

    2) The other take is that we Spectre is too hard to exploit to be a
    realistic threat and that we do not need to eliminate it or
    mitigate it. That's a similar to the mainstream opinion on
    cache-timing attacks on AES before Dan Bernstein demonstrated that
    such attacks can be performed. Except that for Spectre we already
    have demonstrations.

    - anton

    What demonstrations?
    The demonstration that I would consider realistic should be from JS
    running on browser released after 2018-01-28.

    I'm of strong opinion that at least Spectre Variant 1 (Bound Check
    Bypass) should not be mitigated in hardware.
    W.r.t. variant 2 (Branch Target Injection) I am less categorical. That
    is, I rather prefer it not mitigated on the hardware that I use,
    because I am sure that in no situation it is a realistic threat for me. However it is harder to prove that it is not a realistic threat to
    anybody. And since HW mitigation has smaller performance impact than of Variant 1, so if some CPU vendors decide to mitigate Variant 2, I would
    not call them spinless idiots because of it. I'd call them "slick businessmen" which in my book is less derogatory.

    I had an idea on how to eliminate Bound Check Bypass.
    I intend to have range-check-and-fault instructions like

    CHKLTU value_Rs1, limit_Rs2
    value_Rs1, #limit_imm

    throws an overflow fault exception if value register >= unsigned limit.
    (The unsigned >= check also catches negative signed integer values).

    It can be used to check an array index before use in a LD/ST, e.g.

    CHKLTU index_Rs, limit_Rs
    LD Rd, [base_Rs, index_Rs*scale]

    The problem is that there is no guarantee that an OoO cpu will execute
    the CHKLTU instruction before using the index register in the LD/ST.

    My idea is for the CHKcc instruction to copy the test value to a dest
    register when the check is successful. This makes the dest value register write-dependent on successfully passing the range check,
    and blocks the subsequent LD from using the index until validated.

    CHKLTU index_R2, index_R1, limit_R3
    LD R4, [base_R5, index_R2*scale]

    Because there is no branch, there is no way to speculate around the check
    (but load value speculation could negate this fix).

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Oct 15 21:09:27 2025
    From Newsgroup: comp.arch


    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    ----------------------------

    Anyway, the relevance for comp.arch is how to evaluate certain
    hardware features: If we have a way to make the programmers' jobs
    easier at a certain hardware cost, when is it justified to add the
    hardware cost?

    Most people would say:: "When it adds performance" AND the compiler
    can use it. Some would add: "from unmodified source code"; but I
    am a little wishy-washy on the last clause.

    I might note that SIMD obeys none of the 3 conditions.

    When it affects many programmers and especially if the difficulty that would otherwise be added is outside the expertise of
    many of them. Let's look at some cases:

    "Closing the semantic gap" by providing instructions like EDIT: Even
    with assembly-language programmers, calling a subroutine is hardly
    harder. With higher-level languages, such instructions buy nothing.

    Printf-family "closes more of the gap" than EDIT ever could. And there
    is a whole suite of things better off left in subroutines than being
    raised into Instructions.

    Unfortunately, elementary FP functions are no longer in that category.
    When one can perform SIN(x) along with argument reduction and polynomial calculation in the cycle time of FDIV, SIN() deserves to be a first
    class member of the instruction set--especially if the HW cost is
    "not that much".

    On the other hand: things like polynomial evaluating instructions
    seem a bridge too far as you have to pick for all time 1 of {Horner,
    Estrin, Pad|-, Power Series, Clenshaw, ...} and at some point it
    becomes better to start using FFT-derived evaluation means.

    Denormal numbers: It affects lots of code that deals with FP, and
    where many programmers are not well-educated (and even the educated
    ones have a harder time when they have to work around their absence).

    Arguably, the best thing to do here is to Trap on the creation of deNorms.
    At least then you can see them and do something about them at the algorithm level. {Gee Whiz Cap. Obvious: IEEE 754 already did this!}

    Hardware without Spectre (e.g., with invisible speculation): There are
    two takes here:

    1) If you consider Spectre to be a realistically exploitable
    vulnerability, you need to protect at least the secret keys against
    extraction with Spectre; then you either need such hardware, or you
    need to use software mitigations agains all Spectre variants in all
    software that runs in processes that have secret keys in their
    address space; the latter would be a huge cost that easily
    justifies the cost of adding invisible speculation to the hardware.

    My 66000 is immune from Spectr|-; -|A state is not updated until retire.

    2) The other take is that we Spectre is too hard to exploit to be a
    realistic threat and that we do not need to eliminate it or
    mitigate it. That's a similar to the mainstream opinion on
    cache-timing attacks on AES before Dan Bernstein demonstrated that
    such attacks can be performed. Except that for Spectre we already
    have demonstrations.

    We just don't have the smoking gun of a missing $1M-to-$1B to make it
    worth the effort to do something about it. But mark my words:: the vulnerability is being exploited ...

    - anton
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Oct 15 21:13:53 2025
    From Newsgroup: comp.arch


    Michael S <already5chosen@yahoo.com> posted:

    -------------------------------
    Hardware without Spectre (e.g., with invisible speculation): There are
    two takes here:

    1) If you consider Spectre to be a realistically exploitable
    vulnerability, you need to protect at least the secret keys against
    extraction with Spectre; then you either need such hardware, or you
    need to use software mitigations agains all Spectre variants in all
    software that runs in processes that have secret keys in their
    address space; the latter would be a huge cost that easily
    justifies the cost of adding invisible speculation to the hardware.

    2) The other take is that we Spectre is too hard to exploit to be a
    realistic threat and that we do not need to eliminate it or
    mitigate it. That's a similar to the mainstream opinion on
    cache-timing attacks on AES before Dan Bernstein demonstrated that
    such attacks can be performed. Except that for Spectre we already
    have demonstrations.

    - anton

    What demonstrations?
    The demonstration that I would consider realistic should be from JS
    running on browser released after 2018-01-28.

    I'm of strong opinion that at least Spectre Variant 1 (Bound Check
    Bypass) should not be mitigated in hardware.

    My 66000 allows an application to crap all over "the stack";
    but it does provide a means whereby "crapping all over the stack"
    does not allow the application to violate the contract between caller
    and callee. Once application performs a RET (or EXIT) control is returns
    to caller 1 instruction past calling point, and with the preserved
    registers preserved !

    W.r.t. variant 2 (Branch Target Injection) I am less categorical. That
    is, I rather prefer it not mitigated on the hardware that I use,
    because I am sure that in no situation it is a realistic threat for me. However it is harder to prove that it is not a realistic threat to
    anybody. And since HW mitigation has smaller performance impact than of Variant 1, so if some CPU vendors decide to mitigate Variant 2, I would
    not call them spinless idiots because of it. I'd call them "slick businessmen" which in my book is less derogatory.





    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Oct 15 21:28:52 2025
    From Newsgroup: comp.arch


    EricP <ThatWouldBeTelling@thevillage.com> posted:
    ---------------------------
    What demonstrations?
    The demonstration that I would consider realistic should be from JS
    running on browser released after 2018-01-28.

    I'm of strong opinion that at least Spectre Variant 1 (Bound Check
    Bypass) should not be mitigated in hardware.
    W.r.t. variant 2 (Branch Target Injection) I am less categorical. That
    is, I rather prefer it not mitigated on the hardware that I use,
    because I am sure that in no situation it is a realistic threat for me. However it is harder to prove that it is not a realistic threat to
    anybody. And since HW mitigation has smaller performance impact than of Variant 1, so if some CPU vendors decide to mitigate Variant 2, I would
    not call them spinless idiots because of it. I'd call them "slick businessmen" which in my book is less derogatory.

    I had an idea on how to eliminate Bound Check Bypass.
    I intend to have range-check-and-fault instructions like

    CHKLTU value_Rs1, limit_Rs2
    value_Rs1, #limit_imm

    throws an overflow fault exception if value register >= unsigned limit.
    (The unsigned >= check also catches negative signed integer values).

    It can be used to check an array index before use in a LD/ST, e.g.

    CHKLTU index_Rs, limit_Rs
    LD Rd, [base_Rs, index_Rs*scale]

    The problem is that there is no guarantee that an OoO cpu will execute
    the CHKLTU instruction before using the index register in the LD/ST.

    Yes, order in OoO is sanity-impairing.

    But, what you do know is that CHKx will be performed before LD can
    retire. _AND_ if your -|A does not update -|A state prior to retire,
    you can be as OoO as you like and still not be Spectr|- sensitive.

    One of the things recently put into My 66000 is that AGEN detects
    overflow and raises PageFault.

    My idea is for the CHKcc instruction to copy the test value to a dest register when the check is successful. This makes the dest value register write-dependent on successfully passing the range check,
    and blocks the subsequent LD from using the index until validated.

    CHKLTU index_R2, index_R1, limit_R3
    LD R4, [base_R5, index_R2*scale]

    If you follow my rule above this is unnecessary, but it may be less
    painful than holding back state update until retire.

    Because there is no branch, there is no way to speculate around the check (but load value speculation could negate this fix).

    x86 has cases (like Shift by 0) where HW predicts that CFLAGs are set
    and -|faults when shift count == 0 and prevents setting of CFLAGS.
    You "COULD" do something similar at -|A level.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Oct 15 21:34:14 2025
    From Newsgroup: comp.arch


    Terje Mathisen <terje.mathisen@tmsw.no> posted:
    ----------------------

    Please note that I have NOT personally observed this, but I have been
    told from people I trust (on the 754 working group) that at least some zero-seeking algorithms will stabilize on an exact value, if and only if
    you have subnormals, otherwise it is possible to wobble back & forth
    between two neighboring results.

    I know of several Newton-Raphson-iterations that converge faster and
    more accurately using reciprocal-SQRT() than the equivalent algorithm
    using SQRT() directly in NR-iteration.

    I.e. they differ by exactly one ulp.

    In my cases, the RSQRT() was 1 or 2 iterations faster and 2 ULP more
    accurate. I don't know of a case oscillating at 1 ULP due to arithmetic anomalies.

    As I noted, I have not been bitten by this particular issue, one of the reaons being that I tend to not write infinite loops inside functions, instead I'll pre-calculate how many (typically NR) iterations should be needed.

    Almost always the right course of events.

    The W() function may be different. W( poly|u(e^poly) ) = poly.

    Terje
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Lawrence =?iso-8859-13?q?D=FFOliveiro?=@ldo@nz.invalid to comp.arch on Wed Oct 15 21:37:42 2025
    From Newsgroup: comp.arch

    On Wed, 15 Oct 2025 21:09:27 GMT, MitchAlsup wrote:

    Most people would say:: "When it adds performance" AND the compiler can
    use it. Some would add: "from unmodified source code"; but I am a little wishy-washy on the last clause.

    I might note that SIMD obeys none of the 3 conditions.

    I believe GCC can do auto-vectorization in some situations.

    But the RISC-V folks still think Cray-style long vectors are better than
    SIMD, if only because it preserves the rCLRrCY in rCLRISCrCY.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Lawrence =?iso-8859-13?q?D=FFOliveiro?=@ldo@nz.invalid to comp.arch on Wed Oct 15 21:42:32 2025
    From Newsgroup: comp.arch

    On Wed, 15 Oct 2025 05:55:40 GMT, Anton Ertl wrote:

    On Wed, 15 Oct 2025 03:45:31 -0000 (UTC), Lawrence DrCOOliveiro wrote:

    The "crisis" was supposed to do with the shortage of programs to write
    all the programs that were needed to solve business and user needs.

    By that definition, I donrCOt think the "crisis" exists any more. It went
    away with the rise of very-high-level languages, from about the time of
    those such as Tcl/Tk, Perl, Python, PHP and JavaScript.

    Better tools certainly help. One interesting aspect here is that all
    the languages you mention are only dynamically typechecked.

    Correct. That does seem to be a key part of what rCLvery-high-levelrCY means.

    There has been quite a bit of work on adding static typechecking to some
    of these languages in the last decade or so, and the motivation given
    for that is difficulties in large software projects using these
    languages.

    What werCOre seeing here is a downward creep, as those very-high-level languages (Python and JavaScript, particularly) are encroaching into the territory of the lower levels. Clearly they must still have some
    advantages over those languages that already inhabit the lower levels, otherwise we might as well use the latter.

    In any case, even with these languages there are still software projects
    that fail, miss their deadlines and have overrun their budget ...

    IrCOm not aware of such; feel free to give an example of some large Python project, for example, which has exceeded its time and/or budget. The key
    point about using such a very-high-level language is you can do a lot in
    just a few lines of code.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Oct 15 22:19:18 2025
    From Newsgroup: comp.arch


    Lawrence =?iso-8859-13?q?D=FFOliveiro?= <ldo@nz.invalid> posted:

    On Wed, 15 Oct 2025 21:09:27 GMT, MitchAlsup wrote:

    Most people would say:: "When it adds performance" AND the compiler can
    use it. Some would add: "from unmodified source code"; but I am a little wishy-washy on the last clause.

    I might note that SIMD obeys none of the 3 conditions.

    I believe GCC can do auto-vectorization in some situations.

    Yes, 28 YEARS after it was first put in !! it danged better be
    able !?! {yes argue about when}

    My point was that you don't put it in until you can see a performance
    advantage in the very next (or internal) compiler. {Where 'you' are
    the designers of that generation.

    But the RISC-V folks still think Cray-style long vectors are better than SIMD, if only because it preserves the rCLRrCY in rCLRISCrCY.

    The R in RISC-V comes from "student _R_esearch".

    Oh, and BTW: I don't believe SIMD is better than CRAY-like vectors
    (or vice versa)--they simply represent different ways of shooting
    yourself in the foot.

    No ISA with more than 200 instructions deserves the RISC mantra.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Oct 15 22:31:32 2025
    From Newsgroup: comp.arch


    Lawrence =?iso-8859-13?q?D=FFOliveiro?= <ldo@nz.invalid> posted:

    On Wed, 15 Oct 2025 05:55:40 GMT, Anton Ertl wrote:

    On Wed, 15 Oct 2025 03:45:31 -0000 (UTC), Lawrence DrCOOliveiro wrote:

    The "crisis" was supposed to do with the shortage of programs to write
    all the programs that were needed to solve business and user needs.

    By that definition, I donrCOt think the "crisis" exists any more. It went >> away with the rise of very-high-level languages, from about the time of
    those such as Tcl/Tk, Perl, Python, PHP and JavaScript.

    Better tools certainly help. One interesting aspect here is that all
    the languages you mention are only dynamically typechecked.

    Correct. That does seem to be a key part of what rCLvery-high-levelrCY means.

    There has been quite a bit of work on adding static typechecking to some
    of these languages in the last decade or so, and the motivation given
    for that is difficulties in large software projects using these
    languages.

    What werCOre seeing here is a downward creep, as those very-high-level languages (Python and JavaScript, particularly) are encroaching into the territory of the lower levels. Clearly they must still have some
    advantages over those languages that already inhabit the lower levels, otherwise we might as well use the latter.

    There is a pernicious trap:: once an application written in a VHLL
    is acclaimed by the masses--it instantly falls into the trap where
    "users want more performance":: something the VHLL cannot provide
    until they.........

    45 years ago it was LISP, you wrote the application in LISP to figure
    out the required algorithms and once you got it working, you rewrote
    it in a high-performance language (FORTRAN or C) so it was usably fast.

    History has a way of repeating itself, when no-one remembers the past.

    In any case, even with these languages there are still software projects that fail, miss their deadlines and have overrun their budget ...

    A lot of these projects were unnecessary. Once someone figured out how to
    make the (17 kinds of) hammers one needs, there it little need to make a
    new hammer architecture.

    Windows could have stopped at W7, and many MANY people would have been happier... The mouse was more precise in W7 than in W8 ... With a little upgrade for new PCIe architecture along the way rather than redesigning
    whole kit and caboodle for tablets and phones which did not work BTW...

    Office application work COULD have STOPPED in 2003, eXcel in 1998, ...
    and few people would have cared. Many SW projects are driven not by demand
    for the product, but pushed by companies to make already satisfied users
    have to upgrade.

    Those programmers could have transitioned to new SW projects rather than redesigning the same old thing 8 more times. Presto, there is now enough
    well trained SW engineers to tackle the undone SW backlog.

    IrCOm not aware of such; feel free to give an example of some large Python project, for example, which has exceeded its time and/or budget. The key point about using such a very-high-level language is you can do a lot in just a few lines of code.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Lawrence =?iso-8859-13?q?D=FFOliveiro?=@ldo@nz.invalid to comp.arch on Thu Oct 16 05:44:04 2025
    From Newsgroup: comp.arch

    On Wed, 15 Oct 2025 22:19:18 GMT, MitchAlsup wrote:

    But the RISC-V folks still think Cray-style long vectors are better
    than SIMD, if only because it preserves the rCLRrCY in rCLRISCrCY.

    The R in RISC-V comes from "student _R_esearch".

    rCLReduced Instruction Set ComputingrCY. That was what every single primer on the subject said, right from the 1980s onwards.

    Oh, and BTW: I don't believe SIMD is better than CRAY-like vectors (or
    vice versa)--they simply represent different ways of shooting yourself
    in the foot.

    The primary design criterion, as I understood it, was to avoid filling up
    the instruction opcode space with a combinatorial explosion. (Or sequence
    of combinatorial explosions, when you look at the wave after wave of SIMD extensions in x86 and elsewhere.)

    Also there might be some pipeline benefits in having longer vector
    operands ... IrCOll bow to your opinion on that.

    No ISA with more than 200 instructions deserves the RISC mantra.

    There you go ... agreeing with me about what the rCLRrCY stands for.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Lawrence =?iso-8859-13?q?D=FFOliveiro?=@ldo@nz.invalid to comp.arch on Thu Oct 16 05:57:34 2025
    From Newsgroup: comp.arch

    On Wed, 15 Oct 2025 22:31:32 GMT, MitchAlsup wrote:

    On Wed, 15 Oct 2025 21:42:32 -0000 (UTC), Lawrence DrCOOliveiro wrote:

    What werCOre seeing here is a downward creep, as those very-high-level
    languages (Python and JavaScript, particularly) are encroaching into
    the territory of the lower levels. Clearly they must still have some
    advantages over those languages that already inhabit the lower levels,
    otherwise we might as well use the latter.

    There is a pernicious trap:: once an application written in a VHLL is acclaimed by the masses--it instantly falls into the trap where "users
    want more performance":: something the VHLL cannot provide until they.........

    45 years ago it was LISP, you wrote the application in LISP to figure
    out the required algorithms and once you got it working, you rewrote it
    in a high-performance language (FORTRAN or C) so it was usably fast.

    No, you didnrCOt. There is a Pareto rule in effect, in that the majority of the CPU time (say, 90%) is spent in a minority of the code (say, 10%). So having got your prototype working, and done suitable profiling to identify
    the bottlenecks, you concentrate on optimizing those bottlenecks, not on rewriting the whole app.

    Paul Graham (well-known LISP guru) described how the company he was with
    -- one of the early Dotcom startups -- wrote Orbitz, an airline
    reservation system, in LISP. But the most performance critical part was
    done in C++.

    Nowadays, with the popularity of Python, we already have lots of efficient lower-level toolkits to take care of common tasks, taking advantage of the versatility of the core Python language. For example, NumPy for handling serious number-crunching: you write a few lines of Python, to express a high-level operation that crunches a million sets of numbers in just a few seconds.

    Maybe it only took you a minute to come up with the line of code; maybe
    you will never need to run it again. Writing a program entirely in FORTRAN
    or C to perform the same operation might take an expert programmer an hour
    or two, say; in that time, the Python programmer could try out dozens of similar operations, maybe discard the results of three quarters of them,
    to narrow down the important information to be extracted from the raw
    data.

    ThatrCOs the kind of productivity gain we enjoy nowadays, on a routine
    basis, without making a big deal about it in news headlines. And thatrCOs
    why we donrCOt talk about a rCLsoftware crisisrCY any more.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From David Brown@david.brown@hesbynett.no to comp.arch on Thu Oct 16 09:04:23 2025
    From Newsgroup: comp.arch

    On 16/10/2025 07:44, Lawrence DrCOOliveiro wrote:
    On Wed, 15 Oct 2025 22:19:18 GMT, MitchAlsup wrote:

    But the RISC-V folks still think Cray-style long vectors are better
    than SIMD, if only because it preserves the rCLRrCY in rCLRISCrCY.

    The R in RISC-V comes from "student _R_esearch".

    rCLReduced Instruction Set ComputingrCY. That was what every single primer on the subject said, right from the 1980s onwards.

    Oh, and BTW: I don't believe SIMD is better than CRAY-like vectors (or
    vice versa)--they simply represent different ways of shooting yourself
    in the foot.

    The primary design criterion, as I understood it, was to avoid filling up
    the instruction opcode space with a combinatorial explosion. (Or sequence
    of combinatorial explosions, when you look at the wave after wave of SIMD extensions in x86 and elsewhere.)

    I believe another aim is to have the same instructions work on different hardware. With SIMD, you need different code if your processor can add
    4 ints at a time, or 8 ints, or 16 ints - it's all different
    instructions using different SIMD registers. With the vector style instructions in RISC-V, the actual SIMD registers and implementation are
    not exposed to the ISA and you have the same code no matter how wide the actual execution units are. I have no experience with this (or much experience with SIMD), but that seems like a big win to my mind. It is
    akin to letting the processor hardware handle multiple instructions in parallel in superscaler cpus, rather than Itanium EPIC coding.


    Also there might be some pipeline benefits in having longer vector
    operands ... IrCOll bow to your opinion on that.

    No ISA with more than 200 instructions deserves the RISC mantra.

    There you go ... agreeing with me about what the rCLRrCY stands for.

    I have always thought that RISC parsed as (RI)SC rather than R(IS)C.
    That is, the instructions themselves should be simple (the aim was one instruction, one clock cycle) rather than that there necessarily should
    be fewer instructions.



    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Thu Oct 16 07:00:58 2025
    From Newsgroup: comp.arch

    Michael S <already5chosen@yahoo.com> writes:
    The demonstration that I would consider realistic should be from JS
    running on browser released after 2018-01-28.

    You apparently only consider attacks through the browser as relevant. Netspectre demonstrates a completely remote attack, i.e., without a
    browser.

    As for the browsers, AFAIK they tried to make Spectre leak less by
    making the clock less precise. That does not stop Spectre, it only
    makes data extraction using the clock slower. Moreover, there are
    ways to work around that by running a timing loop, i.e., instead of
    the clock you use the current count of the counted loop.

    I'm of strong opinion that at least Spectre Variant 1 (Bound Check
    Bypass) should not be mitigated in hardware.

    What do you mean with "mitigated in hardware"? The answers to
    hardware vulnerabilities are to either fix the hardware (for Spectre
    "invisible speculation" looks the most promising to me), or to leave
    the hardware vulnerable and mitigate the vulnerability in software
    (possibly supported by hardware or firmware changes that do not fix
    the vulnerability).

    So do you not want it to be fixed in hardware, or not mitigated in
    software? As long as the hardware is not fixed, you may not have a
    choice in the latter, unless you use an OS you write yourself. AFAIK
    you can disable the software mitigations in the Linux kernel, but the development cost of these mitigations still has to be paid, and any
    slowdowns that result from organizing the code such that enabling the mitigations is possible will still be there even with the mitigations
    disabled.

    So if you are against hardware fixes, you will pay for software
    mitigations, in development cost (possibly indirectly) and in
    performance.

    More info on the topic:

    Fix Spectre in Hardware! Why and How https://repositum.tuwien.at/bitstream/20.500.12708/210758/1/Ertl-2025-Fix%20Spectre%20in%20Hardware%21%20Why%20and%20How-smur.pdf

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Thu Oct 16 11:34:20 2025
    From Newsgroup: comp.arch

    MitchAlsup wrote:

    Terje Mathisen <terje.mathisen@tmsw.no> posted:
    ----------------------

    Please note that I have NOT personally observed this, but I have been
    told from people I trust (on the 754 working group) that at least some
    zero-seeking algorithms will stabilize on an exact value, if and only if
    you have subnormals, otherwise it is possible to wobble back & forth
    between two neighboring results.

    I know of several Newton-Raphson-iterations that converge faster and
    more accurately using reciprocal-SQRT() than the equivalent algorithm
    using SQRT() directly in NR-iteration.

    I.e. they differ by exactly one ulp.

    In my cases, the RSQRT() was 1 or 2 iterations faster and 2 ULP more accurate. I don't know of a case oscillating at 1 ULP due to arithmetic anomalies.

    Interesting! I have also found rsqrt() to be a very good building block,
    to the point where if I can only have one helper function (approximate
    lookup to start the NR), it would be rsqrt, and I would use it for all
    of sqrt, fdiv and rsqrt.

    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Thu Oct 16 10:24:37 2025
    From Newsgroup: comp.arch

    David Brown wrote:
    On 16/10/2025 07:44, Lawrence DrCOOliveiro wrote:
    On Wed, 15 Oct 2025 22:19:18 GMT, MitchAlsup wrote:

    But the RISC-V folks still think Cray-style long vectors are better
    than SIMD, if only because it preserves the rCLRrCY in rCLRISCrCY.

    The R in RISC-V comes from "student _R_esearch".

    rCLReduced Instruction Set ComputingrCY. That was what every single primer on
    the subject said, right from the 1980s onwards.

    Oh, and BTW: I don't believe SIMD is better than CRAY-like vectors (or
    vice versa)--they simply represent different ways of shooting yourself
    in the foot.

    No ISA with more than 200 instructions deserves the RISC mantra.

    There you go ... agreeing with me about what the rCLRrCY stands for.

    I have always thought that RISC parsed as (RI)SC rather than R(IS)C.
    That is, the instructions themselves should be simple (the aim was one instruction, one clock cycle) rather than that there necessarily should
    be fewer instructions.

    Looking at

    The Case for the Reduced Instruction Set Computer, 1980, David Patterson https://dl.acm.org/doi/pdf/10.1145/641914.641917

    he never says what defines RISC, just what improved results
    this *design approach* should achieve.

    "Several factors indicate a Reduced Instruction Set Computer as a
    reasonable design alternative.
    ...
    Implementation Feasibility. A great deal depends on being able to fit
    an entire CPU design on a single chip.
    ...
    [EricP: reduced absolute amount of logic for a minimum implementation]

    Design Time. Design difficulty is a crucial factor in the success of
    VLSI computer.
    ...
    [EricP: reduced complexity leading to reduced design time]

    Speed. The ultimate test for cost-effectiveness is the speed at which an implementation executes a given algorithm. Better use of chip area and availability of newer technology through reduced debugging time contribute
    to the speed of the chip. A RISC potentially gains in speed merely from a simpler design.
    ...
    [EricP: reduced complexity and logic leads to reduced critical
    path lengths giving increased frequency.]

    Better use of chip area. If you have the area, why not implement the CISC?
    For a given chip area there are many tradeoffs for what can be realized.
    We feel that the area gained back by designing a RISC architecture rather
    than a CISC architecture can be used to make the RISC even more attractive
    than the CISC. ... When the CISC becomes realizable on a single chip,
    the RISC will have the silicon area to use pipelining techniques;
    when the CISC gets pipelining the RISC will have on chip caches, etc.
    ...
    [EricP: reduced waste on dragging around architectural boat anchors]

    The experience we have from compilers suggests that the burden on compiler writers is eased when the instruction set is simple and uniform.
    ...
    [EricP: reduced compiler complexity and development work]
    "

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Thu Oct 16 10:32:21 2025
    From Newsgroup: comp.arch

    MitchAlsup wrote:
    EricP <ThatWouldBeTelling@thevillage.com> posted:
    ---------------------------
    What demonstrations?
    The demonstration that I would consider realistic should be from JS
    running on browser released after 2018-01-28.

    I'm of strong opinion that at least Spectre Variant 1 (Bound Check
    Bypass) should not be mitigated in hardware.
    W.r.t. variant 2 (Branch Target Injection) I am less categorical. That
    is, I rather prefer it not mitigated on the hardware that I use,
    because I am sure that in no situation it is a realistic threat for me.
    However it is harder to prove that it is not a realistic threat to
    anybody. And since HW mitigation has smaller performance impact than of
    Variant 1, so if some CPU vendors decide to mitigate Variant 2, I would
    not call them spinless idiots because of it. I'd call them "slick
    businessmen" which in my book is less derogatory.
    I had an idea on how to eliminate Bound Check Bypass.
    I intend to have range-check-and-fault instructions like

    CHKLTU value_Rs1, limit_Rs2
    value_Rs1, #limit_imm

    throws an overflow fault exception if value register >= unsigned limit.
    (The unsigned >= check also catches negative signed integer values).

    It can be used to check an array index before use in a LD/ST, e.g.

    CHKLTU index_Rs, limit_Rs
    LD Rd, [base_Rs, index_Rs*scale]

    The problem is that there is no guarantee that an OoO cpu will execute
    the CHKLTU instruction before using the index register in the LD/ST.

    Yes, order in OoO is sanity-impairing.

    But, what you do know is that CHKx will be performed before LD can
    retire. _AND_ if your -|A does not update -|A state prior to retire,
    you can be as OoO as you like and still not be Spectr|- sensitive.

    One of the things recently put into My 66000 is that AGEN detects
    overflow and raises PageFault.

    My idea is for the CHKcc instruction to copy the test value to a dest
    register when the check is successful. This makes the dest value register
    write-dependent on successfully passing the range check,
    and blocks the subsequent LD from using the index until validated.

    CHKLTU index_R2, index_R1, limit_R3
    LD R4, [base_R5, index_R2*scale]

    If you follow my rule above this is unnecessary, but it may be less
    painful than holding back state update until retire.

    My idea is the same as a SUB instruction with overflow detect,
    which I would already have. I like cheap solutions.

    But the core idea here, to eliminate a control flow race condition by
    changing it to a data flow dependency, may be applicable in other areas.

    Because there is no branch, there is no way to speculate around the check
    (but load value speculation could negate this fix).

    On second thought, no, load value speculation would not negate this fix.

    x86 has cases (like Shift by 0) where HW predicts that CFLAGs are set
    and -|faults when shift count == 0 and prevents setting of CFLAGS.
    You "COULD" do something similar at -|A level.

    I'd prefer not to step in that cow pie to begin with.
    Then I won't have to spend time cleaning my shoes afterwards.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Thu Oct 16 23:04:44 2025
    From Newsgroup: comp.arch

    On Thu, 16 Oct 2025 07:00:58 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    Michael S <already5chosen@yahoo.com> writes:
    The demonstration that I would consider realistic should be from JS
    running on browser released after 2018-01-28.

    You apparently only consider attacks through the browser as relevant. Netspectre demonstrates a completely remote attack, i.e., without a
    browser.

    As for the browsers, AFAIK they tried to make Spectre leak less by
    making the clock less precise. That does not stop Spectre, it only
    makes data extraction using the clock slower. Moreover, there are
    ways to work around that by running a timing loop, i.e., instead of
    the clock you use the current count of the counted loop.


    I don't think that it was a primary mitigation of Spectre Variant 1
    implemented in browsers.
    Indeed, they made clock less precise, but that was their secondary
    line of defense, mostly aimed at new SPECTRE variants that are not
    discovered yet.
    For Spectre Variant 1 they implemented much more direct defense.
    For example, before mitigation JS statement val = x[i] was compiled to:
    cmp %RAX, 0(%RDX) # compare i with x.limit
    jbe oob_handler
    mov 8(%RDX, %RAX, 4), %RCX
    After mitigation it looks like:
    xor %ECX, %ECX
    cmp %RAX, 0(%RDX) # compare i with x.limit
    jbe oob_handler
    movbe %ECX, %EAX # data dependency prevents problematic speculation
    mov 8(%RDX, %RAX, 4), %RCX

    Almost identical code could be generated on ARM or POWER or SPARC. On
    MIPS rev6 it could be even shorter. On non-extended RISC-V it would be
    somewhat longer, but browser vendors do not care about RISC-V, extended
    or not.

    The part above written for the benefit of interested bystanders.
    You already know all that.


    I'm of strong opinion that at least Spectre Variant 1 (Bound Check
    Bypass) should not be mitigated in hardware.

    What do you mean with "mitigated in hardware"? The answers to
    hardware vulnerabilities are to either fix the hardware (for Spectre "invisible speculation" looks the most promising to me), or to leave
    the hardware vulnerable and mitigate the vulnerability in software
    (possibly supported by hardware or firmware changes that do not fix
    the vulnerability).

    So do you not want it to be fixed in hardware, or not mitigated in
    software? As long as the hardware is not fixed, you may not have a
    choice in the latter, unless you use an OS you write yourself. AFAIK
    you can disable the software mitigations in the Linux kernel, but the development cost of these mitigations still has to be paid, and any
    slowdowns that result from organizing the code such that enabling the mitigations is possible will still be there even with the mitigations disabled.

    So if you are against hardware fixes, you will pay for software
    mitigations, in development cost (possibly indirectly) and in
    performance.

    More info on the topic:

    Fix Spectre in Hardware! Why and How https://repositum.tuwien.at/bitstream/20.500.12708/210758/1/Ertl-2025-Fix%20Spectre%20in%20Hardware%21%20Why%20and%20How-smur.pdf

    - anton

    May be, I'll look at it some day. Certainly not tonight.
    May be, never.
    After all, neither me nor you are experts in design of modern high perf
    CPUs. So our reasonings about performance impact of this or that HW
    solution are at best educated hand wavings.







    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Thu Oct 16 15:17:22 2025
    From Newsgroup: comp.arch

    On 10/16/2025 12:44 AM, Lawrence DrCOOliveiro wrote:
    On Wed, 15 Oct 2025 22:19:18 GMT, MitchAlsup wrote:

    But the RISC-V folks still think Cray-style long vectors are better
    than SIMD, if only because it preserves the rCLRrCY in rCLRISCrCY.

    The R in RISC-V comes from "student _R_esearch".

    rCLReduced Instruction Set ComputingrCY. That was what every single primer on the subject said, right from the 1980s onwards.


    With some fighting as to what exactly it means:
    Small Listing (or smallest viable listing);
    Simple Instructions (Eg: Load/Store);
    Fixed-size instructions;
    ...

    So, for RISC-V:
    First point only really holds in the case of RV64I.
    For RV64G, there is already a lot of unnecessary stuff in there.
    Second Point:
    Fails with the 'A' extension;
    Also parts of F/D.
    Third Point:
    Fails with RV-C.
    Though, people redefine it:
    Still RISC so long as not using an x86-style encoding scheme.

    Well, and still the past example of some old marketing for MSP430 trying
    to pass it off as a RISC, where it had more in common with PDP-11 than
    with any of the RISC's (and only reason listing looks tiny is by
    ignoring the special cases encoded in certain combinations of registers
    and addressing modes).

    Like, you can sweep things like immediate-form instructions when you can
    do "@PC+" and get the same effect.


    Oh, and BTW: I don't believe SIMD is better than CRAY-like vectors (or
    vice versa)--they simply represent different ways of shooting yourself
    in the foot.

    The primary design criterion, as I understood it, was to avoid filling up
    the instruction opcode space with a combinatorial explosion. (Or sequence
    of combinatorial explosions, when you look at the wave after wave of SIMD extensions in x86 and elsewhere.)


    RISC-V tends to fail at this one in some areas...

    Also, the V extension doesn't even fit entirely in the opcode, it
    depends on additional state held in CSRs.

    The P extension is also a fail in this area, as they went whole-hog in defining new instructions for nearly every possible combination.



    Also there might be some pipeline benefits in having longer vector
    operands ... IrCOll bow to your opinion on that.


    IME, SIMD tends to primarily show benefits with 2 and 4 element vectors.

    Most use-cases for longer vectors tend to matrix-like rather than
    vector-like. Or, what cases that would appear suited to an 8-element
    vector are often achieved sufficiently with two vectors.

    Also, element sizes:
    Most of the dominant use-cases seem to involve 16 and 32 bit elements.
    Most cases that involve 8 bit elements are less suited to actual
    computation at 8 bits (for example, RGB math often works better at 16 bits).


    There are some weaknesses, for example, I mostly ended up dealing with
    RGB math by simply repeating the 8-bit values twice within a 16-bit spot.

    For various tasks, it might has been better to have gone with an
    unpack/repack scheme like:
    Pad2.Value8.Frac6
    Pad4.Value8.Frac4
    Where Pad can deal with values outside unit range, and Frac with values between the two LDR points. Then the RGB narrowing conversion operations
    could have had the option for round-and-saturate.

    Though, a more tacky option is to use the existing unpack operation and
    then invert the low-order bits to add a little bit of padding space for underflow/overflow.

    Another option being to use "Packed Shift" instructions to get a format
    with pad bits.


    No saturating ops in my case, as saturating ops didn't seem worth it
    (and having Wrap/SSat/USat/... is a big part of the combinatorial
    explosion seen in the P extension).



    No ISA with more than 200 instructions deserves the RISC mantra.

    There you go ... agreeing with me about what the rCLRrCY stands for.


    Checking, if I take XG3, and exclude SIMD, 128-bit integer instructions,
    stuff for 96-bit addressing, etc, the listing drops to around 208 instructions.

    This does still include things like instructions with niche addressing
    modes (such as "(GP,Disp16)"), etc.

    If stripped back to "core instructions" (excluding rarely-used
    instructions, such as ROT*/etc, and some of these alternate-mode
    instructions, etc), could be dropped back a little further.

    There are some instructions in the listing that would have been merged
    in RISC-V, like FPU instructions which differ only in rounding mode (the
    RNE and DYN instructions exist as separate instructions in this case, ...).


    It is a little over 400 if the SIMD and ALUX stuff and similar is added
    back in (excluding things like placeholder spots, or instructions which
    were copied from XG2 but are either N/A or redundant, ...).

    There is a fair chunk of instructions which mostly exist as SIMD format converters and similar.


    So, seems roughly:
    ~ 50%: Base instructions
    ~ 20%: ALUX and 96-bit addressing.
    ~ 30%: SIMD stuff

    Internally to the CPU core, there are roughly 44 core operations ATM,
    though many multiplex groups of related operations as sub-operations.

    So, things like ALU/CONV/etc don't represent a single instruction.
    But, JMP/JSR/BRA/BSR are singular operations (and BRA/BSR both map to
    JAL on the RV side, differing as to whether Rd is X0 or X1; similarly
    with both JMP and JSR mapping to JALR in a similar way).

    BSR and JSR had been modified to allow arbitrary link register, but it
    may make sense to reverse this; as Rd other than X0 and X1 is seemingly
    pretty much never used in practice (so not really worth the logic cost).


    Other option being to trap and (potentially) emulate, if Rd is not X0 or
    X1 (or just ignore it). Also, very possible, is demoting basically the
    entire RV 'A' extension to "trap and emulate".

    So, in HW:
    RV64I : Fully
    M : Mostly
    A : Trap/Emulate
    F/D : Partial (many cases are traps)
    Zicsr : Partial (trap in general case)
    Zifence: Trap
    ...


    where, say, ALU gets a 6-bit control value:
    (3:0): Which basic operation to perform;
    (5:4): In one of several ways:
    00: 32-bit, sign-ext result (eg: ADDW in RV terms)
    01: 32-bit, zero-ext result (eg: ADDWU in RV terms)
    10: 64-bit (ADD)
    11: 2x 32-bit (some ops) or 4x 16-bit (some other ops)
    PADD.L or PADD.W.

    There is CONV/CONV2/CONV3:
    CONV: Simple 2R converter ops which may have 1-cycle latency
    (later demoted to 2-cycle, with moV being relocated elsewhere).
    CONV2: More complex 2R converter ops, 2 cycle latency.
    CONV3: Same as CONV2, but because CONV2 ran out of space.


    Still no real mechanism to deal with the potential for proliferation of
    ".UW" instructions in RISC-V, for now I had been ignoring this.

    ...

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Thu Oct 16 16:26:27 2025
    From Newsgroup: comp.arch

    On 10/16/2025 2:04 AM, David Brown wrote:
    On 16/10/2025 07:44, Lawrence DrCOOliveiro wrote:
    On Wed, 15 Oct 2025 22:19:18 GMT, MitchAlsup wrote:

    But the RISC-V folks still think Cray-style long vectors are better
    than SIMD, if only because it preserves the rCLRrCY in rCLRISCrCY.

    The R in RISC-V comes from "student _R_esearch".

    rCLReduced Instruction Set ComputingrCY. That was what every single primer on
    the subject said, right from the 1980s onwards.

    Oh, and BTW: I don't believe SIMD is better than CRAY-like vectors (or
    vice versa)--they simply represent different ways of shooting yourself
    in the foot.

    The primary design criterion, as I understood it, was to avoid filling up
    the instruction opcode space with a combinatorial explosion. (Or sequence
    of combinatorial explosions, when you look at the wave after wave of SIMD
    extensions in x86 and elsewhere.)

    I believe another aim is to have the same instructions work on different hardware.-a With SIMD, you need different code if your processor can add
    4 ints at a time, or 8 ints, or 16 ints - it's all different
    instructions using different SIMD registers.-a With the vector style instructions in RISC-V, the actual SIMD registers and implementation are
    not exposed to the ISA and you have the same code no matter how wide the actual execution units are.-a I have no experience with this (or much experience with SIMD), but that seems like a big win to my mind.-a It is akin to letting the processor hardware handle multiple instructions in parallel in superscaler cpus, rather than Itanium EPIC coding.


    But, there is problem:
    Once you go wider than 2 or 4 elements, cases where wider SIMD brings
    more benefit tend to fall off a cliff.

    More so, when you go wider, there are new problems:
    Vector Masking;
    Resource and energy costs of using wider vectors;
    ...

    Then, for 'V':
    In the basic case, it effectively doubles the size of the register file
    vs 'G';
    ...


    Then We have x86 land:
    SSE: Did well;
    AVX256: Rocky start, negligible benefit from the YMM registers;
    Using AVX encodings for 128-bit vectors being arguably better.
    AVX512: Sorta exists, but:
    Very often not supported;
    Trying to use it (on supported hardware) often makes stuff slower.

    If even Intel can't make their crap work well, I am skeptical.

    While arguably GPUs were very wide, it is different:
    They were often doing very specialized tasks (such as 3D rendering);
    And, often with a SIMT model rather than "very large SIMD";
    Things like CUDA (and RTX) actually push things narrower;
    Larger numbers of narrower cores,
    rather than smaller number of wider cores.
    ...


    The one area that doesn't seem to run into a diminishing returns wall
    seems to be to map "embarrassingly parallel" problems to large numbers
    of processor cores, and to try to keep things as loosely coupled as
    possible.

    This works mostly until the CPU runs out of memory bandwidth or similar.



    Also there might be some pipeline benefits in having longer vector
    operands ... IrCOll bow to your opinion on that.

    No ISA with more than 200 instructions deserves the RISC mantra.

    There you go ... agreeing with me about what the rCLRrCY stands for.

    I have always thought that RISC parsed as (RI)SC rather than R(IS)C.
    That is, the instructions themselves should be simple (the aim was one instruction, one clock cycle) rather than that there necessarily should
    be fewer instructions.


    Agreed, this is more the stance I take.

    Instructions should be simple for hardware and to try to allow for low latency. Rather than trying to make the instruction listing small.



    Though, that said, I still did end up in my case making most
    instructions have a 2 or 3 cycle latency.

    So, generally, MOV-RR and MOV-IR end up as basically about the only single-cycle instructions. A case could almost be made for making *all* instructions 2 or 3 cycles and then eliminate forwarding from EX1
    entirely (or maybe add an EX4 stage).

    Say:
    PF IF ID RF E1 E2 E3 WB
    FW from E2 and E3
    RAW hazard between RF and E1 always stalls.
    Or:
    PF IF ID RF E1 E2 E3 E4 WB
    FW from E2, E3, and E4.

    With an E4 stage, one could maybe allow for pipelined low-precision FMAC
    or similar.


    Though, I see it more as the ISA not actively hindering achieving >= 1
    IPC throughput, rather than instructions having 1 cycle latency.

    But, can note that having 2 cycle latency does hinder the efficiency of
    some common patterns in RISC-V, where tight register RAW dependencies
    run rampant.

    So, say, you ideally want 5-8 instructions between each instruction and
    the next instruction that uses the result. This typically does not
    happen in most code, and particularly not if one needs instruction
    chains for semi-common idioms (say, where the optimal instruction
    scheduling would far exceed the length of a typical loop body).

    For better or worse does tend to result in a lot of performance
    sensitive code being written to use fairly heavy-handed loop unrolling
    though.

    ...


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Oct 16 21:52:22 2025
    From Newsgroup: comp.arch


    EricP <ThatWouldBeTelling@thevillage.com> posted:

    MitchAlsup wrote:
    EricP <ThatWouldBeTelling@thevillage.com> posted: ---------------------------
    What demonstrations?
    The demonstration that I would consider realistic should be from JS
    running on browser released after 2018-01-28.

    I'm of strong opinion that at least Spectre Variant 1 (Bound Check
    Bypass) should not be mitigated in hardware.
    W.r.t. variant 2 (Branch Target Injection) I am less categorical. That >>> is, I rather prefer it not mitigated on the hardware that I use,
    because I am sure that in no situation it is a realistic threat for me. >>> However it is harder to prove that it is not a realistic threat to
    anybody. And since HW mitigation has smaller performance impact than of >>> Variant 1, so if some CPU vendors decide to mitigate Variant 2, I would >>> not call them spinless idiots because of it. I'd call them "slick
    businessmen" which in my book is less derogatory.
    I had an idea on how to eliminate Bound Check Bypass.
    I intend to have range-check-and-fault instructions like

    CHKLTU value_Rs1, limit_Rs2
    value_Rs1, #limit_imm

    throws an overflow fault exception if value register >= unsigned limit.
    (The unsigned >= check also catches negative signed integer values).

    It can be used to check an array index before use in a LD/ST, e.g.

    CHKLTU index_Rs, limit_Rs
    LD Rd, [base_Rs, index_Rs*scale]

    The problem is that there is no guarantee that an OoO cpu will execute
    the CHKLTU instruction before using the index register in the LD/ST.

    Yes, order in OoO is sanity-impairing.

    But, what you do know is that CHKx will be performed before LD can
    retire. _AND_ if your -|A does not update -|A state prior to retire,
    you can be as OoO as you like and still not be Spectr|- sensitive.

    One of the things recently put into My 66000 is that AGEN detects
    overflow and raises PageFault.

    My idea is for the CHKcc instruction to copy the test value to a dest
    register when the check is successful. This makes the dest value register >> write-dependent on successfully passing the range check,
    and blocks the subsequent LD from using the index until validated.

    CHKLTU index_R2, index_R1, limit_R3
    LD R4, [base_R5, index_R2*scale]

    If you follow my rule above this is unnecessary, but it may be less
    painful than holding back state update until retire.

    My idea is the same as a SUB instruction with overflow detect,
    which I would already have. I like cheap solutions.

    But the core idea here, to eliminate a control flow race condition by changing it to a data flow dependency, may be applicable in other areas.

    This adds unnecessary execution latency to the architectural path.
    Without the check you have <say> 3-cycle unchecked LD
    With the check you have 4-cycle checked LD

    Now get some multi-pointer chasing per iteration algorithm in a loop and
    all of a sudden the execution window is no longer big enough to run it at
    full speed.

    Because there is no branch, there is no way to speculate around the check >> (but load value speculation could negate this fix).

    On second thought, no, load value speculation would not negate this fix.

    x86 has cases (like Shift by 0) where HW predicts that CFLAGs are set
    and -|faults when shift count == 0 and prevents setting of CFLAGS.
    You "COULD" do something similar at -|A level.

    I'd prefer not to step in that cow pie to begin with.

    Just making sure you remain aware of the cow-pies littering the field...

    Then I won't have to spend time cleaning my shoes afterwards.

    I am more worried about the blood on the shoes than the cow-pie.
    {{shooting oneself in the foot}}
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Oct 16 21:59:14 2025
    From Newsgroup: comp.arch


    Terje Mathisen <terje.mathisen@tmsw.no> posted:

    MitchAlsup wrote:

    Terje Mathisen <terje.mathisen@tmsw.no> posted:
    ----------------------

    Please note that I have NOT personally observed this, but I have been
    told from people I trust (on the 754 working group) that at least some
    zero-seeking algorithms will stabilize on an exact value, if and only if >> you have subnormals, otherwise it is possible to wobble back & forth
    between two neighboring results.

    I know of several Newton-Raphson-iterations that converge faster and
    more accurately using reciprocal-SQRT() than the equivalent algorithm
    using SQRT() directly in NR-iteration.

    I.e. they differ by exactly one ulp.

    In my cases, the RSQRT() was 1 or 2 iterations faster and 2 ULP more accurate. I don't know of a case oscillating at 1 ULP due to arithmetic anomalies.

    Interesting! I have also found rsqrt() to be a very good building block,
    to the point where if I can only have one helper function (approximate lookup to start the NR), it would be rsqrt, and I would use it for all
    of sqrt, fdiv and rsqrt.

    In practice:: RSQRT() is no harder to compute {both HW and SW},
    yet:: RSQRT() is more useful::

    SQRT(x) = RSQRT(x)*x is 1 pipelined FMUL
    RSQRT(x) = 1/SQRT(x) is 1 non-pipelined FDIV

    Useful in vector normalization::

    some-vector-calculation
    -----------------------
    SQRT( SUM(x**2,1,n) )

    and a host of others.

    Terje

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Oct 16 22:19:21 2025
    From Newsgroup: comp.arch


    David Brown <david.brown@hesbynett.no> posted:

    On 16/10/2025 07:44, Lawrence DrCOOliveiro wrote:
    On Wed, 15 Oct 2025 22:19:18 GMT, MitchAlsup wrote:

    But the RISC-V folks still think Cray-style long vectors are better
    than SIMD, if only because it preserves the rCLRrCY in rCLRISCrCY.

    The R in RISC-V comes from "student _R_esearch".

    rCLReduced Instruction Set ComputingrCY. That was what every single primer on
    the subject said, right from the 1980s onwards.

    Oh, and BTW: I don't believe SIMD is better than CRAY-like vectors (or
    vice versa)--they simply represent different ways of shooting yourself
    in the foot.

    The primary design criterion, as I understood it, was to avoid filling up the instruction opcode space with a combinatorial explosion. (Or sequence of combinatorial explosions, when you look at the wave after wave of SIMD extensions in x86 and elsewhere.)

    I believe another aim is to have the same instructions work on different hardware. With SIMD, you need different code if your processor can add
    4 ints at a time, or 8 ints, or 16 ints - it's all different
    instructions using different SIMD registers.

    Among SIMD's ISA problems is additional state at context switch time
    on top of FP's added state at context switch time; but with all the
    fast memory move subroutines being SIMD-based--the service routines
    need access to SIMD that they don't normally need for FP {and the
    SIMD register file is larger, too}

    With the vector style instructions in RISC-V, the actual SIMD registers and implementation are
    not exposed to the ISA and you have the same code no matter how wide the actual execution units are.

    Vector LD and ST instructions are not conceptually different than
    LDM and STM--1 instruction accesses multiple memory locations.

    But what gets me is the continual disconnect from actual vector
    calculations in source code--causing the compilers to have to solve
    many memory aliasing issues to use the vector ISA.

    Software writes vector loops--yet the HW vectorizes instructions.

    {{I might note My 66000 vectorizes loops not instructions to avoid
    this problem; For example::

    for( i = 0; i < max; i++ )
    {
    temp = a[i];
    a[i] = a[max-i];
    a[max-i] = temp;
    }

    is vectorizable in My 66000--those loops where the memory references
    do not overlap can run "as fast as the width of the data path allow"
    while those with memory reference collisions run no worse than scalar
    code. For a large value of max the profile would look like::

    FFFFFFFFFFFFFFFFFsssFFFFFFFFFFFFFFFFF

    F representing fast (say 4-wide or 8-wide)
    s representing slow (say 1-wide)

    The same binary runs as fast as memory references (and data-flow
    dependencies and data-path width) allow.
    }}

    I have no experience with this (or much experience with SIMD), but that seems like a big win to my mind. It is
    akin to letting the processor hardware handle multiple instructions in parallel in superscaler cpus, rather than Itanium EPIC coding.


    Also there might be some pipeline benefits in having longer vector
    operands ... IrCOll bow to your opinion on that.

    CRAY-like vector computers built memory systems that could handle the load
    of the vector calculations. CRAY-1 could perform a new memory access every clock, CRAY-[XY]MP could handle 2 LDs and 1 ST per clock continuously.

    If those CPUs of today were really going to fully utilize the vector
    data-path, they are going to have to have a lot better memory system
    than they are building presently (1 new cache miss per cycle).

    The power of the vector computers was almost entirely in the memory system
    not in the data path (which is surprisingly easy to build, and surprisingly difficult to keep fed).

    No ISA with more than 200 instructions deserves the RISC mantra.

    There you go ... agreeing with me about what the rCLRrCY stands for.

    I have always thought that RISC parsed as (RI)SC rather than R(IS)C.
    That is, the instructions themselves should be simple (the aim was one instruction, one clock cycle) rather than that there necessarily should
    be fewer instructions.

    On vacation over the summer, I canned a new phrase to denote what I
    hope My 66000 will end up being::

    CARD Computer Architecture Rightly Done.

    Note: It does not stop at ISA--as ISA is less than 1/3rd of what a
    computer architecture is and means.



    --- Synchronet 3.21a-Linux NewsLink 1.2