• Re: Linus Torvalds on bad architectural features

    From Lawrence =?iso-8859-13?q?D=FFOliveiro?=@ldo@nz.invalid to comp.arch on Sun Dec 28 00:10:48 2025
    From Newsgroup: comp.arch

    On Sun, 12 Oct 2025 16:11:27 GMT, MitchAlsup wrote:

    Top to bottom works for Japanese and Chinese. Yet I hear not
    appetite for TB byte order.

    There is no rCLtoprCY or rCLbottomrCY or rCLleftrCY or rCLrightrCY in memory. There
    are only addresses (bit numbers and byte numbers).
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sun Dec 28 17:43:25 2025
    From Newsgroup: comp.arch


    Lawrence =?iso-8859-13?q?D=FFOliveiro?= <ldo@nz.invalid> posted:

    On Sun, 12 Oct 2025 16:11:27 GMT, MitchAlsup wrote:

    Top to bottom works for Japanese and Chinese. Yet I hear not
    appetite for TB byte order.

    There is no rCLtoprCY or rCLbottomrCY or rCLleftrCY or rCLrightrCY in memory. There
    are only addresses (bit numbers and byte numbers).

    In order to stop the BE::LE war, one could always do a Middle Endian
    bit/Byte order. You start in the middle and each step goes right-then-left.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Sun Dec 28 13:34:26 2025
    From Newsgroup: comp.arch

    MitchAlsup wrote:
    Lawrence =?iso-8859-13?q?D=FFOliveiro?= <ldo@nz.invalid> posted:

    On Sun, 12 Oct 2025 16:11:27 GMT, MitchAlsup wrote:

    Top to bottom works for Japanese and Chinese. Yet I hear not
    appetite for TB byte order.
    There is no rCLtoprCY or rCLbottomrCY or rCLleftrCY or rCLrightrCY in memory. There
    are only addresses (bit numbers and byte numbers).

    In order to stop the BE::LE war, one could always do a Middle Endian
    bit/Byte order. You start in the middle and each step goes right-then-left.

    That is pretty much what DEC did on their
    PDP-11 and VAX floating point values.
    (Its what happens when you plug HW designed for
    BE into a LE bus and don't rearrange the wires.)


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch on Sun Dec 28 13:55:09 2025
    From Newsgroup: comp.arch

    MitchAlsup [2025-12-28 17:43:25] wrote:
    In order to stop the BE::LE war, one could always do a Middle Endian
    bit/Byte order. You start in the middle and each step goes right-then-left.

    I thought the solution was to follow the Cray 1's lead, where memory is
    only every accessed in units of the same size (a "word").


    Stefan
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Bill Findlay@findlaybill@blueyonder.co.uk to comp.arch on Sun Dec 28 19:21:44 2025
    From Newsgroup: comp.arch

    On 28 Dec 2025, Lawrence D|Oliveiro wrote
    (in article <10ipsi8$3ssi3$4@dont-email.me>):

    On Sun, 12 Oct 2025 16:11:27 GMT, MitchAlsup wrote:

    Top to bottom works for Japanese and Chinese. Yet I hear not
    appetite for TB byte order.

    There is no "top" or "bottom" or "left" or "right" in memory.
    There are only addresses (bit numbers and byte numbers).

    Priceless!
    (I needed a good laugh.)
    --
    Bill Findlay

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sun Dec 28 21:34:10 2025
    From Newsgroup: comp.arch


    Bill Findlay <findlaybill@blueyonder.co.uk> posted:

    On 28 Dec 2025, Lawrence D-|Oliveiro wrote
    (in article <10ipsi8$3ssi3$4@dont-email.me>):

    On Sun, 12 Oct 2025 16:11:27 GMT, MitchAlsup wrote:

    Top to bottom works for Japanese and Chinese. Yet I hear not
    appetite for TB byte order.

    There is no "top" or "bottom" or "left" or "right" in memory.
    There are only addresses (bit numbers and byte numbers).

    Priceless!
    (I needed a good laugh.)

    I always put higher addresses higher on my drawings than lower addresses.
    So top -> Address = 0xFFFFFFFFFFFFFFFF...
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Sun Dec 28 16:09:11 2025
    From Newsgroup: comp.arch

    On 12/28/2025 12:55 PM, Stefan Monnier wrote:
    MitchAlsup [2025-12-28 17:43:25] wrote:
    In order to stop the BE::LE war, one could always do a Middle Endian
    bit/Byte order. You start in the middle and each step goes right-then-left.

    I thought the solution was to follow the Cray 1's lead, where memory is
    only every accessed in units of the same size (a "word").


    Apparently DEC Alpha did this:
    Nothing smaller than 64 bits in HW.

    So, you want byte-oriented memory access or similar? Implement it yourself.



    Well, my recent goings on in ISA space:
    Ended up adding a J52I prefix to my jumbo-prefix extension in RISC-V.

    J52I is a 64-bit prefix that glues 52 bits onto the immediate, making it possible to encode 64-bit immediate and displacement values.

    I changed the interpretation such that: [Reg+Disp64] is instead
    understood as [Abs64+Reg]. It is now possible to encode Abs64 branches
    via J52I+JALR.

    Potentially similar could be defined for XG2 and XG3 as well. Wouldn't
    require any new changes or additions encoding-wide, but would define
    something new in terms of decoding behavior in the case of Abs64
    (currently, Disp64 is not allowed; this would make it allowed just
    understood as Abs64).

    Though, unlike RISC-V, where [Rb+Dis64] and [Abs64+Rb] are conceptually equivalent, would need to decide the specifics in the XG2/XG3 case:
    Do the same thing as what I did for RV, meaning the displacement
    register is unscaled;
    Break symmetry, and make it so that it is [Abs64+Rb*Scale].


    In XG3, it could in theory just use the RV-J52I encodings as well for a
    lot of the cases if needed.

    For XG2, there is a 64-bit encoding for an Abs48 branch (special case).
    Would need to debate whether or not Abs64 memory ops are needed. But,
    still niche, as it more often applies to thunks and similar than normal
    code generation.



    It was added partly as I started to realize I had some non-zero use
    cases for Imm64 and Abs64 addressing in RV Mode.

    At first, partly designed a new encoding scheme for Imm64 instructions,
    but then realized it was possible to devise a J52I prefix which could be
    done more cheaply within my implementation.

    Then ended up battling with timing failures (this stuff was "the straw
    that broke the camel's back" in terms of timing constraints). Have ended
    up partially restructuring some parts of the decoder, partly reducing
    clutter and improving timing some (so back to passing timing again).


    Formally, the J52I prefix will likely fully replace the use of
    J22+J22+LUI and similar. It can also express the same behavior (via
    "ADDI Xd, X0, Imm64") and with slightly less hair. Internally, the J52I prefixes also better leverage J21I decoding (as effectively both
    "halves" of the J52I prefix are decoded in ways more consistent with the handling of J21I; and for the low 32 bits, the immediate is decoded
    as-if it had been given a J21I prefix).


    Then after this ended up tweaking things in BGBCC and my PEL loaders
    such that the base-reloc previously used to encode tripwires now also
    can encode the location of stack canary values; allowing the loader to essentially randomize the stack canary values each time a program is loaded.

    Mostly works, though seemingly fails for some reason on the CPU core
    when the boot ROM is built in RISC-V mode. At first I thought it was a cache-coherence issue (in the RISC-V case the relevant cache-related
    functions were NOPs). Now it appears though as-if the code was somehow disrupting the application of base relocs (in a way that doesn't happen
    in the emulator).

    So, it is possible that bugs remain in the RV support.


    Also in the process got around to re-enabling basic ASLR for the kernel
    in the Boot ROM (the original issue impacting the symbol listing in the emulator is no longer as relevant).

    Well, and implementing some of the RV CBO instructions and similar.
    But, still doesn't fully address needs nor is a particularly close match
    to how my CPU does things. Also FENCE.I uses too much encoding space,
    and I ended up handling it in a similar way to CBO.

    Had ended up adding a few non-standard 0R encodings for things like TLB flushing and similar.

    As-is, FENCE.I can't fully implement the standard semantics, which in my
    case would need to also be able to flush the L1 D$.
    FENCE: Effectively needs full cache flush.
    Strategy: Trap and emulate.
    FENCE.I:
    Proper semantics needs full cache flush (both D$ and I$).
    Strategy: Trap and emulate proper version,
    allow CBO-like handling of a variant case.

    Some wonk exists in the CBO spec, it is like someone didn't quite get
    the purpose in why one would want cache-line invalidation instructions
    and was trying to work this in with assumptions of an fully coherent
    cache (rather than, say, one using explicit cache flushing because the
    HW uses a weak coherence model; and where using explicit flushes doesn't really make sense if one has coherent caches).

    ...



    Stefan

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sun Dec 28 23:00:05 2025
    From Newsgroup: comp.arch

    BGB <cr88192@gmail.com> schrieb:
    On 12/28/2025 12:55 PM, Stefan Monnier wrote:
    MitchAlsup [2025-12-28 17:43:25] wrote:
    In order to stop the BE::LE war, one could always do a Middle Endian
    bit/Byte order. You start in the middle and each step goes right-then-left. >>
    I thought the solution was to follow the Cray 1's lead, where memory is
    only every accessed in units of the same size (a "word").


    Apparently DEC Alpha did this:
    Nothing smaller than 64 bits in HW.

    So, you want byte-oriented memory access or similar? Implement it yourself.

    That turned out to be a mistake, which they later corrected with the
    BWX extension.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Mon Dec 29 06:59:02 2025
    From Newsgroup: comp.arch

    BGB <cr88192@gmail.com> writes:
    On 12/28/2025 12:55 PM, Stefan Monnier wrote:
    MitchAlsup [2025-12-28 17:43:25] wrote:
    In order to stop the BE::LE war

    The war is over. Little-Endian has won.

    I thought the solution was to follow the Cray 1's lead, where memory is
    only every accessed in units of the same size (a "word").


    Apparently DEC Alpha did this:
    Nothing smaller than 64 bits in HW.

    Alpha is a byte-addressed architecture, and therefore there is a
    difference between big-endian and little-endian on Alpha. The Alpha architecture manual also explains how to implement byte accesses for little-endian systems and for big-endian systems (AFAIK nobody ever
    built a big-endian Alpha system).

    Mitch Alsup mentioned one architecture without order problems: The
    Cray-1 is word-addressed and does not support numbers that take more
    than one word. The same is true of the CDC 6600 and descendents.

    For the 36-bit machines, they were word-addressed, but supported double-precision (72-bit) FP numbers, making the word order of these
    numbers relevant. What order did they use? For IEEE FP, the order is
    probably not an issue, because one will usually not access a DP FP
    value with two SP loads (maybe if the FP value is represented in two
    SP registers, as on the 88000 or the VAX?). For formats where the DP
    FP representation only has a longer mantissa but the same exponent
    size as the SP FP representation, accessing the DP FP value with SP
    operations may be more interesting, and in that case the word order
    plays a role.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Mon Dec 29 08:17:18 2025
    From Newsgroup: comp.arch

    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

    Mitch Alsup mentioned one architecture without order problems: The
    Cray-1 is word-addressed and does not support numbers that take more
    than one word. The same is true of the CDC 6600 and descendents.

    The Cray-1 had double precision numbers, with software support
    only. They had to in order to conform to the FORTRAN standards
    of storage association.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Mon Dec 29 09:08:10 2025
    From Newsgroup: comp.arch

    Thomas Koenig <tkoenig@netcologne.de> writes:
    The Cray-1 had double precision numbers, with software support
    only. They had to in order to conform to the FORTRAN standards
    of storage association.

    And my guess is that the word order for double-precision is also
    specified by FORTRAN. It probably is the word order of the IBM 704
    (the machine for which Fortran was designed), and that probably is
    big-endian (the higher-order bits of the mantissa are in the word with
    the lower address), guessed from the fact that the IBM S/360 has
    big-endian byte order.

    Interestingly, for FP formats that have the sequence of bits

    sign|exponent|mantissa

    with the highest-order mantissa bits leftmost, the exponent left of
    that, and the sign left of that, and where the double-precision format
    just has a longer mantissa then the single-precision format, storing
    DP FP numbers as two words in big-endian format (i.e., the leftmost
    word at the lower address) has a similar property as little-endian has
    for intergers:

    You can load a number that fits into the smaller container at the same
    address with the appropriate shorter load, and get the correct result.

    So the choice of big-endian in the S/360 may have to do with FP in
    addition to unpacked decimal data. On the PDP-11, 8008, and 6502,
    where binary integers dominated and unpacked decimal data played no
    role, little-endian was chosen.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Mon Dec 29 13:39:22 2025
    From Newsgroup: comp.arch

    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    The Cray-1 had double precision numbers, with software support
    only. They had to in order to conform to the FORTRAN standards
    of storage association.

    And my guess is that the word order for double-precision is also
    specified by FORTRAN.

    Your guess is wrong.

    If you have storage association (via COMMON/EQUIVALENCE)
    between two variables of different type and assign a value
    to one of them, the other one becomes undefined.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch on Mon Dec 29 13:48:24 2025
    From Newsgroup: comp.arch

    BGB [2025-12-28 16:09:11] wrote:
    On 12/28/2025 12:55 PM, Stefan Monnier wrote:
    I thought the solution was to follow the Cray 1's lead, where memory is
    only every accessed in units of the same size (a "word").
    Apparently DEC Alpha did this:
    Nothing smaller than 64 bits in HW.

    I thought they did include 32bit load/store instructions even in the
    original AXP21064.


    Stefan
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Levine@johnl@taugh.com to comp.arch on Wed Dec 31 02:54:29 2025
    From Newsgroup: comp.arch

    It appears that Thomas Koenig <tkoenig@netcologne.de> said:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    The Cray-1 had double precision numbers, with software support
    only. They had to in order to conform to the FORTRAN standards
    of storage association.

    And my guess is that the word order for double-precision is also
    specified by FORTRAN.

    Your guess is wrong.

    If you have storage association (via COMMON/EQUIVALENCE)
    between two variables of different type and assign a value
    to one of them, the other one becomes undefined.

    There was never any sort of type punning in FORTRAN. The 704 and 709
    floating point was single precision, other than a multiply that
    produced a two word result. Fortran II on the 709 and 7090 had a
    kludge where you could put D in the first column of a statement to
    make it double precision, and Fortran IV added explicit DOUBLE
    PRECISION. They did double precision in software, until the 7090 added
    double precision arithmetic instructions.

    Fortran ran on many Different machines with different floating point
    formats and you could not make any assumptions about similarities in
    single and double float formats.
    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Wed Dec 31 09:43:18 2025
    From Newsgroup: comp.arch

    John Levine <johnl@taugh.com> schrieb:
    It appears that Thomas Koenig <tkoenig@netcologne.de> said:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    The Cray-1 had double precision numbers, with software support
    only. They had to in order to conform to the FORTRAN standards
    of storage association.

    And my guess is that the word order for double-precision is also
    specified by FORTRAN.

    Your guess is wrong.

    If you have storage association (via COMMON/EQUIVALENCE)
    between two variables of different type and assign a value
    to one of them, the other one becomes undefined.

    There was never any sort of type punning in FORTRAN.

    [Interesting history snipped]

    Fortran ran on many Different machines with different floating point
    formats and you could not make any assumptions about similarities in
    single and double float formats.

    Unfortunately, this did not keep people from using a feature
    that was officially prohibited by the standard, see for example https://netlib.org/slatec/src/d1mach.f . This file fulfilled a
    clear need (having certain floating point constants available for
    the multitude of different floating point constants in the wild)
    but it had to resort to type punning. It is also interesting
    for the wild multitude of different floating point formats for
    double precision that were relevant in the past.

    Fortunately, these days it's all IEEE; I think nobody uses IBM's
    base-16 FP numbers for anything serious any more.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Levine@johnl@taugh.com to comp.arch on Thu Jan 1 17:46:29 2026
    From Newsgroup: comp.arch

    According to Thomas Koenig <tkoenig@netcologne.de>:
    John Levine <johnl@taugh.com> schrieb:
    There was never any sort of type punning in FORTRAN.

    [Interesting history snipped]

    Fortran ran on many Different machines with different floating point
    formats and you could not make any assumptions about similarities in
    single and double float formats.

    Unfortunately, this did not keep people from using a feature
    that was officially prohibited by the standard, see for example >https://netlib.org/slatec/src/d1mach.f . This file fulfilled a
    clear need (having certain floating point constants ...

    Wow, that's gross but I see the need. If you wanted to do extremely
    machine specific stuff in Fortran, it didn't try to stop you.

    Fortunately, these days it's all IEEE; I think nobody uses IBM's
    base-16 FP numbers for anything serious any more.

    Agreed, except IEEE has both binary and decimal flavors. It's never been
    clear to me how much people use decimal FP. The use case is clear enough,
    it lets you control normalization so you can control the decimal precision
    of calculations, which is important for financial calculations like bond prices. On the other hand, while it is somewhat painful to get correct
    decimal rounded results in binary FP, it's not all that hard -- forty
    years ago I wrote all the bond price functions for an MS DOS application
    using the 286's FP.
    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sun Jan 4 00:21:31 2026
    From Newsgroup: comp.arch

    John Levine <johnl@taugh.com> schrieb:
    According to Thomas Koenig <tkoenig@netcologne.de>:
    John Levine <johnl@taugh.com> schrieb:
    There was never any sort of type punning in FORTRAN.

    [Interesting history snipped]

    Fortran ran on many Different machines with different floating point
    formats and you could not make any assumptions about similarities in
    single and double float formats.

    Unfortunately, this did not keep people from using a feature
    that was officially prohibited by the standard, see for example >>https://netlib.org/slatec/src/d1mach.f . This file fulfilled a
    clear need (having certain floating point constants ...

    Wow, that's gross but I see the need. If you wanted to do extremely
    machine specific stuff in Fortran, it didn't try to stop you.

    Fortunately, these days it's all IEEE; I think nobody uses IBM's
    base-16 FP numbers for anything serious any more.

    Agreed, except IEEE has both binary and decimal flavors.

    And two decimal flavors, as well, with binary and densely packed
    decimal encoding of the significand... it's a bit of a mess.

    IBM does the arithmetic in hardware, their decimal arithmetic probably
    goes back to their adding and multiplying punches, far before computers.

    It's never been
    clear to me how much people use decimal FP. The use case is clear enough,
    it lets you control normalization so you can control the decimal precision
    of calculations, which is important for financial calculations like bond prices. On the other hand, while it is somewhat painful to get correct decimal rounded results in binary FP, it's not all that hard -- forty
    years ago I wrote all the bond price functions for an MS DOS application using the 286's FP.

    Speed? At least when using 128-bit densely packed decimal encoding
    on IBM machines, it is possible to get speed which is probably
    not attainable by software (but certainly not very good compared
    to what an optimized version could do, and POWER's 128-bit unit
    is also quite slow as a result).

    And people using other processors don't want to develop hardware,
    but still want to do the same applications. IIRC, everybody but
    IBM uses binary encoding of the significand and software, probably
    doing something similar to what you did.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Levine@johnl@taugh.com to comp.arch on Sun Jan 4 04:12:25 2026
    From Newsgroup: comp.arch

    According to Thomas Koenig <tkoenig@netcologne.de>:
    Agreed, except IEEE has both binary and decimal flavors.

    And two decimal flavors, as well, with binary and densely packed
    decimal encoding of the significand... it's a bit of a mess.

    IBM does the arithmetic in hardware, their decimal arithmetic probably
    goes back to their adding and multiplying punches, far before computers.

    S/360 had packed decimal and zSeries even has vector ops for it. But it's different from DFP. Somewhere I saw a set of slides that said that the
    first DFP in z was done in millicode, with hardware later.

    It's never been
    clear to me how much people use decimal FP. The use case is clear enough, >> it lets you control normalization so you can control the decimal precision ...

    Speed? At least when using 128-bit densely packed decimal encoding
    on IBM machines, it is possible to get speed which is probably
    not attainable by software ...

    Software implmentations of DFP would be slow, but if you know what you are doing you can get the correctly rounded declmal results using BFP which I
    would think would be faster.

    but still want to do the same applications. IIRC, everybody but
    IBM uses binary encoding of the significand and software, probably
    doing something similar to what you did.

    I used regular IEEE binary FP, with explicit code to do decimal rounding
    when needed. Like I said it was a pain but it wasn't all that hard.
    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sun Jan 4 08:06:50 2026
    From Newsgroup: comp.arch

    John Levine <johnl@taugh.com> writes:
    Software implmentations of DFP would be slow, but if you know what you are >doing you can get the correctly rounded declmal results using BFP which I >would think would be faster.

    If you know what you are doing, you use fixed point for those
    financial applications which DFP targets, because that's what finance
    used in the old days, and what they laid down in their rules (and
    still do; the Euro conversion (early 2000s) has to happen with 4
    decimal digits after the decimal point; which is noted as unusual,
    apparently 2 or 3 digits are more common). And whenever I bring that
    up, it is explained to me that DFP actually behaves like fixed point.
    Which leads to the question why one would use DFP rather than fixed
    point.

    In the bad old days of 16-bit processors, using the 64-bit mantissa of
    80-bit BFP as a large integer may have provided an advantage, but
    these days we have 64-bit integers, and 80-bit BFP is slower than
    64-bit BFP with 53-bit mantissa, so I don't see a reason to use any FP
    for financial calculations.

    Concerning the question of how much DFP is used: My impression is that
    Intel's DFP implementation is not particularly efficient, and it sees
    no maintenance. And I have not read about other implementations. My
    guess is that there is so little use of this library that nobody
    bothers working on it, and the use that it sees is not in
    performance-critical code, so nobody works on making Intel's
    implementation faster or making another, faster implementation.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Sun Jan 4 12:20:43 2026
    From Newsgroup: comp.arch

    On Sun, 04 Jan 2026 08:06:50 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    John Levine <johnl@taugh.com> writes:
    Software implmentations of DFP would be slow, but if you know what
    you are doing you can get the correctly rounded declmal results
    using BFP which I would think would be faster.

    If you know what you are doing, you use fixed point for those
    financial applications which DFP targets, because that's what finance
    used in the old days, and what they laid down in their rules (and
    still do; the Euro conversion (early 2000s) has to happen with 4
    decimal digits after the decimal point; which is noted as unusual,
    apparently 2 or 3 digits are more common). And whenever I bring that
    up, it is explained to me that DFP actually behaves like fixed point.
    Which leads to the question why one would use DFP rather than fixed
    point.


    DFP behaves as fixed point for as long as it has enough digits in
    significand to behave as fixed point.
    It could use up all digits rather quickly
    - when you sum up very big number with very small number
    hopefully it does not happen in finances, big numbers there are not
    really big and small numbers are not really small.
    - when you multiply, esp. several times
    - when you divide. If result is inexact then just one division is enough

    At this point, if you want it to continue to behave as fixed point, you
    have to manually apply operation called quantize, which typically would
    cause rounding to lower precision.
    Or, may be, some languages can do it automatically. But I don't know
    which languages it would be.


    If the language is 'C' and compiler is gcc then on POWER one can apply
    quantize op manually via __builtin_dfp_quantize() family of built-ins.

    I have no idea how one can do it on other gcc targets. It does not look
    like gcc provides __builtin_bid_quantize(). If it exists, manual does
    not mention it.
    Then again. gcc manual mentions very few details about Decimal FP
    support. It look like work on Decimal FP was started ~10 years ago,
    made progress for a year or two and them an interest was lost.

    In the bad old days of 16-bit processors, using the 64-bit mantissa of
    80-bit BFP as a large integer may have provided an advantage, but
    these days we have 64-bit integers, and 80-bit BFP is slower than
    64-bit BFP with 53-bit mantissa, so I don't see a reason to use any FP
    for financial calculations.

    Concerning the question of how much DFP is used: My impression is that Intel's DFP implementation is not particularly efficient,

    If gcc implementation is based on Intel's then my measurements posted
    here few weeks ago certainly agree, esp. for multiplication.
    Disclaimer:
    I only measured operations with 34 significant digits. It is possible
    that this implementation is optimized for cases with significantly
    smaller number of digits.

    and it sees
    no maintenance. And I have not read about other implementations. My
    guess is that there is so little use of this library that nobody
    bothers working on it, and the use that it sees is not in performance-critical code, so nobody works on making Intel's
    implementation faster or making another, faster implementation.

    - anton

    An absence of easily accessible quantize operations seems to hints that
    gcc implementation has no production use at all.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Sun Jan 4 12:36:42 2026
    From Newsgroup: comp.arch

    On Sun, 4 Jan 2026 00:21:31 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:


    And two decimal flavors, as well, with binary and densely packed
    decimal encoding of the significand... it's a bit of a mess.


    Since both formats have exactly identical semantics, in theory the mess
    is not worse (and not better) than two bytes orders of IEEE binary FP.

    I see much bigger problem [than BID vs DPD] in the fact that IEEE
    standard does not specify few very important things about global states
    and interoperability of BFP and DFP in the same process. Like whether
    BFP and DFP have common rounding mode or each one has mode of its own.
    The same question about for exception flags and exception masks.



    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Sun Jan 4 13:17:04 2026
    From Newsgroup: comp.arch

    Thomas Koenig wrote:
    John Levine <johnl@taugh.com> schrieb:
    According to Thomas Koenig <tkoenig@netcologne.de>:
    John Levine <johnl@taugh.com> schrieb:
    There was never any sort of type punning in FORTRAN.

    [Interesting history snipped]

    Fortran ran on many Different machines with different floating point
    formats and you could not make any assumptions about similarities in
    single and double float formats.

    Unfortunately, this did not keep people from using a feature
    that was officially prohibited by the standard, see for example
    https://netlib.org/slatec/src/d1mach.f . This file fulfilled a
    clear need (having certain floating point constants ...

    Wow, that's gross but I see the need. If you wanted to do extremely
    machine specific stuff in Fortran, it didn't try to stop you.

    Fortunately, these days it's all IEEE; I think nobody uses IBM's
    base-16 FP numbers for anything serious any more.

    Agreed, except IEEE has both binary and decimal flavors.

    And two decimal flavors, as well, with binary and densely packed
    decimal encoding of the significand... it's a bit of a mess.

    IBM does the arithmetic in hardware, their decimal arithmetic probably
    goes back to their adding and multiplying punches, far before computers.

    It's never been
    clear to me how much people use decimal FP. The use case is clear enough, >> it lets you control normalization so you can control the decimal precision >> of calculations, which is important for financial calculations like bond
    prices. On the other hand, while it is somewhat painful to get correct
    decimal rounded results in binary FP, it's not all that hard -- forty
    years ago I wrote all the bond price functions for an MS DOS application
    using the 286's FP.

    Speed? At least when using 128-bit densely packed decimal encoding
    on IBM machines, it is possible to get speed which is probably
    not attainable by software (but certainly not very good compared
    to what an optimized version could do, and POWER's 128-bit unit
    is also quite slow as a result).

    And people using other processors don't want to develop hardware,
    but still want to do the same applications. IIRC, everybody but
    IBM uses binary encoding of the significand and software, probably
    doing something similar to what you did.

    Using binary mantissa encoding makes everything you do in DFP quite
    easy, with the exception of (re-)normalization. Our own Michael S have
    come up with some really nice ideas here to make it (significantly) less painful.

    That said, for the classic AT&T phone bill benchmark you pretty much
    never do any normalization in the main processing loop, so software
    emulation with binary mantissa is perfectly OK.

    A somewhat similar problem is the 1brc (One Billion Rows Challenge)
    where you aggregate records containing two-digit signed temperature
    records, all of which have one decimal digit.

    Here the fastest approach is to simply do everything with int16_t /i16
    min/max values and a i64 global accumulator. No DFP needed!

    Terje
    PS. In reality, the 1brc is 90%+ determined by the speed of your chosen
    hash table implementation.
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Sun Jan 4 13:22:18 2026
    From Newsgroup: comp.arch

    John Levine wrote:
    According to Thomas Koenig <tkoenig@netcologne.de>:
    Agreed, except IEEE has both binary and decimal flavors.

    And two decimal flavors, as well, with binary and densely packed
    decimal encoding of the significand... it's a bit of a mess.

    IBM does the arithmetic in hardware, their decimal arithmetic probably
    goes back to their adding and multiplying punches, far before computers.

    S/360 had packed decimal and zSeries even has vector ops for it. But it's different from DFP. Somewhere I saw a set of slides that said that the
    first DFP in z was done in millicode, with hardware later.

    It's never been
    clear to me how much people use decimal FP. The use case is clear enough, >>> it lets you control normalization so you can control the decimal precision ...

    Speed? At least when using 128-bit densely packed decimal encoding
    on IBM machines, it is possible to get speed which is probably
    not attainable by software ...

    Software implmentations of DFP would be slow, but if you know what you are doing you can get the correctly rounded declmal results using BFP which I would think would be faster.

    but still want to do the same applications. IIRC, everybody but
    IBM uses binary encoding of the significand and software, probably
    doing something similar to what you did.

    I used regular IEEE binary FP, with explicit code to do decimal rounding
    when needed. Like I said it was a pain but it wasn't all that hard.

    I did the same around 1983/84, using Borland's Turbo Pascal. I started
    PC programming in 1982 to solve a problem for my father-in-law to be.

    I had to adhere to Norwegian legal accounting rules, so all the rounding
    had to be correct, but like you I found that with the total precision available and the known maximum number of operations before I had to do
    the final rounding, just a small epsilon added was enough to guarantee
    that the result would be OK.

    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sun Jan 4 16:45:10 2026
    From Newsgroup: comp.arch

    Michael S <already5chosen@yahoo.com> schrieb:
    On Sun, 4 Jan 2026 00:21:31 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:


    And two decimal flavors, as well, with binary and densely packed
    decimal encoding of the significand... it's a bit of a mess.


    Since both formats have exactly identical semantics, in theory the mess
    is not worse (and not better) than two bytes orders of IEEE binary FP.

    Almost.

    IIRC, there is no restriction on the binary mantissa, so its
    range is slightly larger for the same number of bits
    (1000/1024)**(n/3).
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Sun Jan 4 19:35:13 2026
    From Newsgroup: comp.arch

    On Sun, 4 Jan 2026 16:45:10 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:

    Michael S <already5chosen@yahoo.com> schrieb:
    On Sun, 4 Jan 2026 00:21:31 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:


    And two decimal flavors, as well, with binary and densely packed
    decimal encoding of the significand... it's a bit of a mess.


    Since both formats have exactly identical semantics, in theory the
    mess is not worse (and not better) than two bytes orders of IEEE
    binary FP.

    Almost.

    IIRC, there is no restriction on the binary mantissa, so its
    range is slightly larger for the same number of bits
    (1000/1024)**(n/3).

    It's true that there is no restriction, but it does not influence
    semantics.
    In binary encoding, when the value of significand exceeds
    10**(3*J+1)-1 then it is non-canonical representation of zero.
    It is very similar to how declets in DPD allowed to have all 1024 bit
    patterns, each pattern has defined numeric value in range [0:999], but
    1000 patterns are canonical, i.e.allowed both as inputs and as outputs
    of arithmetic operations and remaining 24 patterns are non-canonical -
    accepted as inputs, but never produced as outputs.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sun Jan 4 18:47:20 2026
    From Newsgroup: comp.arch


    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    John Levine <johnl@taugh.com> writes:
    Software implmentations of DFP would be slow, but if you know what you are >doing you can get the correctly rounded declmal results using BFP which I >would think would be faster.

    If you know what you are doing, you use fixed point for those
    financial applications which DFP targets, because that's what finance
    used in the old days, and what they laid down in their rules (and
    still do; the Euro conversion (early 2000s) has to happen with 4
    decimal digits after the decimal point; which is noted as unusual,
    apparently 2 or 3 digits are more common). And whenever I bring that
    up, it is explained to me that DFP actually behaves like fixed point.
    Which leads to the question why one would use DFP rather than fixed
    point.

    So, would it not be easier and faster to simply make a densely-packed
    128-bit Fixed-Point decimal function unit ?!? A rounding mode, and a
    number of digits below the decimal point, in a control register ???!!!

    3-cycle ADD/SUB
    6-cycle MUL
    ~30-cycle DIV


    - anton
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sun Jan 4 18:01:34 2026
    From Newsgroup: comp.arch

    Michael S <already5chosen@yahoo.com> writes:
    DFP behaves as fixed point for as long as it has enough digits in
    significand to behave as fixed point.
    It could use up all digits rather quickly
    - when you sum up very big number with very small number

    In fixed, point, the very big number is an overflow, and the very
    small number is 0.

    hopefully it does not happen in finances, big numbers there are not
    really big and small numbers are not really small.

    Exactly. There are some places where the currency is vastly inflated
    (or currently undergoing high inflation, but I don't think that any
    inflation was ever so big that it could not be handled with 128-bit
    fixed point and a scale factor of 1/10000 (where the largest number is
    10^34).

    - when you multiply, esp. several times

    What do you want to multiply in financial applications that would
    overflow 128-bit fixed points? If the multiplications result in
    rounding to zero, that's fine.

    - when you divide. If result is inexact then just one division is enough

    Yes, fixed point is inexact. You round (or truncate, whatever is
    specified; usually merchant's rounding) at the specified number of
    digits; that's what the rules say. If DFP behaves differently, it's
    not appropriate for these kinds of calculations.

    Then again. gcc manual mentions very few details about Decimal FP
    support. It look like work on Decimal FP was started ~10 years ago,
    made progress for a year or two and them an interest was lost.

    This supports my theory that nobody is using DFP.

    An absence of easily accessible quantize operations seems to hints that
    gcc implementation has no production use at all.

    And that, too.

    OTOH, gcc apparently does support fixed-point (https://gcc.gnu.org/onlinedocs/gcc/Fixed-Point.html), but that is
    based on a standard for embedded systems (https://www.open-std.org/JTC1/SC22/WG14/www/docs/n1005.pdf) and uses
    binary scale factors (specified as bits), so it's not the kind of
    fixed point useful for financial calculations. OTOH, C is probably
    not the language that is used for financial software.

    For Java, I found that it has BigDecimal in its library, which
    somewhat fits the bill. Each number comes with a scale that is a
    power of 10 (and the programmer actually communicates with the system
    by specifying log(scale)). The scale of the result of an arithmetic
    operation is determined by a MathContext if given, and some default if
    not given. For addition and subtraction the default looks reasonable,
    for multiply one probably wants to supply a MathContext unless you
    multiply a number with scale 10^0 with some other number (which might
    actually be a frequent occurence). Of course BigDecimal is boxed (and
    probably references BigInteger, another boxed type), so it's not
    particularly efficient unless the JIT compiler is doing heroic things;
    but that's what you get with Java.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch on Sun Jan 4 14:12:52 2026
    From Newsgroup: comp.arch

    So, would it not be easier and faster to simply make a densely-packed
    128-bit Fixed-Point decimal function unit ?!? A rounding mode, and a
    number of digits below the decimal point, in a control register ???!!!

    I'm not sure I understand your proposal correctly, but the number of
    digits below the decimal point should not be a global setting because
    several computations can commonly happen at the same time with different
    number of digits below the decimal point.


    Stefan
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sun Jan 4 20:14:09 2026
    From Newsgroup: comp.arch


    Stefan Monnier <monnier@iro.umontreal.ca> posted:

    So, would it not be easier and faster to simply make a densely-packed 128-bit Fixed-Point decimal function unit ?!? A rounding mode, and a
    number of digits below the decimal point, in a control register ???!!!

    I'm not sure I understand your proposal correctly,

    It is not so much of a question, as it is my mind rambling through the possibilities. Loosely based on IBM 360--except 2|u or 4|u as long and
    stored in densely packed decimal instead of 4-bit digits. No decision
    on whether data is stored in registers or processed via memory.

    but the number of
    digits below the decimal point should not be a global setting because
    several computations can commonly happen at the same time with different number of digits below the decimal point.

    OK, forgot that COBOL has each number defined with its own decimal
    location.

    Should each calculation have its own "rounding mode" or "what to do
    with bits that fall off below the defined Lower-end" ??

    It just seems to me that once the "container" is big enough to deal
    with world GDP (of 2100) in the least valuable currency in the world,
    that making the decimal point "float" adds little value.


    Stefan
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Sun Jan 4 13:05:19 2026
    From Newsgroup: comp.arch

    On 1/4/2026 12:14 PM, MitchAlsup wrote:

    Stefan Monnier <monnier@iro.umontreal.ca> posted:

    So, would it not be easier and faster to simply make a densely-packed
    128-bit Fixed-Point decimal function unit ?!? A rounding mode, and a
    number of digits below the decimal point, in a control register ???!!!

    I'm not sure I understand your proposal correctly,

    It is not so much of a question, as it is my mind rambling through the possibilities. Loosely based on IBM 360--except 2|u or 4|u as long and
    stored in densely packed decimal instead of 4-bit digits. No decision
    on whether data is stored in registers or processed via memory.

    but the number of
    digits below the decimal point should not be a global setting because
    several computations can commonly happen at the same time with different
    number of digits below the decimal point.

    OK, forgot that COBOL has each number defined with its own decimal
    location.

    Should each calculation have its own "rounding mode" or "what to do
    with bits that fall off below the defined Lower-end" ??

    COBOL also allows specification of rounded or not on each calculation.
    I don't know about how it handles different rounding modes.


    It just seems to me that once the "container" is big enough to deal
    with world GDP (of 2100) in the least valuable currency in the world,
    that making the decimal point "float" adds little value.

    It used to be that people cared about how much storage (both main memory
    and disk/tape, each data item took, but I think those days are mostly over.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Sun Jan 4 15:20:01 2026
    From Newsgroup: comp.arch

    On 1/4/2026 10:01 AM, Anton Ertl wrote:

    snip

    Then again. gcc manual mentions very few details about Decimal FP
    support. It look like work on Decimal FP was started ~10 years ago,
    made progress for a year or two and them an interest was lost.

    This supports my theory that nobody is using DFP.

    I don't know how many people, if any, are using DFP, however,

    if it is used it is probably used most on IBM Z series. (and hence
    probably don't care about gcc)
    if it is used, the people who use it most probably don't subscribe to
    this NG.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Mon Jan 5 08:16:15 2026
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:
    So, would it not be easier and faster to simply make a densely-packed
    128-bit Fixed-Point decimal function unit ?!?

    128-bit binary integers are mostly good enough, and support for that
    is ok in current architectures; division support might be better,
    though, but see below. Rescaling with a power of 10 is something that
    may merit additional hardware support if it occurs often enough; but I
    am not convinced that it occurs often enough:

    You usually don't need it for addition and subtraction, because the
    operands have the same scale factor, and the same scale factor as the
    result.

    For multiplication, one common operation is to multiply a price with a
    number of pieces resulting in a price, and no rescaling is necessary
    there. Another common operation is to compute a percentage; you do
    have rescaling there.

    For division, it seems to me that the most common case is division by
    a percentage that is applied to many dividends (maybe not in the USA,
    but certainly in Europe it is common to compute the price without VAT
    (sales tax) from the price with VAT; but there are only few VAT rates
    in each country); that can be turned into a two-stage operation that
    might include any necessary rescaling: compute an inverse that can
    then be used for a cheap multiplication-and-rounding operation (e.g.,
    where a power-of-2 scale factor is involved for computing something
    below the least significant digit, in order to implement rounding).

    And yes, support for several rounding modes is needed when an inexact
    result is involved. Hardware may be helpful here.

    I have not done much financial programming, so maybe somebody else can complement my views.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Mon Jan 5 08:51:48 2026
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:
    Should each calculation have its own "rounding mode" or "what to do
    with bits that fall off below the defined Lower-end" ??

    Yes.

    If you look at Java's BigDecimal operations <https://docs.oracle.com/javase/8/docs/api/java/math/BigDecimal.html>,
    they all have versions without and width MathContext (which includes a
    target scale and a rounding mode), and sometimes additional variants
    (e.g., divide() has variants where you pass just the rounding mode, or
    the rounding mode and scale individually instead of through a
    MathContext).

    One feature of BigDecimal is arbitrary precision. As we have
    discussed, that's not necessary for financial calculations.

    It just seems to me that once the "container" is big enough to deal
    with world GDP (of 2100) in the least valuable currency in the world,
    that making the decimal point "float" adds little value.

    I agree. It also adds little value otherwise, because the regulations
    specify fixed-point computations.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Mon Jan 5 09:08:25 2026
    From Newsgroup: comp.arch

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    if it is used it is probably used most on IBM Z series.

    Possibly. But the lack of takeup of the Intel library and of the gcc
    support shows that "build it and they will come" does not work out for
    DFP. So even on System Z, if there is a takeup, it is probably small
    and the result of an effort by IBM to make programmers use DFP.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Mon Jan 5 10:21:31 2026
    From Newsgroup: comp.arch

    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    For multiplication, one common operation is to multiply a price with a
    number of pieces resulting in a price, and no rescaling is necessary
    there. Another common operation is to compute a percentage; you do
    have rescaling there.

    One interesting aspect is that the interest rates I have seen are
    multiples of 1/800 (e.g., 1 3/4%=7/4%=14/8%=14/800). One can also
    represent these through decimal scales, but the decimal scale that
    allows to represent them is 1/100000 (1/800=125/100000). It may be
    more economical in bits to scale with 1/800 (or maybe 1/1600 to be
    prepared the next innovation in finance).

    For tax rates, IIRC I have also seen half percentages, so using a
    1/800 or 1/1600 scale factor may be a good idea for them, too.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Mon Jan 5 10:31:08 2026
    From Newsgroup: comp.arch

    Anton Ertl wrote:
    MitchAlsup <user5857@newsgrouper.org.invalid> writes:
    So, would it not be easier and faster to simply make a densely-packed
    128-bit Fixed-Point decimal function unit ?!?

    128-bit binary integers are mostly good enough, and support for that
    is ok in current architectures; division support might be better,
    though, but see below. Rescaling with a power of 10 is something that
    may merit additional hardware support if it occurs often enough; but I
    am not convinced that it occurs often enough:

    You usually don't need it for addition and subtraction, because the
    operands have the same scale factor, and the same scale factor as the
    result.

    For multiplication, one common operation is to multiply a price with a
    number of pieces resulting in a price, and no rescaling is necessary
    there. Another common operation is to compute a percentage; you do
    have rescaling there.

    For division, it seems to me that the most common case is division by
    a percentage that is applied to many dividends (maybe not in the USA,
    but certainly in Europe it is common to compute the price without VAT
    (sales tax) from the price with VAT; but there are only few VAT rates
    in each country); that can be turned into a two-stage operation that
    might include any necessary rescaling: compute an inverse that can
    then be used for a cheap multiplication-and-rounding operation (e.g.,
    where a power-of-2 scale factor is involved for computing something
    below the least significant digit, in order to implement rounding).

    And yes, support for several rounding modes is needed when an inexact
    result is involved. Hardware may be helpful here.

    I have not done much financial programming, so maybe somebody else can complement my views.

    - anton

    There are many different kinds of bonds each with its own calculation.
    Bonds were invented by the Dutch in the early 1600's.
    They are a just a business deal to borrow money and repay it
    and these calculations are standardizations of those terms.
    People have had 400 years to be creative in designing these deals
    which is why there are many different calculations.

    Many bonds are time series polynomials with non-integer times.
    Many calculations use pow(base,exp) to a non-integer exponent.

    I used double for everything and my results were fine
    matching the benchmarks exactly to their 11 digits,
    and matching the HP Financial Calculator to 13 decimal places.

    Decimal would also need conversions with integers and BFP.
    Many of these values are coming from and going to databases
    and are exchanged with other systems.

    The book I have on bond calculations lists different rules for
    different types of bonds (municipal, corporate, T-bill, other),
    and rules and calculations for price given yield and yield given price.
    e.g.:

    - all prices and yields calculated to at least 10 significant digits

    - for municipal and corporate securities dollar prices should be accurate
    to seven places after the decimal, *truncating* to 3 places just prior
    to display.

    - for T-bills dollar price accuracy should be eight places after the
    decimal, *rounding* to seven places just prior to display.

    - for all other securities dollar price accuracy should be to seven places after the decimal, *rounding* to six places just prior to display.

    - calculations for yield should be at a minimum accurate to four places
    after the decimal. *rounding* to three places just prior to display.

    There are also many different ways of counting the number of days
    between dates which are not always integers.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Mon Jan 5 15:40:55 2026
    From Newsgroup: comp.arch

    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    John Levine <johnl@taugh.com> writes:
    Software implmentations of DFP would be slow, but if you know what you are >>doing you can get the correctly rounded declmal results using BFP which I >>would think would be faster.

    If you know what you are doing, you use fixed point for those
    financial applications which DFP targets, because that's what finance
    used in the old days, and what they laid down in their rules (and
    still do; the Euro conversion (early 2000s) has to happen with 4
    decimal digits after the decimal point; which is noted as unusual,
    apparently 2 or 3 digits are more common). And whenever I bring that
    up, it is explained to me that DFP actually behaves like fixed point.
    Which leads to the question why one would use DFP rather than fixed
    point.

    The Burroughs B3500 (1965) had both decimal fixed point and decimal floating point modes. By the second generation (B4700), Burroughs had dropped
    the decimal floating point since all the business (COBOL) customers
    were happy with the decimal fixed point (100 digits should be good enough
    for most financial calculations, even today).

    The floating point implementation supported an exponent range of -100 to +100 and a full 100 digit mantissia.

    For the B4700 and successors, the decimal floating point was replaced
    with a floating point accumulator (same exponent range, but only a
    twenty digit mantissa).

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Mon Jan 5 11:02:00 2026
    From Newsgroup: comp.arch

    EricP wrote:
    Anton Ertl wrote:
    MitchAlsup <user5857@newsgrouper.org.invalid> writes:
    So, would it not be easier and faster to simply make a densely-packed
    128-bit Fixed-Point decimal function unit ?!?

    128-bit binary integers are mostly good enough, and support for that
    is ok in current architectures; division support might be better,
    though, but see below. Rescaling with a power of 10 is something that
    may merit additional hardware support if it occurs often enough; but I
    am not convinced that it occurs often enough:

    You usually don't need it for addition and subtraction, because the
    operands have the same scale factor, and the same scale factor as the
    result.

    For multiplication, one common operation is to multiply a price with a
    number of pieces resulting in a price, and no rescaling is necessary
    there. Another common operation is to compute a percentage; you do
    have rescaling there.

    For division, it seems to me that the most common case is division by
    a percentage that is applied to many dividends (maybe not in the USA,
    but certainly in Europe it is common to compute the price without VAT
    (sales tax) from the price with VAT; but there are only few VAT rates
    in each country); that can be turned into a two-stage operation that
    might include any necessary rescaling: compute an inverse that can
    then be used for a cheap multiplication-and-rounding operation (e.g.,
    where a power-of-2 scale factor is involved for computing something
    below the least significant digit, in order to implement rounding).

    And yes, support for several rounding modes is needed when an inexact
    result is involved. Hardware may be helpful here.

    I have not done much financial programming, so maybe somebody else can
    complement my views.

    - anton

    Many bonds are time series polynomials with non-integer times.
    Many calculations use pow(base,exp) to a non-integer exponent.

    Looking at BlackuScholes derivative and options pricing

    https://en.wikipedia.org/wiki/Black-Scholes_pricing_formula#Black%E2%80%93Scholes_formula

    I see exp(), ln(), sqrt().
    I don't see any rules for accuracy.
    I found double was fine for calculating bonds.

    There are lots of other calculations:

    https://en.wikipedia.org/wiki/Financial_mathematics


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Mon Jan 5 11:05:41 2026
    From Newsgroup: comp.arch

    Anton Ertl wrote:
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    For multiplication, one common operation is to multiply a price with a
    number of pieces resulting in a price, and no rescaling is necessary
    there. Another common operation is to compute a percentage; you do
    have rescaling there.

    One interesting aspect is that the interest rates I have seen are
    multiples of 1/800 (e.g., 1 3/4%=7/4%=14/8%=14/800). One can also
    represent these through decimal scales, but the decimal scale that
    allows to represent them is 1/100000 (1/800=125/100000). It may be
    more economical in bits to scale with 1/800 (or maybe 1/1600 to be
    prepared the next innovation in finance).

    For tax rates, IIRC I have also seen half percentages, so using a
    1/800 or 1/1600 scale factor may be a good idea for them, too.

    - anton

    I don't know about the 800 but stock and bond prices used to be
    published with fractions like 17 1/8. I can't remember when they
    switched to publishing in decimal.



    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch on Mon Jan 5 11:55:21 2026
    From Newsgroup: comp.arch

    If you look at Java's BigDecimal operations <https://docs.oracle.com/javase/8/docs/api/java/math/BigDecimal.html>,
    they all have versions without and width MathContext (which includes a
    target scale and a rounding mode), and sometimes additional variants
    (e.g., divide() has variants where you pass just the rounding mode, or
    the rounding mode and scale individually instead of through a
    MathContext).

    I wonder how that would compare in practice with a Rational type, where
    all arithmetic operations are exact (and thus don't need anything like
    a MathContext) and you simply provide a rounding function that takes
    two argument: a "target scale" (in the form of a target denominator) and
    a rounding mode.

    [ Extra points for implementing compiler optimizations that keep track
    of the denominators statically to try and do away with the
    denominators at run-time as much as possible. Maybe also figure out
    how to eliminate the use of bignums for the numerators. EfOe ]


    - Stefan
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Mon Jan 5 19:26:05 2026
    From Newsgroup: comp.arch

    On Mon, 05 Jan 2026 11:55:21 -0500
    Stefan Monnier <monnier@iro.umontreal.ca> wrote:
    If you look at Java's BigDecimal operations <https://docs.oracle.com/javase/8/docs/api/java/math/BigDecimal.html>,
    they all have versions without and width MathContext (which
    includes a target scale and a rounding mode), and sometimes
    additional variants (e.g., divide() has variants where you pass
    just the rounding mode, or the rounding mode and scale individually
    instead of through a MathContext).

    I wonder how that would compare in practice with a Rational type,
    where all arithmetic operations are exact (and thus don't need
    anything like a MathContext) and you simply provide a rounding
    function that takes two argument: a "target scale" (in the form of a
    target denominator) and a rounding mode.

    [ Extra points for implementing compiler optimizations that keep track
    of the denominators statically to try and do away with the
    denominators at run-time as much as possible. Maybe also figure out
    how to eliminate the use of bignums for the numerators. EfOe ]


    - Stefan
    exp() would be a challenge.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Mon Jan 5 17:40:15 2026
    From Newsgroup: comp.arch

    Stefan Monnier <monnier@iro.umontreal.ca> writes:
    If you look at Java's BigDecimal operations
    <https://docs.oracle.com/javase/8/docs/api/java/math/BigDecimal.html>,
    they all have versions without and width MathContext (which includes a
    target scale and a rounding mode), and sometimes additional variants
    (e.g., divide() has variants where you pass just the rounding mode, or
    the rounding mode and scale individually instead of through a
    MathContext).

    I wonder how that would compare in practice with a Rational type, where
    all arithmetic operations are exact (and thus don't need anything like
    a MathContext) and you simply provide a rounding function that takes
    two argument: a "target scale" (in the form of a target denominator) and
    a rounding mode.

    BigDecimal is almost like what you imagine, except that the
    denominators are always powers of 10. Without MathContext addition, subtraction, and multiplication are exact, and division is also exact
    or produces an exception.

    The MathContext consists of the target scale and the rounding mode.

    Proper rational arithmetics (used in IIRC Prolog II) is also exact for
    division (and has no rounding), but you can get really long numerators
    and denominators.

    [ Extra points for implementing compiler optimizations that keep track
    of the denominators statically to try and do away with the
    denominators at run-time as much as possible.

    For the kind of fixed point used for financial calculation rules, the
    scale of every calculation is statically known (it comes out of the
    rules), so a compiler for a programming language that has such fixed
    point numbers as native type (Cobol, Ada, anything else?) does not
    need to check every time whether rescaling is necessary (which
    probably happens for Java's BigDecimal).

    Maybe also figure out
    how to eliminate the use of bignums for the numerators.

    I don't think that's possible if the language specifies arbitrary-precision-arithmetics, because the program processes input
    data that ios coming from data sources that can contain
    arbitrarily-large numbers.

    What is possible, and is done in various dynamically-typed languages
    is to have the common case (a bignum that's actually small) unbox, and
    use boxing only in those cases where the number exceeds the range of
    unboxed numbers. I have looked at the OpenJDK BigInteger
    implementation, and there the BigInteger is always boxed (as is
    everything else in Java that is an object).

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Mon Jan 5 18:03:08 2026
    From Newsgroup: comp.arch

    EricP <ThatWouldBeTelling@thevillage.com> writes:
    I don't know about the 800 but stock and bond prices used to be
    published with fractions like 17 1/8.

    17 1/8% = 137/8% = 137/800

    I can't remember when they
    switched to publishing in decimal.

    But all those that I have seen published in decimal are also multiples
    of 1/8%, i.e, of 1/800.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Mon Jan 5 13:51:27 2026
    From Newsgroup: comp.arch

    Anton Ertl wrote:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    I don't know about the 800 but stock and bond prices used to be
    published with fractions like 17 1/8.

    17 1/8% = 137/8% = 137/800

    I can't remember when they
    switched to publishing in decimal.

    But all those that I have seen published in decimal are also multiples
    of 1/8%, i.e, of 1/800.

    - anton

    Ah... it was due to Spanish traders and gold doubloons about 400 years ago.

    Why the NYSE Once Reported Prices in Fractions https://www.investopedia.com/ask/answers/why-nyse-switch-fractions-to-decimals/


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Mon Jan 5 14:21:37 2026
    From Newsgroup: comp.arch

    EricP wrote:
    Anton Ertl wrote:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    I don't know about the 800 but stock and bond prices used to be
    published with fractions like 17 1/8.

    17 1/8% = 137/8% = 137/800

    I can't remember when they
    switched to publishing in decimal.

    But all those that I have seen published in decimal are also multiples
    of 1/8%, i.e, of 1/800.

    - anton

    Ah... it was due to Spanish traders and gold doubloons about 400 years ago.

    Why the NYSE Once Reported Prices in Fractions https://www.investopedia.com/ask/answers/why-nyse-switch-fractions-to-decimals/

    And possibly the factor of 100 comes from Basis Point which is 0.01%.

    Basis Point: Meaning, Value, and Uses https://www.investopedia.com/terms/b/basispoint.asp

    Basis Point only have to do with interest rates or yield not prices
    but since prices and yields convert back and forth maybe in bygone-days
    it was easier to do calculations in fixed point quanta of 1/800.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch on Mon Jan 5 14:33:57 2026
    From Newsgroup: comp.arch

    Stefan Monnier <monnier@iro.umontreal.ca> wrote:
    I wonder how that would compare in practice with a Rational type,
    where all arithmetic operations are exact (and thus don't need
    anything like a MathContext) and you simply provide a rounding
    function that takes two argument: a "target scale" (in the form of a
    target denominator) and a rounding mode.

    [ Note: those two arguments together are basically the same thing as
    a MathContext. ]

    Michael S [2026-01-05 19:26:05] wrote:
    exp() would be a challenge.

    [ I assume you mean the case where the exponent is non-integer. ]
    Not any more than for other approaches, AFAICT. Most likely it would
    convert to&from float. If you want to be thorough, you'd let it take
    a MathContext argument and use arbitrary precision floats to perform the computation if the precision of IEEE floats isn't sufficient.

    Anton Ertl [2026-01-05 17:40:15] wrote:
    BigDecimal is almost like what you imagine, except that the
    denominators are always powers of 10. Without MathContext addition, subtraction, and multiplication are exact, and division is also exact
    or produces an exception.

    Hmm... so they're rationals limited to denominators that are powers
    of 10? I guess it does save them from GCD-style computations to simplify
    the fractions.

    Proper rational arithmetics (used in IIRC Prolog II) is also exact for division (and has no rounding),

    It's also available in many other languages as part of the
    standard library.

    but you can get really long numerators and denominators.

    The only case where the numerators would need to get larger than for
    BigDecimal if for division (when BigDecimal produces an exception), so
    I guess that would argue in favor of providing an additional division
    operation that takes something like a MathContext to avoid the
    computation of the exact result before doing the rounding (or signaling
    an exception).

    Stefan Monnier <monnier@iro.umontreal.ca> wrote:
    [ Extra points for implementing compiler optimizations that keep track
    of the denominators statically to try and do away with the
    denominators at run-time as much as possible.

    Anton Ertl [2026-01-05 17:40:15] wrote:
    For the kind of fixed point used for financial calculation rules, the
    scale of every calculation is statically known (it comes out of the
    rules), so a compiler for a programming language that has such fixed
    point numbers as native type (Cobol, Ada, anything else?) does not
    need to check every time whether rescaling is necessary (which
    probably happens for Java's BigDecimal).

    Indeed, that's why I was suggesting it would be useful for the compiler
    to try and optimize away the management of the scales. If you can move
    it to types it's of course even better since it removes the need for the optimizer to figure out when and if the optimization can be applied.

    Stefan Monnier <monnier@iro.umontreal.ca> wrote:
    Maybe also figure out how to eliminate the use of bignums for
    the numerators.

    Anton Ertl [2026-01-05 17:40:15] wrote:
    I don't think that's possible if the language specifies arbitrary-precision-arithmetics, because the program processes input
    data that ios coming from data sources that can contain
    arbitrarily-large numbers.

    I was assuming we're free to define the semantics of the Rational type,
    e.g. specifying a limit to the precision.

    What is possible, and is done in various dynamically-typed languages
    is to have the common case (a bignum that's actually small) unbox, and
    use boxing only in those cases where the number exceeds the range of
    unboxed numbers.

    Or that, indeed.
    [ Tho, I tend to use the word "boxing" in a different way, where
    I consider both cases "boxed" (i.e. made to fit in a fixed-size
    (typically 64bit) "box"), just that one of them involves placing the
    data in a separate memory location and putting the "pointer + tag" in
    the box, whereas the other puts the "small integer + tag" in that
    same box. ]


    - Stefan
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Tue Jan 6 12:35:20 2026
    From Newsgroup: comp.arch

    Thomas Koenig wrote:
    Michael S <already5chosen@yahoo.com> schrieb:
    On Sun, 4 Jan 2026 00:21:31 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:


    And two decimal flavors, as well, with binary and densely packed
    decimal encoding of the significand... it's a bit of a mess.


    Since both formats have exactly identical semantics, in theory the mess
    is not worse (and not better) than two bytes orders of IEEE binary FP.

    Almost.

    IIRC, there is no restriction on the binary mantissa, so its
    range is slightly larger for the same number of bits
    (1000/1024)**(n/3).

    Sorry, that's wrong:

    Just like the 24 "spare" DPD patterns are illegal, any mantissa
    corresponding to a number greater than the maximum allowed (1e34 afair)
    is also illegal, and there are rules for how to handle both cases
    (without checking, i seem to remember that they should be treated as zero?)

    Happy New Year everyone!

    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Tue Jan 6 15:26:46 2026
    From Newsgroup: comp.arch

    On Tue, 6 Jan 2026 12:35:20 +0100
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:

    Thomas Koenig wrote:
    Michael S <already5chosen@yahoo.com> schrieb:
    On Sun, 4 Jan 2026 00:21:31 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:


    And two decimal flavors, as well, with binary and densely packed
    decimal encoding of the significand... it's a bit of a mess.


    Since both formats have exactly identical semantics, in theory the
    mess is not worse (and not better) than two bytes orders of IEEE
    binary FP.

    Almost.

    IIRC, there is no restriction on the binary mantissa, so its
    range is slightly larger for the same number of bits
    (1000/1024)**(n/3).

    Sorry, that's wrong:

    Just like the 24 "spare" DPD patterns are illegal,

    Non-canonical, which is not the same as illegal
    Silently accepted as input operands but never produced as result.

    any mantissa
    corresponding to a number greater than the maximum allowed (1e34
    afair) is also illegal, and there are rules for how to handle both
    cases (without checking, i seem to remember that they should be
    treated as zero?)


    BID significand extension > max is indeed treated as zeros.
    Non-canonical DPD declets have non-zero values.
    They are forms of (8+c)*100 + (8+f)*10 + (8+i), where c, f, and i are
    in range [0:1].

    Happy New Year everyone!

    Terje



    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Tue Jan 6 17:06:52 2026
    From Newsgroup: comp.arch

    Michael S wrote:
    On Tue, 6 Jan 2026 12:35:20 +0100
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:

    Thomas Koenig wrote:
    Michael S <already5chosen@yahoo.com> schrieb:
    On Sun, 4 Jan 2026 00:21:31 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:


    And two decimal flavors, as well, with binary and densely packed
    decimal encoding of the significand... it's a bit of a mess.


    Since both formats have exactly identical semantics, in theory the
    mess is not worse (and not better) than two bytes orders of IEEE
    binary FP.

    Almost.

    IIRC, there is no restriction on the binary mantissa, so its
    range is slightly larger for the same number of bits
    (1000/1024)**(n/3).

    Sorry, that's wrong:

    Just like the 24 "spare" DPD patterns are illegal,

    Non-canonical, which is not the same as illegal
    Silently accepted as input operands but never produced as result.

    any mantissa
    corresponding to a number greater than the maximum allowed (1e34
    afair) is also illegal, and there are rules for how to handle both
    cases (without checking, i seem to remember that they should be
    treated as zero?)


    BID significand extension > max is indeed treated as zeros.
    Non-canonical DPD declets have non-zero values.
    They are forms of (8+c)*100 + (8+f)*10 + (8+i), where c, f, and i are
    in range [0:1].

    OK, that is probably because allowing them on input is significantly faster/cheaper than having to detect and modify/trap/erase.

    That said, out of range mantissas could also have been accepted, except
    they would not have had a valid conversion to either DPD or ascii.

    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Tue Jan 6 17:26:22 2026
    From Newsgroup: comp.arch

    Stefan Monnier <monnier@iro.umontreal.ca> writes:
    Anton Ertl [2026-01-05 17:40:15] wrote:
    BigDecimal is almost like what you imagine, except that the
    denominators are always powers of 10. Without MathContext addition,
    subtraction, and multiplication are exact, and division is also exact
    or produces an exception.

    Hmm... so they're rationals limited to denominators that are powers
    of 10? I guess it does save them from GCD-style computations to simplify
    the fractions.

    Yes.

    but you can get really long numerators and denominators.

    The only case where the numerators would need to get larger than for >BigDecimal if for division (when BigDecimal produces an exception), so
    I guess that would argue in favor of providing an additional division >operation that takes something like a MathContext to avoid the
    computation of the exact result before doing the rounding (or signaling
    an exception).

    If you have rationals as input numbers, addition and subtraction
    results in a denominator that has the sum of the number of bits in the
    operand denominator, and the numerator of the result can also get much
    larger than the numerators of the operands. Now what happens if you
    add up a lot of rational numbers.

    For multiplication in the worst case the numerators have the sum of
    the bits of ther operand numberators, and likewise for the
    denominators.

    If you only have integer inputs and don't have division, that's
    (big?) integers, not rational numbers.

    Concerning the idea of division inexact results, I don't think that
    those people who choose to use rational numbers will choose to use
    such a division operation in the usual case. If they were satisfied
    with inexact results, they would have chosen FP (or, for special
    requirements, something like BigDecimal).

    Stefan Monnier <monnier@iro.umontreal.ca> wrote:
    Maybe also figure out how to eliminate the use of bignums for
    the numerators.

    Anton Ertl [2026-01-05 17:40:15] wrote:
    I don't think that's possible if the language specifies
    arbitrary-precision-arithmetics, because the program processes input
    data that ios coming from data sources that can contain
    arbitrarily-large numbers.

    I was assuming we're free to define the semantics of the Rational type,
    e.g. specifying a limit to the precision.

    For a general rational type, I expect that it would overflow a fixed
    size often enough that that would not be practical.

    For something like BigDecimal and a use in a financial institutions, I
    think that a 128-bit mantissa (38 significant digits) is good enough
    for representing any amount of currency occuring in practice.

    [ Tho, I tend to use the word "boxing" in a different way, where
    I consider both cases "boxed" (i.e. made to fit in a fixed-size
    (typically 64bit) "box"), just that one of them involves placing the
    data in a separate memory location and putting the "pointer + tag" in
    the box, whereas the other puts the "small integer + tag" in that
    same box. ]

    https://en.wikipedia.org/wiki/Boxing_(computer_programming)

    describes the meaning in common use (that I use, too).

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Tue Jan 6 17:56:23 2026
    From Newsgroup: comp.arch


    Michael S <already5chosen@yahoo.com> posted:

    On Sun, 4 Jan 2026 00:21:31 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:


    And two decimal flavors, as well, with binary and densely packed
    decimal encoding of the significand... it's a bit of a mess.


    Since both formats have exactly identical semantics, in theory the mess
    is not worse (and not better) than two bytes orders of IEEE binary FP.

    Division by 10 is way faster in DPD than in Binary.

    I see much bigger problem [than BID vs DPD] in the fact that IEEE
    standard does not specify few very important things about global states
    and interoperability of BFP and DFP in the same process. Like whether
    BFP and DFP have common rounding mode or each one has mode of its own.
    The same question about for exception flags and exception masks.



    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Tue Jan 6 17:59:33 2026
    From Newsgroup: comp.arch


    Terje Mathisen <terje.mathisen@tmsw.no> posted:

    Michael S wrote:
    On Tue, 6 Jan 2026 12:35:20 +0100
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:

    Thomas Koenig wrote:
    Michael S <already5chosen@yahoo.com> schrieb:
    On Sun, 4 Jan 2026 00:21:31 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:


    And two decimal flavors, as well, with binary and densely packed
    decimal encoding of the significand... it's a bit of a mess.


    Since both formats have exactly identical semantics, in theory the
    mess is not worse (and not better) than two bytes orders of IEEE
    binary FP.

    Almost.

    IIRC, there is no restriction on the binary mantissa, so its
    range is slightly larger for the same number of bits
    (1000/1024)**(n/3).

    Sorry, that's wrong:

    Just like the 24 "spare" DPD patterns are illegal,

    Non-canonical, which is not the same as illegal
    Silently accepted as input operands but never produced as result.

    any mantissa
    corresponding to a number greater than the maximum allowed (1e34
    afair) is also illegal, and there are rules for how to handle both
    cases (without checking, i seem to remember that they should be
    treated as zero?)


    BID significand extension > max is indeed treated as zeros.
    Non-canonical DPD declets have non-zero values.
    They are forms of (8+c)*100 + (8+f)*10 + (8+i), where c, f, and i are
    in range [0:1].

    OK, that is probably because allowing them on input is significantly faster/cheaper than having to detect and modify/trap/erase.

    With the calculation latencies of IBM Z-series, modify/trap/erase is
    of no problem.

    That said, out of range mantissas could also have been accepted, except
    they would not have had a valid conversion to either DPD or ascii.

    Terje

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Tue Jan 6 20:12:48 2026
    From Newsgroup: comp.arch

    On Tue, 06 Jan 2026 17:56:23 GMT
    MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

    Michael S <already5chosen@yahoo.com> posted:

    On Sun, 4 Jan 2026 00:21:31 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:


    And two decimal flavors, as well, with binary and densely packed
    decimal encoding of the significand... it's a bit of a mess.


    Since both formats have exactly identical semantics, in theory the
    mess is not worse (and not better) than two bytes orders of IEEE
    binary FP.

    Division by 10 is way faster in DPD than in Binary.


    Do you consider speed to be part of semantics?
    Just wondering...

    I see much bigger problem [than BID vs DPD] in the fact that IEEE
    standard does not specify few very important things about global
    states and interoperability of BFP and DFP in the same process.
    Like whether BFP and DFP have common rounding mode or each one has
    mode of its own. The same question about for exception flags and
    exception masks.





    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Tue Jan 6 17:50:36 2026
    From Newsgroup: comp.arch

    EricP <ThatWouldBeTelling@thevillage.com> writes:
    EricP wrote:
    Anton Ertl wrote:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    I don't know about the 800 but stock and bond prices used to be
    published with fractions like 17 1/8.

    17 1/8% = 137/8% = 137/800

    I can't remember when they
    switched to publishing in decimal.

    But all those that I have seen published in decimal are also multiples
    of 1/8%, i.e, of 1/800.

    - anton

    Ah... it was due to Spanish traders and gold doubloons about 400 years ago. >>
    Why the NYSE Once Reported Prices in Fractions
    https://www.investopedia.com/ask/answers/why-nyse-switch-fractions-to-decimals/

    That's interesting, but it is about prices, not percentages (e.g.,
    interest rates or taxes).

    And possibly the factor of 100 comes from Basis Point which is 0.01%.

    If you mean the factor of 100 in 1/800, that comes from the %:
    1/100 is another way to write 1% (the "cent" in percent is 100, and
    the per indicates the division); in German some people write vH ("von
    Hundert", i.e. "of hundred") instead of % (IIRC especially in lawyery contexts).

    So when I write 1/8%, that's one 800th of the whole, i.e. 1/800.

    If you have to deal with bp (1bp=1/10000 of the whole), you need the
    scale factor 1/10000. If you have to deal with both bp and 1/8%, you
    can represent both with a cale factor 1/20000:

    1bp = 2/20000
    1/8% = 25/20000

    But in that case I would probably go directly to a scale factor of
    1/100000, because the savings of 1/20000 over that are not big, and
    one is prepared for people using pcm.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Tue Jan 6 20:15:15 2026
    From Newsgroup: comp.arch

    On Tue, 06 Jan 2026 17:59:33 GMT
    MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

    Terje Mathisen <terje.mathisen@tmsw.no> posted:

    Michael S wrote:
    On Tue, 6 Jan 2026 12:35:20 +0100
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:

    Thomas Koenig wrote:
    Michael S <already5chosen@yahoo.com> schrieb:
    On Sun, 4 Jan 2026 00:21:31 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:


    And two decimal flavors, as well, with binary and densely
    packed decimal encoding of the significand... it's a bit of a
    mess.

    Since both formats have exactly identical semantics, in theory
    the mess is not worse (and not better) than two bytes orders
    of IEEE binary FP.

    Almost.

    IIRC, there is no restriction on the binary mantissa, so its
    range is slightly larger for the same number of bits
    (1000/1024)**(n/3).

    Sorry, that's wrong:

    Just like the 24 "spare" DPD patterns are illegal,

    Non-canonical, which is not the same as illegal
    Silently accepted as input operands but never produced as result.

    any mantissa
    corresponding to a number greater than the maximum allowed (1e34
    afair) is also illegal, and there are rules for how to handle
    both cases (without checking, i seem to remember that they
    should be treated as zero?)


    BID significand extension > max is indeed treated as zeros.
    Non-canonical DPD declets have non-zero values.
    They are forms of (8+c)*100 + (8+f)*10 + (8+i), where c, f, and i
    are in range [0:1].

    OK, that is probably because allowing them on input is
    significantly faster/cheaper than having to detect and
    modify/trap/erase.

    With the calculation latencies of IBM Z-series, modify/trap/erase is
    of no problem.


    How do you know calculation latencies of IBM Z-series?
    Did they made an information public?

    That said, out of range mantissas could also have been accepted,
    except they would not have had a valid conversion to either DPD or
    ascii.

    Terje



    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Tue Jan 6 18:14:51 2026
    From Newsgroup: comp.arch

    EricP <ThatWouldBeTelling@thevillage.com> writes:
    EricP wrote:
    Many bonds are time series polynomials with non-integer times.
    Many calculations use pow(base,exp) to a non-integer exponent.

    Looking at BlackuScholes derivative and options pricing

    https://en.wikipedia.org/wiki/Black-Scholes_pricing_formula#Black%E2%80%93Scholes_formula

    I see exp(), ln(), sqrt().
    I don't see any rules for accuracy.
    I found double was fine for calculating bonds.

    There are lots of other calculations:

    https://en.wikipedia.org/wiki/Financial_mathematics

    My guess is that for the kinds of complex computations where you have
    such things, you don't get such detailed rules as for stuff like taxes
    and accounting which have relatively simple computations.

    And in that case it's probably ok not to use fixed point or other
    decimal-based stuff; I guess that the parties involved specify a
    specific number of significant digits (maybe 10; who cares for EUR 0.1
    in the interest of a EUR 1G account?), and then binary64 is good
    enough if the formula is numerically stable. In case of instability
    or a badly conditioned problem, the parties involved should probably
    put the formula in the contract, and then do the calculation exactly
    as in the formula. Should be good enough to show due diligence.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Tue Jan 6 19:29:40 2026
    From Newsgroup: comp.arch


    Michael S <already5chosen@yahoo.com> posted:

    On Tue, 06 Jan 2026 17:56:23 GMT
    MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

    Michael S <already5chosen@yahoo.com> posted:

    On Sun, 4 Jan 2026 00:21:31 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:


    And two decimal flavors, as well, with binary and densely packed decimal encoding of the significand... it's a bit of a mess.


    Since both formats have exactly identical semantics, in theory the
    mess is not worse (and not better) than two bytes orders of IEEE
    binary FP.

    Division by 10 is way faster in DPD than in Binary.


    Do you consider speed to be part of semantics?
    Just wondering...

    More like justification for the facility itself.

    But also note: DPD can be converted to ASCII all data in parallel without
    any division going on. Binary does not have that property.

    I see much bigger problem [than BID vs DPD] in the fact that IEEE standard does not specify few very important things about global
    states and interoperability of BFP and DFP in the same process.
    Like whether BFP and DFP have common rounding mode or each one has
    mode of its own. The same question about for exception flags and exception masks.





    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Tue Jan 6 19:35:34 2026
    From Newsgroup: comp.arch


    Michael S <already5chosen@yahoo.com> posted:

    On Tue, 06 Jan 2026 17:59:33 GMT
    MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

    Terje Mathisen <terje.mathisen@tmsw.no> posted:

    Michael S wrote:
    On Tue, 6 Jan 2026 12:35:20 +0100
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:

    Thomas Koenig wrote:
    Michael S <already5chosen@yahoo.com> schrieb:
    On Sun, 4 Jan 2026 00:21:31 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:


    And two decimal flavors, as well, with binary and densely
    packed decimal encoding of the significand... it's a bit of a
    mess.

    Since both formats have exactly identical semantics, in theory
    the mess is not worse (and not better) than two bytes orders
    of IEEE binary FP.

    Almost.

    IIRC, there is no restriction on the binary mantissa, so its
    range is slightly larger for the same number of bits
    (1000/1024)**(n/3).

    Sorry, that's wrong:

    Just like the 24 "spare" DPD patterns are illegal,

    Non-canonical, which is not the same as illegal
    Silently accepted as input operands but never produced as result.

    any mantissa
    corresponding to a number greater than the maximum allowed (1e34
    afair) is also illegal, and there are rules for how to handle
    both cases (without checking, i seem to remember that they
    should be treated as zero?)


    BID significand extension > max is indeed treated as zeros. Non-canonical DPD declets have non-zero values.
    They are forms of (8+c)*100 + (8+f)*10 + (8+i), where c, f, and i
    are in range [0:1].

    OK, that is probably because allowing them on input is
    significantly faster/cheaper than having to detect and
    modify/trap/erase.

    With the calculation latencies of IBM Z-series, modify/trap/erase is
    of no problem.


    How do you know calculation latencies of IBM Z-series?
    Did they made an information public?

    9-15 months ago there was a presentation of their latest mainframe
    showing the pipeline lengths.

    Decode was on the order of 20 cycles, down from the top left;
    execute was horizontal across the middle;
    Retire was on the order of 12 cycles, down from the top right;

    That said, out of range mantissas could also have been accepted,
    except they would not have had a valid conversion to either DPD or
    ascii.

    Terje



    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Tue Jan 6 23:09:33 2026
    From Newsgroup: comp.arch

    On Tue, 06 Jan 2026 19:35:34 GMT
    MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

    Michael S <already5chosen@yahoo.com> posted:

    On Tue, 06 Jan 2026 17:59:33 GMT
    MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

    Terje Mathisen <terje.mathisen@tmsw.no> posted:

    Michael S wrote:
    On Tue, 6 Jan 2026 12:35:20 +0100
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:

    Thomas Koenig wrote:
    Michael S <already5chosen@yahoo.com> schrieb:
    On Sun, 4 Jan 2026 00:21:31 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:


    And two decimal flavors, as well, with binary and densely
    packed decimal encoding of the significand... it's a bit
    of a mess.

    Since both formats have exactly identical semantics, in
    theory the mess is not worse (and not better) than two
    bytes orders of IEEE binary FP.

    Almost.

    IIRC, there is no restriction on the binary mantissa, so its
    range is slightly larger for the same number of bits
    (1000/1024)**(n/3).

    Sorry, that's wrong:

    Just like the 24 "spare" DPD patterns are illegal,

    Non-canonical, which is not the same as illegal
    Silently accepted as input operands but never produced as
    result.
    any mantissa
    corresponding to a number greater than the maximum allowed
    (1e34 afair) is also illegal, and there are rules for how to
    handle both cases (without checking, i seem to remember that
    they should be treated as zero?)


    BID significand extension > max is indeed treated as zeros. Non-canonical DPD declets have non-zero values.
    They are forms of (8+c)*100 + (8+f)*10 + (8+i), where c, f,
    and i are in range [0:1].

    OK, that is probably because allowing them on input is
    significantly faster/cheaper than having to detect and modify/trap/erase.

    With the calculation latencies of IBM Z-series, modify/trap/erase
    is of no problem.


    How do you know calculation latencies of IBM Z-series?
    Did they made an information public?

    9-15 months ago there was a presentation of their latest mainframe
    showing the pipeline lengths.

    Decode was on the order of 20 cycles, down from the top left;
    execute was horizontal across the middle;
    Retire was on the order of 12 cycles, down from the top right;


    Back 20 years ago Intel used to have pipelines of comparable depth
    (IIRC, ~35 cycles in the 3rd and 4th generations of Pentium 4). But
    despite that, latency of simple ALU ops was 1 clock. Latency of L1D hit
    was 4 clocks, long for 2005, but standard today. Latencies of FMUL
    and FADD were 7 and 5 clocks, respectively - long, but not
    extraordinary.

    IBM's own POWER6 18 years ago had integer pipeline close to 30 stages
    and FP pipeline of around 35 stages. However FP MUL/ADD/FMA latency
    was 6 or 7 clocks.

    I would expect similar or shorter latency figures for BFP on modern IBM
    z. Likely shorter, because today they have far more silicon to through
    on various bypasses.
    Now, in case of DFP I don't want to guess, because I have no base for
    guessing.













    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From jgd@jgd@cix.co.uk (John Dallman) to comp.arch on Tue Jan 6 22:06:00 2026
    From Newsgroup: comp.arch

    In article <2026Jan5.100825@mips.complang.tuwien.ac.at>, anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    Possibly. But the lack of takeup of the Intel library and of the
    gcc support shows that "build it and they will come" does not work
    out for DFP.

    The world has got very used to IEEE BFP, and has solutions that work
    acceptably with it. Lots of organisations don't see anything obvious for
    them in DFP.

    The thing I'd like to try out is fast quad-precision BFP. For the field I
    work in, that would make some things much simpler. I did try to interest
    AMD in the idea in the early days of x86-64, but they didn't bite.

    John
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Wed Jan 7 13:34:24 2026
    From Newsgroup: comp.arch

    On Tue, 6 Jan 2026 22:06 +0000 (GMT Standard Time)
    jgd@cix.co.uk (John Dallman) wrote:
    In article <2026Jan5.100825@mips.complang.tuwien.ac.at>, anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    Possibly. But the lack of takeup of the Intel library and of the
    gcc support shows that "build it and they will come" does not work
    out for DFP.

    The world has got very used to IEEE BFP, and has solutions that work acceptably with it. Lots of organisations don't see anything obvious
    for them in DFP.

    The thing I'd like to try out is fast quad-precision BFP. For the
    field I work in, that would make some things much simpler. I did try
    to interest AMD in the idea in the early days of x86-64, but they
    didn't bite.

    John
    I already asked you couple of years ago how fast do want binary128 in
    order to consider it fast enough.
    IIRC, you either avoided the answer completely or gave totally
    unrealistic answer like "the same as binary64".
    May be, nobody bites because with non-answers or answers like that
    nobody thinks that you are serious?
    Anecdote.
    Few months ago I tried to design very long decimation filters with stop
    band attenuation of ~160 dB.
    Matlab's implementation of ParksrCoMcClellan algorithm (a customize
    variation of Remez Exchange spiced with a small portion of black magic)
    was not up to the task. Gnu Octave implementation was somewhat worse
    yet.
    When I started to investigate the reasons I found out that there were
    actually two of them, both related to insufficient precision of the
    series of DP FP calculations.
    The first was Discreet Cosine Transform and underlying FFT engine for N
    around 32K.
    The second was solving system of linear equations for N around 1000 a
    a little more.
    In both cases precision of DP FP was perfectly sufficient both for
    inputs and for outputs. But errors accumulated in intermediate caused
    troubles.
    In both cases quad-precision FP was key to solution.
    For DCT (FFT) I went for full re-implementation at higher precision.
    For Linear Solver, I left LU decomposition, which happens to be the
    heavy O(N**3) part in DP. Quad-precision was applied only during final
    solver stages - forward propagation, back propagation, calculation of
    residual error vector and repetition of forward and back propagation.
    All those parts are O(N**2). That modification was sufficient to
    improve precision of result almost to the best possible in DP FP format.
    And sufficient for good convergence of ParksrCoMcClellan algorithm.
    So, what is a point of my anecdote?
    The speed of quad-precision FP was never an obstacle, even when running
    on rather old hardware. And it's not like calculations here were not
    heavy. They were heavy all right, thank you very much.
    Using quad-precision only when necessary helped.
    But what helped more is not being hesitant. Doing things instead of
    worrying that they would be too slow.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Wed Jan 7 15:06:27 2026
    From Newsgroup: comp.arch

    On Tue, 06 Jan 2026 19:29:40 GMT
    MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

    Michael S <already5chosen@yahoo.com> posted:

    On Tue, 06 Jan 2026 17:56:23 GMT
    MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

    Michael S <already5chosen@yahoo.com> posted:

    On Sun, 4 Jan 2026 00:21:31 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:


    And two decimal flavors, as well, with binary and densely
    packed decimal encoding of the significand... it's a bit of a
    mess.

    Since both formats have exactly identical semantics, in theory
    the mess is not worse (and not better) than two bytes orders of
    IEEE binary FP.

    Division by 10 is way faster in DPD than in Binary.


    Do you consider speed to be part of semantics?
    Just wondering...

    More like justification for the facility itself.

    But also note: DPD can be converted to ASCII all data in parallel
    without any division going on. Binary does not have that property.


    I took a look at how IBM exploits this property.
    I don't have up to date zArch manual. It's probably easily available,
    but right now I have no time or desire to search.
    So I looked at POWER, which tends to copy DFP stuff from zArch with one generation time gap.
    POWER ISA v.3.0 (2015) has following relevant instructions:

    ddedpd - DFP Decode DPD to BCD
    For Decimal128 it has two forms
    - convert 32 rightmost digits of significand (unsigned)
    - convert 31 rightmost digits of significand (signed)
    With IBM I am never sure what they call 'rightmost' :(

    If you wonder about couple of remaining digits, IBM has following
    helper instruction:
    dscli - DFP shift significand left immediate
    dscri - DFP shift significand right immediate
    Once again, since it's IBM, I am not sure about directions.

    dxex - DFD extract biased exponent.
    Exponent is extracted in binary form, not in BCD


    They also have instructions that workd in the opposite direction
    denbcd - DFP encode BCD to DPD
    It converts signed 31-digit or unsigned 32-digit BCD-encoded integer to
    DPD with exponent=0

    diex - DFD insert biased exponent.
    Here too exponent is in binary form, not in BCD.


    So, POWER hardware helps a lot converting DPD to BCD, but leaves to
    software the last step of unpacking 4-bit BCD to 8-bit ASCII.
    Or, may be, they have instruction for that as well, but it's not in DFP
    related part of the book, so I missed it.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From jgd@jgd@cix.co.uk (John Dallman) to comp.arch on Wed Jan 7 13:16:00 2026
    From Newsgroup: comp.arch

    In article <20260107133424.00000e99@yahoo.com>, already5chosen@yahoo.com (Michael S) wrote:

    I already asked you couple of years ago how fast do want binary128
    in order to consider it fast enough.
    IIRC, you either avoided the answer completely or gave totally
    unrealistic answer like "the same as binary64".
    May be, nobody bites because with non-answers or answers like that
    nobody thinks that you are serious?

    I don't know much about hardware design. What is realistic for hardware binary128?

    John
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Wed Jan 7 15:24:38 2026
    From Newsgroup: comp.arch

    Michael S <already5chosen@yahoo.com> writes:
    On Tue, 06 Jan 2026 19:29:40 GMT
    MitchAlsup <user5857@newsgrouper.org.invalid> wrote:


    So, POWER hardware helps a lot converting DPD to BCD, but leaves to
    software the last step of unpacking 4-bit BCD to 8-bit ASCII.

    That's a fairly simple operation, just adding the proper zone
    digit (3 for ASCII, F for EBCDIC) to the 8-bit byte.

    The B3500 (1965) did that in hardware;
    I would find it strange if the Power CPU didn't.

    Or, may be, they have instruction for that as well, but it's not in DFP >related part of the book, so I missed it.

    It was just a flavor of the move instruction in the B3500.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Wed Jan 7 17:55:07 2026
    From Newsgroup: comp.arch

    On Wed, 7 Jan 2026 13:16 +0000 (GMT Standard Time)
    jgd@cix.co.uk (John Dallman) wrote:

    In article <20260107133424.00000e99@yahoo.com>,
    already5chosen@yahoo.com (Michael S) wrote:

    I already asked you couple of years ago how fast do want binary128
    in order to consider it fast enough.
    IIRC, you either avoided the answer completely or gave totally
    unrealistic answer like "the same as binary64".
    May be, nobody bites because with non-answers or answers like that
    nobody thinks that you are serious?

    I don't know much about hardware design. What is realistic for
    hardware binary128?

    John

    I think that you are asking wrong question, but I'd answer nevertheless.

    For design that take significant amount of design effort and
    non-trivial silicon resources (say + 3-5% in core area)

    SIMD throughput within the same TDP: 1/4th of DP FP. May be, 1/3rd if
    designers worked very hard
    Latency: assuming 4-clock FMA for DP FP, 9 clocks QP FMA sounds very
    realistic. May be, 8 clocks. Mitch could answer better. FADD can be
    faster - 6 sounds realistic.

    Another extreme in design space is what IBM did on POWER9.
    I would guess that here silicon resources dedicated to quad-precision
    BFP were below 0.5 % of the core area. Likely, below 0.1%.
    They did scalar (i.e. non-SIMD) quad-FP unit. FADD is
    pipelines, but FMUL/FMA is very minimally pipelined (at most 2
    operations proceed simultaneously),
    Throughput/latency Table:

    Oper : DP T L QP T L
    ADD : 4 5-7 1 12
    MUL : 4 5-7 1/13 24
    MADD : 4 5-7 1/13 24

    As you can see, POWER9 double-precision throughput/latency numbers are
    somewhat worse than what we accustomed on x86-64 and on high-end ARM64. However, even relatively to those not great numbers throughput of QP
    FMA is 52 times lower and latency ~4 times higher.


    And still, it's all depend on application. If all application does is multiplication or decomposition of big matrices then migration to
    minimalistic QP engine similar to one in POWER9 will cause major
    slowdown (Then again, as shown by my anecdote, sometimes DP is
    non-adequate for exactly those tasks).
    But most applications are not like those. Even for activities that are
    normally considered numerically intensive, like recalculation of huge spreadsheet, I'd expect at most few percents of slowdown on POWER9 QP.
    I don't know where your application is placed in this spectrum. Most
    likely, you don't know too. Until you try! And in order to get a
    feeling you don't need hardware. Use software implementation.
    Experiments, even with not very good software implementation as one in
    gcc on x86-64 and ARM64, will give you massively better feeling a lot
    of questions. They will put you into position of knowledge, when
    proposing something to AMD, Intel or Arm vendors.
    Which, of course, does not guarantee that they will byte your bait. I'd
    even dare to say that for as long as current "AI" bubble lasts they
    will not byte it. But it will not last forever.



    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Wed Jan 7 18:06:59 2026
    From Newsgroup: comp.arch

    On Wed, 07 Jan 2026 15:24:38 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:

    Michael S <already5chosen@yahoo.com> writes:
    On Tue, 06 Jan 2026 19:29:40 GMT
    MitchAlsup <user5857@newsgrouper.org.invalid> wrote:


    So, POWER hardware helps a lot converting DPD to BCD, but leaves to >software the last step of unpacking 4-bit BCD to 8-bit ASCII.

    That's a fairly simple operation, just adding the proper zone
    digit (3 for ASCII, F for EBCDIC) to the 8-bit byte.


    The hard part is having bits in right places within wide word.
    I.e. you have 64 bits like that:
    '0123456789abcdef'. You want to convert it to pair of 64-bit numbers: '0001020304050607', '08090a0b0c0d0e0f'
    Without HW help it is not fast. Likely not faster than running
    respective DPD declets through look-up table where you, at least, get 3
    ASCII digits per look-up. On modern wide core, likely only marginally
    faster than converting BYD mantissa.

    The B3500 (1965) did that in hardware;
    I would find it strange if the Power CPU didn't.

    Or, may be, they have instruction for that as well, but it's not in
    DFP related part of the book, so I missed it.

    It was just a flavor of the move instruction in the B3500.

    I am not sure that we are talking about the same thing.




    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Wed Jan 7 16:41:26 2026
    From Newsgroup: comp.arch

    Michael S <already5chosen@yahoo.com> writes:
    On Wed, 07 Jan 2026 15:24:38 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:

    Michael S <already5chosen@yahoo.com> writes:
    On Tue, 06 Jan 2026 19:29:40 GMT
    MitchAlsup <user5857@newsgrouper.org.invalid> wrote:


    So, POWER hardware helps a lot converting DPD to BCD, but leaves to
    software the last step of unpacking 4-bit BCD to 8-bit ASCII.

    That's a fairly simple operation, just adding the proper zone
    digit (3 for ASCII, F for EBCDIC) to the 8-bit byte.


    The hard part is having bits in right places within wide word.
    I.e. you have 64 bits like that:
    '0123456789abcdef'. You want to convert it to pair of 64-bit numbers: >'0001020304050607', '08090a0b0c0d0e0f'

    The ASCII[*] would be '3031323334353637', '3839414243444546'. That's
    a move from the BCD '0123456789abcdef' to the corresponding ASCII
    bytes.

    [*] Printable version of the BCD input number.

    The B3500 addressed to the digit, so it was a simple move to add
    the zone digit when converting to ASCII (or EBCDIC depending on
    a processor flag). Although 'undigits' (bit patterns 0b1010
    through 0b1111) were not legal in BCD numbers on the B3500
    and adding a zone digit to them didn't make them printable.

    e.g.

    INPUT DATA 8 UN 0123456789 // Unsigned Numeric 8 nibble field
    OUTPUT DATA 8 UA // Unsigned Alphanumeric 8 byte field

    MVN INPUT(UN), OUTPUT(UA) // yields 'f0f1f2f3f4f5f6f7f8f9' (EBCDIC)

    If the output field was larger than the input field, leading
    blanks would be added before the number when using MVN. MVA
    would blank pad the output field after the number when the
    output field was larger.

    Without HW help it is not fast. Likely not faster than running
    respective DPD declets through look-up table where you, at least, get 3
    ASCII digits per look-up. On modern wide core, likely only marginally
    faster than converting BYD mantissa.

    The B3500 (1965) did that in hardware;
    I would find it strange if the Power CPU didn't.

    Or, may be, they have instruction for that as well, but it's not in
    DFP related part of the book, so I missed it.

    It was just a flavor of the move instruction in the B3500.

    I am not sure that we are talking about the same thing.

    Probably not, since the ASCII zero character is encoded as 0x30
    instead of the 0x00 you show in the example above.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From antispam@antispam@fricas.org (Waldek Hebisch) to comp.arch on Wed Jan 7 17:32:17 2026
    From Newsgroup: comp.arch

    Scott Lurndal <scott@slp53.sl.home> wrote:
    Michael S <already5chosen@yahoo.com> writes:
    On Wed, 07 Jan 2026 15:24:38 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:

    Michael S <already5chosen@yahoo.com> writes:
    On Tue, 06 Jan 2026 19:29:40 GMT
    MitchAlsup <user5857@newsgrouper.org.invalid> wrote:


    So, POWER hardware helps a lot converting DPD to BCD, but leaves to
    software the last step of unpacking 4-bit BCD to 8-bit ASCII.

    That's a fairly simple operation, just adding the proper zone
    digit (3 for ASCII, F for EBCDIC) to the 8-bit byte.


    The hard part is having bits in right places within wide word.
    I.e. you have 64 bits like that:
    '0123456789abcdef'. You want to convert it to pair of 64-bit numbers: >>'0001020304050607', '08090a0b0c0d0e0f'

    The ASCII[*] would be '3031323334353637', '3839414243444546'. That's
    a move from the BCD '0123456789abcdef' to the corresponding ASCII
    bytes.

    [*] Printable version of the BCD input number.

    The B3500 addressed to the digit, so it was a simple move to add
    the zone digit when converting to ASCII (or EBCDIC depending on
    a processor flag). Although 'undigits' (bit patterns 0b1010
    through 0b1111) were not legal in BCD numbers on the B3500
    and adding a zone digit to them didn't make them printable.

    e.g.

    INPUT DATA 8 UN 0123456789 // Unsigned Numeric 8 nibble field
    OUTPUT DATA 8 UA // Unsigned Alphanumeric 8 byte field

    MVN INPUT(UN), OUTPUT(UA) // yields 'f0f1f2f3f4f5f6f7f8f9' (EBCDIC)

    If the output field was larger than the input field, leading
    blanks would be added before the number when using MVN. MVA
    would blank pad the output field after the number when the
    output field was larger.

    Without HW help it is not fast. Likely not faster than running
    respective DPD declets through look-up table where you, at least, get 3 >>ASCII digits per look-up. On modern wide core, likely only marginally >>faster than converting BYD mantissa.

    The B3500 (1965) did that in hardware;
    I would find it strange if the Power CPU didn't.

    Or, may be, they have instruction for that as well, but it's not in
    DFP related part of the book, so I missed it.

    It was just a flavor of the move instruction in the B3500.

    I am not sure that we are talking about the same thing.

    Probably not, since the ASCII zero character is encoded as 0x30
    instead of the 0x00 you show in the example above.


    IIUC Michael was asking for the following transformation of
    on the strings of hex digits:

    0123456789abcdef

    into

    000102030405060708090a0b0c0d0e0f

    given (fast) such transformation it is very easy to add proper
    zone bits on modern hardware. One possible approach to transform
    above would be to do byte type unpacking operation (that is
    version of the above working on bytes) and then use masking and
    shifting to more upper bits of each byte to the right place.
    --
    Waldek Hebisch
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Wed Jan 7 18:38:26 2026
    From Newsgroup: comp.arch

    MitchAlsup wrote:

    Michael S <already5chosen@yahoo.com> posted:

    On Sun, 4 Jan 2026 00:21:31 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:


    And two decimal flavors, as well, with binary and densely packed
    decimal encoding of the significand... it's a bit of a mess.


    Since both formats have exactly identical semantics, in theory the mess
    is not worse (and not better) than two bytes orders of IEEE binary FP.

    Division by 10 is way faster in DPD than in Binary.

    I see much bigger problem [than BID vs DPD] in the fact that IEEE
    standard does not specify few very important things about global states
    and interoperability of BFP and DFP in the same process. Like whether
    BFP and DFP have common rounding mode or each one has mode of its own.
    The same question about for exception flags and exception masks.

    In hardware a division by 10 is either an adjustment of the exponent,
    which is equally fast for both encodings, or for a real division DPD
    just requires unpacking of all the 10-bit fields (I'm assuming IBM does
    this in parallel for all 11 groups, so max one cycle), then shifting all
    the nybbles down one position before the reverse to pack them back up, probably including a rounding step before the repack.

    This operation is very closely related to the general case of having to re-normalize after any operation which would require that, i.e. commonly
    for DFMUL, very seldom for DFADD/DFSUB, and almost always for DFDIV.

    In BID we would do division by 10 with a reciprocal multiplication that handles powers of 5, this way we (i.e Michael S) can fit 26/27 digits
    into a 64-bit int. No matter how we do it, two iterations is enough
    handle even maximally large numbers, and that would require 4 64x64->128
    MULs per iteration. Since these would pipeline nicely I'm guessing it
    would be doable in 15-30 cycles total?

    So yes, scaling by a power of ten is the one operation where DPD is
    clearly much faster, but if you try to implement DPD in software, then
    you have to handle the unpack and pack operations, and they could easily
    take the same or even more time.

    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Jan 7 17:39:25 2026
    From Newsgroup: comp.arch


    Michael S <already5chosen@yahoo.com> posted:

    On Wed, 7 Jan 2026 13:16 +0000 (GMT Standard Time)
    jgd@cix.co.uk (John Dallman) wrote:

    In article <20260107133424.00000e99@yahoo.com>,
    already5chosen@yahoo.com (Michael S) wrote:

    I already asked you couple of years ago how fast do want binary128
    in order to consider it fast enough.
    IIRC, you either avoided the answer completely or gave totally unrealistic answer like "the same as binary64".
    May be, nobody bites because with non-answers or answers like that
    nobody thinks that you are serious?

    I don't know much about hardware design. What is realistic for
    hardware binary128?

    John

    I think that you are asking wrong question, but I'd answer nevertheless.

    For design that take significant amount of design effort and
    non-trivial silicon resources (say + 3-5% in core area)

    SIMD throughput within the same TDP: 1/4th of DP FP. May be, 1/3rd if designers worked very hard
    Latency: assuming 4-clock FMA for DP FP, 9 clocks QP FMA sounds very realistic. May be, 8 clocks. Mitch could answer better. FADD can be
    faster - 6 sounds realistic.

    Those are realistic numbers when the designers work hard AND operands
    are shipped in 1 cycle--add 1 if B128 is shipped in 2 cycles.

    Another extreme in design space is what IBM did on POWER9.
    I would guess that here silicon resources dedicated to quad-precision
    BFP were below 0.5 % of the core area. Likely, below 0.1%.
    They did scalar (i.e. non-SIMD) quad-FP unit. FADD is
    pipelines, but FMUL/FMA is very minimally pipelined (at most 2
    operations proceed simultaneously),
    Throughput/latency Table:

    Oper : DP T L QP T L
    ADD : 4 5-7 1 12
    MUL : 4 5-7 1/13 24
    MADD : 4 5-7 1/13 24

    As you can see, POWER9 double-precision throughput/latency numbers are somewhat worse than what we accustomed on x86-64 and on high-end ARM64. However, even relatively to those not great numbers throughput of QP
    FMA is 52 times lower and latency ~4 times higher.


    And still, it's all depend on application. If all application does is multiplication or decomposition of big matrices then migration to minimalistic QP engine similar to one in POWER9 will cause major
    slowdown (Then again, as shown by my anecdote, sometimes DP is
    non-adequate for exactly those tasks).
    But most applications are not like those. Even for activities that are normally considered numerically intensive, like recalculation of huge spreadsheet, I'd expect at most few percents of slowdown on POWER9 QP.
    I don't know where your application is placed in this spectrum. Most
    likely, you don't know too. Until you try! And in order to get a
    feeling you don't need hardware. Use software implementation.
    Experiments, even with not very good software implementation as one in
    gcc on x86-64 and ARM64, will give you massively better feeling a lot
    of questions. They will put you into position of knowledge, when
    proposing something to AMD, Intel or Arm vendors.
    Which, of course, does not guarantee that they will byte your bait. I'd
    even dare to say that for as long as current "AI" bubble lasts they
    will not byte it. But it will not last forever.



    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Jan 7 17:44:08 2026
    From Newsgroup: comp.arch


    Michael S <already5chosen@yahoo.com> posted:

    On Wed, 07 Jan 2026 15:24:38 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:

    Michael S <already5chosen@yahoo.com> writes:
    On Tue, 06 Jan 2026 19:29:40 GMT
    MitchAlsup <user5857@newsgrouper.org.invalid> wrote:


    So, POWER hardware helps a lot converting DPD to BCD, but leaves to >software the last step of unpacking 4-bit BCD to 8-bit ASCII.

    That's a fairly simple operation, just adding the proper zone
    digit (3 for ASCII, F for EBCDIC) to the 8-bit byte.


    The hard part is having bits in right places within wide word.
    I.e. you have 64 bits like that:
    '0123456789abcdef'. You want to convert it to pair of 64-bit numbers: '0001020304050607', '08090a0b0c0d0e0f'
    Without HW help it is not fast.

    With Extract and Insert instructions this becomes 16 extracts all
    concurrent, and 16 inserts, 8 serially dependent pairs. With shift
    instructions only--you are on your own.

    In HW "its just wires."

    Likely not faster than running
    respective DPD declets through look-up table where you, at least, get 3
    ASCII digits per look-up. On modern wide core, likely only marginally
    faster than converting BYD mantissa.

    The B3500 (1965) did that in hardware;
    I would find it strange if the Power CPU didn't.

    Or, may be, they have instruction for that as well, but it's not in
    DFP related part of the book, so I missed it.

    It was just a flavor of the move instruction in the B3500.

    I am not sure that we are talking about the same thing.




    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Wed Jan 7 18:47:47 2026
    From Newsgroup: comp.arch

    MitchAlsup wrote:

    Michael S <already5chosen@yahoo.com> posted:

    On Tue, 06 Jan 2026 17:56:23 GMT
    MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

    Michael S <already5chosen@yahoo.com> posted:

    On Sun, 4 Jan 2026 00:21:31 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:


    And two decimal flavors, as well, with binary and densely packed
    decimal encoding of the significand... it's a bit of a mess.


    Since both formats have exactly identical semantics, in theory the
    mess is not worse (and not better) than two bytes orders of IEEE
    binary FP.

    Division by 10 is way faster in DPD than in Binary.


    Do you consider speed to be part of semantics?
    Just wondering...

    More like justification for the facility itself.

    But also note: DPD can be converted to ASCII all data in parallel without
    any division going on. Binary does not have that property.

    a) Conversion to ASCII is _never_ in the critical path, or if it is,
    then the problem is trivial.

    b) Fast Binary to Ascii is a solved problem: I.e. easily doable in less
    than 50 clcock cycles even for 128-bit values.
    I invented the original unsigned_to_ascci() conversion algorithm ~30
    years ago, taking advantage of fast multipliers and splitting the input
    into multiple parts which are then converted in parallel using simple
    mul_by_5 operations on a scaled input.

    My original code which I posted here in c.arch was for the 32-bit CPUs
    we had at the time, but extending to 64 or even 128-bit inputs is straightforward.

    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Wed Jan 7 18:56:12 2026
    From Newsgroup: comp.arch

    Michael S wrote:
    On Tue, 6 Jan 2026 22:06 +0000 (GMT Standard Time)
    jgd@cix.co.uk (John Dallman) wrote:

    In article <2026Jan5.100825@mips.complang.tuwien.ac.at>,
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    Possibly. But the lack of takeup of the Intel library and of the
    gcc support shows that "build it and they will come" does not work
    out for DFP.

    The world has got very used to IEEE BFP, and has solutions that work
    acceptably with it. Lots of organisations don't see anything obvious
    for them in DFP.

    The thing I'd like to try out is fast quad-precision BFP. For the
    field I work in, that would make some things much simpler. I did try
    to interest AMD in the idea in the early days of x86-64, but they
    didn't bite.

    John

    I already asked you couple of years ago how fast do want binary128 in
    order to consider it fast enough.
    IIRC, you either avoided the answer completely or gave totally
    unrealistic answer like "the same as binary64".
    May be, nobody bites because with non-answers or answers like that
    nobody thinks that you are serious?

    Anecdote.
    Few months ago I tried to design very long decimation filters with stop> band attenuation of ~160 dB.
    Matlab's implementation of Parks|ore4rCLMcClellan algorithm (a customize variation of Remez Exchange spiced with a small portion of black magic)> was not up to the task. Gnu Octave implementation was somewhat worse
    yet.
    When I started to investigate the reasons I found out that there were actually two of them, both related to insufficient precision of the
    series of DP FP calculations.
    The first was Discreet Cosine Transform and underlying FFT engine for N> around 32K.
    The second was solving system of linear equations for N around 1000 a
    a little more.
    In both cases precision of DP FP was perfectly sufficient both for
    inputs and for outputs. But errors accumulated in intermediate caused troubles.

    In both cases quad-precision FP was key to solution.

    For DCT (FFT) I went for full re-implementation at higher precision.

    For Linear Solver, I left LU decomposition, which happens to be the
    heavy O(N**3) part in DP. Quad-precision was applied only during final
    solver stages - forward propagation, back propagation, calculation of residual error vector and repetition of forward and back propagation.
    All those parts are O(N**2). That modification was sufficient to
    improve precision of result almost to the best possible in DP FP format.
    And sufficient for good convergence of Parks|ore4rCLMcClellan algorithm.


    So, what is a point of my anecdote?
    The speed of quad-precision FP was never an obstacle, even when running> on rather old hardware. And it's not like calculations here were not
    heavy. They were heavy all right, thank you very much.
    Using quad-precision only when necessary helped.
    But what helped more is not being hesitant. Doing things instead of
    worrying that they would be too slow.
    I thihnk the main issue is similar to what we had before 754, i.e every
    fp programmer needed to also be a fp analyst, capable of carrying out
    error budget calculation across their algorithms.
    You can obviously do that, and so can a number of regulars here, but in
    the real world we are in a _very_ small minority.
    For the rest, just having fp128 fast enough that that it could be
    applied naively would solve a number of problems.
    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From antispam@antispam@fricas.org (Waldek Hebisch) to comp.arch on Wed Jan 7 17:57:34 2026
    From Newsgroup: comp.arch

    Stefan Monnier <monnier@iro.umontreal.ca> wrote:
    If you look at Java's BigDecimal operations
    <https://docs.oracle.com/javase/8/docs/api/java/math/BigDecimal.html>,
    they all have versions without and width MathContext (which includes a
    target scale and a rounding mode), and sometimes additional variants
    (e.g., divide() has variants where you pass just the rounding mode, or
    the rounding mode and scale individually instead of through a
    MathContext).

    I wonder how that would compare in practice with a Rational type, where
    all arithmetic operations are exact (and thus don't need anything like
    a MathContext) and you simply provide a rounding function that takes
    two argument: a "target scale" (in the form of a target denominator) and
    a rounding mode.

    [ Extra points for implementing compiler optimizations that keep track
    of the denominators statically to try and do away with the
    denominators at run-time as much as possible. Maybe also figure out
    how to eliminate the use of bignums for the numerators. EfOe ]

    I use every day a computer algebra system. One possiblity here
    is to use arbitrary precision fractions. Observations:
    - once there is more complex computation numerators and denominators
    tend to be pretty big
    - a lot of time is spent computing gcd, but if you try to save on
    gcd-s and work with unsimplified fractions you may get trenendous
    blowup (like million digit numbers in what should be reasonably
    small computation).
    - general, if one needs numeric approximation, then arbitrary
    precision (software) floating point is much faster

    BTW: sometimes people write rational type/class that uses fixed
    size numerators and denominators. Such type is useless once
    there is longer/less regular seqence of operations: simply
    fixed size numbers overflow too easily.

    BTW2: Usual trick with rational operations is to estimate size
    of final result. If size of final result is known with reasonable
    accuracy, than usually computation using a finite fields is much
    faster and allows exact recovery of final result.
    --
    Waldek Hebisch
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Wed Jan 7 18:58:04 2026
    From Newsgroup: comp.arch

    John Dallman wrote:
    In article <20260107133424.00000e99@yahoo.com>, already5chosen@yahoo.com (Michael S) wrote:

    I already asked you couple of years ago how fast do want binary128
    in order to consider it fast enough.
    IIRC, you either avoided the answer completely or gave totally
    unrealistic answer like "the same as binary64".
    May be, nobody bites because with non-answers or answers like that
    nobody thinks that you are serious?

    I don't know much about hardware design. What is realistic for hardware binary128?

    Sub-10 cycles fmul/fadd/fsub seems very doable?

    Mitch?

    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Wed Jan 7 10:22:52 2026
    From Newsgroup: comp.arch

    On 1/7/2026 5:06 AM, Michael S wrote:
    On Tue, 06 Jan 2026 19:29:40 GMT
    MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

    Michael S <already5chosen@yahoo.com> posted:

    On Tue, 06 Jan 2026 17:56:23 GMT
    MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

    Michael S <already5chosen@yahoo.com> posted:

    On Sun, 4 Jan 2026 00:21:31 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:


    And two decimal flavors, as well, with binary and densely
    packed decimal encoding of the significand... it's a bit of a
    mess.

    Since both formats have exactly identical semantics, in theory
    the mess is not worse (and not better) than two bytes orders of
    IEEE binary FP.

    Division by 10 is way faster in DPD than in Binary.


    Do you consider speed to be part of semantics?
    Just wondering...

    More like justification for the facility itself.

    But also note: DPD can be converted to ASCII all data in parallel
    without any division going on. Binary does not have that property.


    I took a look at how IBM exploits this property.
    I don't have up to date zArch manual. It's probably easily available,
    but right now I have no time or desire to search.
    So I looked at POWER, which tends to copy DFP stuff from zArch with one generation time gap.
    POWER ISA v.3.0 (2015) has following relevant instructions:

    ddedpd - DFP Decode DPD to BCD
    For Decimal128 it has two forms
    - convert 32 rightmost digits of significand (unsigned)
    - convert 31 rightmost digits of significand (signed)
    With IBM I am never sure what they call 'rightmost' :(

    If you wonder about couple of remaining digits, IBM has following
    helper instruction:
    dscli - DFP shift significand left immediate
    dscri - DFP shift significand right immediate
    Once again, since it's IBM, I am not sure about directions.

    dxex - DFD extract biased exponent.
    Exponent is extracted in binary form, not in BCD


    They also have instructions that workd in the opposite direction
    denbcd - DFP encode BCD to DPD
    It converts signed 31-digit or unsigned 32-digit BCD-encoded integer to
    DPD with exponent=0

    diex - DFD insert biased exponent.
    Here too exponent is in binary form, not in BCD.


    So, POWER hardware helps a lot converting DPD to BCD, but leaves to
    software the last step of unpacking 4-bit BCD to 8-bit ASCII.
    Or, may be, they have instruction for that as well, but it's not in DFP related part of the book, so I missed it.

    On Z series, that sounds like the unpack instruction, available since
    the decimal arithmetic extension in the S/360, though, being Z series,
    it uses EBCDIC rather than ASCII.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Wed Jan 7 18:38:10 2026
    From Newsgroup: comp.arch

    Terje Mathisen <terje.mathisen@tmsw.no> writes:
    In BID we would do division by 10 with a reciprocal multiplication that >handles powers of 5, this way we (i.e Michael S) can fit 26/27 digits
    into a 64-bit int.

    2^64 < 10^20

    How do you put 26/27 decimal digits in 64 bits. Or do you mean 27
    quinary digits? What would that be good for?

    No matter how we do it, two iterations is enough
    handle even maximally large numbers, and that would require 4 64x64->128 >MULs per iteration. Since these would pipeline nicely I'm guessing it
    would be doable in 15-30 cycles total?

    On Skylake the latency of a 64x64->128 multiplication is 6 cycles (4
    cycles for the lower 64 bits), and I expect it to be lower on newer
    hardware. The pipelined multiplications should be done by cycle 9.
    There are also some additions involved, but I would not expect them to
    increase the latency to 15 cycles. What other operations do you have
    in mind that would result in 15-30 cycles? For scaling you don't need
    the remainder, only some rounding.

    So yes, scaling by a power of ten is the one operation where DPD is
    clearly much faster

    I would not bet on it. It needs to unpack the 34 digits into 136
    bits, do a 136-bit shift, then repack into DPD. Either they widen the
    data path beyond what they normally do, or they do it in parcels of 64
    bits or less, and the end result can easily take a similar number of
    cycles as 128-bit binary multiplication with the reciprocal. My guess
    is that they did the slow implementation at first, and then there was
    so little takeup of DFP that the slow implementation is good enough to
    this day.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Wed Jan 7 19:14:08 2026
    From Newsgroup: comp.arch

    antispam@fricas.org (Waldek Hebisch) writes:
    Scott Lurndal <scott@slp53.sl.home> wrote:
    Michael S <already5chosen@yahoo.com> writes:
    On Wed, 07 Jan 2026 15:24:38 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:

    Michael S <already5chosen@yahoo.com> writes:
    On Tue, 06 Jan 2026 19:29:40 GMT
    MitchAlsup <user5857@newsgrouper.org.invalid> wrote:


    So, POWER hardware helps a lot converting DPD to BCD, but leaves to
    software the last step of unpacking 4-bit BCD to 8-bit ASCII.

    That's a fairly simple operation, just adding the proper zone
    digit (3 for ASCII, F for EBCDIC) to the 8-bit byte.


    The hard part is having bits in right places within wide word.
    I.e. you have 64 bits like that:
    '0123456789abcdef'. You want to convert it to pair of 64-bit numbers: >>>'0001020304050607', '08090a0b0c0d0e0f'

    The ASCII[*] would be '3031323334353637', '3839414243444546'. That's
    a move from the BCD '0123456789abcdef' to the corresponding ASCII
    bytes.

    [*] Printable version of the BCD input number.

    The B3500 addressed to the digit, so it was a simple move to add
    the zone digit when converting to ASCII (or EBCDIC depending on
    a processor flag). Although 'undigits' (bit patterns 0b1010
    through 0b1111) were not legal in BCD numbers on the B3500
    and adding a zone digit to them didn't make them printable.

    e.g.

    INPUT DATA 8 UN 0123456789 // Unsigned Numeric 8 nibble field >> OUTPUT DATA 8 UA // Unsigned Alphanumeric 8 byte field

    MVN INPUT(UN), OUTPUT(UA) // yields 'f0f1f2f3f4f5f6f7f8f9' (EBCDIC)

    If the output field was larger than the input field, leading
    blanks would be added before the number when using MVN. MVA
    would blank pad the output field after the number when the
    output field was larger.

    Without HW help it is not fast. Likely not faster than running
    respective DPD declets through look-up table where you, at least, get 3 >>>ASCII digits per look-up. On modern wide core, likely only marginally >>>faster than converting BYD mantissa.

    The B3500 (1965) did that in hardware;
    I would find it strange if the Power CPU didn't.

    Or, may be, they have instruction for that as well, but it's not in
    DFP related part of the book, so I missed it.

    It was just a flavor of the move instruction in the B3500.

    I am not sure that we are talking about the same thing.

    Probably not, since the ASCII zero character is encoded as 0x30
    instead of the 0x00 you show in the example above.


    IIUC Michael was asking for the following transformation of
    on the strings of hex digits:

    0123456789abcdef

    into

    000102030405060708090a0b0c0d0e0f

    given (fast) such transformation it is very easy to add proper
    zone bits on modern hardware. One possible approach to transform
    above would be to do byte type unpacking operation (that is
    version of the above working on bytes) and then use masking and
    shifting to more upper bits of each byte to the right place.

    Since you know that the zone digit after transformation will
    always be zero, an arithmetic "OR" of the ASCII/EBCDIC value
    for '0' (0x30/0xf0) over each byte should be sufficient.


    e.g.
    000102030405060708090a0b0c0d0e0f | 30303030303030303030303030303030


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Wed Jan 7 19:19:22 2026
    From Newsgroup: comp.arch

    Terje Mathisen <terje.mathisen@tmsw.no> writes:
    MitchAlsup wrote:

    Michael S <already5chosen@yahoo.com> posted:

    On Sun, 4 Jan 2026 00:21:31 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:


    And two decimal flavors, as well, with binary and densely packed
    decimal encoding of the significand... it's a bit of a mess.


    Since both formats have exactly identical semantics, in theory the mess
    is not worse (and not better) than two bytes orders of IEEE binary FP.

    Division by 10 is way faster in DPD than in Binary.

    I see much bigger problem [than BID vs DPD] in the fact that IEEE
    standard does not specify few very important things about global states
    and interoperability of BFP and DFP in the same process. Like whether
    BFP and DFP have common rounding mode or each one has mode of its own.
    The same question about for exception flags and exception masks.

    In hardware a division by 10 is either an adjustment of the exponent,
    which is equally fast for both encodings, or for a real division DPD
    just requires unpacking of all the 10-bit fields (I'm assuming IBM does
    this in parallel for all 11 groups, so max one cycle), then shifting all
    the nybbles down one position before the reverse to pack them back up, >probably including a rounding step before the repack.

    It was even easier on the B3500. As it was addressed to the nibble,
    division by 10 simply required dropping the last digit of the source,
    while multiplication by 10 simple required appending a zero digit to
    the result (both of which the MVN instruction did automatically when the operand lengths differed). A common peephole optimization in the
    compilers.

    There were no operand registers, all arithmetic was memory to memory.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Jan 7 19:56:54 2026
    From Newsgroup: comp.arch


    Terje Mathisen <terje.mathisen@tmsw.no> posted:

    MitchAlsup wrote:

    Michael S <already5chosen@yahoo.com> posted:

    On Sun, 4 Jan 2026 00:21:31 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:


    And two decimal flavors, as well, with binary and densely packed
    decimal encoding of the significand... it's a bit of a mess.


    Since both formats have exactly identical semantics, in theory the mess
    is not worse (and not better) than two bytes orders of IEEE binary FP.

    Division by 10 is way faster in DPD than in Binary.

    I see much bigger problem [than BID vs DPD] in the fact that IEEE
    standard does not specify few very important things about global states
    and interoperability of BFP and DFP in the same process. Like whether
    BFP and DFP have common rounding mode or each one has mode of its own.
    The same question about for exception flags and exception masks.

    In hardware a division by 10 is either an adjustment of the exponent,
    which is equally fast for both encodings, or for a real division DPD

    just requires unpacking of all the 10-bit fields (I'm assuming IBM does
    this in parallel for all 11 groups, so max one cycle),

    unpack is 3-gates of delay.

    then shifting all
    the nybbles down one position before the reverse to pack them back up, probably including a rounding step before the repack.

    pack is also 3 gates of delay.

    This operation is very closely related to the general case of having to re-normalize after any operation which would require that, i.e. commonly
    for DFMUL, very seldom for DFADD/DFSUB, and almost always for DFDIV.

    In BID we would do division by 10 with a reciprocal multiplication that handles powers of 5, this way we (i.e Michael S) can fit 26/27 digits
    into a 64-bit int. No matter how we do it, two iterations is enough
    handle even maximally large numbers, and that would require 4 64x64->128 MULs per iteration. Since these would pipeline nicely I'm guessing it
    would be doable in 15-30 cycles total?

    Closer to 16 than 30 if one tries hard.

    So yes, scaling by a power of ten is the one operation where DPD is
    clearly much faster, but if you try to implement DPD in software, then
    you have to handle the unpack and pack operations, and they could easily take the same or even more time.

    Which is why you don't WANT to do it in HW.
    There is obviously a class of SW that wants these things--the question
    is whether YOUR architecture wants people of this class buying your HW.

    Terje

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Jan 7 20:05:17 2026
    From Newsgroup: comp.arch


    Terje Mathisen <terje.mathisen@tmsw.no> posted:

    John Dallman wrote:
    In article <20260107133424.00000e99@yahoo.com>, already5chosen@yahoo.com (Michael S) wrote:

    I already asked you couple of years ago how fast do want binary128
    in order to consider it fast enough.
    IIRC, you either avoided the answer completely or gave totally
    unrealistic answer like "the same as binary64".
    May be, nobody bites because with non-answers or answers like that
    nobody thinks that you are serious?

    I don't know much about hardware design. What is realistic for hardware binary128?

    Sub-10 cycles fmul/fadd/fsub seems very doable?

    Mitch?

    Assuming 128-bit operands are delivered in 1 cycle and 128-bit
    results are delivered in 1 cycle::

    128-bit Fadd/Fsub should be no worse than 1 cycle longer than 64-bit.

    128-bit Fmul requires that the multiplier tree be 64|u64 instead of
    53|u53 (1.46|u bigger tree, 1.22|u bigger FU), and would/should be 3-4
    cycles longer than 64-bit Fmul. If you wanted to be "really clever"
    you could use a 59|u59 tree and the FU is only 1.12|u bigger; but here
    you could not use the tree for Integer MUL.

    Terje

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Jan 7 20:11:20 2026
    From Newsgroup: comp.arch


    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    Terje Mathisen <terje.mathisen@tmsw.no> writes:
    In BID we would do division by 10 with a reciprocal multiplication that >handles powers of 5, this way we (i.e Michael S) can fit 26/27 digits
    into a 64-bit int.

    2^64 < 10^20

    How do you put 26/27 decimal digits in 64 bits. Or do you mean 27
    quinary digits? What would that be good for?

    it takes 3.32 binary digits to encode 10, thus there are only 19.25
    decimal digits in 64-bits.

    No matter how we do it, two iterations is enough
    handle even maximally large numbers, and that would require 4 64x64->128 >MULs per iteration. Since these would pipeline nicely I'm guessing it >would be doable in 15-30 cycles total?

    On Skylake the latency of a 64x64->128 multiplication is 6 cycles (4
    cycles for the lower 64 bits), and I expect it to be lower on newer
    hardware. The pipelined multiplications should be done by cycle 9.
    There are also some additions involved, but I would not expect them to increase the latency to 15 cycles. What other operations do you have
    in mind that would result in 15-30 cycles? For scaling you don't need
    the remainder, only some rounding.

    So yes, scaling by a power of ten is the one operation where DPD is >clearly much faster

    I would not bet on it. It needs to unpack the 34 digits into 136
    bits, do a 136-bit shift, then repack into DPD.

    In HW::
    unpack is 3-gates of delay
    pack is 3 gates of delay

    Either they widen the
    data path beyond what they normally do, or they do it in parcels of 64
    bits or less, and the end result can easily take a similar number of
    cycles as 128-bit binary multiplication with the reciprocal.

    In HW; 136-bits is conceptually no different from 128-bits.

    My guess
    is that they did the slow implementation at first, and then there was
    so little takeup of DFP that the slow implementation is good enough to
    this day.

    - anton
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Wed Jan 7 14:23:10 2026
    From Newsgroup: comp.arch

    On 1/7/2026 7:16 AM, John Dallman wrote:
    In article <20260107133424.00000e99@yahoo.com>, already5chosen@yahoo.com (Michael S) wrote:

    I already asked you couple of years ago how fast do want binary128
    in order to consider it fast enough.
    IIRC, you either avoided the answer completely or gave totally
    unrealistic answer like "the same as binary64".
    May be, nobody bites because with non-answers or answers like that
    nobody thinks that you are serious?

    I don't know much about hardware design. What is realistic for hardware binary128?


    Likely estimate for FPGA:
    Around 28 DSP48's for a "triangular" multiplier;
    Would need to add several clock cycles for the adder tree;
    ...
    FADD/FSUB unit, also around 12 cycles,
    as most intermediate steps now take 2 clock cycles;


    Estimate:
    Probably around 5k LUTs for the FMUL, with a latency of around 10 or 12
    clock cycles.
    Probably around 12k LUTs for FADD/FSUB unit;
    Will need a few more kLUT for the glue logic.

    So, will put the cost at:
    18-20 kLUT likely;
    ~ 28 DSP48s;
    Around 12 cycles of latency.


    What about an FMA based implementation:
    Probably 49 DSP48's and around 24 cycles of latency.
    Where, 49 is needed for full-width multiplier results.
    Also add a big bump to the LUT cost vs separate units.
    An FMA unit roughly has the latency cost of both the FADD and FMUL.
    But, some people really like the ability to quickly have single-rounded results.


    The initial FPU would likely take around 1/3 of the total LUT budget of
    an XC7A100T, and is unclear if such a thing would be possible within a
    50 MHz CPU core (might require dropping to 33 MHz or similar).


    In my case, similar issues wrecked my ideas of doing a 96 bit truncated format, and even then 96 bit is still less than 128 bit. My current
    strategy is to instead allow for trap-based handling or hot-patching.




    To simplify a hot-patching implementation, I am now considering having
    the compiler set aside roughly 4 instruction-words of "hot patch zone"
    for any instruction that is likely to be implemented via hot-patching.

    These would be dumped out in blobs within 1MB of the target, or at the
    end of ".text", whichever comes first. Technically, 3 words would be the minimum, but 4 allows for a little more working flexibility.

    May make sense to assume that the hot-patching is free to stomp X5, as
    this would make it possible to implement on RV64G. Though, would need 6
    to allow for AUIPC+LD+JALR; but still works if assuming AUIPC+JALR (+/-
    4GB).


    This would give space for the handler to replace the offending
    instruction with a JAL, and then to branch off to whatever memory is
    being used for hot-patched instruction sequences.

    Granted, this sort of thing only works well if one assumes compiler cooperation.

    Current possibility is that the compiler could hint at these spaces by
    filling them with a special instruction, such as:
    JALR X0, 0(X0) //branch to NULL
    Where, if the loader or trap handler sees large blobs of such an
    instruction, it can assume that this area was set aside for use by the
    hot patching to reuse to encode long-distance branches.

    Could probably also add this to XG1/XG2 if trying to do similar (like
    enabling the "FPUX" extension), may make sense to find some other filler instruction that makes sense for XG2 though (using RISC-V JALR
    instructions would be a little out of place in this case).

    Granted, could also make sense to use a large blob of EBREAK or similar,
    which could have a similar effect (mostly depends on the probability
    that a program would have some other likely reason to have a big blob of EBREAK's, and EBREAK has a higher probability to be "actually useful"
    than a JALR-NULL).



    Granted, one could argue that using pad-space defeats the merit of using trapping-instructions rather than runtime calls. But, alas...


    Ironically, for my RV+SIMD stuff, partly leaned partly into still using runtime calls for some operations rather than doing them inline, as
    doing them inline is more bulk with a comparably weaker SIMD ISA (but,
    with some more fiddling, weak SIMD still a big improvement over no-SIMD
    for things like GLQuake).

    Well, and more fiddling to make RV FPU handling by BGBCC less crappy:
    More likely to use the correct registers, etc.

    And, currently putting 128-bit SIMD in FPU register pairs, which is
    mostly less-bad than GPRs even in the absence of native SIMD ops, apart
    from the "epic crapiness" or trying to deal with shuffle operations (I
    did add "FPU PACK" style instructions as otherwise this part is "dog crap").


    Or, in RV terms, one has, say:
    PACK Rd, Rs1, Rs2 // { Rs1[31: 0], Rs2[31: 0] }
    PACKU Rd, Rs1, Rs2 // { Rs1[63:32], Rs2[63:32] }
    BitManip stopped there, my case would also have PACKBT/PACKTB, though in
    my ISA they were called MOVLD/MOVHD/MOVLHD/MOVHLD (BGBCC still mostly
    uses these names, but allows PACK/PACKU for ASM code). The RV P
    extension also defines all 4 cases, but only for GPRs.

    For sake of sanity, my SIMD extension had also defined variants for
    FPRs, albeit still using the same mnemonics (the assembler figures out
    what to do based on registers here).


    So, in this case, still more sensible to use internal runtime calls for operations like DotProduct and CrossProduct and similar (but are likely
    remain as inline operations for XG2/3).

    Similar also applies to complex-number and quaternion operations, which
    will mostly remain as runtime calls.


    As noted, no current plans to move beyond 64/128 bit SIMD.

    Most likely option is that, rather than (hypothetically) define any sort
    of large-vector SIMD, may make more sense to fake large SIMD via the
    RV-V extension, and then probably use hot-patching to pretend that it
    exists (if needed, by faking RV-V on top of the narrower SIMD).

    Like, by the time one wants crap like AVX or similar, then RV-V starts
    to seem more sane.

    Big problem-case is when one wants something more like MMX or SSE-1,
    where RV-V seems like a pretty big ask to expect a hardware implementation.

    But, a hot-patching implementation could potentially be fast enough to
    make RV-V "not totally worthless" (if faking 256 bit vectors or similar,
    it is then more likely to eat the relative overhead of the patch-calls).
    And, could then implement native RV-V for hardware that can justify the
    cost.


    ...

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Wed Jan 7 14:38:30 2026
    From Newsgroup: comp.arch

    On 1/7/2026 11:56 AM, Terje Mathisen wrote:
    Michael S wrote:
    On Tue, 6 Jan 2026 22:06 +0000 (GMT Standard Time)
    jgd@cix.co.uk (John Dallman) wrote:

    In article <2026Jan5.100825@mips.complang.tuwien.ac.at>,
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    Possibly.-a But the lack of takeup of the Intel library and of the
    gcc support shows that "build it and they will come" does not work
    out for DFP.

    The world has got very used to IEEE BFP, and has solutions that work
    acceptably with it. Lots of organisations don't see anything obvious
    for them in DFP.

    The thing I'd like to try out is fast quad-precision BFP. For the
    field I work in, that would make some things much simpler. I did try
    to interest AMD in the idea in the early days of x86-64, but they
    didn't bite.

    John

    I already asked you couple of years ago how fast do want binary128 in
    order to consider it fast enough.
    IIRC, you either avoided the answer completely or gave totally
    unrealistic answer like "the same as binary64".
    May be, nobody bites because with non-answers or answers like that
    nobody thinks that you are serious?

    Anecdote.
    Few months ago I tried to design very long decimation filters with stop
    band attenuation of ~160 dB.
    Matlab's implementation of Parks|ore4rCLMcClellan algorithm (a customize
    variation of Remez Exchange spiced with a small portion of black magic)
    was not up to the task. Gnu Octave implementation was somewhat worse
    yet.
    When I started to investigate the reasons I found out that there were
    actually two of them, both related to insufficient precision of the
    series of DP FP calculations.
    The first was Discreet Cosine Transform and underlying FFT engine for N
    around 32K.
    The second was solving system of linear equations for N around 1000 a
    a little more.
    In both cases precision of DP FP was perfectly sufficient both for
    inputs and for outputs. But errors accumulated in intermediate caused
    troubles.

    In both cases quad-precision FP was key to solution.

    For DCT (FFT) I went for full re-implementation at higher precision.

    For Linear Solver, I left LU decomposition, which happens to be the
    heavy O(N**3) part in DP. Quad-precision was applied only during final
    solver stages - forward propagation, back propagation, calculation of
    residual error vector and repetition of forward and back propagation.
    All those parts are O(N**2). That modification was sufficient to
    improve precision of result almost to the best possible in DP FP format.
    And sufficient for good convergence of Parks|ore4rCLMcClellan algorithm.


    So, what is a point of my anecdote?
    The speed of quad-precision FP was never an obstacle, even when running
    on rather old hardware. And it's not like calculations here were not
    heavy. They were heavy all right, thank you very much.
    Using quad-precision only when necessary helped.
    But what helped more is not being hesitant. Doing things instead of
    worrying that they would be too slow.

    I thihnk the main issue is similar to what we had before 754, i.e every
    fp programmer needed to also be a fp analyst, capable of carrying out
    error budget calculation across their algorithms.

    You can obviously do that, and so can a number of regulars here, but in
    the real world we are in a _very_ small minority.

    For the rest, just having fp128 fast enough that that it could be
    applied naively would solve a number of problems.


    As I see it, FP128 is fast enough for practical use even with a
    software-only implementation (though, in part due to its relatively low
    usage frequency; if it is used, it is mostly for cases that actually
    need precision, rather than high throughput, with high-throughput cases
    likely to remain dominated by smaller types, like Binary32 and Binary16;
    with Binary64 more remaining as the "de-facto default" precision for floating-point).

    As can be noted, in my case, it was a partial motivation for supporting
    things like 128-bit integer instructions (in my C compiler, and
    optionally in the underlying ISA), as supporting Int128 ops is a step
    towards making doing Binary128 in software more practical (without the
    steep cost of a 128-bit FPU).

    ...


    Terje


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Jan 7 21:18:54 2026
    From Newsgroup: comp.arch


    BGB <cr88192@gmail.com> posted:

    On 1/7/2026 11:56 AM, Terje Mathisen wrote:
    Michael S wrote:
    On Tue, 6 Jan 2026 22:06 +0000 (GMT Standard Time)
    jgd@cix.co.uk (John Dallman) wrote:

    In article <2026Jan5.100825@mips.complang.tuwien.ac.at>,
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    Possibly.-a But the lack of takeup of the Intel library and of the
    gcc support shows that "build it and they will come" does not work
    out for DFP.

    The world has got very used to IEEE BFP, and has solutions that work
    acceptably with it. Lots of organisations don't see anything obvious
    for them in DFP.

    The thing I'd like to try out is fast quad-precision BFP. For the
    field I work in, that would make some things much simpler. I did try
    to interest AMD in the idea in the early days of x86-64, but they
    didn't bite.

    John

    I already asked you couple of years ago how fast do want binary128 in
    order to consider it fast enough.
    IIRC, you either avoided the answer completely or gave totally
    unrealistic answer like "the same as binary64".
    May be, nobody bites because with non-answers or answers like that
    nobody thinks that you are serious?

    Anecdote.
    Few months ago I tried to design very long decimation filters with stop
    band attenuation of ~160 dB.
    Matlab's implementation of Parks|ore4rCLMcClellan algorithm (a customize >> variation of Remez Exchange spiced with a small portion of black magic)
    was not up to the task. Gnu Octave implementation was somewhat worse
    yet.
    When I started to investigate the reasons I found out that there were
    actually two of them, both related to insufficient precision of the
    series of DP FP calculations.
    The first was Discreet Cosine Transform and underlying FFT engine for N
    around 32K.
    The second was solving system of linear equations for N around 1000 a
    a little more.
    In both cases precision of DP FP was perfectly sufficient both for
    inputs and for outputs. But errors accumulated in intermediate caused
    troubles.

    In both cases quad-precision FP was key to solution.

    For DCT (FFT) I went for full re-implementation at higher precision.

    For Linear Solver, I left LU decomposition, which happens to be the
    heavy O(N**3) part in DP. Quad-precision was applied only during final
    solver stages - forward propagation, back propagation, calculation of
    residual error vector and repetition of forward and back propagation.
    All those parts are O(N**2). That modification was sufficient to
    improve precision of result almost to the best possible in DP FP format. >> And sufficient for good convergence of Parks|ore4rCLMcClellan algorithm. >>

    So, what is a point of my anecdote?
    The speed of quad-precision FP was never an obstacle, even when running
    on rather old hardware. And it's not like calculations here were not
    heavy. They were heavy all right, thank you very much.
    Using quad-precision only when necessary helped.
    But what helped more is not being hesitant. Doing things instead of
    worrying that they would be too slow.

    I thihnk the main issue is similar to what we had before 754, i.e every
    fp programmer needed to also be a fp analyst, capable of carrying out error budget calculation across their algorithms.

    You can obviously do that, and so can a number of regulars here, but in the real world we are in a _very_ small minority.

    For the rest, just having fp128 fast enough that that it could be
    applied naively would solve a number of problems.


    As I see it, FP128 is fast enough for practical use even with a software-only implementation (though, in part due to its relatively low usage frequency; if it is used, it is mostly for cases that actually
    need precision, rather than high throughput, with high-throughput cases likely to remain dominated by smaller types, like Binary32 and Binary16; with Binary64 more remaining as the "de-facto default" precision for floating-point).

    {To date::}
    My only used for 128-bit FP was to compute Chebyshev Coefficients for
    my high speed DP Transcendentals. I only needed 64-bits of fractions
    but, in practice, 80-bit FP was only giving me 63-bits of precision.
    Since these are a) compute once b) use infinitely many times; the
    speed of 128-bit FP is completely irrelevant.

    As can be noted, in my case, it was a partial motivation for supporting things like 128-bit integer instructions (in my C compiler, and
    optionally in the underlying ISA), as supporting Int128 ops is a step towards making doing Binary128 in software more practical (without the
    steep cost of a 128-bit FPU).

    It seems to me that if one ahs "reasonable" ISA support for tearing a
    128-bit P into {sign, exponent, fraction} and reasonably fast 128-bit
    integer support, then emulating 128-bit FP in SW is "not that bad"--
    especially if one can do 128|u128 -> 256 in 4-8 cycles.

    ...


    Terje


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Wed Jan 7 23:47:06 2026
    From Newsgroup: comp.arch

    On Wed, 07 Jan 2026 20:05:17 GMT
    MitchAlsup <user5857@newsgrouper.org.invalid> wrote:
    Terje Mathisen <terje.mathisen@tmsw.no> posted:

    John Dallman wrote:
    In article <20260107133424.00000e99@yahoo.com>,
    already5chosen@yahoo.com (Michael S) wrote:

    I already asked you couple of years ago how fast do want
    binary128 in order to consider it fast enough.
    IIRC, you either avoided the answer completely or gave totally
    unrealistic answer like "the same as binary64".
    May be, nobody bites because with non-answers or answers like
    that nobody thinks that you are serious?

    I don't know much about hardware design. What is realistic for
    hardware binary128?

    Sub-10 cycles fmul/fadd/fsub seems very doable?

    Mitch?

    Assuming 128-bit operands are delivered in 1 cycle and 128-bit
    results are delivered in 1 cycle::

    If we are talking about SIMD of the same width (measured in bits) as
    SP/DP SIMD on the given general purpose machine, i.e. on modern Intel
    and AMD server cores, 2 FPUs with 4 128-bit lanes each, then I think
    that fully pipelined binary128 operations are none starter, because it
    would blow your power and thermal budget. One full-width result (i.e. 8 binary128 results) every 2 cycles sounds somewhat more realistic.
    After all in general-purpose CPU binary128, if at all implemented, is
    a proverbial tail that can't be allowed to wag the dog.
    OTOH, if we define our binary128 to use only least-significant 128 bit
    lane of our 512-bit register and only build b128 capabilities into one
    of our pair of FPUs then full pipelining (i.e. 1 result per cycle) looks
    like a good choice, at least from power/thermal perspective. That is,
    as lng as designers found a way to avoid a hot spot.
    128-bit Fadd/Fsub should be no worse than 1 cycle longer than 64-bit.

    128-bit Fmul requires that the multiplier tree be 64+64 instead of
    53+53 (1.46+ bigger tree, 1.22+ bigger FU), and would/should be 3-4
    cycles longer than 64-bit Fmul. If you wanted to be "really clever"
    you could use a 59+59 tree and the FU is only 1.12+ bigger; but here
    you could not use the tree for Integer MUL.

    Terje

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Wed Jan 7 16:10:01 2026
    From Newsgroup: comp.arch

    On 1/7/2026 3:18 PM, MitchAlsup wrote:

    BGB <cr88192@gmail.com> posted:

    On 1/7/2026 11:56 AM, Terje Mathisen wrote:
    Michael S wrote:
    On Tue, 6 Jan 2026 22:06 +0000 (GMT Standard Time)
    jgd@cix.co.uk (John Dallman) wrote:

    In article <2026Jan5.100825@mips.complang.tuwien.ac.at>,
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    Possibly.-a But the lack of takeup of the Intel library and of the >>>>>> gcc support shows that "build it and they will come" does not work >>>>>> out for DFP.

    The world has got very used to IEEE BFP, and has solutions that work >>>>> acceptably with it. Lots of organisations don't see anything obvious >>>>> for them in DFP.

    The thing I'd like to try out is fast quad-precision BFP. For the
    field I work in, that would make some things much simpler. I did try >>>>> to interest AMD in the idea in the early days of x86-64, but they
    didn't bite.

    John

    I already asked you couple of years ago how fast do want binary128 in
    order to consider it fast enough.
    IIRC, you either avoided the answer completely or gave totally
    unrealistic answer like "the same as binary64".
    May be, nobody bites because with non-answers or answers like that
    nobody thinks that you are serious?

    Anecdote.
    Few months ago I tried to design very long decimation filters with stop >>>> band attenuation of ~160 dB.
    Matlab's implementation of Parks|ore4rCLMcClellan algorithm (a customize >>>> variation of Remez Exchange spiced with a small portion of black magic) >>>> was not up to the task. Gnu Octave implementation was somewhat worse
    yet.
    When I started to investigate the reasons I found out that there were
    actually two of them, both related to insufficient precision of the
    series of DP FP calculations.
    The first was Discreet Cosine Transform and underlying FFT engine for N >>>> around 32K.
    The second was solving system of linear equations for N around 1000 a
    a little more.
    In both cases precision of DP FP was perfectly sufficient both for
    inputs and for outputs. But errors accumulated in intermediate caused
    troubles.

    In both cases quad-precision FP was key to solution.

    For DCT (FFT) I went for full re-implementation at higher precision.

    For Linear Solver, I left LU decomposition, which happens to be the
    heavy O(N**3) part in DP. Quad-precision was applied only during final >>>> solver stages - forward propagation, back propagation, calculation of
    residual error vector and repetition of forward and back propagation.
    All those parts are O(N**2). That modification was sufficient to
    improve precision of result almost to the best possible in DP FP format. >>>> And sufficient for good convergence of Parks|ore4rCLMcClellan algorithm. >>>>

    FWIW: For most cases where I had used DCT or FFT, it has almost always
    been with fixed-point integer math...



    So, what is a point of my anecdote?
    The speed of quad-precision FP was never an obstacle, even when running >>>> on rather old hardware. And it's not like calculations here were not
    heavy. They were heavy all right, thank you very much.
    Using quad-precision only when necessary helped.
    But what helped more is not being hesitant. Doing things instead of
    worrying that they would be too slow.

    I thihnk the main issue is similar to what we had before 754, i.e every
    fp programmer needed to also be a fp analyst, capable of carrying out
    error budget calculation across their algorithms.

    You can obviously do that, and so can a number of regulars here, but in
    the real world we are in a _very_ small minority.

    For the rest, just having fp128 fast enough that that it could be
    applied naively would solve a number of problems.


    As I see it, FP128 is fast enough for practical use even with a
    software-only implementation (though, in part due to its relatively low
    usage frequency; if it is used, it is mostly for cases that actually
    need precision, rather than high throughput, with high-throughput cases
    likely to remain dominated by smaller types, like Binary32 and Binary16;
    with Binary64 more remaining as the "de-facto default" precision for
    floating-point).

    {To date::}
    My only used for 128-bit FP was to compute Chebyshev Coefficients for
    my high speed DP Transcendentals. I only needed 64-bits of fractions
    but, in practice, 80-bit FP was only giving me 63-bits of precision.
    Since these are a) compute once b) use infinitely many times; the
    speed of 128-bit FP is completely irrelevant.


    As noted, low usage frequency.

    If it is something that mostly applies to initial program startup or occasionally in the slow path, that it is "kinda slow" doesn't matter
    too much.

    Though, it is starting to seem that "trap and emulate" might still be a
    little too slow, leading to my recent efforts in the direction of
    efficient hot-patching.

    Granted, this is more a case of "just sort of pushing the cost somewhere
    else" and in theory, if the compiler knows that the instruction will
    just be patched anyways, it could in premise generate intermediate calls
    for cheaper.

    But, for Binary128 there is another factor:
    RV64G/RV64GC lacks access to 128-bit integer instructions;
    So, it makes sense to instead run this logic in XG3;
    But, compiler can't just use XG3, as if it uses any XG3 ops, may as well
    just compile the whole binary as XG3;
    So, it makes sense to use XG3 as a "make RV64 less poor" feature, but
    then the compiler can't be allowed to depend on it directly, and at
    least needs to pretend it is living in RV64 land.

    But, then, this leads to hot-patch wonk.


    As can be noted, in my case, it was a partial motivation for supporting
    things like 128-bit integer instructions (in my C compiler, and
    optionally in the underlying ISA), as supporting Int128 ops is a step
    towards making doing Binary128 in software more practical (without the
    steep cost of a 128-bit FPU).

    It seems to me that if one ahs "reasonable" ISA support for tearing a
    128-bit P into {sign, exponent, fraction} and reasonably fast 128-bit
    integer support, then emulating 128-bit FP in SW is "not that bad"-- especially if one can do 128|u128 -> 256 in 4-8 cycles.


    Yeah, this is basically the idea.

    Int128 ops, and my BITMOV instructions (which can extract/insert/move bitfields within 64 and 128 bit containers; as a combined "Shift and
    masked MUX"), can provide a nice boost here.

    Sadly, there is still not really a great way to do a 128x128 => 256
    multiply though. Current fastest option is still to decompose it into a crapload of 32x32=>64 bit widening multiply ops (which, ironically, is
    another thing that RV is lacking in; need to use a full 64-bit multiply,
    but there are downsides, more-so when the base ISA is also lacking PACK/PACKU).


    Still kinda funny that RV land, with all of its wide industrial support,
    lots of people doing lots of extensions, advanced features, etc.
    Seemingly still fails at making an ISA where "basic things" fit together
    well.

    And, then a lot of features going off in rabbit holes like "why would
    you want this?", and then it turns out it is to micro-optimize some
    specific test case within SPECint or something (often, rather than
    finding a more general solution that would address multiple related issues).

    More so when the "micro-optimize the benchmark" features were more often chosen over the more general purpose "actually address the underlying
    issue" features.


    Granted, then someone is almost invariably going to be like "all the
    parts of RV do fit together well, but you are using it wrong...".

    But, in this case, would expect GCC to generate smaller binaries than
    BGBCC; leaving me to think it is more a case of "these parts don't fit together all that well".



    ...


    Terje



    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Jan 7 22:10:27 2026
    From Newsgroup: comp.arch


    Michael S <already5chosen@yahoo.com> posted:

    On Wed, 07 Jan 2026 20:05:17 GMT
    MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

    Terje Mathisen <terje.mathisen@tmsw.no> posted:

    John Dallman wrote:
    In article <20260107133424.00000e99@yahoo.com>, already5chosen@yahoo.com (Michael S) wrote:

    I already asked you couple of years ago how fast do want
    binary128 in order to consider it fast enough.
    IIRC, you either avoided the answer completely or gave totally
    unrealistic answer like "the same as binary64".
    May be, nobody bites because with non-answers or answers like
    that nobody thinks that you are serious?

    I don't know much about hardware design. What is realistic for
    hardware binary128?

    Sub-10 cycles fmul/fadd/fsub seems very doable?

    Mitch?

    Assuming 128-bit operands are delivered in 1 cycle and 128-bit
    results are delivered in 1 cycle::


    If we are talking about SIMD of the same width (measured in bits) as
    SP/DP SIMD on the given general purpose machine, i.e. on modern Intel
    and AMD server cores, 2 FPUs with 4 128-bit lanes each, then I think
    that fully pipelined binary128 operations are none starter, because it
    would blow your power and thermal budget.

    I agree, however a single 128-bit FPU would fit inside a reasonable
    power budget.

    One full-width result (i.e. 8 binary128 results) every 2 cycles sounds somewhat more realistic.

    Likely still over a reasonable power budget.

    After all in general-purpose CPU binary128, if at all implemented, is
    a proverbial tail that can't be allowed to wag the dog.

    We build (and call) our current machines 64-bits because that is the
    size of the register files (not including SIMD/Vector) and because
    we can run the scalar unit at rated clock frequency (non SIMD/Vector) essentially continuously.

    Once we step over the scalar width, power goes up 2|u-4|u and we get a
    couple of hundred cycles before frequency throttling. Thus, we cannot
    in general, run SIMD/Vector at rated frequency continuously. Nor can
    we, at present time, build a memory system than can properly feed a
    SIMD/Vector RF so that one can use all of the lanes of available
    calculations. {HBM is approaching this point, however--it becomes
    more like B-memory from CRAY-2; than main memory for applications
    that can use that much b-memory effectively.}

    OTOH, if we define our binary128 to use only least-significant 128 bit
    lane of our 512-bit register and only build b128 capabilities into one
    of our pair of FPUs then full pipelining (i.e. 1 result per cycle) looks
    like a good choice, at least from power/thermal perspective. That is,
    as long as designers found a way to avoid a hot spot.

    We could, instead, treat them as pairs of GPRs--like we did in Mc 88120
    and still not need SIMD/Vectors.

    128-bit Fadd/Fsub should be no worse than 1 cycle longer than 64-bit.

    128-bit Fmul requires that the multiplier tree be 64|u64 instead of
    53|u53 (1.46|u bigger tree, 1.22|u bigger FU), and would/should be 3-4 cycles longer than 64-bit Fmul. If you wanted to be "really clever"
    you could use a 59|u59 tree and the FU is only 1.12|u bigger; but here
    you could not use the tree for Integer MUL.

    Terje



    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Jan 8 00:05:33 2026
    From Newsgroup: comp.arch


    BGB <cr88192@gmail.com> posted:

    On 1/7/2026 3:18 PM, MitchAlsup wrote:

    BGB <cr88192@gmail.com> posted:

    On 1/7/2026 11:56 AM, Terje Mathisen wrote:
    Michael S wrote:
    On Tue, 6 Jan 2026 22:06 +0000 (GMT Standard Time)
    jgd@cix.co.uk (John Dallman) wrote:

    In article <2026Jan5.100825@mips.complang.tuwien.ac.at>,
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    Possibly.-a But the lack of takeup of the Intel library and of the >>>>>> gcc support shows that "build it and they will come" does not work >>>>>> out for DFP.

    The world has got very used to IEEE BFP, and has solutions that work >>>>> acceptably with it. Lots of organisations don't see anything obvious >>>>> for them in DFP.

    The thing I'd like to try out is fast quad-precision BFP. For the
    field I work in, that would make some things much simpler. I did try >>>>> to interest AMD in the idea in the early days of x86-64, but they
    didn't bite.

    John

    I already asked you couple of years ago how fast do want binary128 in >>>> order to consider it fast enough.
    IIRC, you either avoided the answer completely or gave totally
    unrealistic answer like "the same as binary64".
    May be, nobody bites because with non-answers or answers like that
    nobody thinks that you are serious?

    Anecdote.
    Few months ago I tried to design very long decimation filters with stop >>>> band attenuation of ~160 dB.
    Matlab's implementation of Parks|ore4rCLMcClellan algorithm (a customize >>>> variation of Remez Exchange spiced with a small portion of black magic) >>>> was not up to the task. Gnu Octave implementation was somewhat worse >>>> yet.
    When I started to investigate the reasons I found out that there were >>>> actually two of them, both related to insufficient precision of the
    series of DP FP calculations.
    The first was Discreet Cosine Transform and underlying FFT engine for N >>>> around 32K.
    The second was solving system of linear equations for N around 1000 a >>>> a little more.
    In both cases precision of DP FP was perfectly sufficient both for
    inputs and for outputs. But errors accumulated in intermediate caused >>>> troubles.

    In both cases quad-precision FP was key to solution.

    For DCT (FFT) I went for full re-implementation at higher precision. >>>>
    For Linear Solver, I left LU decomposition, which happens to be the
    heavy O(N**3) part in DP. Quad-precision was applied only during final >>>> solver stages - forward propagation, back propagation, calculation of >>>> residual error vector and repetition of forward and back propagation. >>>> All those parts are O(N**2). That modification was sufficient to
    improve precision of result almost to the best possible in DP FP format. >>>> And sufficient for good convergence of Parks|ore4rCLMcClellan algorithm. >>>>

    FWIW: For most cases where I had used DCT or FFT, it has almost always
    been with fixed-point integer math...



    So, what is a point of my anecdote?
    The speed of quad-precision FP was never an obstacle, even when running >>>> on rather old hardware. And it's not like calculations here were not >>>> heavy. They were heavy all right, thank you very much.
    Using quad-precision only when necessary helped.
    But what helped more is not being hesitant. Doing things instead of
    worrying that they would be too slow.

    I thihnk the main issue is similar to what we had before 754, i.e every >>> fp programmer needed to also be a fp analyst, capable of carrying out
    error budget calculation across their algorithms.

    You can obviously do that, and so can a number of regulars here, but in >>> the real world we are in a _very_ small minority.

    For the rest, just having fp128 fast enough that that it could be
    applied naively would solve a number of problems.


    As I see it, FP128 is fast enough for practical use even with a
    software-only implementation (though, in part due to its relatively low
    usage frequency; if it is used, it is mostly for cases that actually
    need precision, rather than high throughput, with high-throughput cases
    likely to remain dominated by smaller types, like Binary32 and Binary16; >> with Binary64 more remaining as the "de-facto default" precision for
    floating-point).

    {To date::}
    My only used for 128-bit FP was to compute Chebyshev Coefficients for
    my high speed DP Transcendentals. I only needed 64-bits of fractions
    but, in practice, 80-bit FP was only giving me 63-bits of precision.
    Since these are a) compute once b) use infinitely many times; the
    speed of 128-bit FP is completely irrelevant.


    As noted, low usage frequency.

    If it is something that mostly applies to initial program startup or occasionally in the slow path, that it is "kinda slow" doesn't matter
    too much.

    Though, it is starting to seem that "trap and emulate" might still be a little too slow, leading to my recent efforts in the direction of
    efficient hot-patching.

    Depends on the speed of T&E. If privilege control transfer is 10-cycles then its probably OK, if 100+ it is getting on the annoying side of thiigns.

    Granted, this is more a case of "just sort of pushing the cost somewhere else" and in theory, if the compiler knows that the instruction will
    just be patched anyways, it could in premise generate intermediate calls
    for cheaper.
    -------------------
    As can be noted, in my case, it was a partial motivation for supporting
    things like 128-bit integer instructions (in my C compiler, and
    optionally in the underlying ISA), as supporting Int128 ops is a step
    towards making doing Binary128 in software more practical (without the
    steep cost of a 128-bit FPU).

    It seems to me that if one ahs "reasonable" ISA support for tearing a 128-bit P into {sign, exponent, fraction} and reasonably fast 128-bit integer support, then emulating 128-bit FP in SW is "not that bad"-- especially if one can do 128|u128 -> 256 in 4-8 cycles.


    Yeah, this is basically the idea.

    Int128 ops, and my BITMOV instructions (which can extract/insert/move bitfields within 64 and 128 bit containers; as a combined "Shift and
    masked MUX"), can provide a nice boost here.

    Sadly, there is still not really a great way to do a 128x128 => 256
    multiply though.

    My Transcendentals get to 1ULP when the multiplier tree is 59|u59-bits
    {a bit more than -+ of the get 1ULP at 58|u58}. I gave a lot of though
    to this {~1 year} before deciding that a "Do everything else" function
    unit was "overall" better than a couple of "near miss" FUs. So, IMUL
    comes in at 4 cycles, FMUL at 4 cycles, FDIV at 17, SQRT at 22, and
    IDIV at 25. The fast IMUL makes up a lot for the "size" of the FU.

    To get 64|u64->128 I need my <single> prefix instruction CARRY. But
    this also gives me (64|u64->128)/64 ->{64,64} {quotient, remainder}.

    Current fastest option is still to decompose it into a crapload of 32x32=>64 bit widening multiply ops (which, ironically, is another thing that RV is lacking in; need to use a full 64-bit multiply,
    but there are downsides, more-so when the base ISA is also lacking PACK/PACKU).

    "Not my fault".

    Still kinda funny that RV land, with all of its wide industrial support, lots of people doing lots of extensions, advanced features, etc.
    Seemingly still fails at making an ISA where "basic things" fit together well.

    And, then a lot of features going off in rabbit holes like "why would
    you want this?", and then it turns out it is to micro-optimize some
    specific test case within SPECint or something (often, rather than
    finding a more general solution that would address multiple related issues).

    Reasonable support for 64|u64->128 is what makes emulation "affordable".

    Side note: Back in 1987, MIPS has 13-cycle multiply using their non-
    pipelined FU and special registers--while Mc 88100 has 3 cycle 32|u32
    multiply in 3 cycles. Well, it ends up one could program this multiplier
    to do 32|u32->64 in 13 cycles; TOO !!

    More so when the "micro-optimize the benchmark" features were more often chosen over the more general purpose "actually address the underlying
    issue" features.

    Been there done that......


    Granted, then someone is almost invariably going to be like "all the
    parts of RV do fit together well, but you are using it wrong...".

    But, in this case, would expect GCC to generate smaller binaries than
    BGBCC; leaving me to think it is more a case of "these parts don't fit together all that well".



    ...


    Terje



    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Wed Jan 7 21:16:38 2026
    From Newsgroup: comp.arch

    On 2026-01-07 3:23 p.m., BGB wrote:
    On 1/7/2026 7:16 AM, John Dallman wrote:
    In article <20260107133424.00000e99@yahoo.com>, already5chosen@yahoo.com
    (Michael S) wrote:

    I already asked you couple of years ago how fast do want binary128
    in order to consider it fast enough.
    IIRC, you either avoided the answer completely or gave totally
    unrealistic answer like "the same as binary64".
    May be, nobody bites because with non-answers or answers like that
    nobody thinks that you are serious?

    I don't know much about hardware design. What is realistic for hardware
    binary128?


    Likely estimate for FPGA:
    -a Around 28 DSP48's for a "triangular" multiplier;
    -a-a-a Would need to add several clock cycles for the adder tree;
    -a-a-a ...
    -a FADD/FSUB unit, also around 12 cycles,
    -a-a-a as most intermediate steps now take 2 clock cycles;


    Estimate:
    Probably around 5k LUTs for the FMUL, with a latency of around 10 or 12 clock cycles.
    Probably around 12k LUTs for FADD/FSUB unit;
    Will need a few more kLUT for the glue logic.

    So, will put the cost at:
    -a 18-20 kLUT likely;
    -a ~ 28 DSP48s;
    -a Around 12 cycles of latency.


    What about an FMA based implementation:
    -a Probably 49 DSP48's and around 24 cycles of latency.
    -a-a-a Where, 49 is needed for full-width multiplier results.
    -a-a-a Also add a big bump to the LUT cost vs separate units.
    -a An FMA unit roughly has the latency cost of both the FADD and FMUL.
    But, some people really like the ability to quickly have single-rounded results.



    The 128-bit FMA I implemented with an eight-cycle latency, uses 36 DSPs (Karatsuba multiplier). The latency is a bit less than double for an
    FADD. One cycle can be trimmed off operand decoding that can happen in parallel, then there is only a single normalization and round taking
    place which also trims a couple of clocks off the double latency.

    My FADD has a five-cycle latency. Latency is a bit of a designerrCOs
    choice and can be setup as desired for the clock frequency. I picked
    eight to try and match the FP clock to the CPU clock (slow CPU clock).
    Many more stages could be added to bump up the clock frequency.

    The FMA consumes about 8600 LUTs and 2600 FFs. I decided to use FMAs
    (without FADD, FMUL) in my design even though the latency is a bit more
    as I think the total LUT cost is lower.

    <snip>

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Jan 8 02:38:57 2026
    From Newsgroup: comp.arch


    MitchAlsup <user5857@newsgrouper.org.invalid> posted:
    -------------------

    My Transcendentals get to 1ULP when the multiplier tree is 59|u59-bits
    {a bit more than -+ of the get 1ULP at 58|u58}. I gave a lot of though
    to this {~1 year} before deciding that a "Do everything else" function
    unit was "overall" better than a couple of "near miss" FUs. So, IMUL
    comes in at 4 cycles, FMUL at 4 cycles, FDIV at 17, SQRT at 22, and
    IDIV at 25. The fast IMUL makes up a lot for the "size" of the FU.

    I forgot to add that my transcendentals went from <just barely>
    faithfully rounded 0.75ULP RMS to 5.002ULP RMS when the tree was
    expanded.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Thu Jan 8 10:52:21 2026
    From Newsgroup: comp.arch

    On Thu, 08 Jan 2026 02:38:57 GMT
    MitchAlsup <user5857@newsgrouper.org.invalid> wrote:
    MitchAlsup <user5857@newsgrouper.org.invalid> posted:
    -------------------

    My Transcendentals get to 1ULP when the multiplier tree is
    59-59-bits {a bit more than + of the get 1ULP at 58-58}. I gave a
    lot of though to this {~1 year} before deciding that a "Do
    everything else" function unit was "overall" better than a couple
    of "near miss" FUs. So, IMUL comes in at 4 cycles, FMUL at 4
    cycles, FDIV at 17, SQRT at 22, and IDIV at 25. The fast IMUL makes
    up a lot for the "size" of the FU.

    I forgot to add that my transcendentals went from <just barely>
    faithfully rounded 0.75ULP RMS to 5.002ULP RMS when the tree was
    expanded.

    Don't you mean '0.5002 ULP' ?
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Thu Jan 8 12:50:32 2026
    From Newsgroup: comp.arch

    Waldek Hebisch wrote:
    Scott Lurndal <scott@slp53.sl.home> wrote:
    Michael S <already5chosen@yahoo.com> writes:
    On Wed, 07 Jan 2026 15:24:38 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:

    Michael S <already5chosen@yahoo.com> writes:
    On Tue, 06 Jan 2026 19:29:40 GMT
    MitchAlsup <user5857@newsgrouper.org.invalid> wrote:


    So, POWER hardware helps a lot converting DPD to BCD, but leaves to
    software the last step of unpacking 4-bit BCD to 8-bit ASCII.

    That's a fairly simple operation, just adding the proper zone
    digit (3 for ASCII, F for EBCDIC) to the 8-bit byte.


    The hard part is having bits in right places within wide word.
    I.e. you have 64 bits like that:
    '0123456789abcdef'. You want to convert it to pair of 64-bit numbers:
    '0001020304050607', '08090a0b0c0d0e0f'

    The ASCII[*] would be '3031323334353637', '3839414243444546'. That's
    a move from the BCD '0123456789abcdef' to the corresponding ASCII
    bytes.

    [*] Printable version of the BCD input number.

    The B3500 addressed to the digit, so it was a simple move to add
    the zone digit when converting to ASCII (or EBCDIC depending on
    a processor flag). Although 'undigits' (bit patterns 0b1010
    through 0b1111) were not legal in BCD numbers on the B3500
    and adding a zone digit to them didn't make them printable.

    e.g.

    INPUT DATA 8 UN 0123456789 // Unsigned Numeric 8 nibble field >> OUTPUT DATA 8 UA // Unsigned Alphanumeric 8 byte field

    MVN INPUT(UN), OUTPUT(UA) // yields 'f0f1f2f3f4f5f6f7f8f9' (EBCDIC)

    If the output field was larger than the input field, leading
    blanks would be added before the number when using MVN. MVA
    would blank pad the output field after the number when the
    output field was larger.

    Without HW help it is not fast. Likely not faster than running
    respective DPD declets through look-up table where you, at least, get 3
    ASCII digits per look-up. On modern wide core, likely only marginally
    faster than converting BYD mantissa.

    The B3500 (1965) did that in hardware;
    I would find it strange if the Power CPU didn't.

    Or, may be, they have instruction for that as well, but it's not in
    DFP related part of the book, so I missed it.

    It was just a flavor of the move instruction in the B3500.

    I am not sure that we are talking about the same thing.

    Probably not, since the ASCII zero character is encoded as 0x30
    instead of the 0x00 you show in the example above.


    IIUC Michael was asking for the following transformation of
    on the strings of hex digits:

    0123456789abcdef

    into

    000102030405060708090a0b0c0d0e0f

    given (fast) such transformation it is very easy to add proper
    zone bits on modern hardware. One possible approach to transform
    above would be to do byte type unpacking operation (that is
    version of the above working on bytes) and then use masking and
    shifting to more upper bits of each byte to the right place.


    Intel (and then AMD) added PDEP (and the corresponding PEXT) opcode
    sometime within the last 20 years (probably less than 10?), it is
    perfect for this operation:

    ;; rsi has 8 nybbles to unpack into 16 bytes (rdx:rax)
    mov rbx, 0x0f0f0f0f0f0f0f0f
    pdep rax,rbx,rsi
    shr rsi,32
    pdep rdx,rbx,rsi

    This is sub-5 cycles of latency.

    It is also doable with much older CPUs using the permute/byte shuffle operation, with a bit more or less latency depdning upon where the
    source and destination data resides (SIMD vs regular integer reg).

    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Thu Jan 8 13:01:45 2026
    From Newsgroup: comp.arch

    MitchAlsup wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    Terje Mathisen <terje.mathisen@tmsw.no> writes:
    In BID we would do division by 10 with a reciprocal multiplication that
    handles powers of 5, this way we (i.e Michael S) can fit 26/27 digits
    into a 64-bit int.

    2^64 < 10^20

    How do you put 26/27 decimal digits in 64 bits. Or do you mean 27
    quinary digits? What would that be good for?

    it takes 3.32 binary digits to encode 10, thus there are only 19.25
    decimal digits in 64-bits.

    Michael's idea was to split the division by a power of ten into two
    parts: A division by a power of 5 and a bitshift for the 2^N.

    If we start with the bitshift (but remember the bits shifted out from
    the bottom, then 5^26 fits into 2^64.

    Does that make sense?

    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Thu Jan 8 13:05:10 2026
    From Newsgroup: comp.arch

    MitchAlsup wrote:

    BGB <cr88192@gmail.com> posted:

    On 1/7/2026 11:56 AM, Terje Mathisen wrote:
    Michael S wrote:
    On Tue, 6 Jan 2026 22:06 +0000 (GMT Standard Time)
    jgd@cix.co.uk (John Dallman) wrote:

    In article <2026Jan5.100825@mips.complang.tuwien.ac.at>,
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    Possibly.|e-a But the lack of takeup of the Intel library and of the >>>>>> gcc support shows that "build it and they will come" does not work>>>>>> out for DFP.

    The world has got very used to IEEE BFP, and has solutions that work >>>>> acceptably with it. Lots of organisations don't see anything obvious >>>>> for them in DFP.

    The thing I'd like to try out is fast quad-precision BFP. For the
    field I work in, that would make some things much simpler. I did try >>>>> to interest AMD in the idea in the early days of x86-64, but they
    didn't bite.

    John

    I already asked you couple of years ago how fast do want binary128 in
    order to consider it fast enough.
    IIRC, you either avoided the answer completely or gave totally
    unrealistic answer like "the same as binary64".
    May be, nobody bites because with non-answers or answers like that
    nobody thinks that you are serious?

    Anecdote.
    Few months ago I tried to design very long decimation filters with stop >>>> band attenuation of ~160 dB.
    Matlab's implementation of Parks|a-o|orCU-4|ore4+oMcClellan algorithm (a customize
    variation of Remez Exchange spiced with a small portion of black magic) >>>> was not up to the task. Gnu Octave implementation was somewhat worse>>>> yet.
    When I started to investigate the reasons I found out that there were
    actually two of them, both related to insufficient precision of the
    series of DP FP calculations.
    The first was Discreet Cosine Transform and underlying FFT engine for N >>>> around 32K.
    The second was solving system of linear equations for N around 1000 a
    a little more.
    In both cases precision of DP FP was perfectly sufficient both for
    inputs and for outputs. But errors accumulated in intermediate caused
    troubles.

    In both cases quad-precision FP was key to solution.

    For DCT (FFT) I went for full re-implementation at higher precision.>>>> >>>> For Linear Solver, I left LU decomposition, which happens to be the
    heavy O(N**3) part in DP. Quad-precision was applied only during final >>>> solver stages - forward propagation, back propagation, calculation of
    residual error vector and repetition of forward and back propagation.
    All those parts are O(N**2). That modification was sufficient to
    improve precision of result almost to the best possible in DP FP format. >>>> And sufficient for good convergence of Parks|a-o|orCU-4|ore4+oMcClellan algorithm.


    So, what is a point of my anecdote?
    The speed of quad-precision FP was never an obstacle, even when running >>>> on rather old hardware. And it's not like calculations here were not>>>> heavy. They were heavy all right, thank you very much.
    Using quad-precision only when necessary helped.
    But what helped more is not being hesitant. Doing things instead of
    worrying that they would be too slow.

    I thihnk the main issue is similar to what we had before 754, i.e every
    fp programmer needed to also be a fp analyst, capable of carrying out>>> error budget calculation across their algorithms.

    You can obviously do that, and so can a number of regulars here, but in
    the real world we are in a _very_ small minority.

    For the rest, just having fp128 fast enough that that it could be
    applied naively would solve a number of problems.


    As I see it, FP128 is fast enough for practical use even with a
    software-only implementation (though, in part due to its relatively low
    usage frequency; if it is used, it is mostly for cases that actually
    need precision, rather than high throughput, with high-throughput cases
    likely to remain dominated by smaller types, like Binary32 and Binary16;
    with Binary64 more remaining as the "de-facto default" precision for
    floating-point).

    {To date::}
    My only used for 128-bit FP was to compute Chebyshev Coefficients for
    my high speed DP Transcendentals. I only needed 64-bits of fractions
    but, in practice, 80-bit FP was only giving me 63-bits of precision.
    Since these are a) compute once b) use infinitely many times; the
    speed of 128-bit FP is completely irrelevant.
    Sounds similar to the weekend I spent writing a fp128 (using 1:31:96 for speed/ease of implementation on a Pentium) library just to be able to
    verify that our FPATAN2 workaround for the FDIV bug was correct.
    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Thu Jan 8 13:10:14 2026
    From Newsgroup: comp.arch

    MitchAlsup wrote:

    MitchAlsup <user5857@newsgrouper.org.invalid> posted:
    -------------------

    My Transcendentals get to 1ULP when the multiplier tree is 59|arCo59-bits
    {a bit more than |e-+ of the get 1ULP at 58|arCo58}. I gave a lot of though >> to this {~1 year} before deciding that a "Do everything else" function>> unit was "overall" better than a couple of "near miss" FUs. So, IMUL
    comes in at 4 cycles, FMUL at 4 cycles, FDIV at 17, SQRT at 22, and
    IDIV at 25. The fast IMUL makes up a lot for the "size" of the FU.

    I forgot to add that my transcendentals went from <just barely>
    faithfully rounded 0.75ULP RMS to 5.002ULP RMS when the tree was
    expanded.
    You obviously intended 0.5002 ulp, so you have about 11-13 guard bits to get the rounding correct?
    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Thu Jan 8 15:41:17 2026
    From Newsgroup: comp.arch

    On Thu, 8 Jan 2026 12:50:32 +0100
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:

    Waldek Hebisch wrote:
    Scott Lurndal <scott@slp53.sl.home> wrote:
    Michael S <already5chosen@yahoo.com> writes:
    On Wed, 07 Jan 2026 15:24:38 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:

    Michael S <already5chosen@yahoo.com> writes:
    On Tue, 06 Jan 2026 19:29:40 GMT
    MitchAlsup <user5857@newsgrouper.org.invalid> wrote:


    So, POWER hardware helps a lot converting DPD to BCD, but
    leaves to software the last step of unpacking 4-bit BCD to
    8-bit ASCII.

    That's a fairly simple operation, just adding the proper zone
    digit (3 for ASCII, F for EBCDIC) to the 8-bit byte.


    The hard part is having bits in right places within wide word.
    I.e. you have 64 bits like that:
    '0123456789abcdef'. You want to convert it to pair of 64-bit
    numbers: '0001020304050607', '08090a0b0c0d0e0f'

    The ASCII[*] would be '3031323334353637', '3839414243444546'.
    That's a move from the BCD '0123456789abcdef' to the corresponding
    ASCII bytes.

    [*] Printable version of the BCD input number.

    The B3500 addressed to the digit, so it was a simple move to add
    the zone digit when converting to ASCII (or EBCDIC depending on
    a processor flag). Although 'undigits' (bit patterns 0b1010
    through 0b1111) were not legal in BCD numbers on the B3500
    and adding a zone digit to them didn't make them printable.

    e.g.

    INPUT DATA 8 UN 0123456789 // Unsigned Numeric 8
    nibble field OUTPUT DATA 8 UA // Unsigned
    Alphanumeric 8 byte field

    MVN INPUT(UN), OUTPUT(UA) // yields
    'f0f1f2f3f4f5f6f7f8f9' (EBCDIC)

    If the output field was larger than the input field, leading
    blanks would be added before the number when using MVN. MVA
    would blank pad the output field after the number when the
    output field was larger.

    Without HW help it is not fast. Likely not faster than running
    respective DPD declets through look-up table where you, at least,
    get 3 ASCII digits per look-up. On modern wide core, likely only
    marginally faster than converting BYD mantissa.

    The B3500 (1965) did that in hardware;
    I would find it strange if the Power CPU didn't.

    Or, may be, they have instruction for that as well, but it's
    not in DFP related part of the book, so I missed it.

    It was just a flavor of the move instruction in the B3500.

    I am not sure that we are talking about the same thing.

    Probably not, since the ASCII zero character is encoded as 0x30
    instead of the 0x00 you show in the example above.


    IIUC Michael was asking for the following transformation of
    on the strings of hex digits:

    0123456789abcdef

    into

    000102030405060708090a0b0c0d0e0f

    given (fast) such transformation it is very easy to add proper
    zone bits on modern hardware. One possible approach to transform
    above would be to do byte type unpacking operation (that is
    version of the above working on bytes) and then use masking and
    shifting to more upper bits of each byte to the right place.


    Intel (and then AMD) added PDEP (and the corresponding PEXT) opcode
    sometime within the last 20 years (probably less than 10?), it is
    perfect for this operation:


    Haswell. Officially launched in June 4, 2013. So 12.5 years ago.
    Time runs.

    ;; rsi has 8 nybbles to unpack into 16 bytes (rdx:rax)
    mov rbx, 0x0f0f0f0f0f0f0f0f
    pdep rax,rbx,rsi
    shr rsi,32
    pdep rdx,rbx,rsi

    This is sub-5 cycles of latency.

    That's nice.
    I'm not sure if POWER has similar instruction.


    It is also doable with much older CPUs using the permute/byte shuffle operation, with a bit more or less latency depdning upon where the
    source and destination data resides (SIMD vs regular integer reg).

    Terje



    I don't understand that part. Do you suggest that there are better
    swizzle instruction than unpack, mentioned by Waldek Hebisch?
    So far, I don't see so. Unpack looks to me the most suitable.



    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Jan 8 18:50:14 2026
    From Newsgroup: comp.arch


    Michael S <already5chosen@yahoo.com> posted:

    On Thu, 08 Jan 2026 02:38:57 GMT
    MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

    MitchAlsup <user5857@newsgrouper.org.invalid> posted:
    -------------------

    My Transcendentals get to 1ULP when the multiplier tree is
    59|u59-bits {a bit more than -+ of the get 1ULP at 58|u58}. I gave a
    lot of though to this {~1 year} before deciding that a "Do
    everything else" function unit was "overall" better than a couple
    of "near miss" FUs. So, IMUL comes in at 4 cycles, FMUL at 4
    cycles, FDIV at 17, SQRT at 22, and IDIV at 25. The fast IMUL makes
    up a lot for the "size" of the FU.

    I forgot to add that my transcendentals went from <just barely>
    faithfully rounded 0.75ULP RMS to 5.002ULP RMS when the tree was
    expanded.


    Don't you mean '0.5002 ULP' ?

    Technically, and rounding that is not !EEE correct is at least 1 ULP.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Jan 8 18:52:26 2026
    From Newsgroup: comp.arch


    Terje Mathisen <terje.mathisen@tmsw.no> posted:

    MitchAlsup wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    Terje Mathisen <terje.mathisen@tmsw.no> writes:
    In BID we would do division by 10 with a reciprocal multiplication that >>> handles powers of 5, this way we (i.e Michael S) can fit 26/27 digits
    into a 64-bit int.

    2^64 < 10^20

    How do you put 26/27 decimal digits in 64 bits. Or do you mean 27
    quinary digits? What would that be good for?

    it takes 3.32 binary digits to encode 10, thus there are only 19.25
    decimal digits in 64-bits.

    Michael's idea was to split the division by a power of ten into two
    parts: A division by a power of 5 and a bitshift for the 2^N.

    If we start with the bitshift (but remember the bits shifted out from
    the bottom, then 5^26 fits into 2^64.

    Does that make sense?

    My point was that you cannot fit 26 encoding representing 0-9 into 64
    bits; not about what math is in play.


    Terje

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Jan 8 18:54:40 2026
    From Newsgroup: comp.arch


    Terje Mathisen <terje.mathisen@tmsw.no> posted:

    MitchAlsup wrote:

    MitchAlsup <user5857@newsgrouper.org.invalid> posted:
    -------------------

    My Transcendentals get to 1ULP when the multiplier tree is 59|arCo59-bits >> {a bit more than |e-+ of the get 1ULP at 58|arCo58}. I gave a lot of though
    to this {~1 year} before deciding that a "Do everything else" function
    unit was "overall" better than a couple of "near miss" FUs. So, IMUL
    comes in at 4 cycles, FMUL at 4 cycles, FDIV at 17, SQRT at 22, and
    IDIV at 25. The fast IMUL makes up a lot for the "size" of the FU.

    I forgot to add that my transcendentals went from <just barely>
    faithfully rounded 0.75ULP RMS to 5.002ULP RMS when the tree was
    expanded.

    You obviously intended 0.5002 ulp, so you have about 11-13 guard bits to
    get the rounding correct?

    64-53 = 11 yes

    But a single incorrect rounding is 1 ULP all by itself.

    Terje

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Thu Jan 8 21:25:57 2026
    From Newsgroup: comp.arch

    Michael S wrote:
    On Thu, 8 Jan 2026 12:50:32 +0100
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:

    Waldek Hebisch wrote:
    Scott Lurndal <scott@slp53.sl.home> wrote:
    Michael S <already5chosen@yahoo.com> writes:
    On Wed, 07 Jan 2026 15:24:38 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:

    Michael S <already5chosen@yahoo.com> writes:
    On Tue, 06 Jan 2026 19:29:40 GMT
    MitchAlsup <user5857@newsgrouper.org.invalid> wrote:


    So, POWER hardware helps a lot converting DPD to BCD, but
    leaves to software the last step of unpacking 4-bit BCD to
    8-bit ASCII.

    That's a fairly simple operation, just adding the proper zone
    digit (3 for ASCII, F for EBCDIC) to the 8-bit byte.


    The hard part is having bits in right places within wide word.
    I.e. you have 64 bits like that:
    '0123456789abcdef'. You want to convert it to pair of 64-bit
    numbers: '0001020304050607', '08090a0b0c0d0e0f'

    The ASCII[*] would be '3031323334353637', '3839414243444546'.
    That's a move from the BCD '0123456789abcdef' to the corresponding
    ASCII bytes.

    [*] Printable version of the BCD input number.

    The B3500 addressed to the digit, so it was a simple move to add
    the zone digit when converting to ASCII (or EBCDIC depending on
    a processor flag). Although 'undigits' (bit patterns 0b1010
    through 0b1111) were not legal in BCD numbers on the B3500
    and adding a zone digit to them didn't make them printable.

    e.g.

    INPUT DATA 8 UN 0123456789 // Unsigned Numeric 8
    nibble field OUTPUT DATA 8 UA // Unsigned
    Alphanumeric 8 byte field

    MVN INPUT(UN), OUTPUT(UA) // yields
    'f0f1f2f3f4f5f6f7f8f9' (EBCDIC)

    If the output field was larger than the input field, leading
    blanks would be added before the number when using MVN. MVA
    would blank pad the output field after the number when the
    output field was larger.

    Without HW help it is not fast. Likely not faster than running
    respective DPD declets through look-up table where you, at least,
    get 3 ASCII digits per look-up. On modern wide core, likely only
    marginally faster than converting BYD mantissa.

    The B3500 (1965) did that in hardware;
    I would find it strange if the Power CPU didn't.

    Or, may be, they have instruction for that as well, but it's
    not in DFP related part of the book, so I missed it.

    It was just a flavor of the move instruction in the B3500.

    I am not sure that we are talking about the same thing.

    Probably not, since the ASCII zero character is encoded as 0x30
    instead of the 0x00 you show in the example above.


    IIUC Michael was asking for the following transformation of
    on the strings of hex digits:

    0123456789abcdef

    into

    000102030405060708090a0b0c0d0e0f

    given (fast) such transformation it is very easy to add proper
    zone bits on modern hardware. One possible approach to transform
    above would be to do byte type unpacking operation (that is
    version of the above working on bytes) and then use masking and
    shifting to more upper bits of each byte to the right place.


    Intel (and then AMD) added PDEP (and the corresponding PEXT) opcode
    sometime within the last 20 years (probably less than 10?), it is
    perfect for this operation:


    Haswell. Officially launched in June 4, 2013. So 12.5 years ago.
    Time runs.

    ;; rsi has 8 nybbles to unpack into 16 bytes (rdx:rax)
    mov rbx, 0x0f0f0f0f0f0f0f0f
    pdep rax,rbx,rsi
    shr rsi,32
    pdep rdx,rbx,rsi

    This is sub-5 cycles of latency.

    That's nice.
    I'm not sure if POWER has similar instruction.


    It is also doable with much older CPUs using the permute/byte shuffle
    operation, with a bit more or less latency depdning upon where the
    source and destination data resides (SIMD vs regular integer reg).

    Terje



    I don't understand that part. Do you suggest that there are better
    swizzle instruction than unpack, mentioned by Waldek Hebisch?
    So far, I don't see so. Unpack looks to me the most suitable.

    There are at least three ways to do it:

    a) PDEP in 64-bit regs

    b) PSHUFB and nybble masks using SSE/AVX regs

    c) PUNPACKLBW which expands bytes to words. Do it twice with a bytewise
    SHR 4 to select the upper nybbles and a mask to keep the lower nybbles
    of the first part.

    Did you intend to use (c) or is there yet another method?

    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Thu Jan 8 21:35:16 2026
    From Newsgroup: comp.arch

    MitchAlsup wrote:

    Terje Mathisen <terje.mathisen@tmsw.no> posted:

    MitchAlsup wrote:

    MitchAlsup <user5857@newsgrouper.org.invalid> posted:
    -------------------

    My Transcendentals get to 1ULP when the multiplier tree is 59|a|A|ore4rCY59-bits
    {a bit more than |arCU|e-+ of the get 1ULP at 58|a|A|ore4rCY58}. I gave a lot of though
    to this {~1 year} before deciding that a "Do everything else" function >>>> unit was "overall" better than a couple of "near miss" FUs. So, IMUL>>>> comes in at 4 cycles, FMUL at 4 cycles, FDIV at 17, SQRT at 22, and
    IDIV at 25. The fast IMUL makes up a lot for the "size" of the FU.

    I forgot to add that my transcendentals went from <just barely>
    faithfully rounded 0.75ULP RMS to 5.002ULP RMS when the tree was
    expanded.

    You obviously intended 0.5002 ulp, so you have about 11-13 guard bits to
    get the rounding correct?

    64-53 = 11 yes
    For many of the functions you can do a lot by letting the final
    operation be a merging of the first/largest term, particularly if you do that with extended precision.
    I.e something like fpatan2() works quite nicely this way, just not
    enough for exact rounding.
    You need to combine this with extended precision range adjustment at the end.

    But a single incorrect rounding is 1 ULP all by itself.
    :-)
    Yeah, there are strong forces who want to have, at least as a suggested/recommended option, a set of transcendental functions which
    are exactly rounded.
    It is provably doable for float, in very close to the same cycle count
    as the best libraries in current use, double is "somewhat" harder to
    totally verify/prove.
    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Thu Jan 8 22:50:36 2026
    From Newsgroup: comp.arch

    On Thu, 8 Jan 2026 21:25:57 +0100
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:

    Michael S wrote:
    On Thu, 8 Jan 2026 12:50:32 +0100
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:

    Waldek Hebisch wrote:
    Scott Lurndal <scott@slp53.sl.home> wrote:
    Michael S <already5chosen@yahoo.com> writes:
    On Wed, 07 Jan 2026 15:24:38 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:

    Michael S <already5chosen@yahoo.com> writes:
    On Tue, 06 Jan 2026 19:29:40 GMT
    MitchAlsup <user5857@newsgrouper.org.invalid> wrote:


    So, POWER hardware helps a lot converting DPD to BCD, but
    leaves to software the last step of unpacking 4-bit BCD to
    8-bit ASCII.

    That's a fairly simple operation, just adding the proper zone
    digit (3 for ASCII, F for EBCDIC) to the 8-bit byte.


    The hard part is having bits in right places within wide word.
    I.e. you have 64 bits like that:
    '0123456789abcdef'. You want to convert it to pair of 64-bit
    numbers: '0001020304050607', '08090a0b0c0d0e0f'

    The ASCII[*] would be '3031323334353637', '3839414243444546'.
    That's a move from the BCD '0123456789abcdef' to the
    corresponding ASCII bytes.

    [*] Printable version of the BCD input number.

    The B3500 addressed to the digit, so it was a simple move to add
    the zone digit when converting to ASCII (or EBCDIC depending on
    a processor flag). Although 'undigits' (bit patterns 0b1010
    through 0b1111) were not legal in BCD numbers on the B3500
    and adding a zone digit to them didn't make them printable.

    e.g.

    INPUT DATA 8 UN 0123456789 // Unsigned Numeric 8
    nibble field OUTPUT DATA 8 UA // Unsigned
    Alphanumeric 8 byte field

    MVN INPUT(UN), OUTPUT(UA) // yields
    'f0f1f2f3f4f5f6f7f8f9' (EBCDIC)

    If the output field was larger than the input field, leading
    blanks would be added before the number when using MVN. MVA
    would blank pad the output field after the number when the
    output field was larger.

    Without HW help it is not fast. Likely not faster than running
    respective DPD declets through look-up table where you, at
    least, get 3 ASCII digits per look-up. On modern wide core,
    likely only marginally faster than converting BYD mantissa.

    The B3500 (1965) did that in hardware;
    I would find it strange if the Power CPU didn't.

    Or, may be, they have instruction for that as well, but it's
    not in DFP related part of the book, so I missed it.

    It was just a flavor of the move instruction in the B3500.

    I am not sure that we are talking about the same thing.

    Probably not, since the ASCII zero character is encoded as 0x30
    instead of the 0x00 you show in the example above.


    IIUC Michael was asking for the following transformation of
    on the strings of hex digits:

    0123456789abcdef

    into

    000102030405060708090a0b0c0d0e0f

    given (fast) such transformation it is very easy to add proper
    zone bits on modern hardware. One possible approach to transform
    above would be to do byte type unpacking operation (that is
    version of the above working on bytes) and then use masking and
    shifting to more upper bits of each byte to the right place.


    Intel (and then AMD) added PDEP (and the corresponding PEXT) opcode
    sometime within the last 20 years (probably less than 10?), it is
    perfect for this operation:


    Haswell. Officially launched in June 4, 2013. So 12.5 years ago.
    Time runs.

    ;; rsi has 8 nybbles to unpack into 16 bytes (rdx:rax)
    mov rbx, 0x0f0f0f0f0f0f0f0f
    pdep rax,rbx,rsi
    shr rsi,32
    pdep rdx,rbx,rsi

    This is sub-5 cycles of latency.

    That's nice.
    I'm not sure if POWER has similar instruction.


    It is also doable with much older CPUs using the permute/byte
    shuffle operation, with a bit more or less latency depdning upon
    where the source and destination data resides (SIMD vs regular
    integer reg).

    Terje



    I don't understand that part. Do you suggest that there are better
    swizzle instruction than unpack, mentioned by Waldek Hebisch?
    So far, I don't see so. Unpack looks to me the most suitable.

    There are at least three ways to do it:

    a) PDEP in 64-bit regs

    b) PSHUFB and nybble masks using SSE/AVX regs

    c) PUNPACKLBW which expands bytes to words. Do it twice with a
    bytewise SHR 4 to select the upper nybbles and a mask to keep the
    lower nybbles of the first part.

    Did you intend to use (c) or is there yet another method?

    Terje



    I'd use (c) if (a) is either not available or slow. The latter
    case applies to AMD Zen1/2. Otherwise I'd use (a). I don't see
    circumstances for preferring (b).


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Fri Jan 9 01:24:17 2026
    From Newsgroup: comp.arch


    Terje Mathisen <terje.mathisen@tmsw.no> posted:

    MitchAlsup wrote:

    Terje Mathisen <terje.mathisen@tmsw.no> posted:

    MitchAlsup wrote:

    MitchAlsup <user5857@newsgrouper.org.invalid> posted:
    -------------------

    My Transcendentals get to 1ULP when the multiplier tree is 59|a|A|ore4rCY59-bits
    {a bit more than |arCU|e-+ of the get 1ULP at 58|a|A|ore4rCY58}. I gave a lot of though
    to this {~1 year} before deciding that a "Do everything else" function >>>> unit was "overall" better than a couple of "near miss" FUs. So, IMUL >>>> comes in at 4 cycles, FMUL at 4 cycles, FDIV at 17, SQRT at 22, and
    IDIV at 25. The fast IMUL makes up a lot for the "size" of the FU.

    I forgot to add that my transcendentals went from <just barely>
    faithfully rounded 0.75ULP RMS to 5.002ULP RMS when the tree was
    expanded.

    You obviously intended 0.5002 ulp, so you have about 11-13 guard bits to >> get the rounding correct?

    64-53 = 11 yes

    For many of the functions you can do a lot by letting the final
    operation be a merging of the first/largest term, particularly if you do that with extended precision.

    I.e something like fpatan2() works quite nicely this way, just not
    enough for exact rounding.

    You need to combine this with extended precision range adjustment at the end.


    But a single incorrect rounding is 1 ULP all by itself.

    :-)

    Yeah, there are strong forces who want to have, at least as a suggested/recommended option, a set of transcendental functions which
    are exactly rounded.

    After the final addition, I know 1 of the top 3-bits is a 1, and I
    have 69-bits in the accumulated result. I also know that the poly-
    nomial error is below the 3rd least significant bit.

    It is provably doable for float, in very close to the same cycle count
    as the best libraries in current use, double is "somewhat" harder to
    totally verify/prove.

    I have logic (patented) that allows the FU to raise an UNCERTAIN
    rounding exception, so SW can take over and change 0.5002 into
    0.5000 at the cost of the exception and running the long winded
    SW correctly rounded subroutine. I expect this to be used only
    during verification and on the 3 machines owned by Kahan, Coonen,
    and someone else I forgot.

    Terje

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Fri Jan 9 16:32:30 2026
    From Newsgroup: comp.arch

    Waldek Hebisch <antispam@fricas.org> schrieb:

    I use every day a computer algebra system. One possiblity here
    is to use arbitrary precision fractions. Observations:
    - once there is more complex computation numerators and denominators
    tend to be pretty big
    - a lot of time is spent computing gcd, but if you try to save on
    gcd-s and work with unsimplified fractions you may get trenendous
    blowup (like million digit numbers in what should be reasonably
    small computation).
    - general, if one needs numeric approximation, then arbitrary
    precision (software) floating point is much faster

    It is also not possible to express irrational numbers,
    transcendental functions etc. When I use a computer algebra
    sytem, I tend to use such functions, so solutions are usually
    not rational numbers.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Sat Jan 10 18:02:46 2026
    From Newsgroup: comp.arch

    MitchAlsup wrote:

    Terje Mathisen <terje.mathisen@tmsw.no> posted:

    MitchAlsup wrote:
    But a single incorrect rounding is 1 ULP all by itself.

    :-)

    Yeah, there are strong forces who want to have, at least as a
    suggested/recommended option, a set of transcendental functions which
    are exactly rounded.

    After the final addition, I know 1 of the top 3-bits is a 1, and I
    have 69-bits in the accumulated result. I also know that the poly-
    nomial error is below the 3rd least significant bit.

    It is provably doable for float, in very close to the same cycle count
    as the best libraries in current use, double is "somewhat" harder to
    totally verify/prove.

    I have logic (patented) that allows the FU to raise an UNCERTAIN
    rounding exception, so SW can take over and change 0.5002 into
    0.5000 at the cost of the exception and running the long winded
    SW correctly rounded subroutine. I expect this to be used only
    during verification and on the 3 machines owned by Kahan, Coonen,
    and someone else I forgot.



    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From antispam@antispam@fricas.org (Waldek Hebisch) to comp.arch on Sat Jan 10 23:21:40 2026
    From Newsgroup: comp.arch

    John Dallman <jgd@cix.co.uk> wrote:
    In article <2026Jan5.100825@mips.complang.tuwien.ac.at>, anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    Possibly. But the lack of takeup of the Intel library and of the
    gcc support shows that "build it and they will come" does not work
    out for DFP.

    The world has got very used to IEEE BFP, and has solutions that work acceptably with it. Lots of organisations don't see anything obvious for
    them in DFP.

    AFAICS DFP exists as a standard only because IBM pushed it. I had
    a short e-mail exchange with main DFP advocate at IBM. My point
    was that purely software implementation of his decimal benchmark had
    perfectly adequate performance. His answer was that he knows this,
    but that was hand written code that normal users would not write.
    Compilers for Cobol could generate such code, but apparently no
    existing compilers supported it. He claimed that there
    must be hardware support to make it widely available. He somewhat
    ignored fact that even with hardware support there still needs
    to be support in compilers. And that C++ templates allow
    fast fixed point decimals as library feature. If there is no
    such library (and I am not aware of one) it is due to low
    demand and not due to difficulty.

    Anyway, he and IBM succeded pushd the DFP standard, but adoption
    is as I and other folks predicted.
    --
    Waldek Hebisch
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sun Jan 11 00:03:36 2026
    From Newsgroup: comp.arch


    antispam@fricas.org (Waldek Hebisch) posted:

    John Dallman <jgd@cix.co.uk> wrote:
    In article <2026Jan5.100825@mips.complang.tuwien.ac.at>, anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    Possibly. But the lack of takeup of the Intel library and of the
    gcc support shows that "build it and they will come" does not work
    out for DFP.

    The world has got very used to IEEE BFP, and has solutions that work acceptably with it. Lots of organisations don't see anything obvious for them in DFP.

    AFAICS DFP exists as a standard only because IBM pushed it. I had
    a short e-mail exchange with main DFP advocate at IBM.

    Mike Cow<something>shaw ??

    My point
    was that purely software implementation of his decimal benchmark had perfectly adequate performance. His answer was that he knows this,
    but that was hand written code that normal users would not write.
    Compilers for Cobol could generate such code, but apparently no
    existing compilers supported it. He claimed that there
    must be hardware support to make it widely available. He somewhat
    ignored fact that even with hardware support there still needs
    to be support in compilers. And that C++ templates allow
    fast fixed point decimals as library feature. If there is no
    such library (and I am not aware of one) it is due to low
    demand and not due to difficulty.

    Anyway, he and IBM succeded pushd the DFP standard, but adoption
    is as I and other folks predicted.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From antispam@antispam@fricas.org (Waldek Hebisch) to comp.arch on Sun Jan 11 00:33:22 2026
    From Newsgroup: comp.arch

    Michael S <already5chosen@yahoo.com> wrote:
    On Tue, 6 Jan 2026 22:06 +0000 (GMT Standard Time)
    jgd@cix.co.uk (John Dallman) wrote:

    In article <2026Jan5.100825@mips.complang.tuwien.ac.at>,
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    Possibly. But the lack of takeup of the Intel library and of the
    gcc support shows that "build it and they will come" does not work
    out for DFP.

    The world has got very used to IEEE BFP, and has solutions that work
    acceptably with it. Lots of organisations don't see anything obvious
    for them in DFP.

    The thing I'd like to try out is fast quad-precision BFP. For the
    field I work in, that would make some things much simpler. I did try
    to interest AMD in the idea in the early days of x86-64, but they
    didn't bite.

    John

    I already asked you couple of years ago how fast do want binary128 in
    order to consider it fast enough.
    IIRC, you either avoided the answer completely or gave totally
    unrealistic answer like "the same as binary64".
    May be, nobody bites because with non-answers or answers like that
    nobody thinks that you are serious?

    I would hope for half troughput and say 1-2 clocks more latency
    for addition. For multiplication I would expect 1/4 troughput
    and maybe twice latency than for binary64.

    As of today, there is double-double. IIRC double-double addition
    needs 6 double additions, that is way too much. AFAICS
    quantifying double-double mutiplication performance is more
    tricky: there is relatively easy implementation using
    64-bit multiply-add (it takes adwantage of fact that multiply-add
    can deliver low-order bits that only contibute to rounding in
    normal FP multiply), but this implements normal multiply in
    terms of multiply-add. Implementing multiply-add takes
    more effort and impementing multiply only using multiply
    tekes even more effort.

    Anyway, to make sense hardware should be faster than double-double.

    Anecdote.
    Few months ago I tried to design very long decimation filters with stop
    band attenuation of ~160 dB.
    Matlab's implementation of ParksrCoMcClellan algorithm (a customize
    variation of Remez Exchange spiced with a small portion of black magic)
    was not up to the task. Gnu Octave implementation was somewhat worse
    yet.
    When I started to investigate the reasons I found out that there were actually two of them, both related to insufficient precision of the
    series of DP FP calculations.
    The first was Discreet Cosine Transform and underlying FFT engine for N around 32K.
    The second was solving system of linear equations for N around 1000 a
    a little more.
    In both cases precision of DP FP was perfectly sufficient both for
    inputs and for outputs. 32KBut errors accumulated in intermediate caused troubles.

    In both cases quad-precision FP was key to solution.

    For DCT (FFT) I went for full re-implementation at higher precision.

    Hmm, I did estimates for FFT and my result was that in classic
    implementation each layers of butterflies essentially additively
    contributes to L^2 error. So 32K point radix-2 FFT has 15 times
    bigger L^2 error than single layer of batterflies, which has error
    about 4 times machine epsilon. With radix-4 FFT error of single
    batterfly is larger, but number of layers is halved and result
    is similar. So, in terms of L^2 error 32K point FFT needs very
    little extra precision, essentially 6 bits. But Remez works
    in term of supremum norm and at 32K points that may need extra
    8 bits. So it if possible that 80-bit format would have enough
    accuracy for your purpose.

    I looked at FFT as one of possible ways to implement convolution
    of integer sequences with exact result. Alas, double precision
    computation is good only for about 20 bits for relatively short
    seqences and less for longer ones. It seems that integer only
    computation is much faster. Fast 128-bit floating point would
    shift balance towards floating point, but probably not enough
    to beat integer computations.

    For Linear Solver, I left LU decomposition, which happens to be the
    heavy O(N**3) part in DP. Quad-precision was applied only during final
    solver stages - forward propagation, back propagation, calculation of residual error vector and repetition of forward and back propagation.
    All those parts are O(N**2). That modification was sufficient to
    improve precision of result almost to the best possible in DP FP format.
    And sufficient for good convergence of ParksrCoMcClellan algorithm.

    Yes, as long as your system is reasonably well conditioned it is
    easy to improve accuracy in a postprocessing step. OTOH system
    may be so badly conditioned that solving in double precision leads
    to catastrophic errors while solving in higher precision works
    fine.

    So, what is a point of my anecdote?
    The speed of quad-precision FP was never an obstacle, even when running
    on rather old hardware. And it's not like calculations here were not
    heavy. They were heavy all right, thank you very much.
    Using quad-precision only when necessary helped.
    But what helped more is not being hesitant. Doing things instead of
    worrying that they would be too slow.

    Well, I have arbitrary precision implementation of LLL algorithm.
    It works, but it is about 100 times slower than using double precision
    math. The trouble is, in worst case double precision LLL in
    dimension 53 may fail to converge. On tame data LLL is expected
    to work in higher dimensions, but even on tame data at dimentsion
    about 250 double precision LLL is expected to fail. In a sense
    this is no-win situations, as needed number of bits grows linearly
    with dimension (both worst case and tame case). One can try to
    use double precision when it works. But it is frustrating
    how much effort one needs to spend to get better speed using
    FPU. And especially, there is contrast with integer math
    where it is realatively easy to get higher precisions when
    needed. But for integer math RISC-V tries to change this by not
    providing carry bit. And, AFAICS SSE/AVX do not provide high
    order bits of multiplication (no vectored MULHI instruction), so
    multiprecision multiplies must go trough scalar multiplier.
    --
    Waldek Hebisch
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From antispam@antispam@fricas.org (Waldek Hebisch) to comp.arch on Sun Jan 11 00:59:51 2026
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

    Michael S <already5chosen@yahoo.com> posted:

    On Wed, 07 Jan 2026 20:05:17 GMT
    MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

    Terje Mathisen <terje.mathisen@tmsw.no> posted:

    John Dallman wrote:
    In article <20260107133424.00000e99@yahoo.com>,
    already5chosen@yahoo.com (Michael S) wrote:

    I already asked you couple of years ago how fast do want
    binary128 in order to consider it fast enough.
    IIRC, you either avoided the answer completely or gave totally
    unrealistic answer like "the same as binary64".
    May be, nobody bites because with non-answers or answers like
    that nobody thinks that you are serious?

    I don't know much about hardware design. What is realistic for
    hardware binary128?

    Sub-10 cycles fmul/fadd/fsub seems very doable?

    Mitch?

    Assuming 128-bit operands are delivered in 1 cycle and 128-bit
    results are delivered in 1 cycle::


    If we are talking about SIMD of the same width (measured in bits) as
    SP/DP SIMD on the given general purpose machine, i.e. on modern Intel
    and AMD server cores, 2 FPUs with 4 128-bit lanes each, then I think
    that fully pipelined binary128 operations are none starter, because it
    would blow your power and thermal budget.

    I agree, however a single 128-bit FPU would fit inside a reasonable
    power budget.

    One full-width result (i.e. 8
    binary128 results) every 2 cycles sounds somewhat more realistic.

    Likely still over a reasonable power budget.

    After all in general-purpose CPU binary128, if at all implemented, is
    a proverbial tail that can't be allowed to wag the dog.

    We build (and call) our current machines 64-bits because that is the
    size of the register files (not including SIMD/Vector) and because
    we can run the scalar unit at rated clock frequency (non SIMD/Vector) essentially continuously.

    Once we step over the scalar width, power goes up 2|u-4|u and we get a
    couple of hundred cycles before frequency throttling. Thus, we cannot
    in general, run SIMD/Vector at rated frequency continuously.

    I understand that mutipliers are big and power hungry. I know
    almost nothing about permute unit, but it too looks like big
    and power hungry thing. But how bad is it when one is doing
    simple operations say mostly in registers.

    Nor can
    we, at present time, build a memory system than can properly feed a SIMD/Vector RF so that one can use all of the lanes of available calculations.

    There is matrix multiply which is doing n^3 multiplies on n^2
    data. I need polynomial mutiplication, that is n^2 multiplies
    on size n data. There are real computations where a piece or
    two pieces of data got trough several steps. So there is a
    lot of compute-intensive problems where processing units can
    do work on data in registers or from L1 cache.

    So if compute units can do the work, it is still useful,
    iven if other problem are memory bound.

    {HBM is approaching this point, however--it becomes
    more like B-memory from CRAY-2; than main memory for applications
    that can use that much b-memory effectively.}

    OTOH, if we define our binary128 to use only least-significant 128 bit
    lane of our 512-bit register and only build b128 capabilities into one
    of our pair of FPUs then full pipelining (i.e. 1 result per cycle) looks
    like a good choice, at least from power/thermal perspective. That is,
    as long as designers found a way to avoid a hot spot.

    We could, instead, treat them as pairs of GPRs--like we did in Mc 88120
    and still not need SIMD/Vectors.

    128-bit Fadd/Fsub should be no worse than 1 cycle longer than 64-bit.

    128-bit Fmul requires that the multiplier tree be 64|u64 instead of
    53|u53 (1.46|u bigger tree, 1.22|u bigger FU), and would/should be 3-4
    cycles longer than 64-bit Fmul. If you wanted to be "really clever"
    you could use a 59|u59 tree and the FU is only 1.12|u bigger; but here
    you could not use the tree for Integer MUL.

    Terje



    --
    Waldek Hebisch
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From antispam@antispam@fricas.org (Waldek Hebisch) to comp.arch on Sun Jan 11 01:14:26 2026
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

    Terje Mathisen <terje.mathisen@tmsw.no> posted:

    MitchAlsup wrote:

    MitchAlsup <user5857@newsgrouper.org.invalid> posted:
    -------------------

    My Transcendentals get to 1ULP when the multiplier tree is 59|arCo59-bits >> >> {a bit more than |e-+ of the get 1ULP at 58|arCo58}. I gave a lot of though
    to this {~1 year} before deciding that a "Do everything else" function
    unit was "overall" better than a couple of "near miss" FUs. So, IMUL
    comes in at 4 cycles, FMUL at 4 cycles, FDIV at 17, SQRT at 22, and
    IDIV at 25. The fast IMUL makes up a lot for the "size" of the FU.

    I forgot to add that my transcendentals went from <just barely>
    faithfully rounded 0.75ULP RMS to 5.002ULP RMS when the tree was
    expanded.

    You obviously intended 0.5002 ulp, so you have about 11-13 guard bits to
    get the rounding correct?

    64-53 = 11 yes

    But a single incorrect rounding is 1 ULP all by itself.

    It is clear that when your rounding is different that "IEEE correct"
    rounding, then there is 1 ULP difference between your result and
    IEEE rounding. But claims like max 0.5002 ulp mean that there
    is at most 0.5002 ulp difference between true result and result
    deliverd by your FPU. Or do you really mean that there may be
    1ULP difference between true result and your FPU?
    --
    Waldek Hebisch
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From antispam@antispam@fricas.org (Waldek Hebisch) to comp.arch on Sun Jan 11 11:21:27 2026
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

    antispam@fricas.org (Waldek Hebisch) posted:

    John Dallman <jgd@cix.co.uk> wrote:
    In article <2026Jan5.100825@mips.complang.tuwien.ac.at>,
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    Possibly. But the lack of takeup of the Intel library and of the
    gcc support shows that "build it and they will come" does not work
    out for DFP.

    The world has got very used to IEEE BFP, and has solutions that work
    acceptably with it. Lots of organisations don't see anything obvious for >> > them in DFP.

    AFAICS DFP exists as a standard only because IBM pushed it. I had
    a short e-mail exchange with main DFP advocate at IBM.

    Mike Cow<something>shaw ??

    Mike Cowlishaw, yes.

    My point
    was that purely software implementation of his decimal benchmark had
    perfectly adequate performance. His answer was that he knows this,
    but that was hand written code that normal users would not write.
    Compilers for Cobol could generate such code, but apparently no
    existing compilers supported it. He claimed that there
    must be hardware support to make it widely available. He somewhat
    ignored fact that even with hardware support there still needs
    to be support in compilers. And that C++ templates allow
    fast fixed point decimals as library feature. If there is no
    such library (and I am not aware of one) it is due to low
    demand and not due to difficulty.

    Anyway, he and IBM succeded pushd the DFP standard, but adoption
    is as I and other folks predicted.

    --
    Waldek Hebisch
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Sun Jan 11 12:52:03 2026
    From Newsgroup: comp.arch

    Waldek Hebisch wrote:
    MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

    Terje Mathisen <terje.mathisen@tmsw.no> posted:

    MitchAlsup wrote:

    MitchAlsup <user5857@newsgrouper.org.invalid> posted:
    -------------------

    My Transcendentals get to 1ULP when the multiplier tree is 59|a|A|ore4rCY59-bits
    {a bit more than |arCU|e-+ of the get 1ULP at 58|a|A|ore4rCY58}. I gave a lot of though
    to this {~1 year} before deciding that a "Do everything else" function >>>>> unit was "overall" better than a couple of "near miss" FUs. So, IMUL >>>>> comes in at 4 cycles, FMUL at 4 cycles, FDIV at 17, SQRT at 22, and>>>>> IDIV at 25. The fast IMUL makes up a lot for the "size" of the FU.

    I forgot to add that my transcendentals went from <just barely>
    faithfully rounded 0.75ULP RMS to 5.002ULP RMS when the tree was
    expanded.

    You obviously intended 0.5002 ulp, so you have about 11-13 guard bits to >>> get the rounding correct?

    64-53 = 11 yes

    But a single incorrect rounding is 1 ULP all by itself.

    It is clear that when your rounding is different that "IEEE correct" rounding, then there is 1 ULP difference between your result and
    IEEE rounding. But claims like max 0.5002 ulp mean that there
    is at most 0.5002 ulp difference between true result and result
    deliverd by your FPU. Or do you really mean that there may be
    1ULP difference between true result and your FPU?

    It is well established that when you measure the accuracy of special functions, you compare against the perfect result which is never more
    than 0.5 ulp away from the arbitrary/infinitely precise exact result.
    Stating that some algorithm delivers 0.5002 ulp means that with the
    worst possible input, the before-rounding result is 0.0002 ulp away from the real/exact result, and in such a way that rounding will go in the
    wrong direction.
    It is perfectly OK to be 0.4 ulp wrong as long as you are within the
    correct 0.5 ulp wide interval, but in reality the only way to deliver
    results like Mitch is by being nearly exact everywhere.
    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From antispam@antispam@fricas.org (Waldek Hebisch) to comp.arch on Sun Jan 11 11:49:50 2026
    From Newsgroup: comp.arch

    Thomas Koenig <tkoenig@netcologne.de> wrote:
    Waldek Hebisch <antispam@fricas.org> schrieb:

    I use every day a computer algebra system. One possiblity here
    is to use arbitrary precision fractions. Observations:
    - once there is more complex computation numerators and denominators
    tend to be pretty big
    - a lot of time is spent computing gcd, but if you try to save on
    gcd-s and work with unsimplified fractions you may get trenendous
    blowup (like million digit numbers in what should be reasonably
    small computation).
    - general, if one needs numeric approximation, then arbitrary
    precision (software) floating point is much faster

    It is also not possible to express irrational numbers,
    transcendental functions etc. When I use a computer algebra
    sytem, I tend to use such functions, so solutions are usually
    not rational numbers.

    Support for algebraic irrationalities is by now standard
    feature in computer algebra. Dealing with transcendental
    elementary functions too. Support for special functions
    is weaker, but that is possible to. Deciding if a number
    is transcendental or not is theoretically tricky, but for
    elementary numbers there is Schanuel conjecture, which
    wile unproven tends to work well in practice.

    What troubles many folks is fact that for many practical
    problems answers are implicit, so if you want numbers at the
    end you need to do numeric computation anyway.

    Anyway, my point was that exact computations tend to need
    large accuracy at intermediate steps, so computational
    cost is much higher than numerics (even arbitrary precion
    numerics tend to be much faster). As a little example
    is sligthly different spirit, when can try to run approximate
    root finding procedure for polynomials in rational
    arithemtics. This solves problem of potential numerical
    instability, but leads to large fractions which are
    much more costly than arbitrary precion floating point
    (which with some care deals with instability too).
    --
    Waldek Hebisch
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Sun Jan 11 14:31:54 2026
    From Newsgroup: comp.arch

    On Sun, 11 Jan 2026 00:33:22 -0000 (UTC)
    antispam@fricas.org (Waldek Hebisch) wrote:
    Michael S <already5chosen@yahoo.com> wrote:
    On Tue, 6 Jan 2026 22:06 +0000 (GMT Standard Time)
    jgd@cix.co.uk (John Dallman) wrote:

    In article <2026Jan5.100825@mips.complang.tuwien.ac.at>,
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    Possibly. But the lack of takeup of the Intel library and of
    the gcc support shows that "build it and they will come" does
    not work out for DFP.

    The world has got very used to IEEE BFP, and has solutions that
    work acceptably with it. Lots of organisations don't see anything
    obvious for them in DFP.

    The thing I'd like to try out is fast quad-precision BFP. For the
    field I work in, that would make some things much simpler. I did
    try to interest AMD in the idea in the early days of x86-64, but
    they didn't bite.

    John

    I already asked you couple of years ago how fast do want binary128
    in order to consider it fast enough.
    IIRC, you either avoided the answer completely or gave totally
    unrealistic answer like "the same as binary64".
    May be, nobody bites because with non-answers or answers like that
    nobody thinks that you are serious?

    I would hope for half troughput and say 1-2 clocks more latency
    for addition.
    That sounds doable, from power and thermal perspective, but does not
    sound sufficiently important for anybody to bother.
    Having addition at half throughput of binary64 instead of quarter would
    not sell you more chips.
    For multiplication I would expect 1/4 troughput
    and maybe twice latency than for binary64.

    As of today, there is double-double. IIRC double-double addition
    needs 6 double additions, that is way too much. AFAICS
    quantifying double-double mutiplication performance is more
    tricky: there is relatively easy implementation using
    64-bit multiply-add (it takes adwantage of fact that multiply-add
    can deliver low-order bits that only contibute to rounding in
    normal FP multiply), but this implements normal multiply in
    terms of multiply-add. Implementing multiply-add takes
    more effort and impementing multiply only using multiply
    tekes even more effort.

    Are there compilers that are able to vectirize double-double? If not,
    any talk about throughput is pointless.
    Anyway, to make sense hardware should be faster than double-double.

    I disagree. Numeric properties of binary128 are better than
    double-double. And far easier to analyze, both deeply and applying
    rules of thumb.
    As far as I am concerned, the low performance limit for hardware
    implementation of binary128 is set not by double-double, but by
    competent implementation of binary128 with integer math, including
    competent ABI. Current soft binary128 in gcc is ~ factor of two away
    from that in add/mul/fma, larger factor in div, larger yet in sqrt.
    As to ABI incompetence in this case, it is hard to quantify.
    Anecdote.
    Few months ago I tried to design very long decimation filters with
    stop band attenuation of ~160 dB.
    Matlab's implementation of ParksrCoMcClellan algorithm (a customize variation of Remez Exchange spiced with a small portion of black
    magic) was not up to the task. Gnu Octave implementation was
    somewhat worse yet.
    When I started to investigate the reasons I found out that there
    were actually two of them, both related to insufficient precision
    of the series of DP FP calculations.
    The first was Discreet Cosine Transform and underlying FFT engine
    for N around 32K.
    The second was solving system of linear equations for N around 1000
    a a little more.
    In both cases precision of DP FP was perfectly sufficient both for
    inputs and for outputs. 32KBut errors accumulated in intermediate
    caused troubles.

    In both cases quad-precision FP was key to solution.

    For DCT (FFT) I went for full re-implementation at higher
    precision.

    Hmm, I did estimates for FFT and my result was that in classic
    implementation each layers of butterflies essentially additively
    contributes to L^2 error. So 32K point radix-2 FFT has 15 times
    bigger L^2 error than single layer of batterflies, which has error
    about 4 times machine epsilon. With radix-4 FFT error of single
    batterfly is larger, but number of layers is halved and result
    is similar. So, in terms of L^2 error 32K point FFT needs very
    little extra precision, essentially 6 bits.
    My estimated was 7.5 bits.
    But Remez works
    in term of supremum norm and at 32K points that may need extra
    8 bits. So it if possible that 80-bit format would have enough
    accuracy for your purpose.

    Yes, 80-bit would suffice.
    But since it was not time-critical part, I had chose 128 bits.
    I looked at FFT as one of possible ways to implement convolution
    of integer sequences with exact result. Alas, double precision
    computation is good only for about 20 bits for relatively short
    seqences and less for longer ones. It seems that integer only
    computation is much faster. Fast 128-bit floating point would
    shift balance towards floating point, but probably not enough
    to beat integer computations.

    For Linear Solver, I left LU decomposition, which happens to be the
    heavy O(N**3) part in DP. Quad-precision was applied only during
    final solver stages - forward propagation, back propagation,
    calculation of residual error vector and repetition of forward and
    back propagation. All those parts are O(N**2). That modification
    was sufficient to improve precision of result almost to the best
    possible in DP FP format. And sufficient for good convergence of ParksrCoMcClellan algorithm.

    Yes, as long as your system is reasonably well conditioned it is
    easy to improve accuracy in a postprocessing step. OTOH system
    may be so badly conditioned that solving in double precision leads
    to catastrophic errors while solving in higher precision works
    fine.

    That is part of black magic that I mentioned above. ParksrCoMcClellan
    algorithm works in acos(x) domain. It leads to better conditioned
    linear systems that when doing Remez taken from math books. At least
    it's true for sort of filters that are suitable for decimation.
    So, what is a point of my anecdote?
    The speed of quad-precision FP was never an obstacle, even when
    running on rather old hardware. And it's not like calculations here
    were not heavy. They were heavy all right, thank you very much.
    Using quad-precision only when necessary helped.
    But what helped more is not being hesitant. Doing things instead of worrying that they would be too slow.

    Well, I have arbitrary precision implementation of LLL algorithm.
    It works, but it is about 100 times slower than using double precision
    math.
    My point is that John should measure first. Only after measurements he
    has full rights to cry "Slow!".
    The trouble is, in worst case double precision LLL in
    dimension 53 may fail to converge. On tame data LLL is expected
    to work in higher dimensions, but even on tame data at dimentsion
    about 250 double precision LLL is expected to fail. In a sense
    this is no-win situations, as needed number of bits grows linearly
    with dimension (both worst case and tame case). One can try to
    use double precision when it works. But it is frustrating
    how much effort one needs to spend to get better speed using
    FPU. And especially, there is contrast with integer math
    where it is realatively easy to get higher precisions when
    needed. But for integer math RISC-V tries to change this by not
    providing carry bit. And, AFAICS SSE/AVX do not provide high
    order bits of multiplication (no vectored MULHI instruction), so multiprecision multiplies must go trough scalar multiplier.

    Vectored MULHI exists in SSE/AVX, but it is intended for image
    processing and for low-end audio processing (16-bit input and
    output). So it does not help
    They have width-doubling multiplication that is closer to your need
    (look for PMULUDQ). It is still rather narrow (32x32=64). However on
    modern high-end Intel and AMD it provided 2x8 = 16 multiplications per
    clock, so at least potentially it has higher bandwidth than
    single 64x64=128bit multiplication in non-SIMD domain.
    It seems, the most serious problems for attempts to use AVX/AVX512 for
    very high precision integer math is absence of support for carry chains
    for items wider than 64 bits.
    However there are few interesting ideas of how to deal with that
    limitation by means of speculation and of replacement of data
    dependencies with control dependencies. The core idea is that carry
    caused by carry is extremely rare so can be profitably predicted as
    not happening. It's normally does not matter how slow is the fix when
    it happened nevertheless.
    For more concrete examples you can look at discussion of 3-way addition
    of 64Kbit integers that happened here few months ago.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Sun Jan 11 15:01:16 2026
    From Newsgroup: comp.arch

    On Thu, 8 Jan 2026 21:35:16 +0100
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:

    MitchAlsup wrote:


    But a single incorrect rounding is 1 ULP all by itself.

    :-)

    Yeah, there are strong forces who want to have, at least as a suggested/recommended option, a set of transcendental functions which
    are exactly rounded.


    I wonder who are those forces and what is the set they push for.

    I would guess that they are mostly software and hardware verification
    people rather than people that use transcendental functions in
    engineering and physical calculations.

    The majority of the latter crowd would likely have no objections
    against 0.75 ULP RMS as long as implementation is both fast and
    preserving few invariant, of which I can primarily think of two:

    1. Evenness/odness
    If precise function F(x) is even or odd then approximate functions f(x)
    is also even or odd.

    2. Weak preservation of sign of delta.
    If F(x) < F(x+ULP) then f(x) <= f(x+ULP)
    If F(x) > F(x+ULP) then f(x) >= f(x+ULP)


    In practice, it's probably unlikely to have these invariant preserved
    when RMS error is 0.75 ULP, but for RMS error = 0.60-0.65 ULP I don't
    see difficulties.
    For many transcendental functions there will be only few problematic
    values of x near edges of implementation-specific ranges where one
    has to be careful.

    It is provably doable for float, in very close to the same cycle
    count as the best libraries in current use, double is "somewhat"
    harder to totally verify/prove.

    Terje


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Sun Jan 11 08:38:36 2026
    From Newsgroup: comp.arch

    On 1/10/2026 3:21 PM, Waldek Hebisch wrote:
    John Dallman <jgd@cix.co.uk> wrote:
    In article <2026Jan5.100825@mips.complang.tuwien.ac.at>,
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    Possibly. But the lack of takeup of the Intel library and of the
    gcc support shows that "build it and they will come" does not work
    out for DFP.

    The world has got very used to IEEE BFP, and has solutions that work
    acceptably with it. Lots of organisations don't see anything obvious for
    them in DFP.

    AFAICS DFP exists as a standard only because IBM pushed it.

    While I have no personal knowledge of this, I don't doubt it.


    I had
    a short e-mail exchange with main DFP advocate at IBM. My point
    was that purely software implementation of his decimal benchmark had perfectly adequate performance. His answer was that he knows this,
    but that was hand written code that normal users would not write.
    Compilers for Cobol could generate such code, but apparently no
    existing compilers supported it. He claimed that there
    must be hardware support to make it widely available. He somewhat
    ignored fact that even with hardware support there still needs
    to be support in compilers.

    Or perhaps (again, no personal knowledge - just speculation) that
    supporting an additional data type in the IBM COBOL (and, for what its
    worth PL/1) compilers is easier if there was hardware support for it.

    And that C++ templates allow
    fast fixed point decimals as library feature. If there is no
    such library (and I am not aware of one) it is due to low
    demand and not due to difficulty.

    For the existing Z series base, I suspect anything related to C++ is not significant, i.e. about as important as DFP is to the typical C++ user. :-)


    Anyway, he and IBM succeded pushd the DFP standard, but adoption
    is as I and other folks predicted.

    Yes.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sun Jan 11 18:07:53 2026
    From Newsgroup: comp.arch


    antispam@fricas.org (Waldek Hebisch) posted:

    MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

    Terje Mathisen <terje.mathisen@tmsw.no> posted:

    MitchAlsup wrote:

    MitchAlsup <user5857@newsgrouper.org.invalid> posted:
    -------------------

    My Transcendentals get to 1ULP when the multiplier tree is 59|arCo59-bits
    {a bit more than |e-+ of the get 1ULP at 58|arCo58}. I gave a lot of though
    to this {~1 year} before deciding that a "Do everything else" function >> >> unit was "overall" better than a couple of "near miss" FUs. So, IMUL
    comes in at 4 cycles, FMUL at 4 cycles, FDIV at 17, SQRT at 22, and
    IDIV at 25. The fast IMUL makes up a lot for the "size" of the FU.

    I forgot to add that my transcendentals went from <just barely>
    faithfully rounded 0.75ULP RMS to 5.002ULP RMS when the tree was
    expanded.

    You obviously intended 0.5002 ulp, so you have about 11-13 guard bits to >> get the rounding correct?

    64-53 = 11 yes

    But a single incorrect rounding is 1 ULP all by itself.

    It is clear that when your rounding is different that "IEEE correct" rounding, then there is 1 ULP difference between your result and
    IEEE rounding. But claims like max 0.5002 ulp mean that there
    is at most 0.5002 ulp difference between true result and result
    deliverd by your FPU. Or do you really mean that there may be
    1ULP difference between true result and your FPU?

    It means I make a single IEEE rounding error once every several thousand calculations; AND I can achieve this in all IEEE rounding modes.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sun Jan 11 18:08:51 2026
    From Newsgroup: comp.arch


    antispam@fricas.org (Waldek Hebisch) posted:

    MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

    antispam@fricas.org (Waldek Hebisch) posted:

    John Dallman <jgd@cix.co.uk> wrote:
    In article <2026Jan5.100825@mips.complang.tuwien.ac.at>,
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    Possibly. But the lack of takeup of the Intel library and of the
    gcc support shows that "build it and they will come" does not work
    out for DFP.

    The world has got very used to IEEE BFP, and has solutions that work
    acceptably with it. Lots of organisations don't see anything obvious for >> > them in DFP.

    AFAICS DFP exists as a standard only because IBM pushed it. I had
    a short e-mail exchange with main DFP advocate at IBM.

    Mike Cow<something>shaw ??

    Mike Cowlishaw, yes.

    Thanks: my memory had it as Cowlingshaw--which I knew was wrong.

    My point
    was that purely software implementation of his decimal benchmark had
    perfectly adequate performance. His answer was that he knows this,
    but that was hand written code that normal users would not write.
    Compilers for Cobol could generate such code, but apparently no
    existing compilers supported it. He claimed that there
    must be hardware support to make it widely available. He somewhat
    ignored fact that even with hardware support there still needs
    to be support in compilers. And that C++ templates allow
    fast fixed point decimals as library feature. If there is no
    such library (and I am not aware of one) it is due to low
    demand and not due to difficulty.

    Anyway, he and IBM succeded pushd the DFP standard, but adoption
    is as I and other folks predicted.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sun Jan 11 18:11:00 2026
    From Newsgroup: comp.arch


    antispam@fricas.org (Waldek Hebisch) posted:

    Thomas Koenig <tkoenig@netcologne.de> wrote:
    Waldek Hebisch <antispam@fricas.org> schrieb:

    I use every day a computer algebra system. One possiblity here
    is to use arbitrary precision fractions. Observations:
    - once there is more complex computation numerators and denominators
    tend to be pretty big
    - a lot of time is spent computing gcd, but if you try to save on
    gcd-s and work with unsimplified fractions you may get trenendous
    blowup (like million digit numbers in what should be reasonably
    small computation).
    - general, if one needs numeric approximation, then arbitrary
    precision (software) floating point is much faster

    It is also not possible to express irrational numbers,
    transcendental functions etc. When I use a computer algebra
    sytem, I tend to use such functions, so solutions are usually
    not rational numbers.

    Support for algebraic irrationalities is by now standard
    feature in computer algebra. Dealing with transcendental
    elementary functions too. Support for special functions
    is weaker, but that is possible to. Deciding if a number
    is transcendental or not is theoretically tricky, but for
    elementary numbers there is Schanuel conjecture, which
    wile unproven tends to work well in practice.

    What troubles many folks is fact that for many practical
    problems answers are implicit, so if you want numbers at the
    end you need to do numeric computation anyway.

    Anyway, my point was that exact computations tend to need
    large accuracy at intermediate steps,

    Between 2|un+6 and 2|un+13 which is 3-9-bits larger than the
    next higher precision.

    so computational
    cost is much higher than numerics (even arbitrary precion
    numerics tend to be much faster). As a little example
    is sligthly different spirit, when can try to run approximate
    root finding procedure for polynomials in rational
    arithemtics. This solves problem of potential numerical
    instability, but leads to large fractions which are
    much more costly than arbitrary precion floating point
    (which with some care deals with instability too).

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sun Jan 11 18:18:00 2026
    From Newsgroup: comp.arch


    Michael S <already5chosen@yahoo.com> posted:

    On Thu, 8 Jan 2026 21:35:16 +0100
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:

    MitchAlsup wrote:


    But a single incorrect rounding is 1 ULP all by itself.

    :-)

    Yeah, there are strong forces who want to have, at least as a suggested/recommended option, a set of transcendental functions which
    are exactly rounded.


    I wonder who are those forces and what is the set they push for.

    The problem, here, is that even when one gets all the rounding correct,
    one has still lost various algebraic identities.

    CRSIN(x)^2 + CRCOS(X)^2 ~= 1.0

    I would guess that they are mostly software and hardware verification
    people rather than people that use transcendental functions in
    engineering and physical calculations.

    Numerical people, almost never engineers, physicists, or chemists.

    The majority of the latter crowd would likely have no objections
    against 0.75 ULP RMS as long as implementation is both fast and
    preserving few invariant, of which I can primarily think of two:

    1. Evenness/odness
    If precise function F(x) is even or odd then approximate functions f(x)
    is also even or odd.

    Odd functions need to be monotonic around zero.

    2. Weak preservation of sign of delta.
    If F(x) < F(x+ULP) then f(x) <= f(x+ULP)
    If F(x) > F(x+ULP) then f(x) >= f(x+ULP)

    Small scale Monotonicity.


    In practice, it's probably unlikely to have these invariant preserved
    when RMS error is 0.75 ULP, but for RMS error = 0.60-0.65 ULP I don't
    see difficulties.
    For many transcendental functions there will be only few problematic
    values of x near edges of implementation-specific ranges where one
    has to be careful.

    It is provably doable for float, in very close to the same cycle
    count as the best libraries in current use, double is "somewhat"
    harder to totally verify/prove.

    Terje


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Sun Jan 11 20:50:04 2026
    From Newsgroup: comp.arch

    On Sun, 11 Jan 2026 18:18:00 GMT
    MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

    Michael S <already5chosen@yahoo.com> posted:

    On Thu, 8 Jan 2026 21:35:16 +0100
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:

    MitchAlsup wrote:


    But a single incorrect rounding is 1 ULP all by itself.

    :-)

    Yeah, there are strong forces who want to have, at least as a suggested/recommended option, a set of transcendental functions
    which are exactly rounded.


    I wonder who are those forces and what is the set they push for.

    The problem, here, is that even when one gets all the rounding
    correct, one has still lost various algebraic identities.

    CRSIN(x)^2 + CRCOS(X)^2 ~= 1.0

    I would guess that they are mostly software and hardware
    verification people rather than people that use transcendental
    functions in engineering and physical calculations.

    Numerical people, almost never engineers, physicists, or chemists.

    The majority of the latter crowd would likely have no objections
    against 0.75 ULP RMS as long as implementation is both fast and
    preserving few invariant, of which I can primarily think of two:

    1. Evenness/odness
    If precise function F(x) is even or odd then approximate functions
    f(x) is also even or odd.

    Odd functions need to be monotonic around zero.

    2. Weak preservation of sign of delta.
    If F(x) < F(x+ULP) then f(x) <= f(x+ULP)
    If F(x) > F(x+ULP) then f(x) >= f(x+ULP)

    Small scale Monotonicity.


    Yes, that's a better name.
    I just wanted to express it as simple non-equality conditions and made
    it too simple and stronger than necessary.
    In fact I would not complain if my conditions do not hold when F(x) has extremum in between x and x+ULP. That is, it's nice if condition holds
    here as well, but it is relatively less important than holding on
    monotonous intervals.


    In practice, it's probably unlikely to have these invariant
    preserved when RMS error is 0.75 ULP, but for RMS error = 0.60-0.65
    ULP I don't see difficulties.
    For many transcendental functions there will be only few problematic
    values of x near edges of implementation-specific ranges where one
    has to be careful.

    It is provably doable for float, in very close to the same cycle
    count as the best libraries in current use, double is "somewhat"
    harder to totally verify/prove.

    Terje




    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Levine@johnl@taugh.com to comp.arch on Sun Jan 11 19:03:48 2026
    From Newsgroup: comp.arch

    According to Stephen Fuld <sfuld@alumni.cmu.edu.invalid>:
    existing compilers supported it. He claimed that there
    must be hardware support to make it widely available. He somewhat
    ignored fact that even with hardware support there still needs
    to be support in compilers.

    Or perhaps (again, no personal knowledge - just speculation) that
    supporting an additional data type in the IBM COBOL (and, for what its
    worth PL/1) compilers is easier if there was hardware support for it.

    Having written a few compilers, I can say that it is equally easy
    within epsilon to emit a DFADD instruction as the equivalent of CALL
    DFADD. I could believe it's politically easier, hey we'll look dumb if
    we announce this swell DFP feature and our own compilers don't use it.

    And that C++ templates allow
    fast fixed point decimals as library feature. If there is no
    such library (and I am not aware of one) it is due to low
    demand and not due to difficulty.

    For the existing Z series base, I suspect anything related to C++ is not >significant, i.e. about as important as DFP is to the typical C++ user. :-)

    Maybe. Remember that IBM has full support for linux on Z. There used
    to be a pricing hack (may still be) where you could buy a lower cost
    linux-only Z series processor which was just a regular processor with
    a microcode tweak to keep it from booting z/OS.
    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Sun Jan 11 12:40:55 2026
    From Newsgroup: comp.arch

    On 1/11/2026 10:07 AM, MitchAlsup wrote:

    antispam@fricas.org (Waldek Hebisch) posted:

    MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

    Terje Mathisen <terje.mathisen@tmsw.no> posted:

    MitchAlsup wrote:

    MitchAlsup <user5857@newsgrouper.org.invalid> posted:
    -------------------

    My Transcendentals get to 1ULP when the multiplier tree is 59|arCo59-bits
    {a bit more than |e-+ of the get 1ULP at 58|arCo58}. I gave a lot of though
    to this {~1 year} before deciding that a "Do everything else" function >>>>>> unit was "overall" better than a couple of "near miss" FUs. So, IMUL >>>>>> comes in at 4 cycles, FMUL at 4 cycles, FDIV at 17, SQRT at 22, and >>>>>> IDIV at 25. The fast IMUL makes up a lot for the "size" of the FU.

    I forgot to add that my transcendentals went from <just barely>
    faithfully rounded 0.75ULP RMS to 5.002ULP RMS when the tree was
    expanded.

    You obviously intended 0.5002 ulp, so you have about 11-13 guard bits to >>>> get the rounding correct?

    64-53 = 11 yes

    But a single incorrect rounding is 1 ULP all by itself.

    It is clear that when your rounding is different that "IEEE correct"
    rounding, then there is 1 ULP difference between your result and
    IEEE rounding. But claims like max 0.5002 ulp mean that there
    is at most 0.5002 ulp difference between true result and result
    deliverd by your FPU. Or do you really mean that there may be
    1ULP difference between true result and your FPU?

    It means I make a single IEEE rounding error once every several thousand calculations; AND I can achieve this in all IEEE rounding modes.

    Here is some older experimental code of mine that is HYPER sensitive to floating point errors. I was going to try another method, but I forgot
    the damn name of it. Uhhh, wait. Unums?

    https://groups.google.com/g/comp.lang.c++/c/bB1wA4wvoFc/m/OTccTiXLAgAJ


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sun Jan 11 22:11:55 2026
    From Newsgroup: comp.arch


    Michael S <already5chosen@yahoo.com> posted:

    On Sun, 11 Jan 2026 18:18:00 GMT
    MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

    Michael S <already5chosen@yahoo.com> posted:

    On Thu, 8 Jan 2026 21:35:16 +0100
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:

    MitchAlsup wrote:


    But a single incorrect rounding is 1 ULP all by itself.

    :-)

    Yeah, there are strong forces who want to have, at least as a suggested/recommended option, a set of transcendental functions
    which are exactly rounded.


    I wonder who are those forces and what is the set they push for.

    The problem, here, is that even when one gets all the rounding
    correct, one has still lost various algebraic identities.

    CRSIN(x)^2 + CRCOS(X)^2 ~= 1.0

    I would guess that they are mostly software and hardware
    verification people rather than people that use transcendental
    functions in engineering and physical calculations.

    Numerical people, almost never engineers, physicists, or chemists.

    The majority of the latter crowd would likely have no objections
    against 0.75 ULP RMS as long as implementation is both fast and preserving few invariant, of which I can primarily think of two:

    1. Evenness/odness
    If precise function F(x) is even or odd then approximate functions
    f(x) is also even or odd.

    Odd functions need to be monotonic around zero.

    2. Weak preservation of sign of delta.
    If F(x) < F(x+ULP) then f(x) <= f(x+ULP)
    If F(x) > F(x+ULP) then f(x) >= f(x+ULP)

    Small scale Monotonicity.


    Yes, that's a better name.
    I just wanted to express it as simple non-equality conditions and made
    it too simple and stronger than necessary.
    In fact I would not complain if my conditions do not hold when F(x) has extremum in between x and x+ULP. That is, it's nice if condition holds
    here as well, but it is relatively less important than holding on
    monotonous intervals.

    Consider COS(x) near 0.0

    The transition from 1.0 to .99999999 and later to 0.99999998 (in both directions) is small scale monotonic. AND it is exactly at this trans-
    ition where my rounding takes the biggest number of hits (incorrect
    roundings).

    Seen in binary, one has a prerounded result of:

    0.1111111111 1111111111 1111111111 1111111111 1111111111 1111 and digits

    behind where rounding transpires. If those digits start with 01111111111
    or 1000000000 then we are in the situation where we cannot know if we can choose a correct rounding; the next term of the polynomial could sway the balance. J. M. Mueller chapter 11 shows that one might need as many as
    2|un+13 bits in order to get the rounding "correct". This must include, polynomial error, arithmetic error, and certain boundary conditions.

    If rounding that begins 01 contains a second 0, correct rounding happens.
    if rounding that begins 10 contains a second 1, correct rounding happens.

    And it is exactly at these points that
    a) while the result remains monotonic, the point of change can be "off"
    by a small number of +os
    b) when the slope is shallow, one can get several rounding errors in a row
    without loosing the property of monotonicity or overall RMS.



    In practice, it's probably unlikely to have these invariant
    preserved when RMS error is 0.75 ULP, but for RMS error = 0.60-0.65
    ULP I don't see difficulties.
    For many transcendental functions there will be only few problematic values of x near edges of implementation-specific ranges where one
    has to be careful.

    It is provably doable for float, in very close to the same cycle
    count as the best libraries in current use, double is "somewhat"
    harder to totally verify/prove.

    Terje




    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From antispam@antispam@fricas.org (Waldek Hebisch) to comp.arch on Sun Jan 11 22:30:14 2026
    From Newsgroup: comp.arch

    Michael S <already5chosen@yahoo.com> wrote:
    On Thu, 8 Jan 2026 21:35:16 +0100
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:

    MitchAlsup wrote:


    But a single incorrect rounding is 1 ULP all by itself.

    :-)

    Yeah, there are strong forces who want to have, at least as a
    suggested/recommended option, a set of transcendental functions which
    are exactly rounded.


    I wonder who are those forces and what is the set they push for.

    I would guess that they are mostly software and hardware verification
    people rather than people that use transcendental functions in
    engineering and physical calculations.

    The majority of the latter crowd would likely have no objections
    against 0.75 ULP RMS as long as implementation is both fast and
    preserving few invariant, of which I can primarily think of two:

    1. Evenness/odness
    If precise function F(x) is even or odd then approximate functions f(x)
    is also even or odd.

    2. Weak preservation of sign of delta.
    If F(x) < F(x+ULP) then f(x) <= f(x+ULP)
    If F(x) > F(x+ULP) then f(x) >= f(x+ULP)


    In practice, it's probably unlikely to have these invariant preserved
    when RMS error is 0.75 ULP, but for RMS error = 0.60-0.65 ULP I don't
    see difficulties.
    For many transcendental functions there will be only few problematic
    values of x near edges of implementation-specific ranges where one
    has to be careful.

    This property is independent of magnitude of the error. For
    example, nominally "double" routine may deliver correctly rounded
    single precision results. Of course errors are huge, but the
    property holds. More realistically, monotonic behaviour can
    be obtained as composition of monotonic operations. If done
    in software such composition may produce more than 1ulp error,
    but still be monotonic where required.
    --
    Waldek Hebisch
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From antispam@antispam@fricas.org (Waldek Hebisch) to comp.arch on Mon Jan 12 00:37:20 2026
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

    antispam@fricas.org (Waldek Hebisch) posted:

    Thomas Koenig <tkoenig@netcologne.de> wrote:
    Waldek Hebisch <antispam@fricas.org> schrieb:

    I use every day a computer algebra system. One possiblity here
    is to use arbitrary precision fractions. Observations:
    - once there is more complex computation numerators and denominators
    tend to be pretty big
    - a lot of time is spent computing gcd, but if you try to save on
    gcd-s and work with unsimplified fractions you may get trenendous
    blowup (like million digit numbers in what should be reasonably
    small computation).
    - general, if one needs numeric approximation, then arbitrary
    precision (software) floating point is much faster

    It is also not possible to express irrational numbers,
    transcendental functions etc. When I use a computer algebra
    sytem, I tend to use such functions, so solutions are usually
    not rational numbers.

    Support for algebraic irrationalities is by now standard
    feature in computer algebra. Dealing with transcendental
    elementary functions too. Support for special functions
    is weaker, but that is possible to. Deciding if a number
    is transcendental or not is theoretically tricky, but for
    elementary numbers there is Schanuel conjecture, which
    wile unproven tends to work well in practice.

    What troubles many folks is fact that for many practical
    problems answers are implicit, so if you want numbers at the
    end you need to do numeric computation anyway.

    Anyway, my point was that exact computations tend to need
    large accuracy at intermediate steps,

    Between 2|un+6 and 2|un+13 which is 3-9-bits larger than the
    next higher precision.

    You are talking about a specific, rather special problem.
    Reasonably typical task in exact computations is to compute
    determinant of n by n matrix with k-bit integer entries.
    Sometimes k is large, but k <= 10 is frequent. Using
    reasonably normal arithmetic operations you need slightly
    more than n*k bits at intermedate steps. For similar
    matrix with rational entries needed number of bits may
    be as large as n^2*k. If you skip simplifications of
    fractions at intermediate steps your numbers may grow
    exponentially with n. In root finding problem that
    I mentioned below, to get k bits of accuracy you need
    to evaluate polynomial at k bit number. If you do
    evaluation in exact arithmetic, then at intermediate
    steps you get n*k bit numbers, where n is degree of the
    polynomial. OTOH in numeric computation you can get
    good result with much smaller number of bits (trough
    analysis and its result are complex), but growing
    with n.

    so computational
    cost is much higher than numerics (even arbitrary precion
    numerics tend to be much faster). As a little example
    is sligthly different spirit, when can try to run approximate
    root finding procedure for polynomials in rational
    arithemtics. This solves problem of potential numerical
    instability, but leads to large fractions which are
    much more costly than arbitrary precion floating point
    (which with some care deals with instability too).

    --
    Waldek Hebisch
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon Jan 12 02:05:15 2026
    From Newsgroup: comp.arch


    antispam@fricas.org (Waldek Hebisch) posted:

    MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

    antispam@fricas.org (Waldek Hebisch) posted:

    Thomas Koenig <tkoenig@netcologne.de> wrote:
    Waldek Hebisch <antispam@fricas.org> schrieb:

    I use every day a computer algebra system. One possiblity here
    is to use arbitrary precision fractions. Observations:
    - once there is more complex computation numerators and denominators
    tend to be pretty big
    - a lot of time is spent computing gcd, but if you try to save on
    gcd-s and work with unsimplified fractions you may get trenendous
    blowup (like million digit numbers in what should be reasonably
    small computation).
    - general, if one needs numeric approximation, then arbitrary
    precision (software) floating point is much faster

    It is also not possible to express irrational numbers,
    transcendental functions etc. When I use a computer algebra
    sytem, I tend to use such functions, so solutions are usually
    not rational numbers.

    Support for algebraic irrationalities is by now standard
    feature in computer algebra. Dealing with transcendental
    elementary functions too. Support for special functions
    is weaker, but that is possible to. Deciding if a number
    is transcendental or not is theoretically tricky, but for
    elementary numbers there is Schanuel conjecture, which
    wile unproven tends to work well in practice.

    What troubles many folks is fact that for many practical
    problems answers are implicit, so if you want numbers at the
    end you need to do numeric computation anyway.

    Anyway, my point was that exact computations tend to need
    large accuracy at intermediate steps,

    Between 2|un+6 and 2|un+13 which is 3-9-bits larger than the
    next higher precision.

    You are talking about a specific, rather special problem.

    Yes, exactly, where the exact result is either known or computable
    with current known methods/means. And its all in the Mueller book.
    All I did was to take typical elementary functions and make the
    evaluation of them similar, in clock cycles, as FDIV of the same
    operand size; and for this gain in performance, I am willing to
    sacrifice the merest loss in precision:: 1 rounding error every
    "quite large number of calculations"

    Reasonably typical task in exact computations is to compute
    determinant of n by n matrix with k-bit integer entries.
    Sometimes k is large, but k <= 10 is frequent. Using
    reasonably normal arithmetic operations you need slightly
    more than n*k bits at intermedate steps. For similar
    matrix with rational entries needed number of bits may
    be as large as n^2*k. If you skip simplifications of
    fractions at intermediate steps your numbers may grow
    exponentially with n. In root finding problem that
    I mentioned below, to get k bits of accuracy you need
    to evaluate polynomial at k bit number. If you do
    evaluation in exact arithmetic, then at intermediate
    steps you get n*k bit numbers, where n is degree of the
    polynomial. OTOH in numeric computation you can get
    good result with much smaller number of bits (trough
    analysis and its result are complex), but growing
    with n.

    so computational
    cost is much higher than numerics (even arbitrary precion
    numerics tend to be much faster). As a little example
    is sligthly different spirit, when can try to run approximate
    root finding procedure for polynomials in rational
    arithemtics. This solves problem of potential numerical
    instability, but leads to large fractions which are
    much more costly than arbitrary precion floating point
    (which with some care deals with instability too).


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Sun Jan 11 15:41:59 2026
    From Newsgroup: comp.arch

    On 1/11/2026 12:40 PM, Chris M. Thomasson wrote:
    On 1/11/2026 10:07 AM, MitchAlsup wrote:
    [...]

    This is a reworked fun experiment I had about how to store and load data
    in complex numbers:

    https://groups.google.com/g/comp.lang.c++/c/bB1wA4wvoFc/m/OTccTiXLAgAJ



    // updated code
    Can you run it and tell me what you get? Thanks!

    :^)
    _____________________________________
    // Chris M. Thomasson
    // complex storage for fun...


    #include <complex>
    #include <iostream>
    #include <vector>
    #include <limits>
    #include <algorithm>
    #include <cstdint>
    #include <cassert>
    #include <cstring>

    typedef std::int64_t ct_int;
    typedef std::uint64_t ct_uint;
    typedef double ct_float;
    typedef std::numeric_limits<ct_float> ct_float_nlim;
    typedef std::complex<ct_float> ct_complex;
    typedef std::vector<ct_complex> ct_complex_vec;

    #define CT_PI 3.14159265358979323846

    ct_float
    ct_roots(
    ct_complex const& z,
    ct_int p,
    ct_complex_vec& out
    ) {
    assert(p != 0);

    ct_float radius = std::pow(std::abs(z), 1.0 / p);
    ct_float angle_base = std::arg(z) / p;
    ct_float angle_step = (CT_PI * 2.0) / p;

    ct_uint n = std::abs(p);
    ct_float avg_err = 0.0;

    for (ct_uint i = 0; i < n; ++i) {
    ct_float angle = angle_step * i;
    ct_complex c = {
    std::cos(angle_base + angle) * radius,
    std::sin(angle_base + angle) * radius
    };

    out.push_back(c);

    ct_complex raised = std::pow(c, p);
    avg_err = avg_err + std::abs(raised - z);
    }

    return avg_err / n;
    }

    // Direct angular calculation - O(1) instead of O(n)
    ct_int
    ct_try_find_direct(
    ct_complex const& z,
    ct_complex const& z_next,
    ct_int power,
    ct_float eps
    ) {
    // Calculate what the angle_base was when z_next's roots were computed
    ct_float angle_base = std::arg(z_next) / power;

    // Get z's angle relative to origin
    ct_float z_angle = std::arg(z);

    // Find which root slot z falls into
    // Subtract the base angle and normalize
    ct_float relative_angle = z_angle - angle_base;

    // Normalize to [0, 2*pi)
    while (relative_angle < 0) relative_angle += CT_PI * 2.0;
    while (relative_angle >= CT_PI * 2.0) relative_angle -= CT_PI * 2.0;

    // Calculate step size between roots
    ct_float angle_step = (CT_PI * 2.0) / power;

    // Find nearest root index
    ct_uint index = (ct_uint)std::round(relative_angle / angle_step);

    // Handle wrap-around
    if (index >= (ct_uint)std::abs(power)) {
    index = 0;
    }

    return index;
    }

    // Original linear search version - more robust but O(n)
    ct_int
    ct_try_find(
    ct_complex const& z,
    ct_complex_vec const& roots,
    ct_float eps
    ) {
    std::size_t n = roots.size();

    for (std::size_t i = 0; i < n; ++i) {
    ct_complex const& root = roots[i];
    ct_float adif = std::abs(root - z);

    if (adif < eps) {
    return i;
    }
    }

    return -1;
    }

    static std::string const g_tokens_str =
    "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ";

    ct_int
    ct_gain_power(
    std::string const& tokens
    ) {
    ct_uint n = tokens.length();
    std::size_t pmax = 0;

    for (ct_uint i = 0; i < n; ++i) {
    std::size_t fridx = g_tokens_str.find_first_of(tokens[i]);
    assert(fridx != std::string::npos);
    pmax = std::max(pmax, fridx);
    }

    return (ct_int)(pmax + 1);
    }

    ct_complex
    ct_store(
    ct_complex const& z_origin,
    ct_int p,
    std::string const& tokens
    ) {
    ct_uint n = tokens.length();
    ct_complex z = z_origin;
    ct_float store_avg_err = 0.0;

    std::cout << "Storing Data..." << "\n";
    std::cout << "stored:z_origin:" << z_origin << "\n";

    for (ct_uint i = 0; i < n; ++i) {
    ct_complex_vec roots;
    ct_float avg_err = ct_roots(z, p, roots);
    store_avg_err = store_avg_err + avg_err;

    std::size_t fridx = g_tokens_str.find_first_of(tokens[i]);
    assert(fridx != std::string::npos);

    z = roots[fridx];
    std::cout << "stored[" << i << "]:" << z << "\n";
    }

    store_avg_err = store_avg_err / n;
    std::cout << "store_avg_err:" << store_avg_err << "\n";

    return z;
    }

    ct_float
    ct_load(
    ct_complex const& z_store,
    ct_complex const& z_target,
    ct_int p,
    ct_float eps,
    std::string& out_tokens,
    ct_complex& out_z,
    bool use_direct = false // Toggle between direct and linear search
    ) {
    ct_complex z = z_store;
    ct_uint n = 128;
    ct_float load_err_sum = 0.0;

    std::cout << "Loading Data... (using " << (use_direct ? "direct" : "linear search") << " method)\n";

    for (ct_uint i = 0; i < n; ++i) {
    // Raise to power to get parent point
    ct_complex z_next = std::pow(z, p);

    ct_int root_idx;

    if (use_direct) {
    // Direct O(1) calculation
    root_idx = ct_try_find_direct(z, z_next, p, eps);
    }
    else {
    // Linear search O(n) - compute roots and search
    ct_complex_vec roots;
    ct_float avg_err = ct_roots(z_next, p, roots);
    load_err_sum += avg_err;
    root_idx = ct_try_find(z, roots, eps);
    }

    if (root_idx < 0 || (ct_uint)root_idx >= g_tokens_str.length()) {
    break;
    }

    std::cout << "loaded[" << i << "]:" << z << " (index:" <<
    root_idx << ")\n";
    out_tokens += g_tokens_str[root_idx];

    // Move to parent point
    z = z_next;

    // Check if we've reached the origin
    if (std::abs(z - z_target) < eps) {
    std::cout << "fin detected!:[" << i << "]:" << z << "\n";
    break;
    }
    }

    // Reverse to get original order
    std::reverse(out_tokens.begin(), out_tokens.end());
    out_z = z;

    return load_err_sum;
    }

    int main() {
    std::cout.precision(ct_float_nlim::max_digits10);
    std::cout << "g_tokens_str:" << g_tokens_str << "\n\n";

    {
    ct_complex z_origin = { -.75, .06 };
    std::string stored = "CHRIS";
    ct_int power = ct_gain_power(stored);

    std::cout << "stored:" << stored << "\n";
    std::cout << "power:" << power << "\n\n";
    std::cout << "________________________________________\n";

    // STORE
    ct_complex z_stored = ct_store(z_origin, power, stored);

    std::cout << "________________________________________\n";
    std::cout << "\nSTORED POINT:" << z_stored << "\n";
    std::cout << "________________________________________\n";

    // LOAD - try both methods
    std::string loaded;
    ct_complex z_loaded;
    ct_float eps = .001;

    std::cout << "\n=== Testing LINEAR SEARCH method ===\n";
    ct_float load_err_sum =
    ct_load(z_stored, z_origin, power, eps, loaded, z_loaded,
    false);

    std::cout << "________________________________________\n";
    std::cout << "\nORIGIN POINT:" << z_origin << "\n";
    std::cout << "LOADED POINT:" << z_loaded << "\n";
    std::cout << "\nloaded:" << loaded << "\n";
    std::cout << "load_err_sum:" << load_err_sum << "\n";

    if (stored == loaded) {
    std::cout << "\n\nDATA COHERENT! :^D" << "\n";
    }
    else {
    std::cout << "\n\n***** DATA CORRUPTED!!! Shi%! *****" << "\n";
    std::cout << "Expected: " << stored << "\n";
    std::cout << "Got: " << loaded << "\n";
    }

    // Try direct method
    std::cout << "\n\n=== Testing DIRECT ANGULAR method ===\n";
    std::string loaded_direct;
    ct_complex z_loaded_direct;

    ct_float load_err_sum_direct =
    ct_load(z_stored, z_origin, power, eps, loaded_direct, z_loaded_direct, true);

    std::cout << "________________________________________\n";
    std::cout << "\nloaded:" << loaded_direct << "\n";

    if (stored == loaded_direct) {
    std::cout << "\n\nDATA COHERENT (DIRECT METHOD)! :^D" << "\n";
    }
    else {
    std::cout << "\n\n***** DATA CORRUPTED (DIRECT METHOD)!!!
    *****" << "\n";
    std::cout << "Expected: " << stored << "\n";
    std::cout << "Got: " << loaded_direct << "\n";
    }
    }

    std::cout << "\n\nFin, hit <ENTER> to exit...\n";
    std::fflush(stdout);
    std::cin.get();

    return 0;
    }
    _____________________________________



    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Sun Jan 11 15:40:27 2026
    From Newsgroup: comp.arch

    On 1/11/2026 12:40 PM, Chris M. Thomasson wrote:
    On 1/11/2026 10:07 AM, MitchAlsup wrote:
    [...]

    This is a reworked fun experiment I had about how to store and load data
    in complex numbers:

    https://groups.google.com/g/comp.lang.c++/c/bB1wA4wvoFc/m/OTccTiXLAgAJ



    // updated code
    Can you run it and tell me what you get? Thanks!

    :^)
    _____________________________________
    // Chris M. Thomasson
    // complex storage for fun...


    #include <complex>
    #include <iostream>
    #include <vector>
    #include <limits>
    #include <algorithm>
    #include <cstdint>
    #include <cassert>
    #include <cstring>

    typedef std::int64_t ct_int;
    typedef std::uint64_t ct_uint;
    typedef double ct_float;
    typedef std::numeric_limits<ct_float> ct_float_nlim;
    typedef std::complex<ct_float> ct_complex;
    typedef std::vector<ct_complex> ct_complex_vec;

    #define CT_PI 3.14159265358979323846

    ct_float
    ct_roots(
    ct_complex const& z,
    ct_int p,
    ct_complex_vec& out
    ) {
    assert(p != 0);

    ct_float radius = std::pow(std::abs(z), 1.0 / p);
    ct_float angle_base = std::arg(z) / p;
    ct_float angle_step = (CT_PI * 2.0) / p;

    ct_uint n = std::abs(p);
    ct_float avg_err = 0.0;

    for (ct_uint i = 0; i < n; ++i) {
    ct_float angle = angle_step * i;
    ct_complex c = {
    std::cos(angle_base + angle) * radius,
    std::sin(angle_base + angle) * radius
    };

    out.push_back(c);

    ct_complex raised = std::pow(c, p);
    avg_err = avg_err + std::abs(raised - z);
    }

    return avg_err / n;
    }

    // Direct angular calculation - O(1) instead of O(n)
    ct_int
    ct_try_find_direct(
    ct_complex const& z,
    ct_complex const& z_next,
    ct_int power,
    ct_float eps
    ) {
    // Calculate what the angle_base was when z_next's roots were computed
    ct_float angle_base = std::arg(z_next) / power;

    // Get z's angle relative to origin
    ct_float z_angle = std::arg(z);

    // Find which root slot z falls into
    // Subtract the base angle and normalize
    ct_float relative_angle = z_angle - angle_base;

    // Normalize to [0, 2*pi)
    while (relative_angle < 0) relative_angle += CT_PI * 2.0;
    while (relative_angle >= CT_PI * 2.0) relative_angle -= CT_PI * 2.0;

    // Calculate step size between roots
    ct_float angle_step = (CT_PI * 2.0) / power;

    // Find nearest root index
    ct_uint index = (ct_uint)std::round(relative_angle / angle_step);

    // Handle wrap-around
    if (index >= (ct_uint)std::abs(power)) {
    index = 0;
    }

    return index;
    }

    // Original linear search version - more robust but O(n)
    ct_int
    ct_try_find(
    ct_complex const& z,
    ct_complex_vec const& roots,
    ct_float eps
    ) {
    std::size_t n = roots.size();

    for (std::size_t i = 0; i < n; ++i) {
    ct_complex const& root = roots[i];
    ct_float adif = std::abs(root - z);

    if (adif < eps) {
    return i;
    }
    }

    return -1;
    }

    static std::string const g_tokens_str =
    "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ";

    ct_int
    ct_gain_power(
    std::string const& tokens
    ) {
    ct_uint n = tokens.length();
    std::size_t pmax = 0;

    for (ct_uint i = 0; i < n; ++i) {
    std::size_t fridx = g_tokens_str.find_first_of(tokens[i]);
    assert(fridx != std::string::npos);
    pmax = std::max(pmax, fridx);
    }

    return (ct_int)(pmax + 1);
    }

    ct_complex
    ct_store(
    ct_complex const& z_origin,
    ct_int p,
    std::string const& tokens
    ) {
    ct_uint n = tokens.length();
    ct_complex z = z_origin;
    ct_float store_avg_err = 0.0;

    std::cout << "Storing Data..." << "\n";
    std::cout << "stored:z_origin:" << z_origin << "\n";

    for (ct_uint i = 0; i < n; ++i) {
    ct_complex_vec roots;
    ct_float avg_err = ct_roots(z, p, roots);
    store_avg_err = store_avg_err + avg_err;

    std::size_t fridx = g_tokens_str.find_first_of(tokens[i]);
    assert(fridx != std::string::npos);

    z = roots[fridx];
    std::cout << "stored[" << i << "]:" << z << "\n";
    }

    store_avg_err = store_avg_err / n;
    std::cout << "store_avg_err:" << store_avg_err << "\n";

    return z;
    }

    ct_float
    ct_load(
    ct_complex const& z_store,
    ct_complex const& z_target,
    ct_int p,
    ct_float eps,
    std::string& out_tokens,
    ct_complex& out_z,
    bool use_direct = false // Toggle between direct and linear search
    ) {
    ct_complex z = z_store;
    ct_uint n = 128;
    ct_float load_err_sum = 0.0;

    std::cout << "Loading Data... (using " << (use_direct ? "direct" : "linear search") << " method)\n";

    for (ct_uint i = 0; i < n; ++i) {
    // Raise to power to get parent point
    ct_complex z_next = std::pow(z, p);

    ct_int root_idx;

    if (use_direct) {
    // Direct O(1) calculation
    root_idx = ct_try_find_direct(z, z_next, p, eps);
    }
    else {
    // Linear search O(n) - compute roots and search
    ct_complex_vec roots;
    ct_float avg_err = ct_roots(z_next, p, roots);
    load_err_sum += avg_err;
    root_idx = ct_try_find(z, roots, eps);
    }

    if (root_idx < 0 || (ct_uint)root_idx >= g_tokens_str.length()) {
    break;
    }

    std::cout << "loaded[" << i << "]:" << z << " (index:" <<
    root_idx << ")\n";
    out_tokens += g_tokens_str[root_idx];

    // Move to parent point
    z = z_next;

    // Check if we've reached the origin
    if (std::abs(z - z_target) < eps) {
    std::cout << "fin detected!:[" << i << "]:" << z << "\n";
    break;
    }
    }

    // Reverse to get original order
    std::reverse(out_tokens.begin(), out_tokens.end());
    out_z = z;

    return load_err_sum;
    }

    int main() {
    std::cout.precision(ct_float_nlim::max_digits10);
    std::cout << "g_tokens_str:" << g_tokens_str << "\n\n";

    {
    ct_complex z_origin = { -.75, .06 };
    std::string stored = "CHRIS";
    ct_int power = ct_gain_power(stored);

    std::cout << "stored:" << stored << "\n";
    std::cout << "power:" << power << "\n\n";
    std::cout << "________________________________________\n";

    // STORE
    ct_complex z_stored = ct_store(z_origin, power, stored);

    std::cout << "________________________________________\n";
    std::cout << "\nSTORED POINT:" << z_stored << "\n";
    std::cout << "________________________________________\n";

    // LOAD - try both methods
    std::string loaded;
    ct_complex z_loaded;
    ct_float eps = .001;

    std::cout << "\n=== Testing LINEAR SEARCH method ===\n";
    ct_float load_err_sum =
    ct_load(z_stored, z_origin, power, eps, loaded, z_loaded,
    false);

    std::cout << "________________________________________\n";
    std::cout << "\nORIGIN POINT:" << z_origin << "\n";
    std::cout << "LOADED POINT:" << z_loaded << "\n";
    std::cout << "\nloaded:" << loaded << "\n";
    std::cout << "load_err_sum:" << load_err_sum << "\n";

    if (stored == loaded) {
    std::cout << "\n\nDATA COHERENT! :^D" << "\n";
    }
    else {
    std::cout << "\n\n***** DATA CORRUPTED!!! Shi%! *****" << "\n";
    std::cout << "Expected: " << stored << "\n";
    std::cout << "Got: " << loaded << "\n";
    }

    // Try direct method
    std::cout << "\n\n=== Testing DIRECT ANGULAR method ===\n";
    std::string loaded_direct;
    ct_complex z_loaded_direct;

    ct_float load_err_sum_direct =
    ct_load(z_stored, z_origin, power, eps, loaded_direct, z_loaded_direct, true);

    std::cout << "________________________________________\n";
    std::cout << "\nloaded:" << loaded_direct << "\n";

    if (stored == loaded_direct) {
    std::cout << "\n\nDATA COHERENT (DIRECT METHOD)! :^D" << "\n";
    }
    else {
    std::cout << "\n\n***** DATA CORRUPTED (DIRECT METHOD)!!!
    *****" << "\n";
    std::cout << "Expected: " << stored << "\n";
    std::cout << "Got: " << loaded_direct << "\n";
    }
    }

    std::cout << "\n\nFin, hit <ENTER> to exit...\n";
    std::fflush(stdout);
    std::cin.get();

    return 0;
    }
    _____________________________________



    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Mon Jan 12 01:07:34 2026
    From Newsgroup: comp.arch

    On Sun, 11 Jan 2026 22:30:14 -0000 (UTC)
    antispam@fricas.org (Waldek Hebisch) wrote:

    Michael S <already5chosen@yahoo.com> wrote:
    On Thu, 8 Jan 2026 21:35:16 +0100
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:

    MitchAlsup wrote:


    But a single incorrect rounding is 1 ULP all by itself.

    :-)

    Yeah, there are strong forces who want to have, at least as a
    suggested/recommended option, a set of transcendental functions
    which are exactly rounded.


    I wonder who are those forces and what is the set they push for.

    I would guess that they are mostly software and hardware
    verification people rather than people that use transcendental
    functions in engineering and physical calculations.

    The majority of the latter crowd would likely have no objections
    against 0.75 ULP RMS as long as implementation is both fast and
    preserving few invariant, of which I can primarily think of two:

    1. Evenness/odness
    If precise function F(x) is even or odd then approximate functions
    f(x) is also even or odd.

    2. Weak preservation of sign of delta.
    If F(x) < F(x+ULP) then f(x) <= f(x+ULP)
    If F(x) > F(x+ULP) then f(x) >= f(x+ULP)


    In practice, it's probably unlikely to have these invariant
    preserved when RMS error is 0.75 ULP, but for RMS error = 0.60-0.65
    ULP I don't see difficulties.
    For many transcendental functions there will be only few problematic
    values of x near edges of implementation-specific ranges where one
    has to be careful.

    This property is independent of magnitude of the error. For
    example, nominally "double" routine may deliver correctly rounded
    single precision results. Of course errors are huge, but the
    property holds.


    That's why I specified them not in usolation but together with RMS <
    0.75 ULP.

    I copied the latter part from post of Mitch. But I don't like this sort
    of characterization of precision. It takes innto account discrete nature
    of axis Y, but ignores discreteness of axis X.
    Tonight is too late for better definition. May be, I'd do it tomorrow.


    More realistically, monotonic behaviour can
    be obtained as composition of monotonic operations. If done
    in software such composition may produce more than 1ulp error,
    but still be monotonic where required.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Mon Jan 12 01:10:07 2026
    From Newsgroup: comp.arch

    On Sun, 11 Jan 2026 22:30:14 -0000 (UTC)
    antispam@fricas.org (Waldek Hebisch) wrote:

    Michael S <already5chosen@yahoo.com> wrote:
    On Thu, 8 Jan 2026 21:35:16 +0100
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:

    MitchAlsup wrote:


    But a single incorrect rounding is 1 ULP all by itself.

    :-)

    Yeah, there are strong forces who want to have, at least as a
    suggested/recommended option, a set of transcendental functions
    which are exactly rounded.


    I wonder who are those forces and what is the set they push for.

    I would guess that they are mostly software and hardware
    verification people rather than people that use transcendental
    functions in engineering and physical calculations.

    The majority of the latter crowd would likely have no objections
    against 0.75 ULP RMS as long as implementation is both fast and
    preserving few invariant, of which I can primarily think of two:

    1. Evenness/odness
    If precise function F(x) is even or odd then approximate functions
    f(x) is also even or odd.

    2. Weak preservation of sign of delta.
    If F(x) < F(x+ULP) then f(x) <= f(x+ULP)
    If F(x) > F(x+ULP) then f(x) >= f(x+ULP)


    In practice, it's probably unlikely to have these invariant
    preserved when RMS error is 0.75 ULP, but for RMS error = 0.60-0.65
    ULP I don't see difficulties.
    For many transcendental functions there will be only few problematic
    values of x near edges of implementation-specific ranges where one
    has to be careful.

    This property is independent of magnitude of the error. For
    example, nominally "double" routine may deliver correctly rounded
    single precision results. Of course errors are huge, but the
    property holds.


    That's why I specified them not in usolation but together with RMS <
    0.75 ULP.

    I copied the latter part from post of Mitch. But I don't like this sort
    of characterization of precision. It takes innto account discrete nature
    of axis Y, but ignores discreteness of axis X.
    Tonight is too late for better definition. May be, I'd do it tomorrow.


    More realistically, monotonic behaviour can
    be obtained as composition of monotonic operations. If done
    in software such composition may produce more than 1ulp error,
    but still be monotonic where required.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Mon Jan 12 00:57:38 2026
    From Newsgroup: comp.arch

    On Sun, 11 Jan 2026 22:30:14 -0000 (UTC)
    antispam@fricas.org (Waldek Hebisch) wrote:

    Michael S <already5chosen@yahoo.com> wrote:
    On Thu, 8 Jan 2026 21:35:16 +0100
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:

    MitchAlsup wrote:


    But a single incorrect rounding is 1 ULP all by itself.

    :-)

    Yeah, there are strong forces who want to have, at least as a
    suggested/recommended option, a set of transcendental functions
    which are exactly rounded.


    I wonder who are those forces and what is the set they push for.

    I would guess that they are mostly software and hardware
    verification people rather than people that use transcendental
    functions in engineering and physical calculations.

    The majority of the latter crowd would likely have no objections
    against 0.75 ULP RMS as long as implementation is both fast and
    preserving few invariant, of which I can primarily think of two:

    1. Evenness/odness
    If precise function F(x) is even or odd then approximate functions
    f(x) is also even or odd.

    2. Weak preservation of sign of delta.
    If F(x) < F(x+ULP) then f(x) <= f(x+ULP)
    If F(x) > F(x+ULP) then f(x) >= f(x+ULP)


    In practice, it's probably unlikely to have these invariant
    preserved when RMS error is 0.75 ULP, but for RMS error = 0.60-0.65
    ULP I don't see difficulties.
    For many transcendental functions there will be only few problematic
    values of x near edges of implementation-specific ranges where one
    has to be careful.

    This property is independent of magnitude of the error. For
    example, nominally "double" routine may deliver correctly rounded
    single precision results. Of course errors are huge, but the
    property holds. More realistically, monotonic behaviour can
    be obtained as composition of monotonic operations. If done
    in software such composition may produce more than 1ulp error,
    but still be monotonic where required.


    That's why I specified them not in usolation but together with RMS <
    0.75 ULP.

    I copied the latter part from post of Mitch. But I don't like this sort
    of characterization of precision. It takes innto account discrete nature
    of axis Y, but ignores discreteness of axis X.
    Tonight is too late for better definition. May be, I'd do it tomorrow.





    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Mon Jan 12 12:20:48 2026
    From Newsgroup: comp.arch

    On Sun, 11 Jan 2026 22:11:55 GMT
    MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

    Consider COS(x) near 0.0


    sin/cos is not an interesting or hard case, because for sin/cos the
    value at extremum is exactly 1 or -1, i.e. it is representable exactly
    by any BFP format. Plus, slop is very shallow. It means that any sane implementation of sin/cos will have no troubles correctly rounding both
    sides of interval that contains extremum to 1 (or -1). At least it
    holds as long as x is in the sane range (abs(x) < 2**26). For x outside
    that range, you (i.e. engineer, chemist, phisicist) just know that you
    are doing something very wrong, and implementation of trigs is among
    last things that you should be concerned about.

    More challenging cases are transcendental functions that have extrema
    with values that are non-representable exactly, especially so when the
    value is close to the mid-point between two representable numbers.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Mon Jan 12 16:28:37 2026
    From Newsgroup: comp.arch

    MitchAlsup wrote:

    antispam@fricas.org (Waldek Hebisch) posted:

    Thomas Koenig <tkoenig@netcologne.de> wrote:
    Waldek Hebisch <antispam@fricas.org> schrieb:

    I use every day a computer algebra system. One possiblity here
    is to use arbitrary precision fractions. Observations:
    - once there is more complex computation numerators and denominators>>>> tend to be pretty big
    - a lot of time is spent computing gcd, but if you try to save on
    gcd-s and work with unsimplified fractions you may get trenendous>>>> blowup (like million digit numbers in what should be reasonably
    small computation).
    - general, if one needs numeric approximation, then arbitrary
    precision (software) floating point is much faster

    It is also not possible to express irrational numbers,
    transcendental functions etc. When I use a computer algebra
    sytem, I tend to use such functions, so solutions are usually
    not rational numbers.

    Support for algebraic irrationalities is by now standard
    feature in computer algebra. Dealing with transcendental
    elementary functions too. Support for special functions
    is weaker, but that is possible to. Deciding if a number
    is transcendental or not is theoretically tricky, but for
    elementary numbers there is Schanuel conjecture, which
    wile unproven tends to work well in practice.

    What troubles many folks is fact that for many practical
    problems answers are implicit, so if you want numbers at the
    end you need to do numeric computation anyway.

    Anyway, my point was that exact computations tend to need
    large accuracy at intermediate steps,

    Between 2|un+6 and 2|un+13 which is 3-9-bits larger than the
    next higher precision.
    double -> fp128 is 53 vs 113 bits mantissa (including the hidden bit),
    so 2N+7 which is _almost enough even for the handful of really bad cases.Using u128 unsigned calculations might be enough for exact double results?
    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Mon Jan 12 22:22:05 2026
    From Newsgroup: comp.arch

    John Levine <johnl@taugh.com> schrieb:
    According to Stephen Fuld <sfuld@alumni.cmu.edu.invalid>:
    existing compilers supported it. He claimed that there
    must be hardware support to make it widely available. He somewhat
    ignored fact that even with hardware support there still needs
    to be support in compilers.

    Or perhaps (again, no personal knowledge - just speculation) that >>supporting an additional data type in the IBM COBOL (and, for what its >>worth PL/1) compilers is easier if there was hardware support for it.

    Having written a few compilers, I can say that it is equally easy
    within epsilon to emit a DFADD instruction as the equivalent of CALL
    DFADD. I could believe it's politically easier, hey we'll look dumb if
    we announce this swell DFP feature and our own compilers don't use it.

    Unfortunately, I do not have access to a machine with an IBM COBOL
    compiler. It would be interesting to see if it actually uses
    the decimal float arithmetic. But xlc has an option for that,
    -qdfp.

    [...]
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Tue Jan 13 09:55:00 2026
    From Newsgroup: comp.arch

    John Levine wrote:
    According to Stephen Fuld <sfuld@alumni.cmu.edu.invalid>:
    existing compilers supported it. He claimed that there
    must be hardware support to make it widely available. He somewhat
    ignored fact that even with hardware support there still needs
    to be support in compilers.

    Or perhaps (again, no personal knowledge - just speculation) that
    supporting an additional data type in the IBM COBOL (and, for what its
    worth PL/1) compilers is easier if there was hardware support for it.

    Having written a few compilers, I can say that it is equally easy
    within epsilon to emit a DFADD instruction as the equivalent of CALL
    DFADD. I could believe it's politically easier, hey we'll look dumb if
    we announce this swell DFP feature and our own compilers don't use it.

    Back in the FDIV bug days, the workaround code I did most of the writing
    on simply replaced all FDIV opcodes with a CALL FDIVFIX, none of the
    compiler teams found that to be any problem at all. For most it was
    probably just a patch to the code output table?

    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch on Tue Jan 13 14:45:02 2026
    From Newsgroup: comp.arch

    MitchAlsup [2026-01-11 18:18:00] wrote:
    Michael S <already5chosen@yahoo.com> posted:
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:
    Yeah, there are strong forces who want to have, at least as a
    suggested/recommended option, a set of transcendental functions which
    are exactly rounded.
    I wonder who are those forces and what is the set they push for.

    One reason to want it comes from portability and bit-for-bit
    reproducibility. These requirements don't actually care about the
    rounding being *correct* as much as the rounding always being the same
    across different hardware and libm implementations, but it seems rather unlikely that the various actors involved would agree on a particular
    return value if it's not the correctly-rounded one, so in practice this
    becomes a push for correctly rounded results.

    The problem, here, is that even when one gets all the rounding correct,
    one has still lost various algebraic identities.

    CRSIN(x)^2 + CRCOS(X)^2 ~= 1.0

    Which properties are preserved and which ones aren't is inevitably
    a compromise since, for example, the above one cannot be preserved
    without breaking several others.


    - Stefan
    --- Synchronet 3.21a-Linux NewsLink 1.2