Top to bottom works for Japanese and Chinese. Yet I hear not
appetite for TB byte order.
On Sun, 12 Oct 2025 16:11:27 GMT, MitchAlsup wrote:
Top to bottom works for Japanese and Chinese. Yet I hear not
appetite for TB byte order.
There is no rCLtoprCY or rCLbottomrCY or rCLleftrCY or rCLrightrCY in memory. There
are only addresses (bit numbers and byte numbers).
Lawrence =?iso-8859-13?q?D=FFOliveiro?= <ldo@nz.invalid> posted:
On Sun, 12 Oct 2025 16:11:27 GMT, MitchAlsup wrote:
Top to bottom works for Japanese and Chinese. Yet I hear notThere is no rCLtoprCY or rCLbottomrCY or rCLleftrCY or rCLrightrCY in memory. There
appetite for TB byte order.
are only addresses (bit numbers and byte numbers).
In order to stop the BE::LE war, one could always do a Middle Endian
bit/Byte order. You start in the middle and each step goes right-then-left.
In order to stop the BE::LE war, one could always do a Middle Endian
bit/Byte order. You start in the middle and each step goes right-then-left.
On Sun, 12 Oct 2025 16:11:27 GMT, MitchAlsup wrote:
Top to bottom works for Japanese and Chinese. Yet I hear not
appetite for TB byte order.
There is no "top" or "bottom" or "left" or "right" in memory.
There are only addresses (bit numbers and byte numbers).
On 28 Dec 2025, Lawrence D-|Oliveiro wrote
(in article <10ipsi8$3ssi3$4@dont-email.me>):
On Sun, 12 Oct 2025 16:11:27 GMT, MitchAlsup wrote:
Top to bottom works for Japanese and Chinese. Yet I hear not
appetite for TB byte order.
There is no "top" or "bottom" or "left" or "right" in memory.
There are only addresses (bit numbers and byte numbers).
Priceless!
(I needed a good laugh.)
MitchAlsup [2025-12-28 17:43:25] wrote:
In order to stop the BE::LE war, one could always do a Middle Endian
bit/Byte order. You start in the middle and each step goes right-then-left.
I thought the solution was to follow the Cray 1's lead, where memory is
only every accessed in units of the same size (a "word").
Stefan
On 12/28/2025 12:55 PM, Stefan Monnier wrote:
MitchAlsup [2025-12-28 17:43:25] wrote:
In order to stop the BE::LE war, one could always do a Middle EndianI thought the solution was to follow the Cray 1's lead, where memory is
bit/Byte order. You start in the middle and each step goes right-then-left. >>
only every accessed in units of the same size (a "word").
Apparently DEC Alpha did this:
Nothing smaller than 64 bits in HW.
So, you want byte-oriented memory access or similar? Implement it yourself.
On 12/28/2025 12:55 PM, Stefan Monnier wrote:
MitchAlsup [2025-12-28 17:43:25] wrote:
In order to stop the BE::LE war
I thought the solution was to follow the Cray 1's lead, where memory is
only every accessed in units of the same size (a "word").
Apparently DEC Alpha did this:
Nothing smaller than 64 bits in HW.
Mitch Alsup mentioned one architecture without order problems: The
Cray-1 is word-addressed and does not support numbers that take more
than one word. The same is true of the CDC 6600 and descendents.
The Cray-1 had double precision numbers, with software support
only. They had to in order to conform to the FORTRAN standards
of storage association.
Thomas Koenig <tkoenig@netcologne.de> writes:
The Cray-1 had double precision numbers, with software support
only. They had to in order to conform to the FORTRAN standards
of storage association.
And my guess is that the word order for double-precision is also
specified by FORTRAN.
On 12/28/2025 12:55 PM, Stefan Monnier wrote:
I thought the solution was to follow the Cray 1's lead, where memory isApparently DEC Alpha did this:
only every accessed in units of the same size (a "word").
Nothing smaller than 64 bits in HW.
Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
Thomas Koenig <tkoenig@netcologne.de> writes:
The Cray-1 had double precision numbers, with software support
only. They had to in order to conform to the FORTRAN standards
of storage association.
And my guess is that the word order for double-precision is also
specified by FORTRAN.
Your guess is wrong.
If you have storage association (via COMMON/EQUIVALENCE)
between two variables of different type and assign a value
to one of them, the other one becomes undefined.
It appears that Thomas Koenig <tkoenig@netcologne.de> said:
Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
Thomas Koenig <tkoenig@netcologne.de> writes:
The Cray-1 had double precision numbers, with software support
only. They had to in order to conform to the FORTRAN standards
of storage association.
And my guess is that the word order for double-precision is also
specified by FORTRAN.
Your guess is wrong.
If you have storage association (via COMMON/EQUIVALENCE)
between two variables of different type and assign a value
to one of them, the other one becomes undefined.
There was never any sort of type punning in FORTRAN.
Fortran ran on many Different machines with different floating point
formats and you could not make any assumptions about similarities in
single and double float formats.
John Levine <johnl@taugh.com> schrieb:
There was never any sort of type punning in FORTRAN.
[Interesting history snipped]
Fortran ran on many Different machines with different floating point
formats and you could not make any assumptions about similarities in
single and double float formats.
Unfortunately, this did not keep people from using a feature
that was officially prohibited by the standard, see for example >https://netlib.org/slatec/src/d1mach.f . This file fulfilled a
clear need (having certain floating point constants ...
Fortunately, these days it's all IEEE; I think nobody uses IBM's
base-16 FP numbers for anything serious any more.
According to Thomas Koenig <tkoenig@netcologne.de>:
John Levine <johnl@taugh.com> schrieb:
There was never any sort of type punning in FORTRAN.
[Interesting history snipped]
Fortran ran on many Different machines with different floating point
formats and you could not make any assumptions about similarities in
single and double float formats.
Unfortunately, this did not keep people from using a feature
that was officially prohibited by the standard, see for example >>https://netlib.org/slatec/src/d1mach.f . This file fulfilled a
clear need (having certain floating point constants ...
Wow, that's gross but I see the need. If you wanted to do extremely
machine specific stuff in Fortran, it didn't try to stop you.
Fortunately, these days it's all IEEE; I think nobody uses IBM's
base-16 FP numbers for anything serious any more.
Agreed, except IEEE has both binary and decimal flavors.
It's never been
clear to me how much people use decimal FP. The use case is clear enough,
it lets you control normalization so you can control the decimal precision
of calculations, which is important for financial calculations like bond prices. On the other hand, while it is somewhat painful to get correct decimal rounded results in binary FP, it's not all that hard -- forty
years ago I wrote all the bond price functions for an MS DOS application using the 286's FP.
Agreed, except IEEE has both binary and decimal flavors.
And two decimal flavors, as well, with binary and densely packed
decimal encoding of the significand... it's a bit of a mess.
IBM does the arithmetic in hardware, their decimal arithmetic probably
goes back to their adding and multiplying punches, far before computers.
It's never been
clear to me how much people use decimal FP. The use case is clear enough, >> it lets you control normalization so you can control the decimal precision ...
Speed? At least when using 128-bit densely packed decimal encoding
on IBM machines, it is possible to get speed which is probably
not attainable by software ...
but still want to do the same applications. IIRC, everybody but
IBM uses binary encoding of the significand and software, probably
doing something similar to what you did.
Software implmentations of DFP would be slow, but if you know what you are >doing you can get the correctly rounded declmal results using BFP which I >would think would be faster.
John Levine <johnl@taugh.com> writes:
Software implmentations of DFP would be slow, but if you know what
you are doing you can get the correctly rounded declmal results
using BFP which I would think would be faster.
If you know what you are doing, you use fixed point for those
financial applications which DFP targets, because that's what finance
used in the old days, and what they laid down in their rules (and
still do; the Euro conversion (early 2000s) has to happen with 4
decimal digits after the decimal point; which is noted as unusual,
apparently 2 or 3 digits are more common). And whenever I bring that
up, it is explained to me that DFP actually behaves like fixed point.
Which leads to the question why one would use DFP rather than fixed
point.
In the bad old days of 16-bit processors, using the 64-bit mantissa of
80-bit BFP as a large integer may have provided an advantage, but
these days we have 64-bit integers, and 80-bit BFP is slower than
64-bit BFP with 53-bit mantissa, so I don't see a reason to use any FP
for financial calculations.
Concerning the question of how much DFP is used: My impression is that Intel's DFP implementation is not particularly efficient,
and it sees
no maintenance. And I have not read about other implementations. My
guess is that there is so little use of this library that nobody
bothers working on it, and the use that it sees is not in performance-critical code, so nobody works on making Intel's
implementation faster or making another, faster implementation.
- anton
And two decimal flavors, as well, with binary and densely packed
decimal encoding of the significand... it's a bit of a mess.
John Levine <johnl@taugh.com> schrieb:
According to Thomas Koenig <tkoenig@netcologne.de>:
John Levine <johnl@taugh.com> schrieb:
There was never any sort of type punning in FORTRAN.
[Interesting history snipped]
Fortran ran on many Different machines with different floating point
formats and you could not make any assumptions about similarities in
single and double float formats.
Unfortunately, this did not keep people from using a feature
that was officially prohibited by the standard, see for example
https://netlib.org/slatec/src/d1mach.f . This file fulfilled a
clear need (having certain floating point constants ...
Wow, that's gross but I see the need. If you wanted to do extremely
machine specific stuff in Fortran, it didn't try to stop you.
Fortunately, these days it's all IEEE; I think nobody uses IBM's
base-16 FP numbers for anything serious any more.
Agreed, except IEEE has both binary and decimal flavors.
And two decimal flavors, as well, with binary and densely packed
decimal encoding of the significand... it's a bit of a mess.
IBM does the arithmetic in hardware, their decimal arithmetic probably
goes back to their adding and multiplying punches, far before computers.
It's never been
clear to me how much people use decimal FP. The use case is clear enough, >> it lets you control normalization so you can control the decimal precision >> of calculations, which is important for financial calculations like bond
prices. On the other hand, while it is somewhat painful to get correct
decimal rounded results in binary FP, it's not all that hard -- forty
years ago I wrote all the bond price functions for an MS DOS application
using the 286's FP.
Speed? At least when using 128-bit densely packed decimal encoding
on IBM machines, it is possible to get speed which is probably
not attainable by software (but certainly not very good compared
to what an optimized version could do, and POWER's 128-bit unit
is also quite slow as a result).
And people using other processors don't want to develop hardware,
but still want to do the same applications. IIRC, everybody but
IBM uses binary encoding of the significand and software, probably
doing something similar to what you did.
According to Thomas Koenig <tkoenig@netcologne.de>:
Agreed, except IEEE has both binary and decimal flavors.
And two decimal flavors, as well, with binary and densely packed
decimal encoding of the significand... it's a bit of a mess.
IBM does the arithmetic in hardware, their decimal arithmetic probably
goes back to their adding and multiplying punches, far before computers.
S/360 had packed decimal and zSeries even has vector ops for it. But it's different from DFP. Somewhere I saw a set of slides that said that the
first DFP in z was done in millicode, with hardware later.
It's never been
clear to me how much people use decimal FP. The use case is clear enough, >>> it lets you control normalization so you can control the decimal precision ...
Speed? At least when using 128-bit densely packed decimal encoding
on IBM machines, it is possible to get speed which is probably
not attainable by software ...
Software implmentations of DFP would be slow, but if you know what you are doing you can get the correctly rounded declmal results using BFP which I would think would be faster.
but still want to do the same applications. IIRC, everybody but
IBM uses binary encoding of the significand and software, probably
doing something similar to what you did.
I used regular IEEE binary FP, with explicit code to do decimal rounding
when needed. Like I said it was a pain but it wasn't all that hard.
On Sun, 4 Jan 2026 00:21:31 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:
And two decimal flavors, as well, with binary and densely packed
decimal encoding of the significand... it's a bit of a mess.
Since both formats have exactly identical semantics, in theory the mess
is not worse (and not better) than two bytes orders of IEEE binary FP.
Michael S <already5chosen@yahoo.com> schrieb:
On Sun, 4 Jan 2026 00:21:31 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:
And two decimal flavors, as well, with binary and densely packed
decimal encoding of the significand... it's a bit of a mess.
Since both formats have exactly identical semantics, in theory the
mess is not worse (and not better) than two bytes orders of IEEE
binary FP.
Almost.
IIRC, there is no restriction on the binary mantissa, so its
range is slightly larger for the same number of bits
(1000/1024)**(n/3).
John Levine <johnl@taugh.com> writes:
Software implmentations of DFP would be slow, but if you know what you are >doing you can get the correctly rounded declmal results using BFP which I >would think would be faster.
If you know what you are doing, you use fixed point for those
financial applications which DFP targets, because that's what finance
used in the old days, and what they laid down in their rules (and
still do; the Euro conversion (early 2000s) has to happen with 4
decimal digits after the decimal point; which is noted as unusual,
apparently 2 or 3 digits are more common). And whenever I bring that
up, it is explained to me that DFP actually behaves like fixed point.
Which leads to the question why one would use DFP rather than fixed
point.
- anton--- Synchronet 3.21a-Linux NewsLink 1.2
DFP behaves as fixed point for as long as it has enough digits in
significand to behave as fixed point.
It could use up all digits rather quickly
- when you sum up very big number with very small number
hopefully it does not happen in finances, big numbers there are not
really big and small numbers are not really small.
10^34).
- when you multiply, esp. several times
- when you divide. If result is inexact then just one division is enough
Then again. gcc manual mentions very few details about Decimal FP
support. It look like work on Decimal FP was started ~10 years ago,
made progress for a year or two and them an interest was lost.
An absence of easily accessible quantize operations seems to hints that
gcc implementation has no production use at all.
So, would it not be easier and faster to simply make a densely-packed
128-bit Fixed-Point decimal function unit ?!? A rounding mode, and a
number of digits below the decimal point, in a control register ???!!!
So, would it not be easier and faster to simply make a densely-packed 128-bit Fixed-Point decimal function unit ?!? A rounding mode, and a
number of digits below the decimal point, in a control register ???!!!
I'm not sure I understand your proposal correctly,
but the number of
digits below the decimal point should not be a global setting because
several computations can commonly happen at the same time with different number of digits below the decimal point.
Stefan--- Synchronet 3.21a-Linux NewsLink 1.2
Stefan Monnier <monnier@iro.umontreal.ca> posted:
So, would it not be easier and faster to simply make a densely-packed
128-bit Fixed-Point decimal function unit ?!? A rounding mode, and a
number of digits below the decimal point, in a control register ???!!!
I'm not sure I understand your proposal correctly,
It is not so much of a question, as it is my mind rambling through the possibilities. Loosely based on IBM 360--except 2|u or 4|u as long and
stored in densely packed decimal instead of 4-bit digits. No decision
on whether data is stored in registers or processed via memory.
but the number of
digits below the decimal point should not be a global setting because
several computations can commonly happen at the same time with different
number of digits below the decimal point.
OK, forgot that COBOL has each number defined with its own decimal
location.
Should each calculation have its own "rounding mode" or "what to do
with bits that fall off below the defined Lower-end" ??
It just seems to me that once the "container" is big enough to deal
with world GDP (of 2100) in the least valuable currency in the world,
that making the decimal point "float" adds little value.
Then again. gcc manual mentions very few details about Decimal FP
support. It look like work on Decimal FP was started ~10 years ago,
made progress for a year or two and them an interest was lost.
This supports my theory that nobody is using DFP.
So, would it not be easier and faster to simply make a densely-packed
128-bit Fixed-Point decimal function unit ?!?
Should each calculation have its own "rounding mode" or "what to do
with bits that fall off below the defined Lower-end" ??
It just seems to me that once the "container" is big enough to deal
with world GDP (of 2100) in the least valuable currency in the world,
that making the decimal point "float" adds little value.
if it is used it is probably used most on IBM Z series.
For multiplication, one common operation is to multiply a price with a
number of pieces resulting in a price, and no rescaling is necessary
there. Another common operation is to compute a percentage; you do
have rescaling there.
MitchAlsup <user5857@newsgrouper.org.invalid> writes:
So, would it not be easier and faster to simply make a densely-packed
128-bit Fixed-Point decimal function unit ?!?
128-bit binary integers are mostly good enough, and support for that
is ok in current architectures; division support might be better,
though, but see below. Rescaling with a power of 10 is something that
may merit additional hardware support if it occurs often enough; but I
am not convinced that it occurs often enough:
You usually don't need it for addition and subtraction, because the
operands have the same scale factor, and the same scale factor as the
result.
For multiplication, one common operation is to multiply a price with a
number of pieces resulting in a price, and no rescaling is necessary
there. Another common operation is to compute a percentage; you do
have rescaling there.
For division, it seems to me that the most common case is division by
a percentage that is applied to many dividends (maybe not in the USA,
but certainly in Europe it is common to compute the price without VAT
(sales tax) from the price with VAT; but there are only few VAT rates
in each country); that can be turned into a two-stage operation that
might include any necessary rescaling: compute an inverse that can
then be used for a cheap multiplication-and-rounding operation (e.g.,
where a power-of-2 scale factor is involved for computing something
below the least significant digit, in order to implement rounding).
And yes, support for several rounding modes is needed when an inexact
result is involved. Hardware may be helpful here.
I have not done much financial programming, so maybe somebody else can complement my views.
- anton
John Levine <johnl@taugh.com> writes:
Software implmentations of DFP would be slow, but if you know what you are >>doing you can get the correctly rounded declmal results using BFP which I >>would think would be faster.
If you know what you are doing, you use fixed point for those
financial applications which DFP targets, because that's what finance
used in the old days, and what they laid down in their rules (and
still do; the Euro conversion (early 2000s) has to happen with 4
decimal digits after the decimal point; which is noted as unusual,
apparently 2 or 3 digits are more common). And whenever I bring that
up, it is explained to me that DFP actually behaves like fixed point.
Which leads to the question why one would use DFP rather than fixed
point.
Anton Ertl wrote:
MitchAlsup <user5857@newsgrouper.org.invalid> writes:
So, would it not be easier and faster to simply make a densely-packed
128-bit Fixed-Point decimal function unit ?!?
128-bit binary integers are mostly good enough, and support for that
is ok in current architectures; division support might be better,
though, but see below. Rescaling with a power of 10 is something that
may merit additional hardware support if it occurs often enough; but I
am not convinced that it occurs often enough:
You usually don't need it for addition and subtraction, because the
operands have the same scale factor, and the same scale factor as the
result.
For multiplication, one common operation is to multiply a price with a
number of pieces resulting in a price, and no rescaling is necessary
there. Another common operation is to compute a percentage; you do
have rescaling there.
For division, it seems to me that the most common case is division by
a percentage that is applied to many dividends (maybe not in the USA,
but certainly in Europe it is common to compute the price without VAT
(sales tax) from the price with VAT; but there are only few VAT rates
in each country); that can be turned into a two-stage operation that
might include any necessary rescaling: compute an inverse that can
then be used for a cheap multiplication-and-rounding operation (e.g.,
where a power-of-2 scale factor is involved for computing something
below the least significant digit, in order to implement rounding).
And yes, support for several rounding modes is needed when an inexact
result is involved. Hardware may be helpful here.
I have not done much financial programming, so maybe somebody else can
complement my views.
- anton
Many bonds are time series polynomials with non-integer times.
Many calculations use pow(base,exp) to a non-integer exponent.
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
For multiplication, one common operation is to multiply a price with a
number of pieces resulting in a price, and no rescaling is necessary
there. Another common operation is to compute a percentage; you do
have rescaling there.
One interesting aspect is that the interest rates I have seen are
multiples of 1/800 (e.g., 1 3/4%=7/4%=14/8%=14/800). One can also
represent these through decimal scales, but the decimal scale that
allows to represent them is 1/100000 (1/800=125/100000). It may be
more economical in bits to scale with 1/800 (or maybe 1/1600 to be
prepared the next innovation in finance).
For tax rates, IIRC I have also seen half percentages, so using a
1/800 or 1/1600 scale factor may be a good idea for them, too.
- anton
If you look at Java's BigDecimal operations <https://docs.oracle.com/javase/8/docs/api/java/math/BigDecimal.html>,
they all have versions without and width MathContext (which includes a
target scale and a rounding mode), and sometimes additional variants
(e.g., divide() has variants where you pass just the rounding mode, or
the rounding mode and scale individually instead of through a
MathContext).
If you look at Java's BigDecimal operations <https://docs.oracle.com/javase/8/docs/api/java/math/BigDecimal.html>,
they all have versions without and width MathContext (which
includes a target scale and a rounding mode), and sometimes
additional variants (e.g., divide() has variants where you pass
just the rounding mode, or the rounding mode and scale individually
instead of through a MathContext).
I wonder how that would compare in practice with a Rational type,
where all arithmetic operations are exact (and thus don't need
anything like a MathContext) and you simply provide a rounding
function that takes two argument: a "target scale" (in the form of a
target denominator) and a rounding mode.
[ Extra points for implementing compiler optimizations that keep track
of the denominators statically to try and do away with the
denominators at run-time as much as possible. Maybe also figure out
how to eliminate the use of bignums for the numerators. EfOe ]
- Stefanexp() would be a challenge.
If you look at Java's BigDecimal operations
<https://docs.oracle.com/javase/8/docs/api/java/math/BigDecimal.html>,
they all have versions without and width MathContext (which includes a
target scale and a rounding mode), and sometimes additional variants
(e.g., divide() has variants where you pass just the rounding mode, or
the rounding mode and scale individually instead of through a
MathContext).
I wonder how that would compare in practice with a Rational type, where
all arithmetic operations are exact (and thus don't need anything like
a MathContext) and you simply provide a rounding function that takes
two argument: a "target scale" (in the form of a target denominator) and
a rounding mode.
[ Extra points for implementing compiler optimizations that keep track
of the denominators statically to try and do away with the
denominators at run-time as much as possible.
Maybe also figure out
how to eliminate the use of bignums for the numerators.
I don't know about the 800 but stock and bond prices used to be
published with fractions like 17 1/8.
I can't remember when they
switched to publishing in decimal.
EricP <ThatWouldBeTelling@thevillage.com> writes:
I don't know about the 800 but stock and bond prices used to be
published with fractions like 17 1/8.
17 1/8% = 137/8% = 137/800
I can't remember when they
switched to publishing in decimal.
But all those that I have seen published in decimal are also multiples
of 1/8%, i.e, of 1/800.
- anton
Anton Ertl wrote:
EricP <ThatWouldBeTelling@thevillage.com> writes:
I don't know about the 800 but stock and bond prices used to be
published with fractions like 17 1/8.
17 1/8% = 137/8% = 137/800
I can't remember when they
switched to publishing in decimal.
But all those that I have seen published in decimal are also multiples
of 1/8%, i.e, of 1/800.
- anton
Ah... it was due to Spanish traders and gold doubloons about 400 years ago.
Why the NYSE Once Reported Prices in Fractions https://www.investopedia.com/ask/answers/why-nyse-switch-fractions-to-decimals/
I wonder how that would compare in practice with a Rational type,
where all arithmetic operations are exact (and thus don't need
anything like a MathContext) and you simply provide a rounding
function that takes two argument: a "target scale" (in the form of a
target denominator) and a rounding mode.
exp() would be a challenge.
BigDecimal is almost like what you imagine, except that the
denominators are always powers of 10. Without MathContext addition, subtraction, and multiplication are exact, and division is also exact
or produces an exception.
Proper rational arithmetics (used in IIRC Prolog II) is also exact for division (and has no rounding),
but you can get really long numerators and denominators.
[ Extra points for implementing compiler optimizations that keep track
of the denominators statically to try and do away with the
denominators at run-time as much as possible.
For the kind of fixed point used for financial calculation rules, the
scale of every calculation is statically known (it comes out of the
rules), so a compiler for a programming language that has such fixed
point numbers as native type (Cobol, Ada, anything else?) does not
need to check every time whether rescaling is necessary (which
probably happens for Java's BigDecimal).
Maybe also figure out how to eliminate the use of bignums for
the numerators.
I don't think that's possible if the language specifies arbitrary-precision-arithmetics, because the program processes input
data that ios coming from data sources that can contain
arbitrarily-large numbers.
What is possible, and is done in various dynamically-typed languages
is to have the common case (a bignum that's actually small) unbox, and
use boxing only in those cases where the number exceeds the range of
unboxed numbers.
Michael S <already5chosen@yahoo.com> schrieb:
On Sun, 4 Jan 2026 00:21:31 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:
And two decimal flavors, as well, with binary and densely packed
decimal encoding of the significand... it's a bit of a mess.
Since both formats have exactly identical semantics, in theory the mess
is not worse (and not better) than two bytes orders of IEEE binary FP.
Almost.
IIRC, there is no restriction on the binary mantissa, so its
range is slightly larger for the same number of bits
(1000/1024)**(n/3).
Thomas Koenig wrote:
Michael S <already5chosen@yahoo.com> schrieb:
On Sun, 4 Jan 2026 00:21:31 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:
And two decimal flavors, as well, with binary and densely packed
decimal encoding of the significand... it's a bit of a mess.
Since both formats have exactly identical semantics, in theory the
mess is not worse (and not better) than two bytes orders of IEEE
binary FP.
Almost.
IIRC, there is no restriction on the binary mantissa, so itsSorry, that's wrong:
range is slightly larger for the same number of bits
(1000/1024)**(n/3).
Just like the 24 "spare" DPD patterns are illegal,
any mantissa
corresponding to a number greater than the maximum allowed (1e34
afair) is also illegal, and there are rules for how to handle both
cases (without checking, i seem to remember that they should be
treated as zero?)
Happy New Year everyone!
Terje
On Tue, 6 Jan 2026 12:35:20 +0100
Terje Mathisen <terje.mathisen@tmsw.no> wrote:
Thomas Koenig wrote:
Michael S <already5chosen@yahoo.com> schrieb:Sorry, that's wrong:
On Sun, 4 Jan 2026 00:21:31 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:
And two decimal flavors, as well, with binary and densely packed
decimal encoding of the significand... it's a bit of a mess.
Since both formats have exactly identical semantics, in theory the
mess is not worse (and not better) than two bytes orders of IEEE
binary FP.
Almost.
IIRC, there is no restriction on the binary mantissa, so its
range is slightly larger for the same number of bits
(1000/1024)**(n/3).
Just like the 24 "spare" DPD patterns are illegal,
Non-canonical, which is not the same as illegal
Silently accepted as input operands but never produced as result.
any mantissa
corresponding to a number greater than the maximum allowed (1e34
afair) is also illegal, and there are rules for how to handle both
cases (without checking, i seem to remember that they should be
treated as zero?)
BID significand extension > max is indeed treated as zeros.
Non-canonical DPD declets have non-zero values.
They are forms of (8+c)*100 + (8+f)*10 + (8+i), where c, f, and i are
in range [0:1].
Anton Ertl [2026-01-05 17:40:15] wrote:
BigDecimal is almost like what you imagine, except that the
denominators are always powers of 10. Without MathContext addition,
subtraction, and multiplication are exact, and division is also exact
or produces an exception.
Hmm... so they're rationals limited to denominators that are powers
of 10? I guess it does save them from GCD-style computations to simplify
the fractions.
but you can get really long numerators and denominators.
The only case where the numerators would need to get larger than for >BigDecimal if for division (when BigDecimal produces an exception), so
I guess that would argue in favor of providing an additional division >operation that takes something like a MathContext to avoid the
computation of the exact result before doing the rounding (or signaling
an exception).
Stefan Monnier <monnier@iro.umontreal.ca> wrote:
Maybe also figure out how to eliminate the use of bignums for
the numerators.
Anton Ertl [2026-01-05 17:40:15] wrote:
I don't think that's possible if the language specifies
arbitrary-precision-arithmetics, because the program processes input
data that ios coming from data sources that can contain
arbitrarily-large numbers.
I was assuming we're free to define the semantics of the Rational type,
e.g. specifying a limit to the precision.
[ Tho, I tend to use the word "boxing" in a different way, where
I consider both cases "boxed" (i.e. made to fit in a fixed-size
(typically 64bit) "box"), just that one of them involves placing the
data in a separate memory location and putting the "pointer + tag" in
the box, whereas the other puts the "small integer + tag" in that
same box. ]
On Sun, 4 Jan 2026 00:21:31 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:
And two decimal flavors, as well, with binary and densely packed
decimal encoding of the significand... it's a bit of a mess.
Since both formats have exactly identical semantics, in theory the mess
is not worse (and not better) than two bytes orders of IEEE binary FP.
I see much bigger problem [than BID vs DPD] in the fact that IEEE
standard does not specify few very important things about global states
and interoperability of BFP and DFP in the same process. Like whether
BFP and DFP have common rounding mode or each one has mode of its own.
The same question about for exception flags and exception masks.
Michael S wrote:
On Tue, 6 Jan 2026 12:35:20 +0100
Terje Mathisen <terje.mathisen@tmsw.no> wrote:
Thomas Koenig wrote:
Michael S <already5chosen@yahoo.com> schrieb:Sorry, that's wrong:
On Sun, 4 Jan 2026 00:21:31 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:
And two decimal flavors, as well, with binary and densely packed
decimal encoding of the significand... it's a bit of a mess.
Since both formats have exactly identical semantics, in theory the
mess is not worse (and not better) than two bytes orders of IEEE
binary FP.
Almost.
IIRC, there is no restriction on the binary mantissa, so its
range is slightly larger for the same number of bits
(1000/1024)**(n/3).
Just like the 24 "spare" DPD patterns are illegal,
Non-canonical, which is not the same as illegal
Silently accepted as input operands but never produced as result.
any mantissa
corresponding to a number greater than the maximum allowed (1e34
afair) is also illegal, and there are rules for how to handle both
cases (without checking, i seem to remember that they should be
treated as zero?)
BID significand extension > max is indeed treated as zeros.
Non-canonical DPD declets have non-zero values.
They are forms of (8+c)*100 + (8+f)*10 + (8+i), where c, f, and i are
in range [0:1].
OK, that is probably because allowing them on input is significantly faster/cheaper than having to detect and modify/trap/erase.
That said, out of range mantissas could also have been accepted, except
they would not have had a valid conversion to either DPD or ascii.
Terje
Michael S <already5chosen@yahoo.com> posted:
On Sun, 4 Jan 2026 00:21:31 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:
And two decimal flavors, as well, with binary and densely packed
decimal encoding of the significand... it's a bit of a mess.
Since both formats have exactly identical semantics, in theory the
mess is not worse (and not better) than two bytes orders of IEEE
binary FP.
Division by 10 is way faster in DPD than in Binary.
I see much bigger problem [than BID vs DPD] in the fact that IEEE
standard does not specify few very important things about global
states and interoperability of BFP and DFP in the same process.
Like whether BFP and DFP have common rounding mode or each one has
mode of its own. The same question about for exception flags and
exception masks.
EricP wrote:
Anton Ertl wrote:
EricP <ThatWouldBeTelling@thevillage.com> writes:
I don't know about the 800 but stock and bond prices used to be
published with fractions like 17 1/8.
17 1/8% = 137/8% = 137/800
I can't remember when they
switched to publishing in decimal.
But all those that I have seen published in decimal are also multiples
of 1/8%, i.e, of 1/800.
- anton
Ah... it was due to Spanish traders and gold doubloons about 400 years ago. >>
Why the NYSE Once Reported Prices in Fractions
https://www.investopedia.com/ask/answers/why-nyse-switch-fractions-to-decimals/
And possibly the factor of 100 comes from Basis Point which is 0.01%.
Terje Mathisen <terje.mathisen@tmsw.no> posted:
Michael S wrote:
On Tue, 6 Jan 2026 12:35:20 +0100
Terje Mathisen <terje.mathisen@tmsw.no> wrote:
Thomas Koenig wrote:
Michael S <already5chosen@yahoo.com> schrieb:Sorry, that's wrong:
On Sun, 4 Jan 2026 00:21:31 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:
And two decimal flavors, as well, with binary and densely
packed decimal encoding of the significand... it's a bit of a
mess.
Since both formats have exactly identical semantics, in theory
the mess is not worse (and not better) than two bytes orders
of IEEE binary FP.
Almost.
IIRC, there is no restriction on the binary mantissa, so its
range is slightly larger for the same number of bits
(1000/1024)**(n/3).
Just like the 24 "spare" DPD patterns are illegal,
Non-canonical, which is not the same as illegal
Silently accepted as input operands but never produced as result.
any mantissa
corresponding to a number greater than the maximum allowed (1e34
afair) is also illegal, and there are rules for how to handle
both cases (without checking, i seem to remember that they
should be treated as zero?)
BID significand extension > max is indeed treated as zeros.
Non-canonical DPD declets have non-zero values.
They are forms of (8+c)*100 + (8+f)*10 + (8+i), where c, f, and i
are in range [0:1].
OK, that is probably because allowing them on input is
significantly faster/cheaper than having to detect and
modify/trap/erase.
With the calculation latencies of IBM Z-series, modify/trap/erase is
of no problem.
That said, out of range mantissas could also have been accepted,
except they would not have had a valid conversion to either DPD or
ascii.
Terje
EricP wrote:
Many bonds are time series polynomials with non-integer times.
Many calculations use pow(base,exp) to a non-integer exponent.
Looking at BlackuScholes derivative and options pricing
https://en.wikipedia.org/wiki/Black-Scholes_pricing_formula#Black%E2%80%93Scholes_formula
I see exp(), ln(), sqrt().
I don't see any rules for accuracy.
I found double was fine for calculating bonds.
There are lots of other calculations:
https://en.wikipedia.org/wiki/Financial_mathematics
On Tue, 06 Jan 2026 17:56:23 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:
Michael S <already5chosen@yahoo.com> posted:
On Sun, 4 Jan 2026 00:21:31 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:
And two decimal flavors, as well, with binary and densely packed decimal encoding of the significand... it's a bit of a mess.
Since both formats have exactly identical semantics, in theory the
mess is not worse (and not better) than two bytes orders of IEEE
binary FP.
Division by 10 is way faster in DPD than in Binary.
Do you consider speed to be part of semantics?
Just wondering...
I see much bigger problem [than BID vs DPD] in the fact that IEEE standard does not specify few very important things about global
states and interoperability of BFP and DFP in the same process.
Like whether BFP and DFP have common rounding mode or each one has
mode of its own. The same question about for exception flags and exception masks.
On Tue, 06 Jan 2026 17:59:33 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:
Terje Mathisen <terje.mathisen@tmsw.no> posted:
Michael S wrote:
On Tue, 6 Jan 2026 12:35:20 +0100
Terje Mathisen <terje.mathisen@tmsw.no> wrote:
Thomas Koenig wrote:
Michael S <already5chosen@yahoo.com> schrieb:Sorry, that's wrong:
On Sun, 4 Jan 2026 00:21:31 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:
And two decimal flavors, as well, with binary and densely
packed decimal encoding of the significand... it's a bit of a
mess.
Since both formats have exactly identical semantics, in theory
the mess is not worse (and not better) than two bytes orders
of IEEE binary FP.
Almost.
IIRC, there is no restriction on the binary mantissa, so its
range is slightly larger for the same number of bits
(1000/1024)**(n/3).
Just like the 24 "spare" DPD patterns are illegal,
Non-canonical, which is not the same as illegal
Silently accepted as input operands but never produced as result.
any mantissa
corresponding to a number greater than the maximum allowed (1e34
afair) is also illegal, and there are rules for how to handle
both cases (without checking, i seem to remember that they
should be treated as zero?)
BID significand extension > max is indeed treated as zeros. Non-canonical DPD declets have non-zero values.
They are forms of (8+c)*100 + (8+f)*10 + (8+i), where c, f, and i
are in range [0:1].
OK, that is probably because allowing them on input is
significantly faster/cheaper than having to detect and
modify/trap/erase.
With the calculation latencies of IBM Z-series, modify/trap/erase is
of no problem.
How do you know calculation latencies of IBM Z-series?
Did they made an information public?
That said, out of range mantissas could also have been accepted,
except they would not have had a valid conversion to either DPD or
ascii.
Terje
Michael S <already5chosen@yahoo.com> posted:
On Tue, 06 Jan 2026 17:59:33 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:
Terje Mathisen <terje.mathisen@tmsw.no> posted:
Michael S wrote:
On Tue, 6 Jan 2026 12:35:20 +0100
Terje Mathisen <terje.mathisen@tmsw.no> wrote:
Thomas Koenig wrote:
Michael S <already5chosen@yahoo.com> schrieb:Sorry, that's wrong:
On Sun, 4 Jan 2026 00:21:31 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:
And two decimal flavors, as well, with binary and densely
packed decimal encoding of the significand... it's a bit
of a mess.
Since both formats have exactly identical semantics, in
theory the mess is not worse (and not better) than two
bytes orders of IEEE binary FP.
Almost.
IIRC, there is no restriction on the binary mantissa, so its
range is slightly larger for the same number of bits
(1000/1024)**(n/3).
Just like the 24 "spare" DPD patterns are illegal,
Non-canonical, which is not the same as illegal
Silently accepted as input operands but never produced as
result.
any mantissa
corresponding to a number greater than the maximum allowed
(1e34 afair) is also illegal, and there are rules for how to
handle both cases (without checking, i seem to remember that
they should be treated as zero?)
BID significand extension > max is indeed treated as zeros. Non-canonical DPD declets have non-zero values.
They are forms of (8+c)*100 + (8+f)*10 + (8+i), where c, f,
and i are in range [0:1].
OK, that is probably because allowing them on input is
significantly faster/cheaper than having to detect and modify/trap/erase.
With the calculation latencies of IBM Z-series, modify/trap/erase
is of no problem.
How do you know calculation latencies of IBM Z-series?
Did they made an information public?
9-15 months ago there was a presentation of their latest mainframe
showing the pipeline lengths.
Decode was on the order of 20 cycles, down from the top left;
execute was horizontal across the middle;
Retire was on the order of 12 cycles, down from the top right;
Possibly. But the lack of takeup of the Intel library and of the
gcc support shows that "build it and they will come" does not work
out for DFP.
In article <2026Jan5.100825@mips.complang.tuwien.ac.at>, anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:I already asked you couple of years ago how fast do want binary128 in
Possibly. But the lack of takeup of the Intel library and of the
gcc support shows that "build it and they will come" does not work
out for DFP.
The world has got very used to IEEE BFP, and has solutions that work acceptably with it. Lots of organisations don't see anything obvious
for them in DFP.
The thing I'd like to try out is fast quad-precision BFP. For the
field I work in, that would make some things much simpler. I did try
to interest AMD in the idea in the early days of x86-64, but they
didn't bite.
John
Michael S <already5chosen@yahoo.com> posted:
On Tue, 06 Jan 2026 17:56:23 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:
Michael S <already5chosen@yahoo.com> posted:
On Sun, 4 Jan 2026 00:21:31 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:
And two decimal flavors, as well, with binary and densely
packed decimal encoding of the significand... it's a bit of a
mess.
Since both formats have exactly identical semantics, in theory
the mess is not worse (and not better) than two bytes orders of
IEEE binary FP.
Division by 10 is way faster in DPD than in Binary.
Do you consider speed to be part of semantics?
Just wondering...
More like justification for the facility itself.
But also note: DPD can be converted to ASCII all data in parallel
without any division going on. Binary does not have that property.
I already asked you couple of years ago how fast do want binary128
in order to consider it fast enough.
IIRC, you either avoided the answer completely or gave totally
unrealistic answer like "the same as binary64".
May be, nobody bites because with non-answers or answers like that
nobody thinks that you are serious?
On Tue, 06 Jan 2026 19:29:40 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:
So, POWER hardware helps a lot converting DPD to BCD, but leaves to
software the last step of unpacking 4-bit BCD to 8-bit ASCII.
Or, may be, they have instruction for that as well, but it's not in DFP >related part of the book, so I missed it.
In article <20260107133424.00000e99@yahoo.com>,
already5chosen@yahoo.com (Michael S) wrote:
I already asked you couple of years ago how fast do want binary128
in order to consider it fast enough.
IIRC, you either avoided the answer completely or gave totally
unrealistic answer like "the same as binary64".
May be, nobody bites because with non-answers or answers like that
nobody thinks that you are serious?
I don't know much about hardware design. What is realistic for
hardware binary128?
John
Michael S <already5chosen@yahoo.com> writes:
On Tue, 06 Jan 2026 19:29:40 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:
So, POWER hardware helps a lot converting DPD to BCD, but leaves to >software the last step of unpacking 4-bit BCD to 8-bit ASCII.
That's a fairly simple operation, just adding the proper zone
digit (3 for ASCII, F for EBCDIC) to the 8-bit byte.
The B3500 (1965) did that in hardware;
I would find it strange if the Power CPU didn't.
Or, may be, they have instruction for that as well, but it's not in
DFP related part of the book, so I missed it.
It was just a flavor of the move instruction in the B3500.
On Wed, 07 Jan 2026 15:24:38 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:
Michael S <already5chosen@yahoo.com> writes:
On Tue, 06 Jan 2026 19:29:40 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:
So, POWER hardware helps a lot converting DPD to BCD, but leaves to
software the last step of unpacking 4-bit BCD to 8-bit ASCII.
That's a fairly simple operation, just adding the proper zone
digit (3 for ASCII, F for EBCDIC) to the 8-bit byte.
The hard part is having bits in right places within wide word.
I.e. you have 64 bits like that:
'0123456789abcdef'. You want to convert it to pair of 64-bit numbers: >'0001020304050607', '08090a0b0c0d0e0f'
Without HW help it is not fast. Likely not faster than running
respective DPD declets through look-up table where you, at least, get 3
ASCII digits per look-up. On modern wide core, likely only marginally
faster than converting BYD mantissa.
The B3500 (1965) did that in hardware;
I would find it strange if the Power CPU didn't.
Or, may be, they have instruction for that as well, but it's not in
DFP related part of the book, so I missed it.
It was just a flavor of the move instruction in the B3500.
I am not sure that we are talking about the same thing.
Michael S <already5chosen@yahoo.com> writes:
On Wed, 07 Jan 2026 15:24:38 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:
Michael S <already5chosen@yahoo.com> writes:
On Tue, 06 Jan 2026 19:29:40 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:
So, POWER hardware helps a lot converting DPD to BCD, but leaves to
software the last step of unpacking 4-bit BCD to 8-bit ASCII.
That's a fairly simple operation, just adding the proper zone
digit (3 for ASCII, F for EBCDIC) to the 8-bit byte.
The hard part is having bits in right places within wide word.
I.e. you have 64 bits like that:
'0123456789abcdef'. You want to convert it to pair of 64-bit numbers: >>'0001020304050607', '08090a0b0c0d0e0f'
The ASCII[*] would be '3031323334353637', '3839414243444546'. That's
a move from the BCD '0123456789abcdef' to the corresponding ASCII
bytes.
[*] Printable version of the BCD input number.
The B3500 addressed to the digit, so it was a simple move to add
the zone digit when converting to ASCII (or EBCDIC depending on
a processor flag). Although 'undigits' (bit patterns 0b1010
through 0b1111) were not legal in BCD numbers on the B3500
and adding a zone digit to them didn't make them printable.
e.g.
INPUT DATA 8 UN 0123456789 // Unsigned Numeric 8 nibble field
OUTPUT DATA 8 UA // Unsigned Alphanumeric 8 byte field
MVN INPUT(UN), OUTPUT(UA) // yields 'f0f1f2f3f4f5f6f7f8f9' (EBCDIC)
If the output field was larger than the input field, leading
blanks would be added before the number when using MVN. MVA
would blank pad the output field after the number when the
output field was larger.
Without HW help it is not fast. Likely not faster than running
respective DPD declets through look-up table where you, at least, get 3 >>ASCII digits per look-up. On modern wide core, likely only marginally >>faster than converting BYD mantissa.
The B3500 (1965) did that in hardware;
I would find it strange if the Power CPU didn't.
Or, may be, they have instruction for that as well, but it's not in
DFP related part of the book, so I missed it.
It was just a flavor of the move instruction in the B3500.
I am not sure that we are talking about the same thing.
Probably not, since the ASCII zero character is encoded as 0x30
instead of the 0x00 you show in the example above.
Michael S <already5chosen@yahoo.com> posted:
On Sun, 4 Jan 2026 00:21:31 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:
And two decimal flavors, as well, with binary and densely packed
decimal encoding of the significand... it's a bit of a mess.
Since both formats have exactly identical semantics, in theory the mess
is not worse (and not better) than two bytes orders of IEEE binary FP.
Division by 10 is way faster in DPD than in Binary.
I see much bigger problem [than BID vs DPD] in the fact that IEEE
standard does not specify few very important things about global states
and interoperability of BFP and DFP in the same process. Like whether
BFP and DFP have common rounding mode or each one has mode of its own.
The same question about for exception flags and exception masks.
On Wed, 7 Jan 2026 13:16 +0000 (GMT Standard Time)
jgd@cix.co.uk (John Dallman) wrote:
In article <20260107133424.00000e99@yahoo.com>,
already5chosen@yahoo.com (Michael S) wrote:
I already asked you couple of years ago how fast do want binary128
in order to consider it fast enough.
IIRC, you either avoided the answer completely or gave totally unrealistic answer like "the same as binary64".
May be, nobody bites because with non-answers or answers like that
nobody thinks that you are serious?
I don't know much about hardware design. What is realistic for
hardware binary128?
John
I think that you are asking wrong question, but I'd answer nevertheless.
For design that take significant amount of design effort and
non-trivial silicon resources (say + 3-5% in core area)
SIMD throughput within the same TDP: 1/4th of DP FP. May be, 1/3rd if designers worked very hard
Latency: assuming 4-clock FMA for DP FP, 9 clocks QP FMA sounds very realistic. May be, 8 clocks. Mitch could answer better. FADD can be
faster - 6 sounds realistic.
Another extreme in design space is what IBM did on POWER9.
I would guess that here silicon resources dedicated to quad-precision
BFP were below 0.5 % of the core area. Likely, below 0.1%.
They did scalar (i.e. non-SIMD) quad-FP unit. FADD is
pipelines, but FMUL/FMA is very minimally pipelined (at most 2
operations proceed simultaneously),
Throughput/latency Table:
Oper : DP T L QP T L
ADD : 4 5-7 1 12
MUL : 4 5-7 1/13 24
MADD : 4 5-7 1/13 24
As you can see, POWER9 double-precision throughput/latency numbers are somewhat worse than what we accustomed on x86-64 and on high-end ARM64. However, even relatively to those not great numbers throughput of QP
FMA is 52 times lower and latency ~4 times higher.
And still, it's all depend on application. If all application does is multiplication or decomposition of big matrices then migration to minimalistic QP engine similar to one in POWER9 will cause major
slowdown (Then again, as shown by my anecdote, sometimes DP is
non-adequate for exactly those tasks).
But most applications are not like those. Even for activities that are normally considered numerically intensive, like recalculation of huge spreadsheet, I'd expect at most few percents of slowdown on POWER9 QP.
I don't know where your application is placed in this spectrum. Most
likely, you don't know too. Until you try! And in order to get a
feeling you don't need hardware. Use software implementation.
Experiments, even with not very good software implementation as one in
gcc on x86-64 and ARM64, will give you massively better feeling a lot
of questions. They will put you into position of knowledge, when
proposing something to AMD, Intel or Arm vendors.
Which, of course, does not guarantee that they will byte your bait. I'd
even dare to say that for as long as current "AI" bubble lasts they
will not byte it. But it will not last forever.
On Wed, 07 Jan 2026 15:24:38 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:
Michael S <already5chosen@yahoo.com> writes:
On Tue, 06 Jan 2026 19:29:40 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:
So, POWER hardware helps a lot converting DPD to BCD, but leaves to >software the last step of unpacking 4-bit BCD to 8-bit ASCII.
That's a fairly simple operation, just adding the proper zone
digit (3 for ASCII, F for EBCDIC) to the 8-bit byte.
The hard part is having bits in right places within wide word.
I.e. you have 64 bits like that:
'0123456789abcdef'. You want to convert it to pair of 64-bit numbers: '0001020304050607', '08090a0b0c0d0e0f'
Without HW help it is not fast.
Likely not faster than running
respective DPD declets through look-up table where you, at least, get 3
ASCII digits per look-up. On modern wide core, likely only marginally
faster than converting BYD mantissa.
The B3500 (1965) did that in hardware;
I would find it strange if the Power CPU didn't.
Or, may be, they have instruction for that as well, but it's not in
DFP related part of the book, so I missed it.
It was just a flavor of the move instruction in the B3500.
I am not sure that we are talking about the same thing.
Michael S <already5chosen@yahoo.com> posted:
On Tue, 06 Jan 2026 17:56:23 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:
Michael S <already5chosen@yahoo.com> posted:
On Sun, 4 Jan 2026 00:21:31 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:
And two decimal flavors, as well, with binary and densely packed
decimal encoding of the significand... it's a bit of a mess.
Since both formats have exactly identical semantics, in theory the
mess is not worse (and not better) than two bytes orders of IEEE
binary FP.
Division by 10 is way faster in DPD than in Binary.
Do you consider speed to be part of semantics?
Just wondering...
More like justification for the facility itself.
But also note: DPD can be converted to ASCII all data in parallel without
any division going on. Binary does not have that property.
On Tue, 6 Jan 2026 22:06 +0000 (GMT Standard Time)
jgd@cix.co.uk (John Dallman) wrote:
In article <2026Jan5.100825@mips.complang.tuwien.ac.at>,
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
Possibly. But the lack of takeup of the Intel library and of the
gcc support shows that "build it and they will come" does not work
out for DFP.
The world has got very used to IEEE BFP, and has solutions that work
acceptably with it. Lots of organisations don't see anything obvious
for them in DFP.
The thing I'd like to try out is fast quad-precision BFP. For the
field I work in, that would make some things much simpler. I did try
to interest AMD in the idea in the early days of x86-64, but they
didn't bite.
John
I already asked you couple of years ago how fast do want binary128 in
order to consider it fast enough.
IIRC, you either avoided the answer completely or gave totally
unrealistic answer like "the same as binary64".
May be, nobody bites because with non-answers or answers like that
nobody thinks that you are serious?
Anecdote.
Few months ago I tried to design very long decimation filters with stop> band attenuation of ~160 dB.
Matlab's implementation of Parks|ore4rCLMcClellan algorithm (a customize variation of Remez Exchange spiced with a small portion of black magic)> was not up to the task. Gnu Octave implementation was somewhat worse
yet.
When I started to investigate the reasons I found out that there were actually two of them, both related to insufficient precision of the
series of DP FP calculations.
The first was Discreet Cosine Transform and underlying FFT engine for N> around 32K.
The second was solving system of linear equations for N around 1000 a
a little more.
In both cases precision of DP FP was perfectly sufficient both for
inputs and for outputs. But errors accumulated in intermediate caused troubles.
In both cases quad-precision FP was key to solution.
For DCT (FFT) I went for full re-implementation at higher precision.
For Linear Solver, I left LU decomposition, which happens to be the
heavy O(N**3) part in DP. Quad-precision was applied only during final
solver stages - forward propagation, back propagation, calculation of residual error vector and repetition of forward and back propagation.
All those parts are O(N**2). That modification was sufficient to
improve precision of result almost to the best possible in DP FP format.
And sufficient for good convergence of Parks|ore4rCLMcClellan algorithm.
So, what is a point of my anecdote?I thihnk the main issue is similar to what we had before 754, i.e every
The speed of quad-precision FP was never an obstacle, even when running> on rather old hardware. And it's not like calculations here were not
heavy. They were heavy all right, thank you very much.
Using quad-precision only when necessary helped.
But what helped more is not being hesitant. Doing things instead of
worrying that they would be too slow.
If you look at Java's BigDecimal operations
<https://docs.oracle.com/javase/8/docs/api/java/math/BigDecimal.html>,
they all have versions without and width MathContext (which includes a
target scale and a rounding mode), and sometimes additional variants
(e.g., divide() has variants where you pass just the rounding mode, or
the rounding mode and scale individually instead of through a
MathContext).
I wonder how that would compare in practice with a Rational type, where
all arithmetic operations are exact (and thus don't need anything like
a MathContext) and you simply provide a rounding function that takes
two argument: a "target scale" (in the form of a target denominator) and
a rounding mode.
[ Extra points for implementing compiler optimizations that keep track
of the denominators statically to try and do away with the
denominators at run-time as much as possible. Maybe also figure out
how to eliminate the use of bignums for the numerators. EfOe ]
In article <20260107133424.00000e99@yahoo.com>, already5chosen@yahoo.com (Michael S) wrote:
I already asked you couple of years ago how fast do want binary128
in order to consider it fast enough.
IIRC, you either avoided the answer completely or gave totally
unrealistic answer like "the same as binary64".
May be, nobody bites because with non-answers or answers like that
nobody thinks that you are serious?
I don't know much about hardware design. What is realistic for hardware binary128?
On Tue, 06 Jan 2026 19:29:40 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:
Michael S <already5chosen@yahoo.com> posted:
On Tue, 06 Jan 2026 17:56:23 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:
Michael S <already5chosen@yahoo.com> posted:
On Sun, 4 Jan 2026 00:21:31 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:
And two decimal flavors, as well, with binary and densely
packed decimal encoding of the significand... it's a bit of a
mess.
Since both formats have exactly identical semantics, in theory
the mess is not worse (and not better) than two bytes orders of
IEEE binary FP.
Division by 10 is way faster in DPD than in Binary.
Do you consider speed to be part of semantics?
Just wondering...
More like justification for the facility itself.
But also note: DPD can be converted to ASCII all data in parallel
without any division going on. Binary does not have that property.
I took a look at how IBM exploits this property.
I don't have up to date zArch manual. It's probably easily available,
but right now I have no time or desire to search.
So I looked at POWER, which tends to copy DFP stuff from zArch with one generation time gap.
POWER ISA v.3.0 (2015) has following relevant instructions:
ddedpd - DFP Decode DPD to BCD
For Decimal128 it has two forms
- convert 32 rightmost digits of significand (unsigned)
- convert 31 rightmost digits of significand (signed)
With IBM I am never sure what they call 'rightmost' :(
If you wonder about couple of remaining digits, IBM has following
helper instruction:
dscli - DFP shift significand left immediate
dscri - DFP shift significand right immediate
Once again, since it's IBM, I am not sure about directions.
dxex - DFD extract biased exponent.
Exponent is extracted in binary form, not in BCD
They also have instructions that workd in the opposite direction
denbcd - DFP encode BCD to DPD
It converts signed 31-digit or unsigned 32-digit BCD-encoded integer to
DPD with exponent=0
diex - DFD insert biased exponent.
Here too exponent is in binary form, not in BCD.
So, POWER hardware helps a lot converting DPD to BCD, but leaves to
software the last step of unpacking 4-bit BCD to 8-bit ASCII.
Or, may be, they have instruction for that as well, but it's not in DFP related part of the book, so I missed it.
In BID we would do division by 10 with a reciprocal multiplication that >handles powers of 5, this way we (i.e Michael S) can fit 26/27 digits
into a 64-bit int.
No matter how we do it, two iterations is enough
handle even maximally large numbers, and that would require 4 64x64->128 >MULs per iteration. Since these would pipeline nicely I'm guessing it
would be doable in 15-30 cycles total?
So yes, scaling by a power of ten is the one operation where DPD is
clearly much faster
Scott Lurndal <scott@slp53.sl.home> wrote:
Michael S <already5chosen@yahoo.com> writes:
On Wed, 07 Jan 2026 15:24:38 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:
Michael S <already5chosen@yahoo.com> writes:
On Tue, 06 Jan 2026 19:29:40 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:
So, POWER hardware helps a lot converting DPD to BCD, but leaves to
software the last step of unpacking 4-bit BCD to 8-bit ASCII.
That's a fairly simple operation, just adding the proper zone
digit (3 for ASCII, F for EBCDIC) to the 8-bit byte.
The hard part is having bits in right places within wide word.
I.e. you have 64 bits like that:
'0123456789abcdef'. You want to convert it to pair of 64-bit numbers: >>>'0001020304050607', '08090a0b0c0d0e0f'
The ASCII[*] would be '3031323334353637', '3839414243444546'. That's
a move from the BCD '0123456789abcdef' to the corresponding ASCII
bytes.
[*] Printable version of the BCD input number.
The B3500 addressed to the digit, so it was a simple move to add
the zone digit when converting to ASCII (or EBCDIC depending on
a processor flag). Although 'undigits' (bit patterns 0b1010
through 0b1111) were not legal in BCD numbers on the B3500
and adding a zone digit to them didn't make them printable.
e.g.
INPUT DATA 8 UN 0123456789 // Unsigned Numeric 8 nibble field >> OUTPUT DATA 8 UA // Unsigned Alphanumeric 8 byte field
MVN INPUT(UN), OUTPUT(UA) // yields 'f0f1f2f3f4f5f6f7f8f9' (EBCDIC)
If the output field was larger than the input field, leading
blanks would be added before the number when using MVN. MVA
would blank pad the output field after the number when the
output field was larger.
Without HW help it is not fast. Likely not faster than running
respective DPD declets through look-up table where you, at least, get 3 >>>ASCII digits per look-up. On modern wide core, likely only marginally >>>faster than converting BYD mantissa.
The B3500 (1965) did that in hardware;
I would find it strange if the Power CPU didn't.
Or, may be, they have instruction for that as well, but it's not in
DFP related part of the book, so I missed it.
It was just a flavor of the move instruction in the B3500.
I am not sure that we are talking about the same thing.
Probably not, since the ASCII zero character is encoded as 0x30
instead of the 0x00 you show in the example above.
IIUC Michael was asking for the following transformation of
on the strings of hex digits:
0123456789abcdef
into
000102030405060708090a0b0c0d0e0f
given (fast) such transformation it is very easy to add proper
zone bits on modern hardware. One possible approach to transform
above would be to do byte type unpacking operation (that is
version of the above working on bytes) and then use masking and
shifting to more upper bits of each byte to the right place.
MitchAlsup wrote:
Michael S <already5chosen@yahoo.com> posted:
On Sun, 4 Jan 2026 00:21:31 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:
And two decimal flavors, as well, with binary and densely packed
decimal encoding of the significand... it's a bit of a mess.
Since both formats have exactly identical semantics, in theory the mess
is not worse (and not better) than two bytes orders of IEEE binary FP.
Division by 10 is way faster in DPD than in Binary.
I see much bigger problem [than BID vs DPD] in the fact that IEEE
standard does not specify few very important things about global states
and interoperability of BFP and DFP in the same process. Like whether
BFP and DFP have common rounding mode or each one has mode of its own.
The same question about for exception flags and exception masks.
In hardware a division by 10 is either an adjustment of the exponent,
which is equally fast for both encodings, or for a real division DPD
just requires unpacking of all the 10-bit fields (I'm assuming IBM does
this in parallel for all 11 groups, so max one cycle), then shifting all
the nybbles down one position before the reverse to pack them back up, >probably including a rounding step before the repack.
MitchAlsup wrote:
Michael S <already5chosen@yahoo.com> posted:
On Sun, 4 Jan 2026 00:21:31 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:
And two decimal flavors, as well, with binary and densely packed
decimal encoding of the significand... it's a bit of a mess.
Since both formats have exactly identical semantics, in theory the mess
is not worse (and not better) than two bytes orders of IEEE binary FP.
Division by 10 is way faster in DPD than in Binary.
I see much bigger problem [than BID vs DPD] in the fact that IEEE
standard does not specify few very important things about global states
and interoperability of BFP and DFP in the same process. Like whether
BFP and DFP have common rounding mode or each one has mode of its own.
The same question about for exception flags and exception masks.
In hardware a division by 10 is either an adjustment of the exponent,
which is equally fast for both encodings, or for a real division DPD
just requires unpacking of all the 10-bit fields (I'm assuming IBM does
this in parallel for all 11 groups, so max one cycle),
then shifting all
the nybbles down one position before the reverse to pack them back up, probably including a rounding step before the repack.
This operation is very closely related to the general case of having to re-normalize after any operation which would require that, i.e. commonly
for DFMUL, very seldom for DFADD/DFSUB, and almost always for DFDIV.
In BID we would do division by 10 with a reciprocal multiplication that handles powers of 5, this way we (i.e Michael S) can fit 26/27 digits
into a 64-bit int. No matter how we do it, two iterations is enough
handle even maximally large numbers, and that would require 4 64x64->128 MULs per iteration. Since these would pipeline nicely I'm guessing it
would be doable in 15-30 cycles total?
So yes, scaling by a power of ten is the one operation where DPD is
clearly much faster, but if you try to implement DPD in software, then
you have to handle the unpack and pack operations, and they could easily take the same or even more time.
Terje
John Dallman wrote:
In article <20260107133424.00000e99@yahoo.com>, already5chosen@yahoo.com (Michael S) wrote:
I already asked you couple of years ago how fast do want binary128
in order to consider it fast enough.
IIRC, you either avoided the answer completely or gave totally
unrealistic answer like "the same as binary64".
May be, nobody bites because with non-answers or answers like that
nobody thinks that you are serious?
I don't know much about hardware design. What is realistic for hardware binary128?
Sub-10 cycles fmul/fadd/fsub seems very doable?
Mitch?
Terje
Terje Mathisen <terje.mathisen@tmsw.no> writes:
In BID we would do division by 10 with a reciprocal multiplication that >handles powers of 5, this way we (i.e Michael S) can fit 26/27 digits
into a 64-bit int.
2^64 < 10^20
How do you put 26/27 decimal digits in 64 bits. Or do you mean 27
quinary digits? What would that be good for?
No matter how we do it, two iterations is enough
handle even maximally large numbers, and that would require 4 64x64->128 >MULs per iteration. Since these would pipeline nicely I'm guessing it >would be doable in 15-30 cycles total?
On Skylake the latency of a 64x64->128 multiplication is 6 cycles (4
cycles for the lower 64 bits), and I expect it to be lower on newer
hardware. The pipelined multiplications should be done by cycle 9.
There are also some additions involved, but I would not expect them to increase the latency to 15 cycles. What other operations do you have
in mind that would result in 15-30 cycles? For scaling you don't need
the remainder, only some rounding.
So yes, scaling by a power of ten is the one operation where DPD is >clearly much faster
I would not bet on it. It needs to unpack the 34 digits into 136
bits, do a 136-bit shift, then repack into DPD.
Either they widen the
data path beyond what they normally do, or they do it in parcels of 64
bits or less, and the end result can easily take a similar number of
cycles as 128-bit binary multiplication with the reciprocal.
My guess--- Synchronet 3.21a-Linux NewsLink 1.2
is that they did the slow implementation at first, and then there was
so little takeup of DFP that the slow implementation is good enough to
this day.
- anton
In article <20260107133424.00000e99@yahoo.com>, already5chosen@yahoo.com (Michael S) wrote:
I already asked you couple of years ago how fast do want binary128
in order to consider it fast enough.
IIRC, you either avoided the answer completely or gave totally
unrealistic answer like "the same as binary64".
May be, nobody bites because with non-answers or answers like that
nobody thinks that you are serious?
I don't know much about hardware design. What is realistic for hardware binary128?
Michael S wrote:
On Tue, 6 Jan 2026 22:06 +0000 (GMT Standard Time)
jgd@cix.co.uk (John Dallman) wrote:
In article <2026Jan5.100825@mips.complang.tuwien.ac.at>,
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
Possibly.-a But the lack of takeup of the Intel library and of the
gcc support shows that "build it and they will come" does not work
out for DFP.
The world has got very used to IEEE BFP, and has solutions that work
acceptably with it. Lots of organisations don't see anything obvious
for them in DFP.
The thing I'd like to try out is fast quad-precision BFP. For the
field I work in, that would make some things much simpler. I did try
to interest AMD in the idea in the early days of x86-64, but they
didn't bite.
John
I already asked you couple of years ago how fast do want binary128 in
order to consider it fast enough.
IIRC, you either avoided the answer completely or gave totally
unrealistic answer like "the same as binary64".
May be, nobody bites because with non-answers or answers like that
nobody thinks that you are serious?
Anecdote.
Few months ago I tried to design very long decimation filters with stop
band attenuation of ~160 dB.
Matlab's implementation of Parks|ore4rCLMcClellan algorithm (a customize
variation of Remez Exchange spiced with a small portion of black magic)
was not up to the task. Gnu Octave implementation was somewhat worse
yet.
When I started to investigate the reasons I found out that there were
actually two of them, both related to insufficient precision of the
series of DP FP calculations.
The first was Discreet Cosine Transform and underlying FFT engine for N
around 32K.
The second was solving system of linear equations for N around 1000 a
a little more.
In both cases precision of DP FP was perfectly sufficient both for
inputs and for outputs. But errors accumulated in intermediate caused
troubles.
In both cases quad-precision FP was key to solution.
For DCT (FFT) I went for full re-implementation at higher precision.
For Linear Solver, I left LU decomposition, which happens to be the
heavy O(N**3) part in DP. Quad-precision was applied only during final
solver stages - forward propagation, back propagation, calculation of
residual error vector and repetition of forward and back propagation.
All those parts are O(N**2). That modification was sufficient to
improve precision of result almost to the best possible in DP FP format.
And sufficient for good convergence of Parks|ore4rCLMcClellan algorithm.
So, what is a point of my anecdote?
The speed of quad-precision FP was never an obstacle, even when running
on rather old hardware. And it's not like calculations here were not
heavy. They were heavy all right, thank you very much.
Using quad-precision only when necessary helped.
But what helped more is not being hesitant. Doing things instead of
worrying that they would be too slow.
I thihnk the main issue is similar to what we had before 754, i.e every
fp programmer needed to also be a fp analyst, capable of carrying out
error budget calculation across their algorithms.
You can obviously do that, and so can a number of regulars here, but in
the real world we are in a _very_ small minority.
For the rest, just having fp128 fast enough that that it could be
applied naively would solve a number of problems.
Terje
On 1/7/2026 11:56 AM, Terje Mathisen wrote:
Michael S wrote:
On Tue, 6 Jan 2026 22:06 +0000 (GMT Standard Time)
jgd@cix.co.uk (John Dallman) wrote:
In article <2026Jan5.100825@mips.complang.tuwien.ac.at>,
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
Possibly.-a But the lack of takeup of the Intel library and of the
gcc support shows that "build it and they will come" does not work
out for DFP.
The world has got very used to IEEE BFP, and has solutions that work
acceptably with it. Lots of organisations don't see anything obvious
for them in DFP.
The thing I'd like to try out is fast quad-precision BFP. For the
field I work in, that would make some things much simpler. I did try
to interest AMD in the idea in the early days of x86-64, but they
didn't bite.
John
I already asked you couple of years ago how fast do want binary128 in
order to consider it fast enough.
IIRC, you either avoided the answer completely or gave totally
unrealistic answer like "the same as binary64".
May be, nobody bites because with non-answers or answers like that
nobody thinks that you are serious?
Anecdote.
Few months ago I tried to design very long decimation filters with stop
band attenuation of ~160 dB.
Matlab's implementation of Parks|ore4rCLMcClellan algorithm (a customize >> variation of Remez Exchange spiced with a small portion of black magic)
was not up to the task. Gnu Octave implementation was somewhat worse
yet.
When I started to investigate the reasons I found out that there were
actually two of them, both related to insufficient precision of the
series of DP FP calculations.
The first was Discreet Cosine Transform and underlying FFT engine for N
around 32K.
The second was solving system of linear equations for N around 1000 a
a little more.
In both cases precision of DP FP was perfectly sufficient both for
inputs and for outputs. But errors accumulated in intermediate caused
troubles.
In both cases quad-precision FP was key to solution.
For DCT (FFT) I went for full re-implementation at higher precision.
For Linear Solver, I left LU decomposition, which happens to be the
heavy O(N**3) part in DP. Quad-precision was applied only during final
solver stages - forward propagation, back propagation, calculation of
residual error vector and repetition of forward and back propagation.
All those parts are O(N**2). That modification was sufficient to
improve precision of result almost to the best possible in DP FP format. >> And sufficient for good convergence of Parks|ore4rCLMcClellan algorithm. >>
So, what is a point of my anecdote?
The speed of quad-precision FP was never an obstacle, even when running
on rather old hardware. And it's not like calculations here were not
heavy. They were heavy all right, thank you very much.
Using quad-precision only when necessary helped.
But what helped more is not being hesitant. Doing things instead of
worrying that they would be too slow.
I thihnk the main issue is similar to what we had before 754, i.e every
fp programmer needed to also be a fp analyst, capable of carrying out error budget calculation across their algorithms.
You can obviously do that, and so can a number of regulars here, but in the real world we are in a _very_ small minority.
For the rest, just having fp128 fast enough that that it could be
applied naively would solve a number of problems.
As I see it, FP128 is fast enough for practical use even with a software-only implementation (though, in part due to its relatively low usage frequency; if it is used, it is mostly for cases that actually
need precision, rather than high throughput, with high-throughput cases likely to remain dominated by smaller types, like Binary32 and Binary16; with Binary64 more remaining as the "de-facto default" precision for floating-point).
As can be noted, in my case, it was a partial motivation for supporting things like 128-bit integer instructions (in my C compiler, and
optionally in the underlying ISA), as supporting Int128 ops is a step towards making doing Binary128 in software more practical (without the
steep cost of a 128-bit FPU).
...
Terje
Terje Mathisen <terje.mathisen@tmsw.no> posted:
John Dallman wrote:
In article <20260107133424.00000e99@yahoo.com>,
already5chosen@yahoo.com (Michael S) wrote:
I already asked you couple of years ago how fast do want
binary128 in order to consider it fast enough.
IIRC, you either avoided the answer completely or gave totally
unrealistic answer like "the same as binary64".
May be, nobody bites because with non-answers or answers like
that nobody thinks that you are serious?
I don't know much about hardware design. What is realistic for
hardware binary128?
Sub-10 cycles fmul/fadd/fsub seems very doable?
Mitch?Assuming 128-bit operands are delivered in 1 cycle and 128-bit
results are delivered in 1 cycle::
128-bit Fadd/Fsub should be no worse than 1 cycle longer than 64-bit.--- Synchronet 3.21a-Linux NewsLink 1.2
128-bit Fmul requires that the multiplier tree be 64+64 instead of
53+53 (1.46+ bigger tree, 1.22+ bigger FU), and would/should be 3-4
cycles longer than 64-bit Fmul. If you wanted to be "really clever"
you could use a 59+59 tree and the FU is only 1.12+ bigger; but here
you could not use the tree for Integer MUL.
Terje
BGB <cr88192@gmail.com> posted:
On 1/7/2026 11:56 AM, Terje Mathisen wrote:
Michael S wrote:
On Tue, 6 Jan 2026 22:06 +0000 (GMT Standard Time)
jgd@cix.co.uk (John Dallman) wrote:
In article <2026Jan5.100825@mips.complang.tuwien.ac.at>,
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
Possibly.-a But the lack of takeup of the Intel library and of the >>>>>> gcc support shows that "build it and they will come" does not work >>>>>> out for DFP.
The world has got very used to IEEE BFP, and has solutions that work >>>>> acceptably with it. Lots of organisations don't see anything obvious >>>>> for them in DFP.
The thing I'd like to try out is fast quad-precision BFP. For the
field I work in, that would make some things much simpler. I did try >>>>> to interest AMD in the idea in the early days of x86-64, but they
didn't bite.
John
I already asked you couple of years ago how fast do want binary128 in
order to consider it fast enough.
IIRC, you either avoided the answer completely or gave totally
unrealistic answer like "the same as binary64".
May be, nobody bites because with non-answers or answers like that
nobody thinks that you are serious?
Anecdote.
Few months ago I tried to design very long decimation filters with stop >>>> band attenuation of ~160 dB.
Matlab's implementation of Parks|ore4rCLMcClellan algorithm (a customize >>>> variation of Remez Exchange spiced with a small portion of black magic) >>>> was not up to the task. Gnu Octave implementation was somewhat worse
yet.
When I started to investigate the reasons I found out that there were
actually two of them, both related to insufficient precision of the
series of DP FP calculations.
The first was Discreet Cosine Transform and underlying FFT engine for N >>>> around 32K.
The second was solving system of linear equations for N around 1000 a
a little more.
In both cases precision of DP FP was perfectly sufficient both for
inputs and for outputs. But errors accumulated in intermediate caused
troubles.
In both cases quad-precision FP was key to solution.
For DCT (FFT) I went for full re-implementation at higher precision.
For Linear Solver, I left LU decomposition, which happens to be the
heavy O(N**3) part in DP. Quad-precision was applied only during final >>>> solver stages - forward propagation, back propagation, calculation of
residual error vector and repetition of forward and back propagation.
All those parts are O(N**2). That modification was sufficient to
improve precision of result almost to the best possible in DP FP format. >>>> And sufficient for good convergence of Parks|ore4rCLMcClellan algorithm. >>>>
So, what is a point of my anecdote?
The speed of quad-precision FP was never an obstacle, even when running >>>> on rather old hardware. And it's not like calculations here were not
heavy. They were heavy all right, thank you very much.
Using quad-precision only when necessary helped.
But what helped more is not being hesitant. Doing things instead of
worrying that they would be too slow.
I thihnk the main issue is similar to what we had before 754, i.e every
fp programmer needed to also be a fp analyst, capable of carrying out
error budget calculation across their algorithms.
You can obviously do that, and so can a number of regulars here, but in
the real world we are in a _very_ small minority.
For the rest, just having fp128 fast enough that that it could be
applied naively would solve a number of problems.
As I see it, FP128 is fast enough for practical use even with a
software-only implementation (though, in part due to its relatively low
usage frequency; if it is used, it is mostly for cases that actually
need precision, rather than high throughput, with high-throughput cases
likely to remain dominated by smaller types, like Binary32 and Binary16;
with Binary64 more remaining as the "de-facto default" precision for
floating-point).
{To date::}
My only used for 128-bit FP was to compute Chebyshev Coefficients for
my high speed DP Transcendentals. I only needed 64-bits of fractions
but, in practice, 80-bit FP was only giving me 63-bits of precision.
Since these are a) compute once b) use infinitely many times; the
speed of 128-bit FP is completely irrelevant.
As can be noted, in my case, it was a partial motivation for supporting
things like 128-bit integer instructions (in my C compiler, and
optionally in the underlying ISA), as supporting Int128 ops is a step
towards making doing Binary128 in software more practical (without the
steep cost of a 128-bit FPU).
It seems to me that if one ahs "reasonable" ISA support for tearing a
128-bit P into {sign, exponent, fraction} and reasonably fast 128-bit
integer support, then emulating 128-bit FP in SW is "not that bad"-- especially if one can do 128|u128 -> 256 in 4-8 cycles.
...
Terje
On Wed, 07 Jan 2026 20:05:17 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:
Terje Mathisen <terje.mathisen@tmsw.no> posted:
John Dallman wrote:
In article <20260107133424.00000e99@yahoo.com>, already5chosen@yahoo.com (Michael S) wrote:
I already asked you couple of years ago how fast do want
binary128 in order to consider it fast enough.
IIRC, you either avoided the answer completely or gave totally
unrealistic answer like "the same as binary64".
May be, nobody bites because with non-answers or answers like
that nobody thinks that you are serious?
I don't know much about hardware design. What is realistic for
hardware binary128?
Sub-10 cycles fmul/fadd/fsub seems very doable?
Mitch?Assuming 128-bit operands are delivered in 1 cycle and 128-bit
results are delivered in 1 cycle::
If we are talking about SIMD of the same width (measured in bits) as
SP/DP SIMD on the given general purpose machine, i.e. on modern Intel
and AMD server cores, 2 FPUs with 4 128-bit lanes each, then I think
that fully pipelined binary128 operations are none starter, because it
would blow your power and thermal budget.
One full-width result (i.e. 8 binary128 results) every 2 cycles sounds somewhat more realistic.
After all in general-purpose CPU binary128, if at all implemented, is
a proverbial tail that can't be allowed to wag the dog.
OTOH, if we define our binary128 to use only least-significant 128 bit
lane of our 512-bit register and only build b128 capabilities into one
of our pair of FPUs then full pipelining (i.e. 1 result per cycle) looks
like a good choice, at least from power/thermal perspective. That is,
as long as designers found a way to avoid a hot spot.
128-bit Fadd/Fsub should be no worse than 1 cycle longer than 64-bit.
128-bit Fmul requires that the multiplier tree be 64|u64 instead of
53|u53 (1.46|u bigger tree, 1.22|u bigger FU), and would/should be 3-4 cycles longer than 64-bit Fmul. If you wanted to be "really clever"
you could use a 59|u59 tree and the FU is only 1.12|u bigger; but here
you could not use the tree for Integer MUL.
Terje
On 1/7/2026 3:18 PM, MitchAlsup wrote:
BGB <cr88192@gmail.com> posted:
On 1/7/2026 11:56 AM, Terje Mathisen wrote:
Michael S wrote:
On Tue, 6 Jan 2026 22:06 +0000 (GMT Standard Time)
jgd@cix.co.uk (John Dallman) wrote:
In article <2026Jan5.100825@mips.complang.tuwien.ac.at>,
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
Possibly.-a But the lack of takeup of the Intel library and of the >>>>>> gcc support shows that "build it and they will come" does not work >>>>>> out for DFP.
The world has got very used to IEEE BFP, and has solutions that work >>>>> acceptably with it. Lots of organisations don't see anything obvious >>>>> for them in DFP.
The thing I'd like to try out is fast quad-precision BFP. For the
field I work in, that would make some things much simpler. I did try >>>>> to interest AMD in the idea in the early days of x86-64, but they
didn't bite.
John
I already asked you couple of years ago how fast do want binary128 in >>>> order to consider it fast enough.
IIRC, you either avoided the answer completely or gave totally
unrealistic answer like "the same as binary64".
May be, nobody bites because with non-answers or answers like that
nobody thinks that you are serious?
Anecdote.
Few months ago I tried to design very long decimation filters with stop >>>> band attenuation of ~160 dB.
Matlab's implementation of Parks|ore4rCLMcClellan algorithm (a customize >>>> variation of Remez Exchange spiced with a small portion of black magic) >>>> was not up to the task. Gnu Octave implementation was somewhat worse >>>> yet.
When I started to investigate the reasons I found out that there were >>>> actually two of them, both related to insufficient precision of the
series of DP FP calculations.
The first was Discreet Cosine Transform and underlying FFT engine for N >>>> around 32K.
The second was solving system of linear equations for N around 1000 a >>>> a little more.
In both cases precision of DP FP was perfectly sufficient both for
inputs and for outputs. But errors accumulated in intermediate caused >>>> troubles.
In both cases quad-precision FP was key to solution.
For DCT (FFT) I went for full re-implementation at higher precision. >>>>
For Linear Solver, I left LU decomposition, which happens to be the
heavy O(N**3) part in DP. Quad-precision was applied only during final >>>> solver stages - forward propagation, back propagation, calculation of >>>> residual error vector and repetition of forward and back propagation. >>>> All those parts are O(N**2). That modification was sufficient to
improve precision of result almost to the best possible in DP FP format. >>>> And sufficient for good convergence of Parks|ore4rCLMcClellan algorithm. >>>>
FWIW: For most cases where I had used DCT or FFT, it has almost always
been with fixed-point integer math...
So, what is a point of my anecdote?
The speed of quad-precision FP was never an obstacle, even when running >>>> on rather old hardware. And it's not like calculations here were not >>>> heavy. They were heavy all right, thank you very much.
Using quad-precision only when necessary helped.
But what helped more is not being hesitant. Doing things instead of
worrying that they would be too slow.
I thihnk the main issue is similar to what we had before 754, i.e every >>> fp programmer needed to also be a fp analyst, capable of carrying out
error budget calculation across their algorithms.
You can obviously do that, and so can a number of regulars here, but in >>> the real world we are in a _very_ small minority.
For the rest, just having fp128 fast enough that that it could be
applied naively would solve a number of problems.
As I see it, FP128 is fast enough for practical use even with a
software-only implementation (though, in part due to its relatively low
usage frequency; if it is used, it is mostly for cases that actually
need precision, rather than high throughput, with high-throughput cases
likely to remain dominated by smaller types, like Binary32 and Binary16; >> with Binary64 more remaining as the "de-facto default" precision for
floating-point).
{To date::}
My only used for 128-bit FP was to compute Chebyshev Coefficients for
my high speed DP Transcendentals. I only needed 64-bits of fractions
but, in practice, 80-bit FP was only giving me 63-bits of precision.
Since these are a) compute once b) use infinitely many times; the
speed of 128-bit FP is completely irrelevant.
As noted, low usage frequency.
If it is something that mostly applies to initial program startup or occasionally in the slow path, that it is "kinda slow" doesn't matter
too much.
Though, it is starting to seem that "trap and emulate" might still be a little too slow, leading to my recent efforts in the direction of
efficient hot-patching.
Granted, this is more a case of "just sort of pushing the cost somewhere else" and in theory, if the compiler knows that the instruction will-------------------
just be patched anyways, it could in premise generate intermediate calls
for cheaper.
As can be noted, in my case, it was a partial motivation for supporting
things like 128-bit integer instructions (in my C compiler, and
optionally in the underlying ISA), as supporting Int128 ops is a step
towards making doing Binary128 in software more practical (without the
steep cost of a 128-bit FPU).
It seems to me that if one ahs "reasonable" ISA support for tearing a 128-bit P into {sign, exponent, fraction} and reasonably fast 128-bit integer support, then emulating 128-bit FP in SW is "not that bad"-- especially if one can do 128|u128 -> 256 in 4-8 cycles.
Yeah, this is basically the idea.
Int128 ops, and my BITMOV instructions (which can extract/insert/move bitfields within 64 and 128 bit containers; as a combined "Shift and
masked MUX"), can provide a nice boost here.
Sadly, there is still not really a great way to do a 128x128 => 256
multiply though.
Current fastest option is still to decompose it into a crapload of 32x32=>64 bit widening multiply ops (which, ironically, is another thing that RV is lacking in; need to use a full 64-bit multiply,
but there are downsides, more-so when the base ISA is also lacking PACK/PACKU).
Still kinda funny that RV land, with all of its wide industrial support, lots of people doing lots of extensions, advanced features, etc.
Seemingly still fails at making an ISA where "basic things" fit together well.
And, then a lot of features going off in rabbit holes like "why would
you want this?", and then it turns out it is to micro-optimize some
specific test case within SPECint or something (often, rather than
finding a more general solution that would address multiple related issues).
More so when the "micro-optimize the benchmark" features were more often chosen over the more general purpose "actually address the underlying
issue" features.
Granted, then someone is almost invariably going to be like "all the
parts of RV do fit together well, but you are using it wrong...".
But, in this case, would expect GCC to generate smaller binaries than
BGBCC; leaving me to think it is more a case of "these parts don't fit together all that well".
...
Terje
On 1/7/2026 7:16 AM, John Dallman wrote:
In article <20260107133424.00000e99@yahoo.com>, already5chosen@yahoo.com
(Michael S) wrote:
I already asked you couple of years ago how fast do want binary128
in order to consider it fast enough.
IIRC, you either avoided the answer completely or gave totally
unrealistic answer like "the same as binary64".
May be, nobody bites because with non-answers or answers like that
nobody thinks that you are serious?
I don't know much about hardware design. What is realistic for hardware
binary128?
Likely estimate for FPGA:
-a Around 28 DSP48's for a "triangular" multiplier;
-a-a-a Would need to add several clock cycles for the adder tree;
-a-a-a ...
-a FADD/FSUB unit, also around 12 cycles,
-a-a-a as most intermediate steps now take 2 clock cycles;
Estimate:
Probably around 5k LUTs for the FMUL, with a latency of around 10 or 12 clock cycles.
Probably around 12k LUTs for FADD/FSUB unit;
Will need a few more kLUT for the glue logic.
So, will put the cost at:
-a 18-20 kLUT likely;
-a ~ 28 DSP48s;
-a Around 12 cycles of latency.
What about an FMA based implementation:
-a Probably 49 DSP48's and around 24 cycles of latency.
-a-a-a Where, 49 is needed for full-width multiplier results.
-a-a-a Also add a big bump to the LUT cost vs separate units.
-a An FMA unit roughly has the latency cost of both the FADD and FMUL.
But, some people really like the ability to quickly have single-rounded results.
My Transcendentals get to 1ULP when the multiplier tree is 59|u59-bits
{a bit more than -+ of the get 1ULP at 58|u58}. I gave a lot of though
to this {~1 year} before deciding that a "Do everything else" function
unit was "overall" better than a couple of "near miss" FUs. So, IMUL
comes in at 4 cycles, FMUL at 4 cycles, FDIV at 17, SQRT at 22, and
IDIV at 25. The fast IMUL makes up a lot for the "size" of the FU.
MitchAlsup <user5857@newsgrouper.org.invalid> posted:
-------------------
My Transcendentals get to 1ULP when the multiplier tree is
59-59-bits {a bit more than + of the get 1ULP at 58-58}. I gave a
lot of though to this {~1 year} before deciding that a "Do
everything else" function unit was "overall" better than a couple
of "near miss" FUs. So, IMUL comes in at 4 cycles, FMUL at 4
cycles, FDIV at 17, SQRT at 22, and IDIV at 25. The fast IMUL makes
up a lot for the "size" of the FU.
I forgot to add that my transcendentals went from <just barely>
faithfully rounded 0.75ULP RMS to 5.002ULP RMS when the tree was
expanded.
Scott Lurndal <scott@slp53.sl.home> wrote:
Michael S <already5chosen@yahoo.com> writes:
On Wed, 07 Jan 2026 15:24:38 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:
Michael S <already5chosen@yahoo.com> writes:
On Tue, 06 Jan 2026 19:29:40 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:
So, POWER hardware helps a lot converting DPD to BCD, but leaves to
software the last step of unpacking 4-bit BCD to 8-bit ASCII.
That's a fairly simple operation, just adding the proper zone
digit (3 for ASCII, F for EBCDIC) to the 8-bit byte.
The hard part is having bits in right places within wide word.
I.e. you have 64 bits like that:
'0123456789abcdef'. You want to convert it to pair of 64-bit numbers:
'0001020304050607', '08090a0b0c0d0e0f'
The ASCII[*] would be '3031323334353637', '3839414243444546'. That's
a move from the BCD '0123456789abcdef' to the corresponding ASCII
bytes.
[*] Printable version of the BCD input number.
The B3500 addressed to the digit, so it was a simple move to add
the zone digit when converting to ASCII (or EBCDIC depending on
a processor flag). Although 'undigits' (bit patterns 0b1010
through 0b1111) were not legal in BCD numbers on the B3500
and adding a zone digit to them didn't make them printable.
e.g.
INPUT DATA 8 UN 0123456789 // Unsigned Numeric 8 nibble field >> OUTPUT DATA 8 UA // Unsigned Alphanumeric 8 byte field
MVN INPUT(UN), OUTPUT(UA) // yields 'f0f1f2f3f4f5f6f7f8f9' (EBCDIC)
If the output field was larger than the input field, leading
blanks would be added before the number when using MVN. MVA
would blank pad the output field after the number when the
output field was larger.
Without HW help it is not fast. Likely not faster than running
respective DPD declets through look-up table where you, at least, get 3
ASCII digits per look-up. On modern wide core, likely only marginally
faster than converting BYD mantissa.
The B3500 (1965) did that in hardware;
I would find it strange if the Power CPU didn't.
Or, may be, they have instruction for that as well, but it's not in
DFP related part of the book, so I missed it.
It was just a flavor of the move instruction in the B3500.
I am not sure that we are talking about the same thing.
Probably not, since the ASCII zero character is encoded as 0x30
instead of the 0x00 you show in the example above.
IIUC Michael was asking for the following transformation of
on the strings of hex digits:
0123456789abcdef
into
000102030405060708090a0b0c0d0e0f
given (fast) such transformation it is very easy to add proper
zone bits on modern hardware. One possible approach to transform
above would be to do byte type unpacking operation (that is
version of the above working on bytes) and then use masking and
shifting to more upper bits of each byte to the right place.
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
Terje Mathisen <terje.mathisen@tmsw.no> writes:
In BID we would do division by 10 with a reciprocal multiplication that
handles powers of 5, this way we (i.e Michael S) can fit 26/27 digits
into a 64-bit int.
2^64 < 10^20
How do you put 26/27 decimal digits in 64 bits. Or do you mean 27
quinary digits? What would that be good for?
it takes 3.32 binary digits to encode 10, thus there are only 19.25
decimal digits in 64-bits.
BGB <cr88192@gmail.com> posted:Sounds similar to the weekend I spent writing a fp128 (using 1:31:96 for speed/ease of implementation on a Pentium) library just to be able to
On 1/7/2026 11:56 AM, Terje Mathisen wrote:
Michael S wrote:
On Tue, 6 Jan 2026 22:06 +0000 (GMT Standard Time)
jgd@cix.co.uk (John Dallman) wrote:
In article <2026Jan5.100825@mips.complang.tuwien.ac.at>,
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
Possibly.|e-a But the lack of takeup of the Intel library and of the >>>>>> gcc support shows that "build it and they will come" does not work>>>>>> out for DFP.
The world has got very used to IEEE BFP, and has solutions that work >>>>> acceptably with it. Lots of organisations don't see anything obvious >>>>> for them in DFP.
The thing I'd like to try out is fast quad-precision BFP. For the
field I work in, that would make some things much simpler. I did try >>>>> to interest AMD in the idea in the early days of x86-64, but they
didn't bite.
John
I already asked you couple of years ago how fast do want binary128 in
order to consider it fast enough.
IIRC, you either avoided the answer completely or gave totally
unrealistic answer like "the same as binary64".
May be, nobody bites because with non-answers or answers like that
nobody thinks that you are serious?
Anecdote.
Few months ago I tried to design very long decimation filters with stop >>>> band attenuation of ~160 dB.
Matlab's implementation of Parks|a-o|orCU-4|ore4+oMcClellan algorithm (a customize
variation of Remez Exchange spiced with a small portion of black magic) >>>> was not up to the task. Gnu Octave implementation was somewhat worse>>>> yet.
When I started to investigate the reasons I found out that there were
actually two of them, both related to insufficient precision of the
series of DP FP calculations.
The first was Discreet Cosine Transform and underlying FFT engine for N >>>> around 32K.
The second was solving system of linear equations for N around 1000 a
a little more.
In both cases precision of DP FP was perfectly sufficient both for
inputs and for outputs. But errors accumulated in intermediate caused
troubles.
In both cases quad-precision FP was key to solution.
For DCT (FFT) I went for full re-implementation at higher precision.>>>> >>>> For Linear Solver, I left LU decomposition, which happens to be the
heavy O(N**3) part in DP. Quad-precision was applied only during final >>>> solver stages - forward propagation, back propagation, calculation of
residual error vector and repetition of forward and back propagation.
All those parts are O(N**2). That modification was sufficient to
improve precision of result almost to the best possible in DP FP format. >>>> And sufficient for good convergence of Parks|a-o|orCU-4|ore4+oMcClellan algorithm.
So, what is a point of my anecdote?
The speed of quad-precision FP was never an obstacle, even when running >>>> on rather old hardware. And it's not like calculations here were not>>>> heavy. They were heavy all right, thank you very much.
Using quad-precision only when necessary helped.
But what helped more is not being hesitant. Doing things instead of
worrying that they would be too slow.
I thihnk the main issue is similar to what we had before 754, i.e every
fp programmer needed to also be a fp analyst, capable of carrying out>>> error budget calculation across their algorithms.
You can obviously do that, and so can a number of regulars here, but in
the real world we are in a _very_ small minority.
For the rest, just having fp128 fast enough that that it could be
applied naively would solve a number of problems.
As I see it, FP128 is fast enough for practical use even with a
software-only implementation (though, in part due to its relatively low
usage frequency; if it is used, it is mostly for cases that actually
need precision, rather than high throughput, with high-throughput cases
likely to remain dominated by smaller types, like Binary32 and Binary16;
with Binary64 more remaining as the "de-facto default" precision for
floating-point).
{To date::}
My only used for 128-bit FP was to compute Chebyshev Coefficients for
my high speed DP Transcendentals. I only needed 64-bits of fractions
but, in practice, 80-bit FP was only giving me 63-bits of precision.
Since these are a) compute once b) use infinitely many times; the
speed of 128-bit FP is completely irrelevant.
MitchAlsup <user5857@newsgrouper.org.invalid> posted:You obviously intended 0.5002 ulp, so you have about 11-13 guard bits to get the rounding correct?
-------------------
My Transcendentals get to 1ULP when the multiplier tree is 59|arCo59-bits
{a bit more than |e-+ of the get 1ULP at 58|arCo58}. I gave a lot of though >> to this {~1 year} before deciding that a "Do everything else" function>> unit was "overall" better than a couple of "near miss" FUs. So, IMUL
comes in at 4 cycles, FMUL at 4 cycles, FDIV at 17, SQRT at 22, and
IDIV at 25. The fast IMUL makes up a lot for the "size" of the FU.
I forgot to add that my transcendentals went from <just barely>
faithfully rounded 0.75ULP RMS to 5.002ULP RMS when the tree was
expanded.
Waldek Hebisch wrote:
Scott Lurndal <scott@slp53.sl.home> wrote:
Michael S <already5chosen@yahoo.com> writes:
On Wed, 07 Jan 2026 15:24:38 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:
Michael S <already5chosen@yahoo.com> writes:
On Tue, 06 Jan 2026 19:29:40 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:
So, POWER hardware helps a lot converting DPD to BCD, but
leaves to software the last step of unpacking 4-bit BCD to
8-bit ASCII.
That's a fairly simple operation, just adding the proper zone
digit (3 for ASCII, F for EBCDIC) to the 8-bit byte.
The hard part is having bits in right places within wide word.
I.e. you have 64 bits like that:
'0123456789abcdef'. You want to convert it to pair of 64-bit
numbers: '0001020304050607', '08090a0b0c0d0e0f'
The ASCII[*] would be '3031323334353637', '3839414243444546'.
That's a move from the BCD '0123456789abcdef' to the corresponding
ASCII bytes.
[*] Printable version of the BCD input number.
The B3500 addressed to the digit, so it was a simple move to add
the zone digit when converting to ASCII (or EBCDIC depending on
a processor flag). Although 'undigits' (bit patterns 0b1010
through 0b1111) were not legal in BCD numbers on the B3500
and adding a zone digit to them didn't make them printable.
e.g.
INPUT DATA 8 UN 0123456789 // Unsigned Numeric 8
nibble field OUTPUT DATA 8 UA // Unsigned
Alphanumeric 8 byte field
MVN INPUT(UN), OUTPUT(UA) // yields
'f0f1f2f3f4f5f6f7f8f9' (EBCDIC)
If the output field was larger than the input field, leading
blanks would be added before the number when using MVN. MVA
would blank pad the output field after the number when the
output field was larger.
Without HW help it is not fast. Likely not faster than running
respective DPD declets through look-up table where you, at least,
get 3 ASCII digits per look-up. On modern wide core, likely only
marginally faster than converting BYD mantissa.
The B3500 (1965) did that in hardware;
I would find it strange if the Power CPU didn't.
Or, may be, they have instruction for that as well, but it's
not in DFP related part of the book, so I missed it.
It was just a flavor of the move instruction in the B3500.
I am not sure that we are talking about the same thing.
Probably not, since the ASCII zero character is encoded as 0x30
instead of the 0x00 you show in the example above.
IIUC Michael was asking for the following transformation of
on the strings of hex digits:
0123456789abcdef
into
000102030405060708090a0b0c0d0e0f
given (fast) such transformation it is very easy to add proper
zone bits on modern hardware. One possible approach to transform
above would be to do byte type unpacking operation (that is
version of the above working on bytes) and then use masking and
shifting to more upper bits of each byte to the right place.
Intel (and then AMD) added PDEP (and the corresponding PEXT) opcode
sometime within the last 20 years (probably less than 10?), it is
perfect for this operation:
;; rsi has 8 nybbles to unpack into 16 bytes (rdx:rax)
mov rbx, 0x0f0f0f0f0f0f0f0f
pdep rax,rbx,rsi
shr rsi,32
pdep rdx,rbx,rsi
This is sub-5 cycles of latency.
It is also doable with much older CPUs using the permute/byte shuffle operation, with a bit more or less latency depdning upon where the
source and destination data resides (SIMD vs regular integer reg).
Terje
On Thu, 08 Jan 2026 02:38:57 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:
MitchAlsup <user5857@newsgrouper.org.invalid> posted:
-------------------
My Transcendentals get to 1ULP when the multiplier tree is
59|u59-bits {a bit more than -+ of the get 1ULP at 58|u58}. I gave a
lot of though to this {~1 year} before deciding that a "Do
everything else" function unit was "overall" better than a couple
of "near miss" FUs. So, IMUL comes in at 4 cycles, FMUL at 4
cycles, FDIV at 17, SQRT at 22, and IDIV at 25. The fast IMUL makes
up a lot for the "size" of the FU.
I forgot to add that my transcendentals went from <just barely>
faithfully rounded 0.75ULP RMS to 5.002ULP RMS when the tree was
expanded.
Don't you mean '0.5002 ULP' ?
MitchAlsup wrote:
anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
Terje Mathisen <terje.mathisen@tmsw.no> writes:
In BID we would do division by 10 with a reciprocal multiplication that >>> handles powers of 5, this way we (i.e Michael S) can fit 26/27 digits
into a 64-bit int.
2^64 < 10^20
How do you put 26/27 decimal digits in 64 bits. Or do you mean 27
quinary digits? What would that be good for?
it takes 3.32 binary digits to encode 10, thus there are only 19.25
decimal digits in 64-bits.
Michael's idea was to split the division by a power of ten into two
parts: A division by a power of 5 and a bitshift for the 2^N.
If we start with the bitshift (but remember the bits shifted out from
the bottom, then 5^26 fits into 2^64.
Does that make sense?
Terje
MitchAlsup wrote:
MitchAlsup <user5857@newsgrouper.org.invalid> posted:
-------------------
My Transcendentals get to 1ULP when the multiplier tree is 59|arCo59-bits >> {a bit more than |e-+ of the get 1ULP at 58|arCo58}. I gave a lot of though
to this {~1 year} before deciding that a "Do everything else" function
unit was "overall" better than a couple of "near miss" FUs. So, IMUL
comes in at 4 cycles, FMUL at 4 cycles, FDIV at 17, SQRT at 22, and
IDIV at 25. The fast IMUL makes up a lot for the "size" of the FU.
I forgot to add that my transcendentals went from <just barely>
faithfully rounded 0.75ULP RMS to 5.002ULP RMS when the tree was
expanded.
You obviously intended 0.5002 ulp, so you have about 11-13 guard bits to
get the rounding correct?
Terje
On Thu, 8 Jan 2026 12:50:32 +0100
Terje Mathisen <terje.mathisen@tmsw.no> wrote:
Waldek Hebisch wrote:
Scott Lurndal <scott@slp53.sl.home> wrote:
Michael S <already5chosen@yahoo.com> writes:
On Wed, 07 Jan 2026 15:24:38 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:
Michael S <already5chosen@yahoo.com> writes:
On Tue, 06 Jan 2026 19:29:40 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:
So, POWER hardware helps a lot converting DPD to BCD, but
leaves to software the last step of unpacking 4-bit BCD to
8-bit ASCII.
That's a fairly simple operation, just adding the proper zone
digit (3 for ASCII, F for EBCDIC) to the 8-bit byte.
The hard part is having bits in right places within wide word.
I.e. you have 64 bits like that:
'0123456789abcdef'. You want to convert it to pair of 64-bit
numbers: '0001020304050607', '08090a0b0c0d0e0f'
The ASCII[*] would be '3031323334353637', '3839414243444546'.
That's a move from the BCD '0123456789abcdef' to the corresponding
ASCII bytes.
[*] Printable version of the BCD input number.
The B3500 addressed to the digit, so it was a simple move to add
the zone digit when converting to ASCII (or EBCDIC depending on
a processor flag). Although 'undigits' (bit patterns 0b1010
through 0b1111) were not legal in BCD numbers on the B3500
and adding a zone digit to them didn't make them printable.
e.g.
INPUT DATA 8 UN 0123456789 // Unsigned Numeric 8
nibble field OUTPUT DATA 8 UA // Unsigned
Alphanumeric 8 byte field
MVN INPUT(UN), OUTPUT(UA) // yields
'f0f1f2f3f4f5f6f7f8f9' (EBCDIC)
If the output field was larger than the input field, leading
blanks would be added before the number when using MVN. MVA
would blank pad the output field after the number when the
output field was larger.
Without HW help it is not fast. Likely not faster than running
respective DPD declets through look-up table where you, at least,
get 3 ASCII digits per look-up. On modern wide core, likely only
marginally faster than converting BYD mantissa.
The B3500 (1965) did that in hardware;
I would find it strange if the Power CPU didn't.
Or, may be, they have instruction for that as well, but it's
not in DFP related part of the book, so I missed it.
It was just a flavor of the move instruction in the B3500.
I am not sure that we are talking about the same thing.
Probably not, since the ASCII zero character is encoded as 0x30
instead of the 0x00 you show in the example above.
IIUC Michael was asking for the following transformation of
on the strings of hex digits:
0123456789abcdef
into
000102030405060708090a0b0c0d0e0f
given (fast) such transformation it is very easy to add proper
zone bits on modern hardware. One possible approach to transform
above would be to do byte type unpacking operation (that is
version of the above working on bytes) and then use masking and
shifting to more upper bits of each byte to the right place.
Intel (and then AMD) added PDEP (and the corresponding PEXT) opcode
sometime within the last 20 years (probably less than 10?), it is
perfect for this operation:
Haswell. Officially launched in June 4, 2013. So 12.5 years ago.
Time runs.
;; rsi has 8 nybbles to unpack into 16 bytes (rdx:rax)
mov rbx, 0x0f0f0f0f0f0f0f0f
pdep rax,rbx,rsi
shr rsi,32
pdep rdx,rbx,rsi
This is sub-5 cycles of latency.
That's nice.
I'm not sure if POWER has similar instruction.
It is also doable with much older CPUs using the permute/byte shuffle
operation, with a bit more or less latency depdning upon where the
source and destination data resides (SIMD vs regular integer reg).
Terje
I don't understand that part. Do you suggest that there are better
swizzle instruction than unpack, mentioned by Waldek Hebisch?
So far, I don't see so. Unpack looks to me the most suitable.
Terje Mathisen <terje.mathisen@tmsw.no> posted:For many of the functions you can do a lot by letting the final
MitchAlsup wrote:
MitchAlsup <user5857@newsgrouper.org.invalid> posted:
-------------------
My Transcendentals get to 1ULP when the multiplier tree is 59|a|A|ore4rCY59-bits
{a bit more than |arCU|e-+ of the get 1ULP at 58|a|A|ore4rCY58}. I gave a lot of though
to this {~1 year} before deciding that a "Do everything else" function >>>> unit was "overall" better than a couple of "near miss" FUs. So, IMUL>>>> comes in at 4 cycles, FMUL at 4 cycles, FDIV at 17, SQRT at 22, and
IDIV at 25. The fast IMUL makes up a lot for the "size" of the FU.
I forgot to add that my transcendentals went from <just barely>
faithfully rounded 0.75ULP RMS to 5.002ULP RMS when the tree was
expanded.
You obviously intended 0.5002 ulp, so you have about 11-13 guard bits to
get the rounding correct?
64-53 = 11 yes
But a single incorrect rounding is 1 ULP all by itself.:-)
Michael S wrote:
On Thu, 8 Jan 2026 12:50:32 +0100
Terje Mathisen <terje.mathisen@tmsw.no> wrote:
Waldek Hebisch wrote:
Scott Lurndal <scott@slp53.sl.home> wrote:
Michael S <already5chosen@yahoo.com> writes:
On Wed, 07 Jan 2026 15:24:38 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:
Michael S <already5chosen@yahoo.com> writes:
On Tue, 06 Jan 2026 19:29:40 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:
So, POWER hardware helps a lot converting DPD to BCD, but
leaves to software the last step of unpacking 4-bit BCD to
8-bit ASCII.
That's a fairly simple operation, just adding the proper zone
digit (3 for ASCII, F for EBCDIC) to the 8-bit byte.
The hard part is having bits in right places within wide word.
I.e. you have 64 bits like that:
'0123456789abcdef'. You want to convert it to pair of 64-bit
numbers: '0001020304050607', '08090a0b0c0d0e0f'
The ASCII[*] would be '3031323334353637', '3839414243444546'.
That's a move from the BCD '0123456789abcdef' to the
corresponding ASCII bytes.
[*] Printable version of the BCD input number.
The B3500 addressed to the digit, so it was a simple move to add
the zone digit when converting to ASCII (or EBCDIC depending on
a processor flag). Although 'undigits' (bit patterns 0b1010
through 0b1111) were not legal in BCD numbers on the B3500
and adding a zone digit to them didn't make them printable.
e.g.
INPUT DATA 8 UN 0123456789 // Unsigned Numeric 8
nibble field OUTPUT DATA 8 UA // Unsigned
Alphanumeric 8 byte field
MVN INPUT(UN), OUTPUT(UA) // yields
'f0f1f2f3f4f5f6f7f8f9' (EBCDIC)
If the output field was larger than the input field, leading
blanks would be added before the number when using MVN. MVA
would blank pad the output field after the number when the
output field was larger.
Without HW help it is not fast. Likely not faster than running
respective DPD declets through look-up table where you, at
least, get 3 ASCII digits per look-up. On modern wide core,
likely only marginally faster than converting BYD mantissa.
The B3500 (1965) did that in hardware;
I would find it strange if the Power CPU didn't.
Or, may be, they have instruction for that as well, but it's
not in DFP related part of the book, so I missed it.
It was just a flavor of the move instruction in the B3500.
I am not sure that we are talking about the same thing.
Probably not, since the ASCII zero character is encoded as 0x30
instead of the 0x00 you show in the example above.
IIUC Michael was asking for the following transformation of
on the strings of hex digits:
0123456789abcdef
into
000102030405060708090a0b0c0d0e0f
given (fast) such transformation it is very easy to add proper
zone bits on modern hardware. One possible approach to transform
above would be to do byte type unpacking operation (that is
version of the above working on bytes) and then use masking and
shifting to more upper bits of each byte to the right place.
Intel (and then AMD) added PDEP (and the corresponding PEXT) opcode
sometime within the last 20 years (probably less than 10?), it is
perfect for this operation:
Haswell. Officially launched in June 4, 2013. So 12.5 years ago.
Time runs.
;; rsi has 8 nybbles to unpack into 16 bytes (rdx:rax)
mov rbx, 0x0f0f0f0f0f0f0f0f
pdep rax,rbx,rsi
shr rsi,32
pdep rdx,rbx,rsi
This is sub-5 cycles of latency.
That's nice.
I'm not sure if POWER has similar instruction.
It is also doable with much older CPUs using the permute/byte
shuffle operation, with a bit more or less latency depdning upon
where the source and destination data resides (SIMD vs regular
integer reg).
Terje
I don't understand that part. Do you suggest that there are better
swizzle instruction than unpack, mentioned by Waldek Hebisch?
So far, I don't see so. Unpack looks to me the most suitable.
There are at least three ways to do it:
a) PDEP in 64-bit regs
b) PSHUFB and nybble masks using SSE/AVX regs
c) PUNPACKLBW which expands bytes to words. Do it twice with a
bytewise SHR 4 to select the upper nybbles and a mask to keep the
lower nybbles of the first part.
Did you intend to use (c) or is there yet another method?
Terje
MitchAlsup wrote:
Terje Mathisen <terje.mathisen@tmsw.no> posted:
MitchAlsup wrote:
MitchAlsup <user5857@newsgrouper.org.invalid> posted:
-------------------
My Transcendentals get to 1ULP when the multiplier tree is 59|a|A|ore4rCY59-bits
{a bit more than |arCU|e-+ of the get 1ULP at 58|a|A|ore4rCY58}. I gave a lot of though
to this {~1 year} before deciding that a "Do everything else" function >>>> unit was "overall" better than a couple of "near miss" FUs. So, IMUL >>>> comes in at 4 cycles, FMUL at 4 cycles, FDIV at 17, SQRT at 22, and
IDIV at 25. The fast IMUL makes up a lot for the "size" of the FU.
I forgot to add that my transcendentals went from <just barely>
faithfully rounded 0.75ULP RMS to 5.002ULP RMS when the tree was
expanded.
You obviously intended 0.5002 ulp, so you have about 11-13 guard bits to >> get the rounding correct?
64-53 = 11 yes
For many of the functions you can do a lot by letting the final
operation be a merging of the first/largest term, particularly if you do that with extended precision.
I.e something like fpatan2() works quite nicely this way, just not
enough for exact rounding.
You need to combine this with extended precision range adjustment at the end.
But a single incorrect rounding is 1 ULP all by itself.
:-)
Yeah, there are strong forces who want to have, at least as a suggested/recommended option, a set of transcendental functions which
are exactly rounded.
It is provably doable for float, in very close to the same cycle count
as the best libraries in current use, double is "somewhat" harder to
totally verify/prove.
Terje
I use every day a computer algebra system. One possiblity here
is to use arbitrary precision fractions. Observations:
- once there is more complex computation numerators and denominators
tend to be pretty big
- a lot of time is spent computing gcd, but if you try to save on
gcd-s and work with unsimplified fractions you may get trenendous
blowup (like million digit numbers in what should be reasonably
small computation).
- general, if one needs numeric approximation, then arbitrary
precision (software) floating point is much faster
Terje Mathisen <terje.mathisen@tmsw.no> posted:
MitchAlsup wrote:
But a single incorrect rounding is 1 ULP all by itself.
:-)
Yeah, there are strong forces who want to have, at least as a
suggested/recommended option, a set of transcendental functions which
are exactly rounded.
After the final addition, I know 1 of the top 3-bits is a 1, and I
have 69-bits in the accumulated result. I also know that the poly-
nomial error is below the 3rd least significant bit.
It is provably doable for float, in very close to the same cycle count
as the best libraries in current use, double is "somewhat" harder to
totally verify/prove.
I have logic (patented) that allows the FU to raise an UNCERTAIN
rounding exception, so SW can take over and change 0.5002 into
0.5000 at the cost of the exception and running the long winded
SW correctly rounded subroutine. I expect this to be used only
during verification and on the 3 machines owned by Kahan, Coonen,
and someone else I forgot.
In article <2026Jan5.100825@mips.complang.tuwien.ac.at>, anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
Possibly. But the lack of takeup of the Intel library and of the
gcc support shows that "build it and they will come" does not work
out for DFP.
The world has got very used to IEEE BFP, and has solutions that work acceptably with it. Lots of organisations don't see anything obvious for
them in DFP.
John Dallman <jgd@cix.co.uk> wrote:
In article <2026Jan5.100825@mips.complang.tuwien.ac.at>, anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
Possibly. But the lack of takeup of the Intel library and of the
gcc support shows that "build it and they will come" does not work
out for DFP.
The world has got very used to IEEE BFP, and has solutions that work acceptably with it. Lots of organisations don't see anything obvious for them in DFP.
AFAICS DFP exists as a standard only because IBM pushed it. I had
a short e-mail exchange with main DFP advocate at IBM.
My point
was that purely software implementation of his decimal benchmark had perfectly adequate performance. His answer was that he knows this,
but that was hand written code that normal users would not write.
Compilers for Cobol could generate such code, but apparently no
existing compilers supported it. He claimed that there
must be hardware support to make it widely available. He somewhat
ignored fact that even with hardware support there still needs
to be support in compilers. And that C++ templates allow
fast fixed point decimals as library feature. If there is no
such library (and I am not aware of one) it is due to low
demand and not due to difficulty.
Anyway, he and IBM succeded pushd the DFP standard, but adoption
is as I and other folks predicted.
On Tue, 6 Jan 2026 22:06 +0000 (GMT Standard Time)
jgd@cix.co.uk (John Dallman) wrote:
In article <2026Jan5.100825@mips.complang.tuwien.ac.at>,
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
Possibly. But the lack of takeup of the Intel library and of the
gcc support shows that "build it and they will come" does not work
out for DFP.
The world has got very used to IEEE BFP, and has solutions that work
acceptably with it. Lots of organisations don't see anything obvious
for them in DFP.
The thing I'd like to try out is fast quad-precision BFP. For the
field I work in, that would make some things much simpler. I did try
to interest AMD in the idea in the early days of x86-64, but they
didn't bite.
John
I already asked you couple of years ago how fast do want binary128 in
order to consider it fast enough.
IIRC, you either avoided the answer completely or gave totally
unrealistic answer like "the same as binary64".
May be, nobody bites because with non-answers or answers like that
nobody thinks that you are serious?
Anecdote.
Few months ago I tried to design very long decimation filters with stop
band attenuation of ~160 dB.
Matlab's implementation of ParksrCoMcClellan algorithm (a customize
variation of Remez Exchange spiced with a small portion of black magic)
was not up to the task. Gnu Octave implementation was somewhat worse
yet.
When I started to investigate the reasons I found out that there were actually two of them, both related to insufficient precision of the
series of DP FP calculations.
The first was Discreet Cosine Transform and underlying FFT engine for N around 32K.
The second was solving system of linear equations for N around 1000 a
a little more.
In both cases precision of DP FP was perfectly sufficient both for
inputs and for outputs. 32KBut errors accumulated in intermediate caused troubles.
In both cases quad-precision FP was key to solution.
For DCT (FFT) I went for full re-implementation at higher precision.
For Linear Solver, I left LU decomposition, which happens to be the
heavy O(N**3) part in DP. Quad-precision was applied only during final
solver stages - forward propagation, back propagation, calculation of residual error vector and repetition of forward and back propagation.
All those parts are O(N**2). That modification was sufficient to
improve precision of result almost to the best possible in DP FP format.
And sufficient for good convergence of ParksrCoMcClellan algorithm.
So, what is a point of my anecdote?
The speed of quad-precision FP was never an obstacle, even when running
on rather old hardware. And it's not like calculations here were not
heavy. They were heavy all right, thank you very much.
Using quad-precision only when necessary helped.
But what helped more is not being hesitant. Doing things instead of
worrying that they would be too slow.
Michael S <already5chosen@yahoo.com> posted:
On Wed, 07 Jan 2026 20:05:17 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:
Terje Mathisen <terje.mathisen@tmsw.no> posted:
John Dallman wrote:Assuming 128-bit operands are delivered in 1 cycle and 128-bit
In article <20260107133424.00000e99@yahoo.com>,
already5chosen@yahoo.com (Michael S) wrote:
I already asked you couple of years ago how fast do want
binary128 in order to consider it fast enough.
IIRC, you either avoided the answer completely or gave totally
unrealistic answer like "the same as binary64".
May be, nobody bites because with non-answers or answers like
that nobody thinks that you are serious?
I don't know much about hardware design. What is realistic for
hardware binary128?
Sub-10 cycles fmul/fadd/fsub seems very doable?
Mitch?
results are delivered in 1 cycle::
If we are talking about SIMD of the same width (measured in bits) as
SP/DP SIMD on the given general purpose machine, i.e. on modern Intel
and AMD server cores, 2 FPUs with 4 128-bit lanes each, then I think
that fully pipelined binary128 operations are none starter, because it
would blow your power and thermal budget.
I agree, however a single 128-bit FPU would fit inside a reasonable
power budget.
One full-width result (i.e. 8
binary128 results) every 2 cycles sounds somewhat more realistic.
Likely still over a reasonable power budget.
After all in general-purpose CPU binary128, if at all implemented, is
a proverbial tail that can't be allowed to wag the dog.
We build (and call) our current machines 64-bits because that is the
size of the register files (not including SIMD/Vector) and because
we can run the scalar unit at rated clock frequency (non SIMD/Vector) essentially continuously.
Once we step over the scalar width, power goes up 2|u-4|u and we get a
couple of hundred cycles before frequency throttling. Thus, we cannot
in general, run SIMD/Vector at rated frequency continuously.
Nor can
we, at present time, build a memory system than can properly feed a SIMD/Vector RF so that one can use all of the lanes of available calculations.
{HBM is approaching this point, however--it becomes--
more like B-memory from CRAY-2; than main memory for applications
that can use that much b-memory effectively.}
OTOH, if we define our binary128 to use only least-significant 128 bit
lane of our 512-bit register and only build b128 capabilities into one
of our pair of FPUs then full pipelining (i.e. 1 result per cycle) looks
like a good choice, at least from power/thermal perspective. That is,
as long as designers found a way to avoid a hot spot.
We could, instead, treat them as pairs of GPRs--like we did in Mc 88120
and still not need SIMD/Vectors.
128-bit Fadd/Fsub should be no worse than 1 cycle longer than 64-bit.
128-bit Fmul requires that the multiplier tree be 64|u64 instead of
53|u53 (1.46|u bigger tree, 1.22|u bigger FU), and would/should be 3-4
cycles longer than 64-bit Fmul. If you wanted to be "really clever"
you could use a 59|u59 tree and the FU is only 1.12|u bigger; but here
you could not use the tree for Integer MUL.
Terje
Terje Mathisen <terje.mathisen@tmsw.no> posted:
MitchAlsup wrote:
MitchAlsup <user5857@newsgrouper.org.invalid> posted:
-------------------
My Transcendentals get to 1ULP when the multiplier tree is 59|arCo59-bits >> >> {a bit more than |e-+ of the get 1ULP at 58|arCo58}. I gave a lot of though
to this {~1 year} before deciding that a "Do everything else" function
unit was "overall" better than a couple of "near miss" FUs. So, IMUL
comes in at 4 cycles, FMUL at 4 cycles, FDIV at 17, SQRT at 22, and
IDIV at 25. The fast IMUL makes up a lot for the "size" of the FU.
I forgot to add that my transcendentals went from <just barely>
faithfully rounded 0.75ULP RMS to 5.002ULP RMS when the tree was
expanded.
You obviously intended 0.5002 ulp, so you have about 11-13 guard bits to
get the rounding correct?
64-53 = 11 yes
But a single incorrect rounding is 1 ULP all by itself.
antispam@fricas.org (Waldek Hebisch) posted:
John Dallman <jgd@cix.co.uk> wrote:
In article <2026Jan5.100825@mips.complang.tuwien.ac.at>,
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
Possibly. But the lack of takeup of the Intel library and of the
gcc support shows that "build it and they will come" does not work
out for DFP.
The world has got very used to IEEE BFP, and has solutions that work
acceptably with it. Lots of organisations don't see anything obvious for >> > them in DFP.
AFAICS DFP exists as a standard only because IBM pushed it. I had
a short e-mail exchange with main DFP advocate at IBM.
Mike Cow<something>shaw ??
--My point
was that purely software implementation of his decimal benchmark had
perfectly adequate performance. His answer was that he knows this,
but that was hand written code that normal users would not write.
Compilers for Cobol could generate such code, but apparently no
existing compilers supported it. He claimed that there
must be hardware support to make it widely available. He somewhat
ignored fact that even with hardware support there still needs
to be support in compilers. And that C++ templates allow
fast fixed point decimals as library feature. If there is no
such library (and I am not aware of one) it is due to low
demand and not due to difficulty.
Anyway, he and IBM succeded pushd the DFP standard, but adoption
is as I and other folks predicted.
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:
Terje Mathisen <terje.mathisen@tmsw.no> posted:
MitchAlsup wrote:
MitchAlsup <user5857@newsgrouper.org.invalid> posted:
-------------------
My Transcendentals get to 1ULP when the multiplier tree is 59|a|A|ore4rCY59-bits
{a bit more than |arCU|e-+ of the get 1ULP at 58|a|A|ore4rCY58}. I gave a lot of though
to this {~1 year} before deciding that a "Do everything else" function >>>>> unit was "overall" better than a couple of "near miss" FUs. So, IMUL >>>>> comes in at 4 cycles, FMUL at 4 cycles, FDIV at 17, SQRT at 22, and>>>>> IDIV at 25. The fast IMUL makes up a lot for the "size" of the FU.
I forgot to add that my transcendentals went from <just barely>
faithfully rounded 0.75ULP RMS to 5.002ULP RMS when the tree was
expanded.
You obviously intended 0.5002 ulp, so you have about 11-13 guard bits to >>> get the rounding correct?
64-53 = 11 yes
But a single incorrect rounding is 1 ULP all by itself.
It is clear that when your rounding is different that "IEEE correct" rounding, then there is 1 ULP difference between your result and
IEEE rounding. But claims like max 0.5002 ulp mean that there
is at most 0.5002 ulp difference between true result and result
deliverd by your FPU. Or do you really mean that there may be
1ULP difference between true result and your FPU?
Waldek Hebisch <antispam@fricas.org> schrieb:
I use every day a computer algebra system. One possiblity here
is to use arbitrary precision fractions. Observations:
- once there is more complex computation numerators and denominators
tend to be pretty big
- a lot of time is spent computing gcd, but if you try to save on
gcd-s and work with unsimplified fractions you may get trenendous
blowup (like million digit numbers in what should be reasonably
small computation).
- general, if one needs numeric approximation, then arbitrary
precision (software) floating point is much faster
It is also not possible to express irrational numbers,
transcendental functions etc. When I use a computer algebra
sytem, I tend to use such functions, so solutions are usually
not rational numbers.
Michael S <already5chosen@yahoo.com> wrote:That sounds doable, from power and thermal perspective, but does not
On Tue, 6 Jan 2026 22:06 +0000 (GMT Standard Time)
jgd@cix.co.uk (John Dallman) wrote:
In article <2026Jan5.100825@mips.complang.tuwien.ac.at>,
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
Possibly. But the lack of takeup of the Intel library and of
the gcc support shows that "build it and they will come" does
not work out for DFP.
The world has got very used to IEEE BFP, and has solutions that
work acceptably with it. Lots of organisations don't see anything
obvious for them in DFP.
The thing I'd like to try out is fast quad-precision BFP. For the
field I work in, that would make some things much simpler. I did
try to interest AMD in the idea in the early days of x86-64, but
they didn't bite.
John
I already asked you couple of years ago how fast do want binary128
in order to consider it fast enough.
IIRC, you either avoided the answer completely or gave totally
unrealistic answer like "the same as binary64".
May be, nobody bites because with non-answers or answers like that
nobody thinks that you are serious?
I would hope for half troughput and say 1-2 clocks more latency
for addition.
For multiplication I would expect 1/4 troughput
and maybe twice latency than for binary64.
As of today, there is double-double. IIRC double-double addition
needs 6 double additions, that is way too much. AFAICS
quantifying double-double mutiplication performance is more
tricky: there is relatively easy implementation using
64-bit multiply-add (it takes adwantage of fact that multiply-add
can deliver low-order bits that only contibute to rounding in
normal FP multiply), but this implements normal multiply in
terms of multiply-add. Implementing multiply-add takes
more effort and impementing multiply only using multiply
tekes even more effort.
Anyway, to make sense hardware should be faster than double-double.
My estimated was 7.5 bits.Anecdote.
Few months ago I tried to design very long decimation filters with
stop band attenuation of ~160 dB.
Matlab's implementation of ParksrCoMcClellan algorithm (a customize variation of Remez Exchange spiced with a small portion of black
magic) was not up to the task. Gnu Octave implementation was
somewhat worse yet.
When I started to investigate the reasons I found out that there
were actually two of them, both related to insufficient precision
of the series of DP FP calculations.
The first was Discreet Cosine Transform and underlying FFT engine
for N around 32K.
The second was solving system of linear equations for N around 1000
a a little more.
In both cases precision of DP FP was perfectly sufficient both for
inputs and for outputs. 32KBut errors accumulated in intermediate
caused troubles.
In both cases quad-precision FP was key to solution.
For DCT (FFT) I went for full re-implementation at higher
precision.
Hmm, I did estimates for FFT and my result was that in classic
implementation each layers of butterflies essentially additively
contributes to L^2 error. So 32K point radix-2 FFT has 15 times
bigger L^2 error than single layer of batterflies, which has error
about 4 times machine epsilon. With radix-4 FFT error of single
batterfly is larger, but number of layers is halved and result
is similar. So, in terms of L^2 error 32K point FFT needs very
little extra precision, essentially 6 bits.
But Remez works
in term of supremum norm and at 32K points that may need extra
8 bits. So it if possible that 80-bit format would have enough
accuracy for your purpose.
I looked at FFT as one of possible ways to implement convolutionThat is part of black magic that I mentioned above. ParksrCoMcClellan
of integer sequences with exact result. Alas, double precision
computation is good only for about 20 bits for relatively short
seqences and less for longer ones. It seems that integer only
computation is much faster. Fast 128-bit floating point would
shift balance towards floating point, but probably not enough
to beat integer computations.
For Linear Solver, I left LU decomposition, which happens to be the
heavy O(N**3) part in DP. Quad-precision was applied only during
final solver stages - forward propagation, back propagation,
calculation of residual error vector and repetition of forward and
back propagation. All those parts are O(N**2). That modification
was sufficient to improve precision of result almost to the best
possible in DP FP format. And sufficient for good convergence of ParksrCoMcClellan algorithm.
Yes, as long as your system is reasonably well conditioned it is
easy to improve accuracy in a postprocessing step. OTOH system
may be so badly conditioned that solving in double precision leads
to catastrophic errors while solving in higher precision works
fine.
My point is that John should measure first. Only after measurements heSo, what is a point of my anecdote?
The speed of quad-precision FP was never an obstacle, even when
running on rather old hardware. And it's not like calculations here
were not heavy. They were heavy all right, thank you very much.
Using quad-precision only when necessary helped.
But what helped more is not being hesitant. Doing things instead of worrying that they would be too slow.
Well, I have arbitrary precision implementation of LLL algorithm.
It works, but it is about 100 times slower than using double precision
math.
The trouble is, in worst case double precision LLL in
dimension 53 may fail to converge. On tame data LLL is expected
to work in higher dimensions, but even on tame data at dimentsion
about 250 double precision LLL is expected to fail. In a sense
this is no-win situations, as needed number of bits grows linearly
with dimension (both worst case and tame case). One can try to
use double precision when it works. But it is frustrating
how much effort one needs to spend to get better speed using
FPU. And especially, there is contrast with integer math
where it is realatively easy to get higher precisions when
needed. But for integer math RISC-V tries to change this by not
providing carry bit. And, AFAICS SSE/AVX do not provide high
order bits of multiplication (no vectored MULHI instruction), so multiprecision multiplies must go trough scalar multiplier.
MitchAlsup wrote:
But a single incorrect rounding is 1 ULP all by itself.
:-)
Yeah, there are strong forces who want to have, at least as a suggested/recommended option, a set of transcendental functions which
are exactly rounded.
It is provably doable for float, in very close to the same cycle
count as the best libraries in current use, double is "somewhat"
harder to totally verify/prove.
Terje
John Dallman <jgd@cix.co.uk> wrote:
In article <2026Jan5.100825@mips.complang.tuwien.ac.at>,
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
Possibly. But the lack of takeup of the Intel library and of the
gcc support shows that "build it and they will come" does not work
out for DFP.
The world has got very used to IEEE BFP, and has solutions that work
acceptably with it. Lots of organisations don't see anything obvious for
them in DFP.
AFAICS DFP exists as a standard only because IBM pushed it.
I had
a short e-mail exchange with main DFP advocate at IBM. My point
was that purely software implementation of his decimal benchmark had perfectly adequate performance. His answer was that he knows this,
but that was hand written code that normal users would not write.
Compilers for Cobol could generate such code, but apparently no
existing compilers supported it. He claimed that there
must be hardware support to make it widely available. He somewhat
ignored fact that even with hardware support there still needs
to be support in compilers.
And that C++ templates allow
fast fixed point decimals as library feature. If there is no
such library (and I am not aware of one) it is due to low
demand and not due to difficulty.
Anyway, he and IBM succeded pushd the DFP standard, but adoption
is as I and other folks predicted.
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:
Terje Mathisen <terje.mathisen@tmsw.no> posted:
MitchAlsup wrote:
MitchAlsup <user5857@newsgrouper.org.invalid> posted:
-------------------
My Transcendentals get to 1ULP when the multiplier tree is 59|arCo59-bits
{a bit more than |e-+ of the get 1ULP at 58|arCo58}. I gave a lot of though
to this {~1 year} before deciding that a "Do everything else" function >> >> unit was "overall" better than a couple of "near miss" FUs. So, IMUL
comes in at 4 cycles, FMUL at 4 cycles, FDIV at 17, SQRT at 22, and
IDIV at 25. The fast IMUL makes up a lot for the "size" of the FU.
I forgot to add that my transcendentals went from <just barely>
faithfully rounded 0.75ULP RMS to 5.002ULP RMS when the tree was
expanded.
You obviously intended 0.5002 ulp, so you have about 11-13 guard bits to >> get the rounding correct?
64-53 = 11 yes
But a single incorrect rounding is 1 ULP all by itself.
It is clear that when your rounding is different that "IEEE correct" rounding, then there is 1 ULP difference between your result and
IEEE rounding. But claims like max 0.5002 ulp mean that there
is at most 0.5002 ulp difference between true result and result
deliverd by your FPU. Or do you really mean that there may be
1ULP difference between true result and your FPU?
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:
antispam@fricas.org (Waldek Hebisch) posted:
John Dallman <jgd@cix.co.uk> wrote:
In article <2026Jan5.100825@mips.complang.tuwien.ac.at>,
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
Possibly. But the lack of takeup of the Intel library and of the
gcc support shows that "build it and they will come" does not work
out for DFP.
The world has got very used to IEEE BFP, and has solutions that work
acceptably with it. Lots of organisations don't see anything obvious for >> > them in DFP.
AFAICS DFP exists as a standard only because IBM pushed it. I had
a short e-mail exchange with main DFP advocate at IBM.
Mike Cow<something>shaw ??
Mike Cowlishaw, yes.
My point
was that purely software implementation of his decimal benchmark had
perfectly adequate performance. His answer was that he knows this,
but that was hand written code that normal users would not write.
Compilers for Cobol could generate such code, but apparently no
existing compilers supported it. He claimed that there
must be hardware support to make it widely available. He somewhat
ignored fact that even with hardware support there still needs
to be support in compilers. And that C++ templates allow
fast fixed point decimals as library feature. If there is no
such library (and I am not aware of one) it is due to low
demand and not due to difficulty.
Anyway, he and IBM succeded pushd the DFP standard, but adoption
is as I and other folks predicted.
Thomas Koenig <tkoenig@netcologne.de> wrote:
Waldek Hebisch <antispam@fricas.org> schrieb:
I use every day a computer algebra system. One possiblity here
is to use arbitrary precision fractions. Observations:
- once there is more complex computation numerators and denominators
tend to be pretty big
- a lot of time is spent computing gcd, but if you try to save on
gcd-s and work with unsimplified fractions you may get trenendous
blowup (like million digit numbers in what should be reasonably
small computation).
- general, if one needs numeric approximation, then arbitrary
precision (software) floating point is much faster
It is also not possible to express irrational numbers,
transcendental functions etc. When I use a computer algebra
sytem, I tend to use such functions, so solutions are usually
not rational numbers.
Support for algebraic irrationalities is by now standard
feature in computer algebra. Dealing with transcendental
elementary functions too. Support for special functions
is weaker, but that is possible to. Deciding if a number
is transcendental or not is theoretically tricky, but for
elementary numbers there is Schanuel conjecture, which
wile unproven tends to work well in practice.
What troubles many folks is fact that for many practical
problems answers are implicit, so if you want numbers at the
end you need to do numeric computation anyway.
Anyway, my point was that exact computations tend to need
large accuracy at intermediate steps,
so computational
cost is much higher than numerics (even arbitrary precion
numerics tend to be much faster). As a little example
is sligthly different spirit, when can try to run approximate
root finding procedure for polynomials in rational
arithemtics. This solves problem of potential numerical
instability, but leads to large fractions which are
much more costly than arbitrary precion floating point
(which with some care deals with instability too).
On Thu, 8 Jan 2026 21:35:16 +0100
Terje Mathisen <terje.mathisen@tmsw.no> wrote:
MitchAlsup wrote:
But a single incorrect rounding is 1 ULP all by itself.
:-)
Yeah, there are strong forces who want to have, at least as a suggested/recommended option, a set of transcendental functions which
are exactly rounded.
I wonder who are those forces and what is the set they push for.
I would guess that they are mostly software and hardware verification
people rather than people that use transcendental functions in
engineering and physical calculations.
The majority of the latter crowd would likely have no objections
against 0.75 ULP RMS as long as implementation is both fast and
preserving few invariant, of which I can primarily think of two:
1. Evenness/odness
If precise function F(x) is even or odd then approximate functions f(x)
is also even or odd.
2. Weak preservation of sign of delta.
If F(x) < F(x+ULP) then f(x) <= f(x+ULP)
If F(x) > F(x+ULP) then f(x) >= f(x+ULP)
In practice, it's probably unlikely to have these invariant preserved
when RMS error is 0.75 ULP, but for RMS error = 0.60-0.65 ULP I don't
see difficulties.
For many transcendental functions there will be only few problematic
values of x near edges of implementation-specific ranges where one
has to be careful.
It is provably doable for float, in very close to the same cycle
count as the best libraries in current use, double is "somewhat"
harder to totally verify/prove.
Terje
Michael S <already5chosen@yahoo.com> posted:
On Thu, 8 Jan 2026 21:35:16 +0100
Terje Mathisen <terje.mathisen@tmsw.no> wrote:
MitchAlsup wrote:
But a single incorrect rounding is 1 ULP all by itself.
:-)
Yeah, there are strong forces who want to have, at least as a suggested/recommended option, a set of transcendental functions
which are exactly rounded.
I wonder who are those forces and what is the set they push for.
The problem, here, is that even when one gets all the rounding
correct, one has still lost various algebraic identities.
CRSIN(x)^2 + CRCOS(X)^2 ~= 1.0
I would guess that they are mostly software and hardware
verification people rather than people that use transcendental
functions in engineering and physical calculations.
Numerical people, almost never engineers, physicists, or chemists.
The majority of the latter crowd would likely have no objections
against 0.75 ULP RMS as long as implementation is both fast and
preserving few invariant, of which I can primarily think of two:
1. Evenness/odness
If precise function F(x) is even or odd then approximate functions
f(x) is also even or odd.
Odd functions need to be monotonic around zero.
2. Weak preservation of sign of delta.
If F(x) < F(x+ULP) then f(x) <= f(x+ULP)
If F(x) > F(x+ULP) then f(x) >= f(x+ULP)
Small scale Monotonicity.
In practice, it's probably unlikely to have these invariant
preserved when RMS error is 0.75 ULP, but for RMS error = 0.60-0.65
ULP I don't see difficulties.
For many transcendental functions there will be only few problematic
values of x near edges of implementation-specific ranges where one
has to be careful.
It is provably doable for float, in very close to the same cycle
count as the best libraries in current use, double is "somewhat"
harder to totally verify/prove.
Terje
existing compilers supported it. He claimed that there
must be hardware support to make it widely available. He somewhat
ignored fact that even with hardware support there still needs
to be support in compilers.
Or perhaps (again, no personal knowledge - just speculation) that
supporting an additional data type in the IBM COBOL (and, for what its
worth PL/1) compilers is easier if there was hardware support for it.
And that C++ templates allow
fast fixed point decimals as library feature. If there is no
such library (and I am not aware of one) it is due to low
demand and not due to difficulty.
For the existing Z series base, I suspect anything related to C++ is not >significant, i.e. about as important as DFP is to the typical C++ user. :-)
antispam@fricas.org (Waldek Hebisch) posted:
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:
Terje Mathisen <terje.mathisen@tmsw.no> posted:
MitchAlsup wrote:
MitchAlsup <user5857@newsgrouper.org.invalid> posted:
-------------------
My Transcendentals get to 1ULP when the multiplier tree is 59|arCo59-bits
{a bit more than |e-+ of the get 1ULP at 58|arCo58}. I gave a lot of though
to this {~1 year} before deciding that a "Do everything else" function >>>>>> unit was "overall" better than a couple of "near miss" FUs. So, IMUL >>>>>> comes in at 4 cycles, FMUL at 4 cycles, FDIV at 17, SQRT at 22, and >>>>>> IDIV at 25. The fast IMUL makes up a lot for the "size" of the FU.
I forgot to add that my transcendentals went from <just barely>
faithfully rounded 0.75ULP RMS to 5.002ULP RMS when the tree was
expanded.
You obviously intended 0.5002 ulp, so you have about 11-13 guard bits to >>>> get the rounding correct?
64-53 = 11 yes
But a single incorrect rounding is 1 ULP all by itself.
It is clear that when your rounding is different that "IEEE correct"
rounding, then there is 1 ULP difference between your result and
IEEE rounding. But claims like max 0.5002 ulp mean that there
is at most 0.5002 ulp difference between true result and result
deliverd by your FPU. Or do you really mean that there may be
1ULP difference between true result and your FPU?
It means I make a single IEEE rounding error once every several thousand calculations; AND I can achieve this in all IEEE rounding modes.
On Sun, 11 Jan 2026 18:18:00 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:
Michael S <already5chosen@yahoo.com> posted:
On Thu, 8 Jan 2026 21:35:16 +0100
Terje Mathisen <terje.mathisen@tmsw.no> wrote:
MitchAlsup wrote:
But a single incorrect rounding is 1 ULP all by itself.
:-)
Yeah, there are strong forces who want to have, at least as a suggested/recommended option, a set of transcendental functions
which are exactly rounded.
I wonder who are those forces and what is the set they push for.
The problem, here, is that even when one gets all the rounding
correct, one has still lost various algebraic identities.
CRSIN(x)^2 + CRCOS(X)^2 ~= 1.0
I would guess that they are mostly software and hardware
verification people rather than people that use transcendental
functions in engineering and physical calculations.
Numerical people, almost never engineers, physicists, or chemists.
The majority of the latter crowd would likely have no objections
against 0.75 ULP RMS as long as implementation is both fast and preserving few invariant, of which I can primarily think of two:
1. Evenness/odness
If precise function F(x) is even or odd then approximate functions
f(x) is also even or odd.
Odd functions need to be monotonic around zero.
2. Weak preservation of sign of delta.
If F(x) < F(x+ULP) then f(x) <= f(x+ULP)
If F(x) > F(x+ULP) then f(x) >= f(x+ULP)
Small scale Monotonicity.
Yes, that's a better name.
I just wanted to express it as simple non-equality conditions and made
it too simple and stronger than necessary.
In fact I would not complain if my conditions do not hold when F(x) has extremum in between x and x+ULP. That is, it's nice if condition holds
here as well, but it is relatively less important than holding on
monotonous intervals.
In practice, it's probably unlikely to have these invariant
preserved when RMS error is 0.75 ULP, but for RMS error = 0.60-0.65
ULP I don't see difficulties.
For many transcendental functions there will be only few problematic values of x near edges of implementation-specific ranges where one
has to be careful.
It is provably doable for float, in very close to the same cycle
count as the best libraries in current use, double is "somewhat"
harder to totally verify/prove.
Terje
On Thu, 8 Jan 2026 21:35:16 +0100
Terje Mathisen <terje.mathisen@tmsw.no> wrote:
MitchAlsup wrote:
But a single incorrect rounding is 1 ULP all by itself.
:-)
Yeah, there are strong forces who want to have, at least as a
suggested/recommended option, a set of transcendental functions which
are exactly rounded.
I wonder who are those forces and what is the set they push for.
I would guess that they are mostly software and hardware verification
people rather than people that use transcendental functions in
engineering and physical calculations.
The majority of the latter crowd would likely have no objections
against 0.75 ULP RMS as long as implementation is both fast and
preserving few invariant, of which I can primarily think of two:
1. Evenness/odness
If precise function F(x) is even or odd then approximate functions f(x)
is also even or odd.
2. Weak preservation of sign of delta.
If F(x) < F(x+ULP) then f(x) <= f(x+ULP)
If F(x) > F(x+ULP) then f(x) >= f(x+ULP)
In practice, it's probably unlikely to have these invariant preserved
when RMS error is 0.75 ULP, but for RMS error = 0.60-0.65 ULP I don't
see difficulties.
For many transcendental functions there will be only few problematic
values of x near edges of implementation-specific ranges where one
has to be careful.
antispam@fricas.org (Waldek Hebisch) posted:
Thomas Koenig <tkoenig@netcologne.de> wrote:
Waldek Hebisch <antispam@fricas.org> schrieb:
I use every day a computer algebra system. One possiblity here
is to use arbitrary precision fractions. Observations:
- once there is more complex computation numerators and denominators
tend to be pretty big
- a lot of time is spent computing gcd, but if you try to save on
gcd-s and work with unsimplified fractions you may get trenendous
blowup (like million digit numbers in what should be reasonably
small computation).
- general, if one needs numeric approximation, then arbitrary
precision (software) floating point is much faster
It is also not possible to express irrational numbers,
transcendental functions etc. When I use a computer algebra
sytem, I tend to use such functions, so solutions are usually
not rational numbers.
Support for algebraic irrationalities is by now standard
feature in computer algebra. Dealing with transcendental
elementary functions too. Support for special functions
is weaker, but that is possible to. Deciding if a number
is transcendental or not is theoretically tricky, but for
elementary numbers there is Schanuel conjecture, which
wile unproven tends to work well in practice.
What troubles many folks is fact that for many practical
problems answers are implicit, so if you want numbers at the
end you need to do numeric computation anyway.
Anyway, my point was that exact computations tend to need
large accuracy at intermediate steps,
Between 2|un+6 and 2|un+13 which is 3-9-bits larger than the
next higher precision.
--so computational
cost is much higher than numerics (even arbitrary precion
numerics tend to be much faster). As a little example
is sligthly different spirit, when can try to run approximate
root finding procedure for polynomials in rational
arithemtics. This solves problem of potential numerical
instability, but leads to large fractions which are
much more costly than arbitrary precion floating point
(which with some care deals with instability too).
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:
antispam@fricas.org (Waldek Hebisch) posted:
Thomas Koenig <tkoenig@netcologne.de> wrote:
Waldek Hebisch <antispam@fricas.org> schrieb:
I use every day a computer algebra system. One possiblity here
is to use arbitrary precision fractions. Observations:
- once there is more complex computation numerators and denominators
tend to be pretty big
- a lot of time is spent computing gcd, but if you try to save on
gcd-s and work with unsimplified fractions you may get trenendous
blowup (like million digit numbers in what should be reasonably
small computation).
- general, if one needs numeric approximation, then arbitrary
precision (software) floating point is much faster
It is also not possible to express irrational numbers,
transcendental functions etc. When I use a computer algebra
sytem, I tend to use such functions, so solutions are usually
not rational numbers.
Support for algebraic irrationalities is by now standard
feature in computer algebra. Dealing with transcendental
elementary functions too. Support for special functions
is weaker, but that is possible to. Deciding if a number
is transcendental or not is theoretically tricky, but for
elementary numbers there is Schanuel conjecture, which
wile unproven tends to work well in practice.
What troubles many folks is fact that for many practical
problems answers are implicit, so if you want numbers at the
end you need to do numeric computation anyway.
Anyway, my point was that exact computations tend to need
large accuracy at intermediate steps,
Between 2|un+6 and 2|un+13 which is 3-9-bits larger than the
next higher precision.
You are talking about a specific, rather special problem.
Reasonably typical task in exact computations is to compute
determinant of n by n matrix with k-bit integer entries.
Sometimes k is large, but k <= 10 is frequent. Using
reasonably normal arithmetic operations you need slightly
more than n*k bits at intermedate steps. For similar
matrix with rational entries needed number of bits may
be as large as n^2*k. If you skip simplifications of
fractions at intermediate steps your numbers may grow
exponentially with n. In root finding problem that
I mentioned below, to get k bits of accuracy you need
to evaluate polynomial at k bit number. If you do
evaluation in exact arithmetic, then at intermediate
steps you get n*k bit numbers, where n is degree of the
polynomial. OTOH in numeric computation you can get
good result with much smaller number of bits (trough
analysis and its result are complex), but growing
with n.
so computational
cost is much higher than numerics (even arbitrary precion
numerics tend to be much faster). As a little example
is sligthly different spirit, when can try to run approximate
root finding procedure for polynomials in rational
arithemtics. This solves problem of potential numerical
instability, but leads to large fractions which are
much more costly than arbitrary precion floating point
(which with some care deals with instability too).
On 1/11/2026 10:07 AM, MitchAlsup wrote:[...]
On 1/11/2026 10:07 AM, MitchAlsup wrote:[...]
Michael S <already5chosen@yahoo.com> wrote:
On Thu, 8 Jan 2026 21:35:16 +0100
Terje Mathisen <terje.mathisen@tmsw.no> wrote:
MitchAlsup wrote:
But a single incorrect rounding is 1 ULP all by itself.
:-)
Yeah, there are strong forces who want to have, at least as a
suggested/recommended option, a set of transcendental functions
which are exactly rounded.
I wonder who are those forces and what is the set they push for.
I would guess that they are mostly software and hardware
verification people rather than people that use transcendental
functions in engineering and physical calculations.
The majority of the latter crowd would likely have no objections
against 0.75 ULP RMS as long as implementation is both fast and
preserving few invariant, of which I can primarily think of two:
1. Evenness/odness
If precise function F(x) is even or odd then approximate functions
f(x) is also even or odd.
2. Weak preservation of sign of delta.
If F(x) < F(x+ULP) then f(x) <= f(x+ULP)
If F(x) > F(x+ULP) then f(x) >= f(x+ULP)
In practice, it's probably unlikely to have these invariant
preserved when RMS error is 0.75 ULP, but for RMS error = 0.60-0.65
ULP I don't see difficulties.
For many transcendental functions there will be only few problematic
values of x near edges of implementation-specific ranges where one
has to be careful.
This property is independent of magnitude of the error. For
example, nominally "double" routine may deliver correctly rounded
single precision results. Of course errors are huge, but the
property holds.
More realistically, monotonic behaviour can
be obtained as composition of monotonic operations. If done
in software such composition may produce more than 1ulp error,
but still be monotonic where required.
Michael S <already5chosen@yahoo.com> wrote:
On Thu, 8 Jan 2026 21:35:16 +0100
Terje Mathisen <terje.mathisen@tmsw.no> wrote:
MitchAlsup wrote:
But a single incorrect rounding is 1 ULP all by itself.
:-)
Yeah, there are strong forces who want to have, at least as a
suggested/recommended option, a set of transcendental functions
which are exactly rounded.
I wonder who are those forces and what is the set they push for.
I would guess that they are mostly software and hardware
verification people rather than people that use transcendental
functions in engineering and physical calculations.
The majority of the latter crowd would likely have no objections
against 0.75 ULP RMS as long as implementation is both fast and
preserving few invariant, of which I can primarily think of two:
1. Evenness/odness
If precise function F(x) is even or odd then approximate functions
f(x) is also even or odd.
2. Weak preservation of sign of delta.
If F(x) < F(x+ULP) then f(x) <= f(x+ULP)
If F(x) > F(x+ULP) then f(x) >= f(x+ULP)
In practice, it's probably unlikely to have these invariant
preserved when RMS error is 0.75 ULP, but for RMS error = 0.60-0.65
ULP I don't see difficulties.
For many transcendental functions there will be only few problematic
values of x near edges of implementation-specific ranges where one
has to be careful.
This property is independent of magnitude of the error. For
example, nominally "double" routine may deliver correctly rounded
single precision results. Of course errors are huge, but the
property holds.
More realistically, monotonic behaviour can
be obtained as composition of monotonic operations. If done
in software such composition may produce more than 1ulp error,
but still be monotonic where required.
Michael S <already5chosen@yahoo.com> wrote:
On Thu, 8 Jan 2026 21:35:16 +0100
Terje Mathisen <terje.mathisen@tmsw.no> wrote:
MitchAlsup wrote:
But a single incorrect rounding is 1 ULP all by itself.
:-)
Yeah, there are strong forces who want to have, at least as a
suggested/recommended option, a set of transcendental functions
which are exactly rounded.
I wonder who are those forces and what is the set they push for.
I would guess that they are mostly software and hardware
verification people rather than people that use transcendental
functions in engineering and physical calculations.
The majority of the latter crowd would likely have no objections
against 0.75 ULP RMS as long as implementation is both fast and
preserving few invariant, of which I can primarily think of two:
1. Evenness/odness
If precise function F(x) is even or odd then approximate functions
f(x) is also even or odd.
2. Weak preservation of sign of delta.
If F(x) < F(x+ULP) then f(x) <= f(x+ULP)
If F(x) > F(x+ULP) then f(x) >= f(x+ULP)
In practice, it's probably unlikely to have these invariant
preserved when RMS error is 0.75 ULP, but for RMS error = 0.60-0.65
ULP I don't see difficulties.
For many transcendental functions there will be only few problematic
values of x near edges of implementation-specific ranges where one
has to be careful.
This property is independent of magnitude of the error. For
example, nominally "double" routine may deliver correctly rounded
single precision results. Of course errors are huge, but the
property holds. More realistically, monotonic behaviour can
be obtained as composition of monotonic operations. If done
in software such composition may produce more than 1ulp error,
but still be monotonic where required.
Consider COS(x) near 0.0
antispam@fricas.org (Waldek Hebisch) posted:double -> fp128 is 53 vs 113 bits mantissa (including the hidden bit),
Thomas Koenig <tkoenig@netcologne.de> wrote:
Waldek Hebisch <antispam@fricas.org> schrieb:
I use every day a computer algebra system. One possiblity here
is to use arbitrary precision fractions. Observations:
- once there is more complex computation numerators and denominators>>>> tend to be pretty big
- a lot of time is spent computing gcd, but if you try to save on
gcd-s and work with unsimplified fractions you may get trenendous>>>> blowup (like million digit numbers in what should be reasonably
small computation).
- general, if one needs numeric approximation, then arbitrary
precision (software) floating point is much faster
It is also not possible to express irrational numbers,
transcendental functions etc. When I use a computer algebra
sytem, I tend to use such functions, so solutions are usually
not rational numbers.
Support for algebraic irrationalities is by now standard
feature in computer algebra. Dealing with transcendental
elementary functions too. Support for special functions
is weaker, but that is possible to. Deciding if a number
is transcendental or not is theoretically tricky, but for
elementary numbers there is Schanuel conjecture, which
wile unproven tends to work well in practice.
What troubles many folks is fact that for many practical
problems answers are implicit, so if you want numbers at the
end you need to do numeric computation anyway.
Anyway, my point was that exact computations tend to need
large accuracy at intermediate steps,
Between 2|un+6 and 2|un+13 which is 3-9-bits larger than the
next higher precision.
According to Stephen Fuld <sfuld@alumni.cmu.edu.invalid>:
existing compilers supported it. He claimed that there
must be hardware support to make it widely available. He somewhat
ignored fact that even with hardware support there still needs
to be support in compilers.
Or perhaps (again, no personal knowledge - just speculation) that >>supporting an additional data type in the IBM COBOL (and, for what its >>worth PL/1) compilers is easier if there was hardware support for it.
Having written a few compilers, I can say that it is equally easy
within epsilon to emit a DFADD instruction as the equivalent of CALL
DFADD. I could believe it's politically easier, hey we'll look dumb if
we announce this swell DFP feature and our own compilers don't use it.
According to Stephen Fuld <sfuld@alumni.cmu.edu.invalid>:
existing compilers supported it. He claimed that there
must be hardware support to make it widely available. He somewhat
ignored fact that even with hardware support there still needs
to be support in compilers.
Or perhaps (again, no personal knowledge - just speculation) that
supporting an additional data type in the IBM COBOL (and, for what its
worth PL/1) compilers is easier if there was hardware support for it.
Having written a few compilers, I can say that it is equally easy
within epsilon to emit a DFADD instruction as the equivalent of CALL
DFADD. I could believe it's politically easier, hey we'll look dumb if
we announce this swell DFP feature and our own compilers don't use it.
Michael S <already5chosen@yahoo.com> posted:
Terje Mathisen <terje.mathisen@tmsw.no> wrote:
Yeah, there are strong forces who want to have, at least as aI wonder who are those forces and what is the set they push for.
suggested/recommended option, a set of transcendental functions which
are exactly rounded.
The problem, here, is that even when one gets all the rounding correct,
one has still lost various algebraic identities.
CRSIN(x)^2 + CRCOS(X)^2 ~= 1.0
| Sysop: | Amessyroom |
|---|---|
| Location: | Fayetteville, NC |
| Users: | 54 |
| Nodes: | 6 (1 / 5) |
| Uptime: | 21:16:43 |
| Calls: | 742 |
| Files: | 1,218 |
| D/L today: |
6 files (8,794K bytes) |
| Messages: | 186,029 |
| Posted today: | 1 |