Forum: Too Lazy BBS

Re: floating point history, word order and byte order

From Michael S@already5chosen@yahoo.com to comp.arch on Tue Jan 6 23:09:33 2026

From Newsgroup: comp.arch

On Tue, 06 Jan 2026 19:35:34 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

Michael S <already5chosen@yahoo.com> posted:

On Tue, 06 Jan 2026 17:59:33 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

Terje Mathisen <terje.mathisen@tmsw.no> posted:

Michael S wrote:

On Tue, 6 Jan 2026 12:35:20 +0100
Terje Mathisen <terje.mathisen@tmsw.no> wrote:

Thomas Koenig wrote:

Michael S <already5chosen@yahoo.com> schrieb:

On Sun, 4 Jan 2026 00:21:31 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:

And two decimal flavors, as well, with binary and densely
packed decimal encoding of the significand... it's a bit
of a mess.

Since both formats have exactly identical semantics, in
theory the mess is not worse (and not better) than two
bytes orders of IEEE binary FP.

Almost.

IIRC, there is no restriction on the binary mantissa, so its
range is slightly larger for the same number of bits
(1000/1024)**(n/3).

Sorry, that's wrong:

Just like the 24 "spare" DPD patterns are illegal,

Non-canonical, which is not the same as illegal
Silently accepted as input operands but never produced as
result.

any mantissa
corresponding to a number greater than the maximum allowed
(1e34 afair) is also illegal, and there are rules for how to
handle both cases (without checking, i seem to remember that
they should be treated as zero?)

BID significand extension > max is indeed treated as zeros. Non-canonical DPD declets have non-zero values.
They are forms of (8+c)*100 + (8+f)*10 + (8+i), where c, f,
and i are in range [0:1].

OK, that is probably because allowing them on input is
significantly faster/cheaper than having to detect and modify/trap/erase.

With the calculation latencies of IBM Z-series, modify/trap/erase
is of no problem.

How do you know calculation latencies of IBM Z-series?
Did they made an information public?

9-15 months ago there was a presentation of their latest mainframe
showing the pipeline lengths.

Decode was on the order of 20 cycles, down from the top left;
execute was horizontal across the middle;
Retire was on the order of 12 cycles, down from the top right;

Back 20 years ago Intel used to have pipelines of comparable depth
(IIRC, ~35 cycles in the 3rd and 4th generations of Pentium 4). But
despite that, latency of simple ALU ops was 1 clock. Latency of L1D hit
was 4 clocks, long for 2005, but standard today. Latencies of FMUL
and FADD were 7 and 5 clocks, respectively - long, but not
extraordinary.

IBM's own POWER6 18 years ago had integer pipeline close to 30 stages
and FP pipeline of around 35 stages. However FP MUL/ADD/FMA latency
was 6 or 7 clocks.

I would expect similar or shorter latency figures for BFP on modern IBM
z. Likely shorter, because today they have far more silicon to through
on various bypasses.
Now, in case of DFP I don't want to guess, because I have no base for
guessing.

--- Synchronet 3.21a-Linux NewsLink 1.2

From jgd@jgd@cix.co.uk (John Dallman) to comp.arch on Tue Jan 6 22:06:00 2026

From Newsgroup: comp.arch

In article <2026Jan5.100825@mips.complang.tuwien.ac.at>, anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

Possibly. But the lack of takeup of the Intel library and of the
gcc support shows that "build it and they will come" does not work
out for DFP.

The world has got very used to IEEE BFP, and has solutions that work
acceptably with it. Lots of organisations don't see anything obvious for
them in DFP.

The thing I'd like to try out is fast quad-precision BFP. For the field I
work in, that would make some things much simpler. I did try to interest
AMD in the idea in the early days of x86-64, but they didn't bite.

John
--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.arch on Wed Jan 7 13:34:24 2026

From Newsgroup: comp.arch

On Tue, 6 Jan 2026 22:06 +0000 (GMT Standard Time)
jgd@cix.co.uk (John Dallman) wrote:

In article <2026Jan5.100825@mips.complang.tuwien.ac.at>, anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

Possibly. But the lack of takeup of the Intel library and of the
gcc support shows that "build it and they will come" does not work
out for DFP.

The world has got very used to IEEE BFP, and has solutions that work acceptably with it. Lots of organisations don't see anything obvious
for them in DFP.

The thing I'd like to try out is fast quad-precision BFP. For the
field I work in, that would make some things much simpler. I did try
to interest AMD in the idea in the early days of x86-64, but they
didn't bite.

John

I already asked you couple of years ago how fast do want binary128 in
order to consider it fast enough.
IIRC, you either avoided the answer completely or gave totally
unrealistic answer like "the same as binary64".
May be, nobody bites because with non-answers or answers like that
nobody thinks that you are serious?
Anecdote.
Few months ago I tried to design very long decimation filters with stop
band attenuation of ~160 dB.
Matlab's implementation of ParksrCoMcClellan algorithm (a customize
variation of Remez Exchange spiced with a small portion of black magic)
was not up to the task. Gnu Octave implementation was somewhat worse
yet.
When I started to investigate the reasons I found out that there were
actually two of them, both related to insufficient precision of the
series of DP FP calculations.
The first was Discreet Cosine Transform and underlying FFT engine for N
around 32K.
The second was solving system of linear equations for N around 1000 a
a little more.
In both cases precision of DP FP was perfectly sufficient both for
inputs and for outputs. But errors accumulated in intermediate caused
troubles.
In both cases quad-precision FP was key to solution.
For DCT (FFT) I went for full re-implementation at higher precision.
For Linear Solver, I left LU decomposition, which happens to be the
heavy O(N**3) part in DP. Quad-precision was applied only during final
solver stages - forward propagation, back propagation, calculation of
residual error vector and repetition of forward and back propagation.
All those parts are O(N**2). That modification was sufficient to
improve precision of result almost to the best possible in DP FP format.
And sufficient for good convergence of ParksrCoMcClellan algorithm.
So, what is a point of my anecdote?
The speed of quad-precision FP was never an obstacle, even when running
on rather old hardware. And it's not like calculations here were not
heavy. They were heavy all right, thank you very much.
Using quad-precision only when necessary helped.
But what helped more is not being hesitant. Doing things instead of
worrying that they would be too slow.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.arch on Wed Jan 7 15:06:27 2026

From Newsgroup: comp.arch

On Tue, 06 Jan 2026 19:29:40 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

Michael S <already5chosen@yahoo.com> posted:

On Tue, 06 Jan 2026 17:56:23 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

Michael S <already5chosen@yahoo.com> posted:

On Sun, 4 Jan 2026 00:21:31 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:

And two decimal flavors, as well, with binary and densely
packed decimal encoding of the significand... it's a bit of a
mess.

Since both formats have exactly identical semantics, in theory
the mess is not worse (and not better) than two bytes orders of
IEEE binary FP.

Division by 10 is way faster in DPD than in Binary.

Do you consider speed to be part of semantics?
Just wondering...

More like justification for the facility itself.

But also note: DPD can be converted to ASCII all data in parallel
without any division going on. Binary does not have that property.

I took a look at how IBM exploits this property.
I don't have up to date zArch manual. It's probably easily available,
but right now I have no time or desire to search.
So I looked at POWER, which tends to copy DFP stuff from zArch with one generation time gap.
POWER ISA v.3.0 (2015) has following relevant instructions:

ddedpd - DFP Decode DPD to BCD
For Decimal128 it has two forms
- convert 32 rightmost digits of significand (unsigned)
- convert 31 rightmost digits of significand (signed)
With IBM I am never sure what they call 'rightmost' :(

If you wonder about couple of remaining digits, IBM has following
helper instruction:
dscli - DFP shift significand left immediate
dscri - DFP shift significand right immediate
Once again, since it's IBM, I am not sure about directions.

dxex - DFD extract biased exponent.
Exponent is extracted in binary form, not in BCD

They also have instructions that workd in the opposite direction
denbcd - DFP encode BCD to DPD
It converts signed 31-digit or unsigned 32-digit BCD-encoded integer to
DPD with exponent=0

diex - DFD insert biased exponent.
Here too exponent is in binary form, not in BCD.

So, POWER hardware helps a lot converting DPD to BCD, but leaves to
software the last step of unpacking 4-bit BCD to 8-bit ASCII.
Or, may be, they have instruction for that as well, but it's not in DFP
related part of the book, so I missed it.

--- Synchronet 3.21a-Linux NewsLink 1.2

From jgd@jgd@cix.co.uk (John Dallman) to comp.arch on Wed Jan 7 13:16:00 2026

From Newsgroup: comp.arch

In article <20260107133424.00000e99@yahoo.com>, already5chosen@yahoo.com (Michael S) wrote:

I already asked you couple of years ago how fast do want binary128
in order to consider it fast enough.
IIRC, you either avoided the answer completely or gave totally
unrealistic answer like "the same as binary64".
May be, nobody bites because with non-answers or answers like that
nobody thinks that you are serious?

I don't know much about hardware design. What is realistic for hardware binary128?

John
--- Synchronet 3.21a-Linux NewsLink 1.2

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Wed Jan 7 15:24:38 2026

From Newsgroup: comp.arch

Michael S <already5chosen@yahoo.com> writes:

On Tue, 06 Jan 2026 19:29:40 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

So, POWER hardware helps a lot converting DPD to BCD, but leaves to
software the last step of unpacking 4-bit BCD to 8-bit ASCII.

That's a fairly simple operation, just adding the proper zone
digit (3 for ASCII, F for EBCDIC) to the 8-bit byte.

The B3500 (1965) did that in hardware;
I would find it strange if the Power CPU didn't.

Or, may be, they have instruction for that as well, but it's not in DFP >related part of the book, so I missed it.

It was just a flavor of the move instruction in the B3500.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.arch on Wed Jan 7 17:55:07 2026

From Newsgroup: comp.arch

On Wed, 7 Jan 2026 13:16 +0000 (GMT Standard Time)
jgd@cix.co.uk (John Dallman) wrote:

In article <20260107133424.00000e99@yahoo.com>,
already5chosen@yahoo.com (Michael S) wrote:

I already asked you couple of years ago how fast do want binary128
in order to consider it fast enough.
IIRC, you either avoided the answer completely or gave totally
unrealistic answer like "the same as binary64".
May be, nobody bites because with non-answers or answers like that
nobody thinks that you are serious?

I don't know much about hardware design. What is realistic for
hardware binary128?

John

I think that you are asking wrong question, but I'd answer nevertheless.

For design that take significant amount of design effort and
non-trivial silicon resources (say + 3-5% in core area)

SIMD throughput within the same TDP: 1/4th of DP FP. May be, 1/3rd if
designers worked very hard
Latency: assuming 4-clock FMA for DP FP, 9 clocks QP FMA sounds very
realistic. May be, 8 clocks. Mitch could answer better. FADD can be
faster - 6 sounds realistic.

Another extreme in design space is what IBM did on POWER9.
I would guess that here silicon resources dedicated to quad-precision
BFP were below 0.5 % of the core area. Likely, below 0.1%.
They did scalar (i.e. non-SIMD) quad-FP unit. FADD is
pipelines, but FMUL/FMA is very minimally pipelined (at most 2
operations proceed simultaneously),
Throughput/latency Table:

Oper : DP T L QP T L
ADD : 4 5-7 1 12
MUL : 4 5-7 1/13 24
MADD : 4 5-7 1/13 24

As you can see, POWER9 double-precision throughput/latency numbers are
somewhat worse than what we accustomed on x86-64 and on high-end ARM64. However, even relatively to those not great numbers throughput of QP
FMA is 52 times lower and latency ~4 times higher.

And still, it's all depend on application. If all application does is multiplication or decomposition of big matrices then migration to
minimalistic QP engine similar to one in POWER9 will cause major
slowdown (Then again, as shown by my anecdote, sometimes DP is
non-adequate for exactly those tasks).
But most applications are not like those. Even for activities that are
normally considered numerically intensive, like recalculation of huge spreadsheet, I'd expect at most few percents of slowdown on POWER9 QP.
I don't know where your application is placed in this spectrum. Most
likely, you don't know too. Until you try! And in order to get a
feeling you don't need hardware. Use software implementation.
Experiments, even with not very good software implementation as one in
gcc on x86-64 and ARM64, will give you massively better feeling a lot
of questions. They will put you into position of knowledge, when
proposing something to AMD, Intel or Arm vendors.
Which, of course, does not guarantee that they will byte your bait. I'd
even dare to say that for as long as current "AI" bubble lasts they
will not byte it. But it will not last forever.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.arch on Wed Jan 7 18:06:59 2026

From Newsgroup: comp.arch

On Wed, 07 Jan 2026 15:24:38 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:

Michael S <already5chosen@yahoo.com> writes:

On Tue, 06 Jan 2026 19:29:40 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

So, POWER hardware helps a lot converting DPD to BCD, but leaves to >software the last step of unpacking 4-bit BCD to 8-bit ASCII.

That's a fairly simple operation, just adding the proper zone
digit (3 for ASCII, F for EBCDIC) to the 8-bit byte.

The hard part is having bits in right places within wide word.
I.e. you have 64 bits like that:
'0123456789abcdef'. You want to convert it to pair of 64-bit numbers: '0001020304050607', '08090a0b0c0d0e0f'
Without HW help it is not fast. Likely not faster than running
respective DPD declets through look-up table where you, at least, get 3
ASCII digits per look-up. On modern wide core, likely only marginally
faster than converting BYD mantissa.

The B3500 (1965) did that in hardware;
I would find it strange if the Power CPU didn't.

Or, may be, they have instruction for that as well, but it's not in
DFP related part of the book, so I missed it.

It was just a flavor of the move instruction in the B3500.

I am not sure that we are talking about the same thing.

--- Synchronet 3.21a-Linux NewsLink 1.2

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Wed Jan 7 16:41:26 2026

From Newsgroup: comp.arch

Michael S <already5chosen@yahoo.com> writes:

On Wed, 07 Jan 2026 15:24:38 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:

Michael S <already5chosen@yahoo.com> writes:

On Tue, 06 Jan 2026 19:29:40 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

So, POWER hardware helps a lot converting DPD to BCD, but leaves to
software the last step of unpacking 4-bit BCD to 8-bit ASCII.

That's a fairly simple operation, just adding the proper zone
digit (3 for ASCII, F for EBCDIC) to the 8-bit byte.

The hard part is having bits in right places within wide word.
I.e. you have 64 bits like that:
'0123456789abcdef'. You want to convert it to pair of 64-bit numbers: >'0001020304050607', '08090a0b0c0d0e0f'

The ASCII[*] would be '3031323334353637', '3839414243444546'. That's
a move from the BCD '0123456789abcdef' to the corresponding ASCII
bytes.

[*] Printable version of the BCD input number.

The B3500 addressed to the digit, so it was a simple move to add
the zone digit when converting to ASCII (or EBCDIC depending on
a processor flag). Although 'undigits' (bit patterns 0b1010
through 0b1111) were not legal in BCD numbers on the B3500
and adding a zone digit to them didn't make them printable.

e.g.

INPUT DATA 8 UN 0123456789 // Unsigned Numeric 8 nibble field
OUTPUT DATA 8 UA // Unsigned Alphanumeric 8 byte field

MVN INPUT(UN), OUTPUT(UA) // yields 'f0f1f2f3f4f5f6f7f8f9' (EBCDIC)

If the output field was larger than the input field, leading
blanks would be added before the number when using MVN. MVA
would blank pad the output field after the number when the
output field was larger.

Without HW help it is not fast. Likely not faster than running
respective DPD declets through look-up table where you, at least, get 3
ASCII digits per look-up. On modern wide core, likely only marginally
faster than converting BYD mantissa.

The B3500 (1965) did that in hardware;
I would find it strange if the Power CPU didn't.

Or, may be, they have instruction for that as well, but it's not in
DFP related part of the book, so I missed it.

It was just a flavor of the move instruction in the B3500.

I am not sure that we are talking about the same thing.

Probably not, since the ASCII zero character is encoded as 0x30
instead of the 0x00 you show in the example above.

--- Synchronet 3.21a-Linux NewsLink 1.2

From antispam@antispam@fricas.org (Waldek Hebisch) to comp.arch on Wed Jan 7 17:32:17 2026

From Newsgroup: comp.arch

Scott Lurndal <scott@slp53.sl.home> wrote:

Michael S <already5chosen@yahoo.com> writes:

On Wed, 07 Jan 2026 15:24:38 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:

Michael S <already5chosen@yahoo.com> writes:

On Tue, 06 Jan 2026 19:29:40 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

So, POWER hardware helps a lot converting DPD to BCD, but leaves to
software the last step of unpacking 4-bit BCD to 8-bit ASCII.

That's a fairly simple operation, just adding the proper zone
digit (3 for ASCII, F for EBCDIC) to the 8-bit byte.

The hard part is having bits in right places within wide word.
I.e. you have 64 bits like that:
'0123456789abcdef'. You want to convert it to pair of 64-bit numbers: >>'0001020304050607', '08090a0b0c0d0e0f'

The ASCII[*] would be '3031323334353637', '3839414243444546'. That's
a move from the BCD '0123456789abcdef' to the corresponding ASCII
bytes.

[*] Printable version of the BCD input number.

The B3500 addressed to the digit, so it was a simple move to add
the zone digit when converting to ASCII (or EBCDIC depending on
a processor flag). Although 'undigits' (bit patterns 0b1010
through 0b1111) were not legal in BCD numbers on the B3500
and adding a zone digit to them didn't make them printable.

e.g.

INPUT DATA 8 UN 0123456789 // Unsigned Numeric 8 nibble field
OUTPUT DATA 8 UA // Unsigned Alphanumeric 8 byte field

MVN INPUT(UN), OUTPUT(UA) // yields 'f0f1f2f3f4f5f6f7f8f9' (EBCDIC)

If the output field was larger than the input field, leading
blanks would be added before the number when using MVN. MVA
would blank pad the output field after the number when the
output field was larger.

Without HW help it is not fast. Likely not faster than running
respective DPD declets through look-up table where you, at least, get 3 >>ASCII digits per look-up. On modern wide core, likely only marginally >>faster than converting BYD mantissa.

The B3500 (1965) did that in hardware;
I would find it strange if the Power CPU didn't.

Or, may be, they have instruction for that as well, but it's not in
DFP related part of the book, so I missed it.

It was just a flavor of the move instruction in the B3500.

I am not sure that we are talking about the same thing.

Probably not, since the ASCII zero character is encoded as 0x30
instead of the 0x00 you show in the example above.

IIUC Michael was asking for the following transformation of
on the strings of hex digits:

0123456789abcdef

into

000102030405060708090a0b0c0d0e0f

given (fast) such transformation it is very easy to add proper
zone bits on modern hardware. One possible approach to transform
above would be to do byte type unpacking operation (that is
version of the above working on bytes) and then use masking and
shifting to more upper bits of each byte to the right place.
--
Waldek Hebisch
--- Synchronet 3.21a-Linux NewsLink 1.2

From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Wed Jan 7 18:38:26 2026

From Newsgroup: comp.arch

MitchAlsup wrote:

Michael S <already5chosen@yahoo.com> posted:

On Sun, 4 Jan 2026 00:21:31 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:

And two decimal flavors, as well, with binary and densely packed
decimal encoding of the significand... it's a bit of a mess.

Since both formats have exactly identical semantics, in theory the mess
is not worse (and not better) than two bytes orders of IEEE binary FP.

Division by 10 is way faster in DPD than in Binary.

I see much bigger problem [than BID vs DPD] in the fact that IEEE
standard does not specify few very important things about global states
and interoperability of BFP and DFP in the same process. Like whether
BFP and DFP have common rounding mode or each one has mode of its own.
The same question about for exception flags and exception masks.

In hardware a division by 10 is either an adjustment of the exponent,
which is equally fast for both encodings, or for a real division DPD
just requires unpacking of all the 10-bit fields (I'm assuming IBM does
this in parallel for all 11 groups, so max one cycle), then shifting all
the nybbles down one position before the reverse to pack them back up, probably including a rounding step before the repack.

This operation is very closely related to the general case of having to re-normalize after any operation which would require that, i.e. commonly
for DFMUL, very seldom for DFADD/DFSUB, and almost always for DFDIV.

In BID we would do division by 10 with a reciprocal multiplication that handles powers of 5, this way we (i.e Michael S) can fit 26/27 digits
into a 64-bit int. No matter how we do it, two iterations is enough
handle even maximally large numbers, and that would require 4 64x64->128
MULs per iteration. Since these would pipeline nicely I'm guessing it
would be doable in 15-30 cycles total?

So yes, scaling by a power of ten is the one operation where DPD is
clearly much faster, but if you try to implement DPD in software, then
you have to handle the unpack and pack operations, and they could easily
take the same or even more time.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Jan 7 17:39:25 2026

From Newsgroup: comp.arch

Michael S <already5chosen@yahoo.com> posted:

On Wed, 7 Jan 2026 13:16 +0000 (GMT Standard Time)
jgd@cix.co.uk (John Dallman) wrote:

In article <20260107133424.00000e99@yahoo.com>,
already5chosen@yahoo.com (Michael S) wrote:

I already asked you couple of years ago how fast do want binary128
in order to consider it fast enough.
IIRC, you either avoided the answer completely or gave totally unrealistic answer like "the same as binary64".
May be, nobody bites because with non-answers or answers like that
nobody thinks that you are serious?

I don't know much about hardware design. What is realistic for
hardware binary128?

John

I think that you are asking wrong question, but I'd answer nevertheless.

For design that take significant amount of design effort and
non-trivial silicon resources (say + 3-5% in core area)

SIMD throughput within the same TDP: 1/4th of DP FP. May be, 1/3rd if designers worked very hard
Latency: assuming 4-clock FMA for DP FP, 9 clocks QP FMA sounds very realistic. May be, 8 clocks. Mitch could answer better. FADD can be
faster - 6 sounds realistic.

Those are realistic numbers when the designers work hard AND operands
are shipped in 1 cycle--add 1 if B128 is shipped in 2 cycles.

Another extreme in design space is what IBM did on POWER9.
I would guess that here silicon resources dedicated to quad-precision
BFP were below 0.5 % of the core area. Likely, below 0.1%.
They did scalar (i.e. non-SIMD) quad-FP unit. FADD is
pipelines, but FMUL/FMA is very minimally pipelined (at most 2
operations proceed simultaneously),
Throughput/latency Table:

Oper : DP T L QP T L
ADD : 4 5-7 1 12
MUL : 4 5-7 1/13 24
MADD : 4 5-7 1/13 24

As you can see, POWER9 double-precision throughput/latency numbers are somewhat worse than what we accustomed on x86-64 and on high-end ARM64. However, even relatively to those not great numbers throughput of QP
FMA is 52 times lower and latency ~4 times higher.

And still, it's all depend on application. If all application does is multiplication or decomposition of big matrices then migration to minimalistic QP engine similar to one in POWER9 will cause major
slowdown (Then again, as shown by my anecdote, sometimes DP is
non-adequate for exactly those tasks).
But most applications are not like those. Even for activities that are normally considered numerically intensive, like recalculation of huge spreadsheet, I'd expect at most few percents of slowdown on POWER9 QP.
I don't know where your application is placed in this spectrum. Most
likely, you don't know too. Until you try! And in order to get a
feeling you don't need hardware. Use software implementation.
Experiments, even with not very good software implementation as one in
gcc on x86-64 and ARM64, will give you massively better feeling a lot
of questions. They will put you into position of knowledge, when
proposing something to AMD, Intel or Arm vendors.
Which, of course, does not guarantee that they will byte your bait. I'd
even dare to say that for as long as current "AI" bubble lasts they
will not byte it. But it will not last forever.

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Jan 7 17:44:08 2026

From Newsgroup: comp.arch

Michael S <already5chosen@yahoo.com> posted:

On Wed, 07 Jan 2026 15:24:38 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:

Michael S <already5chosen@yahoo.com> writes:

On Tue, 06 Jan 2026 19:29:40 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

So, POWER hardware helps a lot converting DPD to BCD, but leaves to >software the last step of unpacking 4-bit BCD to 8-bit ASCII.

That's a fairly simple operation, just adding the proper zone
digit (3 for ASCII, F for EBCDIC) to the 8-bit byte.

The hard part is having bits in right places within wide word.
I.e. you have 64 bits like that:
'0123456789abcdef'. You want to convert it to pair of 64-bit numbers: '0001020304050607', '08090a0b0c0d0e0f'
Without HW help it is not fast.

With Extract and Insert instructions this becomes 16 extracts all
concurrent, and 16 inserts, 8 serially dependent pairs. With shift
instructions only--you are on your own.

In HW "its just wires."

Likely not faster than running
respective DPD declets through look-up table where you, at least, get 3
ASCII digits per look-up. On modern wide core, likely only marginally
faster than converting BYD mantissa.

The B3500 (1965) did that in hardware;
I would find it strange if the Power CPU didn't.

Or, may be, they have instruction for that as well, but it's not in
DFP related part of the book, so I missed it.

It was just a flavor of the move instruction in the B3500.

I am not sure that we are talking about the same thing.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Wed Jan 7 18:47:47 2026

From Newsgroup: comp.arch

MitchAlsup wrote:

Michael S <already5chosen@yahoo.com> posted:

On Tue, 06 Jan 2026 17:56:23 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

Michael S <already5chosen@yahoo.com> posted:

On Sun, 4 Jan 2026 00:21:31 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:

And two decimal flavors, as well, with binary and densely packed
decimal encoding of the significand... it's a bit of a mess.

Since both formats have exactly identical semantics, in theory the
mess is not worse (and not better) than two bytes orders of IEEE
binary FP.

Division by 10 is way faster in DPD than in Binary.

Do you consider speed to be part of semantics?
Just wondering...

More like justification for the facility itself.

But also note: DPD can be converted to ASCII all data in parallel without
any division going on. Binary does not have that property.

a) Conversion to ASCII is _never_ in the critical path, or if it is,
then the problem is trivial.

b) Fast Binary to Ascii is a solved problem: I.e. easily doable in less
than 50 clcock cycles even for 128-bit values.
I invented the original unsigned_to_ascci() conversion algorithm ~30
years ago, taking advantage of fast multipliers and splitting the input
into multiple parts which are then converted in parallel using simple
mul_by_5 operations on a scaled input.

My original code which I posted here in c.arch was for the 32-bit CPUs
we had at the time, but extending to 64 or even 128-bit inputs is straightforward.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
--- Synchronet 3.21a-Linux NewsLink 1.2

From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Wed Jan 7 18:56:12 2026

From Newsgroup: comp.arch

Michael S wrote:

On Tue, 6 Jan 2026 22:06 +0000 (GMT Standard Time)
jgd@cix.co.uk (John Dallman) wrote:

In article <2026Jan5.100825@mips.complang.tuwien.ac.at>,
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

Possibly. But the lack of takeup of the Intel library and of the
gcc support shows that "build it and they will come" does not work
out for DFP.

The world has got very used to IEEE BFP, and has solutions that work
acceptably with it. Lots of organisations don't see anything obvious
for them in DFP.

The thing I'd like to try out is fast quad-precision BFP. For the
field I work in, that would make some things much simpler. I did try
to interest AMD in the idea in the early days of x86-64, but they
didn't bite.

John

I already asked you couple of years ago how fast do want binary128 in
order to consider it fast enough.
IIRC, you either avoided the answer completely or gave totally
unrealistic answer like "the same as binary64".
May be, nobody bites because with non-answers or answers like that
nobody thinks that you are serious?

Anecdote.
Few months ago I tried to design very long decimation filters with stop> band attenuation of ~160 dB.
Matlab's implementation of Parks|ore4rCLMcClellan algorithm (a customize variation of Remez Exchange spiced with a small portion of black magic)> was not up to the task. Gnu Octave implementation was somewhat worse
yet.
When I started to investigate the reasons I found out that there were actually two of them, both related to insufficient precision of the
series of DP FP calculations.
The first was Discreet Cosine Transform and underlying FFT engine for N> around 32K.
The second was solving system of linear equations for N around 1000 a
a little more.
In both cases precision of DP FP was perfectly sufficient both for
inputs and for outputs. But errors accumulated in intermediate caused troubles.

In both cases quad-precision FP was key to solution.

For DCT (FFT) I went for full re-implementation at higher precision.

For Linear Solver, I left LU decomposition, which happens to be the
heavy O(N**3) part in DP. Quad-precision was applied only during final
solver stages - forward propagation, back propagation, calculation of residual error vector and repetition of forward and back propagation.
All those parts are O(N**2). That modification was sufficient to
improve precision of result almost to the best possible in DP FP format.
And sufficient for good convergence of Parks|ore4rCLMcClellan algorithm.

So, what is a point of my anecdote?
The speed of quad-precision FP was never an obstacle, even when running> on rather old hardware. And it's not like calculations here were not
heavy. They were heavy all right, thank you very much.
Using quad-precision only when necessary helped.
But what helped more is not being hesitant. Doing things instead of
worrying that they would be too slow.

I thihnk the main issue is similar to what we had before 754, i.e every
fp programmer needed to also be a fp analyst, capable of carrying out
error budget calculation across their algorithms.
You can obviously do that, and so can a number of regulars here, but in
the real world we are in a _very_ small minority.
For the rest, just having fp128 fast enough that that it could be
applied naively would solve a number of problems.
Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
--- Synchronet 3.21a-Linux NewsLink 1.2

From antispam@antispam@fricas.org (Waldek Hebisch) to comp.arch on Wed Jan 7 17:57:34 2026

From Newsgroup: comp.arch

Stefan Monnier <monnier@iro.umontreal.ca> wrote:

If you look at Java's BigDecimal operations
<https://docs.oracle.com/javase/8/docs/api/java/math/BigDecimal.html>,
they all have versions without and width MathContext (which includes a
target scale and a rounding mode), and sometimes additional variants
(e.g., divide() has variants where you pass just the rounding mode, or
the rounding mode and scale individually instead of through a
MathContext).

I wonder how that would compare in practice with a Rational type, where
all arithmetic operations are exact (and thus don't need anything like
a MathContext) and you simply provide a rounding function that takes
two argument: a "target scale" (in the form of a target denominator) and
a rounding mode.

[ Extra points for implementing compiler optimizations that keep track
of the denominators statically to try and do away with the
denominators at run-time as much as possible. Maybe also figure out
how to eliminate the use of bignums for the numerators. EfOe ]

I use every day a computer algebra system. One possiblity here
is to use arbitrary precision fractions. Observations:
- once there is more complex computation numerators and denominators
tend to be pretty big
- a lot of time is spent computing gcd, but if you try to save on
gcd-s and work with unsimplified fractions you may get trenendous
blowup (like million digit numbers in what should be reasonably
small computation).
- general, if one needs numeric approximation, then arbitrary
precision (software) floating point is much faster

BTW: sometimes people write rational type/class that uses fixed
size numerators and denominators. Such type is useless once
there is longer/less regular seqence of operations: simply
fixed size numbers overflow too easily.

BTW2: Usual trick with rational operations is to estimate size
of final result. If size of final result is known with reasonable
accuracy, than usually computation using a finite fields is much
faster and allows exact recovery of final result.
--
Waldek Hebisch
--- Synchronet 3.21a-Linux NewsLink 1.2

From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Wed Jan 7 18:58:04 2026

From Newsgroup: comp.arch

John Dallman wrote:

In article <20260107133424.00000e99@yahoo.com>, already5chosen@yahoo.com (Michael S) wrote:

I already asked you couple of years ago how fast do want binary128
in order to consider it fast enough.
IIRC, you either avoided the answer completely or gave totally
unrealistic answer like "the same as binary64".
May be, nobody bites because with non-answers or answers like that
nobody thinks that you are serious?

I don't know much about hardware design. What is realistic for hardware binary128?

Sub-10 cycles fmul/fadd/fsub seems very doable?

Mitch?

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
--- Synchronet 3.21a-Linux NewsLink 1.2

From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Wed Jan 7 10:22:52 2026

From Newsgroup: comp.arch

On 1/7/2026 5:06 AM, Michael S wrote:

On Tue, 06 Jan 2026 19:29:40 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

Michael S <already5chosen@yahoo.com> posted:

On Tue, 06 Jan 2026 17:56:23 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

Michael S <already5chosen@yahoo.com> posted:

On Sun, 4 Jan 2026 00:21:31 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:

And two decimal flavors, as well, with binary and densely
packed decimal encoding of the significand... it's a bit of a
mess.

Since both formats have exactly identical semantics, in theory
the mess is not worse (and not better) than two bytes orders of
IEEE binary FP.

Division by 10 is way faster in DPD than in Binary.

Do you consider speed to be part of semantics?
Just wondering...

More like justification for the facility itself.

But also note: DPD can be converted to ASCII all data in parallel
without any division going on. Binary does not have that property.

I took a look at how IBM exploits this property.
I don't have up to date zArch manual. It's probably easily available,
but right now I have no time or desire to search.
So I looked at POWER, which tends to copy DFP stuff from zArch with one generation time gap.
POWER ISA v.3.0 (2015) has following relevant instructions:

ddedpd - DFP Decode DPD to BCD
For Decimal128 it has two forms
- convert 32 rightmost digits of significand (unsigned)
- convert 31 rightmost digits of significand (signed)
With IBM I am never sure what they call 'rightmost' :(

If you wonder about couple of remaining digits, IBM has following
helper instruction:
dscli - DFP shift significand left immediate
dscri - DFP shift significand right immediate
Once again, since it's IBM, I am not sure about directions.

dxex - DFD extract biased exponent.
Exponent is extracted in binary form, not in BCD

They also have instructions that workd in the opposite direction
denbcd - DFP encode BCD to DPD
It converts signed 31-digit or unsigned 32-digit BCD-encoded integer to
DPD with exponent=0

diex - DFD insert biased exponent.
Here too exponent is in binary form, not in BCD.

So, POWER hardware helps a lot converting DPD to BCD, but leaves to
software the last step of unpacking 4-bit BCD to 8-bit ASCII.
Or, may be, they have instruction for that as well, but it's not in DFP related part of the book, so I missed it.

On Z series, that sounds like the unpack instruction, available since
the decimal arithmetic extension in the S/360, though, being Z series,
it uses EBCDIC rather than ASCII.
--
- Stephen Fuld
(e-mail address disguised to prevent spam)
--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Wed Jan 7 18:38:10 2026

From Newsgroup: comp.arch

Terje Mathisen <terje.mathisen@tmsw.no> writes:

In BID we would do division by 10 with a reciprocal multiplication that >handles powers of 5, this way we (i.e Michael S) can fit 26/27 digits
into a 64-bit int.

2^64 < 10^20

How do you put 26/27 decimal digits in 64 bits. Or do you mean 27
quinary digits? What would that be good for?

No matter how we do it, two iterations is enough
handle even maximally large numbers, and that would require 4 64x64->128 >MULs per iteration. Since these would pipeline nicely I'm guessing it
would be doable in 15-30 cycles total?

On Skylake the latency of a 64x64->128 multiplication is 6 cycles (4
cycles for the lower 64 bits), and I expect it to be lower on newer
hardware. The pipelined multiplications should be done by cycle 9.
There are also some additions involved, but I would not expect them to
increase the latency to 15 cycles. What other operations do you have
in mind that would result in 15-30 cycles? For scaling you don't need
the remainder, only some rounding.

So yes, scaling by a power of ten is the one operation where DPD is
clearly much faster

I would not bet on it. It needs to unpack the 34 digits into 136
bits, do a 136-bit shift, then repack into DPD. Either they widen the
data path beyond what they normally do, or they do it in parcels of 64
bits or less, and the end result can easily take a similar number of
cycles as 128-bit binary multiplication with the reciprocal. My guess
is that they did the slow implementation at first, and then there was
so little takeup of DFP that the slow implementation is good enough to
this day.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Wed Jan 7 19:14:08 2026

From Newsgroup: comp.arch

antispam@fricas.org (Waldek Hebisch) writes:

Scott Lurndal <scott@slp53.sl.home> wrote:

Michael S <already5chosen@yahoo.com> writes:

On Wed, 07 Jan 2026 15:24:38 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:

Michael S <already5chosen@yahoo.com> writes:

On Tue, 06 Jan 2026 19:29:40 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

So, POWER hardware helps a lot converting DPD to BCD, but leaves to
software the last step of unpacking 4-bit BCD to 8-bit ASCII.

That's a fairly simple operation, just adding the proper zone
digit (3 for ASCII, F for EBCDIC) to the 8-bit byte.

The hard part is having bits in right places within wide word.
I.e. you have 64 bits like that:
'0123456789abcdef'. You want to convert it to pair of 64-bit numbers: >>>'0001020304050607', '08090a0b0c0d0e0f'

The ASCII[*] would be '3031323334353637', '3839414243444546'. That's
a move from the BCD '0123456789abcdef' to the corresponding ASCII
bytes.

[*] Printable version of the BCD input number.

The B3500 addressed to the digit, so it was a simple move to add
the zone digit when converting to ASCII (or EBCDIC depending on
a processor flag). Although 'undigits' (bit patterns 0b1010
through 0b1111) were not legal in BCD numbers on the B3500
and adding a zone digit to them didn't make them printable.

e.g.

INPUT DATA 8 UN 0123456789 // Unsigned Numeric 8 nibble field >> OUTPUT DATA 8 UA // Unsigned Alphanumeric 8 byte field

MVN INPUT(UN), OUTPUT(UA) // yields 'f0f1f2f3f4f5f6f7f8f9' (EBCDIC)

If the output field was larger than the input field, leading
blanks would be added before the number when using MVN. MVA
would blank pad the output field after the number when the
output field was larger.

Without HW help it is not fast. Likely not faster than running
respective DPD declets through look-up table where you, at least, get 3 >>>ASCII digits per look-up. On modern wide core, likely only marginally >>>faster than converting BYD mantissa.

The B3500 (1965) did that in hardware;
I would find it strange if the Power CPU didn't.

Or, may be, they have instruction for that as well, but it's not in
DFP related part of the book, so I missed it.

It was just a flavor of the move instruction in the B3500.

I am not sure that we are talking about the same thing.

Probably not, since the ASCII zero character is encoded as 0x30
instead of the 0x00 you show in the example above.

IIUC Michael was asking for the following transformation of
on the strings of hex digits:

0123456789abcdef

into

000102030405060708090a0b0c0d0e0f

given (fast) such transformation it is very easy to add proper
zone bits on modern hardware. One possible approach to transform
above would be to do byte type unpacking operation (that is
version of the above working on bytes) and then use masking and
shifting to more upper bits of each byte to the right place.

Since you know that the zone digit after transformation will
always be zero, an arithmetic "OR" of the ASCII/EBCDIC value
for '0' (0x30/0xf0) over each byte should be sufficient.

e.g.
000102030405060708090a0b0c0d0e0f | 30303030303030303030303030303030

--- Synchronet 3.21a-Linux NewsLink 1.2

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Wed Jan 7 19:19:22 2026

From Newsgroup: comp.arch

Terje Mathisen <terje.mathisen@tmsw.no> writes:

MitchAlsup wrote:

Michael S <already5chosen@yahoo.com> posted:

On Sun, 4 Jan 2026 00:21:31 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:

And two decimal flavors, as well, with binary and densely packed
decimal encoding of the significand... it's a bit of a mess.

Since both formats have exactly identical semantics, in theory the mess
is not worse (and not better) than two bytes orders of IEEE binary FP.

Division by 10 is way faster in DPD than in Binary.

I see much bigger problem [than BID vs DPD] in the fact that IEEE
standard does not specify few very important things about global states
and interoperability of BFP and DFP in the same process. Like whether
BFP and DFP have common rounding mode or each one has mode of its own.
The same question about for exception flags and exception masks.

In hardware a division by 10 is either an adjustment of the exponent,
which is equally fast for both encodings, or for a real division DPD
just requires unpacking of all the 10-bit fields (I'm assuming IBM does
this in parallel for all 11 groups, so max one cycle), then shifting all
the nybbles down one position before the reverse to pack them back up, >probably including a rounding step before the repack.

It was even easier on the B3500. As it was addressed to the nibble,
division by 10 simply required dropping the last digit of the source,
while multiplication by 10 simple required appending a zero digit to
the result (both of which the MVN instruction did automatically when the operand lengths differed). A common peephole optimization in the
compilers.

There were no operand registers, all arithmetic was memory to memory.

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Jan 7 19:56:54 2026

From Newsgroup: comp.arch

Terje Mathisen <terje.mathisen@tmsw.no> posted:

MitchAlsup wrote:

Michael S <already5chosen@yahoo.com> posted:

On Sun, 4 Jan 2026 00:21:31 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:

And two decimal flavors, as well, with binary and densely packed
decimal encoding of the significand... it's a bit of a mess.

Since both formats have exactly identical semantics, in theory the mess
is not worse (and not better) than two bytes orders of IEEE binary FP.

Division by 10 is way faster in DPD than in Binary.

I see much bigger problem [than BID vs DPD] in the fact that IEEE
standard does not specify few very important things about global states
and interoperability of BFP and DFP in the same process. Like whether
BFP and DFP have common rounding mode or each one has mode of its own.
The same question about for exception flags and exception masks.

In hardware a division by 10 is either an adjustment of the exponent,
which is equally fast for both encodings, or for a real division DPD

just requires unpacking of all the 10-bit fields (I'm assuming IBM does
this in parallel for all 11 groups, so max one cycle),

unpack is 3-gates of delay.

then shifting all
the nybbles down one position before the reverse to pack them back up, probably including a rounding step before the repack.

pack is also 3 gates of delay.

This operation is very closely related to the general case of having to re-normalize after any operation which would require that, i.e. commonly
for DFMUL, very seldom for DFADD/DFSUB, and almost always for DFDIV.

In BID we would do division by 10 with a reciprocal multiplication that handles powers of 5, this way we (i.e Michael S) can fit 26/27 digits
into a 64-bit int. No matter how we do it, two iterations is enough
handle even maximally large numbers, and that would require 4 64x64->128 MULs per iteration. Since these would pipeline nicely I'm guessing it
would be doable in 15-30 cycles total?

Closer to 16 than 30 if one tries hard.

So yes, scaling by a power of ten is the one operation where DPD is
clearly much faster, but if you try to implement DPD in software, then
you have to handle the unpack and pack operations, and they could easily take the same or even more time.

Which is why you don't WANT to do it in HW.
There is obviously a class of SW that wants these things--the question
is whether YOUR architecture wants people of this class buying your HW.

Terje

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Jan 7 20:05:17 2026

From Newsgroup: comp.arch

Terje Mathisen <terje.mathisen@tmsw.no> posted:

John Dallman wrote:

In article <20260107133424.00000e99@yahoo.com>, already5chosen@yahoo.com (Michael S) wrote:

I already asked you couple of years ago how fast do want binary128
in order to consider it fast enough.
IIRC, you either avoided the answer completely or gave totally
unrealistic answer like "the same as binary64".
May be, nobody bites because with non-answers or answers like that
nobody thinks that you are serious?

I don't know much about hardware design. What is realistic for hardware binary128?

Sub-10 cycles fmul/fadd/fsub seems very doable?

Mitch?

Assuming 128-bit operands are delivered in 1 cycle and 128-bit
results are delivered in 1 cycle::

128-bit Fadd/Fsub should be no worse than 1 cycle longer than 64-bit.

128-bit Fmul requires that the multiplier tree be 64|u64 instead of
53|u53 (1.46|u bigger tree, 1.22|u bigger FU), and would/should be 3-4
cycles longer than 64-bit Fmul. If you wanted to be "really clever"
you could use a 59|u59 tree and the FU is only 1.12|u bigger; but here
you could not use the tree for Integer MUL.

Terje

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Jan 7 20:11:20 2026

From Newsgroup: comp.arch

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

Terje Mathisen <terje.mathisen@tmsw.no> writes:

In BID we would do division by 10 with a reciprocal multiplication that >handles powers of 5, this way we (i.e Michael S) can fit 26/27 digits
into a 64-bit int.

2^64 < 10^20

How do you put 26/27 decimal digits in 64 bits. Or do you mean 27
quinary digits? What would that be good for?

it takes 3.32 binary digits to encode 10, thus there are only 19.25
decimal digits in 64-bits.

No matter how we do it, two iterations is enough
handle even maximally large numbers, and that would require 4 64x64->128 >MULs per iteration. Since these would pipeline nicely I'm guessing it >would be doable in 15-30 cycles total?

On Skylake the latency of a 64x64->128 multiplication is 6 cycles (4
cycles for the lower 64 bits), and I expect it to be lower on newer
hardware. The pipelined multiplications should be done by cycle 9.
There are also some additions involved, but I would not expect them to increase the latency to 15 cycles. What other operations do you have
in mind that would result in 15-30 cycles? For scaling you don't need
the remainder, only some rounding.

So yes, scaling by a power of ten is the one operation where DPD is >clearly much faster

I would not bet on it. It needs to unpack the 34 digits into 136
bits, do a 136-bit shift, then repack into DPD.

In HW::
unpack is 3-gates of delay
pack is 3 gates of delay

Either they widen the
data path beyond what they normally do, or they do it in parcels of 64
bits or less, and the end result can easily take a similar number of
cycles as 128-bit binary multiplication with the reciprocal.

In HW; 136-bits is conceptually no different from 128-bits.

My guess
is that they did the slow implementation at first, and then there was
so little takeup of DFP that the slow implementation is good enough to
this day.

- anton

--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Wed Jan 7 14:23:10 2026

From Newsgroup: comp.arch

On 1/7/2026 7:16 AM, John Dallman wrote:

In article <20260107133424.00000e99@yahoo.com>, already5chosen@yahoo.com (Michael S) wrote:

I already asked you couple of years ago how fast do want binary128
in order to consider it fast enough.
IIRC, you either avoided the answer completely or gave totally
unrealistic answer like "the same as binary64".
May be, nobody bites because with non-answers or answers like that
nobody thinks that you are serious?

I don't know much about hardware design. What is realistic for hardware binary128?

Likely estimate for FPGA:
Around 28 DSP48's for a "triangular" multiplier;
Would need to add several clock cycles for the adder tree;
...
FADD/FSUB unit, also around 12 cycles,
as most intermediate steps now take 2 clock cycles;

Estimate:
Probably around 5k LUTs for the FMUL, with a latency of around 10 or 12
clock cycles.
Probably around 12k LUTs for FADD/FSUB unit;
Will need a few more kLUT for the glue logic.

So, will put the cost at:
18-20 kLUT likely;
~ 28 DSP48s;
Around 12 cycles of latency.

What about an FMA based implementation:
Probably 49 DSP48's and around 24 cycles of latency.
Where, 49 is needed for full-width multiplier results.
Also add a big bump to the LUT cost vs separate units.
An FMA unit roughly has the latency cost of both the FADD and FMUL.
But, some people really like the ability to quickly have single-rounded results.

The initial FPU would likely take around 1/3 of the total LUT budget of
an XC7A100T, and is unclear if such a thing would be possible within a
50 MHz CPU core (might require dropping to 33 MHz or similar).

In my case, similar issues wrecked my ideas of doing a 96 bit truncated format, and even then 96 bit is still less than 128 bit. My current
strategy is to instead allow for trap-based handling or hot-patching.

To simplify a hot-patching implementation, I am now considering having
the compiler set aside roughly 4 instruction-words of "hot patch zone"
for any instruction that is likely to be implemented via hot-patching.

These would be dumped out in blobs within 1MB of the target, or at the
end of ".text", whichever comes first. Technically, 3 words would be the minimum, but 4 allows for a little more working flexibility.

May make sense to assume that the hot-patching is free to stomp X5, as
this would make it possible to implement on RV64G. Though, would need 6
to allow for AUIPC+LD+JALR; but still works if assuming AUIPC+JALR (+/-
4GB).

This would give space for the handler to replace the offending
instruction with a JAL, and then to branch off to whatever memory is
being used for hot-patched instruction sequences.

Granted, this sort of thing only works well if one assumes compiler cooperation.

Current possibility is that the compiler could hint at these spaces by
filling them with a special instruction, such as:
JALR X0, 0(X0) //branch to NULL
Where, if the loader or trap handler sees large blobs of such an
instruction, it can assume that this area was set aside for use by the
hot patching to reuse to encode long-distance branches.

Could probably also add this to XG1/XG2 if trying to do similar (like
enabling the "FPUX" extension), may make sense to find some other filler instruction that makes sense for XG2 though (using RISC-V JALR
instructions would be a little out of place in this case).

Granted, could also make sense to use a large blob of EBREAK or similar,
which could have a similar effect (mostly depends on the probability
that a program would have some other likely reason to have a big blob of EBREAK's, and EBREAK has a higher probability to be "actually useful"
than a JALR-NULL).

Granted, one could argue that using pad-space defeats the merit of using trapping-instructions rather than runtime calls. But, alas...

Ironically, for my RV+SIMD stuff, partly leaned partly into still using runtime calls for some operations rather than doing them inline, as
doing them inline is more bulk with a comparably weaker SIMD ISA (but,
with some more fiddling, weak SIMD still a big improvement over no-SIMD
for things like GLQuake).

Well, and more fiddling to make RV FPU handling by BGBCC less crappy:
More likely to use the correct registers, etc.

And, currently putting 128-bit SIMD in FPU register pairs, which is
mostly less-bad than GPRs even in the absence of native SIMD ops, apart
from the "epic crapiness" or trying to deal with shuffle operations (I
did add "FPU PACK" style instructions as otherwise this part is "dog crap").

Or, in RV terms, one has, say:
PACK Rd, Rs1, Rs2 // { Rs1[31: 0], Rs2[31: 0] }
PACKU Rd, Rs1, Rs2 // { Rs1[63:32], Rs2[63:32] }
BitManip stopped there, my case would also have PACKBT/PACKTB, though in
my ISA they were called MOVLD/MOVHD/MOVLHD/MOVHLD (BGBCC still mostly
uses these names, but allows PACK/PACKU for ASM code). The RV P
extension also defines all 4 cases, but only for GPRs.

For sake of sanity, my SIMD extension had also defined variants for
FPRs, albeit still using the same mnemonics (the assembler figures out
what to do based on registers here).

So, in this case, still more sensible to use internal runtime calls for operations like DotProduct and CrossProduct and similar (but are likely
remain as inline operations for XG2/3).

Similar also applies to complex-number and quaternion operations, which
will mostly remain as runtime calls.

As noted, no current plans to move beyond 64/128 bit SIMD.

Most likely option is that, rather than (hypothetically) define any sort
of large-vector SIMD, may make more sense to fake large SIMD via the
RV-V extension, and then probably use hot-patching to pretend that it
exists (if needed, by faking RV-V on top of the narrower SIMD).

Like, by the time one wants crap like AVX or similar, then RV-V starts
to seem more sane.

Big problem-case is when one wants something more like MMX or SSE-1,
where RV-V seems like a pretty big ask to expect a hardware implementation.

But, a hot-patching implementation could potentially be fast enough to
make RV-V "not totally worthless" (if faking 256 bit vectors or similar,
it is then more likely to eat the relative overhead of the patch-calls).
And, could then implement native RV-V for hardware that can justify the
cost.

...

--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Wed Jan 7 14:38:30 2026

From Newsgroup: comp.arch

On 1/7/2026 11:56 AM, Terje Mathisen wrote:

Michael S wrote:

On Tue, 6 Jan 2026 22:06 +0000 (GMT Standard Time)
jgd@cix.co.uk (John Dallman) wrote:

In article <2026Jan5.100825@mips.complang.tuwien.ac.at>,
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

Possibly.-a But the lack of takeup of the Intel library and of the
gcc support shows that "build it and they will come" does not work
out for DFP.

The world has got very used to IEEE BFP, and has solutions that work
acceptably with it. Lots of organisations don't see anything obvious
for them in DFP.

The thing I'd like to try out is fast quad-precision BFP. For the
field I work in, that would make some things much simpler. I did try
to interest AMD in the idea in the early days of x86-64, but they
didn't bite.

John

I already asked you couple of years ago how fast do want binary128 in
order to consider it fast enough.
IIRC, you either avoided the answer completely or gave totally
unrealistic answer like "the same as binary64".
May be, nobody bites because with non-answers or answers like that
nobody thinks that you are serious?

Anecdote.
Few months ago I tried to design very long decimation filters with stop
band attenuation of ~160 dB.
Matlab's implementation of Parks|ore4rCLMcClellan algorithm (a customize
variation of Remez Exchange spiced with a small portion of black magic)
was not up to the task. Gnu Octave implementation was somewhat worse
yet.
When I started to investigate the reasons I found out that there were
actually two of them, both related to insufficient precision of the
series of DP FP calculations.
The first was Discreet Cosine Transform and underlying FFT engine for N
around 32K.
The second was solving system of linear equations for N around 1000 a
a little more.
In both cases precision of DP FP was perfectly sufficient both for
inputs and for outputs. But errors accumulated in intermediate caused
troubles.

In both cases quad-precision FP was key to solution.

For DCT (FFT) I went for full re-implementation at higher precision.

For Linear Solver, I left LU decomposition, which happens to be the
heavy O(N**3) part in DP. Quad-precision was applied only during final
solver stages - forward propagation, back propagation, calculation of
residual error vector and repetition of forward and back propagation.
All those parts are O(N**2). That modification was sufficient to
improve precision of result almost to the best possible in DP FP format.
And sufficient for good convergence of Parks|ore4rCLMcClellan algorithm.

So, what is a point of my anecdote?
The speed of quad-precision FP was never an obstacle, even when running
on rather old hardware. And it's not like calculations here were not
heavy. They were heavy all right, thank you very much.
Using quad-precision only when necessary helped.
But what helped more is not being hesitant. Doing things instead of
worrying that they would be too slow.

I thihnk the main issue is similar to what we had before 754, i.e every
fp programmer needed to also be a fp analyst, capable of carrying out
error budget calculation across their algorithms.

You can obviously do that, and so can a number of regulars here, but in
the real world we are in a _very_ small minority.

For the rest, just having fp128 fast enough that that it could be
applied naively would solve a number of problems.

As I see it, FP128 is fast enough for practical use even with a
software-only implementation (though, in part due to its relatively low
usage frequency; if it is used, it is mostly for cases that actually
need precision, rather than high throughput, with high-throughput cases
likely to remain dominated by smaller types, like Binary32 and Binary16;
with Binary64 more remaining as the "de-facto default" precision for floating-point).

As can be noted, in my case, it was a partial motivation for supporting
things like 128-bit integer instructions (in my C compiler, and
optionally in the underlying ISA), as supporting Int128 ops is a step
towards making doing Binary128 in software more practical (without the
steep cost of a 128-bit FPU).

...

Terje

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Jan 7 21:18:54 2026

From Newsgroup: comp.arch

BGB <cr88192@gmail.com> posted:

On 1/7/2026 11:56 AM, Terje Mathisen wrote:

Michael S wrote:

On Tue, 6 Jan 2026 22:06 +0000 (GMT Standard Time)
jgd@cix.co.uk (John Dallman) wrote:

In article <2026Jan5.100825@mips.complang.tuwien.ac.at>,
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

Possibly.-a But the lack of takeup of the Intel library and of the
gcc support shows that "build it and they will come" does not work
out for DFP.

The world has got very used to IEEE BFP, and has solutions that work
acceptably with it. Lots of organisations don't see anything obvious
for them in DFP.

The thing I'd like to try out is fast quad-precision BFP. For the
field I work in, that would make some things much simpler. I did try
to interest AMD in the idea in the early days of x86-64, but they
didn't bite.

John

I already asked you couple of years ago how fast do want binary128 in
order to consider it fast enough.
IIRC, you either avoided the answer completely or gave totally
unrealistic answer like "the same as binary64".
May be, nobody bites because with non-answers or answers like that
nobody thinks that you are serious?

Anecdote.
Few months ago I tried to design very long decimation filters with stop
band attenuation of ~160 dB.
Matlab's implementation of Parks|ore4rCLMcClellan algorithm (a customize >> variation of Remez Exchange spiced with a small portion of black magic)
was not up to the task. Gnu Octave implementation was somewhat worse
yet.
When I started to investigate the reasons I found out that there were
actually two of them, both related to insufficient precision of the
series of DP FP calculations.
The first was Discreet Cosine Transform and underlying FFT engine for N
around 32K.
The second was solving system of linear equations for N around 1000 a
a little more.
In both cases precision of DP FP was perfectly sufficient both for
inputs and for outputs. But errors accumulated in intermediate caused
troubles.

In both cases quad-precision FP was key to solution.

For DCT (FFT) I went for full re-implementation at higher precision.

For Linear Solver, I left LU decomposition, which happens to be the
heavy O(N**3) part in DP. Quad-precision was applied only during final
solver stages - forward propagation, back propagation, calculation of
residual error vector and repetition of forward and back propagation.
All those parts are O(N**2). That modification was sufficient to
improve precision of result almost to the best possible in DP FP format. >> And sufficient for good convergence of Parks|ore4rCLMcClellan algorithm. >>

So, what is a point of my anecdote?
The speed of quad-precision FP was never an obstacle, even when running
on rather old hardware. And it's not like calculations here were not
heavy. They were heavy all right, thank you very much.
Using quad-precision only when necessary helped.
But what helped more is not being hesitant. Doing things instead of
worrying that they would be too slow.

I thihnk the main issue is similar to what we had before 754, i.e every
fp programmer needed to also be a fp analyst, capable of carrying out error budget calculation across their algorithms.

You can obviously do that, and so can a number of regulars here, but in the real world we are in a _very_ small minority.

For the rest, just having fp128 fast enough that that it could be
applied naively would solve a number of problems.

As I see it, FP128 is fast enough for practical use even with a software-only implementation (though, in part due to its relatively low usage frequency; if it is used, it is mostly for cases that actually
need precision, rather than high throughput, with high-throughput cases likely to remain dominated by smaller types, like Binary32 and Binary16; with Binary64 more remaining as the "de-facto default" precision for floating-point).

{To date::}
My only used for 128-bit FP was to compute Chebyshev Coefficients for
my high speed DP Transcendentals. I only needed 64-bits of fractions
but, in practice, 80-bit FP was only giving me 63-bits of precision.
Since these are a) compute once b) use infinitely many times; the
speed of 128-bit FP is completely irrelevant.

As can be noted, in my case, it was a partial motivation for supporting things like 128-bit integer instructions (in my C compiler, and
optionally in the underlying ISA), as supporting Int128 ops is a step towards making doing Binary128 in software more practical (without the
steep cost of a 128-bit FPU).

It seems to me that if one ahs "reasonable" ISA support for tearing a
128-bit P into {sign, exponent, fraction} and reasonably fast 128-bit
integer support, then emulating 128-bit FP in SW is "not that bad"--
especially if one can do 128|u128 -> 256 in 4-8 cycles.

...

Terje

--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.arch on Wed Jan 7 23:47:06 2026

From Newsgroup: comp.arch

On Wed, 07 Jan 2026 20:05:17 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

Terje Mathisen <terje.mathisen@tmsw.no> posted:

John Dallman wrote:

In article <20260107133424.00000e99@yahoo.com>,
already5chosen@yahoo.com (Michael S) wrote:

I already asked you couple of years ago how fast do want
binary128 in order to consider it fast enough.
IIRC, you either avoided the answer completely or gave totally
unrealistic answer like "the same as binary64".
May be, nobody bites because with non-answers or answers like
that nobody thinks that you are serious?

I don't know much about hardware design. What is realistic for
hardware binary128?

Sub-10 cycles fmul/fadd/fsub seems very doable?

Mitch?

Assuming 128-bit operands are delivered in 1 cycle and 128-bit
results are delivered in 1 cycle::

If we are talking about SIMD of the same width (measured in bits) as
SP/DP SIMD on the given general purpose machine, i.e. on modern Intel
and AMD server cores, 2 FPUs with 4 128-bit lanes each, then I think
that fully pipelined binary128 operations are none starter, because it
would blow your power and thermal budget. One full-width result (i.e. 8 binary128 results) every 2 cycles sounds somewhat more realistic.
After all in general-purpose CPU binary128, if at all implemented, is
a proverbial tail that can't be allowed to wag the dog.
OTOH, if we define our binary128 to use only least-significant 128 bit
lane of our 512-bit register and only build b128 capabilities into one
of our pair of FPUs then full pipelining (i.e. 1 result per cycle) looks
like a good choice, at least from power/thermal perspective. That is,
as lng as designers found a way to avoid a hot spot.

128-bit Fadd/Fsub should be no worse than 1 cycle longer than 64-bit.

128-bit Fmul requires that the multiplier tree be 64+64 instead of
53+53 (1.46+ bigger tree, 1.22+ bigger FU), and would/should be 3-4
cycles longer than 64-bit Fmul. If you wanted to be "really clever"
you could use a 59+59 tree and the FU is only 1.12+ bigger; but here
you could not use the tree for Integer MUL.

Terje

--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Wed Jan 7 16:10:01 2026

From Newsgroup: comp.arch

On 1/7/2026 3:18 PM, MitchAlsup wrote:

BGB <cr88192@gmail.com> posted:

On 1/7/2026 11:56 AM, Terje Mathisen wrote:

Michael S wrote:

On Tue, 6 Jan 2026 22:06 +0000 (GMT Standard Time)
jgd@cix.co.uk (John Dallman) wrote:

In article <2026Jan5.100825@mips.complang.tuwien.ac.at>,
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

Possibly.-a But the lack of takeup of the Intel library and of the >>>>>> gcc support shows that "build it and they will come" does not work >>>>>> out for DFP.

The world has got very used to IEEE BFP, and has solutions that work >>>>> acceptably with it. Lots of organisations don't see anything obvious >>>>> for them in DFP.

The thing I'd like to try out is fast quad-precision BFP. For the
field I work in, that would make some things much simpler. I did try >>>>> to interest AMD in the idea in the early days of x86-64, but they
didn't bite.

John

I already asked you couple of years ago how fast do want binary128 in
order to consider it fast enough.
IIRC, you either avoided the answer completely or gave totally
unrealistic answer like "the same as binary64".
May be, nobody bites because with non-answers or answers like that
nobody thinks that you are serious?

Anecdote.
Few months ago I tried to design very long decimation filters with stop >>>> band attenuation of ~160 dB.
Matlab's implementation of Parks|ore4rCLMcClellan algorithm (a customize >>>> variation of Remez Exchange spiced with a small portion of black magic) >>>> was not up to the task. Gnu Octave implementation was somewhat worse
yet.
When I started to investigate the reasons I found out that there were
actually two of them, both related to insufficient precision of the
series of DP FP calculations.
The first was Discreet Cosine Transform and underlying FFT engine for N >>>> around 32K.
The second was solving system of linear equations for N around 1000 a
a little more.
In both cases precision of DP FP was perfectly sufficient both for
inputs and for outputs. But errors accumulated in intermediate caused
troubles.

In both cases quad-precision FP was key to solution.

For DCT (FFT) I went for full re-implementation at higher precision.

For Linear Solver, I left LU decomposition, which happens to be the
heavy O(N**3) part in DP. Quad-precision was applied only during final >>>> solver stages - forward propagation, back propagation, calculation of
residual error vector and repetition of forward and back propagation.
All those parts are O(N**2). That modification was sufficient to
improve precision of result almost to the best possible in DP FP format. >>>> And sufficient for good convergence of Parks|ore4rCLMcClellan algorithm. >>>>

FWIW: For most cases where I had used DCT or FFT, it has almost always
been with fixed-point integer math...

So, what is a point of my anecdote?
The speed of quad-precision FP was never an obstacle, even when running >>>> on rather old hardware. And it's not like calculations here were not
heavy. They were heavy all right, thank you very much.
Using quad-precision only when necessary helped.
But what helped more is not being hesitant. Doing things instead of
worrying that they would be too slow.

I thihnk the main issue is similar to what we had before 754, i.e every
fp programmer needed to also be a fp analyst, capable of carrying out
error budget calculation across their algorithms.

You can obviously do that, and so can a number of regulars here, but in
the real world we are in a _very_ small minority.

For the rest, just having fp128 fast enough that that it could be
applied naively would solve a number of problems.

As I see it, FP128 is fast enough for practical use even with a
software-only implementation (though, in part due to its relatively low
usage frequency; if it is used, it is mostly for cases that actually
need precision, rather than high throughput, with high-throughput cases
likely to remain dominated by smaller types, like Binary32 and Binary16;
with Binary64 more remaining as the "de-facto default" precision for
floating-point).

{To date::}
My only used for 128-bit FP was to compute Chebyshev Coefficients for
my high speed DP Transcendentals. I only needed 64-bits of fractions
but, in practice, 80-bit FP was only giving me 63-bits of precision.
Since these are a) compute once b) use infinitely many times; the
speed of 128-bit FP is completely irrelevant.

As noted, low usage frequency.

If it is something that mostly applies to initial program startup or occasionally in the slow path, that it is "kinda slow" doesn't matter
too much.

Though, it is starting to seem that "trap and emulate" might still be a
little too slow, leading to my recent efforts in the direction of
efficient hot-patching.

Granted, this is more a case of "just sort of pushing the cost somewhere
else" and in theory, if the compiler knows that the instruction will
just be patched anyways, it could in premise generate intermediate calls
for cheaper.

But, for Binary128 there is another factor:
RV64G/RV64GC lacks access to 128-bit integer instructions;
So, it makes sense to instead run this logic in XG3;
But, compiler can't just use XG3, as if it uses any XG3 ops, may as well
just compile the whole binary as XG3;
So, it makes sense to use XG3 as a "make RV64 less poor" feature, but
then the compiler can't be allowed to depend on it directly, and at
least needs to pretend it is living in RV64 land.

But, then, this leads to hot-patch wonk.

As can be noted, in my case, it was a partial motivation for supporting
things like 128-bit integer instructions (in my C compiler, and
optionally in the underlying ISA), as supporting Int128 ops is a step
towards making doing Binary128 in software more practical (without the
steep cost of a 128-bit FPU).

It seems to me that if one ahs "reasonable" ISA support for tearing a
128-bit P into {sign, exponent, fraction} and reasonably fast 128-bit
integer support, then emulating 128-bit FP in SW is "not that bad"-- especially if one can do 128|u128 -> 256 in 4-8 cycles.

Yeah, this is basically the idea.

Int128 ops, and my BITMOV instructions (which can extract/insert/move bitfields within 64 and 128 bit containers; as a combined "Shift and
masked MUX"), can provide a nice boost here.

Sadly, there is still not really a great way to do a 128x128 => 256
multiply though. Current fastest option is still to decompose it into a crapload of 32x32=>64 bit widening multiply ops (which, ironically, is
another thing that RV is lacking in; need to use a full 64-bit multiply,
but there are downsides, more-so when the base ISA is also lacking PACK/PACKU).

Still kinda funny that RV land, with all of its wide industrial support,
lots of people doing lots of extensions, advanced features, etc.
Seemingly still fails at making an ISA where "basic things" fit together
well.

And, then a lot of features going off in rabbit holes like "why would
you want this?", and then it turns out it is to micro-optimize some
specific test case within SPECint or something (often, rather than
finding a more general solution that would address multiple related issues).

More so when the "micro-optimize the benchmark" features were more often chosen over the more general purpose "actually address the underlying
issue" features.

Granted, then someone is almost invariably going to be like "all the
parts of RV do fit together well, but you are using it wrong...".

But, in this case, would expect GCC to generate smaller binaries than
BGBCC; leaving me to think it is more a case of "these parts don't fit together all that well".

...

Terje

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Jan 7 22:10:27 2026

From Newsgroup: comp.arch

Michael S <already5chosen@yahoo.com> posted:

On Wed, 07 Jan 2026 20:05:17 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

Terje Mathisen <terje.mathisen@tmsw.no> posted:

John Dallman wrote:

In article <20260107133424.00000e99@yahoo.com>, already5chosen@yahoo.com (Michael S) wrote:

I already asked you couple of years ago how fast do want
binary128 in order to consider it fast enough.
IIRC, you either avoided the answer completely or gave totally
unrealistic answer like "the same as binary64".
May be, nobody bites because with non-answers or answers like
that nobody thinks that you are serious?

I don't know much about hardware design. What is realistic for
hardware binary128?

Sub-10 cycles fmul/fadd/fsub seems very doable?

Mitch?

Assuming 128-bit operands are delivered in 1 cycle and 128-bit
results are delivered in 1 cycle::

If we are talking about SIMD of the same width (measured in bits) as
SP/DP SIMD on the given general purpose machine, i.e. on modern Intel
and AMD server cores, 2 FPUs with 4 128-bit lanes each, then I think
that fully pipelined binary128 operations are none starter, because it
would blow your power and thermal budget.

I agree, however a single 128-bit FPU would fit inside a reasonable
power budget.

One full-width result (i.e. 8 binary128 results) every 2 cycles sounds somewhat more realistic.

Likely still over a reasonable power budget.

After all in general-purpose CPU binary128, if at all implemented, is
a proverbial tail that can't be allowed to wag the dog.

We build (and call) our current machines 64-bits because that is the
size of the register files (not including SIMD/Vector) and because
we can run the scalar unit at rated clock frequency (non SIMD/Vector) essentially continuously.

Once we step over the scalar width, power goes up 2|u-4|u and we get a
couple of hundred cycles before frequency throttling. Thus, we cannot
in general, run SIMD/Vector at rated frequency continuously. Nor can
we, at present time, build a memory system than can properly feed a
SIMD/Vector RF so that one can use all of the lanes of available
calculations. {HBM is approaching this point, however--it becomes
more like B-memory from CRAY-2; than main memory for applications
that can use that much b-memory effectively.}

OTOH, if we define our binary128 to use only least-significant 128 bit
lane of our 512-bit register and only build b128 capabilities into one
of our pair of FPUs then full pipelining (i.e. 1 result per cycle) looks
like a good choice, at least from power/thermal perspective. That is,
as long as designers found a way to avoid a hot spot.

We could, instead, treat them as pairs of GPRs--like we did in Mc 88120
and still not need SIMD/Vectors.

128-bit Fadd/Fsub should be no worse than 1 cycle longer than 64-bit.

128-bit Fmul requires that the multiplier tree be 64|u64 instead of
53|u53 (1.46|u bigger tree, 1.22|u bigger FU), and would/should be 3-4 cycles longer than 64-bit Fmul. If you wanted to be "really clever"
you could use a 59|u59 tree and the FU is only 1.12|u bigger; but here
you could not use the tree for Integer MUL.

Terje

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Jan 8 00:05:33 2026

From Newsgroup: comp.arch

BGB <cr88192@gmail.com> posted:

On 1/7/2026 3:18 PM, MitchAlsup wrote:

BGB <cr88192@gmail.com> posted:

On 1/7/2026 11:56 AM, Terje Mathisen wrote:

Michael S wrote:

On Tue, 6 Jan 2026 22:06 +0000 (GMT Standard Time)
jgd@cix.co.uk (John Dallman) wrote:

In article <2026Jan5.100825@mips.complang.tuwien.ac.at>,
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

Possibly.-a But the lack of takeup of the Intel library and of the >>>>>> gcc support shows that "build it and they will come" does not work >>>>>> out for DFP.

The world has got very used to IEEE BFP, and has solutions that work >>>>> acceptably with it. Lots of organisations don't see anything obvious >>>>> for them in DFP.

The thing I'd like to try out is fast quad-precision BFP. For the
field I work in, that would make some things much simpler. I did try >>>>> to interest AMD in the idea in the early days of x86-64, but they
didn't bite.

John

I already asked you couple of years ago how fast do want binary128 in >>>> order to consider it fast enough.
IIRC, you either avoided the answer completely or gave totally
unrealistic answer like "the same as binary64".
May be, nobody bites because with non-answers or answers like that
nobody thinks that you are serious?

Anecdote.
Few months ago I tried to design very long decimation filters with stop >>>> band attenuation of ~160 dB.
Matlab's implementation of Parks|ore4rCLMcClellan algorithm (a customize >>>> variation of Remez Exchange spiced with a small portion of black magic) >>>> was not up to the task. Gnu Octave implementation was somewhat worse >>>> yet.
When I started to investigate the reasons I found out that there were >>>> actually two of them, both related to insufficient precision of the
series of DP FP calculations.
The first was Discreet Cosine Transform and underlying FFT engine for N >>>> around 32K.
The second was solving system of linear equations for N around 1000 a >>>> a little more.
In both cases precision of DP FP was perfectly sufficient both for
inputs and for outputs. But errors accumulated in intermediate caused >>>> troubles.

In both cases quad-precision FP was key to solution.

For DCT (FFT) I went for full re-implementation at higher precision. >>>>
For Linear Solver, I left LU decomposition, which happens to be the
heavy O(N**3) part in DP. Quad-precision was applied only during final >>>> solver stages - forward propagation, back propagation, calculation of >>>> residual error vector and repetition of forward and back propagation. >>>> All those parts are O(N**2). That modification was sufficient to
improve precision of result almost to the best possible in DP FP format. >>>> And sufficient for good convergence of Parks|ore4rCLMcClellan algorithm. >>>>

FWIW: For most cases where I had used DCT or FFT, it has almost always
been with fixed-point integer math...

So, what is a point of my anecdote?
The speed of quad-precision FP was never an obstacle, even when running >>>> on rather old hardware. And it's not like calculations here were not >>>> heavy. They were heavy all right, thank you very much.
Using quad-precision only when necessary helped.
But what helped more is not being hesitant. Doing things instead of
worrying that they would be too slow.

I thihnk the main issue is similar to what we had before 754, i.e every >>> fp programmer needed to also be a fp analyst, capable of carrying out
error budget calculation across their algorithms.

You can obviously do that, and so can a number of regulars here, but in >>> the real world we are in a _very_ small minority.

For the rest, just having fp128 fast enough that that it could be
applied naively would solve a number of problems.

As I see it, FP128 is fast enough for practical use even with a
software-only implementation (though, in part due to its relatively low
usage frequency; if it is used, it is mostly for cases that actually
need precision, rather than high throughput, with high-throughput cases
likely to remain dominated by smaller types, like Binary32 and Binary16; >> with Binary64 more remaining as the "de-facto default" precision for
floating-point).

{To date::}
My only used for 128-bit FP was to compute Chebyshev Coefficients for
my high speed DP Transcendentals. I only needed 64-bits of fractions
but, in practice, 80-bit FP was only giving me 63-bits of precision.
Since these are a) compute once b) use infinitely many times; the
speed of 128-bit FP is completely irrelevant.

As noted, low usage frequency.

If it is something that mostly applies to initial program startup or occasionally in the slow path, that it is "kinda slow" doesn't matter
too much.

Though, it is starting to seem that "trap and emulate" might still be a little too slow, leading to my recent efforts in the direction of
efficient hot-patching.

Depends on the speed of T&E. If privilege control transfer is 10-cycles then its probably OK, if 100+ it is getting on the annoying side of thiigns.

Granted, this is more a case of "just sort of pushing the cost somewhere else" and in theory, if the compiler knows that the instruction will
just be patched anyways, it could in premise generate intermediate calls
for cheaper.

-------------------

As can be noted, in my case, it was a partial motivation for supporting
things like 128-bit integer instructions (in my C compiler, and
optionally in the underlying ISA), as supporting Int128 ops is a step
towards making doing Binary128 in software more practical (without the
steep cost of a 128-bit FPU).

It seems to me that if one ahs "reasonable" ISA support for tearing a 128-bit P into {sign, exponent, fraction} and reasonably fast 128-bit integer support, then emulating 128-bit FP in SW is "not that bad"-- especially if one can do 128|u128 -> 256 in 4-8 cycles.

Yeah, this is basically the idea.

Int128 ops, and my BITMOV instructions (which can extract/insert/move bitfields within 64 and 128 bit containers; as a combined "Shift and
masked MUX"), can provide a nice boost here.

Sadly, there is still not really a great way to do a 128x128 => 256
multiply though.

My Transcendentals get to 1ULP when the multiplier tree is 59|u59-bits
{a bit more than -+ of the get 1ULP at 58|u58}. I gave a lot of though
to this {~1 year} before deciding that a "Do everything else" function
unit was "overall" better than a couple of "near miss" FUs. So, IMUL
comes in at 4 cycles, FMUL at 4 cycles, FDIV at 17, SQRT at 22, and
IDIV at 25. The fast IMUL makes up a lot for the "size" of the FU.

To get 64|u64->128 I need my <single> prefix instruction CARRY. But
this also gives me (64|u64->128)/64 ->{64,64} {quotient, remainder}.

Current fastest option is still to decompose it into a crapload of 32x32=>64 bit widening multiply ops (which, ironically, is another thing that RV is lacking in; need to use a full 64-bit multiply,
but there are downsides, more-so when the base ISA is also lacking PACK/PACKU).

"Not my fault".

Still kinda funny that RV land, with all of its wide industrial support, lots of people doing lots of extensions, advanced features, etc.
Seemingly still fails at making an ISA where "basic things" fit together well.

And, then a lot of features going off in rabbit holes like "why would
you want this?", and then it turns out it is to micro-optimize some
specific test case within SPECint or something (often, rather than
finding a more general solution that would address multiple related issues).

Reasonable support for 64|u64->128 is what makes emulation "affordable".

Side note: Back in 1987, MIPS has 13-cycle multiply using their non-
pipelined FU and special registers--while Mc 88100 has 3 cycle 32|u32
multiply in 3 cycles. Well, it ends up one could program this multiplier
to do 32|u32->64 in 13 cycles; TOO !!

More so when the "micro-optimize the benchmark" features were more often chosen over the more general purpose "actually address the underlying
issue" features.

Been there done that......

Granted, then someone is almost invariably going to be like "all the
parts of RV do fit together well, but you are using it wrong...".

But, in this case, would expect GCC to generate smaller binaries than
BGBCC; leaving me to think it is more a case of "these parts don't fit together all that well".

...

Terje

--- Synchronet 3.21a-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Wed Jan 7 21:16:38 2026

From Newsgroup: comp.arch

On 2026-01-07 3:23 p.m., BGB wrote:

On 1/7/2026 7:16 AM, John Dallman wrote:

In article <20260107133424.00000e99@yahoo.com>, already5chosen@yahoo.com
(Michael S) wrote:

I already asked you couple of years ago how fast do want binary128
in order to consider it fast enough.
IIRC, you either avoided the answer completely or gave totally
unrealistic answer like "the same as binary64".
May be, nobody bites because with non-answers or answers like that
nobody thinks that you are serious?

I don't know much about hardware design. What is realistic for hardware
binary128?

Likely estimate for FPGA:
-a Around 28 DSP48's for a "triangular" multiplier;
-a-a-a Would need to add several clock cycles for the adder tree;
-a-a-a ...
-a FADD/FSUB unit, also around 12 cycles,
-a-a-a as most intermediate steps now take 2 clock cycles;

Estimate:
Probably around 5k LUTs for the FMUL, with a latency of around 10 or 12 clock cycles.
Probably around 12k LUTs for FADD/FSUB unit;
Will need a few more kLUT for the glue logic.

So, will put the cost at:
-a 18-20 kLUT likely;
-a ~ 28 DSP48s;
-a Around 12 cycles of latency.

What about an FMA based implementation:
-a Probably 49 DSP48's and around 24 cycles of latency.
-a-a-a Where, 49 is needed for full-width multiplier results.
-a-a-a Also add a big bump to the LUT cost vs separate units.
-a An FMA unit roughly has the latency cost of both the FADD and FMUL.
But, some people really like the ability to quickly have single-rounded results.

The 128-bit FMA I implemented with an eight-cycle latency, uses 36 DSPs (Karatsuba multiplier). The latency is a bit less than double for an
FADD. One cycle can be trimmed off operand decoding that can happen in parallel, then there is only a single normalization and round taking
place which also trims a couple of clocks off the double latency.

My FADD has a five-cycle latency. Latency is a bit of a designerrCOs
choice and can be setup as desired for the clock frequency. I picked
eight to try and match the FP clock to the CPU clock (slow CPU clock).
Many more stages could be added to bump up the clock frequency.

The FMA consumes about 8600 LUTs and 2600 FFs. I decided to use FMAs
(without FADD, FMUL) in my design even though the latency is a bit more
as I think the total LUT cost is lower.

<snip>

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Jan 8 02:38:57 2026

From Newsgroup: comp.arch

MitchAlsup <user5857@newsgrouper.org.invalid> posted:
-------------------

My Transcendentals get to 1ULP when the multiplier tree is 59|u59-bits
{a bit more than -+ of the get 1ULP at 58|u58}. I gave a lot of though
to this {~1 year} before deciding that a "Do everything else" function
unit was "overall" better than a couple of "near miss" FUs. So, IMUL
comes in at 4 cycles, FMUL at 4 cycles, FDIV at 17, SQRT at 22, and
IDIV at 25. The fast IMUL makes up a lot for the "size" of the FU.

I forgot to add that my transcendentals went from <just barely>
faithfully rounded 0.75ULP RMS to 5.002ULP RMS when the tree was
expanded.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.arch on Thu Jan 8 10:52:21 2026

From Newsgroup: comp.arch

On Thu, 08 Jan 2026 02:38:57 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

MitchAlsup <user5857@newsgrouper.org.invalid> posted:
-------------------

My Transcendentals get to 1ULP when the multiplier tree is
59-59-bits {a bit more than + of the get 1ULP at 58-58}. I gave a
lot of though to this {~1 year} before deciding that a "Do
everything else" function unit was "overall" better than a couple
of "near miss" FUs. So, IMUL comes in at 4 cycles, FMUL at 4
cycles, FDIV at 17, SQRT at 22, and IDIV at 25. The fast IMUL makes
up a lot for the "size" of the FU.

I forgot to add that my transcendentals went from <just barely>
faithfully rounded 0.75ULP RMS to 5.002ULP RMS when the tree was
expanded.

Don't you mean '0.5002 ULP' ?
--- Synchronet 3.21a-Linux NewsLink 1.2

From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Thu Jan 8 12:50:32 2026

From Newsgroup: comp.arch

Waldek Hebisch wrote:

Scott Lurndal <scott@slp53.sl.home> wrote:

Michael S <already5chosen@yahoo.com> writes:

On Wed, 07 Jan 2026 15:24:38 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:

Michael S <already5chosen@yahoo.com> writes:

On Tue, 06 Jan 2026 19:29:40 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

So, POWER hardware helps a lot converting DPD to BCD, but leaves to
software the last step of unpacking 4-bit BCD to 8-bit ASCII.

That's a fairly simple operation, just adding the proper zone
digit (3 for ASCII, F for EBCDIC) to the 8-bit byte.

The hard part is having bits in right places within wide word.
I.e. you have 64 bits like that:
'0123456789abcdef'. You want to convert it to pair of 64-bit numbers:
'0001020304050607', '08090a0b0c0d0e0f'

The ASCII[*] would be '3031323334353637', '3839414243444546'. That's
a move from the BCD '0123456789abcdef' to the corresponding ASCII
bytes.

[*] Printable version of the BCD input number.

The B3500 addressed to the digit, so it was a simple move to add
the zone digit when converting to ASCII (or EBCDIC depending on
a processor flag). Although 'undigits' (bit patterns 0b1010
through 0b1111) were not legal in BCD numbers on the B3500
and adding a zone digit to them didn't make them printable.

e.g.

INPUT DATA 8 UN 0123456789 // Unsigned Numeric 8 nibble field >> OUTPUT DATA 8 UA // Unsigned Alphanumeric 8 byte field

MVN INPUT(UN), OUTPUT(UA) // yields 'f0f1f2f3f4f5f6f7f8f9' (EBCDIC)

If the output field was larger than the input field, leading
blanks would be added before the number when using MVN. MVA
would blank pad the output field after the number when the
output field was larger.

Without HW help it is not fast. Likely not faster than running
respective DPD declets through look-up table where you, at least, get 3
ASCII digits per look-up. On modern wide core, likely only marginally
faster than converting BYD mantissa.

The B3500 (1965) did that in hardware;
I would find it strange if the Power CPU didn't.

Or, may be, they have instruction for that as well, but it's not in
DFP related part of the book, so I missed it.

It was just a flavor of the move instruction in the B3500.

I am not sure that we are talking about the same thing.

Probably not, since the ASCII zero character is encoded as 0x30
instead of the 0x00 you show in the example above.

IIUC Michael was asking for the following transformation of
on the strings of hex digits:

0123456789abcdef

into

000102030405060708090a0b0c0d0e0f

given (fast) such transformation it is very easy to add proper
zone bits on modern hardware. One possible approach to transform
above would be to do byte type unpacking operation (that is
version of the above working on bytes) and then use masking and
shifting to more upper bits of each byte to the right place.

Intel (and then AMD) added PDEP (and the corresponding PEXT) opcode
sometime within the last 20 years (probably less than 10?), it is
perfect for this operation:

;; rsi has 8 nybbles to unpack into 16 bytes (rdx:rax)
mov rbx, 0x0f0f0f0f0f0f0f0f
pdep rax,rbx,rsi
shr rsi,32
pdep rdx,rbx,rsi

This is sub-5 cycles of latency.

It is also doable with much older CPUs using the permute/byte shuffle operation, with a bit more or less latency depdning upon where the
source and destination data resides (SIMD vs regular integer reg).

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
--- Synchronet 3.21a-Linux NewsLink 1.2

From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Thu Jan 8 13:01:45 2026

From Newsgroup: comp.arch

MitchAlsup wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

Terje Mathisen <terje.mathisen@tmsw.no> writes:

In BID we would do division by 10 with a reciprocal multiplication that
handles powers of 5, this way we (i.e Michael S) can fit 26/27 digits
into a 64-bit int.

2^64 < 10^20

How do you put 26/27 decimal digits in 64 bits. Or do you mean 27
quinary digits? What would that be good for?

it takes 3.32 binary digits to encode 10, thus there are only 19.25
decimal digits in 64-bits.

Michael's idea was to split the division by a power of ten into two
parts: A division by a power of 5 and a bitshift for the 2^N.

If we start with the bitshift (but remember the bits shifted out from
the bottom, then 5^26 fits into 2^64.

Does that make sense?

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
--- Synchronet 3.21a-Linux NewsLink 1.2

From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Thu Jan 8 13:05:10 2026

From Newsgroup: comp.arch

MitchAlsup wrote:

BGB <cr88192@gmail.com> posted:

On 1/7/2026 11:56 AM, Terje Mathisen wrote:

Michael S wrote:

On Tue, 6 Jan 2026 22:06 +0000 (GMT Standard Time)
jgd@cix.co.uk (John Dallman) wrote:

In article <2026Jan5.100825@mips.complang.tuwien.ac.at>,
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

Possibly.|e-a But the lack of takeup of the Intel library and of the >>>>>> gcc support shows that "build it and they will come" does not work>>>>>> out for DFP.

The world has got very used to IEEE BFP, and has solutions that work >>>>> acceptably with it. Lots of organisations don't see anything obvious >>>>> for them in DFP.

The thing I'd like to try out is fast quad-precision BFP. For the
field I work in, that would make some things much simpler. I did try >>>>> to interest AMD in the idea in the early days of x86-64, but they
didn't bite.

John

I already asked you couple of years ago how fast do want binary128 in
order to consider it fast enough.
IIRC, you either avoided the answer completely or gave totally
unrealistic answer like "the same as binary64".
May be, nobody bites because with non-answers or answers like that
nobody thinks that you are serious?

Anecdote.
Few months ago I tried to design very long decimation filters with stop >>>> band attenuation of ~160 dB.
Matlab's implementation of Parks|a-o|orCU-4|ore4+oMcClellan algorithm (a customize
variation of Remez Exchange spiced with a small portion of black magic) >>>> was not up to the task. Gnu Octave implementation was somewhat worse>>>> yet.
When I started to investigate the reasons I found out that there were
actually two of them, both related to insufficient precision of the
series of DP FP calculations.
The first was Discreet Cosine Transform and underlying FFT engine for N >>>> around 32K.
The second was solving system of linear equations for N around 1000 a
a little more.
In both cases precision of DP FP was perfectly sufficient both for
inputs and for outputs. But errors accumulated in intermediate caused
troubles.

In both cases quad-precision FP was key to solution.

For DCT (FFT) I went for full re-implementation at higher precision.>>>> >>>> For Linear Solver, I left LU decomposition, which happens to be the
heavy O(N**3) part in DP. Quad-precision was applied only during final >>>> solver stages - forward propagation, back propagation, calculation of
residual error vector and repetition of forward and back propagation.
All those parts are O(N**2). That modification was sufficient to
improve precision of result almost to the best possible in DP FP format. >>>> And sufficient for good convergence of Parks|a-o|orCU-4|ore4+oMcClellan algorithm.

So, what is a point of my anecdote?
The speed of quad-precision FP was never an obstacle, even when running >>>> on rather old hardware. And it's not like calculations here were not>>>> heavy. They were heavy all right, thank you very much.
Using quad-precision only when necessary helped.
But what helped more is not being hesitant. Doing things instead of
worrying that they would be too slow.

I thihnk the main issue is similar to what we had before 754, i.e every
fp programmer needed to also be a fp analyst, capable of carrying out>>> error budget calculation across their algorithms.

You can obviously do that, and so can a number of regulars here, but in
the real world we are in a _very_ small minority.

For the rest, just having fp128 fast enough that that it could be
applied naively would solve a number of problems.

As I see it, FP128 is fast enough for practical use even with a
software-only implementation (though, in part due to its relatively low
usage frequency; if it is used, it is mostly for cases that actually
need precision, rather than high throughput, with high-throughput cases
likely to remain dominated by smaller types, like Binary32 and Binary16;
with Binary64 more remaining as the "de-facto default" precision for
floating-point).

{To date::}
My only used for 128-bit FP was to compute Chebyshev Coefficients for
my high speed DP Transcendentals. I only needed 64-bits of fractions
but, in practice, 80-bit FP was only giving me 63-bits of precision.
Since these are a) compute once b) use infinitely many times; the
speed of 128-bit FP is completely irrelevant.

Sounds similar to the weekend I spent writing a fp128 (using 1:31:96 for speed/ease of implementation on a Pentium) library just to be able to
verify that our FPATAN2 workaround for the FDIV bug was correct.
Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
--- Synchronet 3.21a-Linux NewsLink 1.2

From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Thu Jan 8 13:10:14 2026

From Newsgroup: comp.arch

MitchAlsup wrote:

MitchAlsup <user5857@newsgrouper.org.invalid> posted:
-------------------

My Transcendentals get to 1ULP when the multiplier tree is 59|arCo59-bits
{a bit more than |e-+ of the get 1ULP at 58|arCo58}. I gave a lot of though >> to this {~1 year} before deciding that a "Do everything else" function>> unit was "overall" better than a couple of "near miss" FUs. So, IMUL
comes in at 4 cycles, FMUL at 4 cycles, FDIV at 17, SQRT at 22, and
IDIV at 25. The fast IMUL makes up a lot for the "size" of the FU.

I forgot to add that my transcendentals went from <just barely>
faithfully rounded 0.75ULP RMS to 5.002ULP RMS when the tree was
expanded.

You obviously intended 0.5002 ulp, so you have about 11-13 guard bits to get the rounding correct?
Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.arch on Thu Jan 8 15:41:17 2026

From Newsgroup: comp.arch

On Thu, 8 Jan 2026 12:50:32 +0100
Terje Mathisen <terje.mathisen@tmsw.no> wrote:

Waldek Hebisch wrote:

Scott Lurndal <scott@slp53.sl.home> wrote:

Michael S <already5chosen@yahoo.com> writes:

On Wed, 07 Jan 2026 15:24:38 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:

Michael S <already5chosen@yahoo.com> writes:

On Tue, 06 Jan 2026 19:29:40 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

So, POWER hardware helps a lot converting DPD to BCD, but
leaves to software the last step of unpacking 4-bit BCD to
8-bit ASCII.

That's a fairly simple operation, just adding the proper zone
digit (3 for ASCII, F for EBCDIC) to the 8-bit byte.

The hard part is having bits in right places within wide word.
I.e. you have 64 bits like that:
'0123456789abcdef'. You want to convert it to pair of 64-bit
numbers: '0001020304050607', '08090a0b0c0d0e0f'

The ASCII[*] would be '3031323334353637', '3839414243444546'.
That's a move from the BCD '0123456789abcdef' to the corresponding
ASCII bytes.

[*] Printable version of the BCD input number.

The B3500 addressed to the digit, so it was a simple move to add
the zone digit when converting to ASCII (or EBCDIC depending on
a processor flag). Although 'undigits' (bit patterns 0b1010
through 0b1111) were not legal in BCD numbers on the B3500
and adding a zone digit to them didn't make them printable.

e.g.

INPUT DATA 8 UN 0123456789 // Unsigned Numeric 8
nibble field OUTPUT DATA 8 UA // Unsigned
Alphanumeric 8 byte field

MVN INPUT(UN), OUTPUT(UA) // yields
'f0f1f2f3f4f5f6f7f8f9' (EBCDIC)

If the output field was larger than the input field, leading
blanks would be added before the number when using MVN. MVA
would blank pad the output field after the number when the
output field was larger.

Without HW help it is not fast. Likely not faster than running
respective DPD declets through look-up table where you, at least,
get 3 ASCII digits per look-up. On modern wide core, likely only
marginally faster than converting BYD mantissa.

The B3500 (1965) did that in hardware;
I would find it strange if the Power CPU didn't.

Or, may be, they have instruction for that as well, but it's
not in DFP related part of the book, so I missed it.

It was just a flavor of the move instruction in the B3500.

I am not sure that we are talking about the same thing.

Probably not, since the ASCII zero character is encoded as 0x30
instead of the 0x00 you show in the example above.

IIUC Michael was asking for the following transformation of
on the strings of hex digits:

0123456789abcdef

into

000102030405060708090a0b0c0d0e0f

given (fast) such transformation it is very easy to add proper
zone bits on modern hardware. One possible approach to transform
above would be to do byte type unpacking operation (that is
version of the above working on bytes) and then use masking and
shifting to more upper bits of each byte to the right place.

Intel (and then AMD) added PDEP (and the corresponding PEXT) opcode
sometime within the last 20 years (probably less than 10?), it is
perfect for this operation:

Haswell. Officially launched in June 4, 2013. So 12.5 years ago.
Time runs.

;; rsi has 8 nybbles to unpack into 16 bytes (rdx:rax)
mov rbx, 0x0f0f0f0f0f0f0f0f
pdep rax,rbx,rsi
shr rsi,32
pdep rdx,rbx,rsi

This is sub-5 cycles of latency.

That's nice.
I'm not sure if POWER has similar instruction.

It is also doable with much older CPUs using the permute/byte shuffle operation, with a bit more or less latency depdning upon where the
source and destination data resides (SIMD vs regular integer reg).

Terje

I don't understand that part. Do you suggest that there are better
swizzle instruction than unpack, mentioned by Waldek Hebisch?
So far, I don't see so. Unpack looks to me the most suitable.

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Jan 8 18:50:14 2026

From Newsgroup: comp.arch

Michael S <already5chosen@yahoo.com> posted:

On Thu, 08 Jan 2026 02:38:57 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

MitchAlsup <user5857@newsgrouper.org.invalid> posted:
-------------------

My Transcendentals get to 1ULP when the multiplier tree is
59|u59-bits {a bit more than -+ of the get 1ULP at 58|u58}. I gave a
lot of though to this {~1 year} before deciding that a "Do
everything else" function unit was "overall" better than a couple
of "near miss" FUs. So, IMUL comes in at 4 cycles, FMUL at 4
cycles, FDIV at 17, SQRT at 22, and IDIV at 25. The fast IMUL makes
up a lot for the "size" of the FU.

I forgot to add that my transcendentals went from <just barely>
faithfully rounded 0.75ULP RMS to 5.002ULP RMS when the tree was
expanded.

Don't you mean '0.5002 ULP' ?

Technically, and rounding that is not !EEE correct is at least 1 ULP.
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Jan 8 18:52:26 2026

From Newsgroup: comp.arch

Terje Mathisen <terje.mathisen@tmsw.no> posted:

MitchAlsup wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

Terje Mathisen <terje.mathisen@tmsw.no> writes:

In BID we would do division by 10 with a reciprocal multiplication that >>> handles powers of 5, this way we (i.e Michael S) can fit 26/27 digits
into a 64-bit int.

2^64 < 10^20

How do you put 26/27 decimal digits in 64 bits. Or do you mean 27
quinary digits? What would that be good for?

it takes 3.32 binary digits to encode 10, thus there are only 19.25
decimal digits in 64-bits.

Michael's idea was to split the division by a power of ten into two
parts: A division by a power of 5 and a bitshift for the 2^N.

If we start with the bitshift (but remember the bits shifted out from
the bottom, then 5^26 fits into 2^64.

Does that make sense?

My point was that you cannot fit 26 encoding representing 0-9 into 64
bits; not about what math is in play.

Terje

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Jan 8 18:54:40 2026

From Newsgroup: comp.arch

Terje Mathisen <terje.mathisen@tmsw.no> posted:

MitchAlsup wrote:

MitchAlsup <user5857@newsgrouper.org.invalid> posted:
-------------------

My Transcendentals get to 1ULP when the multiplier tree is 59|arCo59-bits >> {a bit more than |e-+ of the get 1ULP at 58|arCo58}. I gave a lot of though
to this {~1 year} before deciding that a "Do everything else" function
unit was "overall" better than a couple of "near miss" FUs. So, IMUL
comes in at 4 cycles, FMUL at 4 cycles, FDIV at 17, SQRT at 22, and
IDIV at 25. The fast IMUL makes up a lot for the "size" of the FU.

I forgot to add that my transcendentals went from <just barely>
faithfully rounded 0.75ULP RMS to 5.002ULP RMS when the tree was
expanded.

You obviously intended 0.5002 ulp, so you have about 11-13 guard bits to
get the rounding correct?

64-53 = 11 yes

But a single incorrect rounding is 1 ULP all by itself.

Terje

--- Synchronet 3.21a-Linux NewsLink 1.2

From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Thu Jan 8 21:25:57 2026

From Newsgroup: comp.arch

Michael S wrote:

On Thu, 8 Jan 2026 12:50:32 +0100
Terje Mathisen <terje.mathisen@tmsw.no> wrote:

Waldek Hebisch wrote:

Scott Lurndal <scott@slp53.sl.home> wrote:

Michael S <already5chosen@yahoo.com> writes:

On Wed, 07 Jan 2026 15:24:38 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:

Michael S <already5chosen@yahoo.com> writes:

On Tue, 06 Jan 2026 19:29:40 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

So, POWER hardware helps a lot converting DPD to BCD, but
leaves to software the last step of unpacking 4-bit BCD to
8-bit ASCII.

That's a fairly simple operation, just adding the proper zone
digit (3 for ASCII, F for EBCDIC) to the 8-bit byte.

The hard part is having bits in right places within wide word.
I.e. you have 64 bits like that:
'0123456789abcdef'. You want to convert it to pair of 64-bit
numbers: '0001020304050607', '08090a0b0c0d0e0f'

The ASCII[*] would be '3031323334353637', '3839414243444546'.
That's a move from the BCD '0123456789abcdef' to the corresponding
ASCII bytes.

[*] Printable version of the BCD input number.

The B3500 addressed to the digit, so it was a simple move to add
the zone digit when converting to ASCII (or EBCDIC depending on
a processor flag). Although 'undigits' (bit patterns 0b1010
through 0b1111) were not legal in BCD numbers on the B3500
and adding a zone digit to them didn't make them printable.

e.g.

INPUT DATA 8 UN 0123456789 // Unsigned Numeric 8
nibble field OUTPUT DATA 8 UA // Unsigned
Alphanumeric 8 byte field

MVN INPUT(UN), OUTPUT(UA) // yields
'f0f1f2f3f4f5f6f7f8f9' (EBCDIC)

If the output field was larger than the input field, leading
blanks would be added before the number when using MVN. MVA
would blank pad the output field after the number when the
output field was larger.

Without HW help it is not fast. Likely not faster than running
respective DPD declets through look-up table where you, at least,
get 3 ASCII digits per look-up. On modern wide core, likely only
marginally faster than converting BYD mantissa.

The B3500 (1965) did that in hardware;
I would find it strange if the Power CPU didn't.

Or, may be, they have instruction for that as well, but it's
not in DFP related part of the book, so I missed it.

It was just a flavor of the move instruction in the B3500.

I am not sure that we are talking about the same thing.

Probably not, since the ASCII zero character is encoded as 0x30
instead of the 0x00 you show in the example above.

IIUC Michael was asking for the following transformation of
on the strings of hex digits:

0123456789abcdef

into

000102030405060708090a0b0c0d0e0f

given (fast) such transformation it is very easy to add proper
zone bits on modern hardware. One possible approach to transform
above would be to do byte type unpacking operation (that is
version of the above working on bytes) and then use masking and
shifting to more upper bits of each byte to the right place.

Intel (and then AMD) added PDEP (and the corresponding PEXT) opcode
sometime within the last 20 years (probably less than 10?), it is
perfect for this operation:

Haswell. Officially launched in June 4, 2013. So 12.5 years ago.
Time runs.

;; rsi has 8 nybbles to unpack into 16 bytes (rdx:rax)
mov rbx, 0x0f0f0f0f0f0f0f0f
pdep rax,rbx,rsi
shr rsi,32
pdep rdx,rbx,rsi

This is sub-5 cycles of latency.

That's nice.
I'm not sure if POWER has similar instruction.

It is also doable with much older CPUs using the permute/byte shuffle
operation, with a bit more or less latency depdning upon where the
source and destination data resides (SIMD vs regular integer reg).

Terje

I don't understand that part. Do you suggest that there are better
swizzle instruction than unpack, mentioned by Waldek Hebisch?
So far, I don't see so. Unpack looks to me the most suitable.

There are at least three ways to do it:

a) PDEP in 64-bit regs

b) PSHUFB and nybble masks using SSE/AVX regs

c) PUNPACKLBW which expands bytes to words. Do it twice with a bytewise
SHR 4 to select the upper nybbles and a mask to keep the lower nybbles
of the first part.

Did you intend to use (c) or is there yet another method?

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
--- Synchronet 3.21a-Linux NewsLink 1.2

From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Thu Jan 8 21:35:16 2026

From Newsgroup: comp.arch

MitchAlsup wrote:

Terje Mathisen <terje.mathisen@tmsw.no> posted:

MitchAlsup wrote:

MitchAlsup <user5857@newsgrouper.org.invalid> posted:
-------------------

My Transcendentals get to 1ULP when the multiplier tree is 59|a|A|ore4rCY59-bits
{a bit more than |arCU|e-+ of the get 1ULP at 58|a|A|ore4rCY58}. I gave a lot of though
to this {~1 year} before deciding that a "Do everything else" function >>>> unit was "overall" better than a couple of "near miss" FUs. So, IMUL>>>> comes in at 4 cycles, FMUL at 4 cycles, FDIV at 17, SQRT at 22, and
IDIV at 25. The fast IMUL makes up a lot for the "size" of the FU.

I forgot to add that my transcendentals went from <just barely>
faithfully rounded 0.75ULP RMS to 5.002ULP RMS when the tree was
expanded.

You obviously intended 0.5002 ulp, so you have about 11-13 guard bits to
get the rounding correct?

64-53 = 11 yes

For many of the functions you can do a lot by letting the final
operation be a merging of the first/largest term, particularly if you do that with extended precision.
I.e something like fpatan2() works quite nicely this way, just not
enough for exact rounding.
You need to combine this with extended precision range adjustment at the end.

But a single incorrect rounding is 1 ULP all by itself.

:-)
Yeah, there are strong forces who want to have, at least as a suggested/recommended option, a set of transcendental functions which
are exactly rounded.
It is provably doable for float, in very close to the same cycle count
as the best libraries in current use, double is "somewhat" harder to
totally verify/prove.
Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.arch on Thu Jan 8 22:50:36 2026

From Newsgroup: comp.arch

On Thu, 8 Jan 2026 21:25:57 +0100
Terje Mathisen <terje.mathisen@tmsw.no> wrote:

Michael S wrote:

On Thu, 8 Jan 2026 12:50:32 +0100
Terje Mathisen <terje.mathisen@tmsw.no> wrote:

Waldek Hebisch wrote:

Scott Lurndal <scott@slp53.sl.home> wrote:

Michael S <already5chosen@yahoo.com> writes:

On Wed, 07 Jan 2026 15:24:38 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:

Michael S <already5chosen@yahoo.com> writes:

On Tue, 06 Jan 2026 19:29:40 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

So, POWER hardware helps a lot converting DPD to BCD, but
leaves to software the last step of unpacking 4-bit BCD to
8-bit ASCII.

That's a fairly simple operation, just adding the proper zone
digit (3 for ASCII, F for EBCDIC) to the 8-bit byte.

The hard part is having bits in right places within wide word.
I.e. you have 64 bits like that:
'0123456789abcdef'. You want to convert it to pair of 64-bit
numbers: '0001020304050607', '08090a0b0c0d0e0f'

The ASCII[*] would be '3031323334353637', '3839414243444546'.
That's a move from the BCD '0123456789abcdef' to the
corresponding ASCII bytes.

[*] Printable version of the BCD input number.

The B3500 addressed to the digit, so it was a simple move to add
the zone digit when converting to ASCII (or EBCDIC depending on
a processor flag). Although 'undigits' (bit patterns 0b1010
through 0b1111) were not legal in BCD numbers on the B3500
and adding a zone digit to them didn't make them printable.

e.g.

INPUT DATA 8 UN 0123456789 // Unsigned Numeric 8
nibble field OUTPUT DATA 8 UA // Unsigned
Alphanumeric 8 byte field

MVN INPUT(UN), OUTPUT(UA) // yields
'f0f1f2f3f4f5f6f7f8f9' (EBCDIC)

If the output field was larger than the input field, leading
blanks would be added before the number when using MVN. MVA
would blank pad the output field after the number when the
output field was larger.

Without HW help it is not fast. Likely not faster than running
respective DPD declets through look-up table where you, at
least, get 3 ASCII digits per look-up. On modern wide core,
likely only marginally faster than converting BYD mantissa.

The B3500 (1965) did that in hardware;
I would find it strange if the Power CPU didn't.

Or, may be, they have instruction for that as well, but it's
not in DFP related part of the book, so I missed it.

It was just a flavor of the move instruction in the B3500.

I am not sure that we are talking about the same thing.

Probably not, since the ASCII zero character is encoded as 0x30
instead of the 0x00 you show in the example above.

IIUC Michael was asking for the following transformation of
on the strings of hex digits:

0123456789abcdef

into

000102030405060708090a0b0c0d0e0f

given (fast) such transformation it is very easy to add proper
zone bits on modern hardware. One possible approach to transform
above would be to do byte type unpacking operation (that is
version of the above working on bytes) and then use masking and
shifting to more upper bits of each byte to the right place.

Intel (and then AMD) added PDEP (and the corresponding PEXT) opcode
sometime within the last 20 years (probably less than 10?), it is
perfect for this operation:

Haswell. Officially launched in June 4, 2013. So 12.5 years ago.
Time runs.

;; rsi has 8 nybbles to unpack into 16 bytes (rdx:rax)
mov rbx, 0x0f0f0f0f0f0f0f0f
pdep rax,rbx,rsi
shr rsi,32
pdep rdx,rbx,rsi

This is sub-5 cycles of latency.

That's nice.
I'm not sure if POWER has similar instruction.

It is also doable with much older CPUs using the permute/byte
shuffle operation, with a bit more or less latency depdning upon
where the source and destination data resides (SIMD vs regular
integer reg).

Terje

I don't understand that part. Do you suggest that there are better
swizzle instruction than unpack, mentioned by Waldek Hebisch?
So far, I don't see so. Unpack looks to me the most suitable.

There are at least three ways to do it:

a) PDEP in 64-bit regs

b) PSHUFB and nybble masks using SSE/AVX regs

c) PUNPACKLBW which expands bytes to words. Do it twice with a
bytewise SHR 4 to select the upper nybbles and a mask to keep the
lower nybbles of the first part.

Did you intend to use (c) or is there yet another method?

Terje

I'd use (c) if (a) is either not available or slow. The latter
case applies to AMD Zen1/2. Otherwise I'd use (a). I don't see
circumstances for preferring (b).

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Fri Jan 9 01:24:17 2026

From Newsgroup: comp.arch

Terje Mathisen <terje.mathisen@tmsw.no> posted:

MitchAlsup wrote:

Terje Mathisen <terje.mathisen@tmsw.no> posted:

MitchAlsup wrote:

MitchAlsup <user5857@newsgrouper.org.invalid> posted:
-------------------

My Transcendentals get to 1ULP when the multiplier tree is 59|a|A|ore4rCY59-bits
{a bit more than |arCU|e-+ of the get 1ULP at 58|a|A|ore4rCY58}. I gave a lot of though
to this {~1 year} before deciding that a "Do everything else" function >>>> unit was "overall" better than a couple of "near miss" FUs. So, IMUL >>>> comes in at 4 cycles, FMUL at 4 cycles, FDIV at 17, SQRT at 22, and
IDIV at 25. The fast IMUL makes up a lot for the "size" of the FU.

I forgot to add that my transcendentals went from <just barely>
faithfully rounded 0.75ULP RMS to 5.002ULP RMS when the tree was
expanded.

You obviously intended 0.5002 ulp, so you have about 11-13 guard bits to >> get the rounding correct?

64-53 = 11 yes

For many of the functions you can do a lot by letting the final
operation be a merging of the first/largest term, particularly if you do that with extended precision.

I.e something like fpatan2() works quite nicely this way, just not
enough for exact rounding.

You need to combine this with extended precision range adjustment at the end.

But a single incorrect rounding is 1 ULP all by itself.

:-)

Yeah, there are strong forces who want to have, at least as a suggested/recommended option, a set of transcendental functions which
are exactly rounded.

After the final addition, I know 1 of the top 3-bits is a 1, and I
have 69-bits in the accumulated result. I also know that the poly-
nomial error is below the 3rd least significant bit.

It is provably doable for float, in very close to the same cycle count
as the best libraries in current use, double is "somewhat" harder to
totally verify/prove.

I have logic (patented) that allows the FU to raise an UNCERTAIN
rounding exception, so SW can take over and change 0.5002 into
0.5000 at the cost of the exception and running the long winded
SW correctly rounded subroutine. I expect this to be used only
during verification and on the 3 machines owned by Kahan, Coonen,
and someone else I forgot.

Terje

--- Synchronet 3.21a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Fri Jan 9 16:32:30 2026

From Newsgroup: comp.arch

Waldek Hebisch <antispam@fricas.org> schrieb:

I use every day a computer algebra system. One possiblity here
is to use arbitrary precision fractions. Observations:
- once there is more complex computation numerators and denominators
tend to be pretty big
- a lot of time is spent computing gcd, but if you try to save on
gcd-s and work with unsimplified fractions you may get trenendous
blowup (like million digit numbers in what should be reasonably
small computation).
- general, if one needs numeric approximation, then arbitrary
precision (software) floating point is much faster

It is also not possible to express irrational numbers,
transcendental functions etc. When I use a computer algebra
sytem, I tend to use such functions, so solutions are usually
not rational numbers.
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Sat Jan 10 18:02:46 2026

From Newsgroup: comp.arch

MitchAlsup wrote:

Terje Mathisen <terje.mathisen@tmsw.no> posted:

MitchAlsup wrote:

But a single incorrect rounding is 1 ULP all by itself.

:-)

Yeah, there are strong forces who want to have, at least as a
suggested/recommended option, a set of transcendental functions which
are exactly rounded.

After the final addition, I know 1 of the top 3-bits is a 1, and I
have 69-bits in the accumulated result. I also know that the poly-
nomial error is below the 3rd least significant bit.

It is provably doable for float, in very close to the same cycle count
as the best libraries in current use, double is "somewhat" harder to
totally verify/prove.

I have logic (patented) that allows the FU to raise an UNCERTAIN
rounding exception, so SW can take over and change 0.5002 into
0.5000 at the cost of the exception and running the long winded
SW correctly rounded subroutine. I expect this to be used only
during verification and on the 3 machines owned by Kahan, Coonen,
and someone else I forgot.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
--- Synchronet 3.21a-Linux NewsLink 1.2

From antispam@antispam@fricas.org (Waldek Hebisch) to comp.arch on Sat Jan 10 23:21:40 2026

From Newsgroup: comp.arch

John Dallman <jgd@cix.co.uk> wrote:

In article <2026Jan5.100825@mips.complang.tuwien.ac.at>, anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

Possibly. But the lack of takeup of the Intel library and of the
gcc support shows that "build it and they will come" does not work
out for DFP.

The world has got very used to IEEE BFP, and has solutions that work acceptably with it. Lots of organisations don't see anything obvious for
them in DFP.

AFAICS DFP exists as a standard only because IBM pushed it. I had
a short e-mail exchange with main DFP advocate at IBM. My point
was that purely software implementation of his decimal benchmark had
perfectly adequate performance. His answer was that he knows this,
but that was hand written code that normal users would not write.
Compilers for Cobol could generate such code, but apparently no
existing compilers supported it. He claimed that there
must be hardware support to make it widely available. He somewhat
ignored fact that even with hardware support there still needs
to be support in compilers. And that C++ templates allow
fast fixed point decimals as library feature. If there is no
such library (and I am not aware of one) it is due to low
demand and not due to difficulty.

Anyway, he and IBM succeded pushd the DFP standard, but adoption
is as I and other folks predicted.
--
Waldek Hebisch
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sun Jan 11 00:03:36 2026

From Newsgroup: comp.arch

antispam@fricas.org (Waldek Hebisch) posted:

John Dallman <jgd@cix.co.uk> wrote:

In article <2026Jan5.100825@mips.complang.tuwien.ac.at>, anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

Possibly. But the lack of takeup of the Intel library and of the
gcc support shows that "build it and they will come" does not work
out for DFP.

The world has got very used to IEEE BFP, and has solutions that work acceptably with it. Lots of organisations don't see anything obvious for them in DFP.

AFAICS DFP exists as a standard only because IBM pushed it. I had
a short e-mail exchange with main DFP advocate at IBM.

Mike Cow<something>shaw ??

My point
was that purely software implementation of his decimal benchmark had perfectly adequate performance. His answer was that he knows this,
but that was hand written code that normal users would not write.
Compilers for Cobol could generate such code, but apparently no
existing compilers supported it. He claimed that there
must be hardware support to make it widely available. He somewhat
ignored fact that even with hardware support there still needs
to be support in compilers. And that C++ templates allow
fast fixed point decimals as library feature. If there is no
such library (and I am not aware of one) it is due to low
demand and not due to difficulty.

Anyway, he and IBM succeded pushd the DFP standard, but adoption
is as I and other folks predicted.

--- Synchronet 3.21a-Linux NewsLink 1.2

From antispam@antispam@fricas.org (Waldek Hebisch) to comp.arch on Sun Jan 11 00:33:22 2026

From Newsgroup: comp.arch

Michael S <already5chosen@yahoo.com> wrote:

On Tue, 6 Jan 2026 22:06 +0000 (GMT Standard Time)
jgd@cix.co.uk (John Dallman) wrote:

In article <2026Jan5.100825@mips.complang.tuwien.ac.at>,
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

Possibly. But the lack of takeup of the Intel library and of the
gcc support shows that "build it and they will come" does not work
out for DFP.

The world has got very used to IEEE BFP, and has solutions that work
acceptably with it. Lots of organisations don't see anything obvious
for them in DFP.

The thing I'd like to try out is fast quad-precision BFP. For the
field I work in, that would make some things much simpler. I did try
to interest AMD in the idea in the early days of x86-64, but they
didn't bite.

John

I already asked you couple of years ago how fast do want binary128 in
order to consider it fast enough.
IIRC, you either avoided the answer completely or gave totally
unrealistic answer like "the same as binary64".
May be, nobody bites because with non-answers or answers like that
nobody thinks that you are serious?

I would hope for half troughput and say 1-2 clocks more latency
for addition. For multiplication I would expect 1/4 troughput
and maybe twice latency than for binary64.

As of today, there is double-double. IIRC double-double addition
needs 6 double additions, that is way too much. AFAICS
quantifying double-double mutiplication performance is more
tricky: there is relatively easy implementation using
64-bit multiply-add (it takes adwantage of fact that multiply-add
can deliver low-order bits that only contibute to rounding in
normal FP multiply), but this implements normal multiply in
terms of multiply-add. Implementing multiply-add takes
more effort and impementing multiply only using multiply
tekes even more effort.

Anyway, to make sense hardware should be faster than double-double.

Anecdote.
Few months ago I tried to design very long decimation filters with stop
band attenuation of ~160 dB.
Matlab's implementation of ParksrCoMcClellan algorithm (a customize
variation of Remez Exchange spiced with a small portion of black magic)
was not up to the task. Gnu Octave implementation was somewhat worse
yet.
When I started to investigate the reasons I found out that there were actually two of them, both related to insufficient precision of the
series of DP FP calculations.
The first was Discreet Cosine Transform and underlying FFT engine for N around 32K.
The second was solving system of linear equations for N around 1000 a
a little more.
In both cases precision of DP FP was perfectly sufficient both for
inputs and for outputs. 32KBut errors accumulated in intermediate caused troubles.

In both cases quad-precision FP was key to solution.

For DCT (FFT) I went for full re-implementation at higher precision.

Hmm, I did estimates for FFT and my result was that in classic
implementation each layers of butterflies essentially additively
contributes to L^2 error. So 32K point radix-2 FFT has 15 times
bigger L^2 error than single layer of batterflies, which has error
about 4 times machine epsilon. With radix-4 FFT error of single
batterfly is larger, but number of layers is halved and result
is similar. So, in terms of L^2 error 32K point FFT needs very
little extra precision, essentially 6 bits. But Remez works
in term of supremum norm and at 32K points that may need extra
8 bits. So it if possible that 80-bit format would have enough
accuracy for your purpose.

I looked at FFT as one of possible ways to implement convolution
of integer sequences with exact result. Alas, double precision
computation is good only for about 20 bits for relatively short
seqences and less for longer ones. It seems that integer only
computation is much faster. Fast 128-bit floating point would
shift balance towards floating point, but probably not enough
to beat integer computations.

For Linear Solver, I left LU decomposition, which happens to be the
heavy O(N**3) part in DP. Quad-precision was applied only during final
solver stages - forward propagation, back propagation, calculation of residual error vector and repetition of forward and back propagation.
All those parts are O(N**2). That modification was sufficient to
improve precision of result almost to the best possible in DP FP format.
And sufficient for good convergence of ParksrCoMcClellan algorithm.

Yes, as long as your system is reasonably well conditioned it is
easy to improve accuracy in a postprocessing step. OTOH system
may be so badly conditioned that solving in double precision leads
to catastrophic errors while solving in higher precision works
fine.

So, what is a point of my anecdote?
The speed of quad-precision FP was never an obstacle, even when running
on rather old hardware. And it's not like calculations here were not
heavy. They were heavy all right, thank you very much.
Using quad-precision only when necessary helped.
But what helped more is not being hesitant. Doing things instead of
worrying that they would be too slow.

Well, I have arbitrary precision implementation of LLL algorithm.
It works, but it is about 100 times slower than using double precision
math. The trouble is, in worst case double precision LLL in
dimension 53 may fail to converge. On tame data LLL is expected
to work in higher dimensions, but even on tame data at dimentsion
about 250 double precision LLL is expected to fail. In a sense
this is no-win situations, as needed number of bits grows linearly
with dimension (both worst case and tame case). One can try to
use double precision when it works. But it is frustrating
how much effort one needs to spend to get better speed using
FPU. And especially, there is contrast with integer math
where it is realatively easy to get higher precisions when
needed. But for integer math RISC-V tries to change this by not
providing carry bit. And, AFAICS SSE/AVX do not provide high
order bits of multiplication (no vectored MULHI instruction), so
multiprecision multiplies must go trough scalar multiplier.
--
Waldek Hebisch
--- Synchronet 3.21a-Linux NewsLink 1.2

From antispam@antispam@fricas.org (Waldek Hebisch) to comp.arch on Sun Jan 11 00:59:51 2026

From Newsgroup: comp.arch

MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

Michael S <already5chosen@yahoo.com> posted:

On Wed, 07 Jan 2026 20:05:17 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

Terje Mathisen <terje.mathisen@tmsw.no> posted:

John Dallman wrote:

In article <20260107133424.00000e99@yahoo.com>,
already5chosen@yahoo.com (Michael S) wrote:

I already asked you couple of years ago how fast do want
binary128 in order to consider it fast enough.
IIRC, you either avoided the answer completely or gave totally
unrealistic answer like "the same as binary64".
May be, nobody bites because with non-answers or answers like
that nobody thinks that you are serious?

I don't know much about hardware design. What is realistic for
hardware binary128?

Sub-10 cycles fmul/fadd/fsub seems very doable?

Mitch?

Assuming 128-bit operands are delivered in 1 cycle and 128-bit
results are delivered in 1 cycle::

If we are talking about SIMD of the same width (measured in bits) as
SP/DP SIMD on the given general purpose machine, i.e. on modern Intel
and AMD server cores, 2 FPUs with 4 128-bit lanes each, then I think
that fully pipelined binary128 operations are none starter, because it
would blow your power and thermal budget.

I agree, however a single 128-bit FPU would fit inside a reasonable
power budget.

One full-width result (i.e. 8
binary128 results) every 2 cycles sounds somewhat more realistic.

Likely still over a reasonable power budget.

After all in general-purpose CPU binary128, if at all implemented, is
a proverbial tail that can't be allowed to wag the dog.

We build (and call) our current machines 64-bits because that is the
size of the register files (not including SIMD/Vector) and because
we can run the scalar unit at rated clock frequency (non SIMD/Vector) essentially continuously.

Once we step over the scalar width, power goes up 2|u-4|u and we get a
couple of hundred cycles before frequency throttling. Thus, we cannot
in general, run SIMD/Vector at rated frequency continuously.

I understand that mutipliers are big and power hungry. I know
almost nothing about permute unit, but it too looks like big
and power hungry thing. But how bad is it when one is doing
simple operations say mostly in registers.

Nor can
we, at present time, build a memory system than can properly feed a SIMD/Vector RF so that one can use all of the lanes of available calculations.

There is matrix multiply which is doing n^3 multiplies on n^2
data. I need polynomial mutiplication, that is n^2 multiplies
on size n data. There are real computations where a piece or
two pieces of data got trough several steps. So there is a
lot of compute-intensive problems where processing units can
do work on data in registers or from L1 cache.

So if compute units can do the work, it is still useful,
iven if other problem are memory bound.

{HBM is approaching this point, however--it becomes
more like B-memory from CRAY-2; than main memory for applications
that can use that much b-memory effectively.}

OTOH, if we define our binary128 to use only least-significant 128 bit
lane of our 512-bit register and only build b128 capabilities into one
of our pair of FPUs then full pipelining (i.e. 1 result per cycle) looks
like a good choice, at least from power/thermal perspective. That is,
as long as designers found a way to avoid a hot spot.

We could, instead, treat them as pairs of GPRs--like we did in Mc 88120
and still not need SIMD/Vectors.

128-bit Fadd/Fsub should be no worse than 1 cycle longer than 64-bit.

128-bit Fmul requires that the multiplier tree be 64|u64 instead of
53|u53 (1.46|u bigger tree, 1.22|u bigger FU), and would/should be 3-4
cycles longer than 64-bit Fmul. If you wanted to be "really clever"
you could use a 59|u59 tree and the FU is only 1.12|u bigger; but here
you could not use the tree for Integer MUL.

Terje

--
Waldek Hebisch
--- Synchronet 3.21a-Linux NewsLink 1.2

From antispam@antispam@fricas.org (Waldek Hebisch) to comp.arch on Sun Jan 11 01:14:26 2026

From Newsgroup: comp.arch

MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

Terje Mathisen <terje.mathisen@tmsw.no> posted:

MitchAlsup wrote:

MitchAlsup <user5857@newsgrouper.org.invalid> posted:
-------------------

My Transcendentals get to 1ULP when the multiplier tree is 59|arCo59-bits >> >> {a bit more than |e-+ of the get 1ULP at 58|arCo58}. I gave a lot of though
to this {~1 year} before deciding that a "Do everything else" function
unit was "overall" better than a couple of "near miss" FUs. So, IMUL
comes in at 4 cycles, FMUL at 4 cycles, FDIV at 17, SQRT at 22, and
IDIV at 25. The fast IMUL makes up a lot for the "size" of the FU.

I forgot to add that my transcendentals went from <just barely>
faithfully rounded 0.75ULP RMS to 5.002ULP RMS when the tree was
expanded.

You obviously intended 0.5002 ulp, so you have about 11-13 guard bits to
get the rounding correct?

64-53 = 11 yes

But a single incorrect rounding is 1 ULP all by itself.

It is clear that when your rounding is different that "IEEE correct"
rounding, then there is 1 ULP difference between your result and
IEEE rounding. But claims like max 0.5002 ulp mean that there
is at most 0.5002 ulp difference between true result and result
deliverd by your FPU. Or do you really mean that there may be
1ULP difference between true result and your FPU?
--
Waldek Hebisch
--- Synchronet 3.21a-Linux NewsLink 1.2

From antispam@antispam@fricas.org (Waldek Hebisch) to comp.arch on Sun Jan 11 11:21:27 2026

From Newsgroup: comp.arch

MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

antispam@fricas.org (Waldek Hebisch) posted:

John Dallman <jgd@cix.co.uk> wrote:

In article <2026Jan5.100825@mips.complang.tuwien.ac.at>,
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

Possibly. But the lack of takeup of the Intel library and of the
gcc support shows that "build it and they will come" does not work
out for DFP.

The world has got very used to IEEE BFP, and has solutions that work
acceptably with it. Lots of organisations don't see anything obvious for >> > them in DFP.

AFAICS DFP exists as a standard only because IBM pushed it. I had
a short e-mail exchange with main DFP advocate at IBM.

Mike Cow<something>shaw ??

Mike Cowlishaw, yes.

My point
was that purely software implementation of his decimal benchmark had
perfectly adequate performance. His answer was that he knows this,
but that was hand written code that normal users would not write.
Compilers for Cobol could generate such code, but apparently no
existing compilers supported it. He claimed that there
must be hardware support to make it widely available. He somewhat
ignored fact that even with hardware support there still needs
to be support in compilers. And that C++ templates allow
fast fixed point decimals as library feature. If there is no
such library (and I am not aware of one) it is due to low
demand and not due to difficulty.

Anyway, he and IBM succeded pushd the DFP standard, but adoption
is as I and other folks predicted.

--
Waldek Hebisch
--- Synchronet 3.21a-Linux NewsLink 1.2

From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Sun Jan 11 12:52:03 2026

From Newsgroup: comp.arch

Waldek Hebisch wrote:

MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

Terje Mathisen <terje.mathisen@tmsw.no> posted:

MitchAlsup wrote:

MitchAlsup <user5857@newsgrouper.org.invalid> posted:
-------------------

My Transcendentals get to 1ULP when the multiplier tree is 59|a|A|ore4rCY59-bits
{a bit more than |arCU|e-+ of the get 1ULP at 58|a|A|ore4rCY58}. I gave a lot of though
to this {~1 year} before deciding that a "Do everything else" function >>>>> unit was "overall" better than a couple of "near miss" FUs. So, IMUL >>>>> comes in at 4 cycles, FMUL at 4 cycles, FDIV at 17, SQRT at 22, and>>>>> IDIV at 25. The fast IMUL makes up a lot for the "size" of the FU.

I forgot to add that my transcendentals went from <just barely>
faithfully rounded 0.75ULP RMS to 5.002ULP RMS when the tree was
expanded.

You obviously intended 0.5002 ulp, so you have about 11-13 guard bits to >>> get the rounding correct?

64-53 = 11 yes

But a single incorrect rounding is 1 ULP all by itself.

It is clear that when your rounding is different that "IEEE correct" rounding, then there is 1 ULP difference between your result and
IEEE rounding. But claims like max 0.5002 ulp mean that there
is at most 0.5002 ulp difference between true result and result
deliverd by your FPU. Or do you really mean that there may be
1ULP difference between true result and your FPU?

It is well established that when you measure the accuracy of special functions, you compare against the perfect result which is never more
than 0.5 ulp away from the arbitrary/infinitely precise exact result.
Stating that some algorithm delivers 0.5002 ulp means that with the
worst possible input, the before-rounding result is 0.0002 ulp away from the real/exact result, and in such a way that rounding will go in the
wrong direction.
It is perfectly OK to be 0.4 ulp wrong as long as you are within the
correct 0.5 ulp wide interval, but in reality the only way to deliver
results like Mitch is by being nearly exact everywhere.
Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
--- Synchronet 3.21a-Linux NewsLink 1.2

From antispam@antispam@fricas.org (Waldek Hebisch) to comp.arch on Sun Jan 11 11:49:50 2026

From Newsgroup: comp.arch

Thomas Koenig <tkoenig@netcologne.de> wrote:

Waldek Hebisch <antispam@fricas.org> schrieb:

I use every day a computer algebra system. One possiblity here
is to use arbitrary precision fractions. Observations:
- once there is more complex computation numerators and denominators
tend to be pretty big
- a lot of time is spent computing gcd, but if you try to save on
gcd-s and work with unsimplified fractions you may get trenendous
blowup (like million digit numbers in what should be reasonably
small computation).
- general, if one needs numeric approximation, then arbitrary
precision (software) floating point is much faster

It is also not possible to express irrational numbers,
transcendental functions etc. When I use a computer algebra
sytem, I tend to use such functions, so solutions are usually
not rational numbers.

Support for algebraic irrationalities is by now standard
feature in computer algebra. Dealing with transcendental
elementary functions too. Support for special functions
is weaker, but that is possible to. Deciding if a number
is transcendental or not is theoretically tricky, but for
elementary numbers there is Schanuel conjecture, which
wile unproven tends to work well in practice.

What troubles many folks is fact that for many practical
problems answers are implicit, so if you want numbers at the
end you need to do numeric computation anyway.

Anyway, my point was that exact computations tend to need
large accuracy at intermediate steps, so computational
cost is much higher than numerics (even arbitrary precion
numerics tend to be much faster). As a little example
is sligthly different spirit, when can try to run approximate
root finding procedure for polynomials in rational
arithemtics. This solves problem of potential numerical
instability, but leads to large fractions which are
much more costly than arbitrary precion floating point
(which with some care deals with instability too).
--
Waldek Hebisch
--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.arch on Sun Jan 11 14:31:54 2026

From Newsgroup: comp.arch

On Sun, 11 Jan 2026 00:33:22 -0000 (UTC)
antispam@fricas.org (Waldek Hebisch) wrote:

Michael S <already5chosen@yahoo.com> wrote:

On Tue, 6 Jan 2026 22:06 +0000 (GMT Standard Time)
jgd@cix.co.uk (John Dallman) wrote:

In article <2026Jan5.100825@mips.complang.tuwien.ac.at>,
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

Possibly. But the lack of takeup of the Intel library and of
the gcc support shows that "build it and they will come" does
not work out for DFP.

The world has got very used to IEEE BFP, and has solutions that
work acceptably with it. Lots of organisations don't see anything
obvious for them in DFP.

The thing I'd like to try out is fast quad-precision BFP. For the
field I work in, that would make some things much simpler. I did
try to interest AMD in the idea in the early days of x86-64, but
they didn't bite.

John

I already asked you couple of years ago how fast do want binary128
in order to consider it fast enough.
IIRC, you either avoided the answer completely or gave totally
unrealistic answer like "the same as binary64".
May be, nobody bites because with non-answers or answers like that
nobody thinks that you are serious?

I would hope for half troughput and say 1-2 clocks more latency
for addition.

That sounds doable, from power and thermal perspective, but does not
sound sufficiently important for anybody to bother.
Having addition at half throughput of binary64 instead of quarter would
not sell you more chips.

For multiplication I would expect 1/4 troughput
and maybe twice latency than for binary64.

As of today, there is double-double. IIRC double-double addition
needs 6 double additions, that is way too much. AFAICS
quantifying double-double mutiplication performance is more
tricky: there is relatively easy implementation using
64-bit multiply-add (it takes adwantage of fact that multiply-add
can deliver low-order bits that only contibute to rounding in
normal FP multiply), but this implements normal multiply in
terms of multiply-add. Implementing multiply-add takes
more effort and impementing multiply only using multiply
tekes even more effort.

Are there compilers that are able to vectirize double-double? If not,
any talk about throughput is pointless.

Anyway, to make sense hardware should be faster than double-double.

I disagree. Numeric properties of binary128 are better than
double-double. And far easier to analyze, both deeply and applying
rules of thumb.
As far as I am concerned, the low performance limit for hardware
implementation of binary128 is set not by double-double, but by
competent implementation of binary128 with integer math, including
competent ABI. Current soft binary128 in gcc is ~ factor of two away
from that in add/mul/fma, larger factor in div, larger yet in sqrt.
As to ABI incompetence in this case, it is hard to quantify.

Anecdote.
Few months ago I tried to design very long decimation filters with
stop band attenuation of ~160 dB.
Matlab's implementation of ParksrCoMcClellan algorithm (a customize variation of Remez Exchange spiced with a small portion of black
magic) was not up to the task. Gnu Octave implementation was
somewhat worse yet.
When I started to investigate the reasons I found out that there
were actually two of them, both related to insufficient precision
of the series of DP FP calculations.
The first was Discreet Cosine Transform and underlying FFT engine
for N around 32K.
The second was solving system of linear equations for N around 1000
a a little more.
In both cases precision of DP FP was perfectly sufficient both for
inputs and for outputs. 32KBut errors accumulated in intermediate
caused troubles.

In both cases quad-precision FP was key to solution.

For DCT (FFT) I went for full re-implementation at higher
precision.

Hmm, I did estimates for FFT and my result was that in classic
implementation each layers of butterflies essentially additively
contributes to L^2 error. So 32K point radix-2 FFT has 15 times
bigger L^2 error than single layer of batterflies, which has error
about 4 times machine epsilon. With radix-4 FFT error of single
batterfly is larger, but number of layers is halved and result
is similar. So, in terms of L^2 error 32K point FFT needs very
little extra precision, essentially 6 bits.

My estimated was 7.5 bits.

But Remez works
in term of supremum norm and at 32K points that may need extra
8 bits. So it if possible that 80-bit format would have enough
accuracy for your purpose.

Yes, 80-bit would suffice.
But since it was not time-critical part, I had chose 128 bits.

I looked at FFT as one of possible ways to implement convolution
of integer sequences with exact result. Alas, double precision
computation is good only for about 20 bits for relatively short
seqences and less for longer ones. It seems that integer only
computation is much faster. Fast 128-bit floating point would
shift balance towards floating point, but probably not enough
to beat integer computations.

For Linear Solver, I left LU decomposition, which happens to be the
heavy O(N**3) part in DP. Quad-precision was applied only during
final solver stages - forward propagation, back propagation,
calculation of residual error vector and repetition of forward and
back propagation. All those parts are O(N**2). That modification
was sufficient to improve precision of result almost to the best
possible in DP FP format. And sufficient for good convergence of ParksrCoMcClellan algorithm.

Yes, as long as your system is reasonably well conditioned it is
easy to improve accuracy in a postprocessing step. OTOH system
may be so badly conditioned that solving in double precision leads
to catastrophic errors while solving in higher precision works
fine.

That is part of black magic that I mentioned above. ParksrCoMcClellan
algorithm works in acos(x) domain. It leads to better conditioned
linear systems that when doing Remez taken from math books. At least
it's true for sort of filters that are suitable for decimation.

So, what is a point of my anecdote?
The speed of quad-precision FP was never an obstacle, even when
running on rather old hardware. And it's not like calculations here
were not heavy. They were heavy all right, thank you very much.
Using quad-precision only when necessary helped.
But what helped more is not being hesitant. Doing things instead of worrying that they would be too slow.

Well, I have arbitrary precision implementation of LLL algorithm.
It works, but it is about 100 times slower than using double precision
math.

My point is that John should measure first. Only after measurements he
has full rights to cry "Slow!".

The trouble is, in worst case double precision LLL in
dimension 53 may fail to converge. On tame data LLL is expected
to work in higher dimensions, but even on tame data at dimentsion
about 250 double precision LLL is expected to fail. In a sense
this is no-win situations, as needed number of bits grows linearly
with dimension (both worst case and tame case). One can try to
use double precision when it works. But it is frustrating
how much effort one needs to spend to get better speed using
FPU. And especially, there is contrast with integer math
where it is realatively easy to get higher precisions when
needed. But for integer math RISC-V tries to change this by not
providing carry bit. And, AFAICS SSE/AVX do not provide high
order bits of multiplication (no vectored MULHI instruction), so multiprecision multiplies must go trough scalar multiplier.

Vectored MULHI exists in SSE/AVX, but it is intended for image
processing and for low-end audio processing (16-bit input and
output). So it does not help
They have width-doubling multiplication that is closer to your need
(look for PMULUDQ). It is still rather narrow (32x32=64). However on
modern high-end Intel and AMD it provided 2x8 = 16 multiplications per
clock, so at least potentially it has higher bandwidth than
single 64x64=128bit multiplication in non-SIMD domain.
It seems, the most serious problems for attempts to use AVX/AVX512 for
very high precision integer math is absence of support for carry chains
for items wider than 64 bits.
However there are few interesting ideas of how to deal with that
limitation by means of speculation and of replacement of data
dependencies with control dependencies. The core idea is that carry
caused by carry is extremely rare so can be profitably predicted as
not happening. It's normally does not matter how slow is the fix when
it happened nevertheless.
For more concrete examples you can look at discussion of 3-way addition
of 64Kbit integers that happened here few months ago.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.arch on Sun Jan 11 15:01:16 2026

From Newsgroup: comp.arch

On Thu, 8 Jan 2026 21:35:16 +0100
Terje Mathisen <terje.mathisen@tmsw.no> wrote:

MitchAlsup wrote:

But a single incorrect rounding is 1 ULP all by itself.

:-)

Yeah, there are strong forces who want to have, at least as a suggested/recommended option, a set of transcendental functions which
are exactly rounded.

I wonder who are those forces and what is the set they push for.

I would guess that they are mostly software and hardware verification
people rather than people that use transcendental functions in
engineering and physical calculations.

The majority of the latter crowd would likely have no objections
against 0.75 ULP RMS as long as implementation is both fast and
preserving few invariant, of which I can primarily think of two:

1. Evenness/odness
If precise function F(x) is even or odd then approximate functions f(x)
is also even or odd.

2. Weak preservation of sign of delta.
If F(x) < F(x+ULP) then f(x) <= f(x+ULP)
If F(x) > F(x+ULP) then f(x) >= f(x+ULP)

In practice, it's probably unlikely to have these invariant preserved
when RMS error is 0.75 ULP, but for RMS error = 0.60-0.65 ULP I don't
see difficulties.
For many transcendental functions there will be only few problematic
values of x near edges of implementation-specific ranges where one
has to be careful.

It is provably doable for float, in very close to the same cycle
count as the best libraries in current use, double is "somewhat"
harder to totally verify/prove.

Terje

--- Synchronet 3.21a-Linux NewsLink 1.2

From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Sun Jan 11 08:38:36 2026

From Newsgroup: comp.arch

On 1/10/2026 3:21 PM, Waldek Hebisch wrote:

John Dallman <jgd@cix.co.uk> wrote:

In article <2026Jan5.100825@mips.complang.tuwien.ac.at>,
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

Possibly. But the lack of takeup of the Intel library and of the
gcc support shows that "build it and they will come" does not work
out for DFP.

The world has got very used to IEEE BFP, and has solutions that work
acceptably with it. Lots of organisations don't see anything obvious for
them in DFP.

AFAICS DFP exists as a standard only because IBM pushed it.

While I have no personal knowledge of this, I don't doubt it.

I had
a short e-mail exchange with main DFP advocate at IBM. My point
was that purely software implementation of his decimal benchmark had perfectly adequate performance. His answer was that he knows this,
but that was hand written code that normal users would not write.
Compilers for Cobol could generate such code, but apparently no
existing compilers supported it. He claimed that there
must be hardware support to make it widely available. He somewhat
ignored fact that even with hardware support there still needs
to be support in compilers.

Or perhaps (again, no personal knowledge - just speculation) that
supporting an additional data type in the IBM COBOL (and, for what its
worth PL/1) compilers is easier if there was hardware support for it.

And that C++ templates allow
fast fixed point decimals as library feature. If there is no
such library (and I am not aware of one) it is due to low
demand and not due to difficulty.

For the existing Z series base, I suspect anything related to C++ is not significant, i.e. about as important as DFP is to the typical C++ user. :-)

Anyway, he and IBM succeded pushd the DFP standard, but adoption
is as I and other folks predicted.

Yes.
--
- Stephen Fuld
(e-mail address disguised to prevent spam)
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sun Jan 11 18:07:53 2026

From Newsgroup: comp.arch

antispam@fricas.org (Waldek Hebisch) posted:

MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

Terje Mathisen <terje.mathisen@tmsw.no> posted:

MitchAlsup wrote:

MitchAlsup <user5857@newsgrouper.org.invalid> posted:
-------------------

My Transcendentals get to 1ULP when the multiplier tree is 59|arCo59-bits
{a bit more than |e-+ of the get 1ULP at 58|arCo58}. I gave a lot of though
to this {~1 year} before deciding that a "Do everything else" function >> >> unit was "overall" better than a couple of "near miss" FUs. So, IMUL
comes in at 4 cycles, FMUL at 4 cycles, FDIV at 17, SQRT at 22, and
IDIV at 25. The fast IMUL makes up a lot for the "size" of the FU.

I forgot to add that my transcendentals went from <just barely>
faithfully rounded 0.75ULP RMS to 5.002ULP RMS when the tree was
expanded.

You obviously intended 0.5002 ulp, so you have about 11-13 guard bits to >> get the rounding correct?

64-53 = 11 yes

But a single incorrect rounding is 1 ULP all by itself.

It is clear that when your rounding is different that "IEEE correct" rounding, then there is 1 ULP difference between your result and
IEEE rounding. But claims like max 0.5002 ulp mean that there
is at most 0.5002 ulp difference between true result and result
deliverd by your FPU. Or do you really mean that there may be
1ULP difference between true result and your FPU?

It means I make a single IEEE rounding error once every several thousand calculations; AND I can achieve this in all IEEE rounding modes.
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sun Jan 11 18:08:51 2026

From Newsgroup: comp.arch

antispam@fricas.org (Waldek Hebisch) posted:

MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

antispam@fricas.org (Waldek Hebisch) posted:

John Dallman <jgd@cix.co.uk> wrote:

In article <2026Jan5.100825@mips.complang.tuwien.ac.at>,
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

Possibly. But the lack of takeup of the Intel library and of the
gcc support shows that "build it and they will come" does not work
out for DFP.

The world has got very used to IEEE BFP, and has solutions that work
acceptably with it. Lots of organisations don't see anything obvious for >> > them in DFP.

AFAICS DFP exists as a standard only because IBM pushed it. I had
a short e-mail exchange with main DFP advocate at IBM.

Mike Cow<something>shaw ??

Mike Cowlishaw, yes.

Thanks: my memory had it as Cowlingshaw--which I knew was wrong.

My point
was that purely software implementation of his decimal benchmark had
perfectly adequate performance. His answer was that he knows this,
but that was hand written code that normal users would not write.
Compilers for Cobol could generate such code, but apparently no
existing compilers supported it. He claimed that there
must be hardware support to make it widely available. He somewhat
ignored fact that even with hardware support there still needs
to be support in compilers. And that C++ templates allow
fast fixed point decimals as library feature. If there is no
such library (and I am not aware of one) it is due to low
demand and not due to difficulty.

Anyway, he and IBM succeded pushd the DFP standard, but adoption
is as I and other folks predicted.

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sun Jan 11 18:11:00 2026

From Newsgroup: comp.arch

antispam@fricas.org (Waldek Hebisch) posted:

Thomas Koenig <tkoenig@netcologne.de> wrote:

Waldek Hebisch <antispam@fricas.org> schrieb:

I use every day a computer algebra system. One possiblity here
is to use arbitrary precision fractions. Observations:
- once there is more complex computation numerators and denominators
tend to be pretty big
- a lot of time is spent computing gcd, but if you try to save on
gcd-s and work with unsimplified fractions you may get trenendous
blowup (like million digit numbers in what should be reasonably
small computation).
- general, if one needs numeric approximation, then arbitrary
precision (software) floating point is much faster

It is also not possible to express irrational numbers,
transcendental functions etc. When I use a computer algebra
sytem, I tend to use such functions, so solutions are usually
not rational numbers.

Support for algebraic irrationalities is by now standard
feature in computer algebra. Dealing with transcendental
elementary functions too. Support for special functions
is weaker, but that is possible to. Deciding if a number
is transcendental or not is theoretically tricky, but for
elementary numbers there is Schanuel conjecture, which
wile unproven tends to work well in practice.

What troubles many folks is fact that for many practical
problems answers are implicit, so if you want numbers at the
end you need to do numeric computation anyway.

Anyway, my point was that exact computations tend to need
large accuracy at intermediate steps,

Between 2|un+6 and 2|un+13 which is 3-9-bits larger than the
next higher precision.

so computational
cost is much higher than numerics (even arbitrary precion
numerics tend to be much faster). As a little example
is sligthly different spirit, when can try to run approximate
root finding procedure for polynomials in rational
arithemtics. This solves problem of potential numerical
instability, but leads to large fractions which are
much more costly than arbitrary precion floating point
(which with some care deals with instability too).

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sun Jan 11 18:18:00 2026

From Newsgroup: comp.arch

Michael S <already5chosen@yahoo.com> posted:

On Thu, 8 Jan 2026 21:35:16 +0100
Terje Mathisen <terje.mathisen@tmsw.no> wrote:

MitchAlsup wrote:

But a single incorrect rounding is 1 ULP all by itself.

:-)

Yeah, there are strong forces who want to have, at least as a suggested/recommended option, a set of transcendental functions which
are exactly rounded.

I wonder who are those forces and what is the set they push for.

The problem, here, is that even when one gets all the rounding correct,
one has still lost various algebraic identities.

CRSIN(x)^2 + CRCOS(X)^2 ~= 1.0

I would guess that they are mostly software and hardware verification
people rather than people that use transcendental functions in
engineering and physical calculations.

Numerical people, almost never engineers, physicists, or chemists.

The majority of the latter crowd would likely have no objections
against 0.75 ULP RMS as long as implementation is both fast and
preserving few invariant, of which I can primarily think of two:

1. Evenness/odness
If precise function F(x) is even or odd then approximate functions f(x)
is also even or odd.

Odd functions need to be monotonic around zero.

2. Weak preservation of sign of delta.
If F(x) < F(x+ULP) then f(x) <= f(x+ULP)
If F(x) > F(x+ULP) then f(x) >= f(x+ULP)

Small scale Monotonicity.

In practice, it's probably unlikely to have these invariant preserved
when RMS error is 0.75 ULP, but for RMS error = 0.60-0.65 ULP I don't
see difficulties.
For many transcendental functions there will be only few problematic
values of x near edges of implementation-specific ranges where one
has to be careful.

It is provably doable for float, in very close to the same cycle
count as the best libraries in current use, double is "somewhat"
harder to totally verify/prove.

Terje

--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.arch on Sun Jan 11 20:50:04 2026

From Newsgroup: comp.arch

On Sun, 11 Jan 2026 18:18:00 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

Michael S <already5chosen@yahoo.com> posted:

On Thu, 8 Jan 2026 21:35:16 +0100
Terje Mathisen <terje.mathisen@tmsw.no> wrote:

MitchAlsup wrote:

But a single incorrect rounding is 1 ULP all by itself.

:-)

Yeah, there are strong forces who want to have, at least as a suggested/recommended option, a set of transcendental functions
which are exactly rounded.

I wonder who are those forces and what is the set they push for.

The problem, here, is that even when one gets all the rounding
correct, one has still lost various algebraic identities.

CRSIN(x)^2 + CRCOS(X)^2 ~= 1.0

I would guess that they are mostly software and hardware
verification people rather than people that use transcendental
functions in engineering and physical calculations.

Numerical people, almost never engineers, physicists, or chemists.

The majority of the latter crowd would likely have no objections
against 0.75 ULP RMS as long as implementation is both fast and
preserving few invariant, of which I can primarily think of two:

1. Evenness/odness
If precise function F(x) is even or odd then approximate functions
f(x) is also even or odd.

Odd functions need to be monotonic around zero.

2. Weak preservation of sign of delta.
If F(x) < F(x+ULP) then f(x) <= f(x+ULP)
If F(x) > F(x+ULP) then f(x) >= f(x+ULP)

Small scale Monotonicity.

Yes, that's a better name.
I just wanted to express it as simple non-equality conditions and made
it too simple and stronger than necessary.
In fact I would not complain if my conditions do not hold when F(x) has extremum in between x and x+ULP. That is, it's nice if condition holds
here as well, but it is relatively less important than holding on
monotonous intervals.

In practice, it's probably unlikely to have these invariant
preserved when RMS error is 0.75 ULP, but for RMS error = 0.60-0.65
ULP I don't see difficulties.
For many transcendental functions there will be only few problematic
values of x near edges of implementation-specific ranges where one
has to be careful.

It is provably doable for float, in very close to the same cycle
count as the best libraries in current use, double is "somewhat"
harder to totally verify/prove.

Terje

--- Synchronet 3.21a-Linux NewsLink 1.2

From John Levine@johnl@taugh.com to comp.arch on Sun Jan 11 19:03:48 2026

From Newsgroup: comp.arch

According to Stephen Fuld <sfuld@alumni.cmu.edu.invalid>:

existing compilers supported it. He claimed that there
must be hardware support to make it widely available. He somewhat
ignored fact that even with hardware support there still needs
to be support in compilers.

Or perhaps (again, no personal knowledge - just speculation) that
supporting an additional data type in the IBM COBOL (and, for what its
worth PL/1) compilers is easier if there was hardware support for it.

Having written a few compilers, I can say that it is equally easy
within epsilon to emit a DFADD instruction as the equivalent of CALL
DFADD. I could believe it's politically easier, hey we'll look dumb if
we announce this swell DFP feature and our own compilers don't use it.

And that C++ templates allow
fast fixed point decimals as library feature. If there is no
such library (and I am not aware of one) it is due to low
demand and not due to difficulty.

For the existing Z series base, I suspect anything related to C++ is not >significant, i.e. about as important as DFP is to the typical C++ user. :-)

Maybe. Remember that IBM has full support for linux on Z. There used
to be a pricing hack (may still be) where you could buy a lower cost
linux-only Z series processor which was just a regular processor with
a microcode tweak to keep it from booting z/OS.
--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly
--- Synchronet 3.21a-Linux NewsLink 1.2

From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Sun Jan 11 12:40:55 2026

From Newsgroup: comp.arch

On 1/11/2026 10:07 AM, MitchAlsup wrote:

antispam@fricas.org (Waldek Hebisch) posted:

MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

Terje Mathisen <terje.mathisen@tmsw.no> posted:

MitchAlsup wrote:

MitchAlsup <user5857@newsgrouper.org.invalid> posted:
-------------------

My Transcendentals get to 1ULP when the multiplier tree is 59|arCo59-bits
{a bit more than |e-+ of the get 1ULP at 58|arCo58}. I gave a lot of though
to this {~1 year} before deciding that a "Do everything else" function >>>>>> unit was "overall" better than a couple of "near miss" FUs. So, IMUL >>>>>> comes in at 4 cycles, FMUL at 4 cycles, FDIV at 17, SQRT at 22, and >>>>>> IDIV at 25. The fast IMUL makes up a lot for the "size" of the FU.

I forgot to add that my transcendentals went from <just barely>
faithfully rounded 0.75ULP RMS to 5.002ULP RMS when the tree was
expanded.

You obviously intended 0.5002 ulp, so you have about 11-13 guard bits to >>>> get the rounding correct?

64-53 = 11 yes

But a single incorrect rounding is 1 ULP all by itself.

It is clear that when your rounding is different that "IEEE correct"
rounding, then there is 1 ULP difference between your result and
IEEE rounding. But claims like max 0.5002 ulp mean that there
is at most 0.5002 ulp difference between true result and result
deliverd by your FPU. Or do you really mean that there may be
1ULP difference between true result and your FPU?

It means I make a single IEEE rounding error once every several thousand calculations; AND I can achieve this in all IEEE rounding modes.

Here is some older experimental code of mine that is HYPER sensitive to floating point errors. I was going to try another method, but I forgot
the damn name of it. Uhhh, wait. Unums?

https://groups.google.com/g/comp.lang.c++/c/bB1wA4wvoFc/m/OTccTiXLAgAJ

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sun Jan 11 22:11:55 2026

From Newsgroup: comp.arch

Michael S <already5chosen@yahoo.com> posted:

On Sun, 11 Jan 2026 18:18:00 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

Michael S <already5chosen@yahoo.com> posted:

On Thu, 8 Jan 2026 21:35:16 +0100
Terje Mathisen <terje.mathisen@tmsw.no> wrote:

MitchAlsup wrote:

But a single incorrect rounding is 1 ULP all by itself.

:-)

Yeah, there are strong forces who want to have, at least as a suggested/recommended option, a set of transcendental functions
which are exactly rounded.

I wonder who are those forces and what is the set they push for.

The problem, here, is that even when one gets all the rounding
correct, one has still lost various algebraic identities.

CRSIN(x)^2 + CRCOS(X)^2 ~= 1.0

I would guess that they are mostly software and hardware
verification people rather than people that use transcendental
functions in engineering and physical calculations.

Numerical people, almost never engineers, physicists, or chemists.

The majority of the latter crowd would likely have no objections
against 0.75 ULP RMS as long as implementation is both fast and preserving few invariant, of which I can primarily think of two:

1. Evenness/odness
If precise function F(x) is even or odd then approximate functions
f(x) is also even or odd.

Odd functions need to be monotonic around zero.

2. Weak preservation of sign of delta.
If F(x) < F(x+ULP) then f(x) <= f(x+ULP)
If F(x) > F(x+ULP) then f(x) >= f(x+ULP)

Small scale Monotonicity.

Yes, that's a better name.
I just wanted to express it as simple non-equality conditions and made
it too simple and stronger than necessary.
In fact I would not complain if my conditions do not hold when F(x) has extremum in between x and x+ULP. That is, it's nice if condition holds
here as well, but it is relatively less important than holding on
monotonous intervals.

Consider COS(x) near 0.0

The transition from 1.0 to .99999999 and later to 0.99999998 (in both directions) is small scale monotonic. AND it is exactly at this trans-
ition where my rounding takes the biggest number of hits (incorrect
roundings).

Seen in binary, one has a prerounded result of:

0.1111111111 1111111111 1111111111 1111111111 1111111111 1111 and digits

behind where rounding transpires. If those digits start with 01111111111
or 1000000000 then we are in the situation where we cannot know if we can choose a correct rounding; the next term of the polynomial could sway the balance. J. M. Mueller chapter 11 shows that one might need as many as
2|un+13 bits in order to get the rounding "correct". This must include, polynomial error, arithmetic error, and certain boundary conditions.

If rounding that begins 01 contains a second 0, correct rounding happens.
if rounding that begins 10 contains a second 1, correct rounding happens.

And it is exactly at these points that
a) while the result remains monotonic, the point of change can be "off"
by a small number of +os
b) when the slope is shallow, one can get several rounding errors in a row
without loosing the property of monotonicity or overall RMS.

In practice, it's probably unlikely to have these invariant
preserved when RMS error is 0.75 ULP, but for RMS error = 0.60-0.65
ULP I don't see difficulties.
For many transcendental functions there will be only few problematic values of x near edges of implementation-specific ranges where one
has to be careful.

It is provably doable for float, in very close to the same cycle
count as the best libraries in current use, double is "somewhat"
harder to totally verify/prove.

Terje

--- Synchronet 3.21a-Linux NewsLink 1.2

From antispam@antispam@fricas.org (Waldek Hebisch) to comp.arch on Sun Jan 11 22:30:14 2026

From Newsgroup: comp.arch

Michael S <already5chosen@yahoo.com> wrote:

On Thu, 8 Jan 2026 21:35:16 +0100
Terje Mathisen <terje.mathisen@tmsw.no> wrote:

MitchAlsup wrote:

But a single incorrect rounding is 1 ULP all by itself.

:-)

Yeah, there are strong forces who want to have, at least as a
suggested/recommended option, a set of transcendental functions which
are exactly rounded.

I wonder who are those forces and what is the set they push for.

I would guess that they are mostly software and hardware verification
people rather than people that use transcendental functions in
engineering and physical calculations.

The majority of the latter crowd would likely have no objections
against 0.75 ULP RMS as long as implementation is both fast and
preserving few invariant, of which I can primarily think of two:

1. Evenness/odness
If precise function F(x) is even or odd then approximate functions f(x)
is also even or odd.

2. Weak preservation of sign of delta.
If F(x) < F(x+ULP) then f(x) <= f(x+ULP)
If F(x) > F(x+ULP) then f(x) >= f(x+ULP)

In practice, it's probably unlikely to have these invariant preserved
when RMS error is 0.75 ULP, but for RMS error = 0.60-0.65 ULP I don't
see difficulties.
For many transcendental functions there will be only few problematic
values of x near edges of implementation-specific ranges where one
has to be careful.

This property is independent of magnitude of the error. For
example, nominally "double" routine may deliver correctly rounded
single precision results. Of course errors are huge, but the
property holds. More realistically, monotonic behaviour can
be obtained as composition of monotonic operations. If done
in software such composition may produce more than 1ulp error,
but still be monotonic where required.
--
Waldek Hebisch
--- Synchronet 3.21a-Linux NewsLink 1.2

From antispam@antispam@fricas.org (Waldek Hebisch) to comp.arch on Mon Jan 12 00:37:20 2026

From Newsgroup: comp.arch

MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

antispam@fricas.org (Waldek Hebisch) posted:

Thomas Koenig <tkoenig@netcologne.de> wrote:

Waldek Hebisch <antispam@fricas.org> schrieb:

I use every day a computer algebra system. One possiblity here
is to use arbitrary precision fractions. Observations:
- once there is more complex computation numerators and denominators
tend to be pretty big
- a lot of time is spent computing gcd, but if you try to save on
gcd-s and work with unsimplified fractions you may get trenendous
blowup (like million digit numbers in what should be reasonably
small computation).
- general, if one needs numeric approximation, then arbitrary
precision (software) floating point is much faster

It is also not possible to express irrational numbers,
transcendental functions etc. When I use a computer algebra
sytem, I tend to use such functions, so solutions are usually
not rational numbers.

Support for algebraic irrationalities is by now standard
feature in computer algebra. Dealing with transcendental
elementary functions too. Support for special functions
is weaker, but that is possible to. Deciding if a number
is transcendental or not is theoretically tricky, but for
elementary numbers there is Schanuel conjecture, which
wile unproven tends to work well in practice.

What troubles many folks is fact that for many practical
problems answers are implicit, so if you want numbers at the
end you need to do numeric computation anyway.

Anyway, my point was that exact computations tend to need
large accuracy at intermediate steps,

Between 2|un+6 and 2|un+13 which is 3-9-bits larger than the
next higher precision.

You are talking about a specific, rather special problem.
Reasonably typical task in exact computations is to compute
determinant of n by n matrix with k-bit integer entries.
Sometimes k is large, but k <= 10 is frequent. Using
reasonably normal arithmetic operations you need slightly
more than n*k bits at intermedate steps. For similar
matrix with rational entries needed number of bits may
be as large as n^2*k. If you skip simplifications of
fractions at intermediate steps your numbers may grow
exponentially with n. In root finding problem that
I mentioned below, to get k bits of accuracy you need
to evaluate polynomial at k bit number. If you do
evaluation in exact arithmetic, then at intermediate
steps you get n*k bit numbers, where n is degree of the
polynomial. OTOH in numeric computation you can get
good result with much smaller number of bits (trough
analysis and its result are complex), but growing
with n.

so computational
cost is much higher than numerics (even arbitrary precion
numerics tend to be much faster). As a little example
is sligthly different spirit, when can try to run approximate
root finding procedure for polynomials in rational
arithemtics. This solves problem of potential numerical
instability, but leads to large fractions which are
much more costly than arbitrary precion floating point
(which with some care deals with instability too).

--
Waldek Hebisch
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon Jan 12 02:05:15 2026

From Newsgroup: comp.arch

antispam@fricas.org (Waldek Hebisch) posted:

MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

antispam@fricas.org (Waldek Hebisch) posted:

Thomas Koenig <tkoenig@netcologne.de> wrote:

Waldek Hebisch <antispam@fricas.org> schrieb:

I use every day a computer algebra system. One possiblity here
is to use arbitrary precision fractions. Observations:
- once there is more complex computation numerators and denominators
tend to be pretty big
- a lot of time is spent computing gcd, but if you try to save on
gcd-s and work with unsimplified fractions you may get trenendous
blowup (like million digit numbers in what should be reasonably
small computation).
- general, if one needs numeric approximation, then arbitrary
precision (software) floating point is much faster

It is also not possible to express irrational numbers,
transcendental functions etc. When I use a computer algebra
sytem, I tend to use such functions, so solutions are usually
not rational numbers.

Support for algebraic irrationalities is by now standard
feature in computer algebra. Dealing with transcendental
elementary functions too. Support for special functions
is weaker, but that is possible to. Deciding if a number
is transcendental or not is theoretically tricky, but for
elementary numbers there is Schanuel conjecture, which
wile unproven tends to work well in practice.

What troubles many folks is fact that for many practical
problems answers are implicit, so if you want numbers at the
end you need to do numeric computation anyway.

Anyway, my point was that exact computations tend to need
large accuracy at intermediate steps,

Between 2|un+6 and 2|un+13 which is 3-9-bits larger than the
next higher precision.

You are talking about a specific, rather special problem.

Yes, exactly, where the exact result is either known or computable
with current known methods/means. And its all in the Mueller book.
All I did was to take typical elementary functions and make the
evaluation of them similar, in clock cycles, as FDIV of the same
operand size; and for this gain in performance, I am willing to
sacrifice the merest loss in precision:: 1 rounding error every
"quite large number of calculations"

Reasonably typical task in exact computations is to compute
determinant of n by n matrix with k-bit integer entries.
Sometimes k is large, but k <= 10 is frequent. Using
reasonably normal arithmetic operations you need slightly
more than n*k bits at intermedate steps. For similar
matrix with rational entries needed number of bits may
be as large as n^2*k. If you skip simplifications of
fractions at intermediate steps your numbers may grow
exponentially with n. In root finding problem that
I mentioned below, to get k bits of accuracy you need
to evaluate polynomial at k bit number. If you do
evaluation in exact arithmetic, then at intermediate
steps you get n*k bit numbers, where n is degree of the
polynomial. OTOH in numeric computation you can get
good result with much smaller number of bits (trough
analysis and its result are complex), but growing
with n.

so computational
cost is much higher than numerics (even arbitrary precion
numerics tend to be much faster). As a little example
is sligthly different spirit, when can try to run approximate
root finding procedure for polynomials in rational
arithemtics. This solves problem of potential numerical
instability, but leads to large fractions which are
much more costly than arbitrary precion floating point
(which with some care deals with instability too).

--- Synchronet 3.21a-Linux NewsLink 1.2

From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Sun Jan 11 15:41:59 2026

From Newsgroup: comp.arch

On 1/11/2026 12:40 PM, Chris M. Thomasson wrote:

On 1/11/2026 10:07 AM, MitchAlsup wrote:

[...]

This is a reworked fun experiment I had about how to store and load data
in complex numbers:

https://groups.google.com/g/comp.lang.c++/c/bB1wA4wvoFc/m/OTccTiXLAgAJ

// updated code
Can you run it and tell me what you get? Thanks!

:^)
_____________________________________
// Chris M. Thomasson
// complex storage for fun...

#include <complex>
#include <iostream>
#include <vector>
#include <limits>
#include <algorithm>
#include <cstdint>
#include <cassert>
#include <cstring>

typedef std::int64_t ct_int;
typedef std::uint64_t ct_uint;
typedef double ct_float;
typedef std::numeric_limits<ct_float> ct_float_nlim;
typedef std::complex<ct_float> ct_complex;
typedef std::vector<ct_complex> ct_complex_vec;

#define CT_PI 3.14159265358979323846

ct_float
ct_roots(
ct_complex const& z,
ct_int p,
ct_complex_vec& out
) {
assert(p != 0);

ct_float radius = std::pow(std::abs(z), 1.0 / p);
ct_float angle_base = std::arg(z) / p;
ct_float angle_step = (CT_PI * 2.0) / p;

ct_uint n = std::abs(p);
ct_float avg_err = 0.0;

for (ct_uint i = 0; i < n; ++i) {
ct_float angle = angle_step * i;
ct_complex c = {
std::cos(angle_base + angle) * radius,
std::sin(angle_base + angle) * radius
};

out.push_back(c);

ct_complex raised = std::pow(c, p);
avg_err = avg_err + std::abs(raised - z);
}

return avg_err / n;
}

// Direct angular calculation - O(1) instead of O(n)
ct_int
ct_try_find_direct(
ct_complex const& z,
ct_complex const& z_next,
ct_int power,
ct_float eps
) {
// Calculate what the angle_base was when z_next's roots were computed
ct_float angle_base = std::arg(z_next) / power;

// Get z's angle relative to origin
ct_float z_angle = std::arg(z);

// Find which root slot z falls into
// Subtract the base angle and normalize
ct_float relative_angle = z_angle - angle_base;

// Normalize to [0, 2*pi)
while (relative_angle < 0) relative_angle += CT_PI * 2.0;
while (relative_angle >= CT_PI * 2.0) relative_angle -= CT_PI * 2.0;

// Calculate step size between roots
ct_float angle_step = (CT_PI * 2.0) / power;

// Find nearest root index
ct_uint index = (ct_uint)std::round(relative_angle / angle_step);

// Handle wrap-around
if (index >= (ct_uint)std::abs(power)) {
index = 0;
}

return index;
}

// Original linear search version - more robust but O(n)
ct_int
ct_try_find(
ct_complex const& z,
ct_complex_vec const& roots,
ct_float eps
) {
std::size_t n = roots.size();

for (std::size_t i = 0; i < n; ++i) {
ct_complex const& root = roots[i];
ct_float adif = std::abs(root - z);

if (adif < eps) {
return i;
}
}

return -1;
}

static std::string const g_tokens_str =
"0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ";

ct_int
ct_gain_power(
std::string const& tokens
) {
ct_uint n = tokens.length();
std::size_t pmax = 0;

for (ct_uint i = 0; i < n; ++i) {
std::size_t fridx = g_tokens_str.find_first_of(tokens[i]);
assert(fridx != std::string::npos);
pmax = std::max(pmax, fridx);
}

return (ct_int)(pmax + 1);
}

ct_complex
ct_store(
ct_complex const& z_origin,
ct_int p,
std::string const& tokens
) {
ct_uint n = tokens.length();
ct_complex z = z_origin;
ct_float store_avg_err = 0.0;

std::cout << "Storing Data..." << "\n";
std::cout << "stored:z_origin:" << z_origin << "\n";

for (ct_uint i = 0; i < n; ++i) {
ct_complex_vec roots;
ct_float avg_err = ct_roots(z, p, roots);
store_avg_err = store_avg_err + avg_err;

std::size_t fridx = g_tokens_str.find_first_of(tokens[i]);
assert(fridx != std::string::npos);

z = roots[fridx];
std::cout << "stored[" << i << "]:" << z << "\n";
}

store_avg_err = store_avg_err / n;
std::cout << "store_avg_err:" << store_avg_err << "\n";

return z;
}

ct_float
ct_load(
ct_complex const& z_store,
ct_complex const& z_target,
ct_int p,
ct_float eps,
std::string& out_tokens,
ct_complex& out_z,
bool use_direct = false // Toggle between direct and linear search
) {
ct_complex z = z_store;
ct_uint n = 128;
ct_float load_err_sum = 0.0;

std::cout << "Loading Data... (using " << (use_direct ? "direct" : "linear search") << " method)\n";

for (ct_uint i = 0; i < n; ++i) {
// Raise to power to get parent point
ct_complex z_next = std::pow(z, p);

ct_int root_idx;

if (use_direct) {
// Direct O(1) calculation
root_idx = ct_try_find_direct(z, z_next, p, eps);
}
else {
// Linear search O(n) - compute roots and search
ct_complex_vec roots;
ct_float avg_err = ct_roots(z_next, p, roots);
load_err_sum += avg_err;
root_idx = ct_try_find(z, roots, eps);
}

if (root_idx < 0 || (ct_uint)root_idx >= g_tokens_str.length()) {
break;
}

std::cout << "loaded[" << i << "]:" << z << " (index:" <<
root_idx << ")\n";
out_tokens += g_tokens_str[root_idx];

// Move to parent point
z = z_next;

// Check if we've reached the origin
if (std::abs(z - z_target) < eps) {
std::cout << "fin detected!:[" << i << "]:" << z << "\n";
break;
}
}

// Reverse to get original order
std::reverse(out_tokens.begin(), out_tokens.end());
out_z = z;

return load_err_sum;
}

int main() {
std::cout.precision(ct_float_nlim::max_digits10);
std::cout << "g_tokens_str:" << g_tokens_str << "\n\n";

{
ct_complex z_origin = { -.75, .06 };
std::string stored = "CHRIS";
ct_int power = ct_gain_power(stored);

std::cout << "stored:" << stored << "\n";
std::cout << "power:" << power << "\n\n";
std::cout << "________________________________________\n";

// STORE
ct_complex z_stored = ct_store(z_origin, power, stored);

std::cout << "________________________________________\n";
std::cout << "\nSTORED POINT:" << z_stored << "\n";
std::cout << "________________________________________\n";

// LOAD - try both methods
std::string loaded;
ct_complex z_loaded;
ct_float eps = .001;

std::cout << "\n=== Testing LINEAR SEARCH method ===\n";
ct_float load_err_sum =
ct_load(z_stored, z_origin, power, eps, loaded, z_loaded,
false);

std::cout << "________________________________________\n";
std::cout << "\nORIGIN POINT:" << z_origin << "\n";
std::cout << "LOADED POINT:" << z_loaded << "\n";
std::cout << "\nloaded:" << loaded << "\n";
std::cout << "load_err_sum:" << load_err_sum << "\n";

if (stored == loaded) {
std::cout << "\n\nDATA COHERENT! :^D" << "\n";
}
else {
std::cout << "\n\n***** DATA CORRUPTED!!! Shi%! *****" << "\n";
std::cout << "Expected: " << stored << "\n";
std::cout << "Got: " << loaded << "\n";
}

// Try direct method
std::cout << "\n\n=== Testing DIRECT ANGULAR method ===\n";
std::string loaded_direct;
ct_complex z_loaded_direct;

ct_float load_err_sum_direct =
ct_load(z_stored, z_origin, power, eps, loaded_direct, z_loaded_direct, true);

std::cout << "________________________________________\n";
std::cout << "\nloaded:" << loaded_direct << "\n";

if (stored == loaded_direct) {
std::cout << "\n\nDATA COHERENT (DIRECT METHOD)! :^D" << "\n";
}
else {
std::cout << "\n\n***** DATA CORRUPTED (DIRECT METHOD)!!!
*****" << "\n";
std::cout << "Expected: " << stored << "\n";
std::cout << "Got: " << loaded_direct << "\n";
}
}

std::cout << "\n\nFin, hit <ENTER> to exit...\n";
std::fflush(stdout);
std::cin.get();

return 0;
}
_____________________________________

--- Synchronet 3.21a-Linux NewsLink 1.2

From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Sun Jan 11 15:40:27 2026

From Newsgroup: comp.arch

On 1/11/2026 12:40 PM, Chris M. Thomasson wrote:

On 1/11/2026 10:07 AM, MitchAlsup wrote:

[...]

This is a reworked fun experiment I had about how to store and load data
in complex numbers:

https://groups.google.com/g/comp.lang.c++/c/bB1wA4wvoFc/m/OTccTiXLAgAJ

// updated code
Can you run it and tell me what you get? Thanks!

:^)
_____________________________________
// Chris M. Thomasson
// complex storage for fun...

#include <complex>
#include <iostream>
#include <vector>
#include <limits>
#include <algorithm>
#include <cstdint>
#include <cassert>
#include <cstring>

typedef std::int64_t ct_int;
typedef std::uint64_t ct_uint;
typedef double ct_float;
typedef std::numeric_limits<ct_float> ct_float_nlim;
typedef std::complex<ct_float> ct_complex;
typedef std::vector<ct_complex> ct_complex_vec;

#define CT_PI 3.14159265358979323846

ct_float
ct_roots(
ct_complex const& z,
ct_int p,
ct_complex_vec& out
) {
assert(p != 0);

ct_float radius = std::pow(std::abs(z), 1.0 / p);
ct_float angle_base = std::arg(z) / p;
ct_float angle_step = (CT_PI * 2.0) / p;

ct_uint n = std::abs(p);
ct_float avg_err = 0.0;

for (ct_uint i = 0; i < n; ++i) {
ct_float angle = angle_step * i;
ct_complex c = {
std::cos(angle_base + angle) * radius,
std::sin(angle_base + angle) * radius
};

out.push_back(c);

ct_complex raised = std::pow(c, p);
avg_err = avg_err + std::abs(raised - z);
}

return avg_err / n;
}

// Direct angular calculation - O(1) instead of O(n)
ct_int
ct_try_find_direct(
ct_complex const& z,
ct_complex const& z_next,
ct_int power,
ct_float eps
) {
// Calculate what the angle_base was when z_next's roots were computed
ct_float angle_base = std::arg(z_next) / power;

// Get z's angle relative to origin
ct_float z_angle = std::arg(z);

// Find which root slot z falls into
// Subtract the base angle and normalize
ct_float relative_angle = z_angle - angle_base;

// Normalize to [0, 2*pi)
while (relative_angle < 0) relative_angle += CT_PI * 2.0;
while (relative_angle >= CT_PI * 2.0) relative_angle -= CT_PI * 2.0;

// Calculate step size between roots
ct_float angle_step = (CT_PI * 2.0) / power;

// Find nearest root index
ct_uint index = (ct_uint)std::round(relative_angle / angle_step);

// Handle wrap-around
if (index >= (ct_uint)std::abs(power)) {
index = 0;
}

return index;
}

// Original linear search version - more robust but O(n)
ct_int
ct_try_find(
ct_complex const& z,
ct_complex_vec const& roots,
ct_float eps
) {
std::size_t n = roots.size();

for (std::size_t i = 0; i < n; ++i) {
ct_complex const& root = roots[i];
ct_float adif = std::abs(root - z);

if (adif < eps) {
return i;
}
}

return -1;
}

static std::string const g_tokens_str =
"0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ";

ct_int
ct_gain_power(
std::string const& tokens
) {
ct_uint n = tokens.length();
std::size_t pmax = 0;

for (ct_uint i = 0; i < n; ++i) {
std::size_t fridx = g_tokens_str.find_first_of(tokens[i]);
assert(fridx != std::string::npos);
pmax = std::max(pmax, fridx);
}

return (ct_int)(pmax + 1);
}

ct_complex
ct_store(
ct_complex const& z_origin,
ct_int p,
std::string const& tokens
) {
ct_uint n = tokens.length();
ct_complex z = z_origin;
ct_float store_avg_err = 0.0;

std::cout << "Storing Data..." << "\n";
std::cout << "stored:z_origin:" << z_origin << "\n";

for (ct_uint i = 0; i < n; ++i) {
ct_complex_vec roots;
ct_float avg_err = ct_roots(z, p, roots);
store_avg_err = store_avg_err + avg_err;

std::size_t fridx = g_tokens_str.find_first_of(tokens[i]);
assert(fridx != std::string::npos);

z = roots[fridx];
std::cout << "stored[" << i << "]:" << z << "\n";
}

store_avg_err = store_avg_err / n;
std::cout << "store_avg_err:" << store_avg_err << "\n";

return z;
}

ct_float
ct_load(
ct_complex const& z_store,
ct_complex const& z_target,
ct_int p,
ct_float eps,
std::string& out_tokens,
ct_complex& out_z,
bool use_direct = false // Toggle between direct and linear search
) {
ct_complex z = z_store;
ct_uint n = 128;
ct_float load_err_sum = 0.0;

std::cout << "Loading Data... (using " << (use_direct ? "direct" : "linear search") << " method)\n";

for (ct_uint i = 0; i < n; ++i) {
// Raise to power to get parent point
ct_complex z_next = std::pow(z, p);

ct_int root_idx;

if (use_direct) {
// Direct O(1) calculation
root_idx = ct_try_find_direct(z, z_next, p, eps);
}
else {
// Linear search O(n) - compute roots and search
ct_complex_vec roots;
ct_float avg_err = ct_roots(z_next, p, roots);
load_err_sum += avg_err;
root_idx = ct_try_find(z, roots, eps);
}

if (root_idx < 0 || (ct_uint)root_idx >= g_tokens_str.length()) {
break;
}

std::cout << "loaded[" << i << "]:" << z << " (index:" <<
root_idx << ")\n";
out_tokens += g_tokens_str[root_idx];

// Move to parent point
z = z_next;

// Check if we've reached the origin
if (std::abs(z - z_target) < eps) {
std::cout << "fin detected!:[" << i << "]:" << z << "\n";
break;
}
}

// Reverse to get original order
std::reverse(out_tokens.begin(), out_tokens.end());
out_z = z;

return load_err_sum;
}

int main() {
std::cout.precision(ct_float_nlim::max_digits10);
std::cout << "g_tokens_str:" << g_tokens_str << "\n\n";

{
ct_complex z_origin = { -.75, .06 };
std::string stored = "CHRIS";
ct_int power = ct_gain_power(stored);

std::cout << "stored:" << stored << "\n";
std::cout << "power:" << power << "\n\n";
std::cout << "________________________________________\n";

// STORE
ct_complex z_stored = ct_store(z_origin, power, stored);

std::cout << "________________________________________\n";
std::cout << "\nSTORED POINT:" << z_stored << "\n";
std::cout << "________________________________________\n";

// LOAD - try both methods
std::string loaded;
ct_complex z_loaded;
ct_float eps = .001;

std::cout << "\n=== Testing LINEAR SEARCH method ===\n";
ct_float load_err_sum =
ct_load(z_stored, z_origin, power, eps, loaded, z_loaded,
false);

std::cout << "________________________________________\n";
std::cout << "\nORIGIN POINT:" << z_origin << "\n";
std::cout << "LOADED POINT:" << z_loaded << "\n";
std::cout << "\nloaded:" << loaded << "\n";
std::cout << "load_err_sum:" << load_err_sum << "\n";

if (stored == loaded) {
std::cout << "\n\nDATA COHERENT! :^D" << "\n";
}
else {
std::cout << "\n\n***** DATA CORRUPTED!!! Shi%! *****" << "\n";
std::cout << "Expected: " << stored << "\n";
std::cout << "Got: " << loaded << "\n";
}

// Try direct method
std::cout << "\n\n=== Testing DIRECT ANGULAR method ===\n";
std::string loaded_direct;
ct_complex z_loaded_direct;

ct_float load_err_sum_direct =
ct_load(z_stored, z_origin, power, eps, loaded_direct, z_loaded_direct, true);

std::cout << "________________________________________\n";
std::cout << "\nloaded:" << loaded_direct << "\n";

if (stored == loaded_direct) {
std::cout << "\n\nDATA COHERENT (DIRECT METHOD)! :^D" << "\n";
}
else {
std::cout << "\n\n***** DATA CORRUPTED (DIRECT METHOD)!!!
*****" << "\n";
std::cout << "Expected: " << stored << "\n";
std::cout << "Got: " << loaded_direct << "\n";
}
}

std::cout << "\n\nFin, hit <ENTER> to exit...\n";
std::fflush(stdout);
std::cin.get();

return 0;
}
_____________________________________

--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.arch on Mon Jan 12 01:07:34 2026

From Newsgroup: comp.arch

On Sun, 11 Jan 2026 22:30:14 -0000 (UTC)
antispam@fricas.org (Waldek Hebisch) wrote:

Michael S <already5chosen@yahoo.com> wrote:

On Thu, 8 Jan 2026 21:35:16 +0100
Terje Mathisen <terje.mathisen@tmsw.no> wrote:

MitchAlsup wrote:

But a single incorrect rounding is 1 ULP all by itself.

:-)

Yeah, there are strong forces who want to have, at least as a
suggested/recommended option, a set of transcendental functions
which are exactly rounded.

I wonder who are those forces and what is the set they push for.

I would guess that they are mostly software and hardware
verification people rather than people that use transcendental
functions in engineering and physical calculations.

The majority of the latter crowd would likely have no objections
against 0.75 ULP RMS as long as implementation is both fast and
preserving few invariant, of which I can primarily think of two:

1. Evenness/odness
If precise function F(x) is even or odd then approximate functions
f(x) is also even or odd.

2. Weak preservation of sign of delta.
If F(x) < F(x+ULP) then f(x) <= f(x+ULP)
If F(x) > F(x+ULP) then f(x) >= f(x+ULP)

In practice, it's probably unlikely to have these invariant
preserved when RMS error is 0.75 ULP, but for RMS error = 0.60-0.65
ULP I don't see difficulties.
For many transcendental functions there will be only few problematic
values of x near edges of implementation-specific ranges where one
has to be careful.

This property is independent of magnitude of the error. For
example, nominally "double" routine may deliver correctly rounded
single precision results. Of course errors are huge, but the
property holds.

That's why I specified them not in usolation but together with RMS <
0.75 ULP.

I copied the latter part from post of Mitch. But I don't like this sort
of characterization of precision. It takes innto account discrete nature
of axis Y, but ignores discreteness of axis X.
Tonight is too late for better definition. May be, I'd do it tomorrow.

More realistically, monotonic behaviour can
be obtained as composition of monotonic operations. If done
in software such composition may produce more than 1ulp error,
but still be monotonic where required.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.arch on Mon Jan 12 01:10:07 2026

From Newsgroup: comp.arch

On Sun, 11 Jan 2026 22:30:14 -0000 (UTC)
antispam@fricas.org (Waldek Hebisch) wrote:

Michael S <already5chosen@yahoo.com> wrote:

On Thu, 8 Jan 2026 21:35:16 +0100
Terje Mathisen <terje.mathisen@tmsw.no> wrote:

MitchAlsup wrote:

But a single incorrect rounding is 1 ULP all by itself.

:-)

Yeah, there are strong forces who want to have, at least as a
suggested/recommended option, a set of transcendental functions
which are exactly rounded.

I wonder who are those forces and what is the set they push for.

I would guess that they are mostly software and hardware
verification people rather than people that use transcendental
functions in engineering and physical calculations.

The majority of the latter crowd would likely have no objections
against 0.75 ULP RMS as long as implementation is both fast and
preserving few invariant, of which I can primarily think of two:

1. Evenness/odness
If precise function F(x) is even or odd then approximate functions
f(x) is also even or odd.

2. Weak preservation of sign of delta.
If F(x) < F(x+ULP) then f(x) <= f(x+ULP)
If F(x) > F(x+ULP) then f(x) >= f(x+ULP)

In practice, it's probably unlikely to have these invariant
preserved when RMS error is 0.75 ULP, but for RMS error = 0.60-0.65
ULP I don't see difficulties.
For many transcendental functions there will be only few problematic
values of x near edges of implementation-specific ranges where one
has to be careful.

This property is independent of magnitude of the error. For
example, nominally "double" routine may deliver correctly rounded
single precision results. Of course errors are huge, but the
property holds.

That's why I specified them not in usolation but together with RMS <
0.75 ULP.

I copied the latter part from post of Mitch. But I don't like this sort
of characterization of precision. It takes innto account discrete nature
of axis Y, but ignores discreteness of axis X.
Tonight is too late for better definition. May be, I'd do it tomorrow.

More realistically, monotonic behaviour can
be obtained as composition of monotonic operations. If done
in software such composition may produce more than 1ulp error,
but still be monotonic where required.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.arch on Mon Jan 12 00:57:38 2026

From Newsgroup: comp.arch

On Sun, 11 Jan 2026 22:30:14 -0000 (UTC)
antispam@fricas.org (Waldek Hebisch) wrote:

Michael S <already5chosen@yahoo.com> wrote:

On Thu, 8 Jan 2026 21:35:16 +0100
Terje Mathisen <terje.mathisen@tmsw.no> wrote:

MitchAlsup wrote:

But a single incorrect rounding is 1 ULP all by itself.

:-)

Yeah, there are strong forces who want to have, at least as a
suggested/recommended option, a set of transcendental functions
which are exactly rounded.

I wonder who are those forces and what is the set they push for.

I would guess that they are mostly software and hardware
verification people rather than people that use transcendental
functions in engineering and physical calculations.

The majority of the latter crowd would likely have no objections
against 0.75 ULP RMS as long as implementation is both fast and
preserving few invariant, of which I can primarily think of two:

1. Evenness/odness
If precise function F(x) is even or odd then approximate functions
f(x) is also even or odd.

2. Weak preservation of sign of delta.
If F(x) < F(x+ULP) then f(x) <= f(x+ULP)
If F(x) > F(x+ULP) then f(x) >= f(x+ULP)

In practice, it's probably unlikely to have these invariant
preserved when RMS error is 0.75 ULP, but for RMS error = 0.60-0.65
ULP I don't see difficulties.
For many transcendental functions there will be only few problematic
values of x near edges of implementation-specific ranges where one
has to be careful.

This property is independent of magnitude of the error. For
example, nominally "double" routine may deliver correctly rounded
single precision results. Of course errors are huge, but the
property holds. More realistically, monotonic behaviour can
be obtained as composition of monotonic operations. If done
in software such composition may produce more than 1ulp error,
but still be monotonic where required.

That's why I specified them not in usolation but together with RMS <
0.75 ULP.

I copied the latter part from post of Mitch. But I don't like this sort
of characterization of precision. It takes innto account discrete nature
of axis Y, but ignores discreteness of axis X.
Tonight is too late for better definition. May be, I'd do it tomorrow.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.arch on Mon Jan 12 12:20:48 2026

From Newsgroup: comp.arch

On Sun, 11 Jan 2026 22:11:55 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

Consider COS(x) near 0.0

sin/cos is not an interesting or hard case, because for sin/cos the
value at extremum is exactly 1 or -1, i.e. it is representable exactly
by any BFP format. Plus, slop is very shallow. It means that any sane implementation of sin/cos will have no troubles correctly rounding both
sides of interval that contains extremum to 1 (or -1). At least it
holds as long as x is in the sane range (abs(x) < 2**26). For x outside
that range, you (i.e. engineer, chemist, phisicist) just know that you
are doing something very wrong, and implementation of trigs is among
last things that you should be concerned about.

More challenging cases are transcendental functions that have extrema
with values that are non-representable exactly, especially so when the
value is close to the mid-point between two representable numbers.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Mon Jan 12 16:28:37 2026

From Newsgroup: comp.arch

MitchAlsup wrote:

antispam@fricas.org (Waldek Hebisch) posted:

Thomas Koenig <tkoenig@netcologne.de> wrote:

Waldek Hebisch <antispam@fricas.org> schrieb:

I use every day a computer algebra system. One possiblity here
is to use arbitrary precision fractions. Observations:
- once there is more complex computation numerators and denominators>>>> tend to be pretty big
- a lot of time is spent computing gcd, but if you try to save on
gcd-s and work with unsimplified fractions you may get trenendous>>>> blowup (like million digit numbers in what should be reasonably
small computation).
- general, if one needs numeric approximation, then arbitrary
precision (software) floating point is much faster

It is also not possible to express irrational numbers,
transcendental functions etc. When I use a computer algebra
sytem, I tend to use such functions, so solutions are usually
not rational numbers.

Support for algebraic irrationalities is by now standard
feature in computer algebra. Dealing with transcendental
elementary functions too. Support for special functions
is weaker, but that is possible to. Deciding if a number
is transcendental or not is theoretically tricky, but for
elementary numbers there is Schanuel conjecture, which
wile unproven tends to work well in practice.

What troubles many folks is fact that for many practical
problems answers are implicit, so if you want numbers at the
end you need to do numeric computation anyway.

Anyway, my point was that exact computations tend to need
large accuracy at intermediate steps,

Between 2|un+6 and 2|un+13 which is 3-9-bits larger than the
next higher precision.

double -> fp128 is 53 vs 113 bits mantissa (including the hidden bit),
so 2N+7 which is _almost enough even for the handful of really bad cases.Using u128 unsigned calculations might be enough for exact double results?
Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
--- Synchronet 3.21a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Mon Jan 12 22:22:05 2026

From Newsgroup: comp.arch

John Levine <johnl@taugh.com> schrieb:

According to Stephen Fuld <sfuld@alumni.cmu.edu.invalid>:

existing compilers supported it. He claimed that there
must be hardware support to make it widely available. He somewhat
ignored fact that even with hardware support there still needs
to be support in compilers.

Or perhaps (again, no personal knowledge - just speculation) that >>supporting an additional data type in the IBM COBOL (and, for what its >>worth PL/1) compilers is easier if there was hardware support for it.

Having written a few compilers, I can say that it is equally easy
within epsilon to emit a DFADD instruction as the equivalent of CALL
DFADD. I could believe it's politically easier, hey we'll look dumb if
we announce this swell DFP feature and our own compilers don't use it.

Unfortunately, I do not have access to a machine with an IBM COBOL
compiler. It would be interesting to see if it actually uses
the decimal float arithmetic. But xlc has an option for that,
-qdfp.

[...]
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Tue Jan 13 09:55:00 2026

From Newsgroup: comp.arch

John Levine wrote:

According to Stephen Fuld <sfuld@alumni.cmu.edu.invalid>:

existing compilers supported it. He claimed that there
must be hardware support to make it widely available. He somewhat
ignored fact that even with hardware support there still needs
to be support in compilers.

Or perhaps (again, no personal knowledge - just speculation) that
supporting an additional data type in the IBM COBOL (and, for what its
worth PL/1) compilers is easier if there was hardware support for it.

Having written a few compilers, I can say that it is equally easy
within epsilon to emit a DFADD instruction as the equivalent of CALL
DFADD. I could believe it's politically easier, hey we'll look dumb if
we announce this swell DFP feature and our own compilers don't use it.

Back in the FDIV bug days, the workaround code I did most of the writing
on simply replaced all FDIV opcodes with a CALL FDIVFIX, none of the
compiler teams found that to be any problem at all. For most it was
probably just a patch to the code output table?

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
--- Synchronet 3.21a-Linux NewsLink 1.2

From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch on Tue Jan 13 14:45:02 2026

From Newsgroup: comp.arch

MitchAlsup [2026-01-11 18:18:00] wrote:

Michael S <already5chosen@yahoo.com> posted:

Terje Mathisen <terje.mathisen@tmsw.no> wrote:

Yeah, there are strong forces who want to have, at least as a
suggested/recommended option, a set of transcendental functions which
are exactly rounded.

I wonder who are those forces and what is the set they push for.

One reason to want it comes from portability and bit-for-bit
reproducibility. These requirements don't actually care about the
rounding being *correct* as much as the rounding always being the same
across different hardware and libm implementations, but it seems rather unlikely that the various actors involved would agree on a particular
return value if it's not the correctly-rounded one, so in practice this
becomes a push for correctly rounded results.

The problem, here, is that even when one gets all the rounding correct,
one has still lost various algebraic identities.

CRSIN(x)^2 + CRCOS(X)^2 ~= 1.0

Which properties are preserved and which ones aren't is inevitably
a compromise since, for example, the above one cannot be preserved
without breaking several others.

- Stefan
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Jan 21 01:44:08 2026

From Newsgroup: comp.arch

Anyone still here and active ???

Mitch
--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.arch on Wed Jan 21 11:05:30 2026

From Newsgroup: comp.arch

On Wed, 21 Jan 2026 01:44:08 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

Anyone still here and active ???

Mitch

https://www.linfo.org/rule_of_silence.html

I have serious doubts about universality of wisdom of this rule in the
field of human-machine interfaces, but for Usenet interaction it is
golden.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Wed Jan 21 22:33:09 2026

From Newsgroup: comp.arch

MitchAlsup wrote:

Anyone still here and active ???

Mitch

Yes!
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
--- Synchronet 3.21a-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Thu Jan 22 14:37:40 2026

From Newsgroup: comp.arch

On 2026-01-21 4:33 p.m., Terje Mathisen wrote:

MitchAlsup wrote:

Anyone still here and active ???

Mitch

Yes!

Still here. I thought maybe the snow put a damper on things.

I have been working away on Q+4 doing simulation runs while waiting for
posts on comp.arch. The timing goal has been bumped up to 100 MHz from
40 MHz.

Been toying with a couple of ideas. One is to give the machine a wider
or perhaps narrower datapath to support 80 or 96 bits operations. A few
extra digits for finance. Do not really want to go to 128-bits. A 48-bit machine could have 96-bit double precision support. Another idea is to
go six wide.

--- Synchronet 3.21a-Linux NewsLink 1.2

From George Neuner@gneuner2@comcast.net to comp.arch on Thu Jan 22 15:12:06 2026

From Newsgroup: comp.arch

On Wed, 21 Jan 2026 01:44:08 GMT, MitchAlsup
<user5857@newsgrouper.org.invalid> wrote:

Anyone still here and active ???

Mitch

Still here.

Given that Usenet as a whole seems to be dying, sudden pauses in
postings are eerie.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.arch on Thu Jan 22 22:57:50 2026

From Newsgroup: comp.arch

On Thu, 22 Jan 2026 15:12:06 -0500
George Neuner <gneuner2@comcast.net> wrote:

On Wed, 21 Jan 2026 01:44:08 GMT, MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

Anyone still here and active ???

Mitch

Still here.

Given that Usenet as a whole seems to be dying, sudden pauses in
postings are eerie.

Usenet as a whole is dying for several years longer than the time since
I first discovered it 23+ years ago.

--- Synchronet 3.21a-Linux NewsLink 1.2

From George Neuner@gneuner2@comcast.net to comp.arch on Fri Jan 23 13:47:44 2026

From Newsgroup: comp.arch

On Thu, 22 Jan 2026 22:57:50 +0200, Michael S
<already5chosen@yahoo.com> wrote:

On Thu, 22 Jan 2026 15:12:06 -0500
George Neuner <gneuner2@comcast.net> wrote:

On Wed, 21 Jan 2026 01:44:08 GMT, MitchAlsup
<user5857@newsgrouper.org.invalid> wrote:

Anyone still here and active ???

Mitch

Still here.

Given that Usenet as a whole seems to be dying, sudden pauses in
postings are eerie.

Usenet as a whole is dying for several years longer than the time since
I first discovered it 23+ years ago.

I discovered Usenet in the late 80's. I participated in about 20 of
the comp.* and sci.* groups. All through the 90's and even into the
early 00's many of them were quite lively.

I recognized early on that the file sharing groups were in trouble -
mainly due to legal issues - but I didn't get the sense that Usenet
was in decline more generally until the late 00's.

And it wasn't for lack of GUIs: I can't recall names now, but there
were at least 2 different terminal hosted GUIs (one of which had a
Windows port), Netscape mail did NN natively, and AltaVista had a
quite nice [text but browser hosted] interface to Usenet.

Google managed to f_ up everything AltaVista did, Microsoft pushed
Netscape aside, Firefox came along but quickly removed its initial
email and NN support, Thunderbird and SeaMonkey took too long to
appear, and too many web providers were trying to cash in by luring
people to pay-walled forums.

It seems all it would have taken was one decent web site providing a
good user experience for Usenet.

Yeah, I know ... why didn't I do it? Well, I don't do web sites.
[Actually, that's incorrect: I do do middle and backend work, but I
don't do any user facing (GUI) stuff. I've been told my sense of what
is "easy to use" is somewhat stunted ... I've been in computing since
1980 and I am mostly satisfied that something works. Moreover, I
don't consider very much "modern" software - and almost nothing that
runs on a phone or in a browser - to be particularly easy to use.]

YMMV.
--- Synchronet 3.21a-Linux NewsLink 1.2

From jgd@jgd@cix.co.uk (John Dallman) to comp.arch on Sat Jan 24 09:11:00 2026

From Newsgroup: comp.arch

In article <1768959848-5857@newsgrouper.org>,
user5857@newsgrouper.org.invalid (MitchAlsup) wrote:

Anyone still here and active ???

Back, after some time away. Not very active.

John
--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sat Jan 24 15:58:21 2026

From Newsgroup: comp.arch

George Neuner <gneuner2@comcast.net> writes:

It seems all it would have taken was one decent web site providing a
good user experience for Usenet.

Given the popularity of groups.google.com, it seems that those who
wanted web access to Usenet preferred that web interface (including
some of the regulars of this group). It did not stop the decline of
Usenet, and eventually they stopped gatewaying between their system
and Usenet (after being the vehicle of a particularly nasty attack on
Usenet). There are also other web-based Usenet services (<https://en.wikipedia.org/wiki/Web-based_Usenet> mentions some), so I
think that your claim above is falsified (if you suggest that the web interfaces did not provide a "good user experience", is your claim
above actually falsifyable?).

Some other theories about the Usenet decline have been:

* Usenet's plain text interface could not compete against the more
featureful facilities of various social media. IIRC Andy Glew tried
to convert us to MIME or somesuch before he dropped out.

* The presence of spam, trolls and kooks in unmoderated Usenet groups,
and the fact that everybody has to killfile them on their own (apart
from spam, which is mostly eliminated by cancelbots) does not look
inviting to newbies, whereas modern social media with their
moderation are more inviting. From what I have read, many people
have left Twitter after it became more Usenet-like in this respect
(but AFAIK still social-media-like by default in presenting some
"interesting" selection of postings, and not all (except killfiled
ones and spam) in the subscribed groups).

Yeah, I know ... why didn't I do it?

Others did: Google, the makers of Rocksolid, and others.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From David LaRue@huey.dll@tampabay.rr.com to comp.arch on Sun Jan 25 01:09:16 2026

From Newsgroup: comp.arch

anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote in news:2026Jan24.165821@mips.complang.tuwien.ac.at:

George Neuner <gneuner2@comcast.net> writes:

It seems all it would have taken was one decent web site providing a
good user experience for Usenet.

Given the popularity of groups.google.com, it seems that those who
wanted web access to Usenet preferred that web interface (including
some of the regulars of this group). It did not stop the decline of
Usenet, and eventually they stopped gatewaying between their system
and Usenet (after being the vehicle of a particularly nasty attack on Usenet). There are also other web-based Usenet services (<https://en.wikipedia.org/wiki/Web-based_Usenet> mentions some), so I
think that your claim above is falsified (if you suggest that the web interfaces did not provide a "good user experience", is your claim
above actually falsifyable?).

Some other theories about the Usenet decline have been:

* Usenet's plain text interface could not compete against the more
featureful facilities of various social media. IIRC Andy Glew tried
to convert us to MIME or somesuch before he dropped out.

* The presence of spam, trolls and kooks in unmoderated Usenet groups,
and the fact that everybody has to killfile them on their own (apart
from spam, which is mostly eliminated by cancelbots) does not look
inviting to newbies, whereas modern social media with their
moderation are more inviting. From what I have read, many people
have left Twitter after it became more Usenet-like in this respect
(but AFAIK still social-media-like by default in presenting some
"interesting" selection of postings, and not all (except killfiled
ones and spam) in the subscribed groups).

Yeah, I know ... why didn't I do it?

Others did: Google, the makers of Rocksolid, and others.

- anton

Don't forget about the attacks on USENET because it was unregulated by
design and allowed mostly alleged abuses that the governments used to shut down or threaten server owners fpr carrying regional abuses. That caused a great number of users to flee to web services rather than simply find other servers they could still reach.

I still prefer USENET to any web service because of the speed of finding relevant content. Yes there are still users who abuse various groups but
most moved elsewhere. The current spam abuse is still relatively easy to
spot and control individually.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Brett@ggtgp@yahoo.com to comp.arch on Sun Jan 25 22:13:08 2026

From Newsgroup: comp.arch

MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

Anyone still here and active ???

Mitch

Reduced to lurking while waiting for computer state of the art to catch up
to my predictions.

Am working on far far more important subjects in any case.
--- Synchronet 3.21b-Linux NewsLink 1.2

From George Neuner@gneuner2@comcast.net to comp.arch on Sun Jan 25 18:12:38 2026

From Newsgroup: comp.arch

On Sat, 24 Jan 2026 15:58:21 GMT, anton@mips.complang.tuwien.ac.at
(Anton Ertl) wrote:

George Neuner <gneuner2@comcast.net> writes:

It seems all it would have taken was one decent web site providing a
good user experience for Usenet.

Given the popularity of groups.google.com, it seems that those who
wanted web access to Usenet preferred that web interface (including
some of the regulars of this group). It did not stop the decline of
Usenet, and eventually they stopped gatewaying between their system
and Usenet (after being the vehicle of a particularly nasty attack on >Usenet). There are also other web-based Usenet services >(<https://en.wikipedia.org/wiki/Web-based_Usenet> mentions some), so I
think that your claim above is falsified (if you suggest that the web >interfaces did not provide a "good user experience", is your claim
above actually falsifyable?).

In my opinion, Google Groups contributed to the fall of Usenet.

I have seen many postings from GG that had - not kidding - *hundreds*
of lines of hidden HTML formatting for a few lines of text. The worst
example I have seen was a message posted to comp.lang.scheme several
years ago - the message was 18KB and contained 2 words - "thank you" -
in the midst of over 900 lines of HTML formatting.

And you can't blame desktop email programs - GG allowed posting only
via the web interface. Could people cut-n-paste from another editor?
Sure, but how many actually did? And I've never seen any editor do
that.

Some other theories about the Usenet decline have been:

* Usenet's plain text interface could not compete against the more
featureful facilities of various social media. IIRC Andy Glew tried
to convert us to MIME or somesuch before he dropped out.

Usenet's text interface handles HTML just fine: HTML is just text
after all. But that's not necessarily a good thing.

* The presence of spam, trolls and kooks in unmoderated Usenet groups,
and the fact that everybody has to killfile them on their own (apart
from spam, which is mostly eliminated by cancelbots) does not look
inviting to newbies, whereas modern social media with their
moderation are more inviting. From what I have read, many people
have left Twitter after it became more Usenet-like in this respect
(but AFAIK still social-media-like by default in presenting some
"interesting" selection of postings, and not all (except killfiled
ones and spam) in the subscribed groups).

Yeah, I know ... why didn't I do it?

Others did: Google, the makers of Rocksolid, and others.

- anton

Spam can only exist when send addresses are allowed to be invalid. It
could have been made a non-issue 40 years ago.

[BTW: circa ~2004, I met Ray Tomlinson, who created the first email
program and the x@y email address format in 1971. What was he working
on in 2004? Improving spam filtering.]
--- Synchronet 3.21b-Linux NewsLink 1.2

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Mon Jan 26 21:38:46 2026

From Newsgroup: comp.arch

George Neuner <gneuner2@comcast.net> writes:

On Sat, 24 Jan 2026 15:58:21 GMT, anton@mips.complang.tuwien.ac.at
(Anton Ertl) wrote:

George Neuner <gneuner2@comcast.net> writes:

It seems all it would have taken was one decent web site providing a
good user experience for Usenet.

Given the popularity of groups.google.com, it seems that those who
wanted web access to Usenet preferred that web interface (including
some of the regulars of this group). It did not stop the decline of >>Usenet, and eventually they stopped gatewaying between their system
and Usenet (after being the vehicle of a particularly nasty attack on >>Usenet). There are also other web-based Usenet services >>(<https://en.wikipedia.org/wiki/Web-based_Usenet> mentions some), so I >>think that your claim above is falsified (if you suggest that the web >>interfaces did not provide a "good user experience", is your claim
above actually falsifyable?).

In my opinion, Google Groups contributed to the fall of Usenet.

While Usenet is still alive (it hasn't fallen completely), the primary contribution to its decline was when the ISP's stopped providing
usenet to their customers.

I have seen many postings from GG that had - not kidding - *hundreds*
of lines of hidden HTML formatting for a few lines of text. The worst >example I have seen was a message posted to comp.lang.scheme several
years ago - the message was 18KB and contained 2 words - "thank you" -
in the midst of over 900 lines of HTML formatting.

I don't recall any google groups posts that were HTML formatted;
there were some NNTP clients that would happily post in HTML instead
of the RFC required plaintext, however. I think thunderbird may
have been able to be configured to post in HTML.

I've read Usenet with xrn since 1990ish. xrn does not interpret
html formatting at all, so it's quite noticable when a post
includes html in a mime part where there is no plaintext mime
part available.

And you can't blame desktop email programs - GG allowed posting only
via the web interface. Could people cut-n-paste from another editor?
Sure, but how many actually did? And I've never seen any editor do
that.

Some other theories about the Usenet decline have been:

* Usenet's plain text interface could not compete against the more
featureful facilities of various social media. IIRC Andy Glew tried

Henry Spencer argued for MIME in 1993, long before Andy Glew.

https://datatracker.ietf.org/doc/draft-spencer-usefor-son-of-1036/01/

to convert us to MIME or somesuch before he dropped out.

Usenet's text interface handles HTML just fine: HTML is just text
after all. But that's not necessarily a good thing.

* The presence of spam, trolls and kooks in unmoderated Usenet groups,
and the fact that everybody has to killfile them on their own (apart
from spam, which is mostly eliminated by cancelbots) does not look
inviting to newbies, whereas modern social media with their
moderation are more inviting. From what I have read, many people
have left Twitter after it became more Usenet-like in this respect
(but AFAIK still social-media-like by default in presenting some
"interesting" selection of postings, and not all (except killfiled
ones and spam) in the subscribed groups).

Once the ISP's started offering NNTP, eternal september started
(and the eponymously named public usenet server came a bit later).

Yeah, I know ... why didn't I do it?

Others did: Google, the makers of Rocksolid, and others.

- anton

Spam can only exist when send addresses are allowed to be invalid. It
could have been made a non-issue 40 years ago.

[BTW: circa ~2004, I met Ray Tomlinson, who created the first email
program and the x@y email address format in 1971. What was he working
on in 2004? Improving spam filtering.]

I used the precursor to usenet (PLATO Notes, 1973) in 1979;
mainly when not playing PLATO dnd on the plasma display terminals.

--- Synchronet 3.21b-Linux NewsLink 1.2

From Anssi Saari@anssi.saari@usenet.mail.kapsi.fi to comp.arch on Tue Jan 27 12:19:06 2026

From Newsgroup: comp.arch

scott@slp53.sl.home (Scott Lurndal) writes:

Henry Spencer argued for MIME in 1993, long before Andy Glew.

https://datatracker.ietf.org/doc/draft-spencer-usefor-son-of-1036/01/

I think MIME was also fairly widely adopted in non-English newsgroups,
Finnish sfnet.* and finet.* for example but those're pretty much gone. I
don't know which survive today, but at least some groups in the Italian
it.* hierarchy survive and defining charset as UTF-8 or iso-8859-1 or iso-8859-15 seems common.
--- Synchronet 3.21b-Linux NewsLink 1.2

From George Neuner@gneuner2@comcast.net to comp.arch on Tue Jan 27 06:33:11 2026

From Newsgroup: comp.arch

On Mon, 26 Jan 2026 21:38:46 GMT, scott@slp53.sl.home (Scott Lurndal)
wrote:

George Neuner <gneuner2@comcast.net> writes:

I have seen many postings from GG that had - not kidding - *hundreds*
of lines of hidden HTML formatting for a few lines of text. The worst >>example I have seen was a message posted to comp.lang.scheme several
years ago - the message was 18KB and contained 2 words - "thank you" -
in the midst of over 900 lines of HTML formatting.

I don't recall any google groups posts that were HTML formatted;
there were some NNTP clients that would happily post in HTML instead
of the RFC required plaintext, however. I think thunderbird may
have been able to be configured to post in HTML.

Lots of NN readers simply ignore HTML without rendering it [or can be configured to do so]. If your NN reader will show you the raw
message, you'll see that [not all, but] many have HTML formatting in
them.

Every GG post has an "envelope" of HTML - inserted by the GG editor.
If your NN reader shows the overall line count of the message, and you
count the actual lines of text, you'll find they differ by some
amount. If the message was composed offline and just pasted into
their editor for posting, the difference would be negligible. But if
the message was composed *using* their editor ...

I noticed that all the ridiculous length messages with seemingly
nothing in them were coming from people using GG and so I experimented
with their message editor.

I discovered that - other than backspace - any edit made to the text
brought in a new [hidden] formatting envelope for the new text.

E.g., if you initially typed "thequickbrownfox", you'd get some
[barely legal] result like:

<html><style="font-family: 'Arial'; font-size: 10px; color: black;"> thequickbrownfox
</style></html>

Then if you went back and inserted spaces between the words, you'd get
the style reapplied once for every edit you made:

<html><style="font-family: 'Arial'; font-size: 10px; color: black;"> <style="font-family: 'Arial'; font-size: 10px; color: black;"> <style="font-family: 'Arial'; font-size: 10px; color: black;"> <style="font-family: 'Arial'; font-size: 10px; color: black;">
the quick brown fox
</style></style></style></style></html>

3 spaces added, 3 reapplications of formatting.

Starting a new paragraph inserted a new style envelope and began
again. Inserting ahead of the first text would wrap the whole message
in yet another new style envelope. The more you played with the text,
the longer and more complicated the formatting became.

The GG editor *would* display the formatting, but only if you asked to
see the raw message. The editor had no style controls other than bold,
italic, etc., but you could edit the raw HTML style(s) to change
whatever you wished and it would be displayed that way by Google or
other NN reader that understood HTML. But few people ever looked at
their raw messages, and unless you directly edited the HTML it mostly
made little difference to the display.

I did file a bug report with Google about the proliferation of
formatting code - and they did fix it ... eventually, after the better
part of a year. Meanwhile how many messages were posted with scads of
hidden formatting?
--- Synchronet 3.21b-Linux NewsLink 1.2

From kegs@kegs@provalid.com (Kent Dickey) to comp.arch on Wed Jan 28 13:25:01 2026

From Newsgroup: comp.arch

In article <memo.20260107131628.5352Z@jgd.cix.co.uk>,
John Dallman <jgd@cix.co.uk> wrote:

In article <20260107133424.00000e99@yahoo.com>, already5chosen@yahoo.com >(Michael S) wrote:

I already asked you couple of years ago how fast do want binary128
in order to consider it fast enough.
IIRC, you either avoided the answer completely or gave totally
unrealistic answer like "the same as binary64".
May be, nobody bites because with non-answers or answers like that
nobody thinks that you are serious?

I don't know much about hardware design. What is realistic for hardware >binary128?

John

I'm late to this, but I thought I'd point out for very low hardware cost (mostly control complexity), you can implement 128-bit FP FMUL at about 2x
the latency of 64-bit FMUL and 1/4th the throughput, with some control complexity but no significant storage or data buses. Sadly, 128-bit FADD
and FMA seems to require more resources over 64-bit FADD/FMA.

128-bit FP has a sign bit, a 15-bit exponent, and a 112-bit mantissa.
If you divide this into 64-bit pieces, then the low part has mantissa[63:0], and the upper part has { sign, exp[15:0], mantissa[111:64] }.
64-bit FP is {sign, exp[10:0], mantissa[51:0] }

Assume 64-bit FMUL takes 4 cycles, and is fully pipelined (can start a new operation every clock, results come out 4 clocks later). Assume you have
a 53*53 = 106 bit multiplier for 64-bit FMA. We need to widen this to a
57*57 multiplier, and widen the adders/accumulators as well from ~107 bits
to ~115 bits. This has a cost, but it's small. We also need to provide the input registers over 4 clocks, over the existing 64-bit wide buses, but this
is a control complexity. We'll effectively do 4 partial multiplies using
the existing 64-bit paths, and then combine them into the 128-bit result,
which will be generated over 2 clocks. So all in/out paths stay 64-bit.

Divide the 128-bit FP mantissa into low[56:0] which is mantissa[56:0] and high[56:0] = 01,mantissa[111:57]. So send down the FMUL pipeline new register inputs over 4 clocks. We'll call the 128-bit FP register A, which consists
of 64-bit register A_low and 64-bit register A_high, and the other operand
is B.

Clock 0: A_low*B_low. Result comes in clock 4
Clock 1: A_low*B_high. Result comes in clock 5
Clock 2: A_high*B_low. Result comes in clock 6
Clock 3: A_high*B_high. Result comes in clock 7.

Now we just need to wait for those results to arrive, no more register
values are fed to FMUL each clock for the next steps.

CLock 4: Sum[56:0] = low*low shift right 57 (track sticky bit)
Clock 5: Sum[114:0] += low*high
Clock 6: Sum[58:0] += high*low, shift right 57 (track sticky bit)
Clock 7: Sum[114:0] += high*high. Do rounding. Return low [63:0] of result Clock 8: Return high part of result (fixing up exponent) to return [127:64]

The Sum only works on ~115 bits, which is only a little more than the 64-bit
FP 108 bits needed for 64-bit FMA.

When the unit receives A_low and B_low, it's getting 7 mantissa bits it won't use (since it's just using [56:0] in the multiply). It needs to save those bits to be part of high in the future cycles, since when the register file supplies the High parts in later cycles, it will be mantissa[111:64], and we need mantissa[111:57].

128-bit FADD does not break into parts so easily. So adding 128-bit FP
really then becomes how much space you want to put into the 128-bit FADD.
I think it can be done just with extra storage, keeping the adder to
115 bits (for FMA), but it requires a late fixup since the magnitude of the
two items being added isn't known when the add must begin.

This strategy trades off performance for low resource usage, which I think
is an excellent tradeoff for 128-bit FP. And it will have the same power profile as 64-bit FP--running 128-bit FP flat out is about the same power as running 64-bit FP flat out.

Kent
--- Synchronet 3.21b-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Wed Jan 28 15:26:54 2026

From Newsgroup: comp.arch

On 1/28/2026 7:25 AM, Kent Dickey wrote:

In article <memo.20260107131628.5352Z@jgd.cix.co.uk>,
John Dallman <jgd@cix.co.uk> wrote:

In article <20260107133424.00000e99@yahoo.com>, already5chosen@yahoo.com
(Michael S) wrote:

I already asked you couple of years ago how fast do want binary128
in order to consider it fast enough.
IIRC, you either avoided the answer completely or gave totally
unrealistic answer like "the same as binary64".
May be, nobody bites because with non-answers or answers like that
nobody thinks that you are serious?

I don't know much about hardware design. What is realistic for hardware
binary128?

John

I'm late to this, but I thought I'd point out for very low hardware cost (mostly control complexity), you can implement 128-bit FP FMUL at about 2x the latency of 64-bit FMUL and 1/4th the throughput, with some control complexity but no significant storage or data buses. Sadly, 128-bit FADD
and FMA seems to require more resources over 64-bit FADD/FMA.

128-bit FP has a sign bit, a 15-bit exponent, and a 112-bit mantissa.
If you divide this into 64-bit pieces, then the low part has mantissa[63:0], and the upper part has { sign, exp[15:0], mantissa[111:64] }.
64-bit FP is {sign, exp[10:0], mantissa[51:0] }

Assume 64-bit FMUL takes 4 cycles, and is fully pipelined (can start a new operation every clock, results come out 4 clocks later). Assume you have
a 53*53 = 106 bit multiplier for 64-bit FMA. We need to widen this to a 57*57 multiplier, and widen the adders/accumulators as well from ~107 bits
to ~115 bits. This has a cost, but it's small. We also need to provide the input registers over 4 clocks, over the existing 64-bit wide buses, but this is a control complexity. We'll effectively do 4 partial multiplies using
the existing 64-bit paths, and then combine them into the 128-bit result, which will be generated over 2 clocks. So all in/out paths stay 64-bit.

Divide the 128-bit FP mantissa into low[56:0] which is mantissa[56:0] and high[56:0] = 01,mantissa[111:57]. So send down the FMUL pipeline new register
inputs over 4 clocks. We'll call the 128-bit FP register A, which consists of 64-bit register A_low and 64-bit register A_high, and the other operand
is B.

Clock 0: A_low*B_low. Result comes in clock 4
Clock 1: A_low*B_high. Result comes in clock 5
Clock 2: A_high*B_low. Result comes in clock 6
Clock 3: A_high*B_high. Result comes in clock 7.

Now we just need to wait for those results to arrive, no more register
values are fed to FMUL each clock for the next steps.

CLock 4: Sum[56:0] = low*low shift right 57 (track sticky bit)
Clock 5: Sum[114:0] += low*high
Clock 6: Sum[58:0] += high*low, shift right 57 (track sticky bit)
Clock 7: Sum[114:0] += high*high. Do rounding. Return low [63:0] of result Clock 8: Return high part of result (fixing up exponent) to return [127:64]

The Sum only works on ~115 bits, which is only a little more than the 64-bit FP 108 bits needed for 64-bit FMA.

When the unit receives A_low and B_low, it's getting 7 mantissa bits it won't use (since it's just using [56:0] in the multiply). It needs to save those bits to be part of high in the future cycles, since when the register file supplies the High parts in later cycles, it will be mantissa[111:64], and we need mantissa[111:57].

128-bit FADD does not break into parts so easily. So adding 128-bit FP really then becomes how much space you want to put into the 128-bit FADD.
I think it can be done just with extra storage, keeping the adder to
115 bits (for FMA), but it requires a late fixup since the magnitude of the two items being added isn't known when the add must begin.

This strategy trades off performance for low resource usage, which I think
is an excellent tradeoff for 128-bit FP. And it will have the same power profile as 64-bit FP--running 128-bit FP flat out is about the same power as running 64-bit FP flat out.

Sort of reminds me of one case where I evaluated the possibility of a
64-bit hardware multiplier which would internally decompose it into
32x32->64 bit widening multiplies and add the parts back together.

Then noted the drawback that this wouldn't have been much faster than
doing it in software (using the same general strategy). Eventually did
end up adding a (significantly slower, but cheaper) shift-and-add
hardware multiplier.

The shift-and-add multiplier can be made to do FMUL and FDIV, and in
theory could be made to function as a more affordable FP128 FPU as well (probably also providing Int128 MUL/DIV in the process), but... Would
likely be slower than doing it in software.

I had started moving away from using it for FDIV mostly because:
Using it for this is not *that* much faster than trap and emulate;
Trap is slower, but T&E is within an order of magnitude...
Once implemented, hot-patching is likely to be faster.
The downside being mostly that hot-patching is more complex.

Though, for BGBCC, am mostly using runtime calls for FP divide, as in
this case it is the fastest option, albeit not the most space efficient (trap&emulate uses less space, hot patching would be intermediate here,
but now the cost is paid by the OS).

Though, one other option (IIRC, used once in the past in one of my code generators) was to call generated functions which instead behaves more
like instructions: The function itself was wired for fixed input and
output registers, and would then provide the code to move them into the
ABI registers and then call the generic function. This sort of thing
made more sense in a JIT compiler. But, this could consolidate some of
the space cost of the runtime calls if they happen to use the same
registers (if not, it is worse than just calling the function using the
normal ABI).

...

But, progress is slowing down in this space.
My most recent activity was trying to sort out some of the inconsistent handling of the TST/NTST instructions in XG3; had ended up in a
situation where behavior was inconsistent between the CPU core,
emulator, compiler, and specs regarding which exact behavior was used.
Partly due to an issue where in the emulator had (unintentionally) ended
up with a situation where which behavior it gives (for true/false
status) would depend on which registers were used. Now in theory it
should be sorted out (mostly effected XG3 when using Predication; mostly confusion resulting from a bug in the instruction decoding in my emulator).

Technically has led to inconsistent naming with my RV-JX specs, need to
fully decide on the "polarity" of the TST/NTST mnemonics.

Can note:
SH : TST was (Rm&Rn)==0
BJX2: Intent was:
2R: TST is ((Rm&Rn)==0)
3R: TST is ((Rm&Rn)!=0), NTST is ((Rm&Rn)==0)
BGBCC currently following this pattern internally.
Current implementation results in the mnemonics being backwards.
Within the ISA spec, they effectively have the SH behaviors.
Within the RV-JX spec, they have the intended behaviors.

So, need to decide whether it is better to go back to SH convention (so
2R and 3R agree on what the mnemonic does), flipping them in the RV-JX
spec for consistency, leaving BGBCC as the odd-man-out. Or, flip the
nmemonics in the XG2 and XG3 specs to match actual behavior (well, more likely; would match original intention at least).

Well, and for consistency with BTST/BNTST, where BTST are "Branch is (Rs&Rt)!=0".

This sort of thing almost feels like an engineering fail tough.

Though, mostly unrelated:
In my current 3D engine, I ended up increasing the world height from 128
to 384 blocks in a way that mostly doesn't break anything, and doesn't
have much additional memory cost. It uses vertically stacked regions,
but only generates the upper or lower regions if the player enters them
(and, if so, fills the sky region with air, or the underground region
with stone).

In theory, could have gone higher, but ran into a limit mostly with the
way I was representing world coords:
(23: 0): X position, as 16.8 fixed point.
(47:24): Y position, as 16.8 fixed point.
(63:48): Z position (up/down), originally as 8.8 fixed point.

Increasing the world height was handled by changing the Z position into
9.7, which mostly worked.

Had noted that I can't really go to 10.6 though, as the loss of
precision in this case is enough to effectively break the ray-casting.

Where, in this case, the 3D engine determines visible blocks by firing
off rays from the POV of the player, and marking everything that is hit
by a ray as visible. At 6 bits, the rays stop going up/down as
effectively, resulting in missing blocks mostly along the ground and
ceilings (every ground plane block past a certain radius effectively disappears, which is very obvious/ugly).

But, I cut the bit off of Z mostly as the player mostly slides along the ground plane, and a loss of Z positional accuracy was likely better here
than a loss of X or Y positional accuracy.

I am half tempted to consider doing more like my old engine (BT2) and
going to using a struct with 3x 20.12 coordinates or similar (or, maybe
16.16, assuming I keep the same world size).

I am left to wonder if precision going be related to some of the other "general glitchiness" with the raycast visibility determination for more distant objects, or maybe even if it could make sense to abandon
ray-casting on PC (vs going the Minecraft route of "just throw
everything within a certain radius at the GPU and let it sort it out").

Though, can note that for similar draw-distances, BT3 currently uses
around 1/6 as much RAM as Minecraft. Partly as it it doesn't need to
load chunks or generate meshes for parts of the world that are not
currently visible.

But, does have the issue that as things come into view there may be
momentary delays and "pop-in" before the visibility determination
realizes they are visible, and parts of the terrain (usually at a
distance) repeatedly flickering in and out of visibility.

Also increasing ray density increases CPU cost, but not notable solving
the issue, but this could be due in part to the precision issues.

There was already a trick where the exact origin from which raycasting
takes place jumps around to better improve ray hits, but likely doesn't
fully cover for the accuracy issues (and if 6 fractional bits is enough
to break the ray-casting, means the situation probably isn't exactly
great with 8 bits).

Where, as noted, in this case, the base unit is a meter, so:
5 fraction bits: ~ 1.250" ULP (untested, probably worse)
6 fraction bits: ~ 0.625" ULP (raycast breaks down)
7 fraction bits: ~ 0.313" ULP (slight increase in raycast issues).
8 fraction bits: ~ 0.156" ULP (works, OK ish)

The world coords basically need to be able to give coordinates anywhere
in the world, and due to world structure, in this case fixed-point is preferable to floating-point. Note that it is partly independent from
the coordinate space used for local rendering (camera relative).

Well, there is also the wonk that somewhere along the development path,
the BT3 engine ended up using a left-handed coordinate space. Almost
mostly doesn't matter, except when writing scripts to build structures
and place items, and the X axis being backwards in this case. On one
imagines a structure placed top-down with +Y as "up", then +X is left.

Could place a structure as +X in which case +Y is right, so slightly
less wonky, but then X/Y are backwards. Almost tempting to consider
adding an option to switch X/Y here, to at least allow scripts
optionally to pretend it is in a right-hand space. Could switch the
engine (effectively leading to wonk in the terrain system and/or
mirroring the world), possibly a bigger hassle.

Though, I think part of it may have been that when I was originally
writing it, it was for TKRA-GL, and early on it ended up with the OpenGL effectively flipped vertically (initial development using a +Y=down framebuffer, but GL still doing all math as-if it were a +Y=up
framebuffer); and I later ended up needing to flip it vertically to work
with more normal OpenGL behavior (but, in the process of this wonk, had
ended up in a LHS coordinate space).

Well, say, where traditional hardware framebuffers had their origin in
the top-left corner (so +Y=down), but Windows/BMP/etc, have (0,0) in the
lower left, and typically for GL one assumes +Y is up. But, wonk results
if one does it in a +Y=down context and later flips the rendering so
that the image isn't upside down.

Note that the later move to TKGDI in TestKern (vs directly sending the framebuffer to the HW) also switched to origin in the lower-left (but,
then one also ends up with inconsistency in image file formats as to the relative assumptions about raster order).

...

Other related mysteries are whether for storing something resembling 2D images, if the RLE compression could be improved significantly by moving
from raster to Hilbert order, or if the added complexity of Hilbert
order would cause it to make more sense to just go over to LZ77 or similar.

But, yeah, ...

...

--- Synchronet 3.21b-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Jan 28 23:03:56 2026

From Newsgroup: comp.arch

BGB <cr88192@gmail.com> posted:

On 1/28/2026 7:25 AM, Kent Dickey wrote:

Sort of reminds me of one case where I evaluated the possibility of a
64-bit hardware multiplier which would internally decompose it into 32x32->64 bit widening multiplies and add the parts back together.

Then noted the drawback that this wouldn't have been much faster than
doing it in software (using the same general strategy). Eventually did
end up adding a (significantly slower, but cheaper) shift-and-add
hardware multiplier.

Mc 88100 uses a 32|u32 multiplier:: integer multiply was 3 cycles,
FP32 was 4 cycles, PF64 was 7 cycles.

When you wanted 32|u32->64 there was a 12-cycle instruction sequence
that would provide--any yes it required extracting 16-bit partials
multiplying 4 of them and adding them all up.

--- Synchronet 3.21b-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Wed Jan 28 17:43:34 2026

From Newsgroup: comp.arch

On 1/28/2026 5:03 PM, MitchAlsup wrote:

BGB <cr88192@gmail.com> posted:

On 1/28/2026 7:25 AM, Kent Dickey wrote:

Sort of reminds me of one case where I evaluated the possibility of a
64-bit hardware multiplier which would internally decompose it into
32x32->64 bit widening multiplies and add the parts back together.

Then noted the drawback that this wouldn't have been much faster than
doing it in software (using the same general strategy). Eventually did
end up adding a (significantly slower, but cheaper) shift-and-add
hardware multiplier.

Mc 88100 uses a 32|u32 multiplier:: integer multiply was 3 cycles,
FP32 was 4 cycles, PF64 was 7 cycles.

When you wanted 32|u32->64 there was a 12-cycle instruction sequence
that would provide--any yes it required extracting 16-bit partials multiplying 4 of them and adding them all up.

Similar here:
32*32=>64: 3-cycle, pipelined;
Considered hard-wired logic mechanism:
~ 12 cycles;
Runtime call: ~ 16 cycles (maybe 20 with call/return overheads).
Shift-and-add: 68 cycles (same as DIV/REM).
But, easier to justify the LUTs in the name of RV 'M' support.
Still faster than trap and emulate.

Where 64-bit integer MUL and DIV being not quite rare enough for trap
and emulate to be acceptable from a performance POV. The slow hardware
integer divide did manage to outperform using a software
shift-and-subtract loop though (so had that much going for it at least).

For Binary64, this unit is around 112 cycles for FDIV (due to quirks).

In the paste, Hardware Newton-Raphson is an option, but is more
complicated and expensive to make it work well.

The FMUL is a fair bit faster, and this means software Newton-Raphson is
still the most attractive option from the performance POV.

If done for Binary128, would be around 228 cycles for FMUL and FDIV,
assuming the Shift-and-Add unit remains 1 bit per cycle.
There is concern that internal latency could require 0.5 bit/cycle, or, would-be 456 cycles.

If it were 456 cycles, may as well just use trap-and-emulate at that
point...

In the latter case, just using the 32-bit widening integer multiplier to implement the Binary128 FMUL and using Newton-Raphson is likely to be
faster.

Main merit of Binary128 though being that "long double" is so
infrequently used that it almost doesn't matter if it is glacially slow
(even more so with FDIV, which for many programs might not happen at all).

...

--- Synchronet 3.21b-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Thu Jan 29 03:47:38 2026

From Newsgroup: comp.arch

On 2026-01-28 6:43 p.m., BGB wrote:

On 1/28/2026 5:03 PM, MitchAlsup wrote:

BGB <cr88192@gmail.com> posted:

On 1/28/2026 7:25 AM, Kent Dickey wrote:

Sort of reminds me of one case where I evaluated the possibility of a
64-bit hardware multiplier which would internally decompose it into
32x32->64 bit widening multiplies and add the parts back together.

Then noted the drawback that this wouldn't have been much faster than
doing it in software (using the same general strategy). Eventually did
end up adding a (significantly slower, but cheaper) shift-and-add
hardware multiplier.

Mc 88100 uses a 32|u32 multiplier:: integer multiply was 3 cycles,
FP32 was 4 cycles, PF64 was 7 cycles.

When you wanted 32|u32->64 there was a 12-cycle instruction sequence
that would provide--any yes it required extracting 16-bit partials
multiplying 4 of them and adding them all up.

Similar here:
-a 32*32=>64: 3-cycle, pipelined;
-a Considered hard-wired logic mechanism:
-a-a-a ~ 12 cycles;
-a Runtime call: ~ 16 cycles (maybe 20 with call/return overheads).
-a Shift-and-add: 68 cycles (same as DIV/REM).
-a-a-a But, easier to justify the LUTs in the name of RV 'M' support.
-a-a-a Still faster than trap and emulate.

Where 64-bit integer MUL and DIV being not quite rare enough for trap
and emulate to be acceptable from a performance POV. The slow hardware integer divide did manage to outperform using a software shift-and-
subtract loop though (so had that much going for it at least).

For Binary64, this unit is around 112 cycles for FDIV (due to quirks).

In the paste, Hardware Newton-Raphson is an option, but is more
complicated and expensive to make it work well.

The FMUL is a fair bit faster, and this means software Newton-Raphson is still the most attractive option from the performance POV.

If done for Binary128, would be around 228 cycles for FMUL and FDIV, assuming the Shift-and-Add unit remains 1 bit per cycle.
There is concern that internal latency could require 0.5 bit/cycle, or, would-be 456 cycles.

If it were 456 cycles, may as well just use trap-and-emulate at that point...

In the latter case, just using the 32-bit widening integer multiplier to implement the Binary128 FMUL and using Newton-Raphson is likely to be faster.

Main merit of Binary128 though being that "long double" is so
infrequently used that it almost doesn't matter if it is glacially slow (even more so with FDIV, which for many programs might not happen at all).

...

I seem to find that it is difficult to get better performance for FDIV
than using a simple divider.

FMA has a latency of about 40 clocks at 300 MHz (or 20 CPU clocks). So performing three or four iterations of NR in software (60 to 80 clocks)
is just about as time consuming as using a divider.

For FDIV (or FMUL) with a radix-2 divide it can probably operate at
double the CPU clock frequency. For instance the FDIV in my float
package runs at almost 300 MHz. But the CPU can only be clocked about
100 MHz. So a double-frequency clock is used for FDIV. This cuts the
relative latency in half. (60 CPU clocks).

I could maybe better balance the timing in the FMA to reduce the latency somewhat and still keep the same FMAX. The 64x64 multiply has by itself
about 11 cycles of latency. Built up out of 16x16 multipliers.

--- Synchronet 3.21b-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Jan 29 21:30:49 2026

From Newsgroup: comp.arch

Robert Finch <robfi680@gmail.com> posted:

On 2026-01-28 6:43 p.m., BGB wrote:

On 1/28/2026 5:03 PM, MitchAlsup wrote:

BGB <cr88192@gmail.com> posted:

On 1/28/2026 7:25 AM, Kent Dickey wrote:

Sort of reminds me of one case where I evaluated the possibility of a
64-bit hardware multiplier which would internally decompose it into
32x32->64 bit widening multiplies and add the parts back together.

Then noted the drawback that this wouldn't have been much faster than
doing it in software (using the same general strategy). Eventually did >>> end up adding a (significantly slower, but cheaper) shift-and-add
hardware multiplier.

Mc 88100 uses a 32|u32 multiplier:: integer multiply was 3 cycles,
FP32 was 4 cycles, PF64 was 7 cycles.

When you wanted 32|u32->64 there was a 12-cycle instruction sequence
that would provide--any yes it required extracting 16-bit partials
multiplying 4 of them and adding them all up.

Similar here:
-a 32*32=>64: 3-cycle, pipelined;
-a Considered hard-wired logic mechanism:
-a-a-a ~ 12 cycles;
-a Runtime call: ~ 16 cycles (maybe 20 with call/return overheads).
-a Shift-and-add: 68 cycles (same as DIV/REM).
-a-a-a But, easier to justify the LUTs in the name of RV 'M' support.
-a-a-a Still faster than trap and emulate.

Where 64-bit integer MUL and DIV being not quite rare enough for trap
and emulate to be acceptable from a performance POV. The slow hardware integer divide did manage to outperform using a software shift-and- subtract loop though (so had that much going for it at least).

For Binary64, this unit is around 112 cycles for FDIV (due to quirks).

In the paste, Hardware Newton-Raphson is an option, but is more complicated and expensive to make it work well.

The FMUL is a fair bit faster, and this means software Newton-Raphson is still the most attractive option from the performance POV.

If done for Binary128, would be around 228 cycles for FMUL and FDIV, assuming the Shift-and-Add unit remains 1 bit per cycle.
There is concern that internal latency could require 0.5 bit/cycle, or, would-be 456 cycles.

If it were 456 cycles, may as well just use trap-and-emulate at that point...

In the latter case, just using the 32-bit widening integer multiplier to implement the Binary128 FMUL and using Newton-Raphson is likely to be faster.

Main merit of Binary128 though being that "long double" is so
infrequently used that it almost doesn't matter if it is glacially slow (even more so with FDIV, which for many programs might not happen at all).

...

I seem to find that it is difficult to get better performance for FDIV
than using a simple divider.

FMA has a latency of about 40 clocks at 300 MHz (or 20 CPU clocks). So performing three or four iterations of NR in software (60 to 80 clocks)
is just about as time consuming as using a divider.

For FDIV (or FMUL) with a radix-2 divide it can probably operate at
double the CPU clock frequency. For instance the FDIV in my float
package runs at almost 300 MHz. But the CPU can only be clocked about
100 MHz. So a double-frequency clock is used for FDIV. This cuts the relative latency in half. (60 CPU clocks).

An SRT step (iteration) can be done several times per cycle,
3 steps per 16 gate cycle is not that hard.
4 steps per 16 gate cycle is on the edge of doable.

64-bit div is thus on the order of 23-cycles (64/3=21+2 pipeline)
whereas a Goldschmidt with NR correction is 17 cycles IEEE correct
where one knows they are within 1 ULP at cycle 12.

I could maybe better balance the timing in the FMA to reduce the latency somewhat and still keep the same FMAX. The 64x64 multiply has by itself about 11 cycles of latency. Built up out of 16x16 multipliers.

I suspect they are making you eat the 32-bit adder from each 16|u16
instead of doing every thing in carry-save format until the final add.

A 64|u32 Booth recoded Dadda/Walace tree is only 5-layers of 4-2
compressors {or 10-gates of delay (after recoder fanout)} plus a
128-bit adder (of your choice) gate delay (say 11-gates of delay);
for a total multiply time of 21 gates or 1.5 cycles.

Add the FP multiplexers, Booth recoding, find first for normalization,
and you are sitting at 3.3 cycles PLUS wire delay.

--- Synchronet 3.21b-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Thu Jan 29 17:44:20 2026

From Newsgroup: comp.arch

On 1/29/2026 2:47 AM, Robert Finch wrote:

On 2026-01-28 6:43 p.m., BGB wrote:

On 1/28/2026 5:03 PM, MitchAlsup wrote:

BGB <cr88192@gmail.com> posted:

On 1/28/2026 7:25 AM, Kent Dickey wrote:

Sort of reminds me of one case where I evaluated the possibility of a
64-bit hardware multiplier which would internally decompose it into
32x32->64 bit widening multiplies and add the parts back together.

Then noted the drawback that this wouldn't have been much faster than
doing it in software (using the same general strategy). Eventually did >>>> end up adding a (significantly slower, but cheaper) shift-and-add
hardware multiplier.

Mc 88100 uses a 32|u32 multiplier:: integer multiply was 3 cycles,
FP32 was 4 cycles, PF64 was 7 cycles.

When you wanted 32|u32->64 there was a 12-cycle instruction sequence
that would provide--any yes it required extracting 16-bit partials
multiplying 4 of them and adding them all up.

Similar here:
-a-a 32*32=>64: 3-cycle, pipelined;
-a-a Considered hard-wired logic mechanism:
-a-a-a-a ~ 12 cycles;
-a-a Runtime call: ~ 16 cycles (maybe 20 with call/return overheads).
-a-a Shift-and-add: 68 cycles (same as DIV/REM).
-a-a-a-a But, easier to justify the LUTs in the name of RV 'M' support.
-a-a-a-a Still faster than trap and emulate.

Where 64-bit integer MUL and DIV being not quite rare enough for trap
and emulate to be acceptable from a performance POV. The slow hardware
integer divide did manage to outperform using a software shift-and-
subtract loop though (so had that much going for it at least).

For Binary64, this unit is around 112 cycles for FDIV (due to quirks).

In the paste, Hardware Newton-Raphson is an option, but is more
complicated and expensive to make it work well.

The FMUL is a fair bit faster, and this means software Newton-Raphson
is still the most attractive option from the performance POV.

If done for Binary128, would be around 228 cycles for FMUL and FDIV,
assuming the Shift-and-Add unit remains 1 bit per cycle.
There is concern that internal latency could require 0.5 bit/cycle,
or, would-be 456 cycles.

If it were 456 cycles, may as well just use trap-and-emulate at that
point...

In the latter case, just using the 32-bit widening integer multiplier
to implement the Binary128 FMUL and using Newton-Raphson is likely to
be faster.

Main merit of Binary128 though being that "long double" is so
infrequently used that it almost doesn't matter if it is glacially
slow (even more so with FDIV, which for many programs might not happen
at all).

...

I seem to find that it is difficult to get better performance for FDIV
than using a simple divider.

FMA has a latency of about 40 clocks at 300 MHz (or 20 CPU clocks). So performing three or four iterations of NR in software (60 to 80 clocks)
is just about as time consuming as using a divider.

For FDIV (or FMUL) with a radix-2 divide it can probably operate at
double the CPU clock frequency. For instance the FDIV in my float
package runs at almost 300 MHz. But the CPU can only be clocked about
100 MHz. So a double-frequency clock is used for FDIV. This cuts the relative latency in half. (60 CPU clocks).

I could maybe better balance the timing in the FMA to reduce the latency somewhat and still keep the same FMAX. The 64x64 multiply has by itself about 11 cycles of latency. Built up out of 16x16 multipliers.

OK, I have:
Binary64 FMUL: 6 cycles
Binary64 FADD: 6 cycles (incl FSUB, Int<->FP)
Via SIMD Unit:
Binary32 FMUL: 3 cycles (incl SIMD)
Binary32 FADD: 3 cycles (incl SIMD)
FMULA/FADDA: Also 3 cycles (Binary64 format at Binary32 precision).

This mostly leaves N-R as the fastest strategy in this case.

No FMA as there isn't really a good way to get the latency low enough
except in a very niche case of FP8*FP8+FP16, but this would likely only
really be useful for NN's or similar (not as useful as a general purpose
SIMD instruction).

Granted, FP8 for inputs/weights and FP16 accumulators does deem to be a
fairly effective approach for NN's.

...

--- Synchronet 3.21b-Linux NewsLink 1.2

From Tim Rentsch@tr.17687@z991.linuxsc.com to comp.arch on Sat Feb 14 20:49:05 2026

From Newsgroup: comp.arch

Michael S <already5chosen@yahoo.com> writes:

On Wed, 21 Jan 2026 01:44:08 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

Anyone still here and active ???

https://www.linfo.org/rule_of_silence.html

I have serious doubts about universality of wisdom of this rule in the
field of human-machine interfaces, but for Usenet interaction it is
golden.

Thank you for this.
--- Synchronet 3.21b-Linux NewsLink 1.2

Who's Online

System Info

Sysop:	Amessyroom
Location:	Fayetteville, NC
Users:	59
Nodes:	6 (0 / 6)
Uptime:	24:34:16
Calls:	810
Files:	1,287
Messages:	195,978

Re: floating point history, word order and byte order

Who's Online

System Info