Forum: Too Lazy BBS

Who's Online
Recent Visitors
- Kawasu
  Fri Oct 17 10:51:10 2025
  from Mena, Ar via Telnet
- Geek2
  Thu Oct 16 20:44:04 2025
  from Euclid, Oh via Telnet
- Kawasu
  Thu Oct 16 10:17:15 2025
  from Mena, Ar via Telnet
- Geek2
  Thu Oct 16 06:39:58 2025
  from Euclid, Oh via Telnet

System Info

Sysop:	Amessyroom
Location:	Fayetteville, NC
Users:	26
Nodes:	6 (0 / 6)
Uptime:	59:23:44
Calls:	633
Calls today:	1
Files:	1,188
D/L today:	32 files (20,076K bytes)
Messages:	180,583

Random: Very Low Precision FP

From BGB@cr88192@gmail.com to comp.arch on Tue Aug 26 13:08:29 2025

From Newsgroup: comp.arch

Well, idea here is that sometimes one wants to be able to do
floating-point math where accuracy is a very low priority.

Say, the sort of stuff people might use FP8 or BF16 or maybe Binary16
for (though, what I am thinking of here is low-precision even by
Binary16 standards).

But, will use Binary16 and BF16 as the example formats.

So, can note that one can approximate some ops with modified integer
ADD/SUB (excluding sign-bit handling):
a*b : A+B-0x3C00 (0x3F80 for BF16)
a/b : A-B+0x3C00
sqrt(a): (A>>1)+0x1E00

The harder ones though, are ADD/SUB.

A partial ADD seems to be:
a+b: A+((B-A)>>1)+0x0400

But, this simple case seems not to hold up when either doing subtract,
or when A and B are far apart.

So, it would appear either that there is a 4th term or the bias is
variable (depending on the B-A term; and for ADD/SUB).

Seems like the high bits (exponent and operator) could be used to drive
a lookup table, but this is lame, The magic bias appears to have
non-linear properties so isn't as easily represented with basic integer operations.

Then again, probably other people know about all of this and might know
what I am missing.

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Tue Aug 26 21:17:47 2025

From Newsgroup: comp.arch

BGB <cr88192@gmail.com> posted:

Well, idea here is that sometimes one wants to be able to do
floating-point math where accuracy is a very low priority.

Say, the sort of stuff people might use FP8 or BF16 or maybe Binary16
for (though, what I am thinking of here is low-precision even by
Binary16 standards).

For 8-bit stuff, just use 5 memory tables [256|u256]

But, will use Binary16 and BF16 as the example formats.

So, can note that one can approximate some ops with modified integer
ADD/SUB (excluding sign-bit handling):
a*b : A+B-0x3C00 (0x3F80 for BF16)
a/b : A-B+0x3C00
sqrt(a): (A>>1)+0x1E00

You are aware that GPUs perform elementary transcendental functions
(32-bits) in 5 cycles {sin(), cos(), tan(), exp(), ln(), ...}.
These functions get within 1.5-2 ULP. See authors: Oberman, Pierno,
Matula circa 2000-2005 for relevant data. I did a crack at this
(patented: Samsung) that got within 0.7 and 1.2 ULP using a three
term polynomial instead of a 2 term polynomial.
Standard GPU FP math (32-bit and 16-bit) are 4 cycles and are now
IEEE 754 accurate (except for a couple of outlying cases.)

So, I don't see this suggestion bringing value to the table.

The harder ones though, are ADD/SUB.

A partial ADD seems to be:
a+b: A+((B-A)>>1)+0x0400

But, this simple case seems not to hold up when either doing subtract,
or when A and B are far apart.

So, it would appear either that there is a 4th term or the bias is
variable (depending on the B-A term; and for ADD/SUB).

Seems like the high bits (exponent and operator) could be used to drive
a lookup table, but this is lame, The magic bias appears to have
non-linear properties so isn't as easily represented with basic integer operations.

Then again, probably other people know about all of this and might know
what I am missing.

I still recommend getting the right answer over getting a close but wrong answer a couple cycles earlier.
--- Synchronet 3.21a-Linux NewsLink 1.2

From John Savard@quadibloc@invalid.invalid to comp.arch on Wed Aug 27 01:35:08 2025

From Newsgroup: comp.arch

On Tue, 26 Aug 2025 13:08:29 -0500, BGB wrote:

Then again, probably other people know about all of this and might know
what I am missing.

A long time ago, a notation called FOCUS was proposed for low-precision floats. It represented numbers by their logarithms. Multiplication and division were done quickly by addition and subtraction.

Addition and subtraction required a lookup table - but because the two
numbers involved needed to be not too far apart in magnitude for the operations to do anything, the lookup tables required were shorter than
they would be for numbers represented normally, where it would be multiplication and division that required the lookup tables.

John Savard

--- Synchronet 3.21a-Linux NewsLink 1.2

From John Savard@quadibloc@invalid.invalid to comp.arch on Wed Aug 27 01:39:47 2025

From Newsgroup: comp.arch

On Wed, 27 Aug 2025 01:35:08 +0000, John Savard wrote:

Addition and subtraction required a lookup table - but because the two numbers involved needed to be not too far apart in magnitude for the operations to do anything, the lookup tables required were shorter than
they would be for numbers represented normally, where it would be multiplication and division that required the lookup tables.

So to add two numbers, first switch them if necessary, so that the larger
one is a, and the smaller one is b.

Calculate b/a by subtraction.

Then use a short table to find (a+b)/a from b/a. The value found from that table can be added to a to give (the logarithmic representation of) a+b.

John Savard
--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Wed Aug 27 00:06:44 2025

From Newsgroup: comp.arch

On 8/26/2025 8:35 PM, John Savard wrote:

On Tue, 26 Aug 2025 13:08:29 -0500, BGB wrote:

Then again, probably other people know about all of this and might know
what I am missing.

A long time ago, a notation called FOCUS was proposed for low-precision floats. It represented numbers by their logarithms. Multiplication and division were done quickly by addition and subtraction.

OK, it is similar.

In this case, floating-point values can be seen as roughly analogous to fixed-point log2 values. Not exactly, but "close enough" for some use cases.

As long as one keeps conventional FP formats, it mostly maps over,
nevermind if values are "slightly distorted". Say, because in
traditional FP, each step in the mantissa is the same size, but in log2
space the step-size differs based on the relative position within the mantissa.

But, generally good at least to try to keep things looking like
traditional floating-point values.

So, say:
a*b: A+B-Bias
Where in the simple case, Bias is 1.0.

One can get clever and lookup a bias adjustment based on the high-order
bits of the mantissa for more accuracy. Sadly, not found any cheap way
to calculate the bias adjustment semi-accurately. Typically, the
relative bias offset drops to 0 at around each power of 2, so the
adjustment stays within a power-of-2 range, and repeats for every power
of 2.

Where, one might naively expect the bias to simply get slightly larger
as the mantissa gets larger (as each step itself gets larger), except
that it is more like a "slightly lopsided hump" (IIRC, reaching its
highest point at around 0.625-0.667 rather than 0.5; then dropping off
more quickly than it had risen).

At least in theory, it could be more accurate if one could use a cubic
spline or similar to interpolate the bias values, but by this point one
may as well use a traditional FPU multiply.

But, without interpolation, the table-lookup approach as an undesirable jitter. Whereas at least with the fixed bias it is consistent.

Had noted that seemingly one could do something like (IIRC):
T=A+B
B=0xC400|(T[8:2]^T[9:3])
C=T+B
But, this merely slightly improved "average case" accuracy (over some
parts of the range it gets worse; as it doesn't particularly closely
mimic the desired shape, peak not being in the right place nor having
the correct slope, as the proper version rises and drops off more
sharply, ...).

Though, can note that the high 2 bits of the result would need to be
used for range clamping:
0x: Normal range
10: Overflow (clamp to maximum value)
11: Underflow (clamp to 0)
With bit 15 or similar then being replaced with the sign bit.

But, the interpolated lookup table strategy, does exist as a possible alternative FDIV strategy if compared with Newton-Raphson. But, if done
in software, doesn't have many obvious advantages over N-R if one has a
fast FPU multiply (but, is a bit more tempting if one needs to implement
FDIV using integer math).

It is kind of a similar situation to doing sin/cos with interpolated
lookup tables or Taylor expansion. Here, Taylor Expansion wins by
providing more accuracy relative to CPU cost, but, does implicitly
assume that one has native FPU ops.

Addition and subtraction required a lookup table - but because the two numbers involved needed to be not too far apart in magnitude for the operations to do anything, the lookup tables required were shorter than
they would be for numbers represented normally, where it would be multiplication and division that required the lookup tables.

Yeah, addition and subtraction are seemingly the harder operations in
the case of trying to operate in logarithmic space.

It was more a case of wondering if anyone knew of something cheaper an seemingly the:
D=B-A
C=A+(D>>1)+MagicBias

Strategy with some boilerplate glued on.

One cost saving simplification I am aware of (conventional FP) is to not
just eliminate the concept of NaN and Inf, but also the concept of "true zero"; however, this only really works OK for formats with "sufficient
dynamic range". In this case, 0 is no longer special, merely "the
smallest value, generally understood as 0".

So, for example, with BF16 or Binary32, one could mostly eliminate zero handling, but with FP8 or Binary16, the zero point is large enough that
a lack of 0 is would be more noticeable (in particular because various mathematical identifies fail or misbehave in the absence of zero).

Like, if 0*Inf => 2.0, this is "a little off".
Also, ((0+0)!=0), ...

But, does seem almost like I am "pretty close" to figuring out something "really cheap".

But, while accuracy isn't that important, still at least need to land
"in the right general area".

Also would prefer to avoid "stair steps" due to coarse lookup tables.

John Savard

--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Wed Aug 27 01:17:47 2025

From Newsgroup: comp.arch

On 8/26/2025 4:17 PM, MitchAlsup wrote:

BGB <cr88192@gmail.com> posted:

Well, idea here is that sometimes one wants to be able to do
floating-point math where accuracy is a very low priority.

Say, the sort of stuff people might use FP8 or BF16 or maybe Binary16
for (though, what I am thinking of here is low-precision even by
Binary16 standards).

For 8-bit stuff, just use 5 memory tables [256|u256]

Would work OK for scalar ops on a CPU; less great for SIMD (or other
cases where one can't afford a 64K lookup table).

For 4x FP8 on a CPU, it makes sense to just use a normal SIMD unit for this.

But, say, what if one wants 8x or 16x wide SIMD with FP8; or use within
more specialized units?...

Granted, FP8 multiply is fairly cheap either way (eg: the mantissa
multiply already fits into LUT6's and can be pulled off in combinatorial logic). It is mostly FP8 FADD that needs to be made cheaper and low
latency; sadly approximating FADD/FSUB in the general case has typically
been the harder problem.

And, while a special case has been found (works for simple add with
similar exponents), it doesn't extend to the general case.

But, will use Binary16 and BF16 as the example formats.

So, can note that one can approximate some ops with modified integer
ADD/SUB (excluding sign-bit handling):
a*b : A+B-0x3C00 (0x3F80 for BF16)
a/b : A-B+0x3C00
sqrt(a): (A>>1)+0x1E00

You are aware that GPUs perform elementary transcendental functions
(32-bits) in 5 cycles {sin(), cos(), tan(), exp(), ln(), ...}.
These functions get within 1.5-2 ULP. See authors: Oberman, Pierno,
Matula circa 2000-2005 for relevant data. I did a crack at this
(patented: Samsung) that got within 0.7 and 1.2 ULP using a three
term polynomial instead of a 2 term polynomial.
Standard GPU FP math (32-bit and 16-bit) are 4 cycles and are now
IEEE 754 accurate (except for a couple of outlying cases.)

So, I don't see this suggestion bringing value to the table.

These can do some operations more cheaply.

But, as noted, primarily for low precision values, likely in dedicated
logic.

This approach would not make as much sense in general for larger
formats, given the accuracy is far below what would be considered as acceptable.

Though. a few of these were already in use (as CPU helper ops), though
usually to provide "starter-values" for Newton-Raphson.

But, this sort of thing, is unlikely to replace general-purpose SIMD ops
on a CPU or similar in any case.

And, for the SIMD unit, can continue doing floating-point in ways that
"are not complete garbage".

But, say, for working with HDR values inside the rasterizer module or
similar, this is more where this sort of thing could matter.

Or, maybe, could be relevant for perspective-correct texture filtering
(well, if it were working with floating-point texture coords rather than
fixed point).

Might be better if the module could also do transform and deal with full primitives, but this is asking too much.

Or, failing this, if it could be used for 2D "blit" operations
(currently only deals with square or rectangular power-of-2 images in
Morton Order, which isn't terribly useful for "blit").

Though, as noted, TKRA-GL keeps its textures internally in Morton Order.

Currently, TKRA-GL uses a 12-bit lineal Z-Buffer (with 4 bits for
stencil), though it is possible that it could make sense to use floating
point for the Z-buffer (maybe S.E3.F8; as it mostly only needs to hold
values between -1.0 and 1.0, etc).

Some of the audio modules also use values mostly in A-Law form.
Though, annoyingly it seems I have now ended up stuck with both A-Law
formats with Bias=7 and Bias=8. As, initially I added the ops primarily
for audio and Bias=8 made sense, but for other (non audio) uses I more
needed Bias=7 (renamed as FP8A). So, annoyingly, there are now two sets
of converter ops differing primarily in bias.

But, I am mostly phasing out FP8S (E4.F3.S) in favor of plain FP8
(S.E4.F3). Though, it lingers on as a sort of a design mistake, much
like making my A-Law ops originally Bias=8. But, then FP8A remains
preferably mostly because it has a slightly larger mantissa than normal FP8.

...

The harder ones though, are ADD/SUB.

A partial ADD seems to be:
a+b: A+((B-A)>>1)+0x0400

But, this simple case seems not to hold up when either doing subtract,
or when A and B are far apart.

So, it would appear either that there is a 4th term or the bias is
variable (depending on the B-A term; and for ADD/SUB).

Seems like the high bits (exponent and operator) could be used to drive
a lookup table, but this is lame, The magic bias appears to have
non-linear properties so isn't as easily represented with basic integer
operations.

Then again, probably other people know about all of this and might know
what I am missing.

I still recommend getting the right answer over getting a close but wrong answer a couple cycles earlier.

A lot depends on what is needed...

In cases where a person is doing math using FP8, any real semblance of "accuracy" or "right answer" has already gone out the window. Usually
the math is just sorta throwing darts at a dartboard and hoping they
land somewhere in the right area.

Though, that said, usually even with these sorts of approximations (such
as approximating a FMUL with a modified ADD), often the top 3-5 bits of
the mantissa are correct. So, for FP8 or BF16, the answer of the
approximates in many cases still may be close to the answer if done
using real floating-point logic.

But, even for something like Binary16, it is a bit iffy.

There are 10 bits of mantissa, and math ops that only give around 3-5
bits of accuracy or do isn't great in this case.

Though, sometimes the accuracy doesn't matter that much, but one may
still want to avoid "stair steps" as the artifacts generated by this may
be much more obvious.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Wed Aug 27 13:56:53 2025

From Newsgroup: comp.arch

MitchAlsup wrote:

BGB <cr88192@gmail.com> posted:

Well, idea here is that sometimes one wants to be able to do
floating-point math where accuracy is a very low priority.

Say, the sort of stuff people might use FP8 or BF16 or maybe Binary16
for (though, what I am thinking of here is low-precision even by
Binary16 standards).

For 8-bit stuff, just use 5 memory tables [256|arCo256]

They don't even need to be full 8-bit: With a tiny amount of logic to
handle the signs you are already down to 128x128, right?

Then again, probably other people know about all of this and might know
what I am missing.

The infamous invsqrt() trick is the canonical example of where all the
quirks of the ieee 754 format works just right to get you to 10+ bits
with a single NR iteration.
Your basic ops examples are a lot more iffy.

I still recommend getting the right answer over getting a close but wrong answer a couple cycles earlier.

Exactly.
I think you showed me the idea of usually getting the correct result in
N cycles, but in a low number of cases, the trailing bits would be too
close to a rounding boundary, so they would add one more NR iteration.
I just realized that the code I wrote to fix Pentium FDIV could have
been even more efficient on a proper superscalar OoO CPU:
Start the FDIV immediately, then at the same time do the divisor
mantissa inspection to determine if the workaround would be needed (5
out of 1024 cases), and only if that happens, start the slower path that takes up to twice as long.
The idea is that for 99.5% of all divisors, the only cost would be a
close to zero cycle correctly predicted branch, but then the remainder
would require two FDIV operations, so 80 instead of 40 cycles.
OTOH, that same Big OoO core can probably predict that the entire
mantissa inspection part will end up with a "skip the workaround" branch and start the FDIV almost at once. I'm assuming that when the mispredict turns up, the core can stop a long operation like FDIV more or less
immediately and discard the current status.
(From memory)
double fdiv(double a, double b)
{
uint64_t mant10;
memcpy(&mant10, &b, sizeof(ub));
mant10 = (mant10 >> 42) & 1023;
if (fdiv_table[mant10 >> 3] & (1 << (mant10 & 7))) {
// set fpu to extended/long double, save previous mode
b *= 15.0/16.0; // Exact operation!
a *= 15.0/16.0; // Exact operation!
// Restore to previous precision mode
}
return a / b;
}
Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Wed Aug 27 14:43:20 2025

From Newsgroup: comp.arch

Terje Mathisen <terje.mathisen@tmsw.no> writes:

They don't even need to be full 8-bit: With a tiny amount of logic to=20 >handle the signs you are already down to 128x128, right?

With exponential representation, say with base 2^(1/4) (range
0.000018-55109 for exponents -63 to +63, and factor 1.19 between
consecutive numbers), if the absolutely smaller number is smaller by a
fixed amount in exponential representation (14 for our base 2^(1/4)
numbers), adding or subtracting it won't make a difference. Which
means that we need a 14*15/2=105 entry table (with 8-bit results) for
addition and a table with the same size for subtraction, and a little
logic for handling the cases where the numbers are too different or 0,
or, if supported, +/-Inf or NaN (which reduce the range a little).

If you want such a representation with finer-grained resolution, you
get a smaller range and need larger tables. E.g., if you want to have
a granularity as good as the minimum granularity of FP with 3-bit
mantissa (with hidden 1), i.e., 1.125, you need 2^(1/6), with a
granularity of 1.122 and a range 0.00069-1448; adding or subtracting a
number where the representation is 25 smaller makes no difference, so
the table sizes are 25*26/2=325 entries. Still looks relatively
cheap.

Why are people going for something FP-like instead of exponential
if the number of bits is so small?

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Wed Aug 27 12:26:31 2025

From Newsgroup: comp.arch

On 8/27/2025 9:43 AM, Anton Ertl wrote:

Terje Mathisen <terje.mathisen@tmsw.no> writes:

They don't even need to be full 8-bit: With a tiny amount of logic to=20
handle the signs you are already down to 128x128, right?

With exponential representation, say with base 2^(1/4) (range
0.000018-55109 for exponents -63 to +63, and factor 1.19 between
consecutive numbers), if the absolutely smaller number is smaller by a
fixed amount in exponential representation (14 for our base 2^(1/4)
numbers), adding or subtracting it won't make a difference. Which
means that we need a 14*15/2=105 entry table (with 8-bit results) for addition and a table with the same size for subtraction, and a little
logic for handling the cases where the numbers are too different or 0,
or, if supported, +/-Inf or NaN (which reduce the range a little).

If you want such a representation with finer-grained resolution, you
get a smaller range and need larger tables. E.g., if you want to have
a granularity as good as the minimum granularity of FP with 3-bit
mantissa (with hidden 1), i.e., 1.125, you need 2^(1/6), with a
granularity of 1.122 and a range 0.00069-1448; adding or subtracting a
number where the representation is 25 smaller makes no difference, so
the table sizes are 25*26/2=325 entries. Still looks relatively
cheap.

Why are people going for something FP-like instead of exponential
if the number of bits is so small?

There is sort of the thing:
When the number of bits gets small, the practical differences between FP
and exponential mostly evaporates.

If at the same scale with the same biases, the values match up between
the two systems.

At FP8, they are basically equivalent.
With BF16, or S.E8.M7, the values will differ, but not drastically.

With Binary16, they are not equivalent.
For FDIV, much like InvSqrt, a single N-R can fix it up.

But, yeah, with FP8:
If difference between exponents is >3, FADD would merely return the
larger of the two values, so yeah, a table size of 24 works (and fits in
5 bits).

This means, at least for FP8, the ADD/SUB lookup table could fit in 6 bits.

So, something like:
if(Ain[6:0]>=Bin[6:0])
begin
A={1'b0,Ain[6:0]}; B={1'b0,Bin[6:0]};
sgn=Ain[7];
end
else
begin
A={1'b0,Bin[6:0]}; B={1'b0,Ain[6:0]};
sgn=Bin[7];
end
isSub=Ain[7]^Bin[7];
isOor=0; //out of range, no effect
D={1'b0, B[6:0]}-{1'b0, A[6:0]};
if((!D[7] && D[6:0]!=0) || (D[7:5]!=1'b111))
isOor=1;
case(isSub, D[5:0])
6'b00: tBias=8'h08;
...
endcase
C=A+{D[7],D[7:1]}+tBias;
if(isOor)
C=A;
if(C[7])
C=C[6]?8'h00:8'h7F; //overflow/underflow
result={sgn,C[6:0]};

Works for FP8, for bigger formats (eg, Binary16) would exceed the size
of a LUT6 though.

Maybe might need 9 bits for C though, since if subtracting a value from
itself yields the maximally negative bias (to try to reliably hit "0"),
then with 8 bits it might reach back into positive overflow territory.

It is either that or special-case the scenario of isSub and D==0.

...

But, yeah, the question of if there is a cheaper way to do this is
starting to look like "probably no"...

- anton

--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Wed Aug 27 13:01:08 2025

From Newsgroup: comp.arch

On 8/27/2025 12:26 PM, BGB wrote:

On 8/27/2025 9:43 AM, Anton Ertl wrote:

Terje Mathisen <terje.mathisen@tmsw.no> writes:

They don't even need to be full 8-bit: With a tiny amount of logic to=20 >>> handle the signs you are already down to 128x128, right?

With exponential representation, say with base 2^(1/4) (range
0.000018-55109 for exponents -63 to +63, and factor 1.19 between
consecutive numbers), if the absolutely smaller number is smaller by a
fixed amount in exponential representation (14 for our base 2^(1/4)
numbers), adding or subtracting it won't make a difference.-a Which
means that we need a 14*15/2=105 entry table (with 8-bit results) for
addition and a table with the same size for subtraction, and a little
logic for handling the cases where the numbers are too different or 0,
or, if supported, +/-Inf or NaN (which reduce the range a little).

If you want such a representation with finer-grained resolution, you
get a smaller range and need larger tables.-a E.g., if you want to have
a granularity as good as the minimum granularity of FP with 3-bit
mantissa (with hidden 1), i.e., 1.125, you need 2^(1/6), with a
granularity of 1.122 and a range 0.00069-1448; adding or subtracting a
number where the representation is 25 smaller makes no difference, so
the table sizes are 25*26/2=325 entries.-a Still looks relatively
cheap.

Why are people going for something FP-like instead of exponential
if the number of bits is so small?

There is sort of the thing:
When the number of bits gets small, the practical differences between FP
and exponential mostly evaporates.

If at the same scale with the same biases, the values match up between
the two systems.

At FP8, they are basically equivalent.
With BF16, or S.E8.M7, the values will differ, but not drastically.

With Binary16, they are not equivalent.
-a For FDIV, much like InvSqrt, a single N-R can fix it up.

But, yeah, with FP8:
If difference between exponents is >3, FADD would merely return the
larger of the two values, so yeah, a table size of 24 works (and fits in
5 bits).

This means, at least for FP8, the ADD/SUB lookup table could fit in 6 bits.

So, something like:
-a if(Ain[6:0]>=Bin[6:0])
-a begin
-a-a-a A={1'b0,Ain[6:0]}; B={1'b0,Bin[6:0]};
-a-a-a sgn=Ain[7];
-a end
-a else
-a begin
-a-a-a A={1'b0,Bin[6:0]}; B={1'b0,Ain[6:0]};
-a-a-a sgn=Bin[7];
-a end
-a isSub=Ain[7]^Bin[7];
-a isOor=0;-a //out of range, no effect
-a D={1'b0, B[6:0]}-{1'b0, A[6:0]};
-a if((!D[7] && D[6:0]!=0) || (D[7:5]!=1'b111))
-a-a-a-a isOor=1;
-a case(isSub, D[5:0])
-a-a-a 6'b00: tBias=8'h08;
-a-a-a ...
-a endcase
-a C=A+{D[7],D[7:1]}+tBias;
-a if(isOor)
-a-a-a C=A;
-a if(C[7])
-a-a-a C=C[6]?8'h00:8'h7F;-a //overflow/underflow
-a result={sgn,C[6:0]};

Works for FP8, for bigger formats (eg, Binary16) would exceed the size
of a LUT6 though.

Maybe might need 9 bits for C though, since if subtracting a value from itself yields the maximally negative bias (to try to reliably hit "0"),
then with 8 bits it might reach back into positive overflow territory.

It is either that or special-case the scenario of isSub and D==0.

...

But, yeah, the question of if there is a cheaper way to do this is
starting to look like "probably no"...

Well, nevermind if I then just wandered off and realized a few that
would apply to the FP8 case:
Flip the subtract such that D>=0;
Eliminate (D>>1) term as it is effectively redundant and for its whole applicable range could be folded into the bias table.

...

Or, essentially:
D=A-B
U=lookup[D]
C=A+U

Thinking more: will probably need to widen to 9 bits to deal with
separating overflow and underflow over the full dynamic range.

Like, 480-480 should give 0, and 0-480 should also give 0, ...

- anton

--- Synchronet 3.21a-Linux NewsLink 1.2

From George Neuner@gneuner2@comcast.net to comp.arch on Wed Aug 27 15:29:40 2025

From Newsgroup: comp.arch

On Wed, 27 Aug 2025 14:43:20 GMT, anton@mips.complang.tuwien.ac.at
(Anton Ertl) wrote:

Terje Mathisen <terje.mathisen@tmsw.no> writes:

They don't even need to be full 8-bit: With a tiny amount of logic to=20 >>handle the signs you are already down to 128x128, right?

With exponential representation, say with base 2^(1/4) (range
0.000018-55109 for exponents -63 to +63, and factor 1.19 between
consecutive numbers), if the absolutely smaller number is smaller by a
fixed amount in exponential representation (14 for our base 2^(1/4)
numbers), adding or subtracting it won't make a difference. Which
means that we need a 14*15/2=105 entry table (with 8-bit results) for >addition and a table with the same size for subtraction, and a little
logic for handling the cases where the numbers are too different or 0,
or, if supported, +/-Inf or NaN (which reduce the range a little).

If you want such a representation with finer-grained resolution, you
get a smaller range and need larger tables. E.g., if you want to have
a granularity as good as the minimum granularity of FP with 3-bit
mantissa (with hidden 1), i.e., 1.125, you need 2^(1/6), with a
granularity of 1.122 and a range 0.00069-1448; adding or subtracting a
number where the representation is 25 smaller makes no difference, so
the table sizes are 25*26/2=325 entries. Still looks relatively
cheap.

Why are people going for something FP-like instead of exponential
if the number of bits is so small?

- anton

Excellant question. Wish I had an answer.

Given that the use case almost invariably is NN, the only interesting
values are [or should be] fractions in the range 0 to 1. Little/no
need for floating point.
--- Synchronet 3.21a-Linux NewsLink 1.2

Who's Online

Recent Visitors

System Info

Random: Very Low Precision FP