Sysop: | Amessyroom |
---|---|
Location: | Fayetteville, NC |
Users: | 23 |
Nodes: | 6 (0 / 6) |
Uptime: | 54:32:03 |
Calls: | 583 |
Files: | 1,139 |
D/L today: |
179 files (27,921K bytes) |
Messages: | 111,800 |
Well, idea here is that sometimes one wants to be able to do
floating-point math where accuracy is a very low priority.
Say, the sort of stuff people might use FP8 or BF16 or maybe Binary16
for (though, what I am thinking of here is low-precision even by
Binary16 standards).
But, will use Binary16 and BF16 as the example formats.
So, can note that one can approximate some ops with modified integer
ADD/SUB (excluding sign-bit handling):
a*b : A+B-0x3C00 (0x3F80 for BF16)
a/b : A-B+0x3C00
sqrt(a): (A>>1)+0x1E00
The harder ones though, are ADD/SUB.
A partial ADD seems to be:
a+b: A+((B-A)>>1)+0x0400
But, this simple case seems not to hold up when either doing subtract,
or when A and B are far apart.
So, it would appear either that there is a 4th term or the bias is
variable (depending on the B-A term; and for ADD/SUB).
Seems like the high bits (exponent and operator) could be used to drive
a lookup table, but this is lame, The magic bias appears to have
non-linear properties so isn't as easily represented with basic integer operations.
Then again, probably other people know about all of this and might know
what I am missing.
Then again, probably other people know about all of this and might know
what I am missing.
Addition and subtraction required a lookup table - but because the two numbers involved needed to be not too far apart in magnitude for the operations to do anything, the lookup tables required were shorter than
they would be for numbers represented normally, where it would be multiplication and division that required the lookup tables.
On Tue, 26 Aug 2025 13:08:29 -0500, BGB wrote:
Then again, probably other people know about all of this and might know
what I am missing.
A long time ago, a notation called FOCUS was proposed for low-precision floats. It represented numbers by their logarithms. Multiplication and division were done quickly by addition and subtraction.
Addition and subtraction required a lookup table - but because the two numbers involved needed to be not too far apart in magnitude for the operations to do anything, the lookup tables required were shorter than
they would be for numbers represented normally, where it would be multiplication and division that required the lookup tables.
John Savard
BGB <cr88192@gmail.com> posted:
Well, idea here is that sometimes one wants to be able to do
floating-point math where accuracy is a very low priority.
Say, the sort of stuff people might use FP8 or BF16 or maybe Binary16
for (though, what I am thinking of here is low-precision even by
Binary16 standards).
For 8-bit stuff, just use 5 memory tables [256|u256]
But, will use Binary16 and BF16 as the example formats.
So, can note that one can approximate some ops with modified integer
ADD/SUB (excluding sign-bit handling):
a*b : A+B-0x3C00 (0x3F80 for BF16)
a/b : A-B+0x3C00
sqrt(a): (A>>1)+0x1E00
You are aware that GPUs perform elementary transcendental functions
(32-bits) in 5 cycles {sin(), cos(), tan(), exp(), ln(), ...}.
These functions get within 1.5-2 ULP. See authors: Oberman, Pierno,
Matula circa 2000-2005 for relevant data. I did a crack at this
(patented: Samsung) that got within 0.7 and 1.2 ULP using a three
term polynomial instead of a 2 term polynomial.
Standard GPU FP math (32-bit and 16-bit) are 4 cycles and are now
IEEE 754 accurate (except for a couple of outlying cases.)
So, I don't see this suggestion bringing value to the table.
The harder ones though, are ADD/SUB.
A partial ADD seems to be:
a+b: A+((B-A)>>1)+0x0400
But, this simple case seems not to hold up when either doing subtract,
or when A and B are far apart.
So, it would appear either that there is a 4th term or the bias is
variable (depending on the B-A term; and for ADD/SUB).
Seems like the high bits (exponent and operator) could be used to drive
a lookup table, but this is lame, The magic bias appears to have
non-linear properties so isn't as easily represented with basic integer
operations.
Then again, probably other people know about all of this and might know
what I am missing.
I still recommend getting the right answer over getting a close but wrong answer a couple cycles earlier.
BGB <cr88192@gmail.com> posted:They don't even need to be full 8-bit: With a tiny amount of logic to
Well, idea here is that sometimes one wants to be able to do
floating-point math where accuracy is a very low priority.
Say, the sort of stuff people might use FP8 or BF16 or maybe Binary16
for (though, what I am thinking of here is low-precision even by
Binary16 standards).
For 8-bit stuff, just use 5 memory tables [256|arCo256]
The infamous invsqrt() trick is the canonical example of where all theThen again, probably other people know about all of this and might know
what I am missing.
I still recommend getting the right answer over getting a close but wrong answer a couple cycles earlier.
They don't even need to be full 8-bit: With a tiny amount of logic to=20 >handle the signs you are already down to 128x128, right?
Terje Mathisen <terje.mathisen@tmsw.no> writes:
They don't even need to be full 8-bit: With a tiny amount of logic to=20
handle the signs you are already down to 128x128, right?
With exponential representation, say with base 2^(1/4) (range
0.000018-55109 for exponents -63 to +63, and factor 1.19 between
consecutive numbers), if the absolutely smaller number is smaller by a
fixed amount in exponential representation (14 for our base 2^(1/4)
numbers), adding or subtracting it won't make a difference. Which
means that we need a 14*15/2=105 entry table (with 8-bit results) for addition and a table with the same size for subtraction, and a little
logic for handling the cases where the numbers are too different or 0,
or, if supported, +/-Inf or NaN (which reduce the range a little).
If you want such a representation with finer-grained resolution, you
get a smaller range and need larger tables. E.g., if you want to have
a granularity as good as the minimum granularity of FP with 3-bit
mantissa (with hidden 1), i.e., 1.125, you need 2^(1/6), with a
granularity of 1.122 and a range 0.00069-1448; adding or subtracting a
number where the representation is 25 smaller makes no difference, so
the table sizes are 25*26/2=325 entries. Still looks relatively
cheap.
Why are people going for something FP-like instead of exponential
if the number of bits is so small?
- anton
On 8/27/2025 9:43 AM, Anton Ertl wrote:
Terje Mathisen <terje.mathisen@tmsw.no> writes:
They don't even need to be full 8-bit: With a tiny amount of logic to=20 >>> handle the signs you are already down to 128x128, right?
With exponential representation, say with base 2^(1/4) (range
0.000018-55109 for exponents -63 to +63, and factor 1.19 between
consecutive numbers), if the absolutely smaller number is smaller by a
fixed amount in exponential representation (14 for our base 2^(1/4)
numbers), adding or subtracting it won't make a difference.-a Which
means that we need a 14*15/2=105 entry table (with 8-bit results) for
addition and a table with the same size for subtraction, and a little
logic for handling the cases where the numbers are too different or 0,
or, if supported, +/-Inf or NaN (which reduce the range a little).
If you want such a representation with finer-grained resolution, you
get a smaller range and need larger tables.-a E.g., if you want to have
a granularity as good as the minimum granularity of FP with 3-bit
mantissa (with hidden 1), i.e., 1.125, you need 2^(1/6), with a
granularity of 1.122 and a range 0.00069-1448; adding or subtracting a
number where the representation is 25 smaller makes no difference, so
the table sizes are 25*26/2=325 entries.-a Still looks relatively
cheap.
Why are people going for something FP-like instead of exponential
if the number of bits is so small?
There is sort of the thing:
When the number of bits gets small, the practical differences between FP
and exponential mostly evaporates.
If at the same scale with the same biases, the values match up between
the two systems.
At FP8, they are basically equivalent.
With BF16, or S.E8.M7, the values will differ, but not drastically.
With Binary16, they are not equivalent.
-a For FDIV, much like InvSqrt, a single N-R can fix it up.
But, yeah, with FP8:
If difference between exponents is >3, FADD would merely return the
larger of the two values, so yeah, a table size of 24 works (and fits in
5 bits).
This means, at least for FP8, the ADD/SUB lookup table could fit in 6 bits.
So, something like:
-a if(Ain[6:0]>=Bin[6:0])
-a begin
-a-a-a A={1'b0,Ain[6:0]}; B={1'b0,Bin[6:0]};
-a-a-a sgn=Ain[7];
-a end
-a else
-a begin
-a-a-a A={1'b0,Bin[6:0]}; B={1'b0,Ain[6:0]};
-a-a-a sgn=Bin[7];
-a end
-a isSub=Ain[7]^Bin[7];
-a isOor=0;-a //out of range, no effect
-a D={1'b0, B[6:0]}-{1'b0, A[6:0]};
-a if((!D[7] && D[6:0]!=0) || (D[7:5]!=1'b111))
-a-a-a-a isOor=1;
-a case(isSub, D[5:0])
-a-a-a 6'b00: tBias=8'h08;
-a-a-a ...
-a endcase
-a C=A+{D[7],D[7:1]}+tBias;
-a if(isOor)
-a-a-a C=A;
-a if(C[7])
-a-a-a C=C[6]?8'h00:8'h7F;-a //overflow/underflow
-a result={sgn,C[6:0]};
Works for FP8, for bigger formats (eg, Binary16) would exceed the size
of a LUT6 though.
Maybe might need 9 bits for C though, since if subtracting a value from itself yields the maximally negative bias (to try to reliably hit "0"),
then with 8 bits it might reach back into positive overflow territory.
It is either that or special-case the scenario of isSub and D==0.
...
But, yeah, the question of if there is a cheaper way to do this is
starting to look like "probably no"...
- anton
Terje Mathisen <terje.mathisen@tmsw.no> writes:
They don't even need to be full 8-bit: With a tiny amount of logic to=20 >>handle the signs you are already down to 128x128, right?
With exponential representation, say with base 2^(1/4) (range
0.000018-55109 for exponents -63 to +63, and factor 1.19 between
consecutive numbers), if the absolutely smaller number is smaller by a
fixed amount in exponential representation (14 for our base 2^(1/4)
numbers), adding or subtracting it won't make a difference. Which
means that we need a 14*15/2=105 entry table (with 8-bit results) for >addition and a table with the same size for subtraction, and a little
logic for handling the cases where the numbers are too different or 0,
or, if supported, +/-Inf or NaN (which reduce the range a little).
If you want such a representation with finer-grained resolution, you
get a smaller range and need larger tables. E.g., if you want to have
a granularity as good as the minimum granularity of FP with 3-bit
mantissa (with hidden 1), i.e., 1.125, you need 2^(1/6), with a
granularity of 1.122 and a range 0.00069-1448; adding or subtracting a
number where the representation is 25 smaller makes no difference, so
the table sizes are 25*26/2=325 entries. Still looks relatively
cheap.
Why are people going for something FP-like instead of exponential
if the number of bits is so small?
- anton