• Random: Very Low Precision FP

    From BGB@cr88192@gmail.com to comp.arch on Tue Aug 26 13:08:29 2025
    From Newsgroup: comp.arch

    Well, idea here is that sometimes one wants to be able to do
    floating-point math where accuracy is a very low priority.

    Say, the sort of stuff people might use FP8 or BF16 or maybe Binary16
    for (though, what I am thinking of here is low-precision even by
    Binary16 standards).

    But, will use Binary16 and BF16 as the example formats.

    So, can note that one can approximate some ops with modified integer
    ADD/SUB (excluding sign-bit handling):
    a*b : A+B-0x3C00 (0x3F80 for BF16)
    a/b : A-B+0x3C00
    sqrt(a): (A>>1)+0x1E00

    The harder ones though, are ADD/SUB.

    A partial ADD seems to be:
    a+b: A+((B-A)>>1)+0x0400

    But, this simple case seems not to hold up when either doing subtract,
    or when A and B are far apart.

    So, it would appear either that there is a 4th term or the bias is
    variable (depending on the B-A term; and for ADD/SUB).

    Seems like the high bits (exponent and operator) could be used to drive
    a lookup table, but this is lame, The magic bias appears to have
    non-linear properties so isn't as easily represented with basic integer operations.

    Then again, probably other people know about all of this and might know
    what I am missing.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Tue Aug 26 21:17:47 2025
    From Newsgroup: comp.arch


    BGB <cr88192@gmail.com> posted:

    Well, idea here is that sometimes one wants to be able to do
    floating-point math where accuracy is a very low priority.

    Say, the sort of stuff people might use FP8 or BF16 or maybe Binary16
    for (though, what I am thinking of here is low-precision even by
    Binary16 standards).

    For 8-bit stuff, just use 5 memory tables [256|u256]

    But, will use Binary16 and BF16 as the example formats.

    So, can note that one can approximate some ops with modified integer
    ADD/SUB (excluding sign-bit handling):
    a*b : A+B-0x3C00 (0x3F80 for BF16)
    a/b : A-B+0x3C00
    sqrt(a): (A>>1)+0x1E00

    You are aware that GPUs perform elementary transcendental functions
    (32-bits) in 5 cycles {sin(), cos(), tan(), exp(), ln(), ...}.
    These functions get within 1.5-2 ULP. See authors: Oberman, Pierno,
    Matula circa 2000-2005 for relevant data. I did a crack at this
    (patented: Samsung) that got within 0.7 and 1.2 ULP using a three
    term polynomial instead of a 2 term polynomial.
    Standard GPU FP math (32-bit and 16-bit) are 4 cycles and are now
    IEEE 754 accurate (except for a couple of outlying cases.)

    So, I don't see this suggestion bringing value to the table.

    The harder ones though, are ADD/SUB.

    A partial ADD seems to be:
    a+b: A+((B-A)>>1)+0x0400

    But, this simple case seems not to hold up when either doing subtract,
    or when A and B are far apart.

    So, it would appear either that there is a 4th term or the bias is
    variable (depending on the B-A term; and for ADD/SUB).

    Seems like the high bits (exponent and operator) could be used to drive
    a lookup table, but this is lame, The magic bias appears to have
    non-linear properties so isn't as easily represented with basic integer operations.

    Then again, probably other people know about all of this and might know
    what I am missing.

    I still recommend getting the right answer over getting a close but wrong answer a couple cycles earlier.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Savard@quadibloc@invalid.invalid to comp.arch on Wed Aug 27 01:35:08 2025
    From Newsgroup: comp.arch

    On Tue, 26 Aug 2025 13:08:29 -0500, BGB wrote:

    Then again, probably other people know about all of this and might know
    what I am missing.

    A long time ago, a notation called FOCUS was proposed for low-precision floats. It represented numbers by their logarithms. Multiplication and division were done quickly by addition and subtraction.

    Addition and subtraction required a lookup table - but because the two
    numbers involved needed to be not too far apart in magnitude for the operations to do anything, the lookup tables required were shorter than
    they would be for numbers represented normally, where it would be multiplication and division that required the lookup tables.

    John Savard

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Savard@quadibloc@invalid.invalid to comp.arch on Wed Aug 27 01:39:47 2025
    From Newsgroup: comp.arch

    On Wed, 27 Aug 2025 01:35:08 +0000, John Savard wrote:

    Addition and subtraction required a lookup table - but because the two numbers involved needed to be not too far apart in magnitude for the operations to do anything, the lookup tables required were shorter than
    they would be for numbers represented normally, where it would be multiplication and division that required the lookup tables.

    So to add two numbers, first switch them if necessary, so that the larger
    one is a, and the smaller one is b.

    Calculate b/a by subtraction.

    Then use a short table to find (a+b)/a from b/a. The value found from that table can be added to a to give (the logarithmic representation of) a+b.

    John Savard
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Wed Aug 27 00:06:44 2025
    From Newsgroup: comp.arch

    On 8/26/2025 8:35 PM, John Savard wrote:
    On Tue, 26 Aug 2025 13:08:29 -0500, BGB wrote:

    Then again, probably other people know about all of this and might know
    what I am missing.

    A long time ago, a notation called FOCUS was proposed for low-precision floats. It represented numbers by their logarithms. Multiplication and division were done quickly by addition and subtraction.


    OK, it is similar.

    In this case, floating-point values can be seen as roughly analogous to fixed-point log2 values. Not exactly, but "close enough" for some use cases.

    As long as one keeps conventional FP formats, it mostly maps over,
    nevermind if values are "slightly distorted". Say, because in
    traditional FP, each step in the mantissa is the same size, but in log2
    space the step-size differs based on the relative position within the mantissa.

    But, generally good at least to try to keep things looking like
    traditional floating-point values.


    So, say:
    a*b: A+B-Bias
    Where in the simple case, Bias is 1.0.

    One can get clever and lookup a bias adjustment based on the high-order
    bits of the mantissa for more accuracy. Sadly, not found any cheap way
    to calculate the bias adjustment semi-accurately. Typically, the
    relative bias offset drops to 0 at around each power of 2, so the
    adjustment stays within a power-of-2 range, and repeats for every power
    of 2.

    Where, one might naively expect the bias to simply get slightly larger
    as the mantissa gets larger (as each step itself gets larger), except
    that it is more like a "slightly lopsided hump" (IIRC, reaching its
    highest point at around 0.625-0.667 rather than 0.5; then dropping off
    more quickly than it had risen).


    At least in theory, it could be more accurate if one could use a cubic
    spline or similar to interpolate the bias values, but by this point one
    may as well use a traditional FPU multiply.

    But, without interpolation, the table-lookup approach as an undesirable jitter. Whereas at least with the fixed bias it is consistent.


    Had noted that seemingly one could do something like (IIRC):
    T=A+B
    B=0xC400|(T[8:2]^T[9:3])
    C=T+B
    But, this merely slightly improved "average case" accuracy (over some
    parts of the range it gets worse; as it doesn't particularly closely
    mimic the desired shape, peak not being in the right place nor having
    the correct slope, as the proper version rises and drops off more
    sharply, ...).


    Though, can note that the high 2 bits of the result would need to be
    used for range clamping:
    0x: Normal range
    10: Overflow (clamp to maximum value)
    11: Underflow (clamp to 0)
    With bit 15 or similar then being replaced with the sign bit.



    But, the interpolated lookup table strategy, does exist as a possible alternative FDIV strategy if compared with Newton-Raphson. But, if done
    in software, doesn't have many obvious advantages over N-R if one has a
    fast FPU multiply (but, is a bit more tempting if one needs to implement
    FDIV using integer math).

    It is kind of a similar situation to doing sin/cos with interpolated
    lookup tables or Taylor expansion. Here, Taylor Expansion wins by
    providing more accuracy relative to CPU cost, but, does implicitly
    assume that one has native FPU ops.


    Addition and subtraction required a lookup table - but because the two numbers involved needed to be not too far apart in magnitude for the operations to do anything, the lookup tables required were shorter than
    they would be for numbers represented normally, where it would be multiplication and division that required the lookup tables.


    Yeah, addition and subtraction are seemingly the harder operations in
    the case of trying to operate in logarithmic space.

    It was more a case of wondering if anyone knew of something cheaper an seemingly the:
    D=B-A
    C=A+(D>>1)+MagicBias

    Strategy with some boilerplate glued on.


    One cost saving simplification I am aware of (conventional FP) is to not
    just eliminate the concept of NaN and Inf, but also the concept of "true zero"; however, this only really works OK for formats with "sufficient
    dynamic range". In this case, 0 is no longer special, merely "the
    smallest value, generally understood as 0".

    So, for example, with BF16 or Binary32, one could mostly eliminate zero handling, but with FP8 or Binary16, the zero point is large enough that
    a lack of 0 is would be more noticeable (in particular because various mathematical identifies fail or misbehave in the absence of zero).

    Like, if 0*Inf => 2.0, this is "a little off".
    Also, ((0+0)!=0), ...



    But, does seem almost like I am "pretty close" to figuring out something "really cheap".

    But, while accuracy isn't that important, still at least need to land
    "in the right general area".

    Also would prefer to avoid "stair steps" due to coarse lookup tables.


    John Savard


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Wed Aug 27 01:17:47 2025
    From Newsgroup: comp.arch

    On 8/26/2025 4:17 PM, MitchAlsup wrote:

    BGB <cr88192@gmail.com> posted:

    Well, idea here is that sometimes one wants to be able to do
    floating-point math where accuracy is a very low priority.

    Say, the sort of stuff people might use FP8 or BF16 or maybe Binary16
    for (though, what I am thinking of here is low-precision even by
    Binary16 standards).

    For 8-bit stuff, just use 5 memory tables [256|u256]


    Would work OK for scalar ops on a CPU; less great for SIMD (or other
    cases where one can't afford a 64K lookup table).

    For 4x FP8 on a CPU, it makes sense to just use a normal SIMD unit for this.

    But, say, what if one wants 8x or 16x wide SIMD with FP8; or use within
    more specialized units?...


    Granted, FP8 multiply is fairly cheap either way (eg: the mantissa
    multiply already fits into LUT6's and can be pulled off in combinatorial logic). It is mostly FP8 FADD that needs to be made cheaper and low
    latency; sadly approximating FADD/FSUB in the general case has typically
    been the harder problem.


    And, while a special case has been found (works for simple add with
    similar exponents), it doesn't extend to the general case.


    But, will use Binary16 and BF16 as the example formats.

    So, can note that one can approximate some ops with modified integer
    ADD/SUB (excluding sign-bit handling):
    a*b : A+B-0x3C00 (0x3F80 for BF16)
    a/b : A-B+0x3C00
    sqrt(a): (A>>1)+0x1E00

    You are aware that GPUs perform elementary transcendental functions
    (32-bits) in 5 cycles {sin(), cos(), tan(), exp(), ln(), ...}.
    These functions get within 1.5-2 ULP. See authors: Oberman, Pierno,
    Matula circa 2000-2005 for relevant data. I did a crack at this
    (patented: Samsung) that got within 0.7 and 1.2 ULP using a three
    term polynomial instead of a 2 term polynomial.
    Standard GPU FP math (32-bit and 16-bit) are 4 cycles and are now
    IEEE 754 accurate (except for a couple of outlying cases.)

    So, I don't see this suggestion bringing value to the table.


    These can do some operations more cheaply.

    But, as noted, primarily for low precision values, likely in dedicated
    logic.

    This approach would not make as much sense in general for larger
    formats, given the accuracy is far below what would be considered as acceptable.

    Though. a few of these were already in use (as CPU helper ops), though
    usually to provide "starter-values" for Newton-Raphson.


    But, this sort of thing, is unlikely to replace general-purpose SIMD ops
    on a CPU or similar in any case.

    And, for the SIMD unit, can continue doing floating-point in ways that
    "are not complete garbage".


    But, say, for working with HDR values inside the rasterizer module or
    similar, this is more where this sort of thing could matter.

    Or, maybe, could be relevant for perspective-correct texture filtering
    (well, if it were working with floating-point texture coords rather than
    fixed point).

    Might be better if the module could also do transform and deal with full primitives, but this is asking too much.

    Or, failing this, if it could be used for 2D "blit" operations
    (currently only deals with square or rectangular power-of-2 images in
    Morton Order, which isn't terribly useful for "blit").

    Though, as noted, TKRA-GL keeps its textures internally in Morton Order.

    Currently, TKRA-GL uses a 12-bit lineal Z-Buffer (with 4 bits for
    stencil), though it is possible that it could make sense to use floating
    point for the Z-buffer (maybe S.E3.F8; as it mostly only needs to hold
    values between -1.0 and 1.0, etc).


    Some of the audio modules also use values mostly in A-Law form.
    Though, annoyingly it seems I have now ended up stuck with both A-Law
    formats with Bias=7 and Bias=8. As, initially I added the ops primarily
    for audio and Bias=8 made sense, but for other (non audio) uses I more
    needed Bias=7 (renamed as FP8A). So, annoyingly, there are now two sets
    of converter ops differing primarily in bias.

    But, I am mostly phasing out FP8S (E4.F3.S) in favor of plain FP8
    (S.E4.F3). Though, it lingers on as a sort of a design mistake, much
    like making my A-Law ops originally Bias=8. But, then FP8A remains
    preferably mostly because it has a slightly larger mantissa than normal FP8.


    ...


    The harder ones though, are ADD/SUB.

    A partial ADD seems to be:
    a+b: A+((B-A)>>1)+0x0400

    But, this simple case seems not to hold up when either doing subtract,
    or when A and B are far apart.

    So, it would appear either that there is a 4th term or the bias is
    variable (depending on the B-A term; and for ADD/SUB).

    Seems like the high bits (exponent and operator) could be used to drive
    a lookup table, but this is lame, The magic bias appears to have
    non-linear properties so isn't as easily represented with basic integer
    operations.

    Then again, probably other people know about all of this and might know
    what I am missing.

    I still recommend getting the right answer over getting a close but wrong answer a couple cycles earlier.


    A lot depends on what is needed...

    In cases where a person is doing math using FP8, any real semblance of "accuracy" or "right answer" has already gone out the window. Usually
    the math is just sorta throwing darts at a dartboard and hoping they
    land somewhere in the right area.

    Though, that said, usually even with these sorts of approximations (such
    as approximating a FMUL with a modified ADD), often the top 3-5 bits of
    the mantissa are correct. So, for FP8 or BF16, the answer of the
    approximates in many cases still may be close to the answer if done
    using real floating-point logic.


    But, even for something like Binary16, it is a bit iffy.

    There are 10 bits of mantissa, and math ops that only give around 3-5
    bits of accuracy or do isn't great in this case.

    Though, sometimes the accuracy doesn't matter that much, but one may
    still want to avoid "stair steps" as the artifacts generated by this may
    be much more obvious.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Wed Aug 27 13:56:53 2025
    From Newsgroup: comp.arch

    MitchAlsup wrote:

    BGB <cr88192@gmail.com> posted:

    Well, idea here is that sometimes one wants to be able to do
    floating-point math where accuracy is a very low priority.

    Say, the sort of stuff people might use FP8 or BF16 or maybe Binary16
    for (though, what I am thinking of here is low-precision even by
    Binary16 standards).

    For 8-bit stuff, just use 5 memory tables [256|arCo256]
    They don't even need to be full 8-bit: With a tiny amount of logic to
    handle the signs you are already down to 128x128, right?
    Then again, probably other people know about all of this and might know
    what I am missing.
    The infamous invsqrt() trick is the canonical example of where all the
    quirks of the ieee 754 format works just right to get you to 10+ bits
    with a single NR iteration.
    Your basic ops examples are a lot more iffy.

    I still recommend getting the right answer over getting a close but wrong answer a couple cycles earlier.

    Exactly.
    I think you showed me the idea of usually getting the correct result in
    N cycles, but in a low number of cases, the trailing bits would be too
    close to a rounding boundary, so they would add one more NR iteration.
    I just realized that the code I wrote to fix Pentium FDIV could have
    been even more efficient on a proper superscalar OoO CPU:
    Start the FDIV immediately, then at the same time do the divisor
    mantissa inspection to determine if the workaround would be needed (5
    out of 1024 cases), and only if that happens, start the slower path that takes up to twice as long.
    The idea is that for 99.5% of all divisors, the only cost would be a
    close to zero cycle correctly predicted branch, but then the remainder
    would require two FDIV operations, so 80 instead of 40 cycles.
    OTOH, that same Big OoO core can probably predict that the entire
    mantissa inspection part will end up with a "skip the workaround" branch and start the FDIV almost at once. I'm assuming that when the mispredict turns up, the core can stop a long operation like FDIV more or less
    immediately and discard the current status.
    (From memory)
    double fdiv(double a, double b)
    {
    uint64_t mant10;
    memcpy(&mant10, &b, sizeof(ub));
    mant10 = (mant10 >> 42) & 1023;
    if (fdiv_table[mant10 >> 3] & (1 << (mant10 & 7))) {
    // set fpu to extended/long double, save previous mode
    b *= 15.0/16.0; // Exact operation!
    a *= 15.0/16.0; // Exact operation!
    // Restore to previous precision mode
    }
    return a / b;
    }
    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Wed Aug 27 14:43:20 2025
    From Newsgroup: comp.arch

    Terje Mathisen <terje.mathisen@tmsw.no> writes:
    They don't even need to be full 8-bit: With a tiny amount of logic to=20 >handle the signs you are already down to 128x128, right?

    With exponential representation, say with base 2^(1/4) (range
    0.000018-55109 for exponents -63 to +63, and factor 1.19 between
    consecutive numbers), if the absolutely smaller number is smaller by a
    fixed amount in exponential representation (14 for our base 2^(1/4)
    numbers), adding or subtracting it won't make a difference. Which
    means that we need a 14*15/2=105 entry table (with 8-bit results) for
    addition and a table with the same size for subtraction, and a little
    logic for handling the cases where the numbers are too different or 0,
    or, if supported, +/-Inf or NaN (which reduce the range a little).

    If you want such a representation with finer-grained resolution, you
    get a smaller range and need larger tables. E.g., if you want to have
    a granularity as good as the minimum granularity of FP with 3-bit
    mantissa (with hidden 1), i.e., 1.125, you need 2^(1/6), with a
    granularity of 1.122 and a range 0.00069-1448; adding or subtracting a
    number where the representation is 25 smaller makes no difference, so
    the table sizes are 25*26/2=325 entries. Still looks relatively
    cheap.

    Why are people going for something FP-like instead of exponential
    if the number of bits is so small?

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Wed Aug 27 12:26:31 2025
    From Newsgroup: comp.arch

    On 8/27/2025 9:43 AM, Anton Ertl wrote:
    Terje Mathisen <terje.mathisen@tmsw.no> writes:
    They don't even need to be full 8-bit: With a tiny amount of logic to=20
    handle the signs you are already down to 128x128, right?

    With exponential representation, say with base 2^(1/4) (range
    0.000018-55109 for exponents -63 to +63, and factor 1.19 between
    consecutive numbers), if the absolutely smaller number is smaller by a
    fixed amount in exponential representation (14 for our base 2^(1/4)
    numbers), adding or subtracting it won't make a difference. Which
    means that we need a 14*15/2=105 entry table (with 8-bit results) for addition and a table with the same size for subtraction, and a little
    logic for handling the cases where the numbers are too different or 0,
    or, if supported, +/-Inf or NaN (which reduce the range a little).

    If you want such a representation with finer-grained resolution, you
    get a smaller range and need larger tables. E.g., if you want to have
    a granularity as good as the minimum granularity of FP with 3-bit
    mantissa (with hidden 1), i.e., 1.125, you need 2^(1/6), with a
    granularity of 1.122 and a range 0.00069-1448; adding or subtracting a
    number where the representation is 25 smaller makes no difference, so
    the table sizes are 25*26/2=325 entries. Still looks relatively
    cheap.

    Why are people going for something FP-like instead of exponential
    if the number of bits is so small?


    There is sort of the thing:
    When the number of bits gets small, the practical differences between FP
    and exponential mostly evaporates.

    If at the same scale with the same biases, the values match up between
    the two systems.


    At FP8, they are basically equivalent.
    With BF16, or S.E8.M7, the values will differ, but not drastically.

    With Binary16, they are not equivalent.
    For FDIV, much like InvSqrt, a single N-R can fix it up.


    But, yeah, with FP8:
    If difference between exponents is >3, FADD would merely return the
    larger of the two values, so yeah, a table size of 24 works (and fits in
    5 bits).

    This means, at least for FP8, the ADD/SUB lookup table could fit in 6 bits.

    So, something like:
    if(Ain[6:0]>=Bin[6:0])
    begin
    A={1'b0,Ain[6:0]}; B={1'b0,Bin[6:0]};
    sgn=Ain[7];
    end
    else
    begin
    A={1'b0,Bin[6:0]}; B={1'b0,Ain[6:0]};
    sgn=Bin[7];
    end
    isSub=Ain[7]^Bin[7];
    isOor=0; //out of range, no effect
    D={1'b0, B[6:0]}-{1'b0, A[6:0]};
    if((!D[7] && D[6:0]!=0) || (D[7:5]!=1'b111))
    isOor=1;
    case(isSub, D[5:0])
    6'b00: tBias=8'h08;
    ...
    endcase
    C=A+{D[7],D[7:1]}+tBias;
    if(isOor)
    C=A;
    if(C[7])
    C=C[6]?8'h00:8'h7F; //overflow/underflow
    result={sgn,C[6:0]};

    Works for FP8, for bigger formats (eg, Binary16) would exceed the size
    of a LUT6 though.

    Maybe might need 9 bits for C though, since if subtracting a value from
    itself yields the maximally negative bias (to try to reliably hit "0"),
    then with 8 bits it might reach back into positive overflow territory.

    It is either that or special-case the scenario of isSub and D==0.

    ...


    But, yeah, the question of if there is a cheaper way to do this is
    starting to look like "probably no"...

    - anton

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Wed Aug 27 13:01:08 2025
    From Newsgroup: comp.arch

    On 8/27/2025 12:26 PM, BGB wrote:
    On 8/27/2025 9:43 AM, Anton Ertl wrote:
    Terje Mathisen <terje.mathisen@tmsw.no> writes:
    They don't even need to be full 8-bit: With a tiny amount of logic to=20 >>> handle the signs you are already down to 128x128, right?

    With exponential representation, say with base 2^(1/4) (range
    0.000018-55109 for exponents -63 to +63, and factor 1.19 between
    consecutive numbers), if the absolutely smaller number is smaller by a
    fixed amount in exponential representation (14 for our base 2^(1/4)
    numbers), adding or subtracting it won't make a difference.-a Which
    means that we need a 14*15/2=105 entry table (with 8-bit results) for
    addition and a table with the same size for subtraction, and a little
    logic for handling the cases where the numbers are too different or 0,
    or, if supported, +/-Inf or NaN (which reduce the range a little).

    If you want such a representation with finer-grained resolution, you
    get a smaller range and need larger tables.-a E.g., if you want to have
    a granularity as good as the minimum granularity of FP with 3-bit
    mantissa (with hidden 1), i.e., 1.125, you need 2^(1/6), with a
    granularity of 1.122 and a range 0.00069-1448; adding or subtracting a
    number where the representation is 25 smaller makes no difference, so
    the table sizes are 25*26/2=325 entries.-a Still looks relatively
    cheap.

    Why are people going for something FP-like instead of exponential
    if the number of bits is so small?


    There is sort of the thing:
    When the number of bits gets small, the practical differences between FP
    and exponential mostly evaporates.

    If at the same scale with the same biases, the values match up between
    the two systems.


    At FP8, they are basically equivalent.
    With BF16, or S.E8.M7, the values will differ, but not drastically.

    With Binary16, they are not equivalent.
    -a For FDIV, much like InvSqrt, a single N-R can fix it up.


    But, yeah, with FP8:
    If difference between exponents is >3, FADD would merely return the
    larger of the two values, so yeah, a table size of 24 works (and fits in
    5 bits).

    This means, at least for FP8, the ADD/SUB lookup table could fit in 6 bits.

    So, something like:
    -a if(Ain[6:0]>=Bin[6:0])
    -a begin
    -a-a-a A={1'b0,Ain[6:0]}; B={1'b0,Bin[6:0]};
    -a-a-a sgn=Ain[7];
    -a end
    -a else
    -a begin
    -a-a-a A={1'b0,Bin[6:0]}; B={1'b0,Ain[6:0]};
    -a-a-a sgn=Bin[7];
    -a end
    -a isSub=Ain[7]^Bin[7];
    -a isOor=0;-a //out of range, no effect
    -a D={1'b0, B[6:0]}-{1'b0, A[6:0]};
    -a if((!D[7] && D[6:0]!=0) || (D[7:5]!=1'b111))
    -a-a-a-a isOor=1;
    -a case(isSub, D[5:0])
    -a-a-a 6'b00: tBias=8'h08;
    -a-a-a ...
    -a endcase
    -a C=A+{D[7],D[7:1]}+tBias;
    -a if(isOor)
    -a-a-a C=A;
    -a if(C[7])
    -a-a-a C=C[6]?8'h00:8'h7F;-a //overflow/underflow
    -a result={sgn,C[6:0]};

    Works for FP8, for bigger formats (eg, Binary16) would exceed the size
    of a LUT6 though.

    Maybe might need 9 bits for C though, since if subtracting a value from itself yields the maximally negative bias (to try to reliably hit "0"),
    then with 8 bits it might reach back into positive overflow territory.

    It is either that or special-case the scenario of isSub and D==0.

    ...


    But, yeah, the question of if there is a cheaper way to do this is
    starting to look like "probably no"...


    Well, nevermind if I then just wandered off and realized a few that
    would apply to the FP8 case:
    Flip the subtract such that D>=0;
    Eliminate (D>>1) term as it is effectively redundant and for its whole applicable range could be folded into the bias table.

    ...

    Or, essentially:
    D=A-B
    U=lookup[D]
    C=A+U

    Thinking more: will probably need to widen to 9 bits to deal with
    separating overflow and underflow over the full dynamic range.

    Like, 480-480 should give 0, and 0-480 should also give 0, ...



    - anton


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From George Neuner@gneuner2@comcast.net to comp.arch on Wed Aug 27 15:29:40 2025
    From Newsgroup: comp.arch

    On Wed, 27 Aug 2025 14:43:20 GMT, anton@mips.complang.tuwien.ac.at
    (Anton Ertl) wrote:

    Terje Mathisen <terje.mathisen@tmsw.no> writes:
    They don't even need to be full 8-bit: With a tiny amount of logic to=20 >>handle the signs you are already down to 128x128, right?

    With exponential representation, say with base 2^(1/4) (range
    0.000018-55109 for exponents -63 to +63, and factor 1.19 between
    consecutive numbers), if the absolutely smaller number is smaller by a
    fixed amount in exponential representation (14 for our base 2^(1/4)
    numbers), adding or subtracting it won't make a difference. Which
    means that we need a 14*15/2=105 entry table (with 8-bit results) for >addition and a table with the same size for subtraction, and a little
    logic for handling the cases where the numbers are too different or 0,
    or, if supported, +/-Inf or NaN (which reduce the range a little).

    If you want such a representation with finer-grained resolution, you
    get a smaller range and need larger tables. E.g., if you want to have
    a granularity as good as the minimum granularity of FP with 3-bit
    mantissa (with hidden 1), i.e., 1.125, you need 2^(1/6), with a
    granularity of 1.122 and a range 0.00069-1448; adding or subtracting a
    number where the representation is 25 smaller makes no difference, so
    the table sizes are 25*26/2=325 entries. Still looks relatively
    cheap.

    Why are people going for something FP-like instead of exponential
    if the number of bits is so small?

    - anton

    Excellant question. Wish I had an answer.

    Given that the use case almost invariably is NN, the only interesting
    values are [or should be] fractions in the range 0 to 1. Little/no
    need for floating point.
    --- Synchronet 3.21a-Linux NewsLink 1.2