• Time to eat Crow

    From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Fri Oct 3 02:50:23 2025
    From Newsgroup: comp.arch


    My 66000 2.0

    After 4-odd years of ISA stability, I ran into a case where
    I <pretty much> needed to change the instruction formats.
    And after bragging to Quadribloc about its stability--it
    reached the point where it was time to switch to version 2.0.

    Well, its time to eat crow. --------------------------------------------------------------
    Memory reference instructions already produce 64-bit values
    from Byte, HalfWord, Word and DoubleWord memory references
    in both Signed and unSigned flavors. These supports both
    integer and floating point due to the single register file.

    Essentially, I need that property in both integer and floating
    point calculations to eliminate instructions that merely apply
    value range constraints--just like memory !

    ISA 2.0 changes allows calculation instructions; both Integer
    and Floating Point; and a few other miscellaneous instructions
    (not so easily classified) the same uniformity.

    In all cases, an integer calculation produces a 64-bit value
    range limited to that of the {Sign}|u{Size}--no garbage bits
    in the high parts of the registers--the register accurately
    represents the calculation as specified {Sign}|u{Size}.

    Integer and floating point compare instructions only compare
    bits of the specified {Size}.

    Conversions between integer and floating point are now also
    governed by {Size} so one can directly convert FP64 directly
    into {unSigned}|u{Int16}--more fully supporting strongly typed
    languages.
    --------------------------------------------------------------
    Integer instructions are now::
    {Signed and unSigned}|u{Byte, HalfWord, Word, DoubleWord}
    while FP instructions are now:
    {Byte, HalfWord, Word, DoubleWord}
    Although I am oscillating whether to support FP8 or FP128.

    With this rearrangement of bit in the instruction formats, I
    was able to get all Constant and routing control bits in the
    same place and format in all {1, 2, and 3}-Operand instructions
    uniformly. This simplifies <trifling> the Decoder, but more
    importantly; the Operand delivery (and/or reception) mechanism.

    I was also able to compress the 7 extended operation formats
    into a single extended operation format. The instruction
    format now looks like:

    inst<31:26> Major OpCode
    inst<20:16> {Rd, Cnd field}
    inst<25:21> {SRC1, Rbase}
    inst<15:10> {SH width, else, {I,d,Sign,Size}}
    inst< 9: 6> {Minor OpCode, SRC3}
    inst< 4: 0> {offset,SRC2,Rindex,1-OP|u}

    So there is 1 uniformly positioned field of Minor OpCodes,
    and one uniformly interpreted field of Operand Modifiers.
    Operand Modifiers applies routing registers and inserting
    of constants to XOP Instructions. --------------------------------------------------------------
    So, what does this buy the Instruction Set ??

    A) All integer calculations are performed at the size and
    type of the result as required by the high level language::
    {Signed and unSigned}|u{Byte, HalfWord, Word, DoubleWord}.
    This, gets rid of all smash instructions across all data
    types. {smash == {sext, zext, ((x<<2^n)>>2^n), ...}

    B) I actually gained 1 more extended OpCode for future expansion.

    C) assembler/disassembler was simplified

    D) and while I did not add any new 'instructions' I made those
    already present more uniform and supporting of the requirements
    of higher level languages (like ADA) and more suitable to the
    stricter typing LLVM provides over GCC.

    In some ways I 'doubled' the instruction count while not adding
    a single instruction {spelling or field-pattern} to ISA. --------------------------------------------------------------
    The elimination of 'smashes' shrinks the instruction count of
    GNUPLOT by 4%--maybe a bit more once we sort out all of the
    compiler patterns it needs to recognize. --------------------------------------------------------------
    I wonder if crow tastes good in shepard's pie ?!?
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Fri Oct 3 03:17:16 2025
    From Newsgroup: comp.arch

    On 2025-10-02 10:50 p.m., MitchAlsup wrote:

    My 66000 2.0

    After 4-odd years of ISA stability, I ran into a case where
    I <pretty much> needed to change the instruction formats.
    And after bragging to Quadribloc about its stability--it
    reached the point where it was time to switch to version 2.0.

    Well, its time to eat crow. --------------------------------------------------------------
    Memory reference instructions already produce 64-bit values
    from Byte, HalfWord, Word and DoubleWord memory references
    in both Signed and unSigned flavors. These supports both
    integer and floating point due to the single register file.

    Essentially, I need that property in both integer and floating
    point calculations to eliminate instructions that merely apply
    value range constraints--just like memory !

    ISA 2.0 changes allows calculation instructions; both Integer
    and Floating Point; and a few other miscellaneous instructions
    (not so easily classified) the same uniformity.

    In all cases, an integer calculation produces a 64-bit value
    range limited to that of the {Sign}|u{Size}--no garbage bits
    in the high parts of the registers--the register accurately
    represents the calculation as specified {Sign}|u{Size}.

    Integer and floating point compare instructions only compare
    bits of the specified {Size}.

    Conversions between integer and floating point are now also
    governed by {Size} so one can directly convert FP64 directly
    into {unSigned}|u{Int16}--more fully supporting strongly typed
    languages.
    --------------------------------------------------------------
    Integer instructions are now::
    {Signed and unSigned}|u{Byte, HalfWord, Word, DoubleWord}
    while FP instructions are now:
    {Byte, HalfWord, Word, DoubleWord}
    Although I am oscillating whether to support FP8 or FP128.

    For my arch, I decided to support FP128 thinking that FP8 could be
    implemented with lookup tables, given that eight bit floats tend to vary
    in composition. Of course, I like more precision.
    Could it be a build option? Or a bit in a control register to flip
    between FP8 and FP128?

    With this rearrangement of bit in the instruction formats, I
    was able to get all Constant and routing control bits in the
    same place and format in all {1, 2, and 3}-Operand instructions
    uniformly. This simplifies <trifling> the Decoder, but more
    importantly; the Operand delivery (and/or reception) mechanism.

    I was also able to compress the 7 extended operation formats
    into a single extended operation format. The instruction
    format now looks like:

    inst<31:26> Major OpCode
    inst<20:16> {Rd, Cnd field}
    inst<25:21> {SRC1, Rbase}
    inst<15:10> {SH width, else, {I,d,Sign,Size}}
    inst< 9: 6> {Minor OpCode, SRC3}
    inst< 4: 0> {offset,SRC2,Rindex,1-OP|u}

    Only four bits for SRC3?

    So there is 1 uniformly positioned field of Minor OpCodes,
    and one uniformly interpreted field of Operand Modifiers.
    Operand Modifiers applies routing registers and inserting
    of constants to XOP Instructions. --------------------------------------------------------------
    So, what does this buy the Instruction Set ??

    A) All integer calculations are performed at the size and
    type of the result as required by the high level language::
    {Signed and unSigned}|u{Byte, HalfWord, Word, DoubleWord}.
    This, gets rid of all smash instructions across all data
    types. {smash == {sext, zext, ((x<<2^n)>>2^n), ...}

    B) I actually gained 1 more extended OpCode for future expansion.

    C) assembler/disassembler was simplified

    D) and while I did not add any new 'instructions' I made those
    already present more uniform and supporting of the requirements
    of higher level languages (like ADA) and more suitable to the
    stricter typing LLVM provides over GCC.

    In some ways I 'doubled' the instruction count while not adding
    a single instruction {spelling or field-pattern} to ISA. --------------------------------------------------------------
    The elimination of 'smashes' shrinks the instruction count of
    GNUPLOT by 4%--maybe a bit more once we sort out all of the
    compiler patterns it needs to recognize. --------------------------------------------------------------
    I wonder if crow tastes good in shepard's pie ?!?

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Fri Oct 3 15:33:36 2025
    From Newsgroup: comp.arch


    Robert Finch <robfi680@gmail.com> posted:

    On 2025-10-02 10:50 p.m., MitchAlsup wrote:

    My 66000 2.0

    After 4-odd years of ISA stability, I ran into a case where
    I <pretty much> needed to change the instruction formats.
    And after bragging to Quadribloc about its stability--it
    reached the point where it was time to switch to version 2.0.

    Well, its time to eat crow. --------------------------------------------------------------
    Memory reference instructions already produce 64-bit values
    from Byte, HalfWord, Word and DoubleWord memory references
    in both Signed and unSigned flavors. These supports both
    integer and floating point due to the single register file.

    Essentially, I need that property in both integer and floating
    point calculations to eliminate instructions that merely apply
    value range constraints--just like memory !

    ISA 2.0 changes allows calculation instructions; both Integer
    and Floating Point; and a few other miscellaneous instructions
    (not so easily classified) the same uniformity.

    In all cases, an integer calculation produces a 64-bit value
    range limited to that of the {Sign}|u{Size}--no garbage bits
    in the high parts of the registers--the register accurately
    represents the calculation as specified {Sign}|u{Size}.

    Integer and floating point compare instructions only compare
    bits of the specified {Size}.

    Conversions between integer and floating point are now also
    governed by {Size} so one can directly convert FP64 directly
    into {unSigned}|u{Int16}--more fully supporting strongly typed
    languages.
    --------------------------------------------------------------
    Integer instructions are now::
    {Signed and unSigned}|u{Byte, HalfWord, Word, DoubleWord}
    while FP instructions are now:
    {Byte, HalfWord, Word, DoubleWord}
    Although I am oscillating whether to support FP8 or FP128.

    For my arch, I decided to support FP128 thinking that FP8 could be implemented with lookup tables, given that eight bit floats tend to vary
    in composition. Of course, I like more precision.
    Could it be a build option? Or a bit in a control register to flip
    between FP8 and FP128?

    With this rearrangement of bit in the instruction formats, I
    was able to get all Constant and routing control bits in the
    same place and format in all {1, 2, and 3}-Operand instructions
    uniformly. This simplifies <trifling> the Decoder, but more
    importantly; the Operand delivery (and/or reception) mechanism.

    I was also able to compress the 7 extended operation formats
    into a single extended operation format. The instruction
    format now looks like:

    inst<31:26> Major OpCode
    inst<20:16> {Rd, Cnd field}
    inst<25:21> {SRC1, Rbase}
    inst<15:10> {SH width, else, {I,d,Sign,Size}}
    inst< 9: 6> {Minor OpCode, SRC3}
    inst< 4: 0> {offset,SRC2,Rindex,1-OP|u}

    Only four bits for SRC3?
    No, there are 5-bits--inst<9:5>--woops.

    So there is 1 uniformly positioned field of Minor OpCodes,
    and one uniformly interpreted field of Operand Modifiers.
    Operand Modifiers applies routing registers and inserting
    of constants to XOP Instructions. --------------------------------------------------------------
    So, what does this buy the Instruction Set ??

    A) All integer calculations are performed at the size and
    type of the result as required by the high level language::
    {Signed and unSigned}|u{Byte, HalfWord, Word, DoubleWord}.
    This, gets rid of all smash instructions across all data
    types. {smash == {sext, zext, ((x<<2^n)>>2^n), ...}

    B) I actually gained 1 more extended OpCode for future expansion.

    C) assembler/disassembler was simplified

    D) and while I did not add any new 'instructions' I made those
    already present more uniform and supporting of the requirements
    of higher level languages (like ADA) and more suitable to the
    stricter typing LLVM provides over GCC.

    In some ways I 'doubled' the instruction count while not adding
    a single instruction {spelling or field-pattern} to ISA. --------------------------------------------------------------
    The elimination of 'smashes' shrinks the instruction count of
    GNUPLOT by 4%--maybe a bit more once we sort out all of the
    compiler patterns it needs to recognize. --------------------------------------------------------------
    I wonder if crow tastes good in shepard's pie ?!?

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Fri Oct 3 12:40:17 2025
    From Newsgroup: comp.arch

    MitchAlsup wrote:
    My 66000 2.0

    After 4-odd years of ISA stability, I ran into a case where
    I <pretty much> needed to change the instruction formats.
    And after bragging to Quadribloc about its stability--it
    reached the point where it was time to switch to version 2.0.

    Well, its time to eat crow. --------------------------------------------------------------
    Memory reference instructions already produce 64-bit values
    from Byte, HalfWord, Word and DoubleWord memory references
    in both Signed and unSigned flavors. These supports both
    integer and floating point due to the single register file.

    Essentially, I need that property in both integer and floating
    point calculations to eliminate instructions that merely apply
    value range constraints--just like memory !

    Why? Compilers do not have any problem with this
    as its been handled by overload resolution since forever.

    Its people who have the problems following type changes and most
    compilers will warn of mixed type operations for exactly that reason.

    ISA 2.0 changes allows calculation instructions; both Integer
    and Floating Point; and a few other miscellaneous instructions
    (not so easily classified) the same uniformity.

    In all cases, an integer calculation produces a 64-bit value
    range limited to that of the {Sign}|u{Size}--no garbage bits
    in the high parts of the registers--the register accurately
    represents the calculation as specified {Sign}|u{Size}.

    Integer and floating point compare instructions only compare
    bits of the specified {Size}.

    Conversions between integer and floating point are now also
    governed by {Size} so one can directly convert FP64 directly
    into {unSigned}|u{Int16}--more fully supporting strongly typed
    languages.

    Strongly typed languages don't natively support mixed type operations.
    They come with a set of predefined operations for specific types that
    produce specific results.

    If YOU want operators/functions that allow mixed types then they force
    you to define your own functions to perform your specific operations,
    and it forces you to deal with the consequences of your type mixing.

    All this does is force YOU, the programmer, to be explicit in your
    definition and not depend on invisible compiler specific interpretations.

    If you want to support Uns8 * Int8 then it forces you, the programmer,
    to deal with the fact that this produces a signed 16-bit result
    in the range -128*256..+127*256 = -32768..32512.
    Now if you want to convert that result bit pattern to Uns8 by truncating
    it to the lower 8 bits, or worse treat the result as Int8 and take
    whatever random value falls in bit [7] as the sign, then that's on you.
    They just force you to be explicit what you are doing.

    --------------------------------------------------------------
    Integer instructions are now::
    {Signed and unSigned}|u{Byte, HalfWord, Word, DoubleWord}
    while FP instructions are now:
    {Byte, HalfWord, Word, DoubleWord}

    I doubt any compilers will use this feature.
    Strong typed languages don't have predefined operators that allow mixing.
    Weak typed languages deal with this in overload resolution and by having predefined invisible type conversions in those operators and then using
    the normal single type arithmetic instructions.

    Although I am oscillating whether to support FP8 or FP128.

    The issue with FP8 support seems to be that everyone who wants it also
    wants their own definition so no matter what you do, it will be unused.

    The issue with FP128 seems associated with scaling on LD and ST
    because now scaling is 1,2,4,8,16 which adds 1 bit to the scale field.
    And in the case of a combined int-float register file deciding whether
    to expand all registers to 128 bits, or use 64-bit register pairs.
    Using 128-bit registers raises the question of 128-bit integer support,
    and using register pairs opens a whole new category of pair instructions.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Fri Oct 3 10:55:46 2025
    From Newsgroup: comp.arch

    On 10/2/2025 7:50 PM, MitchAlsup wrote:

    My 66000 2.0

    After 4-odd years of ISA stability, I ran into a case where
    I <pretty much> needed to change the instruction formats.
    And after bragging to Quadribloc about its stability--it
    reached the point where it was time to switch to version 2.0.

    Well, its time to eat crow. --------------------------------------------------------------
    Memory reference instructions already produce 64-bit values
    from Byte, HalfWord, Word and DoubleWord memory references
    in both Signed and unSigned flavors. These supports both
    integer and floating point due to the single register file.

    Essentially, I need that property in both integer and floating
    point calculations to eliminate instructions that merely apply
    value range constraints--just like memory !

    ISA 2.0 changes allows calculation instructions; both Integer
    and Floating Point; and a few other miscellaneous instructions
    (not so easily classified) the same uniformity.

    In all cases, an integer calculation produces a 64-bit value
    range limited to that of the {Sign}|u{Size}--no garbage bits
    in the high parts of the registers--the register accurately
    represents the calculation as specified {Sign}|u{Size}.

    I must be missing something. Suppose I have

    C := A + B

    where A and C are 16 bit signed integers and B is an 8 bit signed
    integer. As I understand what you are doing, loading B into a register
    will leave the high order 56 bits zero. But the add instruction will presumably be half word, so if B is negative, it will get an incorrect
    answer (because B is not sign extended to 16 bits).

    What am I missing?
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch on Fri Oct 3 15:25:25 2025
    From Newsgroup: comp.arch

    --------------------------------------------------------------
    Integer instructions are now:: {Signed and unSigned}+{Byte, HalfWord, >> Word, DoubleWord}
    while FP instructions are now:
    {Byte, HalfWord, Word, DoubleWord}

    I doubt any compilers will use this feature.
    Strong typed languages don't have predefined operators that allow mixing.

    Not sure who's confused, but my reading of the above is not some sort of "mixing": I believe Mitch is just saying that his addition operation
    (for example) can be specified to operate on either one of int8, uint8,
    int16, uint16, ...
    But that specification applies to all inputs and outputs of the
    instruction, so it does not support adding an int8 to an int32, or other "mixes".


    Stefan
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Fri Oct 3 19:55:00 2025
    From Newsgroup: comp.arch


    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

    On 10/2/2025 7:50 PM, MitchAlsup wrote:

    My 66000 2.0

    After 4-odd years of ISA stability, I ran into a case where
    I <pretty much> needed to change the instruction formats.
    And after bragging to Quadribloc about its stability--it
    reached the point where it was time to switch to version 2.0.

    Well, its time to eat crow. --------------------------------------------------------------
    Memory reference instructions already produce 64-bit values
    from Byte, HalfWord, Word and DoubleWord memory references
    in both Signed and unSigned flavors. These supports both
    integer and floating point due to the single register file.

    Essentially, I need that property in both integer and floating
    point calculations to eliminate instructions that merely apply
    value range constraints--just like memory !

    ISA 2.0 changes allows calculation instructions; both Integer
    and Floating Point; and a few other miscellaneous instructions
    (not so easily classified) the same uniformity.

    In all cases, an integer calculation produces a 64-bit value
    range limited to that of the {Sign}|u{Size}--no garbage bits
    in the high parts of the registers--the register accurately
    represents the calculation as specified {Sign}|u{Size}.

    I must be missing something. Suppose I have

    C := A + B

    where A and C are 16 bit signed integers and B is an 8 bit signed
    integer. As I understand what you are doing, loading B into a register
    will leave the high order 56 bits zero. But the add instruction will presumably be half word, so if B is negative, it will get an incorrect answer (because B is not sign extended to 16 bits).

    What am I missing?

    A is loaded as 16-bits properly sign to 64-bits: range[-32768..32767]
    B is loaded as 8-bits properly sign to 64-bits: range[-128..127]

    ADDSH Rc,Ra,Rb

    Adds 64-bit Ra and 64-bit Rb and then sign extends the result from bit<15>. The result is a properly signed 64-bit value: range [-32768..32767]


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Fri Oct 3 20:47:08 2025
    From Newsgroup: comp.arch

    EricP <ThatWouldBeTelling@thevillage.com> schrieb:
    MitchAlsup wrote:
    My 66000 2.0

    After 4-odd years of ISA stability, I ran into a case where
    I <pretty much> needed to change the instruction formats.
    And after bragging to Quadribloc about its stability--it
    reached the point where it was time to switch to version 2.0.

    Well, its time to eat crow.
    --------------------------------------------------------------
    Memory reference instructions already produce 64-bit values
    from Byte, HalfWord, Word and DoubleWord memory references
    in both Signed and unSigned flavors. These supports both
    integer and floating point due to the single register file.

    Essentially, I need that property in both integer and floating
    point calculations to eliminate instructions that merely apply
    value range constraints--just like memory !

    Why? Compilers do not have any problem with this
    as its been handled by overload resolution since forever.

    A non-My66000 example:

    int add (int a, int b)
    {
    return a + b;
    }

    is translated on powerpc64le-unknown-linux-gnu (with -O3 to)

    add 3,3,4
    extsw 3,3
    blr

    extsw fills the 32 high-value bits with because numbers returned
    in registers have to be correct, either as 32- or 64-bit values.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Fri Oct 3 21:04:16 2025
    From Newsgroup: comp.arch

    Stefan Monnier <monnier@iro.umontreal.ca> schrieb:
    --------------------------------------------------------------
    Integer instructions are now:: {Signed and unSigned}|u{Byte, HalfWord, >>> Word, DoubleWord}
    while FP instructions are now:
    {Byte, HalfWord, Word, DoubleWord}

    I doubt any compilers will use this feature.
    Strong typed languages don't have predefined operators that allow mixing.

    Not sure who's confused, but my reading of the above is not some sort of "mixing": I believe Mitch is just saying that his addition operation
    (for example) can be specified to operate on either one of int8, uint8, int16, uint16, ...
    But that specification applies to all inputs and outputs of the
    instruction, so it does not support adding an int8 to an int32, or other "mixes".

    The outputs are correctly extended to a 64-bit number (signed or
    unsigned) so it is possible to pass results to wider operations
    without conversion.

    One example would be

    unsigned long foo (unsigned int a, unsigned int b)
    {
    return a + b;
    }

    which would need an adjustment after the add, and which would
    just be somethign like

    adduw r1,r1,r2
    ret

    using Mitch's new encoding.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Fri Oct 3 21:36:07 2025
    From Newsgroup: comp.arch


    EricP <ThatWouldBeTelling@thevillage.com> posted:

    MitchAlsup wrote:
    My 66000 2.0

    After 4-odd years of ISA stability, I ran into a case where
    I <pretty much> needed to change the instruction formats.
    And after bragging to Quadribloc about its stability--it
    reached the point where it was time to switch to version 2.0.

    Well, its time to eat crow. --------------------------------------------------------------
    Memory reference instructions already produce 64-bit values
    from Byte, HalfWord, Word and DoubleWord memory references
    in both Signed and unSigned flavors. These supports both
    integer and floating point due to the single register file.

    Essentially, I need that property in both integer and floating
    point calculations to eliminate instructions that merely apply
    value range constraints--just like memory !

    Why? Compilers do not have any problem with this
    as its been handled by overload resolution since forever.

    LLVM compiles C with stricter typing than GCC resulting in a lot
    of smashes:: For example::

    int subroutine( int a, int b )
    {
    return a+b;
    }

    Compiles into:

    subroutine:
    ADD R1,R1,R2
    SRA R1,R1,<32,0> // limit result to (int)
    RET

    LLVM thinks the smash is required because [-2^31..+2^31-1] +
    [-2^31..+2^31-1] does not always fit into [-2^31..+2^31-1] !!!
    and chasing down all the cases is harder than the compiler is
    ready to do. At first I though that the Value propagation in
    LLVM would find that the vast majority of arithmetic does not
    need smashing. This proved frustrating to both myself and to
    Brian. The more I read RSIC-V and ARM assembly code, the more
    I realized that adding sized integer arithmetic is the only
    way to get through to the LLVM infrastructure.

    We (the My 66000 team; mostly me and Brian) have been trying to
    obey the stricter than necessary typing of LLVM and achieve the
    code density possible as if K&R rules were in play with 64-bit
    only (int)s.

    RISC-V has ADDW (but no ADDH or ADDB) to alleviate the issue on
    a majority of calculations. ARM has word sized Registers to
    alleviate the issue. Since ARM started as 32-bits ADDW is natural.
    I am exploring how to provide integer arithmetic such that smashing
    never has to happen.

    We have been chasing smashes for 9 months making little progress...

    Its people who have the problems following type changes and most
    compilers will warn of mixed type operations for exactly that reason.

    It is more the ADA problem that values must fit in containers--that
    is values have a range {min..max} and that calculated values outside
    of that range are to be "addressed".

    ISA 2.0 changes allows calculation instructions; both Integer
    and Floating Point; and a few other miscellaneous instructions
    (not so easily classified) the same uniformity.

    In all cases, an integer calculation produces a 64-bit value
    range limited to that of the {Sign}|u{Size}--no garbage bits
    in the high parts of the registers--the register accurately
    represents the calculation as specified {Sign}|u{Size}.

    Integer and floating point compare instructions only compare
    bits of the specified {Size}.

    Conversions between integer and floating point are now also
    governed by {Size} so one can directly convert FP64 directly
    into {unSigned}|u{Int16}--more fully supporting strongly typed
    languages.

    Strongly typed languages don't natively support mixed type operations.
    They come with a set of predefined operations for specific types that
    produce specific results.

    Yes, indeed, and this is what I am providing: {Sign}|u{Size} calculations. Where the result is known to be range lmited to {Sign}|u{Size}. Thus:

    ADDSH R7,R8,R9

    R7 is range limited {Signed}|u{HalfWord} == [-32768..+32767] ------------------------------------------------------------------------
    So let's look at some egregious cases::

    cvtds r2,r2 // convert double to signed 64
    srl r3,r2,#0,#32 // convert signed 64 to signed 32
    --------
    sra r1,r23,#0,#32 // smash to signed 32
    sra r2,r20,#0,#32 // smash to signed 32
    maxs r23,r2,r1 // max of signed 32
    --------
    ldd r24,[r24] // LD signed 64
    add r1,r28,#1 // innocently add #1
    sra r28,r1,#0,#32 // smash to Signed 32
    cmp r1,r28,r16 // to match the other operand of CMP --------
    call strspn
    srl r2,r1,#0,#32 // smash result Signed 32
    add r1,r25,-r1
    sra r1,r1,#0,#32 // smash Signed 32
    cmp r2,r19,r2
    srl r2,r2,#2,#1
    add r21,r21,r2 // add Bool to Signed 32
    sra r2,r20,#0,#32 // smash Signed 32
    maxs r20,r1,r2 // MAX Signed 32
    --------
    mov r1,r29 // Signed 64
    ple0 r17,FFFFFFF // ignore
    stw r17,[ip,key_rows] // ignore
    add r1,r29,#-1 // innocent subtract
    sra r1,r1,#0,#32 // smash to Signed 32
    divs r1,r1,r17 // DIV Signed 32
    --------
    lduw r2,[ip,keyT+4]
    add r2,r2,#-1 // innocent subtract
    srl r2,r2,#0,#32 // smash to unSigned 32
    cmp r3,r2,#1 // CMP unSigned 32
    // even though CMP is Signless
    --------
    add r1,r19,-r6 // not so innocent subtract
    sra r2,r1,#0,#32 // Signed
    srl r1,r1,#0,#32 // unSigned
    // only one of these can be eliminated
    --------

    If YOU want operators/functions that allow mixed types then they force
    you to define your own functions to perform your specific operations,
    and it forces you to deal with the consequences of your type mixing.

    All this does is force YOU, the programmer, to be explicit in your
    definition and not depend on invisible compiler specific interpretations.

    If you want to support Uns8 * Int8 then it forces you, the programmer,
    to deal with the fact that this produces a signed 16-bit result
    in the range -128*256..+127*256 = -32768..32512.

    Uns8 occupies 64-bits in a register range-limited to [0..255]
    Int8 occupies 64-bits in a register range-limited to [-128..127]
    So, integer values sitting in registers occupy the whole 64-bits
    but are properly range-limited to base-type.

    Multiply multiplies 2|u64-bit registers and produces a 128-bit
    result, since CARRY is not in effect, the bits<127..64> are
    discarded; bits<63..0> are then considered.

    unSigned results simply discard bits more significant than base-type.
    Signed results raise OVERFLOW is there is more significance than
    base-type (and if enabled take an exception).
    In all cases, the result delivered fits within the range of base-type.

    So, in the case you mention::

    LDUB R8,[---]
    LDSB R9,[---]
    MULSH R7,R8,R9 // result range [-32768..32767]
    -----
    MULUH R7,R8,R9 // result range [0..65535]


    Now if you want to convert that result bit pattern to Uns8 by truncating
    it to the lower 8 bits,

    MULUB R7,R8,R9 // result range [0..255]

    or worse treat the result as Int8 and take
    whatever random value falls in bit [7] as the sign, then that's on you.

    MULSB R7,R8,R9 // result range [-128..127] or OVERFLOW

    Personally, I prefer range checks that raise OVERFLOW.

    They just force you to be explicit what you are doing.

    --------------------------------------------------------------
    Integer instructions are now::
    {Signed and unSigned}|u{Byte, HalfWord, Word, DoubleWord}
    while FP instructions are now:
    {Byte, HalfWord, Word, DoubleWord}

    I doubt any compilers will use this feature.

    RISC-V and ARM LLVM compilers already do this and use it to eliminate
    smashes. RISC-V is limited to WORD, ARM uses registers of WORD size.
    Both eliminate smashes. Since there are already LLVM compilers using
    this (to eliminate smashes) it should be not terribly difficult to add.

    On the other hand:: ILP64 ALSO gets rid of the problem (at a different cost).

    Strong typed languages don't have predefined operators that allow mixing. Weak typed languages deal with this in overload resolution and by having predefined invisible type conversions in those operators and then using
    the normal single type arithmetic instructions.

    Although I am oscillating whether to support FP8 or FP128.

    The issue with FP8 support seems to be that everyone who wants it also
    wants their own definition so no matter what you do, it will be unused.

    Thank you for your input.

    The issue with FP128 seems associated with scaling on LD and ST
    because now scaling is 1,2,4,8,16 which adds 1 bit to the scale field.
    And in the case of a combined int-float register file deciding whether
    to expand all registers to 128 bits, or use 64-bit register pairs.

    My position is that people want 64-bit registers and ISA that allow
    reasonably easy and efficient access to 128-bits, CARRY provides this.
    But the architecture is not cut out to be a big 128-bit number cruncher; occasional sure, but all the time, no.

    Using 128-bit registers raises the question of 128-bit integer support,
    and using register pairs opens a whole new category of pair instructions.

    CARRY supports this.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Sat Oct 4 04:56:21 2025
    From Newsgroup: comp.arch

    On 10/3/2025 4:04 PM, Thomas Koenig wrote:
    Stefan Monnier <monnier@iro.umontreal.ca> schrieb:
    --------------------------------------------------------------
    Integer instructions are now:: {Signed and unSigned}|u{Byte, HalfWord,
    Word, DoubleWord}
    while FP instructions are now:
    {Byte, HalfWord, Word, DoubleWord}

    I doubt any compilers will use this feature.
    Strong typed languages don't have predefined operators that allow mixing. >>
    Not sure who's confused, but my reading of the above is not some sort of
    "mixing": I believe Mitch is just saying that his addition operation
    (for example) can be specified to operate on either one of int8, uint8,
    int16, uint16, ...
    But that specification applies to all inputs and outputs of the
    instruction, so it does not support adding an int8 to an int32, or other
    "mixes".

    The outputs are correctly extended to a 64-bit number (signed or
    unsigned) so it is possible to pass results to wider operations
    without conversion.

    One example would be

    unsigned long foo (unsigned int a, unsigned int b)
    {
    return a + b;
    }

    which would need an adjustment after the add, and which would
    just be somethign like

    adduw r1,r1,r2
    ret

    using Mitch's new encoding.



    Yes.

    Sign extend signed types, zero extend unsigned types.
    Up-conversion is free.


    This is something the RISC-V people got wrong IMO, and adding a bunch of
    ".UW" instructions in an attempt to patch over it is just kinda ugly.

    Partly for my own uses revived ADDWU and SUBWU (which had been dropped
    in BitManip), because these are less bad than the alternative.

    I get annoyed that new extensions keep trying to add ever more ".UW" instructions rather than just having the compiler go over to
    zero-extended unsigned and make this whole mess go away.

    ...



    Ironically, the number of new instructions being added to my own ISA has mostly died off recently, largely because there is little particularly relevant to add at this point (within the realm of stuff that could be
    added).


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Sat Oct 4 04:57:23 2025
    From Newsgroup: comp.arch

    On 10/3/2025 11:40 AM, EricP wrote:
    MitchAlsup wrote:
    My 66000 2.0

    After 4-odd years of ISA stability, I ran into a case where
    I <pretty much> needed to change the instruction formats.
    And after bragging to Quadribloc about its stability--it reached the
    point where it was time to switch to version 2.0.

    Well, its time to eat crow.
    --------------------------------------------------------------
    Memory reference instructions already produce 64-bit values
    from Byte, HalfWord, Word and DoubleWord memory references
    in both Signed and unSigned flavors. These supports both integer and
    floating point due to the single register file.

    Essentially, I need that property in both integer and floating
    point calculations to eliminate instructions that merely apply value
    range constraints--just like memory !

    Why? Compilers do not have any problem with this
    as its been handled by overload resolution since forever.

    Its people who have the problems following type changes and most
    compilers will warn of mixed type operations for exactly that reason.

    ISA 2.0 changes allows calculation instructions; both Integer and
    Floating Point; and a few other miscellaneous instructions (not so
    easily classified) the same uniformity.

    In all cases, an integer calculation produces a 64-bit value
    range limited to that of the {Sign}|u{Size}--no garbage bits
    in the high parts of the registers--the register accurately
    represents the calculation as specified {Sign}|u{Size}.

    Integer and floating point compare instructions only compare
    bits of the specified {Size}.

    Conversions between integer and floating point are now also
    governed by {Size} so one can directly convert FP64 directly
    into {unSigned}|u{Int16}--more fully supporting strongly typed
    languages.

    Strongly typed languages don't natively support mixed type operations.
    They come with a set of predefined operations for specific types that
    produce specific results.

    If YOU want operators/functions that allow mixed types then they force
    you to define your own functions to perform your specific operations,
    and it forces you to deal with the consequences of your type mixing.

    All this does is force YOU, the programmer, to be explicit in your
    definition and not depend on invisible compiler specific interpretations.

    If you want to support Uns8 * Int8 then it forces you, the programmer,
    to deal with the fact that this produces a signed 16-bit result
    in the range -128*256..+127*256 = -32768..32512.
    Now if you want to convert that result bit pattern to Uns8 by truncating
    it to the lower 8 bits, or worse treat the result as Int8 and take
    whatever random value falls in bit [7] as the sign, then that's on you.
    They just force you to be explicit what you are doing.

    --------------------------------------------------------------
    Integer instructions are now:: -a-a-a-a {Signed and unSigned}|u{Byte,
    HalfWord, Word, DoubleWord}
    while FP instructions are now:
    -a-a-a-a {Byte, HalfWord, Word, DoubleWord}

    I doubt any compilers will use this feature.
    Strong typed languages don't have predefined operators that allow mixing. Weak typed languages deal with this in overload resolution and by having predefined invisible type conversions in those operators and then using
    the normal single type arithmetic instructions.

    Although I am oscillating whether to support FP8 or FP128.

    The issue with FP8 support seems to be that everyone who wants it also
    wants their own definition so no matter what you do, it will be unused.

    The issue with FP128 seems associated with scaling on LD and ST
    because now scaling is 1,2,4,8,16 which adds 1 bit to the scale field.
    And in the case of a combined int-float register file deciding whether
    to expand all registers to 128 bits, or use 64-bit register pairs.
    Using 128-bit registers raises the question of 128-bit integer support,
    and using register pairs opens a whole new category of pair instructions.


    I generally went with register pairs...

    Where, say, for base types:
    8-bits: Rarely big enough
    16-bits: Sometimes big enough
    32-bits: Usually big enough
    64-bits: Almost always big enough

    Vector types:
    2x: Good
    4x: Better
    8x: Rarely Needed

    For a scalar type, the high 64 bits of a 128-bit register would be
    almost always wasted, so it isn't worthwhile to spend resources on
    things that are mostly just going to waste.



    At least with 64-bit registers, they cover:
    Integer values: Usually overkill
    'int' is far more common than 'long long'.
    Floating Point: Usually Optimal
    Binary64 is almost always good.
    Binary32 is frequently insufficient.
    2x Binary32 and 4x Binary16: OK

    Then, 128-bit as pairs:
    Deals with the occasional 128-bit vector and integer;
    Avoids wasting resources all the times we don't need it.

    Well, since computation isn't exactly a gas that expands to efficiently utilize the register size (going bigger = diminishing returns).


    If the CPU is superscalar, can use 2x64b lanes for the 128-bit path, ...


    As for Binary128:
    Infrequently used;
    Too expensive for direct hardware support;
    So, ended up adding a trap-only support;
    Trap-only allows it to exist without also eating the FPGA.

    As for FP8:
    There are multiple formats in use:
    S.E3.M4: Bias=7 (Quats / Unit Vectors)
    S.E3.M4: Bias=8 (Audio)
    S.E4.M3: Bias=7 (NN's)
    E4.M4: Bias=7 (HDR images)

    Then, for 16-bit:
    S.E5.M10: Generic, Graphics Processing, Sometimes 3D Geometry
    Sometimes not enough dynamic range.
    S.E8.M7: NNs
    Usually not enough precision.

    It is likely the more optimal 16-bit format might actually be S.E6.M9,
    but this is non-standard.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Sat Oct 4 12:37:18 2025
    From Newsgroup: comp.arch

    Stephen Fuld wrote:
    On 10/2/2025 7:50 PM, MitchAlsup wrote:

    My 66000 2.0

    After 4-odd years of ISA stability, I ran into a case where
    I <pretty much> needed to change the instruction formats.
    And after bragging to Quadribloc about its stability--it
    reached the point where it was time to switch to version 2.0.

    Well, its time to eat crow.
    --------------------------------------------------------------
    Memory reference instructions already produce 64-bit values
    from Byte, HalfWord, Word and DoubleWord memory references
    in both Signed and unSigned flavors. These supports both
    integer and floating point due to the single register file.

    Essentially, I need that property in both integer and floating
    point calculations to eliminate instructions that merely apply
    value range constraints--just like memory !

    ISA 2.0 changes allows calculation instructions; both Integer
    and Floating Point; and a few other miscellaneous instructions
    (not so easily classified) the same uniformity.

    In all cases, an integer calculation produces a 64-bit value
    range limited to that of the {Sign}|arCo{Size}--no garbage bits
    in the high parts of the registers--the register accurately
    represents the calculation as specified {Sign}|arCo{Size}.

    I must be missing something.-a Suppose I have

    C := A + B

    where A and C are 16 bit signed integers and B is an 8 bit signed
    integer.-a As I understand what you are doing, loading B into a register will leave the high order 56 bits zero.-a But the add instruction will presumably be half word, so if B is negative, it will get an incorrect > answer (because B is not sign extended to 16 bits).

    What am I missing?


    I am pretty sure A would be sign extended to 64 bit on load and the same for B, from 8->64 bits, at which point the addition works as it should?
    When storing a 64-bit result as a 16-bit signed integer, the cpu can
    verify that the top 48 bits are either all 1 or all 0.
    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sat Oct 4 10:17:41 2025
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:
    LLVM compiles C with stricter typing than GCC resulting in a lot
    of smashes:: For example::

    int subroutine( int a, int b )
    {
    return a+b;
    }

    Compiles into:

    subroutine:
    ADD R1,R1,R2
    SRA R1,R1,<32,0> // limit result to (int)
    RET

    I tested this on AMD64, and did not find sign-extension in the caller,
    neither with gcc-14 nor with clang-19; both produce the following code
    for your example (with "subroutine" renamed into "subroutine1").

    0000000000000000 <subroutine1>:
    0: 8d 04 37 lea (%rdi,%rsi,1),%eax
    3: c3 ret

    It's not about strict or lax typing, it's about what the calling
    convention promises about types that are smaller than a machine word.
    If the calling convention requires/guarantees that ints are
    sign-extended, the compiler must use instructions that produce a
    sign-extended result. If the calling convention guarantees that ints
    are zero-extended (sounds perverse, but RV64 has the guarantee that
    unsigned is passed in sign-extended form, which is equally perverse),
    then the compiler must use instructions that produce a zero-extended
    result (e.g., AMD64's addl). If the calling convention only requires
    and guarantees the low-order 32 bits (I call this garbage-extended),
    then the compiler can use instructions that perform 64-bit adds; this
    is what we are seeing above.

    The other side of the medal is what is needed at the caller: If the
    caller needs to cconvert a sign-extended int into a long, it does not
    have to do anything. If it needs to convert a zero-extended or garbage-extended int into a long, it has to sign-extend the value.

    I have tested this with:

    int subroutine2(int,int);

    long subroutine3(int a,int b)
    {
    return subroutine2(a,b);
    }

    On AMD64 the result is:

    gcc-14:
    0000000000000010 <subroutine3>:
    10: 48 83 ec 08 sub $0x8,%rsp
    14: e8 00 00 00 00 call 19 <subroutine3+0x9>
    19: 48 83 c4 08 add $0x8,%rsp
    1d: 48 98 cltq
    1f: c3 ret

    clang-19:
    0000000000000010 <subroutine3>:
    10: 50 push %rax
    11: e8 00 00 00 00 call 16 <subroutine3+0x6>
    16: 48 98 cltq
    18: 59 pop %rcx
    19: c3 ret

    The compilers introduce the sign-extension CLTQ because the result of
    the call is not sign-extended. For parameter passing, it's the same:

    int subroutine4(long,long);

    long subroutine5(int a,int b)
    {
    return subroutine4(a,b);
    }

    0000000000000020 <subroutine5>:
    20: 48 83 ec 08 sub $0x8,%rsp
    24: 48 63 f6 movslq %esi,%rsi
    27: 48 63 ff movslq %edi,%rdi
    2a: e8 00 00 00 00 call 2f <subroutine5+0xf>
    2f: 48 83 c4 08 add $0x8,%rsp
    33: 48 98 cltq
    35: c3 ret
    0000000000000020 <subroutine5>:
    20: 50 push %rax
    21: 48 63 ff movslq %edi,%rdi
    24: 48 63 f6 movslq %esi,%rsi
    27: e8 00 00 00 00 call 2c <subroutine5+0xc>
    2c: 48 98 cltq
    2e: 59 pop %rcx
    2f: c3 ret

    BTW, In C as it was originally conceived, that was not an issue,
    because int occupied a complete register and all smaller types are
    converted to ints. The I32LP64 mistake has required to insert a lot
    of sign-extensions (and C compiler writers embrace undefined behaviour
    to avoid that in some cases).

    Another mistake we see in this example is the 16-byte alignment
    requirement of SSEx. It results in the RSP adjustments around the
    call. If only AMD had decided to support unaligned SSEx memory
    accesses by default in 64-bit mode.

    LLVM thinks the smash is required because [-2^31..+2^31-1] +
    [-2^31..+2^31-1] does not always fit into [-2^31..+2^31-1] !!!
    and chasing down all the cases is harder than the compiler is
    ready to do.

    In your example, there is nothing to chase down, because subroutine()
    can be called from anywhere.

    At first I though that the Value propagation in
    LLVM would find that the vast majority of arithmetic does not
    need smashing. This proved frustrating to both myself and to
    Brian. The more I read RSIC-V and ARM assembly code, the more
    I realized that adding sized integer arithmetic is the only
    way to get through to the LLVM infrastructure.

    You might try changing the calling convention for int to
    garbage-extended. It can introduce sign or zero extension elsewhere,
    but maybe fewer than otherwise.

    RISC-V has ADDW (but no ADDH or ADDB) to alleviate the issue on
    a majority of calculations.

    That's an RV64 extension. RV32 does not have ADDW.

    ARM has word sized Registers to
    alleviate the issue. Since ARM started as 32-bits ADDW is natural.

    Not at all. ARM A64 is a completely new instruction set that has at
    least as much in common with PowerPC as with ARM A32 or ARM T32. I
    expect that they would not have added the 32-bit ADDW or the
    addressing modes with sign- or zero-extended 32-bit indexes if the
    MIPS and Alpha people had not made the I32LP64 mistake. Instead, they
    would have used the encoding space for more useful things.

    I am exploring how to provide integer arithmetic such that smashing
    never has to happen.

    If you want to avoid every use of a separate sign-extension or
    zero-extension instruction, add three bits to every source-register
    specifier: 2 bits for the input size (1,2,4,8 bytes), 1 for
    signed/unsigned. Once you have that, there is no need to extend to
    result: you always can perform the extension on input to the use of a
    result; the natural calling convention to go along with that is to garbage-extend.

    I don't think that extension instructions are frequent enough to merit
    going to such lengths. I actually think that the RISC-V people made
    the wrong choice here, contrary to their usual stance. Instead of
    having sign-extension as a separate instruction (like zero-extension),
    they added it to a number of integer instructions, inflating the
    number of instructions for little benefit.

    So let's look at some egregious cases::

    cvtds r2,r2 // convert double to signed 64
    srl r3,r2,#0,#32 // convert signed 64 to signed 32

    unsigned?

    --------
    sra r1,r23,#0,#32 // smash to signed 32
    sra r2,r20,#0,#32 // smash to signed 32
    maxs r23,r2,r1 // max of signed 32

    With garbage-extension, you need a 32-bit maxs or sign-extend the
    operands. But you are sign-extended; why do you need it?

    Such things are not necessary with garbage-extension for add, sub,
    mul, and, or xor, i.e., the most common operations.

    --------
    ldd r24,[r24] // LD signed 64
    add r1,r28,#1 // innocently add #1
    sra r28,r1,#0,#32 // smash to Signed 32
    cmp r1,r28,r16 // to match the other operand of CMP

    Similar to the maxs case.

    --------
    call strspn
    srl r2,r1,#0,#32 // smash result Signed 32
    add r1,r25,-r1
    sra r1,r1,#0,#32 // smash Signed 32
    cmp r2,r19,r2
    srl r2,r2,#2,#1
    add r21,r21,r2 // add Bool to Signed 32
    sra r2,r20,#0,#32 // smash Signed 32
    maxs r20,r1,r2 // MAX Signed 32

    Maybe the right way here is to use size_t for the variable where you
    put the return value (strspn() returns a size_t).

    --------
    mov r1,r29 // Signed 64
    ple0 r17,FFFFFFF // ignore
    stw r17,[ip,key_rows] // ignore
    add r1,r29,#-1 // innocent subtract
    sra r1,r1,#0,#32 // smash to Signed 32
    divs r1,r1,r17 // DIV Signed 32

    Division is one of the operations where garbage-extended input is not
    ok; but fortunately it is rare.

    I doubt any compilers will use this feature.

    RISC-V and ARM LLVM compilers already do this and use it to eliminate >smashes.

    Shortly after we got our first Alphas in 1995, I saw DEC's C compiler
    produce lots of explicit sign-extensions (using the addl instruction)
    of both int operands and int results. In later years they got the
    compiler to emit many fewer sign-extensions. I don't remember seeing
    that many sign extensions on Alpha from gcc, ever, so apparently they
    already kept track of the extension status of a value at the time.

    On the other hand:: ILP64 ALSO gets rid of the problem (at a different cost).

    Exactly. If the I32LP64 mistake had not been made, we would have been
    spared a lot (not just extension instructions). But for ARM A64 and
    RV64, they have to adapt to the world as it is, not as it should be,
    and unfortunately that means I32LP64. For MY66000, it's your call, of
    course.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sat Oct 4 11:52:22 2025
    From Newsgroup: comp.arch

    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

    The I32LP64 mistake

    If you consider I32LP64 a mistake, how should FORTRAN's (note the
    upper case, this is pre-Fortran-90) storage association rules have
    been handled, in your opinion?

    If you are not familiar with them, they are:

    - INTEGER takes up one storage unit
    - REAL takes up one storage unit
    - DOUBLE PRECISION takes up two storage units

    where storage units are implementation-defined. Also consider
    that 32-bit REALs and 64-bit REALs are both useful and needed,
    and that (unofficially) C's integers were identical to
    FORTRAN's INTEGER.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sat Oct 4 16:11:37 2025
    From Newsgroup: comp.arch

    Thomas Koenig <tkoenig@netcologne.de> writes:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

    The I32LP64 mistake

    If you consider I32LP64 a mistake, how should FORTRAN's (note the
    upper case, this is pre-Fortran-90) storage association rules have
    been handled, in your opinion?

    I am not familiar enough with FORTRAN to give a recommendation on
    that. However, two observations:

    * The Cray-1 is primarily a Fortran machine, and it's C implementation
    is ILP64, and it is successful. So obviously an ILP64 C can live
    fine with FORTRAN.

    * Whatever inconvenience ILP64 would have caused to Fortran
    implementors is small compared to the cost in performance and
    reliability that I32LP64 has cost in the C world and the cost in
    encoding space (and thus code size) and implementation effort and
    transistors (probably not that many, but still) that it is costing
    all customers of 64-bit processors.

    If you are not familiar with them, they are:

    - INTEGER takes up one storage unit
    - REAL takes up one storage unit
    - DOUBLE PRECISION takes up two storage units

    where storage units are implementation-defined. Also consider
    that 32-bit REALs and 64-bit REALs are both useful and needed,
    and that (unofficially) C's integers were identical to
    FORTRAN's INTEGER.

    And unofficially C's integers were as long as pointers (with a legacy
    reaching back to BCPL). If I had to choose between breaking an
    unofficial FORTRAN-C interface tradition and a C-internal tradition, I
    would choose the C-internal tradition every time.

    There are two other languages that I have thought about:

    Java was introduced with fixed-size 32-bit int and 64-bit long, and
    with references typically having the size of a machine word. The
    choice of "int" and "long" may be due to I32LP64, and if the C people
    had gone for ILP64, the Java people might have chosen different names.
    But given their goal of write-once-run-everywhere with bit-identical
    results, they probably did not want to provide a machine-word-sized
    integer type. Java became popular when 32-bit machines were still a
    thing for running Java, so there would be lots of Java around that
    uses the 32-bit integer type. Given the large amount of Java code,
    that alone might be enough to make computer architects want to add
    special architectural support for signed 32-bit integers. At least we
    would have been spared architectural support for unsigned 32-bit
    integers.

    AFAIK Rust does not have a machine-word-sized integer type; instead,
    each type has its size in its name (e.g., i32, u64). Given that Rust
    was designed recently, that does not lead to portability problems yet:
    On servers, desktops (and recently smartphones) machine words are only
    64 bits, so if you write for that, you can just use i64 and u64, and
    your software will be efficient (or you can use smaller integers, and
    unless you store a lot of them, your software will be inefficient on
    various machines thanks to sign or zero extension). If you program on
    an embedded system, the code probably won't be ported to a machine
    with a different word size, so again, choosing the integer types that
    match the word size is a good choice. If there is ever a transition
    to 128-bit machines, I expect that the Rust approach will backfire,
    but who knows if Rust will still be in significant use by then. If it
    is, it may result in costs like I32LP64 is causing now.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Sat Oct 4 20:44:37 2025
    From Newsgroup: comp.arch

    On Sat, 04 Oct 2025 16:11:37 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:


    AFAIK Rust does not have a machine-word-sized integer type; instead,
    each type has its size in its name (e.g., i32, u64).

    Rust has machine-dependent isize and usize types, identical to ptrdiff_t
    and size_t in C.





    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Sat Oct 4 20:51:43 2025
    From Newsgroup: comp.arch

    On Sat, 04 Oct 2025 16:11:37 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    Thomas Koenig <tkoenig@netcologne.de> writes:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

    The I32LP64 mistake

    If you consider I32LP64 a mistake, how should FORTRAN's (note the
    upper case, this is pre-Fortran-90) storage association rules have
    been handled, in your opinion?

    I am not familiar enough with FORTRAN to give a recommendation on
    that. However, two observations:

    * The Cray-1 is primarily a Fortran machine, and it's C implementation
    is ILP64, and it is successful. So obviously an ILP64 C can live
    fine with FORTRAN.


    I would guess that Cray-1 FORTRAN was not 100% conformant to FORTRAN 77 standard. And they likely didn't care.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat Oct 4 18:01:59 2025
    From Newsgroup: comp.arch


    Thomas Koenig <tkoenig@netcologne.de> posted:

    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

    The I32LP64 mistake

    If you consider I32LP64 a mistake, how should FORTRAN's (note the
    upper case, this is pre-Fortran-90) storage association rules have
    been handled, in your opinion?

    FORTRAN INTEGER == INT32_T

    allowing ILP64.

    If you are not familiar with them, they are:

    - INTEGER takes up one storage unit
    - REAL takes up one storage unit
    - DOUBLE PRECISION takes up two storage units

    where storage units are implementation-defined. Also consider
    that 32-bit REALs and 64-bit REALs are both useful and needed,
    and that (unofficially) C's integers were identical to
    FORTRAN's INTEGER.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat Oct 4 18:05:18 2025
    From Newsgroup: comp.arch


    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    Thomas Koenig <tkoenig@netcologne.de> writes:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

    The I32LP64 mistake

    If you consider I32LP64 a mistake, how should FORTRAN's (note the
    upper case, this is pre-Fortran-90) storage association rules have
    been handled, in your opinion?

    I am not familiar enough with FORTRAN to give a recommendation on
    that. However, two observations:

    * The Cray-1 is primarily a Fortran machine, and it's C implementation
    is ILP64, and it is successful. So obviously an ILP64 C can live
    fine with FORTRAN.

    * Whatever inconvenience ILP64 would have caused to Fortran
    implementors is small compared to the cost in performance and
    reliability that I32LP64 has cost in the C world and the cost in
    encoding space (and thus code size) and implementation effort and
    transistors (probably not that many, but still) that it is costing
    all customers of 64-bit processors.

    If you are not familiar with them, they are:

    - INTEGER takes up one storage unit
    - REAL takes up one storage unit
    - DOUBLE PRECISION takes up two storage units

    where storage units are implementation-defined. Also consider
    that 32-bit REALs and 64-bit REALs are both useful and needed,
    and that (unofficially) C's integers were identical to
    FORTRAN's INTEGER.

    And unofficially C's integers were as long as pointers (with a legacy reaching back to BCPL). If I had to choose between breaking an
    unofficial FORTRAN-C interface tradition and a C-internal tradition, I
    would choose the C-internal tradition every time.

    There is a quote from K&R C that states int is the most efficient
    form for computing integer arithmetic values.

    With the demand for int to remain 32-bits and the countering demand
    of LLVM to obey typing, int no longer obeys its original stated goal.


    - anton
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Sat Oct 4 14:42:25 2025
    From Newsgroup: comp.arch

    Thomas Koenig wrote:
    EricP <ThatWouldBeTelling@thevillage.com> schrieb:
    MitchAlsup wrote:
    My 66000 2.0

    After 4-odd years of ISA stability, I ran into a case where
    I <pretty much> needed to change the instruction formats.
    And after bragging to Quadribloc about its stability--it
    reached the point where it was time to switch to version 2.0.

    Well, its time to eat crow.
    --------------------------------------------------------------
    Memory reference instructions already produce 64-bit values
    from Byte, HalfWord, Word and DoubleWord memory references
    in both Signed and unSigned flavors. These supports both
    integer and floating point due to the single register file.

    Essentially, I need that property in both integer and floating
    point calculations to eliminate instructions that merely apply
    value range constraints--just like memory !
    Why? Compilers do not have any problem with this
    as its been handled by overload resolution since forever.

    A non-My66000 example:

    int add (int a, int b)
    {
    return a + b;
    }

    is translated on powerpc64le-unknown-linux-gnu (with -O3 to)

    add 3,3,4
    extsw 3,3
    blr

    extsw fills the 32 high-value bits with because numbers returned
    in registers have to be correct, either as 32- or 64-bit values.

    Ok I see what's going on - the reference to strong typing got me
    thinking this was about operand type matching.

    Above it is treating integer arguments and return types that are
    smaller than full register width, and presumably short and char also,
    as modulo (wrapping) data types and converting them to canonical
    form by sign or zero extension. That avoids later problems in compare operations where the low order bits match but high order bits differ.

    A strong typed language would have a separate data types for signed
    and unsigned linear integers, signed and unsigned modulo integers.
    The sign/zero extend for modulo result types would mask any overflow
    and prevent proper result overflow checking.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sat Oct 4 18:55:05 2025
    From Newsgroup: comp.arch

    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

    The I32LP64 mistake

    If you consider I32LP64 a mistake, how should FORTRAN's (note the
    upper case, this is pre-Fortran-90) storage association rules have
    been handled, in your opinion?

    I am not familiar enough with FORTRAN to give a recommendation on
    that. However, two observations:

    * The Cray-1 is primarily a Fortran machine, and it's C implementation
    is ILP64, and it is successful. So obviously an ILP64 C can live
    fine with FORTRAN.

    As you may know, the Cray-1 was a very special machine, which got
    away with a lot of idiosyncracies because it was blindingly fast
    (and caused users a lot of trouble with conversion between DOUBLE
    PRECISION and REAL).

    But that was in the late 1970s. By the time the 64-bit worksations
    were being designed, REAL was firmly established as 32-bit and
    DOUBLE PRECISION as 64-bit, from the /360, the PDP-11, the VAX
    and the very 32-bit workstations that the 64-bit workstations were
    supposed to replace.


    * Whatever inconvenience ILP64 would have caused to Fortran
    implementors is small compared to the cost in performance and
    reliability that I32LP64 has cost in the C world and the cost in
    encoding space (and thus code size) and implementation effort and
    transistors (probably not that many, but still) that it is costing
    all customers of 64-bit processors.

    A 64-bit REAL and (consequently) a 128-bit DOUBLE PRECISION
    would have made the 64-bit workstaions pretty much unusable for
    scientific use, and a lot of these were aimed at the technical
    and scientific market, and that meant FORTRAN.

    So, put yourself into the shoes of the people designing workstations
    RS4000 they could allow their scientific and technical customers
    to use the same codes "as is", with no conversion, or tell them
    they cannot use 32-bit REAL any more, and that they need to rewrite
    all their software.

    What would they have expected their customers to do? Buy a system
    which forces them to do this, or buy a competitor's system where
    they can just recompile their software?

    You're always harping about how compilers should be bug-comptatible
    to previous releases. Well, that would have been the mother of
    all incompatiblities, aka business suicide.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Sat Oct 4 16:04:54 2025
    From Newsgroup: comp.arch

    On 10/4/2025 12:44 PM, Michael S wrote:
    On Sat, 04 Oct 2025 16:11:37 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:


    AFAIK Rust does not have a machine-word-sized integer type; instead,
    each type has its size in its name (e.g., i32, u64).

    Rust has machine-dependent isize and usize types, identical to ptrdiff_t
    and size_t in C.


    I guess, if starting clean slate (in a from-scratch language), it might
    make sense to have:
    A range of defined fixed sizes;
    A range of types whose size is a product of various machine constraints.


    So, say:
    u8/u16/u32/u64/u128 //Unsigned, fixed size, default endian
    s8/s16/s32/s64/s128 //Signed, fixed size, default endian
    u8l/u16l/u32l/u64l/u128l //Unsigned, fixed size, little endian
    s8l/s16l/s32l/s64l/s128l //Signed, fixed size, little endian
    u8b/u16b/u32b/u64b/u128b //Unsigned, fixed size, big endian
    s8b/s16b/s32b/s64b/s128b //Signed, fixed size, big endian
    u8l/s8l/u8b/s8b: Technically redundant with u8/s8, but added for
    consistency.


    i8/i16/i32/i64/i128, could also make sense.
    Could also have sbit(N) and ubit(n) which specify exact width types, but otherwise behave like the normal integer types. The power-of-2 sizes
    could be seen as mostly equivalent to the fixed-size types.


    Floating point types:
    f16/f32/f64/f128
    f8/f8a/f8u/...: Assortment of 8-bit types.
    Since no one-size-fits-all with FP8.
    (Maybe also with f*l and f*b variants?).

    Machine constraint-sized types:
    sasize/uasize: Size for arrays and similar
    spsize/upsize: Size for pointers and pointer differences
    sfsize/ufsize: Size for file offsets
    int: default 'fast' size (32 or 64 bits)
    long: default 'large but fast' size (64 or 128 bits)
    Would be 64 if machine only has 64 bit ALU operations;
    Would be 128 if machine has a 128-bit ALU available.
    intmul: Whichever size allows the fastest integer MUL or MAC.
    More likely to be 16 or 32 bits.
    ...

    Special types:
    void: No Type, pointers may freely convert to other types
    m8: Like void, but with a defined size, but no operators.
    m8 could be assumed the default type for raw memory buffers.
    m8 pointers may be freely cast to/from other pointer types.
    m16/m32/m64/m128: Has size but no defined operators.
    Casts involving these types will be bit-preserving.
    Size-mismatched casts will not be allowed.

    May use slightly different type promotion rules from C, for integer types:
    Td = Ts OP Tt
    If the range of Td is greater or equal to (Ts OP Tt)
    then promote to the wider of the two;
    (Ts OP Tt)
    Promotes by default to the wider of Ts or Tt.
    If a signed/unsigned mismatch of same size or smaller signed type,
    promote to the next larger signed type.
    (Note: NOT the "same sized unsigned" as C would use).
    If the range of Td is less than (Ts OP Tt)
    If the result will be the same either way,
    promote to most efficient type to carry out operation
    Or, use Td if doing so is efficient.
    Narrow result if needed
    Td narrower than intermediate type.
    Else, promote to type of (Ts OP Tt), and narrow result.

    In this case, the types may flow-out from the inputs and operators, but
    also flow-in from the destination type. Usually C lacks the flowing-in
    part, but it is relevant for efficient code generation.

    Note that the inward flow may happen recursively, where if Td promotion
    is used for an outward expression, the two sub-expressions may be
    re-evaluated in light of 'Td' as the destination type (vs merely the
    result of the input expressions).

    Unlike C, would still apply the same promotion behavior to 8 and 16 bit
    types as for wider types (so, there is no implicit "first auto-promote everything to int" rule). Though, it can generally still use wider ALU
    so long as the result value will retain the expected sign or zero extension.


    This would differ from C's behavior in the case of widening expressions,
    in that operating on narrower types and storing the result as a wider
    type will promote first (so no overflow happens) rather than in C where
    an overflow may happen with the narrower types and promoted after the fact.

    This would have fewer "gotchas" on average than the C approach, but C's
    rules need to be maintained for C code, as some code will break if the original integer overflow behavior is not preserved. But, the existing
    rules are not entirely consistent.

    Can make the working assumption that widening is cheap but narrowing has
    a non-zero cost (though, this is the reverse from the normal RV ABI,
    where on RV64G the ABI would normally have people pay the cost at
    "unsigned int"->"long" promotion).

    In the abstract model, all narrower signed or unsigned types are sign or
    zero extended to the maximum widest type in play; we can also assume
    twos complement as the working model; ...



    The big and little endian types would mostly apply to structures and
    pointers. They would only effect local variables if the address of the
    local variable is taken (else the machine default is used; or "all
    choices being equal" assume little endian).

    By default, assume native alignment of a type unless a packed modifier
    is used (with packed applied either par variable or for the structure as
    a whole). If no packed is used, the alignment of a struct will be the
    widest member in the struct. If used on a struct, the whole struct will
    assume byte alignment. Else, the alignment will be the largest alignment
    seen within the struct (or the largest non-packed member). Could maybe
    have an 'align_as()' modifier (to specify to use the same alignment as
    another type) with the packed case being equal to byte alignment.

    Possible:
    Allow 'if()' in structs, but would be evaluated as a compile-time
    constant (so in this sense, functions more like an ifdef, just evaluated
    later in the process).

    Might also allow VLA-like patterns if the expression is a compile time constant. Could allow a VLA as the final member of a struct, which will
    be understood the same as a zero-element array. Will have the side
    effect that the size of the struct is unknown, and it may not be used in arrays nor as the non-final member of a parent struct (and if present,
    will apply the same property on the parent struct).


    Note that structs may be classified as serializable or non-serializable. Serializable structs will need a fixed and unambiguous size;
    They will explicitly disallow pointers, references, or any other types
    that can't be serialized.

    Serializable structs would be assumed to be able to be safely read from
    or written to a file or socket, ...


    Might make sense, in such a language, to have an object model similar to C#: Structs exist, by-value by default;
    Classes always by-reference, with a single inheritance and interfaces model; Maybe for nicety, assume that interfaces can be mapped to COM-like
    objects (should map the underlying COM layout);
    ...

    Could also assume similar scoping rules to C#, with full scope known at
    the time an EXE or DLL is compiled (any undefined types or variables at
    this stage being a compiler error). The front-end parser and compiler
    would be required to still work even without a full knowledge of the type-system (WRT class-like types), but may enforce stricter constraints
    on normal value types. Though, if doing separate compilation, this only
    allows partial compilation of some features (the object system will need
    to be sorted out at link time).

    Would not have C++ style templates, but could still have generics.


    But:
    No garbage collector;
    Objects may have an explicit automatic lifetime.

    Say:
    Foo! foo();
    Does not mean that it is necessarily stack-allocated or by-value (unlike
    C++), but will mean that 'foo' will be auto-deleted when foo goes out of scope.

    Similar could also be applied to class members, so a T! member is
    auto-deleted when the parent goes out of scope. Could maybe also
    consider "T^" for cases where the member is to use reference counting
    (though count also make sense on the class definition).

    so, some modifiers could be applied one of several places:
    Class definition: Default behavior to be used, may be overridden.
    Variable: Used in this context, may override class.
    "new()": Used at object creation for dynamically created objects.

    With possible syntax:
    T //base type, default behavior, global lifetime for objects.
    T* //pointer, structs, N/A for class objects
    T! //automatic / parent-scope lifetime
    T^ //reference counted
    T(Z) //zone lifetime

    Typically the stronger rule may be used, with it being a compiler error
    if a variable or member doesn't match the lifetime specified elsewhere
    (though with fudging for "T!" as it would apply to the point of creation and/or place-of-residence of the object in question). As such, it is
    likely that "T!" class members would primarily be initialized in
    constructors (but may be treated as 'final' outside of a constructor for
    the class in question).

    zones will be compile-time entities. It could be treated as an error for
    an object in a longer-lived zone to have a reference with a
    shorter-lived zone. Though, unclear how to enforce this at compile time.
    Zone lifetime would depend on program control flow rather than known at compile time. Though, a zone-tree could be defined at compile time, and
    the compiler or runtime could error-out or fault if it detects zone
    creation or destruction which deviates from the specified dependency order.

    zonedef Z; //define a zone Z, parent of Z is global
    zonedef Z(Zp); //define zone Z whose lifetime exists within Zp.
    If Z is live and Zp is destroyed, throw.
    If Z is created and Zp is not live, throw
    If an object in Z is created, and Z is not live, throw.
    ...


    In most cases, 'delete' could be discouraged, as the only time delete is likely to be needed is if lifetime is poorly specified in some other
    way. But, we don't need generalized garbage collection, as pretty much
    no one has really made this work acceptably.

    Reference counting may leak memory, though one possibility could be to
    try to detect and flag cycle-formation when creating object graphs, with
    an explicit "weak object reference" being created in cases where cycle-creation is detected (in this case, the reference count is
    special). If the reference count for non-weak references drops to 0, it destroys the object. Downside: This puts some of the computational cost
    of a mark/sweep collector into the code for incrementing and
    decrementing reference counts.

    Though possible is allowing both reference-counting and zones on the
    same object, in which case the zone may clean up leaks from the reference-counter (assuming periodic zone destruction).


    ...


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Sat Oct 4 17:28:09 2025
    From Newsgroup: comp.arch

    On 10/4/2025 4:56 AM, BGB wrote:
    On 10/3/2025 4:04 PM, Thomas Koenig wrote:
    Stefan Monnier <monnier@iro.umontreal.ca> schrieb:
    --------------------------------------------------------------
    Integer instructions are now::-a-a-a-a-a {Signed and unSigned}|u{Byte, >>>>> HalfWord,
    Word, DoubleWord}
    while FP instructions are now:
    -a-a-a-a-a {Byte, HalfWord, Word, DoubleWord}

    I doubt any compilers will use this feature.
    Strong typed languages don't have predefined operators that allow
    mixing.

    Not sure who's confused, but my reading of the above is not some sort of >>> "mixing": I believe Mitch is just saying that his addition operation
    (for example) can be specified to operate on either one of int8, uint8,
    int16, uint16, ...
    But that specification applies to all inputs and outputs of the
    instruction, so it does not support adding an int8 to an int32, or other >>> "mixes".

    The outputs are correctly extended to a 64-bit number (signed or
    unsigned) so it is possible to pass results to wider operations
    without conversion.

    One example would be

    unsigned long foo (unsigned int a, unsigned int b)
    {
    -a-a return a + b;
    }

    which would need an adjustment after the add, and which would
    just be somethign like

    -a-a-a-aadduw-a-a-a r1,r1,r2
    -a-a-a-aret

    using Mitch's new encoding.



    Yes.

    Sign extend signed types, zero extend unsigned types.
    Up-conversion is free.


    This is something the RISC-V people got wrong IMO, and adding a bunch of ".UW" instructions in an attempt to patch over it is just kinda ugly.

    Partly for my own uses revived ADDWU and SUBWU (which had been dropped
    in BitManip), because these are less bad than the alternative.

    I get annoyed that new extensions keep trying to add ever more ".UW" instructions rather than just having the compiler go over to zero-
    extended unsigned and make this whole mess go away.

    ...



    Ironically, the number of new instructions being added to my own ISA has mostly died off recently, largely because there is little particularly relevant to add at this point (within the realm of stuff that could be added).


    Going and looking back, most major new instructions added were:
    BITMOV and BITMOV.S, ~ 7 months ago
    Some new ops related to FP8A handling and similar, ~ 2 months ago
    Mostly for Bias=7 (where, FP8A=S.E3.M4, or A-Law format)
    I couldn't just change the Bias=8 ops to 7 without breaking stuff;
    But, for non-audio uses 7 is a lot more useful.
    Mostly used for unit vectors,
    where ability to store values >= 1.0 sometimes needed.
    But, most values still < 1.0 ...
    Sorta relates to Trellis re-normalization trickery.
    Stored vector isn't exactly unit-length, but unit post-renorm.

    A few operations in the "possible" category:
    A few NN related packed multiply instructions;
    Instructions for a possible UVF1 packed block format
    (graphics and NN);
    ...

    FPU Compare 3R instructions, ~8 months ago


    While XG3 was added 11 months ago, it isn't really new instructions, so
    much as a new more and encoding scheme for the same instructions (and it
    was only fairly recently that I got support for predicated instructions implemented in RISC-V).

    And, 12 months ago, a RISC-V target for BGBCC, and jumbo prefixes for
    the RISC-V side, ... Somehow I thought all of this happened several
    years ago, seems it was 1 year.


    Seems initial efforts to start adding RISC-V support were (only) 2 years
    ago.

    A lot more fiddling has been in things mostly related to dealing with
    RISC-V and trying to make it less terrible.


    The stuff for the recent FPU behavior tweaks are more tweaking FPU
    behavior, and haven't really involved adding new instructions (except on
    the RISC-V side, ones which already existed in the RISC-V specs).


    Hmm...


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sun Oct 5 11:58:14 2025
    From Newsgroup: comp.arch

    Michael S <already5chosen@yahoo.com> writes:
    On Sat, 04 Oct 2025 16:11:37 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:


    AFAIK Rust does not have a machine-word-sized integer type; instead,
    each type has its size in its name (e.g., i32, u64).

    Rust has machine-dependent isize and usize types

    Good. But for some reasons all the examples I have seen use
    integer types like i32 and u64.

    identical to ptrdiff_t and size_t in C.

    I have read that there are C implementation (variants) where ptrdiff_t
    and size_t are smaller than a pointer, in particular large-model C on
    the 8086, and that was the reason for C standard restrictions about
    pointer subtraction and pointer inequality comparison.

    I hope nobody is doing large-model Rust, even though Rust may be more appropriate for that than C.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sun Oct 5 15:01:06 2025
    From Newsgroup: comp.arch

    Thomas Koenig <tkoenig@netcologne.de> writes:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

    The I32LP64 mistake

    If you consider I32LP64 a mistake, how should FORTRAN's (note the
    upper case, this is pre-Fortran-90) storage association rules have
    been handled, in your opinion?
    ...
    By the time the 64-bit worksations
    were being designed, REAL was firmly established as 32-bit and
    DOUBLE PRECISION as 64-bit, from the /360, the PDP-11, the VAX
    and the very 32-bit workstations that the 64-bit workstations were
    supposed to replace.

    On the PDP-11 C's int is 16 bits. I don't know what FORTRAN's INTEGER
    is on the PDP-11 (but I remember reading about INTEGER*2 and
    INTEGER*4, AFAIK not in a PDP-11 context). In any case, I expect that FORTRAN's REAL was 32-bit on a PDP-11, and that any rule chain that
    requires that C's int is as wide as FORTRAN's REAL is broken at some
    point on the PDP-11.

    So your rules do not even work for the first machine where C has been implemented. If shortsighted FORTRAN people look at 32-bit machines
    and become accomodated to C's int being as wide as FORTRAN's INTEGER
    and REAL, they could have known from the PDP-11 that that's going to
    break for other machine word sizes.

    So, put yourself into the shoes of the people designing workstations
    RS4000 they could allow their scientific and technical customers
    to use the same codes "as is", with no conversion, or tell them
    they cannot use 32-bit REAL any more, and that they need to rewrite
    all their software.

    If they want to use their software as-is, and it is written to work
    with an ILP32 C implementation, the only solution is to continue using
    an ILP32 implementation. That's not only for FORTRAN/C mixing, but
    for most C code of the day, certainly with I32LP64; I expect that the
    porting effort would have been smaller with ILP64, but there still
    would have been some.

    BTW, we have a DecStation 5000/150 with an R4000, and all C compilers
    on this machine support ILP32 and nothing else.

    What would they have expected their customers to do? Buy a system
    which forces them to do this, or buy a competitor's system where
    they can just recompile their software?

    If just recompiling is the requirement, what follows is ILP32.

    You're always harping about how compilers should be bug-comptatible
    to previous releases.

    Not in the least. I did not ask for bug compatibility.

    I also did not ask for "compiling as is" on a different architecture,
    much less on a system with different address size.

    I have actually written up what I ask for: <https://www.complang.tuwien.ac.at/papers/ertl17kps.pdf>. Maybe you
    should read it one day, or reread it given that you have forgotten it.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sun Oct 5 18:19:47 2025
    From Newsgroup: comp.arch


    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
    <snip>

    Not in the least. I did not ask for bug compatibility.

    I also did not ask for "compiling as is" on a different architecture,
    much less on a system with different address size.

    I have actually written up what I ask for: <https://www.complang.tuwien.ac.at/papers/ertl17kps.pdf>. Maybe you
    should read it one day, or reread it given that you have forgotten it.

    In the referenced article you write::
    "Access to uninitialized data is another issue where absolute equivalence
    with the basic model would make important optimizations impossible. Consider
    a variable v at the end of its life (e.g., at the end of a function). Unless the compiler can prove that the location of the variable is not read later
    as a result of reading uninitialized data (say, reading the uninitialized variable w living in the same location in a different function), v would
    have to stay in the same location in future compiler versions or other optimization levels; or at least the final value of v would have to be
    stored in this location, and the initial value of w would have to be
    fetched from this location."

    If variable v and variable w are "stack variables" local to their own subroutines, it seems perfectly reasonable to assume that all deallocated
    stack variables become inaccessible. Then, later when new stack space is allocated those new variables have no relationship to any previously deallocated variables.

    That is: when the stack pointer is incremented the space is no longer accessible and::
    a) any modified cache lines are discarded instead of being written
    to memory--the space is no longer accessible so don't waste power
    making DRAM coherent with inaccessible stack space.

    Later, when the stack pointer is decremented::
    b) new cache line area can be "allocated" without reading DRAM and
    being <conceptually> initialized to zero.


    - anton
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Levine@johnl@taugh.com to comp.arch on Sun Oct 5 19:30:42 2025
    From Newsgroup: comp.arch

    According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:
    On the PDP-11 C's int is 16 bits. I don't know what FORTRAN's INTEGER
    is on the PDP-11 (but I remember reading about INTEGER*2 and
    INTEGER*4, AFAIK not in a PDP-11 context). In any case, I expect that >FORTRAN's REAL was 32-bit on a PDP-11, and that any rule chain that
    requires that C's int is as wide as FORTRAN's REAL is broken at some
    point on the PDP-11.

    I wrote INFort, one of the two F77 implementations for the PDP-11.
    INTEGER and REAL were the same size because that's what the standard
    said, and any program that used EQUIVALENCE would break otherwise. If
    you wanted shorter ints, INTEGER*2 provided them.

    Bell Labs independently wrote f77 around the same time, and its manual says they did the same thing, INTEGER was C long int, INTEGER*2 was short int.

    If the speed difference mattered, it wasn't hard to say something like

    IMPLICIT INTEGER*2(I-N)

    to make your ints short.
    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sun Oct 5 19:51:26 2025
    From Newsgroup: comp.arch

    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

    The I32LP64 mistake

    If you consider I32LP64 a mistake, how should FORTRAN's (note the
    upper case, this is pre-Fortran-90) storage association rules have
    been handled, in your opinion?
    ...
    By the time the 64-bit worksations
    were being designed, REAL was firmly established as 32-bit and
    DOUBLE PRECISION as 64-bit, from the /360, the PDP-11, the VAX
    and the very 32-bit workstations that the 64-bit workstations were
    supposed to replace.

    On the PDP-11 C's int is 16 bits. I don't know what FORTRAN's INTEGER
    is on the PDP-11 (but I remember reading about INTEGER*2 and
    INTEGER*4, AFAIK not in a PDP-11 context). In any case, I expect that FORTRAN's REAL was 32-bit on a PDP-11, and that any rule chain that
    requires that C's int is as wide as FORTRAN's REAL is broken at some
    point on the PDP-11.

    It is possible to have a two-byte integer and a 32-byte real.
    Storage association then requires four bytes for an integer.
    This wastes space for integers (at least for arrays) but that
    is not such a big deal, because most big arrays in scientific
    code are reals.

    The same held for the Cray-1 - default ingegers (24 bit)
    and their weird 64-bit reals

    The main problem is when the size of default INTEGER size _exceeds_ the smallest useful REAL, then REAL arrays either become twice as big,
    plus you need to implement 128-bit REALs.

    So, put yourself into the shoes of the people designing workstations
    RS4000 they could allow their scientific and technical customers
    to use the same codes "as is", with no conversion, or tell them
    they cannot use 32-bit REAL any more, and that they need to rewrite
    all their software.

    If they want to use their software as-is, and it is written to work
    with an ILP32 C implementation, the only solution is to continue using
    an ILP32 implementation.

    So, kill the 64-bit machines in the scientific marketplace. I'm glad
    you agree.


    What would they have expected their customers to do? Buy a system
    which forces them to do this, or buy a competitor's system where
    they can just recompile their software?

    If just recompiling is the requirement, what follows is ILP32.

    There is absolutely no problem with 64-bit pointers when recompiling
    Fortran.


    You're always harping about how compilers should be bug-comptatible
    to previous releases.

    Not in the least. I did not ask for bug compatibility.

    I'll keep that in mind for the next time.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Mon Oct 6 05:56:53 2025
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:
    If variable v and variable w are "stack variables" local to their own >subroutines, it seems perfectly reasonable to assume that all deallocated >stack variables become inaccessible.

    That is debatable. This assumption is the basis of "optimizing" away
    memset() (or similar) that is intended to keep the lifetime of secret
    keys as short as possible. After this "optimization", the secret key
    continues to be in memory, and can be extracted through
    vulnerabilities, preserved for much longer in the swap area or in
    snapshots, or in the value of newly allocated uninitialized areas.
    All of which prove that the assumption is wrong.

    Then, later when new stack space is
    allocated those new variables have no relationship to any previously >deallocated variables.

    That is: when the stack pointer is incremented the space is no longer >accessible and::
    a) any modified cache lines are discarded instead of being written
    to memory--the space is no longer accessible so don't waste power
    making DRAM coherent with inaccessible stack space.

    Later, when the stack pointer is decremented::
    b) new cache line area can be "allocated" without reading DRAM and
    being <conceptually> initialized to zero.

    I have outlined ways to optimize zeroing of memory in <2014Jul9.193122@mips.complang.tuwien.ac.at> <2022Aug5.141325@mips.complang.tuwien.ac.at>

    With that idea, the way to use it is to zero the memory when it is
    deallocated (so it is not written back to main memory; it may be
    written to the zero area as part of a larger unit). And to also zero
    it when it is allocated so that there is no need to load the data from
    outer cache levels or main memory (or their equivalents in zeroed
    memory).

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Mon Oct 6 06:26:12 2025
    From Newsgroup: comp.arch

    Thomas Koenig <tkoenig@netcologne.de> writes:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    [...]
    It is possible to have a two-byte integer and a 32-byte real.

    But according to John Levine that is not what happens on the PDP-11.
    Instead, it has 4-byte INTEGERs, demonstrating that your "unofficial
    rule" that C int is as wide as FORTRAN INTEGER did not hold.

    The same held for the Cray-1 - default ingegers (24 bit)
    and their weird 64-bit reals

    If FORTRAN INTEGERs are 24 bits on the Cray-1, this architecture is
    another example where your "unofficial rule" does not hold. C ints
    are 64-bit on the Cray 1.

    If they want to use their software as-is, and it is written to work
    with an ILP32 C implementation, the only solution is to continue using
    an ILP32 implementation.

    So, kill the 64-bit machines in the scientific marketplace. I'm glad
    you agree.

    Not in the least. Most C programs did not run as-is on I32LP64, and
    that did not kill these machines, either. And I am sure that C
    programs were much more relevant for selling these machines than
    FORTRAN programs. C programmers changed the programs to run on
    I32LP64 (this was called "making them 64-bit-clean"). And until that
    was done, ILP32 was used.

    If just recompiling is the requirement, what follows is ILP32.

    There is absolutely no problem with 64-bit pointers when recompiling
    Fortran.

    Fortran is not the only consideration for designing an ABI for C, if
    it is one at all. The large number of 32bit->64bit sign-extension and zero-extension operations, either explicitly, or integrated into
    instructions such as RISC-V's addw, plus the
    "optimizations"/miscompilations to ged rid of some of the sign
    extensions are a cost that we pay all the time for the I32LP64
    mistake.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Mon Oct 6 14:23:50 2025
    From Newsgroup: comp.arch

    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    [...]
    <snip>

    So, kill the 64-bit machines in the scientific marketplace. I'm glad
    you agree.

    Not in the least. Most C programs did not run as-is on I32LP64.

    The vast majority of C/C++ programs ran just fine on I32LP64. There
    were some that didn't, but it was certainly not "most".
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Mon Oct 6 11:51:18 2025
    From Newsgroup: comp.arch

    On 10/6/2025 9:23 AM, Scott Lurndal wrote:
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    [...]
    <snip>

    So, kill the 64-bit machines in the scientific marketplace. I'm glad
    you agree.

    Not in the least. Most C programs did not run as-is on I32LP64.

    The vast majority of C/C++ programs ran just fine on I32LP64. There
    were some that didn't, but it was certainly not "most".

    Yes, most programs only needed minor edits.


    Some stuff I had ported:
    Doom: Mostly trivial edits;
    Had to re-implement audio and music handling.
    Heretic and Hexen:
    More edits, mostly removing MS-DOS stuff;
    Had to replace most of the audio and music code.
    ROTT:
    Extensive modification to graphics handling;
    Was very dependent on low-level VGA hardware twiddling.
    (Vs Doom's "Set 320x200 and done" approach).
    Lots of memory management and out-of-bounds issues;
    Some amount of code that is sensitive to integer wrap-on-overflow;
    ...
    (ROTT was a little harder to port)
    Quake:
    Few issues for most of the engine;
    The "progs.dat" VM required getting creative.
    It mixes pointers and 'float' in ways
    "some might consider unnatural"
    Quake 2:
    Basically 64-bit clean out of the box.
    Quake 3:
    The QVM architecture very much assumes 32-bit,
    not really a way to make it 64-bit absent a significant rewrite.
    Did allow for falling back to the Quake2 strategy,
    of using natively compiled DLLs.


    Of the programs, I still have not fully debugged ROTT when built via
    BGBCC, where there is an issue somewhere that is resulting in demo
    desyncs that tend to change from one run to another.

    Last I checked, I had it stable when built with MSVC, and had it
    basically working with a GCC build.


    Can note that ROTT is one of the larger programs I had ported to my
    project (in terms of code size), where both the ROTT and Quake3 ports
    weigh in at a little over 300 kLOC (very much larger than Doom or Quake).

    Quake 3 builds as multiple DLLs, whereas ROTT as a single binary. As
    such, ROTT currently builds the biggest EXE (with around 1MB of ".text").

    Though, curiously, there is (on average) less than 4 bytes per line on
    C, not entirely sure how that happens.

    ...

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Mon Oct 6 17:38:13 2025
    From Newsgroup: comp.arch

    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    Thomas Koenig <tkoenig@netcologne.de> writes:

    So, kill the 64-bit machines in the scientific marketplace. I'm glad
    you agree.

    Not in the least. Most C programs did not run as-is on I32LP64, and
    that did not kill these machines, either.

    Only those who assumed sizeof(int) = sizeof(char *). This was
    not true on the PDP-11, and it was a standards violation, anyway.
    Only people who liked to play these kind of games (I know you do)
    were caught.

    And I am sure that C
    programs were much more relevant for selling these machines than
    FORTRAN programs.

    Based on what data? Your own personal guess?

    C programmers changed the programs to run on
    I32LP64 (this was called "making them 64-bit-clean"). And until that
    was done, ILP32 was used.

    The problem with 64-bit INTEGERs for Fortran is that they make REAL
    unusable for lots of existing code.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Levine@johnl@taugh.com to comp.arch on Mon Oct 6 20:02:50 2025
    From Newsgroup: comp.arch

    According to Thomas Koenig <tkoenig@netcologne.de>:
    Not in the least. Most C programs did not run as-is on I32LP64, and
    that did not kill these machines, either.

    Only those who assumed sizeof(int) = sizeof(char *). This was
    not true on the PDP-11, ...

    The PDP-11 was a 16 bit machine with 16 bit ints and 16 bit pointers.
    There were 32 bit long and float, and 64 bit double.

    I didn't port a lot of code from the 11 to other machines, but my recollection is that the widespread assumption in Berkeley Vax code that location zero was addressable and contained binary zeros was much more painful to fix than
    size issues.
    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Mon Oct 6 20:46:11 2025
    From Newsgroup: comp.arch

    John Levine <johnl@taugh.com> writes:
    According to Thomas Koenig <tkoenig@netcologne.de>:
    Not in the least. Most C programs did not run as-is on I32LP64, and
    that did not kill these machines, either.

    Only those who assumed sizeof(int) = sizeof(char *). This was
    not true on the PDP-11, ...

    The PDP-11 was a 16 bit machine with 16 bit ints and 16 bit pointers.
    There were 32 bit long and float, and 64 bit double.

    I didn't port a lot of code from the 11 to other machines, but my recollection >is that the widespread assumption in Berkeley Vax code that location zero was >addressable and contained binary zeros was much more painful to fix than
    size issues.

    "location zero was addressible". Might also point out it was RO, but yes
    that caused many problems porting BSD utilities to SVR4.

    The other issue with leaving the PDP-11 for 32-bit systems was the change
    in the size of the PID, UID, and GID. Which required more than a simple recompile, since there weren't abstract types (e.g. pid_t, gid_t, uid_t)
    for those data items yet, so code needed to be updated manually.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From kegs@kegs@provalid.com (Kent Dickey) to comp.arch on Tue Oct 7 01:38:02 2025
    From Newsgroup: comp.arch

    In article <2025Oct4.121741@mips.complang.tuwien.ac.at>,
    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    MitchAlsup <user5857@newsgrouper.org.invalid> writes:
    LLVM compiles C with stricter typing than GCC resulting in a lot
    of smashes:: For example::

    int subroutine( int a, int b )
    {
    return a+b;
    }

    Compiles into:

    subroutine:
    ADD R1,R1,R2
    SRA R1,R1,<32,0> // limit result to (int)
    RET

    I tested this on AMD64, and did not find sign-extension in the caller, >neither with gcc-14 nor with clang-19; both produce the following code
    for your example (with "subroutine" renamed into "subroutine1").

    0000000000000000 <subroutine1>:
    0: 8d 04 37 lea (%rdi,%rsi,1),%eax
    3: c3 ret

    It's not about strict or lax typing, it's about what the calling
    convention promises about types that are smaller than a machine word.
    If the calling convention requires/guarantees that ints are
    sign-extended, the compiler must use instructions that produce a >sign-extended result. If the calling convention guarantees that ints
    are zero-extended (sounds perverse, but RV64 has the guarantee that
    unsigned is passed in sign-extended form, which is equally perverse),
    then the compiler must use instructions that produce a zero-extended
    result (e.g., AMD64's addl). If the calling convention only requires
    and guarantees the low-order 32 bits (I call this garbage-extended),
    then the compiler can use instructions that perform 64-bit adds; this
    is what we are seeing above.

    The other side of the medal is what is needed at the caller: If the
    caller needs to cconvert a sign-extended int into a long, it does not
    have to do anything. If it needs to convert a zero-extended or >garbage-extended int into a long, it has to sign-extend the value.

    AMD64 in hardware does 0 extension of 32-bit operations. From your
    example "lea (%rdi,%rsi,1),%eax" (AT&T notation, so %eax is the dest),
    the 64-bit register %rax will have 0's written into bits [63:32].
    So the AMD64 convention for 32-bit values in 64-bit registers is to
    zero-extend on writes. And to ignore the upper 32-bits on reads, so
    using a 64-bit register should use the %exx name.

    I agree with you that I32LP64 was a mistake, but it exists, and I
    think ARM64 did a good job handling it. It has all integer operations
    working on two sizes: 32-bit and 64-bit, and when writing a 32-bit result,
    it 0-extends the register value.

    You don't want "garbage extend" since you want a predictable answer.
    Your choices for writing 32-bit results in a 64-bit register are thus sign-extend (not a good choice) or zero-extend (what almost
    everyone chose). RISC-V is in another land, where they effectively have
    no 32-bit operations, but rather a convention that all 32-bit inputs
    must be sign-extended in a 64-bit register.

    For C and C++ code, the standard dictates that all integer operations are
    done with "int" precision, unless some operand is larger than int, and then
    do it in that precision. So there's no real need for 8-bit and 16-bit operations to be natively by the CPU--these operations are actually done
    as int's already. If you have a variable which is a byte, then assigning
    to that variable, and then using that variable again you will need to zero-extend, but honestly, this is not usually a performance path. It's
    likely to be stored to memory instead, so no masking or sign extending
    should be needed.

    If you pick ILP64 for your ABI, then you will get rid of almost all of
    these zero- and sign-extensions of 32-bit C and C++ code. It will just
    work. If you pick I32LP64, then you should have a full suite of 32-bit operations and 64-bit operations, at least for all add, subtract, and
    compare operations. And if you do I32LP64, your indexed addressing
    modes should have 3 types of indexed registers: 64-bit, 32-bit signed,
    and 32-bit unsigned. That worked well for ARM64.

    Kent
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Tue Oct 7 15:52:17 2025
    From Newsgroup: comp.arch


    kegs@provalid.com (Kent Dickey) posted:

    In article <2025Oct4.121741@mips.complang.tuwien.ac.at>,
    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    MitchAlsup <user5857@newsgrouper.org.invalid> writes:
    LLVM compiles C with stricter typing than GCC resulting in a lot
    of smashes:: For example::

    int subroutine( int a, int b )
    {
    return a+b;
    }

    Compiles into:

    subroutine:
    ADD R1,R1,R2
    SRA R1,R1,<32,0> // limit result to (int)
    RET

    I tested this on AMD64, and did not find sign-extension in the caller, >neither with gcc-14 nor with clang-19; both produce the following code
    for your example (with "subroutine" renamed into "subroutine1").

    0000000000000000 <subroutine1>:
    0: 8d 04 37 lea (%rdi,%rsi,1),%eax
    3: c3 ret

    It's not about strict or lax typing, it's about what the calling
    convention promises about types that are smaller than a machine word.
    If the calling convention requires/guarantees that ints are
    sign-extended, the compiler must use instructions that produce a >sign-extended result. If the calling convention guarantees that ints
    are zero-extended (sounds perverse, but RV64 has the guarantee that >unsigned is passed in sign-extended form, which is equally perverse),
    then the compiler must use instructions that produce a zero-extended
    result (e.g., AMD64's addl). If the calling convention only requires
    and guarantees the low-order 32 bits (I call this garbage-extended),
    then the compiler can use instructions that perform 64-bit adds; this
    is what we are seeing above.

    The other side of the medal is what is needed at the caller: If the
    caller needs to cconvert a sign-extended int into a long, it does not
    have to do anything. If it needs to convert a zero-extended or >garbage-extended int into a long, it has to sign-extend the value.

    AMD64 in hardware does 0 extension of 32-bit operations. From your
    example "lea (%rdi,%rsi,1),%eax" (AT&T notation, so %eax is the dest),
    the 64-bit register %rax will have 0's written into bits [63:32].
    So the AMD64 convention for 32-bit values in 64-bit registers is to zero-extend on writes. And to ignore the upper 32-bits on reads, so
    using a 64-bit register should use the %exx name.

    I agree with you that I32LP64 was a mistake, but it exists, and I
    think ARM64 did a good job handling it. It has all integer operations working on two sizes: 32-bit and 64-bit, and when writing a 32-bit result,
    it 0-extends the register value.

    You don't want "garbage extend" since you want a predictable answer.

    Strongly Agree.

    Your choices for writing 32-bit results in a 64-bit register are thus sign-extend (not a good choice) or zero-extend (what almost
    everyone chose). RISC-V is in another land, where they effectively have
    no 32-bit operations, but rather a convention that all 32-bit inputs
    must be sign-extended in a 64-bit register.

    Why not zero extend unSigned and sign extend Signed ?!?
    That way the value in the register is (IS) the value in the smaller
    container !!

    Also, why not extend this to both shorts and chars ?!?

    For C and C++ code, the standard dictates that all integer operations are done with "int" precision, unless some operand is larger than int, and then do it in that precision. So there's no real need for 8-bit and 16-bit operations to be natively by the CPU--these operations are actually done
    as int's already. If you have a variable which is a byte, then assigning
    to that variable, and then using that variable again you will need to zero-extend,

    You could perform the operation at base-size (byte in this case).

    Languages like ADA are not defined like C.

    but honestly, this is not usually a performance path. It's likely to be stored to memory instead, so no masking or sign extending
    should be needed.

    If you pick ILP64 for your ABI, then you will get rid of almost all of
    these zero- and sign-extensions of 32-bit C and C++ code.

    Then, the only access to 32-bit integers is int32_t and uint32-t.

    It will just
    work. If you pick I32LP64, then you should have a full suite of 32-bit operations and 64-bit operations, at least for all add, subtract, and
    compare operations. And if you do I32LP64, your indexed addressing
    modes should have 3 types of indexed registers: 64-bit, 32-bit signed,
    and 32-bit unsigned. That worked well for ARM64.

    Kent
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Tue Oct 7 11:27:39 2025
    From Newsgroup: comp.arch

    kegs@provalid.com (Kent Dickey) writes:
    In article <2025Oct4.121741@mips.complang.tuwien.ac.at>,
    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    MitchAlsup <user5857@newsgrouper.org.invalid> writes:
    int subroutine( int a, int b )
    {
    return a+b;
    }
    ...
    I tested this on AMD64, and did not find sign-extension in the caller, >>neither with gcc-14 nor with clang-19; both produce the following code
    for your example (with "subroutine" renamed into "subroutine1").

    0000000000000000 <subroutine1>:
    0: 8d 04 37 lea (%rdi,%rsi,1),%eax
    3: c3 ret
    ...
    AMD64 in hardware does 0 extension of 32-bit operations. From your
    example "lea (%rdi,%rsi,1),%eax" (AT&T notation, so %eax is the dest),
    the 64-bit register %rax will have 0's written into bits [63:32].
    So the AMD64 convention for 32-bit values in 64-bit registers is to >zero-extend on writes. And to ignore the upper 32-bits on reads, so
    using a 64-bit register should use the %exx name.

    Interesting. At some point I got the impression that LEA produces a
    64-bit result, because it produces an address, but testing reveals
    that LEA has a 32-bit zero-extended variant indeed.

    I agree with you that I32LP64 was a mistake, but it exists, and I
    think ARM64 did a good job handling it. It has all integer operations >working on two sizes: 32-bit and 64-bit, and when writing a 32-bit result,
    it 0-extends the register value.

    You don't want "garbage extend" since you want a predictable answer.

    Zero-extended for unsigned and sign-extended for int are certainly
    more forgiving when some function is called without a prototype and
    the actual type does not match the implied type (I once read about
    IIRC miranda prototypes, but a web search only gives me Star Trek
    stuff when I ask for that).

    Zero-extending for int is less forgiving. Apparently by 2003 (when
    AMD64 appeared) the use of prototypes was widespread enough that such
    a calling convention was acceptable.

    But once all the functions have correct prototypes, garbage-extension
    is just as workable as other alternatives.

    Your choices for writing 32-bit results in a 64-bit register are thus >sign-extend (not a good choice) or zero-extend (what almost everyone chose).

    What makes you think that one is a better choice than the other?

    The most obvious choices to me are:

    Sign-extend int and zero-extend unsigned: That has the best chance at
    the expected behaviour when the prototype is missing and would be
    required.

    If you rely on prototypes being present, you can take any choice,
    including garbage-extension. Then you can use the full 64-bit
    operation in many cases, and only insert sign or zero extension when a conversion from 32-bit to 64 bit is needed (and that extension can be
    part of an instruction, as in ARM A64 addressing modes).

    As for what "almost everyone chose", here's some data:

    int unsigned ABI
    sign-extended sign-extended MIPS o64 and 64
    sign-extended zero-extended SPARC V9
    sign-extended zero-extended PowerPC64
    zero-extended zero-extended AMD64
    zero-extended zero-extended ARM A64
    sign-extended sign-extended RV64

    I determined this by looking at the code for

    unsigned usubroutine( unsigned a, unsigned b )
    {
    return a+b;
    }

    int isubroutine( int a, int b )
    {
    return a+b;
    }

    The code on variois architectures (as compiled with gcc -O) is:

    MIPS64 (gcc -mabi=64 -O and gcc -mabi=o64 -O):
    0000000000000034 <usubroutine>:
    34: 03e00008 jr ra
    38: 00851021 addu v0,a0,a1

    000000000000003c <isubroutine>:
    3c: 03e00008 jr ra
    40: 00851021 addu v0,a0,a1

    SPARC V9:
    0000000000000018 <usubroutine>:
    18: 9d e3 bf 50 save %sp, -176, %sp
    1c: b0 06 00 19 add %i0, %i1, %i0
    20: 81 cf e0 08 return %i7 + 8
    24: 91 32 20 00 srl %o0, 0, %o0

    0000000000000028 <isubroutine>:
    28: 9d e3 bf 50 save %sp, -176, %sp
    2c: b0 06 00 19 add %i0, %i1, %i0
    30: 81 cf e0 08 return %i7 + 8
    34: 91 3a 20 00 sra %o0, 0, %o0

    PowerPC64:
    0000000000000030 <.usubroutine>:
    30: 7c 63 22 14 add r3,r3,r4
    34: 78 63 00 20 clrldi r3,r3,32
    38: 4e 80 00 20 blr
    ...

    0000000000000048 <.isubroutine>:
    48: 7c 63 22 14 add r3,r3,r4
    4c: 7c 63 07 b4 extsw r3,r3
    50: 4e 80 00 20 blr

    RISC-V is in another land, where they effectively have
    no 32-bit operations, but rather a convention that all 32-bit inputs
    must be sign-extended in a 64-bit register.

    RISC-V has a number of sign-extending 32-bit instructions, and a
    calling convention to go with it.

    There seem to be the following options:

    Have no 32-bit instructions, and insert sign-extension or
    zero-extension instructions where necessary (or implicitly in all
    operands, as I outlined earlier). SPARC V9 and PowerPC64 seem to take
    this approach.

    Have 32-bit instructions that sign-extend: MIPS64, Alpha, and RV64.

    Have 32-bit instructions that zero-extend: AMD64 and ARM A64.

    Have 32-bit instructions that sign-extend and 32-bit instructions that zero-extend. No architecture that does that is known to me. It would
    be a good match for the SPARC-V9 and PowerPC64 calling convention.

    There is also one instruction set (ARM A64) that has special 32-bit sign-extension and zero-extension forms for some operands.

    And you can then adapt the calling convention to match the instruction
    set. For "no 32-bit instructions", garbage-extension seems to be the
    cheapest approach to me, but I expect that when SPARC-V9 and PowerPC64
    came on the market, there was enough C code with missing prototypes
    around that they preferred a more forgiving calling convention.

    If you pick ILP64 for your ABI, then you will get rid of almost all of
    these zero- and sign-extensions of 32-bit C and C++ code. It will just
    work. If you pick I32LP64, then you should have a full suite of 32-bit >operations and 64-bit operations, at least for all add, subtract, and
    compare operations.

    For compare, divide, shift-right and rotate, you either first need to sign/zero-extend the register, or you need 32-bit versions (possibly
    both signed and unsigned).

    And if you do I32LP64, your indexed addressing
    modes should have 3 types of indexed registers: 64-bit, 32-bit signed,
    and 32-bit unsigned. That worked well for ARM64.

    It is certainly part of the way towards my idea of having sign- and zero-extended 32-bit operands for every operand of every instruction.

    It would be interesting to see how many sign-extensions and
    zero-extensions (whether explicit or implicitly part of the
    instruction) are executed in code that is generated from various C
    sources (with and without -fwrapv). I expect that it's highly
    dependent on the programming style. Sure there are types like pid_t
    where you have no choice, but in frequently occuring cases you can
    choose:

    for (i=0; i<n; i++) {
    ... a[i] ...
    }

    Here you can choose whether to define i as int, unsigned, long,
    unsigned long, size_t, etc. If you care for portability to 16-bit
    machines, size_t is a good idea here, otherwise long and unsigned long
    also are efficient. If n is unsigned, you can also choose unsigned,
    but then this code will be slow on RV64 (and MIPS64 and SPARC V9 and
    PowerPC64 and Alpha).

    If n is int, you can also choose int, and there is actually enough
    information here to make the code efficient (even with -fwrapv),
    because in this code int overflow really cannot happen, but in code
    that's not much different from this one (e.g., using != instead of <),
    -fwrapv will result in an inserted sign extension on AMD64, and not
    using -fwrapv may result in unintended behaviour thanks to the
    compiler assuming that int overflow does not happen.

    ILP64 would have spared us all these considerations.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Tue Oct 7 18:01:25 2025
    From Newsgroup: comp.arch

    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    kegs@provalid.com (Kent Dickey) writes:
    In article <2025Oct4.121741@mips.complang.tuwien.ac.at>,
    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    MitchAlsup <user5857@newsgrouper.org.invalid> writes:
    int subroutine( int a, int b )
    {
    return a+b;
    }
    ...
    I tested this on AMD64, and did not find sign-extension in the caller, >>>neither with gcc-14 nor with clang-19; both produce the following code >>>for your example (with "subroutine" renamed into "subroutine1").

    0000000000000000 <subroutine1>:
    0: 8d 04 37 lea (%rdi,%rsi,1),%eax
    3: c3 ret
    ...
    AMD64 in hardware does 0 extension of 32-bit operations. From your
    example "lea (%rdi,%rsi,1),%eax" (AT&T notation, so %eax is the dest),
    the 64-bit register %rax will have 0's written into bits [63:32].
    So the AMD64 convention for 32-bit values in 64-bit registers is to >>zero-extend on writes. And to ignore the upper 32-bits on reads, so
    using a 64-bit register should use the %exx name.

    Interesting. At some point I got the impression that LEA produces a
    64-bit result, because it produces an address, but testing reveals
    that LEA has a 32-bit zero-extended variant indeed.

    Architecurally, any store to a 32-bit register (%e_x) will
    clear the high-order bits of of the 64-bit version of the
    register.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Tue Oct 7 18:34:45 2025
    From Newsgroup: comp.arch


    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    kegs@provalid.com (Kent Dickey) writes:
    In article <2025Oct4.121741@mips.complang.tuwien.ac.at>,
    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    MitchAlsup <user5857@newsgrouper.org.invalid> writes:
    int subroutine( int a, int b )
    {
    return a+b;
    }
    ------------------------------------------------------------

    RISC-V is in another land, where they effectively have
    no 32-bit operations, but rather a convention that all 32-bit inputs
    must be sign-extended in a 64-bit register.

    RISC-V has a number of sign-extending 32-bit instructions, and a
    calling convention to go with it.

    RISC-V has word sized integer arithmetic.

    There seem to be the following options:

    Have no 32-bit instructions, and insert sign-extension or
    zero-extension instructions where necessary (or implicitly in all
    operands, as I outlined earlier). SPARC V9 and PowerPC64 seem to take
    this approach.

    This was My 66000 between 2016 and two weeks ago.
    The cost is 4% growth in code footprint and similar perf degradation.

    Have 32-bit instructions that sign-extend: MIPS64, Alpha, and RV64.

    Have 32-bit instructions that zero-extend: AMD64 and ARM A64.

    Have 32-bit instructions that sign-extend and 32-bit instructions that zero-extend. No architecture that does that is known to me. It would
    be a good match for the SPARC-V9 and PowerPC64 calling convention.

    This is the starting point for My 66000 2.0:: integer arithmetic has
    size and signedness, with the property that all integer results have
    the 64-bit register <container> contain a range-limited result suit-
    able to the base-type of the calculation {no garbage in HoBs}.

    There is also one instruction set (ARM A64) that has special 32-bit sign-extension and zero-extension forms for some operands.

    And you can then adapt the calling convention to match the instruction
    set. For "no 32-bit instructions", garbage-extension seems to be the cheapest approach to me, but I expect that when SPARC-V9 and PowerPC64
    came on the market, there was enough C code with missing prototypes
    around that they preferred a more forgiving calling convention.

    If you pick ILP64 for your ABI, then you will get rid of almost all of >these zero- and sign-extensions of 32-bit C and C++ code. It will just >work. If you pick I32LP64, then you should have a full suite of 32-bit >operations and 64-bit operations, at least for all add, subtract, and >compare operations.

    For compare, divide, shift-right and rotate, you either first need to sign/zero-extend the register, or you need 32-bit versions (possibly
    both signed and unsigned).

    My 66000 CMP is signless--it compares two integer registers and delivers
    a bit vector of all possible comparisons {2 equality, 4 signed, 4 unsigned,
    4 range checks, [and in FP land 10-bits are the class of the RS1 operand]}

    My 66000 SL, SR can be used in extract form--and here you need no operand preparation if you only extract meaningful bits.

    My 66000 2.0 DIV has a size component to the calculation.

    And if you do I32LP64, your indexed addressing
    modes should have 3 types of indexed registers: 64-bit, 32-bit signed,
    and 32-bit unsigned. That worked well for ARM64.

    It is certainly part of the way towards my idea of having sign- and zero-extended 32-bit operands for every operand of every instruction.

    Unnecessary if the integer calculation deliver properly range-limited
    64-bit results.

    It would be interesting to see how many sign-extensions and
    zero-extensions (whether explicit or implicitly part of the
    instruction) are executed in code that is generated from various C
    sources (with and without -fwrapv).

    In GNUPLOT is is just over 4% of instruction count for 64-bit-only
    integer calculations.

    I expect that it's highly
    dependent on the programming style. Sure there are types like pid_t
    where you have no choice, but in frequently occuring cases you can
    choose:

    for (i=0; i<n; i++) {
    ... a[i] ...
    }

    Here you can choose whether to define i as int, unsigned, long,
    unsigned long, size_t, etc. If you care for portability to 16-bit
    machines, size_t is a good idea here, otherwise long and unsigned long
    also are efficient.

    Counted for() loops are somewhat special in that it is quite easy to
    determine that the loop index never exceeds the range-limit of the
    container.

    If n is unsigned, you can also choose unsigned,
    but then this code will be slow on RV64 (and MIPS64 and SPARC V9 and PowerPC64 and Alpha).

    Example please !?!

    If n is int, you can also choose int, and there is actually enough information here to make the code efficient (even with -fwrapv),
    because in this code int overflow really cannot happen,

    Consider the case where n is int64_t or uint64_t !?!

    Consider the C-preprocessor with::
    # define int (short int) // !!
    in scope.

    but in code
    that's not much different from this one (e.g., using != instead of <), -fwrapv will result in an inserted sign extension on AMD64, and not
    using -fwrapv may result in unintended behaviour thanks to the
    compiler assuming that int overflow does not happen.

    ILP64 would have spared us all these considerations.

    Agreed. I32LP64 is am abomination, especially if one is bothering to
    ty to keep the number of instructions down.

    - anton
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Tue Oct 7 12:20:08 2025
    From Newsgroup: comp.arch

    On 10/3/2025 12:55 PM, MitchAlsup wrote:

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

    On 10/2/2025 7:50 PM, MitchAlsup wrote:

    My 66000 2.0

    After 4-odd years of ISA stability, I ran into a case where
    I <pretty much> needed to change the instruction formats.
    And after bragging to Quadribloc about its stability--it
    reached the point where it was time to switch to version 2.0.

    Well, its time to eat crow.
    --------------------------------------------------------------
    Memory reference instructions already produce 64-bit values
    from Byte, HalfWord, Word and DoubleWord memory references
    in both Signed and unSigned flavors. These supports both
    integer and floating point due to the single register file.

    Essentially, I need that property in both integer and floating
    point calculations to eliminate instructions that merely apply
    value range constraints--just like memory !

    ISA 2.0 changes allows calculation instructions; both Integer
    and Floating Point; and a few other miscellaneous instructions
    (not so easily classified) the same uniformity.

    In all cases, an integer calculation produces a 64-bit value
    range limited to that of the {Sign}|u{Size}--no garbage bits
    in the high parts of the registers--the register accurately
    represents the calculation as specified {Sign}|u{Size}.

    I must be missing something. Suppose I have

    C := A + B

    where A and C are 16 bit signed integers and B is an 8 bit signed
    integer. As I understand what you are doing, loading B into a register
    will leave the high order 56 bits zero. But the add instruction will
    presumably be half word, so if B is negative, it will get an incorrect
    answer (because B is not sign extended to 16 bits).

    What am I missing?

    A is loaded as 16-bits properly sign to 64-bits: range[-32768..32767]
    B is loaded as 8-bits properly sign to 64-bits: range[-128..127]

    ADDSH Rc,Ra,Rb

    Adds 64-bit Ra and 64-bit Rb and then sign extends the result from bit<15>. The result is a properly signed 64-bit value: range [-32768..32767]

    First let me apologize, then admit my embarrassment. I didn't write
    what I intended to, and even if I did, it wouldn't have been correct.

    I had totally missed the issue of perhaps not extending result of an arithmetic operation to the full register width. I must admit that this
    never came up in the programming I have done, and I never considered it.
    But subsequent posts in this thread have explained the issue well, and
    so I learned something. Thanks to all!
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Tue Oct 7 19:09:25 2025
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
    ...
    My 66000 CMP is signless--it compares two integer registers and delivers
    a bit vector of all possible comparisons {2 equality, 4 signed, 4 unsigned,
    4 range checks, [and in FP land 10-bits are the class of the RS1 operand]}

    With an 88000-style compare and a result register of 64 bits, you can
    spend 14 bits on 64-bit comparison, 14 bits on 32-bit comparison, 14
    bits on 16-bit comparison, and 14 bits on 8-bit comparison, and still
    have 8 bits left. What is a "range check" and why does it take 4
    bits?

    It is certainly part of the way towards my idea of having sign- and
    zero-extended 32-bit operands for every operand of every instruction.

    Unnecessary if the integer calculation deliver properly range-limited
    64-bit results.

    Sign- or zero extension will still be necessary for things like

    long a=...
    int b=a;
    ... c[b];

    With the extension in the operands, you do not need any extension
    instructions, not even for division, right-shift etc.

    The question, however, is if the extensions occur often enough to
    merit such features. I lean towards the SPARC/PowerPC/My 66000-v1
    approach here.

    It would be interesting to see how many sign-extensions and
    zero-extensions (whether explicit or implicitly part of the
    instruction) are executed in code that is generated from various C
    sources (with and without -fwrapv).

    In GNUPLOT is is just over 4% of instruction count for 64-bit-only
    integer calculations.

    Now what if you had a calling convention with garbage-extension? A
    number of extensions in your examples would go away.

    Counted for() loops are somewhat special in that it is quite easy to >determine that the loop index never exceeds the range-limit of the >container.

    There have been enough cases where such reasoning led to "optimizing"
    code into an infinite loop and other fallout of adversarial compilers.

    If n is unsigned, you can also choose unsigned,
    but then this code will be slow on RV64 (and MIPS64 and SPARC V9 and
    PowerPC64 and Alpha).

    Example please !?!

    With a slightly different loop:

    long foo(long a[], unsigned l, unsigned h)
    {
    unsigned i;
    long r=0;
    for (i=l; i!=h; i++)
    r+=a[i];
    return r;
    }

    gcc-10 -O3 produces on RV64G:

    0000000000000000 <foo>:
    0: 872a mv a4,a0
    2: 4501 li a0,0
    4: 00c58c63 beq a1,a2,1c <.L4>

    0000000000000008 <.L3>:
    8: 02059793 slli a5,a1,0x20
    c: 83f5 srli a5,a5,0x1d
    e: 97ba add a5,a5,a4
    10: 639c ld a5,0(a5)
    12: 2585 addiw a1,a1,1
    14: 953e add a0,a0,a5
    16: feb619e3 bne a2,a1,8 <.L3>
    1a: 8082 ret

    000000000000001c <.L4>:
    1c: 8082 ret




    If n is int, you can also choose int, and there is actually enough
    information here to make the code efficient (even with -fwrapv),
    because in this code int overflow really cannot happen,

    Consider the case where n is int64_t or uint64_t !?!

    Then the first condition does not hold on I32LP64.

    Consider the C-preprocessor with::
    # define int (short int) // !!
    in scope.

    Then the compiler will see short int, and generate code accordingly.
    What's your point?

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Tue Oct 7 20:18:11 2025
    From Newsgroup: comp.arch


    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
    ..
    My 66000 CMP is signless--it compares two integer registers and delivers
    a bit vector of all possible comparisons {2 equality, 4 signed, 4 unsigned, >4 range checks, [and in FP land 10-bits are the class of the RS1 operand]}

    With an 88000-style compare and a result register of 64 bits, you can
    spend 14 bits on 64-bit comparison, 14 bits on 32-bit comparison, 14
    bits on 16-bit comparison, and 14 bits on 8-bit comparison, and still
    have 8 bits left. What is a "range check" and why does it take 4
    bits?

    CIN 0 <= Reg < Max
    FIN 0 < Reg <= Max
    RIN 0 < Reg < Max
    SIN 0 <= Reg <= Max


    It is certainly part of the way towards my idea of having sign- and
    zero-extended 32-bit operands for every operand of every instruction.

    Unnecessary if the integer calculation deliver properly range-limited >64-bit results.

    Sign- or zero extension will still be necessary for things like

    long a=...
    int b=a;
    .. c[b];

    The movement of long to int will 'smash' out extraneous significance.
    As written: b has range [-2G..+2G] and the register holding b's value
    will too.

    The important property is that registers contain 64-bits and the value
    in the register is range-limited to the calculated (or LDed) result.

    With the extension in the operands, you do not need any extension instructions, not even for division, right-shift etc.

    The question, however, is if the extensions occur often enough to
    merit such features. I lean towards the SPARC/PowerPC/My 66000-v1
    approach here.

    I did too, until <many> conversations with LLVM compiler writer.
    GNUPLOT seems to be a banner application wrt range-limited calcu-
    lations.

    It would be interesting to see how many sign-extensions and
    zero-extensions (whether explicit or implicitly part of the
    instruction) are executed in code that is generated from various C
    sources (with and without -fwrapv).

    In GNUPLOT is is just over 4% of instruction count for 64-bit-only
    integer calculations.

    Now what if you had a calling convention with garbage-extension? A
    number of extensions in your examples would go away.

    Not many, few are on ABI and most of the ones that are are dealt with
    when moving arguments to preserved registers. So, you could send HoBs
    that are never observed since the MOV Rpreserved,Rargument gets changed
    into a SR[AL] Rpreserved,Rargument<32:0> at no space or time cost.

    Counted for() loops are somewhat special in that it is quite easy to >determine that the loop index never exceeds the range-limit of the >container.

    There have been enough cases where such reasoning led to "optimizing"
    code into an infinite loop and other fallout of adversarial compilers.

    If n is unsigned, you can also choose unsigned,
    but then this code will be slow on RV64 (and MIPS64 and SPARC V9 and
    PowerPC64 and Alpha).

    Example please !?!

    With a slightly different loop:

    long foo(long a[], unsigned l, unsigned h)
    {
    unsigned i; // <---this variable should be uint64_t
    long r=0;
    for (i=l; i!=h; i++)
    r+=a[i];
    return r;
    }

    gcc-10 -O3 produces on RV64G:

    0000000000000000 <foo>:
    0: 872a mv a4,a0
    2: 4501 li a0,0
    4: 00c58c63 beq a1,a2,1c <.L4>

    0000000000000008 <.L3>:
    8: 02059793 slli a5,a1,0x20 // eliminate HoBs
    c: 83f5 srli a5,a5,0x1d // does not have scaled indexing
    e: 97ba add a5,a5,a4 // does not have indexing
    10: 639c ld a5,0(a5) // all that work
    12: 2585 addiw a1,a1,1
    14: 953e add a0,a0,a5 // loop induction
    16: feb619e3 bne a2,a1,8 <.L3>
    1a: 8082 ret

    000000000000001c <.L4>:
    1c: 8082 ret

    foo:
    MOV R4,#0
    MOV R5,#1
    VEC R7,{}
    LDD R6,[R1,R5<<3]
    ADD R4,R4,R6
    LOOP2 NE,R5,#1,R3
    MOV R1,R4
    RET



    If n is int, you can also choose int, and there is actually enough
    information here to make the code efficient (even with -fwrapv),
    because in this code int overflow really cannot happen,

    Consider the case where n is int64_t or uint64_t !?!

    Then the first condition does not hold on I32LP64.

    Consider the C-preprocessor with::
    # define int (short int) // !!
    in scope.

    Then the compiler will see short int, and generate code accordingly.
    What's your point?

    - anton
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From kegs@kegs@provalid.com (Kent Dickey) to comp.arch on Wed Oct 8 20:41:21 2025
    From Newsgroup: comp.arch

    In article <2025Oct7.210925@mips.complang.tuwien.ac.at>,
    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
    ...
    If n is unsigned, you can also choose unsigned,
    but then this code will be slow on RV64 (and MIPS64 and SPARC V9 and
    PowerPC64 and Alpha).

    Example please !?!

    With a slightly different loop:

    long foo(long a[], unsigned l, unsigned h)
    {
    unsigned i;
    long r=0;
    for (i=l; i!=h; i++)
    r+=a[i];
    return r;
    }

    gcc-10 -O3 produces on RV64G:

    0000000000000000 <foo>:
    0: 872a mv a4,a0
    2: 4501 li a0,0
    4: 00c58c63 beq a1,a2,1c <.L4>

    0000000000000008 <.L3>:
    8: 02059793 slli a5,a1,0x20
    c: 83f5 srli a5,a5,0x1d
    e: 97ba add a5,a5,a4
    10: 639c ld a5,0(a5)
    12: 2585 addiw a1,a1,1
    14: 953e add a0,a0,a5
    16: feb619e3 bne a2,a1,8 <.L3>
    1a: 8082 ret

    000000000000001c <.L4>:
    1c: 8082 ret

    Unsigned 32-bit stuff on RISC-V has a habit of blowing up with lots of
    overhead instructions. Change the loop condition to "i < h", and you get
    on godbolt.org with -O2 -march=rv64g

    foo(long*, unsigned int, unsigned int):
    mv a5,a0
    bgeu a1,a2,.L4
    addiw a4,a2,-1
    subw a4,a4,a1
    slli a4,a4,32
    slli a1,a1,32
    srli a1,a1,32
    srli a4,a4,32
    add a4,a4,a1
    addi a3,a0,8
    slli a4,a4,3
    slli a1,a1,3
    li a0,0
    add a5,a5,a1
    add a4,a4,a3
    .L3:
    ld a3,0(a5)
    addi a5,a5,8
    add a0,a0,a3
    bne a5,a4,.L3
    ret
    .L4:
    li a0,0
    ret

    This does get better with "-march=rv64g_zab", but Zab isn't part of RV64G.

    GCC has actually optimized the loop itself better, but it has lots of
    fixup code to create 64-bit register versions of the unsigned inputs
    (because the RISC-V ABI specifies all 32-bit quantities must be
    sign-extended at the function call boundaries, even if they are
    unsigned).

    In many cases, the sign-extension works well (BGEU on 64-bit registers
    that are 32-bit sign-extended, works as it would if the values were 0-extended). But mixing true 64-bit unsigned with 32-bit unsigned
    requires fixup instructions. And the lack of a ZEXT.W in the basic
    64-bit instruction set was a mistake. RISC-V gives us a modern example
    of how to handle not having a full suite of 32-bit instructions, and
    what that would look like.

    Kent
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Wed Oct 8 22:58:53 2025
    From Newsgroup: comp.arch

    On 10/8/2025 3:41 PM, Kent Dickey wrote:
    In article <2025Oct7.210925@mips.complang.tuwien.ac.at>,
    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
    ...
    If n is unsigned, you can also choose unsigned,
    but then this code will be slow on RV64 (and MIPS64 and SPARC V9 and
    PowerPC64 and Alpha).

    Example please !?!

    With a slightly different loop:

    long foo(long a[], unsigned l, unsigned h)
    {
    unsigned i;
    long r=0;
    for (i=l; i!=h; i++)
    r+=a[i];
    return r;
    }

    gcc-10 -O3 produces on RV64G:

    0000000000000000 <foo>:
    0: 872a mv a4,a0
    2: 4501 li a0,0
    4: 00c58c63 beq a1,a2,1c <.L4>

    0000000000000008 <.L3>:
    8: 02059793 slli a5,a1,0x20
    c: 83f5 srli a5,a5,0x1d
    e: 97ba add a5,a5,a4
    10: 639c ld a5,0(a5)
    12: 2585 addiw a1,a1,1
    14: 953e add a0,a0,a5
    16: feb619e3 bne a2,a1,8 <.L3>
    1a: 8082 ret

    000000000000001c <.L4>:
    1c: 8082 ret

    Unsigned 32-bit stuff on RISC-V has a habit of blowing up with lots of overhead instructions. Change the loop condition to "i < h", and you get
    on godbolt.org with -O2 -march=rv64g

    foo(long*, unsigned int, unsigned int):
    mv a5,a0
    bgeu a1,a2,.L4
    addiw a4,a2,-1
    subw a4,a4,a1
    slli a4,a4,32
    slli a1,a1,32
    srli a1,a1,32
    srli a4,a4,32
    add a4,a4,a1
    addi a3,a0,8
    slli a4,a4,3
    slli a1,a1,3
    li a0,0
    add a5,a5,a1
    add a4,a4,a3
    .L3:
    ld a3,0(a5)
    addi a5,a5,8
    add a0,a0,a3
    bne a5,a4,.L3
    ret
    .L4:
    li a0,0
    ret

    This does get better with "-march=rv64g_zab", but Zab isn't part of RV64G.

    GCC has actually optimized the loop itself better, but it has lots of
    fixup code to create 64-bit register versions of the unsigned inputs
    (because the RISC-V ABI specifies all 32-bit quantities must be
    sign-extended at the function call boundaries, even if they are
    unsigned).

    In many cases, the sign-extension works well (BGEU on 64-bit registers
    that are 32-bit sign-extended, works as it would if the values were 0-extended). But mixing true 64-bit unsigned with 32-bit unsigned
    requires fixup instructions. And the lack of a ZEXT.W in the basic
    64-bit instruction set was a mistake. RISC-V gives us a modern example
    of how to handle not having a full suite of 32-bit instructions, and
    what that would look like.


    Had they not dropped ADDWU and SUBWU from BitManip, and did the sensible
    thing of using zero-extended "unsigned int", much of this mess goes away...


    Sign-extending "unsigned int" is almost the worst possible option (even
    within the limits of plain RV64G). Sign extension makes "a+b" slightly cheaper, but everything else gets worse. It is, ironically, better to
    just pay the up-front cost of zero extension for add/subtract (and maybe
    throw up a middle finger to the ABI spec on this one).


    Well, then again, it seems there are multiple versions of the ABI spec floating around in the internet, seemingly with differences as to the
    exact handling of passing/returning structures, etc. So, I don't
    personally put too much weight into worrying about there being a minor mismatch here.

    Where:
    Some versions appear to be using SysV-AMD64 style struct rules;
    With structs being returned by on-stack copy.
    Some versions using the register, register-pair, or by-reference.
    With structs returned in X10, X11:X10,
    or by passing a return pointer as a hidden argument.
    This also being what BGBCC uses;
    ...

    Then, differences between LP64 and LP64D:
    LP64: All F registers are Scratch;
    LP64D: Some of the F registers are Preserved.


    Well, and there are bigger concerns on the ABI front (the ABI used by
    BGBCC not being strictly 1:1 with the standard ABI, but close enough
    that most cases will work):
    Basic case is LP64 argument passing with LP64D's register rules.

    Then an XG3 ABI variant (can also be used for RV64G) which defines there
    as being 16-argument registers and reassigns 4 of the F registers from
    scratch to preserved (to bring the balance slightly closer to an even
    split).

    So:
    X: 4 SPR, 16 Scratch, 12 Preserved
    F: 16 Scratch, 16 Preserved (Vs 20+12)
    So: 32 Scratch + 28 Preserved
    Vs: 36 Scratch + 24 Preserved



    Kent

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Thu Oct 9 05:28:56 2025
    From Newsgroup: comp.arch

    Kent Dickey <kegs@provalid.com> schrieb:

    GCC has actually optimized the loop itself better, but it has lots of
    fixup code to create 64-bit register versions of the unsigned inputs
    (because the RISC-V ABI specifies all 32-bit quantities must be
    sign-extended at the function call boundaries, even if they are
    unsigned).

    You mean 0xffffffff as unsigned has to be passed as
    0xffffffffffffffff ? Somebody was not thinking that one through...

    At least Loongarch gets that one right; unsigned and signed are
    zero- and sign-extended, respectively.


    [...]

    RISC-V gives us a modern example
    of how to handle not having a full suite of 32-bit instructions, and
    what that would look like.

    Seems like it...
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Thu Oct 9 01:13:54 2025
    From Newsgroup: comp.arch

    On 10/9/2025 12:28 AM, Thomas Koenig wrote:
    Kent Dickey <kegs@provalid.com> schrieb:

    GCC has actually optimized the loop itself better, but it has lots of
    fixup code to create 64-bit register versions of the unsigned inputs
    (because the RISC-V ABI specifies all 32-bit quantities must be
    sign-extended at the function call boundaries, even if they are
    unsigned).

    You mean 0xffffffff as unsigned has to be passed as
    0xffffffffffffffff ? Somebody was not thinking that one through...


    Yes, and it is real stupid...


    At least Loongarch gets that one right; unsigned and signed are
    zero- and sign-extended, respectively.


    Meanwhile, in RISC-V land, they are like, "You know, UInt is sign
    extended but we need 0 extension." and rather than do something sane,
    like zero-extend UInt...

    Well, first the B extension adds some ".UW" instructions:
    ADD.UW, H1ADD.UW, SH2ADD.UW, SH3ADD.UW, SLLI.UW

    Which have the amazing behavior of zero-extending on the input side.


    And then more extensions come along, and add more ".UW" instructions...
    How many did the indexed Zilx/Zisx proposal add?... 19.

    Me: FFS.


    I could almost just ignore it, except if implementing a CPU core that
    might want to support these extensions, I also end up needing to waste
    FPGA resources to support this stuff.



    [...]

    RISC-V gives us a modern example
    of how to handle not having a full suite of 32-bit instructions, and
    what that would look like.

    Seems like it...

    Yeah, some of the 32-bit instructions only handle signed variants.

    Also, the number of instructions you need for zero-extension on output,
    is far less than you need for zero extension on input.

    How many do you need?: ADDWU, SUBWU.
    Or, 2 instructions.


    Even without these, zero-extending UInt for operations that might go out
    of range is still less bad than dealing with the mess left by
    sign-extended UInt.

    ...

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Savard@quadibloc@invalid.invalid to comp.arch on Thu Oct 9 07:17:43 2025
    From Newsgroup: comp.arch

    On Fri, 03 Oct 2025 02:50:23 +0000, MitchAlsup wrote:

    And after bragging to Quadribloc about its stability--it reached the
    point where it was time to switch to version 2.0.

    I really don't think you have anything to be ashamed of.

    Getting new ideas that are capable of radically improving your ISA and modifying it to make use of them is an entirely appropriate thing to do.
    And your ideas are ones which fit into your philosophy - it's not as if
    you found yourself doing something wrong where _my_ way was better.

    That would be a valid occasion for "eating crow", but that isn't what
    happened.

    John Savard
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Savard@quadibloc@invalid.invalid to comp.arch on Thu Oct 9 07:32:44 2025
    From Newsgroup: comp.arch

    Also, looking at the specific area which you discussed in your post...

    I do this stuff the way most other computers do it, I think; the
    conventional way that is exemplified by the IBM System/360. I hadn't
    really even thought the matter through to see if there was another, better way: it seemed obvious that this way made sense and worked.

    So multiplying two 8-bit integers produces a 16-bit result... or,
    optionally, perhaps just an 8-bit result, since while the high bits may sometimes be needed for multi-precision arithmetic, usually they're just
    extra bother.

    And this applies to every other size of integer. Loading an integer into a 64-bit register always produces a 64-bit result, though, because registers don't shrink. So there's load (sign extension), unsigned load (clear the
    high bits), and insert (don't touch bits higher than the data coming in).

    Since fixed point data is usually considered to be _integers_ rather than numbers between 0 and 1, fixed-point data is right aligned. Maybe
    fractional fixed-point is a cheaper substitute for floating-point in some applications, and so I should add the extra option of left-aligned loads
    and stores - and fixed-point arithmetic instructions that behave more like floating-point instructions. Since I haven't encountered that feature very much on computers - the multiply instructions on some minis, however, suggested that they viewed fixed-point data as left-aligned fractional -
    I've assumed it's too esoteric a feature to support, but I could be wrong.

    Particularly in comparison to all the other esoteric features I plan to support; this one could be genuinely useful. But it could also be
    something that nobody uses because floating-point is safer to use for the
    kind of problems that fractional fixed-point is applicable to; no constant worry about overflows.

    John Savard
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Thu Oct 9 07:07:11 2025
    From Newsgroup: comp.arch

    kegs@provalid.com (Kent Dickey) writes:
    In article <2025Oct7.210925@mips.complang.tuwien.ac.at>,
    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
    ...
    If n is unsigned, you can also choose unsigned,
    but then this code will be slow on RV64 (and MIPS64 and SPARC V9 and
    PowerPC64 and Alpha).

    Example please !?!

    With a slightly different loop:

    long foo(long a[], unsigned l, unsigned h)
    {
    unsigned i;
    long r=0;
    for (i=l; i!=h; i++)
    r+=a[i];
    return r;
    }

    gcc-10 -O3 produces on RV64G:

    0000000000000000 <foo>:
    0: 872a mv a4,a0
    2: 4501 li a0,0
    4: 00c58c63 beq a1,a2,1c <.L4>

    0000000000000008 <.L3>:
    8: 02059793 slli a5,a1,0x20
    c: 83f5 srli a5,a5,0x1d
    e: 97ba add a5,a5,a4
    10: 639c ld a5,0(a5)
    12: 2585 addiw a1,a1,1
    14: 953e add a0,a0,a5
    16: feb619e3 bne a2,a1,8 <.L3>
    1a: 8082 ret

    000000000000001c <.L4>:
    1c: 8082 ret

    Unsigned 32-bit stuff on RISC-V has a habit of blowing up with lots of >overhead instructions. Change the loop condition to "i < h", and you get
    on godbolt.org with -O2 -march=rv64g

    foo(long*, unsigned int, unsigned int):
    mv a5,a0
    bgeu a1,a2,.L4
    addiw a4,a2,-1
    subw a4,a4,a1
    slli a4,a4,32
    slli a1,a1,32
    srli a1,a1,32
    srli a4,a4,32
    add a4,a4,a1
    addi a3,a0,8
    slli a4,a4,3
    slli a1,a1,3
    li a0,0
    add a5,a5,a1
    add a4,a4,a3
    .L3:
    ld a3,0(a5)
    addi a5,a5,8
    add a0,a0,a3
    bne a5,a4,.L3
    ret
    .L4:
    li a0,0
    ret

    Yes, in many cases the compiler manages to pull the zero extension out
    of the loop or eliminate it completely, and it took me three tries to
    find a loop where this does not happen; and given that I actually
    intended to find such a case and made my changes accordingly, I expect
    that the occurences in practice are rarer than 1 in 3. Nevertheless,
    you don't want your hot loop to fall into this trap, so better use
    size_t rather than unsigned for your loop counter.

    And the lack of a ZEXT.W in the basic
    64-bit instruction set was a mistake.

    The slli and srli instructions above are compressible to 16 bits, so
    it looks to me that the RISC-V designers knew that zero extension is
    going to be somewhat frequent, and wanted to make the instructions for
    that cheap (I don't expect that other uses of srli are frequent enough
    to merit inclusion in the compressed instructions; for slli the use in addressing is probably more frequent than its use in zero extension).
    But they either thought that making the more general slli and srli
    instructions cheap was good enough (and also benefits other cases), or
    they were too reluctant to add another instruction.

    But given that they added a number of sign-extending 32-bit
    instructions, such a reluctance certainly did not exist when they did
    that; the combination of slli and srai performs a sign extension all
    right, no such instructions are strictly necessary, and certainly not
    all of them: addw reg,x0,reg also performs a sign extension. Just
    having a sign-extending and a zero-extening add could have been
    another option.

    RISC-V gives us a modern example
    of how to handle not having a full suite of 32-bit instructions, and
    what that would look like.

    No instruction set has a full suite of 32-bit instructions with both sign-extended and zero-extended results. They all have workarounds
    for the lack of the full suite, with various associated costs that
    have to be balanced against the costs of providing the full suite.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Thu Oct 9 08:22:31 2025
    From Newsgroup: comp.arch

    Thomas Koenig <tkoenig@netcologne.de> writes:
    Kent Dickey <kegs@provalid.com> schrieb:

    GCC has actually optimized the loop itself better, but it has lots of
    fixup code to create 64-bit register versions of the unsigned inputs
    (because the RISC-V ABI specifies all 32-bit quantities must be
    sign-extended at the function call boundaries, even if they are
    unsigned).

    You mean 0xffffffff as unsigned has to be passed as
    0xffffffffffffffff ? Somebody was not thinking that one through...

    I am sure they did think that one through. The manual says:

    |The compiler and calling convention maintain an invariant that all
    |32-bit values are held in a sign-extended format in 64-bit
    |registers. Even 32-bit unsigned integers extend bit 31 into bits 63
    |through 32. Consequently, conversion between unsigned and signed
    |32-bit integers is a no- op, as is conversion from a signed 32-bit
    |integer to a signed 64-bit integer. Existing 64-bit wide SLTU and
    |unsigned branch compares still operate correctly on unsigned 32-bit
    |integers under this invariant. Similarly, existing 64-bit wide logical |operations on 32-bit sign-extended integers preserve the
    |sign-extension property. A few new instructions (ADD[I]W/SUBW/SxxW)
    |are required for addition and shifts to ensure reasonable performance
    |for 32-bit values.

    What I find more interesting is that MIPS apparently made the same
    choice. In the early 90s a lot of code was around which did not
    provide prototypes, code that worked on 32-bit systems because there
    was no difference between int and long, and between unsigned int and
    unsigned long. I expect such code to have a better chance to work as
    intended in an I32LP64 setting if unsigned is zero-extended (and the
    choices for SPARC and PowerPC are along these lines). So why did the
    MIPS people go that way?

    One other interesting thing is how various architectures define the
    upper 32 bits of existing instructions when they extend to 64 bits.
    Let's consider addition:

    MIPS-IV: addu performs sign-extended 32-bit addition (and the instruction
    is called "Add Unsigned Word":-); they added daddu for 64-bit
    addition. They undefined the result of addu if the inputs were not sign-extended 32-bit numbers to make their lack of competence more
    obvious.

    SPARC and PowerPC: The addition instructions perform 64-bit addition.
    No extra 32-bit variant was added.

    AMD64 is actually a new instruction set incompatible with IA-32, but
    given that AMD64 is so close to IA-32 that their decoder is shared on
    all implementations I know of, I include it here: AMD64 defines the
    existing instructions as producing the same result as on IA-32 (that
    includes instructions like shift-right and division where the upper
    bits of a 64-bit operation would play a role), with the upper 32 bits
    being zero, and adds 64-bit variants.

    ARM A64 is a completely new 64-bit instruction set. There have been
    efforts for an ILP32 ABI, but I don't think that this instruction set
    was designed for that.

    RISC-V was designed for both 32-bit and 64-bit settings. The add
    instruction performs the full 64-bit addition in 64-bit
    implementations. The 64-bit extension adds addw (along with addiw
    slliw srliw sraiw sllw srlw subw sraw), which sign-extends the lower
    32 bits of its result. addw produces a defined result for all inputs.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Thu Oct 9 10:39:05 2025
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
    It is certainly part of the way towards my idea of having sign- and
    zero-extended 32-bit operands for every operand of every instruction.

    Unnecessary if the integer calculation deliver properly range-limited
    64-bit results.

    Sign- or zero extension will still be necessary for things like

    long a=...
    int b=a;
    .. c[b];

    The movement of long to int will 'smash' out extraneous significance.

    I.e., you have an extra instruction for that purpose.

    Actually, the example is incomplete. Let's make it more complete:

    long x=..., y=...;
    long a=x-y;
    int b=a;
    long d=c[b];
    long e=a*3;
    live: d,e; dead: a,b

    Let's call the 64-bit registers x..., with alternative names i... and
    u..., where i... are sign-extended from the low-order 32-bits as
    source operands, and a 32-bit result sign-extended to 64-bits are
    stored into x.../i... when i... is a destination. Likewise for u... and zero-extension. With that, if you only have instructions that allow
    all variants as destinations, but have no choice on the source side,
    the code looks as follows:

    sub xa=xx,xy
    mov/sext ib=ia
    load xd=(xd+xb*8)
    mul xe=xa,3

    If you have choice on the source side, you can implement this as:

    sub xa=xx,xy
    load xd=(xd+ia*8)
    mul xe=xa,3

    i.e., you can eliminate the sign-extension instruction. And if you
    arrange your conventions such that the consumer of a value is
    responsible for sign/zero extension (e.g., with a garbage-extending
    calling convention), you can use x... as destination (no
    i.../u... needed there), and do not need to execute any separate sign/zero-extension instructions (whether you call the SEXT/ZEXT or
    MOV).

    However, I doubt that this benefit is worth the price. Nevertheless,
    ARM A64 has addressing modes that correspond to (xd+ia*1/2/4/8) and (xd+ua*1/2/4/8).

    I did too, until <many> conversations with LLVM compiler writer.
    GNUPLOT seems to be a banner application wrt range-limited calcu-
    lations.

    But will a several percent lower instruction count on GNUPLOT sell
    many MY66000s?

    Now what if you had a calling convention with garbage-extension? A
    number of extensions in your examples would go away.

    Not many, few are on ABI and most of the ones that are are dealt with
    when moving arguments to preserved registers.

    That sounds like something that is done on the callee side. And doing
    the sign/zero extension on the callee side is what one would do if the convention is garbage-extension.

    But yes, with either convention you are able to combine the
    sign/zero-extension with a mov that would otherwise have been
    necessary anyway.

    So, you could send HoBs
    that are never observed since the MOV Rpreserved,Rargument gets changed
    into a SR[AL] Rpreserved,Rargument<32:0> at no space or time cost.

    Concerning time cost, many microarchitectures nowadays have zero-cycle
    MOVs (the MOVs are performed by the register renamer). It is possible
    to extend this to also do zero-cycle sign- and zero-extensions
    (resulting in i... and u... registers in the microarchitecture), but
    that certainly has a cost in design and implementation complexity, and
    I am not sure that this is justified by the performance advantages of
    this hardware optimization.

    If the hardware has zero-cycle moves, but not zero-cycle
    sign/zero-extensions, then there is a time cost.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Savard@quadibloc@invalid.invalid to comp.arch on Thu Oct 9 13:51:52 2025
    From Newsgroup: comp.arch

    On Sat, 04 Oct 2025 10:17:41 +0000, Anton Ertl wrote:

    If the calling convention guarantees that ints are zero-extended (sounds perverse, but RV64 has the guarantee that unsigned is passed in
    sign-extended form, which is equally perverse), then the compiler must
    use instructions that produce a zero-extended result (e.g., AMD64's
    addl). If the calling convention only requires and guarantees the
    low-order 32 bits (I call this garbage-extended), then the compiler can
    use instructions that perform 64-bit adds; this is what we are seeing
    above.

    The other side of the medal is what is needed at the caller: If the
    caller needs to cconvert a sign-extended int into a long, it does not
    have to do anything.

    I find this just a bit confusing.

    Obviously, regular signed integer values should be sign extended.

    But, equally, _unsigned_ integer values should be zero extended for
    precisely the same reason, so that the longer value, as an unsigned
    integer, has the same value without doing anything.

    John Savard
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Oct 9 15:40:20 2025
    From Newsgroup: comp.arch


    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    kegs@provalid.com (Kent Dickey) writes:
    In article <2025Oct7.210925@mips.complang.tuwien.ac.at>,
    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    MitchAlsup <user5857@newsgrouper.org.invalid> writes:<snip>

    No instruction set has a full suite of 32-bit instructions with both sign-extended and zero-extended results. They all have workarounds
    for the lack of the full suite, with various associated costs that
    have to be balanced against the costs of providing the full suite.

    Once I get a workable solution to converts, My 66000 will have.
    {Sign}|u{Size}.


    - anton
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Oct 9 15:42:40 2025
    From Newsgroup: comp.arch


    John Savard <quadibloc@invalid.invalid> posted:

    On Fri, 03 Oct 2025 02:50:23 +0000, MitchAlsup wrote:

    And after bragging to Quadribloc about its stability--it reached the
    point where it was time to switch to version 2.0.

    I really don't think you have anything to be ashamed of.

    What I am ashamed of is "breast beating" when you started to announce Concertina III while I was still on My 66000 1.0--then shortly later
    having to jump to 2.0 with the breast beating still fresh on my mind.

    Getting new ideas that are capable of radically improving your ISA and modifying it to make use of them is an entirely appropriate thing to do.
    And your ideas are ones which fit into your philosophy - it's not as if
    you found yourself doing something wrong where _my_ way was better.

    That would be a valid occasion for "eating crow", but that isn't what happened.

    John Savard
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Thu Oct 9 15:37:37 2025
    From Newsgroup: comp.arch

    Thomas Koenig <tkoenig@netcologne.de> writes:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    Thomas Koenig <tkoenig@netcologne.de> writes:

    So, kill the 64-bit machines in the scientific marketplace. I'm glad
    you agree.

    Not in the least. Most C programs did not run as-is on I32LP64, and
    that did not kill these machines, either.

    Only those who assumed sizeof(int) = sizeof(char *).

    And lots of others, e.g., those that assumed that longs are 4 bytes in
    size.

    This was
    not true on the PDP-11,

    Can you elaborate on this? What do you think is sizeof(int) on a
    PDP-11, and what do you think is sizeof(char *) on a PDP-11?

    and it was a standards violation, anyway.

    That's hilarious. C89 was three years old in 1992. The majority of C
    programs available in 1992 were started before ANSI C was released,
    and thus contained code from before ANSI C. And like today,
    programmers are asked to spend time on other things than fixing things
    that are not broken.

    And I am sure that C
    programs were much more relevant for selling these machines than
    FORTRAN programs.

    Based on what data?

    Based on 4 months of internship at HP in 1988 and 1989, in a group
    that did sales support, tech support, and courses on HP 9000
    workstations and servers and HP/UX (the OS of the HP 9000 machines).
    I don't remember hearing about a customer that used FORTRAN.

    Based also on the impressions I got on Usenet. Apart from SPECfp,
    Fortran was nowhere to be seen.

    C programmers changed the programs to run on
    I32LP64 (this was called "making them 64-bit-clean"). And until that
    was done, ILP32 was used.

    The problem with 64-bit INTEGERs for Fortran is that they make REAL
    unusable for lots of existing code.

    The size of FORTRAN INTEGERs is something the FORTAN people have to
    decide, and I made no statement on that.

    If FORTRAN programs make the assumptions that sizeof(int)==4, maybe
    you should tell the FORTRAN programmers something along these lines:
    "it is a standards violation, anyway. Only people who like to play
    these kind of games are caught."

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Thu Oct 9 16:04:50 2025
    From Newsgroup: comp.arch

    John Levine <johnl@taugh.com> writes:
    I didn't port a lot of code from the 11 to other machines, but my recollection >is that the widespread assumption in Berkeley Vax code that location zero was >addressable and contained binary zeros was much more painful to fix than
    size issues.

    Sure, lots of things are more painful to fix, but Thomas Koenig's
    claim was that if the 64-bit machines would not run FORTRAN code "as
    is", nobody would buy them.

    Concerning pain, I found that in Gforth (which contains C code and
    Forth code) we had many more portability bugs in the C code than in
    the Forth code, where we had almost no portability bugs.

    That's because Forth has only two integer types: cell (a machine word)
    and double cell (two machine words); and if you use one instead of the
    other, the code fails, whatever the cell size is.

    By contrast, in the C code we have to deal with a large number of
    integer types (not just int, long, etc., but also, e.g., off_t), with
    the relations between the types being different on different
    platforms, or, in the case of off_t, also depending #defines. On one
    machine some function parameter was a long or whatever, on a different
    one it was a bla_t or whatever. Of course, these days one might
    target only Linux and MacOS and reach >99% of desktops and servers
    (the result runs on Windows through WSL2), but that solves the problem
    by reducing the portability requirements.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Thu Oct 9 17:04:43 2025
    From Newsgroup: comp.arch

    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    John Levine <johnl@taugh.com> writes:
    I didn't port a lot of code from the 11 to other machines, but my recollection
    is that the widespread assumption in Berkeley Vax code that location zero was >>addressable and contained binary zeros was much more painful to fix than >>size issues.

    Sure, lots of things are more painful to fix, but Thomas Koenig's
    claim was that if the 64-bit machines would not run FORTRAN code "as
    is", nobody would buy them.

    That is a misrepresentation (not that I'm surprised).

    My argument was that this would remove a sizable enough market share
    that nobody would risk that.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Thu Oct 9 18:19:40 2025
    From Newsgroup: comp.arch

    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    Thomas Koenig <tkoenig@netcologne.de> writes:

    So, kill the 64-bit machines in the scientific marketplace. I'm glad >>>>you agree.

    Not in the least. Most C programs did not run as-is on I32LP64, and
    that did not kill these machines, either.

    Only those who assumed sizeof(int) = sizeof(char *).

    And lots of others, e.g., those that assumed that longs are 4 bytes in
    size.

    This was
    not true on the PDP-11,

    Can you elaborate on this?

    That was a mistake, as others have pointed out.

    Based on 4 months of internship at HP in 1988 and 1989, in a group
    that did sales support, tech support, and courses on HP 9000
    workstations and servers and HP/UX (the OS of the HP 9000 machines).
    I don't remember hearing about a customer that used FORTRAN.

    *shrug* Oh well, that is very scientific evindence, statistically
    proven.

    Counterpoint: On the University workstations I worked on, Fortran
    was very much in use. People wrote code to run on IBM mainframes
    and ported this to the HP workstations. Plus, there were vector
    computers where REAL also was 32 bits.

    Whose anecdotal evidence counts more.

    Based also on the impressions I got on Usenet. Apart from SPECfp,
    Fortran was nowhere to be seen.

    New flash: Engineers rarely use Usenet (I'm a bit of an exception
    there).
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Thu Oct 9 22:48:14 2025
    From Newsgroup: comp.arch

    Thomas Koenig wrote:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    Based also on the impressions I got on Usenet. Apart from SPECfp,
    Fortran was nowhere to be seen.

    New flash: Engineers rarely use Usenet (I'm a bit of an exception
    there).

    That depends strongly on when and where you are talking about:

    Back when the Internet (Arpanet) got its first node outside of the US,
    it was in Norway in 1973, but our universities did not get a link until
    1983.

    When I started working for Norsk Hydro in 1984, they had already setup a 64-kbit/s line from our Bergen office to the university there, and the
    reason it was installed was that we had engineers who needed Usenet access.

    Personally I've been on Usenet for close to 40 years. It could have been
    a bit more but I did not use Usenet at NTH in Trondheim and I did not
    setup a news reader immediately when I started in Hydro, with
    responsibility for all IBM PC compatibles worldwide.

    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Thu Oct 9 21:08:25 2025
    From Newsgroup: comp.arch

    Terje Mathisen <terje.mathisen@tmsw.no> schrieb:
    Thomas Koenig wrote:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    Based also on the impressions I got on Usenet. Apart from SPECfp,
    Fortran was nowhere to be seen.

    New flash: Engineers rarely use Usenet (I'm a bit of an exception
    there).

    That depends strongly on when and where you are talking about:

    Back when the Internet (Arpanet) got its first node outside of the US,
    it was in Norway in 1973, but our universities did not get a link until 1983.

    When I started working for Norsk Hydro in 1984, they had already setup a 64-kbit/s line from our Bergen office to the university there, and the reason it was installed was that we had engineers who needed Usenet access.

    Personally I've been on Usenet for close to 40 years. It could have been
    a bit more but I did not use Usenet at NTH in Trondheim and I did not
    setup a news reader immediately when I started in Hydro, with
    responsibility for all IBM PC compatibles worldwide.

    I started using USENET when it reached European universities in the
    very early 1990s, so maybe 35 years.

    But my personal observation, from the people I knew, was that users
    were mostly computer scientists, with some mathematicians thrown in.

    Now, of course, USENET is dying off fast; few news servers are left,
    and those are also being switched off (for example news.individual.net).
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Brian G. Lucas@bagel99@gmail.com to comp.arch on Thu Oct 9 16:30:38 2025
    From Newsgroup: comp.arch

    On 10/6/25 8:38 PM, Kent Dickey wrote:
    [SNIP]
    For C and C++ code, the standard dictates that all integer operations are done with "int" precision, unless some operand is larger than int, and then do it in that precision. So there's no real need for 8-bit and 16-bit operations to be natively by the CPU--these operations are actually done
    as int's already. If you have a variable which is a byte, then assigning
    to that variable, and then using that variable again you will need to zero-extend, but honestly, this is not usually a performance path. It's likely to be stored to memory instead, so no masking or sign extending
    should be needed.

    [SNIP]
    Kent

    Can you point me to the section in "the standard" which indicates
    'all integer operations are done with "int" precision'?

    What if the wording was changed to:
    'all integer operations are done with _at least_ "int" precision',
    e.g. one could use long. Would that break conforming code?

    Brian




    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Oct 9 21:54:21 2025
    From Newsgroup: comp.arch


    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    John Levine <johnl@taugh.com> writes:
    I didn't port a lot of code from the 11 to other machines, but my recollection
    is that the widespread assumption in Berkeley Vax code that location zero was
    addressable and contained binary zeros was much more painful to fix than >size issues.

    Sure, lots of things are more painful to fix, but Thomas Koenig's
    claim was that if the 64-bit machines would not run FORTRAN code "as
    is", nobody would buy them.

    Concerning pain, I found that in Gforth (which contains C code and
    Forth code) we had many more portability bugs in the C code than in
    the Forth code, where we had almost no portability bugs.

    C, itself, would be3 a "lot less painful" if C only had 2 integer types
    1-word and 2-words. But, instead, they typical 2^(n+3) machines have
    8-integer types (Signed, unSigned}|u{Byte, Half, Word, DBLE}, and then
    to make it as bad as possible, there are a myriad of types {ptr_dif,
    size_t, off_t, ...} that change {Sign}|u{Size} on an architecture basis.

    That's because Forth has only two integer types: cell (a machine word)
    and double cell (two machine words); and if you use one instead of the
    other, the code fails, whatever the cell size is.

    Same as <old> FORTRAN.

    By contrast, in the C code we have to deal with a large number of
    integer types (not just int, long, etc., but also, e.g., off_t), with
    the relations between the types being different on different
    platforms, or, in the case of off_t, also depending #defines. On one
    machine some function parameter was a long or whatever, on a different
    one it was a bla_t or whatever. Of course, these days one might
    target only Linux and MacOS and reach >99% of desktops and servers
    (the result runs on Windows through WSL2), but that solves the problem
    ^only
    by reducing the portability requirements.

    Blame goes to:: ISO/IEC 9899:1999 for trying to accommodate everyone
    and ending up screwing everyone.

    - anton
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Oct 9 22:24:01 2025
    From Newsgroup: comp.arch


    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    Thomas Koenig <tkoenig@netcologne.de> writes:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    Thomas Koenig <tkoenig@netcologne.de> writes:

    So, kill the 64-bit machines in the scientific marketplace. I'm glad >>>you agree.

    Not in the least. Most C programs did not run as-is on I32LP64, and
    that did not kill these machines, either.

    Only those who assumed sizeof(int) = sizeof(char *).

    And lots of others, e.g., those that assumed that longs are 4 bytes in
    size.

    This was
    not true on the PDP-11,

    Can you elaborate on this? What do you think is sizeof(int) on a
    PDP-11, and what do you think is sizeof(char *) on a PDP-11?

    sizeof int == 2
    sizeof char * == 2

    and it was a standards violation, anyway.

    That's hilarious. C89 was three years old in 1992. The majority of C programs available in 1992 were started before ANSI C was released,
    and thus contained code from before ANSI C. And like today,
    programmers are asked to spend time on other things than fixing things
    that are not broken.

    If application vendors were subject to the same recall standards that
    the auto industry is subject, that might change. {Remember the Pinto}

    And I am sure that C
    programs were much more relevant for selling these machines than
    FORTRAN programs.

    Based on what data?

    Based on 4 months of internship at HP in 1988 and 1989, in a group
    that did sales support, tech support, and courses on HP 9000
    workstations and servers and HP/UX (the OS of the HP 9000 machines).
    I don't remember hearing about a customer that used FORTRAN.

    C got the OS and compilers up and running, then the people who
    bought the machine run applications they can compile from their
    source.

    Based also on the impressions I got on Usenet. Apart from SPECfp,
    Fortran was nowhere to be seen.

    Most FEM, Optics, CFD and larger scale engineering applications are
    all written in FORTRAN with C-front ends shuffling data/commands
    back and forth. {Spice, Layout, Design Rule Checking, GDSII, ...}

    C programmers changed the programs to run on
    I32LP64 (this was called "making them 64-bit-clean"). And until that
    was done, ILP32 was used.

    The problem with 64-bit INTEGERs for Fortran is that they make REAL >unusable for lots of existing code.

    Nonsense::

    CDC only had Double Precision FP data (60-bit)
    with 18-bit integers
    CRAY only had Double Precision FP data (64-bit)
    with 24-bit integers

    {{Even numerical analysists liked Seymore's 60-bit and 64-bit arithmetic compared to 32-bit IBM and 36-bit Univac FP arithmetic--even with those littered with huge mistakes we would not allow today.}}

    The size of FORTRAN INTEGERs is something the FORTAN people have to
    decide, and I made no statement on that.

    If FORTRAN programs make the assumptions that sizeof(int)==4, maybe
    you should tell the FORTRAN programmers something along these lines:
    "it is a standards violation, anyway. Only people who like to play
    these kind of games are caught."

    FORTRAN programmers think of integer as 1 storage container--even on
    CDC and CRAY. The integer in memory is 60 or 64 bits, the integer in
    register is 18-bit or 24-bit. FORTRAN programmers do not have problems
    with putting 6|u6-bit characters in PDP-10 memory container, or 10|u6 field-data characters in one CDC memory container.

    - anton
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Thu Oct 9 22:26:33 2025
    From Newsgroup: comp.arch

    BGB wrote:
    On 10/3/2025 11:40 AM, EricP wrote:

    The issue with FP8 support seems to be that everyone who wants it also
    wants their own definition so no matter what you do, it will be unused.


    As for FP8:
    There are multiple formats in use:
    S.E3.M4: Bias=7 (Quats / Unit Vectors)
    S.E3.M4: Bias=8 (Audio)
    S.E4.M3: Bias=7 (NN's)
    E4.M4: Bias=7 (HDR images)

    Its not just the memory formats, its also the operations.
    In FP8 few may want to waste 1/8th of the encode space on NaN's.
    Maybe not sticky infinity, rather saturate at max but not stick there.
    Maybe no negative zero.

    All of those encodings might be reallocated to values more useful
    for that application.

    They won't want to calculate single argument transcendentals like tan(x),
    they will use 256 byte lookup tables.
    The multi-operand functions ADD, SUB, MUL, would be faster in hardware
    than 64kB lookup tables.

    Also a lot of these are used matrix ops - super-duper-SIMD.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Thu Oct 9 22:45:36 2025
    From Newsgroup: comp.arch

    Brian G. Lucas wrote:
    On 10/6/25 8:38 PM, Kent Dickey wrote:
    [SNIP]
    For C and C++ code, the standard dictates that all integer operations are
    done with "int" precision, unless some operand is larger than int, and
    then
    do it in that precision. So there's no real need for 8-bit and 16-bit
    operations to be natively by the CPU--these operations are actually done
    as int's already. If you have a variable which is a byte, then assigning
    to that variable, and then using that variable again you will need to
    zero-extend, but honestly, this is not usually a performance path. It's
    likely to be stored to memory instead, so no masking or sign extending
    should be needed.

    [SNIP]
    Kent

    Can you point me to the section in "the standard" which indicates
    'all integer operations are done with "int" precision'?

    What if the wording was changed to:
    'all integer operations are done with _at least_ "int" precision',
    e.g. one could use long. Would that break conforming code?

    Brian

    I was wondering this myself. The down-cast rule to a smaller size
    appears to be in C12 standard:
    6.3.1.3(3) implementation-defined or throw an exception if out-of-range.

    "
    6.3 Conversions
    6.3.1.3 Signed and unsigned integers

    1 When a value with integer type is converted to another integer type
    other than _Bool, if the value can be represented by the new type,
    it is unchanged.

    2 Otherwise, if the new type is unsigned, the value is converted by
    repeatedly adding or subtracting one more than the maximum value that
    can be represented in the new type until the value is in the range of
    the new type.60)

    [EricP: this is the same as zero extend]

    3 Otherwise, the new type is signed and the value cannot be
    represented in it; either the result is implementation-defined or
    an implementation-defined signal is raised.
    "


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Fri Oct 10 07:07:03 2025
    From Newsgroup: comp.arch

    EricP <ThatWouldBeTelling@thevillage.com> writes:
    6.3 Conversions
    6.3.1.3 Signed and unsigned integers

    1 When a value with integer type is converted to another integer type
    other than _Bool, if the value can be represented by the new type,
    it is unchanged.

    2 Otherwise, if the new type is unsigned, the value is converted by >repeatedly adding or subtracting one more than the maximum value that
    can be represented in the new type until the value is in the range of
    the new type.60)

    [EricP: this is the same as zero extend]

    If the new type is larger, then case 1 is the only relevant one. For
    signed integers and twos-complement representation, that is sign
    extension. For unsigned integers, it's zero extension.

    Case 2 can only happen if the new type is smaller than the old type.
    In that case no extension happens, and what is described is modulo
    equivalence. It could be described as modulo operation, but isn't.
    Maybe this description was originally intended to also cover signed
    numbers (where the modulo operation would not be appropriate), but
    later case 3 was added.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Fri Oct 10 07:31:16 2025
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    Thomas Koenig <tkoenig@netcologne.de> writes:
    and it was a standards violation, anyway.

    That's hilarious. C89 was three years old in 1992. The majority of C
    programs available in 1992 were started before ANSI C was released,
    and thus contained code from before ANSI C. And like today,
    programmers are asked to spend time on other things than fixing things
    that are not broken.

    If application vendors were subject to the same recall standards that
    the auto industry is subject, that might change. {Remember the Pinto}

    I had not heard about "the Pinto" before, so I cannot remember it.
    Searching for it, it seems that you mean the Ford Pinto which had fuel
    system fires.

    I don't think that a program that works as intended but does not
    comply to a later-introduced standard is in the same position.

    Actually, the regulations for cars only hold for newly sold cars. All
    the other cars can be as unsafe and poison the air as badly as when
    they were introduced, and the Diesel emissions scandal (VW and many
    other car makers) shows that they are actually allowed to poison the
    air even more; at least in Austria none of the cars have been recalled
    that produce more emissions than was allowed when the cars were sold.

    Even for aircraft it is apparently enough to comply with the
    regulations valid at the time of certification of the aircraft, with
    fatal consequences for Chalk's Ocean Airways Flight 101. <https://en.wikipedia.org/wiki/Chalk%27s_Ocean_Airways_Flight_101#Age_of_fleet>

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From cross@cross@spitfire.i.gajendra.net (Dan Cross) to comp.arch on Fri Oct 10 10:50:41 2025
    From Newsgroup: comp.arch

    In article <WhQEQ.117554$7Ika.12025@fx17.iad>,
    Scott Lurndal <slp53@pacbell.net> wrote:
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    [...]
    <snip>

    So, kill the 64-bit machines in the scientific marketplace. I'm glad
    you agree.

    Not in the least. Most C programs did not run as-is on I32LP64.

    The vast majority of C/C++ programs ran just fine on I32LP64. There
    were some that didn't, but it was certainly not "most".

    Yeah, but I remember the switchover to 64-bit pretty well. Most
    programs ran ok, but there were quite a few that punned int for
    a native word and assumed it was interchangeable with a pointer,
    and it took a very long time to get all of that cruft cleaned
    up. That transition was pretty painful.

    - Dan C.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Fri Oct 10 12:04:52 2025
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
    Concerning pain, I found that in Gforth (which contains C code and
    Forth code) we had many more portability bugs in the C code than in
    the Forth code, where we had almost no portability bugs.

    C, itself, would be3 a "lot less painful" if C only had 2 integer types >1-word and 2-words. But, instead, they typical 2^(n+3) machines have >8-integer types (Signed, unSigned}|u{Byte, Half, Word, DBLE}, and then
    to make it as bad as possible, there are a myriad of types {ptr_dif,
    size_t, off_t, ...} that change {Sign}|u{Size} on an architecture basis.

    Actually, ptrdiff_t might be seen as the signed word-size integer type
    and size_t as the unsigned one. That's somewhat

    Concerning off_t, if C had the single-word and two-word type, one
    could have used the two-word type instead of off_t from the start,
    avoiding the pain of _FILE_OFFSETS_BITS etc.

    Concerning signedness: Forth also supports signed and unsigned cells
    and double-cells. This does not cause portability problems, because
    the signedness of a value does not change between platforms.
    Signedness bugs are easy to miss, however.

    That's because Forth has only two integer types: cell (a machine word)
    and double cell (two machine words); and if you use one instead of the
    other, the code fails, whatever the cell size is.

    Same as <old> FORTRAN.

    According to the information discussed here recently, FORTRAN uses the
    same approach on byte-addressed machines as Java: 32-bit INTEGERs,
    32-bit REALs, 64-bit DOUBLEs. No word-sized INTEGERs in FORTRAN.

    BTW, in Forth the FP sizes are not related to integer sizes; this does
    not cause portability problems in my experience, but I have
    experienced FP-related portability problems, typically coming from the assumption that an FP value consumes a power-of-two number of bytes in
    memory (there are systems with 10-byte floats).

    By contrast, in the C code we have to deal with a large number of
    integer types (not just int, long, etc., but also, e.g., off_t), with
    the relations between the types being different on different
    platforms, or, in the case of off_t, also depending #defines. On one
    machine some function parameter was a long or whatever, on a different
    one it was a bla_t or whatever. Of course, these days one might
    target only Linux and MacOS and reach >99% of desktops and servers
    (the result runs on Windows through WSL2), but that solves the problem
    ^only
    by reducing the portability requirements.

    Blame goes to:: ISO/IEC 9899:1999 for trying to accommodate everyone
    and ending up screwing everyone.

    I don't think that blaming anyone is useful. One can, however, think
    about what contributed to the portability problems and what
    alternative approaches would have avoided them.

    The machine-word-oriented B proved insufficient for the byte-addressed
    PDP-11, so Ritchie added types and C was born. There was int (the
    machine word) and char (the byte). Because in B p+1 means the next
    machine word after p, and Ritchie wanted to preserve this, C also has
    typed pointers: int * and char *. long was added because int is
    occasionally too small on the PDP-11.

    One way to avoid the portability problems would have been to define
    int and pointers to be a machine words and long to be two machine
    words. In this scenario, as long as machine-internal data is
    accessed, there would not be portability problems: pid_t, uid_t,
    etc. would all be ints. There would be problems when exchanging data
    with outher machines. E.g., a file system probably wants architecture-independent data, and would spend, say, 32 bits on the
    uid. But at least these issues would be limited to the code that
    accesses these file systems (at least if the programmer isolates these accesses).

    But C did not go there, and instead made long 32 bits long on both
    16-bit machines and on 32-bit machines, with the result that lseek(),
    which produced and consumed a long, could only deal with 2GB files.
    Good enough at the start, but limiting later, so at some point off_t
    and the whole _FILE_OFFSET_BITS mess had to be introduced.

    Another way to avoid the portability problems would have been to go
    for special-purpose types like off_t from the start and make all
    integer types incompatible, i.e., require explicit instead of implicit conversion between them. That (along with appropriate teaching
    material) would make it clear that conversion should be avoided where
    possible, which in turn would reduce the dependencies on relations
    between type sizes. However, going full-bore in this direction when
    coming from B was probably incompatible with Ritchie's apparent goal
    of using B code with as few changes as possible.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Fri Oct 10 19:01:59 2025
    From Newsgroup: comp.arch

    On 10/9/2025 9:26 PM, EricP wrote:
    BGB wrote:
    On 10/3/2025 11:40 AM, EricP wrote:

    The issue with FP8 support seems to be that everyone who wants it also
    wants their own definition so no matter what you do, it will be unused.


    As for FP8:
    -a There are multiple formats in use:
    -a-a-a S.E3.M4: Bias=7 (Quats / Unit Vectors)
    -a-a-a S.E3.M4: Bias=8 (Audio)
    -a-a-a S.E4.M3: Bias=7 (NN's)
    -a-a-a E4.M4: Bias=7-a-a (HDR images)

    Its not just the memory formats, its also the operations.
    In FP8 few may want to waste 1/8th of the encode space on NaN's.
    Maybe not sticky infinity, rather saturate at max but not stick there.
    Maybe no negative zero.

    All of those encodings might be reallocated to values more useful
    for that application.


    In my uses, the 8-bit formats lacked Inf/NaN or subnormals.
    To what extent NaN existed, it was encoded as -0.


    They won't want to calculate single argument transcendentals like tan(x), they will use 256 byte lookup tables.
    The multi-operand functions ADD, SUB, MUL, would be faster in hardware
    than 64kB lookup tables.


    Different ways exist.

    In many cases, directly performing computations on FP8 was insufficient,
    so generally Binary16 or similar was used as the intermediate working
    format.


    Also a lot of these are used matrix ops - super-duper-SIMD.


    Yeah.

    Though, in a 3D model format of mine from not too long ago, I was using
    a mix of FP8 and Joint-Exponent formats.

    So, there are a lot of possible use cases.



    X/Y/Z coords: Joint exponent.
    3x 9-bit, denormalized, 5 bit shared exponent.
    Unpacked to 3x Binary16.
    Similar to the RGB9_E5 format,
    except the values were also sign-extended.
    S/T coords: joint exponent (sorta).
    The scheme for texture coords is more convoluted.
    Normal: 3x FP8A

    There were skeletal animations, with poses also mostly stored as FP8A.
    I tested a few options and noted that storing each rotation quaternion
    as FP8A was the most accurate option.

    While, taken individually, the average absolute error of an FP8A was
    worse than for an 8-bit byte (where -127..127 maps to -1.0..1.0); after normalizing the FP8A vector is more accurate. This could be further
    improved by jittering the scaling and rounding slightly and looking for
    the vector that produced a result closer to the original after
    normalization.


    Note that for texture-mapped models, it would encode S/T coords and
    infer the normal from the geometry. Base RGB would be assumed white, and vertex color is calculated based on pose and normal (engine didn't use positional dynamic lights, so would calculate colors assuming that light
    comes down from overhead).

    For shaded/untextured meshes, a vertex normal would be stored instead
    (again used to calculate vertex RGB). Texture name would encode the RGB
    base color for the mesh (as "#rrggbb").

    In a more advanced engine, one might use the normal vectors directly and calculate shading based on color + normal + light-source. But, this is
    slower and typically requires per-light rendering passes, etc.

    Or, basically a lot of the stuff that makes Doom3 slow (or, like, game
    that came out in 2003 but it took until 2015 before "mere mortal"
    computers could run it at decent framerates...).



    The encoding of vertex coords here differed from the Quake engine
    family, which typically encoded them as 3x BYTE relative to a
    bounding-box; except for Quake3 which went to 16-bit values (still
    relative to a bounding-box).


    Though on a PC, there is the downside that normally OpenGL only really
    allows Binary32 or small integer values mapped to unit-range for vertex
    arrays (would be useful, say, to be able to use HALF or one of the joint-exponent formats here; seemingly OpenGL decided a lot these are
    only for HDR color data or similar though; and not for vertex coords...).


    For TKRA-GL, it allows some more compact formats for vertex arrays.

    Partly relevant as even at arguably fairly low triangle counts,
    unpacking animation frames to vertex arrays can still eat a lot of RAM
    (and you don't want to burn CPU time recalculating the model vertices
    every time it is redrawn).

    Though, there is still the option of using GL_BYTES or similar and
    scaling via the transformation matrix.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sat Oct 11 10:01:44 2025
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    Thomas Koenig <tkoenig@netcologne.de> writes:

    C programmers changed the programs to run on
    I32LP64 (this was called "making them 64-bit-clean"). And until that
    was done, ILP32 was used.

    The problem with 64-bit INTEGERs for Fortran is that they make REAL
    unusable for lots of existing code.

    Nonsense::

    CDC only had Double Precision FP data (60-bit)
    with 18-bit integers
    CRAY only had Double Precision FP data (64-bit)
    with 24-bit integers

    By the time the 64-bit workstations appeared, vector computers
    were very much on the way out, and people had long since gotten
    used to 32-bit reals and 64-bit floating points. Making a 64-bit
    integer would have required a 128-bit double precision, and you
    know how popular that is - only one vendor has it, and there it
    is more of an additional of their decimal float unit, and hence
    very slow (but still faster than software emulation).


    {{Even numerical analysists liked Seymore's 60-bit and 64-bit arithmetic compared to 32-bit IBM and 36-bit Univac FP arithmetic--even with those littered with huge mistakes we would not allow today.}}

    The size of FORTRAN INTEGERs is something the FORTAN people have to
    decide, and I made no statement on that.

    If FORTRAN programs make the assumptions that sizeof(int)==4, maybe
    you should tell the FORTRAN programmers something along these lines:
    "it is a standards violation, anyway. Only people who like to play
    these kind of games are caught."

    FORTRAN programmers think of integer as 1 storage container--even on
    CDC and CRAY.

    That's what the standard says.

    The integer in memory is 60 or 64 bits, the integer in
    register is 18-bit or 24-bit. FORTRAN programmers do not have problems
    with putting 6|u6-bit characters in PDP-10 memory container, or 10|u6 field-data characters in one CDC memory container.

    Most of them have learned to use CHARACTER by now, it's only been
    47 years :-)
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2