Forum: Too Lazy BBS

Who's Online

System Info

Sysop:	Amessyroom
Location:	Fayetteville, NC
Users:	26
Nodes:	6 (0 / 6)
Uptime:	48:44:40
Calls:	632
Files:	1,187
D/L today:	3 files (4,227K bytes)
Messages:	177,138

Time to eat Crow

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Fri Oct 3 02:50:23 2025

From Newsgroup: comp.arch

My 66000 2.0

After 4-odd years of ISA stability, I ran into a case where
I <pretty much> needed to change the instruction formats.
And after bragging to Quadribloc about its stability--it
reached the point where it was time to switch to version 2.0.

Well, its time to eat crow. --------------------------------------------------------------
Memory reference instructions already produce 64-bit values
from Byte, HalfWord, Word and DoubleWord memory references
in both Signed and unSigned flavors. These supports both
integer and floating point due to the single register file.

Essentially, I need that property in both integer and floating
point calculations to eliminate instructions that merely apply
value range constraints--just like memory !

ISA 2.0 changes allows calculation instructions; both Integer
and Floating Point; and a few other miscellaneous instructions
(not so easily classified) the same uniformity.

In all cases, an integer calculation produces a 64-bit value
range limited to that of the {Sign}|u{Size}--no garbage bits
in the high parts of the registers--the register accurately
represents the calculation as specified {Sign}|u{Size}.

Integer and floating point compare instructions only compare
bits of the specified {Size}.

Conversions between integer and floating point are now also
governed by {Size} so one can directly convert FP64 directly
into {unSigned}|u{Int16}--more fully supporting strongly typed
languages.
--------------------------------------------------------------
Integer instructions are now::
{Signed and unSigned}|u{Byte, HalfWord, Word, DoubleWord}
while FP instructions are now:
{Byte, HalfWord, Word, DoubleWord}
Although I am oscillating whether to support FP8 or FP128.

With this rearrangement of bit in the instruction formats, I
was able to get all Constant and routing control bits in the
same place and format in all {1, 2, and 3}-Operand instructions
uniformly. This simplifies <trifling> the Decoder, but more
importantly; the Operand delivery (and/or reception) mechanism.

I was also able to compress the 7 extended operation formats
into a single extended operation format. The instruction
format now looks like:

inst<31:26> Major OpCode
inst<20:16> {Rd, Cnd field}
inst<25:21> {SRC1, Rbase}
inst<15:10> {SH width, else, {I,d,Sign,Size}}
inst< 9: 6> {Minor OpCode, SRC3}
inst< 4: 0> {offset,SRC2,Rindex,1-OP|u}

So there is 1 uniformly positioned field of Minor OpCodes,
and one uniformly interpreted field of Operand Modifiers.
Operand Modifiers applies routing registers and inserting
of constants to XOP Instructions. --------------------------------------------------------------
So, what does this buy the Instruction Set ??

A) All integer calculations are performed at the size and
type of the result as required by the high level language::
{Signed and unSigned}|u{Byte, HalfWord, Word, DoubleWord}.
This, gets rid of all smash instructions across all data
types. {smash == {sext, zext, ((x<<2^n)>>2^n), ...}

B) I actually gained 1 more extended OpCode for future expansion.

C) assembler/disassembler was simplified

D) and while I did not add any new 'instructions' I made those
already present more uniform and supporting of the requirements
of higher level languages (like ADA) and more suitable to the
stricter typing LLVM provides over GCC.

In some ways I 'doubled' the instruction count while not adding
a single instruction {spelling or field-pattern} to ISA. --------------------------------------------------------------
The elimination of 'smashes' shrinks the instruction count of
GNUPLOT by 4%--maybe a bit more once we sort out all of the
compiler patterns it needs to recognize. --------------------------------------------------------------
I wonder if crow tastes good in shepard's pie ?!?
--- Synchronet 3.21a-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Fri Oct 3 03:17:16 2025

From Newsgroup: comp.arch

On 2025-10-02 10:50 p.m., MitchAlsup wrote:

My 66000 2.0

After 4-odd years of ISA stability, I ran into a case where
I <pretty much> needed to change the instruction formats.
And after bragging to Quadribloc about its stability--it
reached the point where it was time to switch to version 2.0.

Well, its time to eat crow. --------------------------------------------------------------
Memory reference instructions already produce 64-bit values
from Byte, HalfWord, Word and DoubleWord memory references
in both Signed and unSigned flavors. These supports both
integer and floating point due to the single register file.

Essentially, I need that property in both integer and floating
point calculations to eliminate instructions that merely apply
value range constraints--just like memory !

ISA 2.0 changes allows calculation instructions; both Integer
and Floating Point; and a few other miscellaneous instructions
(not so easily classified) the same uniformity.

In all cases, an integer calculation produces a 64-bit value
range limited to that of the {Sign}|u{Size}--no garbage bits
in the high parts of the registers--the register accurately
represents the calculation as specified {Sign}|u{Size}.

Integer and floating point compare instructions only compare
bits of the specified {Size}.

Conversions between integer and floating point are now also
governed by {Size} so one can directly convert FP64 directly
into {unSigned}|u{Int16}--more fully supporting strongly typed
languages.
--------------------------------------------------------------
Integer instructions are now::
{Signed and unSigned}|u{Byte, HalfWord, Word, DoubleWord}
while FP instructions are now:
{Byte, HalfWord, Word, DoubleWord}
Although I am oscillating whether to support FP8 or FP128.

For my arch, I decided to support FP128 thinking that FP8 could be
implemented with lookup tables, given that eight bit floats tend to vary
in composition. Of course, I like more precision.
Could it be a build option? Or a bit in a control register to flip
between FP8 and FP128?

With this rearrangement of bit in the instruction formats, I
was able to get all Constant and routing control bits in the
same place and format in all {1, 2, and 3}-Operand instructions
uniformly. This simplifies <trifling> the Decoder, but more
importantly; the Operand delivery (and/or reception) mechanism.

I was also able to compress the 7 extended operation formats
into a single extended operation format. The instruction
format now looks like:

inst<31:26> Major OpCode
inst<20:16> {Rd, Cnd field}
inst<25:21> {SRC1, Rbase}
inst<15:10> {SH width, else, {I,d,Sign,Size}}
inst< 9: 6> {Minor OpCode, SRC3}
inst< 4: 0> {offset,SRC2,Rindex,1-OP|u}

Only four bits for SRC3?

So there is 1 uniformly positioned field of Minor OpCodes,
and one uniformly interpreted field of Operand Modifiers.
Operand Modifiers applies routing registers and inserting
of constants to XOP Instructions. --------------------------------------------------------------
So, what does this buy the Instruction Set ??

A) All integer calculations are performed at the size and
type of the result as required by the high level language::
{Signed and unSigned}|u{Byte, HalfWord, Word, DoubleWord}.
This, gets rid of all smash instructions across all data
types. {smash == {sext, zext, ((x<<2^n)>>2^n), ...}

B) I actually gained 1 more extended OpCode for future expansion.

C) assembler/disassembler was simplified

D) and while I did not add any new 'instructions' I made those
already present more uniform and supporting of the requirements
of higher level languages (like ADA) and more suitable to the
stricter typing LLVM provides over GCC.

In some ways I 'doubled' the instruction count while not adding
a single instruction {spelling or field-pattern} to ISA. --------------------------------------------------------------
The elimination of 'smashes' shrinks the instruction count of
GNUPLOT by 4%--maybe a bit more once we sort out all of the
compiler patterns it needs to recognize. --------------------------------------------------------------
I wonder if crow tastes good in shepard's pie ?!?

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Fri Oct 3 15:33:36 2025

From Newsgroup: comp.arch

Robert Finch <robfi680@gmail.com> posted:

On 2025-10-02 10:50 p.m., MitchAlsup wrote:

My 66000 2.0

After 4-odd years of ISA stability, I ran into a case where
I <pretty much> needed to change the instruction formats.
And after bragging to Quadribloc about its stability--it
reached the point where it was time to switch to version 2.0.

Well, its time to eat crow. --------------------------------------------------------------
Memory reference instructions already produce 64-bit values
from Byte, HalfWord, Word and DoubleWord memory references
in both Signed and unSigned flavors. These supports both
integer and floating point due to the single register file.

Essentially, I need that property in both integer and floating
point calculations to eliminate instructions that merely apply
value range constraints--just like memory !

ISA 2.0 changes allows calculation instructions; both Integer
and Floating Point; and a few other miscellaneous instructions
(not so easily classified) the same uniformity.

In all cases, an integer calculation produces a 64-bit value
range limited to that of the {Sign}|u{Size}--no garbage bits
in the high parts of the registers--the register accurately
represents the calculation as specified {Sign}|u{Size}.

Integer and floating point compare instructions only compare
bits of the specified {Size}.

Conversions between integer and floating point are now also
governed by {Size} so one can directly convert FP64 directly
into {unSigned}|u{Int16}--more fully supporting strongly typed
languages.
--------------------------------------------------------------
Integer instructions are now::
{Signed and unSigned}|u{Byte, HalfWord, Word, DoubleWord}
while FP instructions are now:
{Byte, HalfWord, Word, DoubleWord}
Although I am oscillating whether to support FP8 or FP128.

For my arch, I decided to support FP128 thinking that FP8 could be implemented with lookup tables, given that eight bit floats tend to vary
in composition. Of course, I like more precision.
Could it be a build option? Or a bit in a control register to flip
between FP8 and FP128?

With this rearrangement of bit in the instruction formats, I
was able to get all Constant and routing control bits in the
same place and format in all {1, 2, and 3}-Operand instructions
uniformly. This simplifies <trifling> the Decoder, but more
importantly; the Operand delivery (and/or reception) mechanism.

I was also able to compress the 7 extended operation formats
into a single extended operation format. The instruction
format now looks like:

inst<31:26> Major OpCode
inst<20:16> {Rd, Cnd field}
inst<25:21> {SRC1, Rbase}
inst<15:10> {SH width, else, {I,d,Sign,Size}}
inst< 9: 6> {Minor OpCode, SRC3}
inst< 4: 0> {offset,SRC2,Rindex,1-OP|u}

Only four bits for SRC3?

No, there are 5-bits--inst<9:5>--woops.

So there is 1 uniformly positioned field of Minor OpCodes,
and one uniformly interpreted field of Operand Modifiers.
Operand Modifiers applies routing registers and inserting
of constants to XOP Instructions. --------------------------------------------------------------
So, what does this buy the Instruction Set ??

A) All integer calculations are performed at the size and
type of the result as required by the high level language::
{Signed and unSigned}|u{Byte, HalfWord, Word, DoubleWord}.
This, gets rid of all smash instructions across all data
types. {smash == {sext, zext, ((x<<2^n)>>2^n), ...}

B) I actually gained 1 more extended OpCode for future expansion.

C) assembler/disassembler was simplified

D) and while I did not add any new 'instructions' I made those
already present more uniform and supporting of the requirements
of higher level languages (like ADA) and more suitable to the
stricter typing LLVM provides over GCC.

In some ways I 'doubled' the instruction count while not adding
a single instruction {spelling or field-pattern} to ISA. --------------------------------------------------------------
The elimination of 'smashes' shrinks the instruction count of
GNUPLOT by 4%--maybe a bit more once we sort out all of the
compiler patterns it needs to recognize. --------------------------------------------------------------
I wonder if crow tastes good in shepard's pie ?!?

--- Synchronet 3.21a-Linux NewsLink 1.2

From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Fri Oct 3 12:40:17 2025

From Newsgroup: comp.arch

MitchAlsup wrote:

My 66000 2.0

After 4-odd years of ISA stability, I ran into a case where
I <pretty much> needed to change the instruction formats.
And after bragging to Quadribloc about its stability--it
reached the point where it was time to switch to version 2.0.

Well, its time to eat crow. --------------------------------------------------------------
Memory reference instructions already produce 64-bit values
from Byte, HalfWord, Word and DoubleWord memory references
in both Signed and unSigned flavors. These supports both
integer and floating point due to the single register file.

Essentially, I need that property in both integer and floating
point calculations to eliminate instructions that merely apply
value range constraints--just like memory !

Why? Compilers do not have any problem with this
as its been handled by overload resolution since forever.

Its people who have the problems following type changes and most
compilers will warn of mixed type operations for exactly that reason.

ISA 2.0 changes allows calculation instructions; both Integer
and Floating Point; and a few other miscellaneous instructions
(not so easily classified) the same uniformity.

In all cases, an integer calculation produces a 64-bit value
range limited to that of the {Sign}|u{Size}--no garbage bits
in the high parts of the registers--the register accurately
represents the calculation as specified {Sign}|u{Size}.

Integer and floating point compare instructions only compare
bits of the specified {Size}.

Conversions between integer and floating point are now also
governed by {Size} so one can directly convert FP64 directly
into {unSigned}|u{Int16}--more fully supporting strongly typed
languages.

Strongly typed languages don't natively support mixed type operations.
They come with a set of predefined operations for specific types that
produce specific results.

If YOU want operators/functions that allow mixed types then they force
you to define your own functions to perform your specific operations,
and it forces you to deal with the consequences of your type mixing.

All this does is force YOU, the programmer, to be explicit in your
definition and not depend on invisible compiler specific interpretations.

If you want to support Uns8 * Int8 then it forces you, the programmer,
to deal with the fact that this produces a signed 16-bit result
in the range -128*256..+127*256 = -32768..32512.
Now if you want to convert that result bit pattern to Uns8 by truncating
it to the lower 8 bits, or worse treat the result as Int8 and take
whatever random value falls in bit [7] as the sign, then that's on you.
They just force you to be explicit what you are doing.

--------------------------------------------------------------
Integer instructions are now::
{Signed and unSigned}|u{Byte, HalfWord, Word, DoubleWord}
while FP instructions are now:
{Byte, HalfWord, Word, DoubleWord}

I doubt any compilers will use this feature.
Strong typed languages don't have predefined operators that allow mixing.
Weak typed languages deal with this in overload resolution and by having predefined invisible type conversions in those operators and then using
the normal single type arithmetic instructions.

Although I am oscillating whether to support FP8 or FP128.

The issue with FP8 support seems to be that everyone who wants it also
wants their own definition so no matter what you do, it will be unused.

The issue with FP128 seems associated with scaling on LD and ST
because now scaling is 1,2,4,8,16 which adds 1 bit to the scale field.
And in the case of a combined int-float register file deciding whether
to expand all registers to 128 bits, or use 64-bit register pairs.
Using 128-bit registers raises the question of 128-bit integer support,
and using register pairs opens a whole new category of pair instructions.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Fri Oct 3 10:55:46 2025

From Newsgroup: comp.arch

On 10/2/2025 7:50 PM, MitchAlsup wrote:

My 66000 2.0

After 4-odd years of ISA stability, I ran into a case where
I <pretty much> needed to change the instruction formats.
And after bragging to Quadribloc about its stability--it
reached the point where it was time to switch to version 2.0.

Well, its time to eat crow. --------------------------------------------------------------
Memory reference instructions already produce 64-bit values
from Byte, HalfWord, Word and DoubleWord memory references
in both Signed and unSigned flavors. These supports both
integer and floating point due to the single register file.

Essentially, I need that property in both integer and floating
point calculations to eliminate instructions that merely apply
value range constraints--just like memory !

ISA 2.0 changes allows calculation instructions; both Integer
and Floating Point; and a few other miscellaneous instructions
(not so easily classified) the same uniformity.

In all cases, an integer calculation produces a 64-bit value
range limited to that of the {Sign}|u{Size}--no garbage bits
in the high parts of the registers--the register accurately
represents the calculation as specified {Sign}|u{Size}.

I must be missing something. Suppose I have

C := A + B

where A and C are 16 bit signed integers and B is an 8 bit signed
integer. As I understand what you are doing, loading B into a register
will leave the high order 56 bits zero. But the add instruction will presumably be half word, so if B is negative, it will get an incorrect
answer (because B is not sign extended to 16 bits).

What am I missing?
--
- Stephen Fuld
(e-mail address disguised to prevent spam)
--- Synchronet 3.21a-Linux NewsLink 1.2

From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch on Fri Oct 3 15:25:25 2025

From Newsgroup: comp.arch

--------------------------------------------------------------
Integer instructions are now:: {Signed and unSigned}+{Byte, HalfWord, >> Word, DoubleWord}
while FP instructions are now:
{Byte, HalfWord, Word, DoubleWord}

I doubt any compilers will use this feature.
Strong typed languages don't have predefined operators that allow mixing.

Not sure who's confused, but my reading of the above is not some sort of "mixing": I believe Mitch is just saying that his addition operation
(for example) can be specified to operate on either one of int8, uint8,
int16, uint16, ...
But that specification applies to all inputs and outputs of the
instruction, so it does not support adding an int8 to an int32, or other "mixes".

Stefan
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Fri Oct 3 19:55:00 2025

From Newsgroup: comp.arch

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

On 10/2/2025 7:50 PM, MitchAlsup wrote:

My 66000 2.0

After 4-odd years of ISA stability, I ran into a case where
I <pretty much> needed to change the instruction formats.
And after bragging to Quadribloc about its stability--it
reached the point where it was time to switch to version 2.0.

Well, its time to eat crow. --------------------------------------------------------------
Memory reference instructions already produce 64-bit values
from Byte, HalfWord, Word and DoubleWord memory references
in both Signed and unSigned flavors. These supports both
integer and floating point due to the single register file.

Essentially, I need that property in both integer and floating
point calculations to eliminate instructions that merely apply
value range constraints--just like memory !

ISA 2.0 changes allows calculation instructions; both Integer
and Floating Point; and a few other miscellaneous instructions
(not so easily classified) the same uniformity.

In all cases, an integer calculation produces a 64-bit value
range limited to that of the {Sign}|u{Size}--no garbage bits
in the high parts of the registers--the register accurately
represents the calculation as specified {Sign}|u{Size}.

I must be missing something. Suppose I have

C := A + B

where A and C are 16 bit signed integers and B is an 8 bit signed
integer. As I understand what you are doing, loading B into a register
will leave the high order 56 bits zero. But the add instruction will presumably be half word, so if B is negative, it will get an incorrect answer (because B is not sign extended to 16 bits).

What am I missing?

A is loaded as 16-bits properly sign to 64-bits: range[-32768..32767]
B is loaded as 8-bits properly sign to 64-bits: range[-128..127]

ADDSH Rc,Ra,Rb

Adds 64-bit Ra and 64-bit Rb and then sign extends the result from bit<15>. The result is a properly signed 64-bit value: range [-32768..32767]

--- Synchronet 3.21a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Fri Oct 3 20:47:08 2025

From Newsgroup: comp.arch

EricP <ThatWouldBeTelling@thevillage.com> schrieb:

MitchAlsup wrote:

My 66000 2.0

After 4-odd years of ISA stability, I ran into a case where
I <pretty much> needed to change the instruction formats.
And after bragging to Quadribloc about its stability--it
reached the point where it was time to switch to version 2.0.

Well, its time to eat crow.
--------------------------------------------------------------
Memory reference instructions already produce 64-bit values
from Byte, HalfWord, Word and DoubleWord memory references
in both Signed and unSigned flavors. These supports both
integer and floating point due to the single register file.

Essentially, I need that property in both integer and floating
point calculations to eliminate instructions that merely apply
value range constraints--just like memory !

Why? Compilers do not have any problem with this
as its been handled by overload resolution since forever.

A non-My66000 example:

int add (int a, int b)
{
return a + b;
}

is translated on powerpc64le-unknown-linux-gnu (with -O3 to)

add 3,3,4
extsw 3,3
blr

extsw fills the 32 high-value bits with because numbers returned
in registers have to be correct, either as 32- or 64-bit values.
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Fri Oct 3 21:04:16 2025

From Newsgroup: comp.arch

Stefan Monnier <monnier@iro.umontreal.ca> schrieb:

--------------------------------------------------------------
Integer instructions are now:: {Signed and unSigned}|u{Byte, HalfWord, >>> Word, DoubleWord}
while FP instructions are now:
{Byte, HalfWord, Word, DoubleWord}

I doubt any compilers will use this feature.
Strong typed languages don't have predefined operators that allow mixing.

Not sure who's confused, but my reading of the above is not some sort of "mixing": I believe Mitch is just saying that his addition operation
(for example) can be specified to operate on either one of int8, uint8, int16, uint16, ...
But that specification applies to all inputs and outputs of the
instruction, so it does not support adding an int8 to an int32, or other "mixes".

The outputs are correctly extended to a 64-bit number (signed or
unsigned) so it is possible to pass results to wider operations
without conversion.

One example would be

unsigned long foo (unsigned int a, unsigned int b)
{
return a + b;
}

which would need an adjustment after the add, and which would
just be somethign like

adduw r1,r1,r2
ret

using Mitch's new encoding.
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Fri Oct 3 21:36:07 2025

From Newsgroup: comp.arch

EricP <ThatWouldBeTelling@thevillage.com> posted:

MitchAlsup wrote:

My 66000 2.0

After 4-odd years of ISA stability, I ran into a case where
I <pretty much> needed to change the instruction formats.
And after bragging to Quadribloc about its stability--it
reached the point where it was time to switch to version 2.0.

Well, its time to eat crow. --------------------------------------------------------------
Memory reference instructions already produce 64-bit values
from Byte, HalfWord, Word and DoubleWord memory references
in both Signed and unSigned flavors. These supports both
integer and floating point due to the single register file.

Essentially, I need that property in both integer and floating
point calculations to eliminate instructions that merely apply
value range constraints--just like memory !

Why? Compilers do not have any problem with this
as its been handled by overload resolution since forever.

LLVM compiles C with stricter typing than GCC resulting in a lot
of smashes:: For example::

int subroutine( int a, int b )
{
return a+b;
}

Compiles into:

subroutine:
ADD R1,R1,R2
SRA R1,R1,<32,0> // limit result to (int)
RET

LLVM thinks the smash is required because [-2^31..+2^31-1] +
[-2^31..+2^31-1] does not always fit into [-2^31..+2^31-1] !!!
and chasing down all the cases is harder than the compiler is
ready to do. At first I though that the Value propagation in
LLVM would find that the vast majority of arithmetic does not
need smashing. This proved frustrating to both myself and to
Brian. The more I read RSIC-V and ARM assembly code, the more
I realized that adding sized integer arithmetic is the only
way to get through to the LLVM infrastructure.

We (the My 66000 team; mostly me and Brian) have been trying to
obey the stricter than necessary typing of LLVM and achieve the
code density possible as if K&R rules were in play with 64-bit
only (int)s.

RISC-V has ADDW (but no ADDH or ADDB) to alleviate the issue on
a majority of calculations. ARM has word sized Registers to
alleviate the issue. Since ARM started as 32-bits ADDW is natural.
I am exploring how to provide integer arithmetic such that smashing
never has to happen.

We have been chasing smashes for 9 months making little progress...

Its people who have the problems following type changes and most
compilers will warn of mixed type operations for exactly that reason.

It is more the ADA problem that values must fit in containers--that
is values have a range {min..max} and that calculated values outside
of that range are to be "addressed".

ISA 2.0 changes allows calculation instructions; both Integer
and Floating Point; and a few other miscellaneous instructions
(not so easily classified) the same uniformity.

In all cases, an integer calculation produces a 64-bit value
range limited to that of the {Sign}|u{Size}--no garbage bits
in the high parts of the registers--the register accurately
represents the calculation as specified {Sign}|u{Size}.

Integer and floating point compare instructions only compare
bits of the specified {Size}.

Conversions between integer and floating point are now also
governed by {Size} so one can directly convert FP64 directly
into {unSigned}|u{Int16}--more fully supporting strongly typed
languages.

Strongly typed languages don't natively support mixed type operations.
They come with a set of predefined operations for specific types that
produce specific results.

Yes, indeed, and this is what I am providing: {Sign}|u{Size} calculations. Where the result is known to be range lmited to {Sign}|u{Size}. Thus:

ADDSH R7,R8,R9

R7 is range limited {Signed}|u{HalfWord} == [-32768..+32767] ------------------------------------------------------------------------
So let's look at some egregious cases::

cvtds r2,r2 // convert double to signed 64
srl r3,r2,#0,#32 // convert signed 64 to signed 32
--------
sra r1,r23,#0,#32 // smash to signed 32
sra r2,r20,#0,#32 // smash to signed 32
maxs r23,r2,r1 // max of signed 32
--------
ldd r24,[r24] // LD signed 64
add r1,r28,#1 // innocently add #1
sra r28,r1,#0,#32 // smash to Signed 32
cmp r1,r28,r16 // to match the other operand of CMP --------
call strspn
srl r2,r1,#0,#32 // smash result Signed 32
add r1,r25,-r1
sra r1,r1,#0,#32 // smash Signed 32
cmp r2,r19,r2
srl r2,r2,#2,#1
add r21,r21,r2 // add Bool to Signed 32
sra r2,r20,#0,#32 // smash Signed 32
maxs r20,r1,r2 // MAX Signed 32
--------
mov r1,r29 // Signed 64
ple0 r17,FFFFFFF // ignore
stw r17,[ip,key_rows] // ignore
add r1,r29,#-1 // innocent subtract
sra r1,r1,#0,#32 // smash to Signed 32
divs r1,r1,r17 // DIV Signed 32
--------
lduw r2,[ip,keyT+4]
add r2,r2,#-1 // innocent subtract
srl r2,r2,#0,#32 // smash to unSigned 32
cmp r3,r2,#1 // CMP unSigned 32
// even though CMP is Signless
--------
add r1,r19,-r6 // not so innocent subtract
sra r2,r1,#0,#32 // Signed
srl r1,r1,#0,#32 // unSigned
// only one of these can be eliminated
--------

If YOU want operators/functions that allow mixed types then they force
you to define your own functions to perform your specific operations,
and it forces you to deal with the consequences of your type mixing.

All this does is force YOU, the programmer, to be explicit in your
definition and not depend on invisible compiler specific interpretations.

If you want to support Uns8 * Int8 then it forces you, the programmer,
to deal with the fact that this produces a signed 16-bit result
in the range -128*256..+127*256 = -32768..32512.

Uns8 occupies 64-bits in a register range-limited to [0..255]
Int8 occupies 64-bits in a register range-limited to [-128..127]
So, integer values sitting in registers occupy the whole 64-bits
but are properly range-limited to base-type.

Multiply multiplies 2|u64-bit registers and produces a 128-bit
result, since CARRY is not in effect, the bits<127..64> are
discarded; bits<63..0> are then considered.

unSigned results simply discard bits more significant than base-type.
Signed results raise OVERFLOW is there is more significance than
base-type (and if enabled take an exception).
In all cases, the result delivered fits within the range of base-type.

So, in the case you mention::

LDUB R8,[---]
LDSB R9,[---]
MULSH R7,R8,R9 // result range [-32768..32767]
-----
MULUH R7,R8,R9 // result range [0..65535]

Now if you want to convert that result bit pattern to Uns8 by truncating
it to the lower 8 bits,

MULUB R7,R8,R9 // result range [0..255]

or worse treat the result as Int8 and take
whatever random value falls in bit [7] as the sign, then that's on you.

MULSB R7,R8,R9 // result range [-128..127] or OVERFLOW

Personally, I prefer range checks that raise OVERFLOW.

They just force you to be explicit what you are doing.

--------------------------------------------------------------
Integer instructions are now::
{Signed and unSigned}|u{Byte, HalfWord, Word, DoubleWord}
while FP instructions are now:
{Byte, HalfWord, Word, DoubleWord}

I doubt any compilers will use this feature.

RISC-V and ARM LLVM compilers already do this and use it to eliminate
smashes. RISC-V is limited to WORD, ARM uses registers of WORD size.
Both eliminate smashes. Since there are already LLVM compilers using
this (to eliminate smashes) it should be not terribly difficult to add.

On the other hand:: ILP64 ALSO gets rid of the problem (at a different cost).

Strong typed languages don't have predefined operators that allow mixing. Weak typed languages deal with this in overload resolution and by having predefined invisible type conversions in those operators and then using
the normal single type arithmetic instructions.

Although I am oscillating whether to support FP8 or FP128.

The issue with FP8 support seems to be that everyone who wants it also
wants their own definition so no matter what you do, it will be unused.

Thank you for your input.

The issue with FP128 seems associated with scaling on LD and ST
because now scaling is 1,2,4,8,16 which adds 1 bit to the scale field.
And in the case of a combined int-float register file deciding whether
to expand all registers to 128 bits, or use 64-bit register pairs.

My position is that people want 64-bit registers and ISA that allow
reasonably easy and efficient access to 128-bits, CARRY provides this.
But the architecture is not cut out to be a big 128-bit number cruncher; occasional sure, but all the time, no.

Using 128-bit registers raises the question of 128-bit integer support,
and using register pairs opens a whole new category of pair instructions.

CARRY supports this.
--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Sat Oct 4 04:56:21 2025

From Newsgroup: comp.arch

On 10/3/2025 4:04 PM, Thomas Koenig wrote:

Stefan Monnier <monnier@iro.umontreal.ca> schrieb:

--------------------------------------------------------------
Integer instructions are now:: {Signed and unSigned}|u{Byte, HalfWord,
Word, DoubleWord}
while FP instructions are now:
{Byte, HalfWord, Word, DoubleWord}

I doubt any compilers will use this feature.
Strong typed languages don't have predefined operators that allow mixing. >>

Not sure who's confused, but my reading of the above is not some sort of
"mixing": I believe Mitch is just saying that his addition operation
(for example) can be specified to operate on either one of int8, uint8,
int16, uint16, ...
But that specification applies to all inputs and outputs of the
instruction, so it does not support adding an int8 to an int32, or other
"mixes".

The outputs are correctly extended to a 64-bit number (signed or
unsigned) so it is possible to pass results to wider operations
without conversion.

One example would be

unsigned long foo (unsigned int a, unsigned int b)
{
return a + b;
}

which would need an adjustment after the add, and which would
just be somethign like

adduw r1,r1,r2
ret

using Mitch's new encoding.

Yes.

Sign extend signed types, zero extend unsigned types.
Up-conversion is free.

This is something the RISC-V people got wrong IMO, and adding a bunch of
".UW" instructions in an attempt to patch over it is just kinda ugly.

Partly for my own uses revived ADDWU and SUBWU (which had been dropped
in BitManip), because these are less bad than the alternative.

I get annoyed that new extensions keep trying to add ever more ".UW" instructions rather than just having the compiler go over to
zero-extended unsigned and make this whole mess go away.

...

Ironically, the number of new instructions being added to my own ISA has mostly died off recently, largely because there is little particularly relevant to add at this point (within the realm of stuff that could be
added).

--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Sat Oct 4 04:57:23 2025

From Newsgroup: comp.arch

On 10/3/2025 11:40 AM, EricP wrote:

MitchAlsup wrote:

My 66000 2.0

After 4-odd years of ISA stability, I ran into a case where
I <pretty much> needed to change the instruction formats.
And after bragging to Quadribloc about its stability--it reached the
point where it was time to switch to version 2.0.

Well, its time to eat crow.
--------------------------------------------------------------
Memory reference instructions already produce 64-bit values
from Byte, HalfWord, Word and DoubleWord memory references
in both Signed and unSigned flavors. These supports both integer and
floating point due to the single register file.

Essentially, I need that property in both integer and floating
point calculations to eliminate instructions that merely apply value
range constraints--just like memory !

Why? Compilers do not have any problem with this
as its been handled by overload resolution since forever.

Its people who have the problems following type changes and most
compilers will warn of mixed type operations for exactly that reason.

ISA 2.0 changes allows calculation instructions; both Integer and
Floating Point; and a few other miscellaneous instructions (not so
easily classified) the same uniformity.

In all cases, an integer calculation produces a 64-bit value
range limited to that of the {Sign}|u{Size}--no garbage bits
in the high parts of the registers--the register accurately
represents the calculation as specified {Sign}|u{Size}.

Integer and floating point compare instructions only compare
bits of the specified {Size}.

Conversions between integer and floating point are now also
governed by {Size} so one can directly convert FP64 directly
into {unSigned}|u{Int16}--more fully supporting strongly typed
languages.

Strongly typed languages don't natively support mixed type operations.
They come with a set of predefined operations for specific types that
produce specific results.

If YOU want operators/functions that allow mixed types then they force
you to define your own functions to perform your specific operations,
and it forces you to deal with the consequences of your type mixing.

All this does is force YOU, the programmer, to be explicit in your
definition and not depend on invisible compiler specific interpretations.

If you want to support Uns8 * Int8 then it forces you, the programmer,
to deal with the fact that this produces a signed 16-bit result
in the range -128*256..+127*256 = -32768..32512.
Now if you want to convert that result bit pattern to Uns8 by truncating
it to the lower 8 bits, or worse treat the result as Int8 and take
whatever random value falls in bit [7] as the sign, then that's on you.
They just force you to be explicit what you are doing.

--------------------------------------------------------------
Integer instructions are now:: -a-a-a-a {Signed and unSigned}|u{Byte,
HalfWord, Word, DoubleWord}
while FP instructions are now:
-a-a-a-a {Byte, HalfWord, Word, DoubleWord}

I doubt any compilers will use this feature.
Strong typed languages don't have predefined operators that allow mixing. Weak typed languages deal with this in overload resolution and by having predefined invisible type conversions in those operators and then using
the normal single type arithmetic instructions.

Although I am oscillating whether to support FP8 or FP128.

The issue with FP8 support seems to be that everyone who wants it also
wants their own definition so no matter what you do, it will be unused.

The issue with FP128 seems associated with scaling on LD and ST
because now scaling is 1,2,4,8,16 which adds 1 bit to the scale field.
And in the case of a combined int-float register file deciding whether
to expand all registers to 128 bits, or use 64-bit register pairs.
Using 128-bit registers raises the question of 128-bit integer support,
and using register pairs opens a whole new category of pair instructions.

I generally went with register pairs...

Where, say, for base types:
8-bits: Rarely big enough
16-bits: Sometimes big enough
32-bits: Usually big enough
64-bits: Almost always big enough

Vector types:
2x: Good
4x: Better
8x: Rarely Needed

For a scalar type, the high 64 bits of a 128-bit register would be
almost always wasted, so it isn't worthwhile to spend resources on
things that are mostly just going to waste.

At least with 64-bit registers, they cover:
Integer values: Usually overkill
'int' is far more common than 'long long'.
Floating Point: Usually Optimal
Binary64 is almost always good.
Binary32 is frequently insufficient.
2x Binary32 and 4x Binary16: OK

Then, 128-bit as pairs:
Deals with the occasional 128-bit vector and integer;
Avoids wasting resources all the times we don't need it.

Well, since computation isn't exactly a gas that expands to efficiently utilize the register size (going bigger = diminishing returns).

If the CPU is superscalar, can use 2x64b lanes for the 128-bit path, ...

As for Binary128:
Infrequently used;
Too expensive for direct hardware support;
So, ended up adding a trap-only support;
Trap-only allows it to exist without also eating the FPGA.

As for FP8:
There are multiple formats in use:
S.E3.M4: Bias=7 (Quats / Unit Vectors)
S.E3.M4: Bias=8 (Audio)
S.E4.M3: Bias=7 (NN's)
E4.M4: Bias=7 (HDR images)

Then, for 16-bit:
S.E5.M10: Generic, Graphics Processing, Sometimes 3D Geometry
Sometimes not enough dynamic range.
S.E8.M7: NNs
Usually not enough precision.

It is likely the more optimal 16-bit format might actually be S.E6.M9,
but this is non-standard.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Sat Oct 4 12:37:18 2025

From Newsgroup: comp.arch

Stephen Fuld wrote:

On 10/2/2025 7:50 PM, MitchAlsup wrote:

My 66000 2.0

After 4-odd years of ISA stability, I ran into a case where
I <pretty much> needed to change the instruction formats.
And after bragging to Quadribloc about its stability--it
reached the point where it was time to switch to version 2.0.

Well, its time to eat crow.
--------------------------------------------------------------
Memory reference instructions already produce 64-bit values
from Byte, HalfWord, Word and DoubleWord memory references
in both Signed and unSigned flavors. These supports both
integer and floating point due to the single register file.

Essentially, I need that property in both integer and floating
point calculations to eliminate instructions that merely apply
value range constraints--just like memory !

ISA 2.0 changes allows calculation instructions; both Integer
and Floating Point; and a few other miscellaneous instructions
(not so easily classified) the same uniformity.

In all cases, an integer calculation produces a 64-bit value
range limited to that of the {Sign}|arCo{Size}--no garbage bits
in the high parts of the registers--the register accurately
represents the calculation as specified {Sign}|arCo{Size}.

I must be missing something.-a Suppose I have

C := A + B

where A and C are 16 bit signed integers and B is an 8 bit signed
integer.-a As I understand what you are doing, loading B into a register will leave the high order 56 bits zero.-a But the add instruction will presumably be half word, so if B is negative, it will get an incorrect > answer (because B is not sign extended to 16 bits).

What am I missing?

I am pretty sure A would be sign extended to 64 bit on load and the same for B, from 8->64 bits, at which point the addition works as it should?
When storing a 64-bit result as a 16-bit signed integer, the cpu can
verify that the top 48 bits are either all 1 or all 0.
Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sat Oct 4 10:17:41 2025

From Newsgroup: comp.arch

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

LLVM compiles C with stricter typing than GCC resulting in a lot
of smashes:: For example::

int subroutine( int a, int b )
{
return a+b;
}

Compiles into:

subroutine:
ADD R1,R1,R2
SRA R1,R1,<32,0> // limit result to (int)
RET

I tested this on AMD64, and did not find sign-extension in the caller,
neither with gcc-14 nor with clang-19; both produce the following code
for your example (with "subroutine" renamed into "subroutine1").

0000000000000000 <subroutine1>:
0: 8d 04 37 lea (%rdi,%rsi,1),%eax
3: c3 ret

It's not about strict or lax typing, it's about what the calling
convention promises about types that are smaller than a machine word.
If the calling convention requires/guarantees that ints are
sign-extended, the compiler must use instructions that produce a
sign-extended result. If the calling convention guarantees that ints
are zero-extended (sounds perverse, but RV64 has the guarantee that
unsigned is passed in sign-extended form, which is equally perverse),
then the compiler must use instructions that produce a zero-extended
result (e.g., AMD64's addl). If the calling convention only requires
and guarantees the low-order 32 bits (I call this garbage-extended),
then the compiler can use instructions that perform 64-bit adds; this
is what we are seeing above.

The other side of the medal is what is needed at the caller: If the
caller needs to cconvert a sign-extended int into a long, it does not
have to do anything. If it needs to convert a zero-extended or garbage-extended int into a long, it has to sign-extend the value.

I have tested this with:

int subroutine2(int,int);

long subroutine3(int a,int b)
{
return subroutine2(a,b);
}

On AMD64 the result is:

gcc-14:
0000000000000010 <subroutine3>:
10: 48 83 ec 08 sub $0x8,%rsp
14: e8 00 00 00 00 call 19 <subroutine3+0x9>
19: 48 83 c4 08 add $0x8,%rsp
1d: 48 98 cltq
1f: c3 ret

clang-19:
0000000000000010 <subroutine3>:
10: 50 push %rax
11: e8 00 00 00 00 call 16 <subroutine3+0x6>
16: 48 98 cltq
18: 59 pop %rcx
19: c3 ret

The compilers introduce the sign-extension CLTQ because the result of
the call is not sign-extended. For parameter passing, it's the same:

int subroutine4(long,long);

long subroutine5(int a,int b)
{
return subroutine4(a,b);
}

0000000000000020 <subroutine5>:
20: 48 83 ec 08 sub $0x8,%rsp
24: 48 63 f6 movslq %esi,%rsi
27: 48 63 ff movslq %edi,%rdi
2a: e8 00 00 00 00 call 2f <subroutine5+0xf>
2f: 48 83 c4 08 add $0x8,%rsp
33: 48 98 cltq
35: c3 ret
0000000000000020 <subroutine5>:
20: 50 push %rax
21: 48 63 ff movslq %edi,%rdi
24: 48 63 f6 movslq %esi,%rsi
27: e8 00 00 00 00 call 2c <subroutine5+0xc>
2c: 48 98 cltq
2e: 59 pop %rcx
2f: c3 ret

BTW, In C as it was originally conceived, that was not an issue,
because int occupied a complete register and all smaller types are
converted to ints. The I32LP64 mistake has required to insert a lot
of sign-extensions (and C compiler writers embrace undefined behaviour
to avoid that in some cases).

Another mistake we see in this example is the 16-byte alignment
requirement of SSEx. It results in the RSP adjustments around the
call. If only AMD had decided to support unaligned SSEx memory
accesses by default in 64-bit mode.

LLVM thinks the smash is required because [-2^31..+2^31-1] +
[-2^31..+2^31-1] does not always fit into [-2^31..+2^31-1] !!!
and chasing down all the cases is harder than the compiler is
ready to do.

In your example, there is nothing to chase down, because subroutine()
can be called from anywhere.

At first I though that the Value propagation in
LLVM would find that the vast majority of arithmetic does not
need smashing. This proved frustrating to both myself and to
Brian. The more I read RSIC-V and ARM assembly code, the more
I realized that adding sized integer arithmetic is the only
way to get through to the LLVM infrastructure.

You might try changing the calling convention for int to
garbage-extended. It can introduce sign or zero extension elsewhere,
but maybe fewer than otherwise.

RISC-V has ADDW (but no ADDH or ADDB) to alleviate the issue on
a majority of calculations.

That's an RV64 extension. RV32 does not have ADDW.

ARM has word sized Registers to
alleviate the issue. Since ARM started as 32-bits ADDW is natural.

Not at all. ARM A64 is a completely new instruction set that has at
least as much in common with PowerPC as with ARM A32 or ARM T32. I
expect that they would not have added the 32-bit ADDW or the
addressing modes with sign- or zero-extended 32-bit indexes if the
MIPS and Alpha people had not made the I32LP64 mistake. Instead, they
would have used the encoding space for more useful things.

I am exploring how to provide integer arithmetic such that smashing
never has to happen.

If you want to avoid every use of a separate sign-extension or
zero-extension instruction, add three bits to every source-register
specifier: 2 bits for the input size (1,2,4,8 bytes), 1 for
signed/unsigned. Once you have that, there is no need to extend to
result: you always can perform the extension on input to the use of a
result; the natural calling convention to go along with that is to garbage-extend.

I don't think that extension instructions are frequent enough to merit
going to such lengths. I actually think that the RISC-V people made
the wrong choice here, contrary to their usual stance. Instead of
having sign-extension as a separate instruction (like zero-extension),
they added it to a number of integer instructions, inflating the
number of instructions for little benefit.

So let's look at some egregious cases::

cvtds r2,r2 // convert double to signed 64
srl r3,r2,#0,#32 // convert signed 64 to signed 32

unsigned?

--------
sra r1,r23,#0,#32 // smash to signed 32
sra r2,r20,#0,#32 // smash to signed 32
maxs r23,r2,r1 // max of signed 32

With garbage-extension, you need a 32-bit maxs or sign-extend the
operands. But you are sign-extended; why do you need it?

Such things are not necessary with garbage-extension for add, sub,
mul, and, or xor, i.e., the most common operations.

--------
ldd r24,[r24] // LD signed 64
add r1,r28,#1 // innocently add #1
sra r28,r1,#0,#32 // smash to Signed 32
cmp r1,r28,r16 // to match the other operand of CMP

Similar to the maxs case.

--------
call strspn
srl r2,r1,#0,#32 // smash result Signed 32
add r1,r25,-r1
sra r1,r1,#0,#32 // smash Signed 32
cmp r2,r19,r2
srl r2,r2,#2,#1
add r21,r21,r2 // add Bool to Signed 32
sra r2,r20,#0,#32 // smash Signed 32
maxs r20,r1,r2 // MAX Signed 32

Maybe the right way here is to use size_t for the variable where you
put the return value (strspn() returns a size_t).

--------
mov r1,r29 // Signed 64
ple0 r17,FFFFFFF // ignore
stw r17,[ip,key_rows] // ignore
add r1,r29,#-1 // innocent subtract
sra r1,r1,#0,#32 // smash to Signed 32
divs r1,r1,r17 // DIV Signed 32

Division is one of the operations where garbage-extended input is not
ok; but fortunately it is rare.

I doubt any compilers will use this feature.

RISC-V and ARM LLVM compilers already do this and use it to eliminate >smashes.

Shortly after we got our first Alphas in 1995, I saw DEC's C compiler
produce lots of explicit sign-extensions (using the addl instruction)
of both int operands and int results. In later years they got the
compiler to emit many fewer sign-extensions. I don't remember seeing
that many sign extensions on Alpha from gcc, ever, so apparently they
already kept track of the extension status of a value at the time.

On the other hand:: ILP64 ALSO gets rid of the problem (at a different cost).

Exactly. If the I32LP64 mistake had not been made, we would have been
spared a lot (not just extension instructions). But for ARM A64 and
RV64, they have to adapt to the world as it is, not as it should be,
and unfortunately that means I32LP64. For MY66000, it's your call, of
course.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sat Oct 4 11:52:22 2025

From Newsgroup: comp.arch

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

The I32LP64 mistake

If you consider I32LP64 a mistake, how should FORTRAN's (note the
upper case, this is pre-Fortran-90) storage association rules have
been handled, in your opinion?

If you are not familiar with them, they are:

- INTEGER takes up one storage unit
- REAL takes up one storage unit
- DOUBLE PRECISION takes up two storage units

where storage units are implementation-defined. Also consider
that 32-bit REALs and 64-bit REALs are both useful and needed,
and that (unofficially) C's integers were identical to
FORTRAN's INTEGER.
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sat Oct 4 16:11:37 2025

From Newsgroup: comp.arch

Thomas Koenig <tkoenig@netcologne.de> writes:

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

The I32LP64 mistake

If you consider I32LP64 a mistake, how should FORTRAN's (note the
upper case, this is pre-Fortran-90) storage association rules have
been handled, in your opinion?

I am not familiar enough with FORTRAN to give a recommendation on
that. However, two observations:

* The Cray-1 is primarily a Fortran machine, and it's C implementation
is ILP64, and it is successful. So obviously an ILP64 C can live
fine with FORTRAN.

* Whatever inconvenience ILP64 would have caused to Fortran
implementors is small compared to the cost in performance and
reliability that I32LP64 has cost in the C world and the cost in
encoding space (and thus code size) and implementation effort and
transistors (probably not that many, but still) that it is costing
all customers of 64-bit processors.

If you are not familiar with them, they are:

- INTEGER takes up one storage unit
- REAL takes up one storage unit
- DOUBLE PRECISION takes up two storage units

where storage units are implementation-defined. Also consider
that 32-bit REALs and 64-bit REALs are both useful and needed,
and that (unofficially) C's integers were identical to
FORTRAN's INTEGER.

And unofficially C's integers were as long as pointers (with a legacy
reaching back to BCPL). If I had to choose between breaking an
unofficial FORTRAN-C interface tradition and a C-internal tradition, I
would choose the C-internal tradition every time.

There are two other languages that I have thought about:

Java was introduced with fixed-size 32-bit int and 64-bit long, and
with references typically having the size of a machine word. The
choice of "int" and "long" may be due to I32LP64, and if the C people
had gone for ILP64, the Java people might have chosen different names.
But given their goal of write-once-run-everywhere with bit-identical
results, they probably did not want to provide a machine-word-sized
integer type. Java became popular when 32-bit machines were still a
thing for running Java, so there would be lots of Java around that
uses the 32-bit integer type. Given the large amount of Java code,
that alone might be enough to make computer architects want to add
special architectural support for signed 32-bit integers. At least we
would have been spared architectural support for unsigned 32-bit
integers.

AFAIK Rust does not have a machine-word-sized integer type; instead,
each type has its size in its name (e.g., i32, u64). Given that Rust
was designed recently, that does not lead to portability problems yet:
On servers, desktops (and recently smartphones) machine words are only
64 bits, so if you write for that, you can just use i64 and u64, and
your software will be efficient (or you can use smaller integers, and
unless you store a lot of them, your software will be inefficient on
various machines thanks to sign or zero extension). If you program on
an embedded system, the code probably won't be ported to a machine
with a different word size, so again, choosing the integer types that
match the word size is a good choice. If there is ever a transition
to 128-bit machines, I expect that the Rust approach will backfire,
but who knows if Rust will still be in significant use by then. If it
is, it may result in costs like I32LP64 is causing now.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.arch on Sat Oct 4 20:44:37 2025

From Newsgroup: comp.arch

On Sat, 04 Oct 2025 16:11:37 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

AFAIK Rust does not have a machine-word-sized integer type; instead,
each type has its size in its name (e.g., i32, u64).

Rust has machine-dependent isize and usize types, identical to ptrdiff_t
and size_t in C.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.arch on Sat Oct 4 20:51:43 2025

From Newsgroup: comp.arch

On Sat, 04 Oct 2025 16:11:37 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

Thomas Koenig <tkoenig@netcologne.de> writes:

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

The I32LP64 mistake

If you consider I32LP64 a mistake, how should FORTRAN's (note the
upper case, this is pre-Fortran-90) storage association rules have
been handled, in your opinion?

I am not familiar enough with FORTRAN to give a recommendation on
that. However, two observations:

* The Cray-1 is primarily a Fortran machine, and it's C implementation
is ILP64, and it is successful. So obviously an ILP64 C can live
fine with FORTRAN.

I would guess that Cray-1 FORTRAN was not 100% conformant to FORTRAN 77 standard. And they likely didn't care.

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat Oct 4 18:01:59 2025

From Newsgroup: comp.arch

Thomas Koenig <tkoenig@netcologne.de> posted:

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

The I32LP64 mistake

If you consider I32LP64 a mistake, how should FORTRAN's (note the
upper case, this is pre-Fortran-90) storage association rules have
been handled, in your opinion?

FORTRAN INTEGER == INT32_T

allowing ILP64.

If you are not familiar with them, they are:

- INTEGER takes up one storage unit
- REAL takes up one storage unit
- DOUBLE PRECISION takes up two storage units

where storage units are implementation-defined. Also consider
that 32-bit REALs and 64-bit REALs are both useful and needed,
and that (unofficially) C's integers were identical to
FORTRAN's INTEGER.

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat Oct 4 18:05:18 2025

From Newsgroup: comp.arch

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

Thomas Koenig <tkoenig@netcologne.de> writes:

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

The I32LP64 mistake

If you consider I32LP64 a mistake, how should FORTRAN's (note the
upper case, this is pre-Fortran-90) storage association rules have
been handled, in your opinion?

I am not familiar enough with FORTRAN to give a recommendation on
that. However, two observations:

* The Cray-1 is primarily a Fortran machine, and it's C implementation
is ILP64, and it is successful. So obviously an ILP64 C can live
fine with FORTRAN.

* Whatever inconvenience ILP64 would have caused to Fortran
implementors is small compared to the cost in performance and
reliability that I32LP64 has cost in the C world and the cost in
encoding space (and thus code size) and implementation effort and
transistors (probably not that many, but still) that it is costing
all customers of 64-bit processors.

If you are not familiar with them, they are:

- INTEGER takes up one storage unit
- REAL takes up one storage unit
- DOUBLE PRECISION takes up two storage units

where storage units are implementation-defined. Also consider
that 32-bit REALs and 64-bit REALs are both useful and needed,
and that (unofficially) C's integers were identical to
FORTRAN's INTEGER.

And unofficially C's integers were as long as pointers (with a legacy reaching back to BCPL). If I had to choose between breaking an
unofficial FORTRAN-C interface tradition and a C-internal tradition, I
would choose the C-internal tradition every time.

There is a quote from K&R C that states int is the most efficient
form for computing integer arithmetic values.

With the demand for int to remain 32-bits and the countering demand
of LLVM to obey typing, int no longer obeys its original stated goal.

- anton

--- Synchronet 3.21a-Linux NewsLink 1.2

From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Sat Oct 4 14:42:25 2025

From Newsgroup: comp.arch

Thomas Koenig wrote:

EricP <ThatWouldBeTelling@thevillage.com> schrieb:

MitchAlsup wrote:

My 66000 2.0

After 4-odd years of ISA stability, I ran into a case where
I <pretty much> needed to change the instruction formats.
And after bragging to Quadribloc about its stability--it
reached the point where it was time to switch to version 2.0.

Well, its time to eat crow.
--------------------------------------------------------------
Memory reference instructions already produce 64-bit values
from Byte, HalfWord, Word and DoubleWord memory references
in both Signed and unSigned flavors. These supports both
integer and floating point due to the single register file.

Essentially, I need that property in both integer and floating
point calculations to eliminate instructions that merely apply
value range constraints--just like memory !

Why? Compilers do not have any problem with this
as its been handled by overload resolution since forever.

A non-My66000 example:

int add (int a, int b)
{
return a + b;
}

is translated on powerpc64le-unknown-linux-gnu (with -O3 to)

add 3,3,4
extsw 3,3
blr

extsw fills the 32 high-value bits with because numbers returned
in registers have to be correct, either as 32- or 64-bit values.

Ok I see what's going on - the reference to strong typing got me
thinking this was about operand type matching.

Above it is treating integer arguments and return types that are
smaller than full register width, and presumably short and char also,
as modulo (wrapping) data types and converting them to canonical
form by sign or zero extension. That avoids later problems in compare operations where the low order bits match but high order bits differ.

A strong typed language would have a separate data types for signed
and unsigned linear integers, signed and unsigned modulo integers.
The sign/zero extend for modulo result types would mask any overflow
and prevent proper result overflow checking.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sat Oct 4 18:55:05 2025

From Newsgroup: comp.arch

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

Thomas Koenig <tkoenig@netcologne.de> writes:

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

The I32LP64 mistake

If you consider I32LP64 a mistake, how should FORTRAN's (note the
upper case, this is pre-Fortran-90) storage association rules have
been handled, in your opinion?

I am not familiar enough with FORTRAN to give a recommendation on
that. However, two observations:

* The Cray-1 is primarily a Fortran machine, and it's C implementation
is ILP64, and it is successful. So obviously an ILP64 C can live
fine with FORTRAN.

As you may know, the Cray-1 was a very special machine, which got
away with a lot of idiosyncracies because it was blindingly fast
(and caused users a lot of trouble with conversion between DOUBLE
PRECISION and REAL).

But that was in the late 1970s. By the time the 64-bit worksations
were being designed, REAL was firmly established as 32-bit and
DOUBLE PRECISION as 64-bit, from the /360, the PDP-11, the VAX
and the very 32-bit workstations that the 64-bit workstations were
supposed to replace.

* Whatever inconvenience ILP64 would have caused to Fortran
implementors is small compared to the cost in performance and
reliability that I32LP64 has cost in the C world and the cost in
encoding space (and thus code size) and implementation effort and
transistors (probably not that many, but still) that it is costing
all customers of 64-bit processors.

A 64-bit REAL and (consequently) a 128-bit DOUBLE PRECISION
would have made the 64-bit workstaions pretty much unusable for
scientific use, and a lot of these were aimed at the technical
and scientific market, and that meant FORTRAN.

So, put yourself into the shoes of the people designing workstations
RS4000 they could allow their scientific and technical customers
to use the same codes "as is", with no conversion, or tell them
they cannot use 32-bit REAL any more, and that they need to rewrite
all their software.

What would they have expected their customers to do? Buy a system
which forces them to do this, or buy a competitor's system where
they can just recompile their software?

You're always harping about how compilers should be bug-comptatible
to previous releases. Well, that would have been the mother of
all incompatiblities, aka business suicide.
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Sat Oct 4 16:04:54 2025

From Newsgroup: comp.arch

On 10/4/2025 12:44 PM, Michael S wrote:

On Sat, 04 Oct 2025 16:11:37 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

AFAIK Rust does not have a machine-word-sized integer type; instead,
each type has its size in its name (e.g., i32, u64).

Rust has machine-dependent isize and usize types, identical to ptrdiff_t
and size_t in C.

I guess, if starting clean slate (in a from-scratch language), it might
make sense to have:
A range of defined fixed sizes;
A range of types whose size is a product of various machine constraints.

So, say:
u8/u16/u32/u64/u128 //Unsigned, fixed size, default endian
s8/s16/s32/s64/s128 //Signed, fixed size, default endian
u8l/u16l/u32l/u64l/u128l //Unsigned, fixed size, little endian
s8l/s16l/s32l/s64l/s128l //Signed, fixed size, little endian
u8b/u16b/u32b/u64b/u128b //Unsigned, fixed size, big endian
s8b/s16b/s32b/s64b/s128b //Signed, fixed size, big endian
u8l/s8l/u8b/s8b: Technically redundant with u8/s8, but added for
consistency.

i8/i16/i32/i64/i128, could also make sense.
Could also have sbit(N) and ubit(n) which specify exact width types, but otherwise behave like the normal integer types. The power-of-2 sizes
could be seen as mostly equivalent to the fixed-size types.

Floating point types:
f16/f32/f64/f128
f8/f8a/f8u/...: Assortment of 8-bit types.
Since no one-size-fits-all with FP8.
(Maybe also with f*l and f*b variants?).

Machine constraint-sized types:
sasize/uasize: Size for arrays and similar
spsize/upsize: Size for pointers and pointer differences
sfsize/ufsize: Size for file offsets
int: default 'fast' size (32 or 64 bits)
long: default 'large but fast' size (64 or 128 bits)
Would be 64 if machine only has 64 bit ALU operations;
Would be 128 if machine has a 128-bit ALU available.
intmul: Whichever size allows the fastest integer MUL or MAC.
More likely to be 16 or 32 bits.
...

Special types:
void: No Type, pointers may freely convert to other types
m8: Like void, but with a defined size, but no operators.
m8 could be assumed the default type for raw memory buffers.
m8 pointers may be freely cast to/from other pointer types.
m16/m32/m64/m128: Has size but no defined operators.
Casts involving these types will be bit-preserving.
Size-mismatched casts will not be allowed.

May use slightly different type promotion rules from C, for integer types:
Td = Ts OP Tt
If the range of Td is greater or equal to (Ts OP Tt)
then promote to the wider of the two;
(Ts OP Tt)
Promotes by default to the wider of Ts or Tt.
If a signed/unsigned mismatch of same size or smaller signed type,
promote to the next larger signed type.
(Note: NOT the "same sized unsigned" as C would use).
If the range of Td is less than (Ts OP Tt)
If the result will be the same either way,
promote to most efficient type to carry out operation
Or, use Td if doing so is efficient.
Narrow result if needed
Td narrower than intermediate type.
Else, promote to type of (Ts OP Tt), and narrow result.

In this case, the types may flow-out from the inputs and operators, but
also flow-in from the destination type. Usually C lacks the flowing-in
part, but it is relevant for efficient code generation.

Note that the inward flow may happen recursively, where if Td promotion
is used for an outward expression, the two sub-expressions may be
re-evaluated in light of 'Td' as the destination type (vs merely the
result of the input expressions).

Unlike C, would still apply the same promotion behavior to 8 and 16 bit
types as for wider types (so, there is no implicit "first auto-promote everything to int" rule). Though, it can generally still use wider ALU
so long as the result value will retain the expected sign or zero extension.

This would differ from C's behavior in the case of widening expressions,
in that operating on narrower types and storing the result as a wider
type will promote first (so no overflow happens) rather than in C where
an overflow may happen with the narrower types and promoted after the fact.

This would have fewer "gotchas" on average than the C approach, but C's
rules need to be maintained for C code, as some code will break if the original integer overflow behavior is not preserved. But, the existing
rules are not entirely consistent.

Can make the working assumption that widening is cheap but narrowing has
a non-zero cost (though, this is the reverse from the normal RV ABI,
where on RV64G the ABI would normally have people pay the cost at
"unsigned int"->"long" promotion).

In the abstract model, all narrower signed or unsigned types are sign or
zero extended to the maximum widest type in play; we can also assume
twos complement as the working model; ...

The big and little endian types would mostly apply to structures and
pointers. They would only effect local variables if the address of the
local variable is taken (else the machine default is used; or "all
choices being equal" assume little endian).

By default, assume native alignment of a type unless a packed modifier
is used (with packed applied either par variable or for the structure as
a whole). If no packed is used, the alignment of a struct will be the
widest member in the struct. If used on a struct, the whole struct will
assume byte alignment. Else, the alignment will be the largest alignment
seen within the struct (or the largest non-packed member). Could maybe
have an 'align_as()' modifier (to specify to use the same alignment as
another type) with the packed case being equal to byte alignment.

Possible:
Allow 'if()' in structs, but would be evaluated as a compile-time
constant (so in this sense, functions more like an ifdef, just evaluated
later in the process).

Might also allow VLA-like patterns if the expression is a compile time constant. Could allow a VLA as the final member of a struct, which will
be understood the same as a zero-element array. Will have the side
effect that the size of the struct is unknown, and it may not be used in arrays nor as the non-final member of a parent struct (and if present,
will apply the same property on the parent struct).

Note that structs may be classified as serializable or non-serializable. Serializable structs will need a fixed and unambiguous size;
They will explicitly disallow pointers, references, or any other types
that can't be serialized.

Serializable structs would be assumed to be able to be safely read from
or written to a file or socket, ...

Might make sense, in such a language, to have an object model similar to C#: Structs exist, by-value by default;
Classes always by-reference, with a single inheritance and interfaces model; Maybe for nicety, assume that interfaces can be mapped to COM-like
objects (should map the underlying COM layout);
...

Could also assume similar scoping rules to C#, with full scope known at
the time an EXE or DLL is compiled (any undefined types or variables at
this stage being a compiler error). The front-end parser and compiler
would be required to still work even without a full knowledge of the type-system (WRT class-like types), but may enforce stricter constraints
on normal value types. Though, if doing separate compilation, this only
allows partial compilation of some features (the object system will need
to be sorted out at link time).

Would not have C++ style templates, but could still have generics.

But:
No garbage collector;
Objects may have an explicit automatic lifetime.

Say:
Foo! foo();
Does not mean that it is necessarily stack-allocated or by-value (unlike
C++), but will mean that 'foo' will be auto-deleted when foo goes out of scope.

Similar could also be applied to class members, so a T! member is
auto-deleted when the parent goes out of scope. Could maybe also
consider "T^" for cases where the member is to use reference counting
(though count also make sense on the class definition).

so, some modifiers could be applied one of several places:
Class definition: Default behavior to be used, may be overridden.
Variable: Used in this context, may override class.
"new()": Used at object creation for dynamically created objects.

With possible syntax:
T //base type, default behavior, global lifetime for objects.
T* //pointer, structs, N/A for class objects
T! //automatic / parent-scope lifetime
T^ //reference counted
T(Z) //zone lifetime

Typically the stronger rule may be used, with it being a compiler error
if a variable or member doesn't match the lifetime specified elsewhere
(though with fudging for "T!" as it would apply to the point of creation and/or place-of-residence of the object in question). As such, it is
likely that "T!" class members would primarily be initialized in
constructors (but may be treated as 'final' outside of a constructor for
the class in question).

zones will be compile-time entities. It could be treated as an error for
an object in a longer-lived zone to have a reference with a
shorter-lived zone. Though, unclear how to enforce this at compile time.
Zone lifetime would depend on program control flow rather than known at compile time. Though, a zone-tree could be defined at compile time, and
the compiler or runtime could error-out or fault if it detects zone
creation or destruction which deviates from the specified dependency order.

zonedef Z; //define a zone Z, parent of Z is global
zonedef Z(Zp); //define zone Z whose lifetime exists within Zp.
If Z is live and Zp is destroyed, throw.
If Z is created and Zp is not live, throw
If an object in Z is created, and Z is not live, throw.
...

In most cases, 'delete' could be discouraged, as the only time delete is likely to be needed is if lifetime is poorly specified in some other
way. But, we don't need generalized garbage collection, as pretty much
no one has really made this work acceptably.

Reference counting may leak memory, though one possibility could be to
try to detect and flag cycle-formation when creating object graphs, with
an explicit "weak object reference" being created in cases where cycle-creation is detected (in this case, the reference count is
special). If the reference count for non-weak references drops to 0, it destroys the object. Downside: This puts some of the computational cost
of a mark/sweep collector into the code for incrementing and
decrementing reference counts.

Though possible is allowing both reference-counting and zones on the
same object, in which case the zone may clean up leaks from the reference-counter (assuming periodic zone destruction).

...

--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Sat Oct 4 17:28:09 2025

From Newsgroup: comp.arch

On 10/4/2025 4:56 AM, BGB wrote:

On 10/3/2025 4:04 PM, Thomas Koenig wrote:

Stefan Monnier <monnier@iro.umontreal.ca> schrieb:

--------------------------------------------------------------
Integer instructions are now::-a-a-a-a-a {Signed and unSigned}|u{Byte, >>>>> HalfWord,
Word, DoubleWord}
while FP instructions are now:
-a-a-a-a-a {Byte, HalfWord, Word, DoubleWord}

I doubt any compilers will use this feature.
Strong typed languages don't have predefined operators that allow
mixing.

Not sure who's confused, but my reading of the above is not some sort of >>> "mixing": I believe Mitch is just saying that his addition operation
(for example) can be specified to operate on either one of int8, uint8,
int16, uint16, ...
But that specification applies to all inputs and outputs of the
instruction, so it does not support adding an int8 to an int32, or other >>> "mixes".

The outputs are correctly extended to a 64-bit number (signed or
unsigned) so it is possible to pass results to wider operations
without conversion.

One example would be

unsigned long foo (unsigned int a, unsigned int b)
{
-a-a return a + b;
}

which would need an adjustment after the add, and which would
just be somethign like

-a-a-a-aadduw-a-a-a r1,r1,r2
-a-a-a-aret

using Mitch's new encoding.

Yes.

Sign extend signed types, zero extend unsigned types.
Up-conversion is free.

This is something the RISC-V people got wrong IMO, and adding a bunch of ".UW" instructions in an attempt to patch over it is just kinda ugly.

Partly for my own uses revived ADDWU and SUBWU (which had been dropped
in BitManip), because these are less bad than the alternative.

I get annoyed that new extensions keep trying to add ever more ".UW" instructions rather than just having the compiler go over to zero-
extended unsigned and make this whole mess go away.

...

Ironically, the number of new instructions being added to my own ISA has mostly died off recently, largely because there is little particularly relevant to add at this point (within the realm of stuff that could be added).

Going and looking back, most major new instructions added were:
BITMOV and BITMOV.S, ~ 7 months ago
Some new ops related to FP8A handling and similar, ~ 2 months ago
Mostly for Bias=7 (where, FP8A=S.E3.M4, or A-Law format)
I couldn't just change the Bias=8 ops to 7 without breaking stuff;
But, for non-audio uses 7 is a lot more useful.
Mostly used for unit vectors,
where ability to store values >= 1.0 sometimes needed.
But, most values still < 1.0 ...
Sorta relates to Trellis re-normalization trickery.
Stored vector isn't exactly unit-length, but unit post-renorm.

A few operations in the "possible" category:
A few NN related packed multiply instructions;
Instructions for a possible UVF1 packed block format
(graphics and NN);
...

FPU Compare 3R instructions, ~8 months ago

While XG3 was added 11 months ago, it isn't really new instructions, so
much as a new more and encoding scheme for the same instructions (and it
was only fairly recently that I got support for predicated instructions implemented in RISC-V).

And, 12 months ago, a RISC-V target for BGBCC, and jumbo prefixes for
the RISC-V side, ... Somehow I thought all of this happened several
years ago, seems it was 1 year.

Seems initial efforts to start adding RISC-V support were (only) 2 years
ago.

A lot more fiddling has been in things mostly related to dealing with
RISC-V and trying to make it less terrible.

The stuff for the recent FPU behavior tweaks are more tweaking FPU
behavior, and haven't really involved adding new instructions (except on
the RISC-V side, ones which already existed in the RISC-V specs).

Hmm...

--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sun Oct 5 11:58:14 2025

From Newsgroup: comp.arch

Michael S <already5chosen@yahoo.com> writes:

On Sat, 04 Oct 2025 16:11:37 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

AFAIK Rust does not have a machine-word-sized integer type; instead,
each type has its size in its name (e.g., i32, u64).

Rust has machine-dependent isize and usize types

Good. But for some reasons all the examples I have seen use
integer types like i32 and u64.

identical to ptrdiff_t and size_t in C.

I have read that there are C implementation (variants) where ptrdiff_t
and size_t are smaller than a pointer, in particular large-model C on
the 8086, and that was the reason for C standard restrictions about
pointer subtraction and pointer inequality comparison.

I hope nobody is doing large-model Rust, even though Rust may be more appropriate for that than C.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sun Oct 5 15:01:06 2025

From Newsgroup: comp.arch

Thomas Koenig <tkoenig@netcologne.de> writes:

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

Thomas Koenig <tkoenig@netcologne.de> writes:

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

The I32LP64 mistake

If you consider I32LP64 a mistake, how should FORTRAN's (note the
upper case, this is pre-Fortran-90) storage association rules have
been handled, in your opinion?

...

By the time the 64-bit worksations
were being designed, REAL was firmly established as 32-bit and
DOUBLE PRECISION as 64-bit, from the /360, the PDP-11, the VAX
and the very 32-bit workstations that the 64-bit workstations were
supposed to replace.

On the PDP-11 C's int is 16 bits. I don't know what FORTRAN's INTEGER
is on the PDP-11 (but I remember reading about INTEGER*2 and
INTEGER*4, AFAIK not in a PDP-11 context). In any case, I expect that FORTRAN's REAL was 32-bit on a PDP-11, and that any rule chain that
requires that C's int is as wide as FORTRAN's REAL is broken at some
point on the PDP-11.

So your rules do not even work for the first machine where C has been implemented. If shortsighted FORTRAN people look at 32-bit machines
and become accomodated to C's int being as wide as FORTRAN's INTEGER
and REAL, they could have known from the PDP-11 that that's going to
break for other machine word sizes.

So, put yourself into the shoes of the people designing workstations
RS4000 they could allow their scientific and technical customers
to use the same codes "as is", with no conversion, or tell them
they cannot use 32-bit REAL any more, and that they need to rewrite
all their software.

If they want to use their software as-is, and it is written to work
with an ILP32 C implementation, the only solution is to continue using
an ILP32 implementation. That's not only for FORTRAN/C mixing, but
for most C code of the day, certainly with I32LP64; I expect that the
porting effort would have been smaller with ILP64, but there still
would have been some.

BTW, we have a DecStation 5000/150 with an R4000, and all C compilers
on this machine support ILP32 and nothing else.

What would they have expected their customers to do? Buy a system
which forces them to do this, or buy a competitor's system where
they can just recompile their software?

If just recompiling is the requirement, what follows is ILP32.

You're always harping about how compilers should be bug-comptatible
to previous releases.

Not in the least. I did not ask for bug compatibility.

I also did not ask for "compiling as is" on a different architecture,
much less on a system with different address size.

I have actually written up what I ask for: <https://www.complang.tuwien.ac.at/papers/ertl17kps.pdf>. Maybe you
should read it one day, or reread it given that you have forgotten it.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sun Oct 5 18:19:47 2025

From Newsgroup: comp.arch

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
<snip>

Not in the least. I did not ask for bug compatibility.

I also did not ask for "compiling as is" on a different architecture,
much less on a system with different address size.

I have actually written up what I ask for: <https://www.complang.tuwien.ac.at/papers/ertl17kps.pdf>. Maybe you
should read it one day, or reread it given that you have forgotten it.

In the referenced article you write::
"Access to uninitialized data is another issue where absolute equivalence
with the basic model would make important optimizations impossible. Consider
a variable v at the end of its life (e.g., at the end of a function). Unless the compiler can prove that the location of the variable is not read later
as a result of reading uninitialized data (say, reading the uninitialized variable w living in the same location in a different function), v would
have to stay in the same location in future compiler versions or other optimization levels; or at least the final value of v would have to be
stored in this location, and the initial value of w would have to be
fetched from this location."

If variable v and variable w are "stack variables" local to their own subroutines, it seems perfectly reasonable to assume that all deallocated
stack variables become inaccessible. Then, later when new stack space is allocated those new variables have no relationship to any previously deallocated variables.

That is: when the stack pointer is incremented the space is no longer accessible and::
a) any modified cache lines are discarded instead of being written
to memory--the space is no longer accessible so don't waste power
making DRAM coherent with inaccessible stack space.

Later, when the stack pointer is decremented::
b) new cache line area can be "allocated" without reading DRAM and
being <conceptually> initialized to zero.

- anton

--- Synchronet 3.21a-Linux NewsLink 1.2

From John Levine@johnl@taugh.com to comp.arch on Sun Oct 5 19:30:42 2025

From Newsgroup: comp.arch

According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:

On the PDP-11 C's int is 16 bits. I don't know what FORTRAN's INTEGER
is on the PDP-11 (but I remember reading about INTEGER*2 and
INTEGER*4, AFAIK not in a PDP-11 context). In any case, I expect that >FORTRAN's REAL was 32-bit on a PDP-11, and that any rule chain that
requires that C's int is as wide as FORTRAN's REAL is broken at some
point on the PDP-11.

I wrote INFort, one of the two F77 implementations for the PDP-11.
INTEGER and REAL were the same size because that's what the standard
said, and any program that used EQUIVALENCE would break otherwise. If
you wanted shorter ints, INTEGER*2 provided them.

Bell Labs independently wrote f77 around the same time, and its manual says they did the same thing, INTEGER was C long int, INTEGER*2 was short int.

If the speed difference mattered, it wasn't hard to say something like

IMPLICIT INTEGER*2(I-N)

to make your ints short.
--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly
--- Synchronet 3.21a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sun Oct 5 19:51:26 2025

From Newsgroup: comp.arch

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

Thomas Koenig <tkoenig@netcologne.de> writes:

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

Thomas Koenig <tkoenig@netcologne.de> writes:

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

The I32LP64 mistake

If you consider I32LP64 a mistake, how should FORTRAN's (note the
upper case, this is pre-Fortran-90) storage association rules have
been handled, in your opinion?

...

By the time the 64-bit worksations
were being designed, REAL was firmly established as 32-bit and
DOUBLE PRECISION as 64-bit, from the /360, the PDP-11, the VAX
and the very 32-bit workstations that the 64-bit workstations were
supposed to replace.

On the PDP-11 C's int is 16 bits. I don't know what FORTRAN's INTEGER
is on the PDP-11 (but I remember reading about INTEGER*2 and
INTEGER*4, AFAIK not in a PDP-11 context). In any case, I expect that FORTRAN's REAL was 32-bit on a PDP-11, and that any rule chain that
requires that C's int is as wide as FORTRAN's REAL is broken at some
point on the PDP-11.

It is possible to have a two-byte integer and a 32-byte real.
Storage association then requires four bytes for an integer.
This wastes space for integers (at least for arrays) but that
is not such a big deal, because most big arrays in scientific
code are reals.

The same held for the Cray-1 - default ingegers (24 bit)
and their weird 64-bit reals

The main problem is when the size of default INTEGER size _exceeds_ the smallest useful REAL, then REAL arrays either become twice as big,
plus you need to implement 128-bit REALs.

So, put yourself into the shoes of the people designing workstations
RS4000 they could allow their scientific and technical customers
to use the same codes "as is", with no conversion, or tell them
they cannot use 32-bit REAL any more, and that they need to rewrite
all their software.

If they want to use their software as-is, and it is written to work
with an ILP32 C implementation, the only solution is to continue using
an ILP32 implementation.

So, kill the 64-bit machines in the scientific marketplace. I'm glad
you agree.

What would they have expected their customers to do? Buy a system
which forces them to do this, or buy a competitor's system where
they can just recompile their software?

If just recompiling is the requirement, what follows is ILP32.

There is absolutely no problem with 64-bit pointers when recompiling
Fortran.

You're always harping about how compilers should be bug-comptatible
to previous releases.

Not in the least. I did not ask for bug compatibility.

I'll keep that in mind for the next time.
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Mon Oct 6 05:56:53 2025

From Newsgroup: comp.arch

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

If variable v and variable w are "stack variables" local to their own >subroutines, it seems perfectly reasonable to assume that all deallocated >stack variables become inaccessible.

That is debatable. This assumption is the basis of "optimizing" away
memset() (or similar) that is intended to keep the lifetime of secret
keys as short as possible. After this "optimization", the secret key
continues to be in memory, and can be extracted through
vulnerabilities, preserved for much longer in the swap area or in
snapshots, or in the value of newly allocated uninitialized areas.
All of which prove that the assumption is wrong.

Then, later when new stack space is
allocated those new variables have no relationship to any previously >deallocated variables.

That is: when the stack pointer is incremented the space is no longer >accessible and::
a) any modified cache lines are discarded instead of being written
to memory--the space is no longer accessible so don't waste power
making DRAM coherent with inaccessible stack space.

Later, when the stack pointer is decremented::
b) new cache line area can be "allocated" without reading DRAM and
being <conceptually> initialized to zero.

I have outlined ways to optimize zeroing of memory in <2014Jul9.193122@mips.complang.tuwien.ac.at> <2022Aug5.141325@mips.complang.tuwien.ac.at>

With that idea, the way to use it is to zero the memory when it is
deallocated (so it is not written back to main memory; it may be
written to the zero area as part of a larger unit). And to also zero
it when it is allocated so that there is no need to load the data from
outer cache levels or main memory (or their equivalents in zeroed
memory).

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Mon Oct 6 06:26:12 2025

From Newsgroup: comp.arch

Thomas Koenig <tkoenig@netcologne.de> writes:

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

[...]

It is possible to have a two-byte integer and a 32-byte real.

But according to John Levine that is not what happens on the PDP-11.
Instead, it has 4-byte INTEGERs, demonstrating that your "unofficial
rule" that C int is as wide as FORTRAN INTEGER did not hold.

The same held for the Cray-1 - default ingegers (24 bit)
and their weird 64-bit reals

If FORTRAN INTEGERs are 24 bits on the Cray-1, this architecture is
another example where your "unofficial rule" does not hold. C ints
are 64-bit on the Cray 1.

If they want to use their software as-is, and it is written to work
with an ILP32 C implementation, the only solution is to continue using
an ILP32 implementation.

So, kill the 64-bit machines in the scientific marketplace. I'm glad
you agree.

Not in the least. Most C programs did not run as-is on I32LP64, and
that did not kill these machines, either. And I am sure that C
programs were much more relevant for selling these machines than
FORTRAN programs. C programmers changed the programs to run on
I32LP64 (this was called "making them 64-bit-clean"). And until that
was done, ILP32 was used.

If just recompiling is the requirement, what follows is ILP32.

There is absolutely no problem with 64-bit pointers when recompiling
Fortran.

Fortran is not the only consideration for designing an ABI for C, if
it is one at all. The large number of 32bit->64bit sign-extension and zero-extension operations, either explicitly, or integrated into
instructions such as RISC-V's addw, plus the
"optimizations"/miscompilations to ged rid of some of the sign
extensions are a cost that we pay all the time for the I32LP64
mistake.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Mon Oct 6 14:23:50 2025

From Newsgroup: comp.arch

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

Thomas Koenig <tkoenig@netcologne.de> writes:

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

[...]

<snip>

So, kill the 64-bit machines in the scientific marketplace. I'm glad
you agree.

Not in the least. Most C programs did not run as-is on I32LP64.

The vast majority of C/C++ programs ran just fine on I32LP64. There
were some that didn't, but it was certainly not "most".
--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Mon Oct 6 11:51:18 2025

From Newsgroup: comp.arch

On 10/6/2025 9:23 AM, Scott Lurndal wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

Thomas Koenig <tkoenig@netcologne.de> writes:

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

[...]

<snip>

So, kill the 64-bit machines in the scientific marketplace. I'm glad
you agree.

Not in the least. Most C programs did not run as-is on I32LP64.

The vast majority of C/C++ programs ran just fine on I32LP64. There
were some that didn't, but it was certainly not "most".

Yes, most programs only needed minor edits.

Some stuff I had ported:
Doom: Mostly trivial edits;
Had to re-implement audio and music handling.
Heretic and Hexen:
More edits, mostly removing MS-DOS stuff;
Had to replace most of the audio and music code.
ROTT:
Extensive modification to graphics handling;
Was very dependent on low-level VGA hardware twiddling.
(Vs Doom's "Set 320x200 and done" approach).
Lots of memory management and out-of-bounds issues;
Some amount of code that is sensitive to integer wrap-on-overflow;
...
(ROTT was a little harder to port)
Quake:
Few issues for most of the engine;
The "progs.dat" VM required getting creative.
It mixes pointers and 'float' in ways
"some might consider unnatural"
Quake 2:
Basically 64-bit clean out of the box.
Quake 3:
The QVM architecture very much assumes 32-bit,
not really a way to make it 64-bit absent a significant rewrite.
Did allow for falling back to the Quake2 strategy,
of using natively compiled DLLs.

Of the programs, I still have not fully debugged ROTT when built via
BGBCC, where there is an issue somewhere that is resulting in demo
desyncs that tend to change from one run to another.

Last I checked, I had it stable when built with MSVC, and had it
basically working with a GCC build.

Can note that ROTT is one of the larger programs I had ported to my
project (in terms of code size), where both the ROTT and Quake3 ports
weigh in at a little over 300 kLOC (very much larger than Doom or Quake).

Quake 3 builds as multiple DLLs, whereas ROTT as a single binary. As
such, ROTT currently builds the biggest EXE (with around 1MB of ".text").

Though, curiously, there is (on average) less than 4 bytes per line on
C, not entirely sure how that happens.

...

--- Synchronet 3.21a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Mon Oct 6 17:38:13 2025

From Newsgroup: comp.arch

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

Thomas Koenig <tkoenig@netcologne.de> writes:

So, kill the 64-bit machines in the scientific marketplace. I'm glad
you agree.

Not in the least. Most C programs did not run as-is on I32LP64, and
that did not kill these machines, either.

Only those who assumed sizeof(int) = sizeof(char *). This was
not true on the PDP-11, and it was a standards violation, anyway.
Only people who liked to play these kind of games (I know you do)
were caught.

And I am sure that C
programs were much more relevant for selling these machines than
FORTRAN programs.

Based on what data? Your own personal guess?

C programmers changed the programs to run on
I32LP64 (this was called "making them 64-bit-clean"). And until that
was done, ILP32 was used.

The problem with 64-bit INTEGERs for Fortran is that they make REAL
unusable for lots of existing code.
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21a-Linux NewsLink 1.2

From John Levine@johnl@taugh.com to comp.arch on Mon Oct 6 20:02:50 2025

From Newsgroup: comp.arch

According to Thomas Koenig <tkoenig@netcologne.de>:

Not in the least. Most C programs did not run as-is on I32LP64, and
that did not kill these machines, either.

Only those who assumed sizeof(int) = sizeof(char *). This was
not true on the PDP-11, ...

The PDP-11 was a 16 bit machine with 16 bit ints and 16 bit pointers.
There were 32 bit long and float, and 64 bit double.

I didn't port a lot of code from the 11 to other machines, but my recollection is that the widespread assumption in Berkeley Vax code that location zero was addressable and contained binary zeros was much more painful to fix than
size issues.
--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly
--- Synchronet 3.21a-Linux NewsLink 1.2

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Mon Oct 6 20:46:11 2025

From Newsgroup: comp.arch

John Levine <johnl@taugh.com> writes:

According to Thomas Koenig <tkoenig@netcologne.de>:

Not in the least. Most C programs did not run as-is on I32LP64, and
that did not kill these machines, either.

Only those who assumed sizeof(int) = sizeof(char *). This was
not true on the PDP-11, ...

The PDP-11 was a 16 bit machine with 16 bit ints and 16 bit pointers.
There were 32 bit long and float, and 64 bit double.

I didn't port a lot of code from the 11 to other machines, but my recollection >is that the widespread assumption in Berkeley Vax code that location zero was >addressable and contained binary zeros was much more painful to fix than
size issues.

"location zero was addressible". Might also point out it was RO, but yes
that caused many problems porting BSD utilities to SVR4.

The other issue with leaving the PDP-11 for 32-bit systems was the change
in the size of the PID, UID, and GID. Which required more than a simple recompile, since there weren't abstract types (e.g. pid_t, gid_t, uid_t)
for those data items yet, so code needed to be updated manually.
--- Synchronet 3.21a-Linux NewsLink 1.2

From kegs@kegs@provalid.com (Kent Dickey) to comp.arch on Tue Oct 7 01:38:02 2025

From Newsgroup: comp.arch

In article <2025Oct4.121741@mips.complang.tuwien.ac.at>,
Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

LLVM compiles C with stricter typing than GCC resulting in a lot
of smashes:: For example::

int subroutine( int a, int b )
{
return a+b;
}

Compiles into:

subroutine:
ADD R1,R1,R2
SRA R1,R1,<32,0> // limit result to (int)
RET

I tested this on AMD64, and did not find sign-extension in the caller, >neither with gcc-14 nor with clang-19; both produce the following code
for your example (with "subroutine" renamed into "subroutine1").

0000000000000000 <subroutine1>:
0: 8d 04 37 lea (%rdi,%rsi,1),%eax
3: c3 ret

It's not about strict or lax typing, it's about what the calling
convention promises about types that are smaller than a machine word.
If the calling convention requires/guarantees that ints are
sign-extended, the compiler must use instructions that produce a >sign-extended result. If the calling convention guarantees that ints
are zero-extended (sounds perverse, but RV64 has the guarantee that
unsigned is passed in sign-extended form, which is equally perverse),
then the compiler must use instructions that produce a zero-extended
result (e.g., AMD64's addl). If the calling convention only requires
and guarantees the low-order 32 bits (I call this garbage-extended),
then the compiler can use instructions that perform 64-bit adds; this
is what we are seeing above.

The other side of the medal is what is needed at the caller: If the
caller needs to cconvert a sign-extended int into a long, it does not
have to do anything. If it needs to convert a zero-extended or >garbage-extended int into a long, it has to sign-extend the value.

AMD64 in hardware does 0 extension of 32-bit operations. From your
example "lea (%rdi,%rsi,1),%eax" (AT&T notation, so %eax is the dest),
the 64-bit register %rax will have 0's written into bits [63:32].
So the AMD64 convention for 32-bit values in 64-bit registers is to
zero-extend on writes. And to ignore the upper 32-bits on reads, so
using a 64-bit register should use the %exx name.

I agree with you that I32LP64 was a mistake, but it exists, and I
think ARM64 did a good job handling it. It has all integer operations
working on two sizes: 32-bit and 64-bit, and when writing a 32-bit result,
it 0-extends the register value.

You don't want "garbage extend" since you want a predictable answer.
Your choices for writing 32-bit results in a 64-bit register are thus sign-extend (not a good choice) or zero-extend (what almost
everyone chose). RISC-V is in another land, where they effectively have
no 32-bit operations, but rather a convention that all 32-bit inputs
must be sign-extended in a 64-bit register.

For C and C++ code, the standard dictates that all integer operations are
done with "int" precision, unless some operand is larger than int, and then
do it in that precision. So there's no real need for 8-bit and 16-bit operations to be natively by the CPU--these operations are actually done
as int's already. If you have a variable which is a byte, then assigning
to that variable, and then using that variable again you will need to zero-extend, but honestly, this is not usually a performance path. It's
likely to be stored to memory instead, so no masking or sign extending
should be needed.

If you pick ILP64 for your ABI, then you will get rid of almost all of
these zero- and sign-extensions of 32-bit C and C++ code. It will just
work. If you pick I32LP64, then you should have a full suite of 32-bit operations and 64-bit operations, at least for all add, subtract, and
compare operations. And if you do I32LP64, your indexed addressing
modes should have 3 types of indexed registers: 64-bit, 32-bit signed,
and 32-bit unsigned. That worked well for ARM64.

Kent
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Tue Oct 7 15:52:17 2025

From Newsgroup: comp.arch

kegs@provalid.com (Kent Dickey) posted:

In article <2025Oct4.121741@mips.complang.tuwien.ac.at>,
Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

LLVM compiles C with stricter typing than GCC resulting in a lot
of smashes:: For example::

int subroutine( int a, int b )
{
return a+b;
}

Compiles into:

subroutine:
ADD R1,R1,R2
SRA R1,R1,<32,0> // limit result to (int)
RET

I tested this on AMD64, and did not find sign-extension in the caller, >neither with gcc-14 nor with clang-19; both produce the following code
for your example (with "subroutine" renamed into "subroutine1").

0000000000000000 <subroutine1>:
0: 8d 04 37 lea (%rdi,%rsi,1),%eax
3: c3 ret

It's not about strict or lax typing, it's about what the calling
convention promises about types that are smaller than a machine word.
If the calling convention requires/guarantees that ints are
sign-extended, the compiler must use instructions that produce a >sign-extended result. If the calling convention guarantees that ints
are zero-extended (sounds perverse, but RV64 has the guarantee that >unsigned is passed in sign-extended form, which is equally perverse),
then the compiler must use instructions that produce a zero-extended
result (e.g., AMD64's addl). If the calling convention only requires
and guarantees the low-order 32 bits (I call this garbage-extended),
then the compiler can use instructions that perform 64-bit adds; this
is what we are seeing above.

The other side of the medal is what is needed at the caller: If the
caller needs to cconvert a sign-extended int into a long, it does not
have to do anything. If it needs to convert a zero-extended or >garbage-extended int into a long, it has to sign-extend the value.

AMD64 in hardware does 0 extension of 32-bit operations. From your
example "lea (%rdi,%rsi,1),%eax" (AT&T notation, so %eax is the dest),
the 64-bit register %rax will have 0's written into bits [63:32].
So the AMD64 convention for 32-bit values in 64-bit registers is to zero-extend on writes. And to ignore the upper 32-bits on reads, so
using a 64-bit register should use the %exx name.

I agree with you that I32LP64 was a mistake, but it exists, and I
think ARM64 did a good job handling it. It has all integer operations working on two sizes: 32-bit and 64-bit, and when writing a 32-bit result,
it 0-extends the register value.

You don't want "garbage extend" since you want a predictable answer.

Strongly Agree.

Your choices for writing 32-bit results in a 64-bit register are thus sign-extend (not a good choice) or zero-extend (what almost
everyone chose). RISC-V is in another land, where they effectively have
no 32-bit operations, but rather a convention that all 32-bit inputs
must be sign-extended in a 64-bit register.

Why not zero extend unSigned and sign extend Signed ?!?
That way the value in the register is (IS) the value in the smaller
container !!

Also, why not extend this to both shorts and chars ?!?

For C and C++ code, the standard dictates that all integer operations are done with "int" precision, unless some operand is larger than int, and then do it in that precision. So there's no real need for 8-bit and 16-bit operations to be natively by the CPU--these operations are actually done
as int's already. If you have a variable which is a byte, then assigning
to that variable, and then using that variable again you will need to zero-extend,

You could perform the operation at base-size (byte in this case).

Languages like ADA are not defined like C.

but honestly, this is not usually a performance path. It's likely to be stored to memory instead, so no masking or sign extending
should be needed.

If you pick ILP64 for your ABI, then you will get rid of almost all of
these zero- and sign-extensions of 32-bit C and C++ code.

Then, the only access to 32-bit integers is int32_t and uint32-t.

It will just
work. If you pick I32LP64, then you should have a full suite of 32-bit operations and 64-bit operations, at least for all add, subtract, and
compare operations. And if you do I32LP64, your indexed addressing
modes should have 3 types of indexed registers: 64-bit, 32-bit signed,
and 32-bit unsigned. That worked well for ARM64.

Kent

--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Tue Oct 7 11:27:39 2025

From Newsgroup: comp.arch

kegs@provalid.com (Kent Dickey) writes:

In article <2025Oct4.121741@mips.complang.tuwien.ac.at>,
Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

int subroutine( int a, int b )
{
return a+b;
}

...

I tested this on AMD64, and did not find sign-extension in the caller, >>neither with gcc-14 nor with clang-19; both produce the following code
for your example (with "subroutine" renamed into "subroutine1").

0000000000000000 <subroutine1>:
0: 8d 04 37 lea (%rdi,%rsi,1),%eax
3: c3 ret

...

AMD64 in hardware does 0 extension of 32-bit operations. From your
example "lea (%rdi,%rsi,1),%eax" (AT&T notation, so %eax is the dest),
the 64-bit register %rax will have 0's written into bits [63:32].
So the AMD64 convention for 32-bit values in 64-bit registers is to >zero-extend on writes. And to ignore the upper 32-bits on reads, so
using a 64-bit register should use the %exx name.

Interesting. At some point I got the impression that LEA produces a
64-bit result, because it produces an address, but testing reveals
that LEA has a 32-bit zero-extended variant indeed.

I agree with you that I32LP64 was a mistake, but it exists, and I
think ARM64 did a good job handling it. It has all integer operations >working on two sizes: 32-bit and 64-bit, and when writing a 32-bit result,
it 0-extends the register value.

You don't want "garbage extend" since you want a predictable answer.

Zero-extended for unsigned and sign-extended for int are certainly
more forgiving when some function is called without a prototype and
the actual type does not match the implied type (I once read about
IIRC miranda prototypes, but a web search only gives me Star Trek
stuff when I ask for that).

Zero-extending for int is less forgiving. Apparently by 2003 (when
AMD64 appeared) the use of prototypes was widespread enough that such
a calling convention was acceptable.

But once all the functions have correct prototypes, garbage-extension
is just as workable as other alternatives.

Your choices for writing 32-bit results in a 64-bit register are thus >sign-extend (not a good choice) or zero-extend (what almost everyone chose).

What makes you think that one is a better choice than the other?

The most obvious choices to me are:

Sign-extend int and zero-extend unsigned: That has the best chance at
the expected behaviour when the prototype is missing and would be
required.

If you rely on prototypes being present, you can take any choice,
including garbage-extension. Then you can use the full 64-bit
operation in many cases, and only insert sign or zero extension when a conversion from 32-bit to 64 bit is needed (and that extension can be
part of an instruction, as in ARM A64 addressing modes).

As for what "almost everyone chose", here's some data:

int unsigned ABI
sign-extended sign-extended MIPS o64 and 64
sign-extended zero-extended SPARC V9
sign-extended zero-extended PowerPC64
zero-extended zero-extended AMD64
zero-extended zero-extended ARM A64
sign-extended sign-extended RV64

I determined this by looking at the code for

unsigned usubroutine( unsigned a, unsigned b )
{
return a+b;
}

int isubroutine( int a, int b )
{
return a+b;
}

The code on variois architectures (as compiled with gcc -O) is:

MIPS64 (gcc -mabi=64 -O and gcc -mabi=o64 -O):
0000000000000034 <usubroutine>:
34: 03e00008 jr ra
38: 00851021 addu v0,a0,a1

000000000000003c <isubroutine>:
3c: 03e00008 jr ra
40: 00851021 addu v0,a0,a1

SPARC V9:
0000000000000018 <usubroutine>:
18: 9d e3 bf 50 save %sp, -176, %sp
1c: b0 06 00 19 add %i0, %i1, %i0
20: 81 cf e0 08 return %i7 + 8
24: 91 32 20 00 srl %o0, 0, %o0

0000000000000028 <isubroutine>:
28: 9d e3 bf 50 save %sp, -176, %sp
2c: b0 06 00 19 add %i0, %i1, %i0
30: 81 cf e0 08 return %i7 + 8
34: 91 3a 20 00 sra %o0, 0, %o0

PowerPC64:
0000000000000030 <.usubroutine>:
30: 7c 63 22 14 add r3,r3,r4
34: 78 63 00 20 clrldi r3,r3,32
38: 4e 80 00 20 blr
...

0000000000000048 <.isubroutine>:
48: 7c 63 22 14 add r3,r3,r4
4c: 7c 63 07 b4 extsw r3,r3
50: 4e 80 00 20 blr

RISC-V is in another land, where they effectively have
no 32-bit operations, but rather a convention that all 32-bit inputs
must be sign-extended in a 64-bit register.

RISC-V has a number of sign-extending 32-bit instructions, and a
calling convention to go with it.

There seem to be the following options:

Have no 32-bit instructions, and insert sign-extension or
zero-extension instructions where necessary (or implicitly in all
operands, as I outlined earlier). SPARC V9 and PowerPC64 seem to take
this approach.

Have 32-bit instructions that sign-extend: MIPS64, Alpha, and RV64.

Have 32-bit instructions that zero-extend: AMD64 and ARM A64.

Have 32-bit instructions that sign-extend and 32-bit instructions that zero-extend. No architecture that does that is known to me. It would
be a good match for the SPARC-V9 and PowerPC64 calling convention.

There is also one instruction set (ARM A64) that has special 32-bit sign-extension and zero-extension forms for some operands.

And you can then adapt the calling convention to match the instruction
set. For "no 32-bit instructions", garbage-extension seems to be the
cheapest approach to me, but I expect that when SPARC-V9 and PowerPC64
came on the market, there was enough C code with missing prototypes
around that they preferred a more forgiving calling convention.

If you pick ILP64 for your ABI, then you will get rid of almost all of
these zero- and sign-extensions of 32-bit C and C++ code. It will just
work. If you pick I32LP64, then you should have a full suite of 32-bit >operations and 64-bit operations, at least for all add, subtract, and
compare operations.

For compare, divide, shift-right and rotate, you either first need to sign/zero-extend the register, or you need 32-bit versions (possibly
both signed and unsigned).

And if you do I32LP64, your indexed addressing
modes should have 3 types of indexed registers: 64-bit, 32-bit signed,
and 32-bit unsigned. That worked well for ARM64.

It is certainly part of the way towards my idea of having sign- and zero-extended 32-bit operands for every operand of every instruction.

It would be interesting to see how many sign-extensions and
zero-extensions (whether explicit or implicitly part of the
instruction) are executed in code that is generated from various C
sources (with and without -fwrapv). I expect that it's highly
dependent on the programming style. Sure there are types like pid_t
where you have no choice, but in frequently occuring cases you can
choose:

for (i=0; i<n; i++) {
... a[i] ...
}

Here you can choose whether to define i as int, unsigned, long,
unsigned long, size_t, etc. If you care for portability to 16-bit
machines, size_t is a good idea here, otherwise long and unsigned long
also are efficient. If n is unsigned, you can also choose unsigned,
but then this code will be slow on RV64 (and MIPS64 and SPARC V9 and
PowerPC64 and Alpha).

If n is int, you can also choose int, and there is actually enough
information here to make the code efficient (even with -fwrapv),
because in this code int overflow really cannot happen, but in code
that's not much different from this one (e.g., using != instead of <),
-fwrapv will result in an inserted sign extension on AMD64, and not
using -fwrapv may result in unintended behaviour thanks to the
compiler assuming that int overflow does not happen.

ILP64 would have spared us all these considerations.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Tue Oct 7 18:01:25 2025

From Newsgroup: comp.arch

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

kegs@provalid.com (Kent Dickey) writes:

In article <2025Oct4.121741@mips.complang.tuwien.ac.at>,
Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

int subroutine( int a, int b )
{
return a+b;
}

...

I tested this on AMD64, and did not find sign-extension in the caller, >>>neither with gcc-14 nor with clang-19; both produce the following code >>>for your example (with "subroutine" renamed into "subroutine1").

0000000000000000 <subroutine1>:
0: 8d 04 37 lea (%rdi,%rsi,1),%eax
3: c3 ret

...

AMD64 in hardware does 0 extension of 32-bit operations. From your
example "lea (%rdi,%rsi,1),%eax" (AT&T notation, so %eax is the dest),
the 64-bit register %rax will have 0's written into bits [63:32].
So the AMD64 convention for 32-bit values in 64-bit registers is to >>zero-extend on writes. And to ignore the upper 32-bits on reads, so
using a 64-bit register should use the %exx name.

Interesting. At some point I got the impression that LEA produces a
64-bit result, because it produces an address, but testing reveals
that LEA has a 32-bit zero-extended variant indeed.

Architecurally, any store to a 32-bit register (%e_x) will
clear the high-order bits of of the 64-bit version of the
register.
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Tue Oct 7 18:34:45 2025

From Newsgroup: comp.arch

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

kegs@provalid.com (Kent Dickey) writes:

In article <2025Oct4.121741@mips.complang.tuwien.ac.at>,
Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

int subroutine( int a, int b )
{
return a+b;
}

------------------------------------------------------------

RISC-V is in another land, where they effectively have
no 32-bit operations, but rather a convention that all 32-bit inputs
must be sign-extended in a 64-bit register.

RISC-V has a number of sign-extending 32-bit instructions, and a
calling convention to go with it.

RISC-V has word sized integer arithmetic.

There seem to be the following options:

Have no 32-bit instructions, and insert sign-extension or
zero-extension instructions where necessary (or implicitly in all
operands, as I outlined earlier). SPARC V9 and PowerPC64 seem to take
this approach.

This was My 66000 between 2016 and two weeks ago.
The cost is 4% growth in code footprint and similar perf degradation.

Have 32-bit instructions that sign-extend: MIPS64, Alpha, and RV64.

Have 32-bit instructions that zero-extend: AMD64 and ARM A64.

Have 32-bit instructions that sign-extend and 32-bit instructions that zero-extend. No architecture that does that is known to me. It would
be a good match for the SPARC-V9 and PowerPC64 calling convention.

This is the starting point for My 66000 2.0:: integer arithmetic has
size and signedness, with the property that all integer results have
the 64-bit register <container> contain a range-limited result suit-
able to the base-type of the calculation {no garbage in HoBs}.

There is also one instruction set (ARM A64) that has special 32-bit sign-extension and zero-extension forms for some operands.

And you can then adapt the calling convention to match the instruction
set. For "no 32-bit instructions", garbage-extension seems to be the cheapest approach to me, but I expect that when SPARC-V9 and PowerPC64
came on the market, there was enough C code with missing prototypes
around that they preferred a more forgiving calling convention.

If you pick ILP64 for your ABI, then you will get rid of almost all of >these zero- and sign-extensions of 32-bit C and C++ code. It will just >work. If you pick I32LP64, then you should have a full suite of 32-bit >operations and 64-bit operations, at least for all add, subtract, and >compare operations.

For compare, divide, shift-right and rotate, you either first need to sign/zero-extend the register, or you need 32-bit versions (possibly
both signed and unsigned).

My 66000 CMP is signless--it compares two integer registers and delivers
a bit vector of all possible comparisons {2 equality, 4 signed, 4 unsigned,
4 range checks, [and in FP land 10-bits are the class of the RS1 operand]}

My 66000 SL, SR can be used in extract form--and here you need no operand preparation if you only extract meaningful bits.

My 66000 2.0 DIV has a size component to the calculation.

And if you do I32LP64, your indexed addressing
modes should have 3 types of indexed registers: 64-bit, 32-bit signed,
and 32-bit unsigned. That worked well for ARM64.

It is certainly part of the way towards my idea of having sign- and zero-extended 32-bit operands for every operand of every instruction.

Unnecessary if the integer calculation deliver properly range-limited
64-bit results.

It would be interesting to see how many sign-extensions and
zero-extensions (whether explicit or implicitly part of the
instruction) are executed in code that is generated from various C
sources (with and without -fwrapv).

In GNUPLOT is is just over 4% of instruction count for 64-bit-only
integer calculations.

I expect that it's highly
dependent on the programming style. Sure there are types like pid_t
where you have no choice, but in frequently occuring cases you can
choose:

for (i=0; i<n; i++) {
... a[i] ...
}

Here you can choose whether to define i as int, unsigned, long,
unsigned long, size_t, etc. If you care for portability to 16-bit
machines, size_t is a good idea here, otherwise long and unsigned long
also are efficient.

Counted for() loops are somewhat special in that it is quite easy to
determine that the loop index never exceeds the range-limit of the
container.

If n is unsigned, you can also choose unsigned,
but then this code will be slow on RV64 (and MIPS64 and SPARC V9 and PowerPC64 and Alpha).

Example please !?!

If n is int, you can also choose int, and there is actually enough information here to make the code efficient (even with -fwrapv),
because in this code int overflow really cannot happen,

Consider the case where n is int64_t or uint64_t !?!

Consider the C-preprocessor with::
# define int (short int) // !!
in scope.

but in code
that's not much different from this one (e.g., using != instead of <), -fwrapv will result in an inserted sign extension on AMD64, and not
using -fwrapv may result in unintended behaviour thanks to the
compiler assuming that int overflow does not happen.

ILP64 would have spared us all these considerations.

Agreed. I32LP64 is am abomination, especially if one is bothering to
ty to keep the number of instructions down.

- anton

--- Synchronet 3.21a-Linux NewsLink 1.2

From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Tue Oct 7 12:20:08 2025

From Newsgroup: comp.arch

On 10/3/2025 12:55 PM, MitchAlsup wrote:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

On 10/2/2025 7:50 PM, MitchAlsup wrote:

My 66000 2.0

After 4-odd years of ISA stability, I ran into a case where
I <pretty much> needed to change the instruction formats.
And after bragging to Quadribloc about its stability--it
reached the point where it was time to switch to version 2.0.

Well, its time to eat crow.
--------------------------------------------------------------
Memory reference instructions already produce 64-bit values
from Byte, HalfWord, Word and DoubleWord memory references
in both Signed and unSigned flavors. These supports both
integer and floating point due to the single register file.

Essentially, I need that property in both integer and floating
point calculations to eliminate instructions that merely apply
value range constraints--just like memory !

ISA 2.0 changes allows calculation instructions; both Integer
and Floating Point; and a few other miscellaneous instructions
(not so easily classified) the same uniformity.

In all cases, an integer calculation produces a 64-bit value
range limited to that of the {Sign}|u{Size}--no garbage bits
in the high parts of the registers--the register accurately
represents the calculation as specified {Sign}|u{Size}.

I must be missing something. Suppose I have

C := A + B

where A and C are 16 bit signed integers and B is an 8 bit signed
integer. As I understand what you are doing, loading B into a register
will leave the high order 56 bits zero. But the add instruction will
presumably be half word, so if B is negative, it will get an incorrect
answer (because B is not sign extended to 16 bits).

What am I missing?

A is loaded as 16-bits properly sign to 64-bits: range[-32768..32767]
B is loaded as 8-bits properly sign to 64-bits: range[-128..127]

ADDSH Rc,Ra,Rb

Adds 64-bit Ra and 64-bit Rb and then sign extends the result from bit<15>. The result is a properly signed 64-bit value: range [-32768..32767]

First let me apologize, then admit my embarrassment. I didn't write
what I intended to, and even if I did, it wouldn't have been correct.

I had totally missed the issue of perhaps not extending result of an arithmetic operation to the full register width. I must admit that this
never came up in the programming I have done, and I never considered it.
But subsequent posts in this thread have explained the issue well, and
so I learned something. Thanks to all!
--
- Stephen Fuld
(e-mail address disguised to prevent spam)
--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Tue Oct 7 19:09:25 2025

From Newsgroup: comp.arch

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

...

My 66000 CMP is signless--it compares two integer registers and delivers
a bit vector of all possible comparisons {2 equality, 4 signed, 4 unsigned,
4 range checks, [and in FP land 10-bits are the class of the RS1 operand]}

With an 88000-style compare and a result register of 64 bits, you can
spend 14 bits on 64-bit comparison, 14 bits on 32-bit comparison, 14
bits on 16-bit comparison, and 14 bits on 8-bit comparison, and still
have 8 bits left. What is a "range check" and why does it take 4
bits?

It is certainly part of the way towards my idea of having sign- and
zero-extended 32-bit operands for every operand of every instruction.

Unnecessary if the integer calculation deliver properly range-limited
64-bit results.

Sign- or zero extension will still be necessary for things like

long a=...
int b=a;
... c[b];

With the extension in the operands, you do not need any extension
instructions, not even for division, right-shift etc.

The question, however, is if the extensions occur often enough to
merit such features. I lean towards the SPARC/PowerPC/My 66000-v1
approach here.

It would be interesting to see how many sign-extensions and
zero-extensions (whether explicit or implicitly part of the
instruction) are executed in code that is generated from various C
sources (with and without -fwrapv).

In GNUPLOT is is just over 4% of instruction count for 64-bit-only
integer calculations.

Now what if you had a calling convention with garbage-extension? A
number of extensions in your examples would go away.

Counted for() loops are somewhat special in that it is quite easy to >determine that the loop index never exceeds the range-limit of the >container.

There have been enough cases where such reasoning led to "optimizing"
code into an infinite loop and other fallout of adversarial compilers.

If n is unsigned, you can also choose unsigned,
but then this code will be slow on RV64 (and MIPS64 and SPARC V9 and
PowerPC64 and Alpha).

Example please !?!

With a slightly different loop:

long foo(long a[], unsigned l, unsigned h)
{
unsigned i;
long r=0;
for (i=l; i!=h; i++)
r+=a[i];
return r;
}

gcc-10 -O3 produces on RV64G:

0000000000000000 <foo>:
0: 872a mv a4,a0
2: 4501 li a0,0
4: 00c58c63 beq a1,a2,1c <.L4>

0000000000000008 <.L3>:
8: 02059793 slli a5,a1,0x20
c: 83f5 srli a5,a5,0x1d
e: 97ba add a5,a5,a4
10: 639c ld a5,0(a5)
12: 2585 addiw a1,a1,1
14: 953e add a0,a0,a5
16: feb619e3 bne a2,a1,8 <.L3>
1a: 8082 ret

000000000000001c <.L4>:
1c: 8082 ret

If n is int, you can also choose int, and there is actually enough
information here to make the code efficient (even with -fwrapv),
because in this code int overflow really cannot happen,

Consider the case where n is int64_t or uint64_t !?!

Then the first condition does not hold on I32LP64.

Consider the C-preprocessor with::
# define int (short int) // !!
in scope.

Then the compiler will see short int, and generate code accordingly.
What's your point?

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Tue Oct 7 20:18:11 2025

From Newsgroup: comp.arch

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

..

My 66000 CMP is signless--it compares two integer registers and delivers
a bit vector of all possible comparisons {2 equality, 4 signed, 4 unsigned, >4 range checks, [and in FP land 10-bits are the class of the RS1 operand]}

With an 88000-style compare and a result register of 64 bits, you can
spend 14 bits on 64-bit comparison, 14 bits on 32-bit comparison, 14
bits on 16-bit comparison, and 14 bits on 8-bit comparison, and still
have 8 bits left. What is a "range check" and why does it take 4
bits?

CIN 0 <= Reg < Max
FIN 0 < Reg <= Max
RIN 0 < Reg < Max
SIN 0 <= Reg <= Max

It is certainly part of the way towards my idea of having sign- and
zero-extended 32-bit operands for every operand of every instruction.

Unnecessary if the integer calculation deliver properly range-limited >64-bit results.

Sign- or zero extension will still be necessary for things like

long a=...
int b=a;
.. c[b];

The movement of long to int will 'smash' out extraneous significance.
As written: b has range [-2G..+2G] and the register holding b's value
will too.

The important property is that registers contain 64-bits and the value
in the register is range-limited to the calculated (or LDed) result.

With the extension in the operands, you do not need any extension instructions, not even for division, right-shift etc.

The question, however, is if the extensions occur often enough to
merit such features. I lean towards the SPARC/PowerPC/My 66000-v1
approach here.

I did too, until <many> conversations with LLVM compiler writer.
GNUPLOT seems to be a banner application wrt range-limited calcu-
lations.

It would be interesting to see how many sign-extensions and
zero-extensions (whether explicit or implicitly part of the
instruction) are executed in code that is generated from various C
sources (with and without -fwrapv).

In GNUPLOT is is just over 4% of instruction count for 64-bit-only
integer calculations.

Now what if you had a calling convention with garbage-extension? A
number of extensions in your examples would go away.

Not many, few are on ABI and most of the ones that are are dealt with
when moving arguments to preserved registers. So, you could send HoBs
that are never observed since the MOV Rpreserved,Rargument gets changed
into a SR[AL] Rpreserved,Rargument<32:0> at no space or time cost.

Counted for() loops are somewhat special in that it is quite easy to >determine that the loop index never exceeds the range-limit of the >container.

There have been enough cases where such reasoning led to "optimizing"
code into an infinite loop and other fallout of adversarial compilers.

If n is unsigned, you can also choose unsigned,
but then this code will be slow on RV64 (and MIPS64 and SPARC V9 and
PowerPC64 and Alpha).

Example please !?!

With a slightly different loop:

long foo(long a[], unsigned l, unsigned h)
{
unsigned i; // <---this variable should be uint64_t
long r=0;
for (i=l; i!=h; i++)
r+=a[i];
return r;
}

gcc-10 -O3 produces on RV64G:

0000000000000000 <foo>:
0: 872a mv a4,a0
2: 4501 li a0,0
4: 00c58c63 beq a1,a2,1c <.L4>

0000000000000008 <.L3>:
8: 02059793 slli a5,a1,0x20 // eliminate HoBs
c: 83f5 srli a5,a5,0x1d // does not have scaled indexing
e: 97ba add a5,a5,a4 // does not have indexing
10: 639c ld a5,0(a5) // all that work
12: 2585 addiw a1,a1,1
14: 953e add a0,a0,a5 // loop induction
16: feb619e3 bne a2,a1,8 <.L3>
1a: 8082 ret

000000000000001c <.L4>:
1c: 8082 ret

foo:
MOV R4,#0
MOV R5,#1
VEC R7,{}
LDD R6,[R1,R5<<3]
ADD R4,R4,R6
LOOP2 NE,R5,#1,R3
MOV R1,R4
RET

If n is int, you can also choose int, and there is actually enough
information here to make the code efficient (even with -fwrapv),
because in this code int overflow really cannot happen,

Consider the case where n is int64_t or uint64_t !?!

Then the first condition does not hold on I32LP64.

Consider the C-preprocessor with::
# define int (short int) // !!
in scope.

Then the compiler will see short int, and generate code accordingly.
What's your point?

- anton

--- Synchronet 3.21a-Linux NewsLink 1.2

From kegs@kegs@provalid.com (Kent Dickey) to comp.arch on Wed Oct 8 20:41:21 2025

From Newsgroup: comp.arch

In article <2025Oct7.210925@mips.complang.tuwien.ac.at>,
Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

...

If n is unsigned, you can also choose unsigned,
but then this code will be slow on RV64 (and MIPS64 and SPARC V9 and
PowerPC64 and Alpha).

Example please !?!

With a slightly different loop:

long foo(long a[], unsigned l, unsigned h)
{
unsigned i;
long r=0;
for (i=l; i!=h; i++)
r+=a[i];
return r;
}

gcc-10 -O3 produces on RV64G:

0000000000000000 <foo>:
0: 872a mv a4,a0
2: 4501 li a0,0
4: 00c58c63 beq a1,a2,1c <.L4>

0000000000000008 <.L3>:
8: 02059793 slli a5,a1,0x20
c: 83f5 srli a5,a5,0x1d
e: 97ba add a5,a5,a4
10: 639c ld a5,0(a5)
12: 2585 addiw a1,a1,1
14: 953e add a0,a0,a5
16: feb619e3 bne a2,a1,8 <.L3>
1a: 8082 ret

000000000000001c <.L4>:
1c: 8082 ret

Unsigned 32-bit stuff on RISC-V has a habit of blowing up with lots of
overhead instructions. Change the loop condition to "i < h", and you get
on godbolt.org with -O2 -march=rv64g

foo(long*, unsigned int, unsigned int):
mv a5,a0
bgeu a1,a2,.L4
addiw a4,a2,-1
subw a4,a4,a1
slli a4,a4,32
slli a1,a1,32
srli a1,a1,32
srli a4,a4,32
add a4,a4,a1
addi a3,a0,8
slli a4,a4,3
slli a1,a1,3
li a0,0
add a5,a5,a1
add a4,a4,a3
.L3:
ld a3,0(a5)
addi a5,a5,8
add a0,a0,a3
bne a5,a4,.L3
ret
.L4:
li a0,0
ret

This does get better with "-march=rv64g_zab", but Zab isn't part of RV64G.

GCC has actually optimized the loop itself better, but it has lots of
fixup code to create 64-bit register versions of the unsigned inputs
(because the RISC-V ABI specifies all 32-bit quantities must be
sign-extended at the function call boundaries, even if they are
unsigned).

In many cases, the sign-extension works well (BGEU on 64-bit registers
that are 32-bit sign-extended, works as it would if the values were 0-extended). But mixing true 64-bit unsigned with 32-bit unsigned
requires fixup instructions. And the lack of a ZEXT.W in the basic
64-bit instruction set was a mistake. RISC-V gives us a modern example
of how to handle not having a full suite of 32-bit instructions, and
what that would look like.

Kent
--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Wed Oct 8 22:58:53 2025

From Newsgroup: comp.arch

On 10/8/2025 3:41 PM, Kent Dickey wrote:

In article <2025Oct7.210925@mips.complang.tuwien.ac.at>,
Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

...

If n is unsigned, you can also choose unsigned,
but then this code will be slow on RV64 (and MIPS64 and SPARC V9 and
PowerPC64 and Alpha).

Example please !?!

With a slightly different loop:

long foo(long a[], unsigned l, unsigned h)
{
unsigned i;
long r=0;
for (i=l; i!=h; i++)
r+=a[i];
return r;
}

gcc-10 -O3 produces on RV64G:

0000000000000000 <foo>:
0: 872a mv a4,a0
2: 4501 li a0,0
4: 00c58c63 beq a1,a2,1c <.L4>

0000000000000008 <.L3>:
8: 02059793 slli a5,a1,0x20
c: 83f5 srli a5,a5,0x1d
e: 97ba add a5,a5,a4
10: 639c ld a5,0(a5)
12: 2585 addiw a1,a1,1
14: 953e add a0,a0,a5
16: feb619e3 bne a2,a1,8 <.L3>
1a: 8082 ret

000000000000001c <.L4>:
1c: 8082 ret

Unsigned 32-bit stuff on RISC-V has a habit of blowing up with lots of overhead instructions. Change the loop condition to "i < h", and you get
on godbolt.org with -O2 -march=rv64g

foo(long*, unsigned int, unsigned int):
mv a5,a0
bgeu a1,a2,.L4
addiw a4,a2,-1
subw a4,a4,a1
slli a4,a4,32
slli a1,a1,32
srli a1,a1,32
srli a4,a4,32
add a4,a4,a1
addi a3,a0,8
slli a4,a4,3
slli a1,a1,3
li a0,0
add a5,a5,a1
add a4,a4,a3
.L3:
ld a3,0(a5)
addi a5,a5,8
add a0,a0,a3
bne a5,a4,.L3
ret
.L4:
li a0,0
ret

This does get better with "-march=rv64g_zab", but Zab isn't part of RV64G.

GCC has actually optimized the loop itself better, but it has lots of
fixup code to create 64-bit register versions of the unsigned inputs
(because the RISC-V ABI specifies all 32-bit quantities must be
sign-extended at the function call boundaries, even if they are
unsigned).

In many cases, the sign-extension works well (BGEU on 64-bit registers
that are 32-bit sign-extended, works as it would if the values were 0-extended). But mixing true 64-bit unsigned with 32-bit unsigned
requires fixup instructions. And the lack of a ZEXT.W in the basic
64-bit instruction set was a mistake. RISC-V gives us a modern example
of how to handle not having a full suite of 32-bit instructions, and
what that would look like.

Had they not dropped ADDWU and SUBWU from BitManip, and did the sensible
thing of using zero-extended "unsigned int", much of this mess goes away...

Sign-extending "unsigned int" is almost the worst possible option (even
within the limits of plain RV64G). Sign extension makes "a+b" slightly cheaper, but everything else gets worse. It is, ironically, better to
just pay the up-front cost of zero extension for add/subtract (and maybe
throw up a middle finger to the ABI spec on this one).

Well, then again, it seems there are multiple versions of the ABI spec floating around in the internet, seemingly with differences as to the
exact handling of passing/returning structures, etc. So, I don't
personally put too much weight into worrying about there being a minor mismatch here.

Where:
Some versions appear to be using SysV-AMD64 style struct rules;
With structs being returned by on-stack copy.
Some versions using the register, register-pair, or by-reference.
With structs returned in X10, X11:X10,
or by passing a return pointer as a hidden argument.
This also being what BGBCC uses;
...

Then, differences between LP64 and LP64D:
LP64: All F registers are Scratch;
LP64D: Some of the F registers are Preserved.

Well, and there are bigger concerns on the ABI front (the ABI used by
BGBCC not being strictly 1:1 with the standard ABI, but close enough
that most cases will work):
Basic case is LP64 argument passing with LP64D's register rules.

Then an XG3 ABI variant (can also be used for RV64G) which defines there
as being 16-argument registers and reassigns 4 of the F registers from
scratch to preserved (to bring the balance slightly closer to an even
split).

So:
X: 4 SPR, 16 Scratch, 12 Preserved
F: 16 Scratch, 16 Preserved (Vs 20+12)
So: 32 Scratch + 28 Preserved
Vs: 36 Scratch + 24 Preserved

Kent

--- Synchronet 3.21a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Thu Oct 9 05:28:56 2025

From Newsgroup: comp.arch

Kent Dickey <kegs@provalid.com> schrieb:

GCC has actually optimized the loop itself better, but it has lots of
fixup code to create 64-bit register versions of the unsigned inputs
(because the RISC-V ABI specifies all 32-bit quantities must be
sign-extended at the function call boundaries, even if they are
unsigned).

You mean 0xffffffff as unsigned has to be passed as
0xffffffffffffffff ? Somebody was not thinking that one through...

At least Loongarch gets that one right; unsigned and signed are
zero- and sign-extended, respectively.

[...]

RISC-V gives us a modern example
of how to handle not having a full suite of 32-bit instructions, and
what that would look like.

Seems like it...
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Thu Oct 9 01:13:54 2025

From Newsgroup: comp.arch

On 10/9/2025 12:28 AM, Thomas Koenig wrote:

Kent Dickey <kegs@provalid.com> schrieb:

GCC has actually optimized the loop itself better, but it has lots of
fixup code to create 64-bit register versions of the unsigned inputs
(because the RISC-V ABI specifies all 32-bit quantities must be
sign-extended at the function call boundaries, even if they are
unsigned).

You mean 0xffffffff as unsigned has to be passed as
0xffffffffffffffff ? Somebody was not thinking that one through...

Yes, and it is real stupid...

At least Loongarch gets that one right; unsigned and signed are
zero- and sign-extended, respectively.

Meanwhile, in RISC-V land, they are like, "You know, UInt is sign
extended but we need 0 extension." and rather than do something sane,
like zero-extend UInt...

Well, first the B extension adds some ".UW" instructions:
ADD.UW, H1ADD.UW, SH2ADD.UW, SH3ADD.UW, SLLI.UW

Which have the amazing behavior of zero-extending on the input side.

And then more extensions come along, and add more ".UW" instructions...
How many did the indexed Zilx/Zisx proposal add?... 19.

Me: FFS.

I could almost just ignore it, except if implementing a CPU core that
might want to support these extensions, I also end up needing to waste
FPGA resources to support this stuff.

[...]

RISC-V gives us a modern example
of how to handle not having a full suite of 32-bit instructions, and
what that would look like.

Seems like it...

Yeah, some of the 32-bit instructions only handle signed variants.

Also, the number of instructions you need for zero-extension on output,
is far less than you need for zero extension on input.

How many do you need?: ADDWU, SUBWU.
Or, 2 instructions.

Even without these, zero-extending UInt for operations that might go out
of range is still less bad than dealing with the mess left by
sign-extended UInt.

...

--- Synchronet 3.21a-Linux NewsLink 1.2

From John Savard@quadibloc@invalid.invalid to comp.arch on Thu Oct 9 07:17:43 2025

From Newsgroup: comp.arch

On Fri, 03 Oct 2025 02:50:23 +0000, MitchAlsup wrote:

And after bragging to Quadribloc about its stability--it reached the
point where it was time to switch to version 2.0.

I really don't think you have anything to be ashamed of.

Getting new ideas that are capable of radically improving your ISA and modifying it to make use of them is an entirely appropriate thing to do.
And your ideas are ones which fit into your philosophy - it's not as if
you found yourself doing something wrong where _my_ way was better.

That would be a valid occasion for "eating crow", but that isn't what
happened.

John Savard
--- Synchronet 3.21a-Linux NewsLink 1.2

From John Savard@quadibloc@invalid.invalid to comp.arch on Thu Oct 9 07:32:44 2025

From Newsgroup: comp.arch

Also, looking at the specific area which you discussed in your post...

I do this stuff the way most other computers do it, I think; the
conventional way that is exemplified by the IBM System/360. I hadn't
really even thought the matter through to see if there was another, better way: it seemed obvious that this way made sense and worked.

So multiplying two 8-bit integers produces a 16-bit result... or,
optionally, perhaps just an 8-bit result, since while the high bits may sometimes be needed for multi-precision arithmetic, usually they're just
extra bother.

And this applies to every other size of integer. Loading an integer into a 64-bit register always produces a 64-bit result, though, because registers don't shrink. So there's load (sign extension), unsigned load (clear the
high bits), and insert (don't touch bits higher than the data coming in).

Since fixed point data is usually considered to be _integers_ rather than numbers between 0 and 1, fixed-point data is right aligned. Maybe
fractional fixed-point is a cheaper substitute for floating-point in some applications, and so I should add the extra option of left-aligned loads
and stores - and fixed-point arithmetic instructions that behave more like floating-point instructions. Since I haven't encountered that feature very much on computers - the multiply instructions on some minis, however, suggested that they viewed fixed-point data as left-aligned fractional -
I've assumed it's too esoteric a feature to support, but I could be wrong.

Particularly in comparison to all the other esoteric features I plan to support; this one could be genuinely useful. But it could also be
something that nobody uses because floating-point is safer to use for the
kind of problems that fractional fixed-point is applicable to; no constant worry about overflows.

John Savard
--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Thu Oct 9 07:07:11 2025

From Newsgroup: comp.arch

kegs@provalid.com (Kent Dickey) writes:

In article <2025Oct7.210925@mips.complang.tuwien.ac.at>,
Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

...

If n is unsigned, you can also choose unsigned,
but then this code will be slow on RV64 (and MIPS64 and SPARC V9 and
PowerPC64 and Alpha).

Example please !?!

With a slightly different loop:

long foo(long a[], unsigned l, unsigned h)
{
unsigned i;
long r=0;
for (i=l; i!=h; i++)
r+=a[i];
return r;
}

gcc-10 -O3 produces on RV64G:

0000000000000000 <foo>:
0: 872a mv a4,a0
2: 4501 li a0,0
4: 00c58c63 beq a1,a2,1c <.L4>

0000000000000008 <.L3>:
8: 02059793 slli a5,a1,0x20
c: 83f5 srli a5,a5,0x1d
e: 97ba add a5,a5,a4
10: 639c ld a5,0(a5)
12: 2585 addiw a1,a1,1
14: 953e add a0,a0,a5
16: feb619e3 bne a2,a1,8 <.L3>
1a: 8082 ret

000000000000001c <.L4>:
1c: 8082 ret

Unsigned 32-bit stuff on RISC-V has a habit of blowing up with lots of >overhead instructions. Change the loop condition to "i < h", and you get
on godbolt.org with -O2 -march=rv64g

foo(long*, unsigned int, unsigned int):
mv a5,a0
bgeu a1,a2,.L4
addiw a4,a2,-1
subw a4,a4,a1
slli a4,a4,32
slli a1,a1,32
srli a1,a1,32
srli a4,a4,32
add a4,a4,a1
addi a3,a0,8
slli a4,a4,3
slli a1,a1,3
li a0,0
add a5,a5,a1
add a4,a4,a3
.L3:
ld a3,0(a5)
addi a5,a5,8
add a0,a0,a3
bne a5,a4,.L3
ret
.L4:
li a0,0
ret

Yes, in many cases the compiler manages to pull the zero extension out
of the loop or eliminate it completely, and it took me three tries to
find a loop where this does not happen; and given that I actually
intended to find such a case and made my changes accordingly, I expect
that the occurences in practice are rarer than 1 in 3. Nevertheless,
you don't want your hot loop to fall into this trap, so better use
size_t rather than unsigned for your loop counter.

And the lack of a ZEXT.W in the basic
64-bit instruction set was a mistake.

The slli and srli instructions above are compressible to 16 bits, so
it looks to me that the RISC-V designers knew that zero extension is
going to be somewhat frequent, and wanted to make the instructions for
that cheap (I don't expect that other uses of srli are frequent enough
to merit inclusion in the compressed instructions; for slli the use in addressing is probably more frequent than its use in zero extension).
But they either thought that making the more general slli and srli
instructions cheap was good enough (and also benefits other cases), or
they were too reluctant to add another instruction.

But given that they added a number of sign-extending 32-bit
instructions, such a reluctance certainly did not exist when they did
that; the combination of slli and srai performs a sign extension all
right, no such instructions are strictly necessary, and certainly not
all of them: addw reg,x0,reg also performs a sign extension. Just
having a sign-extending and a zero-extening add could have been
another option.

RISC-V gives us a modern example
of how to handle not having a full suite of 32-bit instructions, and
what that would look like.

No instruction set has a full suite of 32-bit instructions with both sign-extended and zero-extended results. They all have workarounds
for the lack of the full suite, with various associated costs that
have to be balanced against the costs of providing the full suite.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Thu Oct 9 08:22:31 2025

From Newsgroup: comp.arch

Thomas Koenig <tkoenig@netcologne.de> writes:

Kent Dickey <kegs@provalid.com> schrieb:

GCC has actually optimized the loop itself better, but it has lots of
fixup code to create 64-bit register versions of the unsigned inputs
(because the RISC-V ABI specifies all 32-bit quantities must be
sign-extended at the function call boundaries, even if they are
unsigned).

You mean 0xffffffff as unsigned has to be passed as
0xffffffffffffffff ? Somebody was not thinking that one through...

I am sure they did think that one through. The manual says:

|The compiler and calling convention maintain an invariant that all
|32-bit values are held in a sign-extended format in 64-bit
|registers. Even 32-bit unsigned integers extend bit 31 into bits 63
|through 32. Consequently, conversion between unsigned and signed
|32-bit integers is a no- op, as is conversion from a signed 32-bit
|integer to a signed 64-bit integer. Existing 64-bit wide SLTU and
|unsigned branch compares still operate correctly on unsigned 32-bit
|integers under this invariant. Similarly, existing 64-bit wide logical |operations on 32-bit sign-extended integers preserve the
|sign-extension property. A few new instructions (ADD[I]W/SUBW/SxxW)
|are required for addition and shifts to ensure reasonable performance
|for 32-bit values.

What I find more interesting is that MIPS apparently made the same
choice. In the early 90s a lot of code was around which did not
provide prototypes, code that worked on 32-bit systems because there
was no difference between int and long, and between unsigned int and
unsigned long. I expect such code to have a better chance to work as
intended in an I32LP64 setting if unsigned is zero-extended (and the
choices for SPARC and PowerPC are along these lines). So why did the
MIPS people go that way?

One other interesting thing is how various architectures define the
upper 32 bits of existing instructions when they extend to 64 bits.
Let's consider addition:

MIPS-IV: addu performs sign-extended 32-bit addition (and the instruction
is called "Add Unsigned Word":-); they added daddu for 64-bit
addition. They undefined the result of addu if the inputs were not sign-extended 32-bit numbers to make their lack of competence more
obvious.

SPARC and PowerPC: The addition instructions perform 64-bit addition.
No extra 32-bit variant was added.

AMD64 is actually a new instruction set incompatible with IA-32, but
given that AMD64 is so close to IA-32 that their decoder is shared on
all implementations I know of, I include it here: AMD64 defines the
existing instructions as producing the same result as on IA-32 (that
includes instructions like shift-right and division where the upper
bits of a 64-bit operation would play a role), with the upper 32 bits
being zero, and adds 64-bit variants.

ARM A64 is a completely new 64-bit instruction set. There have been
efforts for an ILP32 ABI, but I don't think that this instruction set
was designed for that.

RISC-V was designed for both 32-bit and 64-bit settings. The add
instruction performs the full 64-bit addition in 64-bit
implementations. The 64-bit extension adds addw (along with addiw
slliw srliw sraiw sllw srlw subw sraw), which sign-extends the lower
32 bits of its result. addw produces a defined result for all inputs.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Thu Oct 9 10:39:05 2025

From Newsgroup: comp.arch

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

It is certainly part of the way towards my idea of having sign- and
zero-extended 32-bit operands for every operand of every instruction.

Unnecessary if the integer calculation deliver properly range-limited
64-bit results.

Sign- or zero extension will still be necessary for things like

long a=...
int b=a;
.. c[b];

The movement of long to int will 'smash' out extraneous significance.

I.e., you have an extra instruction for that purpose.

Actually, the example is incomplete. Let's make it more complete:

long x=..., y=...;
long a=x-y;
int b=a;
long d=c[b];
long e=a*3;
live: d,e; dead: a,b

Let's call the 64-bit registers x..., with alternative names i... and
u..., where i... are sign-extended from the low-order 32-bits as
source operands, and a 32-bit result sign-extended to 64-bits are
stored into x.../i... when i... is a destination. Likewise for u... and zero-extension. With that, if you only have instructions that allow
all variants as destinations, but have no choice on the source side,
the code looks as follows:

sub xa=xx,xy
mov/sext ib=ia
load xd=(xd+xb*8)
mul xe=xa,3

If you have choice on the source side, you can implement this as:

sub xa=xx,xy
load xd=(xd+ia*8)
mul xe=xa,3

i.e., you can eliminate the sign-extension instruction. And if you
arrange your conventions such that the consumer of a value is
responsible for sign/zero extension (e.g., with a garbage-extending
calling convention), you can use x... as destination (no
i.../u... needed there), and do not need to execute any separate sign/zero-extension instructions (whether you call the SEXT/ZEXT or
MOV).

However, I doubt that this benefit is worth the price. Nevertheless,
ARM A64 has addressing modes that correspond to (xd+ia*1/2/4/8) and (xd+ua*1/2/4/8).

I did too, until <many> conversations with LLVM compiler writer.
GNUPLOT seems to be a banner application wrt range-limited calcu-
lations.

But will a several percent lower instruction count on GNUPLOT sell
many MY66000s?

Now what if you had a calling convention with garbage-extension? A
number of extensions in your examples would go away.

Not many, few are on ABI and most of the ones that are are dealt with
when moving arguments to preserved registers.

That sounds like something that is done on the callee side. And doing
the sign/zero extension on the callee side is what one would do if the convention is garbage-extension.

But yes, with either convention you are able to combine the
sign/zero-extension with a mov that would otherwise have been
necessary anyway.

So, you could send HoBs
that are never observed since the MOV Rpreserved,Rargument gets changed
into a SR[AL] Rpreserved,Rargument<32:0> at no space or time cost.

Concerning time cost, many microarchitectures nowadays have zero-cycle
MOVs (the MOVs are performed by the register renamer). It is possible
to extend this to also do zero-cycle sign- and zero-extensions
(resulting in i... and u... registers in the microarchitecture), but
that certainly has a cost in design and implementation complexity, and
I am not sure that this is justified by the performance advantages of
this hardware optimization.

If the hardware has zero-cycle moves, but not zero-cycle
sign/zero-extensions, then there is a time cost.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From John Savard@quadibloc@invalid.invalid to comp.arch on Thu Oct 9 13:51:52 2025

From Newsgroup: comp.arch

On Sat, 04 Oct 2025 10:17:41 +0000, Anton Ertl wrote:

If the calling convention guarantees that ints are zero-extended (sounds perverse, but RV64 has the guarantee that unsigned is passed in
sign-extended form, which is equally perverse), then the compiler must
use instructions that produce a zero-extended result (e.g., AMD64's
addl). If the calling convention only requires and guarantees the
low-order 32 bits (I call this garbage-extended), then the compiler can
use instructions that perform 64-bit adds; this is what we are seeing
above.

The other side of the medal is what is needed at the caller: If the
caller needs to cconvert a sign-extended int into a long, it does not
have to do anything.

I find this just a bit confusing.

Obviously, regular signed integer values should be sign extended.

But, equally, _unsigned_ integer values should be zero extended for
precisely the same reason, so that the longer value, as an unsigned
integer, has the same value without doing anything.

John Savard
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Oct 9 15:40:20 2025

From Newsgroup: comp.arch

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

kegs@provalid.com (Kent Dickey) writes:

In article <2025Oct7.210925@mips.complang.tuwien.ac.at>,
Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:<snip>

No instruction set has a full suite of 32-bit instructions with both sign-extended and zero-extended results. They all have workarounds
for the lack of the full suite, with various associated costs that
have to be balanced against the costs of providing the full suite.

Once I get a workable solution to converts, My 66000 will have.
{Sign}|u{Size}.

- anton

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Oct 9 15:42:40 2025

From Newsgroup: comp.arch

John Savard <quadibloc@invalid.invalid> posted:

On Fri, 03 Oct 2025 02:50:23 +0000, MitchAlsup wrote:

And after bragging to Quadribloc about its stability--it reached the
point where it was time to switch to version 2.0.

I really don't think you have anything to be ashamed of.

What I am ashamed of is "breast beating" when you started to announce Concertina III while I was still on My 66000 1.0--then shortly later
having to jump to 2.0 with the breast beating still fresh on my mind.

Getting new ideas that are capable of radically improving your ISA and modifying it to make use of them is an entirely appropriate thing to do.
And your ideas are ones which fit into your philosophy - it's not as if
you found yourself doing something wrong where _my_ way was better.

That would be a valid occasion for "eating crow", but that isn't what happened.

John Savard

--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Thu Oct 9 15:37:37 2025

From Newsgroup: comp.arch

Thomas Koenig <tkoenig@netcologne.de> writes:

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

Thomas Koenig <tkoenig@netcologne.de> writes:

So, kill the 64-bit machines in the scientific marketplace. I'm glad
you agree.

Not in the least. Most C programs did not run as-is on I32LP64, and
that did not kill these machines, either.

Only those who assumed sizeof(int) = sizeof(char *).

And lots of others, e.g., those that assumed that longs are 4 bytes in
size.

This was
not true on the PDP-11,

Can you elaborate on this? What do you think is sizeof(int) on a
PDP-11, and what do you think is sizeof(char *) on a PDP-11?

and it was a standards violation, anyway.

That's hilarious. C89 was three years old in 1992. The majority of C
programs available in 1992 were started before ANSI C was released,
and thus contained code from before ANSI C. And like today,
programmers are asked to spend time on other things than fixing things
that are not broken.

And I am sure that C
programs were much more relevant for selling these machines than
FORTRAN programs.

Based on what data?

Based on 4 months of internship at HP in 1988 and 1989, in a group
that did sales support, tech support, and courses on HP 9000
workstations and servers and HP/UX (the OS of the HP 9000 machines).
I don't remember hearing about a customer that used FORTRAN.

Based also on the impressions I got on Usenet. Apart from SPECfp,
Fortran was nowhere to be seen.

C programmers changed the programs to run on
I32LP64 (this was called "making them 64-bit-clean"). And until that
was done, ILP32 was used.

The problem with 64-bit INTEGERs for Fortran is that they make REAL
unusable for lots of existing code.

The size of FORTRAN INTEGERs is something the FORTAN people have to
decide, and I made no statement on that.

If FORTRAN programs make the assumptions that sizeof(int)==4, maybe
you should tell the FORTRAN programmers something along these lines:
"it is a standards violation, anyway. Only people who like to play
these kind of games are caught."

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Thu Oct 9 16:04:50 2025

From Newsgroup: comp.arch

John Levine <johnl@taugh.com> writes:

I didn't port a lot of code from the 11 to other machines, but my recollection >is that the widespread assumption in Berkeley Vax code that location zero was >addressable and contained binary zeros was much more painful to fix than
size issues.

Sure, lots of things are more painful to fix, but Thomas Koenig's
claim was that if the 64-bit machines would not run FORTRAN code "as
is", nobody would buy them.

Concerning pain, I found that in Gforth (which contains C code and
Forth code) we had many more portability bugs in the C code than in
the Forth code, where we had almost no portability bugs.

That's because Forth has only two integer types: cell (a machine word)
and double cell (two machine words); and if you use one instead of the
other, the code fails, whatever the cell size is.

By contrast, in the C code we have to deal with a large number of
integer types (not just int, long, etc., but also, e.g., off_t), with
the relations between the types being different on different
platforms, or, in the case of off_t, also depending #defines. On one
machine some function parameter was a long or whatever, on a different
one it was a bla_t or whatever. Of course, these days one might
target only Linux and MacOS and reach >99% of desktops and servers
(the result runs on Windows through WSL2), but that solves the problem
by reducing the portability requirements.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Thu Oct 9 17:04:43 2025

From Newsgroup: comp.arch

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

John Levine <johnl@taugh.com> writes:

I didn't port a lot of code from the 11 to other machines, but my recollection
is that the widespread assumption in Berkeley Vax code that location zero was >>addressable and contained binary zeros was much more painful to fix than >>size issues.

Sure, lots of things are more painful to fix, but Thomas Koenig's
claim was that if the 64-bit machines would not run FORTRAN code "as
is", nobody would buy them.

That is a misrepresentation (not that I'm surprised).

My argument was that this would remove a sizable enough market share
that nobody would risk that.
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Thu Oct 9 18:19:40 2025

From Newsgroup: comp.arch

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

Thomas Koenig <tkoenig@netcologne.de> writes:

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

Thomas Koenig <tkoenig@netcologne.de> writes:

So, kill the 64-bit machines in the scientific marketplace. I'm glad >>>>you agree.

Not in the least. Most C programs did not run as-is on I32LP64, and
that did not kill these machines, either.

Only those who assumed sizeof(int) = sizeof(char *).

And lots of others, e.g., those that assumed that longs are 4 bytes in
size.

This was
not true on the PDP-11,

Can you elaborate on this?

That was a mistake, as others have pointed out.

Based on 4 months of internship at HP in 1988 and 1989, in a group
that did sales support, tech support, and courses on HP 9000
workstations and servers and HP/UX (the OS of the HP 9000 machines).
I don't remember hearing about a customer that used FORTRAN.

*shrug* Oh well, that is very scientific evindence, statistically
proven.

Counterpoint: On the University workstations I worked on, Fortran
was very much in use. People wrote code to run on IBM mainframes
and ported this to the HP workstations. Plus, there were vector
computers where REAL also was 32 bits.

Whose anecdotal evidence counts more.

Based also on the impressions I got on Usenet. Apart from SPECfp,
Fortran was nowhere to be seen.

New flash: Engineers rarely use Usenet (I'm a bit of an exception
there).
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Thu Oct 9 22:48:14 2025

From Newsgroup: comp.arch

Thomas Koenig wrote:

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

Based also on the impressions I got on Usenet. Apart from SPECfp,
Fortran was nowhere to be seen.

New flash: Engineers rarely use Usenet (I'm a bit of an exception
there).

That depends strongly on when and where you are talking about:

Back when the Internet (Arpanet) got its first node outside of the US,
it was in Norway in 1973, but our universities did not get a link until
1983.

When I started working for Norsk Hydro in 1984, they had already setup a 64-kbit/s line from our Bergen office to the university there, and the
reason it was installed was that we had engineers who needed Usenet access.

Personally I've been on Usenet for close to 40 years. It could have been
a bit more but I did not use Usenet at NTH in Trondheim and I did not
setup a news reader immediately when I started in Hydro, with
responsibility for all IBM PC compatibles worldwide.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
--- Synchronet 3.21a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Thu Oct 9 21:08:25 2025

From Newsgroup: comp.arch

Terje Mathisen <terje.mathisen@tmsw.no> schrieb:

Thomas Koenig wrote:

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

Based also on the impressions I got on Usenet. Apart from SPECfp,
Fortran was nowhere to be seen.

New flash: Engineers rarely use Usenet (I'm a bit of an exception
there).

That depends strongly on when and where you are talking about:

Back when the Internet (Arpanet) got its first node outside of the US,
it was in Norway in 1973, but our universities did not get a link until 1983.

When I started working for Norsk Hydro in 1984, they had already setup a 64-kbit/s line from our Bergen office to the university there, and the reason it was installed was that we had engineers who needed Usenet access.

Personally I've been on Usenet for close to 40 years. It could have been
a bit more but I did not use Usenet at NTH in Trondheim and I did not
setup a news reader immediately when I started in Hydro, with
responsibility for all IBM PC compatibles worldwide.

I started using USENET when it reached European universities in the
very early 1990s, so maybe 35 years.

But my personal observation, from the people I knew, was that users
were mostly computer scientists, with some mathematicians thrown in.

Now, of course, USENET is dying off fast; few news servers are left,
and those are also being switched off (for example news.individual.net).
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Brian G. Lucas@bagel99@gmail.com to comp.arch on Thu Oct 9 16:30:38 2025

From Newsgroup: comp.arch

On 10/6/25 8:38 PM, Kent Dickey wrote:
[SNIP]

For C and C++ code, the standard dictates that all integer operations are done with "int" precision, unless some operand is larger than int, and then do it in that precision. So there's no real need for 8-bit and 16-bit operations to be natively by the CPU--these operations are actually done
as int's already. If you have a variable which is a byte, then assigning
to that variable, and then using that variable again you will need to zero-extend, but honestly, this is not usually a performance path. It's likely to be stored to memory instead, so no masking or sign extending
should be needed.

[SNIP]

Kent

Can you point me to the section in "the standard" which indicates
'all integer operations are done with "int" precision'?

What if the wording was changed to:
'all integer operations are done with _at least_ "int" precision',
e.g. one could use long. Would that break conforming code?

Brian

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Oct 9 21:54:21 2025

From Newsgroup: comp.arch

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

John Levine <johnl@taugh.com> writes:

I didn't port a lot of code from the 11 to other machines, but my recollection
is that the widespread assumption in Berkeley Vax code that location zero was
addressable and contained binary zeros was much more painful to fix than >size issues.

Sure, lots of things are more painful to fix, but Thomas Koenig's
claim was that if the 64-bit machines would not run FORTRAN code "as
is", nobody would buy them.

Concerning pain, I found that in Gforth (which contains C code and
Forth code) we had many more portability bugs in the C code than in
the Forth code, where we had almost no portability bugs.

C, itself, would be3 a "lot less painful" if C only had 2 integer types
1-word and 2-words. But, instead, they typical 2^(n+3) machines have
8-integer types (Signed, unSigned}|u{Byte, Half, Word, DBLE}, and then
to make it as bad as possible, there are a myriad of types {ptr_dif,
size_t, off_t, ...} that change {Sign}|u{Size} on an architecture basis.

That's because Forth has only two integer types: cell (a machine word)
and double cell (two machine words); and if you use one instead of the
other, the code fails, whatever the cell size is.

Same as <old> FORTRAN.

By contrast, in the C code we have to deal with a large number of
integer types (not just int, long, etc., but also, e.g., off_t), with
the relations between the types being different on different
platforms, or, in the case of off_t, also depending #defines. On one
machine some function parameter was a long or whatever, on a different
one it was a bla_t or whatever. Of course, these days one might
target only Linux and MacOS and reach >99% of desktops and servers
(the result runs on Windows through WSL2), but that solves the problem

^only

by reducing the portability requirements.

Blame goes to:: ISO/IEC 9899:1999 for trying to accommodate everyone
and ending up screwing everyone.

- anton

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Oct 9 22:24:01 2025

From Newsgroup: comp.arch

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

Thomas Koenig <tkoenig@netcologne.de> writes:

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

Thomas Koenig <tkoenig@netcologne.de> writes:

So, kill the 64-bit machines in the scientific marketplace. I'm glad >>>you agree.

Not in the least. Most C programs did not run as-is on I32LP64, and
that did not kill these machines, either.

Only those who assumed sizeof(int) = sizeof(char *).

And lots of others, e.g., those that assumed that longs are 4 bytes in
size.

This was
not true on the PDP-11,

Can you elaborate on this? What do you think is sizeof(int) on a
PDP-11, and what do you think is sizeof(char *) on a PDP-11?

sizeof int == 2
sizeof char * == 2

and it was a standards violation, anyway.

That's hilarious. C89 was three years old in 1992. The majority of C programs available in 1992 were started before ANSI C was released,
and thus contained code from before ANSI C. And like today,
programmers are asked to spend time on other things than fixing things
that are not broken.

If application vendors were subject to the same recall standards that
the auto industry is subject, that might change. {Remember the Pinto}

And I am sure that C
programs were much more relevant for selling these machines than
FORTRAN programs.

Based on what data?

Based on 4 months of internship at HP in 1988 and 1989, in a group
that did sales support, tech support, and courses on HP 9000
workstations and servers and HP/UX (the OS of the HP 9000 machines).
I don't remember hearing about a customer that used FORTRAN.

C got the OS and compilers up and running, then the people who
bought the machine run applications they can compile from their
source.

Based also on the impressions I got on Usenet. Apart from SPECfp,
Fortran was nowhere to be seen.

Most FEM, Optics, CFD and larger scale engineering applications are
all written in FORTRAN with C-front ends shuffling data/commands
back and forth. {Spice, Layout, Design Rule Checking, GDSII, ...}

C programmers changed the programs to run on
I32LP64 (this was called "making them 64-bit-clean"). And until that
was done, ILP32 was used.

The problem with 64-bit INTEGERs for Fortran is that they make REAL >unusable for lots of existing code.

Nonsense::

CDC only had Double Precision FP data (60-bit)
with 18-bit integers
CRAY only had Double Precision FP data (64-bit)
with 24-bit integers

{{Even numerical analysists liked Seymore's 60-bit and 64-bit arithmetic compared to 32-bit IBM and 36-bit Univac FP arithmetic--even with those littered with huge mistakes we would not allow today.}}

The size of FORTRAN INTEGERs is something the FORTAN people have to
decide, and I made no statement on that.

If FORTRAN programs make the assumptions that sizeof(int)==4, maybe
you should tell the FORTRAN programmers something along these lines:
"it is a standards violation, anyway. Only people who like to play
these kind of games are caught."

FORTRAN programmers think of integer as 1 storage container--even on
CDC and CRAY. The integer in memory is 60 or 64 bits, the integer in
register is 18-bit or 24-bit. FORTRAN programmers do not have problems
with putting 6|u6-bit characters in PDP-10 memory container, or 10|u6 field-data characters in one CDC memory container.

- anton

--- Synchronet 3.21a-Linux NewsLink 1.2

From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Thu Oct 9 22:26:33 2025

From Newsgroup: comp.arch

BGB wrote:

On 10/3/2025 11:40 AM, EricP wrote:

The issue with FP8 support seems to be that everyone who wants it also
wants their own definition so no matter what you do, it will be unused.

As for FP8:
There are multiple formats in use:
S.E3.M4: Bias=7 (Quats / Unit Vectors)
S.E3.M4: Bias=8 (Audio)
S.E4.M3: Bias=7 (NN's)
E4.M4: Bias=7 (HDR images)

Its not just the memory formats, its also the operations.
In FP8 few may want to waste 1/8th of the encode space on NaN's.
Maybe not sticky infinity, rather saturate at max but not stick there.
Maybe no negative zero.

All of those encodings might be reallocated to values more useful
for that application.

They won't want to calculate single argument transcendentals like tan(x),
they will use 256 byte lookup tables.
The multi-operand functions ADD, SUB, MUL, would be faster in hardware
than 64kB lookup tables.

Also a lot of these are used matrix ops - super-duper-SIMD.

--- Synchronet 3.21a-Linux NewsLink 1.2

From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Thu Oct 9 22:45:36 2025

From Newsgroup: comp.arch

Brian G. Lucas wrote:

On 10/6/25 8:38 PM, Kent Dickey wrote:
[SNIP]

For C and C++ code, the standard dictates that all integer operations are
done with "int" precision, unless some operand is larger than int, and
then
do it in that precision. So there's no real need for 8-bit and 16-bit
operations to be natively by the CPU--these operations are actually done
as int's already. If you have a variable which is a byte, then assigning
to that variable, and then using that variable again you will need to
zero-extend, but honestly, this is not usually a performance path. It's
likely to be stored to memory instead, so no masking or sign extending
should be needed.

[SNIP]

Kent

Can you point me to the section in "the standard" which indicates
'all integer operations are done with "int" precision'?

What if the wording was changed to:
'all integer operations are done with _at least_ "int" precision',
e.g. one could use long. Would that break conforming code?

Brian

I was wondering this myself. The down-cast rule to a smaller size
appears to be in C12 standard:
6.3.1.3(3) implementation-defined or throw an exception if out-of-range.

"
6.3 Conversions
6.3.1.3 Signed and unsigned integers

1 When a value with integer type is converted to another integer type
other than _Bool, if the value can be represented by the new type,
it is unchanged.

2 Otherwise, if the new type is unsigned, the value is converted by
repeatedly adding or subtracting one more than the maximum value that
can be represented in the new type until the value is in the range of
the new type.60)

[EricP: this is the same as zero extend]

3 Otherwise, the new type is signed and the value cannot be
represented in it; either the result is implementation-defined or
an implementation-defined signal is raised.
"

--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Fri Oct 10 07:07:03 2025

From Newsgroup: comp.arch

EricP <ThatWouldBeTelling@thevillage.com> writes:

6.3 Conversions
6.3.1.3 Signed and unsigned integers

1 When a value with integer type is converted to another integer type
other than _Bool, if the value can be represented by the new type,
it is unchanged.

2 Otherwise, if the new type is unsigned, the value is converted by >repeatedly adding or subtracting one more than the maximum value that
can be represented in the new type until the value is in the range of
the new type.60)

[EricP: this is the same as zero extend]

If the new type is larger, then case 1 is the only relevant one. For
signed integers and twos-complement representation, that is sign
extension. For unsigned integers, it's zero extension.

Case 2 can only happen if the new type is smaller than the old type.
In that case no extension happens, and what is described is modulo
equivalence. It could be described as modulo operation, but isn't.
Maybe this description was originally intended to also cover signed
numbers (where the modulo operation would not be appropriate), but
later case 3 was added.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Fri Oct 10 07:31:16 2025

From Newsgroup: comp.arch

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

Thomas Koenig <tkoenig@netcologne.de> writes:

and it was a standards violation, anyway.

That's hilarious. C89 was three years old in 1992. The majority of C
programs available in 1992 were started before ANSI C was released,
and thus contained code from before ANSI C. And like today,
programmers are asked to spend time on other things than fixing things
that are not broken.

If application vendors were subject to the same recall standards that
the auto industry is subject, that might change. {Remember the Pinto}

I had not heard about "the Pinto" before, so I cannot remember it.
Searching for it, it seems that you mean the Ford Pinto which had fuel
system fires.

I don't think that a program that works as intended but does not
comply to a later-introduced standard is in the same position.

Actually, the regulations for cars only hold for newly sold cars. All
the other cars can be as unsafe and poison the air as badly as when
they were introduced, and the Diesel emissions scandal (VW and many
other car makers) shows that they are actually allowed to poison the
air even more; at least in Austria none of the cars have been recalled
that produce more emissions than was allowed when the cars were sold.

Even for aircraft it is apparently enough to comply with the
regulations valid at the time of certification of the aircraft, with
fatal consequences for Chalk's Ocean Airways Flight 101. <https://en.wikipedia.org/wiki/Chalk%27s_Ocean_Airways_Flight_101#Age_of_fleet>

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From cross@cross@spitfire.i.gajendra.net (Dan Cross) to comp.arch on Fri Oct 10 10:50:41 2025

From Newsgroup: comp.arch

In article <WhQEQ.117554$7Ika.12025@fx17.iad>,
Scott Lurndal <slp53@pacbell.net> wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

Thomas Koenig <tkoenig@netcologne.de> writes:

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

[...]

<snip>

So, kill the 64-bit machines in the scientific marketplace. I'm glad
you agree.

Not in the least. Most C programs did not run as-is on I32LP64.

The vast majority of C/C++ programs ran just fine on I32LP64. There
were some that didn't, but it was certainly not "most".

Yeah, but I remember the switchover to 64-bit pretty well. Most
programs ran ok, but there were quite a few that punned int for
a native word and assumed it was interchangeable with a pointer,
and it took a very long time to get all of that cruft cleaned
up. That transition was pretty painful.

- Dan C.

--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Fri Oct 10 12:04:52 2025

From Newsgroup: comp.arch

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

Concerning pain, I found that in Gforth (which contains C code and
Forth code) we had many more portability bugs in the C code than in
the Forth code, where we had almost no portability bugs.

C, itself, would be3 a "lot less painful" if C only had 2 integer types >1-word and 2-words. But, instead, they typical 2^(n+3) machines have >8-integer types (Signed, unSigned}|u{Byte, Half, Word, DBLE}, and then
to make it as bad as possible, there are a myriad of types {ptr_dif,
size_t, off_t, ...} that change {Sign}|u{Size} on an architecture basis.

Actually, ptrdiff_t might be seen as the signed word-size integer type
and size_t as the unsigned one. That's somewhat

Concerning off_t, if C had the single-word and two-word type, one
could have used the two-word type instead of off_t from the start,
avoiding the pain of _FILE_OFFSETS_BITS etc.

Concerning signedness: Forth also supports signed and unsigned cells
and double-cells. This does not cause portability problems, because
the signedness of a value does not change between platforms.
Signedness bugs are easy to miss, however.

That's because Forth has only two integer types: cell (a machine word)
and double cell (two machine words); and if you use one instead of the
other, the code fails, whatever the cell size is.

Same as <old> FORTRAN.

According to the information discussed here recently, FORTRAN uses the
same approach on byte-addressed machines as Java: 32-bit INTEGERs,
32-bit REALs, 64-bit DOUBLEs. No word-sized INTEGERs in FORTRAN.

BTW, in Forth the FP sizes are not related to integer sizes; this does
not cause portability problems in my experience, but I have
experienced FP-related portability problems, typically coming from the assumption that an FP value consumes a power-of-two number of bytes in
memory (there are systems with 10-byte floats).

By contrast, in the C code we have to deal with a large number of
integer types (not just int, long, etc., but also, e.g., off_t), with
the relations between the types being different on different
platforms, or, in the case of off_t, also depending #defines. On one
machine some function parameter was a long or whatever, on a different
one it was a bla_t or whatever. Of course, these days one might
target only Linux and MacOS and reach >99% of desktops and servers
(the result runs on Windows through WSL2), but that solves the problem

^only

by reducing the portability requirements.

Blame goes to:: ISO/IEC 9899:1999 for trying to accommodate everyone
and ending up screwing everyone.

I don't think that blaming anyone is useful. One can, however, think
about what contributed to the portability problems and what
alternative approaches would have avoided them.

The machine-word-oriented B proved insufficient for the byte-addressed
PDP-11, so Ritchie added types and C was born. There was int (the
machine word) and char (the byte). Because in B p+1 means the next
machine word after p, and Ritchie wanted to preserve this, C also has
typed pointers: int * and char *. long was added because int is
occasionally too small on the PDP-11.

One way to avoid the portability problems would have been to define
int and pointers to be a machine words and long to be two machine
words. In this scenario, as long as machine-internal data is
accessed, there would not be portability problems: pid_t, uid_t,
etc. would all be ints. There would be problems when exchanging data
with outher machines. E.g., a file system probably wants architecture-independent data, and would spend, say, 32 bits on the
uid. But at least these issues would be limited to the code that
accesses these file systems (at least if the programmer isolates these accesses).

But C did not go there, and instead made long 32 bits long on both
16-bit machines and on 32-bit machines, with the result that lseek(),
which produced and consumed a long, could only deal with 2GB files.
Good enough at the start, but limiting later, so at some point off_t
and the whole _FILE_OFFSET_BITS mess had to be introduced.

Another way to avoid the portability problems would have been to go
for special-purpose types like off_t from the start and make all
integer types incompatible, i.e., require explicit instead of implicit conversion between them. That (along with appropriate teaching
material) would make it clear that conversion should be avoided where
possible, which in turn would reduce the dependencies on relations
between type sizes. However, going full-bore in this direction when
coming from B was probably incompatible with Ritchie's apparent goal
of using B code with as few changes as possible.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Fri Oct 10 19:01:59 2025

From Newsgroup: comp.arch

On 10/9/2025 9:26 PM, EricP wrote:

BGB wrote:

On 10/3/2025 11:40 AM, EricP wrote:

The issue with FP8 support seems to be that everyone who wants it also
wants their own definition so no matter what you do, it will be unused.

As for FP8:
-a There are multiple formats in use:
-a-a-a S.E3.M4: Bias=7 (Quats / Unit Vectors)
-a-a-a S.E3.M4: Bias=8 (Audio)
-a-a-a S.E4.M3: Bias=7 (NN's)
-a-a-a E4.M4: Bias=7-a-a (HDR images)

Its not just the memory formats, its also the operations.
In FP8 few may want to waste 1/8th of the encode space on NaN's.
Maybe not sticky infinity, rather saturate at max but not stick there.
Maybe no negative zero.

All of those encodings might be reallocated to values more useful
for that application.

In my uses, the 8-bit formats lacked Inf/NaN or subnormals.
To what extent NaN existed, it was encoded as -0.

They won't want to calculate single argument transcendentals like tan(x), they will use 256 byte lookup tables.
The multi-operand functions ADD, SUB, MUL, would be faster in hardware
than 64kB lookup tables.

Different ways exist.

In many cases, directly performing computations on FP8 was insufficient,
so generally Binary16 or similar was used as the intermediate working
format.

Also a lot of these are used matrix ops - super-duper-SIMD.

Yeah.

Though, in a 3D model format of mine from not too long ago, I was using
a mix of FP8 and Joint-Exponent formats.

So, there are a lot of possible use cases.

X/Y/Z coords: Joint exponent.
3x 9-bit, denormalized, 5 bit shared exponent.
Unpacked to 3x Binary16.
Similar to the RGB9_E5 format,
except the values were also sign-extended.
S/T coords: joint exponent (sorta).
The scheme for texture coords is more convoluted.
Normal: 3x FP8A

There were skeletal animations, with poses also mostly stored as FP8A.
I tested a few options and noted that storing each rotation quaternion
as FP8A was the most accurate option.

While, taken individually, the average absolute error of an FP8A was
worse than for an 8-bit byte (where -127..127 maps to -1.0..1.0); after normalizing the FP8A vector is more accurate. This could be further
improved by jittering the scaling and rounding slightly and looking for
the vector that produced a result closer to the original after
normalization.

Note that for texture-mapped models, it would encode S/T coords and
infer the normal from the geometry. Base RGB would be assumed white, and vertex color is calculated based on pose and normal (engine didn't use positional dynamic lights, so would calculate colors assuming that light
comes down from overhead).

For shaded/untextured meshes, a vertex normal would be stored instead
(again used to calculate vertex RGB). Texture name would encode the RGB
base color for the mesh (as "#rrggbb").

In a more advanced engine, one might use the normal vectors directly and calculate shading based on color + normal + light-source. But, this is
slower and typically requires per-light rendering passes, etc.

Or, basically a lot of the stuff that makes Doom3 slow (or, like, game
that came out in 2003 but it took until 2015 before "mere mortal"
computers could run it at decent framerates...).

The encoding of vertex coords here differed from the Quake engine
family, which typically encoded them as 3x BYTE relative to a
bounding-box; except for Quake3 which went to 16-bit values (still
relative to a bounding-box).

Though on a PC, there is the downside that normally OpenGL only really
allows Binary32 or small integer values mapped to unit-range for vertex
arrays (would be useful, say, to be able to use HALF or one of the joint-exponent formats here; seemingly OpenGL decided a lot these are
only for HDR color data or similar though; and not for vertex coords...).

For TKRA-GL, it allows some more compact formats for vertex arrays.

Partly relevant as even at arguably fairly low triangle counts,
unpacking animation frames to vertex arrays can still eat a lot of RAM
(and you don't want to burn CPU time recalculating the model vertices
every time it is redrawn).

Though, there is still the option of using GL_BYTES or similar and
scaling via the transformation matrix.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sat Oct 11 10:01:44 2025

From Newsgroup: comp.arch

MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

Thomas Koenig <tkoenig@netcologne.de> writes:

C programmers changed the programs to run on
I32LP64 (this was called "making them 64-bit-clean"). And until that
was done, ILP32 was used.

The problem with 64-bit INTEGERs for Fortran is that they make REAL
unusable for lots of existing code.

Nonsense::

CDC only had Double Precision FP data (60-bit)
with 18-bit integers
CRAY only had Double Precision FP data (64-bit)
with 24-bit integers

By the time the 64-bit workstations appeared, vector computers
were very much on the way out, and people had long since gotten
used to 32-bit reals and 64-bit floating points. Making a 64-bit
integer would have required a 128-bit double precision, and you
know how popular that is - only one vendor has it, and there it
is more of an additional of their decimal float unit, and hence
very slow (but still faster than software emulation).

{{Even numerical analysists liked Seymore's 60-bit and 64-bit arithmetic compared to 32-bit IBM and 36-bit Univac FP arithmetic--even with those littered with huge mistakes we would not allow today.}}

The size of FORTRAN INTEGERs is something the FORTAN people have to
decide, and I made no statement on that.

If FORTRAN programs make the assumptions that sizeof(int)==4, maybe
you should tell the FORTRAN programmers something along these lines:
"it is a standards violation, anyway. Only people who like to play
these kind of games are caught."

FORTRAN programmers think of integer as 1 storage container--even on
CDC and CRAY.

That's what the standard says.

The integer in memory is 60 or 64 bits, the integer in
register is 18-bit or 24-bit. FORTRAN programmers do not have problems
with putting 6|u6-bit characters in PDP-10 memory container, or 10|u6 field-data characters in one CDC memory container.

Most of them have learned to use CHARACTER by now, it's only been
47 years :-)
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21a-Linux NewsLink 1.2

Who's Online

System Info

Time to eat Crow