Forum: Too Lazy BBS

Sysop:	Amessyroom
Location:	Fayetteville, NC
Users:	26
Nodes:	6 (0 / 6)
Uptime:	56:11:13
Calls:	632
Files:	1,187
D/L today:	27 files (19,977K bytes)
Messages:	179,568

Compilers and flags (was: Concedtina III May Be Returning)

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Fri Sep 5 09:33:06 2025

From Newsgroup: comp.arch

EricP <ThatWouldBeTelling@thevillage.com> writes:

That shows about 12% instructions are conditional branch and 9% CMP.
That says to me that almost all Bcc are paired with a CMP,
and very few use the flags set as a side effect of ALU ops.

I would expect those two numbers to be closer as even today compilers don't >know about those side effect flags and will always emit a CMP or TST first.

Compilers certainly have problems with single flag registers, as they
run contrary to the base assumption of register allocation. But you
don't need full-blown tracking of flags in order to make use of flags
side effects in compilers. Plain peephole optimization can be good
enough. E.g., if you have

if (a+b<0) ...

the compiler may naively translate this to

add tmp = a, b
tst tmp
bge cont

The peephole optimizer can have a rule that says that this is
equivalent to

add tmp = a, b
bge cont

When I compile

long foo(long a, long b)
{
if (a+b<0)
return a-b;
else
return a*b;
}

with gcc-12.2.0 -O -c on AMD64, I get

0000000000000000 <foo>:
0: 48 89 f8 mov %rdi,%rax
3: 48 89 fa mov %rdi,%rdx
6: 48 01 f2 add %rsi,%rdx
9: 78 05 js 10 <foo+0x10>
b: 48 0f af c6 imul %rsi,%rax
f: c3 ret
10: 48 29 f0 sub %rsi,%rax
13: c3 ret

Look, Ma, no tst.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Fri Sep 5 11:00:55 2025

From Newsgroup: comp.arch

Anton Ertl wrote:

EricP <ThatWouldBeTelling@thevillage.com> writes:

That shows about 12% instructions are conditional branch and 9% CMP.
That says to me that almost all Bcc are paired with a CMP,
and very few use the flags set as a side effect of ALU ops.

I would expect those two numbers to be closer as even today compilers don't >> know about those side effect flags and will always emit a CMP or TST first.

Compilers certainly have problems with single flag registers, as they
run contrary to the base assumption of register allocation. But you
don't need full-blown tracking of flags in order to make use of flags
side effects in compilers. Plain peephole optimization can be good
enough. E.g., if you have

if (a+b<0) ...

the compiler may naively translate this to

add tmp = a, b
tst tmp
bge cont

The peephole optimizer can have a rule that says that this is
equivalent to

add tmp = a, b
bge cont

When I compile

long foo(long a, long b)
{
if (a+b<0)
return a-b;
else
return a*b;
}

with gcc-12.2.0 -O -c on AMD64, I get

0000000000000000 <foo>:
0: 48 89 f8 mov %rdi,%rax
3: 48 89 fa mov %rdi,%rdx
6: 48 01 f2 add %rsi,%rdx
9: 78 05 js 10 <foo+0x10>
b: 48 0f af c6 imul %rsi,%rax
f: c3 ret
10: 48 29 f0 sub %rsi,%rax
13: c3 ret

Look, Ma, no tst.

- anton

This could be 1 MOV shorter.
It didn't need to MOV %rdi, %rdx as it already copied rdi to rax.
Just ADD %rsi,%rdi and after that use the %rax copy.

For that optimization { ADD CMP Bcc } => { ADD Bcc }
to work those three instructions must be adjacent.
In this case it wouldn't make a difference but in general
I think they would want the freedom to move code about and not have
the ADD bound to the Bcc too early so this would have to be about
the very last optimization so it didn't interfere with code motion.

The Microsoft compiler uses LEA to do the add which doesn't change flags
so even if it has a flags optimization it would not detect it:

long foo(long,long) PROC ; foo, COMDAT
lea eax, DWORD PTR [rcx+rdx]
test eax, eax
jns SHORT $LN2@foo
sub ecx, edx
mov eax, ecx
ret 0
$LN2@foo:
imul ecx, edx
mov eax, ecx
ret 0

Also if MS had moved ecx to eax first as GCC does then it could have
the function result land in eax and eliminate the final two MOV eax,ecx.

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Fri Sep 5 15:51:13 2025

From Newsgroup: comp.arch

EricP <ThatWouldBeTelling@thevillage.com> posted:

Anton Ertl wrote:

EricP <ThatWouldBeTelling@thevillage.com> writes:

That shows about 12% instructions are conditional branch and 9% CMP.
That says to me that almost all Bcc are paired with a CMP,
and very few use the flags set as a side effect of ALU ops.

I would expect those two numbers to be closer as even today compilers don't
know about those side effect flags and will always emit a CMP or TST first.

Compilers certainly have problems with single flag registers, as they
run contrary to the base assumption of register allocation. But you
don't need full-blown tracking of flags in order to make use of flags
side effects in compilers. Plain peephole optimization can be good
enough. E.g., if you have

if (a+b<0) ...

the compiler may naively translate this to

add tmp = a, b
tst tmp
bge cont

The peephole optimizer can have a rule that says that this is
equivalent to

add tmp = a, b
bge cont

When I compile

long foo(long a, long b)
{
if (a+b<0)
return a-b;
else
return a*b;
}

with gcc-12.2.0 -O -c on AMD64, I get

0000000000000000 <foo>:
0: 48 89 f8 mov %rdi,%rax
3: 48 89 fa mov %rdi,%rdx
6: 48 01 f2 add %rsi,%rdx
9: 78 05 js 10 <foo+0x10>
b: 48 0f af c6 imul %rsi,%rax
f: c3 ret
10: 48 29 f0 sub %rsi,%rax
13: c3 ret

Look, Ma, no tst.

- anton

This could be 1 MOV shorter.
It didn't need to MOV %rdi, %rdx as it already copied rdi to rax.
Just ADD %rsi,%rdi and after that use the %rax copy.

foo:
ADD R3,R1,R2
PLT0 R3,TF
ADD R1,R1,-R2
MUL R1,R1,R2
RET

5 inst versus 8.

For that optimization { ADD CMP Bcc } => { ADD Bcc }
to work those three instructions must be adjacent.
In this case it wouldn't make a difference but in general
I think they would want the freedom to move code about and not have
the ADD bound to the Bcc too early so this would have to be about
the very last optimization so it didn't interfere with code motion.

The Microsoft compiler uses LEA to do the add which doesn't change flags
so even if it has a flags optimization it would not detect it:

long foo(long,long) PROC ; foo, COMDAT
lea eax, DWORD PTR [rcx+rdx]
test eax, eax
jns SHORT $LN2@foo
sub ecx, edx
mov eax, ecx
ret 0
$LN2@foo:
imul ecx, edx
mov eax, ecx
ret 0

5 versus 9

Also if MS had moved ecx to eax first as GCC does then it could have
the function result land in eax and eliminate the final two MOV eax,ecx.

still 5 versus 7
--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Fri Sep 5 16:13:47 2025

From Newsgroup: comp.arch

EricP <ThatWouldBeTelling@thevillage.com> writes:

Anton Ertl wrote:

When I compile

long foo(long a, long b)
{
if (a+b<0)
return a-b;
else
return a*b;
}

with gcc-12.2.0 -O -c on AMD64, I get

0000000000000000 <foo>:
0: 48 89 f8 mov %rdi,%rax
3: 48 89 fa mov %rdi,%rdx
6: 48 01 f2 add %rsi,%rdx
9: 78 05 js 10 <foo+0x10>
b: 48 0f af c6 imul %rsi,%rax
f: c3 ret
10: 48 29 f0 sub %rsi,%rax
13: c3 ret

...

This could be 1 MOV shorter.
It didn't need to MOV %rdi, %rdx as it already copied rdi to rax.
Just ADD %rsi,%rdi and after that use the %rax copy.

Yes, I often see more register-register moves in gcc-generated code
than necessary.

For that optimization { ADD CMP Bcc } => { ADD Bcc }
to work those three instructions must be adjacent.
In this case it wouldn't make a difference but in general
I think they would want the freedom to move code about and not have
the ADD bound to the Bcc too early so this would have to be about
the very last optimization so it didn't interfere with code motion.

Yes, possible. When I look at what clang-14.0.6 -O -c produces, it's
this:

0000000000000000 <foo>:
0: 48 89 f9 mov %rdi,%rcx
3: 48 29 f1 sub %rsi,%rcx
6: 48 89 f0 mov %rsi,%rax
9: 48 0f af c7 imul %rdi,%rax
d: 48 01 fe add %rdi,%rsi
10: 48 0f 48 c1 cmovs %rcx,%rax
14: c3 ret

clang seems to prefer using cmov. The interesting thing here is that
it puts the add right in front of the cmovs, after the code for "a-b"
and "a*b". When I do

long foo(long a, long b)
{
if (a+b*111<0)
return a-b;
else
return a*b;
}

clang produces this code:

0000000000000000 <foo>:
0: 48 6b ce 6f imul $0x6f,%rsi,%rcx
4: 48 89 f8 mov %rdi,%rax
7: 48 29 f0 sub %rsi,%rax
a: 48 0f af f7 imul %rdi,%rsi
e: 48 01 f9 add %rdi,%rcx
11: 48 0f 49 c6 cmovns %rsi,%rax
15: c3 ret

I.e., rcx=b*111 is first, but a+rcx is late, right before the cmovns.
So it seems to have some mechanism for keeping the add and the
cmov(n)s as one unit.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

Who's Online

System Info

Compilers and flags (was: Concedtina III May Be Returning)