Sysop: | Amessyroom |
---|---|
Location: | Fayetteville, NC |
Users: | 26 |
Nodes: | 6 (0 / 6) |
Uptime: | 56:11:13 |
Calls: | 632 |
Files: | 1,187 |
D/L today: |
27 files (19,977K bytes) |
Messages: | 179,568 |
That shows about 12% instructions are conditional branch and 9% CMP.
That says to me that almost all Bcc are paired with a CMP,
and very few use the flags set as a side effect of ALU ops.
I would expect those two numbers to be closer as even today compilers don't >know about those side effect flags and will always emit a CMP or TST first.
EricP <ThatWouldBeTelling@thevillage.com> writes:
That shows about 12% instructions are conditional branch and 9% CMP.
That says to me that almost all Bcc are paired with a CMP,
and very few use the flags set as a side effect of ALU ops.
I would expect those two numbers to be closer as even today compilers don't >> know about those side effect flags and will always emit a CMP or TST first.
Compilers certainly have problems with single flag registers, as they
run contrary to the base assumption of register allocation. But you
don't need full-blown tracking of flags in order to make use of flags
side effects in compilers. Plain peephole optimization can be good
enough. E.g., if you have
if (a+b<0) ...
the compiler may naively translate this to
add tmp = a, b
tst tmp
bge cont
The peephole optimizer can have a rule that says that this is
equivalent to
add tmp = a, b
bge cont
When I compile
long foo(long a, long b)
{
if (a+b<0)
return a-b;
else
return a*b;
}
with gcc-12.2.0 -O -c on AMD64, I get
0000000000000000 <foo>:
0: 48 89 f8 mov %rdi,%rax
3: 48 89 fa mov %rdi,%rdx
6: 48 01 f2 add %rsi,%rdx
9: 78 05 js 10 <foo+0x10>
b: 48 0f af c6 imul %rsi,%rax
f: c3 ret
10: 48 29 f0 sub %rsi,%rax
13: c3 ret
Look, Ma, no tst.
- anton
Anton Ertl wrote:
EricP <ThatWouldBeTelling@thevillage.com> writes:
That shows about 12% instructions are conditional branch and 9% CMP.
That says to me that almost all Bcc are paired with a CMP,
and very few use the flags set as a side effect of ALU ops.
I would expect those two numbers to be closer as even today compilers don't
know about those side effect flags and will always emit a CMP or TST first.
Compilers certainly have problems with single flag registers, as they
run contrary to the base assumption of register allocation. But you
don't need full-blown tracking of flags in order to make use of flags
side effects in compilers. Plain peephole optimization can be good
enough. E.g., if you have
if (a+b<0) ...
the compiler may naively translate this to
add tmp = a, b
tst tmp
bge cont
The peephole optimizer can have a rule that says that this is
equivalent to
add tmp = a, b
bge cont
When I compile
long foo(long a, long b)
{
if (a+b<0)
return a-b;
else
return a*b;
}
with gcc-12.2.0 -O -c on AMD64, I get
0000000000000000 <foo>:
0: 48 89 f8 mov %rdi,%rax
3: 48 89 fa mov %rdi,%rdx
6: 48 01 f2 add %rsi,%rdx
9: 78 05 js 10 <foo+0x10>
b: 48 0f af c6 imul %rsi,%rax
f: c3 ret
10: 48 29 f0 sub %rsi,%rax
13: c3 ret
Look, Ma, no tst.
- anton
This could be 1 MOV shorter.
It didn't need to MOV %rdi, %rdx as it already copied rdi to rax.
Just ADD %rsi,%rdi and after that use the %rax copy.
For that optimization { ADD CMP Bcc } => { ADD Bcc }
to work those three instructions must be adjacent.
In this case it wouldn't make a difference but in general
I think they would want the freedom to move code about and not have
the ADD bound to the Bcc too early so this would have to be about
the very last optimization so it didn't interfere with code motion.
The Microsoft compiler uses LEA to do the add which doesn't change flags
so even if it has a flags optimization it would not detect it:
long foo(long,long) PROC ; foo, COMDAT
lea eax, DWORD PTR [rcx+rdx]
test eax, eax
jns SHORT $LN2@foo
sub ecx, edx
mov eax, ecx
ret 0
$LN2@foo:
imul ecx, edx
mov eax, ecx
ret 0
Also if MS had moved ecx to eax first as GCC does then it could have
the function result land in eax and eliminate the final two MOV eax,ecx.
Anton Ertl wrote:...
When I compile
long foo(long a, long b)
{
if (a+b<0)
return a-b;
else
return a*b;
}
with gcc-12.2.0 -O -c on AMD64, I get
0000000000000000 <foo>:
0: 48 89 f8 mov %rdi,%rax
3: 48 89 fa mov %rdi,%rdx
6: 48 01 f2 add %rsi,%rdx
9: 78 05 js 10 <foo+0x10>
b: 48 0f af c6 imul %rsi,%rax
f: c3 ret
10: 48 29 f0 sub %rsi,%rax
13: c3 ret
This could be 1 MOV shorter.
It didn't need to MOV %rdi, %rdx as it already copied rdi to rax.
Just ADD %rsi,%rdi and after that use the %rax copy.
For that optimization { ADD CMP Bcc } => { ADD Bcc }
to work those three instructions must be adjacent.
In this case it wouldn't make a difference but in general
I think they would want the freedom to move code about and not have
the ADD bound to the Bcc too early so this would have to be about
the very last optimization so it didn't interfere with code motion.