• Compilers and flags (was: Concedtina III May Be Returning)

    From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Fri Sep 5 09:33:06 2025
    From Newsgroup: comp.arch

    EricP <ThatWouldBeTelling@thevillage.com> writes:
    That shows about 12% instructions are conditional branch and 9% CMP.
    That says to me that almost all Bcc are paired with a CMP,
    and very few use the flags set as a side effect of ALU ops.

    I would expect those two numbers to be closer as even today compilers don't >know about those side effect flags and will always emit a CMP or TST first.

    Compilers certainly have problems with single flag registers, as they
    run contrary to the base assumption of register allocation. But you
    don't need full-blown tracking of flags in order to make use of flags
    side effects in compilers. Plain peephole optimization can be good
    enough. E.g., if you have

    if (a+b<0) ...

    the compiler may naively translate this to

    add tmp = a, b
    tst tmp
    bge cont

    The peephole optimizer can have a rule that says that this is
    equivalent to

    add tmp = a, b
    bge cont

    When I compile

    long foo(long a, long b)
    {
    if (a+b<0)
    return a-b;
    else
    return a*b;
    }

    with gcc-12.2.0 -O -c on AMD64, I get

    0000000000000000 <foo>:
    0: 48 89 f8 mov %rdi,%rax
    3: 48 89 fa mov %rdi,%rdx
    6: 48 01 f2 add %rsi,%rdx
    9: 78 05 js 10 <foo+0x10>
    b: 48 0f af c6 imul %rsi,%rax
    f: c3 ret
    10: 48 29 f0 sub %rsi,%rax
    13: c3 ret

    Look, Ma, no tst.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Fri Sep 5 11:00:55 2025
    From Newsgroup: comp.arch

    Anton Ertl wrote:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    That shows about 12% instructions are conditional branch and 9% CMP.
    That says to me that almost all Bcc are paired with a CMP,
    and very few use the flags set as a side effect of ALU ops.

    I would expect those two numbers to be closer as even today compilers don't >> know about those side effect flags and will always emit a CMP or TST first.

    Compilers certainly have problems with single flag registers, as they
    run contrary to the base assumption of register allocation. But you
    don't need full-blown tracking of flags in order to make use of flags
    side effects in compilers. Plain peephole optimization can be good
    enough. E.g., if you have

    if (a+b<0) ...

    the compiler may naively translate this to

    add tmp = a, b
    tst tmp
    bge cont

    The peephole optimizer can have a rule that says that this is
    equivalent to

    add tmp = a, b
    bge cont

    When I compile

    long foo(long a, long b)
    {
    if (a+b<0)
    return a-b;
    else
    return a*b;
    }

    with gcc-12.2.0 -O -c on AMD64, I get

    0000000000000000 <foo>:
    0: 48 89 f8 mov %rdi,%rax
    3: 48 89 fa mov %rdi,%rdx
    6: 48 01 f2 add %rsi,%rdx
    9: 78 05 js 10 <foo+0x10>
    b: 48 0f af c6 imul %rsi,%rax
    f: c3 ret
    10: 48 29 f0 sub %rsi,%rax
    13: c3 ret

    Look, Ma, no tst.

    - anton

    This could be 1 MOV shorter.
    It didn't need to MOV %rdi, %rdx as it already copied rdi to rax.
    Just ADD %rsi,%rdi and after that use the %rax copy.

    For that optimization { ADD CMP Bcc } => { ADD Bcc }
    to work those three instructions must be adjacent.
    In this case it wouldn't make a difference but in general
    I think they would want the freedom to move code about and not have
    the ADD bound to the Bcc too early so this would have to be about
    the very last optimization so it didn't interfere with code motion.

    The Microsoft compiler uses LEA to do the add which doesn't change flags
    so even if it has a flags optimization it would not detect it:

    long foo(long,long) PROC ; foo, COMDAT
    lea eax, DWORD PTR [rcx+rdx]
    test eax, eax
    jns SHORT $LN2@foo
    sub ecx, edx
    mov eax, ecx
    ret 0
    $LN2@foo:
    imul ecx, edx
    mov eax, ecx
    ret 0

    Also if MS had moved ecx to eax first as GCC does then it could have
    the function result land in eax and eliminate the final two MOV eax,ecx.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Fri Sep 5 15:51:13 2025
    From Newsgroup: comp.arch


    EricP <ThatWouldBeTelling@thevillage.com> posted:

    Anton Ertl wrote:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    That shows about 12% instructions are conditional branch and 9% CMP.
    That says to me that almost all Bcc are paired with a CMP,
    and very few use the flags set as a side effect of ALU ops.

    I would expect those two numbers to be closer as even today compilers don't
    know about those side effect flags and will always emit a CMP or TST first.

    Compilers certainly have problems with single flag registers, as they
    run contrary to the base assumption of register allocation. But you
    don't need full-blown tracking of flags in order to make use of flags
    side effects in compilers. Plain peephole optimization can be good
    enough. E.g., if you have

    if (a+b<0) ...

    the compiler may naively translate this to

    add tmp = a, b
    tst tmp
    bge cont

    The peephole optimizer can have a rule that says that this is
    equivalent to

    add tmp = a, b
    bge cont

    When I compile

    long foo(long a, long b)
    {
    if (a+b<0)
    return a-b;
    else
    return a*b;
    }

    with gcc-12.2.0 -O -c on AMD64, I get

    0000000000000000 <foo>:
    0: 48 89 f8 mov %rdi,%rax
    3: 48 89 fa mov %rdi,%rdx
    6: 48 01 f2 add %rsi,%rdx
    9: 78 05 js 10 <foo+0x10>
    b: 48 0f af c6 imul %rsi,%rax
    f: c3 ret
    10: 48 29 f0 sub %rsi,%rax
    13: c3 ret

    Look, Ma, no tst.

    - anton

    This could be 1 MOV shorter.
    It didn't need to MOV %rdi, %rdx as it already copied rdi to rax.
    Just ADD %rsi,%rdi and after that use the %rax copy.

    foo:
    ADD R3,R1,R2
    PLT0 R3,TF
    ADD R1,R1,-R2
    MUL R1,R1,R2
    RET

    5 inst versus 8.

    For that optimization { ADD CMP Bcc } => { ADD Bcc }
    to work those three instructions must be adjacent.
    In this case it wouldn't make a difference but in general
    I think they would want the freedom to move code about and not have
    the ADD bound to the Bcc too early so this would have to be about
    the very last optimization so it didn't interfere with code motion.

    The Microsoft compiler uses LEA to do the add which doesn't change flags
    so even if it has a flags optimization it would not detect it:

    long foo(long,long) PROC ; foo, COMDAT
    lea eax, DWORD PTR [rcx+rdx]
    test eax, eax
    jns SHORT $LN2@foo
    sub ecx, edx
    mov eax, ecx
    ret 0
    $LN2@foo:
    imul ecx, edx
    mov eax, ecx
    ret 0

    5 versus 9

    Also if MS had moved ecx to eax first as GCC does then it could have
    the function result land in eax and eliminate the final two MOV eax,ecx.

    still 5 versus 7
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Fri Sep 5 16:13:47 2025
    From Newsgroup: comp.arch

    EricP <ThatWouldBeTelling@thevillage.com> writes:
    Anton Ertl wrote:
    When I compile

    long foo(long a, long b)
    {
    if (a+b<0)
    return a-b;
    else
    return a*b;
    }

    with gcc-12.2.0 -O -c on AMD64, I get

    0000000000000000 <foo>:
    0: 48 89 f8 mov %rdi,%rax
    3: 48 89 fa mov %rdi,%rdx
    6: 48 01 f2 add %rsi,%rdx
    9: 78 05 js 10 <foo+0x10>
    b: 48 0f af c6 imul %rsi,%rax
    f: c3 ret
    10: 48 29 f0 sub %rsi,%rax
    13: c3 ret
    ...
    This could be 1 MOV shorter.
    It didn't need to MOV %rdi, %rdx as it already copied rdi to rax.
    Just ADD %rsi,%rdi and after that use the %rax copy.

    Yes, I often see more register-register moves in gcc-generated code
    than necessary.

    For that optimization { ADD CMP Bcc } => { ADD Bcc }
    to work those three instructions must be adjacent.
    In this case it wouldn't make a difference but in general
    I think they would want the freedom to move code about and not have
    the ADD bound to the Bcc too early so this would have to be about
    the very last optimization so it didn't interfere with code motion.

    Yes, possible. When I look at what clang-14.0.6 -O -c produces, it's
    this:

    0000000000000000 <foo>:
    0: 48 89 f9 mov %rdi,%rcx
    3: 48 29 f1 sub %rsi,%rcx
    6: 48 89 f0 mov %rsi,%rax
    9: 48 0f af c7 imul %rdi,%rax
    d: 48 01 fe add %rdi,%rsi
    10: 48 0f 48 c1 cmovs %rcx,%rax
    14: c3 ret

    clang seems to prefer using cmov. The interesting thing here is that
    it puts the add right in front of the cmovs, after the code for "a-b"
    and "a*b". When I do

    long foo(long a, long b)
    {
    if (a+b*111<0)
    return a-b;
    else
    return a*b;
    }

    clang produces this code:

    0000000000000000 <foo>:
    0: 48 6b ce 6f imul $0x6f,%rsi,%rcx
    4: 48 89 f8 mov %rdi,%rax
    7: 48 29 f0 sub %rsi,%rax
    a: 48 0f af f7 imul %rdi,%rsi
    e: 48 01 f9 add %rdi,%rcx
    11: 48 0f 49 c6 cmovns %rsi,%rax
    15: c3 ret

    I.e., rcx=b*111 is first, but a+rcx is late, right before the cmovns.
    So it seems to have some mechanism for keeping the add and the
    cmov(n)s as one unit.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2