• Calling conventions (particularly 32-bit ARM)

    From David Brown@21:1/5 to All on Mon Jan 6 14:57:51 2025
    I'm trying to understand the reasoning behind some of the calling
    conventions used with 32-bit ARM. I work primarily with small embedded systems, so the efficiency of code on 32-bit Cortex-M devices is very
    important to me - good calling conventions make a big difference.

    No doubt most people here know this already, but in summary these
    devices are a 32-bit load/store RISC architecture with 16 registers.
    R0-R3 and R12 are scratch/volatile registers, R4-R11 are preserved
    registers, R13 is the stack pointer, R14 is the link register and R15 is
    the program counter. For most Cortex-M cores, there is no
    super-scaling, out-of-order execution, speculative execution, etc., but instructions are pipelined.

    The big problem I see is the registers used for returning values from functions. R0-R3 can all be used for passing arguments to functions, as
    32-bit (or smaller) values, pointers, in pairs as 64-bit values, and as
    parts of structs.

    But the ABI only allows returning a single 32-bit value in R0, or a
    scalar 64-bit value in R0:R1. If a function returns a non-scalar that
    is larger than 32-bit, the caller has to allocate space on the stack for
    the return type and pass a pointer to that space in R0.

    To my mind, this is massively inefficient, especially when using structs
    that are made up of two 32-bit parts.

    Is there any good reason why the ABI is designed with such limited
    register usage for returns? Newer ABIs like RISC-V 32-bit and x86_64
    can at least use two registers for return values. Modern compilers are
    quite happy breaking structs into parts in individual registers - it's a
    /long/ time since they insisted that structs occupied a contiguous block
    of memory. Can anyone give me an explanation why return types can't
    simply use all the same registers that are available for argument passing?

    I also think code would be a bit more efficient if there more registers available for parameter passing and as scratch registers - perhaps 6
    would make more sense.

    In more modern C++ programming, it's very practical to use types like std::optional<>, std::variant<>, std::expected<> and std::tuple<> as a
    way of dealing safely with status and multiple return values rather than
    using C-style error codes or passing manual pointers to return value
    slots. But the limited return registers adds significant overhead to
    small functions.


    Are there good technical reasons for the conventions on 32-bit ARM? Or
    is this all just historical from the days when everything was an "int"
    and that's all anyone ever returned from functions?


    Thanks for any pointers or explanations here.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Theo@21:1/5 to David Brown on Mon Jan 6 15:23:40 2025
    David Brown <david.brown@hesbynett.no> wrote:
    The big problem I see is the registers used for returning values from functions. R0-R3 can all be used for passing arguments to functions, as 32-bit (or smaller) values, pointers, in pairs as 64-bit values, and as
    parts of structs.

    But the ABI only allows returning a single 32-bit value in R0, or a
    scalar 64-bit value in R0:R1. If a function returns a non-scalar that
    is larger than 32-bit, the caller has to allocate space on the stack for
    the return type and pass a pointer to that space in R0.

    According to EABI, it's also possible to return a 128 bit vector in R0-3: https://github.com/ARM-software/abi-aa/blob/main/aapcs32/aapcs32.rst#result-return

    To my mind, this is massively inefficient, especially when using structs
    that are made up of two 32-bit parts.

    Is there any good reason why the ABI is designed with such limited
    register usage for returns? Newer ABIs like RISC-V 32-bit and x86_64
    can at least use two registers for return values. Modern compilers are
    quite happy breaking structs into parts in individual registers - it's a /long/ time since they insisted that structs occupied a contiguous block
    of memory. Can anyone give me an explanation why return types can't
    simply use all the same registers that are available for argument passing?

    The 'composite type' return value, where a pointer is passed in as the first argument to the function and a struct at that pointer is filled in with the return values, has existed since the first ARM ABI - APCS-R: http://www.riscos.com/support/developers/dde/appf.html

    That dates from the mid 1980s before 'modern compilers', and I'm guessing
    that has stuck around. A lot of early ARM code was in assembler. The
    original ARMCC was good but fairly basic - GCC didn't support ARM until
    about 1993.

    [*] technically APCS-R was the second ARM ABI, APCS-A was the first: https://heyrick.eu/assembler/apcsintro.html
    but I don't think return value handling was any different.

    Are there good technical reasons for the conventions on 32-bit ARM? Or
    is this all just historical from the days when everything was an "int"
    and that's all anyone ever returned from functions?

    Probably the latter. Also that AArch64 was an opportunity to throw all this stuff away and start again, with a much richer calling convention: https://github.com/ARM-software/abi-aa/blob/main/aapcs64/aapcs64.rst#result-return

    but obviously that's no help to the microcontroller folks. At this stage, a change of calling convention might be fairly big ask.

    Theo

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to David Brown on Mon Jan 6 15:32:04 2025
    David Brown <david.brown@hesbynett.no> writes:
    But the ABI only allows returning a single 32-bit value in R0, or a
    scalar 64-bit value in R0:R1. If a function returns a non-scalar that
    is larger than 32-bit, the caller has to allocate space on the stack for
    the return type and pass a pointer to that space in R0.

    To my mind, this is massively inefficient, especially when using structs
    that are made up of two 32-bit parts.

    Is there any good reason why the ABI is designed with such limited
    register usage for returns?

    Most calling conventions on RISCs are oriented towards C (if you want
    calling conventions that try to be more cross-language (and slower),
    look at VAX) and its properties and limitations at the time when the
    calling convention was designed, in particular, the PCC
    implementation, which was the de-facto standard Unix C compiler at the
    time. C compilers in the 1980s did not allocate structs to registers,
    so passing structs in registers was foreign to them, so the solution
    is that the caller passes the target struct as an additional
    parameter.

    And passing the return value in registers might not have saved
    anything on a compiler that does not deal with structs in registers.
    E.g., if you have

    mystruct = myfunc(arg1, arg2);

    you would see stores to mystruct behind the call. With the PCC
    calling convention, the same stores would happen in the caller
    (possibly resulting in smaller code if there are several calls to
    myfunc()).

    I wonder, though, how things look for

    mystruct = foo(&mystruct);

    Does PCC perform the return stores to mystruct only after performing
    all other memory accesses in foo? Probably yes, anything else would
    complicate the compiler. In that case the caller could pass &mystruct
    for the return value (a slight complication). But is that restriction reflected in the calling convention?

    Struct returns were (and AFAIK still are, many decades after
    they were added to C) a relatively rarely used feature, so Johnson
    (PCC's author) probably did not want to waste a lot of effort on
    making it more efficient.

    gcc has an option -freg-struct-return, which does what you want. Of
    course, if you use this option on ARM A32/T32, you are not following
    the calling convention, so you should only use it when all sides of a
    struct return are compiled with that option.

    Newer ABIs like RISC-V 32-bit and x86_64
    can at least use two registers for return values. Modern compilers are
    quite happy breaking structs into parts in individual registers - it's a >/long/ time since they insisted that structs occupied a contiguous block
    of memory.

    ARM A32 is from 1985, and its calling convention is probably not much
    younger.

    I also think code would be a bit more efficient if there more registers >available for parameter passing and as scratch registers - perhaps 6
    would make more sense.

    There is a tendency towards passing more parameters in registers in
    more recent calling conventions. IA-32 (and IIRC VAX) passes none,
    MIPS uses 4 integer registers (for either integer or FP parameters),
    Alpha uses 6 integer and 6 FP registers, AMD64's System V ABI 6
    integer and 8 FP registers, ARM A64 has 8 integer and 8 FP registers,
    RISC-V has 8 integer and 8 FP registers. Not sure why they were so
    reluctant to use more registers earlier.

    In more modern C++ programming, it's very practical to use types like >std::optional<>, std::variant<>, std::expected<> and std::tuple<> as a
    way of dealing safely with status and multiple return values rather than >using C-style error codes or passing manual pointers to return value
    slots.

    The ARM calling convention is certainly much older than "modern C++ programming".

    But the limited return registers adds significant overhead to
    small functions.

    C++ programmers think they know what C programming is about (and
    unfortunately they dominate not just C++ compiler writers, but they
    also damage C compilers while they are at it), so my sympathy for your
    problem is very limited.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to David Brown on Mon Jan 6 20:10:13 2025
    On Mon, 6 Jan 2025 13:57:51 +0000, David Brown wrote:

    I'm trying to understand the reasoning behind some of the calling
    conventions used with 32-bit ARM. I work primarily with small embedded systems, so the efficiency of code on 32-bit Cortex-M devices is very important to me - good calling conventions make a big difference.

    No doubt most people here know this already, but in summary these
    devices are a 32-bit load/store RISC architecture with 16 registers.
    R0-R3 and R12 are scratch/volatile registers, R4-R11 are preserved
    registers, R13 is the stack pointer, R14 is the link register and R15 is
    the program counter. For most Cortex-M cores, there is no
    super-scaling,
    SuperScalar
    out-of-order execution, speculative execution, etc., but instructions are pipelined.

    The big problem I see is the registers used for returning values from functions. R0-R3 can all be used for passing arguments to functions, as 32-bit (or smaller) values, pointers, in pairs as 64-bit values, and as
    parts of structs.

    Someone above mentioned a trick to pass back a 128-bit value.

    But the ABI only allows returning a single 32-bit value in R0, or a
    scalar 64-bit value in R0:R1. If a function returns a non-scalar that
    is larger than 32-bit, the caller has to allocate space on the stack for
    the return type and pass a pointer to that space in R0.

    To my mind, this is massively inefficient, especially when using structs
    that are made up of two 32-bit parts.

    I have seen subroutines that returned structures where the point
    in the subroutine that puts values in the returned structure is
    such that putting the structure in registers is less efficient
    than returning the struct in registers--it all depends on how
    the struct is laid out in memory. Doing the struct field assign-
    ments in the middle of the subroutine (long path to return) is
    often enough to sway which is more efficient.

    Is there any good reason why the ABI is designed with such limited
    register usage for returns?

    Vogue in 1980 was to have 1 result passed back from subroutines.

    Newer ABIs like RISC-V 32-bit and x86_64
    can at least use two registers for return values.

    My 66000 can pass up to 8 registers back as a aggregate result.

    Modern compilers are
    quite happy breaking structs into parts in individual registers - it's a /long/ time since they insisted that structs occupied a contiguous block
    of memory. Can anyone give me an explanation why return types can't
    simply use all the same registers that are available for argument
    passing?

    In My 66000 ABI they can and do.

    I also think code would be a bit more efficient if there more registers available for parameter passing and as scratch registers - perhaps 6
    would make more sense.

    Basically, here, there is competing pressure between the compiler
    needing a handful of preserved registers, and the compiler being
    more efficient if there were more argument/result passing registers.

    My 66000 ABI has 8 argument registers, 7 temporary registers, 14
    preserved registers, a FP, and a SP. IP is not part of the register
    file. My ABI has a note indicating that the aggregations can be
    altered, just that I need a good reason to change.

    I looked high and low for codes using more than 8 arguments and
    returning aggregates larger than 8 double words, and about the
    only things I found were a handful of []print[]() calls.

    In more modern C++ programming, it's very practical to use types like std::optional<>, std::variant<>, std::expected<> and std::tuple<> as a
    way of dealing safely with status and multiple return values rather than using C-style error codes or passing manual pointers to return value
    slots. But the limited return registers adds significant overhead to
    small functions.

    C++ also has the:
    try-throw-catch exception model which require new-and-fun stuff to
    be thrown onto the stack.
    constructors and destructors
    new
    Atomic stuff

    Are there good technical reasons for the conventions on 32-bit ARM? Or
    is this all just historical from the days when everything was an "int"
    and that's all anyone ever returned from functions?

    At the time, there were good technical rational--which my have
    faded in import as the years go by.


    Thanks for any pointers or explanations here.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Anton Ertl on Mon Jan 6 20:19:15 2025
    On Mon, 6 Jan 2025 15:32:04 +0000, Anton Ertl wrote:

    David Brown <david.brown@hesbynett.no> writes:
    But the ABI only allows returning a single 32-bit value in R0, or a
    scalar 64-bit value in R0:R1. If a function returns a non-scalar that
    is larger than 32-bit, the caller has to allocate space on the stack for >>the return type and pass a pointer to that space in R0.

    To my mind, this is massively inefficient, especially when using structs >>that are made up of two 32-bit parts.

    Is there any good reason why the ABI is designed with such limited
    register usage for returns?

    Most calling conventions on RISCs are oriented towards C (if you want
    calling conventions that try to be more cross-language (and slower),
    look at VAX) and its properties and limitations at the time when the
    calling convention was designed, in particular, the PCC
    implementation, which was the de-facto standard Unix C compiler at the
    time. C compilers in the 1980s did not allocate structs to registers,
    so passing structs in registers was foreign to them, so the solution
    is that the caller passes the target struct as an additional
    parameter.

    And passing the return value in registers might not have saved
    anything on a compiler that does not deal with structs in registers.
    E.g., if you have

    mystruct = myfunc(arg1, arg2);

    you would see stores to mystruct behind the call. With the PCC
    calling convention, the same stores would happen in the caller
    (possibly resulting in smaller code if there are several calls to
    myfunc()).

    I wonder, though, how things look for

    mystruct = foo(&mystruct);

    Does PCC perform the return stores to mystruct only after performing
    all other memory accesses in foo? Probably yes, anything else would complicate the compiler. In that case the caller could pass &mystruct
    for the return value (a slight complication). But is that restriction reflected in the calling convention?

    For VERY MANY circumstances passing a struct by address is more
    efficient than passing it by value, AND especially when the
    compiler does not optimize heavily.

    Struct returns were (and AFAIK still are, many decades after
    they were added to C) a relatively rarely used feature, so Johnson
    (PCC's author) probably did not want to waste a lot of effort on
    making it more efficient.

    In addition, the programmer has the choice of changing into pointer
    form (&struct) from value form (struct) which is what we learned
    was better style way back then.

    --------------------------

    I also think code would be a bit more efficient if there more registers >>available for parameter passing and as scratch registers - perhaps 6
    would make more sense.

    There is a tendency towards passing more parameters in registers in
    more recent calling conventions. IA-32 (and IIRC VAX) passes none,
    MIPS uses 4 integer registers (for either integer or FP parameters),
    Alpha uses 6 integer and 6 FP registers, AMD64's System V ABI 6
    integer and 8 FP registers, ARM A64 has 8 integer and 8 FP registers,
    RISC-V has 8 integer and 8 FP registers. Not sure why they were so
    reluctant to use more registers earlier.

    Compiler people were telling us that more callee saved registers would
    be higher performing than more argument registers. It did not turn out
    to be that way.

    Oh and BTW, lack of argument registers leads to an increased
    desire for the linker to perform inline folding. ...



    - anton

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Waldek Hebisch@21:1/5 to mitchalsup@aol.com on Tue Jan 7 02:11:45 2025
    MitchAlsup1 <mitchalsup@aol.com> wrote:
    I also think code would be a bit more efficient if there more registers
    available for parameter passing and as scratch registers - perhaps 6
    would make more sense.

    Basically, here, there is competing pressure between the compiler
    needing a handful of preserved registers, and the compiler being
    more efficient if there were more argument/result passing registers.

    My 66000 ABI has 8 argument registers, 7 temporary registers, 14
    preserved registers, a FP, and a SP. IP is not part of the register
    file. My ABI has a note indicating that the aggregations can be
    altered, just that I need a good reason to change.

    I looked high and low for codes using more than 8 arguments and
    returning aggregates larger than 8 double words, and about the
    only things I found were a handful of []print[]() calls.

    I meet such code with reasonable frequency. I peeked semi
    randomly into Lapack. First routine that I looked at had
    8 arguments, so within your limit. Second is:

    SUBROUTINE ZUNMR3( SIDE, TRANS, M, N, K, L, A, LDA, TAU, C, LDC,
    $ WORK, INFO )

    which has 13 arguments.

    Large number of arguments is typical in old style Fortran numeric
    code. It also appears in functional-style code, where to get
    around lack of destructive modification one freqenty have to
    double number of arguments. Another source is closures: when
    looking at source captured values are not visible as arguments,
    but implementation has to pass them behind the scenes.

    More generally, large number of arguments tend to appear in
    hand-optimized where they may lead to faster code than
    using structures in memory. In C structures in memory are
    not that expensive, so scope for gain is limited, but several
    languages dynamically allocate all structures (and pass then
    via address). In such case avoiding dynamic allocation can
    give substantial gain. Programmers now are much less
    inclined to do microptimizations of this sort. But it may
    appear in machine generated sources.

    --
    Waldek Hebisch

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)