Sysop: | Amessyroom |
---|---|
Location: | Fayetteville, NC |
Users: | 43 |
Nodes: | 6 (0 / 6) |
Uptime: | 104:18:59 |
Calls: | 290 |
Files: | 905 |
Messages: | 76,612 |
The big problem I see is the registers used for returning values from functions. R0-R3 can all be used for passing arguments to functions, as 32-bit (or smaller) values, pointers, in pairs as 64-bit values, and as
parts of structs.
But the ABI only allows returning a single 32-bit value in R0, or a
scalar 64-bit value in R0:R1. If a function returns a non-scalar that
is larger than 32-bit, the caller has to allocate space on the stack for
the return type and pass a pointer to that space in R0.
To my mind, this is massively inefficient, especially when using structs
that are made up of two 32-bit parts.
Is there any good reason why the ABI is designed with such limited
register usage for returns? Newer ABIs like RISC-V 32-bit and x86_64
can at least use two registers for return values. Modern compilers are
quite happy breaking structs into parts in individual registers - it's a /long/ time since they insisted that structs occupied a contiguous block
of memory. Can anyone give me an explanation why return types can't
simply use all the same registers that are available for argument passing?
Are there good technical reasons for the conventions on 32-bit ARM? Or
is this all just historical from the days when everything was an "int"
and that's all anyone ever returned from functions?
But the ABI only allows returning a single 32-bit value in R0, or a
scalar 64-bit value in R0:R1. If a function returns a non-scalar that
is larger than 32-bit, the caller has to allocate space on the stack for
the return type and pass a pointer to that space in R0.
To my mind, this is massively inefficient, especially when using structs
that are made up of two 32-bit parts.
Is there any good reason why the ABI is designed with such limited
register usage for returns?
Newer ABIs like RISC-V 32-bit and x86_64
can at least use two registers for return values. Modern compilers are
quite happy breaking structs into parts in individual registers - it's a >/long/ time since they insisted that structs occupied a contiguous block
of memory.
I also think code would be a bit more efficient if there more registers >available for parameter passing and as scratch registers - perhaps 6
would make more sense.
In more modern C++ programming, it's very practical to use types like >std::optional<>, std::variant<>, std::expected<> and std::tuple<> as a
way of dealing safely with status and multiple return values rather than >using C-style error codes or passing manual pointers to return value
slots.
But the limited return registers adds significant overhead to
small functions.
I'm trying to understand the reasoning behind some of the callingSuperScalar
conventions used with 32-bit ARM. I work primarily with small embedded systems, so the efficiency of code on 32-bit Cortex-M devices is very important to me - good calling conventions make a big difference.
No doubt most people here know this already, but in summary these
devices are a 32-bit load/store RISC architecture with 16 registers.
R0-R3 and R12 are scratch/volatile registers, R4-R11 are preserved
registers, R13 is the stack pointer, R14 is the link register and R15 is
the program counter. For most Cortex-M cores, there is no
super-scaling,
out-of-order execution, speculative execution, etc., but instructions are pipelined.
The big problem I see is the registers used for returning values from functions. R0-R3 can all be used for passing arguments to functions, as 32-bit (or smaller) values, pointers, in pairs as 64-bit values, and as
parts of structs.
But the ABI only allows returning a single 32-bit value in R0, or a
scalar 64-bit value in R0:R1. If a function returns a non-scalar that
is larger than 32-bit, the caller has to allocate space on the stack for
the return type and pass a pointer to that space in R0.
To my mind, this is massively inefficient, especially when using structs
that are made up of two 32-bit parts.
Is there any good reason why the ABI is designed with such limited
register usage for returns?
Newer ABIs like RISC-V 32-bit and x86_64
can at least use two registers for return values.
Modern compilers are
quite happy breaking structs into parts in individual registers - it's a /long/ time since they insisted that structs occupied a contiguous block
of memory. Can anyone give me an explanation why return types can't
simply use all the same registers that are available for argument
passing?
I also think code would be a bit more efficient if there more registers available for parameter passing and as scratch registers - perhaps 6
would make more sense.
In more modern C++ programming, it's very practical to use types like std::optional<>, std::variant<>, std::expected<> and std::tuple<> as a
way of dealing safely with status and multiple return values rather than using C-style error codes or passing manual pointers to return value
slots. But the limited return registers adds significant overhead to
small functions.
Are there good technical reasons for the conventions on 32-bit ARM? Or
is this all just historical from the days when everything was an "int"
and that's all anyone ever returned from functions?
Thanks for any pointers or explanations here.
David Brown <david.brown@hesbynett.no> writes:
But the ABI only allows returning a single 32-bit value in R0, or a
scalar 64-bit value in R0:R1. If a function returns a non-scalar that
is larger than 32-bit, the caller has to allocate space on the stack for >>the return type and pass a pointer to that space in R0.
To my mind, this is massively inefficient, especially when using structs >>that are made up of two 32-bit parts.
Is there any good reason why the ABI is designed with such limited
register usage for returns?
Most calling conventions on RISCs are oriented towards C (if you want
calling conventions that try to be more cross-language (and slower),
look at VAX) and its properties and limitations at the time when the
calling convention was designed, in particular, the PCC
implementation, which was the de-facto standard Unix C compiler at the
time. C compilers in the 1980s did not allocate structs to registers,
so passing structs in registers was foreign to them, so the solution
is that the caller passes the target struct as an additional
parameter.
And passing the return value in registers might not have saved
anything on a compiler that does not deal with structs in registers.
E.g., if you have
mystruct = myfunc(arg1, arg2);
you would see stores to mystruct behind the call. With the PCC
calling convention, the same stores would happen in the caller
(possibly resulting in smaller code if there are several calls to
myfunc()).
I wonder, though, how things look for
mystruct = foo(&mystruct);
Does PCC perform the return stores to mystruct only after performing
all other memory accesses in foo? Probably yes, anything else would complicate the compiler. In that case the caller could pass &mystruct
for the return value (a slight complication). But is that restriction reflected in the calling convention?
Struct returns were (and AFAIK still are, many decades after
they were added to C) a relatively rarely used feature, so Johnson
(PCC's author) probably did not want to waste a lot of effort on
making it more efficient.
I also think code would be a bit more efficient if there more registers >>available for parameter passing and as scratch registers - perhaps 6
would make more sense.
There is a tendency towards passing more parameters in registers in
more recent calling conventions. IA-32 (and IIRC VAX) passes none,
MIPS uses 4 integer registers (for either integer or FP parameters),
Alpha uses 6 integer and 6 FP registers, AMD64's System V ABI 6
integer and 8 FP registers, ARM A64 has 8 integer and 8 FP registers,
RISC-V has 8 integer and 8 FP registers. Not sure why they were so
reluctant to use more registers earlier.
- anton
I also think code would be a bit more efficient if there more registers
available for parameter passing and as scratch registers - perhaps 6
would make more sense.
Basically, here, there is competing pressure between the compiler
needing a handful of preserved registers, and the compiler being
more efficient if there were more argument/result passing registers.
My 66000 ABI has 8 argument registers, 7 temporary registers, 14
preserved registers, a FP, and a SP. IP is not part of the register
file. My ABI has a note indicating that the aggregations can be
altered, just that I need a good reason to change.
I looked high and low for codes using more than 8 arguments and
returning aggregates larger than 8 double words, and about the
only things I found were a handful of []print[]() calls.