-
Notifications
You must be signed in to change notification settings - Fork 14.6k
Description
In System V x86_64 ABI large arguments are passed in stack. In IR code byval
attribute is used for them.
Calling a function with byval
argument requires making intermediate copy of argument bytes. But it seems like this copying is implemented in a target-specific way, somewhere in code generation. This results in poor generated code quality, since no IR optimizations can be done for byval
arguments, including unnecessary loads/stores elimination.
Consider this example:
#include <cstdint>
struct S
{
int64_t arr[12];
S()
{
for( int64_t i= 0; i < 12; ++i )
arr[i]= i * i * 3 + i * 7 + 13;
}
};
void Bar(S s);
void Foo()
{
Bar(S());
}
clang 20.1 produces following assembly:
Foo():
sub rsp, 200
mov qword ptr [rsp + 104], 13
mov qword ptr [rsp + 112], 23
mov qword ptr [rsp + 120], 39
mov qword ptr [rsp + 128], 61
mov qword ptr [rsp + 136], 89
mov qword ptr [rsp + 144], 123
mov qword ptr [rsp + 152], 163
mov qword ptr [rsp + 160], 209
mov qword ptr [rsp + 168], 261
mov qword ptr [rsp + 176], 319
mov qword ptr [rsp + 184], 383
mov qword ptr [rsp + 192], 453
movups xmm0, xmmword ptr [rsp + 184]
movups xmmword ptr [rsp + 80], xmm0
movups xmm0, xmmword ptr [rsp + 168]
movups xmmword ptr [rsp + 64], xmm0
movups xmm0, xmmword ptr [rsp + 104]
movups xmm1, xmmword ptr [rsp + 120]
movups xmm2, xmmword ptr [rsp + 136]
movups xmm3, xmmword ptr [rsp + 152]
movups xmmword ptr [rsp + 48], xmm3
movups xmmword ptr [rsp + 32], xmm2
movups xmmword ptr [rsp + 16], xmm1
movups xmmword ptr [rsp], xmm0
call Bar(S)@PLT
add rsp, 200
ret
https://godbolt.org/z/rzEzn18v9
It can be seen, that memory for a local variable is initialized first (first 12 mov
instructions), then this memory is read and pushed to a new stack location in order to pass this variable to function Bar
. Targeting a newer CPU (with -march=skylake
option) only results in newer memory move instructions usage, but overall pattern remains the same.
I expect that LLVM can optimize such cases and move necessary values directly to the stack region of the argument. A lot of existing C++ code may benefit from such optimization, since it's nowadays pretty common to pass arguments by value (sine move-semantics was introduced).