Missing optimizations involving byval arguments passing on x86_64

In System V x86_64 ABI large arguments are passed in stack. In IR code `byval` attribute is used for them.
Calling a function with `byval` argument requires making intermediate copy of argument bytes. But it seems like this copying is implemented in a target-specific way, somewhere in code generation. This results in poor generated code quality, since no IR optimizations can be done for `byval` arguments, including unnecessary loads/stores elimination.

Consider this example:
```cpp
#include <cstdint>

struct S
{
    int64_t arr[12];

    S()
    {
        for( int64_t i= 0; i < 12; ++i )
            arr[i]= i * i * 3 + i * 7 + 13;
    }
};

void Bar(S s);

void Foo()
{
    Bar(S());
}
```

clang 20.1 produces following assembly:

```asm
Foo():
        sub     rsp, 200
        mov     qword ptr [rsp + 104], 13
        mov     qword ptr [rsp + 112], 23
        mov     qword ptr [rsp + 120], 39
        mov     qword ptr [rsp + 128], 61
        mov     qword ptr [rsp + 136], 89
        mov     qword ptr [rsp + 144], 123
        mov     qword ptr [rsp + 152], 163
        mov     qword ptr [rsp + 160], 209
        mov     qword ptr [rsp + 168], 261
        mov     qword ptr [rsp + 176], 319
        mov     qword ptr [rsp + 184], 383
        mov     qword ptr [rsp + 192], 453
        movups  xmm0, xmmword ptr [rsp + 184]
        movups  xmmword ptr [rsp + 80], xmm0
        movups  xmm0, xmmword ptr [rsp + 168]
        movups  xmmword ptr [rsp + 64], xmm0
        movups  xmm0, xmmword ptr [rsp + 104]
        movups  xmm1, xmmword ptr [rsp + 120]
        movups  xmm2, xmmword ptr [rsp + 136]
        movups  xmm3, xmmword ptr [rsp + 152]
        movups  xmmword ptr [rsp + 48], xmm3
        movups  xmmword ptr [rsp + 32], xmm2
        movups  xmmword ptr [rsp + 16], xmm1
        movups  xmmword ptr [rsp], xmm0
        call    Bar(S)@PLT
        add     rsp, 200
        ret
```

https://godbolt.org/z/rzEzn18v9

It can be seen, that memory for a local variable is initialized first (first 12 `mov` instructions), then this memory is read and pushed to a new stack location in order to pass this variable to function `Bar`. Targeting a newer CPU (with `-march=skylake` option) only results in newer memory move instructions usage, but overall pattern remains the same.

I expect that LLVM can optimize such cases and move necessary values directly to the stack region of the argument. A lot of existing C++ code may benefit from such optimization, since it's nowadays pretty common to pass arguments by value (sine move-semantics was introduced).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Missing optimizations involving byval arguments passing on x86_64 #151288

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Missing optimizations involving byval arguments passing on x86_64 #151288

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions