-
Notifications
You must be signed in to change notification settings - Fork 795
Description
Is your feature request related to a problem? Please describe
As noted in #3815, the performance behavior of sycl::detail::memcpy
is different to the performance behavior of std::memcpy
. In my tests, performance is up to 2x better with std::memcpy
.
Describe the solution you would like
I think there are two options:
-
Support
std::memcpy
in device code.
This appears to work already, but the function isn't explicitly listed here. This is my preferred solution, because it would allow us to callstd::memcpy
in the implementation and for users to callstd::memcpy
in their kernels. -
Implement
sycl::detail::memcpy
the same way asstd::memcpy
where possible.
The implementation ofsycl::detail::memcpy
here is just a simple loop, and the compiler doesn't seem to optimize this as aggressively as it doesstd::memcpy
. Makingsycl::detail::memcpy
faster wouldn't help user code, but would improve performance for those parts of the implementation currently relying on it.
Describe alternatives you have considered
Calling __builtin_memcpy
might also work, but adding a third variant of memcpy
to the mix seems more confusing.
Additional context
I think there are other headers that currently assume std::memcpy
works in device kernels, and I wouldn't be surprised if there was also user code relying on this behavior.