-
Notifications
You must be signed in to change notification settings - Fork 797
[SYCL][CUDA] Add async_group_copy free function implementation #4907
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Tests are here: intel/llvm-test-suite#552 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just a few nits
detail::enable_if_t<is_group_v<Group> && !detail::is_bool<dataT>::value, | ||
device_event> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is C++14 feature, we can use it safely
detail::enable_if_t<is_group_v<Group> && !detail::is_bool<dataT>::value, | |
device_event> | |
std::enable_if_t<is_group_v<Group> && !detail::is_bool<dataT>::value, | |
device_event> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
static_assert(sizeof(bool) == sizeof(uint8_t), | ||
"Async copy to/from bool memory is not supported."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the idea here is that bool
takes 1 byte? Wouldn't it be better then to replace uint8_t
with char
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think that is the idea. I've replaced that now.
static_assert(sizeof(bool) == sizeof(uint8_t), | ||
"Async copy to/from bool memory is not supported."); | ||
using VecT = detail::change_base_type_t<T, uint8_t>; | ||
auto DestP = multi_ptr<VecT, DestS>(reinterpret_cast<VecT *>(Dest.get())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above. This reinterpret_cast
is UB. Can we replace uint8_t
with char
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
// https://github.com/intel/llvm/blob/sycl/libclc/generic/libspirv/async/wait_group_events.cl | ||
// __spirv_ControlBarrier calls __syncthreads or __nvvm_bar_warp_sync | ||
// https://github.com/intel/llvm/blob/sycl/libclc/ptx-nvidiacl/libspirv/synchronization/barrier.cl | ||
(Events.ext_oneapi_wait(g), ...); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Folding expressions are C++ 17 feature. @romanovvlad should we guard those with macros for extensions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It depends. There is a test: sycl/test/basic_tests/stdcpp_compat.cpp, if it passes there is no need to guard.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Works fine on my machine
libclc/ptx-nvidiacl/libspirv/async/async_work_group_strided_copy.cl
Outdated
Show resolved
Hide resolved
libclc/ptx-nvidiacl/libspirv/async/async_work_group_strided_copy.cl
Outdated
Show resolved
Hide resolved
libclc/ptx-nvidiacl/libspirv/async/async_work_group_strided_copy.cl
Outdated
Show resolved
Hide resolved
/verify with intel/llvm-test-suite#552 |
I've updated intel/llvm-test-suite#552 reflecting the latest commits in this PR. |
I don't think we should merge this before the spec (#4950) is merged, and I don't think we should mark it as a final/implemented/supported extension until we get clarification from Khronos on whether group algorithms act as synchronization points (https://gitlab.khronos.org/sycl/Specification/-/issues/576). If we want to merge this soon, it may be worth considering putting it in the |
If the implementation is moved to "experimental", then the API spec in #4903 should also be updated to say the extension is experimental. |
@gmlueck @Pennycook I've moved the implementation to the experimental namespace and updated the proposal in #4950. We'll update #4903 in the near future. |
fbe5835
to
5f34077
Compare
also reworked enable_if for wait_for
Signed-off-by: jack.kirk <[email protected]>
All OpenCL types with 4, 8 or 16 byte alignment have been optimized in the cuda backend for sm_80. The test is up to date here: intel/llvm-test-suite#552 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@FMarno, could you resolve merge conflicts, please?
We've marked this as draft for the moment, I've resolved the conflicts. It needs some fixes. |
I've fixed the outstanding issues. I've removed the implementation for non OpenCL supported trivially copyable types: This implementation was broken for group<1> etc and is not yet possible because it requires the group::get_local_id method but SYCL 2020 groups are not implemented, although there is a PR for the necessary functionality here: #5447 If we add back this full trivially copyable support then it should also probably be added to async_work_group_copy at the same time. |
@FMarno Thanks again for the very useful discussions today. I've added the libclc nvptx implementation part of this PR to #5611 (minus the subgroup case). |
Partial Implement for proposal at #4903.
This adds async_group_copy as a free function, implemented for
group
andsub_group
.This also adds optimizations for NVIDIA architectures of sm_80 and over, making it actually async.
Credit to @JackAKirk for most of the implementation.
Tests to follow shortly