Skip to content

[SYCL][CUDA] Add async_group_copy free function implementation #4907

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 20 commits into from

Conversation

FMarno
Copy link
Contributor

@FMarno FMarno commented Nov 5, 2021

Partial Implement for proposal at #4903.
This adds async_group_copy as a free function, implemented for group and sub_group.
This also adds optimizations for NVIDIA architectures of sm_80 and over, making it actually async.

Credit to @JackAKirk for most of the implementation.

Tests to follow shortly

@JackAKirk
Copy link
Contributor

JackAKirk commented Nov 5, 2021

Tests are here: intel/llvm-test-suite#552

Copy link
Contributor

@alexbatashev alexbatashev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a few nits

Comment on lines 84 to 85
detail::enable_if_t<is_group_v<Group> && !detail::is_bool<dataT>::value,
device_event>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is C++14 feature, we can use it safely

Suggested change
detail::enable_if_t<is_group_v<Group> && !detail::is_bool<dataT>::value,
device_event>
std::enable_if_t<is_group_v<Group> && !detail::is_bool<dataT>::value,
device_event>

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 127 to 128
static_assert(sizeof(bool) == sizeof(uint8_t),
"Async copy to/from bool memory is not supported.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the idea here is that bool takes 1 byte? Wouldn't it be better then to replace uint8_t with char?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think that is the idea. I've replaced that now.

static_assert(sizeof(bool) == sizeof(uint8_t),
"Async copy to/from bool memory is not supported.");
using VecT = detail::change_base_type_t<T, uint8_t>;
auto DestP = multi_ptr<VecT, DestS>(reinterpret_cast<VecT *>(Dest.get()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above. This reinterpret_cast is UB. Can we replace uint8_t with char?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

// https://github.com/intel/llvm/blob/sycl/libclc/generic/libspirv/async/wait_group_events.cl
// __spirv_ControlBarrier calls __syncthreads or __nvvm_bar_warp_sync
// https://github.com/intel/llvm/blob/sycl/libclc/ptx-nvidiacl/libspirv/synchronization/barrier.cl
(Events.ext_oneapi_wait(g), ...);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Folding expressions are C++ 17 feature. @romanovvlad should we guard those with macros for extensions?

Copy link
Contributor

@romanovvlad romanovvlad Nov 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It depends. There is a test: sycl/test/basic_tests/stdcpp_compat.cpp, if it passes there is no need to guard.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works fine on my machine

@bader bader changed the title [SYCL][CUDA] impl of async group copy [SYCL][CUDA] Add async_group_copy free function implementation Nov 16, 2021
bader
bader previously approved these changes Nov 27, 2021
bader
bader previously approved these changes Nov 30, 2021
@bader
Copy link
Contributor

bader commented Nov 30, 2021

/verify with intel/llvm-test-suite#552

@JackAKirk
Copy link
Contributor

JackAKirk commented Nov 30, 2021

/verify with intel/llvm-test-suite#552

I've updated intel/llvm-test-suite#552 reflecting the latest commits in this PR.

alexbatashev
alexbatashev previously approved these changes Nov 30, 2021
@Pennycook
Copy link
Contributor

I don't think we should merge this before the spec (#4950) is merged, and I don't think we should mark it as a final/implemented/supported extension until we get clarification from Khronos on whether group algorithms act as synchronization points (https://gitlab.khronos.org/sycl/Specification/-/issues/576).

If we want to merge this soon, it may be worth considering putting it in the experimental:: namespace. That would make the feature available for testing, and we could take things out of experimental:: once we've reached consensus on the expected synchronization behavior. Tagging @gmlueck for comment, since he's also a reviewer on #4950.

@gmlueck
Copy link
Contributor

gmlueck commented Dec 3, 2021

Moved async_group_copy to experimental

If the implementation is moved to "experimental", then the API spec in #4903 should also be updated to say the extension is experimental.

@FMarno
Copy link
Contributor Author

FMarno commented Dec 3, 2021

@gmlueck @Pennycook I've moved the implementation to the experimental namespace and updated the proposal in #4950. We'll update #4903 in the near future.

@FMarno FMarno force-pushed the async_group_copy_impl branch from fbe5835 to 5f34077 Compare January 10, 2022 16:31
@romanovvlad romanovvlad removed their request for review January 18, 2022 13:19
@JackAKirk
Copy link
Contributor

JackAKirk commented Feb 9, 2022

All OpenCL types with 4, 8 or 16 byte alignment have been optimized in the cuda backend for sm_80. The test is up to date here: intel/llvm-test-suite#552

@bader bader requested a review from alexbatashev February 10, 2022 10:31
Copy link
Contributor

@bader bader left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@FMarno, could you resolve merge conflicts, please?

@FMarno FMarno marked this pull request as draft February 10, 2022 11:14
@JackAKirk
Copy link
Contributor

JackAKirk commented Feb 10, 2022

@FMarno, could you resolve merge conflicts, please?

We've marked this as draft for the moment, I've resolved the conflicts. It needs some fixes.

@JackAKirk
Copy link
Contributor

JackAKirk commented Feb 11, 2022

@FMarno, could you resolve merge conflicts, please?

We've marked this as draft for the moment, I've resolved the conflicts. It needs some fixes.

I've fixed the outstanding issues.

I've removed the implementation for non OpenCL supported trivially copyable types: This implementation was broken for group<1> etc and is not yet possible because it requires the group::get_local_id method but SYCL 2020 groups are not implemented, although there is a PR for the necessary functionality here: #5447

If we add back this full trivially copyable support then it should also probably be added to async_work_group_copy at the same time.

@FMarno FMarno marked this pull request as ready for review February 11, 2022 16:34
@JackAKirk
Copy link
Contributor

@FMarno
We decided that this implementation requires some more work in the meeting today. Can you mark this PR and the Proposal PR as draft again please?

@Pennycook

Thanks again for the very useful discussions today. I've added the libclc nvptx implementation part of this PR to #5611 (minus the subgroup case).

@FMarno FMarno marked this pull request as draft February 22, 2022 11:12
@github-actions github-actions bot added the Stale label Aug 22, 2022
@github-actions github-actions bot closed this Sep 21, 2022
@AerialMantis AerialMantis reopened this Jan 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuda CUDA back-end Stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants