-
Notifications
You must be signed in to change notification settings - Fork 797
[SYCL][DOC] Proposal to generalize async_work_group_copy to include sub-group #4950
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
`async_group_copy` methods would be valid for groups `group` and `sub_group`. | ||
Note that the destination and source arguments have been have been swapped to be | ||
more in line with other SYCL copy functions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this is inherited from SYCL 2020, but the fact that these operations only support copying from local to global memory (or vice versa) is quite subtle. If there aren't any plans to generalize this, I think a short paragraph explaining that here would be helpful.
I'm also curious to hear your thoughts on whether generalizing this would be a good idea, or not. What are the reasons not to support generic pointers here, or copies within an address space?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently most of these arrangements would lead to just falling back on generic code since I don't think the backends have an optimized version. PTX only actually has an asynchronous copy from global to local and not the other way around.
I think this async method should only be used as a potential optimization, and not as the first choice for a basic implementation. I'd guess that in some cases this async method could even lead to overhead (though a benchmark would be required to see if that is true). Restraining it to somewhat model the real capabilities of hardware might help users not pick it first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good argument, and I think adding some text to this effect in the proposal would help people understand the difference between this feature and a higher-level algorithm that might come later.
Part of me wonders whether there's anything we can do with the naming to make it clear that this is a low-level feature that may have hardware-specific considerations and map directly to certain hardware features. We don't really have anything like that today -- the closest equivalent I can think of is the native
namespace, which is currently only used for math operations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the optimization warning has been included. Sorry, I'm not sure about a better name.
This document proposes that `async_work_group_copy` be deprecated and replaced | ||
with `async_group_copy`. `async_group_copy` is a non-member function, in line | ||
with the other group functions, which generalizes `async_work_group_copy` to | ||
also work with sub-groups. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bikeshedding: It didn't occur to me until today that the name group_copy
is aligned with the "Group functions", and it would probably be called joint_group_copy
if it were defined as a "Group algorithm".
Is the intent here for this to be a function or an algorithm?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm actually a little bit confused about where async_group_copy
fits in. It definitely requires all the work-items to call the function with the same parameters, but I don't see a necessity for the work-items to do the copies synchronously.
We are a bit unsure on the wait as well. We don't think it should necessary synchronize the group, so maybe doesn't actually need the group as an argument. The memory a single work-item copied should be consistent for that work item after wait, but there doesn't need to be guarantees about the memory that other work-items work on. I guess the question then is "what memory did the work-item copy?".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just double-checked the OpenCL definitions, and they require both the copy and wait call to be encountered by all work-items in the group: https://www.khronos.org/registry/OpenCL/specs/3.0-unified/html/OpenCL_C.html#async-copies.
Given what you said earlier about it being a low-level functionality, I think positioning the copy and wait functions as "Group functions" makes the most sense. Viewing these as a collective operation allows for more implementations (e.g. an implementation can sit atop OpenCL), and avoids the complexity you mentioned -- as soon as you specify that certain work-items copy specific memory locations, you might prevent this operation from being implemented effectively on some hardware.
The semantics that make the most sense to me are:
async_group_copy
must be encountered by all work-items in converged control flow and acts as a synchronization point with an acquire-release fence. (Otherwise, different work-items might disagree on what values they're copying).wait
must be encountered by all work-items in converged control flow and acts as a synchronization point with an acquire-release fence. (Otherwise, different work-items might disagree on what values were copied).
The OpenCL specification is clear that these operations in OpenCL don't imply any sort of fencing, but that seems poorly defined to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I agree on copy and wait being collective operations. As you say, I don't want to specify exactly what data each work-items copies.
Since we can't tell what work each work-item did, I think adding a synchronization point with an acquire-release fence to wait
makes sense. Otherwise you would just always need one for correctness.
I am a little reluctant to add the synchronization point with an acquire-release fence to async_group_copy
since I can see that being used as either the first or last function call in a kernel and it would be shame to have a wasted sync.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a little reluctant to add the synchronization point with an acquire-release fence to async_group_copy since I can see that being used as either the first or last function call in a kernel and it would be shame to have a wasted sync.
I see where you're coming from and I share the reluctance. It feels like the work-items should be able to start opportunistically copying the part of the range assigned to them as soon as they hit the async_group_copy
call without a sync, but I don't think that works in practice. I think in real-life use-cases you'd need to add an explicit sync before the group copy (just like with the wait).
Consider this example:
// Each work-item writes a few elements to local memory with a stride
local[it.get_local_id()] = x;
local[it.get_local_id() + it.get_local_range()] = y;
// Running this without a sync requires this work-item to be responsible for
// copying the same values as written above
async_group_copy(g, local, global, 2 * it.get_local_range());
In general I don't see how you can provide the necessary guarantee without specifying a relationship between work-items and the values that they copy. Saying that the copy is implicitly strided across work-items would address the example above, but would then break in a case like this:
// Each work-item writes a few contiguous elements to local memory
local[it.get_local_id() * it.get_local_range() + 0] = x;
local[it.get_local_id() * it.get_local_range() + 1] = y;
// Running this without a sync requires this work-item to be responsible for
// copying the same values as written above
async_group_copy(g, local, global, 2 * it.get_local_range());
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The case would be at the beginning of a kernel, when loading memory from global to local. For CUDA, there is actually no asynchronous load from local to global!
From https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#collectives-cg-memcpy-async :
It is important to note that while this is a memcpy in the general case, it is only asynchronous if the source is global memory and the destination is shared memory and both can be addressed with 16, 8, or 4 byte alignments.
Maybe it is the case that async_group_copy
is neither a group function or a group algorithm due to the restrictions on them.
I don't want to distract but looking at the implementation for joint_reduce
, I don't see any synchronization before the sycl::detail::for_each
is done https://github.com/intel/llvm/blob/sycl/sycl/include/CL/sycl/group_algorithm.hpp#L231.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't want to distract but looking at the implementation for
joint_reduce
, I don't see any synchronization before thesycl::detail::for_each
is done https://github.com/intel/llvm/blob/sycl/sycl/include/CL/sycl/group_algorithm.hpp#L231.
Hm, I didn't realize that, thanks for pointing it out.
I think we need the specification to be clarified before we can actually decide what the right thing to do here is. I've opened an issue at Khronos: https://gitlab.khronos.org/sycl/Specification/-/issues/576.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The case would be at the beginning of a kernel, when loading memory from global to local. For CUDA, there is actually no asynchronous load from local to global!
From docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#collectives-cg-memcpy-async :
It is important to note that while this is a memcpy in the general case, it is only asynchronous if the source is global memory and the destination is shared memory and both can be addressed with 16, 8, or 4 byte alignments.
I guess it is somehow an optimization and implementation detail.
For example Xilinx OpenCL implementation implements the asynchronous copy as just a synchronous copy.
This is also correct. :-)
So we might also expose things in SYCL even if the CUDA back-end cannot take advantage from it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The case would be at the beginning of a kernel, when loading memory from global to local. For CUDA, there is actually no asynchronous load from local to global!
From docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#collectives-cg-memcpy-async :It is important to note that while this is a memcpy in the general case, it is only asynchronous if the source is global memory and the destination is shared memory and both can be addressed with 16, 8, or 4 byte alignments.
I guess it is somehow an optimization and implementation detail. For example Xilinx OpenCL implementation implements the asynchronous copy as just a synchronous copy. This is also correct. :-) So we might also expose things in SYCL even if the CUDA back-end cannot take advantage from it.
The initial accompanying implementation to this proposal did make use of the asynchronous copy using the Direct Memory Access hardware for copying from global to shared memory introduced in sm_80 Ampere architecture, but it did not fully allow the overlapping of these DMA memory copies with computes, since the wait_for implementation waited for all outstanding asynchronous copies. We now plan to introduce functionality similar to "wait_prior" from cuda runtime so that we can get more "free" computes during the async copies, by executing multiple async copies and only waiting for those necessary for a following compute.
This poses some obvious questions/problems because the cuda backend does not return anything that can be translated into a device_event from async_group_copy (as discussed #4903 (comment) the backend instead takes an argument N telling it how many async copies to NOT wait for). Currently there is no backend in DPC++ (or any other SYCL implementation as far as I am aware!) that actually uses async_work_group_copy asynchronously, and I am not sure that any GPU backend beyond Nvidia actually has hardware to support it! It looks like it is mostly CPU and FPGA backends that might have supported asynchronous async_work_group_copy via DMA, although this is just a guess from some google searches, so if anyone can clarify this I'd appreciate it!
The OpenCL async_work_group_copy is more than a decade old so it would be good to confirm that it is still being actively used, since I am a little uncomfortable that it is incompatible with the only backend in DPC++ currently supporting asynchronous copies to shared memory (cuda). If in the future other backends plan to implement asynchronous copies from global to local memory, it would be an opportune moment to consider whether they will make use of the existing async_work_group_copy/device_event interface/workflow.
Assuming that we do keep the existing device_event workflow we need to work out how we can make a compatible cuda implementation.
One option would be to introduce a new optional template parameter to sycl::wait_for that can be used by (at least) the cuda backend to NOT wait for numEvents, and only wait for the other totalEvents - numEvents asynchronous copies, something like:
template <int numEvents, typename Group, typename... eventT>
std::enable_if_t<(sycl::is_group_v<Group>) &&
(std::is_same_v<eventT, async_copy_event<Group>> && ...)>
wait_for(Group, eventT... Events) {
#ifdef __NVPTX__
if constexpr (numEvents == 0)
{
(__spirv_GroupWaitEvents(detail::group_to_scope<Group>(), 0, &Events.Event),
...);
}
else{
__nvvm_cp_async_wait_group(numEvents);
__spirv_ControlBarrier(sycl::detail::group_barrier_scope<Group>::Scope,
0,
__spv::MemorySemanticsMask::AcquireRelease |
__spv::MemorySemanticsMask::SubgroupMemory |
__spv::MemorySemanticsMask::WorkgroupMemory |
__spv::MemorySemanticsMask::CrossWorkgroupMemory);
}
#else
(__spirv_GroupWaitEvents(detail::group_to_scope<Group>(), 0, &Events.Event),
...);
#endif
}
If this is not acceptable I think we will have to somehow add a list of the order of the async copy device_event events to the cuda backend so that we can map each device_event to the correct N numEvents.
I am assuming that the async_work_group_copy from the SYCL spec is still useful for some backends in DPC++ to achieve true asynchronous copies. I would be more comfortable actually knowing this. If on the other hand async_work_group_copy is not appropriate then we should have a conversation about what a more appropriate interface would be for general DPC++ backends.
I think that we would prefer to get this proposal and accompanying implementation merged asap as it is, so that we can address the issue stated in this message at a later date, with a CUDA specific proposal if necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The meaning of "synchronization points" here isn't expanded upon, but the standard C++ synchronizes-with relationship implies some agreement about the state of memory. The definition of group barrier in the glossary says:
A synchronization function within a group of work items. All the work items of a group must execute the barrier construct before any work-item continues execution beyond the barrier. Additionally all work-items in the group execute a release mem-fence prior to synchronizing at the barrier, all work-items in the group execute an acquire mem-fence after synchronizing at the barrier, and there is an implicit synchronization between these acquire and release fences as if through an atomic operation on an atomic object internal to the barrier implementation.
...so I think it's possible to read this as saying that there are appropriate fences and barriers at the beginning and end of each group algorithm.
I think that we agree that https://gitlab.khronos.org/sycl/Specification/-/issues/576 should clarify the meaning of
"These group functions act as synchronization points and must be encountered in converged control flow by all work-items in the group."
As Finlay points out, the current implementation of the joint_reduce group function is not consistent with an interpretation of the above statement as meaning that group collectives require fencing by definition.
As you point out the OpenCL definition of async_work_group_copy
doesn't imply any sort of fencing. I believe that this proposal defines async_group_copy
as a straightforward generalization of async_work_group_copy
to support sub-groups. The behaviour of async_group_copy
(or whatever its final name is chosen to be!) in the group
case should match the behaviour of async_work_group_copy. This should mean that if async_group_copy
needs to be altered then async_work_group_copy
should be identically altered (They would have to be atm, since they use the same SPIR-V implementation!).
There are strong performance reasons for not synchronizing with sycl::barrier
prior to a call to async_group_copy
. I find that if I add an additional call to barrier
prior to group_async_copy
when copying 8KB of data from global to shared using the max work group size, the sm_80 DMA performance is worse than if I did a non-asynchronous copy on the same device, assuming that there is no useful compute that can be made whilst the copy is being performed by the DMA without using the new data that is being copied, even though the DMA route copies from global to shared more efficiently by avoiding the usage of registers.
It is worth noting that although the CUDA implementation of the asynchronous copy does not require the work-items to be converged, there is a considerable performance advantage to the work-items being converged: see this discussion on warp entanglement
: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#warp-entanglement-commit.
Our existing ptx implementation is consistent with the CUDA runtime where it is also left up to the user to ensure that the threads are converged.
Taking this and the OpenCL definition of work_group_async_copy
into account, given that I am not aware of any DMA setups other than those supported by OpenCL and CUDA, it seems likely that for the purposes of group_async_copy
the barrier prior to group_async_copy
will not be necessary unless the user has explicitly diverged the work-items. However since group_async_copy
may fall under the banner of sycl group functions
in the future, if it is decided to be important for other sycl group functions
, we may have to add a barrier to the beginning of group_async_copy
. However I don't think that this in itself can be a blocker to merging this proposal, particularly since it is aimed for the oneapi::experimental
namespace.
As an aside for the bigger picture it is worth mentioning that asynchronous copies are most advantageous when copying as much memory as possible in a single kernel to shared memory, and when multiple memory copies (often roughly of size 8KB) can be initialised in tandem, with concurrent computes (and therefore waits) on subsets of these data copies while other subsets are left in flight (I think this may be referred to as "double buffering"). The most obvious application for this in DPC++ applications that comes to mind for me is for Matrix MMA kernels, however the current interface of work_group_async_copy/group_async_copy is incompatible for such an application, since it would be necessary to specify precisely the data that each work-item will copy (I don't see that this could not be done in a sub-group/group convergent way). Memory copies to shared memory can be a bottleneck for such kernels.
## Introduction | ||
|
||
This document proposes that `async_work_group_copy` be deprecated and replaced | ||
with `async_group_copy`. `async_group_copy` is a non-member function, in line |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An extension cannot deprecate a core SYCL feature, so the wording about deprecating the existing async_work_group_copy
should be removed. The SYCL Khronos group may decide to deprecate that function if/when this extension is adopted into the next SYCL core spec, but the extension itself cannot deprecate it.
To understand why, consider an application that is written to the core SYCL 2020 spec. The author of that application should be able to expect that the application compiles (without deprecation messages) on any conformant SYCL 2020 implementation, regardless of which extensions that implementation supports.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes a lot of sense, I have reworded.
This document proposes that `async_work_group_copy` be deprecated and replaced | ||
with `async_group_copy`. `async_group_copy` is a non-member function, in line | ||
with the other group functions, which generalizes `async_work_group_copy` to | ||
also work with sub-groups. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the other group functions in section 4.17.3 "Group functions" must be encountered in convergent control flow. Is that also the case for the new async_group_copy
function? My understanding is that the existing async_work_group_copy
functions do not need to be encountered in convergent control flow and are not collective operations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I expect there should be a restriction that all the work-items in the group would call the function with the same arguments. I don't think convergence is required. Could you suggest where it might fit better.
I think this relates a bit to one of John's questions #4950 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this extension document needs to decide where in the SYCL 2020 spec these functions would eventually be described. That will be decided later by the Khronos SYCL group if/when this extension is adopted into a future version of the core SYCL spec.
The extension spec just need to document the behavior of the new extended APIs you propose. The current description is somewhat unclear because it says "in line with the other group functions", but it sounds like these APIs are not perfectly in line with them because they do not need to be in convergent code. It would be better to just describe the requirements of the new APIs. It sounds like it is a requirement that all work items in the group must call the API passing the same arguments. You should say this somewhere in the extension spec.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I had an incorrect understanding of converged control flow. I don't believe (currently) it should be a requirement that the work-items are synchronized, but all the work-items should call async_group_copy
with the same argument while in converged control flow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a bit saying async_group_copy
requires convergent control flow and removed the line about it being 'in line with the other group functions'
like `group_barrier`. | ||
|
||
`async_group_copy` methods would be valid for groups `group` and `sub_group`. | ||
Note that the destination and source arguments have been have been swapped to be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that the destination and source arguments have been have been swapped to be | |
Note that the destination and source arguments have been swapped to be |
```c++ | ||
namespace sycl::ext::oneapi { | ||
template <typename Group, typename dataT> | ||
device_event async_group_copy(Group group, sycl::decorated_local_ptr<dataT> src, sycl::decorated_global_ptr<dataT> dest, size_t numElements); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
device_event async_group_copy(Group group, sycl::decorated_local_ptr<dataT> src, sycl::decorated_global_ptr<dataT> dest, size_t numElements); | |
device_event async_group_copy(Group group, sycl::decorated_local_ptr<dataT> src, sycl::decorated_global_ptr<dataT> dest, size_t count); |
Kind of a nit, but I think the other copy functions name this parameter count
.
... | ||
public: | ||
template <typename Group> | ||
void ext_oneapi_wait(Group group) noexcept; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Several questions here:
-
If the new
async_group_copy
functions are collective operations, is this also a collective operation? If so, maybe it should be a free function instead. -
What about the existing
nd_item::wait_for()
andgroup::wait_for()
functions? Can they still be used to wait for adevice_event
that is returned from the newasync_group_copy
functions? -
I don't understand the reason this needs the
group
parameter. Is it because of the NVIDIA semantics you mention above, where this function acts like a barrier for all work-items in the group?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be happy to get rid of this in favor of a free function with no group. In the case the user would like synchronisation, then they can add their own sync. I would also add a note to the docs specifying that no synchronisation or memory fencing takes place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I don't understand what you mean. What is the free function that you propose adding instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry ignore the earlier comment, I've changed my mind.
- Since how the copy is done is implementation defined, I think
wait
should act as a synchronization point and a mem-fence, so it will need the group to get the scope for that. This would make it a collective operation and I like the idea of making it a free function. - The exiting
wait_for
would work but it would implicitly use work-group scope. This could become tricky if this is called in a divergent branch that only that sub-group follows. - Same as point one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has been updated and hopefully much more thoroughly specified.
``` | ||
|
||
## Change to `device_event::wait` function | ||
`device_event::wait` will be modified to take the group that will be synchronized in the wait. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not technically true. The extension does not modify the existing device_event::wait()
member function. That function is part of the core spec, so it must continue to be supported. Rather, this extension adds a new function named ext_oneapi_wait()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point!
trademarks of The Khronos Group Inc. OpenCL(TM) is a trademark of Apple Inc. | ||
used by permission by Khronos. | ||
|
||
This extension is written against the SYCL 2020 revision 3 specification. All |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The SYCL 2020 revision 4 has just been published! This extension should be updated to be written against that new revision. I think there is no impact beyond the reference here to "revision 3".
|
||
`async_group_copy` will asynchronously copy a given number of values of type | ||
`dataT` from global memory to local memory or from local memory to global | ||
memory. If an asynchorous copy is not supported then `async_group_copy` will |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
memory. If an asynchorous copy is not supported then `async_group_copy` will | |
memory. If an asynchronous copy is not supported then `async_group_copy` will |
Note that the destination and source arguments have been swapped to be more in | ||
line with other SYCL copy functions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that the destination and source arguments have been swapped to be more in | |
line with other SYCL copy functions. | |
NOTE: The order of the destination and source arguments differs from the SYCL 1.2.1 `async_work_group_copy` function to be more in line with other copy functions in SYCL 2020. |
I don't think it's clear what has been swapped here. I suggest making it a non-normative note and making the comparison explicit.
arguments, otherwise the behaviour is undefined. | ||
* All work-items of the group are required to call the function in convergent | ||
control flow, otherwise the behaviour is undefined. | ||
* The type of `dataT` should be a scalar or vector type. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* The type of `dataT` should be a scalar or vector type. | |
* The type of `dataT` must be a scalar or vector type. |
If this is a requirement, use "must".
Also, is this requirement really what we want? The term "scalar type" refers to any of the types listed in 4.14.1. "Scalar data types", which seems reasonable. However, "vector type" means either the sycl::vec
type or the sycl::marray
type as described in 4.14.2. Vector types.
Note that the element type of std::marray
is defined as "must be a numeric type as it is defined by C++ standard", which refers to the C++ named requirement NumericType. However, that requirement is very broad, allowing non-scalar types.
Therefore, it seems weird that the type passed to async_group_copy
cannot be a non-scalar type, but it can be an std::marray
of non-scalar types. I presume this is not what we intend.
Maybe we want to say that the type passed to async_group_copy
can be any trivially copyable type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't realise that std::marray was for numeric types, that is a bit of a surprise to me. I've gone for trivially copyable as suggested.
Also I've tried to removed any weak language.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gmlueck Hi I've been looking into implementing this and I think it'll be easier for now to just keep it to scalar types and sycl::vec.
When implementing for trivially copyable types my plan was to cast the data to a char* but then it was impossible to do the stride correctly without know the size of the object, which would require a new spirv instruction.
Hopefully this still leaves scope for that in the future if it would be beneficial.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gmlueck @Pennycook
I was talking to @AerialMantis earlier and we decided that we would actually like to keep trivially copyable in the proposal, but for the first implementation we will just implement SYCL scalar types and sycl::vec. In the future we would like to add a spirv extension much like OpGroupAsyncCopy, but also with a object shape (size in bytes) that will allow for the stride to be done properly.
Sorry about the flip-flop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, I think the specifications for DPC++ extensions should match our intended implementation in DPC++. Therefore, if DPC++ intends to implement only scalar types and sycl::vec
here, I think the spec should state those restrictions. I agree that it would be nicer if the spec allowed any trivially copyable type. If we decide to add that support later, we can update the specification to "revision 2" which broadens the set of allowable types.
The reason I prefer this approach is because DPC++ users can then rely on the extension specification as the documentation of the feature. By contrast, if the specification is "aspirational", then we need some separate document describing which parts of the extension are really implemented. This is less friendly to users.
I would consider an exception to this rule if there are other vendors who plan to implement this extension. If the extension is DPC++ only for now, though, then I don't see any value in specifying some behavior that we don't implement in DPC++.
If you want to capture this issue in the spec, you could add a non-normative section at the end stating that we would like to broaden the set of allowed types in the future. This section could even discuss the challenges with allowing trivially copyable types, noting that a SPIR-V extension is required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The specification already says the following:
If an asynchronous copy is not supported then
async_group_copy
will fall back on a synchronous copy
Couldn't the initial implementation of async_group_copy
for types that do not map to native SPIR-V instructions just be implemented using a loop? That would allow the initial implementation to support everything out of the box, and for true asynchronous support for other types to be added later as an optimization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've checked a bit more thoroughly this time by implementing the synchronous fall back that @Pennycook suggested for trivially copyable types, so I'm happy this it is actually possible.
I've also added a bit at the end describing the difficulties of OpGroupAsyncCopy for future reference.
template <typename Group, typename dataT> | ||
device_event async_group_copy(Group group, sycl::decorated_global_ptr<dataT> src, sycl::decorated_local_ptr<dataT> dest, size_t count, src_stride srcStride); | ||
} // namespace sycl::ext::oneapi::experimental | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there should be some description here about the operation that is performed by these functions. The text above lists the restrictions, but it doesn't really say what they do. In particular, I think the following should be clarified:
- That all work items calling the function collectively copy the elements in the range.
- A description of how the
destStride
andsrcStride
work.
Also, I assume that the number of work items in the group need not be the same as the number of elements that are copied, right? Assuming this is the case, it would be good to clarify that too.
For example, maybe something like:
When work items call one of these functions in converged control flow, all work items collectively copy
count
data elements fromsrc
todest
. The number of work items in the group need not be the same ascount
. WhendestStride
is provided, contiguous elements fromsrc
are copied to potentially discontiguous locations indest
. The element atsrc[0]
is copied to the location atdest[0*desStride]
, the element atsrc[1]
is copied to the location atdest[1*destStride]
, etc. WhensrcStride
is provided, elements that are potentially discontiguous insrc
are copied to contiguous locations indest
using a similar scheme.
Also, can destStride
or srcStride
be 0
? If this is forbidden, it should be listed as one of the restrictions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've specified that destStride and srcStride have to be non-zero.
I've added a bit to explain stride, I think it is quite clear.
I don't see a need to specify how the copy is done so I've tried to make it clear that it is implementation defined.
|
||
`wait_for` will block until all the asychronous copies represented by the | ||
`device_event` arguments are complete. Data written to a location by `async_group_copy` | ||
should not be read until `wait_for` has been called with the returned |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should not be read until `wait_for` has been called with the returned | |
must not be read until `wait_for` has been called with the returned |
`async_group_copy` that returned the `device_event` arguments or the behaviour is undefined. | ||
* All work-items of the group are required to call the function in convergent | ||
control flow, otherwise the behaviour is undefined. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if the application gets a device_event
from the new async_group_copy
function and then waits using the old nd_item::wait_for
or group::wait_for
functions? Similarly, what happens if the application gets a device_event
from the old async_work_group_copy
function and then calls the new wait_for
free function?
If these are undefined, should this extension declare some new type rather than reusing device_event
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a new type async_copy_event
to avoid the potential undefined behaviour.
|
||
```c++ | ||
namespace sycl::ext::oneapi::experimental { | ||
template <typename Group, typename eventTN> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
template <typename Group, typename eventTN> | |
template <typename Group, typename ...eventTN> |
I assume this is what you meant.
There should also be a statement someplace saying that all the types eventTN
must be device_event
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, thank you.
Also: * Used stronger langauge * specified what stride does
…cCopy Also: * Fixed less than or equal signs * clarified that async_group_copy can be used with sycl::vec * fixed function signatures
@Pennycook @gmlueck this proposal still needs a few small clarifications/updates for clarity. I will be doing this today, so it's best to wait a little while before reviewing. Thanks |
Consistent usage of American english.
I've clarified that the new functions introduced in this proposal have behavior that is consistent with the sycl/OpenCL functions I've summarized the other issue in the discussions on this page in point 2 of the #Issues section at the bottom of the document, and I've described a route to the resolution. I don't think that this issue blocks this proposal. I've updated #4907 to be consistent with this proposal: as discussed #4907 (comment) I've removed the support for non-OpenCL supported trivally copyable types. I think that both proposal and implementation PR's are now suitable for review. |
This proposal generalizes
async_work_group_copy
from thend_item
andgroup
classes, so it can be used with sub groups.Implementation at #4907 and accompanying tests at intel/llvm-test-suite#552
Signed-off-by: Finlay Marno [email protected]