Skip to content

[SYCL][DOC] Proposal to generalize async_work_group_copy to include sub-group #4950

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 12 commits into from

Conversation

FMarno
Copy link
Contributor

@FMarno FMarno commented Nov 12, 2021

This proposal generalizes async_work_group_copy from the nd_item and group classes, so it can be used with sub groups.

Implementation at #4907 and accompanying tests at intel/llvm-test-suite#552

Signed-off-by: Finlay Marno [email protected]

@FMarno FMarno requested a review from a team as a code owner November 12, 2021 09:46
Comment on lines 72 to 74
`async_group_copy` methods would be valid for groups `group` and `sub_group`.
Note that the destination and source arguments have been have been swapped to be
more in line with other SYCL copy functions.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is inherited from SYCL 2020, but the fact that these operations only support copying from local to global memory (or vice versa) is quite subtle. If there aren't any plans to generalize this, I think a short paragraph explaining that here would be helpful.

I'm also curious to hear your thoughts on whether generalizing this would be a good idea, or not. What are the reasons not to support generic pointers here, or copies within an address space?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently most of these arrangements would lead to just falling back on generic code since I don't think the backends have an optimized version. PTX only actually has an asynchronous copy from global to local and not the other way around.

I think this async method should only be used as a potential optimization, and not as the first choice for a basic implementation. I'd guess that in some cases this async method could even lead to overhead (though a benchmark would be required to see if that is true). Restraining it to somewhat model the real capabilities of hardware might help users not pick it first.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good argument, and I think adding some text to this effect in the proposal would help people understand the difference between this feature and a higher-level algorithm that might come later.

Part of me wonders whether there's anything we can do with the naming to make it clear that this is a low-level feature that may have hardware-specific considerations and map directly to certain hardware features. We don't really have anything like that today -- the closest equivalent I can think of is the native namespace, which is currently only used for math operations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the optimization warning has been included. Sorry, I'm not sure about a better name.

Comment on lines 44 to 47
This document proposes that `async_work_group_copy` be deprecated and replaced
with `async_group_copy`. `async_group_copy` is a non-member function, in line
with the other group functions, which generalizes `async_work_group_copy` to
also work with sub-groups.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bikeshedding: It didn't occur to me until today that the name group_copy is aligned with the "Group functions", and it would probably be called joint_group_copy if it were defined as a "Group algorithm".

Is the intent here for this to be a function or an algorithm?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm actually a little bit confused about where async_group_copy fits in. It definitely requires all the work-items to call the function with the same parameters, but I don't see a necessity for the work-items to do the copies synchronously.

We are a bit unsure on the wait as well. We don't think it should necessary synchronize the group, so maybe doesn't actually need the group as an argument. The memory a single work-item copied should be consistent for that work item after wait, but there doesn't need to be guarantees about the memory that other work-items work on. I guess the question then is "what memory did the work-item copy?".

Copy link
Contributor

@Pennycook Pennycook Nov 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just double-checked the OpenCL definitions, and they require both the copy and wait call to be encountered by all work-items in the group: https://www.khronos.org/registry/OpenCL/specs/3.0-unified/html/OpenCL_C.html#async-copies.

Given what you said earlier about it being a low-level functionality, I think positioning the copy and wait functions as "Group functions" makes the most sense. Viewing these as a collective operation allows for more implementations (e.g. an implementation can sit atop OpenCL), and avoids the complexity you mentioned -- as soon as you specify that certain work-items copy specific memory locations, you might prevent this operation from being implemented effectively on some hardware.

The semantics that make the most sense to me are:

  • async_group_copy must be encountered by all work-items in converged control flow and acts as a synchronization point with an acquire-release fence. (Otherwise, different work-items might disagree on what values they're copying).
  • wait must be encountered by all work-items in converged control flow and acts as a synchronization point with an acquire-release fence. (Otherwise, different work-items might disagree on what values were copied).

The OpenCL specification is clear that these operations in OpenCL don't imply any sort of fencing, but that seems poorly defined to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I agree on copy and wait being collective operations. As you say, I don't want to specify exactly what data each work-items copies.

Since we can't tell what work each work-item did, I think adding a synchronization point with an acquire-release fence to wait makes sense. Otherwise you would just always need one for correctness.

I am a little reluctant to add the synchronization point with an acquire-release fence to async_group_copy since I can see that being used as either the first or last function call in a kernel and it would be shame to have a wasted sync.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a little reluctant to add the synchronization point with an acquire-release fence to async_group_copy since I can see that being used as either the first or last function call in a kernel and it would be shame to have a wasted sync.

I see where you're coming from and I share the reluctance. It feels like the work-items should be able to start opportunistically copying the part of the range assigned to them as soon as they hit the async_group_copy call without a sync, but I don't think that works in practice. I think in real-life use-cases you'd need to add an explicit sync before the group copy (just like with the wait).

Consider this example:

// Each work-item writes a few elements to local memory with a stride
local[it.get_local_id()] = x;
local[it.get_local_id() + it.get_local_range()] = y;

// Running this without a sync requires this work-item to be responsible for
// copying the same values as written above
async_group_copy(g, local, global, 2 * it.get_local_range());

In general I don't see how you can provide the necessary guarantee without specifying a relationship between work-items and the values that they copy. Saying that the copy is implicitly strided across work-items would address the example above, but would then break in a case like this:

// Each work-item writes a few contiguous elements to local memory
local[it.get_local_id() * it.get_local_range() + 0] = x;
local[it.get_local_id() * it.get_local_range() + 1] = y;

// Running this without a sync requires this work-item to be responsible for
// copying the same values as written above
async_group_copy(g, local, global, 2 * it.get_local_range());

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The case would be at the beginning of a kernel, when loading memory from global to local. For CUDA, there is actually no asynchronous load from local to global!

From https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#collectives-cg-memcpy-async :

It is important to note that while this is a memcpy in the general case, it is only asynchronous if the source is global memory and the destination is shared memory and both can be addressed with 16, 8, or 4 byte alignments.

Maybe it is the case that async_group_copy is neither a group function or a group algorithm due to the restrictions on them.

I don't want to distract but looking at the implementation for joint_reduce, I don't see any synchronization before the sycl::detail::for_each is done https://github.com/intel/llvm/blob/sycl/sycl/include/CL/sycl/group_algorithm.hpp#L231.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to distract but looking at the implementation for joint_reduce, I don't see any synchronization before the sycl::detail::for_each is done https://github.com/intel/llvm/blob/sycl/sycl/include/CL/sycl/group_algorithm.hpp#L231.

Hm, I didn't realize that, thanks for pointing it out.

I think we need the specification to be clarified before we can actually decide what the right thing to do here is. I've opened an issue at Khronos: https://gitlab.khronos.org/sycl/Specification/-/issues/576.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The case would be at the beginning of a kernel, when loading memory from global to local. For CUDA, there is actually no asynchronous load from local to global!

From docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#collectives-cg-memcpy-async :

It is important to note that while this is a memcpy in the general case, it is only asynchronous if the source is global memory and the destination is shared memory and both can be addressed with 16, 8, or 4 byte alignments.

I guess it is somehow an optimization and implementation detail.
For example Xilinx OpenCL implementation implements the asynchronous copy as just a synchronous copy.
This is also correct. :-)
So we might also expose things in SYCL even if the CUDA back-end cannot take advantage from it.

Copy link
Contributor

@JackAKirk JackAKirk Feb 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The case would be at the beginning of a kernel, when loading memory from global to local. For CUDA, there is actually no asynchronous load from local to global!
From docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#collectives-cg-memcpy-async :

It is important to note that while this is a memcpy in the general case, it is only asynchronous if the source is global memory and the destination is shared memory and both can be addressed with 16, 8, or 4 byte alignments.

I guess it is somehow an optimization and implementation detail. For example Xilinx OpenCL implementation implements the asynchronous copy as just a synchronous copy. This is also correct. :-) So we might also expose things in SYCL even if the CUDA back-end cannot take advantage from it.

cc @Pennycook @keryell

The initial accompanying implementation to this proposal did make use of the asynchronous copy using the Direct Memory Access hardware for copying from global to shared memory introduced in sm_80 Ampere architecture, but it did not fully allow the overlapping of these DMA memory copies with computes, since the wait_for implementation waited for all outstanding asynchronous copies. We now plan to introduce functionality similar to "wait_prior" from cuda runtime so that we can get more "free" computes during the async copies, by executing multiple async copies and only waiting for those necessary for a following compute.

This poses some obvious questions/problems because the cuda backend does not return anything that can be translated into a device_event from async_group_copy (as discussed #4903 (comment) the backend instead takes an argument N telling it how many async copies to NOT wait for). Currently there is no backend in DPC++ (or any other SYCL implementation as far as I am aware!) that actually uses async_work_group_copy asynchronously, and I am not sure that any GPU backend beyond Nvidia actually has hardware to support it! It looks like it is mostly CPU and FPGA backends that might have supported asynchronous async_work_group_copy via DMA, although this is just a guess from some google searches, so if anyone can clarify this I'd appreciate it!
The OpenCL async_work_group_copy is more than a decade old so it would be good to confirm that it is still being actively used, since I am a little uncomfortable that it is incompatible with the only backend in DPC++ currently supporting asynchronous copies to shared memory (cuda). If in the future other backends plan to implement asynchronous copies from global to local memory, it would be an opportune moment to consider whether they will make use of the existing async_work_group_copy/device_event interface/workflow.

Assuming that we do keep the existing device_event workflow we need to work out how we can make a compatible cuda implementation.
One option would be to introduce a new optional template parameter to sycl::wait_for that can be used by (at least) the cuda backend to NOT wait for numEvents, and only wait for the other totalEvents - numEvents asynchronous copies, something like:

template <int numEvents, typename Group, typename... eventT>
std::enable_if_t<(sycl::is_group_v<Group>) &&
                 (std::is_same_v<eventT, async_copy_event<Group>> && ...)>
wait_for(Group, eventT... Events) {
  #ifdef __NVPTX__
  if constexpr (numEvents == 0)
  {
    (__spirv_GroupWaitEvents(detail::group_to_scope<Group>(), 0, &Events.Event),
   ...);
  }
  else{
  __nvvm_cp_async_wait_group(numEvents);
    __spirv_ControlBarrier(sycl::detail::group_barrier_scope<Group>::Scope,
                         0,
                         __spv::MemorySemanticsMask::AcquireRelease |
                             __spv::MemorySemanticsMask::SubgroupMemory |
                             __spv::MemorySemanticsMask::WorkgroupMemory |
                             __spv::MemorySemanticsMask::CrossWorkgroupMemory);
  }
   #else
   (__spirv_GroupWaitEvents(detail::group_to_scope<Group>(), 0, &Events.Event),
   ...);
   #endif
}

If this is not acceptable I think we will have to somehow add a list of the order of the async copy device_event events to the cuda backend so that we can map each device_event to the correct N numEvents.

I am assuming that the async_work_group_copy from the SYCL spec is still useful for some backends in DPC++ to achieve true asynchronous copies. I would be more comfortable actually knowing this. If on the other hand async_work_group_copy is not appropriate then we should have a conversation about what a more appropriate interface would be for general DPC++ backends.

I think that we would prefer to get this proposal and accompanying implementation merged asap as it is, so that we can address the issue stated in this message at a later date, with a CUDA specific proposal if necessary.

Copy link
Contributor

@JackAKirk JackAKirk Feb 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The meaning of "synchronization points" here isn't expanded upon, but the standard C++ synchronizes-with relationship implies some agreement about the state of memory. The definition of group barrier in the glossary says:

A synchronization function within a group of work items. All the work items of a group must execute the barrier construct before any work-item continues execution beyond the barrier. Additionally all work-items in the group execute a release mem-fence prior to synchronizing at the barrier, all work-items in the group execute an acquire mem-fence after synchronizing at the barrier, and there is an implicit synchronization between these acquire and release fences as if through an atomic operation on an atomic object internal to the barrier implementation.

...so I think it's possible to read this as saying that there are appropriate fences and barriers at the beginning and end of each group algorithm.

I think that we agree that https://gitlab.khronos.org/sycl/Specification/-/issues/576 should clarify the meaning of

"These group functions act as synchronization points and must be encountered in converged control flow by all work-items in the group."

As Finlay points out, the current implementation of the joint_reduce group function is not consistent with an interpretation of the above statement as meaning that group collectives require fencing by definition.

As you point out the OpenCL definition of async_work_group_copy doesn't imply any sort of fencing. I believe that this proposal defines async_group_copy as a straightforward generalization of async_work_group_copy to support sub-groups. The behaviour of async_group_copy (or whatever its final name is chosen to be!) in the group case should match the behaviour of async_work_group_copy. This should mean that if async_group_copy needs to be altered then async_work_group_copy should be identically altered (They would have to be atm, since they use the same SPIR-V implementation!).

There are strong performance reasons for not synchronizing with sycl::barrier prior to a call to async_group_copy. I find that if I add an additional call to barrier prior to group_async_copy when copying 8KB of data from global to shared using the max work group size, the sm_80 DMA performance is worse than if I did a non-asynchronous copy on the same device, assuming that there is no useful compute that can be made whilst the copy is being performed by the DMA without using the new data that is being copied, even though the DMA route copies from global to shared more efficiently by avoiding the usage of registers.
It is worth noting that although the CUDA implementation of the asynchronous copy does not require the work-items to be converged, there is a considerable performance advantage to the work-items being converged: see this discussion on warp entanglement: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#warp-entanglement-commit.
Our existing ptx implementation is consistent with the CUDA runtime where it is also left up to the user to ensure that the threads are converged.
Taking this and the OpenCL definition of work_group_async_copy into account, given that I am not aware of any DMA setups other than those supported by OpenCL and CUDA, it seems likely that for the purposes of group_async_copy the barrier prior to group_async_copy will not be necessary unless the user has explicitly diverged the work-items. However since group_async_copy may fall under the banner of sycl group functions in the future, if it is decided to be important for other sycl group functions, we may have to add a barrier to the beginning of group_async_copy. However I don't think that this in itself can be a blocker to merging this proposal, particularly since it is aimed for the oneapi::experimental namespace.

As an aside for the bigger picture it is worth mentioning that asynchronous copies are most advantageous when copying as much memory as possible in a single kernel to shared memory, and when multiple memory copies (often roughly of size 8KB) can be initialised in tandem, with concurrent computes (and therefore waits) on subsets of these data copies while other subsets are left in flight (I think this may be referred to as "double buffering"). The most obvious application for this in DPC++ applications that comes to mind for me is for Matrix MMA kernels, however the current interface of work_group_async_copy/group_async_copy is incompatible for such an application, since it would be necessary to specify precisely the data that each work-item will copy (I don't see that this could not be done in a sub-group/group convergent way). Memory copies to shared memory can be a bottleneck for such kernels.

## Introduction

This document proposes that `async_work_group_copy` be deprecated and replaced
with `async_group_copy`. `async_group_copy` is a non-member function, in line
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An extension cannot deprecate a core SYCL feature, so the wording about deprecating the existing async_work_group_copy should be removed. The SYCL Khronos group may decide to deprecate that function if/when this extension is adopted into the next SYCL core spec, but the extension itself cannot deprecate it.

To understand why, consider an application that is written to the core SYCL 2020 spec. The author of that application should be able to expect that the application compiles (without deprecation messages) on any conformant SYCL 2020 implementation, regardless of which extensions that implementation supports.

Copy link
Contributor Author

@FMarno FMarno Nov 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes a lot of sense, I have reworded.

This document proposes that `async_work_group_copy` be deprecated and replaced
with `async_group_copy`. `async_group_copy` is a non-member function, in line
with the other group functions, which generalizes `async_work_group_copy` to
also work with sub-groups.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the other group functions in section 4.17.3 "Group functions" must be encountered in convergent control flow. Is that also the case for the new async_group_copy function? My understanding is that the existing async_work_group_copy functions do not need to be encountered in convergent control flow and are not collective operations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I expect there should be a restriction that all the work-items in the group would call the function with the same arguments. I don't think convergence is required. Could you suggest where it might fit better.
I think this relates a bit to one of John's questions #4950 (comment)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this extension document needs to decide where in the SYCL 2020 spec these functions would eventually be described. That will be decided later by the Khronos SYCL group if/when this extension is adopted into a future version of the core SYCL spec.

The extension spec just need to document the behavior of the new extended APIs you propose. The current description is somewhat unclear because it says "in line with the other group functions", but it sounds like these APIs are not perfectly in line with them because they do not need to be in convergent code. It would be better to just describe the requirements of the new APIs. It sounds like it is a requirement that all work items in the group must call the API passing the same arguments. You should say this somewhere in the extension spec.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I had an incorrect understanding of converged control flow. I don't believe (currently) it should be a requirement that the work-items are synchronized, but all the work-items should call async_group_copy with the same argument while in converged control flow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a bit saying async_group_copy requires convergent control flow and removed the line about it being 'in line with the other group functions'

like `group_barrier`.

`async_group_copy` methods would be valid for groups `group` and `sub_group`.
Note that the destination and source arguments have been have been swapped to be
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Note that the destination and source arguments have been have been swapped to be
Note that the destination and source arguments have been swapped to be

```c++
namespace sycl::ext::oneapi {
template <typename Group, typename dataT>
device_event async_group_copy(Group group, sycl::decorated_local_ptr<dataT> src, sycl::decorated_global_ptr<dataT> dest, size_t numElements);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
device_event async_group_copy(Group group, sycl::decorated_local_ptr<dataT> src, sycl::decorated_global_ptr<dataT> dest, size_t numElements);
device_event async_group_copy(Group group, sycl::decorated_local_ptr<dataT> src, sycl::decorated_global_ptr<dataT> dest, size_t count);

Kind of a nit, but I think the other copy functions name this parameter count.

...
public:
template <typename Group>
void ext_oneapi_wait(Group group) noexcept;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Several questions here:

  • If the new async_group_copy functions are collective operations, is this also a collective operation? If so, maybe it should be a free function instead.

  • What about the existing nd_item::wait_for() and group::wait_for() functions? Can they still be used to wait for a device_event that is returned from the new async_group_copy functions?

  • I don't understand the reason this needs the group parameter. Is it because of the NVIDIA semantics you mention above, where this function acts like a barrier for all work-items in the group?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be happy to get rid of this in favor of a free function with no group. In the case the user would like synchronisation, then they can add their own sync. I would also add a note to the docs specifying that no synchronisation or memory fencing takes place.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I don't understand what you mean. What is the free function that you propose adding instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry ignore the earlier comment, I've changed my mind.

  • Since how the copy is done is implementation defined, I think wait should act as a synchronization point and a mem-fence, so it will need the group to get the scope for that. This would make it a collective operation and I like the idea of making it a free function.
  • The exiting wait_for would work but it would implicitly use work-group scope. This could become tricky if this is called in a divergent branch that only that sub-group follows.
  • Same as point one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has been updated and hopefully much more thoroughly specified.

```

## Change to `device_event::wait` function
`device_event::wait` will be modified to take the group that will be synchronized in the wait.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not technically true. The extension does not modify the existing device_event::wait() member function. That function is part of the core spec, so it must continue to be supported. Rather, this extension adds a new function named ext_oneapi_wait().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point!

trademarks of The Khronos Group Inc. OpenCL(TM) is a trademark of Apple Inc.
used by permission by Khronos.

This extension is written against the SYCL 2020 revision 3 specification. All
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The SYCL 2020 revision 4 has just been published! This extension should be updated to be written against that new revision. I think there is no impact beyond the reference here to "revision 3".


`async_group_copy` will asynchronously copy a given number of values of type
`dataT` from global memory to local memory or from local memory to global
memory. If an asynchorous copy is not supported then `async_group_copy` will
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
memory. If an asynchorous copy is not supported then `async_group_copy` will
memory. If an asynchronous copy is not supported then `async_group_copy` will

Comment on lines 94 to 95
Note that the destination and source arguments have been swapped to be more in
line with other SYCL copy functions.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Note that the destination and source arguments have been swapped to be more in
line with other SYCL copy functions.
NOTE: The order of the destination and source arguments differs from the SYCL 1.2.1 `async_work_group_copy` function to be more in line with other copy functions in SYCL 2020.

I don't think it's clear what has been swapped here. I suggest making it a non-normative note and making the comparison explicit.

@FMarno FMarno requested review from Pennycook and gmlueck December 15, 2021 11:05
arguments, otherwise the behaviour is undefined.
* All work-items of the group are required to call the function in convergent
control flow, otherwise the behaviour is undefined.
* The type of `dataT` should be a scalar or vector type.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* The type of `dataT` should be a scalar or vector type.
* The type of `dataT` must be a scalar or vector type.

If this is a requirement, use "must".

Also, is this requirement really what we want? The term "scalar type" refers to any of the types listed in 4.14.1. "Scalar data types", which seems reasonable. However, "vector type" means either the sycl::vec type or the sycl::marray type as described in 4.14.2. Vector types.

Note that the element type of std::marray is defined as "must be a numeric type as it is defined by C++ standard", which refers to the C++ named requirement NumericType. However, that requirement is very broad, allowing non-scalar types.

Therefore, it seems weird that the type passed to async_group_copy cannot be a non-scalar type, but it can be an std::marray of non-scalar types. I presume this is not what we intend.

Maybe we want to say that the type passed to async_group_copy can be any trivially copyable type?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't realise that std::marray was for numeric types, that is a bit of a surprise to me. I've gone for trivially copyable as suggested.

Also I've tried to removed any weak language.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gmlueck Hi I've been looking into implementing this and I think it'll be easier for now to just keep it to scalar types and sycl::vec.
When implementing for trivially copyable types my plan was to cast the data to a char* but then it was impossible to do the stride correctly without know the size of the object, which would require a new spirv instruction.
Hopefully this still leaves scope for that in the future if it would be beneficial.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gmlueck @Pennycook
I was talking to @AerialMantis earlier and we decided that we would actually like to keep trivially copyable in the proposal, but for the first implementation we will just implement SYCL scalar types and sycl::vec. In the future we would like to add a spirv extension much like OpGroupAsyncCopy, but also with a object shape (size in bytes) that will allow for the stride to be done properly.
Sorry about the flip-flop.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, I think the specifications for DPC++ extensions should match our intended implementation in DPC++. Therefore, if DPC++ intends to implement only scalar types and sycl::vec here, I think the spec should state those restrictions. I agree that it would be nicer if the spec allowed any trivially copyable type. If we decide to add that support later, we can update the specification to "revision 2" which broadens the set of allowable types.

The reason I prefer this approach is because DPC++ users can then rely on the extension specification as the documentation of the feature. By contrast, if the specification is "aspirational", then we need some separate document describing which parts of the extension are really implemented. This is less friendly to users.

I would consider an exception to this rule if there are other vendors who plan to implement this extension. If the extension is DPC++ only for now, though, then I don't see any value in specifying some behavior that we don't implement in DPC++.

If you want to capture this issue in the spec, you could add a non-normative section at the end stating that we would like to broaden the set of allowed types in the future. This section could even discuss the challenges with allowing trivially copyable types, noting that a SPIR-V extension is required.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The specification already says the following:

If an asynchronous copy is not supported then async_group_copy will fall back on a synchronous copy

Couldn't the initial implementation of async_group_copy for types that do not map to native SPIR-V instructions just be implemented using a loop? That would allow the initial implementation to support everything out of the box, and for true asynchronous support for other types to be added later as an optimization.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've checked a bit more thoroughly this time by implementing the synchronous fall back that @Pennycook suggested for trivially copyable types, so I'm happy this it is actually possible.
I've also added a bit at the end describing the difficulties of OpGroupAsyncCopy for future reference.

template <typename Group, typename dataT>
device_event async_group_copy(Group group, sycl::decorated_global_ptr<dataT> src, sycl::decorated_local_ptr<dataT> dest, size_t count, src_stride srcStride);
} // namespace sycl::ext::oneapi::experimental
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there should be some description here about the operation that is performed by these functions. The text above lists the restrictions, but it doesn't really say what they do. In particular, I think the following should be clarified:

  • That all work items calling the function collectively copy the elements in the range.
  • A description of how the destStride and srcStride work.

Also, I assume that the number of work items in the group need not be the same as the number of elements that are copied, right? Assuming this is the case, it would be good to clarify that too.

For example, maybe something like:

When work items call one of these functions in converged control flow, all work items collectively copy count data elements from src to dest. The number of work items in the group need not be the same as count. When destStride is provided, contiguous elements from src are copied to potentially discontiguous locations in dest. The element at src[0] is copied to the location at dest[0*desStride], the element at src[1] is copied to the location at dest[1*destStride], etc. When srcStride is provided, elements that are potentially discontiguous in src are copied to contiguous locations in dest using a similar scheme.

Also, can destStride or srcStride be 0? If this is forbidden, it should be listed as one of the restrictions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've specified that destStride and srcStride have to be non-zero.
I've added a bit to explain stride, I think it is quite clear.

I don't see a need to specify how the copy is done so I've tried to make it clear that it is implementation defined.


`wait_for` will block until all the asychronous copies represented by the
`device_event` arguments are complete. Data written to a location by `async_group_copy`
should not be read until `wait_for` has been called with the returned
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
should not be read until `wait_for` has been called with the returned
must not be read until `wait_for` has been called with the returned

`async_group_copy` that returned the `device_event` arguments or the behaviour is undefined.
* All work-items of the group are required to call the function in convergent
control flow, otherwise the behaviour is undefined.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if the application gets a device_event from the new async_group_copy function and then waits using the old nd_item::wait_for or group::wait_for functions? Similarly, what happens if the application gets a device_event from the old async_work_group_copy function and then calls the new wait_for free function?

If these are undefined, should this extension declare some new type rather than reusing device_event?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a new type async_copy_event to avoid the potential undefined behaviour.


```c++
namespace sycl::ext::oneapi::experimental {
template <typename Group, typename eventTN>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
template <typename Group, typename eventTN>
template <typename Group, typename ...eventTN>

I assume this is what you meant.

There should also be a statement someplace saying that all the types eventTN must be device_event.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, thank you.

Also:
* Used stronger langauge
* specified what stride does
…cCopy

Also:
* Fixed less than or equal signs
* clarified that async_group_copy can be used with sycl::vec
* fixed function signatures
@JackAKirk JackAKirk self-requested a review February 9, 2022 18:00
@JackAKirk
Copy link
Contributor

JackAKirk commented Feb 11, 2022

@Pennycook @gmlueck this proposal still needs a few small clarifications/updates for clarity. I will be doing this today, so it's best to wait a little while before reviewing. Thanks

@JackAKirk
Copy link
Contributor

JackAKirk commented Feb 11, 2022

@Pennycook @gmlueck this proposal still needs a few small clarifications/updates for clarity. I will be doing this today, so it's best to wait a little while before reviewing. Thanks

I've clarified that the new functions introduced in this proposal have behavior that is consistent with the sycl/OpenCL functions async_work_group_copy in the overlapping case: i.e. when the group type is sych::group rather than sycl::sub_group.

I've summarized the other issue in the discussions on this page in point 2 of the #Issues section at the bottom of the document, and I've described a route to the resolution. I don't think that this issue blocks this proposal.

I've updated #4907 to be consistent with this proposal: as discussed #4907 (comment) I've removed the support for non-OpenCL supported trivally copyable types.

I think that both proposal and implementation PR's are now suitable for review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants