-
Notifications
You must be signed in to change notification settings - Fork 795
[SYCL] Add a document describing the current status of reduction impl… #4831
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ementation Signed-off-by: Vyacheslav N Klochkov <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good stuff! Some minor corrections and a style note:
Some of the inline code snippets (like "usm memory
") are not code. Would it make sense to use bold or italic for highlighting these instead?
Co-authored-by: Steffen Larsen <[email protected]>
Are we planning to implement (and support) both the SYCL 2020 version of the reduction API and also the "ext::oneapi" version? As you note in the document, the APIs are similar, but not exactly the same. |
This question is for @Pennycook . I suppose the current plan is to support both APIs. |
My preference is to not support the |
If it is not planned to be supported anymore, then it is good time to add the deprecated warning to ext::oneapi::reduction |
As discussed in #4831, we should officially deprecate the original reduction extension in favor of the equivalent SYCL 2020 reduction functionality. This PR only makes the changes to the documentation -- adding deprecation notices to the implementation will be done in a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@Pennycook - do you think it makes sense merging this PR? |
|
||
There are 2 specifications of the reduction feature and both are still actual: | ||
|
||
* `sycl::ext::oneapi::reduction` is described in [this document](https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/Reduction/Reduction.md). This extension was created as part of a pathfinding/prototyping work before it was added to SYCL 2020 standard. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* `sycl::ext::oneapi::reduction` is described in [this document](https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/Reduction/Reduction.md). This extension was created as part of a pathfinding/prototyping work before it was added to SYCL 2020 standard. | |
* `sycl::ext::oneapi::reduction` is described in [this document](https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/Reduction/Reduction.md). This extension is deprecated, and was created as part of a pathfinding/prototyping work before it was added to SYCL 2020 standard. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you. Fixed as suggested: 55c3d75
|
||
## Reduction inside 1 work-group - the main building block for reduction | ||
|
||
The reduction algorithm for 1 basic block depends currently on combination/availability of 2 features: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reduction algorithm for 1 basic block depends currently on combination/availability of 2 features: | |
The reduction algorithm for 1 work-group depends currently on combination/availability of 2 features: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Misprint. Thank you. Fixed as suggested: 55c3d75
- `fast atomics` (i.e. operations available on the target device), such as fetch_add(), fetch_min(), etc. | ||
- `fast reduce` operation for work-group, i.e. `reduce_over_group()` | ||
|
||
So, if the reduction operation/type has both `fast atomics` and `fast reduce`, then the reduction on work-group with 1 reduction variable is implemented using does the following: the elements inside the work-group are reduced using `ext::oneapi::reduce` and the final result is atomically added to the final/global reduction variable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, if the reduction operation/type has both `fast atomics` and `fast reduce`, then the reduction on work-group with 1 reduction variable is implemented using does the following: the elements inside the work-group are reduced using `ext::oneapi::reduce` and the final result is atomically added to the final/global reduction variable. | |
So, if the reduction operation/type has both `fast atomics` and `fast reduce`, then the reduction on work-group with 1 reduction variable does the following: the elements inside the work-group are reduced using `ext::oneapi::reduce` and the final result is atomically added to the final/global reduction variable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you. Fixed as suggested: 55c3d75
``` | ||
|
||
The most general case is when the `fast atomics` and `fast reduce` are not available. | ||
It computes the partial sum using tree-reduction loop and stores the partial sum for the work-group to a global array of partial sums, which later must be also reduced (by using additional(s) kernel(s)). This general case algorithm also requires allocation of a local memory for the tree-reduction loop: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It computes the partial sum using tree-reduction loop and stores the partial sum for the work-group to a global array of partial sums, which later must be also reduced (by using additional(s) kernel(s)). This general case algorithm also requires allocation of a local memory for the tree-reduction loop: | |
It computes the partial sum using tree-reduction loop and stores the partial sum for the work-group to a global array of partial sums, which later must be also reduced (by using additional kernel(s)). This general case algorithm also requires allocation of a local memory for the tree-reduction loop: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you. Fixed as suggested: 55c3d75
The *additional kernel(s)* that reduce the partial sums look very similar to the *main* kernel, the only difference is that they do not call user's function. | ||
|
||
--- | ||
TODO #1 (Performance): After the reduction accepting 'range' was implemented it became possible to simply re-use that implementation for the *additional kernel* runs. Instead of running the *additional kernel* several times (if the number of partial sums is still big), it is better to run the parallel_for accepting 'range' once. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you should qualify here which devices you're talking about, unless these statements about performance apply to all backends and devices.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the general comment, i.e. for all backends/devices. If there is an efficient implementation of parallel_for(range, reduction), then why not use that to reduce let's say 1M partial sums, instead of inventing/adding N calls of additional kernels inside parallel_for(nd_range, reduction) implementation.
### 3) Support `parallel_for` accepting `range` and 2 or more reduction variables. | ||
Currently `parallel_for()` accepting `range` may handle only 1 reduction variable. It does not support 2 or more. | ||
|
||
The temporary work-around for that is to use some container multiple reduction variables, i.e. std::pair, std::struct or a custom struct/class containing 2 or more reduction variables, and also define a custom operator that would be passed to `reduction` constructor. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is std::struct
? I'm not familiar with it, and haven't been able to find any information online.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Misprint. I was thinking about std::tuple and struct (that is written few words later), which resulted into misprint "std::struct".
Fixed in: 55c3d75
I'm not opposed to things being merged in their current form, but doing another read through this morning I spotted a few typos and some points that could be clarified. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I will approve once @Pennycook's comments have been addressed.
Signed-off-by: Vyacheslav N Klochkov <[email protected]>
…ementation
Signed-off-by: Vyacheslav N Klochkov [email protected]