Skip to content

[SYCL] Add a document describing the current status of reduction impl… #4831

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Dec 15, 2021

Conversation

v-klochkov
Copy link
Contributor

…ementation

Signed-off-by: Vyacheslav N Klochkov [email protected]

Copy link
Contributor

@steffenlarsen steffenlarsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good stuff! Some minor corrections and a style note:

Some of the inline code snippets (like "usm memory") are not code. Would it make sense to use bold or italic for highlighting these instead?

@v-klochkov v-klochkov marked this pull request as ready for review November 1, 2021 18:30
@v-klochkov v-klochkov requested a review from a team as a code owner November 1, 2021 18:30
@gmlueck
Copy link
Contributor

gmlueck commented Nov 1, 2021

Are we planning to implement (and support) both the SYCL 2020 version of the reduction API and also the "ext::oneapi" version? As you note in the document, the APIs are similar, but not exactly the same.

@v-klochkov
Copy link
Contributor Author

As you note in the document, the APIs are similar, but n

This question is for @Pennycook . I suppose the current plan is to support both APIs.

@Pennycook
Copy link
Contributor

This question is for @Pennycook . I suppose the current plan is to support both APIs.

My preference is to not support the ext::oneapi version any longer than we have to. The SYCL 2020 version is superior, and we should be steering developers towards using it. I would be in favor of marking the original extension deprecated as soon as possible, and removing it in the first release where that is an option.

@v-klochkov
Copy link
Contributor Author

This question is for @Pennycook . I suppose the current plan is to support both APIs.

My preference is to not support the ext::oneapi version any longer than we have to. The SYCL 2020 version is superior, and we should be steering developers towards using it. I would be in favor of marking the original extension deprecated as soon as possible, and removing it in the first release where that is an option.

If it is not planned to be supported anymore, then it is good time to add the deprecated warning to ext::oneapi::reduction

dm-vodopyanov pushed a commit that referenced this pull request Nov 12, 2021
As discussed in #4831, we should officially deprecate the original reduction extension in favor of the equivalent SYCL 2020 reduction functionality.  This PR only makes the changes to the documentation -- adding deprecation notices to the implementation will be done in a separate PR.
s-kanaev
s-kanaev previously approved these changes Dec 8, 2021
Copy link
Contributor

@s-kanaev s-kanaev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@v-klochkov
Copy link
Contributor Author

@Pennycook - do you think it makes sense merging this PR?


There are 2 specifications of the reduction feature and both are still actual:

* `sycl::ext::oneapi::reduction` is described in [this document](https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/Reduction/Reduction.md). This extension was created as part of a pathfinding/prototyping work before it was added to SYCL 2020 standard.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* `sycl::ext::oneapi::reduction` is described in [this document](https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/Reduction/Reduction.md). This extension was created as part of a pathfinding/prototyping work before it was added to SYCL 2020 standard.
* `sycl::ext::oneapi::reduction` is described in [this document](https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/Reduction/Reduction.md). This extension is deprecated, and was created as part of a pathfinding/prototyping work before it was added to SYCL 2020 standard.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. Fixed as suggested: 55c3d75


## Reduction inside 1 work-group - the main building block for reduction

The reduction algorithm for 1 basic block depends currently on combination/availability of 2 features:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The reduction algorithm for 1 basic block depends currently on combination/availability of 2 features:
The reduction algorithm for 1 work-group depends currently on combination/availability of 2 features:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Misprint. Thank you. Fixed as suggested: 55c3d75

- `fast atomics` (i.e. operations available on the target device), such as fetch_add(), fetch_min(), etc.
- `fast reduce` operation for work-group, i.e. `reduce_over_group()`

So, if the reduction operation/type has both `fast atomics` and `fast reduce`, then the reduction on work-group with 1 reduction variable is implemented using does the following: the elements inside the work-group are reduced using `ext::oneapi::reduce` and the final result is atomically added to the final/global reduction variable.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
So, if the reduction operation/type has both `fast atomics` and `fast reduce`, then the reduction on work-group with 1 reduction variable is implemented using does the following: the elements inside the work-group are reduced using `ext::oneapi::reduce` and the final result is atomically added to the final/global reduction variable.
So, if the reduction operation/type has both `fast atomics` and `fast reduce`, then the reduction on work-group with 1 reduction variable does the following: the elements inside the work-group are reduced using `ext::oneapi::reduce` and the final result is atomically added to the final/global reduction variable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. Fixed as suggested: 55c3d75

```

The most general case is when the `fast atomics` and `fast reduce` are not available.
It computes the partial sum using tree-reduction loop and stores the partial sum for the work-group to a global array of partial sums, which later must be also reduced (by using additional(s) kernel(s)). This general case algorithm also requires allocation of a local memory for the tree-reduction loop:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
It computes the partial sum using tree-reduction loop and stores the partial sum for the work-group to a global array of partial sums, which later must be also reduced (by using additional(s) kernel(s)). This general case algorithm also requires allocation of a local memory for the tree-reduction loop:
It computes the partial sum using tree-reduction loop and stores the partial sum for the work-group to a global array of partial sums, which later must be also reduced (by using additional kernel(s)). This general case algorithm also requires allocation of a local memory for the tree-reduction loop:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. Fixed as suggested: 55c3d75

The *additional kernel(s)* that reduce the partial sums look very similar to the *main* kernel, the only difference is that they do not call user's function.

---
TODO #1 (Performance): After the reduction accepting 'range' was implemented it became possible to simply re-use that implementation for the *additional kernel* runs. Instead of running the *additional kernel* several times (if the number of partial sums is still big), it is better to run the parallel_for accepting 'range' once.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should qualify here which devices you're talking about, unless these statements about performance apply to all backends and devices.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the general comment, i.e. for all backends/devices. If there is an efficient implementation of parallel_for(range, reduction), then why not use that to reduce let's say 1M partial sums, instead of inventing/adding N calls of additional kernels inside parallel_for(nd_range, reduction) implementation.

### 3) Support `parallel_for` accepting `range` and 2 or more reduction variables.
Currently `parallel_for()` accepting `range` may handle only 1 reduction variable. It does not support 2 or more.

The temporary work-around for that is to use some container multiple reduction variables, i.e. std::pair, std::struct or a custom struct/class containing 2 or more reduction variables, and also define a custom operator that would be passed to `reduction` constructor.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is std::struct? I'm not familiar with it, and haven't been able to find any information online.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Misprint. I was thinking about std::tuple and struct (that is written few words later), which resulted into misprint "std::struct".
Fixed in: 55c3d75

@Pennycook
Copy link
Contributor

@Pennycook - do you think it makes sense merging this PR?

I'm not opposed to things being merged in their current form, but doing another read through this morning I spotted a few typos and some points that could be clarified.

Copy link
Contributor

@steffenlarsen steffenlarsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I will approve once @Pennycook's comments have been addressed.

@v-klochkov v-klochkov merged commit 6f1bd41 into intel:sycl Dec 15, 2021
@v-klochkov v-klochkov deleted the reduction_impl_doc branch December 15, 2021 17:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants