diff --git a/sycl/doc/extensions/GroupAlgorithms/SYCL_INTEL_group_sort.asciidoc b/sycl/doc/extensions/GroupAlgorithms/SYCL_INTEL_group_sort.asciidoc index 8f342878e155d..b050bb3a372e0 100755 --- a/sycl/doc/extensions/GroupAlgorithms/SYCL_INTEL_group_sort.asciidoc +++ b/sycl/doc/extensions/GroupAlgorithms/SYCL_INTEL_group_sort.asciidoc @@ -35,7 +35,8 @@ SYCL specification refer to that revision. == Introduction -This extension introduces sorting functions to the group algorithms library and Sorter objects. +This extension introduces sorting functions to the group algorithms library, along with +associated Sorter objects and Group Helper objects. == Feature test macro @@ -47,18 +48,19 @@ Applications can test for the existence of this macro to determine if the implementation supports this feature, or applications can test the macro's value to determine which of the extension's APIs the implementation supports. +Table 1. Values of the `SYCL_EXT_ONEAPI_GROUP_SORT` macro. [%header,cols="1,5"] |=== |Value |Description -|1 |Initial extension version. Base features are supported. +|1 |Initial extension version. Base features are supported. |=== -==== Sorter +== Sorters Sorter is a special type that encapsulates a sorting algorithm. Sorter may contain parameters that help to get better performance. Data for sorting are provided to the `operator()` that should contain an implementation of a sorting algorithm. -Semantics of `operator()` is following: +General semantics of `operator()` is following: [source,c++] ---- @@ -69,9 +71,13 @@ template T operator()(Group g, T val); ---- -At least one overload for `operator()` is required. +NOTE: At least one overload for `operator()` is required. +If only `void operator()(Group g, Ptr first, Ptr last);` is defined then a Sorter can be passed +to `joint_sort` function only. If it's passed to `sort_over_group`, it leads to a compilation +error. If only `T operator()(Group g, T val);` is defined then a Sorter can be passed to +`sort_over_group` function only. If it's passed to `joint_sort`, it leads to a compilation error. -Table. `operator()` for Sorters. +Table 2. `operator()` for Sorters. |=== |`operator()`|Description @@ -91,7 +97,7 @@ Example of custom Sorter: [source,c++] ---- template -class bubble_sort{ +class bubble_sorter{ public: Compare comp; @@ -108,16 +114,21 @@ public: }; ---- +This sorter can be invoked by `joint_sort`, but won't work with `sort_over_group` +due to the absence of corresponding `operator()` + ==== Predefined Sorters -`radix_order` is a `enum` that defines the sorting order when `radix_sorter` is used. +===== Sorting Order + +`sorting_order` is an `enum` that defines a sorting order when `radix_sorter` is used. Only ascending and descending orders are applicable. [source,c++] ---- -namespace sycl::ext::oneapi { +namespace sycl::ext::oneapi::experimental { - enum class radix_order { + enum class sorting_order { ascending, descending }; @@ -129,37 +140,55 @@ SYCL provides the following predefined classes: [source,c++] ---- -namespace sycl::ext::oneapi { +namespace sycl::ext::oneapi::experimental { template> class default_sorter { public: - default_sorter(Compare comp = Compare()); + template + default_sorter(sycl::span scratch, Compare comp = Compare()); template void operator()(Group g, Ptr first, Ptr last); template T operator()(Group g, T val); + + template + static constexpr std::size_t + memory_required(sycl::memory_scope scope, std::size_t range_size); + + template + static constexpr std::size_t + memory_required(sycl::memory_scope scope, sycl::range local_range); }; - template + template class radix_sorter { public: - radix_sorter(const std::bitset mask = - std::bitset (std::numeric_limits::max())); + template + radix_sorter(sycl::span scratch, + const std::bitset mask = + std::bitset (std::numeric_limits::max())); template void operator()(Group g, Ptr first, Ptr last); template T operator()(Group g, T val); + + static constexpr std::size_t + memory_required(sycl::memory_scope scope, std::size_t range_size); + + template + static constexpr std::size_t + memory_required(sycl::memory_scope scope, sycl::range local_range); }; } ---- -Table. Description of predefined Sorters. +Table 3. Description of predefined Sorters. |=== |Sorter|Description @@ -167,25 +196,34 @@ Table. Description of predefined Sorters. default_sorter` |Use a default sorting method based on an implementation-defined heuristic using `Compare` as the binary comparison function object. +The algorithm requires an additional memory that must be allocated on callers side. +Size of required memory (bytes) is defined by calling `memory_required`. -|`template +|`template radix_sorter` |Use radix sort as a sorting method. `Order` specify the sorting order. Only arithmetic types as `T` can be passed to `radix_sorter`. `BitsPerPass` is a number of bits that values are split by. For example, if a sequence of `int32_t` is sorted using `BitsPerPass == 4` then one pass of the radix sort algorithm considers only 4 bits. The number of passes is `32/4=8`. +The algorithm requires an additional memory that must be allocated on callers side. +Size of required memory (bytes) is defined by calling `memory_required`. |=== -Table. Constructors of the `default_sorter` class. +Table 4. Constructors of the `default_sorter` class. |=== |Constructor|Description -|`default_sorter(Compare comp = Compare())` +|`template +default_sorter(sycl::span scratch, Compare comp = Compare())` |Creates the `default_sorter` object using `comp`. +Additional memory for the algorithm is provided using `scratch`. +If `scratch.size()` is less than the value returned by +`memory_required`, behavior of the corresponding sorting algorithm is undefined. + |=== -Table. Member functions of the `default_sorter` class. +Table 5. Member functions of the `default_sorter` class. |=== |Member function|Description @@ -193,26 +231,50 @@ Table. Member functions of the `default_sorter` class. void operator()(Group g, Ptr first, Ptr last)` |Implements a default sorting algorithm to be called by the `joint_sort` algorithm. -_Complexity_: Let `N` be `last - first`. `O(N*log_2(N))` comparisons. +_Complexity_: Let `N` be `last - first`. `O(N*log(N)*log(N))` comparisons. |`template T operator()(Group g, T val)` |Implements a default sorting algorithm to be called by the `sort_over_group` algorithm. -_Complexity_: Let `N` be the work group size. `O(N*log_2(N))` comparisons. +_Complexity_: Let `N` be the `Group` size. `O(N*log(N)*log(N))` comparisons. + +|`template +static std::size_t memory_required(sycl::memory_scope scope, std::size_t range_size)` +|Returns size of temporary memory (in bytes) that is required by +the default sorting algorithm defined by the sorter calling by `joint_sort`. +`range_size` represents a range size for sorting, +e.g. `last-first` from `operator()` arguments. +Result depends on the `scope` parameter: +use `sycl::memory_scope::work_group` to get memory size required for each work_group; +use `sycl::memory_scope::sub_group` to get memory size required for each sub_group`. +If other `scope` values are passed, behavior is unspecified. + +|`static std::size_t memory_required(sycl::memory_scope scope, sycl::range local_range)` +|Returns size of temporary memory (in bytes) that is required by the default +sorting algorithm defined by the sorter calling by `sort_over_group`. +If `scope = sycl::memory_scope::work_group`, +`local_range` is a local range of `sycl::nd_range` that was used to run the kernel; +if `scope = sycl::memory_scope::sub_group`, `local_range` is a sub-group size. +If other `scope` values are passed, behavior is unspecified. |=== -Table. Constructors of the `radix_sorter` class. +Table 6. Constructors of the `radix_sorter` class. |=== |Constructor|Description -|`radix_sorter(const std::bitset mask = std::bitset -(std::numeric_limits::max()));` +|`template +radix_sorter(sycl::span scratch, const std::bitset mask = std::bitset +(std::numeric_limits::max()))` |Creates the `radix_sorter` object to sort values considering only bits that corresponds to 1 in `mask`. +Additional memory for the algorithm is provided using `scratch`. +If `scratch.size()` is less than the value returned by `memory_required`, +behavior of the corresponding sorting algorithm is undefined. + |=== -Table. Member functions of the `radix_sorter` class. +Table 7. Member functions of the `radix_sorter` class. |=== |Member function|Description @@ -223,9 +285,117 @@ void operator()(Group g, Ptr first, Ptr last)` |`template T operator()(Group g, T val)` |Implements the radix sort algorithm to be called by the `sort_over_group` algorithm. + +|`static std::size_t +memory_required(sycl::memory_scope scope, std::size_t range_size)` +|Returns size of temporary memory (in bytes) that is required by the radix sort algorithm +calling by `joint_sort`. `range_size` represents a range size for sorting, +e.g. `last-first` from `operator()` arguments. +Result depends on the `scope` parameter: +use `sycl::memory_scope::work_group` to get memory size required for each work_group; +use `sycl::memory_scope::sub_group` to get memory size required for each sub_group`. +If other `scope` values are passed, behavior is unspecified. + +|`template +static std::size_t +memory_required(sycl::memory_scope scope, sycl::range local_range)` +|Returns size of temporary memory (in bytes) that is required by the radix sort algorithm +calling by `sort_over_group`. +If `scope = sycl::memory_scope::work_group`, `local_range` is a local range of +`sycl::nd_range` that was used to run the kernel; +if `scope = sycl::memory_scope::sub_group`, `local_range` is a sub-group size. +If other `scope` values are passed, behavior is unspecified. +|=== + +NOTE: Predefined sorters are in the `experimental` namespace: interfaces might be changed later. + +=== Group Helper + +To pass additional memory to algorithms that don't have the Sorter +parameter SYCL introduces special type: group helper. +It encapsulates a group and a memory. + +Group helper must have following methods: + +[source,c++] +---- +/* unspecified */ get_group() const; + +/* unspecified */ get_memory() const; +---- + +Table 8. Member functions of group helpers. +|=== +|Member function|Description + +|`/* unspecified */ get_group() const` +|Returns the group that is handled by the group helper object. +Assuming `Group` is a type of method's result `sycl::is_group_v>` must be true. + +|`/* unspecified */ get_memory() const` +|Returns the memory object that represents a memory handled by the group helper object. +A type of the returned value must be the same as the type of the `default_sorter` 's constructor +that passes an additional memory to `default_sorter`. +|=== + +==== Predefined Group Helpers +SYCL introduces the following predefined group helper: + +[source,c++] +---- +namespace sycl::ext::oneapi::experimental { + + // Exposition only: all template arguments except Group are unspecified + template + class group_with_scratchpad + { + public: + group_with_scratchpad(Group group, sycl::span scratch); + Group get_group() const; + + sycl::span + get_memory() const; + }; + + // Deduction guides + template + group_with_scratchpad(Group, sycl::span) + -> group_with_scratchpad; + +} +---- + +NOTE: `group_with_scratchpad` is in the `experimental` namespace: +interfaces might be changed later. + +Table 9. Constructors of the `group_with_scratchpad` class. +|=== +|Constructor|Description + +|`group_with_scratchpad(Group group, sycl::span scratch)` +|Creates the `group_with_scratchpad` object using `group` and `scratch`. +`sycl::is_group_v>` must be true. +`scratch.size()` must not be less than value returned by the `memory_required` method +of `default_sorter`. Otherwise, +behavior of sorting algorithm, which is called with the constructed object, is undefined. +The `scratch` value must be the same for all work-items in `group`. +|=== + +Table 10. Member functions of the `group_with_scratchpad` class. +|=== +|Member function|Description + +|`Group get_group() const` +|Returns the `Group` class object that is handled by the `group_with_scratchpad` object. + +|`sycl::span +get_memory() const` +|Returns `sycl::span` that represents an additional memory +that is handled by the `group_with_scratchpad` object. + |=== -==== Sort +=== Sort The sort function from the {cpp} standard sorts elements with respect to the binary comparison function object. @@ -241,77 +411,216 @@ position `i` in the ordered range. [source,c++] ---- namespace sycl::ext::oneapi { - template - void joint_sort(Group g, Ptr first, Ptr last); // (1) - template - void joint_sort(Group g, Ptr first, Ptr last, Compare comp); // (2) + template + void joint_sort(GroupHelper exec, Ptr first, Ptr last); // (1) + + template + void joint_sort(GroupHelper exec, Ptr first, Ptr last, Compare comp); // (2) template void joint_sort(Group g, Ptr first, Ptr last, Sorter sorter); // (3) - template - T sort_over_group(Group g, T val); // (4) + template + T sort_over_group(GroupHelper exec, T val); // (4) - template - T sort_over_group(Group g, T val, Compare comp); // (5) + template + T sort_over_group(GroupHelper exec, T val, Compare comp); // (5) template T sort_over_group(Group g, T val, Sorter sorter); // (6) } ---- -_Constraints_: All functions are available only if `sycl::is_group_v>` -is true and `Sorter` is a SYCL Sorter. +1._Preconditions_: `first`, `last` must be the same for all work-items in the group. -_Preconditions_: `first`, `last` must be the same for all work-items in the group. +_Effects_: Sort the elements in the range `[first, last)` +using the `exec` group helper object. Elements are compared by `operator<`. -1._Effects_: Sort the elements in the range `[first, last)`. -Elements are compared by `operator<`. +_Complexity_: Let `N` be `last - first`. `O(N*log(N)*log(N))` comparisons. -_Complexity_: Let `N` be `last - first`. `O(N*log_2(N))` comparisons. +_Constraints_: Only available if `GroupHelper` was created with a work group or sub_group and +some associated scratch space. -2._Mandates_: `comp` must satisfy the requirements of `Compare` from +2._Preconditions_: `first`, `last` must be the same for all work-items in the group. + +_Mandates_: `comp` must satisfy the requirements of `Compare` from the {cpp} standard. _Effects_: Sort the elements in the range `[first, last)` with respect to the -binary comparison function object `comp`. +binary comparison function object `comp` using the `exec` group helper object. + +_Complexity_: Let `N` be `last - first`. `O(N*log(N)*log(N))` comparisons. -_Complexity_: Let `N` be `last - first`. `O(N*log_2(N))` comparisons. +_Constraints_: Only available if `GroupHelper` was created with a work group or sub_group and +some associated scratch space. -3._Effects_: Equivalent to: `sorter(g, first, last)`. +3._Preconditions_: `first`, `last` must be the same for all work-items in the group. + +_Effects_: Equivalent to: `sorter(g, first, last)`. + +_Constraints_: All functions are available only if `Sorter` is a SYCL Sorter and +it provides `operator()(Group, Ptr, Ptr)` overload. 4._Returns_: The value returned on work-item `i` is the value in position `i` -of the ordered range resulting from sorting `val` from all work-items in the -`g` group. Elements are compared by `operator<`. +of the ordered range resulting from sorting `val` from all work-items in the group. +Elements are compared by `operator<` +using the `exec` group helper object. For multi-dimensional groups, the order of work-items in the group is determined by their linear id. -_Complexity_: Let `N` be the work group size. `O(N*log_2(N))` comparisons. +_Complexity_: Let `N` be the work group size. `O(N*log(N)*log(N))` comparisons. -5._Mandates_: `comp` must satisfy the requirements of `Compare` from -the {cpp} standard. +_Constraints_: Only available if `GroupHelper` was created with a work group or sub_group and +some associated scratch space. + +5._Mandates_: `comp` must satisfy the requirements of `Compare` from the {cpp} standard. _Returns_: The value returned on work-item `i` is the value in position `i` of the ordered range resulting from sorting `val` from all work-items in the -`g` group with respect to the binary comparison function object `comp`. +`g` group with respect to the binary comparison function object `comp` +using the `exec` group helper object. For multi-dimensional groups, the order of work-items in the group is determined by their linear id. -_Complexity_: Let `N` be the work group size. `O(N*log_2(N))` comparisons. +_Complexity_: Let `N` be the work group or sub-group size. `O(N*log(N)*log(N))` comparisons. + +_Constraints_: Only available if `GroupHelper` was created with a work group or sub_group and +some associated scratch space. 6._Effects_: Equivalent to: `return sorter(g, val)`. -== Issues +_Constraints_: All functions are available only if `Sorter` is a SYCL Sorter and +it provides `operator()(Group, Ptr, Ptr)` overload. + +== Examples + +1.Using `joint_sort` without Sorters. + +[source,c++] +---- +... +namespace my_sycl = sycl::ext::oneapi; +namespace my_sycl_exp = sycl::ext::oneapi::experimental; +// calculate required local memory size +size_t temp_memory_size = + my_sycl_exp::default_sorter<>::memory_required(sycl::memory_scope::work_group, n); + +q.submit([&](sycl::handler& h) { + auto acc = sycl::accessor(buf, h); + auto scratch = sycl::local_accessor( {temp_memory_size}, h ); + + h.parallel_for( + sycl::nd_range<1>{ /*global_size = */ {256}, /*local_size = */ {256} }, + [=](sycl::nd_item<1> id) { + auto ptr = acc.get_pointer() + id.get_group(0) * n; + + my_sycl::joint_sort( + // create group helper using deduction guides + my_sycl_exp::group_with_scratchpad( + id.get_group(), + sycl::span{scratch.get_pointer(), temp_memory_size} + ), + ptr, + ptr + n + ); + }); + }); +... +---- + +2.Using `sort_over_group` and `radix_sorter` + +[source,c++] +---- +... +namespace my_sycl = sycl::ext::oneapi; +namespace my_sycl_exp = sycl::ext::oneapi::experimental; + +sycl::range<1> local_range{256}; +// predefine radix_sorter to calculate local memory size +using RSorter = my_sycl_exp::radix_sorter; +// calculate required local memory size +size_t temp_memory_size = + RSorter::memory_required(sycl::memory_scope::work_group, local_range); + +q.submit([&](sycl::handler& h) { + auto acc = sycl::accessor(buf, h); + auto scratch = sycl::local_accessor( {temp_memory_size}, h); + + h.parallel_for( + sycl::nd_range<1>{ local_range, local_range }, + [=](sycl::nd_item<1> id) { + + acc[id.get_local_id()] = + my_sycl::sort_over_group( + id.get_group(), + acc[id.get_local_id()], + RSorter(sycl::span{scratch.get_pointer(), temp_memory_size}) + ); + }); + }); +... +---- + +3.Using `joint_sort` for key-value sorting +(keys are compared, but keys and values are reordered both). + +NOTE: `oneapi::dpl::zip_iterator` is used here. +See https://spec.oneapi.com/versions/latest/elements/oneDPL/source/index.html[oneDPL Spec] +for details. + +[source,c++] +---- +... +namespace my_sycl = sycl::ext::oneapi; +namespace my_sycl_exp = sycl::ext::oneapi::experimental; +using TupleType = typename std::iterator_traits>::value_type; +// calculate required local memory size +size_t temp_memory_size = + my_sycl_exp::default_sorter<>::memory_required(sycl::memory_scope::work_group, n); + +q.submit([&](sycl::handler& h) { + auto keys_acc = sycl::accessor(keys_buf, h); + auto vals_acc = sycl::accessor(vals_buf, h); + auto scratch = sycl::local_accessor( {temp_memory_size}, h); + + h.parallel_for( + sycl::nd_range<1>{ /*global_size = */ {1024}, /*local_size = */ {256} }, + [=](sycl::nd_item<1> id) { + size_t group_id = id.get_group(0); + auto keys_ptr = keys_acc.get_pointer() + group_id * n; + auto vals_ptr = vals_acc.get_pointer() + group_id * n; + auto first = oneapi::dpl::make_zip_iterator(keys_ptr, vals_ptr); + + my_sycl::joint_sort( + // create group excutor using deduction guides + my_sycl_exp::group_with_scratchpad( + id.get_group(), + sycl::span{scratch.get_pointer(), temp_memory_size} + ), + first, + first + n, + [](auto x, auto y){ return std::get<0>(x) < std::get<0>(y); } + ); + }); + }); +... +---- + +== Issues for later investigations . Sort function can have interfaces with static arrays in private memory as well. The concern is that it can require changes for other group algortihms as well since sort basing on private memory is not very useful if other algorithms in the chain use local -memory only. +memory only. It needs to make sure that we allow different layout for values in static arrays +between different work-items, e.g. "raw major" or "column major" format for storing. . It can be a separate proposal for key-value sorting basing on Projections. It needs to be investigated what is the response for that. . Sorter traits can be useful if there are Finder, Reducer or other objects will be added to the Spec to be used with other Group algorithms, e.g. find, reduce. +. Predefined sorters can be revisited to find a better interfaces for `memory_required` overloads. +. `group_with_scratchpad` can be replaced with concepts that requires 2 methods. == Revision History @@ -320,5 +629,6 @@ will be added to the Spec to be used with other Group algorithms, e.g. find, red [options="header"] |======================================== |Rev|Date|Author|Changes -|1|{docdate}|Andrey Fedorov|Initial public working draft +|1|2021-04-28|Andrey Fedorov|Initial public working draft +|2|{docdate}|Andrey Fedorov|Changes related to additional memory providing |========================================