Skip to content

Commit 837ba01

Browse files
committed
replace intel#3754
Signed-off-by: Fedorov, Andrey <[email protected]>
1 parent f5bc09a commit 837ba01

File tree

1 file changed

+220
-0
lines changed

1 file changed

+220
-0
lines changed

sycl/doc/design/GroupSort.md

Lines changed: 220 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,220 @@
1+
# Group sort algorithm
2+
3+
Group sorting algorithms are needed to sort data without calling additional kernels
4+
They are described by SYCL 2020 Extension specification:
5+
[direct link to the specification's extension][group_sort_spec].
6+
7+
[group_sort_spec]: https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/proposed/sycl_ext_oneapi_group_sort.asciidoc
8+
9+
Example usage:
10+
11+
```cpp
12+
#include <sycl/sycl.hpp>
13+
14+
namespace oneapi_exp = sycl::ext::oneapi::experimental;
15+
sycl::range<1> local_range{256};
16+
// predefine radix_sorter to calculate local memory size
17+
using RSorter = oneapi_exp::radix_sorter<T, oneapi_exp::sorting_order::descending>;
18+
// calculate required local memory size
19+
size_t temp_memory_size =
20+
RSorter::memory_required(sycl::memory_scope::work_group, local_range);
21+
q.submit([&](sycl::handler& h) {
22+
auto acc = sycl::accessor(buf, h);
23+
auto scratch = sycl::local_accessor<std::byte, 1>( {temp_memory_size}, h);
24+
h.parallel_for(
25+
sycl::nd_range<1>{ local_range, local_range },
26+
[=](sycl::nd_item<1> id) {
27+
acc[id.get_local_id()] =
28+
oneapi_exp::sort_over_group(
29+
id.get_group(),
30+
acc[id.get_local_id()],
31+
RSorter(sycl::span{scratch.get_pointer(), temp_memory_size})
32+
);
33+
});
34+
});
35+
...
36+
```
37+
38+
## Design objectives
39+
40+
In DPC++ Headers/DPC++ RT we don't know which sorting algorithm is better for
41+
different architectures. Backends have more capability to optimize the sorting algorithm
42+
using low-level instructions.
43+
44+
Data types that should be supported by backends: arithmetic types
45+
(https://en.cppreference.com/w/c/language/arithmetic_types), `sycl::half`.
46+
47+
Comparators that should be supported by backends: `std::less`, `std::greater`,
48+
custom comparators
49+
50+
## Design
51+
52+
Overall, for backend support we need to have the following:
53+
- Fallback implementation of sorting algorithms for user's types, comparators and/or sorters.
54+
55+
- Backend implementation for types, comparators and/or sorters
56+
that can be optimized using backend specific instructions.
57+
58+
**NOTE**: It was decided that `radix_sorter` will be implemented only in DPC++ Headers since
59+
it's difficult to support such algorithm at backends' level.
60+
61+
- Fallback implementation in case if backends don't have more optimized implementations yet.
62+
63+
- Level Zero extension for `memory_required` functions.
64+
65+
The following should be implemented:
66+
67+
- [x] Sorter classes and their `operator()` including sorting algorithms
68+
- [x] Default sorter.
69+
- [x] Radix sorter.
70+
- [x] `joint_sort` and `sort_over_group` functions.
71+
- [x] Traits to distinguish interfaces with `Compare` and `Sorter` parameters.
72+
- [x] Checks when radix sort is applicable (arithmetic types only).
73+
- [x] The `radix_order` enum class.
74+
- [x] `group_with_scratchpad` predefined group helper.
75+
- [x] `SYCL_EXT_ONEAPI_GROUP_SORT` feature macro.
76+
- [ ] `sort_over_group` with `span`-based parameters.
77+
- [ ] Level Zero extension for `memory_required` functions
78+
- [ ] Specification.
79+
- [ ] Implementation.
80+
- [ ] Backend support for sorting algorithms.
81+
- [ ] Default sorter
82+
- [ ] Fallback library if device doesn't implement functions.
83+
84+
**Note**: The "tick" means that corresponding feature is implemented.
85+
86+
Sections below describe each component in more details.
87+
88+
### DPC++ Headers
89+
90+
DPC++ Headers contain the following:
91+
- required definitions of `joint_sort`, `sort_over_group` functions, `radix_order` enum class,
92+
`default_sorter`, `radix_sorter` classes with corresponding `operator()`
93+
as well as other classes and methods.
94+
95+
- Checks if radix sort is applicable for provided data types.
96+
97+
- Traits to distinguish interfaces with `Compare` and `Sorter` parameters.
98+
99+
- Fallback solution for user's types, user's comparators and/or user's sorters.
100+
101+
### Level Zero
102+
103+
To implement `memory_required` methods for sorters we need to calculate
104+
how much temporary memory is needed.
105+
However, we don't have an information how much memory is needed by backend compiler.
106+
That's why we need a Level Zero function that calls a function from the backend and
107+
provide actual value to the SYCL code.
108+
109+
Required interfaces:
110+
```cpp
111+
// Returns whether default work-group or sub-group sort is present in builtins
112+
virtual bool DefaultGroupSortSupported(GroupSortMemoryScope::MemoryScope_t scope,
113+
GroupSortKeyType::KeyType_t keyType,
114+
bool isKeyValue,
115+
bool isJointSort) const;
116+
117+
// Returns required amount of memory for default joint work-group or sub-group sort
118+
// devicelib builtin function in bytes per workgroup (or sub-group), >= 0
119+
// or -1 if the algorithm for the specified parameters is not implemented
120+
//
121+
// totalItems -- number of elements to sort
122+
// rangeSize -- work-group or sub-group size respectively
123+
//
124+
// For key-only sort pass valueTypeSizeInBytes = 0
125+
virtual long DefaultGroupJointSortMemoryRequired(GroupSortMemoryScope::MemoryScope_t scope,
126+
long totalItems,
127+
long rangeSize,
128+
long keyTypeSizeInBytes,
129+
long valueTypeSizeInBytes) const;
130+
131+
// Returns required amount of memory for default private memory work-group or sub-group sort
132+
// devicelib builtin function in bytes per workgroup (or sub-group), >= 0
133+
// or -1 if the algorithm for the specified parameters is not implemented
134+
//
135+
// itemsPerWorkItem -- number of elements in private array to sort
136+
// rangeSize -- work-group or sub-group size respectively
137+
//
138+
// For key-only sort pass valueTypeSizeInBytes = 0
139+
virtual long DefaultGroupPrivateSortMemoryRequired(GroupSortMemoryScope::MemoryScope_t scope,
140+
long itemsPerWorkItem,
141+
long rangeSize,
142+
long keyTypeSizeInBytes,
143+
long valueTypeSizeInBytes) const;
144+
```
145+
146+
### Fallback SPIR-V library
147+
148+
If backend compilers can generate optimized implementations based on low-level instructions,
149+
we need a function that they can take and optimize.
150+
151+
If there are no implementations in a backend yet,
152+
implementations from the fallback library will be called.
153+
154+
Interface for the library and backends:
155+
156+
```cpp
157+
// for default sorting algorithm
158+
void __devicelib_default_work_group_joint_sort_ascending_<encoded_param_types>(T* first, uint n, byte* scratch);
159+
160+
void __devicelib_default_work_group_joint_sort_descending_<encoded_param_types>(T* first, uint n, byte* scratch);
161+
162+
// for fixed-size arrays
163+
void __devicelib_default_work_group_private_sort_close_ascending_<encoded_param_types>(T* first, uint n, byte* scratch);
164+
165+
void __devicelib_default_work_group_private_sort_close_descending_<encoded_param_types>(T* first, uint n, byte* scratch);
166+
167+
void __devicelib_default_work_group_private_sort_spread_ascending_<encoded_param_types>(T* first, uint n, byte* scratch);
168+
169+
void __devicelib_default_work_group_private_sort_spread_descending_<encoded_param_types>(T* first, uint n, byte* scratch);
170+
171+
// for sub-groups
172+
T __devicelib_default_sub_group_private_sort_ascending_<encoded_scalar_param_type>(T value);
173+
174+
T __devicelib_default_sub_group_private_sort_descending_<encoded_scalar_param_type>(T value);
175+
176+
// for key value sorting using the default algorithm
177+
void __devicelib_default_work_group_joint_sort_ascending_<encoded_param_types>(T* keys_first, U* values_first, uint n, byte* scratch);
178+
179+
void __devicelib_default_work_group_joint_sort_descending_<encoded_param_types>(T* keys_first, U* values_first, uint n, byte* scratch);
180+
181+
// for key value sorting using fixed-size arrays
182+
void __devicelib_default_work_group_private_sort_close_ascending_<encoded_param_types>(T* keys_first, U* values_first, uint n, byte* scratch);
183+
184+
void __devicelib_default_work_group_private_sort_close_descending_<encoded_param_types>(T* keys_first, U* values_first, uint n, byte* scratch);
185+
186+
void __devicelib_default_work_group_private_sort_spread_ascending_<encoded_param_types>(T* keys_first, U* values_first, uint n, byte* scratch);
187+
188+
void __devicelib_default_work_group_private_sort_spread_descending_<encoded_param_types>(T* keys_first, U* values_first, uint n, byte* scratch);
189+
190+
```
191+
192+
Notes:
193+
- `T`, `U` are from the following list `i8`, `i16`,
194+
`i32`, `i64`, `u8`, `u16`, `u32`, `u64`, `f16`, `f32`, `f64`.
195+
- `encoded_param_types` is `T` prepended with `p1` for global/private address
196+
space and `p3` for shared local memory.
197+
- `first` is a pointer to the actual data for sorting.
198+
- The type of `n` (number of elements) is u32.
199+
- `keys_first` points to "keys" for key-value sorting.
200+
"Keys" are comparing and moving during the sorting.
201+
- `scratch` is a temporary storage (local or global) that can be used by backends.
202+
The type of `scratch` is always `byte*`.
203+
- `values_first` points to "values" for key-value sorting. "Keys" are only moving
204+
corresponding the "keys" order during the sorting.
205+
206+
Examples:
207+
```cpp
208+
void __devicelib_default_work_group_joint_sort_ascending_p1i32_u32_p3i8(int* first, uint n, byte* scratch);
209+
void __devicelib_default_work_group_joint_sort_descending_p1u32_u32_p1i8(uint* first, uint n, byte* scratch);
210+
void __devicelib_default_work_group_joint_sort_ascending_p3u32_p3u32_u32_p1i8(uint* first_keys, uint* first_values, uint n, byte* scratch);
211+
void __devicelib_default_work_group_private_sort_close_ascending_p1u32_p1u32_u32_p1i8(uint* first_keys, uint* first_values, uint n, byte* scratch);
212+
double __devicelib_default_sub_group_private_sort_ascending_f64(double value);
213+
```
214+
215+
## Alternative Design
216+
217+
If it's proved that no specific improvements can be done at backends' level (e.g. special
218+
instructions, hardware dispatch) comparing to high-level SYCL code then implementations
219+
of sorting functions can be placed in DPC++ Headers
220+
(no hardware backends, no Level Zero support will be needed in such cases).

0 commit comments

Comments
 (0)