|
| 1 | +# Group sort algorithm |
| 2 | + |
| 3 | +Group sorting algorithms are needed to sort data without calling additional kernels |
| 4 | +They are described by SYCL 2020 Extension specification: |
| 5 | +[direct link to the specification's extension][group_sort_spec]. |
| 6 | + |
| 7 | +[group_sort_spec]: https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/proposed/sycl_ext_oneapi_group_sort.asciidoc |
| 8 | + |
| 9 | +Example usage: |
| 10 | + |
| 11 | +```cpp |
| 12 | +#include <sycl/sycl.hpp> |
| 13 | + |
| 14 | +namespace oneapi_exp = sycl::ext::oneapi::experimental; |
| 15 | +sycl::range<1> local_range{256}; |
| 16 | +// predefine radix_sorter to calculate local memory size |
| 17 | +using RSorter = oneapi_exp::radix_sorter<T, oneapi_exp::sorting_order::descending>; |
| 18 | +// calculate required local memory size |
| 19 | +size_t temp_memory_size = |
| 20 | + RSorter::memory_required(sycl::memory_scope::work_group, local_range); |
| 21 | +q.submit([&](sycl::handler& h) { |
| 22 | + auto acc = sycl::accessor(buf, h); |
| 23 | + auto scratch = sycl::local_accessor<std::byte, 1>( {temp_memory_size}, h); |
| 24 | + h.parallel_for( |
| 25 | + sycl::nd_range<1>{ local_range, local_range }, |
| 26 | + [=](sycl::nd_item<1> id) { |
| 27 | + acc[id.get_local_id()] = |
| 28 | + oneapi_exp::sort_over_group( |
| 29 | + id.get_group(), |
| 30 | + acc[id.get_local_id()], |
| 31 | + RSorter(sycl::span{scratch.get_pointer(), temp_memory_size}) |
| 32 | + ); |
| 33 | + }); |
| 34 | + }); |
| 35 | +... |
| 36 | +``` |
| 37 | +
|
| 38 | +## Design objectives |
| 39 | +
|
| 40 | +In DPC++ Headers/DPC++ RT we don't know which sorting algorithm is better for |
| 41 | +different architectures. Backends have more capability to optimize the sorting algorithm |
| 42 | +using low-level instructions. |
| 43 | +
|
| 44 | +Data types that should be supported by backends: arithmetic types |
| 45 | +(https://en.cppreference.com/w/c/language/arithmetic_types), `sycl::half`. |
| 46 | +
|
| 47 | +Comparators that should be supported by backends: `std::less`, `std::greater`, |
| 48 | +custom comparators |
| 49 | +
|
| 50 | +## Design |
| 51 | +
|
| 52 | +Overall, for backend support we need to have the following: |
| 53 | +- Fallback implementation of sorting algorithms for user's types, comparators and/or sorters. |
| 54 | +
|
| 55 | +- Backend implementation for types, comparators and/or sorters |
| 56 | + that can be optimized using backend specific instructions. |
| 57 | +
|
| 58 | + **NOTE**: It was decided that `radix_sorter` will be implemented only in DPC++ Headers since |
| 59 | +it's difficult to support such algorithm at backends' level. |
| 60 | +
|
| 61 | +- Fallback implementation in case if backends don't have more optimized implementations yet. |
| 62 | +
|
| 63 | +- Level Zero extension for `memory_required` functions. |
| 64 | +
|
| 65 | +The following should be implemented: |
| 66 | +
|
| 67 | +- [x] Sorter classes and their `operator()` including sorting algorithms |
| 68 | + - [x] Default sorter. |
| 69 | + - [x] Radix sorter. |
| 70 | +- [x] `joint_sort` and `sort_over_group` functions. |
| 71 | +- [x] Traits to distinguish interfaces with `Compare` and `Sorter` parameters. |
| 72 | +- [x] Checks when radix sort is applicable (arithmetic types only). |
| 73 | +- [x] The `radix_order` enum class. |
| 74 | +- [x] `group_with_scratchpad` predefined group helper. |
| 75 | +- [x] `SYCL_EXT_ONEAPI_GROUP_SORT` feature macro. |
| 76 | +- [ ] `sort_over_group` with `span`-based parameters. |
| 77 | +- [ ] Level Zero extension for `memory_required` functions |
| 78 | + - [ ] Specification. |
| 79 | + - [ ] Implementation. |
| 80 | +- [ ] Backend support for sorting algorithms. |
| 81 | + - [ ] Default sorter |
| 82 | +- [ ] Fallback library if device doesn't implement functions. |
| 83 | +
|
| 84 | +**Note**: The "tick" means that corresponding feature is implemented. |
| 85 | +
|
| 86 | +Sections below describe each component in more details. |
| 87 | +
|
| 88 | +### DPC++ Headers |
| 89 | +
|
| 90 | +DPC++ Headers contain the following: |
| 91 | +- required definitions of `joint_sort`, `sort_over_group` functions, `radix_order` enum class, |
| 92 | + `default_sorter`, `radix_sorter` classes with corresponding `operator()` |
| 93 | + as well as other classes and methods. |
| 94 | +
|
| 95 | +- Checks if radix sort is applicable for provided data types. |
| 96 | +
|
| 97 | +- Traits to distinguish interfaces with `Compare` and `Sorter` parameters. |
| 98 | +
|
| 99 | +- Fallback solution for user's types, user's comparators and/or user's sorters. |
| 100 | +
|
| 101 | +### Level Zero |
| 102 | +
|
| 103 | +To implement `memory_required` methods for sorters we need to calculate |
| 104 | +how much temporary memory is needed. |
| 105 | +However, we don't have an information how much memory is needed by backend compiler. |
| 106 | +That's why we need a Level Zero function that calls a function from the backend and |
| 107 | +provide actual value to the SYCL code. |
| 108 | +
|
| 109 | +Required interfaces: |
| 110 | +```cpp |
| 111 | + // Returns whether default work-group or sub-group sort is present in builtins |
| 112 | + virtual bool DefaultGroupSortSupported(GroupSortMemoryScope::MemoryScope_t scope, |
| 113 | + GroupSortKeyType::KeyType_t keyType, |
| 114 | + bool isKeyValue, |
| 115 | + bool isJointSort) const; |
| 116 | +
|
| 117 | + // Returns required amount of memory for default joint work-group or sub-group sort |
| 118 | + // devicelib builtin function in bytes per workgroup (or sub-group), >= 0 |
| 119 | + // or -1 if the algorithm for the specified parameters is not implemented |
| 120 | + // |
| 121 | + // totalItems -- number of elements to sort |
| 122 | + // rangeSize -- work-group or sub-group size respectively |
| 123 | + // |
| 124 | + // For key-only sort pass valueTypeSizeInBytes = 0 |
| 125 | + virtual long DefaultGroupJointSortMemoryRequired(GroupSortMemoryScope::MemoryScope_t scope, |
| 126 | + long totalItems, |
| 127 | + long rangeSize, |
| 128 | + long keyTypeSizeInBytes, |
| 129 | + long valueTypeSizeInBytes) const; |
| 130 | +
|
| 131 | + // Returns required amount of memory for default private memory work-group or sub-group sort |
| 132 | + // devicelib builtin function in bytes per workgroup (or sub-group), >= 0 |
| 133 | + // or -1 if the algorithm for the specified parameters is not implemented |
| 134 | + // |
| 135 | + // itemsPerWorkItem -- number of elements in private array to sort |
| 136 | + // rangeSize -- work-group or sub-group size respectively |
| 137 | + // |
| 138 | + // For key-only sort pass valueTypeSizeInBytes = 0 |
| 139 | + virtual long DefaultGroupPrivateSortMemoryRequired(GroupSortMemoryScope::MemoryScope_t scope, |
| 140 | + long itemsPerWorkItem, |
| 141 | + long rangeSize, |
| 142 | + long keyTypeSizeInBytes, |
| 143 | + long valueTypeSizeInBytes) const; |
| 144 | +``` |
| 145 | + |
| 146 | +### Fallback SPIR-V library |
| 147 | + |
| 148 | +If backend compilers can generate optimized implementations based on low-level instructions, |
| 149 | +we need a function that they can take and optimize. |
| 150 | + |
| 151 | +If there are no implementations in a backend yet, |
| 152 | +implementations from the fallback library will be called. |
| 153 | + |
| 154 | +Interface for the library and backends: |
| 155 | + |
| 156 | +```cpp |
| 157 | +// for default sorting algorithm |
| 158 | +void __devicelib_default_work_group_joint_sort_ascending_<encoded_param_types>(T* first, uint n, byte* scratch); |
| 159 | + |
| 160 | +void __devicelib_default_work_group_joint_sort_descending_<encoded_param_types>(T* first, uint n, byte* scratch); |
| 161 | + |
| 162 | +// for fixed-size arrays |
| 163 | +void __devicelib_default_work_group_private_sort_close_ascending_<encoded_param_types>(T* first, uint n, byte* scratch); |
| 164 | + |
| 165 | +void __devicelib_default_work_group_private_sort_close_descending_<encoded_param_types>(T* first, uint n, byte* scratch); |
| 166 | + |
| 167 | +void __devicelib_default_work_group_private_sort_spread_ascending_<encoded_param_types>(T* first, uint n, byte* scratch); |
| 168 | + |
| 169 | +void __devicelib_default_work_group_private_sort_spread_descending_<encoded_param_types>(T* first, uint n, byte* scratch); |
| 170 | + |
| 171 | +// for sub-groups |
| 172 | +T __devicelib_default_sub_group_private_sort_ascending_<encoded_scalar_param_type>(T value); |
| 173 | + |
| 174 | +T __devicelib_default_sub_group_private_sort_descending_<encoded_scalar_param_type>(T value); |
| 175 | + |
| 176 | +// for key value sorting using the default algorithm |
| 177 | +void __devicelib_default_work_group_joint_sort_ascending_<encoded_param_types>(T* keys_first, U* values_first, uint n, byte* scratch); |
| 178 | + |
| 179 | +void __devicelib_default_work_group_joint_sort_descending_<encoded_param_types>(T* keys_first, U* values_first, uint n, byte* scratch); |
| 180 | + |
| 181 | +// for key value sorting using fixed-size arrays |
| 182 | +void __devicelib_default_work_group_private_sort_close_ascending_<encoded_param_types>(T* keys_first, U* values_first, uint n, byte* scratch); |
| 183 | + |
| 184 | +void __devicelib_default_work_group_private_sort_close_descending_<encoded_param_types>(T* keys_first, U* values_first, uint n, byte* scratch); |
| 185 | + |
| 186 | +void __devicelib_default_work_group_private_sort_spread_ascending_<encoded_param_types>(T* keys_first, U* values_first, uint n, byte* scratch); |
| 187 | + |
| 188 | +void __devicelib_default_work_group_private_sort_spread_descending_<encoded_param_types>(T* keys_first, U* values_first, uint n, byte* scratch); |
| 189 | + |
| 190 | +``` |
| 191 | + |
| 192 | +Notes: |
| 193 | +- `T`, `U` are from the following list `i8`, `i16`, |
| 194 | + `i32`, `i64`, `u8`, `u16`, `u32`, `u64`, `f16`, `f32`, `f64`. |
| 195 | +- `encoded_param_types` is `T` prepended with `p1` for global/private address |
| 196 | + space and `p3` for shared local memory. |
| 197 | +- `first` is a pointer to the actual data for sorting. |
| 198 | +- The type of `n` (number of elements) is u32. |
| 199 | +- `keys_first` points to "keys" for key-value sorting. |
| 200 | + "Keys" are comparing and moving during the sorting. |
| 201 | +- `scratch` is a temporary storage (local or global) that can be used by backends. |
| 202 | + The type of `scratch` is always `byte*`. |
| 203 | +- `values_first` points to "values" for key-value sorting. "Keys" are only moving |
| 204 | + corresponding the "keys" order during the sorting. |
| 205 | + |
| 206 | +Examples: |
| 207 | +```cpp |
| 208 | +void __devicelib_default_work_group_joint_sort_ascending_p1i32_u32_p3i8(int* first, uint n, byte* scratch); |
| 209 | +void __devicelib_default_work_group_joint_sort_descending_p1u32_u32_p1i8(uint* first, uint n, byte* scratch); |
| 210 | +void __devicelib_default_work_group_joint_sort_ascending_p3u32_p3u32_u32_p1i8(uint* first_keys, uint* first_values, uint n, byte* scratch); |
| 211 | +void __devicelib_default_work_group_private_sort_close_ascending_p1u32_p1u32_u32_p1i8(uint* first_keys, uint* first_values, uint n, byte* scratch); |
| 212 | +double __devicelib_default_sub_group_private_sort_ascending_f64(double value); |
| 213 | +``` |
| 214 | +
|
| 215 | +## Alternative Design |
| 216 | +
|
| 217 | +If it's proved that no specific improvements can be done at backends' level (e.g. special |
| 218 | +instructions, hardware dispatch) comparing to high-level SYCL code then implementations |
| 219 | +of sorting functions can be placed in DPC++ Headers |
| 220 | +(no hardware backends, no Level Zero support will be needed in such cases). |
0 commit comments