Anti-pattern in dpex.DEFAULT_LOCAL_SIZE

As i understand it, choosing a local size for running a kernel must follow a few rules to ensure that the execution of the kernel fits well with the underlying hardware:

- preferably, it should be a multiple of the size of the pools of threads that execute in a lock step at the hardware level (what would be called warp size for nvidia gpus or wavefronts for amd gpus)

- and at least be equal to this value, if it is smaller the remaining threads of a warp will remain idle (causing underload and hurting performances). In general, part of the device will remain idle if the group size is not a multiple of the warp size.

[clinfo](https://github.com/Oblomov/clinfo), among other information, can display the values the group size should be a multiple of:
```
> clinfo
...
Device Name                                     Intel(R) UHD Graphics [0x9a60]
...
Device Type                                     GPU
...
Max work group size                             512
Preferred work group size multiple              64
...
```  
Regarding `opencl` and `python` those values are also exposed by [pyopencl](https://documen.tician.de/pyopencl/runtime_const.html#pyopencl.device_info.MAX_WORK_GROUP_SIZE).

`dpex.DEFAULT_LOCAL_SIZE` seems to enforce different rules:
- it tries to set a value that is actually reasonable (it is `512` on my computer, is it hardcoded or does it changes with hardware requirements ?)
- in any case, the local size **must** divide the global size. The global size is user inputed and could be any values, usually it is related to the size of the input (e.g number of rows in an array).  If the previous `dpex.DEFAULT_LOCAL_SIZE` does not divide the global size, it will fallback on the largest divisor to `global_size` that is smaller than the previous default value.

Here are a few examples:

```python
import numba_dpex as dpex
import dpctl
import numpy as np
def inspect_default_local_size(global_size):
    @dpex.kernel
    def kernel(array):
        idx = dpex.get_global_id(0)
        size = dpex.get_local_size(0)
        array[idx] = size
    array = dpctl.tensor.empty(sh=global_size, dtype=np.int32)
    kernel[global_size, dpex.DEFAULT_LOCAL_SIZE](array)
    return dpctl.tensor.asnumpy(array)[0]
print(inspect_default_local_size(8192))  # prints 512, OK
print(inspect_default_local_size(10000))  # prints 400 (suboptimal but not too bad)
print(inspect_default_local_size(9973))  # prints 1 because 9973 is prime, looks very suspect
```

If the `dpex.DEFAULT_LOCAL_SIZE` is close enough to a multiple of the recommended value, there should not be a significant impact on performance, and the grief might be counterbalanced because if saves implementing boundaries check in the kernel.

But if it is not (e.g. when `global_size` is a prime number, forcing the default local size to 1 (!)) the performance drop could be massive. (only one thread per warp would be effectively used in this case)

I think the user should be responsible to choose its global and local work group sizes and adapt the behavior of the kernel at boundaries if necessary, and I think it is a good practice to work with a fixed local size and adapt the kernel, rather than ignoring boundaries and adapting the local_size ? reading like:

```python
import numba_dpex as dpex
import dpctl
import numpy as np
import math
@dpex.kernel
def kernel(array):
    idx = dpex.get_global_id(0)
    n = array.shape[0]
    if idx >= n:
        return
    array[idx] = idx
array_size = 9973
array = dpctl.tensor.empty(sh=array_size, dtype=np.int32)
local_size = 512
global_size = math.ceil(array_size/local_size) * local_size
kernel[global_size, local_size](array)
```

And exposing an automatic setting for the local size will be counter productive because it suggests to the user the opposite practice.

If anything, `numba_dpex` could expose the maximum possible local size and the value it is recommended to be a multiple of.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Anti-pattern in dpex.DEFAULT_LOCAL_SIZE #766

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Anti-pattern in dpex.DEFAULT_LOCAL_SIZE #766

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions