Skip to content

Anti-pattern in dpex.DEFAULT_LOCAL_SIZE #766

@fcharras

Description

@fcharras

As i understand it, choosing a local size for running a kernel must follow a few rules to ensure that the execution of the kernel fits well with the underlying hardware:

  • preferably, it should be a multiple of the size of the pools of threads that execute in a lock step at the hardware level (what would be called warp size for nvidia gpus or wavefronts for amd gpus)

  • and at least be equal to this value, if it is smaller the remaining threads of a warp will remain idle (causing underload and hurting performances). In general, part of the device will remain idle if the group size is not a multiple of the warp size.

clinfo, among other information, can display the values the group size should be a multiple of:

> clinfo
...
Device Name                                     Intel(R) UHD Graphics [0x9a60]
...
Device Type                                     GPU
...
Max work group size                             512
Preferred work group size multiple              64
...

Regarding opencl and python those values are also exposed by pyopencl.

dpex.DEFAULT_LOCAL_SIZE seems to enforce different rules:

  • it tries to set a value that is actually reasonable (it is 512 on my computer, is it hardcoded or does it changes with hardware requirements ?)
  • in any case, the local size must divide the global size. The global size is user inputed and could be any values, usually it is related to the size of the input (e.g number of rows in an array). If the previous dpex.DEFAULT_LOCAL_SIZE does not divide the global size, it will fallback on the largest divisor to global_size that is smaller than the previous default value.

Here are a few examples:

import numba_dpex as dpex
import dpctl
import numpy as np
def inspect_default_local_size(global_size):
    @dpex.kernel
    def kernel(array):
        idx = dpex.get_global_id(0)
        size = dpex.get_local_size(0)
        array[idx] = size
    array = dpctl.tensor.empty(sh=global_size, dtype=np.int32)
    kernel[global_size, dpex.DEFAULT_LOCAL_SIZE](array)
    return dpctl.tensor.asnumpy(array)[0]
print(inspect_default_local_size(8192))  # prints 512, OK
print(inspect_default_local_size(10000))  # prints 400 (suboptimal but not too bad)
print(inspect_default_local_size(9973))  # prints 1 because 9973 is prime, looks very suspect

If the dpex.DEFAULT_LOCAL_SIZE is close enough to a multiple of the recommended value, there should not be a significant impact on performance, and the grief might be counterbalanced because if saves implementing boundaries check in the kernel.

But if it is not (e.g. when global_size is a prime number, forcing the default local size to 1 (!)) the performance drop could be massive. (only one thread per warp would be effectively used in this case)

I think the user should be responsible to choose its global and local work group sizes and adapt the behavior of the kernel at boundaries if necessary, and I think it is a good practice to work with a fixed local size and adapt the kernel, rather than ignoring boundaries and adapting the local_size ? reading like:

import numba_dpex as dpex
import dpctl
import numpy as np
import math
@dpex.kernel
def kernel(array):
    idx = dpex.get_global_id(0)
    n = array.shape[0]
    if idx >= n:
        return
    array[idx] = idx
array_size = 9973
array = dpctl.tensor.empty(sh=array_size, dtype=np.int32)
local_size = 512
global_size = math.ceil(array_size/local_size) * local_size
kernel[global_size, local_size](array)

And exposing an automatic setting for the local size will be counter productive because it suggests to the user the opposite practice.

If anything, numba_dpex could expose the maximum possible local size and the value it is recommended to be a multiple of.

Metadata

Metadata

Labels

userUser submitted issue

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions