Skip to content

[Offload] Add olGetKernelMaxGroupSize #142950

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 13 additions & 0 deletions offload/liboffload/API/Kernel.td
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,19 @@ def : Function {
let returns = [];
}

def : Function {
let name = "olGetKernelMaxGroupSize";
let desc = "Get the maximum block size needed to achieve maximum occupancy.";
let details = [];
let params = [
Param<"ol_kernel_handle_t", "Kernel", "handle of the kernel", PARAM_IN>,
Param<"ol_device_handle_t", "Device", "device intended to run the kernel", PARAM_IN>,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function takes in a device handle because, going by the HIP implementation, it looks like on AMDGPU it requires certain device properties.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The output for this should be a struct.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any particular reason why? There's only one output value.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, but if we add more stuff we don't want to break the ABI in the future. However, this should probably be merged with some big 'getKernelInfo' type thing if we don't already have it. No reason for a standalone function here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was originally planning on using getKernelInfo (similar to olGetDeviceInfo, olGetPlatformInfo etc.), but the fact that the query requires a device and memory size complicates that.

Unless you meant something similar to stat(), where it populates a buffer with information that the implementation might find useful? I can see the appeal of that, but it might result in plugins wasting time computing things that won't end up used.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, a kernel implies that it's been successfully loaded onto the device, so it should have a device. Might be reasonable to add it as an argument to the main one. That or we could revisit the other question and add some metadata associated with the kernel.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The question about adding a dedicated kernel_handle_t type that contains a pointer to the GenericKernelTy, the device and maybe even the amount of dynamic memory required? That's my preference and will ease any headaches down the line if we need to add additional information to kernels.

Param<"size_t", "SharedMemory", "dynamic shared memory required", PARAM_IN>,
Param<"size_t*", "GroupSize", "maximum block size", PARAM_OUT>
];
let returns = [];
}

def : Struct {
let name = "ol_kernel_launch_size_args_t";
let desc = "Size-related arguments for a kernel launch.";
Expand Down
20 changes: 19 additions & 1 deletion offload/liboffload/src/OffloadImpl.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -557,6 +557,10 @@ Error olDestroyProgram_impl(ol_program_handle_t Program) {
return olDestroy(Program);
}

inline GenericKernelTy *getPluginKernel(ol_kernel_handle_t OlKernel) {
return reinterpret_cast<GenericKernelTy *>(OlKernel);
}

Error olGetKernel_impl(ol_program_handle_t Program, const char *KernelName,
ol_kernel_handle_t *Kernel) {

Expand All @@ -573,6 +577,20 @@ Error olGetKernel_impl(ol_program_handle_t Program, const char *KernelName,
return Error::success();
}

Error olGetKernelMaxGroupSize_impl(ol_kernel_handle_t Kernel,
ol_device_handle_t Device,
size_t DynamicMemSize, size_t *GroupSize) {
auto *KernelImpl = getPluginKernel(Kernel);

auto Res = KernelImpl->maxGroupSize(*Device->Device, DynamicMemSize);
if (auto Err = Res.takeError())
return Err;

*GroupSize = *Res;

return Error::success();
}

Error olLaunchKernel_impl(ol_queue_handle_t Queue, ol_device_handle_t Device,
ol_kernel_handle_t Kernel, const void *ArgumentsData,
size_t ArgumentsSize,
Expand Down Expand Up @@ -603,7 +621,7 @@ Error olLaunchKernel_impl(ol_queue_handle_t Queue, ol_device_handle_t Device,
// Don't do anything with pointer indirection; use arg data as-is
LaunchArgs.Flags.IsCUDA = true;

auto *KernelImpl = reinterpret_cast<GenericKernelTy *>(Kernel);
auto *KernelImpl = getPluginKernel(Kernel);
auto Err = KernelImpl->launch(*DeviceImpl, LaunchArgs.ArgPtrs, nullptr,
LaunchArgs, AsyncInfoWrapper);

Expand Down
8 changes: 8 additions & 0 deletions offload/plugins-nextgen/amdgpu/src/rtl.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -570,6 +570,14 @@ struct AMDGPUKernelTy : public GenericKernelTy {
KernelLaunchParamsTy LaunchParams,
AsyncInfoWrapperTy &AsyncInfoWrapper) const override;

/// Return maximum block size for maximum occupancy
///
/// TODO: This needs to be implemented for amdgpu
Expected<uint64_t> maxGroupSize(GenericDeviceTy &GenericDevice,
uint64_t DynamicMemSize) const override {
return 1;
}

/// Print more elaborate kernel launch info for AMDGPU
Error printLaunchInfoDetails(GenericDeviceTy &GenericDevice,
KernelArgsTy &KernelArgs, uint32_t NumThreads[3],
Expand Down
3 changes: 3 additions & 0 deletions offload/plugins-nextgen/common/include/PluginInterface.h
Original file line number Diff line number Diff line change
Expand Up @@ -316,6 +316,9 @@ struct GenericKernelTy {
KernelLaunchParamsTy LaunchParams,
AsyncInfoWrapperTy &AsyncInfoWrapper) const = 0;

virtual Expected<uint64_t> maxGroupSize(GenericDeviceTy &GenericDevice,
uint64_t DynamicMemSize) const = 0;

/// Get the kernel name.
const char *getName() const { return Name; }

Expand Down
1 change: 1 addition & 0 deletions offload/plugins-nextgen/cuda/dynamic_cuda/cuda.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -71,6 +71,7 @@ DLWRAP(cuDevicePrimaryCtxGetState, 3)
DLWRAP(cuDevicePrimaryCtxSetFlags, 2)
DLWRAP(cuDevicePrimaryCtxRetain, 2)
DLWRAP(cuModuleLoadDataEx, 5)
DLWRAP(cuOccupancyMaxPotentialBlockSize, 6)

DLWRAP(cuDeviceCanAccessPeer, 3)
DLWRAP(cuCtxEnablePeerAccess, 2)
Expand Down
3 changes: 3 additions & 0 deletions offload/plugins-nextgen/cuda/dynamic_cuda/cuda.h
Original file line number Diff line number Diff line change
Expand Up @@ -289,6 +289,7 @@ static inline void *CU_LAUNCH_PARAM_BUFFER_POINTER = (void *)0x01;
static inline void *CU_LAUNCH_PARAM_BUFFER_SIZE = (void *)0x02;

typedef void (*CUstreamCallback)(CUstream, CUresult, void *);
typedef size_t (*CUoccupancyB2DSize)(int);

CUresult cuCtxGetDevice(CUdevice *);
CUresult cuDeviceGet(CUdevice *, int);
Expand Down Expand Up @@ -370,5 +371,7 @@ CUresult cuMemSetAccess(CUdeviceptr ptr, size_t size,
CUresult cuMemGetAllocationGranularity(size_t *granularity,
const CUmemAllocationProp *prop,
CUmemAllocationGranularity_flags option);
CUresult cuOccupancyMaxPotentialBlockSize(int *, int *, CUfunction,
CUoccupancyB2DSize, size_t, int);

#endif
14 changes: 14 additions & 0 deletions offload/plugins-nextgen/cuda/src/rtl.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -157,6 +157,20 @@ struct CUDAKernelTy : public GenericKernelTy {
KernelLaunchParamsTy LaunchParams,
AsyncInfoWrapperTy &AsyncInfoWrapper) const override;

/// Return maximum block size for maximum occupancy
Expected<uint64_t> maxGroupSize(GenericDeviceTy &,
uint64_t DynamicMemSize) const override {
int minGridSize;
int maxBlockSize;
auto Res = cuOccupancyMaxPotentialBlockSize(
&minGridSize, &maxBlockSize, Func, NULL, DynamicMemSize, INT_MAX);
if (auto Err = Plugin::check(
Res, "error in cuOccupancyMaxPotentialBlockSize: %s")) {
return Err;
}
return maxBlockSize;
}

private:
/// The CUDA kernel function to execute.
CUfunction Func;
Expand Down
7 changes: 7 additions & 0 deletions offload/plugins-nextgen/host/src/rtl.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -114,6 +114,13 @@ struct GenELF64KernelTy : public GenericKernelTy {
return Plugin::success();
}

/// Return maximum block size for maximum occupancy
Expected<uint64_t> maxGroupSize(GenericDeviceTy &Device,
uint64_t DynamicMemSize) const override {
// TODO
return 1;
}

private:
/// The kernel function to execute.
void (*Func)(void);
Expand Down
1 change: 1 addition & 0 deletions offload/unittests/OffloadAPI/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ target_compile_definitions("init.unittests" PRIVATE DISABLE_WRAPPER)

add_offload_unittest("kernel"
kernel/olGetKernel.cpp
kernel/olGetKernelMaxGroupSize.cpp
kernel/olLaunchKernel.cpp)

add_offload_unittest("memory"
Expand Down
37 changes: 37 additions & 0 deletions offload/unittests/OffloadAPI/kernel/olGetKernelMaxGroupSize.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
//===------- Offload API tests - olGetKernelMaxGroupSize ------------------===//
//
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
//
//===----------------------------------------------------------------------===//

#include "../common/Fixtures.hpp"
#include <OffloadAPI.h>
#include <gtest/gtest.h>

using olKernelGetMaxGroupSizeTest = OffloadKernelTest;
OFFLOAD_TESTS_INSTANTIATE_DEVICE_FIXTURE(olKernelGetMaxGroupSizeTest);

TEST_P(olKernelGetMaxGroupSizeTest, Success) {
size_t Size{0};
ASSERT_SUCCESS(olGetKernelMaxGroupSize(Kernel, Device, 0, &Size));
ASSERT_GT(Size, 0u);
}

TEST_P(olKernelGetMaxGroupSizeTest, NullKernel) {
size_t Size;
ASSERT_ERROR(OL_ERRC_INVALID_NULL_HANDLE,
olGetKernelMaxGroupSize(nullptr, Device, 0, &Size));
}

TEST_P(olKernelGetMaxGroupSizeTest, NullDevice) {
size_t Size;
ASSERT_ERROR(OL_ERRC_INVALID_NULL_HANDLE,
olGetKernelMaxGroupSize(Kernel, nullptr, 0, &Size));
}

TEST_P(olKernelGetMaxGroupSizeTest, NullOutput) {
ASSERT_ERROR(OL_ERRC_INVALID_NULL_POINTER,
olGetKernelMaxGroupSize(Kernel, Device, 0, nullptr));
}
Loading