Skip to content

[SYCL] USM shared memory allocator for L0 plugin #2366

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Sep 9, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions sycl/doc/EnvironmentVariables.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@ subject to change. Do not rely on these variables in production code.
| SYCL_QUEUE_THREAD_POOL_SIZE | Positive integer | Number of threads in thread pool of queue. |
| SYCL_DEVICELIB_NO_FALLBACK | Any(\*) | Disable loading and linking of device library images |
| SYCL_PI_LEVEL0_MAX_COMMAND_LIST_CACHE | Positive integer | Maximum number of oneAPI Level Zero Command lists that can be allocated with no reuse before throwing an "out of resources" error. Default is 20000, threshold may be increased based on resource availabilty and workload demand. |
| SYCL_PI_LEVEL0_DISABLE_USM_ALLOCATOR | Any(\*) | Disable USM allocator in Level Zero plugin (each memory request will go directly to Level Zero runtime) |

`(*) Note: Any means this environment variable is effective when set to any non-null value.`

Expand Down
2 changes: 2 additions & 0 deletions sycl/plugins/level_zero/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -75,6 +75,8 @@ add_library(pi_level_zero SHARED
"${sycl_inc_dir}/CL/sycl/detail/pi.h"
"${CMAKE_CURRENT_SOURCE_DIR}/pi_level_zero.cpp"
"${CMAKE_CURRENT_SOURCE_DIR}/pi_level_zero.hpp"
"${CMAKE_CURRENT_SOURCE_DIR}/usm_allocator.cpp"
"${CMAKE_CURRENT_SOURCE_DIR}/usm_allocator.hpp"
)

if (MSVC)
Expand Down
195 changes: 183 additions & 12 deletions sycl/plugins/level_zero/pi_level_zero.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,8 @@

#include <level_zero/zet_api.h>

#include "usm_allocator.hpp"

namespace {

// Controls Level Zero calls serialization to w/a Level Zero driver being not MT
Expand Down Expand Up @@ -1491,10 +1493,16 @@ pi_result piContextRelease(pi_context Context) {

assert(Context);
if (--(Context->RefCount) == 0) {
auto ZeContext = Context->ZeContext;
// Destroy the command list used for initializations
ZE_CALL(zeCommandListDestroy(Context->ZeCommandListInit));
ZE_CALL(zeContextDestroy(Context->ZeContext));
delete Context;

// Destruction of some members of pi_context uses L0 context
// and therefore it must be valid at that point.
// Technically it should be placed to the destructor of pi_context
// but this makes API error handling more complex.
ZE_CALL(zeContextDestroy(ZeContext));
}
return PI_SUCCESS;
}
Expand Down Expand Up @@ -4052,7 +4060,6 @@ pi_result piextGetDeviceFunctionPointer(pi_device Device, pi_program Program,
pi_result piextUSMHostAlloc(void **ResultPtr, pi_context Context,
pi_usm_mem_properties *Properties, size_t Size,
pi_uint32 Alignment) {

assert(Context);
// Check that incorrect bits are not set in the properties.
assert(!Properties || (Properties && !(*Properties & ~PI_MEM_ALLOC_FLAGS)));
Expand All @@ -4066,11 +4073,17 @@ pi_result piextUSMHostAlloc(void **ResultPtr, pi_context Context,
return PI_SUCCESS;
}

pi_result piextUSMDeviceAlloc(void **ResultPtr, pi_context Context,
pi_device Device,
pi_usm_mem_properties *Properties, size_t Size,
pi_uint32 Alignment) {
static bool ShouldUseUSMAllocator() {
// Enable allocator by default if it's not explicitly disabled
return std::getenv("SYCL_PI_LEVEL0_DISABLE_USM_ALLOCATOR") == nullptr;
}

static const bool UseUSMAllocator = ShouldUseUSMAllocator();

pi_result USMDeviceAllocImpl(void **ResultPtr, pi_context Context,
pi_device Device,
pi_usm_mem_properties *Properties, size_t Size,
pi_uint32 Alignment) {
assert(Context);
assert(Device);
// Check that incorrect bits are not set in the properties.
Expand All @@ -4086,11 +4099,10 @@ pi_result piextUSMDeviceAlloc(void **ResultPtr, pi_context Context,
return PI_SUCCESS;
}

pi_result piextUSMSharedAlloc(void **ResultPtr, pi_context Context,
pi_device Device,
pi_usm_mem_properties *Properties, size_t Size,
pi_uint32 Alignment) {

pi_result USMSharedAllocImpl(void **ResultPtr, pi_context Context,
pi_device Device,
pi_usm_mem_properties *Properties, size_t Size,
pi_uint32 Alignment) {
assert(Context);
assert(Device);
// Check that incorrect bits are not set in the properties.
Expand All @@ -4108,11 +4120,170 @@ pi_result piextUSMSharedAlloc(void **ResultPtr, pi_context Context,
return PI_SUCCESS;
}

pi_result piextUSMFree(pi_context Context, void *Ptr) {
pi_result USMFreeImpl(pi_context Context, void *Ptr) {
ZE_CALL(zeMemFree(Context->ZeContext, Ptr));
return PI_SUCCESS;
}

// Exception type to pass allocation errors
class UsmAllocationException {
const pi_result Error;

public:
UsmAllocationException(pi_result Err) : Error{Err} {}
pi_result getError() const { return Error; }
};

pi_result USMSharedMemoryAlloc::allocateImpl(void **ResultPtr, size_t Size,
pi_uint32 Alignment) {
return USMSharedAllocImpl(ResultPtr, Context, Device, nullptr, Size,
Alignment);
}

pi_result USMDeviceMemoryAlloc::allocateImpl(void **ResultPtr, size_t Size,
pi_uint32 Alignment) {
return USMDeviceAllocImpl(ResultPtr, Context, Device, nullptr, Size,
Alignment);
}

void *USMMemoryAllocBase::allocate(size_t Size) {
void *Ptr = nullptr;

auto Res = allocateImpl(&Ptr, Size, sizeof(void *));
if (Res != PI_SUCCESS) {
throw UsmAllocationException(Res);
}

return Ptr;
}

void *USMMemoryAllocBase::allocate(size_t Size, size_t Alignment) {
void *Ptr = nullptr;

auto Res = allocateImpl(&Ptr, Size, Alignment);
if (Res != PI_SUCCESS) {
throw UsmAllocationException(Res);
}
return Ptr;
}

void USMMemoryAllocBase::deallocate(void *Ptr) {
auto Res = USMFreeImpl(Context, Ptr);
if (Res != PI_SUCCESS) {
throw UsmAllocationException(Res);
}
}

pi_result piextUSMDeviceAlloc(void **ResultPtr, pi_context Context,
pi_device Device,
pi_usm_mem_properties *Properties, size_t Size,
pi_uint32 Alignment) {
if (!UseUSMAllocator ||
// L0 spec says that allocation fails if Alignment != 2^n, in order to
// keep the same behavior for the allocator, just call L0 API directly and
// return the error code.
((Alignment & (Alignment - 1)) != 0)) {
return USMDeviceAllocImpl(ResultPtr, Context, Device, Properties, Size,
Alignment);
}

try {
auto It = Context->DeviceMemAllocContexts.find(Device);
if (It == Context->DeviceMemAllocContexts.end())
return PI_INVALID_VALUE;

*ResultPtr = It->second.allocate(Size, Alignment);
} catch (const UsmAllocationException &Ex) {
*ResultPtr = nullptr;
return Ex.getError();
}

return PI_SUCCESS;
}

pi_result piextUSMSharedAlloc(void **ResultPtr, pi_context Context,
pi_device Device,
pi_usm_mem_properties *Properties, size_t Size,
pi_uint32 Alignment) {
if (!UseUSMAllocator ||
// L0 spec says that allocation fails if Alignment != 2^n, in order to
// keep the same behavior for the allocator, just call L0 API directly and
// return the error code.
((Alignment & (Alignment - 1)) != 0)) {
return USMSharedAllocImpl(ResultPtr, Context, Device, Properties, Size,
Alignment);
}

try {
auto It = Context->SharedMemAllocContexts.find(Device);
if (It == Context->SharedMemAllocContexts.end())
return PI_INVALID_VALUE;

*ResultPtr = It->second.allocate(Size, Alignment);
} catch (const UsmAllocationException &Ex) {
*ResultPtr = nullptr;
return Ex.getError();
}

return PI_SUCCESS;
}

pi_result piextUSMFree(pi_context Context, void *Ptr) {
if (!UseUSMAllocator) {
return USMFreeImpl(Context, Ptr);
}

// Query the device of the allocation to determine the right allocator context
ze_device_handle_t ZeDeviceHandle;
ze_memory_allocation_properties_t ZeMemoryAllocationProperties = {};

// Query memory type of the pointer we're freeing to determine the correct
// way to do it(directly or via the allocator)
ZE_CALL(zeMemGetAllocProperties(
Context->ZeContext, Ptr, &ZeMemoryAllocationProperties, &ZeDeviceHandle));

// TODO: when support for multiple devices is implemented, here
// we should do the following:
// - Find pi_device instance corresponding to ZeDeviceHandle we've just got if
// exist
// - Use that pi_device to find the right allocator context and free the
// pointer.

// The allocation doesn't belong to any device for which USM allocator is
// enabled.
if (Context->Device->ZeDevice != ZeDeviceHandle) {
return USMFreeImpl(Context, Ptr);
}

auto DeallocationHelper =
[Context,
Ptr](std::unordered_map<pi_device, USMAllocContext> &AllocContextMap) {
try {
auto It = AllocContextMap.find(Context->Device);
if (It == AllocContextMap.end())
return PI_INVALID_VALUE;

// The right context is found, deallocate the pointer
It->second.deallocate(Ptr);
} catch (const UsmAllocationException &Ex) {
return Ex.getError();
}

return PI_SUCCESS;
};

switch (ZeMemoryAllocationProperties.type) {
case ZE_MEMORY_TYPE_SHARED:
return DeallocationHelper(Context->SharedMemAllocContexts);
case ZE_MEMORY_TYPE_DEVICE:
return DeallocationHelper(Context->DeviceMemAllocContexts);
default:
// Handled below
break;
}
return USMFreeImpl(Context, Ptr);
}

pi_result piextKernelSetArgPointer(pi_kernel Kernel, pi_uint32 ArgIndex,
size_t ArgSize, const void *ArgValue) {

Expand Down
62 changes: 61 additions & 1 deletion sycl/plugins/level_zero/pi_level_zero.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,8 @@

#include <level_zero/ze_api.h>

#include "usm_allocator.hpp"

template <class To, class From> To pi_cast(From Value) {
// TODO: see if more sanity checks are possible.
assert(sizeof(From) == sizeof(To));
Expand Down Expand Up @@ -89,6 +91,46 @@ struct _pi_platform {
std::atomic<int> ZeGlobalCommandListCount{0};
};

// Implements memory allocation via L0 RT for USM allocator interface.
class USMMemoryAllocBase : public SystemMemory {
protected:
pi_context Context;
pi_device Device;
// Internal allocation routine which must be implemented for each allocation
// type
virtual pi_result allocateImpl(void **ResultPtr, size_t Size,
pi_uint32 Alignment) = 0;

public:
USMMemoryAllocBase(pi_context Ctx, pi_device Dev)
: Context{Ctx}, Device{Dev} {}
void *allocate(size_t Size) override final;
void *allocate(size_t Size, size_t Alignment) override final;
void deallocate(void *Ptr) override final;
};

// Allocation routines for shared memory type
class USMSharedMemoryAlloc : public USMMemoryAllocBase {
protected:
pi_result allocateImpl(void **ResultPtr, size_t Size,
pi_uint32 Alignment) override;

public:
USMSharedMemoryAlloc(pi_context Ctx, pi_device Dev)
: USMMemoryAllocBase(Ctx, Dev) {}
};

// Allocation routines for device memory type
class USMDeviceMemoryAlloc : public USMMemoryAllocBase {
protected:
pi_result allocateImpl(void **ResultPtr, size_t Size,
pi_uint32 Alignment) override;

public:
USMDeviceMemoryAlloc(pi_context Ctx, pi_device Dev)
: USMMemoryAllocBase(Ctx, Dev) {}
};

struct _pi_device : _pi_object {
_pi_device(ze_device_handle_t Device, pi_platform Plt,
bool isSubDevice = false)
Expand Down Expand Up @@ -145,7 +187,19 @@ struct _pi_device : _pi_object {
struct _pi_context : _pi_object {
_pi_context(pi_device Device)
: Device{Device}, ZeCommandListInit{nullptr}, ZeEventPool{nullptr},
NumEventsAvailableInEventPool{}, NumEventsLiveInEventPool{} {}
NumEventsAvailableInEventPool{}, NumEventsLiveInEventPool{} {
// TODO: when support for multiple devices is added, here we should
// loop over all the devices and initialize allocator context for each
// pair (context, device)
SharedMemAllocContexts.emplace(
std::piecewise_construct, std::make_tuple(Device),
std::make_tuple(std::unique_ptr<SystemMemory>(
new USMSharedMemoryAlloc(this, Device))));
DeviceMemAllocContexts.emplace(
std::piecewise_construct, std::make_tuple(Device),
std::make_tuple(std::unique_ptr<SystemMemory>(
new USMDeviceMemoryAlloc(this, Device))));
}

// A L0 context handle is primarily used during creation and management of
// resources that may be used by multiple devices.
Expand Down Expand Up @@ -174,6 +228,12 @@ struct _pi_context : _pi_object {
// and destroy the pool if there are no alive events.
ze_result_t decrementAliveEventsInPool(ze_event_pool_handle_t pool);

// Store USM allocator context(internal allocator structures)
// for USM shared/host and device allocations. There is 1 allocator context
// per each pair of (context, device) per each memory type.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why there should be separate allocators for device and shared USM?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They have different allocation function: zeMemAllocShared vs zeMemAllocDevice.
Technically, it's possible to have a single allocator for both of them, but it requires to differentiate and keep track of allocation type. It basically makes the core logic more complex without any advantages of the current approach where we keep allocator per allocation type.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, it make sense.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, keeping track of them separately is good.

std::unordered_map<pi_device, USMAllocContext> SharedMemAllocContexts;
std::unordered_map<pi_device, USMAllocContext> DeviceMemAllocContexts;

private:
// Following member variables are used to manage assignment of events
// to event pools.
Expand Down
Loading