Skip to content

[SYCL[L0] Change the SYCL_PI_LEVEL_ZERO_USM_RESIDENT default to force device allocations only #9442

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
May 16, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion sycl/doc/EnvironmentVariables.md
Original file line number Diff line number Diff line change
Expand Up @@ -259,7 +259,7 @@ variables in production code.</span>
| `SYCL_PI_LEVEL_ZERO_EXPOSE_CSLICE_IN_AFFINITY_PARTITIONING` (Deprecated) | Integer | When set to non-zero value exposes compute slices as sub-sub-devices in `sycl::info::partition_property::partition_by_affinity_domain` partitioning scheme. Default is zero meaning that they are only exposed when partitioning by `sycl::info::partition_property::ext_intel_partition_by_cslice`. This option is introduced for compatibility reasons and is immediately deprecated. New code must not rely on this behavior. Also note that even if sub-sub-device was created using `partition_by_affinity_domain` it would still be reported as created via partitioning by compute slices. |
| `SYCL_PI_LEVEL_ZERO_COMMANDLISTS_CLEANUP_THRESHOLD` | Integer | If non-negative then the threshold is set to this value. If negative, the threshold is set to INT_MAX. Whenever the number of command lists in a queue exceeds this threshold, an attempt is made to cleanup completed command lists for their subsequent reuse. The default is 20. |
| `SYCL_PI_LEVEL_ZERO_IMMEDIATE_COMMANDLISTS_EVENT_CLEANUP_THRESHOLD` | Integer | If non-negative then the threshold is set to this value. If negative, the threshold is set to INT_MAX. Whenever the number of events associated with an immediate command list exceeds this threshold, a check is made for signaled events and these events are recycled. Setting this threshold low causes events to be checked more often, which could result in unneeded events being recycled sooner. However, more frequent event status checks may cost time. The default is 1000. |
| `SYCL_PI_LEVEL_ZERO_USM_RESIDENT` | Integer | Controls if/where to make USM allocations resident at the time of allocation. If set to 0 then no special residency is forced. If set to 1 then allocation (device or shared) is made resident at the device of allocation. If set to 2 then allocation (device or shared) is made resident on all devices in the context of allocation that have P2P access to the device of allocation. For host allocation, any non-0 setting forces the allocation resident on all devices in the context. Default is 2. |
| `SYCL_PI_LEVEL_ZERO_USM_RESIDENT` | Integer | Bit-mask controls if/where to make USM allocations resident at the time of allocation. Input value is of the form 0xHSD, where 4-bits of D control device allocations, 4-bits of S control shared allocations, and 4-bits of H control host allocations. Each 4-bit componenet is holding one of the following values: "0" - then no special residency is forced, "1" - then allocation is made resident at the device of allocation, or "2" - then allocation is made resident on all devices in the context of allocation that have P2P access to the device of allocation. Default is 0x002, i.e. force full residency for device allocations only. |
| `SYCL_PI_LEVEL_ZERO_USE_NATIVE_USM_MEMCPY2D` | Integer | When set to a positive value enables the use of Level Zero USM 2D memory copy operations. Default is 0. |

## Debugging variables for CUDA Plugin
Expand Down
62 changes: 47 additions & 15 deletions sycl/plugins/level_zero/pi_level_zero.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -7231,48 +7231,77 @@ enum class USMAllocationForceResidencyType {
// Force memory resident on the device of allocation at allocation time.
// For host allocation force residency on all devices in a context.
Device = 1,
// [Default] Force memory resident on all devices in the context with P2P
// Force memory resident on all devices in the context with P2P
// access to the device of allocation.
// For host allocation force residency on all devices in a context.
P2PDevices = 2
};

// Returns the desired USM residency setting
static USMAllocationForceResidencyType USMAllocationForceResidency = [] {
// Input value is of the form 0xHSD, where:
// 4-bits of D control device allocations
// 4-bits of S control shared allocations
// 4-bits of H control host allocations
// Each 4-bit value is holding a USMAllocationForceResidencyType enum value.
// The default is 0x2, i.e. force full residency for device allocations only.
//
static uint32_t USMAllocationForceResidency = [] {
const char *UrRet = std::getenv("UR_L0_USM_RESIDENT");
const char *PiRet = std::getenv("SYCL_PI_LEVEL_ZERO_USM_RESIDENT");
const char *Str = UrRet ? UrRet : (PiRet ? PiRet : nullptr);
try {
if (Str) {
// Auto-detect radix to allow more convinient hex base
return std::stoi(Str, nullptr, 0);
}
} catch (...) {
}
return 0x2;
}();

if (!Str)
return USMAllocationForceResidencyType::P2PDevices;
switch (std::atoi(Str)) {
// Convert from an integer value to USMAllocationForceResidencyType enum value
static USMAllocationForceResidencyType
USMAllocationForceResidencyConvert(uint32_t Val) {
switch (Val) {
case 1:
return USMAllocationForceResidencyType::Device;
case 2:
return USMAllocationForceResidencyType::P2PDevices;
default:
return USMAllocationForceResidencyType::None;
};
}

static USMAllocationForceResidencyType USMHostAllocationForceResidency = [] {
return USMAllocationForceResidencyConvert(
(USMAllocationForceResidency & 0xf00) >> 8);
}();
static USMAllocationForceResidencyType USMSharedAllocationForceResidency = [] {
return USMAllocationForceResidencyConvert(
(USMAllocationForceResidency & 0x0f0) >> 4);
}();
static USMAllocationForceResidencyType USMDeviceAllocationForceResidency = [] {
return USMAllocationForceResidencyConvert(
(USMAllocationForceResidency & 0x00f));
}();

// Make USM allocation resident as requested
static pi_result
USMAllocationMakeResident(pi_context Context,
USMAllocationMakeResident(USMAllocationForceResidencyType ForceResidency,
pi_context Context,
pi_device Device, // nullptr for host allocation
void *Ptr, size_t Size) {
if (ForceResidency == USMAllocationForceResidencyType::None)
return PI_SUCCESS;

std::list<pi_device> Devices;

if (USMAllocationForceResidency == USMAllocationForceResidencyType::None)
return PI_SUCCESS;
else if (!Device) {
if (!Device) {
// Host allocation, make it resident on all devices in the context
Devices.insert(Devices.end(), Context->Devices.begin(),
Context->Devices.end());
} else {
Devices.push_back(Device);
if (USMAllocationForceResidency ==
USMAllocationForceResidencyType::P2PDevices) {
if (ForceResidency == USMAllocationForceResidencyType::P2PDevices) {
ze_bool_t P2P;
for (const auto &D : Context->Devices) {
if (D == Device)
Expand Down Expand Up @@ -7322,7 +7351,8 @@ static pi_result USMDeviceAllocImpl(void **ResultPtr, pi_context Context,
reinterpret_cast<std::uintptr_t>(*ResultPtr) % Alignment == 0,
PI_ERROR_INVALID_VALUE);

USMAllocationMakeResident(Context, Device, *ResultPtr, Size);
USMAllocationMakeResident(USMDeviceAllocationForceResidency, Context, Device,
*ResultPtr, Size);
return PI_SUCCESS;
}

Expand Down Expand Up @@ -7353,7 +7383,8 @@ static pi_result USMSharedAllocImpl(void **ResultPtr, pi_context Context,
reinterpret_cast<std::uintptr_t>(*ResultPtr) % Alignment == 0,
PI_ERROR_INVALID_VALUE);

USMAllocationMakeResident(Context, Device, *ResultPtr, Size);
USMAllocationMakeResident(USMSharedAllocationForceResidency, Context, Device,
*ResultPtr, Size);

// TODO: Handle PI_MEM_ALLOC_DEVICE_READ_ONLY.
return PI_SUCCESS;
Expand All @@ -7379,7 +7410,8 @@ static pi_result USMHostAllocImpl(void **ResultPtr, pi_context Context,
reinterpret_cast<std::uintptr_t>(*ResultPtr) % Alignment == 0,
PI_ERROR_INVALID_VALUE);

USMAllocationMakeResident(Context, nullptr, *ResultPtr, Size);
USMAllocationMakeResident(USMHostAllocationForceResidency, Context, nullptr,
*ResultPtr, Size);
return PI_SUCCESS;
}

Expand Down
40 changes: 40 additions & 0 deletions sycl/test-e2e/Plugin/level_zero_usm_residency.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
// REQUIRES: gpu, level_zero

// RUN: %{build} %level_zero_options -o %t.out
// RUN: env SYCL_PI_TRACE=-1 ZE_DEBUG=-1 %{run} %t.out 2>&1 | FileCheck --check-prefixes=DEVICE %s
// RUN: env SYCL_PI_LEVEL_ZERO_USM_RESIDENT=0x001 SYCL_PI_TRACE=-1 ZE_DEBUG=-1 %{run} %t.out 2>&1 | FileCheck --check-prefixes=DEVICE %s
// RUN: env SYCL_PI_LEVEL_ZERO_USM_RESIDENT=0x010 SYCL_PI_TRACE=-1 ZE_DEBUG=-1 %{run} %t.out 2>&1 | FileCheck --check-prefixes=SHARED %s
// RUN: env SYCL_PI_LEVEL_ZERO_USM_RESIDENT=0x100 SYCL_PI_TRACE=-1 ZE_DEBUG=-1 %{run} %t.out 2>&1 | FileCheck --check-prefixes=HOST %s

// Test that USM is made resident at allocation as requested.

#include <sycl/sycl.hpp>

using namespace std;
using namespace sycl;

int main(int argc, char *argv[]) {
queue Q;

auto ptr1 = malloc_device<int>(1, Q);
// DEVICE: ---> piextUSMDeviceAlloc
// DEVICE: ZE ---> zeMemAllocDevice
// DEVICE: ZE ---> zeContextMakeMemoryResident
// DEVICE-NOT: ZE ---> zeContextMakeMemoryResident

auto ptr2 = malloc_shared<int>(1, Q);
// SHARED: ---> piextUSMSharedAlloc
// SHARED: ZE ---> zeMemAllocShared
// SHARED: ZE ---> zeContextMakeMemoryResident
// SHARED-NOT: ZE ---> zeContextMakeMemoryResident

auto ptr3 = malloc_host<int>(1, Q);
// HOST: ---> piextUSMHostAlloc
// HOST: ZE ---> zeMemAllocHost
// HOST: ZE ---> zeContextMakeMemoryResident

free(ptr1, Q);
free(ptr2, Q);
free(ptr3, Q);
return 0;
}