From af7ea6d6a113aac5abcecd434df74e4aa5f25ee5 Mon Sep 17 00:00:00 2001 From: Greg Lueck Date: Tue, 22 Nov 2022 15:22:43 -0500 Subject: [PATCH 1/8] [SYCL] Add spec for sycl_ext_intel_cslice Add a proposed extension specification that allows partitioning a device by "cslice" (aka CCS-es). --- .../proposed/sycl_ext_intel_cslice.asciidoc | 275 ++++++++++++++++++ 1 file changed, 275 insertions(+) create mode 100644 sycl/doc/extensions/proposed/sycl_ext_intel_cslice.asciidoc diff --git a/sycl/doc/extensions/proposed/sycl_ext_intel_cslice.asciidoc b/sycl/doc/extensions/proposed/sycl_ext_intel_cslice.asciidoc new file mode 100644 index 0000000000000..0452f4a1c8d51 --- /dev/null +++ b/sycl/doc/extensions/proposed/sycl_ext_intel_cslice.asciidoc @@ -0,0 +1,275 @@ += sycl_ext_intel_cslice + +:source-highlighter: coderay +:coderay-linenums-mode: table + +// This section needs to be after the document title. +:doctype: book +:toc2: +:toc: left +:encoding: utf-8 +:lang: en +:dpcpp: pass:[DPC++] + +// Set the default source code type in this document to C++, +// for syntax highlighting purposes. This is needed because +// docbook uses c++ and html5 uses cpp. +:language: {basebackend@docbook:c++:cpp} + + +== Notice + +[%hardbreaks] +Copyright (C) 2022-2022 Intel Corporation. All rights reserved. + +Khronos(R) is a registered trademark and SYCL(TM) and SPIR(TM) are trademarks +of The Khronos Group Inc. OpenCL(TM) is a trademark of Apple Inc. used by +permission by Khronos. + + +== Contact + +To report problems with this extension, please open a new issue at: + +https://github.com/intel/llvm/issues + + +== Dependencies + +This extension is written against the SYCL 2020 revision 6 specification. All +references below to the "core SYCL specification" or to section numbers in the +SYCL specification refer to that revision. + + +== Status + +This is a proposed extension specification, intended to gather community +feedback. Interfaces defined in this specification may not be implemented yet +or may be in a preliminary state. The specification itself may also change in +incompatible ways before it is finalized. *Shipping software products should +not rely on APIs defined in this specification.* + + +== Overview + +:multi-CCS: https://github.com/intel/compute-runtime/blob/master/level_zero/doc/experimental_extensions/MULTI_CCS_MODES.md + +Some Intel GPU devices can be partitioned at a granularity of "cslice" (compute +slice), which is a smaller granularity than "tile". At present, the GPU device +drivers don't expose this mode by default, so this form of partitioning is +considered an advanced feature which most applications are not expected to use. +This extension provides a way for these advanced applications to partition a +device by cslice when it is enabled in the device driver. + +Unlike "tile" partitions, a cslice partition does not have any different cache +affinity from its sibling cslice partitions. Therefore, this extension does +not expose this type of partitioning through +`info::partition_property::partition_by_affinity_domain`. Instead, it adds a +new partitioning type +`info::partition_property::ext_intel_partition_by_cslice`. + +Intel GPU devices that support this type of partitioning currently support it +only at the "tile" level. Therefore, a device with multiple tiles (e.g. PVC) +must first be partitioned into per-tile sub-devices via +`partition_by_affinity_domain`, and then each of the +resulting sub-devices can be further partitioned by +`ext_intel_partition_by_cslice`. Single-tile devices (e.g. ATS-M) can be +directly partitioned by `ext_intel_partition_by_cslice` (for those ATS-M parts +that have multiple cslice partitions). + +It is important to understand that the device driver virtualizes work +submission to the cslice sub-devices. This virtualization happens only between +processes, and not within a single process. For example, consider a single +process that constructs two SYCL queues on cslice sub-device #0. Kernels +submitted to these two queues are guaranteed to conflict, both using the same +set of execution units. Therefore, if a single process wants to explicitly +submit kernels to cslice sub-devices and it wants to avoid conflict, it should +create queues on different sub-devices. By contrast, consider an example where +two separate processes create a SYCL queue on cslice sub-device #0. In this +case, the device driver virtualizes access to this cslice, and kernels +submitted from the first process may run on different execution units than +kernels submitted from the second process. In this second case, the device +driver binds the process's requested cslice to a physical cslice according to +the overall system load. + +For information about configuring the device driver to support cslice +partitioning, see the driver documentation on {multi-CCS}[multi-CCS] mode. +Currently, it is only possible to partition a device by cslice if the driver is +in "2 CCS Mode" or "4 CCS Mode". When in 2 CCS Mode, a tile can be partitioned +into two cslice sub-devices. When in 4 CCS Mode, a tile can be partitioned +into four cslice sub-devices. + +Note that this extension can be supported by any backend and any device. If a +backend or device does not support the concept of cslice partitions, it can +still conform to this extension by declaring the new enumerator and member +functions specified below. If the info descriptor query +`info::device::partition_properties` does not report +`ext_intel_partition_by_cslice`, then the implementation need not support +the creation of cslice partitions. + + +== Specification + +=== Feature test macro + +This extension provides a feature-test macro as described in the core SYCL +specification. An implementation supporting this extension must predefine the +macro `SYCL_EXT_INTEL_CSLICE` to one of the values defined in the table +below. Applications can test for the existence of this macro to determine if +the implementation supports this feature, or applications can test the macro's +value to determine which of the extension's features the implementation +supports. + +[%header,cols="1,5"] +|=== +|Value +|Description + +|1 +|Initial version of this extension. +|=== + +=== New partition property + +This extension adds a new enumerator named `ext_intel_partition_by_cslice` to +`info::partition_property`: + +``` +namespace sycl::info { + +enum class partition_property : /* unspecified */ { + // ... + ext_intel_partition_by_cslice +}; + +} // namespace sycl::info +``` + +The behavior of the `info::device::partition_properties` info descriptor query +is also extended to include `ext_intel_partition_by_cslice` in the vector of +returned values if the device can be partitioned into at least two sub-devices +along that partition property. + +=== New function template specialization to create sub-devices + +This extension adds a new function template specialization to the `device` +class: + +``` +namespace sycl { + +class device { + // ... + + // Available only when + // Prop == info::partition_property::ext_intel_partition_by_cslice + template + std::vector create_sub_devices() const; +}; + +} // namespace sycl +``` + +This function only participates in overload resolution if the `Prop` template +parameter is `info::partition_property::ext_intel_partition_by_cslice`. It +returns a `std::vector` of sub-devices partitioned from this SYCL `device`, +each representing a fixed set of hardware cslices. + +If the SYCL `device` does not support +`info::partition_property::ext_intel_partition_by_cslice`, calling this +function throws a synchronous `exception` with the +`errc::feature_not_supported` error code. + +=== Behavior of device info queries for a "cslice" sub-device + +This section describes the behavior for some of the device info queries when +applied to a `device` object that represents a "cslice" partition. + +* `info::device::partition_type_property` ++ +Returns `ext_intel_partition_by_cslice`. + +* `info::device::max_compute_units` ++ +When partitioning by `ext_intel_partition_by_cslice`, each sub-device +represents a fixed subset of the parent device's compute units. This query +returns the number of compute units represented by the sub-device. + +The remaining device info queries return the properties or limits of the +sub-device, as is typical for these queries. In general, if a resource is +partitioned among the sub-devices, then the associated info query will +return each sub-device's share of the resource. However, if a resource is +shared by all of the sub-devices, then the associated info query for each +sub-device will return the same value as for the parent device. For example, +if device global memory is shared by all cslice partitions in a tile, then the +info query `info::device::global_mem_size` will return the same value for the +`device` object representing the tile as for the `device` object representing +a cslice. + +=== Behavior of the Level Zero backend interop functions + +The Level Zero device driver doesn't use the concept of sub-device to represent +a fixed partition of cslices. Instead, a Level Zero command queue can be +created with a particular queue index, which represents a partition of the +cslices. + +As a result, calling `get_native` for a SYCL `device` that represents a cslice +partition returns the same `ze_device_handle_t` as the parent device. If an +application wants a native handle to the cslice partition, it must create a +SYCL `queue` and then call `get_native` on the `queue`. This will return a +`ze_command_queue_handle_t` that corresponds to the cslice partition. + +=== Behavior of the OpenCL backend interop functions + +The OpenCL device driver doesn't use the concept of sub-device to represent a +fixed partition of cslices. Instead, an OpenCL command queue can be created +with a particular queue index, which represents a partition of the cslices. + +As a result, calling `get_native` for a SYCL `device` that represents a cslice +partition returns the same `cl_device_id` as the parent device. If an +application wants a native handle to the cslice partition, it must create a +SYCL `queue` and then call `get_native` on the `queue`. This will return a +`cl_command_queue` that corresponds to the cslice partition. + + +== Impact on the ONEAPI_DEVICE_SELECTOR environment variable + +:oneapi-device-selector: https://github.com/intel/llvm/blob/sycl/sycl/doc/EnvironmentVariables.md#oneapi_device_selector + +This section describes the effect of this extension on the {dpcpp} +`ONEAPI_DEVICE_SELECTOR` environment variable. Since this environment variable +is not part of the SYCL specification, this section is not a normative part of +the extension specification. Rather, it only describes the impact on {dpcpp}. + +As described in the {oneapi-device-selector}[documentation] for the +`ONEAPI_DEVICE_SELECTOR`, a term in the selector string can be an integral +number followed by a decimal point (`.`), where the decimal point indicates a +sub-device. For example, `1.2` means sub-device #2 of device #1. These +decimal points can represent either a sub-device created via +`partition_by_affinity_domain` or via `ext_intel_partition_by_cslice`. When +{dpcpp} processes a term with a decimal point, it first attempts to partition +by `ext_intel_partition_by_cslice`. If that is not possible, it next attempts +to partition by `partition_by_affinity_domain` / +`partition_affinity_domain::next_partitionable`. + +It is important to keep in mind, though, that requesting a specific cslice via +this environment variable has limited effect due to the device driver's +virtualization of cslices. To illustrate, consider an example where two +processes are launched as follows, selecting different cslice sub-devices: + +``` +$ ONEAPI_DEVICE_SELECTOR=level_zero:0.1.0 ZEX_NUMBER_OF_CCS=0:2 ./foo +$ ONEAPI_DEVICE_SELECTOR=level_zero:0.1.1 ZEX_NUMBER_OF_CCS=0:2 ./foo +``` + +The first process selects cslice #0 while the second selects cslice #1. This +does have the effect that each process is constrained to a single cslice (which +is not the {dpcpp} default). However, the actual cslice number is irrelevant. +Because of cslice virtualization, the device driver will choose some available +cslice for each process, ignoring the value requested in the environment +variable. As a result, the following example has exactly the same effect: + +``` +$ ONEAPI_DEVICE_SELECTOR=level_zero:0.1.0 ZEX_NUMBER_OF_CCS=0:2 ./foo +$ ONEAPI_DEVICE_SELECTOR=level_zero:0.1.0 ZEX_NUMBER_OF_CCS=0:2 ./foo +``` From 5222e38d3cfffe0cb0ad0919494764bd1bc6fb60 Mon Sep 17 00:00:00 2001 From: Greg Lueck Date: Wed, 23 Nov 2022 13:34:50 -0500 Subject: [PATCH 2/8] Reword to be PVC specific It turns out that ATS-M does not support this type of partitioning, so reword the overview to note that PVC is the only supported device. --- .../proposed/sycl_ext_intel_cslice.asciidoc | 38 +++++++++---------- 1 file changed, 18 insertions(+), 20 deletions(-) diff --git a/sycl/doc/extensions/proposed/sycl_ext_intel_cslice.asciidoc b/sycl/doc/extensions/proposed/sycl_ext_intel_cslice.asciidoc index 0452f4a1c8d51..098073181d2c6 100644 --- a/sycl/doc/extensions/proposed/sycl_ext_intel_cslice.asciidoc +++ b/sycl/doc/extensions/proposed/sycl_ext_intel_cslice.asciidoc @@ -55,11 +55,11 @@ not rely on APIs defined in this specification.* :multi-CCS: https://github.com/intel/compute-runtime/blob/master/level_zero/doc/experimental_extensions/MULTI_CCS_MODES.md Some Intel GPU devices can be partitioned at a granularity of "cslice" (compute -slice), which is a smaller granularity than "tile". At present, the GPU device -drivers don't expose this mode by default, so this form of partitioning is -considered an advanced feature which most applications are not expected to use. -This extension provides a way for these advanced applications to partition a -device by cslice when it is enabled in the device driver. +slice), which is a smaller granularity than "tile". This form of partitioning +is not currently enabled by default, so it is considered an advanced feature +which most applications are not expected to use. This extension provides a way +for these advanced applications to partition a device by cslice when it is +enabled in the device driver. Unlike "tile" partitions, a cslice partition does not have any different cache affinity from its sibling cslice partitions. Therefore, this extension does @@ -68,14 +68,19 @@ not expose this type of partitioning through new partitioning type `info::partition_property::ext_intel_partition_by_cslice`. -Intel GPU devices that support this type of partitioning currently support it -only at the "tile" level. Therefore, a device with multiple tiles (e.g. PVC) -must first be partitioned into per-tile sub-devices via -`partition_by_affinity_domain`, and then each of the -resulting sub-devices can be further partitioned by -`ext_intel_partition_by_cslice`. Single-tile devices (e.g. ATS-M) can be -directly partitioned by `ext_intel_partition_by_cslice` (for those ATS-M parts -that have multiple cslice partitions). +The only Intel GPU device that currently supports this type of partitioning is +PVC, and this support is only available when the device driver is configured in +{multi-CCS}[multi-CCS] mode. See that documentation for instructions on how to +enable this mode and for other important information. Currently, it is only +possible to partition a device by cslice if the driver is in "2 CCS Mode" or +"4 CCS Mode". When in 2 CCS Mode, a tile can be partitioned into two cslice +sub-devices. When in 4 CCS Mode, a tile can be partitioned into four cslice +sub-devices. + +This type of partitioning is currently supported only at the "tile" level. +Therefore, a device must first be partitioned into per-tile sub-devices via +`partition_by_affinity_domain`, and then each of the resulting sub-devices can +be further partitioned by `ext_intel_partition_by_cslice`. It is important to understand that the device driver virtualizes work submission to the cslice sub-devices. This virtualization happens only between @@ -92,13 +97,6 @@ kernels submitted from the second process. In this second case, the device driver binds the process's requested cslice to a physical cslice according to the overall system load. -For information about configuring the device driver to support cslice -partitioning, see the driver documentation on {multi-CCS}[multi-CCS] mode. -Currently, it is only possible to partition a device by cslice if the driver is -in "2 CCS Mode" or "4 CCS Mode". When in 2 CCS Mode, a tile can be partitioned -into two cslice sub-devices. When in 4 CCS Mode, a tile can be partitioned -into four cslice sub-devices. - Note that this extension can be supported by any backend and any device. If a backend or device does not support the concept of cslice partitions, it can still conform to this extension by declaring the new enumerator and member From 0eb19c64d9ee62af7b497ef2aa81a52442b0370d Mon Sep 17 00:00:00 2001 From: Greg Lueck Date: Tue, 29 Nov 2022 16:54:40 -0500 Subject: [PATCH 3/8] Suggestion from Ben Co-authored-by: Ben Ashbaugh --- sycl/doc/extensions/proposed/sycl_ext_intel_cslice.asciidoc | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/sycl/doc/extensions/proposed/sycl_ext_intel_cslice.asciidoc b/sycl/doc/extensions/proposed/sycl_ext_intel_cslice.asciidoc index 098073181d2c6..0cf8481fcc5cf 100644 --- a/sycl/doc/extensions/proposed/sycl_ext_intel_cslice.asciidoc +++ b/sycl/doc/extensions/proposed/sycl_ext_intel_cslice.asciidoc @@ -97,12 +97,12 @@ kernels submitted from the second process. In this second case, the device driver binds the process's requested cslice to a physical cslice according to the overall system load. -Note that this extension can be supported by any backend and any device. If a -backend or device does not support the concept of cslice partitions, it can +Note that this extension can be supported by any implementation. If an implementation supports a +backend or device without the concept of cslice partitions it can still conform to this extension by declaring the new enumerator and member functions specified below. If the info descriptor query `info::device::partition_properties` does not report -`ext_intel_partition_by_cslice`, then the implementation need not support +`ext_intel_partition_by_cslice`, then the backend or device does not support the creation of cslice partitions. From 81b6ea1c8fe23b79e3a56ceb8cf09a1ab8101fa3 Mon Sep 17 00:00:00 2001 From: Greg Lueck Date: Tue, 29 Nov 2022 16:56:54 -0500 Subject: [PATCH 4/8] Reformat to 80 columns --- .../extensions/proposed/sycl_ext_intel_cslice.asciidoc | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/sycl/doc/extensions/proposed/sycl_ext_intel_cslice.asciidoc b/sycl/doc/extensions/proposed/sycl_ext_intel_cslice.asciidoc index 0cf8481fcc5cf..6ba76771cb0bd 100644 --- a/sycl/doc/extensions/proposed/sycl_ext_intel_cslice.asciidoc +++ b/sycl/doc/extensions/proposed/sycl_ext_intel_cslice.asciidoc @@ -97,10 +97,10 @@ kernels submitted from the second process. In this second case, the device driver binds the process's requested cslice to a physical cslice according to the overall system load. -Note that this extension can be supported by any implementation. If an implementation supports a -backend or device without the concept of cslice partitions it can -still conform to this extension by declaring the new enumerator and member -functions specified below. If the info descriptor query +Note that this extension can be supported by any implementation. If an +implementation supports a backend or device without the concept of cslice +partitions it can still conform to this extension by declaring the new +enumerator and member functions specified below. If the info descriptor query `info::device::partition_properties` does not report `ext_intel_partition_by_cslice`, then the backend or device does not support the creation of cslice partitions. From 6d7448a78983f7ddcb1af0d6e33903e2ceb71dfa Mon Sep 17 00:00:00 2001 From: Greg Lueck Date: Tue, 29 Nov 2022 17:08:32 -0500 Subject: [PATCH 5/8] Suggesting from Jaime --- sycl/doc/extensions/proposed/sycl_ext_intel_cslice.asciidoc | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/sycl/doc/extensions/proposed/sycl_ext_intel_cslice.asciidoc b/sycl/doc/extensions/proposed/sycl_ext_intel_cslice.asciidoc index 6ba76771cb0bd..25760e9e8a2c6 100644 --- a/sycl/doc/extensions/proposed/sycl_ext_intel_cslice.asciidoc +++ b/sycl/doc/extensions/proposed/sycl_ext_intel_cslice.asciidoc @@ -264,8 +264,9 @@ The first process selects cslice #0 while the second selects cslice #1. This does have the effect that each process is constrained to a single cslice (which is not the {dpcpp} default). However, the actual cslice number is irrelevant. Because of cslice virtualization, the device driver will choose some available -cslice for each process, ignoring the value requested in the environment -variable. As a result, the following example has exactly the same effect: +cslice for each process instead of honoring the value requested in the +environment variable. As a result, the following example has exactly the same +effect: ``` $ ONEAPI_DEVICE_SELECTOR=level_zero:0.1.0 ZEX_NUMBER_OF_CCS=0:2 ./foo From b93f2fe32c36f7b0f48d8ce5ce49d5da4cb01e3c Mon Sep 17 00:00:00 2001 From: Greg Lueck Date: Mon, 5 Dec 2022 11:07:16 -0500 Subject: [PATCH 6/8] Address comment from Jaime --- .../proposed/sycl_ext_intel_cslice.asciidoc | 28 ++++++++++--------- 1 file changed, 15 insertions(+), 13 deletions(-) diff --git a/sycl/doc/extensions/proposed/sycl_ext_intel_cslice.asciidoc b/sycl/doc/extensions/proposed/sycl_ext_intel_cslice.asciidoc index 25760e9e8a2c6..34249d5d09962 100644 --- a/sycl/doc/extensions/proposed/sycl_ext_intel_cslice.asciidoc +++ b/sycl/doc/extensions/proposed/sycl_ext_intel_cslice.asciidoc @@ -83,19 +83,21 @@ Therefore, a device must first be partitioned into per-tile sub-devices via be further partitioned by `ext_intel_partition_by_cslice`. It is important to understand that the device driver virtualizes work -submission to the cslice sub-devices. This virtualization happens only between -processes, and not within a single process. For example, consider a single -process that constructs two SYCL queues on cslice sub-device #0. Kernels -submitted to these two queues are guaranteed to conflict, both using the same -set of execution units. Therefore, if a single process wants to explicitly -submit kernels to cslice sub-devices and it wants to avoid conflict, it should -create queues on different sub-devices. By contrast, consider an example where -two separate processes create a SYCL queue on cslice sub-device #0. In this -case, the device driver virtualizes access to this cslice, and kernels -submitted from the first process may run on different execution units than -kernels submitted from the second process. In this second case, the device -driver binds the process's requested cslice to a physical cslice according to -the overall system load. +submission to the cslice sub-devices. (More specifically, the device driver +virtualizes work submission to different CCS-es, and this means that on PVC +the work submission to a cslice is virtualized.) This virtualization happens +only between processes, and not within a single process. For example, consider +a single process that constructs two SYCL queues on cslice sub-device #0. +Kernels submitted to these two queues are guaranteed to conflict, both using +the same set of execution units. Therefore, if a single process wants to +explicitly submit kernels to cslice sub-devices and it wants to avoid conflict, +it should create queues on different sub-devices. By contrast, consider an +example where two separate processes create a SYCL queue on cslice sub-device +#0. In this case, the device driver virtualizes access to this cslice, and +kernels submitted from the first process may run on different execution units +than kernels submitted from the second process. In this second case, the +device driver binds the process's requested cslice to a physical cslice +according to the overall system load. Note that this extension can be supported by any implementation. If an implementation supports a backend or device without the concept of cslice From 17ca6b61615414fae358eb8a821308c90eeb358e Mon Sep 17 00:00:00 2001 From: Greg Lueck Date: Thu, 8 Dec 2022 17:14:03 -0500 Subject: [PATCH 7/8] Use correct marketing name for PVC --- .../proposed/sycl_ext_intel_cslice.asciidoc | 44 +++++++++---------- 1 file changed, 22 insertions(+), 22 deletions(-) diff --git a/sycl/doc/extensions/proposed/sycl_ext_intel_cslice.asciidoc b/sycl/doc/extensions/proposed/sycl_ext_intel_cslice.asciidoc index 34249d5d09962..91b8df12416dc 100644 --- a/sycl/doc/extensions/proposed/sycl_ext_intel_cslice.asciidoc +++ b/sycl/doc/extensions/proposed/sycl_ext_intel_cslice.asciidoc @@ -68,14 +68,14 @@ not expose this type of partitioning through new partitioning type `info::partition_property::ext_intel_partition_by_cslice`. -The only Intel GPU device that currently supports this type of partitioning is -PVC, and this support is only available when the device driver is configured in -{multi-CCS}[multi-CCS] mode. See that documentation for instructions on how to -enable this mode and for other important information. Currently, it is only -possible to partition a device by cslice if the driver is in "2 CCS Mode" or -"4 CCS Mode". When in 2 CCS Mode, a tile can be partitioned into two cslice -sub-devices. When in 4 CCS Mode, a tile can be partitioned into four cslice -sub-devices. +The only Intel GPU devices that currently support this type of partitioning +are the Data Center GPU Max series (aka PVC), and this support is only +available when the device driver is configured in {multi-CCS}[multi-CCS] mode. +See that documentation for instructions on how to enable this mode and for +other important information. Currently, it is only possible to partition a +device by cslice if the driver is in "2 CCS Mode" or "4 CCS Mode". When in +2 CCS Mode, a tile can be partitioned into two cslice sub-devices. When in +4 CCS Mode, a tile can be partitioned into four cslice sub-devices. This type of partitioning is currently supported only at the "tile" level. Therefore, a device must first be partitioned into per-tile sub-devices via @@ -84,20 +84,20 @@ be further partitioned by `ext_intel_partition_by_cslice`. It is important to understand that the device driver virtualizes work submission to the cslice sub-devices. (More specifically, the device driver -virtualizes work submission to different CCS-es, and this means that on PVC -the work submission to a cslice is virtualized.) This virtualization happens -only between processes, and not within a single process. For example, consider -a single process that constructs two SYCL queues on cslice sub-device #0. -Kernels submitted to these two queues are guaranteed to conflict, both using -the same set of execution units. Therefore, if a single process wants to -explicitly submit kernels to cslice sub-devices and it wants to avoid conflict, -it should create queues on different sub-devices. By contrast, consider an -example where two separate processes create a SYCL queue on cslice sub-device -#0. In this case, the device driver virtualizes access to this cslice, and -kernels submitted from the first process may run on different execution units -than kernels submitted from the second process. In this second case, the -device driver binds the process's requested cslice to a physical cslice -according to the overall system load. +virtualizes work submission to different CCS-es, and this means that on Data +Center GPU Max series devices the work submission to a cslice is virtualized.) +This virtualization happens only between processes, and not within a single +process. For example, consider a single process that constructs two SYCL +queues on cslice sub-device #0. Kernels submitted to these two queues are +guaranteed to conflict, both using the same set of execution units. Therefore, +if a single process wants to explicitly submit kernels to cslice sub-devices +and it wants to avoid conflict, it should create queues on different +sub-devices. By contrast, consider an example where two separate processes +create a SYCL queue on cslice sub-device #0. In this case, the device driver +virtualizes access to this cslice, and kernels submitted from the first process +may run on different execution units than kernels submitted from the second +process. In this second case, the device driver binds the process's requested +cslice to a physical cslice according to the overall system load. Note that this extension can be supported by any implementation. If an implementation supports a backend or device without the concept of cslice From 1b251183e46f66d823c0ba75e644b51666715036 Mon Sep 17 00:00:00 2001 From: jbrodman Date: Fri, 9 Dec 2022 15:05:32 -0500 Subject: [PATCH 8/8] Update sycl/doc/extensions/proposed/sycl_ext_intel_cslice.asciidoc Co-authored-by: Ben Ashbaugh --- .../doc/extensions/proposed/sycl_ext_intel_cslice.asciidoc | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/sycl/doc/extensions/proposed/sycl_ext_intel_cslice.asciidoc b/sycl/doc/extensions/proposed/sycl_ext_intel_cslice.asciidoc index 91b8df12416dc..2e53585c710cf 100644 --- a/sycl/doc/extensions/proposed/sycl_ext_intel_cslice.asciidoc +++ b/sycl/doc/extensions/proposed/sycl_ext_intel_cslice.asciidoc @@ -78,8 +78,11 @@ device by cslice if the driver is in "2 CCS Mode" or "4 CCS Mode". When in 4 CCS Mode, a tile can be partitioned into four cslice sub-devices. This type of partitioning is currently supported only at the "tile" level. -Therefore, a device must first be partitioned into per-tile sub-devices via -`partition_by_affinity_domain`, and then each of the resulting sub-devices can +A device should be queried using `info::device::partition_properties` to +determine if it supports partitioning by `ext_intel_partition_by_cslice`. If a +device does not support partitioning by `ext_intel_partition_by_cslice` it may +first need to be partitioned into per-tile sub-devices via +`partition_by_affinity_domain`, and then each of the resulting sub-devices may be further partitioned by `ext_intel_partition_by_cslice`. It is important to understand that the device driver virtualizes work