From 49b6749ea9175ae250b718c04d71af4ccfecc06c Mon Sep 17 00:00:00 2001 From: Dounia Date: Thu, 15 Apr 2021 09:50:14 -0700 Subject: [PATCH 01/10] [SYCL][Matrix]Add spec document for the matrix extension interface and its first implementation for AMX Signed-off-by: Dounia --- sycl/doc/extensions/Matrix/README.md | 2 + .../Matrix/dpcpp-joint-matrix.asciidoc | 223 ++++++++++++++++++ 2 files changed, 225 insertions(+) create mode 100644 sycl/doc/extensions/Matrix/README.md create mode 100644 sycl/doc/extensions/Matrix/dpcpp-joint-matrix.asciidoc diff --git a/sycl/doc/extensions/Matrix/README.md b/sycl/doc/extensions/Matrix/README.md new file mode 100644 index 0000000000000..a930e3598ca88 --- /dev/null +++ b/sycl/doc/extensions/Matrix/README.md @@ -0,0 +1,2 @@ +# Matrix Programming Extension for DPC++ +`matrix` is a new experimental DPC++ extension to provide unified matrix programming on different tensor hardware. The current implementation provides support of the matrix interface using Intel(R) Advanced Matrix Extensions (AMX). \ No newline at end of file diff --git a/sycl/doc/extensions/Matrix/dpcpp-joint-matrix.asciidoc b/sycl/doc/extensions/Matrix/dpcpp-joint-matrix.asciidoc new file mode 100644 index 0000000000000..9a8b3a950a88b --- /dev/null +++ b/sycl/doc/extensions/Matrix/dpcpp-joint-matrix.asciidoc @@ -0,0 +1,223 @@ +# Matrix Programming Extension for DPC++: SYCL_EXT_ONEAPI_MATRIX +:source-highlighter: coderay +:coderay-linenums-mode: table + +// This section needs to be after the document title. +:doctype: book +:toc2: +:toc: left +:encoding: utf-8 +:lang: en + +:blank: pass:[ +] + +// Set the default source code type in this document to C++, +// for syntax highlighting purposes. This is needed because +// docbook uses c++ and html5 uses cpp. +:language: {basebackend@docbook:c++:cpp} + + +== Notice + +Copyright (c) 2021-2021 Intel Corporation. All rights reserved. + +NOTE: Khronos(R) is a registered trademark and SYCL(TM) and SPIR(TM) are +trademarks of The Khronos Group Inc. OpenCL(TM) is a trademark of Apple Inc. +used by permission by Khronos. + +This extension is written against the SYCL 2020 revision 3 specification. All +references below to the "core SYCL specification" or to section numbers in the +SYCL specification refer to that revision. + + +**_NOTE:_** _This document describes the current design and API for the matrix +extension to DPC++. This is an initial experimental version to try out functionality +and performance. The current implementation provides support of the matrix interface on Intel(R) Advanced Matrix Extensions (AMX). We are going to work with the community on incrementally improving +the API to bring them closer to standard C++ (aligned with the `std::mdspan` and `std::mdarray` proposals) and SYCL in the next several months._ + +## Introduction +This document presents an ongoing work towards defining a unified matrix interface. This interface is intended to unify different tensor hardware: AMX in Intel CPU, Habana Gaudi and Goya tensor and gemm cores, Nvidia TPUs, IBM Power MMA. All these hardware provide low-level intrinsics or assembly to access and perform matrix operations. The goal is to provide a unified interface that is portable but also benefit from the maximum performance these different hardware can offer. + +## Feature test macro + +This extension provides a feature-test macro as described in the core SYCL +specification section 6.3.3 "Feature test macros". Therefore, an +implementation supporting this extension must predefine the macro +`SYCL_EXT_ONEAPI_MATRIX` to one of the values defined in the table below. +Applications can test for the existence of this macro to determine if the +implementation supports this feature, or applications can test the macro's +value to determine which of the extension's APIs the implementation supports. + +[frame="none",options="header"] +|====================== +|Value |Description +|1 |Initial extension implementation on AMX. Base features are supported. +|====================== + +### New `matrix` class +We introduce a new class called `matrix`. The user needs to specify the type of the elements, sizes, and the memory layout. + +The same class `matrix` should handle both cases where sizes are constant (GPU case) and when sizes are variables (CPU case). Note that a AMX 2d tile register permits sizes up to 1024 (16rowsx64cols) bytes. The ability to define only one interface for both makes it possible to give the user a way to make use of the flexibility introduced by the CPU but at the same time save resources on the GPU. We use `sycl::dynamic_extent` to differentiate between static and dynamic sizes. + +Layout is necessary on the type to be able to calculate the physical offset if the user needs to access a single entry for some purpose. Besides row major and column major layout, `matrix_layout` is flexible enough to introduce customed layouts such as symmetric or tiled layouts. AMX hardware requires A and B to be in VNNI or 32 bits packed layout. If users did not specify these layouts, transforms done by the implementation will be slow due to extra scatter/gather operations. Hence, we expose these layouts `packed_a` and `packed_b` to the user to specify that A and/or B have already been VNNIed. The packed or VNNI layout is introduced in `VNNI layout` section below. + +```c++ +template +struct matrix; +enum class matrix_layout { + row_major, + col_major, + packed_a, + packed_b +}; +``` +### Matrix Operations +We define three new functions needed to perform the main and common operations on matrices. This set of functions can be easily extended if the tensor hardware implements new features. +The base pointer determines the starting address of the matrix to be loaded/stored. +`layout` determines whether the data are being read/written in a row (`row_major`), column major (`column_major`) fashion, or if the data has already been transformed into VNNI format (`packed_a`, `packed_b`). + +```c++ +void matrix_load(matrix<>A, T *base, unsigned stride, matrix_layout l = row_major); +``` +This function loads data from memory to the 2d tiles of AMX that is a 2d storage. +```c++ +void matrix_store(matrix<>A, T *base, unsigned stride, matrix_layout l = row_major); +``` +This function stores the data from the 2d tiles back to memory. +```c++ +matrix<> matrix_mad(matrix<>A, matrix<>B, matrix<>C); +``` + +The matrix multiply and add function performs the multiply operation on the matrices `A` and `B`, accumulate the result with `C` and return the result. + +## Integration with DPC++ +When using the matrix interface in a DPC++ kernel, additional semantics have to be added to define the memory scope of the matrix object and the execution scope of its operations. In the context of DPC++, `matrix` is distributed among an execution unit. In practice this can be one work-item, the work-items in a sub-group, or the work-items in a work-group. + +Since the matrix functions are group operations (as defined in Section 4.17.3 of the SYCL specification), the matrix API has to be accessed by all the work-items in the group in a non-diverged control flow. The `Group` template argument can be a work-group or a sub-group. These functions will be called once by all the work items in a group. + +### Memory Scope +For the memory scope, we have two solutions. The long term solution is to use the proposed https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/LocalMemory/SYCL_INTEL_local_memory.asciidoc[`group_local_memory` extension] to allocate the matrix in local memory associated with a SYCL group as shown in the example below. + +```c++ +multi_ptr, address_space::local_space> tA_ptr = group_local_memory>(sg); +``` +However, sub-group local memory is not yet well defined in DPC++. Moreover, the representation of this notion in LLVM IR and SPIR-V is not yet clear. Hence, for this proposal, we will proceed with adding the memory scope as an additional constructor argument as follows: + +```c++ +joint_matrix tA(sg); +``` +### Execution Scope +To be aligned with the SYCL 2020 group algorithms, an additional group argument is added to the matrix operations to designate that these functions are collective operations. The DPC++ syntax is the following: + +```c++ +void joint_matrix_load(Group g, matrix<>A, T *base, unsigned stride, matrix_layout l = row_major); +``` +```c++ +void joint_matrix_store(Group g, matrix<>A, T *base, unsigned stride, matrix_layout l = row_major); +``` +```c++ +matrix<> joint_matrix_mad(Group g, matrix<>A, matrix<>B, matrix<>C); +``` + +## Example using int8_t type +```c++ +using namespace cl::sycl::ext::intel::matrix; + +queue q; +range<2> G = {M, N}; +// For this first implementation, SG_SIZE has to be equal to one +range<2> L = {1, SG_SIZE}; +int8_t *memA = malloc_shared(M*K, q); +int8_t *memB = malloc_shared(K*N, q); +Int32_t *memC = malloc_shared(M*N, q); +//Assuming memB has already been VNNIed +q.parallel_for(nd_range<2>(G, L), [=](nd_item<2> item) + [[sycl::reqd_sub_group_size(SG_SIZE)]] { + const auto global_idx = item.get_global_id(0); + const auto global_idy = item.get_global_id(1); + const auto sg_startx = global_idx - item.get_local_id(0); + const auto sg_starty = global_idy - item.get_local_id(1); + sub_group sg = item.get_sub_group(); + joint_matrix tA(sg); + // For B, since current implementation does not support non packed layout, + // users need to specify the updated VNNI sizes along with the packed_b layout + joint_matrix tB(sg); + joint_matrix tC(sg); + joint_matrix_load(sg, tC, memC + sg_startx * tM * N + sg_starty, N, row_major); + for (int k = 0; k < K; k += tk) { + joint_matrix_load(sg, tA, memA + sg_startx * tM * K + k, K, row_major);//collective + joint_matrix_load(sg, tB, memB + k * N + sg_starty, N, packed_b);//VNNI + tC = joint_matrix_mad(sg, tA, tB, tC); + } + joint_matrix_store(sg, tC, memC + sg_startx * tM * N + sg_starty, N, row_major); +}).wait(); + +``` +## Implementation Status +For oneAPI release 3, an AOT implementation is available on the CPU device to targets AMX hardware. we are using AMX tile intrinsics to implement the matrix load and store operations. Since we are currently emitting AMX intrinsics directly, this only enables AOT compilation. Please refer to the following section that talks about the future unified SPIR-V path that will enable JIT compilation. +// We used the https://software.intel.com/sites/landingpage/IntrinsicsGuide/#techs=AMX[`_tile_`-prefixed intrinsics] defined in `immintrin.h`. + +Currently, this is the compilation command line needed to invoke AMX unit of Sapphire Rapids CPU: + +```c++ +clang++ -fsycl -march=sapphirerapids fsycl-targets="spir64_x86_64-uknown-linux-sycldevice" -O2 matmul-int8.cpp -o matmul-int8 +``` + +### Current Implementation Restrictions +#### Type, Sizes, and Layouts +The types supported by this AMX implementation are restricted to the types that AMX hardware support. Although the AMX hardware supports 2d tiles with a maximum size of 16x64 bytes, this current implementation can handle any size. If the matrix size is bigger than 1024 bytes, it will be stored in memory rather than mapped to a 2d tile. Performance penalty may occur in this case. In order to get the best performance with this implementation, matrix sizes should be smaller than 16x64 bytes and A and B matrices should be already packed (put in VNNI format). + +More specifically, the following operation C = A*B+C can be performed on AMX with this interface where: +A(int8, any-size, row_major), B(int8, any-size, packed_b), C(int32, any-size, row_major) +or +A(bf16, any-size, row_major), B(bf16, any-size, packed_b), C(float, any-size, row_major) + +#### Memory and Execution Scope +This current implementation only considers a sub-group scope. However, the sub-group size has to be equal to one in this first implementation. + +## Future Work: Unfied LLVM IR and SPIRV JIT Enabling +To enable JIT compilation, a unified matrix IR needs to be added. Currently, there is no matrix type in LLVM IR or SPIR-V. We are working towards adding a new matrix type in both LLVM IR and SPIR-V. This JIT enabling is expected to be part of a future compiler release. + +### LLVM IR Extension +As a short-term solution, we are extending the https://llvm.org/docs/LangRef.html#llvm-matrix-transpose-intrinsic[existing LLVM IR matrix intrinsics] to include features like VNNI layout. The current matrix intrinsics use flattened vectors to represent the matrix. Therefore, we are exploring both adding matrix type to LLVM IR and also using MLIR `vector` dialect for this work. + +### SPIR-V Extension +The current draft proposal can be found https://gitlab.devtools.intel.com/OpenCL/opencl-extension-drafts/-/blob/master/SPV_INTEL_matrix.asciidoc[here]. +We are adding translation from LLVM IR matrix to SPIR-V matrix and vice versa in the LLVM to SPIR-V translator tool. + +## VNNI/Packed Layout +AMX compute assumes register for B tile (src1) to be in VNNI format as they need 32bit of K-data in A and B to be contiguous in memory. +The VNNI blocking factor is 2 in the case of 16bits, 4 in the case of 8 bits elements. While the current implementation assumes that the matrix has been already packed by the user for performance reasons, the layout information is needed to inform the implementation about this transform. + + + // Example of bf16 data type: + // --------------------------------- + // a1, b1, c1, d1 + // a2, b2, c2, d2 + // a3, b3, c3, d3 + // a4, b4, c4, d4 + // --------------------------------- + // reformat to + // --------------------------------- + // a1, a2, b1, b2, c1, c2, d1, d2 + // a3, a4, b3, b4, c3, c4, d3, d4 + + + +## Open Questions +- Besides row, col major and packed (VNNI) layout, what are the additional layouts that should absolutely be added? +- Are there alternative names for the `packed_a` and `packed_b` layouts that would be clearer to distinguish between the VNNI Layout in matrix A and VNNI layout in matrix B of a matrix multiply and add operation on AMX? +- Ronan Keyrell: "It would be interesting to investigate whether providing also member functions would simplify the API. Provide both so it is possible to use the best one for each use case, while waiting for https://en.wikipedia.org/wiki/Uniform_Function_Call_Syntax to land into C++?" + +## TODO List +- Handle sub group sizes that are bigger than one. +- Add support for queries that gives information about the capabilities of the implementation on a particular device. +- Once the SPIRV translator work is done, this code generation work will move to the backend along enabling JIT compilation. + +## Revision History + +[frame="none",options="header"] +|====================== +|Rev |Date |Author |Changes +|1 |2021-04-13 |Dounia Khaldi |Initial public working draft. +|====================== From 37456c9591116d1640ec8e0917f28fd0b5e33ccf Mon Sep 17 00:00:00 2001 From: Dounia Date: Tue, 1 Jun 2021 13:26:11 -0700 Subject: [PATCH 02/10] [SYCL][Matrix] Incorporate feedback related to: adding the API to namespace sycl::ext::intel::experimental::matrix, remove the C++ specific API from the document, better formatting Signed-off-by: Dounia --- .../Matrix/dpcpp-joint-matrix.asciidoc | 86 ++++++++++++------- 1 file changed, 54 insertions(+), 32 deletions(-) diff --git a/sycl/doc/extensions/Matrix/dpcpp-joint-matrix.asciidoc b/sycl/doc/extensions/Matrix/dpcpp-joint-matrix.asciidoc index 9a8b3a950a88b..d60c298a2802c 100644 --- a/sycl/doc/extensions/Matrix/dpcpp-joint-matrix.asciidoc +++ b/sycl/doc/extensions/Matrix/dpcpp-joint-matrix.asciidoc @@ -1,6 +1,7 @@ # Matrix Programming Extension for DPC++: SYCL_EXT_ONEAPI_MATRIX :source-highlighter: coderay :coderay-linenums-mode: table +:dpcpp: pass:[DPC++] // This section needs to be after the document title. :doctype: book @@ -31,7 +32,7 @@ SYCL specification refer to that revision. **_NOTE:_** _This document describes the current design and API for the matrix -extension to DPC++. This is an initial experimental version to try out functionality +extension to {dpcpp}. This is an initial experimental version to try out functionality and performance. The current implementation provides support of the matrix interface on Intel(R) Advanced Matrix Extensions (AMX). We are going to work with the community on incrementally improving the API to bring them closer to standard C++ (aligned with the `std::mdspan` and `std::mdarray` proposals) and SYCL in the next several months._ @@ -54,15 +55,17 @@ value to determine which of the extension's APIs the implementation supports. |1 |Initial extension implementation on AMX. Base features are supported. |====================== -### New `matrix` class -We introduce a new class called `matrix`. The user needs to specify the type of the elements, sizes, and the memory layout. +## New `matrix` class +We introduce a new class called `matrix`. The user needs to specify the type of the elements, sizes, and the memory layout.`matrix` is distributed among an execution unit. In practice this can be one work-item, the work-items in a sub-group, or the work-items in a work-group. + The same class `matrix` should handle both cases where sizes are constant (GPU case) and when sizes are variables (CPU case). Note that a AMX 2d tile register permits sizes up to 1024 (16rowsx64cols) bytes. The ability to define only one interface for both makes it possible to give the user a way to make use of the flexibility introduced by the CPU but at the same time save resources on the GPU. We use `sycl::dynamic_extent` to differentiate between static and dynamic sizes. Layout is necessary on the type to be able to calculate the physical offset if the user needs to access a single entry for some purpose. Besides row major and column major layout, `matrix_layout` is flexible enough to introduce customed layouts such as symmetric or tiled layouts. AMX hardware requires A and B to be in VNNI or 32 bits packed layout. If users did not specify these layouts, transforms done by the implementation will be slow due to extra scatter/gather operations. Hence, we expose these layouts `packed_a` and `packed_b` to the user to specify that A and/or B have already been VNNIed. The packed or VNNI layout is introduced in `VNNI layout` section below. ```c++ -template +namespace sycl::ext::intel::experimental::matrix { +template struct matrix; enum class matrix_layout { row_major, @@ -70,58 +73,77 @@ enum class matrix_layout { packed_a, packed_b }; +} ``` -### Matrix Operations -We define three new functions needed to perform the main and common operations on matrices. This set of functions can be easily extended if the tensor hardware implements new features. -The base pointer determines the starting address of the matrix to be loaded/stored. -`layout` determines whether the data are being read/written in a row (`row_major`), column major (`column_major`) fashion, or if the data has already been transformed into VNNI format (`packed_a`, `packed_b`). + + +### Memory Scope +For the memory scope, we have two solutions. The long term solution is to use the proposed https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/LocalMemory/SYCL_INTEL_local_memory.asciidoc[`group_local_memory` extension] to allocate the matrix in local memory associated with a SYCL group as shown in the example below. ```c++ -void matrix_load(matrix<>A, T *base, unsigned stride, matrix_layout l = row_major); +multi_ptr, address_space::local_space> tA_ptr = group_local_memory>(sg); ``` -This function loads data from memory to the 2d tiles of AMX that is a 2d storage. +However, sub-group local memory is not yet well defined in {dpcpp}. Moreover, the representation of this notion in LLVM IR and SPIR-V is not yet clear. Hence, for this proposal, we use `joint_matrix` instead of `matrix` to emphasis that the matrix is shared among a group of work items and is not private to each work item. Moreover, the memory scope is added as an additional constructor argument. This results into the following description: + ```c++ -void matrix_store(matrix<>A, T *base, unsigned stride, matrix_layout l = row_major); +namespace sycl::ext::intel::experimental::matrix { +template +struct joint_matrix { + joint_matrix(Group g) {} +}; +} ``` -This function stores the data from the 2d tiles back to memory. +When the group is a `sycl::sub_group`, a matrix is declared as follows: + ```c++ -matrix<> matrix_mad(matrix<>A, matrix<>B, matrix<>C); +joint_matrix tA(sg); ``` -The matrix multiply and add function performs the multiply operation on the matrices `A` and `B`, accumulate the result with `C` and return the result. -## Integration with DPC++ -When using the matrix interface in a DPC++ kernel, additional semantics have to be added to define the memory scope of the matrix object and the execution scope of its operations. In the context of DPC++, `matrix` is distributed among an execution unit. In practice this can be one work-item, the work-items in a sub-group, or the work-items in a work-group. +## Matrix Operations and their Execution Scope +We define three new functions needed to perform the main and common operations on matrices namely, load, store, and the actual multiply and add operation. This set of functions can be easily extended if the tensor hardware implements new features. + +The base pointer determines the starting address of the matrix to be loaded/stored. `layout` determines whether the data are being read/written in a row (`row_major`), column major (`column_major`) fashion, or if the data has already been transformed into VNNI format (`packed_a`, `packed_b`). Since the matrix functions are group operations (as defined in Section 4.17.3 of the SYCL specification), the matrix API has to be accessed by all the work-items in the group in a non-diverged control flow. The `Group` template argument can be a work-group or a sub-group. These functions will be called once by all the work items in a group. -### Memory Scope -For the memory scope, we have two solutions. The long term solution is to use the proposed https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/LocalMemory/SYCL_INTEL_local_memory.asciidoc[`group_local_memory` extension] to allocate the matrix in local memory associated with a SYCL group as shown in the example below. +To be aligned with the SYCL 2020 group algorithms, an additional group argument is added to the matrix operations to designate that these functions are collective operations. The {dpcpp} syntax is the following: ```c++ -multi_ptr, address_space::local_space> tA_ptr = group_local_memory>(sg); +namespace sycl::ext::intel::experimental::matrix { +template + void joint_matrix_load(Group sg, joint_matrix &res, + multi_ptr src, size_t stride, matrix_layout l = row_major); +} ``` -However, sub-group local memory is not yet well defined in DPC++. Moreover, the representation of this notion in LLVM IR and SPIR-V is not yet clear. Hence, for this proposal, we will proceed with adding the memory scope as an additional constructor argument as follows: +This function loads data from memory to the 2d tiles of AMX that is a 2d storage. ```c++ -joint_matrix tA(sg); +namespace sycl::ext::intel::experimental::matrix { +template + void joint_matrix_store(Group sg, matrix &res, + multi_ptr src, size_t stride, matrix_layout l = row_major); +} ``` -### Execution Scope -To be aligned with the SYCL 2020 group algorithms, an additional group argument is added to the matrix operations to designate that these functions are collective operations. The DPC++ syntax is the following: +This function stores the data from the 2d tiles back to memory. ```c++ -void joint_matrix_load(Group g, matrix<>A, T *base, unsigned stride, matrix_layout l = row_major); -``` -```c++ -void joint_matrix_store(Group g, matrix<>A, T *base, unsigned stride, matrix_layout l = row_major); -``` -```c++ -matrix<> joint_matrix_mad(Group g, matrix<>A, matrix<>B, matrix<>C); +namespace sycl::ext::intel::experimental::matrix { + template + joint_matrix joint_matrix_mad(Group sg, joint_matrix A, + joint_matrix B, joint_matrix C); +} ``` +The matrix multiply and add function performs the multiply operation on the matrices `A` and `B`, accumulate the result with `C` and return the result. + + ## Example using int8_t type ```c++ -using namespace cl::sycl::ext::intel::matrix; +using namespace sycl::ext::intel::experimental::matrix; queue q; range<2> G = {M, N}; @@ -207,7 +229,7 @@ The VNNI blocking factor is 2 in the case of 16bits, 4 in the case of 8 bits ele ## Open Questions - Besides row, col major and packed (VNNI) layout, what are the additional layouts that should absolutely be added? - Are there alternative names for the `packed_a` and `packed_b` layouts that would be clearer to distinguish between the VNNI Layout in matrix A and VNNI layout in matrix B of a matrix multiply and add operation on AMX? -- Ronan Keyrell: "It would be interesting to investigate whether providing also member functions would simplify the API. Provide both so it is possible to use the best one for each use case, while waiting for https://en.wikipedia.org/wiki/Uniform_Function_Call_Syntax to land into C++?" +- Ronan Keryell: "It would be interesting to investigate whether providing also member functions would simplify the API. Provide both so it is possible to use the best one for each use case, while waiting for https://en.wikipedia.org/wiki/Uniform_Function_Call_Syntax to land into C++?" ## TODO List - Handle sub group sizes that are bigger than one. From fa675c3292ff478ee8d1e28c1a073691171f09f8 Mon Sep 17 00:00:00 2001 From: Dounia Date: Fri, 11 Jun 2021 12:11:41 -0700 Subject: [PATCH 03/10] [SYCL][Matrix] Incorporate feedback related to: Move future looking description to a different section towards the end Signed-off-by: Dounia --- .../Matrix/dpcpp-joint-matrix.asciidoc | 100 +++++++++++------- 1 file changed, 61 insertions(+), 39 deletions(-) diff --git a/sycl/doc/extensions/Matrix/dpcpp-joint-matrix.asciidoc b/sycl/doc/extensions/Matrix/dpcpp-joint-matrix.asciidoc index d60c298a2802c..59a58d0e3bb8b 100644 --- a/sycl/doc/extensions/Matrix/dpcpp-joint-matrix.asciidoc +++ b/sycl/doc/extensions/Matrix/dpcpp-joint-matrix.asciidoc @@ -55,35 +55,23 @@ value to determine which of the extension's APIs the implementation supports. |1 |Initial extension implementation on AMX. Base features are supported. |====================== -## New `matrix` class -We introduce a new class called `matrix`. The user needs to specify the type of the elements, sizes, and the memory layout.`matrix` is distributed among an execution unit. In practice this can be one work-item, the work-items in a sub-group, or the work-items in a work-group. +## New `joint_matrix` class +We introduce a new class called `joint_matrix`. The user needs to specify the type of the elements, shape, the memory layout, and the memory scope of the matrix. + +//`joint_matrix` is distributed among an execution unit. In practice this can be one work-item, the work-items in a sub-group, or the work-items in a work-group. + +#### Memory Scope +In this experimental API version, we used the terminology of `joint_matrix` instead of plain `matrix` to emphasis that the matrix is shared among a group of work items and is not private to each work item. The memory scope is added as an additional template parameter and is also part of the constructor arguments. -The same class `matrix` should handle both cases where sizes are constant (GPU case) and when sizes are variables (CPU case). Note that a AMX 2d tile register permits sizes up to 1024 (16rowsx64cols) bytes. The ability to define only one interface for both makes it possible to give the user a way to make use of the flexibility introduced by the CPU but at the same time save resources on the GPU. We use `sycl::dynamic_extent` to differentiate between static and dynamic sizes. - -Layout is necessary on the type to be able to calculate the physical offset if the user needs to access a single entry for some purpose. Besides row major and column major layout, `matrix_layout` is flexible enough to introduce customed layouts such as symmetric or tiled layouts. AMX hardware requires A and B to be in VNNI or 32 bits packed layout. If users did not specify these layouts, transforms done by the implementation will be slow due to extra scatter/gather operations. Hence, we expose these layouts `packed_a` and `packed_b` to the user to specify that A and/or B have already been VNNIed. The packed or VNNI layout is introduced in `VNNI layout` section below. +#### Shape +The same class `joint_matrix` should handle both cases where sizes are constant (GPU case) and when sizes are variables (CPU case). Note that a AMX 2d tile register permits sizes up to 1024 (16rowsx64cols) bytes. The ability to define only one interface for both makes it possible to give the user a way to make use of the flexibility introduced by the CPU but at the same time save resources on the GPU. We use `sycl::dynamic_extent` to differentiate between static and dynamic sizes. -```c++ -namespace sycl::ext::intel::experimental::matrix { -template -struct matrix; -enum class matrix_layout { - row_major, - col_major, - packed_a, - packed_b -}; -} -``` +#### Layout +Layout is necessary on the type to be able to calculate the physical offset if the user needs to access a single entry for some purpose. Besides row major and column major layout, `matrix_layout` is flexible enough to introduce customed layouts such as symmetric or tiled layouts. +This results into the following description: -### Memory Scope -For the memory scope, we have two solutions. The long term solution is to use the proposed https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/LocalMemory/SYCL_INTEL_local_memory.asciidoc[`group_local_memory` extension] to allocate the matrix in local memory associated with a SYCL group as shown in the example below. - -```c++ -multi_ptr, address_space::local_space> tA_ptr = group_local_memory>(sg); -``` -However, sub-group local memory is not yet well defined in {dpcpp}. Moreover, the representation of this notion in LLVM IR and SPIR-V is not yet clear. Hence, for this proposal, we use `joint_matrix` instead of `matrix` to emphasis that the matrix is shared among a group of work items and is not private to each work item. Moreover, the memory scope is added as an additional constructor argument. This results into the following description: ```c++ namespace sycl::ext::intel::experimental::matrix { @@ -100,6 +88,22 @@ joint_matrix tA(sg); ``` +AMX hardware requires A and B to be in VNNI or 32 bits packed layout. If users did not specify these layouts, transforms done by the implementation will be slow due to extra scatter/gather operations. Hence, we expose these layouts `packed_a` and `packed_b` to the user to specify that A and/or B have already been VNNIed. The packed or VNNI layout is introduced in `VNNI layout` section below. + +```c++ +namespace sycl::ext::intel::experimental::matrix { +enum class matrix_layout { + row_major, + col_major, + packed_a, + packed_b +}; +} +``` + + + + ## Matrix Operations and their Execution Scope We define three new functions needed to perform the main and common operations on matrices namely, load, store, and the actual multiply and add operation. This set of functions can be easily extended if the tensor hardware implements new features. @@ -109,6 +113,7 @@ Since the matrix functions are group operations (as defined in Section 4.17.3 of To be aligned with the SYCL 2020 group algorithms, an additional group argument is added to the matrix operations to designate that these functions are collective operations. The {dpcpp} syntax is the following: +#### Load ```c++ namespace sycl::ext::intel::experimental::matrix { template , address_space::local_space> tA_ptr = group_local_memory>(sg); +``` +We did not utilize this extension for this matrix API version because sub-group local memory is not yet well defined in {dpcpp}. Moreover, the representation of this notion in LLVM IR and SPIR-V is not clear yet. + ## Open Questions - Besides row, col major and packed (VNNI) layout, what are the additional layouts that should absolutely be added? From 8e94952d6c638c9e807bc532ddaf018cb8a33904 Mon Sep 17 00:00:00 2001 From: Dounia Date: Mon, 14 Jun 2021 13:50:43 -0700 Subject: [PATCH 04/10] [SYCL][Matrix] Incorporate feedback related to: add matrix layout namespace, some formatting and rewording Signed-off-by: Dounia --- .../Matrix/dpcpp-joint-matrix.asciidoc | 35 ++++++++++--------- 1 file changed, 19 insertions(+), 16 deletions(-) diff --git a/sycl/doc/extensions/Matrix/dpcpp-joint-matrix.asciidoc b/sycl/doc/extensions/Matrix/dpcpp-joint-matrix.asciidoc index 59a58d0e3bb8b..cd94dbfa570af 100644 --- a/sycl/doc/extensions/Matrix/dpcpp-joint-matrix.asciidoc +++ b/sycl/doc/extensions/Matrix/dpcpp-joint-matrix.asciidoc @@ -33,7 +33,7 @@ SYCL specification refer to that revision. **_NOTE:_** _This document describes the current design and API for the matrix extension to {dpcpp}. This is an initial experimental version to try out functionality -and performance. The current implementation provides support of the matrix interface on Intel(R) Advanced Matrix Extensions (AMX). We are going to work with the community on incrementally improving +and performance, and **future versions of this API may change in ways that are incompatible with this experimental version**. The current implementation provides support of the matrix interface on Intel(R) Advanced Matrix Extensions (AMX). We are going to work with the community on incrementally improving the API to bring them closer to standard C++ (aligned with the `std::mdspan` and `std::mdarray` proposals) and SYCL in the next several months._ ## Introduction @@ -75,7 +75,7 @@ This results into the following description: ```c++ namespace sycl::ext::intel::experimental::matrix { -template +template struct joint_matrix { joint_matrix(Group g) {} }; @@ -109,17 +109,17 @@ We define three new functions needed to perform the main and common operations o The base pointer determines the starting address of the matrix to be loaded/stored. `layout` determines whether the data are being read/written in a row (`row_major`), column major (`column_major`) fashion, or if the data has already been transformed into VNNI format (`packed_a`, `packed_b`). -Since the matrix functions are group operations (as defined in Section 4.17.3 of the SYCL specification), the matrix API has to be accessed by all the work-items in the group in a non-diverged control flow. The `Group` template argument can be a work-group or a sub-group. These functions will be called once by all the work items in a group. +Since the matrix functions are group operations (as defined in Section 4.17.3 of the SYCL specification), the matrix API has to be accessed by all the work-items in the group in a convergent control flow. The `Group` template argument can be a work-group or a sub-group. These functions will be called once by each work item in the group. To be aligned with the SYCL 2020 group algorithms, an additional group argument is added to the matrix operations to designate that these functions are collective operations. The {dpcpp} syntax is the following: #### Load ```c++ namespace sycl::ext::intel::experimental::matrix { -template void joint_matrix_load(Group sg, joint_matrix &res, - multi_ptr src, size_t stride, matrix_layout l = row_major); + multi_ptr src, size_t stride, matrix_layout layout = matrix_layout::row_major); } ``` This function loads data from memory to the 2d tiles of AMX that is a 2d storage. @@ -128,10 +128,10 @@ This function loads data from memory to the 2d tiles of AMX that is a 2d storage #### Store ```c++ namespace sycl::ext::intel::experimental::matrix { -template void joint_matrix_store(Group sg, matrix &res, - multi_ptr src, size_t stride, matrix_layout l = row_major); + multi_ptr src, size_t stride, matrix_layout layout = matrix_layout::row_major); } ``` This function stores the data from the 2d tiles back to memory. @@ -140,7 +140,7 @@ This function stores the data from the 2d tiles back to memory. ```c++ namespace sycl::ext::intel::experimental::matrix { - template joint_matrix joint_matrix_mad(Group sg, joint_matrix A, joint_matrix B, joint_matrix C); @@ -191,13 +191,13 @@ q.parallel_for(nd_range<2>(G, L), [=](nd_item<2> item) // users need to specify the updated VNNI sizes along with the packed_b layout joint_matrix tB(sg); joint_matrix tC(sg); - joint_matrix_load(sg, tC, memC + sg_startx * tM * N + sg_starty, N, row_major); + joint_matrix_load(sg, tC, memC + sg_startx * tM * N + sg_starty, N, matrix_layout::row_major); for (int k = 0; k < K; k += tk) { - joint_matrix_load(sg, tA, memA + sg_startx * tM * K + k, K, row_major);//collective - joint_matrix_load(sg, tB, memB + k * N + sg_starty, N, packed_b);//VNNI + joint_matrix_load(sg, tA, memA + sg_startx * tM * K + k, K, matrix_layout::row_major);//collective + joint_matrix_load(sg, tB, memB + k * N + sg_starty, N, matrix_layout::packed_b);//VNNI tC = joint_matrix_mad(sg, tA, tB, tC); } - joint_matrix_store(sg, tC, memC + sg_startx * tM * N + sg_starty, N, row_major); + joint_matrix_store(sg, tC, memC + sg_startx * tM * N + sg_starty, N, matrix_layout::row_major); }).wait(); ``` @@ -212,8 +212,10 @@ clang++ -fsycl -march=sapphirerapids fsycl-targets="spir64_x86_64-uknown-linux-s ``` ### Current Implementation Restrictions +This section provides the specific features that this implementation supports. However, in future versions of this API and implementation, the expectation is to provide a query interface to guide the usage of this API. + #### Type, Sizes, and Layouts -The types supported by this AMX implementation are restricted to the types that AMX hardware support. Although the AMX hardware supports 2d tiles with a maximum size of 16x64 bytes, this current implementation can handle any size. If the matrix size is bigger than 1024 bytes, it will be stored in memory rather than mapped to a 2d tile. Performance penalty may occur in this case. In order to get the best performance with this implementation, matrix sizes should be smaller than 16x64 bytes and A and B matrices should be already packed (put in VNNI format). +The types supported by this AMX implementation are restricted to the types that AMX hardware support. Although the AMX hardware supports 2d tiles with a maximum size of 16x64 bytes, this current implementation can handle any size. If the matrix size is bigger than 1024 bytes, it will be stored in memory rather than mapped to a 2d tile. Performance penalty may occur in this case. In order to get the best performance with this implementation, matrix sizes should be no larger than 16x64 bytes and A and B matrices should be already packed (put in VNNI format). More specifically, the following operation C = A*B+C can be performed on AMX with this interface where: A(int8, any-size, row_major), B(int8, any-size, packed_b), C(int32, any-size, row_major) @@ -225,13 +227,14 @@ This current implementation only considers a sub-group scope. However, the sub-g ## Future Implementation Work -###Unfied LLVM IR and SPIRV JIT Enabling + +### Unfied LLVM IR and SPIRV JIT Enabling To enable JIT compilation, a unified matrix IR needs to be added. Currently, there is no matrix type in LLVM IR or SPIR-V. We are working towards adding a new matrix type in both LLVM IR and SPIR-V. This JIT enabling is expected to be part of a future compiler release. -### LLVM IR Extension +#### LLVM IR Extension As a short-term solution, we are extending the https://llvm.org/docs/LangRef.html#llvm-matrix-transpose-intrinsic[existing LLVM IR matrix intrinsics] to include features like VNNI layout. The current matrix intrinsics use flattened vectors to represent the matrix. Therefore, we are exploring both adding matrix type to LLVM IR and also using MLIR `vector` dialect for this work. -### SPIR-V Extension +#### SPIR-V Extension The current draft proposal can be found https://gitlab.devtools.intel.com/OpenCL/opencl-extension-drafts/-/blob/master/SPV_INTEL_matrix.asciidoc[here]. We are adding translation from LLVM IR matrix to SPIR-V matrix and vice versa in the LLVM to SPIR-V translator tool. From 7b4a3bcbbfe0fc7121d99b3056f720fa0700e907 Mon Sep 17 00:00:00 2001 From: Dounia Date: Tue, 22 Jun 2021 09:17:28 -0700 Subject: [PATCH 05/10] - Add layout parameter on the template for load, store, mad functions - Add the implementation status to doc/extensions/README.md - Add "API description vs what is actually implemented" question like dynamic_ extent and Group to the open questions - Add more clarification about packed_a and packed_b layout, and difference between layouts on matrix and in load/store functions Signed-off-by: Dounia --- .../Matrix/dpcpp-joint-matrix.asciidoc | 22 +++++++++++++++---- sycl/doc/extensions/README.md | 5 +++++ 2 files changed, 23 insertions(+), 4 deletions(-) diff --git a/sycl/doc/extensions/Matrix/dpcpp-joint-matrix.asciidoc b/sycl/doc/extensions/Matrix/dpcpp-joint-matrix.asciidoc index cd94dbfa570af..63e22af795eb6 100644 --- a/sycl/doc/extensions/Matrix/dpcpp-joint-matrix.asciidoc +++ b/sycl/doc/extensions/Matrix/dpcpp-joint-matrix.asciidoc @@ -88,7 +88,13 @@ joint_matrix tA(sg); ``` -AMX hardware requires A and B to be in VNNI or 32 bits packed layout. If users did not specify these layouts, transforms done by the implementation will be slow due to extra scatter/gather operations. Hence, we expose these layouts `packed_a` and `packed_b` to the user to specify that A and/or B have already been VNNIed. The packed or VNNI layout is introduced in `VNNI layout` section below. +AMX hardware requires both A and B to be in VNNI or 32 bits packed layout. +If we multiply matrices A (M, K) and B (K, N) into a matrix C (M, N). The logical sizes are M, K, N. However, the packed size for B is different. Let's take an example of bf16 type. The size of packed B is (K/2, N*2). For maximum performance, the allocation of B tile has to be done at the declaration of B matrix point using these VNNIed sizes. This motivate the choice of adding layout argument to the matrix type. + +Same applies to matrix A. But in the case where the memory layout (where the matrices will be loaded from) is column major. +Here, we multiply matrices A (K, M) and B (N, K) into a matrix C (N, M). The packed size for A is different. For bf16 type, the size of packed A is (K/2, M*2). + +//If users did not specify these layouts, transforms done by the implementation will be slow due to extra scatter/gather operations. Hence, we expose these layouts `packed_a` and `packed_b` to the user to specify that A and/or B have already been VNNIed. The packed or VNNI layout is introduced in `VNNI layout` section below. ```c++ namespace sycl::ext::intel::experimental::matrix { @@ -107,7 +113,10 @@ enum class matrix_layout { ## Matrix Operations and their Execution Scope We define three new functions needed to perform the main and common operations on matrices namely, load, store, and the actual multiply and add operation. This set of functions can be easily extended if the tensor hardware implements new features. -The base pointer determines the starting address of the matrix to be loaded/stored. `layout` determines whether the data are being read/written in a row (`row_major`), column major (`column_major`) fashion, or if the data has already been transformed into VNNI format (`packed_a`, `packed_b`). +The base pointer determines the starting address of the matrix to be loaded/stored. `layout` determines whether the data are being read/written in a row (`row_major`), column major (`column_major`) fashion, or if the data has already been transformed into VNNI format (`packed_a`, `packed_b`). + +Note that for getting maximum performance on AMX, prepacking data in the memory is necessary. If users did not specify the packed layouts (`packed_a` in column major case, `packed_b` in row major case), transforms done by the implementation will be slow due to extra scatter/gather operations. Hence, we expose these layouts `packed_a` and `packed_b` to the user to specify that A and/or B have already been VNNIed. The packed or VNNI layout is introduced in `VNNI layout` section below. + Since the matrix functions are group operations (as defined in Section 4.17.3 of the SYCL specification), the matrix API has to be accessed by all the work-items in the group in a convergent control flow. The `Group` template argument can be a work-group or a sub-group. These functions will be called once by each work item in the group. @@ -117,6 +126,7 @@ To be aligned with the SYCL 2020 group algorithms, an additional group argument ```c++ namespace sycl::ext::intel::experimental::matrix { template void joint_matrix_load(Group sg, joint_matrix &res, multi_ptr src, size_t stride, matrix_layout layout = matrix_layout::row_major); @@ -129,7 +139,8 @@ This function loads data from memory to the 2d tiles of AMX that is a 2d storage ```c++ namespace sycl::ext::intel::experimental::matrix { template + matrix_layout Layout, + access::address_space Space> void joint_matrix_store(Group sg, matrix &res, multi_ptr src, size_t stride, matrix_layout layout = matrix_layout::row_major); } @@ -141,7 +152,9 @@ This function stores the data from the 2d tiles back to memory. ```c++ namespace sycl::ext::intel::experimental::matrix { template + std::size_t K, std::size_t N, + matrix_layout LayoutA, matrix_layout LayoutB, + matrix_layout LayoutC> joint_matrix joint_matrix_mad(Group sg, joint_matrix A, joint_matrix B, joint_matrix C); } @@ -255,6 +268,7 @@ We did not utilize this extension for this matrix API version because sub-group - Besides row, col major and packed (VNNI) layout, what are the additional layouts that should absolutely be added? - Are there alternative names for the `packed_a` and `packed_b` layouts that would be clearer to distinguish between the VNNI Layout in matrix A and VNNI layout in matrix B of a matrix multiply and add operation on AMX? - Ronan Keryell: "It would be interesting to investigate whether providing also member functions would simplify the API. Provide both so it is possible to use the best one for each use case, while waiting for https://en.wikipedia.org/wiki/Uniform_Function_Call_Syntax to land into C++?" +- What should the API description include: (1) only features that are implemented, (2) features that are actually part of the API: currently implemented and the ones that we expect implementing them in the future. Specifically, should the document include things like dynamic_ extent and Group? These are part of the API but are not currently implemented. ## TODO List - Handle sub group sizes that are bigger than one. diff --git a/sycl/doc/extensions/README.md b/sycl/doc/extensions/README.md index 0faddf8955ee2..df7d209894b51 100755 --- a/sycl/doc/extensions/README.md +++ b/sycl/doc/extensions/README.md @@ -38,6 +38,11 @@ DPC++ extensions status: | [Level-Zero backend specification](LevelZeroBackend/LevelZeroBackend.md) | Supported | | | [ITT annotations support](ITTAnnotations/ITTAnnotations.rst) | Supported | | | [SYCL_EXT_ONEAPI_DEVICE_IF](DeviceIf/device_if.asciidoc) | Proposal | | +| [SYCL_INTEL_group_sort](GroupAlgorithms/SYCL_INTEL_group_sort.asciidoc) | Proposal | | +| [Invoke SIMD](InvokeSIMD/InvokeSIMD.asciidoc) | Proposal | | +| [Uniform](Uniform/Uniform.asciidoc) | Proposal | | +| [Assert](Assert/SYCL_ONEAPI_ASSERT.asciidoc) | Proposal | | +| [Matrix](Matrix/dpcpp-joint-matrix.asciidoc) | Partially supported(AMX AOT) | Not supported: dynamic-extent, wg and wi scopes, layouts other than packed| Legend: From 111fe1e0b7bcede42705d187b5a9fe4d29a7037b Mon Sep 17 00:00:00 2001 From: Dounia Khaldi Date: Wed, 30 Jun 2021 09:04:13 -0500 Subject: [PATCH 06/10] [SYCL][matrix] Incorporate changes related to: add specific implementation capabilities in each of the API sections, and some improvement to the layouts explanation text --- .../Matrix/dpcpp-joint-matrix.asciidoc | 66 ++++++++++++------- 1 file changed, 43 insertions(+), 23 deletions(-) diff --git a/sycl/doc/extensions/Matrix/dpcpp-joint-matrix.asciidoc b/sycl/doc/extensions/Matrix/dpcpp-joint-matrix.asciidoc index 63e22af795eb6..8fcbdc7b588f8 100644 --- a/sycl/doc/extensions/Matrix/dpcpp-joint-matrix.asciidoc +++ b/sycl/doc/extensions/Matrix/dpcpp-joint-matrix.asciidoc @@ -60,17 +60,8 @@ We introduce a new class called `joint_matrix`. The user needs to specify the ty //`joint_matrix` is distributed among an execution unit. In practice this can be one work-item, the work-items in a sub-group, or the work-items in a work-group. -#### Memory Scope -In this experimental API version, we used the terminology of `joint_matrix` instead of plain `matrix` to emphasis that the matrix is shared among a group of work items and is not private to each work item. The memory scope is added as an additional template parameter and is also part of the constructor arguments. - - -#### Shape -The same class `joint_matrix` should handle both cases where sizes are constant (GPU case) and when sizes are variables (CPU case). Note that a AMX 2d tile register permits sizes up to 1024 (16rowsx64cols) bytes. The ability to define only one interface for both makes it possible to give the user a way to make use of the flexibility introduced by the CPU but at the same time save resources on the GPU. We use `sycl::dynamic_extent` to differentiate between static and dynamic sizes. - -#### Layout -Layout is necessary on the type to be able to calculate the physical offset if the user needs to access a single entry for some purpose. Besides row major and column major layout, `matrix_layout` is flexible enough to introduce customed layouts such as symmetric or tiled layouts. -This results into the following description: +//This results into the following description: ```c++ @@ -81,20 +72,28 @@ struct joint_matrix { }; } ``` + + +#### Memory Scope +In this experimental API version, we used the terminology of `joint_matrix` instead of plain `matrix` to emphasis that the matrix is shared among a group of work items and is not private to each work item. The memory scope is added as an additional template parameter and is also part of the constructor arguments. + +IMPORTANT: In the current implementation, only the subgroup scope is supported + When the group is a `sycl::sub_group`, a matrix is declared as follows: ```c++ joint_matrix tA(sg); ``` +#### Shape +The same class `joint_matrix` should handle both cases where sizes are constant (GPU case) and when sizes are variables (CPU case). Note that a AMX 2d tile register permits sizes up to 1024 (16rowsx64cols) bytes. The ability to define only one interface for both makes it possible to give the user a way to make use of the flexibility introduced by the CPU but at the same time save resources on the GPU. We use `sycl::dynamic_extent` to differentiate between static and dynamic sizes. -AMX hardware requires both A and B to be in VNNI or 32 bits packed layout. -If we multiply matrices A (M, K) and B (K, N) into a matrix C (M, N). The logical sizes are M, K, N. However, the packed size for B is different. Let's take an example of bf16 type. The size of packed B is (K/2, N*2). For maximum performance, the allocation of B tile has to be done at the declaration of B matrix point using these VNNIed sizes. This motivate the choice of adding layout argument to the matrix type. +IMPORTANT: In the current implementation, only the static extent is supported -Same applies to matrix A. But in the case where the memory layout (where the matrices will be loaded from) is column major. -Here, we multiply matrices A (K, M) and B (N, K) into a matrix C (N, M). The packed size for A is different. For bf16 type, the size of packed A is (K/2, M*2). -//If users did not specify these layouts, transforms done by the implementation will be slow due to extra scatter/gather operations. Hence, we expose these layouts `packed_a` and `packed_b` to the user to specify that A and/or B have already been VNNIed. The packed or VNNI layout is introduced in `VNNI layout` section below. +#### Layout +//Layout is necessary on the type to be able to calculate the physical offset if the user needs to access a single entry for some purpose. +Besides row major and column major layouts, `matrix_layout` is flexible enough to introduce customed layouts such as symmetric or tiled layouts. ```c++ namespace sycl::ext::intel::experimental::matrix { @@ -107,21 +106,38 @@ enum class matrix_layout { } ``` +AMX hardware requires both A and B to be in VNNI or 32 bits packed layout. If we multiply matrices A (M, K) and B (K, N) into a matrix C (M, N). The logical sizes are M, K, N. However, the packed shape for B used by the implementation is different. Let's take an example of bf16 type. The shape of packed B is (K/2, N*2) as should be specified in user code as follows: + +//For maximum performance, the allocation of B tile has to be done at the declaration of B matrix point using these VNNIed sizes. This motivate the choice of adding layout argument to the matrix type. + +//Same applies to matrix A. But in the case where the memory layout (where the matrices will be loaded from) is column major. +//Here, we multiply matrices A (K, M) and B (N, K) into a matrix C (N, M). The packed size for A is different. For bf16 type, the size of packed A is (K/2, M*2). + +//If users did not specify these layouts, transforms done by the implementation will be slow due to extra scatter/gather operations. Hence, we expose these layouts `packed_a` and `packed_b` to the user to specify that A and/or B have already been VNNIed. The packed or VNNI layout is introduced in `VNNI layout` section below. + +IMPORTANT: In the current implementation, only `packed_b` layout is necessary to specify on matrix B, the layout on other matrices is ignored. + +```c++ +joint_matrix tB(sg); +``` ## Matrix Operations and their Execution Scope We define three new functions needed to perform the main and common operations on matrices namely, load, store, and the actual multiply and add operation. This set of functions can be easily extended if the tensor hardware implements new features. -The base pointer determines the starting address of the matrix to be loaded/stored. `layout` determines whether the data are being read/written in a row (`row_major`), column major (`column_major`) fashion, or if the data has already been transformed into VNNI format (`packed_a`, `packed_b`). +The base pointer determines the starting address of the matrix to be loaded/stored. `layout` determines whether the data are being read/written in a row (`row_major`), column major (`column_major`) fashion, or if the data has already been transformed into VNNI format (`packed_a`, `packed_b`). `stride` describes the number of elements between consecutive rows for row major and packed layout, columns for column major layout. Note that for getting maximum performance on AMX, prepacking data in the memory is necessary. If users did not specify the packed layouts (`packed_a` in column major case, `packed_b` in row major case), transforms done by the implementation will be slow due to extra scatter/gather operations. Hence, we expose these layouts `packed_a` and `packed_b` to the user to specify that A and/or B have already been VNNIed. The packed or VNNI layout is introduced in `VNNI layout` section below. +IMPORTANT: In the current implementation, the layout in the load of matrix B should be `packed_b`, layout in the load of matrix A and C should be `row_major`, layout in the store of matrix C should be `row_major`. -Since the matrix functions are group operations (as defined in Section 4.17.3 of the SYCL specification), the matrix API has to be accessed by all the work-items in the group in a convergent control flow. The `Group` template argument can be a work-group or a sub-group. These functions will be called once by each work item in the group. +Since the matrix functions are group operations (as defined in Section 4.17.3 of the SYCL specification), the matrix API has to be accessed by all the work-items in the group in a convergent control flow. The `Group` template argument can be a work-group or a subgroup. These functions will be called once by each work item in the group. To be aligned with the SYCL 2020 group algorithms, an additional group argument is added to the matrix operations to designate that these functions are collective operations. The {dpcpp} syntax is the following: +IMPORTANT: In the current implementation, only the subgroup scope is supported + #### Load ```c++ namespace sycl::ext::intel::experimental::matrix { @@ -167,14 +183,14 @@ AMX compute assumes register for B tile (src1) to be in VNNI format as they need The VNNI blocking factor is 2 in the case of 16bits, 4 in the case of 8 bits elements. While the current implementation assumes that the matrix has been already packed by the user for performance reasons, the layout information is needed to inform the implementation about this transform. - // Example of bf16 data type: + // Example of bf16 data type: For the 4 rows x 4 columns matrix below // --------------------------------- // a1, b1, c1, d1 // a2, b2, c2, d2 // a3, b3, c3, d3 // a4, b4, c4, d4 // --------------------------------- - // reformat to + // The reformating to VNNI layout yield the below 2 rows x 8 columns matrix // --------------------------------- // a1, a2, b1, b2, c1, c2, d1, d2 // a3, a4, b3, b4, c3, c4, d3, d4 @@ -202,7 +218,7 @@ q.parallel_for(nd_range<2>(G, L), [=](nd_item<2> item) joint_matrix tA(sg); // For B, since current implementation does not support non packed layout, // users need to specify the updated VNNI sizes along with the packed_b layout - joint_matrix tB(sg); + joint_matrix tB(sg); joint_matrix tC(sg); joint_matrix_load(sg, tC, memC + sg_startx * tM * N + sg_starty, N, matrix_layout::row_major); for (int k = 0; k < K; k += tk) { @@ -215,8 +231,7 @@ q.parallel_for(nd_range<2>(G, L), [=](nd_item<2> item) ``` ## Implementation Status -For oneAPI release 3, an AOT implementation is available on the CPU device to targets AMX hardware. we are using AMX tile intrinsics to implement the matrix load and store operations. Since we are currently emitting AMX intrinsics directly, this only enables AOT compilation. Please refer to the following section that talks about the future unified SPIR-V path that will enable JIT compilation. -// We used the https://software.intel.com/sites/landingpage/IntrinsicsGuide/#techs=AMX[`_tile_`-prefixed intrinsics] defined in `immintrin.h`. +For oneAPI release 3, an AOT implementation is available on the CPU device to targets AMX hardware. we are using AMX tile intrinsics to implement the matrix load and store operations. Since we are currently emitting AMX intrinsics directly, this only enables AOT compilation. Currently, this is the compilation command line needed to invoke AMX unit of Sapphire Rapids CPU: @@ -224,8 +239,13 @@ Currently, this is the compilation command line needed to invoke AMX unit of Sap clang++ -fsycl -march=sapphirerapids fsycl-targets="spir64_x86_64-uknown-linux-sycldevice" -O2 matmul-int8.cpp -o matmul-int8 ``` +Please refer to the section "Future Implementation Work" that talks about the future unified SPIR-V path that will enable JIT compilation. + +// We used the https://software.intel.com/sites/landingpage/IntrinsicsGuide/#techs=AMX[`_tile_`-prefixed intrinsics] defined in `immintrin.h`. + + ### Current Implementation Restrictions -This section provides the specific features that this implementation supports. However, in future versions of this API and implementation, the expectation is to provide a query interface to guide the usage of this API. +This section summarizes the specific features that this implementation supports. In future versions of this API and implementation, the expectation is to provide a query interface to guide the usage of this API. #### Type, Sizes, and Layouts The types supported by this AMX implementation are restricted to the types that AMX hardware support. Although the AMX hardware supports 2d tiles with a maximum size of 16x64 bytes, this current implementation can handle any size. If the matrix size is bigger than 1024 bytes, it will be stored in memory rather than mapped to a 2d tile. Performance penalty may occur in this case. In order to get the best performance with this implementation, matrix sizes should be no larger than 16x64 bytes and A and B matrices should be already packed (put in VNNI format). From 05aa2acd8e4563799065d07555f07fc3cdf9e2dd Mon Sep 17 00:00:00 2001 From: Dounia Khaldi Date: Thu, 1 Jul 2021 13:08:22 -0500 Subject: [PATCH 07/10] [SYCL][Matrix] Incorporate review changes related to: - Remove all the comments from the asciidoc file - Add more explicitly in two other places that "a kernel using this extension must be decorated with the [[sycl::reqd_sub_group_size(1)]] attribute" - Add 8 bit example for VNNI transform - Incorporate rewording suggestions from Greg. --- .../Matrix/dpcpp-joint-matrix.asciidoc | 73 ++++++++++--------- 1 file changed, 40 insertions(+), 33 deletions(-) diff --git a/sycl/doc/extensions/Matrix/dpcpp-joint-matrix.asciidoc b/sycl/doc/extensions/Matrix/dpcpp-joint-matrix.asciidoc index 8fcbdc7b588f8..de422f3bd09a4 100644 --- a/sycl/doc/extensions/Matrix/dpcpp-joint-matrix.asciidoc +++ b/sycl/doc/extensions/Matrix/dpcpp-joint-matrix.asciidoc @@ -56,13 +56,7 @@ value to determine which of the extension's APIs the implementation supports. |====================== ## New `joint_matrix` class -We introduce a new class called `joint_matrix`. The user needs to specify the type of the elements, shape, the memory layout, and the memory scope of the matrix. - -//`joint_matrix` is distributed among an execution unit. In practice this can be one work-item, the work-items in a sub-group, or the work-items in a work-group. - - -//This results into the following description: - +We introduce a new class called `joint_matrix`. The user needs to specify the type of the elements, shape, the memory layout, and the memory scope of the matrix. This results into the following description: ```c++ namespace sycl::ext::intel::experimental::matrix { @@ -106,20 +100,12 @@ enum class matrix_layout { } ``` -AMX hardware requires both A and B to be in VNNI or 32 bits packed layout. If we multiply matrices A (M, K) and B (K, N) into a matrix C (M, N). The logical sizes are M, K, N. However, the packed shape for B used by the implementation is different. Let's take an example of bf16 type. The shape of packed B is (K/2, N*2) as should be specified in user code as follows: - -//For maximum performance, the allocation of B tile has to be done at the declaration of B matrix point using these VNNIed sizes. This motivate the choice of adding layout argument to the matrix type. - -//Same applies to matrix A. But in the case where the memory layout (where the matrices will be loaded from) is column major. -//Here, we multiply matrices A (K, M) and B (N, K) into a matrix C (N, M). The packed size for A is different. For bf16 type, the size of packed A is (K/2, M*2). - -//If users did not specify these layouts, transforms done by the implementation will be slow due to extra scatter/gather operations. Hence, we expose these layouts `packed_a` and `packed_b` to the user to specify that A and/or B have already been VNNIed. The packed or VNNI layout is introduced in `VNNI layout` section below. - -IMPORTANT: In the current implementation, only `packed_b` layout is necessary to specify on matrix B, the layout on other matrices is ignored. +AMX hardware requires both A and B to be in VNNI or 32 bits packed layout. So the matrices are always packed in AMX. If we multiply matrices A (M, K) and B (K, N) into a matrix C (M, N). The logical sizes are M, K, N. However, the packed shape for B used by the implementation is different. For bf16 type for instance, the shape of AMX tile B is (K/2, N*2). The user must provide the information of `packed_b` layout to make the implementation allocate the right shape. The layout information for AMX should be specified in user code as follows: ```c++ joint_matrix tB(sg); ``` +IMPORTANT: In the current implementation, only `packed_b` layout is necessary to specify on matrix B, the layout on other matrices is ignored. @@ -130,13 +116,13 @@ The base pointer determines the starting address of the matrix to be loaded/stor Note that for getting maximum performance on AMX, prepacking data in the memory is necessary. If users did not specify the packed layouts (`packed_a` in column major case, `packed_b` in row major case), transforms done by the implementation will be slow due to extra scatter/gather operations. Hence, we expose these layouts `packed_a` and `packed_b` to the user to specify that A and/or B have already been VNNIed. The packed or VNNI layout is introduced in `VNNI layout` section below. -IMPORTANT: In the current implementation, the layout in the load of matrix B should be `packed_b`, layout in the load of matrix A and C should be `row_major`, layout in the store of matrix C should be `row_major`. +IMPORTANT: In the current implementation, the layout in the load of matrix B must be `packed_b`. Therefore, both the template parameter for the declaration of the B matrix and the call to `joint_matrix_load` for the B matrix must specify the `packed_b` layout. The layout in the load of matrices A and C must be `row_major`, and the layout in the store of matrix C must also be `row_major`. Since the matrix functions are group operations (as defined in Section 4.17.3 of the SYCL specification), the matrix API has to be accessed by all the work-items in the group in a convergent control flow. The `Group` template argument can be a work-group or a subgroup. These functions will be called once by each work item in the group. To be aligned with the SYCL 2020 group algorithms, an additional group argument is added to the matrix operations to designate that these functions are collective operations. The {dpcpp} syntax is the following: -IMPORTANT: In the current implementation, only the subgroup scope is supported +IMPORTANT: In the current implementation, only the subgroup scope is supported. Moreover, a kernel using this extension must be decorated with the [[sycl::reqd_sub_group_size(1)]] attribute. #### Load ```c++ @@ -180,21 +166,40 @@ The matrix multiply and add function performs the multiply operation on the matr ## VNNI/Packed Layout AMX compute assumes register for B tile (src1) to be in VNNI format as they need 32bit of K-data in A and B to be contiguous in memory. -The VNNI blocking factor is 2 in the case of 16bits, 4 in the case of 8 bits elements. While the current implementation assumes that the matrix has been already packed by the user for performance reasons, the layout information is needed to inform the implementation about this transform. +The VNNI blocking factor is 2 in the case of 16-bit types, and it is 4 in the case of 8-bit types. While the current implementation assumes that the matrix has been already packed by the user for performance reasons, the layout information is needed to inform the implementation about this transform. The following example illustrates how a matrix in `row_major` layout is transformed into the `packed_b` layout for a 16-bit type. - - // Example of bf16 data type: For the 4 rows x 4 columns matrix below +#### Example 1: 16-bit elements + // Example of a 4 row x 4 column matrix using a 16-bit data element, in row-major layout. + // Element a1 is contiguous in memory with element b1, etc. // --------------------------------- // a1, b1, c1, d1 // a2, b2, c2, d2 // a3, b3, c3, d3 // a4, b4, c4, d4 // --------------------------------- - // The reformating to VNNI layout yield the below 2 rows x 8 columns matrix + // The same matrix reformatted in packed_b layout. + // Here, packing of 2 elements is needed to form 32 bits. + // Element a1 is contiguous in memory with element a2, etc. // --------------------------------- // a1, a2, b1, b2, c1, c2, d1, d2 // a3, a4, b3, b4, c3, c4, d3, d4 +#### Example 2: 8-bit elements + + // Example of a 4 row x 4 column matrix using a 8-bit data element, in row-major layout. + // Element a1 is contiguous in memory with element b1, etc. + // --------------------------------- + // a1, b1, c1, d1 + // a2, b2, c2, d2 + // a3, b3, c3, d3 + // a4, b4, c4, d4 + // --------------------------------- + // The same matrix reformatted in packed_b layout. + // Here, packing of 4 elements is needed to form 32 bits. + // Elements a1, a2, a3, a4 are contiguous in memory, etc. + // --------------------------------- + // a1, a2, a3, a4, b1, b2, b3, b4, c1, c2, c3, c4, d1, d2, d3, d4 + ## Example using int8_t type ```c++ @@ -207,7 +212,7 @@ range<2> L = {1, SG_SIZE}; int8_t *memA = malloc_shared(M*K, q); int8_t *memB = malloc_shared(K*N, q); Int32_t *memC = malloc_shared(M*N, q); -//Assuming memB has already been VNNIed +// Assuming memB has already been VNNIed q.parallel_for(nd_range<2>(G, L), [=](nd_item<2> item) [[sycl::reqd_sub_group_size(SG_SIZE)]] { const auto global_idx = item.get_global_id(0); @@ -222,8 +227,8 @@ q.parallel_for(nd_range<2>(G, L), [=](nd_item<2> item) joint_matrix tC(sg); joint_matrix_load(sg, tC, memC + sg_startx * tM * N + sg_starty, N, matrix_layout::row_major); for (int k = 0; k < K; k += tk) { - joint_matrix_load(sg, tA, memA + sg_startx * tM * K + k, K, matrix_layout::row_major);//collective - joint_matrix_load(sg, tB, memB + k * N + sg_starty, N, matrix_layout::packed_b);//VNNI + joint_matrix_load(sg, tA, memA + sg_startx * tM * K + k, K, matrix_layout::row_major); + joint_matrix_load(sg, tB, memB + k * N + sg_starty, N, matrix_layout::packed_b); // VNNI tC = joint_matrix_mad(sg, tA, tB, tC); } joint_matrix_store(sg, tC, memC + sg_startx * tM * N + sg_starty, N, matrix_layout::row_major); @@ -241,27 +246,29 @@ clang++ -fsycl -march=sapphirerapids fsycl-targets="spir64_x86_64-uknown-linux-s Please refer to the section "Future Implementation Work" that talks about the future unified SPIR-V path that will enable JIT compilation. -// We used the https://software.intel.com/sites/landingpage/IntrinsicsGuide/#techs=AMX[`_tile_`-prefixed intrinsics] defined in `immintrin.h`. - - ### Current Implementation Restrictions This section summarizes the specific features that this implementation supports. In future versions of this API and implementation, the expectation is to provide a query interface to guide the usage of this API. #### Type, Sizes, and Layouts -The types supported by this AMX implementation are restricted to the types that AMX hardware support. Although the AMX hardware supports 2d tiles with a maximum size of 16x64 bytes, this current implementation can handle any size. If the matrix size is bigger than 1024 bytes, it will be stored in memory rather than mapped to a 2d tile. Performance penalty may occur in this case. In order to get the best performance with this implementation, matrix sizes should be no larger than 16x64 bytes and A and B matrices should be already packed (put in VNNI format). +The types supported by this AMX implementation are restricted to the types that AMX hardware support. Although the AMX hardware supports 2d tiles with a maximum size of 16x64 bytes, this current implementation can handle any size. If the matrix size is bigger than 1024 bytes, it will be stored in memory rather than mapped to a 2d tile. Performance penalty may occur in this case. In order to get the best performance with this implementation, matrix sizes should be no larger than 16x64 bytes and B matrix should be already packed (put in VNNI format). More specifically, the following operation C = A*B+C can be performed on AMX with this interface where: + A(int8, any-size, row_major), B(int8, any-size, packed_b), C(int32, any-size, row_major) + or -A(bf16, any-size, row_major), B(bf16, any-size, packed_b), C(float, any-size, row_major) + +A(bf16, any-size, row_major), B(bf16, any-size, packed_b), C(float, any-size, row_major). + +No other types or layouts are supported at this time. #### Memory and Execution Scope -This current implementation only considers a sub-group scope. However, the sub-group size has to be equal to one in this first implementation. +This current implementation only considers a sub-group scope. However, the sub-group size has to be equal to one in this first implementation. In this case, a kernel using this extension must be decorated with the [[sycl::reqd_sub_group_size(1)]] attribute. ## Future Implementation Work -### Unfied LLVM IR and SPIRV JIT Enabling +### Unified LLVM IR and SPIRV JIT Enabling To enable JIT compilation, a unified matrix IR needs to be added. Currently, there is no matrix type in LLVM IR or SPIR-V. We are working towards adding a new matrix type in both LLVM IR and SPIR-V. This JIT enabling is expected to be part of a future compiler release. #### LLVM IR Extension From f5c1d13c26badcb63eed35401e74a811cfba8498 Mon Sep 17 00:00:00 2001 From: Dounia Khaldi Date: Thu, 1 Jul 2021 14:44:22 -0500 Subject: [PATCH 08/10] [SYCL][Matrix] Add missing template parameter layout in load/store/mad functions --- sycl/doc/extensions/Matrix/dpcpp-joint-matrix.asciidoc | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/sycl/doc/extensions/Matrix/dpcpp-joint-matrix.asciidoc b/sycl/doc/extensions/Matrix/dpcpp-joint-matrix.asciidoc index de422f3bd09a4..142fbf60f8b0c 100644 --- a/sycl/doc/extensions/Matrix/dpcpp-joint-matrix.asciidoc +++ b/sycl/doc/extensions/Matrix/dpcpp-joint-matrix.asciidoc @@ -130,7 +130,7 @@ namespace sycl::ext::intel::experimental::matrix { template - void joint_matrix_load(Group sg, joint_matrix &res, + void joint_matrix_load(Group sg, joint_matrix &res, multi_ptr src, size_t stride, matrix_layout layout = matrix_layout::row_major); } ``` @@ -143,7 +143,7 @@ namespace sycl::ext::intel::experimental::matrix { template - void joint_matrix_store(Group sg, matrix &res, + void joint_matrix_store(Group sg, joint_matrix &res, multi_ptr src, size_t stride, matrix_layout layout = matrix_layout::row_major); } ``` @@ -157,8 +157,8 @@ namespace sycl::ext::intel::experimental::matrix { std::size_t K, std::size_t N, matrix_layout LayoutA, matrix_layout LayoutB, matrix_layout LayoutC> - joint_matrix joint_matrix_mad(Group sg, joint_matrix A, - joint_matrix B, joint_matrix C); + joint_matrix joint_matrix_mad(Group sg, joint_matrix A, + joint_matrix B, joint_matrix C); } ``` The matrix multiply and add function performs the multiply operation on the matrices `A` and `B`, accumulate the result with `C` and return the result. From ce7e0152ed04ec3e1139108e1626844a09a46e8a Mon Sep 17 00:00:00 2001 From: Dounia Khaldi Date: Thu, 1 Jul 2021 15:27:52 -0500 Subject: [PATCH 09/10] [SYCL][Matrix] reword the matrix layout description --- sycl/doc/extensions/Matrix/dpcpp-joint-matrix.asciidoc | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/sycl/doc/extensions/Matrix/dpcpp-joint-matrix.asciidoc b/sycl/doc/extensions/Matrix/dpcpp-joint-matrix.asciidoc index 142fbf60f8b0c..fdd4274f512ca 100644 --- a/sycl/doc/extensions/Matrix/dpcpp-joint-matrix.asciidoc +++ b/sycl/doc/extensions/Matrix/dpcpp-joint-matrix.asciidoc @@ -100,7 +100,7 @@ enum class matrix_layout { } ``` -AMX hardware requires both A and B to be in VNNI or 32 bits packed layout. So the matrices are always packed in AMX. If we multiply matrices A (M, K) and B (K, N) into a matrix C (M, N). The logical sizes are M, K, N. However, the packed shape for B used by the implementation is different. For bf16 type for instance, the shape of AMX tile B is (K/2, N*2). The user must provide the information of `packed_b` layout to make the implementation allocate the right shape. The layout information for AMX should be specified in user code as follows: +AMX hardware requires B matrix to be in VNNI or 32 bits packed layout. If we multiply matrices A (M, K) and B (K, N) into a matrix C (M, N). The logical sizes are M, K, N. However, the packed shape for B used by the implementation uses the VNNI format, which is described below. The user must provide the information of `packed_b` layout to make the implementation allocate the right shape. The layout information for AMX should be specified in user code as follows: ```c++ joint_matrix tB(sg); From 467ef25a309ec882027052f3d4c3df58c11ee2ac Mon Sep 17 00:00:00 2001 From: Dounia Khaldi Date: Thu, 1 Jul 2021 15:29:43 -0500 Subject: [PATCH 10/10] Update dpcpp-joint-matrix.asciidoc --- sycl/doc/extensions/Matrix/dpcpp-joint-matrix.asciidoc | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/sycl/doc/extensions/Matrix/dpcpp-joint-matrix.asciidoc b/sycl/doc/extensions/Matrix/dpcpp-joint-matrix.asciidoc index fdd4274f512ca..264af216f6d6c 100644 --- a/sycl/doc/extensions/Matrix/dpcpp-joint-matrix.asciidoc +++ b/sycl/doc/extensions/Matrix/dpcpp-joint-matrix.asciidoc @@ -86,7 +86,6 @@ IMPORTANT: In the current implementation, only the static extent is supported #### Layout -//Layout is necessary on the type to be able to calculate the physical offset if the user needs to access a single entry for some purpose. Besides row major and column major layouts, `matrix_layout` is flexible enough to introduce customed layouts such as symmetric or tiled layouts. ```c++ @@ -100,7 +99,7 @@ enum class matrix_layout { } ``` -AMX hardware requires B matrix to be in VNNI or 32 bits packed layout. If we multiply matrices A (M, K) and B (K, N) into a matrix C (M, N). The logical sizes are M, K, N. However, the packed shape for B used by the implementation uses the VNNI format, which is described below. The user must provide the information of `packed_b` layout to make the implementation allocate the right shape. The layout information for AMX should be specified in user code as follows: +AMX hardware requires B matrix to be in VNNI or 32 bits packed layout. If we multiply matrices A (M, K) and B (K, N) into a matrix C (M, N). The logical sizes are M, K, N. However, the packed shape for B tile uses the VNNI format, which is described below. The user must provide the information of packed_b layout to make the implementation allocate the right shape. The layout information for AMX should be specified in user code as follows: ```c++ joint_matrix tB(sg);