diff --git a/sycl/doc/extensions/Matrix/dpcpp-joint-matrix.asciidoc b/sycl/doc/extensions/Matrix/dpcpp-joint-matrix.asciidoc index f3b96a8827ea..5ab2ba094601 100644 --- a/sycl/doc/extensions/Matrix/dpcpp-joint-matrix.asciidoc +++ b/sycl/doc/extensions/Matrix/dpcpp-joint-matrix.asciidoc @@ -33,11 +33,11 @@ SYCL specification refer to that revision. **_NOTE:_** _This document describes the current design and API for the matrix extension to {dpcpp}. This is an initial experimental version to try out functionality -and performance, and **future versions of this API may change in ways that are incompatible with this experimental version**. The current implementation provides support of the matrix interface on Intel(R) Advanced Matrix Extensions (AMX) and DPAS. We are going to work with the community on incrementally improving +and performance, and **future versions of this API may change in ways that are incompatible with this experimental version**. The current implementation provides support of the matrix interface on Intel(R) Advanced Matrix Extensions (AMX), DPAS, and Nvidia® Tensor Cores. We are going to work with the community on incrementally improving the API to bring them closer to standard C++ (aligned with the `std::mdspan` and `std::mdarray` proposals) and SYCL in the next several months._ ## Introduction -This document presents an ongoing work towards defining a unified matrix interface. This interface is intended to unify different tensor hardware: Intel AMX in CPUs, Habana Gaudi and Goya tensor and gemm cores, Nvidia TPUs, IBM Power MMA. All these hardware provide low-level intrinsics or assembly to access and perform matrix operations. The goal is to provide a unified interface that is portable but also benefit from the maximum performance these different hardware can offer. +This document presents an ongoing work towards defining a unified matrix interface. This interface is intended to unify different tensor hardware: Intel AMX in CPUs, Habana Gaudi and Goya tensor and gemm cores, Nvidia® TPUs, IBM Power MMA. All these hardware provide low-level intrinsics or assembly to access and perform matrix operations. The goal is to provide a unified interface that is portable but also benefits from the maximum performance these different hardware can offer. ## Feature test macro @@ -54,41 +54,67 @@ value to determine which of the extension's APIs the implementation supports. |Value |Description |1 |Initial extension implementation on Intel AMX. Base features are supported. |2 |Initial extension JIT implementation on Intel AMX and DPAS. load, store, mad and the query interface are supported +|3 |Initial extension implementation on Nvidia® Tensor Cores. load, store, mad, and bmad are supported. bf16, fp19, mixed precision float, mixed precision u(int), double, and single-bit data formats are supported. |====================== ## New `joint_matrix` class -We introduce a new class called `joint_matrix`. The user needs to specify the type of the elements, shape, the memory layout, and the memory scope of the matrix. This results into the following description: + +We introduce a new class called `joint_matrix`. The user needs to specify the type of the elements, shape, the memory layout, and the memory scope of the matrix. This results in the following description: ```c++ namespace sycl::ext::oneapi::experimental::matrix { -template struct joint_matrix { - joint_matrix(Group g) {} + joint_matrix(Group g) {} }; } ``` +### New `matrix_use` Enumeration used in Nvidia® case + +A new parameter has been added to the Tensor Core implementation of joint_matrix. The type of the new parameter is an enum, `matrix_use` (a, b, accumulator). +`matrix_use` is used to distinguish the suitable place for a given matrix in a Multiply and Add operation. The Multiply and Add operation is described later on in this document. The long term plan is to incorporate the use argument for other TPUs (AMX and DPAS). In the Tensor Cores architecture the supported matrix shapes and data types depend upon the `matrix_use`. +The joint_matrix interface used by the Tensor Cores implementation is: + +```c++ +namespace sycl::ext::oneapi::experimental::matrix { +template +struct joint_matrix { + joint_matrix(Group g) {} +}; +} +``` #### Memory Scope -In this experimental API version, we used the terminology of `joint_matrix` instead of plain `matrix` to emphasis that the matrix is shared among a group of work items and is not private to each work item. The memory scope is added as an additional template parameter and is also part of the constructor arguments. +In this experimental API version, we used the terminology of `joint_matrix` instead of plain `matrix` to emphasize that the matrix is shared among a group of work items and is not private to each work item. The memory scope is added as an additional template parameter and is also part of the constructor arguments. -IMPORTANT: In the current implementation, only the subgroup scope is supported +IMPORTANT: In the current implementations of Intel AMX, Intel DPAS, and Nvidia® Tensor Cores, only the subgroup scope is supported. -When the group is a `sycl::sub_group`, a matrix is declared as follows: +When the group is a `sycl::sub_group`, an example matrix declaration is as follows: ```c++ -joint_matrix tA(sg); +joint_matrix tA(sg); +``` + +In the Tensor Cores case an example declaration adds the `matrix_use` parameter, e.g.: + +```c++ +joint_matrix tA(sg); ``` #### Shape -The same class `joint_matrix` should handle both cases where sizes are constant (GPU case) and when sizes are variables (CPU case). Note that a Intel AMX 2d tile register permits sizes up to 1024 (16rowsx64cols) bytes. The ability to define only one interface for both makes it possible to give the user a way to make use of the flexibility introduced by the CPU but at the same time save resources on the GPU. We use `sycl::dynamic_extent` to differentiate between static and dynamic sizes. -IMPORTANT: In the current implementation, only the static extent is supported +The same class `joint_matrix` should handle both cases where sizes are constant (GPU case) and when sizes are variables (CPU case). Note that a Intel AMX 2d tile register permits sizes up to 1024 (16rowsx64cols) bytes. The ability to define only one interface for both makes it possible to give the user a way to make use of the flexibility introduced by the CPU but at the same time save resources on the GPU. We use `sycl::dynamic_extent` to differentiate between static and dynamic sizes. +IMPORTANT: In the current implementation, only the static extent is supported. #### Layout -Besides row major and column major layouts, `matrix_layout` is flexible enough to introduce customed layouts such as symmetric or tiled layouts. +Besides row major and column major layouts, `matrix_layout` is flexible enough to introduce custom layouts such as symmetric or tiled layouts. ```c++ namespace sycl::ext::oneapi::experimental::matrix { @@ -101,13 +127,14 @@ enum class matrix_layout { } ``` -Intel AMX and DPAS hardware require B matrix to be in VNNI or 32 bits packed layout. If we multiply matrices A (M, K) and B (K, N) into a matrix C (M, N). The logical sizes are M, K, N. However, the packed shape for B tile uses the VNNI format, which is described below. The user must provide the information of packed_b layout to make the implementation allocate the right shape. The layout information for Intel AMX should be specified in user code as follows: +Intel AMX and DPAS hardware require B matrix to be in VNNI or 32 bits packed layout. If we multiply matrices A (M, K) and B (K, N) into a matrix C (M, N). The logical sizes are M, K, N. However, the packed shape for B tile uses the VNNI format, which is described below. The user must provide the information of packed_b layout to make the implementation allocate the right shape. The layout information for Intel AMX should be specified in user code as follows: ```c++ joint_matrix tB(sg); -``` -IMPORTANT: In the current implementation, only `packed_b` layout is necessary to specify on matrix B, the layout on other matrices is ignored. +``` +IMPORTANT: AMX and DPAS: In the current AMX and DPAS implementation, only `packed_b` layout is necessary to specify on matrix B, the layout on other matrices is ignored. + Tensor Cores: For the CUDA backend the layout in the load of matrices A B and C must be either `row_major` or `col_major`, and `Layout` in the store of matrix C must also be either `row_major` or `col_major`. ## Matrix Operations and their Execution Scope @@ -117,15 +144,16 @@ The base pointer determines the starting address of the matrix to be loaded/stor Note that for getting maximum performance on Intel AMX and DPAS, prepacking data in the memory is necessary. If users did not specify the packed layouts (`packed_a` when matrix `C` is column major, `packed_b` when matrix `C` is row major), transforms done by the implementation will be slow due to extra scatter/gather operations. Hence, we expose these layouts `packed_a` and `packed_b` to the user to specify that A or B have already been VNNIed. The packed or VNNI layout is introduced in `VNNI layout` section below. -IMPORTANT: In the current implementation, the layout in the load of matrix B must be `packed_b`. Therefore, both the template parameter for the declaration of the B matrix and the call to `joint_matrix_load` for the B matrix must specify the `packed_b` layout. The layout in the load of matrices A and C must be `row_major`, and the layout in the store of matrix C must also be `row_major`. +IMPORTANT: In the current AMX and DPAS implementation, the layout in the load of matrix B must be `packed_b`. Therefore, both the template parameter for the declaration of the B matrix and the call to `joint_matrix_load` for the B matrix must specify the `packed_b` layout. The layout in the load of matrices A and C must be `row_major`, and the layout in the store of matrix C must also be `row_major`. Since the matrix functions are group operations (as defined in Section 4.17.3 of the SYCL specification), the matrix API has to be accessed by all the work-items in the group in a convergent control flow. The `Group` template argument can be a work-group or a subgroup. These functions will be called once by each work item in the group. -To be aligned with the SYCL 2020 group algorithms, an additional group argument is added to the matrix operations to designate that these functions are collective operations. The {dpcpp} syntax is the following: +IMPORTANT: The CUDA backend does not require any other sub-group size than 32, which is the size of a warp which acts as the sub-group, and also the default DPC++ sub-group size for the CUDA backend. + +To be aligned with the SYCL 2020 group algorithms, an additional group argument is added to the matrix operations to designate that these functions are collective operations. The {dpcpp} syntax is the following: -IMPORTANT: In the current implementation, only the subgroup scope is supported. +### Load -#### Load ```c++ namespace sycl::ext::oneapi::experimental::matrix { template src, size_t stride, matrix_layout MemLayout); } ``` -This function loads data from memory to the 2d tiles/registers of Intel AMX/DPAS. +This function loads data from memory to the 2d tiles/registers of Intel AMX/DPAS or Nvidia® Tensor Cores. + +##### Tensor Cores syntax + +In the Tensor Cores implementation the `matrix_use` enum is added: + +```c++ +namespace sycl::ext::oneapi::experimental::matrix { +template +void joint_matrix_load( + Group sg, joint_matrix &res, + multi_ptr src, size_t stride); +} +``` + +Note that `Layout` is not included as an argument since it may be determined from the joint_matrix argument. + +### Store -#### Store ```c++ namespace sycl::ext::oneapi::experimental::matrix { template + access::address_space Space> void joint_matrix_store(Group sg, joint_matrix &res, multi_ptr src, size_t stride, matrix_layout memL); } ``` -This function stores the data from the 2d tiles back to memory. -#### Multiply and Add +This function stores the data from the 2d tiles/registers back to memory. + +##### Tensor Cores syntax + +```c++ +namespace sycl::ext::oneapi::experimental::matrix { +template +void joint_matrix_store(Group sg, + joint_matrix &src, + multi_ptr dst, size_t stride); +} +``` +Note that this function is only available for `matrix_use::accumulator`. + +### Multiply and Add ```c++ namespace sycl::ext::oneapi::experimental::matrix { @@ -162,8 +222,99 @@ namespace sycl::ext::oneapi::experimental::matrix { joint_matrix B, joint_matrix C); } ``` -The matrix multiply and add function performs the multiply operation on the matrices `A` and `B`, accumulate the result with `C` and return the result. +Multiply and Add (MAD) operations, also known as Matrix Multiply and Accumulate (MMA), multiply matrices A (`matrix_use::a`) with shape (M, K) and B (`matrix_use::b`) with shape (K, N) and add the result to matrix C (`matrix_use::accumulator`) with shape (M, N). The logical sizes are M, K, N. + +C = A * B + C + +##### Tensor Cores syntax + +```c++ +namespace sycl::ext::oneapi::experimental::matrix { +template +joint_matrix +joint_matrix_mad( + Group sg, joint_matrix A, + joint_matrix B, + joint_matrix C); +} +``` + +### Bitwise Multiply and Add - `joint_matrix_bmad` (Nvidia® only) + +```c++ +namespace sycl::ext::oneapi::experimental::matrix { +template +joint_matrix +joint_matrix_bmad( + Group sg, joint_matrix A, + joint_matrix B, + joint_matrix C, BinaryOperation Op); +} +``` + +Bitwise Multiply and Add (BMAD) operations replace the usual dot product between a row of matrix A (M by K) with a column of matrix B (K by N), where the programmer can construct e.g. a standard C++ (M by K) array of specified type T to represent matrix A. Instead, a sequence of logical operations are performed: The AND or XOR logical operations operate on the ith bit of a (K * 32) bit row of matrix A with the ith bit of a (K * 32) bit column of matrix B, to produce a 128 bit intermediate output. +The Population Count (popc) operator then operates on this intermediate output and the result is added with the (M, N)th element of the accumulator matrix C. Currently only the shape M = 8, N = 8, K = 4 (K = 4 corresponds to 128 single-bit matrix elements) is supported. +An important difference with respect to the joint_matrix_mad interface is the addition of the `BinaryOperator Op` parameter. `Op` may be either: + +`sycl::bit_and()` + +or + +`sycl::bit_xor()` + +The A, B, and C `joint_matrix` objects are constructed and loaded/stored in the normal way, using the previously defined `joint_matrix`, `joint_matrix_load`, and `joint_matrix_store` interfaces respectively. +The C matrix must be loaded from an array of 32 bit signed integers, and the A, B single bit matrices must be loaded from an array of unsigned 32-bit integers. + +IMPORTANT: When using Bitwise Multiply and Add joint_matrix A must be in row major layout and joint_matrix B must be in column major layout. + +IMPORTANT: Bitwise Multiply and Add operations are an experimental hardware feature and all implementation details are subject to change. + +#### Motivation for BMAD + +Single-bit MADs can be used as part of Binarized Neural Networks (BNNs) in the case that *both* the activations *and* weights are binarized. "Quantizing" a network to form a BNN represents the extreme limit of reducing the precision of the network degrees of freedom in order to gain performance and improve efficiency. +Hubara et al. (I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio. Binarized Neural Networks, Advances in Neural Information Processing Systems 29 (NIPS 2016)) first demonstrated the utility of an algorithm that could use both binarized activations and weights with backpropagation, by keeping track of real valued weights which are mapped to the binarized weights. In the backwards pass the real valued weights are updated according to a heuristic named the "Straight Through Estimator", whereby the gradient of the loss function with respect to the real weights is set equal to the gradient of the loss function with respect to the binarized weights. +This implies that the precision of the data type used in the matrix multiplications can be single bit, with the necessary addition of forward and backward element wise mappings between binarized and real valued representations of the matrices. +This could prove a significant advantage for large models, since the computational cost of Matrix Multiplication scales with the number of elements per dimension, N, as O(N^3) for square matrices, whereas corresponding element wise operations scale as O(N^2). +Further algorithms based on this binarized approach have been proposed, e.g. see Rastegari et al. (M. Rastegari, V Ordonez, J. Redmon, and A. Farhadi. Computer Vision – ECCV 2016, 525-542) who have made a comparison between a binarized version of a CNN (Using a XNOR Binary Dot Product) and corresponding full precision models, for both the accuracy and performance of image classification using the ImageNet data set. + +For an example of how bitwise MADs can be leveraged on current Nvidia® hardware see (A. Li, and S. Su. IEEE Transactions on Parallel and Distributed Systems, 32(7):1878-1891, 2021). + +#### Example using bitwise operations with `joint_matrix_bmad` + +```c++ +using namespace sycl::ext::oneapi::experimental::matrix; + +queue q; + q.submit([&](handler &cgh) { + auto accC = bufC.template get_access(cgh); + auto accA = bufA.template get_access(cgh); + auto accB = bufB.template get_access(cgh); + auto accD = bufD.template get_access(cgh); + range<2> LocalRange = {1, N_THREADS_PER_MATRIX_OP}; + range<2> GlobalRange = {Sub_Tiles_M, Sub_Tiles_N * N_THREADS_PER_MATRIX_OP}; + cgh.parallel_for>( + nd_range<2>(GlobalRange, LocalRange), [=](nd_item<2> item) { + sycl::sub_group sg = item.get_sub_group(); + const auto m = item.get_group().get_id()[0]; // row id of current submatrix of BIG C matrix + const auto n = item.get_group().get_id()[1]; // column id of current submatrix of BIG C matrix + joint_matrix sub_a; + joint_matrix sub_b; + joint_matrix sub_c; + joint_matrix_load(sg, sub_c, accC.get_pointer() + (m * M) * Big_N + n * N, Big_N); + for (int k = 0; k < Sub_Tiles_K; k++) // row/col id of current submatrix of BIG A/B matrices + { + joint_matrix_load(sg, sub_a, accA.get_pointer() + (k * K) + (m * M * Big_K), Big_K); + joint_matrix_load(sg, sub_b, accB.get_pointer() + (n * N * Big_K) + (k * K), Big_K); + sub_c = joint_matrix_bmad(sg, sub_a, sub_b, sub_c, Op); + } + joint_matrix_store(sg, sub_c, accD.get_pointer() + (m * M) * Big_N + n * N, Big_N); + }); + }).wait(); +``` ## VNNI/Packed Layout Intel AMX and DPAS compute assumes register for B tile (src1) to be in VNNI format as they need 32bit of K-data in A and B to be contiguous in memory. @@ -235,8 +386,17 @@ q.parallel_for(nd_range<2>(G, L), [=](nd_item<2> item) }).wait(); ``` +IMPORTANT: When compiling for Nvidia® the matrix extension requires specification of the architecture version. The compilation invocation is given below. + +This is the compilation command line needed to invoke the Tensor Cores matrix extension on program "matrix-nvidia.cpp": + +```c++ +clang++ -fsycl -fsycl-targets=nvptx64-nvidia-cuda -Xsycl-target-backend --cuda-gpu-arch=sm_80 -DSYCL_EXT_ONEAPI_MATRIX=3 matrix-nvidia.cpp -o output +``` +**_NOTE:_** _--cuda-gpu-arch may be set lower than sm_80 depending on the required matrix operation and whether it is supported by the desired arch._ + == Query Interface -Intel AMX, DPAS and Nvidia TPUs support different sizes and types. +Intel AMX, DPAS and Nvidia® TPUs support different sizes and types. The query interface is used to validate user code and inform them about supported types, sizes, scope, and layouts by the implementation. This also offers development and tuning productivity by both scientists and library developers. The query interface we are proposing here is a compile-time query, so there will be no runtime errors. @@ -248,7 +408,7 @@ The query interface proposed here consists of three functionalities: - General query: the general query interface provides information about sizes, types, static/dynamic, and scopes that are supported by a specific TPU implementation. This is needed to avoid padding by the user, for tuning, and efficient code generation if used by a library. The general query return an array of `combinations` of `combination` type. Each combination includes the sizes and the types for the matrices A, B, and C. Note that for each TPU, the query returns `max_msize, max_nsize, max_ksize` or `msize, nsize, ksize` exclusively depending whether the implementation supports a continuous or discrete number of sizes. For example, Intel AMX implementation supports a continuous number of sizes so the `max_*` variant is applied and only the maximum number is returned. DPAS implementation, on the other hand, supports a discrete list of numbers so the `msize, nsize, ksize` variant is applied. This form takes place when users only specify the TPU they are interested in using. -The table below provides a description for each of the member variables and type aliases in `tpu_params` class and the forms in which they are defined. +The table below provides a description for each of the member variables and type aliases in `tpu_params` class and the forms in which they are defined. [frame="none",options="header"] |====================== @@ -276,8 +436,6 @@ The table below provides a description for each of the member variables and type - - ```c++ namespace sycl::ext::oneapi::experimental::matrix { @@ -533,7 +691,7 @@ for (int i = 0; i < 8; i++) for (int j = 0; j < 8; j++) C(i,j) *= alpha; //Align with mdspan ``` -#### Option2: Restrictive fast element indexing +#### Option 2: Restrictive fast element indexing In the DPC++ context, the expectation is that all element-wise operations will happen in a converged control path by all work items in the group. Option 2 proposes a new set of element-wise operations by overloading existing operations to work on `matrix` object. An example is shown below: ```c++ @@ -542,8 +700,8 @@ joint_matrix C(sg); ``` The problem with this option is that it is restrictive to a very limited set of operations. -#### Option3: Restrictive conversion in the interface from SIMD to SPMD -Nvidia wmma interface added a new member to `fragment` class to designate the WI owned part of the matrix. +#### Option 3: Restrictive conversion in the interface from SIMD to SPMD +Nvidia® wmma interface added a new member to `fragment` class to designate the WI owned part of the matrix. While this provides fast element indexing on the GPU compared to the non-restrictive option, the user does not know the mapping of the owned data to the original matrix. However using the `mma` ptx instructions as opposed to the `wmma` ptx instructions the mapping is known. Knowing this mapping is important for the user to implement new operations like sum of rows of a matrix for quantized algorithms. @@ -579,18 +737,27 @@ We did not utilize this extension for this matrix API version because sub-group ## Open Questions - Besides row, col major and packed (VNNI) layout, what are the additional layouts that should absolutely be added? - Are there alternative names for the `packed_a` and `packed_b` layouts that would be clearer to distinguish between the VNNI Layout in matrix A and VNNI layout in matrix B of a matrix multiply and add operation on Intel AMX? --- Yes, this will be addressed in the next revision where `use` argument will be introduced to distinguish between right (B) , left (A), and accumulator matrix. +-- Yes, this will be addressed in the next revision where `matrix_use` argument will be introduced to distinguish between right (B) , left (A), and accumulator matrix. - Ronan Keryell: "It would be interesting to investigate whether providing also member functions would simplify the API. Provide both so it is possible to use the best one for each use case, while waiting for https://en.wikipedia.org/wiki/Uniform_Function_Call_Syntax to land into C++?" - In the future looking APIs, `get_wi_slice` (that is currently under design) returns an owned object. Should this return a view object to make sure the original matrix C is changed after its slices are modified. +- multi_ptr can be constructed from T* since https://github.com/intel/llvm/pull/1183. However currently this cannot be used with USM for all cases. + It is expected that eventually the `joint_matrix_load` and `joint_matrix_store` interfaces will be fully compatible with USM. ## TODO List - Add support for fill matrix and element-wise operations features -- Add 'matrix_use' parameter to the matrix to distinguish between matrix A, B, and matrix accumulator. This is necessary for supporting VNNI and transpose transform - Change the names default sizes in the query from defaultM, defaultN, defaultK to M,N,K - Change the type of `scope` in the query interface to be able to return more than one value. This will be useful in the event we support other scopes like workgroup besides subgroups - Add a more realistic and complete example that shows the value of the general query +Tensor Cores: + +- Clarify USM compatibility +- Add support for other combinations, the query interface, and consider how the future looking API can be added here + +AMX and DPAS: + +- Add 'matrix_use' parameter to the matrix to distinguish between matrix A, B, and matrix accumulator. This is necessary for supporting VNNI and transpose transform ## Revision History @@ -599,4 +766,5 @@ We did not utilize this extension for this matrix API version because sub-group |Rev |Date |Author |Changes |1 |2021-04-13 |Dounia Khaldi |Initial public working draft. |2 |2021-10-05 |Dounia Khaldi |JIT implementation on both Intel AMX and DPAS +|3 |2021-11-08 |Jack Kirk |Initial AOT use case on Nvidia® Tensor Cores |====================== diff --git a/sycl/doc/extensions/README.md b/sycl/doc/extensions/README.md index 5e17f716c845..a0041cf970d6 100755 --- a/sycl/doc/extensions/README.md +++ b/sycl/doc/extensions/README.md @@ -42,7 +42,7 @@ DPC++ extensions status: | [Invoke SIMD](InvokeSIMD/InvokeSIMD.asciidoc) | Proposal | | | [Uniform](Uniform/Uniform.asciidoc) | Proposal | | | [Assert](Assert/SYCL_ONEAPI_ASSERT.asciidoc) | Proposal | | -| [Matrix](Matrix/dpcpp-joint-matrix.asciidoc) | Partially supported(AMX AOT) | Not supported: dynamic-extent, wg and wi scopes, layouts other than packed| +| [Matrix](Matrix/dpcpp-joint-matrix.asciidoc) | Partially supported (AMX, DPAS, Tensor Cores (AOT)) | Not supported: AMX, DPAS, Tensor Cores: dynamic-extent, wg and wi scopes; Not supported: AMX, DPAS: layouts other than packed; Tensor Cores: The only supported data type is double | [SYCL_INTEL_free_function_queries](FreeFunctionQueries/SYCL_INTEL_free_function_queries.asciidoc) | Supported (experimental) | | | [EXT_ONEAPI_max_work_groups](MaxWorkGroupQueries/max_work_group_query.md) | Supported | | | [SYCL_EXT_ONEAPI_DEVICE_GLOBAL](DeviceGlobal/SYCL_INTEL_device_global.asciidoc) | Proposal | |