diff --git a/sycl/doc/extensions/Matrix/dpcpp-joint-matrix.asciidoc b/sycl/doc/extensions/Matrix/dpcpp-joint-matrix.asciidoc index f7394df705a6a..f3b96a8827ea0 100644 --- a/sycl/doc/extensions/Matrix/dpcpp-joint-matrix.asciidoc +++ b/sycl/doc/extensions/Matrix/dpcpp-joint-matrix.asciidoc @@ -33,11 +33,11 @@ SYCL specification refer to that revision. **_NOTE:_** _This document describes the current design and API for the matrix extension to {dpcpp}. This is an initial experimental version to try out functionality -and performance, and **future versions of this API may change in ways that are incompatible with this experimental version**. The current implementation provides support of the matrix interface on Intel(R) Advanced Matrix Extensions (AMX). We are going to work with the community on incrementally improving +and performance, and **future versions of this API may change in ways that are incompatible with this experimental version**. The current implementation provides support of the matrix interface on Intel(R) Advanced Matrix Extensions (AMX) and DPAS. We are going to work with the community on incrementally improving the API to bring them closer to standard C++ (aligned with the `std::mdspan` and `std::mdarray` proposals) and SYCL in the next several months._ ## Introduction -This document presents an ongoing work towards defining a unified matrix interface. This interface is intended to unify different tensor hardware: AMX in Intel CPU, Habana Gaudi and Goya tensor and gemm cores, Nvidia TPUs, IBM Power MMA. All these hardware provide low-level intrinsics or assembly to access and perform matrix operations. The goal is to provide a unified interface that is portable but also benefit from the maximum performance these different hardware can offer. +This document presents an ongoing work towards defining a unified matrix interface. This interface is intended to unify different tensor hardware: Intel AMX in CPUs, Habana Gaudi and Goya tensor and gemm cores, Nvidia TPUs, IBM Power MMA. All these hardware provide low-level intrinsics or assembly to access and perform matrix operations. The goal is to provide a unified interface that is portable but also benefit from the maximum performance these different hardware can offer. ## Feature test macro @@ -52,15 +52,17 @@ value to determine which of the extension's APIs the implementation supports. [frame="none",options="header"] |====================== |Value |Description -|1 |Initial extension implementation on AMX. Base features are supported. +|1 |Initial extension implementation on Intel AMX. Base features are supported. +|2 |Initial extension JIT implementation on Intel AMX and DPAS. load, store, mad and the query interface are supported |====================== ## New `joint_matrix` class We introduce a new class called `joint_matrix`. The user needs to specify the type of the elements, shape, the memory layout, and the memory scope of the matrix. This results into the following description: ```c++ -namespace sycl::ext::intel::experimental::matrix { -template +namespace sycl::ext::oneapi::experimental::matrix { +template struct joint_matrix { joint_matrix(Group g) {} }; @@ -76,11 +78,11 @@ IMPORTANT: In the current implementation, only the subgroup scope is supported When the group is a `sycl::sub_group`, a matrix is declared as follows: ```c++ -joint_matrix tA(sg); +joint_matrix tA(sg); ``` #### Shape -The same class `joint_matrix` should handle both cases where sizes are constant (GPU case) and when sizes are variables (CPU case). Note that a AMX 2d tile register permits sizes up to 1024 (16rowsx64cols) bytes. The ability to define only one interface for both makes it possible to give the user a way to make use of the flexibility introduced by the CPU but at the same time save resources on the GPU. We use `sycl::dynamic_extent` to differentiate between static and dynamic sizes. +The same class `joint_matrix` should handle both cases where sizes are constant (GPU case) and when sizes are variables (CPU case). Note that a Intel AMX 2d tile register permits sizes up to 1024 (16rowsx64cols) bytes. The ability to define only one interface for both makes it possible to give the user a way to make use of the flexibility introduced by the CPU but at the same time save resources on the GPU. We use `sycl::dynamic_extent` to differentiate between static and dynamic sizes. IMPORTANT: In the current implementation, only the static extent is supported @@ -89,7 +91,7 @@ IMPORTANT: In the current implementation, only the static extent is supported Besides row major and column major layouts, `matrix_layout` is flexible enough to introduce customed layouts such as symmetric or tiled layouts. ```c++ -namespace sycl::ext::intel::experimental::matrix { +namespace sycl::ext::oneapi::experimental::matrix { enum class matrix_layout { row_major, col_major, @@ -99,10 +101,10 @@ enum class matrix_layout { } ``` -AMX hardware requires B matrix to be in VNNI or 32 bits packed layout. If we multiply matrices A (M, K) and B (K, N) into a matrix C (M, N). The logical sizes are M, K, N. However, the packed shape for B tile uses the VNNI format, which is described below. The user must provide the information of packed_b layout to make the implementation allocate the right shape. The layout information for AMX should be specified in user code as follows: +Intel AMX and DPAS hardware require B matrix to be in VNNI or 32 bits packed layout. If we multiply matrices A (M, K) and B (K, N) into a matrix C (M, N). The logical sizes are M, K, N. However, the packed shape for B tile uses the VNNI format, which is described below. The user must provide the information of packed_b layout to make the implementation allocate the right shape. The layout information for Intel AMX should be specified in user code as follows: ```c++ -joint_matrix tB(sg); +joint_matrix tB(sg); ``` IMPORTANT: In the current implementation, only `packed_b` layout is necessary to specify on matrix B, the layout on other matrices is ignored. @@ -113,7 +115,7 @@ We define three new functions needed to perform the main and common operations o The base pointer determines the starting address of the matrix to be loaded/stored. `layout` determines whether the data are being read/written in a row (`row_major`), column major (`column_major`) fashion, or if the data has already been transformed into VNNI format (`packed_a`, `packed_b`). `stride` describes the number of elements between consecutive rows for row major and packed layout, columns for column major layout. -Note that for getting maximum performance on AMX, prepacking data in the memory is necessary. If users did not specify the packed layouts (`packed_a` in column major case, `packed_b` in row major case), transforms done by the implementation will be slow due to extra scatter/gather operations. Hence, we expose these layouts `packed_a` and `packed_b` to the user to specify that A and/or B have already been VNNIed. The packed or VNNI layout is introduced in `VNNI layout` section below. +Note that for getting maximum performance on Intel AMX and DPAS, prepacking data in the memory is necessary. If users did not specify the packed layouts (`packed_a` when matrix `C` is column major, `packed_b` when matrix `C` is row major), transforms done by the implementation will be slow due to extra scatter/gather operations. Hence, we expose these layouts `packed_a` and `packed_b` to the user to specify that A or B have already been VNNIed. The packed or VNNI layout is introduced in `VNNI layout` section below. IMPORTANT: In the current implementation, the layout in the load of matrix B must be `packed_b`. Therefore, both the template parameter for the declaration of the B matrix and the call to `joint_matrix_load` for the B matrix must specify the `packed_b` layout. The layout in the load of matrices A and C must be `row_major`, and the layout in the store of matrix C must also be `row_major`. @@ -121,29 +123,29 @@ Since the matrix functions are group operations (as defined in Section 4.17.3 of To be aligned with the SYCL 2020 group algorithms, an additional group argument is added to the matrix operations to designate that these functions are collective operations. The {dpcpp} syntax is the following: -IMPORTANT: In the current implementation, only the subgroup scope is supported. Moreover, a kernel using this extension must be decorated with the [[sycl::reqd_sub_group_size(1)]] attribute. +IMPORTANT: In the current implementation, only the subgroup scope is supported. #### Load ```c++ -namespace sycl::ext::intel::experimental::matrix { +namespace sycl::ext::oneapi::experimental::matrix { template - void joint_matrix_load(Group sg, joint_matrix &res, - multi_ptr src, size_t stride, matrix_layout layout = matrix_layout::row_major); + void joint_matrix_load(Group sg, joint_matrix &res, + multi_ptr src, size_t stride, matrix_layout MemLayout); } ``` -This function loads data from memory to the 2d tiles of AMX that is a 2d storage. +This function loads data from memory to the 2d tiles/registers of Intel AMX/DPAS. #### Store ```c++ -namespace sycl::ext::intel::experimental::matrix { +namespace sycl::ext::oneapi::experimental::matrix { template - void joint_matrix_store(Group sg, joint_matrix &res, - multi_ptr src, size_t stride, matrix_layout layout = matrix_layout::row_major); + void joint_matrix_store(Group sg, joint_matrix &res, + multi_ptr src, size_t stride, matrix_layout memL); } ``` This function stores the data from the 2d tiles back to memory. @@ -151,20 +153,20 @@ This function stores the data from the 2d tiles back to memory. #### Multiply and Add ```c++ -namespace sycl::ext::intel::experimental::matrix { - template - joint_matrix joint_matrix_mad(Group sg, joint_matrix A, - joint_matrix B, joint_matrix C); +namespace sycl::ext::oneapi::experimental::matrix { + template + joint_matrix joint_matrix_mad(Group sg, joint_matrix A, + joint_matrix B, joint_matrix C); } ``` The matrix multiply and add function performs the multiply operation on the matrices `A` and `B`, accumulate the result with `C` and return the result. ## VNNI/Packed Layout -AMX compute assumes register for B tile (src1) to be in VNNI format as they need 32bit of K-data in A and B to be contiguous in memory. +Intel AMX and DPAS compute assumes register for B tile (src1) to be in VNNI format as they need 32bit of K-data in A and B to be contiguous in memory. The VNNI blocking factor is 2 in the case of 16-bit types, and it is 4 in the case of 8-bit types. While the current implementation assumes that the matrix has been already packed by the user for performance reasons, the layout information is needed to inform the implementation about this transform. The following example illustrates how a matrix in `row_major` layout is transformed into the `packed_b` layout for a 16-bit type. #### Example 1: 16-bit elements @@ -202,11 +204,10 @@ The VNNI blocking factor is 2 in the case of 16-bit types, and it is 4 in the ca ## Example using int8_t type ```c++ -using namespace sycl::ext::intel::experimental::matrix; +using namespace sycl::ext::oneapi::experimental::matrix; queue q; -range<2> G = {M, N}; -// For this first implementation, SG_SIZE has to be equal to one +range<2> G = {M/tM, N}; range<2> L = {1, SG_SIZE}; int8_t *memA = malloc_shared(M*K, q); int8_t *memB = malloc_shared(K*N, q); @@ -219,65 +220,350 @@ q.parallel_for(nd_range<2>(G, L), [=](nd_item<2> item) const auto sg_startx = global_idx - item.get_local_id(0); const auto sg_starty = global_idy - item.get_local_id(1); sub_group sg = item.get_sub_group(); - joint_matrix tA(sg); + joint_matrix tA(sg); // For B, since current implementation does not support non packed layout, - // users need to specify the updated VNNI sizes along with the packed_b layout - joint_matrix tB(sg); - joint_matrix tC(sg); - joint_matrix_load(sg, tC, memC + sg_startx * tM * N + sg_starty, N, matrix_layout::row_major); + // users need to specify the packed_b layout + joint_matrix tB(sg); + joint_matrix tC(sg); + joint_matrix_load(sg, tC, memC + sg_startx * tM * N + sg_starty/SG_SIZE*tN, N, matrix_layout::row_major); for (int k = 0; k < K; k += tk) { joint_matrix_load(sg, tA, memA + sg_startx * tM * K + k, K, matrix_layout::row_major); - joint_matrix_load(sg, tB, memB + k * N + sg_starty, N, matrix_layout::packed_b); // VNNI + joint_matrix_load(sg, tB, memB + k * N + sg_starty/SG_SIZE*tN*4, N*4, matrix_layout::packed_b); // VNNI tC = joint_matrix_mad(sg, tA, tB, tC); } - joint_matrix_store(sg, tC, memC + sg_startx * tM * N + sg_starty, N, matrix_layout::row_major); + joint_matrix_store(sg, tC, memC + sg_startx * tM * N + sg_starty/SG_SIZE*tN, N, matrix_layout::row_major); }).wait(); +``` + +== Query Interface +Intel AMX, DPAS and Nvidia TPUs support different sizes and types. +The query interface is used to validate user code and inform them about supported types, sizes, scope, and layouts by the implementation. +This also offers development and tuning productivity by both scientists and library developers. The query interface we are proposing here is a compile-time query, +so there will be no runtime errors. +The query interface proposed here consists of three functionalities: + +- Validation: at compile time, the validation functionality informs the user whether a specific combination is valid or not. This takes place when the user specifies all template parameters. + +- Default values: this provides a default shape if the user does not provide a specific combination. In this case, aliases to the `joint_matrix` type can be used, namely `joint_matrix_a/b/c` where no additional argument is needed. This form happens when the user specifies all template parameters except the sizes of the matrices (`tiles`) M, N, and K. + +- General query: the general query interface provides information about sizes, types, static/dynamic, and scopes that are supported by a specific TPU implementation. This is needed to avoid padding by the user, for tuning, and efficient code generation if used by a library. The general query return an array of `combinations` of `combination` type. Each combination includes the sizes and the types for the matrices A, B, and C. Note that for each TPU, the query returns `max_msize, max_nsize, max_ksize` or `msize, nsize, ksize` exclusively depending whether the implementation supports a continuous or discrete number of sizes. For example, Intel AMX implementation supports a continuous number of sizes so the `max_*` variant is applied and only the maximum number is returned. DPAS implementation, on the other hand, supports a discrete list of numbers so the `msize, nsize, ksize` variant is applied. This form takes place when users only specify the TPU they are interested in using. + +The table below provides a description for each of the member variables and type aliases in `tpu_params` class and the forms in which they are defined. + +[frame="none",options="header"] +|====================== +| Member/type alias in `tpu_params` | Forms they are defined in |Description +|`type_a`| validation, default values|type alias for the type of matrix A +|`type_b`| validation, default values|type alias for the type of matrix B +|`type_c`| validation, default values|type alias for the type of matrix C +|`defaultM`| validation, default values|when no sizes are provided by the user, indicates the suggested default size for M; usually this corresponds to the maximum size the implementation supports. In validation mode, where the user does provide sizes, this is the same value M that the user provides if M is supported by the implementation +|`defaultN`| validation, default values|when no sizes are provided by the user, indicates the suggested default size for N; usually this corresponds to the maximum size the implementation supports. In validation mode, where the user does provide sizes, this is the same value N that the user provides if N is supported by the implementation +|`defaultK`| validation, default values|when no sizes are provided by the user, indicates the suggested default size for K; usually this corresponds to the maximum size the implementation supports. In validation mode, where the user does provide sizes, this is the same value K that the user provides if K is supported by the implementation +|`joint_matrix_a`| validation, default values|type alias for `joint_matrix` for matrix A +|`joint_matrix_b`| validation, default values| type alias for `joint_matrix` for matrix B +|`joint_matrix_c`| validation, default values| type alias for `joint_matrix` for matrix C +|`dynamic_p`| validation, default values, general query| a boolean that indicates whether the implementation supports dynamic sizes (true) or not (false) +|numtiles| validation, default values, general query|indicates number of tiles in Intel AMX (does not apply to DPAS) +|scope| validation, default values, general query| indicates the memory and execution scope supported by the TPU implementation +|`combination` | validation, default values, general query|composes the types and sizes of A, B, C matrices allowed in one combination +|`max_msize`, `max_nsize`, `max_ksize`| validation, default values, general query| if the TPU implementation supports a continuous number of element sizes, each of these members is non-zero, and the TPU implementation supports all element sizes from 1 up to (and including) that number. By contrast, if the TPU implementation supports a discrete number of element sizes, each of these members has the value zero +|`msize`, `nsize`, `ksize`| validation, default values, general query| if the TPU implementation supports a discrete number of element sizes, each of these members is non-zero, and the value tells one of the supported element sizes. By contrast, if the TPU supports a continuous number of element sizes, each of these members has the value zero +|`atype`, `btype`, `ctype`| validation, default values, general query| indicates the types supported in the combination +|`combinations` | validation, default values, general query| tells the set of supported matrix sizes and types according to the template parameters that are provided. In the "general query" form, the user provides only the TPU type, so the combinations array contains all supported tile sizes and element types for that TPU. In the "default values" form, the user provides the TPU type and element types, so the combinations array contains only those supported matrix sizes and element types that match those element types on that TPU. In the "validation" form, the user provides the TPU type, element types, and element sizes so only this specific combination is returned in the combinations array. +|`num_combinations`| validation, default values, general query|indicates number of combinations supported by the TPU implementation which corresponds to the size of the `combinations` array +|====================== + + + + + + +```c++ +namespace sycl::ext::oneapi::experimental::matrix { + + +template +struct tpu_params; + +// Validation form: Valid or not +// Specialization when both types and sizes are given +template +struct tpu_params< + tpu::amx, Ta, Tb, Tc, M, N, K, + typename std::enable_if<( + !std::is_same_v && !std::is_same_v && + !std::is_same_v && M != 0 && N != 0 && K != 0)>::type> { + // Validate that parameters are supported + static_assert( + (M == 0 && N == 0 && K == 0) || + (is_combination_valid_amx(M, N, K)), + "Invalid parameters for Intel AMX, query valid types and maximum sizes " + "using: " + "tpu_params myparams; and then check out myparams.combinations array"); + + + using type_a = Ta; // this type alias is not available in the current implementation + using type_b = Tb; // this type alias is not available in the current implementation + using type_c = Tc; // this type alias is not available in the current implementation + + // if combination is valid, construct the matrices + + static constexpr std::size_t defaultM = (M != 0) ? M : 16; + static constexpr std::size_t defaultN = (N != 0) ? N : 16; + static constexpr std::size_t defaultK = + (K != 0) ? K : ((sizeof(Ta) == 1) ? 64 : 32); + + template + using joint_matrix_a = joint_matrix; + template + using joint_matrix_b = joint_matrix; + template + using joint_matrix_c = joint_matrix; + + static constexpr bool dynamic_p = false; // should be true in future implementations + // because Intel AMX hardware supports dynamic sizes + static constexpr uint32_t numtiles = 8; + static constexpr scope_t scope = scope_t::sub_group; + struct combination { + uint32_t max_msize; + uint32_t max_nsize; + uint32_t max_ksize; + uint32_t msize; + uint32_t nsize; + uint32_t ksize; + matrix_type atype; + matrix_type btype; + matrix_type ctype; + }; + // In this case, the combinations array contains only the combination that the user provided + static constexpr combination combinations[] = { + {16, 16, (sizeof(Ta) == 1) ? 64 : 32, M, N, K}}; + static constexpr int num_combinations = + sizeof(combinations) / sizeof(combination); +}; + +// Default values form: Sizes-only query +// Specialization for when only types are given, need to query only sizes +template +struct tpu_params && + !std::is_same_v && + !std::is_same_v)>::type> { + static_assert((are_types_valid_amx()), + "Invalid types for Intel AMX, supported types are int8_t, uint8_t, " + "and bf16 (Note that unsigned short should be used in the" + "DPC++ code to implement bf16) "); + + using type_a = Ta; // this type alias is not available in the current implementation + using type_b = Tb; // this type alias is not available in the current implementation + using type_c = Tc; // this type alias is not available in the current implementation + + // construct the matrices using the default sizes + static constexpr std::size_t defaultM = 16; + static constexpr std::size_t defaultN = 16; + static constexpr std::size_t defaultK = ((sizeof(Ta) == 1) ? 64 : 32); + + template + using joint_matrix_a = joint_matrix; + template + using joint_matrix_b = joint_matrix; + template + using joint_matrix_c = joint_matrix; + + static constexpr bool dynamic_p = false; // should be true in future implementations because + // Intel AMX hardware supports dynamic sizes + static constexpr uint32_t numtiles = 8; + static constexpr scope_t scope = scope_t::sub_group; + struct combination { + uint32_t max_msize; + uint32_t max_nsize; + uint32_t max_ksize; + uint32_t msize; + uint32_t nsize; + uint32_t ksize; + matrix_type atype; + matrix_type btype; + matrix_type ctype; + }; + // In this case, the combinations array contain only the combinations that correspond to the Ta, Tb, and Tc + // types that the user provided + static constexpr combination combinations[] = { + {16, 16, (sizeof(Ta) == 1) ? 64 : 32}}; + static constexpr int num_combinations = + sizeof(combinations) / sizeof(combination); +}; + +// General query form: +// types are not given, no default sizes and no implicit matrix construction +template +struct tpu_params { + static constexpr bool dynamic_p = false; // should be true in future implementations because + // Intel AMX hardware supports dynamic sizes + static constexpr uint32_t numtiles = 8; + static constexpr scope_t scope = scope_t::sub_group; + struct combination { + uint32_t max_msize; + uint32_t max_nsize; + uint32_t max_ksize; + uint32_t msize; + uint32_t nsize; + uint32_t ksize; + matrix_type atype; + matrix_type btype; + matrix_type ctype; + }; + static constexpr combination combinations[] = { + {16, 16, 64, 0, 0, 0, matrix_type::sint8, matrix_type::sint8, matrix_type::sint32}, + {16, 16, 64, 0, 0, 0, matrix_type::sint8, matrix_type::uint8, matrix_type::sint32}, + {16, 16, 64, 0, 0, 0, matrix_type::uint8, matrix_type::sint8, matrix_type::sint32}, + {16, 16, 64, 0, 0, 0, matrix_type::uint8, matrix_type::uint8, matrix_type::sint32}, + {16, 16, 32, 0, 0,0, matrix_type::bf16, matrix_type::bf16, matrix_type::fp32}}; + static constexpr int num_combinations = + sizeof(combinations) / sizeof(combination); +}; + + +enum class tpu { + dpas, + amx +}; + +enum class matrix_type { + bf16, + fp16, + fp19, // tfloat32 + fp32, + fp64, + sint2, + sint4, + sint8, + sint16, + sint32, + sint64, + uint2, + uint4, + uint8, + uint16, + uint32, + uint64 +}; + +enum class scope_t { + sub_group, + work_group +}; +} ``` -## Implementation Status -For oneAPI release 3, an AOT implementation is available on the CPU device to targets AMX hardware. we are using AMX tile intrinsics to implement the matrix load and store operations. Since we are currently emitting AMX intrinsics directly, this only enables AOT compilation. -Currently, this is the compilation command line needed to invoke AMX unit of Sapphire Rapids CPU: +=== Validation Example: ```c++ -clang++ -fsycl -march=sapphirerapids fsycl-targets="spir64_x86_64-unknown-linux" -O2 matmul-int8.cpp -o matmul-int8 +// User can provide sizes besides the types and tpu_params can assert if they are supported or not +// in this case, an assertion will happens as 16 is not a supported size for M +using myparams = tpu_params; +size_t NDRangeM = M / myparams::defaultM; //Assertion would happen at this line +size_t NDRangeN = N / myparams::defaultN; ``` -Please refer to the section "Future Implementation Work" that talks about the future unified SPIR-V path that will enable JIT compilation. +=== Default Values Example: +```c++ +using myparams = tpu_params_both; +// use this to construct the ranges on the host side +size_t NDRangeM = M / myparams::defaultM; +size_t NDRangeN = N / myparams::defaultN; +//if M,N,K do not multiply the default sizes, padding has to be done +// device code: the matrices are constructed using the default dimensions +myparams::joint_matrix_a sub_a(sg); +myparams::joint_matrix_b sub_b(sg); +myparams::joint_matrix_c sub_c(sg); -### Current Implementation Restrictions -This section summarizes the specific features that this implementation supports. In future versions of this API and implementation, the expectation is to provide a query interface to guide the usage of this API. +``` -#### Type, Sizes, and Layouts -The types supported by this AMX implementation are restricted to the types that AMX hardware support. Although the AMX hardware supports 2d tiles with a maximum size of 16x64 bytes, this current implementation can handle any size. If the matrix size is bigger than 1024 bytes, it will be stored in memory rather than mapped to a 2d tile. Performance penalty may occur in this case. In order to get the best performance with this implementation, matrix sizes should be no larger than 16x64 bytes and B matrix should be already packed (put in VNNI format). +=== General Query Example: +```c++ +constexpr int M = 1500; // with msize = 8 and msize = 4, + // M can be broken up to 125 sequence of 8-sized ops and remaining 500 using 125 sequence of 4-sized ops +tpu_params params; +constexpr int msize = break_dimension(params, M); +constexpr int msize_remainder = break_dimension_remainder(params, M); +constexpr int nsize = params.combinations[0].nsize; +constexpr int ksize = params.combinations[0].ksize; +// device code: +joint_matrix sub_a(sg); +joint_matrix sub_b(sg); +joint_matrix sub_c(sg); +//Remainder handling +``` -More specifically, the following operation C = A*B+C can be performed on AMX with this interface where: +//No don't need to provide more details in this section because the query interface can serve this. -A(int8, any-size, row_major), B(int8, any-size, packed_b), C(int32, any-size, row_major) +//## Implementation Status -or +//### oneAPI 2022.0 release +//For oneAPI 2022.0 release, a JIT implementation has been made available on both Intel AMX and DPAS hardware of the specific features discussed above. In this case, there is no need to specify any architectural options to the command line. The static query interface can be used to guide the usage of this API. +// The DPAS and Intel AMX implementations support the logical capability support of the HW -A(bf16, any-size, row_major), B(bf16, any-size, packed_b), C(float, any-size, row_major). -No other types or layouts are supported at this time. -#### Memory and Execution Scope -This current implementation only considers a sub-group scope. However, the sub-group size has to be equal to one in this first implementation. In this case, a kernel using this extension must be decorated with the [[sycl::reqd_sub_group_size(1)]] attribute. +## Future-looking API -## Future Implementation Work +### Matrix Initialization: `joint_matrix_fill` +The current interface presented above assumes that all the matrices are directly loaded from memory. This new function called `joint_matrix_fill` makes it possible to multiply a matrix which is not directly loaded from memory but rather initialized directly in the register. On Intel AMX, if the initialization constant is zero, this would map to `_tile_zero` intrinsic: -### Unified LLVM IR and SPIRV JIT Enabling -To enable JIT compilation, a unified matrix IR needs to be added. Currently, there is no matrix type in LLVM IR or SPIR-V. We are working towards adding a new matrix type in both LLVM IR and SPIR-V. This JIT enabling is expected to be part of a future compiler release. +```c++ +namespace sycl::ext::oneapi::experimental::matrix { + template + void joint_matrix_fill(Group sg, joint_matrix &m, const T& v); +} +``` -#### LLVM IR Extension -As a short-term solution, we are extending the https://llvm.org/docs/LangRef.html#llvm-matrix-transpose-intrinsic[existing LLVM IR matrix intrinsics] to include features like VNNI layout. The current matrix intrinsics use flattened vectors to represent the matrix. Therefore, we are exploring both adding matrix type to LLVM IR and also using MLIR `vector` dialect for this work. +### Element Indexing and Element-Wise Operations +There are multiple options on how to enable this feature. -#### SPIR-V Extension -The current draft proposal can be found https://gitlab.devtools.intel.com/OpenCL/opencl-extension-drafts/-/blob/master/SPV_INTEL_matrix.asciidoc[here]. -We are adding translation from LLVM IR matrix to SPIR-V matrix and vice versa in the LLVM to SPIR-V translator tool. +#### Option 1: Non-restrictive element indexing +Allowing non-restrictive element indexing on the matrix element as shown below would result into slow indexing on the GPU. + Besides, it will rely heavily on spirv and compiler vectorization: -## Future-looking API +```c++ +matrix C; +for (int i = 0; i < 8; i++) + for (int j = 0; j < 8; j++) + C(i,j) *= alpha; //Align with mdspan +``` +#### Option2: Restrictive fast element indexing +In the DPC++ context, the expectation is that all element-wise operations will happen in a converged control path by all work items in the group. +Option 2 proposes a new set of element-wise operations by overloading existing operations to work on `matrix` object. An example is shown below: +```c++ +joint_matrix C(sg); + C *= alpha; +``` +The problem with this option is that it is restrictive to a very limited set of operations. + +#### Option3: Restrictive conversion in the interface from SIMD to SPMD +Nvidia wmma interface added a new member to `fragment` class to designate the WI owned part of the matrix. +While this provides fast element indexing on the GPU compared to the non-restrictive option, the user does not know the mapping of the owned data to the original matrix. + However using the `mma` ptx instructions as opposed to the `wmma` ptx instructions the mapping is known. Knowing this mapping is important for the user to implement new operations like sum of rows of a matrix for quantized algorithms. + +#### proposal: Explicit conversion in the interface from SIMD to SPMD +We introduce a new function `get_wi_slice` that provides any portion of the matrix that the user wants but in a SPMD array object:. + +```c++ +namespace sycl::ext::oneapi::experimental::matrix { +template + marray get_wi_slice(joint_matrix &m, size_t row_index, + size_t col_index, size_t n_rows, size_t n_cols); +} +``` + +Example where each WI gets 1 column: +```c++ +marray wi_C = get_wi_slice(C, 0, wi_idx, msize, 1, matrix_layout::row_major); +for (int i = 0; i < msize; i++) + row_sum += wi_C[i]; +``` ### Memory scope @@ -292,14 +578,19 @@ We did not utilize this extension for this matrix API version because sub-group ## Open Questions - Besides row, col major and packed (VNNI) layout, what are the additional layouts that should absolutely be added? -- Are there alternative names for the `packed_a` and `packed_b` layouts that would be clearer to distinguish between the VNNI Layout in matrix A and VNNI layout in matrix B of a matrix multiply and add operation on AMX? +- Are there alternative names for the `packed_a` and `packed_b` layouts that would be clearer to distinguish between the VNNI Layout in matrix A and VNNI layout in matrix B of a matrix multiply and add operation on Intel AMX? +-- Yes, this will be addressed in the next revision where `use` argument will be introduced to distinguish between right (B) , left (A), and accumulator matrix. - Ronan Keryell: "It would be interesting to investigate whether providing also member functions would simplify the API. Provide both so it is possible to use the best one for each use case, while waiting for https://en.wikipedia.org/wiki/Uniform_Function_Call_Syntax to land into C++?" -- What should the API description include: (1) only features that are implemented, (2) features that are actually part of the API: currently implemented and the ones that we expect implementing them in the future. Specifically, should the document include things like dynamic_ extent and Group? These are part of the API but are not currently implemented. + +- In the future looking APIs, `get_wi_slice` (that is currently under design) returns an owned object. Should this return a view object to make sure the original matrix C is changed after its slices are modified. ## TODO List -- Handle sub group sizes that are bigger than one. -- Add support for queries that gives information about the capabilities of the implementation on a particular device. -- Once the SPIRV translator work is done, this code generation work will move to the backend along enabling JIT compilation. +- Add support for fill matrix and element-wise operations features +- Add 'matrix_use' parameter to the matrix to distinguish between matrix A, B, and matrix accumulator. This is necessary for supporting VNNI and transpose transform +- Change the names default sizes in the query from defaultM, defaultN, defaultK to M,N,K +- Change the type of `scope` in the query interface to be able to return more than one value. This will be useful in the event we support other scopes like workgroup besides subgroups +- Add a more realistic and complete example that shows the value of the general query + ## Revision History @@ -307,4 +598,5 @@ We did not utilize this extension for this matrix API version because sub-group |====================== |Rev |Date |Author |Changes |1 |2021-04-13 |Dounia Khaldi |Initial public working draft. +|2 |2021-10-05 |Dounia Khaldi |JIT implementation on both Intel AMX and DPAS |======================