diff --git a/sycl/doc/KernelProgramCache.md b/sycl/doc/KernelProgramCache.md new file mode 100644 index 0000000000000..88d1e11881ce8 --- /dev/null +++ b/sycl/doc/KernelProgramCache.md @@ -0,0 +1,248 @@ +# A brief overview of kernel and program caching mechanism. + +## Rationale behind caching + +During SYCL program execution SYCL runtime will create internal objects +representing kernels and programs, it may also invoke JIT compiler to bring +kernels in a program to executable state. Those runtime operations are quite +expensive, and in some cases caching approach can be employed to eliminate +redundant kernel or program object re-creation and online recompilation. Few +examples below illustrate scenarios where such optimization is possible. + +*Use-case #1.* Submission of the same kernel in a loop: +```C++ + using namespace cl::sycl::queue; + + queue Q; + std::vector Bufs; + + ... + // initialize Bufs with some number of buffers + ... + + for (size_t Idx = 0; Idx < Bufs.size(); ++Idx) { + Q.submit([&](handler &CGH) { + auto Acc = Bufs[Idx].get_access(CGH); + + CGH.parallel_for( + range<2>{N, M}, [=](item<2> Item) { ... }); + }); + } +``` + +*Use-case #2.* Submission of multiple kernels within a single program[1](#what-is-program): +```C++ + using namespace cl::sycl::queue; + + queue Q; + + Q.submit([&](handler &CGH) { + ... + + CGH.parallel_for( + range<2>{N_1, M_1}, [=](item<2> Item) { ... }); + }); + Q.submit([&](handler &CGH) { + ... + + CGH.parallel_for( + range<2>{N_2, M_2}, [=](item<2> Item) { ... }); + }); + Q.submit([&](handler &CGH) { + ... + + CGH.parallel_for( + range<2>{N_3, M_3}, [=](item<2> Item) { ... }); + }); + ... + Q.submit([&](handler &CGH) { + ... + + CGH.parallel_for( + range<2>{N_K, M_K}, [=](item<2> Item) { ... }); + }); +``` + +In both cases SYCL runtime will need to build the program and kernels multiple +times, which may involve JIT compilation and take quite a lot of time. + +In order to eliminate this waste of run-time we introduce a kernel and program +caching. The cache is per-context and it caches underlying objects of non +interop kernels and programs which are built with no options. + +1: Here "program" means an internal SYCL runtime +object corresponding to a SPIRV module or native binary defining a set of SYCL +kernels and/or device functions. + + +## Data structure of cache + +The cache stores underlying PI objects behind `cl::sycl::program` and +`cl::sycl::kernel` user-level objects in a per-context data storage. The storage +consists of two maps: one is for programs and the other is for kernels. + +The programs map's key consists of three components: kernel set id[1](#what-is-ksid), +specialized constants, device this program is built for. + +The krnels map's key consists of three components too: program the kernel +belongs to, kernel name[2](#what-is-kname), device the program is +built for. + +1: Kernel set id is an ordinal number of the device +binary image the kernel is contained in. + +2: Kernel name is a kernel ID mangled class' name +which is provided to methods of `cl::sycl::handler` (e.g. `parallel_for` or +`single_task`). + + +## Points of improvement (things to do) + + - Implement LRU policy on cached objects. See [issue](https://github.com/intel/llvm/issues/2517). + - Allow for caching of objects built with some build options. + - Employ the same built object for multiple devices of the same ISA, + capabilities and so on. *NOTE:* It's not really known if it's possible to + check if two distinct devices are *exactly* the same. Probably this should be + an improvement request for plugins. + - Improve testing: cover real use-cases. See currently covered cases [here](https://github.com/intel/llvm/blob/sycl/sycl/unittests/kernel-and-program/Cache.cpp). + + +## Implementation details + +The caches are represented with instance of [`KernelProgramCache`](https://github.com/intel/llvm/blob/sycl/sycl/source/detail/kernel_program_cache.hpp) +class. The runtime creates one instance of the class per distinct SYCL context +(A context object which is a result of copying another context object isn't +"distinct", as it corresponds to the same underlying internal object +representing a context). + +The `KernelProgramCache` is essentially a pair of maps as described above. + + +### When does the cache come at work? + +The cache is used when one submits a kernel for execution or builds program or +with SYCL API. That means that the cache works when either user explicitly calls +`program::build_with_kernel_type<>()`/`program::get_kernel<>()` methods or SYCL +RT builds a program or gets the required kernel as needed during application +execution. Cacheability of an object can be tested with +`program_impl::is_cacheable()` method. SYCL RT will only try to insert cacheable +programs or kernels into the cache. This is done as a part of +`ProgramManager::getOrCreateKernel()` method. + + +*NOTE:* a kernel is only cacheable if and only if the program it belongs to is +cacheable. On the other hand if the program is cacheable, then each and every +kernel of this program will be cached also. + + +All requests to build a program or to create a kernel - whether they originate +from explicit user API calls or from internal SYCL runtime execution logic - end +up with calling the function [`getOrBuild()`](https://github.com/intel/llvm/blob/sycl/sycl/source/detail/program_manager/program_manager.cpp#L149) +with number of lambda functions passed as arguments: + - Acquire function; + - GetCache function; + - Build function. + +*Acquire* function returns a locked version of cache. Locking is employed for +thread safety. The threads are blocked only for insert-or-acquire attempt, i.e. +when calling to `map::insert` in [`getOrBuild`](https://github.com/intel/llvm/blob/sycl/sycl/source/detail/program_manager/program_manager.cpp#L149) +function. The rest of operation is done with the help of atomics and condition +variables (plus a mutex for proper work of condition variable). + +*GetCache* function returns a reference to mapping `key->value` out of locked +instance of cache. We will see rationale behind it a bit later. + +*Build* function actually builds the kernel or program. + +Caching isn't done: + - when program is built out of source with `program::build_with_source()` or + `program::compile_with_source()` method; + - when program is a result of linking multiple programs. + + +### Thread-safety + +Why do we need thread safety here? It is quite possible to have a use-case when +the `cl::sycl::context` is shared across multiple threads (e.g. via sharing a +queue). Possibility of enqueueing multiple cacheable kernels simultaneously +from multiple threads requires us to provide thread-safety for the caching +mechanisms. + +It is worth of noting that we don't cache the PI resource (kernel or program) +by itself. Instead we augment the resource with the status of build process. +Hence, what is cached is a wrapper structure `BuildResult` which contains three +information fields - pointer to built resource, build error (if applicable) and +current build status (either of "in progress", "succeeded", "failed"). + +One can find definition of `BuildResult` template in [KernelProgramCache](https://github.com/intel/llvm/blob/sycl/sycl/source/detail/kernel_program_cache.hpp). + +The built resource access synchronization approach aims at minimizing the time +any thread holds the global lock guarding the maps to improve performance. To +achieve that, the global lock is acquired only for the duration of the global +map access. Actual build of the program happens outside of the lock, so other +threads can request or build other programs in the meantime. A thread requesting +a `BuildResult` instance via `getOrBuild` can go one of three ways: + A) Build result is **not** available, it is the first thread to request it. + Current thread will then execute the build letting others wait for the + result using the per-build result condition variable kept in `BuildResult`'s + `MBuildCV` field. + B) Build result is **not** available, another thread is already building the + result. Current thread will then wait for the result using the `MBuildCV` + condition variable. + C) Build result **is** available. The thread simply takes it from the `Ptr` + field w/o using any mutexes or condition variables. + +As noted before, access to `BuildResult` instance fields may occur from +different threads simultaneously, but the global lock is no longer held. So, to +make it safe and to make sure only one thread builds the requested program, the +following is done: + - program build state is reflected in the `State` field, threads use + compare-and-swap technique to compete who will do the build and become thread + A. Threads C will find 'DONE' in this field and immediately return the with + built result at hand. + - thread A and thread(s) B use the `MBuildCV` conditional variable field and + `MBuildResultMutex` mutex field guarding that variable to implement the + "single producer-multiple consumers scheme". + - the build result itself appears in the 'Ptr' field when available. +All fields are atomic because they can be accessed from multiple threads. + +A specialization of helper class [Locked](https://github.com/intel/llvm/blob/sycl/sycl/include/CL/sycl/detail/locked.hpp) +for reference of proper mapping is returned by Acquire function. The use of this +class implements RAII to make code look cleaner a bit. Now, GetCache function +will return the mapping to be employed that includes the 3 components: kernel +name, device as well as any specialization constants. These get added to +`BuildResult` and are cached. The `BuildResult` structure is specialized with +either `PiKernel` or `PiProgram`[1](#remove-pointer). + + +### Core of caching mechanism + +Now, let us see how 'getOrBuild' function works. +First, we fetch the cache with sequential calls to Acquire and GetCache +functions. Then, we check if this is the first attempt to build this kernel or +program. This is achieved with an attempt to insert another key-value pair into +the map. At this point we try to insert `BuildResult` stub instance with status +equal to "in progress" which will allow other threads to know that someone is +(i.e. we're) building the object (i.e. kernel or program) now. If insertion +fails, we will wait for building thread to finish with call to `waitUntilBuilt` +function. This function will throw stored exception[2](#exception-data) +upon build failure. This allows waiting threads to see the same result as the +building thread. Special case of the failure is when build result doesn't +contain the error (i.e. the error wasn't of `cl::sycl::exception` type) and the +pointer to object in `BuildResult` instance is nil. In this case, the building +thread has finished the build process and has returned an error to the user. +But this error may be sporadic in nature and may be spurious. Hence, the waiting +thread will try to build the same object once more. + +`BuildResult` structure also contains synchronization objects: mutex and +condition variable. We employ them to signal waiting threads that the build +process for this kernl/program is finished (either successfuly or with a +failure). + + +1: The use of `std::remove_pointer` was omitted for +the sake of simplicity here. + +2: Actually, we store contents of the exception: +its message and error code. + diff --git a/sycl/doc/index.rst b/sycl/doc/index.rst index 5fb5c2da73296..8ecd1031c371e 100644 --- a/sycl/doc/index.rst +++ b/sycl/doc/index.rst @@ -28,3 +28,4 @@ Developing oneAPI DPC++ Compiler PluginInterface ABIPolicyGuide SpecializationConstants + KernelProgramCache