Skip to content

Commit 09282d6

Browse files
author
Mohammad Fawaz
committed
New spec for controlling load-store units in FPGAs
1 parent 4cc1da1 commit 09282d6

File tree

1 file changed

+116
-0
lines changed

1 file changed

+116
-0
lines changed
Lines changed: 116 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,116 @@
1+
2+
# FPGA lsu
3+
4+
The Intel FPGA `lsu` class is implemented in `CL/sycl/intel/fpga_lsu.hpp` which
5+
is included in `CL/sycl/intel/fpga_extensions.hpp`.
6+
7+
The class `cl::sycl::intel::lsu` allows users to explicitly request that the
8+
implementation of a global memory access is configured in a certain way. The
9+
class has two member functions, `load()` and `store()` which allow loading from
10+
and storing to a `global_ptr`, respectively, and is templated on the following
11+
4 optional paremeters:
12+
13+
1. **`cl::sycl::intel::burst_coalesce<B>`, where `B` is a boolean**: request,
14+
to the extent possible, that a dynamic burst coalescer be implemented when
15+
`load` or `store` are called. The default value of this parameter is `false`.
16+
2. **`cl::sycl::intel::cache<N>`, where `N` is an integer greater or equal to
17+
0**: request, to the extent possible, that a read-only cache of the specified
18+
size in bytes be implemented when when `load` is called. It is not allowed to
19+
use that parameter for `store`. The default value of this parameter is `0`.
20+
3. **`cl::sycl::intel::statically_coalesce<N>`, where `B` is a boolean**:
21+
request, to the extent possible, that `load` or `store` accesses, is allowed to
22+
be statically coalesced with other memory accesses at compile time. The default
23+
value of this parameter is `true`.
24+
4. **`cl::sycl::intel::prefetch<B>`, where `N` is a boolean**: request, to the
25+
extent possible, that a prefetcher be implemented when `load` is called. It is
26+
not allowed to use that parameter for `store`. The default value of this
27+
parameter is `false`.
28+
29+
Currently, not every combination of parameters is allowed due to limitations in
30+
the backend. The following rules apply:
31+
1. For `store`, `cl::sycl::intel::cache` must be `0` and
32+
`cl::sycl::intel::prefetch` must be `false`.
33+
2. For `load`, if `cl::sycl::intel::cache` is set to a value greater than `0`,
34+
then `cl::sycl::intel::burst_coalesce` must be set to `true`.
35+
3. For `load`, exactly one of `cl::sycl::intel::prefetch` and
36+
`cl::sycl::intel::burst_coalesce` is allowed to be `true`.
37+
4. For `load`, exactly one of `cl::sycl::intel::prefetch` and
38+
`cl::sycl::intel::cache` is allowed to be `true`.
39+
40+
## Implementation
41+
42+
The implementation relies on the Clang built-in `__builtin_intel_fpga_mem` when
43+
parsing the SYCL device code. The built-in uses the LLVM `ptr.annotation`
44+
intrinsic under the hood to annotate the pointer that is being accessed.
45+
```c++
46+
template <class... mem_access_params> class lsu final {
47+
public:
48+
lsu() = delete;
49+
50+
template <typename T> static T &load(sycl::global_ptr<T> Ptr) {
51+
check_load();
52+
#if defined(__SYCL_DEVICE_ONLY__) && __has_builtin(__builtin_intel_fpga_mem)
53+
return *__builtin_intel_fpga_mem((T *)Ptr,
54+
_burst_coalesce | _cache |
55+
_dont_statically_coalesce | _prefetch,
56+
_cache_val);
57+
#else
58+
return *Ptr;
59+
#endif
60+
}
61+
62+
template <typename T> static void store(sycl::global_ptr<T> Ptr, T Val) {
63+
check_store();
64+
#if defined(__SYCL_DEVICE_ONLY__) && __has_builtin(__builtin_intel_fpga_mem)
65+
*__builtin_intel_fpga_mem((T *)Ptr,
66+
_burst_coalesce | _cache |
67+
_dont_statically_coalesce | _prefetch,
68+
_cache_val) = Val;
69+
#else
70+
*Ptr = Val;
71+
#endif
72+
}
73+
...
74+
}
75+
```
76+
77+
## Usage
78+
79+
```c++
80+
#include <CL/sycl/intel/fpga_extensions.hpp>
81+
...
82+
cl::sycl::buffer<int, 1> output_buffer(output_data, 1);
83+
cl::sycl::buffer<int, 1> input_buffer(input_data, 1);
84+
85+
Queue.submit([&](cl::sycl::handler &cgh) {
86+
auto output_accessor = output_buffer.get_access<cl::sycl::access::mode::write>(cgh);
87+
auto input_accessor = input_buffer.get_access<cl::sycl::access::mode::read>(cgh);
88+
89+
cgh.single_task<class kernel>([=] {
90+
auto input_ptr = input_accessor.get_pointer();
91+
auto output_ptr = output_accessor.get_pointer();
92+
93+
using PrefetchingLSU =
94+
cl::sycl::intel::lsu<cl::sycl::intel::prefetch<true>,
95+
cl::sycl::intel::statically_coalesce<false>>;
96+
97+
using BurstCoalescedLSU =
98+
cl::sycl::intel::lsu<cl::sycl::intel::burst_coalesce<false>,
99+
cl::sycl::intel::statically_coalesce<false>>;
100+
101+
using CachingLSU =
102+
cl::sycl::intel::lsu<cl::sycl::intel::burst_coalesce<true>,
103+
cl::sycl::intel::cache<1024>,
104+
cl::sycl::intel::statically_coalesce<true>>;
105+
106+
using PipelinedLSU = cl::sycl::intel::lsu<>;
107+
108+
int X = PrefetchingLSU::load(input_ptr); // int X = input_ptr[0]
109+
int Y = CachingLSU::load(input_ptr + 1); // int Y = input_ptr[1]
110+
111+
BurstCoalescedLSU::store(output_ptr, X); // output_ptr[0] = X
112+
PipelinedLSU::store(output_ptr + 1, Y); // output_ptr[1] = Y
113+
});
114+
});
115+
...
116+
```

0 commit comments

Comments
 (0)