intel · bader · Dec 3, 2021 · Nov 24, 2021
@@ -634,6 +634,92 @@ Automated mechanisms are possible to provide uniquification across calls, and co
 *RESOLUTION*: Resolved.  Abstraction/libraries on top enable functionality like this.  We will make public a library that enables arrays of pipes.
 --
 
+== Experimental APIs
+
+*NOTE*: The APIs described in this section are experimental. Future versions of this extension may change these APIs in ways that are incompatible with the versions described here.
+
+In the experimental API version, read/write methods take template arguments, which can contain the latency control properties `latency_anchor_id` and/or `latency_constraint`.
+
+* `sycl::ext::intel::experimental::latency_anchor_id<N>`, where `N` is an integer: An ID to associate with the current read/write function call, which can then be referenced by other `latency_constraint` properties elsewhere in the program to define relative latency constaints. ID must be unique within the application, and a diagnostic is required if that condition is not met.
+* `sycl::ext::intel::experimental::latency_constraint<A, B, C>`: A tuple of three values which cause the current read/write function call to act as an endpoint of a latency constraint relative to a specified `latency_anchor_id` defined by a different instruction.
+** `A` is an integer: The ID of the target anchor defined on a different instruction through a `latency_anchor_id` property.
+** `B` is an enum value: The type of control from the set {`type::exact`, `type::max`, `type::min`}.
+** `C` is an integer: The relative clock cycle difference between the target anchor and the current function call, that the constraint should infer subject to the type of the control (exact, max, min).
+
+The template arguments above don't have to be specified if user doesn't want to apply latency controls. The template arguments can be passed in arbitrary order.
+
+=== Implementation
+
+[source,c++]
+----
+// Added in version 2 of this extension.
+namespace sycl::ext::intel::experimental {
+enum class type {
+  none, // default
+  exact,
+  max,
+  min
+};
+
+template <int32_t _N> struct latency_anchor_id {
+  static constexpr int32_t value = _N;
+  static constexpr int32_t default_value = -1;
+};
+
+template <int32_t _N1, type _N2, int32_t _N3> struct latency_constraint {
+  static constexpr std::tuple<int32_t, type, int32_t> value = {_N1, _N2, _N3};
+  static constexpr std::tuple<int32_t, type, int32_t> default_value = {
+      0, type::none, 0};
+};
+
+template <typename name,
+          typename dataT,
+          size_t min_capacity = 0>
+class pipe {
+  // Blocking
+  template <class... _Params>
+  static dataT read();
+  template <class... _Params>
+  static void write( const dataT &data );
+
+  // Non-blocking
+  template <class... _Params>
+  static dataT read( bool &success_code );
+  template <class... _Params>
+  static void write( const dataT &data, bool &success_code );
+}
+} // namespace sycl::ext::intel::experimental
+----
+
+=== Usage
+
+[source,c++]
+----
+// Added in version 2 of this extension.
+#include <sycl/ext/intel/fpga_extensions.hpp>
+...
+using Pipe1 = ext::intel::experimental::pipe<class PipeClass1, int, 8>;
+using Pipe2 = ext::intel::experimental::pipe<class PipeClass2, int, 8>;
+using Pipe3 = ext::intel::experimental::pipe<class PipeClass2, int, 8>;
+
+myQueue.submit([&](handler &cgh) {
+  cgh.single_task<class foo>([=] {
+    // The following Pipe1::read is anchor 0
+    int value = Pipe1::read<ext::intel::experimental::latency_anchor_id<0>>();
+
+    // The following Pipe2::write is anchor 1
+    // The following Pipe2::write occurs exactly 2 cycles after anchor 0
+    Pipe2::write<ext::intel::experimental::latency_anchor_id<1>,
+                 ext::intel::experimental::latency_constraint<
+                     0, ext::intel::experimental::type::exact, 2>>(value);
+
+    // The following Pipe3::write occurs at least 2 cycles after anchor 1
+    Pipe3::write<ext::intel::experimental::latency_constraint<
+        1, ext::intel::experimental::type::min, 2>>(value);
+  });
+});
+----
+
 == Feature test macro
 
 This extension provides a feature-test macro as described in the core SYCL
@@ -648,6 +734,7 @@ extension's APIs the implementation supports.
 |===
 |Value |Description
 |1     |Initial extension version. Base features are supported.
+|2     |Add experimental latency control API.
 |===
 
 == Revision History
@@ -660,6 +747,7 @@ extension's APIs the implementation supports.
 |1|2019-09-12|Michael Kinsner|*Initial public working draft*
 |2|2019-11-13|Michael Kinsner|Incorporate feedback
 |3|2020-04-27|Michael Kinsner|Clarify that pipe operations behave as-if they are relaxed atomic operations.  Make SYCL2020 the baseline
+|4|2021-12-02|Shuo Niu|Add experimental latency control API
 |========================================
 
 //************************************************************************

@@ -119,16 +119,188 @@ Queue.submit([&](cl::sycl::handler &cgh) {
 ...
 ```
 
+## Experimental APIs
+
+**NOTE**: The APIs described in this section are experimental. Future versions of
+this extension may change these APIs in ways that are incompatible with the
+versions described here.
+
+In the experimental API version, member functions `load()` and `store()` take
+template arguments, which can contain the latency control properties
+`latency_anchor_id` and/or `latency_constraint`.
+
+1. **`sycl::ext::intel::experimental::latency_anchor_id<N>`, where `N` is an integer**:
+represents ID of the current function call when it performs as an anchor. The ID
+must be unique within the application, with a diagnostic required if that
+condition is not met.
+2. **`sycl::ext::intel::experimental::latency_constraint<A, B, C>`** contains control
+parameters when the current function performs as a non-anchor, where:
+    - **`A` is an integer**: The ID of the target anchor defined on a different
+    instruction through a `latency_anchor_id` property.
+    - **`B` is an enum value**: The type of control from the set
+    {`type::exact`, `type::max`, `type::min`}.
+    - **`C` is an integer**: The relative clock cycle difference between the
+    target anchor and the current function call, that the constraint should
+    infer subject to the type of the control (exact, max, min).
+
+The template arguments above don't have to be specified if user doesn't want to
+apply latency controls. The template arguments can be passed in arbitrary order.
+
+### Implementation
+```c++
+// Added in version 2 of this extension.
+namespace sycl::ext::intel::experimental {
+enum class type {
+  none, // default
+  exact,
+  max,
+  min
+};
+
+template <int32_t _N> struct latency_anchor_id {
+  static constexpr int32_t value = _N;
+  static constexpr int32_t default_value = -1;
+};
+
+template <int32_t _N1, type _N2, int32_t _N3> struct latency_constraint {
+  static constexpr std::tuple<int32_t, type, int32_t> value = {_N1, _N2, _N3};
+  static constexpr std::tuple<int32_t, type, int32_t> default_value = {
+      0, type::none, 0};
+};
+
+template <class... mem_access_params> class lsu final {
+public:
+  lsu() = delete;
+
+  template <class... _Params, typename _T, access::address_space _space>
+  static _T load(sycl::multi_ptr<_T, _space> Ptr) {
+    check_space<_space>();
+    check_load();
+#if defined(__SYCL_DEVICE_ONLY__) && __has_builtin(__builtin_intel_fpga_mem)
+    static constexpr auto _anchor_id =
+        __GetValue<int, latency_anchor_id, _Params...>::value;
+    static constexpr auto _constraint =
+        __GetValue3<int, type, int, latency_constraint, _Params...>::value;
+
+    static constexpr int _target_anchor = std::get<0>(_constraint);
+    static constexpr type _control_type = std::get<1>(_constraint);
+    static constexpr int _cycle = std::get<2>(_constraint);
+    int _type;
+    if (_control_type == type::none) {
+      _type = 0;
+    } else if (_control_type == type::exact) {
+      _type = 1;
+    } else if (_control_type == type::max) {
+      _type = 2;
+    } else { // _control_type == type::min
+      _type = 3;
+    }
+
+    return *__latency_control_mem_wrapper((_T *)Ptr, _anchor_id, _target_anchor,
+                                          _type, _cycle);
+#else
+    return *Ptr;
+#endif
+  }
+
+  template <class... _Params, typename _T, access::address_space _space>
+  static void store(sycl::multi_ptr<_T, _space> Ptr, _T Val) {
+    check_space<_space>();
+    check_store();
+#if defined(__SYCL_DEVICE_ONLY__) && __has_builtin(__builtin_intel_fpga_mem)
+    static constexpr auto _anchor_id =
+        __GetValue<int, latency_anchor_id, _Params...>::value;
+    static constexpr auto _constraint =
+        __GetValue3<int, type, int, latency_constraint, _Params...>::value;
+
+    static constexpr int _target_anchor = std::get<0>(_constraint);
+    static constexpr type _control_type = std::get<1>(_constraint);
+    static constexpr int _cycle = std::get<2>(_constraint);
+    int _type;
+    if (_control_type == type::none) {
+      _type = 0;
+    } else if (_control_type == type::exact) {
+      _type = 1;
+    } else if (_control_type == type::max) {
+      _type = 2;
+    } else { // _control_type == type::min
+      _type = 3;
+    }
+
+    *__latency_control_mem_wrapper((_T *)Ptr, _anchor_id, _target_anchor, _type,
+                                   _cycle) = Val;
+#else
+    *Ptr = Val;
+#endif
+  }
+  ...
+private:
+#if defined(__SYCL_DEVICE_ONLY__) && __has_builtin(__builtin_intel_fpga_mem)
+  template <typename _T>
+  static _T *__latency_control_mem_wrapper(_T * Ptr, int32_t AnchorID,
+                                           int32_t TargetAnchor, int32_t Type,
+                                           int32_t Cycle) {
+    return __builtin_intel_fpga_mem(Ptr,
+                                    _burst_coalesce | _cache |
+                                    _dont_statically_coalesce | _prefetch,
+                                    _cache_val);
+  }
+#endif
+  ...
+}
+} // namespace sycl::ext::intel::experimental
+```
+
+### Usage
+```c++
+// Added in version 2 of this extension.
+#include <sycl/ext/intel/fpga_extensions.hpp>
+...
+sycl::buffer<int, 1> output_buffer(output_data, 1);
+sycl::buffer<int, 1> input_buffer(input_data, 1);
+Queue.submit([&](sycl::handler &cgh) {
+  auto output_accessor =
+      output_buffer.get_access<sycl::access::mode::write>(cgh);
+  auto input_accessor = input_buffer.get_access<sycl::access::mode::read>(cgh);
+
+  cgh.single_task<class kernel>([=] {
+    auto input_ptr = input_accessor.get_pointer();
+    auto output_ptr = output_accessor.get_pointer();
+
+    // latency controls
+    using ExpPrefetchingLSU = sycl::ext::intel::experimental::lsu<
+        sycl::ext::intel::experimental::prefetch<true>,
+        sycl::ext::intel::experimental::statically_coalesce<false>>;
+
+    using ExpBurstCoalescedLSU = sycl::ext::intel::experimental::lsu<
+        sycl::ext::intel::experimental::burst_coalesce<false>,
+        sycl::ext::intel::experimental::statically_coalesce<false>>;
+
+    // The following load is anchor 1
+    int Z = ExpPrefetchingLSU::load<
+        sycl::ext::intel::experimental::latency_anchor_id<1>>(input_ptr + 2);
+
+    // The following store occurs exactly 5 cycles after the anchor 1 read
+    ExpBurstCoalescedLSU::store<
+        sycl::ext::intel::experimental::latency_constraint<
+            1, sycl::ext::intel::experimental::type::exact, 5>>(output_ptr + 2,
+                                                                Z);
+  });
+});
+...
+```
+
 ## Feature Test Macro
 
 This extension provides a feature-test macro as described in the core SYCL
 specification section 6.3.3 "Feature test macros". Therefore, an implementation
 supporting this extension must predefine the macro `SYCL_EXT_INTEL_FPGA_LSU`
 to one of the values defined in the table below. Applications can test for the
 existence of this macro to determine if the implementation supports this
-feature, or applications can test the macro�s value to determine which of the
-extension�s APIs the implementation supports.
+feature, or applications can test the macro's value to determine which of the
+extension's APIs the implementation supports.
 
 |Value |Description|
 |:---- |:---------:|
 |1     |Initial extension version. Base features are supported.|
+|2     |Add experimental latency control API.|