Skip to content

Extensions to the SYCL Graph proposal #1

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 30 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
7c62056
Inital version of sycl graph prototype
reble Feb 18, 2022
59bb7da
Adding initial sycl graph doc
reble Feb 18, 2022
528017a
Reusing command list for re-execution (WIP)
reble Jan 7, 2022
aee48a5
Adding lazy execution property to queue
reble Feb 15, 2022
24fa5a9
fix merge
reble Feb 22, 2022
b7ce271
Update pi_level_zero.cpp
reble Feb 22, 2022
f3d30ed
update extension proposal started to incorporate feedback
reble Mar 11, 2022
0e96d12
typo
reble Mar 11, 2022
8f1a8dc
Apply suggestions from code review
reble Mar 14, 2022
a8c7265
fix typos and syntax issues
reble May 3, 2022
5055f59
Merge branch 'sycl' of github.com:reble/llvm into sycl
reble May 3, 2022
7d90885
Propagate lazy queue property
julianmi May 3, 2022
ff0d764
Formatting
julianmi May 3, 2022
cc64b96
Rename add_device_node() to submit()
julianmi May 3, 2022
f097481
Clarify naming of graph instantiation
julianmi May 3, 2022
2b3fc88
Add support for empty nodes
julianmi May 3, 2022
116497d
Add support for memory nodes
julianmi May 3, 2022
42253ca
Add single_task() and parallel_for() shurtcuts
julianmi May 3, 2022
ac2c1f3
Add support for setting/updating node parameters
julianmi May 3, 2022
2c072f0
Assign provided node to the graph called from
julianmi May 5, 2022
60507c1
Propagate lazy queue property
julianmi May 3, 2022
2d8ea35
Include predecessor dependencies for empty nodes
julianmi May 18, 2022
f1a9db3
Clarify naming of adding/updating nodes
julianmi May 18, 2022
d0069e5
Update pi_level_zero.cpp
reble May 25, 2022
ed74e86
Update pi_level_zero.cpp
reble May 25, 2022
38a6fda
Avoid flag redefinition and fix semantics
reble May 26, 2022
058c875
Update pi_level_zero.cpp
reble May 26, 2022
a61515b
Submit node only if body exists and it is not empty
julianmi May 27, 2022
b9c0381
Add update node routine
julianmi May 27, 2022
9dcf29e
Merge remote-tracking branch 'upstream/sycl' into sycl
julianmi May 27, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
274 changes: 274 additions & 0 deletions sycl/doc/extensions/experimental/SYCL_EXT_ONEAPI_GRAPH.asciidoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,274 @@
= SYCL_EXT_ONEAPI_GRAPH
:source-highlighter: coderay
:coderay-linenums-mode: table

// This section needs to be after the document title.
:doctype: book
:toc2:
:toc: left
:encoding: utf-8
:lang: en

:blank: pass:[ +]

// Set the default source code type in this document to C++,
// for syntax highlighting purposes. This is needed because
// docbook uses c++ and html5 uses cpp.
:language: {basebackend@docbook:c++:cpp}

== Notice

Copyright (c) 2022 Intel Corporation. All rights reserved.

IMPORTANT: This specification is a draft.

NOTE: Khronos(R) is a registered trademark and SYCL(TM) and SPIR(TM) are
trademarks of The Khronos Group Inc. OpenCL(TM) is a trademark of Apple Inc.
used by permission by Khronos.

NOTE: This document is better viewed when rendered as html with asciidoctor.
GitHub does not render image icons.

This extension is written against the SYCL 2020 revision 4 specification. All
references below to the "core SYCL specification" or to section numbers in the
SYCL specification refer to that revision.

NOTE: This extension is experimental: interfaces are subject to change later.

== Introduction

This extension introduces an interface that enables a lazy execution and easy replay of a kernel graph by separating
its definition and execution.

== Feature test macro

This extension provides a feature-test macro as described in the core SYCL
specification section 6.3.3 "Feature test macros". Therefore, an
implementation supporting this extension must predefine the macro
`SYCL_EXT_ONEAPI_GRAPH` to one of the values defined in the table below.
Applications can test for the existence of this macro to determine if the
implementation supports this feature, or applications can test the macro's
value to determine which of the extension's APIs the implementation supports.

Table 1. Values of the `SYCL_EXT_ONEAPI_GRAPH` macro.
[%header,cols="1,5"]
|===
|Value |Description
|1 |Initial extension version. Base features are supported.
|===

== SYCL Graph Terminology

Table 2. Terminology.
|===
|Concept|Description
|graph| Class that stores structured work units and their dependencies
|node| The unit of work. Can have different attributes.
|edge| Dependency between work units. Happens-before relation.
|===

== Node

Node is a class that encapsulates tasks like SYCL kernel functions or host tasks for deferred execution.
A graph has to be created first, the structure of a graph is defined second by adding nodes and edges.

[source,c++]
----
namespace sycl::ext::oneapi::experimental {

class node{
};
}
----

NOTE:

== Edge

A dependency between two nodes representing a happens-before relationship. `sender` and `receiver` may be associated to different graphs.

[source,c++]
----
namespace sycl::ext::oneapi::experimental {

// Adding dependency between two nodes.
void make_edge(node sender, node receiver);
}
----

== Graph

Graph is a class that represents a directed acyclic graph of nodes.
A graph can have different states, can be nested, can have multiple root nodes that are scheduled for execution first and multiple leaf nodes that are scheduled for execution last. The execution of a graph has been completed when all leaf node tasks have been completed.
Member functions as listed in Table 2 and 3 can be used to add nodes to a graph.

[source,c++]
----
namespace sycl::ext::oneapi::experimental {

enum class graph_state{
modifiable,
executable
};

template<graph_state State>
class graph {
public:
operator graph<graph_state::executable>();
};

graph<graph_state::modifiable> make_graph();

graph<graph_state::executable> compile(const graph<graph_state::modifiable> Graph);

}

sycl::event sycl::queue(const graph<graph_state::executable> Graph);

----

=== Executable Graph

A `graph` object in `graph_state::executable` represents a user generated device and context specific execution object that is submitted to a queue for execution.
The structure of such a `graph` object in this state is immutable and cannot be changed, so are the tasks assigned with each node.
Support of submitting a graph for execution, before a previous execution has been completed is backend specific. The runtime may throw an error.

Table 3. Constructor of the `graph` class.
|===
|Constructor|Description

|`graph()`
|Creates a `graph` object. It's default state is `graph_state::modifiable`.

|===

Table 4. Member functions of the `graph` class.
|===
|Member function|Description

|`node add_node(const std::vector<node>& dep = {});`
|This creates an empty node which is associated to no task. Its intended use is either a connection point inside a graph between groups of nodes, and can significantly reduce the number of edges ( O(n) vs. O(n^2) ). Another use-case is building the structure of a graph first and adding tasks later.

|`template<typename T>
node add_node(T cgf, const std::vector<node>& dep = {});`
|This node captures a command group function object containing host task which is scheduled by the SYCL runtime or a SYCL function for invoking kernels with all restrictions that apply as described in the spec.

|===

Table 5. Member functions of the `graph` class (memory operations).
|===
|Member function|Description

|`node add_memcpy_node(void* dest, const void* src, size_t numBytes, const std::vector<node>& dep = {});`
|Adding a node that encapsulates a `memcpy` operation.

|`node add_memset_node(void* ptr, int value, size_t numBytes, const std::vector<node>& dep = {});`
|Adding a node that encapsulates a `memset` operation.

|`node add_malloc_node(void *data, size_t numBytes, usm::alloc kind, const std::vector<node>& dep = {});`
|Adding a node that encapsulates a `malloc` operation.

|`node add_free_node(void *data, const std::vector<node>& dep = {});`
|Adding a node that encapsulates a `free` operation.

|===


== Examples

NOTE: The examples below demonstrate intended usage of the extension, but are not compatible with the proof-of-concept implementation. The proof-of-concept implementation currently requires different syntax, as described in the "Non-implemented features" section at the end of this document.

1. Dot product

[source,c++]
----
...

#include <sycl/ext/oneapi/experimental/graph.hpp>

int main() {
const size_t n = 10;
float alpha = 1.0f;
float beta = 2.0f;
float gamma = 3.0f;

sycl::queue q;

auto g = sycl::ext::oneapi::experimental::make_graph();

float *x = sycl::malloc_shared<float>(n, q);
float *y = sycl::malloc_shared<float>(n, q);
float *z = sycl::malloc_shared<float>(n, q);

float *dotp = sycl::malloc_shared<float>(1, q);

for (int i = 0; i < n; i++) {
x[i] = 1.0f;
y[i] = 2.0f;
z[i] = 3.0f;
}

auto node_a = g.add_node([&](sycl::handler &h) {
h.parallel_for(sycl::range<1>{n}, [=](sycl::id<1> it) {
const size_t i = it[0];
x[i] = alpha * x[i] + beta * y[i];
});
});

auto node_b = g.add_node([&](sycl::handler &h) {
h.parallel_for(sycl::range<1>{n}, [=](sycl::id<1> it) {
const size_t i = it[0];
z[i] = gamma * z[i] + beta * y[i];
});
});

auto node_c = g.add_node(
[&](sycl::handler &h) {
h.parallel_for(sycl::range<1>{n},
sycl::reduction(dotp, 0.0f, std::plus()),
[=](sycl::id<1> it, auto &sum) {
const size_t i = it[0];
sum += x[i] * z[i];
});
},
{node_a, node_b});

auto exec = compile(q);

q.submit(exec).wait();

sycl::free(x, q);
sycl::free(y, q);
sycl::free(z, q);
sycl::free(dotp, q);

return 0;
}


...
----

== Issues for later investigations

. Explicit memory movement can cause POC to stall.

== Non-implemented features
Please, note that the following features are not yet implemented:

. Level Zero backend only
. Memory operation nodes not implemented
. Host node not implemented
. Submit overload of a queue. `submit(graph)` Use a combination of `executable_graph::exec_and_wait()` and queue property `sycl::ext::oneapi::property::queue::lazy_execution{}` instead.
. `class graph<graph_state>` Use dedicated `class graph` (equivalent to `graph_state == modifiable`) and `class executable_graph` (equivalent to `graph_state == executable`) instead.

== Revision History

[cols="5,15,15,70"]
[grid="rows"]
[options="header"]
|========================================
|Rev|Date|Author|Changes
|1|2022-02-11|Pablo Reble|Initial public working draft
|2|2022-03-11|Pablo Reble|Incorporate feedback from PR
|========================================
1 change: 1 addition & 0 deletions sycl/include/CL/sycl/detail/pi.def
Original file line number Diff line number Diff line change
Expand Up @@ -103,6 +103,7 @@ _PI_API(piSamplerGetInfo)
_PI_API(piSamplerRetain)
_PI_API(piSamplerRelease)
// Queue commands
_PI_API(piKernelLaunch)
_PI_API(piEnqueueKernelLaunch)
_PI_API(piEnqueueNativeKernel)
_PI_API(piEnqueueEventsWait)
Expand Down
3 changes: 3 additions & 0 deletions sycl/include/CL/sycl/detail/pi.h
Original file line number Diff line number Diff line change
Expand Up @@ -602,6 +602,7 @@ constexpr pi_queue_properties PI_QUEUE_PROFILING_ENABLE =
constexpr pi_queue_properties PI_QUEUE_ON_DEVICE = CL_QUEUE_ON_DEVICE;
constexpr pi_queue_properties PI_QUEUE_ON_DEVICE_DEFAULT =
CL_QUEUE_ON_DEVICE_DEFAULT;
constexpr pi_queue_properties PI_QUEUE_LAZY_EXECUTION = 1 << 10;

using pi_result = _pi_result;
using pi_platform_info = _pi_platform_info;
Expand Down Expand Up @@ -1453,6 +1454,8 @@ __SYCL_EXPORT pi_result piSamplerRelease(pi_sampler sampler);
//
// Queue Commands
//
__SYCL_EXPORT pi_result piKernelLaunch(pi_queue queue);

__SYCL_EXPORT pi_result piEnqueueKernelLaunch(
pi_queue queue, pi_kernel kernel, pi_uint32 work_dim,
const size_t *global_work_offset, const size_t *global_work_size,
Expand Down
3 changes: 2 additions & 1 deletion sycl/include/CL/sycl/detail/property_helper.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -34,8 +34,9 @@ enum DataLessPropKind {
InitializeToIdentity = 7,
UseDefaultStream = 8,
DiscardEvents = 9,
LazyExecution = 10,
// Indicates the last known dataless property.
LastKnownDataLessPropKind = 9,
LastKnownDataLessPropKind = 10,
// Exceeding 32 may cause ABI breaking change on some of OSes.
DataLessPropKindSize = 32
};
Expand Down
1 change: 1 addition & 0 deletions sycl/include/CL/sycl/feature_test.hpp.in
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,7 @@ namespace sycl {
#define SYCL_EXT_ONEAPI_FREE_FUNCTION_QUERIES 1
#define SYCL_EXT_ONEAPI_GROUP_ALGORITHMS 1
#define SYCL_EXT_ONEAPI_GROUP_SORT 1
#define SYCL_EXT_ONEAPI_LAZY_QUEUE 1
#define SYCL_EXT_ONEAPI_MAX_WORK_GROUP_QUERY 1
#define SYCL_EXT_ONEAPI_ND_RANGE_REDUCTIONS 1
#define SYCL_EXT_ONEAPI_DEFAULT_CONTEXT 1
Expand Down
8 changes: 8 additions & 0 deletions sycl/include/CL/sycl/properties/queue_properties.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,8 @@ namespace property {
namespace queue {
class discard_events : public ::cl::sycl::detail::DataLessProperty<
::cl::sycl::detail::DiscardEvents> {};
class lazy_execution : public ::cl::sycl::detail::DataLessProperty<
::cl::sycl::detail::LazyExecution> {};
} // namespace queue
} // namespace property

Expand Down Expand Up @@ -63,6 +65,9 @@ template <>
struct is_property<ext::oneapi::property::queue::discard_events>
: std::true_type {};
template <>
struct is_property<ext::oneapi::property::queue::lazy_execution>
: std::true_type {};
template <>
struct is_property<property::queue::cuda::use_default_stream> : std::true_type {
};
template <>
Expand All @@ -78,6 +83,9 @@ template <>
struct is_property_of<ext::oneapi::property::queue::discard_events, queue>
: std::true_type {};
template <>
struct is_property_of<ext::oneapi::property::queue::lazy_execution, queue>
: std::true_type {};
template <>
struct is_property_of<property::queue::cuda::use_default_stream, queue>
: std::true_type {};
template <>
Expand Down
Loading