Skip to content

[SYCL] Support for arrays as kernel parameters #1387

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 30 commits into from
Closed

[SYCL] Support for arrays as kernel parameters #1387

wants to merge 30 commits into from

Conversation

rdeodhar
Copy link
Contributor

Signed-off-by: rdeodhar [email protected]

rdeodhar and others added 9 commits March 24, 2020 14:32
Implementation of piEventSetCallback with tests

GlueEvent uses now the correct plugins

The SYCL RT code for GlueEvent calls now
the right plugin to create the event that triggers the
dependency chain.
Renamed variables to clarify the source code and avoid
confusions between Context and Plugin

Signed-off-by: Ruyman Reyes <[email protected]>
Signed-off-by: Stuart Adams <[email protected]>
Signed-off-by: Steffen Larsen <[email protected]>
NOTE: This flag is not exposed to the driver and not intended for users.
It's added to make experiments and identify issues with optimizations.

Signed-off-by: Alexey Bader <[email protected]>
By emitting the legacy variant of the LLVM IR alongside the newer
representation of the attribute, backwards compatibility with any
existing BE implementation is restored. A smooth transition period
is thus achieved for the aforementiond BE - until it's able to consume
the new LLVM IR, it has an option to simply ignore the unknown metadata.

Signed-off-by: Artem Gindinson <[email protected]>
If found alloca command is not sub-buffer alloca, then
it's parent alloca which has same context

Signed-off-by: Ivan Karachun <[email protected]>
@erichkeane
Copy link
Contributor

@romanovvlad As discussed previously via email, I want @rolandschulz to review the design/interface/IR output, so we're waiting on the submitter to provide the design documentation for this. Additionally, @elizabethandrews and @Fznamznon know this section of code better than me, so I'll delegate review to them.

Copy link
Contributor

@Fznamznon Fznamznon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This patch is not so trivial.
Could you please describe the approach in commit message?
Could you please also add a CodeGen test?

@@ -0,0 +1,259 @@
// This test checks array kernel parameters
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that this test checks some kernel execution. We don't put such tests in llvm/clang/test/* directories. Please put somewhere to llvm/sycl/test/ directory.
BTW, SYCL kernels can be run on different devices. Other tests in llvm/sycl/test/ directory has bunch of special RUN lines to run the test on particular device. Like // RUN: %GPU_RUN_PLACEHOLDER %t.out here:

// RUN: %GPU_RUN_PLACEHOLDER %t.out
. You will need to add them too, if you are doing to add test with kernel execution.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will write a design document to explain the overall approach to supporting array arguments of kernels. I assume attaching documents here is accepted practice?
I will also address comments posted here in my next attempt.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, we appreciate design docs. You can put your doc somewhere in this folder https://github.com/intel/llvm/tree/sycl/sycl/doc (or extend some existing doc here).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

Comment on lines +3 to +4
// This test checks if compiler reports compilation error on an attempt to pass
// a struct with non-trivially copyable type as SYCL kernel parameter.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean array of non-trivially copyable objects, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, comment changed.

@@ -1095,6 +1225,32 @@ static void populateIntHeader(SYCLIntegrationHeader &H, const StringRef Name,
populateHeaderForWrappedAccessors(FldType,
Offset + OffsetInWrapper);
}
} else if (FldType->isConstantArrayType()) {
auto CAT = cast<ConstantArrayType>(FldType);
auto ET = CAT->getElementType();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do not use auto when the type is not obvious from the context.
See https://llvm.org/docs/CodingStandards.html#id26
Please apply this comment to all uses of auto in your patch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

Comment on lines 902 to 921
if (Util::isSyclAccessorType(ET)) {
size_t NumElements =
static_cast<size_t>(CAT->getSize().getZExtValue());
DeclAccessPair FieldDAP = DeclAccessPair::make(Field, AS_none);
auto Lhs = MemberExpr::Create(
S.Context, KernelObjCloneRef, false, Loc,
NestedNameSpecifierLoc(), Loc, Field, FieldDAP,
DeclarationNameInfo(Field->getDeclName(), Loc), nullptr,
Field->getType(), VK_LValue, OK_Ordinary, NOUR_None);
// Treat each element of accessor array as a separate item
// Call its __init function
for (size_t I = 0; I < NumElements; ++I) {
ExprResult IndexExpr = S.ActOnIntegerConstant(Loc, I);
auto MA = S.CreateBuiltinArraySubscriptExpr(Lhs, Loc,
IndexExpr.get(), Loc);
getExprForSpecialSYCLObj(ET, nullptr, MA.get(),
ET->getAsCXXRecordDecl(), nullptr,
InitMethodName, BodyStmts);
KernelFuncParam++;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Déjà vu...
The new code in getExprForWrappedAccessorInit lambda looks almost the same. Can we create a function for it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two blocks of code are superficially similar. The check for Accessor type and then iterating over the array elements is common. However, the actions in the two cases are very different. It is not practical to make a callable function out of these two blocks of code.

}
}
} else if (FieldType->isConstantArrayType()) {
auto Loc = Field->getLocation();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really get why you are applying source location of Kernel Object field to newly generated AST nodes, I think it can be tricky. Let's apply just empty source location since these AST nodes do not represent real code from source.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. This was done based on a previous reviewers suggestion. I have reverted to empty source location to conform to the approach in other cases in this function.

};
RecordDecl *NewClass;
FieldDecl *Field;
wrapAnArray(ArgTy, NewClass, Field);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did I get it right that to pass array of something, we wrap it by some struct (which seems to be generated by the compiler)? i.e. resulting kernel for each array field of kernel object has some wrapped_array struct parameter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, a design document will be posted.

Comment on lines 1299 to 1314
auto CAT = cast<ConstantArrayType>(ArgTy);
auto ET = CAT->getElementType();
if (Util::isSyclAccessorType(ET)) {
size_t NumElements = static_cast<size_t>(CAT->getSize().getZExtValue());
const auto *Accessor = ET->getAsCXXRecordDecl();
ASTContext &AccessorCtx = Accessor->getASTContext();
const ASTRecordLayout &AccessorLayout =
AccessorCtx.getASTRecordLayout(Accessor);
auto AccessorSize = AccessorCtx.toBits(AccessorLayout.getSize()) / 8;
// Start with offset of array, i.e., offset of first element of array
uint64_t OffsetInArray = Offset;
// Treat each element of accessor array as a separate item
// Create a parameter descriptor for it
for (size_t I = 0; I < NumElements; ++I) {
populateHeaderForAccessor(ET, OffsetInArray);
OffsetInArray += AccessorSize;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After lines 1231-1253 it's déjà vu again

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, there was somewhat more similarity in the actions done in the two cases, and this code has been factored out into a separate function, called twice.

@elizabethandrews
Copy link
Contributor

From what I undestand you've handled 3 types of kernel parameters here -

  1. Accessor arrays
  2. Arrays of other types
  3. Class type with an accessor array as a member.

For 1. and 3., you added each element as a separate parameter in generated OpenCL Kernel and called init functions accordingly.

For 2. You wrapped the array inside a new class 'wrapped _array' and added this as the parameter.

Is this a gist of what you did? Am I missing anything? It would be helpful having this information in the commit message.

@elizabethandrews
Copy link
Contributor

Please add a Sema test showing AST generated.

AlexeySachkov and others added 8 commits March 26, 2020 10:15
Enable -fdeclare-spirv-builtins for SYCL device compilation mode

For device compilation, SPIR-V builtins are now looked up by
the device compiler. They now longer need to be forward declared.

[SYCL-PTX] Revert manual mangling of some SPIR-V builtins
[SYCL-PTX] Add fmod builtin
[SYCL-PTX] Update Atomic mangling

Signed-off-by: Victor Lomuller <[email protected]>
…<dir> (#1346)

When using /Fo<dir> the improper dependency file name was generated, causing
the bundle step to not be able to locate the dependency file when compiling
to object

Signed-off-by: Michael D Toguchi <[email protected]>
This patch introduces the following loop attributes:
- loop_coalesce:
  Indicates that the loop nest should be coalesced into a single loop without
  affecting functionality
- speculated_iterations:
  Specifies the number of concurrent speculated iterations that will be in
  flight for a loop invocation
- disable_loop_pipelining:
  Disables pipelining of the loop data path, causing the loop to be executed
  serially
- max_interleaving:
  Places a maximum limit N on the number of interleaved invocations of an inner
  loop by an outer loop

Signed-off-by: Viktoria Maksimova <[email protected]>
Fixed the buffer constructor called with a pair of iterators.
The current implementation has a problem due to ambiguous spec.
The buffer should never write back data unless there is a call to set_final_data(), but the current implementation does it.
I corrected the spec in KhronosGroup/SYCL-Docs#76.
So, now we can change the buffer implementation according to the clarified spec.

The test case buffer.cpp also needed change because of this change.
The user should not expect the automatic write-back of data upon destruction of buffer.

Signed-off-by: Byoungro So <[email protected]>
Co-authored-by: Ronan Keryell <[email protected]>
A simple library which allows to construct and serialize/deserialize
a sequence of typed property sets, where each property is a <name,typed value>
pair. To be used in offload tools.

Signed-off-by: Konstantin S Bobrovsky <[email protected]>
The library allows to create, serialize/deserialize tables of strings,
insert/delete/replace/rename columns, add rows. To be used in offload
tools.

Signed-off-by: Konstantin S Bobrovsky <[email protected]>
@rdeodhar
Copy link
Contributor Author

rdeodhar commented Mar 27, 2020 via email

@erichkeane
Copy link
Contributor

erichkeane commented Mar 27, 2020

From what I undestand you've handled 3 types of kernel parameters here -

1. Accessor arrays

2. Arrays of other types

3. Class type with an accessor array as a member.

For 1. and 3., you added each element as a separate parameter in generated OpenCL Kernel and called init functions accordingly.

For 2. You wrapped the array inside a new class 'wrapped _array' and added this as the parameter.

Is this a gist of what you did? Am I missing anything? It would be helpful having this information in the commit message.

Hmm.. I cannot help but wonder what happens when multiple arrays are passed? Are they in separate types? Are those types necessary? I think LLVM-IR supports array-level parameters (despite C/C++ not ), so I wonder if we can do away with the type itself to avoid the obvious trouble that comes with passing it through a struct.

@rdeodhar
Copy link
Contributor Author

Design document attached.
Array_Kernel_Parameters.docx

@rdeodhar rdeodhar requested a review from rolandschulz March 27, 2020 03:08
This reverts commit d357add.

Signed-off-by: Vladimir Lazarev <[email protected]>
Alexander Batashev and others added 11 commits March 27, 2020 12:28
…for (#1348)

The kernel callable being invoked from an nd_range parallel_for is accepting an id argument, while it should be nd_item.

After my analysis, I found we check arguments' type for kernel_parallel_for instead of parallel_for. But that check is useless, because the compiler can still find a candidate for kernel_parallel_for with nd_range and id which is a wrong combination.

In my solution, parallel_for with nd_range calls kernel_parallel_for_nd_range(...) which is only available for nd_item.

Signed-off-by: Bing1 Yu <[email protected]>
Implements a few code simplification/unification for LowerWGScope.

Signed-off-by: Victor Lomuller <[email protected]>
)

For NVPTX target address space inference for kernel arguments and
allocas is happening in the backend (NVPTXLowerArgs and
NVPTXLowerAlloca passes). After frontend these pointers are in LLVM
default address space 0 which is the generic address space for NVPTX
target. Perform address space cast of a pointer to the shadow global
variable from the local to the generic address space before replacing
all usages of a byval argument.

Signed-off-by: Artur Gainullin <[email protected]>
- Adds static members to sub_group class.
- sub_group member functions marked deprecated, to be removed later.
- SPIR-V helpers expanded to convert SYCL group to SPIR-V scope.
- Add workaround for half types

Signed-off-by: John Pennycook <[email protected]>
Whereas it is not possible to generate vector of bools in FE,
we have to change return type for corresponding instructions in SPIRV
translator to vector of bools. SPIRV translator already did this for
some instructions, this patch extends this behaviour to handle more
instructions.
Adding doxygen documentation to PI CUDA backend.
Some code is re-ordered in the file to help sorting the
doxygen.

Co-Authored-By: Alexey Bader <[email protected]>
Co-Authored-By: Alexander Batashev <[email protected]>
Co-Authored-By: Romanov Vlad <[email protected]>

Signed-off-by: Ruyman Reyes <[email protected]>
Based on
https://github.com/codeplaysoftware/standards-proposals/blob/master/spec-constant/index.md

* [SYCL] PI changes:

1. Add specialization constant API to the SYCL RT Plugin Interface.
New PI API added:
pi_result piProgramSetSpecializationConstant(pi_program prog, pi_uint32 spec_id,
                                             size_t spec_size,
                                             const void *spec_value);
2. Add property set fields to the binary image descriptor, bump PI version.
This change breaks backward binary compatibility of device binary image descriptors.
3. Add convenience C++ wrappers for PI binary image hierarchy objects.

* [SYCL] Support device binary properties and file tables in the offload wrapper.

1. New option - "-properties=<file>". <file> must be a property set registry
file, as defined by llvm/Support/PropertySetIO.h. The wrapper will add the
property sets to the binary image descriptor and the them available to the
runtime.

2. New options - "-batch". With this option the only input can be a file table,
as defined by llvm/Support/SimpleTable.h. Column names are a part of interface
between this tool and the sycl-post-link, which produces the file table.

3. Binary image descriptor LLVM type updated to resemble changes in Plugin
Interface v1.2.

* [SYCL] Specialization constants support in the Front End.

1. Detect kernel lambda object captures corresponding to specialization
constants and (a) don't create kernel arguments for them (b) generate
specializations of the SpecConstantInfo structure into the integration
header.

2. Recognize the __unique_stable_name intrinsic and replace
it with a string literal uniquely identifying the type of the typename
template parameter to this intrinsic.

3. FE-related changes in the runtime:
- new SpecConstantInfo templated struct for type->name translation for
  specialization constants used by integration header
- define the __sycl_fe_getStableUniqueTypeName intrinsic

* [SYCL] Add specialization constant support in SYCL runtime.

1. Define SYCL API (sycl/include/CL/sycl/experimental/spec_constant.hpp)
2. Add convenience C++ wrappers for PI device binary structures and refactor
   runtime to use the wrappers. Get rid of custom deleters for binary images.
3. Implement SYCL spec constant APIs in program an program manager.

* [SYCL] Use file-table-tform in SYCL offload processing in clang driver.

Clang driver's design can't handily model
(1) multiple inputs/outputs in the action graph. Because of that, for
example, sycl-post-link tool is invoked twice - once to to split the code
and produce multiple bitcode files, and secondly - to generate symbol
files for the split modules.
(2) "Clusters" of inputs/outputs, when subsets of inputs/outputs are
associated and describe different aspects of the same data. Example of
such clustering is the split module + its symbol file above. Clustering
would require support both in the driver and the tools invoked in
response to actions.

This commit moves SYCL offload processing to the "file table concept."
sycl-post-link instead of
(1) being invoked n times, once per each output type requested (once for
    device split and once for symbol file generation)
(2) outputting multiple file lists each listing outputs from the
    corresponding invocation above
is now invoked once and produces single file table output. E.g.
  [Code|Symbols|Properties]
  a_0.bc|a_0.sym|a_0.props
  a_1.bc|a_1.sym|a_1.props
This solves both problems - multiple input/output and clustering.
Combined with the file-table-tform tool, this allows for efficent handling
of multiple clusters of files (each represented as a row in the table file)
in the clang driver infrastructure.
For example, there is a real offload processing problem:
step1. sycl-post-link outputs N clusters of files
step2. "Code" file of each cluster resuilting from step1 ({a_0.bc, a_1.bc}
       in the example above) must undergo further transformations -
       translation to SPIRV and optional ahead-of-time compilation.
step3. In each cluster resulting from step1 the "Code" file needs to be
       replaced with the result of step2
step4. All the clusters are processed by the ClangOffloadWrapper tool, which
       needs to know how files are distributed into clusters and what is
       the roles of each file in a cluster - whether it is "Code", "Symbol"
       or "Properties".
To solve this, the following action graph is constructed in the clang driver:

                        column:"Code"
t1 -> [file-table-tform:extract column] -> t1a -> [for-each:] -> t1b
                                                  llvm-spirv
                                                   aot-comp
t1
   \                  column:"Code"
    [file-table-tform:replace column] -> t2 -> [ClangOffloadWrapper]
   /
t1b

where t1b is ["Code"]  and t2 is  [Code|Symbols|Properties]
              a_0.bin             a_0.bin|a_0.sym|a_0.props
              a_1.bin             a_1.bin|a_1.sym|a_1.props

Note that the graph does not change with growing number of clusters, neither
it changes when more files are added to each cluster (e.g. a "Manifest" file).

* [SYCL] Process specialization constants in sycl-post-link tool.

Add a spec constant lowering pass to sycl-post-link tool. Support
file table output format.

* [SYCL] Temporarily disable spec_const_hw.cpp on CPU.

CPU OpenCL Runtime on build machines is not updated yet.

Signed-off-by: Konstantin S Bobrovsky <[email protected]>
@AlexeySachkov
Copy link
Contributor

Seems to be moved to #1423

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.