Skip to content

[SYCL][Doc] Extended group load/store APIs proposal #7593

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 79 commits into from
Apr 4, 2024

Conversation

aelizaro
Copy link
Contributor

An initial draft of extended group load/store APIs to provide capabilities to work with temporary memory buffers and load/store multiple elements per work item.

@aelizaro aelizaro requested a review from a team as a code owner November 30, 2022 16:39
@aelizaro
Copy link
Contributor Author

@andreyfe1, @Pennycook, could you please take a look at the initial proposal?

Copy link
Contributor

@Pennycook Pennycook left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also wrap things at 80 columns; it makes it much easier to review.

work-group or sub-group and some associated scratch space.

_Effects_: Loads `ItemsPerWorkItem` elements from `in_ptr` to `out`
using the `gh` group helper object. `GroupMemoryHelper` specifies data placement properties and also can work with extra options such as specifying out-of-boundry value and limited work-items number to work with.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of using extra options in the GroupMemoryHelper, could we make these two new properties? Then a developer could use them with the overload that doesn't require a GroupMemoryHelper.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, removed (2) and (4) and added a TODO for other properties, wanna do some prototyping for it

@dkhaldi
Copy link
Contributor

dkhaldi commented Dec 2, 2022

It will be good to have complete kernels for this extension in tests directory.
Specifically, I would like to see the usage of vec, marray, and simd as the span memory storage.

@aelizaro
Copy link
Contributor Author

aelizaro commented Dec 6, 2022

@dkhaldi, do you mean to add full tests without implementation, to illustrate how it supposes to work with sycl's data types?

@dkhaldi
Copy link
Contributor

dkhaldi commented Dec 6, 2022

@dkhaldi, do you mean to add full tests without implementation, to illustrate how it supposes to work with sycl's data types?

Right.

@aelizaro aelizaro requested a review from a team as a code owner January 6, 2023 13:40
@aelizaro
Copy link
Contributor Author

aelizaro commented Jan 6, 2023

Test cases are added to illustrate how APIs should work.
For sycl::vec case, we can use both span approaches (as it is shown in the test) or have a specialization for a single value case - it will look cleaner.
For sycl::marray it works nicely as its internal representation guarantees contiguous storage of memory, so we can use:

sycl::marray<InputT, items_per_thread> data;
sycl_exp::joint_load(item.get_group(), in.get_pointer(), sycl::span<InputT, items_per_thread>{ data.begin(), data.end() });

Copy link
Contributor

@Pennycook Pennycook left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spotted a few minor issues during review of the latest changes, but I think the fixes are straightforward. Let me know when they're applied and I'll approve.

Comment on lines 402 to 416
namespace sycl::ext::oneapi::experimental {

struct full_work_group_hint {
using value_t =
property_value<full_work_group_hint>;
};

inline constexpr full_work_group_hint::value_t full_work_group;

struct full_sub_group_hint {
using value_t =
property_value<full_sub_group_hint>;
};

inline constexpr full_sub_group_hint::value_t full_sub_group;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should combine these into one hint. We can call it full_group for now.

We already know whether we're using a group or a sub-group from the first argument. There are also going to be other group types in the future, and it would be convenient to be able to re-use the same property across all of those group types as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure full_group is a good name. parallel_for requires that global size is a multiple of a work group size, so WG is always "full" in some sense. It might still be that the last SG of each WG isn't full.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, but that just means that full_group doesn't do anything when the algorithm is invoked with a sycl::group, but improves performance when the algorithm is invoked with a sycl::sub_group. The "group" in "full group" is intended to mean "anything satisfying the group concept" and not "sycl::group".

I want to push for a renaming of sycl::group to sycl::work_group to avoid this sort of confusion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then we need another property for the WG case, I think.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? Maybe I'm misunderstanding what you're expecting to happen here. I thought your earlier message was saying that work-groups were always "full"?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are, but the block-read is SG-level intrinsic. We need SGs to be "full" to use them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're talking past each other.

Consider the below:

auto sg = it.get_sub_group();
group_load(sg, input, output_span, properties{contiguous_memory, full_group});

The interpretation here is that sg represents a full sub-group.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the user writes

  auto g = it.get_group();
  group_load(g, input, output_span, properties{contiguous_memory, full_group});

we still can't use block loads without runtime checks. WG is full, but its last SG might not be. Also, full_group is meaningless for WG because it always full per SYCL spec parallel_for/nd_range restrictions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We couldn't use block loads in this case anyway, because the stride in this case is the size of the work-group and not the sub-group. You're right that we could generate block loads in some cases (like a work-group + blocked data placement) but I think we'd still need a runtime check because different devices carve up work-groups into sub-groups differently.

full_group is currently meaningless for work-groups, but I think it makes more sense to have a generic full_group than to make the property sub-group specific. That allows a developer to write generic code that uses full_group and have it work in both cases, and it may be meaningful for future group types.

Comment on lines 397 to 398
The following properties is introduced to be used
as a hint that implementation can use get_max_local_range():
Copy link
Contributor

@Pennycook Pennycook Mar 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Users might not understand why this is helpful for implementations, so I think we should be a little more explicit here. Maybe something like:

Suggested change
The following properties is introduced to be used
as a hint that implementation can use get_max_local_range():
The following property can be used as a hint that
`get_local_range()` is equal to `get_max_local_range()`,
which may enable more aggressive optimizations for some
implementations.
[NOTE]
====
Using `full_group` is necessary to generate SPIR-V block read
and block write instructions, because these instructions are
defined to use the maximum group size as the stride.
====

@aelizaro
Copy link
Contributor Author

@gmlueck, @Pennycook, @aelovikov-intel, could you please, double-check the PR? (and huge thanks to you for contribution!)

I think I addressed everything except one ongoing discussion about full_group hint naming #7593 (comment)

Copy link
Contributor

@Pennycook Pennycook left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I spotted one minor potential issue, but apart from this and the full_group discussion, I think we're good.

Comment on lines 212 to 216
`Properties` argument is reserved for future revisions of this extention and is
ignored now.
Default value is empty `sycl::ext::oneapi::experimental::empty_properties_t`
May be used in future for setting boundary values or limiting numbers of work
items.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question as above. Can we use contiguous here?

Copy link
Contributor

@gmlueck gmlueck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks good, but I have some comments about the use of the word "hint". See below.

Copy link
Contributor

@gmlueck gmlueck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@aelizaro
Copy link
Contributor Author

aelizaro commented Apr 4, 2024

@Pennycook, can we merge it now?

@Pennycook
Copy link
Contributor

@Pennycook, can we merge it now?

I think so! @intel/llvm-gatekeepers, please merge.

@dm-vodopyanov dm-vodopyanov merged commit e320aa4 into intel:sycl Apr 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants