Skip to content

[SYCL] Fix assertion failure in E2E marray test #14234

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 27 commits into from
Jul 1, 2024

Conversation

lbushi25
Copy link
Contributor

@lbushi25 lbushi25 commented Jun 20, 2024

This PR fixes a GPU accuracy bug by upscaling the error-tolerance to a double type if the GPU supports 64-bit floating point arithmetic.

@lbushi25 lbushi25 requested a review from a team as a code owner June 20, 2024 06:20
@lbushi25 lbushi25 requested a review from sergey-semenov June 20, 2024 06:20
// Make sure we don't use fp64 on devices that don't support it.
sycl::detail::get_elem_type_t<ExpectedTy> d(delta);

sycl::queue{}.submit([&](sycl::handler &cgh) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we actually have a bug somewhere in our execution graph builder. queue destruction is a non-blocking operation, but the kernel should still be launched and all completion events communicated as usual:

A SYCL queue may be destroyed even when there are uncompleted commands that have been submitted to the queue. Doing so does not block. Instead, any commands that have been submitted to the queue begin execution when their requisites are satisfied, just as they would had the queue not been destroyed. Any event objects for those commands are signaled in the normal manner when the command completes. Resources associated with the queue will be freed by the time the last command completes.

Which makes me think that it could be a bug that we don't communicate kernel completion event properly and host_accessor creation doesn't wait for it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this temporary queue is the problem here to be honest, I just rewrote it to declare it beforehand in order to improve readability. The change that actually fixed the test was introducing the new boolean variable result and having the buffer point to it. This is also unusual because according to the spec, host accessor can be safely used to access buffers like the test was doing before and yet it was failing so we could also have a bug in host accessor.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

host accessor can be safely used to access buffers like the test was doing before and yet it was failing so we could also have a bug in host accessor

IMO, we shouldn't be merging this "fix" until the investigation is done.

Copy link
Contributor Author

@lbushi25 lbushi25 Jun 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did some more investigation, it seems to be a GPU accuracy issue. Note in line 37 of helpers.hpp the delta error tolerance is converted from double to whatever the type that the function under test produces. If this type happens to be float, then some accuracy is lost going from double to float and apparently in some of the test cases in the marray_common.cpp file, the results of the GPU computation differ from the expected values by large enough errors so as to expose this loss in accuracy and the equal(result, expected, delta) function that verifies the result returns false which causes our assertion to fail.

Therefore, the synchronization is actually correct, the problem seems to be the lack of accuracy of GPU for float arguments or any argument type with significantly less bits than double. Easy fix by removing the variable line 37 and replacing d by the original delta which has type double?
I tested this and it works, its simple and IMO does not compromise the original purpose of the test.
Also tagging @steffenlarsen if he has time to give his 2 cents on this.

Just for clarity, at the moment I've rewritten the test to use buffers with host pointers instead of host_accessor but that was before I knew that this accuracy problem was the heart of the issue. As to why the test was passing when using buffers with host pointers, I'm clueless!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the idea of upscaling before error-tolerance checking.

Copy link
Contributor Author

@lbushi25 lbushi25 Jun 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made the changes. Also, I explicitly created a context and created a queue with that context in order to make the test independent of the default context extension.

@lbushi25
Copy link
Contributor Author

@intel/llvm-reviewers-runtime ping

@aelovikov-intel
Copy link
Contributor

Please fix PR's title. It goes into git commit message and stays there forever. We don't need to describe every single wrong step we've made on the path to the PR there.

Comment on lines 37 to 39
sycl::context ctx;
sycl::queue q{ctx, ctx.get_devices()[0]};
q.submit([&](sycl::handler &cgh) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this?

Without it, the queue uses the default context as described in the default context extension: https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/supported/sycl_ext_oneapi_default_context.asciidoc.
Core SYCL tests should not rely on extensions so I've explicitly created a context and then created a queue with that context.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what? Lots of tests just do sycl::queue q; and it's perfectly legal in SYCL. Are you saying that because of this extension we must rewrite all these tests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what? Lots of tests just do sycl::queue q; and it's perfectly legal in SYCL. Are you saying that because of this extension we must rewrite all these tests?

You're right, the reporter of the tracker was under the impression that the bug was due to the extension so that's why they suggested we rewrite it. I've changed it to use the idiomatic sycl::queue q;

Comment on lines 37 to 38
sycl::queue q;
q.submit([&](sycl::handler &cgh) {
Copy link
Contributor

@aelovikov-intel aelovikov-intel Jun 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why?!.. You can just remove line 37/update line 44 on the left and not change anything else at all. Not even line 53 on the right side...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a matter of taste I suppose, It looks cleaner to use a variable name for the queue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line 53 was a rogue change of the formatter but I've reverted it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a matter of taste I suppose, It looks cleaner to use a variable name for the queue.

That's debatable, which is it has to go separately from the bugfix, if at all.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a matter of taste I suppose, It looks cleaner to use a variable name for the queue.

That's debatable, which is why it has to go separately from the bugfix, if at all.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a matter of taste I suppose, It looks cleaner to use a variable name for the queue.

That's debatable, which is it has to go separately from the bugfix, if at all.

Ok, have a look now.

Copy link
Contributor Author

@lbushi25 lbushi25 Jun 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made a few more changes to make sure that the upscaling does not happen if the device does not support 64-bit floating point arithmetic. This was exposed by pre-commit tests.

@lbushi25 lbushi25 requested a review from a team as a code owner June 28, 2024 02:38
@lbushi25 lbushi25 removed the request for review from a team June 28, 2024 02:43
@AlexeySachkov AlexeySachkov merged commit 8a44553 into intel:sycl Jul 1, 2024
14 checks passed
KseniyaTikhomirova pushed a commit to KseniyaTikhomirova/llvm that referenced this pull request Jul 1, 2024
This PR fixes a GPU accuracy bug by upscaling the error-tolerance to a
double type if the GPU supports 64-bit floating point arithmetic.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants