[SYCL] Fix assertion failure in E2E marray test #14234

lbushi25 · 2024-06-20T06:20:53Z

This PR fixes a GPU accuracy bug by upscaling the error-tolerance to a double type if the GPU supports 64-bit floating point arithmetic.

AlexeySachkov · 2024-06-20T07:45:06Z

sycl/test-e2e/Basic/built-ins/helpers.hpp

-  // Make sure we don't use fp64 on devices that don't support it.
-  sycl::detail::get_elem_type_t<ExpectedTy> d(delta);
-
-  sycl::queue{}.submit([&](sycl::handler &cgh) {


I wonder if we actually have a bug somewhere in our execution graph builder. queue destruction is a non-blocking operation, but the kernel should still be launched and all completion events communicated as usual:

A SYCL queue may be destroyed even when there are uncompleted commands that have been submitted to the queue. Doing so does not block. Instead, any commands that have been submitted to the queue begin execution when their requisites are satisfied, just as they would had the queue not been destroyed. Any event objects for those commands are signaled in the normal manner when the command completes. Resources associated with the queue will be freed by the time the last command completes.

Which makes me think that it could be a bug that we don't communicate kernel completion event properly and host_accessor creation doesn't wait for it

I don't think this temporary queue is the problem here to be honest, I just rewrote it to declare it beforehand in order to improve readability. The change that actually fixed the test was introducing the new boolean variable result and having the buffer point to it. This is also unusual because according to the spec, host accessor can be safely used to access buffers like the test was doing before and yet it was failing so we could also have a bug in host accessor.

host accessor can be safely used to access buffers like the test was doing before and yet it was failing so we could also have a bug in host accessor

IMO, we shouldn't be merging this "fix" until the investigation is done.

I did some more investigation, it seems to be a GPU accuracy issue. Note in line 37 of helpers.hpp the delta error tolerance is converted from double to whatever the type that the function under test produces. If this type happens to be float, then some accuracy is lost going from double to float and apparently in some of the test cases in the marray_common.cpp file, the results of the GPU computation differ from the expected values by large enough errors so as to expose this loss in accuracy and the equal(result, expected, delta) function that verifies the result returns false which causes our assertion to fail.

Therefore, the synchronization is actually correct, the problem seems to be the lack of accuracy of GPU for float arguments or any argument type with significantly less bits than double. Easy fix by removing the variable line 37 and replacing d by the original delta which has type double?
I tested this and it works, its simple and IMO does not compromise the original purpose of the test.
Also tagging @steffenlarsen if he has time to give his 2 cents on this.

Just for clarity, at the moment I've rewritten the test to use buffers with host pointers instead of host_accessor but that was before I knew that this accuracy problem was the heart of the issue. As to why the test was passing when using buffers with host pointers, I'm clueless!

I agree with the idea of upscaling before error-tolerance checking.

I've made the changes. Also, I explicitly created a context and created a queue with that context in order to make the test independent of the default context extension.

lbushi25 · 2024-06-24T15:56:03Z

@intel/llvm-reviewers-runtime ping

aelovikov-intel · 2024-06-26T18:32:35Z

Please fix PR's title. It goes into git commit message and stays there forever. We don't need to describe every single wrong step we've made on the path to the PR there.

aelovikov-intel · 2024-06-26T18:33:24Z

sycl/test-e2e/Basic/built-ins/helpers.hpp

+  sycl::context ctx;
+  sycl::queue q{ctx, ctx.get_devices()[0]};
+  q.submit([&](sycl::handler &cgh) {


Why do we need this?

Why do we need this?

Without it, the queue uses the default context as described in the default context extension: https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/supported/sycl_ext_oneapi_default_context.asciidoc.
Core SYCL tests should not rely on extensions so I've explicitly created a context and then created a queue with that context.

So what? Lots of tests just do sycl::queue q; and it's perfectly legal in SYCL. Are you saying that because of this extension we must rewrite all these tests?

So what? Lots of tests just do sycl::queue q; and it's perfectly legal in SYCL. Are you saying that because of this extension we must rewrite all these tests?

You're right, the reporter of the tracker was under the impression that the bug was due to the extension so that's why they suggested we rewrite it. I've changed it to use the idiomatic sycl::queue q;

aelovikov-intel · 2024-06-26T20:09:07Z

sycl/test-e2e/Basic/built-ins/helpers.hpp

+  sycl::queue q;
+  q.submit([&](sycl::handler &cgh) {


Why?!.. You can just remove line 37/update line 44 on the left and not change anything else at all. Not even line 53 on the right side...

It's a matter of taste I suppose, It looks cleaner to use a variable name for the queue.

Line 53 was a rogue change of the formatter but I've reverted it.

It's a matter of taste I suppose, It looks cleaner to use a variable name for the queue.

That's debatable, which is it has to go separately from the bugfix, if at all.

It's a matter of taste I suppose, It looks cleaner to use a variable name for the queue.

That's debatable, which is why it has to go separately from the bugfix, if at all.

It's a matter of taste I suppose, It looks cleaner to use a variable name for the queue.

That's debatable, which is it has to go separately from the bugfix, if at all.

Ok, have a look now.

Made a few more changes to make sure that the upscaling does not happen if the device does not support 64-bit floating point arithmetic. This was exposed by pre-commit tests.

This PR fixes a GPU accuracy bug by upscaling the error-tolerance to a double type if the GPU supports 64-bit floating point arithmetic.

lbushi25 added 3 commits June 19, 2024 23:01

Fix bug in forward_progress extension

bc939bd

Empty

1c21400

Fix buffer consistency bug in built in marray test

6d42c93

lbushi25 requested a review from a team as a code owner June 20, 2024 06:20

lbushi25 requested a review from sergey-semenov June 20, 2024 06:20

lbushi25 had a problem deploying to WindowsCILock June 20, 2024 06:21 — with GitHub Actions Error

Remove rogue changes

aa8772d

lbushi25 had a problem deploying to WindowsCILock June 20, 2024 06:24 — with GitHub Actions Error

Remove rogue changes

5d9f6e1

lbushi25 temporarily deployed to WindowsCILock June 20, 2024 06:25 — with GitHub Actions Inactive

lbushi25 had a problem deploying to WindowsCILock June 20, 2024 06:37 — with GitHub Actions Error

Update helpers.hpp

75133ad

lbushi25 temporarily deployed to WindowsCILock June 20, 2024 06:53 — with GitHub Actions Inactive

lbushi25 temporarily deployed to WindowsCILock June 20, 2024 07:05 — with GitHub Actions Inactive

AlexeySachkov reviewed Jun 20, 2024

View reviewed changes

lbushi25 added 2 commits June 26, 2024 09:24

Upscale error tolerance to double in marray tests helper file

b9374d8

Apply formatter

fd699f4

lbushi25 temporarily deployed to WindowsCILock June 26, 2024 16:27 — with GitHub Actions Inactive

lbushi25 requested a review from aelovikov-intel June 26, 2024 18:08

aelovikov-intel reviewed Jun 26, 2024

View reviewed changes

lbushi25 had a problem deploying to WindowsCILock June 26, 2024 19:22 — with GitHub Actions Error

Update helpers.hpp

29ad244

lbushi25 had a problem deploying to WindowsCILock June 26, 2024 19:37 — with GitHub Actions Error

lbushi25 requested a review from aelovikov-intel June 26, 2024 20:06

aelovikov-intel reviewed Jun 26, 2024

View reviewed changes

lbushi25 added 2 commits June 26, 2024 16:15

Update helpers.hpp

3661d3b

Update helpers.hpp

6fa0fa6

lbushi25 had a problem deploying to WindowsCILock June 26, 2024 20:18 — with GitHub Actions Error

lbushi25 temporarily deployed to WindowsCILock June 27, 2024 04:48 — with GitHub Actions Inactive

lbushi25 had a problem deploying to WindowsCILock June 27, 2024 06:12 — with GitHub Actions Failure

Update helpers.hpp

41c4641

lbushi25 temporarily deployed to WindowsCILock June 27, 2024 14:29 — with GitHub Actions Inactive

lbushi25 had a problem deploying to WindowsCILock June 27, 2024 16:07 — with GitHub Actions Error

Update helpers.hpp

7593a2c

lbushi25 temporarily deployed to WindowsCILock June 27, 2024 18:10 — with GitHub Actions Inactive

lbushi25 had a problem deploying to WindowsCILock June 27, 2024 22:21 — with GitHub Actions Failure

lbushi25 added 3 commits June 27, 2024 19:36

Add link to bug for hardware_dispatch test

2c5f35a

Update helpers.hpp

85ed5c6

Update helpers.hpp

131ba3e

lbushi25 requested a review from a team as a code owner June 28, 2024 02:38

lbushi25 had a problem deploying to WindowsCILock June 28, 2024 02:39 — with GitHub Actions Error

lbushi25 added 7 commits June 27, 2024 22:39

Update hardware_dispatch.cpp

8f05214

Update hardware_dispatch.cpp

d46d0ca

Update helpers.hpp

587f713

Update helpers.hpp

070dec6

Update helpers.hpp

53383fb

Update helpers.hpp

891fc0d

Update helpers.hpp

c98df77

lbushi25 removed the request for review from a team June 28, 2024 02:43

lbushi25 had a problem deploying to WindowsCILock June 28, 2024 02:44 — with GitHub Actions Error

Update helpers.hpp

8d47e91

lbushi25 temporarily deployed to WindowsCILock June 28, 2024 02:59 — with GitHub Actions Inactive

lbushi25 had a problem deploying to WindowsCILock June 28, 2024 03:12 — with GitHub Actions Failure

Update helpers.hpp

ca899a6

lbushi25 temporarily deployed to WindowsCILock June 28, 2024 03:52 — with GitHub Actions Inactive

lbushi25 temporarily deployed to WindowsCILock June 28, 2024 04:05 — with GitHub Actions Inactive

AlexeySachkov merged commit 8a44553 into intel:sycl Jul 1, 2024
14 checks passed

[SYCL] Fix assertion failure in E2E marray test #14234

[SYCL] Fix assertion failure in E2E marray test #14234

Uh oh!

Conversation

lbushi25 commented Jun 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lbushi25 Jun 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lbushi25 Jun 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lbushi25 commented Jun 24, 2024

Uh oh!

aelovikov-intel commented Jun 26, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aelovikov-intel Jun 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lbushi25 Jun 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

lbushi25 commented Jun 20, 2024 •

edited

Loading

lbushi25 Jun 26, 2024 •

edited

Loading

lbushi25 Jun 26, 2024 •

edited

Loading

aelovikov-intel Jun 26, 2024 •

edited

Loading

lbushi25 Jun 27, 2024 •

edited

Loading