-
Notifications
You must be signed in to change notification settings - Fork 794
[SYCL] Fix assertion failure in E2E marray test #14234
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
// Make sure we don't use fp64 on devices that don't support it. | ||
sycl::detail::get_elem_type_t<ExpectedTy> d(delta); | ||
|
||
sycl::queue{}.submit([&](sycl::handler &cgh) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we actually have a bug somewhere in our execution graph builder. queue
destruction is a non-blocking operation, but the kernel should still be launched and all completion events communicated as usual:
A SYCL queue may be destroyed even when there are uncompleted commands that have been submitted to the queue. Doing so does not block. Instead, any commands that have been submitted to the queue begin execution when their requisites are satisfied, just as they would had the queue not been destroyed. Any event objects for those commands are signaled in the normal manner when the command completes. Resources associated with the queue will be freed by the time the last command completes.
Which makes me think that it could be a bug that we don't communicate kernel completion event properly and host_accessor
creation doesn't wait for it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this temporary queue is the problem here to be honest, I just rewrote it to declare it beforehand in order to improve readability. The change that actually fixed the test was introducing the new boolean variable result
and having the buffer point to it. This is also unusual because according to the spec, host accessor can be safely used to access buffers like the test was doing before and yet it was failing so we could also have a bug in host accessor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
host accessor can be safely used to access buffers like the test was doing before and yet it was failing so we could also have a bug in host accessor
IMO, we shouldn't be merging this "fix" until the investigation is done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did some more investigation, it seems to be a GPU accuracy issue. Note in line 37 of helpers.hpp the delta error tolerance is converted from double to whatever the type that the function under test produces. If this type happens to be float, then some accuracy is lost going from double to float and apparently in some of the test cases in the marray_common.cpp
file, the results of the GPU computation differ from the expected values by large enough errors so as to expose this loss in accuracy and the equal(result, expected, delta)
function that verifies the result returns false which causes our assertion to fail.
Therefore, the synchronization is actually correct, the problem seems to be the lack of accuracy of GPU for float arguments or any argument type with significantly less bits than double. Easy fix by removing the variable line 37 and replacing d by the original delta which has type double?
I tested this and it works, its simple and IMO does not compromise the original purpose of the test.
Also tagging @steffenlarsen if he has time to give his 2 cents on this.
Just for clarity, at the moment I've rewritten the test to use buffers with host pointers instead of host_accessor but that was before I knew that this accuracy problem was the heart of the issue. As to why the test was passing when using buffers with host pointers, I'm clueless!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with the idea of upscaling before error-tolerance checking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've made the changes. Also, I explicitly created a context and created a queue with that context in order to make the test independent of the default context extension.
@intel/llvm-reviewers-runtime ping |
Please fix PR's title. It goes into git commit message and stays there forever. We don't need to describe every single wrong step we've made on the path to the PR there. |
sycl::context ctx; | ||
sycl::queue q{ctx, ctx.get_devices()[0]}; | ||
q.submit([&](sycl::handler &cgh) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this?
Without it, the queue uses the default context as described in the default context extension: https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/supported/sycl_ext_oneapi_default_context.asciidoc.
Core SYCL tests should not rely on extensions so I've explicitly created a context and then created a queue with that context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So what? Lots of tests just do sycl::queue q;
and it's perfectly legal in SYCL. Are you saying that because of this extension we must rewrite all these tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So what? Lots of tests just do
sycl::queue q;
and it's perfectly legal in SYCL. Are you saying that because of this extension we must rewrite all these tests?
You're right, the reporter of the tracker was under the impression that the bug was due to the extension so that's why they suggested we rewrite it. I've changed it to use the idiomatic sycl::queue q;
sycl::queue q; | ||
q.submit([&](sycl::handler &cgh) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why?!.. You can just remove line 37/update line 44 on the left and not change anything else at all. Not even line 53 on the right side...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a matter of taste I suppose, It looks cleaner to use a variable name for the queue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line 53 was a rogue change of the formatter but I've reverted it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a matter of taste I suppose, It looks cleaner to use a variable name for the queue.
That's debatable, which is it has to go separately from the bugfix, if at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a matter of taste I suppose, It looks cleaner to use a variable name for the queue.
That's debatable, which is why it has to go separately from the bugfix, if at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a matter of taste I suppose, It looks cleaner to use a variable name for the queue.
That's debatable, which is it has to go separately from the bugfix, if at all.
Ok, have a look now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made a few more changes to make sure that the upscaling does not happen if the device does not support 64-bit floating point arithmetic. This was exposed by pre-commit tests.
This PR fixes a GPU accuracy bug by upscaling the error-tolerance to a double type if the GPU supports 64-bit floating point arithmetic.
This PR fixes a GPU accuracy bug by upscaling the error-tolerance to a double type if the GPU supports 64-bit floating point arithmetic.