Skip to content

[SYCL] Fix assertion failure in E2E marray test #14234

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 27 commits into from
Jul 1, 2024
Merged
Changes from all commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 20 additions & 8 deletions sycl/test-e2e/Basic/built-ins/helpers.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -33,16 +33,28 @@ void test(bool CheckDevice, double delta, FuncTy F, ExpectedTy Expected,

sycl::buffer<bool, 1> SuccessBuf{1};

sycl::queue q;
sycl::device dev = q.get_device();
// Make sure we don't use fp64 on devices that don't support it.
sycl::detail::get_elem_type_t<ExpectedTy> d(delta);

sycl::queue{}.submit([&](sycl::handler &cgh) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we actually have a bug somewhere in our execution graph builder. queue destruction is a non-blocking operation, but the kernel should still be launched and all completion events communicated as usual:

A SYCL queue may be destroyed even when there are uncompleted commands that have been submitted to the queue. Doing so does not block. Instead, any commands that have been submitted to the queue begin execution when their requisites are satisfied, just as they would had the queue not been destroyed. Any event objects for those commands are signaled in the normal manner when the command completes. Resources associated with the queue will be freed by the time the last command completes.

Which makes me think that it could be a bug that we don't communicate kernel completion event properly and host_accessor creation doesn't wait for it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this temporary queue is the problem here to be honest, I just rewrote it to declare it beforehand in order to improve readability. The change that actually fixed the test was introducing the new boolean variable result and having the buffer point to it. This is also unusual because according to the spec, host accessor can be safely used to access buffers like the test was doing before and yet it was failing so we could also have a bug in host accessor.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

host accessor can be safely used to access buffers like the test was doing before and yet it was failing so we could also have a bug in host accessor

IMO, we shouldn't be merging this "fix" until the investigation is done.

Copy link
Contributor Author

@lbushi25 lbushi25 Jun 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did some more investigation, it seems to be a GPU accuracy issue. Note in line 37 of helpers.hpp the delta error tolerance is converted from double to whatever the type that the function under test produces. If this type happens to be float, then some accuracy is lost going from double to float and apparently in some of the test cases in the marray_common.cpp file, the results of the GPU computation differ from the expected values by large enough errors so as to expose this loss in accuracy and the equal(result, expected, delta) function that verifies the result returns false which causes our assertion to fail.

Therefore, the synchronization is actually correct, the problem seems to be the lack of accuracy of GPU for float arguments or any argument type with significantly less bits than double. Easy fix by removing the variable line 37 and replacing d by the original delta which has type double?
I tested this and it works, its simple and IMO does not compromise the original purpose of the test.
Also tagging @steffenlarsen if he has time to give his 2 cents on this.

Just for clarity, at the moment I've rewritten the test to use buffers with host pointers instead of host_accessor but that was before I knew that this accuracy problem was the heart of the issue. As to why the test was passing when using buffers with host pointers, I'm clueless!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the idea of upscaling before error-tolerance checking.

Copy link
Contributor Author

@lbushi25 lbushi25 Jun 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made the changes. Also, I explicitly created a context and created a queue with that context in order to make the test independent of the default context extension.

const bool fp64 = dev.has(sycl::aspect::fp64);
q.submit([&](sycl::handler &cgh) {
sycl::accessor Success{SuccessBuf, cgh};
cgh.single_task([=]() {
auto R = F(Args...);
static_assert(std::is_same_v<decltype(Expected), decltype(R)>);
Success[0] = equal(R, Expected, d);
});
if (fp64) {
cgh.single_task([=]() {
auto R = F(Args...);
static_assert(std::is_same_v<decltype(Expected), decltype(R)>);
// use double precision error tolerance when fp64 supported
Success[0] = equal(R, Expected, delta);
});
} else {
// downscale the error tolerance when fp64 is not supported
sycl::detail::get_elem_type_t<ExpectedTy> d(delta);
cgh.single_task([=]() {
auto R = F(Args...);
static_assert(std::is_same_v<decltype(Expected), decltype(R)>);
Success[0] = equal(R, Expected, d);
});
}
});
assert(sycl::host_accessor{SuccessBuf}[0]);
}
Expand Down
Loading