-
Notifications
You must be signed in to change notification settings - Fork 797
[SYCL] Fix assertion failure in E2E marray test #14234
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
27 commits
Select commit
Hold shift + click to select a range
bc939bd
Fix bug in forward_progress extension
lbushi25 1c21400
Empty
lbushi25 6d42c93
Fix buffer consistency bug in built in marray test
lbushi25 aa8772d
Remove rogue changes
lbushi25 5d9f6e1
Remove rogue changes
lbushi25 75133ad
Update helpers.hpp
lbushi25 b9374d8
Upscale error tolerance to double in marray tests helper file
lbushi25 fd699f4
Apply formatter
lbushi25 29ad244
Update helpers.hpp
lbushi25 3661d3b
Update helpers.hpp
lbushi25 6fa0fa6
Update helpers.hpp
lbushi25 d6b8a5a
Update helpers.hpp
lbushi25 6f3e1ad
Verify device supports fp64 before upscaling to double precision
lbushi25 41c4641
Update helpers.hpp
lbushi25 7593a2c
Update helpers.hpp
lbushi25 2c5f35a
Add link to bug for hardware_dispatch test
lbushi25 85ed5c6
Update helpers.hpp
lbushi25 131ba3e
Update helpers.hpp
lbushi25 8f05214
Update hardware_dispatch.cpp
lbushi25 d46d0ca
Update hardware_dispatch.cpp
lbushi25 587f713
Update helpers.hpp
lbushi25 070dec6
Update helpers.hpp
lbushi25 53383fb
Update helpers.hpp
lbushi25 891fc0d
Update helpers.hpp
lbushi25 c98df77
Update helpers.hpp
lbushi25 8d47e91
Update helpers.hpp
lbushi25 ca899a6
Update helpers.hpp
lbushi25 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we actually have a bug somewhere in our execution graph builder.
queue
destruction is a non-blocking operation, but the kernel should still be launched and all completion events communicated as usual:Which makes me think that it could be a bug that we don't communicate kernel completion event properly and
host_accessor
creation doesn't wait for itThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this temporary queue is the problem here to be honest, I just rewrote it to declare it beforehand in order to improve readability. The change that actually fixed the test was introducing the new boolean variable
result
and having the buffer point to it. This is also unusual because according to the spec, host accessor can be safely used to access buffers like the test was doing before and yet it was failing so we could also have a bug in host accessor.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, we shouldn't be merging this "fix" until the investigation is done.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did some more investigation, it seems to be a GPU accuracy issue. Note in line 37 of helpers.hpp the delta error tolerance is converted from double to whatever the type that the function under test produces. If this type happens to be float, then some accuracy is lost going from double to float and apparently in some of the test cases in the
marray_common.cpp
file, the results of the GPU computation differ from the expected values by large enough errors so as to expose this loss in accuracy and theequal(result, expected, delta)
function that verifies the result returns false which causes our assertion to fail.Therefore, the synchronization is actually correct, the problem seems to be the lack of accuracy of GPU for float arguments or any argument type with significantly less bits than double. Easy fix by removing the variable line 37 and replacing d by the original delta which has type double?
I tested this and it works, its simple and IMO does not compromise the original purpose of the test.
Also tagging @steffenlarsen if he has time to give his 2 cents on this.
Just for clarity, at the moment I've rewritten the test to use buffers with host pointers instead of host_accessor but that was before I knew that this accuracy problem was the heart of the issue. As to why the test was passing when using buffers with host pointers, I'm clueless!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with the idea of upscaling before error-tolerance checking.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've made the changes. Also, I explicitly created a context and created a queue with that context in order to make the test independent of the default context extension.