-
Notifications
You must be signed in to change notification settings - Fork 795
[SYCL][CUDA][PI] Improve performance of piQueueFinish #6201
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ingenious! LGTM!
Could your improvement be migrated directly to the HIP plugin interface ? |
This could be migrated to HIP plugin once it also uses multiple streams. |
@smaslov-intel - Would you like to have a look or should we merge this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Fixed off-by-one error introduced in #6201 that would cause queue synchronization to synchronize all streams when no stream has been used. The code worked correctly, but this can in some cases impact performance.
Improves performance of
piQueueFinish
and therefore ofqueue::wait()
on CUDA backend by reducing the number ofcuStreamSynchronize()
calls invoked. This in most use cases fixes the slowdown toqueue::wait()
introduced in #6102.This does not change any interface so there are no changes to the test suite.