You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[SYCL][CUDA] Decouple CUDA contexts from PI contexts
This patch moves the CUDA context from the PI context to the PI device,
and switches to always using the primary context.
CUDA contexts are different from SYCL contexts in that they're tied to a
single device, and that they are required to be active on a thread for
most calls to the CUDA driver API.
As shown in #8124 and #7526 the current mapping of
CUDA context to PI context, causes issues for device based entry points
that still need to call the CUDA APIs, we have workarounds to solve that
but they're a bit hacky, inefficient, and have a lot of edge case
issues.
The peer to peer interface proposal in #6104, is also device
based, but enabling peer to peer for CUDA is done on the CUDA contexts,
so the current mapping would make it difficult to implement.
So this patch solves most of these issues by decoupling the CUDA context
from the SYCL context, and simply managing the CUDA contexts in the
devices, it also changes the CUDA context management to always use the
primary context.
This approach as a number of advantages:
* Use of the primary context is recommended by Nvidia
* Simplifies the CUDA context management in the plugin
* Available CUDA context in device based entry points
* Likely more efficient in the general case, with less opportunities to
accidentally cause costly CUDA context switches.
* Easier and likely more efficient interactions with CUDA runtime
applications.
* Easier to expose P2P capabilities
* Easier to support multiple devices in a SYCL context
It does have a few drawbacks from the previous approach:
* Drops support for `make_context` interop, no sensible "native handle"
to pass in (`get_native` is still supported fine).
* No opportunity for users to separate their work into different CUDA
contexts. It's unclear if there's any actual use case for this, it
seems very uncommon in CUDA codebases to have multiple CUDA contexts
for a single CUDA device in the same process.
So overall I believe this should be a net benefit in general, and we
could revisit if we run into an edge case that would need more fine
grained CUDA context management.
0 commit comments