-
Notifications
You must be signed in to change notification settings - Fork 813
Feature: LL nvlink p2p #173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature: LL nvlink p2p #173
Conversation
Thanks so much for this, the SGLang/vLLM team strongly needs this, as the EP settings from the community is not too large. The only suggestion is about the overlapping. NVLink uses memory sematics, kernels won't exit only if the transmission is finished. How about add a flag |
No worries, I will refactor this for you. Merging it first. |
I've added another PR #174 for some refactor and bugs fixed. The The PCIe domain may not support some atomic operations. Although we don't use such atomic operations, but we don't have clusters to test this, so we also add a warning in comments. See https://docs.nvidia.com/cuda/parallel-thread-execution/#limitations-system-scope-atomicity and https://nvidia.github.io/cccl/libcudacxx/extended_api/memory_model.html#atomicity. |
When we use NVLink load/store, it does indeed increase the kernel execution time; however, in our tests on 4 node (32 EP), we found that the increase in kernel execution time is actually quite small, approximately between 1 us to 3 us. Notably, in the 2 node(16 EP) H800 test scenario, the latency is slightly higher because the NVLink bandwidth is insufficient, which causes the kernel execution time to be longer (the efficiency of load/store is greater than the NVLink transfer efficiency). |
Thanks for the explanation, you are right, if the |
Would it be possible for the repository maintainer to include the experiment results in the README? The use of NVLink P2P copy for intra-node data transfer in low-latency scenarios is a valuable feature, but many users may not be aware that it's now supported. |
@yhyang201 Already updated in de8cfca. Thanks for advice! |
This PR introduces the use of NVLink P2P copy for intra-node data transfer in low-latency scenarios. For smaller-scale decode instances requirements, we found this approach enhances the overall decoding throughput. Specifically in our inference service, the 4-node decode efficiency demonstrates a 27% improvement.
The following test data was obtained using H20: