Skip to content

Feature: LL nvlink p2p #173

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 23, 2025

Conversation

cywork121
Copy link
Contributor

This PR introduces the use of NVLink P2P copy for intra-node data transfer in low-latency scenarios. For smaller-scale decode instances requirements, we found this approach enhances the overall decoding throughput. Specifically in our inference service, the 4-node decode efficiency demonstrates a 27% improvement.

The following test data was obtained using H20:

Dispatch #EP Latency(us) bandwidth(GB/s) Combine #EP Latency(us) bandwidth(GB/s)
8 172 👉🏻 37 43.5 👉🏻 203.2 8 319 👉🏻 59 45.5 👉🏻 244.2
16 193 👉🏻 119 38.8 👉🏻 62.9 16 335 👉🏻 199 43.3 👉🏻 72.7
32 198 👉🏻 157 37.8 👉🏻 47.8 32 344 👉🏻 280 42.1 👉🏻 51.7
64 203 👉🏻 176 36.9 👉🏻 42.6 64 351 👉🏻 318 41.3 👉🏻 45.6

@cywork121 cywork121 marked this pull request as ready for review May 22, 2025 14:39
@LyricZhao
Copy link
Collaborator

LyricZhao commented May 23, 2025

Thanks so much for this, the SGLang/vLLM team strongly needs this, as the EP settings from the community is not too large.

The only suggestion is about the overlapping. NVLink uses memory sematics, kernels won't exit only if the transmission is finished.

How about add a flag allow_nvlink_traffic into Buffer and assert this is incompatible with hook mode?

@LyricZhao
Copy link
Collaborator

No worries, I will refactor this for you. Merging it first.

@LyricZhao LyricZhao merged commit 68ae8b3 into deepseek-ai:main May 23, 2025
@cywork121
Copy link
Contributor Author

cywork121 commented May 23, 2025

Thanks so much for this, the SGLang/vLLM team strongly needs this, as the EP settings from the community is not too large.

The only suggestion is about the overlapping. NVLink uses memory sematics, kernels won't exit only if the transmission is finished.

How about add a flag allow_nvlink_traffic into Buffer and assert this is incompatible with hook mode?

I've implemented an environment variable toggle: Set NVSHMEM_DISABLE_P2P=0 to activate NVLink P2P copy for low-latency (LL) operations;
image

@LyricZhao
Copy link
Collaborator

LyricZhao commented May 23, 2025

I've added another PR #174 for some refactor and bugs fixed.

The st_na_relaxed scope is .gpu, but if you enlarge the communication scope into NVLink domain, you should change it into .sys scope (which includes PCIe and NVLink).

The PCIe domain may not support some atomic operations. Although we don't use such atomic operations, but we don't have clusters to test this, so we also add a warning in comments.

See https://docs.nvidia.com/cuda/parallel-thread-execution/#limitations-system-scope-atomicity and https://nvidia.github.io/cccl/libcudacxx/extended_api/memory_model.html#atomicity.

@cywork121
Copy link
Contributor Author

cywork121 commented May 23, 2025

The only suggestion is about the overlapping. NVLink uses memory sematics, kernels won't exit only if the transmission is finished.

When we use NVLink load/store, it does indeed increase the kernel execution time; however, in our tests on 4 node (32 EP), we found that the increase in kernel execution time is actually quite small, approximately between 1 us to 3 us. Notably, in the 2 node(16 EP) H800 test scenario, the latency is slightly higher because the NVLink bandwidth is insufficient, which causes the kernel execution time to be longer (the efficiency of load/store is greater than the NVLink transfer efficiency).

@LyricZhao
Copy link
Collaborator

Thanks for the explanation, you are right, if the NVLink absolute traffic / NVLink bandwidth < ~10 us, the kernel execution time won't be longer 👍🏻

@cywork121 cywork121 deleted the DeepEP/ll_nvlink_load_store branch May 29, 2025 02:49
@yhyang201
Copy link

Would it be possible for the repository maintainer to include the experiment results in the README? The use of NVLink P2P copy for intra-node data transfer in low-latency scenarios is a valuable feature, but many users may not be aware that it's now supported.

@LyricZhao
Copy link
Collaborator

@yhyang201 Already updated in de8cfca. Thanks for advice!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants