Feature: LL nvlink p2p #173

cywork121 · 2025-05-22T14:36:36Z

This PR introduces the use of NVLink P2P copy for intra-node data transfer in low-latency scenarios. For smaller-scale decode instances requirements, we found this approach enhances the overall decoding throughput. Specifically in our inference service, the 4-node decode efficiency demonstrates a 27% improvement.

The following test data was obtained using H20:

Dispatch #EP	Latency(us)	bandwidth(GB/s)	Combine #EP	Latency(us)	bandwidth(GB/s)
8	172 👉🏻 37	43.5 👉🏻 203.2	8	319 👉🏻 59	45.5 👉🏻 244.2
16	193 👉🏻 119	38.8 👉🏻 62.9	16	335 👉🏻 199	43.3 👉🏻 72.7
32	198 👉🏻 157	37.8 👉🏻 47.8	32	344 👉🏻 280	42.1 👉🏻 51.7
64	203 👉🏻 176	36.9 👉🏻 42.6	64	351 👉🏻 318	41.3 👉🏻 45.6

LyricZhao · 2025-05-23T02:26:19Z

Thanks so much for this, the SGLang/vLLM team strongly needs this, as the EP settings from the community is not too large.

The only suggestion is about the overlapping. NVLink uses memory sematics, kernels won't exit only if the transmission is finished.

How about add a flag allow_nvlink_traffic into Buffer and assert this is incompatible with hook mode?

LyricZhao · 2025-05-23T02:37:23Z

No worries, I will refactor this for you. Merging it first.

cywork121 · 2025-05-23T02:48:15Z

Thanks so much for this, the SGLang/vLLM team strongly needs this, as the EP settings from the community is not too large.

The only suggestion is about the overlapping. NVLink uses memory sematics, kernels won't exit only if the transmission is finished.

How about add a flag allow_nvlink_traffic into Buffer and assert this is incompatible with hook mode?

I've implemented an environment variable toggle: Set NVSHMEM_DISABLE_P2P=0 to activate NVLink P2P copy for low-latency (LL) operations；

LyricZhao · 2025-05-23T03:24:08Z

I've added another PR #174 for some refactor and bugs fixed.

The st_na_relaxed scope is .gpu, but if you enlarge the communication scope into NVLink domain, you should change it into .sys scope (which includes PCIe and NVLink).

The PCIe domain may not support some atomic operations. Although we don't use such atomic operations, but we don't have clusters to test this, so we also add a warning in comments.

See https://docs.nvidia.com/cuda/parallel-thread-execution/#limitations-system-scope-atomicity and https://nvidia.github.io/cccl/libcudacxx/extended_api/memory_model.html#atomicity.

cywork121 · 2025-05-23T03:57:44Z

The only suggestion is about the overlapping. NVLink uses memory sematics, kernels won't exit only if the transmission is finished.

When we use NVLink load/store, it does indeed increase the kernel execution time; however, in our tests on 4 node (32 EP), we found that the increase in kernel execution time is actually quite small, approximately between 1 us to 3 us. Notably, in the 2 node(16 EP) H800 test scenario, the latency is slightly higher because the NVLink bandwidth is insufficient, which causes the kernel execution time to be longer (the efficiency of load/store is greater than the NVLink transfer efficiency).

LyricZhao · 2025-05-23T05:35:05Z

Thanks for the explanation, you are right, if the NVLink absolute traffic / NVLink bandwidth < ~10 us, the kernel execution time won't be longer 👍🏻

yhyang201 · 2025-06-04T16:10:55Z

Would it be possible for the repository maintainer to include the experiment results in the README? The use of NVLink P2P copy for intra-node data transfer in low-latency scenarios is a valuable feature, but many users may not be aware that it's now supported.

LyricZhao · 2025-06-05T03:28:23Z

@yhyang201 Already updated in de8cfca. Thanks for advice!

Feature: LL nvlink p2p

0c02143

cywork121 marked this pull request as ready for review May 22, 2025 14:39

LyricZhao merged commit 68ae8b3 into deepseek-ai:main May 23, 2025

cywork121 deleted the DeepEP/ll_nvlink_load_store branch May 29, 2025 02:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature: LL nvlink p2p #173

Feature: LL nvlink p2p #173

Uh oh!

cywork121 commented May 22, 2025

Uh oh!

LyricZhao commented May 23, 2025 •

edited

Loading

Uh oh!

LyricZhao commented May 23, 2025

Uh oh!

cywork121 commented May 23, 2025 •

edited

Loading

Uh oh!

LyricZhao commented May 23, 2025 •

edited

Loading

Uh oh!

cywork121 commented May 23, 2025 •

edited

Loading

Uh oh!

LyricZhao commented May 23, 2025

Uh oh!

yhyang201 commented Jun 4, 2025

Uh oh!

LyricZhao commented Jun 5, 2025

Uh oh!

Uh oh!

Feature: LL nvlink p2p #173

Feature: LL nvlink p2p #173

Uh oh!

Conversation

cywork121 commented May 22, 2025

Uh oh!

LyricZhao commented May 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LyricZhao commented May 23, 2025

Uh oh!

cywork121 commented May 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LyricZhao commented May 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cywork121 commented May 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LyricZhao commented May 23, 2025

Uh oh!

yhyang201 commented Jun 4, 2025

Uh oh!

LyricZhao commented Jun 5, 2025

Uh oh!

Uh oh!

LyricZhao commented May 23, 2025 •

edited

Loading

cywork121 commented May 23, 2025 •

edited

Loading

LyricZhao commented May 23, 2025 •

edited

Loading

cywork121 commented May 23, 2025 •

edited

Loading