Description
When I ran test_internode.py in case of dual-port, the environment variable NVSHMEM_IBGDA_ENABLE_MULTI_PORT is set to 1. Unfortunately, DeepEP crashed when create rdma team by calling nvshmem_team_split_strided(...) in the following code snippet (runtime.cu:init(...)):
if (low_latency_mode and num_ranks > NUM_MAX_NVL_PEERS) {
EP_HOST_ASSERT(cpu_rdma_team == NVSHMEM_TEAM_INVALID);
EP_HOST_ASSERT(num_ranks % NUM_MAX_NVL_PEERS == 0);
EP_HOST_ASSERT(nvshmem_team_split_strided(NVSHMEM_TEAM_WORLD, rank % NUM_MAX_NVL_PEERS, NUM_MAX_NVL_PEERS,
num_ranks / NUM_MAX_NVL_PEERS, &cpu_rdma_team_config, 0, &cpu_rdma_team) == 0);
EP_HOST_ASSERT(cpu_rdma_team != NVSHMEM_TEAM_INVALID);
}
And print below error message:
[/workspace/nvshmem/src/host/team/team_internal.cpp:1421] cuda failed with an illegal memory access was encountered
[/workspace/nvshmem/src/host/team/team_internal.cpp:1421] cuda failed with an illegal memory access was encountered
[/workspace/nvshmem/src/host/team/team_internal.cpp:1421] cuda failed with an illegal memory access was encountered
[/workspace/nvshmem/src/host/team/team_internal.cpp:1421] cuda failed with an illegal memory access was encountered
[/workspace/nvshmem/src/host/team/team_internal.cpp:1421] cuda failed with an illegal memory access was encountered
[/workspace/nvshmem/src/host/team/team_internal.cpp:1421] cuda failed with an illegal memory access was encountered
[/workspace/nvshmem/src/host/team/team_internal.cpp:1421] cuda failed with an illegal memory access was encountered
[/workspace/nvshmem/src/host/team/team_internal.cpp:1421] cuda failed with an illegal memory access was encountered
Any suggestion should be highly appreciated. Many thanks.