Skip to content

[Bug]: NCCL issues when running vllm v0.9.1 for the Deepseek-R1 model [B200 GPU] #19865

Open
@haic0

Description

@haic0

Your current environment

Machine: NV B200 GPU
Docker image: vllm/vllm-openai:v0.9.1
model: deepseek-ai/DeepSeek-R1
CUDA: 12.8
Driver Version: 570.133.20
Command: VLLM_USE_V1=1 vllm serve /models/DeepSeek-R1 --tensor-parallel-size 8 --disable-log-requests --trust-remote-code

🐛 Describe the bug

While running the online serving:
VLLM_USE_V1=1 vllm serve /models/DeepSeek-R1 --tensor-parallel-size 8 --disable-log-requests --trust-remote-code

The following NCCL error message :

NCCL_DEBUG=INFO VLLM_USE_V1=1 vllm serve /models/DeepSeek-R1 --tensor-parallel-size 8 --disable-log-requests --trust-remote-code

Errors:

(VllmWorker rank=3 pid=15164) ERROR 06-19 07:06:13 [multiproc_executor.py:527] torch.distributed.DistBackendError: NCCL error in: /pytorch/torch/csrc/distributed/c10d/NCCLUtils.cpp:77, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.26.2
(VllmWorker rank=3 pid=15164) ERROR 06-19 07:06:13 [multiproc_executor.py:527] ncclUnhandledCudaError: Call to CUDA function failed.
(VllmWorker rank=3 pid=15164) ERROR 06-19 07:06:13 [multiproc_executor.py:527] Last error:
(VllmWorker rank=3 pid=15164) ERROR 06-19 07:06:13 [multiproc_executor.py:527] Cuda failure 2 'out of memory'
(VllmWorker rank=3 pid=15164) ERROR 06-19 07:06:13 [multiproc_executor.py:527]
ERROR 06-19 07:06:13 [core.py:515] EngineCore failed to start.
ERROR 06-19 07:06:13 [core.py:515] Traceback (most recent call last):
ERROR 06-19 07:06:13 [core.py:515] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 506, in run_engine_core
ERROR 06-19 07:06:13 [core.py:515] engine_core = EngineCoreProc(*args, **kwargs)
ERROR 06-19 07:06:13 [core.py:515] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 07:06:13 [core.py:515] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 390, in init
ERROR 06-19 07:06:13 [core.py:515] super().init(vllm_config, executor_class, log_stats,
ERROR 06-19 07:06:13 [core.py:515] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 83, in init
ERROR 06-19 07:06:13 [core.py:515] self._initialize_kv_caches(vllm_config)
ERROR 06-19 07:06:13 [core.py:515] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 141, in _initialize_kv_caches
ERROR 06-19 07:06:13 [core.py:515] available_gpu_memory = self.model_executor.determine_available_memory()
ERROR 06-19 07:06:13 [core.py:515] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 07:06:13 [core.py:515] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 76, in determine_available_memory
ERROR 06-19 07:06:13 [core.py:515] output = self.collective_rpc("determine_available_memory")
ERROR 06-19 07:06:13 [core.py:515] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 07:06:13 [core.py:515] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 220, in collective_rpc
ERROR 06-19 07:06:13 [core.py:515] result = get_response(w, dequeue_timeout)
ERROR 06-19 07:06:13 [core.py:515] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-19 07:06:13 [core.py:515] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 207, in get_response
ERROR 06-19 07:06:13 [core.py:515] raise RuntimeError(
ERROR 06-19 07:06:13 [core.py:515] RuntimeError: Worker failed with error 'NCCL error in: /pytorch/torch/csrc/distributed/c10d/NCCLUtils.cpp:77, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.26.2
ERROR 06-19 07:06:13 [core.py:515] ncclUnhandledCudaError: Call to CUDA function failed.
ERROR 06-19 07:06:13 [core.py:515] Last error:
ERROR 06-19 07:06:13 [core.py:515] Cuda failure 2 'out of memory'', please check the stack trace above for the root cause
ERROR 06-19 07:06:15 [multiproc_executor.py:140] Worker proc VllmWorker-4 died unexpectedly, shutting down executor.
Process EngineCore_0:
Traceback (most recent call last):
File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 519, in run_engine_core
raise e
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 506, in run_engine_core
engine_core = EngineCoreProc(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 390, in init
super().init(vllm_config, executor_class, log_stats,
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 83, in init
self._initialize_kv_caches(vllm_config)
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 141, in _initialize_kv_caches
available_gpu_memory = self.model_executor.determine_available_memory()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 76, in determine_available_memory
output = self.collective_rpc("determine_available_memory")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 220, in collective_rpc
result = get_response(w, dequeue_timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 207, in get_response
raise RuntimeError(
RuntimeError: Worker failed with error 'NCCL error in: /pytorch/torch/csrc/distributed/c10d/NCCLUtils.cpp:77, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.26.2
ncclUnhandledCudaError: Call to CUDA function failed.
Last error:
Cuda failure 2 'out of memory'', please check the stack trace above for the root cause
Traceback (most recent call last):
File "/usr/local/bin/vllm", line 10, in
sys.exit(main())
^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/main.py", line 59, in main
args.dispatch_function(args)
File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/serve.py", line 58, in cmd
uvloop.run(run_server(args))
File "/usr/local/lib/python3.12/dist-packages/uvloop/init.py", line 109, in run
return __asyncio.run(
^^^^^^^^^^^^^^
File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run
return runner.run(main)
^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
File "/usr/local/lib/python3.12/dist-packages/uvloop/init.py", line 61, in wrapper
return await main
^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1323, in run_server
await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1343, in run_server_worker
async with build_async_engine_client(args, client_config) as engine_client:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/contextlib.py", line 210, in aenter
return await anext(self.gen)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 155, in build_async_engine_client
async with build_async_engine_client_from_engine_args(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/contextlib.py", line 210, in aenter
return await anext(self.gen)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 191, in build_async_engine_client_from_engine_args
async_llm = AsyncLLM.from_vllm_config(
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 162, in from_vllm_config
return cls(
^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 124, in init
self.engine_core = EngineCoreClient.make_async_mp_client(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 93, in make_async_mp_client
return AsyncMPClient(vllm_config, executor_class, log_stats,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 716, in init
super().init(
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 422, in init
self._init_engines_direct(vllm_config, local_only,
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 491, in _init_engines_direct
self._wait_for_engine_startup(handshake_socket, input_address,
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 511, in _wait_for_engine_startup
wait_for_engine_startup(
File "/usr/local/lib/python3.12/dist-packages/vllm/v1/utils.py", line 494, in wait_for_engine_startup
raise RuntimeError("Engine core initialization failed. "
RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    Status

    Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions