Skip to content

Add SM120 to the Dockerfile #19794

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

mgoin
Copy link
Member

@mgoin mgoin commented Jun 18, 2025

Now that #19336 has landed, maybe we can add SM 12.0 without going over the 400MB wheel limit

EDIT: The wheel is 365MB!

Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @mgoin, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request aims to expand the Docker image's compatibility with newer NVIDIA GPU architectures by adding support for SM 12.0. This change is crucial for leveraging the latest hardware capabilities within the Dockerized environment, building upon a previous effort to manage wheel size limits.

Highlights

  • CUDA Architecture Support: I've updated the torch_cuda_arch_list build argument in the Dockerfile to include 12.0. This enables PyTorch builds within the Docker image to support NVIDIA GPUs with SM (Streaming Multiprocessor) capability 12.0, which corresponds to newer GPU architectures.
  • FlashInfer Compilation: I've also extended the TORCH_CUDA_ARCH_LIST environment variable to include 12.0 specifically for the flashinfer library's compilation path. This ensures that flashinfer is built with support for SM 12.0 when not using pre-built wheels for CUDA 12.8.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@mergify mergify bot added the ci/build label Jun 18, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This PR updates the Dockerfile to include CUDA Compute Capability SM 12.0 in the TORCH_CUDA_ARCH_LIST. This affects the main vLLM wheel build process and the compilation of flashinfer from source for non-default CUDA versions (i.e., when CUDA_VERSION is not 12.8.*).

The changes themselves appear correct for enabling SM 12.0 in these compilation steps.

My main feedback point, detailed in a specific comment, is a request for clarification regarding flashinfer's SM 12.0 support when using the default CUDA_VERSION=12.8.1. For this version, a pre-built flashinfer wheel is used, and it's important to understand if this pre-built wheel also includes SM 12.0 capabilities.

Additionally, for documentation consistency within the Dockerfile, the example TORCH_CUDA_ARCH_LIST for flashinfer in the comment block on lines 246-255 (specifically line 247, which currently shows '7.5 8.0 8.9 9.0a 10.0a') could be updated to include 12.0. This would help future maintainers by reflecting the architectures now typically compiled for flashinfer due to this PR's changes. Since this comment block is outside the diff, this is a suggestion for general consideration.

@houseroad
Copy link
Collaborator

What's the new wheel size? :-)

@mgoin
Copy link
Member Author

mgoin commented Jun 18, 2025

The wheel is 365MB!

@cyril23
Copy link

cyril23 commented Jun 18, 2025

The wheel is 365MB!

Sounds awesome! I'll try to confirm. Currently building the whole thing on my desktop, it'll take a while:

~/vllm$ git status
On branch neuralmagic-add-sm120-dockerfile
Your branch is up to date with 'neuralmagic/add-sm120-dockerfile'.
nothing to commit, working tree clean

~/vllm$ git log -1
commit f3bddb6d6ef7539a59654ef5a3834e4c6f456cf1 (HEAD -> neuralmagic-add-sm120-dockerfile, neuralmagic/add-sm120-dockerfile)
Author: Michael Goin <[email protected]>
Date:   Thu Jun 19 01:21:08 2025 +0900

    Update Dockerfile

~/vllm$ DOCKER_BUILDKIT=1 sudo docker build \
  --build-arg max_jobs=5 \
  --build-arg USE_SCCACHE=0 \
  --build-arg GIT_REPO_CHECK=1 \
  --build-arg CUDA_VERSION=12.8.1 \
  --tag wurstdeploy/vllm:wheel-stage \
  --target build \
  --progress plain \
  -f docker/Dockerfile .

# edit:   --build-arg torch_cuda_arch_list='7.0 7.5 8.0 8.9 9.0 10.0 12.0' is not needed anymore of course
# then extract the wheel from the build stage, check size, and build image via target vllm-openai
  • ❌ confirm the new wheel size of 365MB edit: nope, the new wheel size is 832.61 MiB when building for the new default arch list (same as --build-arg torch_cuda_arch_list='7.0 7.5 8.0 8.9 9.0 10.0 12.0'), see this comment edit:
    ✅ confirmed, see Add SM120 to the Dockerfile #19794 (comment)

  • ✅ confirm SM 120 compability (for FlashInfer, too)
    edit: probably needs huydhn's rebuilt wheel for the new arch list. edit: Yes, else I get the error

    RuntimeError: TopKMaskLogits failed with error code no kernel image is available for execution on the device

    edit: tested on RTX 5090, it works now with the new flashinfer wheel

@@ -261,7 +261,7 @@ if [ "$TARGETPLATFORM" != "linux/arm64" ]; then \
if [[ "$CUDA_VERSION" == 12.8* ]]; then \
uv pip install --system https://download.pytorch.org/whl/cu128/flashinfer/flashinfer_python-0.2.6.post1%2Bcu128torch2.7-cp39-abi3-linux_x86_64.whl; \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like I will need to rebuild this wheel with the new TORCH_CUDA_ARCH_LIST list. I could just do it on my end I think (nothing will break)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would appreciate it!

Copy link
Contributor

@huydhn huydhn Jun 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's done. The new wheel built with 12.0 has been uploaded

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's done. The new wheel built with 12.0 has been uploaded

I've tested it on rtx 5090, it works!

@cyril23
Copy link

cyril23 commented Jun 19, 2025

The wheel is 365MB!

Do you mean for SM 120 (torch_cuda_arch_list='12.0') only? What have you tested exactly?

  • I am sorry but the wheel size for ARG torch_cuda_arch_list='7.0 7.5 8.0 8.9 9.0 10.0 12.0' (your new default) is 832.61 MiB which is still too big. I've built it based on your branch, with default settings, see Add SM120 to the Dockerfile #19794 (comment)
  • Output
#23 DONE 23847.4s

#24 [build 7/8] COPY .buildkite/check-wheel-size.py check-wheel-size.py
#24 DONE 0.0s

#25 [build 8/8] RUN if [ "true" = "true" ]; then         python3 check-wheel-size.py dist;     else         echo "Skipping wheel size check.";     fi
#25 0.274 Not allowed: Wheel dist/vllm-0.9.2.dev139+gf3bddb6d6-cp38-abi3-linux_x86_64.whl is larger (832.61 MB) than the limit (400 MB).
#25 0.274 vllm/vllm_flash_attn/_vllm_fa3_C.abi3.so: 1882.29 MBs uncompressed.
#25 0.274 vllm/_C.abi3.so: 752.47 MBs uncompressed.
#25 0.274 vllm/vllm_flash_attn/_vllm_fa2_C.abi3.so: 216.57 MBs uncompressed.
#25 0.274 vllm/_moe_C.abi3.so: 164.88 MBs uncompressed.
#25 0.274 vllm/_flashmla_C.abi3.so: 4.89 MBs uncompressed.
#25 0.274 vllm/third_party/pynvml.py: 0.22 MBs uncompressed.
#25 0.274 vllm/config.py: 0.20 MBs uncompressed.
#25 0.274 vllm-0.9.2.dev139+gf3bddb6d6.dist-info/RECORD: 0.14 MBs uncompressed.
#25 0.274 vllm/distributed/kv_transfer/disagg_prefill_workflow.jpg: 0.14 MBs uncompressed.
#25 0.274 vllm/v1/worker/gpu_model_runner.py: 0.10 MBs uncompressed.
#25 ERROR: process "/bin/sh -c if [ \"$RUN_WHEEL_CHECK\" = \"true\" ]; then         python3 check-wheel-size.py dist;     else         echo \"Skipping wheel size check.\";     fi" did not complete successfully: exit code: 1
------
 > [build 8/8] RUN if [ "true" = "true" ]; then         python3 check-wheel-size.py dist;     else         echo "Skipping wheel size check.";     fi:
0.274 vllm/vllm_flash_attn/_vllm_fa3_C.abi3.so: 1882.29 MBs uncompressed.
0.274 vllm/_C.abi3.so: 752.47 MBs uncompressed.
0.274 vllm/vllm_flash_attn/_vllm_fa2_C.abi3.so: 216.57 MBs uncompressed.
0.274 vllm/_moe_C.abi3.so: 164.88 MBs uncompressed.
0.274 vllm/_flashmla_C.abi3.so: 4.89 MBs uncompressed.
0.274 vllm/third_party/pynvml.py: 0.22 MBs uncompressed.
0.274 vllm/config.py: 0.20 MBs uncompressed.
0.274 vllm-0.9.2.dev139+gf3bddb6d6.dist-info/RECORD: 0.14 MBs uncompressed.
0.274 vllm/distributed/kv_transfer/disagg_prefill_workflow.jpg: 0.14 MBs uncompressed.
0.274 vllm/v1/worker/gpu_model_runner.py: 0.10 MBs uncompressed.
------
Dockerfile:155
--------------------
 154 |     ARG RUN_WHEEL_CHECK=true
 155 | >>> RUN if [ "$RUN_WHEEL_CHECK" = "true" ]; then \
 156 | >>>         python3 check-wheel-size.py dist; \
 157 | >>>     else \
 158 | >>>         echo "Skipping wheel size check."; \
 159 | >>>     fi
 160 |     #################### EXTENSION Build IMAGE ####################
--------------------
ERROR: failed to solve: process "/bin/sh -c if [ \"$RUN_WHEEL_CHECK\" = \"true\" ]; then         python3 check-wheel-size.py dist;     else         echo \"Skipping wheel size check.\";     fi" did not complete successfully: exit code: 1
  • Running the command again but this time with disabled wheel-size-check in order to let me extract the wheel and finish the build-image:
DOCKER_BUILDKIT=1 sudo docker build   --build-arg max_jobs=5   --build-arg USE_SCCACHE=0   --build-arg GIT_REPO_CHECK=1   --build-arg CUDA_VERSION=12.8.1 --build-arg RUN_WHEEL_CHECK=fals
e --tag wurstdeploy/vllm:wheel-stage   --target build   --progress plain   -f docker/Dockerfile .
sudo docker create --name temp-wheel-container wurstdeploy/vllm:wheel-stage
sudo docker cp temp-wheel-container:/workspace/dist ./extracted-wheels
sudo docker rm temp-wheel-container
ls -la extracted-wheels/
# output:
total 852604
drwxr-xr-x  2 root     root          4096 Jun 19 08:35 .
drwxr-xr-x 16 freeuser freeuser      4096 Jun 19 08:50 ..
-rw-r--r--  1 root     root     873053002 Jun 19 08:36 vllm-0.9.2.dev139+gf3bddb6d6-cp38-abi3-linux_x86_64.whl
# That's 873 MB or 832.61 MiB

Unfortunately we still can't update the defaults of the Dockerfile to include SM120, without touching anything else, because it'd be applied to building the CUDA 12.8 wheel here, too, and Pypi's limit of currently 400 MB is too low (even increasing it to 800 MB would not be enough).
How could we solve this problem:

  1. Either we keep your changes to the main Dockerfile as you did in this PR but build for specific architectures within the Build wheel - CUDA 12.8 step here:
    1.1 Either by adding --build-arg torch_cuda_arch_list='12.0' (I havn't confirmed your 365MB yet when building 12.0 only) to make a SM120-only-compatible build, incompatible for all older achitectures like SM 100 Blackwell and older.
    1.2 Or by adding the old default --build-arg torch_cuda_arch_list='7.0 7.5 8.0 8.9 9.0 10.0' (with or without PTX, does not matter) the CUDA 12.8 wheel would still be incompatible for SM 120 Blackwell but works for SM 100 Blackwell and all older gens. So just like the current wheel.
  2. Or we do not update the main Dockerfile but explicitly add something like --build-arg torch_cuda_arch_list='7.0 7.5 8.0 8.6 8.9 9.0 10.0 12.0' --build-arg RUN_WHEEL_CHECK=false to the Docker Build release image step here which was the idea of my PR buildkite release pipeline: add torch_cuda_arch_list including 12.0 to the Docker "Build release image" build args in order to enable Blackwell SM120 support #19747

I prefer solution 1.2. What do you guys think? @mgoin

@mgoin
Copy link
Member Author

mgoin commented Jun 19, 2025

Hey @cyril23 thanks for the concern but the "build image" job in CI succeeds. This is the source of truth for wheel size and is now building for '7.0 7.5 8.0 8.9 9.0 10.0 12.0': https://buildkite.com/vllm/ci/builds/22282/summary/annotations?jid=019783d9-406d-409e-8a20-5313f098957a#019783d9-406d-409e-8a20-5313f098957a/6-4367

I think you aren't building the image the "right way" if you are getting such a large wheel size. Perhaps you are building with Debug information rather than a proper Release build like we use for CI and release?

@cyril23
Copy link

cyril23 commented Jun 19, 2025

my wheels are bigger because I build it with USE_SCCACHE=0 and thus not building CMAKE_BUILD_TYPE=Release but including debug symbols etc.

I think you aren't building the image the "right way" if you are getting such a large wheel size. Perhaps you are building with Debug information rather than a proper Release build like we use for CI and release?

I wish I built it the wrong way, so we could just merge this PR. I've built it as shown here which got me 832.61 MiB wheel size.

~/vllm$ DOCKER_BUILDKIT=1 sudo docker build \
  --build-arg max_jobs=5 \
  --build-arg USE_SCCACHE=0 \
  --build-arg GIT_REPO_CHECK=1 \
  --build-arg CUDA_VERSION=12.8.1 \
  --tag wurstdeploy/vllm:wheel-stage \
  --target build \
  --progress plain \
  -f docker/Dockerfile .

Now I've just tried building again for SM 120 only:

# on Azure Standard E96s v6 (96 vcpus, 768 GiB memory); actually used Max: 291289 MiB RAM
DOCKER_BUILDKIT=1 sudo docker build \
  --build-arg max_jobs=384 \
  --build-arg nvcc_threads=4 \
  --build-arg USE_SCCACHE=0 \
  --build-arg GIT_REPO_CHECK=1 \
  --build-arg CUDA_VERSION=12.8.1 \
  --build-arg torch_cuda_arch_list='12.0' \
  --tag wurstdeploy/vllm:wheel-stage-120only \
  --target build \
  --progress plain \
  -f docker/Dockerfile .

Result:

#24 [build 8/8] RUN if [ "true" = "true" ]; then         python3 check-wheel-size.py dist;     else         echo "Skipping wheel size check.";     fi
#24 0.251 Not allowed: Wheel dist/vllm-0.9.2.dev139+gf3bddb6d6-cp38-abi3-linux_x86_64.whl is larger (558.31 MB) than the limit (400 MB).
#24 0.251 vllm/vllm_flash_attn/_vllm_fa3_C.abi3.so: 1504.86 MBs uncompressed.
#24 0.251 vllm/_C.abi3.so: 297.77 MBs uncompressed.
#24 0.251 vllm/vllm_flash_attn/_vllm_fa2_C.abi3.so: 216.57 MBs uncompressed.
#24 0.251 vllm/_moe_C.abi3.so: 95.23 MBs uncompressed.
#24 0.251 vllm/third_party/pynvml.py: 0.22 MBs uncompressed.
#24 0.251 vllm/config.py: 0.20 MBs uncompressed.
#24 0.251 vllm-0.9.2.dev139+gf3bddb6d6.dist-info/RECORD: 0.14 MBs uncompressed.
#24 0.251 vllm/distributed/kv_transfer/disagg_prefill_workflow.jpg: 0.14 MBs uncompressed.
#24 0.251 vllm/v1/worker/gpu_model_runner.py: 0.10 MBs uncompressed.
#24 0.251 vllm/worker/hpu_model_runner.py: 0.10 MBs uncompressed.
#24 ERROR: process "/bin/sh -c if [ \"$RUN_WHEEL_CHECK\" = \"true\" ]; then         python3 check-wheel-size.py dist;     else         echo \"Skipping wheel size check.\";     fi" did not complete successfully: exit code: 1

After extracting the wheels:

azureuser@building:~/vllm$ ls -la extracted-wheels/
total 571720
drwxr-xr-x  2 root      root           4096 Jun 19 08:09 .
drwxrwxr-x 16 azureuser azureuser      4096 Jun 19 08:14 ..
-rw-r--r--  1 root      root      585426919 Jun 19 08:10 vllm-0.9.2.dev139+gf3bddb6d6-cp38-abi3-linux_x86_64.whl
# thats 558 MB or 558.31 MiB

I am not sure what https://buildkite.com/vllm/ci/builds/22282/summary/annotations?jid=019783d9-406d-409e-8a20-5313f098957a#019783d9-406d-409e-8a20-5313f098957a/6-4367 did differently? They build the "test" target.

Anyway as long as it works on buildkite I am happy! Would love to understand the differences though.

edit: this is what buildkite did:

aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin public.ecr.aws/q9t5s3a7
#!/bin/bash
if [[ -z $(docker manifest inspect public.ecr.aws/q9t5s3a7/vllm-ci-test-repo:f3bddb6d6ef7539a59654ef5a3834e4c6f456cf1) ]]; then
echo "Image not found, proceeding with build..."
else
echo "Image found"
exit 0
fi

docker build --file docker/Dockerfile --build-arg max_jobs=16 --build-arg buildkite_commit=f3bddb6d6ef7539a59654ef5a3834e4c6f456cf1 --build-arg USE_SCCACHE=1 --tag public.ecr.aws/q9t5s3a7/vllm-ci-test-repo:f3bddb6d6ef7539a59654ef5a3834e4c6f456cf1 --target test --progress plain .
docker push public.ecr.aws/q9t5s3a7/vllm-ci-test-repo:f3bddb6d6ef7539a59654ef5a3834e4c6f456cf1

edit: the differences:

  • I use a different number of max_job which shouldn't affect the wheel size
  • I did USE_SCCACHE=0 instead of 1 as in buildkite - can this affect the wheel size? YES, thanks Gemini:

The difference in wheel size between your local build and the Buildkite build is most likely due to the USE_SCCACHE build argument and its effect on the build type.

Here's a breakdown of why this is happening:

The Root Cause
In the docker/Dockerfile, the USE_SCCACHE argument controls which build path is taken. When USE_SCCACHE is set to 1 (as it is in the Buildkite CI), the build command also sets CMAKE_BUILD_TYPE=Release:

# docker/Dockerfile

...
RUN --mount=type=bind,source=.git,target=.git \
    if [ "$USE_SCCACHE" = "1" ]; then \
        echo "Installing sccache..." \
...
        && export CMAKE_BUILD_TYPE=Release \
        && sccache --show-stats \
        && python3 setup.py bdist_wheel --dist-dir=dist --py-limited-api=cp38 \
...
    fi
...

However, when USE_SCCACHE is not 1 (you are setting it to 0), the other build path is taken, and CMAKE_BUILD_TYPE is not set:

# docker/Dockerfile

...
ENV CCACHE_DIR=/root/.cache/ccache
RUN --mount=type=cache,target=/root/.cache/ccache \
    --mount=type=cache,target=/root/.cache/uv \
    --mount=type=bind,source=.git,target=.git  \
    if [ "$USE_SCCACHE" != "1" ]; then \
        # Clean any existing CMake artifacts
        rm -rf .deps && \
        mkdir -p .deps && \
        python3 setup.py bdist_wheel --dist-dir=dist --py-limited-api=cp38; \
    fi
...

When CMAKE_BUILD_TYPE is not explicitly set, CMake often defaults to a Debug build, which includes debugging symbols and is not optimized for size. This is why your locally built wheel is so much larger. The huge size of the .so files in your output is a strong indicator of this.

You didn't accidentally set any debug flags; you accidentally missed setting the release flag!

How to Fix It
To fix this, you have two options:

  1. Set USE_SCCACHE in your build command:

The easiest solution is to mimic the CI environment by setting --build-arg USE_SCCACHE=1 in your docker build command. This will ensure that CMAKE_BUILD_TYPE=Release is set.

DOCKER_BUILDKIT=1 sudo docker build \
  --build-arg max_jobs=5 \
  --build-arg USE_SCCACHE=1 \
  --build-arg GIT_REPO_CHECK=1 \
  --build-arg CUDA_VERSION=12.8.1 \
  --tag wurstdeploy/vllm:wheel-stage \
  --target build \
  --progress plain \
  -f docker/Dockerfile .
  1. Modify the Dockerfile:

If you prefer to build without sccache locally, you can modify the Dockerfile to set the CMAKE_BUILD_TYPE for both build paths. This would make local builds more consistent with CI builds, regardless of the USE_SCCACHE setting.

Here is a diff of the proposed change:

--- a/docker/Dockerfile
+++ b/docker/Dockerfile
@@ -141,6 +141,7 @@
     if [ "$USE_SCCACHE" != "1" ]; then \
         # Clean any existing CMake artifacts
         rm -rf .deps && \
+        export CMAKE_BUILD_TYPE=Release && \
         mkdir -p .deps && \
         python3 setup.py bdist_wheel --dist-dir=dist --py-limited-api=cp38; \
     fi

By making one of these changes, you should see your wheel size decrease significantly and fall within the acceptable range.

edit: I'll propose a new ARG CMAKE_BUILD_TYPE=Release build argument in a separate issue to allow for creating a Release type build even without using SCCACHE.

@cyril23
Copy link

cyril23 commented Jun 19, 2025

The wheel is 365MB!

Now I've verified that using CMAKE_BUILD_TYPE=Release with default arches indeed results in a 382 MB file (365.10 MiB) i.e. exactly as in the buildkite run https://buildkite.com/vllm/ci/builds/22282/summary/annotations?jid=019783d9-406d-409e-8a20-5313f098957a#019783d9-406d-409e-8a20-5313f098957a/6-4367

azureuser@building:~/vllm$ ls -la extracted-wheels/
total 373876
drwxr-xr-x  2 root      root           4096 Jun 19 09:20 .
drwxrwxr-x 16 azureuser azureuser      4096 Jun 19 09:22 ..
-rw-r--r--  1 root      root      382836018 Jun 19 09:21 vllm-0.9.2.dev139+gf3bddb6d6.d20250619-cp38-abi3-linux_x86_64.whl
azureuser@building:~/vllm$

By the way I've further tested that using CMAKE_BUILD_TYPE=Release for SM 120-only (--build-arg torch_cuda_arch_list='12.0') now results in a 167 MB (159.34 MiB) small wheel .

azureuser@building:~/vllm/extracted-wheels$ ls -la
total 163172
drwxr-xr-x  2 root      root           4096 Jun 19 09:00 .
drwxrwxr-x 16 azureuser azureuser      4096 Jun 19 09:02 ..
-rw-r--r--  1 root      root      167077290 Jun 19 09:01 vllm-0.9.2.dev139+gf3bddb6d6.d20250619-cp38-abi3-linux_x86_64.whl
azureuser@building:~/vllm/extracted-wheels$

In order to test it without using SCCACHE I've modified my Dockerfile as follows (I'll make an issue about it):

diff --git a/docker/Dockerfile b/docker/Dockerfile
index 8d4375470..ae866edd0 100644
--- a/docker/Dockerfile
+++ b/docker/Dockerfile
@@ -112,6 +112,7 @@ ENV MAX_JOBS=${max_jobs}
 ARG nvcc_threads=8
 ENV NVCC_THREADS=$nvcc_threads

+ARG CMAKE_BUILD_TYPE=Release
 ARG USE_SCCACHE
 ARG SCCACHE_BUCKET_NAME=vllm-build-sccache
 ARG SCCACHE_REGION_NAME=us-west-2
@@ -129,7 +130,7 @@ RUN --mount=type=cache,target=/root/.cache/uv \
         && export SCCACHE_REGION=${SCCACHE_REGION_NAME} \
         && export SCCACHE_S3_NO_CREDENTIALS=${SCCACHE_S3_NO_CREDENTIALS} \
         && export SCCACHE_IDLE_TIMEOUT=0 \
-        && export CMAKE_BUILD_TYPE=Release \
+        && export CMAKE_BUILD_TYPE=${CMAKE_BUILD_TYPE} \
         && sccache --show-stats \
         && python3 setup.py bdist_wheel --dist-dir=dist --py-limited-api=cp38 \
         && sccache --show-stats; \
@@ -143,6 +144,7 @@ RUN --mount=type=cache,target=/root/.cache/ccache \
         # Clean any existing CMake artifacts
         rm -rf .deps && \
         mkdir -p .deps && \
+        export CMAKE_BUILD_TYPE=${CMAKE_BUILD_TYPE} && \
         python3 setup.py bdist_wheel --dist-dir=dist --py-limited-api=cp38; \
     fi

So let's merge! 👍

@cyril23
Copy link

cyril23 commented Jun 20, 2025

With the new FlashInfer wheel, I've tried it out with RTX 5090 (but just build it using torch_cuda_arch_list='12.0', and CMAKE_BUILD_TYPE=Release) and inference works without a problem

edit: by the way the wheel size is pretty much the same like with the old FlashInfer version (compared to #19794 (comment))

~/vllm$ ls -la extracted-wheels/
total 163192
drwxr-xr-x  2 root     root          4096 Jun 20 10:10 .
drwxr-xr-x 16 freeuser freeuser      4096 Jun 20 10:16 ..
-rw-r--r--  1 root     root     167097574 Jun 20 10:10 vllm-0.9.2.dev182+g47c454049.d20250620-cp38-abi3-linux_x86_64.whl

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci/build ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants