-
-
Notifications
You must be signed in to change notification settings - Fork 8.2k
Add SM120 to the Dockerfile #19794
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add SM120 to the Dockerfile #19794
Conversation
Signed-off-by: mgoin <[email protected]>
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @mgoin, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request aims to expand the Docker image's compatibility with newer NVIDIA GPU architectures by adding support for SM 12.0. This change is crucial for leveraging the latest hardware capabilities within the Dockerized environment, building upon a previous effort to manage wheel size limits.
Highlights
- CUDA Architecture Support: I've updated the
torch_cuda_arch_list
build argument in the Dockerfile to include12.0
. This enables PyTorch builds within the Docker image to support NVIDIA GPUs with SM (Streaming Multiprocessor) capability 12.0, which corresponds to newer GPU architectures. - FlashInfer Compilation: I've also extended the
TORCH_CUDA_ARCH_LIST
environment variable to include12.0
specifically for theflashinfer
library's compilation path. This ensures thatflashinfer
is built with support for SM 12.0 when not using pre-built wheels for CUDA 12.8.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This PR updates the Dockerfile to include CUDA Compute Capability SM 12.0 in the TORCH_CUDA_ARCH_LIST
. This affects the main vLLM wheel build process and the compilation of flashinfer
from source for non-default CUDA versions (i.e., when CUDA_VERSION
is not 12.8.*
).
The changes themselves appear correct for enabling SM 12.0 in these compilation steps.
My main feedback point, detailed in a specific comment, is a request for clarification regarding flashinfer
's SM 12.0 support when using the default CUDA_VERSION=12.8.1
. For this version, a pre-built flashinfer
wheel is used, and it's important to understand if this pre-built wheel also includes SM 12.0 capabilities.
Additionally, for documentation consistency within the Dockerfile, the example TORCH_CUDA_ARCH_LIST
for flashinfer
in the comment block on lines 246-255 (specifically line 247, which currently shows '7.5 8.0 8.9 9.0a 10.0a'
) could be updated to include 12.0
. This would help future maintainers by reflecting the architectures now typically compiled for flashinfer
due to this PR's changes. Since this comment block is outside the diff, this is a suggestion for general consideration.
What's the new wheel size? :-) |
The wheel is 365MB! |
Sounds awesome! I'll try to confirm. Currently building the whole thing on my desktop, it'll take a while:
|
@@ -261,7 +261,7 @@ if [ "$TARGETPLATFORM" != "linux/arm64" ]; then \ | |||
if [[ "$CUDA_VERSION" == 12.8* ]]; then \ | |||
uv pip install --system https://download.pytorch.org/whl/cu128/flashinfer/flashinfer_python-0.2.6.post1%2Bcu128torch2.7-cp39-abi3-linux_x86_64.whl; \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like I will need to rebuild this wheel with the new TORCH_CUDA_ARCH_LIST list. I could just do it on my end I think (nothing will break)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would appreciate it!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's done. The new wheel built with 12.0 has been uploaded
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's done. The new wheel built with 12.0 has been uploaded
I've tested it on rtx 5090, it works!
Do you mean for SM 120 (
Unfortunately we still can't update the defaults of the Dockerfile to include SM120, without touching anything else, because it'd be applied to building the CUDA 12.8 wheel here, too, and Pypi's limit of currently 400 MB is too low (even increasing it to 800 MB would not be enough).
I prefer solution 1.2. What do you guys think? @mgoin |
Hey @cyril23 thanks for the concern but the "build image" job in CI succeeds. This is the source of truth for wheel size and is now building for I think you aren't building the image the "right way" if you are getting such a large wheel size. Perhaps you are building with Debug information rather than a proper Release build like we use for CI and release? |
my wheels are bigger because I build it with USE_SCCACHE=0 and thus not building
|
Now I've verified that using
By the way I've further tested that using
In order to test it without using SCCACHE I've modified my Dockerfile as follows (I'll make an issue about it): diff --git a/docker/Dockerfile b/docker/Dockerfile
index 8d4375470..ae866edd0 100644
--- a/docker/Dockerfile
+++ b/docker/Dockerfile
@@ -112,6 +112,7 @@ ENV MAX_JOBS=${max_jobs}
ARG nvcc_threads=8
ENV NVCC_THREADS=$nvcc_threads
+ARG CMAKE_BUILD_TYPE=Release
ARG USE_SCCACHE
ARG SCCACHE_BUCKET_NAME=vllm-build-sccache
ARG SCCACHE_REGION_NAME=us-west-2
@@ -129,7 +130,7 @@ RUN --mount=type=cache,target=/root/.cache/uv \
&& export SCCACHE_REGION=${SCCACHE_REGION_NAME} \
&& export SCCACHE_S3_NO_CREDENTIALS=${SCCACHE_S3_NO_CREDENTIALS} \
&& export SCCACHE_IDLE_TIMEOUT=0 \
- && export CMAKE_BUILD_TYPE=Release \
+ && export CMAKE_BUILD_TYPE=${CMAKE_BUILD_TYPE} \
&& sccache --show-stats \
&& python3 setup.py bdist_wheel --dist-dir=dist --py-limited-api=cp38 \
&& sccache --show-stats; \
@@ -143,6 +144,7 @@ RUN --mount=type=cache,target=/root/.cache/ccache \
# Clean any existing CMake artifacts
rm -rf .deps && \
mkdir -p .deps && \
+ export CMAKE_BUILD_TYPE=${CMAKE_BUILD_TYPE} && \
python3 setup.py bdist_wheel --dist-dir=dist --py-limited-api=cp38; \
fi So let's merge! 👍 |
With the new FlashInfer wheel, I've tried it out with RTX 5090 (but just build it using edit: by the way the wheel size is pretty much the same like with the old FlashInfer version (compared to #19794 (comment))
|
Now that #19336 has landed, maybe we can add SM 12.0 without going over the 400MB wheel limit
EDIT: The wheel is 365MB!