Skip to content

add release note for vLLM 0.9.0 release #802

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Jul 24, 2025
Merged

add release note for vLLM 0.9.0 release #802

merged 5 commits into from
Jul 24, 2025

Conversation

yma11
Copy link
Contributor

@yma11 yma11 commented Jul 18, 2025

Description

Related Issue

Changes Made

  • The code follows the project's coding standards.
  • No Intel Internal IP is present within the changes.
  • The documentation has been updated to reflect any changes in functionality.

Validation

  • I have tested any changes in container groups locally with test_runner.py with all existing tests passing, and I have added new tests where applicable.

@@ -0,0 +1,198 @@
# Optimize LLM serving with vLLM on Intel® GPUs
Copy link

@rogerxfeng8 rogerxfeng8 Jul 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't update the history release notes. just need to add this 0.9.0-xpu.md

@@ -0,0 +1,198 @@
# Optimize LLM serving with vLLM on Intel® GPUs

vLLM is a fast and easy-to-use library for LLM inference and serving. It has evolved into a community-driven project with contributions from both academia and industry. Intel, as one of the community contributors, is working actively to bring satisfying performance with vLLM on Intel® platforms, including Intel® Xeon® Scalable Processors, Intel® discrete GPUs, as well as Intel® Gaud® AI accelerators. This blog focuses on Intel® discrete GPUs at this time and brings you the necessary information to get the workloads running well on your Intel® graphics cards.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gaud® -> Gaudi®
This blog -> This readme


vLLM is a fast and easy-to-use library for LLM inference and serving. It has evolved into a community-driven project with contributions from both academia and industry. Intel, as one of the community contributors, is working actively to bring satisfying performance with vLLM on Intel® platforms, including Intel® Xeon® Scalable Processors, Intel® discrete GPUs, as well as Intel® Gaud® AI accelerators. This blog focuses on Intel® discrete GPUs at this time and brings you the necessary information to get the workloads running well on your Intel® graphics cards.

The vLLM used in the latest docker image is based on [v0.9.0](https://github.com/vllm-project/vllm/tree/v0.9.0)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

latest -> this

because it will be no longer 'latest' along with the new releases.


## Optimizations

* tensor parallel inference: Intel® oneAPI Collective Communications Library(oneCCL) is optimized to provide boosted performance in Intel® Arc™ B-Series graphics cards.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tensor -> Tensor

## Optimizations

* tensor parallel inference: Intel® oneAPI Collective Communications Library(oneCCL) is optimized to provide boosted performance in Intel® Arc™ B-Series graphics cards.
* GQA kernel optimization: An optimized version of Grouped-Query Attention(GQA) kernel is adopted and obvious perf improvement is observed in models like Qwen and Llama.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

10-16% end-to-end throughput improvement for 8B/14B/32B fp8 workloads in 1024/512 input/output lengths.


The table below lists models that have been verified by Intel. However, there should be broader models that are supported by vLLM work on Intel® GPUs.

| Model Type | Model (company/model name) | Dynamic Online FP8 |

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @rogerxfeng8, one open here, the int4 models we tested are not public but get from ccg. How can we add them here?


The following issues are known issues:

* Memory reservation increases in vLLM 0.9.0 and it may cause OOM to multi-modility models like Qwen/Qwen2-VL-7B-Instruct, Qwen/Qwen2.5-VL-72B-Instruct and Qwen/Qwen2.5-VL-32B-Instruct. We need downgrade `gpu-memory-utilization` from default value `0.9` to `0.85`.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

downgrade -> decrease

@yma11 yma11 force-pushed the 0.9.0 branch 2 times, most recently from 52e947b to 5561274 Compare July 23, 2025 02:11
Signed-off-by: yan <[email protected]>
Signed-off-by: yan <[email protected]>
Currently, by default we load the model at original precision before quantizing down to 8-bits, so you need enough memory to load the whole model. To avoid this, adding `VLLM_OFFLOAD_WEIGHTS_BEFORE_QUANT=1` can allow offloading weights to cpu before quantization and quantized weights will be kept in device.
:::

* Multi Modility Support for Qwen2.5-VL Models

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modility -> Modality


The following issues are known issues:

* Memory reservation increases in vLLM 0.9.0 and it may cause OOM to multi-modility models like Qwen/Qwen2-VL-7B-Instruct, Qwen/Qwen2.5-VL-72B-Instruct and Qwen/Qwen2.5-VL-32B-Instruct. We need decrease `gpu-memory-utilization` from default value `0.9` to `0.85`.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

modility -> modality


We also have some experimental features supported, including:

* **pipeline parallelism**: Works on on single node as only backend `mp` is supported for now.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works on on -> Works on

We support both representations through ENV variable `VLLM_XPU_FP8_DTYPE` with default value `E5M2`.

:::{warning}
Currently, by default we load the model at original precision before quantizing down to 8-bits, so you need enough memory to load the whole model. To avoid this, adding `VLLM_OFFLOAD_WEIGHTS_BEFORE_QUANT=1` can allow offloading weights to cpu before quantization and quantized weights will be kept in device.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cpu -> CPU

Signed-off-by: yan <[email protected]>
@rogerxfeng8
Copy link

@jitendra42 can you help merge the pr?

@jitendra42 jitendra42 merged commit 48cbbb4 into intel:main Jul 24, 2025
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants