add release note for vLLM 0.9.0 release #802

yma11 · 2025-07-18T06:44:23Z

Description

Related Issue

Changes Made

The code follows the project's coding standards.
No Intel Internal IP is present within the changes.
The documentation has been updated to reflect any changes in functionality.

Validation

I have tested any changes in container groups locally with test_runner.py with all existing tests passing, and I have added new tests where applicable.

Signed-off-by: yan <[email protected]>

rogerxfeng8 · 2025-07-21T02:46:26Z

vllm/0.9.0-xpu.md

@@ -0,0 +1,198 @@
+# Optimize LLM serving with vLLM on Intel® GPUs


We don't update the history release notes. just need to add this 0.9.0-xpu.md

rogerxfeng8 · 2025-07-21T02:47:52Z

vllm/0.9.0-xpu.md

@@ -0,0 +1,198 @@
+# Optimize LLM serving with vLLM on Intel® GPUs
+
+vLLM is a fast and easy-to-use library for LLM inference and serving. It has evolved into a community-driven project with contributions from both academia and industry. Intel, as one of the community contributors, is working actively to bring satisfying performance with vLLM on Intel® platforms, including Intel® Xeon® Scalable Processors, Intel® discrete GPUs, as well as Intel® Gaud® AI accelerators. This blog focuses on Intel® discrete GPUs at this time and brings you the necessary information to get the workloads running well on your Intel® graphics cards.


Gaud® -> Gaudi®
This blog -> This readme

rogerxfeng8 · 2025-07-21T02:48:36Z

vllm/0.9.0-xpu.md

+
+vLLM is a fast and easy-to-use library for LLM inference and serving. It has evolved into a community-driven project with contributions from both academia and industry. Intel, as one of the community contributors, is working actively to bring satisfying performance with vLLM on Intel® platforms, including Intel® Xeon® Scalable Processors, Intel® discrete GPUs, as well as Intel® Gaud® AI accelerators. This blog focuses on Intel® discrete GPUs at this time and brings you the necessary information to get the workloads running well on your Intel® graphics cards.
+
+The vLLM used in the latest docker image is based on [v0.9.0](https://github.com/vllm-project/vllm/tree/v0.9.0)


latest -> this

because it will be no longer 'latest' along with the new releases.

rogerxfeng8 · 2025-07-21T02:51:19Z

vllm/0.9.0-xpu.md

+
+## Optimizations
+
+* tensor parallel inference: Intel® oneAPI Collective Communications Library(oneCCL) is optimized to provide boosted performance in Intel® Arc™ B-Series graphics cards.


tensor -> Tensor

rogerxfeng8 · 2025-07-21T02:54:13Z

vllm/0.9.0-xpu.md

+## Optimizations
+
+* tensor parallel inference: Intel® oneAPI Collective Communications Library(oneCCL) is optimized to provide boosted performance in Intel® Arc™ B-Series graphics cards.
+* GQA kernel optimization: An optimized version of Grouped-Query Attention(GQA) kernel is adopted and obvious perf improvement is observed in models like Qwen and Llama.


10-16% end-to-end throughput improvement for 8B/14B/32B fp8 workloads in 1024/512 input/output lengths.

rogerxfeng8 · 2025-07-21T04:51:22Z

vllm/0.9.0-xpu.md

+
+The table below lists models that have been verified by Intel. However, there should be broader models that are supported by vLLM work on Intel® GPUs.
+
+| Model Type | Model (company/model name) | Dynamic Online FP8 |


We have support for fp16 and w4a8 model list as well:

https://wiki.ith.intel.com/display/mlpcdlval/vLLM+GPU+ww28.5+Engineering+Drop+Test+Report#vLLMGPUww28.5EngineeringDropTestReport-BroadModelFunction@FP16

https://wiki.ith.intel.com/display/mlpcdlval/vLLM+GPU+ww28.5+Engineering+Drop+Test+Report#vLLMGPUww28.5EngineeringDropTestReport-BroadModelFunction@int4

Hi @rogerxfeng8, one open here, the int4 models we tested are not public but get from ccg. How can we add them here?

rogerxfeng8 · 2025-07-21T04:52:33Z

vllm/0.9.0-xpu.md

+
+The following issues are known issues:
+
+* Memory reservation increases in vLLM 0.9.0 and it may cause OOM to multi-modility models like Qwen/Qwen2-VL-7B-Instruct, Qwen/Qwen2.5-VL-72B-Instruct and Qwen/Qwen2.5-VL-32B-Instruct. We need downgrade `gpu-memory-utilization` from default value `0.9` to `0.85`.


downgrade -> decrease

Signed-off-by: yan <[email protected]>

rogerxfeng8 · 2025-07-23T03:11:08Z

vllm/0.9.0-xpu.md

+  Currently, by default we load the model at original precision before quantizing down to 8-bits, so you need enough memory to load the whole model. To avoid this, adding `VLLM_OFFLOAD_WEIGHTS_BEFORE_QUANT=1` can allow offloading weights to cpu before quantization and quantized weights will be kept in device.
+  :::
+
+* Multi Modility Support for Qwen2.5-VL Models


Modility -> Modality

rogerxfeng8 · 2025-07-23T03:11:30Z

vllm/0.9.0-xpu.md

+
+The following issues are known issues:
+
+* Memory reservation increases in vLLM 0.9.0 and it may cause OOM to multi-modility models like Qwen/Qwen2-VL-7B-Instruct, Qwen/Qwen2.5-VL-72B-Instruct and Qwen/Qwen2.5-VL-32B-Instruct. We need decrease `gpu-memory-utilization` from default value `0.9` to `0.85`.


modility -> modality

rogerxfeng8 · 2025-07-23T03:13:16Z

vllm/0.9.0-xpu.md

+
+We also have some experimental features supported, including:
+
+* **pipeline parallelism**: Works on on single node as only backend `mp` is supported for now.


Works on on -> Works on

rogerxfeng8 · 2025-07-23T03:13:29Z

vllm/0.9.0-xpu.md

+  We support both representations through ENV variable `VLLM_XPU_FP8_DTYPE` with default value `E5M2`.
+
+  :::{warning}
+  Currently, by default we load the model at original precision before quantizing down to 8-bits, so you need enough memory to load the whole model. To avoid this, adding `VLLM_OFFLOAD_WEIGHTS_BEFORE_QUANT=1` can allow offloading weights to cpu before quantization and quantized weights will be kept in device.


Signed-off-by: yan <[email protected]>

rogerxfeng8 · 2025-07-24T21:45:34Z

@jitendra42 can you help merge the pr?

add release note for vLLM 0.9.0 release

351a909

Signed-off-by: yan <[email protected]>

yma11 requested review from jitendra42, sharvil10 and sramakintel as code owners July 18, 2025 06:44

rogerxfeng8 reviewed Jul 21, 2025

View reviewed changes

yma11 force-pushed the 0.9.0 branch 2 times, most recently from 52e947b to 5561274 Compare July 23, 2025 02:11

address comments

7ef8d3a

Signed-off-by: yan <[email protected]>

yma11 force-pushed the 0.9.0 branch from 5561274 to 7ef8d3a Compare July 23, 2025 02:18

fix lint

c5ff17e

Signed-off-by: yan <[email protected]>

rogerxfeng8 reviewed Jul 23, 2025

View reviewed changes

address comments

268773e

Signed-off-by: yan <[email protected]>

yma11 force-pushed the 0.9.0 branch from 931a46a to 268773e Compare July 23, 2025 03:24

Merge branch 'main' into 0.9.0

fe18dbc

rogerxfeng8 approved these changes Jul 24, 2025

View reviewed changes

jitendra42 approved these changes Jul 24, 2025

View reviewed changes

jitendra42 merged commit 48cbbb4 into intel:main Jul 24, 2025
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

add release note for vLLM 0.9.0 release #802

add release note for vLLM 0.9.0 release #802

Uh oh!

yma11 commented Jul 18, 2025

Uh oh!

rogerxfeng8 Jul 21, 2025 •

edited

Loading

Uh oh!

rogerxfeng8 Jul 21, 2025

Uh oh!

rogerxfeng8 Jul 21, 2025

Uh oh!

rogerxfeng8 Jul 21, 2025

Uh oh!

rogerxfeng8 Jul 21, 2025

Uh oh!

rogerxfeng8 Jul 21, 2025

Uh oh!

yma11 Jul 22, 2025

Uh oh!

rogerxfeng8 Jul 21, 2025

Uh oh!

rogerxfeng8 Jul 23, 2025

Uh oh!

rogerxfeng8 Jul 23, 2025

Uh oh!

rogerxfeng8 Jul 23, 2025

Uh oh!

rogerxfeng8 Jul 23, 2025

Uh oh!

rogerxfeng8 commented Jul 24, 2025

Uh oh!

Uh oh!

Uh oh!

		@@ -0,0 +1,198 @@
		# Optimize LLM serving with vLLM on Intel® GPUs

		@@ -0,0 +1,198 @@
		# Optimize LLM serving with vLLM on Intel® GPUs

		vLLM is a fast and easy-to-use library for LLM inference and serving. It has evolved into a community-driven project with contributions from both academia and industry. Intel, as one of the community contributors, is working actively to bring satisfying performance with vLLM on Intel® platforms, including Intel® Xeon® Scalable Processors, Intel® discrete GPUs, as well as Intel® Gaud® AI accelerators. This blog focuses on Intel® discrete GPUs at this time and brings you the necessary information to get the workloads running well on your Intel® graphics cards.


		vLLM is a fast and easy-to-use library for LLM inference and serving. It has evolved into a community-driven project with contributions from both academia and industry. Intel, as one of the community contributors, is working actively to bring satisfying performance with vLLM on Intel® platforms, including Intel® Xeon® Scalable Processors, Intel® discrete GPUs, as well as Intel® Gaud® AI accelerators. This blog focuses on Intel® discrete GPUs at this time and brings you the necessary information to get the workloads running well on your Intel® graphics cards.

		The vLLM used in the latest docker image is based on [v0.9.0](https://github.com/vllm-project/vllm/tree/v0.9.0)


		## Optimizations

		* tensor parallel inference: Intel® oneAPI Collective Communications Library(oneCCL) is optimized to provide boosted performance in Intel® Arc™ B-Series graphics cards.


		The table below lists models that have been verified by Intel. However, there should be broader models that are supported by vLLM work on Intel® GPUs.

		\| Model Type \| Model (company/model name) \| Dynamic Online FP8 \|


		The following issues are known issues:

		* Memory reservation increases in vLLM 0.9.0 and it may cause OOM to multi-modility models like Qwen/Qwen2-VL-7B-Instruct, Qwen/Qwen2.5-VL-72B-Instruct and Qwen/Qwen2.5-VL-32B-Instruct. We need downgrade `gpu-memory-utilization` from default value `0.9` to `0.85`.


		We also have some experimental features supported, including:

		* pipeline parallelism: Works on on single node as only backend `mp` is supported for now.

add release note for vLLM 0.9.0 release #802

add release note for vLLM 0.9.0 release #802

Uh oh!

Conversation

yma11 commented Jul 18, 2025

Description

Related Issue

Changes Made

Validation

Uh oh!

rogerxfeng8 Jul 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rogerxfeng8 commented Jul 24, 2025

Uh oh!

Uh oh!

Uh oh!

rogerxfeng8 Jul 21, 2025 •

edited

Loading