Skip to content

Commit 48cbbb4

Browse files
yma11sramakintel
andauthored
add release note for vLLM 0.9.0 release (#802)
Signed-off-by: yan <[email protected]> Co-authored-by: Srikanth Ramakrishna <[email protected]>
1 parent 0183947 commit 48cbbb4

File tree

2 files changed

+214
-8
lines changed

2 files changed

+214
-8
lines changed

vllm/0.9.0-xpu.md

Lines changed: 206 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,206 @@
1+
# Optimize LLM serving with vLLM on Intel® GPUs
2+
3+
vLLM is a fast and easy-to-use library for LLM inference and serving. It has evolved into a community-driven project with contributions from both academia and industry. Intel, as one of the community contributors, is working actively to bring satisfying performance with vLLM on Intel® platforms, including Intel® Xeon® Scalable Processors, Intel® discrete GPUs, as well as Intel® Gaudi® AI accelerators. This readme focuses on Intel® discrete GPUs at this time and brings you the necessary information to get the workloads running well on your Intel® graphics cards.
4+
5+
The vLLM used in the this docker image is based on [v0.9.0](https://github.com/vllm-project/vllm/tree/v0.9.0)
6+
7+
## 1. What's Supported?
8+
9+
Intel GPUs benefit from enhancements brought by [vLLM V1 engine](https://blog.vllm.ai/2025/01/27/v1-alpha-release.html), including:
10+
11+
* Optimized Execution Loop & API Server
12+
* Simple & Flexible Scheduler
13+
* Zero-Overhead Prefix Caching
14+
* Clean Architecture for Tensor-Parallel Inference
15+
* Efficient Input Preparation
16+
17+
Besides, following up vLLM V1 design, corresponding optimized kernels are implemented for Intel GPUs.
18+
19+
* chunked_prefill:
20+
21+
chunked_prefill is an optimization feature in vLLM that allows large prefill requests to be divided into small chunks and batched together with decode requests. This approach prioritizes decode requests, improving inter-token latency (ITL) and GPU utilization by combining compute-bound (prefill) and memory-bound (decode) requests in the same batch. vLLM v1 engine is built on this feature and in this release, it's also supported on intel GPUs by leveraging corresponding kernel from Intel® Extension for PyTorch\* for model execution.
22+
23+
* FP8 W8A16:
24+
25+
vLLM supports FP8 (8-bit floating point) weight using hardware acceleration on GPUs. We support weight-only online dynamic quantization with FP8, which allows for a 2x reduction in model memory requirements and up to a 1.6x improvement in throughput with minimal impact on accuracy.
26+
27+
Dynamic quantization of an original precision BF16/FP16 model to FP8 can be achieved with vLLM without any calibration data required. You can enable the feature by specifying `--quantization="fp8"` in the command line or setting `quantization="fp8"` in the LLM constructor.
28+
29+
Besides, the FP8 types typically supported in hardware have two distinct representations, each useful in different scenarios:
30+
31+
* **E4M3**: Consists of 1 sign bit, 4 exponent bits, and 3 bits of mantissa. It can store values up to +/-448 and `nan`.
32+
* **E5M2**: Consists of 1 sign bit, 5 exponent bits, and 2 bits of mantissa. It can store values up to +/-57344, +/- `inf`, and `nan`. The tradeoff for the increased dynamic range is lower precision of the stored values.
33+
34+
We support both representations through ENV variable `VLLM_XPU_FP8_DTYPE` with default value `E5M2`.
35+
36+
:::{warning}
37+
Currently, by default we load the model at original precision before quantizing down to 8-bits, so you need enough memory to load the whole model. To avoid this, adding `VLLM_OFFLOAD_WEIGHTS_BEFORE_QUANT=1` can allow offloading weights to cpu before quantization and quantized weights will be kept in device.
38+
:::
39+
40+
* Multi Modality Support for Qwen2.5-VL Models
41+
42+
In this release, image/audio input can be processed using Qwen2.5-VL Models, like Qwen/Qwen2.5-VL-32B-Instruct on 4 BMG cards.
43+
44+
We also have some experimental features supported, including:
45+
46+
* **pipeline parallelism**: Works on single node as only backend `mp` is supported for now.
47+
* **torch.compile**: Can be enabled for both FP16 and online FP8 quantization path.
48+
* **speculative decoding**: Supports methods `n-gram`, `EAGLE` and `EAGLE3`.
49+
50+
## Optimizations
51+
52+
* Tensor parallel inference: Intel® oneAPI Collective Communications Library(oneCCL) is optimized to provide boosted performance in Intel® Arc™ B-Series graphics cards.
53+
* GQA kernel optimization: An optimized version of Grouped-Query Attention(GQA) kernel is adopted and we observed 10-16% end-to-end throughput improvement for 8B/14B/32B fp8 workloads in 1024/512 input/output lengths.
54+
* Other: long context length (>4k) optimization for output token latency, which brings 1.8x perf gain on next token for 40K seq length, 1.6x for 20K, 1.4x for 12K.
55+
56+
## Supported Models
57+
58+
The table below lists models that have been verified by Intel. However, there should be broader models that are supported by vLLM work on Intel® GPUs.
59+
60+
| Model Type | Model (company/model name) | FP16 | Dynamic Online FP8 |
61+
| ---------- | -------------------------- | --- | --- |
62+
| Text Generation | deepseek-ai/DeepSeek-R1-Distill-Llama-8B |✅︎|✅︎|
63+
| Text Generation | deepseek-ai/DeepSeek-R1-Distill-Qwen-14B |✅︎|✅︎|
64+
| Text Generation | deepseek-ai/DeepSeek-R1-Distill-Qwen-32B |✅︎|✅︎|
65+
| Text Generation | deepseek-ai/DeepSeek-R1-Distill-Llama-70B |✅︎|✅︎|
66+
| Text Generation | Qwen/Qwen2.5-72B-Instruct |✅︎|✅︎|
67+
| Text Generation | Qwen/Qwen3-14B |✅︎|✅︎|
68+
| Text Generation | Qwen/Qwen3-32B |✅︎|✅︎|
69+
| Text Generation | Qwen/Qwen3-30B-A3B |✅︎|✅︎|
70+
| Text Generation | deepseek-ai/DeepSeek-V2-Lite |✅︎|✅︎|
71+
| Text Generation | meta-llama/Llama-3.1-8B-Instruct |✅︎|✅︎|
72+
| Text Generation | baichuan-inc/Baichuan2-13B-Chat |✅︎|✅︎|
73+
| Text Generation | THUDM/GLM-4-9B-chat |✅︎|✅︎|
74+
| Text Generation | THUDM/GLM-4v-9B-chat |✅︎|✅︎|
75+
| Text Generation | THUDM/CodeGeex4-All-9B |✅︎|✅︎|
76+
| Text Generation | chuhac/TeleChat2-35B |✅︎|✅︎|
77+
| Text Generation | 01-ai/Yi1.5-34B-Chat |✅︎|✅︎|
78+
| Text Generation | THUDM/CodeGeex4-All-9B |✅︎|✅︎|
79+
| Text Generation | deepseek-ai/DeepSeek-Coder-33B-base |✅︎|✅︎|
80+
| Text Generation | baichuan-inc/Baichuan2-13B-Chat |✅︎|✅︎|
81+
| Text Generation | meta-llama/Llama-2-13b-chat-hf |✅︎|✅︎|
82+
| Text Generation | THUDM/CodeGeex4-All-9B |✅︎|✅︎|
83+
| Text Generation | Qwen/Qwen1.5-14B-Chat |✅︎|✅︎|
84+
| Text Generation | Qwen/Qwen1.5-32B-Chat |✅︎|✅︎|
85+
| Multi Modality | Qwen/Qwen2.5-VL-72B-Instruct |✅︎|✅︎|
86+
| Multi Modality | Qwen/Qwen2.5-VL-32B-Instruct |✅︎|✅︎|
87+
88+
## 2. Limitations
89+
90+
Some of vLLM V1 features may need extra support, including LoRA(Low-Rank Adaptation), pipeline parallel on Ray, EP(Expert Parallelism)/TP(Tensor Parallelism) MoE(Mixture of Experts), DP(Data Parallelism) Attention and MLA(Multi-head Latent Attention).
91+
92+
The following issues are known issues:
93+
94+
* Memory reservation increases in vLLM 0.9.0 and it may cause OOM to multi-modality models like Qwen/Qwen2-VL-7B-Instruct, Qwen/Qwen2.5-VL-72B-Instruct and Qwen/Qwen2.5-VL-32B-Instruct. We need decrease `gpu-memory-utilization` from default value `0.9` to `0.85`.
95+
* W8A8 quantized models through llm_compressor are not supported yet, like RedHatAI/DeepSeek-R1-Distill-Qwen-32B-FP8-dynamic.
96+
97+
## 3. How to Get Started
98+
99+
### 3.1. Prerequisite
100+
101+
| OS | Hardware |
102+
| ---------- | ---------- |
103+
| Ubuntu 25.04 | Intel® Arc™ B-Series |
104+
105+
### 3.2. Prepare a Serving Environment
106+
107+
1. Get the released docker image with command `docker pull intel/vllm:0.9.0-xpu`
108+
2. Instantiate a docker container with command `docker run -t -d --shm-size 10g --net=host --ipc=host --privileged -v /dev/dri/by-path:/dev/dri/by-path --name=vllm-test --device /dev/dri:/dev/dri --entrypoint= intel/vllm:0.9.0-xpu /bin/bash`
109+
3. Source openapi envs to ensure correct variables set with command `docker exec vllm-test /bin/bash -c "source /opt/intel/oneapi/setvars.sh --force"`
110+
4. Run command `docker exec -it vllm-test bash` in 2 separate terminals to enter container environments for the server and the client respectively.
111+
112+
\* Starting from here, all commands are expected to be run inside the docker container, if not explicitly noted.
113+
114+
In both environments, you may then wish to set a `HUGGING_FACE_HUB_TOKEN` environment variable to make sure necessary files can be downloaded from the HuggingFace website.
115+
116+
```bash
117+
export HUGGING_FACE_HUB_TOKEN=xxxxxx
118+
```
119+
120+
### 3.3. Launch Workloads
121+
122+
#### 3.3.1. Launch Server in the Server Environment
123+
124+
Command:
125+
126+
```bash
127+
TORCH_LLM_ALLREDUCE=1 VLLM_USE_V1=1 VLLM_WORKER_MULTIPROC_METHOD=spawn python3 -m vllm.entrypoints.openai.api_server --model deepseek-ai/DeepSeek-R1-Distill-Qwen-32B --dtype=float16 --device=xpu --enforce-eager --port 8000 --block-size 64 --gpu-memory-util 0.9  --no-enable-prefix-caching --trust-remote-code --disable-sliding-window --disable-log-requests --max_num_batched_tokens=8192 --max_model_len 4096 -tp=4 --quantization fp8
128+
```
129+
130+
Note that by default fp8 online quantization will use `e5m2` and you can switch to use `e4m3` by explicitly add env `VLLM_XPU_FP8_DTYPE=e4m3`. If there is not enough memory to hold the whole model before quantization to fp8, you can use `VLLM_OFFLOAD_WEIGHTS_BEFORE_QUANT=1` to offload weights to CPU first.
131+
132+
Expected output:
133+
134+
```bash
135+
INFO 02-20 03:20:29 api_server.py:937] Starting vLLM API server on http://0.0.0.0:8000
136+
INFO 02-20 03:20:29 launcher.py:23] Available routes are:
137+
INFO 02-20 03:20:29 launcher.py:31] Route: /openapi.json, Methods: HEAD, GET
138+
INFO 02-20 03:20:29 launcher.py:31] Route: /docs, Methods: HEAD, GET
139+
INFO 02-20 03:20:29 launcher.py:31] Route: /docs/oauth2-redirect, Methods: HEAD, GET
140+
INFO 02-20 03:20:29 launcher.py:31] Route: /redoc, Methods: HEAD, GET
141+
INFO 02-20 03:20:29 launcher.py:31] Route: /health, Methods: GET
142+
INFO 02-20 03:20:29 launcher.py:31] Route: /ping, Methods: POST, GET
143+
INFO 02-20 03:20:29 launcher.py:31] Route: /tokenize, Methods: POST
144+
INFO 02-20 03:20:29 launcher.py:31] Route: /detokenize, Methods: POST
145+
INFO 02-20 03:20:29 launcher.py:31] Route: /v1/models, Methods: GET
146+
INFO 02-20 03:20:29 launcher.py:31] Route: /version, Methods: GET
147+
INFO 02-20 03:20:29 launcher.py:31] Route: /v1/chat/completions, Methods: POST
148+
INFO 02-20 03:20:29 launcher.py:31] Route: /v1/completions, Methods: POST
149+
INFO 02-20 03:20:29 launcher.py:31] Route: /v1/embeddings, Methods: POST
150+
INFO 02-20 03:20:29 launcher.py:31] Route: /pooling, Methods: POST
151+
INFO 02-20 03:20:29 launcher.py:31] Route: /score, Methods: POST
152+
INFO 02-20 03:20:29 launcher.py:31] Route: /v1/score, Methods: POST
153+
INFO 02-20 03:20:29 launcher.py:31] Route: /v1/audio/transcriptions, Methods: POST
154+
INFO 02-20 03:20:29 launcher.py:31] Route: /rerank, Methods: POST
155+
INFO 02-20 03:20:29 launcher.py:31] Route: /v1/rerank, Methods: POST
156+
INFO 02-20 03:20:29 launcher.py:31] Route: /v2/rerank, Methods: POST
157+
INFO 02-20 03:20:29 launcher.py:31] Route: /invocations, Methods: POST
158+
INFO: Started server process [1636943]
159+
INFO: Waiting for application startup.
160+
INFO: Application startup complete.
161+
```
162+
163+
It may take some time. Showing `INFO: Application startup complete.` indicates that the server is ready.
164+
165+
#### 3.3.2. Raise Requests for Benchmarking in the Client Environment
166+
167+
We leverage a [benchmarking script](https://github.com/vllm-project/vllm/blob/main/benchmarks/benchmark_serving.py) which is provided in vLLM to perform performance benchmarking. You can use your own client scripts as well.
168+
169+
Use the command below to shoot serving requests:
170+
171+
```bash
172+
python3 benchmarks/benchmark_serving.py --model deepseek-ai/DeepSeek-R1-Distill-Qwen-32B --dataset-name random --random-input-len=1024 --random-output-len=1024 --ignore-eos --num-prompt 1 --max-concurrency 16 --request-rate inf --backend vllm --port=8000 --host 0.0.0.0
173+
```
174+
175+
The command uses model `deepseek-ai/DeepSeek-R1-Distill-Qwen-32B`. Both input and output token sizes are set to `1024`. Maximally `16` requests are processed concurrently in the server.
176+
177+
Expected output:
178+
179+
```bash
180+
Maximum request concurrency: 16
181+
============ Serving Benchmark Result ============
182+
Successful requests: 1
183+
Benchmark duration (s): xxx
184+
Total input tokens: 1024
185+
Total generated tokens: 1024
186+
Request throughput (req/s): xxx
187+
Output token throughput (tok/s): xxx
188+
Total Token throughput (tok/s): xxx
189+
---------------Time to First Token----------------
190+
Mean TTFT (ms): xxx
191+
Median TTFT (ms): xxx
192+
P99 TTFT (ms): xxx
193+
-----Time per Output Token (excl. 1st token)------
194+
Mean TPOT (ms): xxx
195+
Median TPOT (ms): xxx
196+
P99 TPOT (ms): xxx
197+
---------------Inter-token Latency----------------
198+
Mean ITL (ms): xxx
199+
Median ITL (ms): xxx
200+
P99 ITL (ms): xxx
201+
==================================================
202+
```
203+
204+
## 5. Need Assistance?
205+
206+
Should you encounter any issues or have any questions, please submit an issue ticket at [vLLM Github Issues](https://github.com/vllm-project/vllm/issues). Include the text `[Intel GPU]` in the issue title to ensure it gets noticed.

vllm/xpu.md

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
1-
# Optimize LLM serving with vLLM on Intel? GPUs
1+
# Optimize LLM serving with vLLM on Intel® GPUs
22

3-
vLLM is a fast and easy-to-use library for LLM inference and serving. It has evolved into a community-driven project with contributions from both academia and industry. Intel, as one of the community contributors, is working actively to bring satisfying performance with vLLM on Intel? platforms, including Intel? Xeon? Scalable Processors, Intel? discrete GPUs, as well as Intel? Gaud? AI accelerators. This blog focuses on Intel? discrete GPUs at this time and brings you the necessary information to get the workloads running well on your Intel? graphics cards.
3+
vLLM is a fast and easy-to-use library for LLM inference and serving. It has evolved into a community-driven project with contributions from both academia and industry. Intel, as one of the community contributors, is working actively to bring satisfying performance with vLLM on Intel® platforms, including Intel® Xeon® Scalable Processors, Intel® discrete GPUs, as well as Intel® Gaudi® AI accelerators. This readme focuses on Intel® discrete GPUs at this time and brings you the necessary information to get the workloads running well on your Intel® graphics cards.
44

55
## 1. What's Supported?
66

@@ -13,15 +13,15 @@ Intel GPUs benefit from enhancements brought by [vLLM V1 engine](https://blog.vl
1313
* Efficient Input Preparation
1414
* Enhanced Support for Multimodal LLMs
1515

16-
Moreover, **`chunked_prefill`**, an optimization feature in vLLM that allows large prefill requests to be divided into small chunks and batched together with decode requests, is also enabled. This approach prioritizes decode requests, improving inter-token latency (ITL) and GPU utilization by combining compute-bound (prefill) and memory-bound (decode) requests in the same batch. vLLM v1 engine is built on this feature and in this release, it's also supported on intel GPUs by leveraging corresponding kernel from Intel? Extension for PyTorch\* for model execution.
16+
Moreover, **`chunked_prefill`**, an optimization feature in vLLM that allows large prefill requests to be divided into small chunks and batched together with decode requests, is also enabled. This approach prioritizes decode requests, improving inter-token latency (ITL) and GPU utilization by combining compute-bound (prefill) and memory-bound (decode) requests in the same batch. vLLM v1 engine is built on this feature and in this release, it's also supported on intel GPUs by leveraging corresponding kernel from Intel® Extension for PyTorch\* for model execution.
1717

1818
In a near future release, we will support the following features.
1919

2020
* **Spec decode**: Speculative decoding in vLLM is a technique designed to improve inter-token latency during LLM inference by using a smaller, faster draft model to predict future tokens.
2121
* **Sliding window**: Sliding window attention is a mechanism used in large language models to manage memory usage efficiently by limiting the context length to a fixed window size. This approach allows the model to focus on the most recent tokens while discarding older ones, which is particularly useful for handling long sequences without exceeding memory constraints.
22-
* **FP8 KV cache**: We will support FP8 KV cache in this release with kernels from Intel? Extension for PyTorch\*. It allows for a larger number of tokens to be stored in the cache, effectively doubling the space available for KV cache allocation. This increase in storage capacity enhances throughput by enabling the processing of longer context lengths for individual requests or handling more concurrent request batches.
22+
* **FP8 KV cache**: We will support FP8 KV cache in this release with kernels from Intel® Extension for PyTorch\*. It allows for a larger number of tokens to be stored in the cache, effectively doubling the space available for KV cache allocation. This increase in storage capacity enhances throughput by enabling the processing of longer context lengths for individual requests or handling more concurrent request batches.
2323

24-
The table below lists models that have been verified by Intel. However, there should be broader models that are supported by vLLM work on Intel? GPUs.
24+
The table below lists models that have been verified by Intel. However, there should be broader models that are supported by vLLM work on Intel® GPUs.
2525

2626
| Model Type | Model |
2727
| ---------- | ---------- |
@@ -51,14 +51,14 @@ The following issues are known issues that we plan to fix in future releases:
5151

5252
| OS | Hardware |
5353
| ---------- | ---------- |
54-
| Ubuntu 24.10 | Intel? Arc? B580 |
55-
| Ubuntu 22.04 | Intel? Data Center GPU Max Series |
54+
| Ubuntu 24.10 | Intel® Arc B580 |
55+
| Ubuntu 22.04 | Intel® Data Center GPU Max Series |
5656

5757
### 3.2. Prepare a Serving Environment
5858

5959
1. Follow [instructions](https://dgpu-docs.intel.com/driver/overview.html) to install driver packages.
6060
2. Get the released docker image with command `docker pull intel/vllm:xpu`
61-
3. Instantiate a docker container with command `docker run -t -d --shm-size 10g --net=host --ipc=host --privileged -v /dev/dri/by-path:/dev/dri/by-path --name=vllm-test --device /dev/dri:/dev/dri --entrypoint= intel/vllm:xpu?/bin/bash`
61+
3. Instantiate a docker container with command `docker run -t -d --shm-size 10g --net=host --ipc=host --privileged -v /dev/dri/by-path:/dev/dri/by-path --name=vllm-test --device /dev/dri:/dev/dri --entrypoint= intel/vllm:xpu /bin/bash`
6262
4. Run command `docker exec -it vllm-test bash` in 2 separate terminals to enter container environments for the server and the client respectively.
6363

6464
\* Starting from here, all commands are expected to be run inside the docker container, if not explicitly noted.

0 commit comments

Comments
 (0)