-
Notifications
You must be signed in to change notification settings - Fork 24
add release note for vLLM 0.9.0 release #802
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: yan <[email protected]>
@@ -0,0 +1,198 @@ | |||
# Optimize LLM serving with vLLM on Intel® GPUs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't update the history release notes. just need to add this 0.9.0-xpu.md
vllm/0.9.0-xpu.md
Outdated
@@ -0,0 +1,198 @@ | |||
# Optimize LLM serving with vLLM on Intel® GPUs | |||
|
|||
vLLM is a fast and easy-to-use library for LLM inference and serving. It has evolved into a community-driven project with contributions from both academia and industry. Intel, as one of the community contributors, is working actively to bring satisfying performance with vLLM on Intel® platforms, including Intel® Xeon® Scalable Processors, Intel® discrete GPUs, as well as Intel® Gaud® AI accelerators. This blog focuses on Intel® discrete GPUs at this time and brings you the necessary information to get the workloads running well on your Intel® graphics cards. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gaud® -> Gaudi®
This blog -> This readme
vllm/0.9.0-xpu.md
Outdated
|
||
vLLM is a fast and easy-to-use library for LLM inference and serving. It has evolved into a community-driven project with contributions from both academia and industry. Intel, as one of the community contributors, is working actively to bring satisfying performance with vLLM on Intel® platforms, including Intel® Xeon® Scalable Processors, Intel® discrete GPUs, as well as Intel® Gaud® AI accelerators. This blog focuses on Intel® discrete GPUs at this time and brings you the necessary information to get the workloads running well on your Intel® graphics cards. | ||
|
||
The vLLM used in the latest docker image is based on [v0.9.0](https://github.com/vllm-project/vllm/tree/v0.9.0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
latest -> this
because it will be no longer 'latest' along with the new releases.
vllm/0.9.0-xpu.md
Outdated
|
||
## Optimizations | ||
|
||
* tensor parallel inference: Intel® oneAPI Collective Communications Library(oneCCL) is optimized to provide boosted performance in Intel® Arc™ B-Series graphics cards. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tensor -> Tensor
vllm/0.9.0-xpu.md
Outdated
## Optimizations | ||
|
||
* tensor parallel inference: Intel® oneAPI Collective Communications Library(oneCCL) is optimized to provide boosted performance in Intel® Arc™ B-Series graphics cards. | ||
* GQA kernel optimization: An optimized version of Grouped-Query Attention(GQA) kernel is adopted and obvious perf improvement is observed in models like Qwen and Llama. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
10-16% end-to-end throughput improvement for 8B/14B/32B fp8 workloads in 1024/512 input/output lengths.
vllm/0.9.0-xpu.md
Outdated
|
||
The table below lists models that have been verified by Intel. However, there should be broader models that are supported by vLLM work on Intel® GPUs. | ||
|
||
| Model Type | Model (company/model name) | Dynamic Online FP8 | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have support for fp16 and w4a8 model list as well:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @rogerxfeng8, one open here, the int4 models we tested are not public but get from ccg. How can we add them here?
vllm/0.9.0-xpu.md
Outdated
|
||
The following issues are known issues: | ||
|
||
* Memory reservation increases in vLLM 0.9.0 and it may cause OOM to multi-modility models like Qwen/Qwen2-VL-7B-Instruct, Qwen/Qwen2.5-VL-72B-Instruct and Qwen/Qwen2.5-VL-32B-Instruct. We need downgrade `gpu-memory-utilization` from default value `0.9` to `0.85`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
downgrade -> decrease
52e947b
to
5561274
Compare
Signed-off-by: yan <[email protected]>
Signed-off-by: yan <[email protected]>
vllm/0.9.0-xpu.md
Outdated
Currently, by default we load the model at original precision before quantizing down to 8-bits, so you need enough memory to load the whole model. To avoid this, adding `VLLM_OFFLOAD_WEIGHTS_BEFORE_QUANT=1` can allow offloading weights to cpu before quantization and quantized weights will be kept in device. | ||
::: | ||
|
||
* Multi Modility Support for Qwen2.5-VL Models |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Modility -> Modality
vllm/0.9.0-xpu.md
Outdated
|
||
The following issues are known issues: | ||
|
||
* Memory reservation increases in vLLM 0.9.0 and it may cause OOM to multi-modility models like Qwen/Qwen2-VL-7B-Instruct, Qwen/Qwen2.5-VL-72B-Instruct and Qwen/Qwen2.5-VL-32B-Instruct. We need decrease `gpu-memory-utilization` from default value `0.9` to `0.85`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
modility -> modality
vllm/0.9.0-xpu.md
Outdated
|
||
We also have some experimental features supported, including: | ||
|
||
* **pipeline parallelism**: Works on on single node as only backend `mp` is supported for now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Works on on -> Works on
We support both representations through ENV variable `VLLM_XPU_FP8_DTYPE` with default value `E5M2`. | ||
|
||
:::{warning} | ||
Currently, by default we load the model at original precision before quantizing down to 8-bits, so you need enough memory to load the whole model. To avoid this, adding `VLLM_OFFLOAD_WEIGHTS_BEFORE_QUANT=1` can allow offloading weights to cpu before quantization and quantized weights will be kept in device. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cpu -> CPU
Signed-off-by: yan <[email protected]>
@jitendra42 can you help merge the pr? |
Description
Related Issue
Changes Made
Validation
test_runner.py
with all existing tests passing, and I have added new tests where applicable.