Skip to content

Enable CPU nightly performance benchmark and its Markdown report #18444

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 16 commits into
base: main
Choose a base branch
from

Conversation

louie-tsai
Copy link
Contributor

@louie-tsai louie-tsai commented May 21, 2025

Need to standardize vLLM CPU benchmarks among customers and intel users by using vLLM benchmark suite.
Also hope to enable CPU perf numbers on vLLM performance dashboard.

Enable vLLM benchmark suite for CPU and below are snapshot of serving benchmark report.
numbers are aligned with our Xeon EMR numbers.

How to run it on CPU under vllm folder:
ON_CPU=1 bash .buildkite/nightly-benchmarks/scripts/run-performance-benchmarks.sh

also added a new section "Platform Information" section to list out CPU info

image

Here is the full report.
benchmark_results_0527_3.md

overall, it took ~2 hours for current tests.

it also needs to have auto OMP thread binding from below PR.
#17930

Also added a compare-json-results.py to compare among different benchmark-results.json files
image

Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@mergify mergify bot added the ci/build label May 21, 2025
@louie-tsai louie-tsai force-pushed the nightly_cpu_benchmark branch 9 times, most recently from 913611c to 298aba1 Compare June 3, 2025 00:21
@louie-tsai louie-tsai force-pushed the nightly_cpu_benchmark branch 3 times, most recently from 1299794 to 51ede3a Compare June 4, 2025 01:09
@louie-tsai
Copy link
Contributor Author

@bigPYJ1151 could you help to review this PR?

@louie-tsai louie-tsai force-pushed the nightly_cpu_benchmark branch 2 times, most recently from 2f11d5e to 21ce351 Compare June 4, 2025 21:22
@louie-tsai
Copy link
Contributor Author

@xuechendi please help to review and merge the PR. thanks

@louie-tsai louie-tsai force-pushed the nightly_cpu_benchmark branch 2 times, most recently from 5bf0667 to 433ed5c Compare June 9, 2025 17:53
@xuechendi
Copy link
Contributor

@louie-tsai , It looks like the rename for GPU script is not necessary, do you think it is OK to drop changes to existing GPU script and only add for CPU?

@xuechendi
Copy link
Contributor

@bigPYJ1151 , please take a look of this PR, the custom facing team want to provide CPU benchmark script from VLLM upstream repo, so customer can reproduce the number easily. Please check if current test settings makes sense to you.

@louie-tsai louie-tsai force-pushed the nightly_cpu_benchmark branch 3 times, most recently from 5164476 to 7eb926e Compare June 10, 2025 01:58
Copy link
Contributor

@bigPYJ1151 bigPYJ1151 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Frankly speaking, I'm not sure reusing the benchmark script between CPU and GPU is a good idea.

Meanwhile, this PR also relates to vllm CI infra, needs to setup benchmark machines and pipelines. Do we have plan to establish these?

Nightly benchmark will be triggered when:
- Every commit for those PRs with `perf-benchmarks` label and `nightly-benchmarks` label.

## Performance benchmark details

See [performance-benchmarks-descriptions.md](performance-benchmarks-descriptions.md) for detailed descriptions, and use `tests/latency-tests.json`, `tests/throughput-tests.json`, `tests/serving-tests.json` to configure the test cases.

See [performance-benchmarks-descriptions.md](performance-benchmarks-descriptions.md) for detailed descriptions, and use `tests/latency-tests-gpu.json`, `tests/throughput-tests-gpu.json`, `tests/serving-tests-gpu.json` to configure the test cases.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think adding -gpu suffix is not necessary as it will introduce lots of changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed it accordingly.

import pandas as pd
import psutil
from cpuinfo import get_cpu_info
from numa import info
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps numa should be added to requirements like test.in.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added them accordingly.
We are targeting on release docker image. should we use test image instead?
if yes, do you have build command for test image?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, the test image included more dictionaries and packages for testing/benchmarking.

To build it, just add --target=vllm-test when building the CPU image.

@@ -187,7 +222,7 @@ def results_to_json(latency, throughput, serving):
# The GPUs sometimes come in format of "GPUTYPE\nGPUTYPE\n...",
# we want to turn it into "8xGPUTYPE"
df["GPU"] = df["GPU"].apply(
lambda x: f"{len(x.split('\n'))}x{x.split('\n')[0]}"
lambda x: f"{len(x.split(chg_line_char))}x{x.split(chg_line_char)[0]}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this change required?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

used to face a parsing issue on CPU results, but no issue on the latest code. remove the changes.

@louie-tsai louie-tsai force-pushed the nightly_cpu_benchmark branch 2 times, most recently from a876145 to 2353aa2 Compare June 11, 2025 21:47
@louie-tsai
Copy link
Contributor Author

Frankly speaking, I'm not sure reusing the benchmark script between CPU and GPU is a good idea.

Meanwhile, this PR also relates to vllm CI infra, needs to setup benchmark machines and pipelines. Do we have plan to establish these?

for the similar user experiences among different architectures, might be good to have same benchmark scripts between CPU and GPU. we separate the run arguments into different json files, so bash script mostly are general to both cpu and gpu. Hope that explanations could make more sense to you for this change.
Yes. Chendi and I plan to use EMR as the benchmark machine.

@louie-tsai louie-tsai force-pushed the nightly_cpu_benchmark branch from 23bda57 to 1f565b5 Compare June 11, 2025 21:54
@louie-tsai louie-tsai requested a review from bigPYJ1151 June 11, 2025 21:54
import pandas as pd
import psutil
from cpuinfo import get_cpu_info
from numa import info
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, the test image included more dictionaries and packages for testing/benchmarking.

To build it, just add --target=vllm-test when building the CPU image.

@@ -134,6 +134,8 @@ ENTRYPOINT ["bash"]
FROM base AS vllm-openai

WORKDIR /workspace/
ADD ./benchmarks/ ./benchmarks/
ADD ./.buildkite/ ./.buildkite/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If using the test image, these change will not be required.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed them accordingly. thanks

{
"test_name": "latency_llama8B_tp1",
"environment_variables": {
"VLLM_USE_V1": 1,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After the CPU V1 merged, VLLM_USE_V1 is not required.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed it accordingly.

Comment on lines 6 to 8
"VLLM_RPC_TIMEOUT": 1000000,
"VLLM_ALLOW_LONG_MAX_MODEL_LEN": 1,
"VLLM_ENGINE_ITERATION_TIMEOUT_S": 600,
Copy link
Contributor

@bigPYJ1151 bigPYJ1151 Jun 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VLLM_RPC_TIMEOUT and VLLM_ENGINE_ITERATION_TIMEOUT_S only effect for serving benchmarks, so setting them for latency/throughput benchmarks is not necessary.

Meanwhile, even for serving benchmarks, I think setting them with large value is not a good idea. If a benchmark setting exceeds CPU device capability and takes too much time, the result may be less meaningful, and we should adjust the benchmark/engine settings to be more practical. So with effective timeouts we can know whether our benchmark/engine settings are good.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the suggestions. addressed it accordingly.

@louie-tsai
Copy link
Contributor Author

louie-tsai commented Jun 17, 2025

RESERVE 0 CPU per NUMA node due to better performance without reservation for llama3.1 8b on AWS c7i.48xlarge instance.

@louie-tsai louie-tsai force-pushed the nightly_cpu_benchmark branch 3 times, most recently from 12f7c6d to 5836136 Compare June 17, 2025 17:43
@louie-tsai louie-tsai force-pushed the nightly_cpu_benchmark branch 2 times, most recently from 3f045db to ea7ab42 Compare June 18, 2025 15:12
Comment on lines 176 to 177
from cpuinfo import get_cpu_info
from numa import info
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cpuinfo and numa will break the CUDA benchmarking as they are missing in the CUDA containers.

We should check them here and only capture platform_data when the packages are available.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed it accordingly

@louie-tsai louie-tsai force-pushed the nightly_cpu_benchmark branch from ea7ab42 to d9d64e9 Compare June 20, 2025 06:34
@louie-tsai louie-tsai requested a review from bigPYJ1151 June 20, 2025 06:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants