Skip to content

[Platform] Add custom default max tokens #18557

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

gmarinho2
Copy link
Contributor

@gmarinho2 gmarinho2 commented May 22, 2025

FIX: vllm-project/vllm-spyre#148

Currently the default max tokens for the completions API is set to max_model_len - prompt_len. The changes in this PR make so that when a platform needs to use a different value for default_max_tokens it can be altered simply by overriding the maybe_update_max_tokens method in the class Plataform. When it is not needed it returns the current default. Edit: typo in commit message: class Plataform is meant to be class Platform.

Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Copy link
Contributor

@maxdebayser maxdebayser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@joerunde
Copy link
Collaborator

Is this something that can be handled by --generation-config?

--generation-config
The folder path to the generation config. Defaults to "auto", the generation config will be loaded from model path. If set to "vllm", no generation config is loaded, vLLM defaults will be used. If set to a folder path, the generation config will be loaded from the specified folder path. If max_new_tokens is specified in generation config, then it sets a server-wide limit on the number of output tokens for all requests.

Default: 'auto'

Should the chat api be respecting a max_new_tokens override from the generation config instead of setting the default to max_model_len - prompt_len? That would allow a default override to be set regardless of platforms.

That said, I do like the code hook to be able to write whatever code you want too...

@maxdebayser
Copy link
Contributor

Is this something that can be handled by --generation-config?
Unfortunately not because in vllm-spyre the permissible max_new_tokens depends on the request.

Copy link
Contributor

@NickLucche NickLucche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution!

Shouldn't we update serving_completion too?

@gmarinho2
Copy link
Contributor Author

Thanks for the contribution!

Shouldn't we update serving_completion too?

Done. Since class CompletionRequest has 16 as default it will probably be selected most of the time because the default is set to be the minimum between context window, user request & server limit: REF1, REF2, REF3, REF4.

@gmarinho2 gmarinho2 requested a review from NickLucche May 30, 2025 18:53
@joerunde
Copy link
Collaborator

joerunde commented Jun 9, 2025

@youkaichao any thoughts on getting this merged?

Copy link
Contributor

@NickLucche NickLucche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm! Apologies for the delay.

@joerunde joerunde added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 12, 2025
@joerunde
Copy link
Collaborator

Let's get the full CI running then and see if we can get a maintainer to get this merged 👍

@joerunde
Copy link
Collaborator

@njhill can you hit the big green merge button for us?

@gmarinho2 gmarinho2 requested a review from aarnphm as a code owner June 17, 2025 20:45
Copy link
Collaborator

@aarnphm aarnphm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One formatting comment, otherwise lgtm

Signed-off-by: Gabriel Marinho <[email protected]>
@aarnphm aarnphm changed the title Add custom default max tokens for different plataforms [Platform] Add custom default max tokens Jun 19, 2025
@aarnphm aarnphm enabled auto-merge (squash) June 19, 2025 05:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
frontend ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Incorrect default max_completion_tokens being set
7 participants