Skip to content

[V1] batch scheduling of simultaneously arriving requests #23

Open
@yannicks1

Description

@yannicks1

V1 Scheduler: Batch simultaneously arriving requests together

The current V1 vllm-spyre scheduler does not schedule request, which have been submitted simultaneously, in the same batch.
Note: The V0 vllm-spyre scheduler did schedule these requests together.

How to reproduce

alter examples/offline_inference_spyre.py to use batch size 4 and submit 4 prompts simultaneously:

os.environ['VLLM_SPYRE_WARMUP_BATCH_SIZES'] = '4'
.
.
.
prompts = [prompt1, prompt1, prompt1, prompt1]

V1 yields (export VLLM_USE_V1=1):

SpyreModelRunner] INFO: Padding request of length 49 tokens to 64 tokens.
Processed prompts:   0%|                                                                                          | 0/4 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s][spyre_model_runner:execute_model] t_token: 231.08ms
[spyre_model_runner:execute_model] t_token: 67.17ms
[spyre_model_runner:execute_model] t_token: 65.63ms
[SpyreModelRunner] INFO: Padding request of length 49 tokens to 64 tokens.
[SpyreModelRunner] INFO: Padding request of length 49 tokens to 64 tokens.
[SpyreModelRunner] INFO: Padding request of length 49 tokens to 64 tokens.
Processed prompts:  25%|████████████████████                                                            | 1/4 [00:00<00:01,  2.75it/s, est. speed input: 134.60 toks/s, output: 8.24 toks/s][spyre_model_runner:execute_model] t_token: 171.95ms
[spyre_model_runner:execute_model] t_token: 75.90ms
[spyre_model_runner:execute_model] t_token: 62.83ms
Processed prompts: 100%|███████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00,  5.92it/s, est. speed input: 290.16 toks/s, output: 17.76 toks/s]
Time elaspsed for 3 tokens is 0.69 sec

V0 yields (export VLLM_USE_V1=0):

[SpyreModelRunner] INFO: Padding request of length 49 tokens to 64 tokens.
[SpyreModelRunner] INFO: Padding request of length 49 tokens to 64 tokens.
[SpyreModelRunner] INFO: Padding request of length 49 tokens to 64 tokens.
[SpyreModelRunner] INFO: Padding request of length 49 tokens to 64 tokens.
[spyre_model_runner:execute_model] t_token: 166.63ms
[spyre_model_runner:execute_model] t_token: 57.04ms
[spyre_model_runner:execute_model] t_token: 57.95ms
Processed prompts: 100%|███████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 14.03it/s, est. speed input: 687.52 toks/s, output: 42.09 toks/s]
Time elaspsed for 3 tokens is 0.29 sec

Clearly one can see that the V1 scheduler schedules just the first sequence (padded with 3 empty sequences) and after these decodes have finished, the remaining 3 sequences (padded with 1 empty sequence). The V0 scheduler batches all 4 sequences in a single batch, which is the desired behavior.

Note: experiments run on CPU.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions