Open
Description
V1 Scheduler: Batch simultaneously arriving requests together
The current V1 vllm-spyre scheduler does not schedule request, which have been submitted simultaneously, in the same batch.
Note: The V0 vllm-spyre scheduler did schedule these requests together.
How to reproduce
alter examples/offline_inference_spyre.py to use batch size 4 and submit 4 prompts simultaneously:
os.environ['VLLM_SPYRE_WARMUP_BATCH_SIZES'] = '4'
.
.
.
prompts = [prompt1, prompt1, prompt1, prompt1]
V1 yields (export VLLM_USE_V1=1
):
SpyreModelRunner] INFO: Padding request of length 49 tokens to 64 tokens.
Processed prompts: 0%| | 0/4 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s][spyre_model_runner:execute_model] t_token: 231.08ms
[spyre_model_runner:execute_model] t_token: 67.17ms
[spyre_model_runner:execute_model] t_token: 65.63ms
[SpyreModelRunner] INFO: Padding request of length 49 tokens to 64 tokens.
[SpyreModelRunner] INFO: Padding request of length 49 tokens to 64 tokens.
[SpyreModelRunner] INFO: Padding request of length 49 tokens to 64 tokens.
Processed prompts: 25%|████████████████████ | 1/4 [00:00<00:01, 2.75it/s, est. speed input: 134.60 toks/s, output: 8.24 toks/s][spyre_model_runner:execute_model] t_token: 171.95ms
[spyre_model_runner:execute_model] t_token: 75.90ms
[spyre_model_runner:execute_model] t_token: 62.83ms
Processed prompts: 100%|███████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 5.92it/s, est. speed input: 290.16 toks/s, output: 17.76 toks/s]
Time elaspsed for 3 tokens is 0.69 sec
V0 yields (export VLLM_USE_V1=0
):
[SpyreModelRunner] INFO: Padding request of length 49 tokens to 64 tokens.
[SpyreModelRunner] INFO: Padding request of length 49 tokens to 64 tokens.
[SpyreModelRunner] INFO: Padding request of length 49 tokens to 64 tokens.
[SpyreModelRunner] INFO: Padding request of length 49 tokens to 64 tokens.
[spyre_model_runner:execute_model] t_token: 166.63ms
[spyre_model_runner:execute_model] t_token: 57.04ms
[spyre_model_runner:execute_model] t_token: 57.95ms
Processed prompts: 100%|███████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 14.03it/s, est. speed input: 687.52 toks/s, output: 42.09 toks/s]
Time elaspsed for 3 tokens is 0.29 sec
Clearly one can see that the V1 scheduler schedules just the first sequence (padded with 3 empty sequences) and after these decodes have finished, the remaining 3 sequences (padded with 1 empty sequence). The V0 scheduler batches all 4 sequences in a single batch, which is the desired behavior.
Note: experiments run on CPU.