[V1] batch scheduling of simultaneously arriving requests

### V1 Scheduler: Batch simultaneously arriving requests together

The current **V1** vllm-spyre [scheduler](https://github.com/vllm-project/vllm-spyre/blob/main/vllm_spyre/v1/core/scheduler.py) does not schedule request, which have been submitted simultaneously, in the same batch. 
**Note**: The **V0** vllm-spyre [scheduler](https://github.com/vllm-project/vllm-spyre/blob/main/vllm_spyre/core/scheduler.py) did schedule these requests together. 

### How to reproduce

alter [examples/offline_inference_spyre.py](https://github.com/vllm-project/vllm-spyre/blob/main/examples/offline_inference_spyre.py) to use batch size 4 and submit 4 prompts simultaneously: 
```
os.environ['VLLM_SPYRE_WARMUP_BATCH_SIZES'] = '4'
.
.
.
prompts = [prompt1, prompt1, prompt1, prompt1]
```

**V1** yields (`export VLLM_USE_V1=1`):
```
SpyreModelRunner] INFO: Padding request of length 49 tokens to 64 tokens.
Processed prompts:   0%|                                                                                          | 0/4 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s][spyre_model_runner:execute_model] t_token: 231.08ms
[spyre_model_runner:execute_model] t_token: 67.17ms
[spyre_model_runner:execute_model] t_token: 65.63ms
[SpyreModelRunner] INFO: Padding request of length 49 tokens to 64 tokens.
[SpyreModelRunner] INFO: Padding request of length 49 tokens to 64 tokens.
[SpyreModelRunner] INFO: Padding request of length 49 tokens to 64 tokens.
Processed prompts:  25%|████████████████████                                                            | 1/4 [00:00<00:01,  2.75it/s, est. speed input: 134.60 toks/s, output: 8.24 toks/s][spyre_model_runner:execute_model] t_token: 171.95ms
[spyre_model_runner:execute_model] t_token: 75.90ms
[spyre_model_runner:execute_model] t_token: 62.83ms
Processed prompts: 100%|███████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00,  5.92it/s, est. speed input: 290.16 toks/s, output: 17.76 toks/s]
Time elaspsed for 3 tokens is 0.69 sec
```

**V0** yields (`export VLLM_USE_V1=0`):
```
[SpyreModelRunner] INFO: Padding request of length 49 tokens to 64 tokens.
[SpyreModelRunner] INFO: Padding request of length 49 tokens to 64 tokens.
[SpyreModelRunner] INFO: Padding request of length 49 tokens to 64 tokens.
[SpyreModelRunner] INFO: Padding request of length 49 tokens to 64 tokens.
[spyre_model_runner:execute_model] t_token: 166.63ms
[spyre_model_runner:execute_model] t_token: 57.04ms
[spyre_model_runner:execute_model] t_token: 57.95ms
Processed prompts: 100%|███████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 14.03it/s, est. speed input: 687.52 toks/s, output: 42.09 toks/s]
Time elaspsed for 3 tokens is 0.29 sec
```

Clearly one can see that the V1 scheduler schedules just the first sequence (padded with 3 empty sequences) and after these decodes have finished, the remaining 3 sequences (padded with 1 empty sequence). The V0 scheduler batches all 4 sequences in a single batch, which is the desired behavior. 

Note: experiments run on CPU. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[V1] batch scheduling of simultaneously arriving requests #23

V1 Scheduler: Batch simultaneously arriving requests together

How to reproduce

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[V1] batch scheduling of simultaneously arriving requests #23

Description

V1 Scheduler: Batch simultaneously arriving requests together

How to reproduce

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions