Skip to content

[Misc] SLO-aware router with profile support #1192

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 88 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
88 commits
Select commit Hold shift + click to select a range
5652f12
Prepare profile generator for router.
Feb 13, 2025
e3e4d4f
Merge branch 'main' into jingyuan/load_aware_routing
Feb 26, 2025
dc7c12c
Finish cache refactor and load utilization accounting
Mar 6, 2025
02b8392
Fix the missing part: trigger scheduling on there is spare capacity.
Mar 7, 2025
91d401e
Merge commit '9990ab80b86467e33eb6b1a12042ff73269a8ffa' into feature/…
Mar 7, 2025
9f7d77c
Merge commit 'd630ffd99dbf1ae9bc4eb21c55fcf395a681b8a7' into feature/…
Mar 14, 2025
d00c6b6
Cache and router refactoring for stateful router.
Mar 16, 2025
c48804f
remove unused code
Mar 17, 2025
f2bfc7d
Pass existing tests.
Mar 17, 2025
8cebab0
Remove queue router
Mar 17, 2025
9d77b54
Add tests for pod model relationship and minor fixes.
Mar 19, 2025
fe2fb18
remove unused file in this refactor
Mar 19, 2025
632966b
Merge branch 'main' into feature/cache_router_refactor
zhangjyr Mar 19, 2025
57f8a03
Bug fix
Mar 19, 2025
2a6abc5
Lint fix
Mar 19, 2025
8d787d9
Bug fix
Mar 19, 2025
b43d3b0
Bug fix and remove unnecessary log.
Mar 19, 2025
323acf2
Bug fix
Mar 20, 2025
618b248
Merge branch 'feature/cache_router_refactor' into feature/load_aware_…
Mar 20, 2025
0999e1f
Add more tests for basic classes.
Mar 20, 2025
35ec945
Merge branch 'feature/cache_router_refactor' into feature/load_aware_…
Mar 20, 2025
2c047f5
Rebase: Cache and Router refactoring for concurrent performance, conc…
Mar 24, 2025
cea9419
bug ifx
Mar 24, 2025
093439e
Merge branch 'feature/cache_router_refactor' into feature/load_aware_…
Mar 24, 2025
97ed8c6
Rename TraceCache to RequestTracker
Mar 25, 2025
77cc884
Bug fix: concurrent registry array update
Mar 26, 2025
313e73e
Merge branch 'feature/cache_router_refactor' into feature/load_aware_…
Mar 26, 2025
4d7347c
Rebase: Cache and Router refactoring for concurrent performance, conc…
Mar 24, 2025
e721e29
Merge branch 'feature/cache_router_refactor' into feature/load_aware_…
Mar 28, 2025
ddc8d33
Merge commit 'cd9da485c6b2aa0ec8562511ee116b9cffaecb5f' into feature/…
Mar 31, 2025
880448f
Add test cases for output predictor
Apr 1, 2025
b19430e
Add thread safety to simple queue and corresponding tests.
Apr 3, 2025
0820696
Add router fallback mechanism and SLO router will fallback to leastRe…
Apr 7, 2025
4e78c55
AddRequestCount now support idempotency and will be called during rou…
Apr 7, 2025
1476244
Merge commit 'c94029bfede7afb57ee15ba7649313b274dbb32d' into feature/…
Apr 8, 2025
3cc6269
Merge commit '79bb224a474cafa1992716d60977350a81dadb47' into feature/…
Apr 10, 2025
07ba0d0
lint-fx
Apr 10, 2025
f592133
Add unit test and handle legacy profiles that has no SLO info.
May 19, 2025
ff33ab0
Merge commit 'f888028288b2e1f0ffc704f5e2f7e29d3f60fd5c' into feature/…
May 19, 2025
b4d94d7
Merge branch 'main' into feature/load_aware_routing
May 19, 2025
c48e4eb
Add more unit tests
May 20, 2025
088cf16
Added packed slo routing test cases
May 20, 2025
9d20c8e
Add more unit test to make sure slo router behave correctly
May 22, 2025
bbdf5eb
Improve routing result according to simulation results.
May 28, 2025
bcf362f
Merge branch 'main' into feature/load_aware_routing
May 29, 2025
1c076cf
Added vke-dev verison of gateway_plugins
May 29, 2025
da14870
Add support for routing-strategy to benchmark
May 29, 2025
dda2df3
Fix profile generator to contain metrics disregard SLO filter.
Jun 2, 2025
2284339
Ensure any non temporary routing error triggered from slo_queue will …
Jun 2, 2025
2496d22
Merge commit '10a52379f6b5af593907ab272191fcdfe97701d5' into feature/…
Jun 4, 2025
0d0574b
bug fix
Jun 4, 2025
610c1aa
Improve router initialization timeline
Jun 4, 2025
8f26347
Improve log
Jun 4, 2025
a1690ea
Improve log
Jun 5, 2025
66aff7c
Bug fix
Jun 5, 2025
ff1403a
typo fix
Jun 5, 2025
1467ed8
typo fix
Jun 5, 2025
fb4ba2a
Improve logs
Jun 6, 2025
3ef9d44
Remove hard coded pod keys.
Jun 6, 2025
b10ef22
Bug fix in pod_array in heterogenous settings.
Jun 9, 2025
2ff8afc
Add support for variances of slo router.
Jun 11, 2025
455aa7e
AddRequestCount can be move a little earlier before request enqueuing…
Jun 11, 2025
cf4a1f0
Lint fix
Jun 11, 2025
872cf5d
Bug fix in simple queue.
Jun 12, 2025
daf3890
Merge branch 'main' into feature/load_aware_routing
Jun 12, 2025
75b8459
remove unused file
Jun 12, 2025
3f92bb7
Disable unrelated modification
Jun 12, 2025
38ed814
Change default SLO router
Jun 12, 2025
a931453
Merge branch 'main' into feature/load_aware_routing
zhangjyr Jun 12, 2025
db1ab57
Python lint
Jun 12, 2025
e19c139
Fix unit tests
Jun 13, 2025
225bac8
Improve for race test.
Jun 13, 2025
6822b84
Merge branch 'main' into feature/load_aware_routing
zhangjyr Jun 13, 2025
3270ca8
Bug fix
Jun 13, 2025
18562f2
Disable simple_queue_test in race test.
Jun 13, 2025
449bf19
Disable race tests.
Jun 14, 2025
81d1cbc
Disable race tests.
Jun 14, 2025
cb00177
Merge branch 'main' into feature/load_aware_routing
zhangjyr Jun 14, 2025
63a03af
Merge commit '6cd953f203f631b6db86d79fb1bd7064cbf1f668' into feature/…
Jun 16, 2025
57b8c77
Make gpu benchmark support both jsonl and json
Jun 18, 2025
54dfa61
Bug fix and lint fix
Jun 18, 2025
7a1cbe6
Support workload from new generator
Jun 18, 2025
f2faf80
Bug fix
Jun 18, 2025
7cddb5a
Bug fix
Jun 18, 2025
a2c4bb8
Fix interval degradation
Jun 18, 2025
2de4a84
Lint fix
Jun 18, 2025
e12fa29
Merge branch 'main' into feature/load_aware_routing
zhangjyr Jun 20, 2025
6148242
Introduced types.QueueRouter interface for exposing queue status.
Jun 20, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion cmd/plugins/main.go
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,7 @@ import (
extProcPb "github.com/envoyproxy/go-control-plane/envoy/service/ext_proc/v3"
"github.com/vllm-project/aibrix/pkg/cache"
"github.com/vllm-project/aibrix/pkg/plugins/gateway"
routing "github.com/vllm-project/aibrix/pkg/plugins/gateway/algorithms"
"github.com/vllm-project/aibrix/pkg/utils"
"google.golang.org/grpc/health"
healthPb "google.golang.org/grpc/health/grpc_health_v1"
Expand Down Expand Up @@ -77,7 +78,7 @@ func main() {
panic(err)
}

cache.InitForGateway(config, stopCh, redisClient)
cache.InitForGateway(config, stopCh, redisClient, routing.ModelRouterFactory)

// Connect to K8s cluster
k8sClient, err := kubernetes.NewForConfig(config)
Expand Down
4 changes: 2 additions & 2 deletions config/gateway/gateway-plugin/gateway-plugin.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ apiVersion: v1
kind: Service
metadata:
name: gateway-plugins
namespace: aibrix-system
namespace: system
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

system will be overitten eventually to aibrix-system here right? changing to system to be aligned with default setting?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, all other configurations use system instead of aibrix-system. Just keep consistency here.

spec:
selector:
app: gateway-plugins
Expand All @@ -20,7 +20,7 @@ apiVersion: apps/v1
kind: Deployment
metadata:
name: gateway-plugins
namespace: aibrix-system
namespace: system
spec:
strategy:
type: RollingUpdate
Expand Down
38 changes: 38 additions & 0 deletions config/overlays/dev/gateway-plugin/kustomization.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

namespace: aibrix-system

namePrefix: aibrix-

resources:
- ../../../gateway/gateway-plugin

images:
- name: gateway-plugins
newName: aibrix/gateway-plugins
newTag: nightly

patches:
- patch: |- # Use the '|' and '-' for inline patching
apiVersion: apps/v1
kind: Deployment
metadata:
name: gateway-plugins
spec:
template:
spec:
containers:
- name: gateway-plugin
args:
- -v=5
env:
- name: AIBRIX_POD_METRIC_REFRESH_INTERVAL_MS
value: "60000"
- name: AIBRIX_GPU_OPTIMIZER_TRACING_FLAG
value: "true"
target:
kind: Deployment
name: gateway-plugins
namespace: system
version: v1
4 changes: 1 addition & 3 deletions config/overlays/dev/manager/kustomization.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,4 @@ patches:
kind: Deployment
name: controller-manager
namespace: system
version: v1

apiVersion: kustomize.config.k8s.io/v1beta1
version: v1
39 changes: 39 additions & 0 deletions config/overlays/vke-dev/gateway-plugin/gateway_plugins_patch.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
apiVersion: apps/v1
kind: Deployment
metadata:
name: gateway-plugins
namespace: aibrix-system
spec:
replicas: 1
template:
spec:
affinity:
nodeAffinity: # prevent gateway pod to be placed on gpu node.
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
preference:
matchExpressions:
- key: vke.node.gpu.schedule
operator: NotIn
values:
- nvidia
containers:
- name: gateway-plugin
resources:
limits:
cpu: "2"
memory: 8Gi
requests:
cpu: "2"
memory: 8Gi
env:
- name: AIBRIX_PREFIX_CACHE_TOKENIZER_TYPE
value: "character"
- name: AIBRIX_PREFIX_CACHE_BLOCK_SIZE
value: "128"
- name: AIBRIX_PREFIX_CACHE_BLOCK_NUMBER
value: "200000"
- name: AIBRIX_PREFIX_CACHE_POD_RUNNING_REQUEST_IMBALANCE_ABS_COUNT
value: "16"
- name: AIBRIX_PREFIX_CACHE_STANDARD_DEVIATION_FACTOR
value: "2"
11 changes: 5 additions & 6 deletions config/overlays/vke-dev/gateway-plugin/kustomization.yaml
Original file line number Diff line number Diff line change
@@ -1,17 +1,16 @@
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

namespace: aibrix-system

namePrefix: aibrix-

resources:
- ../../../gateway/gateway-plugin
- ../../dev/gateway-plugin

patches:
- path: gateway_plugins_patch.yaml

images:
- name: busybox
newName: aibrix-container-registry-cn-beijing.cr.volces.com/aibrix/busybox
newTag: stable
- name: gateway-plugins
- name: aibrix/gateway-plugins
newName: aibrix-container-registry-cn-beijing.cr.volces.com/aibrix/gateway-plugins
newTag: nightly
12 changes: 12 additions & 0 deletions development/app/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -113,5 +113,17 @@ test-gateway2:
"max_tokens": 512 \
}'

test-router:
curl -v http://localhost:8888/v1/chat/completions \
-H "model: llama2-7b" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer any_key" \
-H "routing-strategy: slo" \
-d '{ \
"model": "llama2-7b", \
"messages": [{"role": "user", "content": "Say this is a test!"}], \
"temperature": 0.7 \
}'

metrics:
curl http://localhost:8000/metrics
31 changes: 28 additions & 3 deletions pkg/cache/cache_api.go
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,9 @@ type Cache interface {
ModelCache
MetricCache
RequestTracker
ProfileCache
types.OutputPredictorProvider
types.RouterProvider
}

// PodCache defines operations for pod information caching
Expand Down Expand Up @@ -106,7 +109,10 @@ type MetricCache interface {

// RequestTracker defines operations for track workload statistics
type RequestTracker interface {
// AddRequestCount starts tracking request count
// AddRequestCount tracks the start of a request after routing.
// To support realtime statistics update and access, AddRequestCount can be called multiple times for a request.
// As the result, implementation should ensure thread-safe access to the counterm and idempotency.
//
// Parameters:
// ctx: Routing context
// requestID: Unique request identifier
Expand All @@ -115,14 +121,18 @@ type RequestTracker interface {
// int64: Trace term identifier
AddRequestCount(ctx *types.RoutingContext, requestID string, modelName string) (traceTerm int64)

// DoneRequestCount completes request count tracking, only one DoneRequestXXX should be called for a request
// DoneRequestCount tracks the completion of a request without usage information like inputTokens and outputTokens.
// Only one DoneRequestXXX should be called for a request. Idemptency is not required.
//
// Parameters:
// requestID: Unique request identifier
// modelName: Name of the model
// traceTerm: Trace term identifier
DoneRequestCount(ctx *types.RoutingContext, requestID string, modelName string, traceTerm int64)

// DoneRequestTrace completes request tracing, only one DoneRequestXXX should be called for a request
// DoneRequestTrace tracks the completion of a request with usage information like inputTokens and outputTokens.
// Only one DoneRequestXXX should be called for a request. Idemptency is not required.
//
// Parameters:
// ctx: Routing context
// requestID: Unique request identifier
Expand All @@ -132,3 +142,18 @@ type RequestTracker interface {
// traceTerm: Trace term identifier
DoneRequestTrace(ctx *types.RoutingContext, requestID string, modelName string, inputTokens, outputTokens, traceTerm int64)
}

// ProfileCache defines operations for model profiles
type ProfileCache interface {
// GetModelProfileByPod gets model profile for a pod
// Parameters:
// pod: Pod object
// modelName: Name of the model
GetModelProfileByPod(pod *v1.Pod, modelName string) (*ModelGPUProfile, error)

// GetModelProfileByDeploymentName gets model profile for a deployment
// Parameters:
// deploymentName: Name of the deployment
// modelName: Name of the model
GetModelProfileByDeploymentName(deploymentName string, modelName string) (*ModelGPUProfile, error)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: we may use other objects to orchestrate pods in future. in that case, deployment might be changed in future. This looks good at this moment.

one more problem is, deployment without namespace can not be used to identify a deployment. we need to append the namespace field

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case of deployment using other objects, the GPU optimizer would have been changed as well (it monitors deployment only). For the support of ray clusters, let me keep a note, leave this comment open, and add an issue after merging.

Can you explain the cases where "deployment without namespace can not be used to identify a deployment"?

}
Loading
Loading