-
Notifications
You must be signed in to change notification settings - Fork 381
[Misc] SLO-aware router with profile support #1192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…load_aware_routing # Conflicts: # pkg/cache/cache.go # pkg/plugins/gateway/gateway.go # pkg/types/router.go
…load_aware_routing # Conflicts: # pkg/cache/cache.go # pkg/plugins/gateway/algorithms/least_busy_time.go # pkg/plugins/gateway/algorithms/least_kv_cache.go # pkg/plugins/gateway/algorithms/least_latency.go # pkg/plugins/gateway/algorithms/least_request.go # pkg/plugins/gateway/algorithms/prefix_cache.go # pkg/plugins/gateway/algorithms/prefix_cache_and_load.go # pkg/plugins/gateway/algorithms/random.go # pkg/plugins/gateway/algorithms/router.go # pkg/plugins/gateway/algorithms/router_test.go # pkg/plugins/gateway/algorithms/throughput.go # pkg/plugins/gateway/gateway.go # pkg/plugins/gateway/gateway_req_body.go
Add random routing policy to e2e test.
Signed-off-by: Jingyuan <[email protected]>
…routing # Conflicts: # cmd/plugins/main.go # pkg/cache/cache.go # pkg/cache/cache_test.go # pkg/cache/model.go # pkg/cache/pod.go # pkg/plugins/gateway/algorithms/prefix_cache.go # pkg/plugins/gateway/algorithms/prefix_cache_test.go # pkg/plugins/gateway/algorithms/router_test.go # pkg/plugins/gateway/gateway_req_body.go # pkg/plugins/gateway/gateway_rsp_body.go # pkg/types/router_context.go
…urrent safety and stateful routing Signed-off-by: Jingyuan Zhang <[email protected]>
Signed-off-by: Jingyuan Zhang <[email protected]>
…routing Signed-off-by: Jingyuan Zhang <[email protected]> # Conflicts: # cmd/controllers/main.go # cmd/metadata/main.go # cmd/plugins/main.go # pkg/cache/cache.go # pkg/cache/cache_test.go # pkg/cache/model.go # pkg/cache/pod.go # pkg/cache/registry_test.go # pkg/plugins/gateway/algorithms/least_busy_time.go # pkg/plugins/gateway/algorithms/least_kv_cache.go # pkg/plugins/gateway/algorithms/least_latency.go # pkg/plugins/gateway/algorithms/least_request.go # pkg/plugins/gateway/algorithms/prefix_cache_test.go # pkg/plugins/gateway/algorithms/router.go # pkg/plugins/gateway/algorithms/router_test.go # pkg/plugins/gateway/algorithms/throughput.go # pkg/plugins/gateway/gateway.go # pkg/plugins/gateway/gateway_req_body.go # pkg/plugins/gateway/gateway_rsp_body.go # pkg/types/router.go # pkg/types/router_context.go
Add PodList interface to replace utils.PodArray. Signed-off-by: Jingyuan Zhang <[email protected]>
Signed-off-by: Jingyuan Zhang <[email protected]>
…routing Signed-off-by: Jingyuan Zhang <[email protected]> # Conflicts: # pkg/cache/cache_api.go # pkg/cache/cache_init.go
…urrent safety and stateful routing Signed-off-by: Jingyuan Zhang <[email protected]>
…routing Signed-off-by: Jingyuan Zhang <[email protected]> # Conflicts: # cmd/plugins/main.go # pkg/cache/cache_api.go # pkg/cache/cache_impl.go # pkg/cache/cache_init.go # pkg/cache/cache_test.go # pkg/cache/cache_trace.go # pkg/cache/informers.go # pkg/cache/model.go # pkg/cache/pod.go # pkg/cache/trace.go # pkg/plugins/gateway/algorithms/least_busy_time.go # pkg/plugins/gateway/algorithms/least_kv_cache.go # pkg/plugins/gateway/algorithms/least_latency.go # pkg/plugins/gateway/algorithms/least_request.go # pkg/plugins/gateway/algorithms/prefix_cache.go # pkg/plugins/gateway/algorithms/prefix_cache_and_load.go # pkg/plugins/gateway/algorithms/prefix_cache_test.go # pkg/plugins/gateway/algorithms/random.go # pkg/plugins/gateway/algorithms/router.go # pkg/plugins/gateway/algorithms/router_test.go # pkg/plugins/gateway/algorithms/throughput.go # pkg/plugins/gateway/gateway.go # pkg/plugins/gateway/gateway_req_body.go # pkg/plugins/gateway/gateway_req_headers.go # pkg/plugins/gateway/gateway_rsp_body.go # pkg/types/router.go # pkg/types/router_context.go # pkg/utils/pod.go # test/e2e/routing_strategy_test.go
…load_aware_routing Signed-off-by: Jingyuan Zhang <[email protected]> # Conflicts: # cmd/plugins/main.go # pkg/cache/cache_api.go # pkg/cache/cache_impl.go # pkg/cache/cache_init.go # pkg/cache/cache_test.go # pkg/cache/cache_trace.go # pkg/cache/informers.go # pkg/cache/model.go # pkg/cache/pod.go # pkg/cache/trace.go # pkg/metrics/metrics.go # pkg/plugins/gateway/algorithms/prefix_cache_test.go # pkg/plugins/gateway/algorithms/router_test.go # pkg/plugins/gateway/gateway_req_body.go # pkg/plugins/gateway/gateway_rsp_body.go # pkg/types/router_context.go # pkg/types/router_context_test.go # pkg/utils/sync_map.go
Signed-off-by: Jingyuan Zhang <[email protected]>
Signed-off-by: Jingyuan Zhang <[email protected]>
Signed-off-by: Jingyuan Zhang <[email protected]>
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
Signed-off-by: Jingyuan Zhang <[email protected]>
Signed-off-by: Jingyuan Zhang <[email protected]>
Signed-off-by: Jingyuan Zhang <[email protected]>
Signed-off-by: Jingyuan Zhang <[email protected]>
Signed-off-by: Jingyuan Zhang <[email protected]>
I notice there're some refactor changes (e.g. internal interface change etc) Technically, that affects other aspects, could it be some separate changes? I mean splitting the changes into common parts (stakeholder needs to review it) and slo specific changes (review could be loose and feature can be protected by feature gate). If the splitting is too complicated, we can have 1st round review and check how to move forward |
Signed-off-by: Jingyuan Zhang <[email protected]>
Signed-off-by: Jingyuan Zhang <[email protected]>
@Jeffwan, I think the only internal interface change is the Select(). The function is called only in one place, and if you find it is not appropriate, we can restore it. |
…load_aware_routing Signed-off-by: Jingyuan Zhang <[email protected]> # Conflicts: # pkg/plugins/gateway/algorithms/prefix_cache_preble.go
@@ -77,7 +78,7 @@ func main() { | |||
panic(err) | |||
} | |||
|
|||
cache.InitForGateway(config, stopCh, redisClient) | |||
cache.InitForGateway(config, stopCh, redisClient, routing.NewSLORouter) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there better way to handle this case? routing.NewSLORouter
is a specific solution but cmd/plugins/main.go
is for common purpose? Can we have factory for such initialization?
@@ -2,7 +2,7 @@ apiVersion: v1 | |||
kind: Service | |||
metadata: | |||
name: gateway-plugins | |||
namespace: aibrix-system | |||
namespace: system |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
system will be overitten eventually to aibrix-system
here right? changing to system
to be aligned with default setting?
// Parameters: | ||
// deploymentName: Name of the deployment | ||
// modelName: Name of the model | ||
GetModelProfileByDeploymentName(deploymentName string, modelName string) (*ModelGPUProfile, error) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: we may use other objects to orchestrate pods in future. in that case, deployment might be changed in future. This looks good at this moment.
one more problem is, deployment without namespace can not be used to identify a deployment. we need to append the namespace field
break | ||
// Current implementation assumes AddRequestCount() will not be called concurrently. | ||
// TODO: Implment "wait for trace term" logic if AddRequestCount() is called concurrently. | ||
if ctx == nil || ctx.CanAddTrace() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the refactor for common case?
metaModels utils.SyncMap[string, *Model] // model_name -> *Model | ||
|
||
// Deploymnent related storage | ||
deploymentProfiles utils.SyncMap[string, *ModelGPUProfile] // deployment_name -> *ModelGPUProfile |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here. we can use namespace/deployment
as the key
@@ -98,6 +97,7 @@ func (c *Store) addPod(obj interface{}) { | |||
// only track pods with model deployments | |||
modelName, ok := pod.Labels[modelIdentifier] | |||
if !ok { | |||
// klog.InfoS("ignored pod without model label", "name", pod.Name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use log level instead?
} | ||
} | ||
} | ||
q.queue, q.baseCursor = newQueue, q.baseCursor+dequeuePos |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could it be a problem is the other goroutine invoke physicalPosRLocked
? can we introduce something like and use it in physicalPosRLocked
and setBaseCursor
in expand
?
func (q *SimpleQueue[V]) getBaseCursor() int64 {
return atomic.LoadInt64(&q.baseCursor)
}
atomic.AddInt32(&hist.size, -hist.Tail().getSkipped()) | ||
} | ||
|
||
func NewSimmpleOutputPredictor(maxInputTokens, maxOutputTokens int, window time.Duration) *SimmpleOutputPredictor { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's briefly talk about the algorithm here? as a comment
debugDelay time.Duration | ||
tokens []int | ||
predictor OutputPredictor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one of my concerns is which field can be used for profile disabled routing algorithms? As a routing algorithm developer, which field should I expected to be available if I enable/disable some features.
return | ||
} | ||
|
||
func (q *SLOQueue) higherRank(rank1 float64, rank2 float64) float64 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
directly return bool looks simplier
queueOverallSLO bool = false | ||
monogenousGPURouting bool = true | ||
monogenousGPURoutingOnly bool = monogenousGPURouting && false | ||
initialTotalSubQueues int = 8 // Expect no more than 8 subqueues |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are these magic numbers const or should be adjusted based on the available resources?
Signed-off-by: Jingyuan Zhang <[email protected]>
Signed-off-by: Jingyuan Zhang <[email protected]>
Signed-off-by: Jingyuan Zhang <[email protected]>
Signed-off-by: Jingyuan Zhang <[email protected]>
Signed-off-by: Jingyuan Zhang <[email protected]>
Signed-off-by: Jingyuan Zhang <[email protected]>
Signed-off-by: Jingyuan Zhang <[email protected]>
Pull Request Description
Introducing SLO-aware router with profile support. This PR introduces three new SLO-aware routing policies:
All three routing policies will prioritize requests with a profiled SLO target.
In addition to the slo-family routing policies, this PR added built-in queues to support request reordering and future delay scheduling. In particular, QueueRouter enables the pull mode within the gateway. Below is a comparison of pulling mode and default push mode:
With profile support, the gateway now has server capacity knowledge and can achieve pull mode within the gateway.
Additional feature added in this PR:
Preliminary results show SLO policy can achieve the SLO target for composite workload on heterogeneous GPUs:

Workload: mixed sharegpt and bird workload with a ratio of 7:4
GPU: 1A10, 4L20
SLO: Latency per token 0.05s
Related Issues
Resolves: #642 #606
Important: Before submitting, please complete the description above and review the checklist below.
Contribution Guidelines (Expand for Details)
We appreciate your contribution to aibrix! To ensure a smooth review process and maintain high code quality, please adhere to the following guidelines:
Pull Request Title Format
Your PR title should start with one of these prefixes to indicate the nature of the change:
[Bug]
: Corrections to existing functionality[CI]
: Changes to build process or CI pipeline[Docs]
: Updates or additions to documentation[API]
: Modifications to aibrix's API or interface[CLI]
: Changes or additions to the Command Line Interface[Misc]
: For changes not covered above (use sparingly)Note: For changes spanning multiple categories, use multiple prefixes in order of importance.
Submission Checklist
By submitting this PR, you confirm that you've read these guidelines and your changes align with the project's contribution standards.