Skip to content

[Misc] SLO-aware router with profile support #1192

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 86 commits into
base: main
Choose a base branch
from

Conversation

zhangjyr
Copy link
Collaborator

@zhangjyr zhangjyr commented Jun 12, 2025

Pull Request Description

Introducing SLO-aware router with profile support. This PR introduces three new SLO-aware routing policies:

  1. slo (or slo-least-load-pulling)
  2. slo-least-load
  3. slo-pack-load
    All three routing policies will prioritize requests with a profiled SLO target.

In addition to the slo-family routing policies, this PR added built-in queues to support request reordering and future delay scheduling. In particular, QueueRouter enables the pull mode within the gateway. Below is a comparison of pulling mode and default push mode:

  1. Push mode: The router dispatches requests to the server, possibly overloading the server.
  2. Pull mode: The server pulls requests from the router based on the server's capacity.
    With profile support, the gateway now has server capacity knowledge and can achieve pull mode within the gateway.

Additional feature added in this PR:

  1. Add a fallback routing policy mechanism for developers to designate a default routing policy if the specified routing policy fails.
  2. Wrap the routing policy registration mechanism in RouterManager, allowing it to be reused for managing a family of related routing policies. (Including simplify Select() to follow RouterProviderFunc)
  3. Add profile cache API to manage model-based performance profiles.
  4. Improving the profile generator in the GPU manager to include SLO information and detailed metrics.

Preliminary results show SLO policy can achieve the SLO target for composite workload on heterogeneous GPUs:
image
Workload: mixed sharegpt and bird workload with a ratio of 7:4
GPU: 1A10, 4L20
SLO: Latency per token 0.05s

Related Issues

Resolves: #642 #606

Important: Before submitting, please complete the description above and review the checklist below.


Contribution Guidelines (Expand for Details)

We appreciate your contribution to aibrix! To ensure a smooth review process and maintain high code quality, please adhere to the following guidelines:

Pull Request Title Format

Your PR title should start with one of these prefixes to indicate the nature of the change:

  • [Bug]: Corrections to existing functionality
  • [CI]: Changes to build process or CI pipeline
  • [Docs]: Updates or additions to documentation
  • [API]: Modifications to aibrix's API or interface
  • [CLI]: Changes or additions to the Command Line Interface
  • [Misc]: For changes not covered above (use sparingly)

Note: For changes spanning multiple categories, use multiple prefixes in order of importance.

Submission Checklist

  • PR title includes appropriate prefix(es)
  • Changes are clearly explained in the PR description
  • New and existing tests pass successfully
  • Code adheres to project style and best practices
  • Documentation updated to reflect changes (if applicable)
  • Thorough testing completed, no regressions introduced

By submitting this PR, you confirm that you've read these guidelines and your changes align with the project's contribution standards.

Jingyuan Zhang and others added 30 commits February 12, 2025 16:24
…load_aware_routing

# Conflicts:
#	pkg/cache/cache.go
#	pkg/plugins/gateway/gateway.go
#	pkg/types/router.go
…load_aware_routing

# Conflicts:
#	pkg/cache/cache.go
#	pkg/plugins/gateway/algorithms/least_busy_time.go
#	pkg/plugins/gateway/algorithms/least_kv_cache.go
#	pkg/plugins/gateway/algorithms/least_latency.go
#	pkg/plugins/gateway/algorithms/least_request.go
#	pkg/plugins/gateway/algorithms/prefix_cache.go
#	pkg/plugins/gateway/algorithms/prefix_cache_and_load.go
#	pkg/plugins/gateway/algorithms/random.go
#	pkg/plugins/gateway/algorithms/router.go
#	pkg/plugins/gateway/algorithms/router_test.go
#	pkg/plugins/gateway/algorithms/throughput.go
#	pkg/plugins/gateway/gateway.go
#	pkg/plugins/gateway/gateway_req_body.go
Add random routing policy to e2e test.
…routing

# Conflicts:
#	cmd/plugins/main.go
#	pkg/cache/cache.go
#	pkg/cache/cache_test.go
#	pkg/cache/model.go
#	pkg/cache/pod.go
#	pkg/plugins/gateway/algorithms/prefix_cache.go
#	pkg/plugins/gateway/algorithms/prefix_cache_test.go
#	pkg/plugins/gateway/algorithms/router_test.go
#	pkg/plugins/gateway/gateway_req_body.go
#	pkg/plugins/gateway/gateway_rsp_body.go
#	pkg/types/router_context.go
…urrent safety and stateful routing

Signed-off-by: Jingyuan Zhang <[email protected]>
Signed-off-by: Jingyuan Zhang <[email protected]>
…routing

Signed-off-by: Jingyuan Zhang <[email protected]>

# Conflicts:
#	cmd/controllers/main.go
#	cmd/metadata/main.go
#	cmd/plugins/main.go
#	pkg/cache/cache.go
#	pkg/cache/cache_test.go
#	pkg/cache/model.go
#	pkg/cache/pod.go
#	pkg/cache/registry_test.go
#	pkg/plugins/gateway/algorithms/least_busy_time.go
#	pkg/plugins/gateway/algorithms/least_kv_cache.go
#	pkg/plugins/gateway/algorithms/least_latency.go
#	pkg/plugins/gateway/algorithms/least_request.go
#	pkg/plugins/gateway/algorithms/prefix_cache_test.go
#	pkg/plugins/gateway/algorithms/router.go
#	pkg/plugins/gateway/algorithms/router_test.go
#	pkg/plugins/gateway/algorithms/throughput.go
#	pkg/plugins/gateway/gateway.go
#	pkg/plugins/gateway/gateway_req_body.go
#	pkg/plugins/gateway/gateway_rsp_body.go
#	pkg/types/router.go
#	pkg/types/router_context.go
Add PodList interface to replace utils.PodArray.

Signed-off-by: Jingyuan Zhang <[email protected]>
…routing

Signed-off-by: Jingyuan Zhang <[email protected]>

# Conflicts:
#	pkg/cache/cache_api.go
#	pkg/cache/cache_init.go
…urrent safety and stateful routing

Signed-off-by: Jingyuan Zhang <[email protected]>
…routing

Signed-off-by: Jingyuan Zhang <[email protected]>

# Conflicts:
#	cmd/plugins/main.go
#	pkg/cache/cache_api.go
#	pkg/cache/cache_impl.go
#	pkg/cache/cache_init.go
#	pkg/cache/cache_test.go
#	pkg/cache/cache_trace.go
#	pkg/cache/informers.go
#	pkg/cache/model.go
#	pkg/cache/pod.go
#	pkg/cache/trace.go
#	pkg/plugins/gateway/algorithms/least_busy_time.go
#	pkg/plugins/gateway/algorithms/least_kv_cache.go
#	pkg/plugins/gateway/algorithms/least_latency.go
#	pkg/plugins/gateway/algorithms/least_request.go
#	pkg/plugins/gateway/algorithms/prefix_cache.go
#	pkg/plugins/gateway/algorithms/prefix_cache_and_load.go
#	pkg/plugins/gateway/algorithms/prefix_cache_test.go
#	pkg/plugins/gateway/algorithms/random.go
#	pkg/plugins/gateway/algorithms/router.go
#	pkg/plugins/gateway/algorithms/router_test.go
#	pkg/plugins/gateway/algorithms/throughput.go
#	pkg/plugins/gateway/gateway.go
#	pkg/plugins/gateway/gateway_req_body.go
#	pkg/plugins/gateway/gateway_req_headers.go
#	pkg/plugins/gateway/gateway_rsp_body.go
#	pkg/types/router.go
#	pkg/types/router_context.go
#	pkg/utils/pod.go
#	test/e2e/routing_strategy_test.go
…load_aware_routing

Signed-off-by: Jingyuan Zhang <[email protected]>

# Conflicts:
#	cmd/plugins/main.go
#	pkg/cache/cache_api.go
#	pkg/cache/cache_impl.go
#	pkg/cache/cache_init.go
#	pkg/cache/cache_test.go
#	pkg/cache/cache_trace.go
#	pkg/cache/informers.go
#	pkg/cache/model.go
#	pkg/cache/pod.go
#	pkg/cache/trace.go
#	pkg/metrics/metrics.go
#	pkg/plugins/gateway/algorithms/prefix_cache_test.go
#	pkg/plugins/gateway/algorithms/router_test.go
#	pkg/plugins/gateway/gateway_req_body.go
#	pkg/plugins/gateway/gateway_rsp_body.go
#	pkg/types/router_context.go
#	pkg/types/router_context_test.go
#	pkg/utils/sync_map.go
Jingyuan Zhang added 4 commits June 12, 2025 15:01
Copy link

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@Jeffwan
Copy link
Collaborator

Jeffwan commented Jun 14, 2025

I notice there're some refactor changes (e.g. internal interface change etc) Technically, that affects other aspects, could it be some separate changes? I mean splitting the changes into common parts (stakeholder needs to review it) and slo specific changes (review could be loose and feature can be protected by feature gate).

If the splitting is too complicated, we can have 1st round review and check how to move forward

Jingyuan Zhang added 2 commits June 13, 2025 20:51
Signed-off-by: Jingyuan Zhang <[email protected]>
Signed-off-by: Jingyuan Zhang <[email protected]>
@zhangjyr
Copy link
Collaborator Author

@Jeffwan, I think the only internal interface change is the Select(). The function is called only in one place, and if you find it is not appropriate, we can restore it.

zhangjyr and others added 2 commits June 13, 2025 22:27
…load_aware_routing

Signed-off-by: Jingyuan Zhang <[email protected]>

# Conflicts:
#	pkg/plugins/gateway/algorithms/prefix_cache_preble.go
@@ -77,7 +78,7 @@ func main() {
panic(err)
}

cache.InitForGateway(config, stopCh, redisClient)
cache.InitForGateway(config, stopCh, redisClient, routing.NewSLORouter)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there better way to handle this case? routing.NewSLORouter is a specific solution but cmd/plugins/main.go is for common purpose? Can we have factory for such initialization?

@@ -2,7 +2,7 @@ apiVersion: v1
kind: Service
metadata:
name: gateway-plugins
namespace: aibrix-system
namespace: system
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

system will be overitten eventually to aibrix-system here right? changing to system to be aligned with default setting?

// Parameters:
// deploymentName: Name of the deployment
// modelName: Name of the model
GetModelProfileByDeploymentName(deploymentName string, modelName string) (*ModelGPUProfile, error)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: we may use other objects to orchestrate pods in future. in that case, deployment might be changed in future. This looks good at this moment.

one more problem is, deployment without namespace can not be used to identify a deployment. we need to append the namespace field

break
// Current implementation assumes AddRequestCount() will not be called concurrently.
// TODO: Implment "wait for trace term" logic if AddRequestCount() is called concurrently.
if ctx == nil || ctx.CanAddTrace() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the refactor for common case?

metaModels utils.SyncMap[string, *Model] // model_name -> *Model

// Deploymnent related storage
deploymentProfiles utils.SyncMap[string, *ModelGPUProfile] // deployment_name -> *ModelGPUProfile
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here. we can use namespace/deployment as the key

@@ -98,6 +97,7 @@ func (c *Store) addPod(obj interface{}) {
// only track pods with model deployments
modelName, ok := pod.Labels[modelIdentifier]
if !ok {
// klog.InfoS("ignored pod without model label", "name", pod.Name)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use log level instead?

}
}
}
q.queue, q.baseCursor = newQueue, q.baseCursor+dequeuePos
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could it be a problem is the other goroutine invoke physicalPosRLocked? can we introduce something like and use it in physicalPosRLocked and setBaseCursor in expand?

func (q *SimpleQueue[V]) getBaseCursor() int64 {
	return atomic.LoadInt64(&q.baseCursor)
}

atomic.AddInt32(&hist.size, -hist.Tail().getSkipped())
}

func NewSimmpleOutputPredictor(maxInputTokens, maxOutputTokens int, window time.Duration) *SimmpleOutputPredictor {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's briefly talk about the algorithm here? as a comment

debugDelay time.Duration
tokens []int
predictor OutputPredictor
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one of my concerns is which field can be used for profile disabled routing algorithms? As a routing algorithm developer, which field should I expected to be available if I enable/disable some features.

return
}

func (q *SLOQueue) higherRank(rank1 float64, rank2 float64) float64 {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

directly return bool looks simplier

queueOverallSLO bool = false
monogenousGPURouting bool = true
monogenousGPURoutingOnly bool = monogenousGPURouting && false
initialTotalSubQueues int = 8 // Expect no more than 8 subqueues
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are these magic numbers const or should be adjusted based on the available resources?

Jingyuan Zhang added 7 commits June 17, 2025 20:55
Signed-off-by: Jingyuan Zhang <[email protected]>
Signed-off-by: Jingyuan Zhang <[email protected]>
Signed-off-by: Jingyuan Zhang <[email protected]>
Signed-off-by: Jingyuan Zhang <[email protected]>
Signed-off-by: Jingyuan Zhang <[email protected]>
Signed-off-by: Jingyuan Zhang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[RFC]: Load-aware pattern-based routing policy with profile support
2 participants