Skip to content

[Ratelimit Processor] Instrument the ratelimiter service with telemetry metrics #562

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 53 commits into from
May 21, 2025

Conversation

gizas
Copy link
Contributor

@gizas gizas commented May 2, 2025

This is the initial implementation for https://github.com/elastic/hosted-otel-collector/issues/513

The code adds a newMetrics reader in the ratelimitprocessor.

At the moment we have implemented only otelcol_ratelimit.requests metric.Int64Counter
Also the newMetricsReader makes use of the global otel processor that we assume that is initialized upon processor usage.

To be reviewed:

  • if err != nil: WithReason = "request_error" and WithDecision = "accepted"
  • if n := len(responses); n != 1: WithReason ="request_error" and WithDecision = "accepted"
  • if n := len(responses)==1 and resp.GetError() != "": WithReason = "limit_error" and WithDecision ="accepted"
  • if gubernator.Status_OVER_LIMIT: WithReason = "over_limit" and WithDecision ="throttled"
  • Default: "WithReason" = "under_limit" and "WithDecision" ="accepted"

Sample document

After running make smoketest from the hosted-otel-collector repo, I can see this document in Kibana:

{
  "_index": ".ds-metrics-apm.app.hosted_otel_collector-default-2025.05.15-000001",
  "_id": "z24a75YBM8MU2AZ6gfwv",
  "_version": 1,
  "_source": {
    "@timestamp": "2025-05-20T19:10:12.799Z",
    "agent": {
      "name": "otlp",
      "version": "unknown"
    },
    "data_stream": {
      "dataset": "apm.app.hosted_otel_collector",
      "namespace": "default",
      "type": "metrics"
    },
    "event": {
      "ingested": "2025-05-20T19:10:13.000Z"
    },
    "labels": {
      "orchestrator_cluster_name": "default",
      "orchestrator_deploymentslice": "",
      "orchestrator_environment": "default",
      "ratelimit_decision": "accepted",
      "reason": "under_limit",
      "x-elastic-project-id": "local"
    },
    "metricset": {
      "name": "app"
    },
    "numeric_labels": {
      "limit_threshold": 0
    },
    "observer": {
      "hostname": "apm-server-apm-server-69868958cf-4rcgq",
      "type": "apm-server",
      "version": "8.18.1"
    },
    "otelcol_ratelimit": {
      "requests": 6
    },
    "service": {
      "framework": {
        "name": "github.com/elastic/opentelemetry-collector-components/processor/ratelimitprocessor"
      },
      "language": {
        "name": "unknown"
      },
      "name": "hosted-otel-collector",
      "node": {
        "name": "9f735ea3-a83e-4098-ad1a-8b289b8f0a9e"
      },
      "version": "git"
    }
  }
}

@gizas gizas requested a review from a team as a code owner May 2, 2025 11:41
@gizas gizas requested review from jackshirazi and edmocosta May 2, 2025 11:41
gizas added 2 commits May 2, 2025 14:43
Signed-off-by: Andreas Gkizas <[email protected]>
gizas added 3 commits May 2, 2025 15:39
Signed-off-by: Andreas Gkizas <[email protected]>
Signed-off-by: Andreas Gkizas <[email protected]>
gizas added 3 commits May 2, 2025 16:26
Signed-off-by: Andreas Gkizas <[email protected]>
Signed-off-by: Andreas Gkizas <[email protected]>
@gizas gizas requested review from a team as code owners May 2, 2025 13:38
gizas added 10 commits May 2, 2025 16:39
Signed-off-by: Andreas Gkizas <[email protected]>
Signed-off-by: Andreas Gkizas <[email protected]>
Signed-off-by: Andreas Gkizas <[email protected]>
Signed-off-by: Andreas Gkizas <[email protected]>
Signed-off-by: Andreas Gkizas <[email protected]>
Signed-off-by: Andreas Gkizas <[email protected]>
Signed-off-by: Andreas Gkizas <[email protected]>
Signed-off-by: Andreas Gkizas <[email protected]>
Signed-off-by: Andreas Gkizas <[email protected]>
Copy link
Member

@vigneshshanmugam vigneshshanmugam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The approach looks good to me. Some thoughts on additional metrics that we would need to introduce

  • Request duration
  • Can we measure the no of requests that are over the limits at a given time? I believe we can by using the decision dimension, just confirm if thats is the case.
  • Concurrent requests

@lahsivjar Could you think of additional metrics that were useful in MIS so we can add them here? Thanks.

}

func newGubernatorRateLimiter(cfg *Config, set processor.Settings) (*gubernatorRateLimiter, error) {
var behavior int32

telemetryBuilder, err := metadata.NewTelemetryBuilder(set.TelemetrySettings)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to add this also to the local rate limiter, to make sure we track all the incoming requests to this component. I would move this out.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vigneshshanmugam Could you explain a bit more here please? Sorry I'm lost 😅

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was definitely without much context, apologeis. What I meant was to add it to

func (r *localRateLimiter) RateLimit(ctx context.Context, hits int) error {
as well as there is way to disable gubernator and use local in-memory ratelimier instead.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you think of additional metrics that were useful in MIS so we can add them here? Thanks.

Currently in MIS we collect only 2 metrics from the rate limiter (ref):

  1. A metric to track the status of the ingest requests. This metric has dimensions with bounded cardinality to record the decision and the reason for the decision by the rate limiter service.
  2. A metric to track the update requests sent between gubernator instances to update them. This metric has just one dimension of status.

While I like the idea of collecting request duration, however, I am not sure how useful concurrent requests would be. We should also be wary of adding too many metrics in hot path, as it does add some overhead.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vigneshshanmugam Could you explain a bit more here please? Sorry I'm lost 😅

I think I can explain this @kaiyan-sheng.

This rate limiter can be local (using a go library for the implementation) or from the gubernator. Basically, if you define a config from the gubernator, we expect it to be from there, otherwise it is local:

if config.Gubernator != nil {
return newGubernatorRateLimiter(config, set)
}
return newLocalRateLimiter(config, set)
})

So the rate limiter is later created this way:

return &LogsRateLimiterProcessor{
rateLimiterProcessor: rateLimiterProcessor{
Component: rateLimiter,
rl: rateLimiter.Unwrap(),
},
count: getLogsCountFunc(strategy),
next: next,
}
}

(I am using one signal just for example).

Since the telemetry builder for the local and gubernator is the same, I think @vigneshshanmugam is suggesting to only initiate it one time (so somewhere in the processor.go).

Then you can also check the functions:

func (r *MetricsRateLimiterProcessor) ConsumeMetrics(ctx context.Context, md pmetric.Metrics) error {

Could the metric from the rate limiter be set here? Or does it really neeed to be in the local.go and gubernator.go? I don't know the answer to this now, I haven't checked the code in detail since the PR was first opened.

I hope it is clear. We can discuss this in more detail next week if you would like :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@constanca-m Thanks for the comments! Yep make sense to move telemetryBuilder, err := metadata.NewTelemetryBuilder(set.TelemetrySettings) into factory.go so we don't create it separately for local.go and gubernator.go. But for metrics, I think it's still necessary to have in both places. I will keep them separate for now, move forward with this task and let's revisit it next week!

@kaiyan-sheng
Copy link
Contributor

Can we measure the no of requests that are over the limits at a given time? I believe we can by using the decision dimension, just confirm if thats is the case.

Yes we can use the ratelimit_decision attribute to filter otelcol_ratelimit.requests metric.

Request duration and concurrent requests

These two metrics are implemented in a separate PR.

@gizas
Copy link
Contributor Author

gizas commented May 19, 2025

@kaiyan-sheng what is still pending on this PR?
Have we decided to implement ratelimit.broadcasts based on the above #562 (comment)?

@kaiyan-sheng
Copy link
Contributor

kaiyan-sheng commented May 19, 2025

@kaiyan-sheng what is still pending on this PR? Have we decided to implement ratelimit.broadcasts based on the above #562 (comment)?

Nothing is pending on this PR, it's ready for review.
@gizas @vigneshshanmugam I probabaly don't understand ratelimit.broadcasts from APM fully. But I don't think this metric is applicable to our ratelimit processor. Please let me know if I'm wrong 🙂 I'm also trying to limit this PR's scope to only adding otelcol_ratelimit.requests metric and fix for CI to pass. I have a separate PR to add more metrics for ratelimiter.

Copy link
Member

@vigneshshanmugam vigneshshanmugam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly LGTM

@gizas
Copy link
Contributor Author

gizas commented May 20, 2025

@kaiyan-sheng I think we need the WithReason in fe67b2e for local.go
All the decision is throttled so how are we going to distinguish which case we have?

Signed-off-by: Andreas Gkizas <[email protected]>
Signed-off-by: Andreas Gkizas <[email protected]>
@kaiyan-sheng
Copy link
Contributor

@kaiyan-sheng I think we need the WithReason in fe67b2e for local.go All the decision is throttled so how are we going to distinguish which case we have?

@gizas Oh good point! I forgot there are two diff reasons there for the local case. I will put them back!

Comment on lines +120 to +123
r.requestTelemetry(ctx, []attribute.KeyValue{
telemetry.WithReason(telemetry.RequestErr),
telemetry.WithDecision("accepted"),
})
Copy link
Contributor

@constanca-m constanca-m May 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So continuing my comment https://github.com/elastic/opentelemetry-collector-components/pull/562/files#r2098431295, maybe you could check errors.Is(...,...) in processor.go and then set the metric in processor.go. This is so that local.go and gubernator.go won't need to have duplicated code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++, would be good to do as part of the followup.

Copy link
Member

@vigneshshanmugam vigneshshanmugam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few nits, mostly there.

Comment on lines +120 to +123
r.requestTelemetry(ctx, []attribute.KeyValue{
telemetry.WithReason(telemetry.RequestErr),
telemetry.WithDecision("accepted"),
})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++, would be good to do as part of the followup.

Copy link
Member

@vigneshshanmugam vigneshshanmugam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, Thanks you 🎉

@vigneshshanmugam vigneshshanmugam merged commit 0a0d449 into main May 21, 2025
13 checks passed
@vigneshshanmugam vigneshshanmugam deleted the ratelimit_instrument branch May 21, 2025 17:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants