[Ratelimit Processor] Instrument the ratelimiter service with telemetry metrics #562

gizas · 2025-05-02T11:41:11Z

This is the initial implementation for https://github.com/elastic/hosted-otel-collector/issues/513

The code adds a newMetrics reader in the ratelimitprocessor.

At the moment we have implemented only otelcol_ratelimit.requests metric.Int64Counter
Also the newMetricsReader makes use of the global otel processor that we assume that is initialized upon processor usage.

To be reviewed:

if err != nil: WithReason = "request_error" and WithDecision = "accepted"
if n := len(responses); n != 1: WithReason ="request_error" and WithDecision = "accepted"
if n := len(responses)==1 and resp.GetError() != "": WithReason = "limit_error" and WithDecision ="accepted"
if gubernator.Status_OVER_LIMIT: WithReason = "over_limit" and WithDecision ="throttled"
Default: "WithReason" = "under_limit" and "WithDecision" ="accepted"

Sample document

After running make smoketest from the hosted-otel-collector repo, I can see this document in Kibana:

{
  "_index": ".ds-metrics-apm.app.hosted_otel_collector-default-2025.05.15-000001",
  "_id": "z24a75YBM8MU2AZ6gfwv",
  "_version": 1,
  "_source": {
    "@timestamp": "2025-05-20T19:10:12.799Z",
    "agent": {
      "name": "otlp",
      "version": "unknown"
    },
    "data_stream": {
      "dataset": "apm.app.hosted_otel_collector",
      "namespace": "default",
      "type": "metrics"
    },
    "event": {
      "ingested": "2025-05-20T19:10:13.000Z"
    },
    "labels": {
      "orchestrator_cluster_name": "default",
      "orchestrator_deploymentslice": "",
      "orchestrator_environment": "default",
      "ratelimit_decision": "accepted",
      "reason": "under_limit",
      "x-elastic-project-id": "local"
    },
    "metricset": {
      "name": "app"
    },
    "numeric_labels": {
      "limit_threshold": 0
    },
    "observer": {
      "hostname": "apm-server-apm-server-69868958cf-4rcgq",
      "type": "apm-server",
      "version": "8.18.1"
    },
    "otelcol_ratelimit": {
      "requests": 6
    },
    "service": {
      "framework": {
        "name": "github.com/elastic/opentelemetry-collector-components/processor/ratelimitprocessor"
      },
      "language": {
        "name": "unknown"
      },
      "name": "hosted-otel-collector",
      "node": {
        "name": "9f735ea3-a83e-4098-ad1a-8b289b8f0a9e"
      },
      "version": "git"
    }
  }
}

Signed-off-by: Andreas Gkizas <[email protected]>

processor/ratelimitprocessor/internal/telemetry/attributes.go

processor/ratelimitprocessor/ratelimiter.go

processor/ratelimitprocessor/gubernator.go

processor/ratelimitprocessor/ratelimiter.go

Signed-off-by: Andreas Gkizas <[email protected]>

processor/ratelimitprocessor/ratelimiter.go

Signed-off-by: Andreas Gkizas <[email protected]>

vigneshshanmugam

The approach looks good to me. Some thoughts on additional metrics that we would need to introduce

Request duration
Can we measure the no of requests that are over the limits at a given time? I believe we can by using the decision dimension, just confirm if thats is the case.
Concurrent requests

@lahsivjar Could you think of additional metrics that were useful in MIS so we can add them here? Thanks.

processor/ratelimitprocessor/gubernator.go

processor/ratelimitprocessor/ratelimiter.go

vigneshshanmugam · 2025-05-02T16:45:36Z

processor/ratelimitprocessor/gubernator.go

 }

 func newGubernatorRateLimiter(cfg *Config, set processor.Settings) (*gubernatorRateLimiter, error) {
 	var behavior int32
+
+	telemetryBuilder, err := metadata.NewTelemetryBuilder(set.TelemetrySettings)


It would be good to add this also to the local rate limiter, to make sure we track all the incoming requests to this component. I would move this out.

@vigneshshanmugam Could you explain a bit more here please? Sorry I'm lost 😅

That was definitely without much context, apologeis. What I meant was to add it to

opentelemetry-collector-components/processor/ratelimitprocessor/local.go

Line 53 in 81b3c24

func (r *localRateLimiter) RateLimit(ctx context.Context, hits int) error {

as well as there is way to disable gubernator and use local in-memory ratelimier instead.

Could you think of additional metrics that were useful in MIS so we can add them here? Thanks.

Currently in MIS we collect only 2 metrics from the rate limiter (ref):

A metric to track the status of the ingest requests. This metric has dimensions with bounded cardinality to record the decision and the reason for the decision by the rate limiter service.

A metric to track the update requests sent between gubernator instances to update them. This metric has just one dimension of status.

While I like the idea of collecting request duration, however, I am not sure how useful concurrent requests would be. We should also be wary of adding too many metrics in hot path, as it does add some overhead.

@vigneshshanmugam Could you explain a bit more here please? Sorry I'm lost 😅

I think I can explain this @kaiyan-sheng.

This rate limiter can be local (using a go library for the implementation) or from the gubernator. Basically, if you define a config from the gubernator, we expect it to be from there, otherwise it is local:

opentelemetry-collector-components/processor/ratelimitprocessor/factory.go

Lines 60 to 64 in a2d4b06

if config.Gubernator != nil {

return newGubernatorRateLimiter(config, set)

}

return newLocalRateLimiter(config, set)

})

So the rate limiter is later created this way:

opentelemetry-collector-components/processor/ratelimitprocessor/processor.go

Lines 67 to 75 in a2d4b06

return &LogsRateLimiterProcessor{

rateLimiterProcessor: rateLimiterProcessor{

Component: rateLimiter,

rl: rateLimiter.Unwrap(),

},

count: getLogsCountFunc(strategy),

next: next,

}

}

(I am using one signal just for example).

Since the telemetry builder for the local and gubernator is the same, I think @vigneshshanmugam is suggesting to only initiate it one time (so somewhere in the processor.go).

Then you can also check the functions:

opentelemetry-collector-components/processor/ratelimitprocessor/processor.go

Line 146 in a2d4b06

func (r *MetricsRateLimiterProcessor) ConsumeMetrics(ctx context.Context, md pmetric.Metrics) error {

Could the metric from the rate limiter be set here? Or does it really neeed to be in the local.go and gubernator.go? I don't know the answer to this now, I haven't checked the code in detail since the PR was first opened.

I hope it is clear. We can discuss this in more detail next week if you would like :)

@constanca-m Thanks for the comments! Yep make sense to move telemetryBuilder, err := metadata.NewTelemetryBuilder(set.TelemetrySettings) into factory.go so we don't create it separately for local.go and gubernator.go. But for metrics, I think it's still necessary to have in both places. I will keep them separate for now, move forward with this task and let's revisit it next week!

Co-authored-by: Vignesh Shanmugam <[email protected]>

kaiyan-sheng · 2025-05-13T00:07:21Z

Can we measure the no of requests that are over the limits at a given time? I believe we can by using the decision dimension, just confirm if thats is the case.

Yes we can use the ratelimit_decision attribute to filter otelcol_ratelimit.requests metric.

Request duration and concurrent requests

These two metrics are implemented in a separate PR.

gizas · 2025-05-19T13:43:21Z

@kaiyan-sheng what is still pending on this PR?
Have we decided to implement ratelimit.broadcasts based on the above #562 (comment)?

kaiyan-sheng · 2025-05-19T17:11:02Z

@kaiyan-sheng what is still pending on this PR? Have we decided to implement ratelimit.broadcasts based on the above #562 (comment)?

Nothing is pending on this PR, it's ready for review.
@gizas @vigneshshanmugam I probabaly don't understand ratelimit.broadcasts from APM fully. But I don't think this metric is applicable to our ratelimit processor. Please let me know if I'm wrong 🙂 I'm also trying to limit this PR's scope to only adding otelcol_ratelimit.requests metric and fix for CI to pass. I have a separate PR to add more metrics for ratelimiter.

vigneshshanmugam

Mostly LGTM

Makefile

processor/ratelimitprocessor/gubernator.go

processor/ratelimitprocessor/local.go

processor/ratelimitprocessor/metadata.yaml

Signed-off-by: Andreas Gkizas <[email protected]>

gizas · 2025-05-20T09:40:42Z

@kaiyan-sheng I think we need the WithReason in fe67b2e for local.go
All the decision is throttled so how are we going to distinguish which case we have?

Signed-off-by: Andreas Gkizas <[email protected]>

processor/ratelimitprocessor/gubernator_test.go

Signed-off-by: Andreas Gkizas <[email protected]>

kaiyan-sheng · 2025-05-20T13:38:25Z

@kaiyan-sheng I think we need the WithReason in fe67b2e for local.go All the decision is throttled so how are we going to distinguish which case we have?

@gizas Oh good point! I forgot there are two diff reasons there for the local case. I will put them back!

constanca-m · 2025-05-20T16:42:07Z

processor/ratelimitprocessor/gubernator.go

+		r.requestTelemetry(ctx, []attribute.KeyValue{
+			telemetry.WithReason(telemetry.RequestErr),
+			telemetry.WithDecision("accepted"),
+		})


So continuing my comment https://github.com/elastic/opentelemetry-collector-components/pull/562/files#r2098431295, maybe you could check errors.Is(...,...) in processor.go and then set the metric in processor.go. This is so that local.go and gubernator.go won't need to have duplicated code.

++, would be good to do as part of the followup.

vigneshshanmugam

Few nits, mostly there.

Makefile

vigneshshanmugam · 2025-05-20T21:25:59Z

processor/ratelimitprocessor/gubernator.go

+		r.requestTelemetry(ctx, []attribute.KeyValue{
+			telemetry.WithReason(telemetry.RequestErr),
+			telemetry.WithDecision("accepted"),
+		})


++, would be good to do as part of the followup.

processor/ratelimitprocessor/documentation.md

processor/ratelimitprocessor/internal/metadatatest/generated_telemetrytest.go

processor/ratelimitprocessor/internal/telemetry/attributes.go

processor/ratelimitprocessor/local.go

Signed-off-by: Andreas Gkizas <[email protected]>

vigneshshanmugam

LGTM, Thanks you 🎉

gizas added 2 commits May 2, 2025 11:12

first effort to instrument

7b1cd05

Signed-off-by: Andreas Gkizas <[email protected]>

adding initial implementation for metrics provider in ratelimiter

104d18c

Signed-off-by: Andreas Gkizas <[email protected]>

gizas requested a review from a team as a code owner May 2, 2025 11:41

gizas requested review from jackshirazi and edmocosta May 2, 2025 11:41

gizas added 2 commits May 2, 2025 14:43

updating test to read projectid from metadata

20336df

Signed-off-by: Andreas Gkizas <[email protected]>

license update

e45b817

Signed-off-by: Andreas Gkizas <[email protected]>

lahsivjar reviewed May 2, 2025

View reviewed changes

processor/ratelimitprocessor/internal/telemetry/attributes.go Outdated Show resolved Hide resolved

processor/ratelimitprocessor/ratelimiter.go Outdated Show resolved Hide resolved

processor/ratelimitprocessor/gubernator.go Outdated Show resolved Hide resolved

constanca-m reviewed May 2, 2025

View reviewed changes

processor/ratelimitprocessor/ratelimiter.go Outdated Show resolved Hide resolved

gizas added 3 commits May 2, 2025 15:39

update with correct attributes as fucntions

9d3f88e

Signed-off-by: Andreas Gkizas <[email protected]>

changing Meter namespace

bc3faad

Signed-off-by: Andreas Gkizas <[email protected]>

make goproto

4bd75d7

Signed-off-by: Andreas Gkizas <[email protected]>

constanca-m reviewed May 2, 2025

View reviewed changes

processor/ratelimitprocessor/ratelimiter.go Outdated Show resolved Hide resolved

gizas added 3 commits May 2, 2025 16:26

run with metadata

1bb84df

Signed-off-by: Andreas Gkizas <[email protected]>

run with metadata and remove unneeded funcsions

684bff2

Signed-off-by: Andreas Gkizas <[email protected]>

make go generate

fb72049

Signed-off-by: Andreas Gkizas <[email protected]>

gizas requested review from a team as code owners May 2, 2025 13:38

gizas added 10 commits May 2, 2025 16:39

make lint

8c221be

Signed-off-by: Andreas Gkizas <[email protected]>

adding err for projectID

9f4baac

Signed-off-by: Andreas Gkizas <[email protected]>

license update

82d7673

Signed-off-by: Andreas Gkizas <[email protected]>

Merge branch 'main' into ratelimit_instrument

7339ecf

remove files

fc32712

Signed-off-by: Andreas Gkizas <[email protected]>

fixing tests

ec8528d

Signed-off-by: Andreas Gkizas <[email protected]>

make fmt

2a8748e

Signed-off-by: Andreas Gkizas <[email protected]>

make default projectID empty

5690517

Signed-off-by: Andreas Gkizas <[email protected]>

make gogenerate

de1b8e1

Signed-off-by: Andreas Gkizas <[email protected]>

make gogenerate

c8365ee

Signed-off-by: Andreas Gkizas <[email protected]>

vigneshshanmugam reviewed May 2, 2025

View reviewed changes

Update processor/ratelimitprocessor/gubernator.go

570f81d

Co-authored-by: Vignesh Shanmugam <[email protected]>

vigneshshanmugam reviewed May 19, 2025

View reviewed changes

kaiyan-sheng and others added 5 commits May 19, 2025 17:28

Merge branch 'main' into ratelimit_instrument

2b74b1d

remove processor_id and reason field

2b8cae0

remove reason field when throttled in local.go

fe67b2e

Merge branch 'main' into ratelimit_instrument

2b8a325

fix errors

02b7b27

Signed-off-by: Andreas Gkizas <[email protected]>

fix errors

97ad2a3

Signed-off-by: Andreas Gkizas <[email protected]>

gizas commented May 20, 2025

View reviewed changes

processor/ratelimitprocessor/gubernator_test.go Outdated Show resolved Hide resolved

fix errors

943377b

Signed-off-by: Andreas Gkizas <[email protected]>

fix test

6abe18f

kaiyan-sheng requested a review from vigneshshanmugam May 20, 2025 15:44

constanca-m reviewed May 20, 2025

View reviewed changes

kaiyan-sheng added 3 commits May 20, 2025 13:29

move metadata.NewTelemetryBuilder to factory.go

db60e80

fix tests

86e13a1

fix local_test.go

197718b

vigneshshanmugam reviewed May 20, 2025

View reviewed changes

kaiyan-sheng and others added 5 commits May 20, 2025 22:52

remove attributes

180dd04

generate new documentation.md

a844247

adding test with AssertEqualRatelimitRequests

207f2aa

Signed-off-by: Andreas Gkizas <[email protected]>

adding test with AssertEqualRatelimitRequests

0ccc6bf

Signed-off-by: Andreas Gkizas <[email protected]>

add comment for the goimports workaround

4524448

vigneshshanmugam approved these changes May 21, 2025

View reviewed changes

vigneshshanmugam merged commit 0a0d449 into main May 21, 2025
13 checks passed

vigneshshanmugam deleted the ratelimit_instrument branch May 21, 2025 17:04

constanca-m mentioned this pull request May 26, 2025

[ratelimiter] Remove wrong error, remove duplicate code #578

Merged

	if config.Gubernator != nil {
	return newGubernatorRateLimiter(config, set)
	}
	return newLocalRateLimiter(config, set)
	})

	return &LogsRateLimiterProcessor{
	rateLimiterProcessor: rateLimiterProcessor{
	Component: rateLimiter,
	rl: rateLimiter.Unwrap(),
	},
	count: getLogsCountFunc(strategy),
	next: next,
	}
	}

[Ratelimit Processor] Instrument the ratelimiter service with telemetry metrics #562

[Ratelimit Processor] Instrument the ratelimiter service with telemetry metrics #562

Uh oh!

Conversation

gizas commented May 2, 2025 • edited by kaiyan-sheng Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Sample document

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vigneshshanmugam left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kaiyan-sheng commented May 13, 2025

Uh oh!

gizas commented May 19, 2025

Uh oh!

kaiyan-sheng commented May 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vigneshshanmugam left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gizas commented May 20, 2025

Uh oh!

Uh oh!

kaiyan-sheng commented May 20, 2025

Uh oh!

constanca-m May 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vigneshshanmugam left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vigneshshanmugam left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

gizas commented May 2, 2025 •

edited by kaiyan-sheng

Loading

kaiyan-sheng commented May 19, 2025 •

edited

Loading

constanca-m May 20, 2025 •

edited

Loading