Skip to content

[tailsamplingprocessor] Decision Timer Latency Metric is misleading #38502

Closed
@Logiraptor

Description

@Logiraptor

Component(s)

processor/tailsampling

What happened?

Description

After #37722, I realized there is another issue with the decision timer latency metric.

Basically, it currently measures a latency from starting policy evaluation until just after each trace is evaluated. This isn't super useful, consider the following scenario:

  • A batch with 10 traces, each taking 1ms to evaluate all policies

Steps to Reproduce

Run the tsp, observe the otelcol_processor_tail_sampling_sampling_decision_timer_latency metric.

Expected Result

I would expect this metric to report one of two things, either:

  1. The total time to evaluate a batch (in this case 10ms)
  2. The time to evaluate each trace (in this case 1ms)

IMO (1) is more useful, since it's a direct indication of whether the tsp is at risk of falling behind on processing traces.

Actual Result

  • The histogram will record times 1ms, 2ms, 3ms, etc up to 10ms.
  • In the end the p99 will be 9ms, p50 will be ~5ms, and average will be 5ms.

I considered opening a PR to change the implementation to (1) above, but figured I would open this issue first to make sure I'm not missing an important use-case for (2). Happy to submit the PR though if not!

Collector version

N/A

Environment information

Environment

OS: (e.g., "Ubuntu 20.04")
Compiler(if manually compiled): (e.g., "go 14.2")

OpenTelemetry Collector configuration

Log output

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions