Description
Component(s)
processor/tailsampling
What happened?
Description
After #37722, I realized there is another issue with the decision timer latency metric.
Basically, it currently measures a latency from starting policy evaluation until just after each trace is evaluated. This isn't super useful, consider the following scenario:
- A batch with 10 traces, each taking 1ms to evaluate all policies
Steps to Reproduce
Run the tsp, observe the otelcol_processor_tail_sampling_sampling_decision_timer_latency
metric.
Expected Result
I would expect this metric to report one of two things, either:
- The total time to evaluate a batch (in this case 10ms)
- The time to evaluate each trace (in this case 1ms)
IMO (1) is more useful, since it's a direct indication of whether the tsp is at risk of falling behind on processing traces.
Actual Result
- The histogram will record times 1ms, 2ms, 3ms, etc up to 10ms.
- In the end the p99 will be 9ms, p50 will be ~5ms, and average will be 5ms.
I considered opening a PR to change the implementation to (1) above, but figured I would open this issue first to make sure I'm not missing an important use-case for (2). Happy to submit the PR though if not!
Collector version
N/A
Environment information
Environment
OS: (e.g., "Ubuntu 20.04")
Compiler(if manually compiled): (e.g., "go 14.2")
OpenTelemetry Collector configuration
Log output
Additional context
No response