-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Description
Component(s)
processor/tailsampling
What happened?
Description
From my understanding of the tailsamplingprocessor
README whenever an "inverted not sample" decision is made that trace should not be sampled, regardless of any other policies. Specifically it states:
Each policy will result in a decision, and the processor will evaluate them to make a final decision:
- When there's an "inverted not sample" decision, the trace is not sampled;
- When there's a "sample" decision, the trace is sampled;
- When there's a "inverted sample" decision and no "not sample" decisions, the trace is sampled;
- In all other cases, the trace is NOT sampled
The "inverted not sample" decision having the highest precedence seems to be confirmed by how this is actually implemented here.
Based upon that I have been trying to put together a policy similar to the following:
- If a trace has a certain attribute AND it only contains 1 span → do not sample it
- For all other traces → sample them
In practice our policy is more complicated than this but I've simplified it as much as possible for the purpose of explaining this issue.
The policy looks like this in config:
- name: ignore-single-span-traces-from-certain-jobs
type: and
and:
and_sub_policy:
- name: not-boring-job
type: string_attribute
string_attribute:
key: job
values: ["boring"]
invert_match: true
- name: at-least-two-spans
type: span_count
span_count:
min_spans: 2
- name: always-sample
type: always_sample
What I'm finding in practice though is that traces with the job
attribute set to boring
that only contain a single span are still being sampled. Unless I'm misunderstanding how the tail sampling processor is working this seems like a bug to me.
I debugged this a bit by adding several extra log statements into the processor to see what decisions it was making on each policy and I believe the issue lies in how AND policies handled not sampled decisions here.
As you can see regardless of whether the decision is NotSampled
or InvertNotSampled
it always returns NotSampled
. This is problematic because a NotSampled
decision won't take precedence over a Sampled
decision so any InverNotSampled
decision that occurs within an AND policy is effectively getting downgraded to a NotSampled
decision.
If it's useful for confirming this behaviour these were the extra log statements I added:
diff --git a/processor/tailsamplingprocessor/processor.go b/processor/tailsamplingprocessor/processor.go
index b0a5850400..edf36cf931 100644
--- a/processor/tailsamplingprocessor/processor.go
+++ b/processor/tailsamplingprocessor/processor.go
@@ -256,6 +256,10 @@ func (tsp *tailSamplingSpanProcessor) samplingPolicyOnTick() {
trace.DecisionTime = time.Now()
decision := tsp.makeDecision(id, trace, &metrics)
+ tsp.logger.Info("Tail Sampling Decision",
+ zap.Int64("spans", trace.SpanCount.Load()),
+ zap.Int32("sampled", int32(decision)),
+ )
tsp.telemetry.ProcessorTailSamplingSamplingDecisionTimerLatency.Record(tsp.ctx, int64(time.Since(startTime)/time.Microsecond))
tsp.telemetry.ProcessorTailSamplingSamplingTraceDroppedTooEarly.Add(tsp.ctx, metrics.idNotFoundOnMapCount)
tsp.telemetry.ProcessorTailSamplingSamplingPolicyEvaluationError.Add(tsp.ctx, metrics.evaluateErrorCount)
@@ -309,6 +313,10 @@ func (tsp *tailSamplingSpanProcessor) makeDecision(id pcommon.TraceID, trace *sa
tsp.telemetry.ProcessorTailSamplingCountSpansSampled.Add(ctx, trace.SpanCount.Load(), p.attribute, decisionToAttribute[decision])
}
+ tsp.logger.Info("Sampling policy decision",
+ zap.Int32("decision", int32(decision)),
+ zap.String("policy", p.name),
+ )
samplingDecision[decision] = true
}
}
@@ -317,10 +325,13 @@ func (tsp *tailSamplingSpanProcessor) makeDecision(id pcommon.TraceID, trace *sa
switch {
case samplingDecision[sampling.InvertNotSampled]:
finalDecision = sampling.NotSampled
+ tsp.logger.Info("Setting finalDecision to NotSampled via InvertNotSampled")
case samplingDecision[sampling.Sampled]:
finalDecision = sampling.Sampled
+ tsp.logger.Info("Setting finalDecision to Sampled via Sampled")
case samplingDecision[sampling.InvertSampled] && !samplingDecision[sampling.NotSampled]:
finalDecision = sampling.Sampled
+ tsp.logger.Info("Setting finalDecision to Sampled via InvertSampled && !NotSampled")
}
return finalDecision
I believe the fix for this would be to return the decision
so either the NotSampled
or InvertNotSampled
are returned respectively. I have tested this and it seems to fix the problem.
I've made those changes in a branch here: main...jamesrwhite:opentelemetry-collector-contrib:fix-and-policy-invert-not-sampled.
I'm happy to open a PR with the above changes if you agree that's the correct fix, also if you need any more information please let me know.
Thanks!
Steps to Reproduce
- Run collector with attached config
- Run sample application
from opentelemetry import trace
tracer = trace.get_tracer('worker')
print('Starting worker..')
with tracer.start_as_current_span('worker') as span:
span.set_attribute('job', 'boring')
print('Done')
$ OTEL_SERVICE_NAME=service.worker opentelemetry-instrument python app.py
- Review collector logs and see that trace is sampled
Expected Result
Trace is not sampled.
Actual Result
Trace is sampled.
Collector version
0.102.0-dev
Environment information
Environment
OS: MacOS 14.5
Go: 1.21.0
Python: 3.12.4
❯ go env
GO111MODULE=''
GOARCH='arm64'
GOBIN='/Users/james.white/.local/share/mise/installs/go/1.21.0/bin'
GOCACHE='/Users/james.white/Library/Caches/go-build'
GOENV='/Users/james.white/Library/Application Support/go/env'
GOEXE=''
GOEXPERIMENT=''
GOFLAGS=''
GOHOSTARCH='arm64'
GOHOSTOS='darwin'
GOINSECURE=''
GOMODCACHE='/Users/james.white/go/pkg/mod'
GONOPROXY=''
GONOSUMDB=''
GOOS='darwin'
GOPATH='/Users/james.white/go'
GOPRIVATE=''
GOPROXY='https://proxy.golang.org,direct'
GOROOT='/Users/james.white/.local/share/mise/installs/go/1.21.0/go'
GOSUMDB='sum.golang.org'
GOTMPDIR=''
GOTOOLCHAIN='auto'
GOTOOLDIR='/Users/james.white/.local/share/mise/installs/go/1.21.0/go/pkg/tool/darwin_arm64'
GOVCS=''
GOVERSION='go1.21.0'
GCCGO='gccgo'
AR='ar'
CC='clang'
CXX='clang++'
CGO_ENABLED='1'
GOMOD='/Users/james.white/Code/otel/opentelemetry-collector-contrib-fork/go.mod'
GOWORK=''
CGO_CFLAGS='-O2 -g'
CGO_CPPFLAGS=''
CGO_CXXFLAGS='-O2 -g'
CGO_FFLAGS='-O2 -g'
CGO_LDFLAGS='-O2 -g'
PKG_CONFIG='pkg-config'
GOGCCFLAGS='-fPIC -arch arm64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -ffile-prefix-map=/var/folders/96/_bvls_n53qg78rl760qw21lm0000gq/T/go-build997096154=/tmp/go-build -gno-record-gcc-switches -fno-common'
OpenTelemetry Collector configuration
receivers:
otlp:
protocols:
grpc:
endpoint: :4317
http:
endpoint: :4318
exporters:
debug:
verbosity: detailed
processors:
tail_sampling:
policies:
- name: ignore-single-span-traces-from-certain-jobs
type: and
and:
and_sub_policy:
- name: not-boring-job
type: string_attribute
string_attribute:
key: job
values: ["boring"]
invert_match: true
- name: at-least-two-spans
type: span_count
span_count:
min_spans: 2
- name: always-sample
type: always_sample
service:
pipelines:
traces:
receivers: [otlp]
processors: [tail_sampling]
exporters: [debug]
Log output
$ make run
cd ./cmd/otelcontribcol && GO111MODULE=on go run --race . --config ../../local/config.yaml
# github.com/open-telemetry/opentelemetry-collector-contrib/cmd/otelcontribcol
ld: warning: ignoring duplicate libraries: '-lproc'
2024-06-19T11:59:29.405+0100 info [email protected]/service.go:115 Setting up own telemetry...
2024-06-19T11:59:29.405+0100 info [email protected]/telemetry.go:96 Serving metrics {"address": ":8888", "level": "Normal"}
2024-06-19T11:59:29.405+0100 info [email protected]/exporter.go:280 Development component. May change in the future. {"kind": "exporter", "data_type": "traces", "name": "debug"}
2024-06-19T11:59:29.406+0100 info [email protected]/service.go:182 Starting otelcontribcol... {"Version": "0.102.0-dev", "NumCPU": 10}
2024-06-19T11:59:29.406+0100 info extensions/extensions.go:34 Starting extensions...
2024-06-19T11:59:29.406+0100 warn [email protected]/warning.go:42 Using the 0.0.0.0 address exposes this server to every network interface, which may facilitate Denial of Service attacks. Enable the feature gate to change the default and remove this warning. {"kind": "receiver", "name": "otlp", "data_type": "traces", "documentation": "https://github.com/open-telemetry/opentelemetry-collector/blob/main/docs/security-best-practices.md#safeguards-against-denial-of-service-attacks", "feature gate ID": "component.UseLocalHostAsDefaultHost"}
2024-06-19T11:59:29.407+0100 info [email protected]/otlp.go:102 Starting GRPC server {"kind": "receiver", "name": "otlp", "data_type": "traces", "endpoint": ":4317"}
2024-06-19T11:59:29.407+0100 warn [email protected]/warning.go:42 Using the 0.0.0.0 address exposes this server to every network interface, which may facilitate Denial of Service attacks. Enable the feature gate to change the default and remove this warning. {"kind": "receiver", "name": "otlp", "data_type": "traces", "documentation": "https://github.com/open-telemetry/opentelemetry-collector/blob/main/docs/security-best-practices.md#safeguards-against-denial-of-service-attacks", "feature gate ID": "component.UseLocalHostAsDefaultHost"}
2024-06-19T11:59:29.407+0100 info [email protected]/otlp.go:152 Starting HTTP server {"kind": "receiver", "name": "otlp", "data_type": "traces", "endpoint": ":4318"}
2024-06-19T11:59:29.407+0100 info [email protected]/service.go:208 Everything is ready. Begin running and processing data.
2024-06-19T11:59:29.407+0100 warn localhostgate/featuregate.go:63 The default endpoints for all servers in components will change to use localhost instead of 0.0.0.0 in a future version. Use the feature gate to preview the new default. {"feature gate ID": "component.UseLocalHostAsDefaultHost"}
2024-06-19T13:38:26.131+0100 info TracesExporter {"kind": "exporter", "data_type": "traces", "name": "debug", "resource spans": 1, "spans": 1}
2024-06-19T13:38:26.133+0100 info ResourceSpans #0
Resource SchemaURL:
Resource attributes:
-> telemetry.sdk.language: Str(python)
-> telemetry.sdk.name: Str(opentelemetry)
-> telemetry.sdk.version: Str(1.25.0)
-> service.name: Str(service.worker)
-> telemetry.auto.version: Str(0.46b0)
ScopeSpans #0
ScopeSpans SchemaURL:
InstrumentationScope worker
Span #0
Trace ID : 92776c0eb3c2a56cab8827066d03b4c3
Parent ID :
ID : e295a5bacda12b5a
Name : worker
Kind : Internal
Start time : 2024-06-19 12:37:55.499827 +0000 UTC
End time : 2024-06-19 12:37:55.499846 +0000 UTC
Status code : Unset
Status message :
Attributes:
-> job: Str(boring)
{"kind": "exporter", "data_type": "traces", "name": "debug"}
Additional context
This is related to support being added for invert_match
in AND policies in #9768