Description
Component(s)
processor/tailsampling
What happened?
Description
There is a data race condition during policies evaluation process. A data race occurs when the following events happen concurrently:
- a policies evaluation process for a certain trace is in process
- new spans has arrived for the same trace
The thing is that when we are trying to copy a slice of spans in some evaluation rules, actually we are copying a pointer (1,2) to reflect.SliceHeader
, not the value. Thus we have concurrent read (evaluation) and write (add spans) for the same slice that could lead to panic.
The date race appears for the following evaluation policies:
- latency
- boolean_attribute
- ottl_condition
- numeric_attribute
- status_code
- string_attribute
- trace_state
Steps to Reproduce
- Add one of the evaluation policies described above to the tail sampling config
- Build collector with race condition flag
- Start collector
- Continuously send spans with same trace id to collector. For better results you need to send as many spans per second as possible
Expected Result
There is no panics or data race messages in collector console output
Actual Result
There are panics or data race messages in output
Collector version
v0.81.0
Environment information
any
OpenTelemetry Collector configuration
receivers:
# any convenient receiver could be used
jaeger:
protocols:
thrift_compact:
endpoint: 0.0.0.0:6831
max_packet_size: 65000
queue_size: 100
workers: 10
processors:
tail_sampling:
num_traces: 100500
decision_wait: 2s
# duplicated policies are used to increase evaluation duration
policies: [
{
name: latency-composite-policy,
type: latency,
latency: { threshold_ms: 100 }
},
{
name: latency-composite-policy,
type: latency,
latency: { threshold_ms: 100 }
},
{
name: latency-composite-policy,
type: latency,
latency: { threshold_ms: 100 }
},
{
name: latency-composite-policy,
type: latency,
latency: { threshold_ms: 100 }
},
]
exporters:
logging:
loglevel: error
service:
telemetry:
logs:
level: info
pipelines:
traces:
receivers: [jaeger]
processors: [tail_sampling]
exporters: [logging]
Log output
2023-07-16T17:15:26.398+0300 info service/telemetry.go:81 Setting up own telemetry...
2023-07-16T17:15:26.398+0300 info service/telemetry.go:104 Serving Prometheus metrics {"address": ":8888", "level": "Detailed"}
2023-07-16T17:15:26.399+0300 info [email protected]/exporter.go:275 Development component. May change in the future. {"kind": "exporter", "data_type": "traces", "name": "logging"}
2023-07-16T17:15:26.399+0300 warn [email protected]/factory.go:98 'loglevel' option is deprecated in favor of 'verbosity'. Set 'verbosity' to equivalent value to preserve behavior. {"kind": "exporter", "data_type": "traces", "name": "logging", "loglevel": "error", "equivalent verbosity level": "Basic"}
2023-07-16T17:15:26.401+0300 info service/service.go:131 Starting otelcol-avito... {"Version": "0.81.0", "NumCPU": 10}
2023-07-16T17:15:26.401+0300 info extensions/extensions.go:30 Starting extensions...
2023-07-16T17:15:26.401+0300 info service/service.go:148 Everything is ready. Begin running and processing data.
==================
WARNING: DATA RACE
Write at 0x00c000412198 by goroutine 43:
go.opentelemetry.io/collector/pdata/ptrace.ResourceSpansSlice.AppendEmpty()
/Users/salarionenko/go/pkg/mod/go.opentelemetry.io/collector/[email protected]/ptrace/generated_resourcespansslice.go:82 +0x12c
github.com/open-telemetry/opentelemetry-collector-contrib/processor/tailsamplingprocessor.appendToTraces()
/Users/salarionenko/go/pkg/mod/github.com/open-telemetry/opentelemetry-collector-contrib/processor/[email protected]/processor.go:425 +0x44
github.com/open-telemetry/opentelemetry-collector-contrib/processor/tailsamplingprocessor.(*tailSamplingSpanProcessor).processTraces()
/Users/salarionenko/go/pkg/mod/github.com/open-telemetry/opentelemetry-collector-contrib/processor/[email protected]/processor.go:364 +0x484
github.com/open-telemetry/opentelemetry-collector-contrib/processor/tailsamplingprocessor.(*tailSamplingSpanProcessor).ConsumeTraces()
/Users/salarionenko/go/pkg/mod/github.com/open-telemetry/opentelemetry-collector-contrib/processor/[email protected]/processor.go:296 +0x60
go.opentelemetry.io/collector/consumer.Traces.ConsumeTraces-fm()
<autogenerated>:1 +0x60
go.opentelemetry.io/collector/consumer.ConsumeTracesFunc.ConsumeTraces()
/Users/salarionenko/go/pkg/mod/go.opentelemetry.io/collector/[email protected]/traces.go:25 +0x64
go.opentelemetry.io/collector/service/internal/graph.(*capabilitiesNode).ConsumeTraces()
<autogenerated>:1 +0x20
github.com/open-telemetry/opentelemetry-collector-contrib/receiver/jaegerreceiver.consumeTraces()
/Users/salarionenko/go/pkg/mod/github.com/open-telemetry/opentelemetry-collector-contrib/receiver/[email protected]/trace_receiver.go:169 +0xa0
github.com/open-telemetry/opentelemetry-collector-contrib/receiver/jaegerreceiver.(*agentHandler).EmitBatch()
/Users/salarionenko/go/pkg/mod/github.com/open-telemetry/opentelemetry-collector-contrib/receiver/[email protected]/trace_receiver.go:202 +0xa8
github.com/jaegertracing/jaeger/thrift-gen/agent.(*agentProcessorEmitBatch).Process()
/Users/salarionenko/go/pkg/mod/github.com/jaegertracing/[email protected]/thrift-gen/agent/agent.go:178 +0xec
github.com/jaegertracing/jaeger/thrift-gen/agent.(*AgentProcessor).Process()
/Users/salarionenko/go/pkg/mod/github.com/jaegertracing/[email protected]/thrift-gen/agent/agent.go:125 +0x264
github.com/jaegertracing/jaeger/cmd/agent/app/processors.(*ThriftProcessor).processBuffer()
/Users/salarionenko/go/pkg/mod/github.com/jaegertracing/[email protected]/cmd/agent/app/processors/thrift_processor.go:122 +0x27c
github.com/jaegertracing/jaeger/cmd/agent/app/processors.NewThriftProcessor.func2()
/Users/salarionenko/go/pkg/mod/github.com/jaegertracing/[email protected]/cmd/agent/app/processors/thrift_processor.go:87 +0x2c
Previous read at 0x00c000412198 by goroutine 34:
go.opentelemetry.io/collector/pdata/ptrace.ResourceSpansSlice.Len()
/Users/salarionenko/go/pkg/mod/go.opentelemetry.io/collector/[email protected]/ptrace/generated_resourcespansslice.go:41 +0x4c
github.com/open-telemetry/opentelemetry-collector-contrib/processor/tailsamplingprocessor/internal/sampling.hasSpanWithCondition()
/Users/salarionenko/go/pkg/mod/github.com/open-telemetry/opentelemetry-collector-contrib/processor/[email protected]/internal/sampling/util.go:57 +0x2c
github.com/open-telemetry/opentelemetry-collector-contrib/processor/tailsamplingprocessor/internal/sampling.(*latency).Evaluate()
/Users/salarionenko/go/pkg/mod/github.com/open-telemetry/opentelemetry-collector-contrib/processor/[email protected]/internal/sampling/latency.go:41 +0xbc
github.com/open-telemetry/opentelemetry-collector-contrib/processor/tailsamplingprocessor.(*tailSamplingSpanProcessor).makeDecision()
/Users/salarionenko/go/pkg/mod/github.com/open-telemetry/opentelemetry-collector-contrib/processor/[email protected]/processor.go:222 +0x220
github.com/open-telemetry/opentelemetry-collector-contrib/processor/tailsamplingprocessor.(*tailSamplingSpanProcessor).samplingPolicyOnTick()
/Users/salarionenko/go/pkg/mod/github.com/open-telemetry/opentelemetry-collector-contrib/processor/[email protected]/processor.go:179 +0x7ec
github.com/open-telemetry/opentelemetry-collector-contrib/processor/tailsamplingprocessor.(*tailSamplingSpanProcessor).samplingPolicyOnTick-fm()
<autogenerated>:1 +0x34
github.com/open-telemetry/opentelemetry-collector-contrib/internal/coreinternal/timeutils.(*PolicyTicker).OnTick()
/Users/salarionenko/go/pkg/mod/github.com/open-telemetry/opentelemetry-collector-contrib/internal/[email protected]/timeutils/ticker_helper.go:45 +0x4c
github.com/open-telemetry/opentelemetry-collector-contrib/internal/coreinternal/timeutils.(*PolicyTicker).Start.func1()
/Users/salarionenko/go/pkg/mod/github.com/open-telemetry/opentelemetry-collector-contrib/internal/[email protected]/timeutils/ticker_helper.go:36 +0x30
Additional context
There is no cheap solution to copy the ptrace.Traces
underlying slice. Therefore, one the possible fixes is to extend spans lock before evaluation.
In my fork I've already fixed the problem and have written some unit test. After applying this solution in production, I haven't noticed any side effects: cpu consumption has remained the same, the processing pipeline hasn't slowed down.