Skip to content

[processor/tailsampling] Data race during evaluation #24283

Closed
@larsn777

Description

@larsn777

Component(s)

processor/tailsampling

What happened?

Description

There is a data race condition during policies evaluation process. A data race occurs when the following events happen concurrently:

  • a policies evaluation process for a certain trace is in process
  • new spans has arrived for the same trace

The thing is that when we are trying to copy a slice of spans in some evaluation rules, actually we are copying a pointer (1,2) to reflect.SliceHeader, not the value. Thus we have concurrent read (evaluation) and write (add spans) for the same slice that could lead to panic.

The date race appears for the following evaluation policies:

  • latency
  • boolean_attribute
  • ottl_condition
  • numeric_attribute
  • status_code
  • string_attribute
  • trace_state

Steps to Reproduce

  • Add one of the evaluation policies described above to the tail sampling config
  • Build collector with race condition flag
  • Start collector
  • Continuously send spans with same trace id to collector. For better results you need to send as many spans per second as possible

Expected Result

There is no panics or data race messages in collector console output

Actual Result

There are panics or data race messages in output

Collector version

v0.81.0

Environment information

any

OpenTelemetry Collector configuration

receivers:
  # any convenient receiver could be used
  jaeger:
    protocols:
      thrift_compact:
        endpoint: 0.0.0.0:6831
        max_packet_size: 65000
        queue_size: 100
        workers: 10

processors:
  tail_sampling:
    num_traces: 100500
    decision_wait: 2s
    # duplicated policies are used to increase evaluation duration
    policies: [
        {
          name: latency-composite-policy,
          type: latency,
          latency: { threshold_ms: 100 }
        },
      {
        name: latency-composite-policy,
        type: latency,
        latency: { threshold_ms: 100 }
      },
      {
        name: latency-composite-policy,
        type: latency,
        latency: { threshold_ms: 100 }
      },
      {
        name: latency-composite-policy,
        type: latency,
        latency: { threshold_ms: 100 }
      },
    ]

exporters:
  logging:
    loglevel: error

service:
  telemetry:
    logs:
      level: info
  pipelines:
    traces:
      receivers: [jaeger]
      processors: [tail_sampling]
      exporters: [logging]

Log output

2023-07-16T17:15:26.398+0300	info	service/telemetry.go:81	Setting up own telemetry...
2023-07-16T17:15:26.398+0300	info	service/telemetry.go:104	Serving Prometheus metrics	{"address": ":8888", "level": "Detailed"}
2023-07-16T17:15:26.399+0300	info	[email protected]/exporter.go:275	Development component. May change in the future.	{"kind": "exporter", "data_type": "traces", "name": "logging"}
2023-07-16T17:15:26.399+0300	warn	[email protected]/factory.go:98	'loglevel' option is deprecated in favor of 'verbosity'. Set 'verbosity' to equivalent value to preserve behavior.	{"kind": "exporter", "data_type": "traces", "name": "logging", "loglevel": "error", "equivalent verbosity level": "Basic"}
2023-07-16T17:15:26.401+0300	info	service/service.go:131	Starting otelcol-avito...	{"Version": "0.81.0", "NumCPU": 10}
2023-07-16T17:15:26.401+0300	info	extensions/extensions.go:30	Starting extensions...
2023-07-16T17:15:26.401+0300	info	service/service.go:148	Everything is ready. Begin running and processing data.
==================
WARNING: DATA RACE
Write at 0x00c000412198 by goroutine 43:
  go.opentelemetry.io/collector/pdata/ptrace.ResourceSpansSlice.AppendEmpty()
      /Users/salarionenko/go/pkg/mod/go.opentelemetry.io/collector/[email protected]/ptrace/generated_resourcespansslice.go:82 +0x12c
  github.com/open-telemetry/opentelemetry-collector-contrib/processor/tailsamplingprocessor.appendToTraces()
      /Users/salarionenko/go/pkg/mod/github.com/open-telemetry/opentelemetry-collector-contrib/processor/[email protected]/processor.go:425 +0x44
  github.com/open-telemetry/opentelemetry-collector-contrib/processor/tailsamplingprocessor.(*tailSamplingSpanProcessor).processTraces()
      /Users/salarionenko/go/pkg/mod/github.com/open-telemetry/opentelemetry-collector-contrib/processor/[email protected]/processor.go:364 +0x484
  github.com/open-telemetry/opentelemetry-collector-contrib/processor/tailsamplingprocessor.(*tailSamplingSpanProcessor).ConsumeTraces()
      /Users/salarionenko/go/pkg/mod/github.com/open-telemetry/opentelemetry-collector-contrib/processor/[email protected]/processor.go:296 +0x60
  go.opentelemetry.io/collector/consumer.Traces.ConsumeTraces-fm()
      <autogenerated>:1 +0x60
  go.opentelemetry.io/collector/consumer.ConsumeTracesFunc.ConsumeTraces()
      /Users/salarionenko/go/pkg/mod/go.opentelemetry.io/collector/[email protected]/traces.go:25 +0x64
  go.opentelemetry.io/collector/service/internal/graph.(*capabilitiesNode).ConsumeTraces()
      <autogenerated>:1 +0x20
  github.com/open-telemetry/opentelemetry-collector-contrib/receiver/jaegerreceiver.consumeTraces()
      /Users/salarionenko/go/pkg/mod/github.com/open-telemetry/opentelemetry-collector-contrib/receiver/[email protected]/trace_receiver.go:169 +0xa0
  github.com/open-telemetry/opentelemetry-collector-contrib/receiver/jaegerreceiver.(*agentHandler).EmitBatch()
      /Users/salarionenko/go/pkg/mod/github.com/open-telemetry/opentelemetry-collector-contrib/receiver/[email protected]/trace_receiver.go:202 +0xa8
  github.com/jaegertracing/jaeger/thrift-gen/agent.(*agentProcessorEmitBatch).Process()
      /Users/salarionenko/go/pkg/mod/github.com/jaegertracing/[email protected]/thrift-gen/agent/agent.go:178 +0xec
  github.com/jaegertracing/jaeger/thrift-gen/agent.(*AgentProcessor).Process()
      /Users/salarionenko/go/pkg/mod/github.com/jaegertracing/[email protected]/thrift-gen/agent/agent.go:125 +0x264
  github.com/jaegertracing/jaeger/cmd/agent/app/processors.(*ThriftProcessor).processBuffer()
      /Users/salarionenko/go/pkg/mod/github.com/jaegertracing/[email protected]/cmd/agent/app/processors/thrift_processor.go:122 +0x27c
  github.com/jaegertracing/jaeger/cmd/agent/app/processors.NewThriftProcessor.func2()
      /Users/salarionenko/go/pkg/mod/github.com/jaegertracing/[email protected]/cmd/agent/app/processors/thrift_processor.go:87 +0x2c

Previous read at 0x00c000412198 by goroutine 34:
  go.opentelemetry.io/collector/pdata/ptrace.ResourceSpansSlice.Len()
      /Users/salarionenko/go/pkg/mod/go.opentelemetry.io/collector/[email protected]/ptrace/generated_resourcespansslice.go:41 +0x4c
  github.com/open-telemetry/opentelemetry-collector-contrib/processor/tailsamplingprocessor/internal/sampling.hasSpanWithCondition()
      /Users/salarionenko/go/pkg/mod/github.com/open-telemetry/opentelemetry-collector-contrib/processor/[email protected]/internal/sampling/util.go:57 +0x2c
  github.com/open-telemetry/opentelemetry-collector-contrib/processor/tailsamplingprocessor/internal/sampling.(*latency).Evaluate()
      /Users/salarionenko/go/pkg/mod/github.com/open-telemetry/opentelemetry-collector-contrib/processor/[email protected]/internal/sampling/latency.go:41 +0xbc
  github.com/open-telemetry/opentelemetry-collector-contrib/processor/tailsamplingprocessor.(*tailSamplingSpanProcessor).makeDecision()
      /Users/salarionenko/go/pkg/mod/github.com/open-telemetry/opentelemetry-collector-contrib/processor/[email protected]/processor.go:222 +0x220
  github.com/open-telemetry/opentelemetry-collector-contrib/processor/tailsamplingprocessor.(*tailSamplingSpanProcessor).samplingPolicyOnTick()
      /Users/salarionenko/go/pkg/mod/github.com/open-telemetry/opentelemetry-collector-contrib/processor/[email protected]/processor.go:179 +0x7ec
  github.com/open-telemetry/opentelemetry-collector-contrib/processor/tailsamplingprocessor.(*tailSamplingSpanProcessor).samplingPolicyOnTick-fm()
      <autogenerated>:1 +0x34
  github.com/open-telemetry/opentelemetry-collector-contrib/internal/coreinternal/timeutils.(*PolicyTicker).OnTick()
      /Users/salarionenko/go/pkg/mod/github.com/open-telemetry/opentelemetry-collector-contrib/internal/[email protected]/timeutils/ticker_helper.go:45 +0x4c
  github.com/open-telemetry/opentelemetry-collector-contrib/internal/coreinternal/timeutils.(*PolicyTicker).Start.func1()
      /Users/salarionenko/go/pkg/mod/github.com/open-telemetry/opentelemetry-collector-contrib/internal/[email protected]/timeutils/ticker_helper.go:36 +0x30

Additional context

There is no cheap solution to copy the ptrace.Traces underlying slice. Therefore, one the possible fixes is ​​to extend spans lock before evaluation.
In my fork I've already fixed the problem and have written some unit test. After applying this solution in production, I haven't noticed any side effects: cpu consumption has remained the same, the processing pipeline hasn't slowed down.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions