Skip to content

[cmd/opampsupervisor] Data race on access to Supervisor.lastHealthFromClient #40207

Closed
@douglascamata

Description

@douglascamata

Component(s)

cmd/opampsupervisor

What happened?

Description

While writing some tests that run the Supervisor with race detection on I got this report:

==================
WARNING: DATA RACE
Read at 0x00c00038bf18 by goroutine 466:
  github.com/open-telemetry/opentelemetry-collector-contrib/cmd/opampsupervisor/supervisor.(*Supervisor).runAgentProcess()
      /Users/douglas/go/pkg/mod/github.com/open-telemetry/opentelemetry-collector-contrib/cmd/[email protected]/supervisor/supervisor.go:1392 +0xe80
  github.com/open-telemetry/opentelemetry-collector-contrib/cmd/opampsupervisor/supervisor.(*Supervisor).Start.func1()
      /Users/douglas/go/pkg/mod/github.com/open-telemetry/opentelemetry-collector-contrib/cmd/[email protected]/supervisor/supervisor.go:344 +0x78

Previous write at 0x00c00038bf18 by goroutine 497:
  github.com/open-telemetry/opentelemetry-collector-contrib/cmd/opampsupervisor/supervisor.(*Supervisor).handleAgentOpAMPMessage()
      /Users/douglas/go/pkg/mod/github.com/open-telemetry/opentelemetry-collector-contrib/cmd/[email protected]/supervisor/supervisor.go:774 +0x720
  github.com/open-telemetry/opentelemetry-collector-contrib/cmd/opampsupervisor/supervisor.(*Supervisor).handleAgentOpAMPMessage-fm()
      <autogenerated>:1 +0x4c
  github.com/open-telemetry/opentelemetry-collector-contrib/cmd/opampsupervisor/supervisor.flattenedSettings.OnMessage()
      /Users/douglas/go/pkg/mod/github.com/open-telemetry/opentelemetry-collector-contrib/cmd/[email protected]/supervisor/server.go:57 +0x6c
  github.com/open-telemetry/opentelemetry-collector-contrib/cmd/opampsupervisor/supervisor.flattenedSettings.OnMessage-fm()
      <autogenerated>:1 +0x30
  github.com/open-telemetry/opamp-go/server.(*server).handleWSConnection()
      /Users/douglas/go/pkg/mod/github.com/open-telemetry/[email protected]/server/serverimpl.go:253 +0x3cc
  github.com/open-telemetry/opamp-go/server.(*server).httpHandler.gowrap1()
      /Users/douglas/go/pkg/mod/github.com/open-telemetry/[email protected]/server/serverimpl.go:204 +0x64

This is pointing to an access to Supervisor.lastHealthFromClient from inside Supervisor.runAgentProcess(), which runs in its own go-routine:

case <-configApplyTimeoutTimer.C:
if s.lastHealthFromClient == nil || !s.lastHealthFromClient.Healthy {
s.reportConfigStatus(protobufs.RemoteConfigStatuses_RemoteConfigStatuses_FAILED, "Config apply timeout exceeded")
} else {
s.reportConfigStatus(protobufs.RemoteConfigStatuses_RemoteConfigStatuses_APPLIED, "")
}

And a concurrent access also here, from a different go-routine:

if message.Health != nil {
s.telemetrySettings.Logger.Debug("Received health status from agent", zap.Bool("healthy", message.Health.Healthy))
s.lastHealthFromClient = message.Health
err := s.opampClient.SetHealth(message.Health)
if err != nil {
s.telemetrySettings.Logger.Error("Could not report health to OpAMP server", zap.Error(err))
}
}

This happens without any synchronization, so a data race is possible.

Steps to Reproduce

Set up a Supervisor in a test, start an OpAMP server and let them talk. I think this should be enough. Here's a small snippet that I am using to start the Supervisor:

func setupSupervisor(t *testing.T, opampServerHostPort string) (*supervisor.Supervisor, string) {
    t.Helper()

	supervisorStorageDir := t.TempDir()
	otelcolBinName := "otelcol-contrib"
	otelcolBinPath := filepath.Join("..", "..", "bin", otelcolBinName)
	if _, err := os.Stat(otelcolBinPath); err != nil {
		t.Fatalf("failed to setup supervisor: %v", err)
	}

	supervisorConfig := supervisorconfig.Supervisor{
		Server: supervisorconfig.OpAMPServer{
			Endpoint: "http://" + opampServerHostPort + "/v1/opamp",
		},
		Capabilities: supervisorconfig.Capabilities{
			AcceptsRemoteConfig:            true,
			AcceptsRestartCommand:          true,
			AcceptsOpAMPConnectionSettings: true,
			ReportsEffectiveConfig:         true,
			ReportsOwnMetrics:              true,
			ReportsOwnLogs:                 true,
			ReportsOwnTraces:               true,
			ReportsHealth:                  true,
			ReportsRemoteConfig:            true,
			ReportsAvailableComponents:     true,
		},
		Agent: supervisorconfig.Agent{
			Executable:              otelcolBinPath,
			ConfigApplyTimeout:      3 * time.Second,
			OrphanDetectionInterval: 3 * time.Second,
			BootstrapTimeout:        3 * time.Second,
			Arguments:               []string{"--feature-gates=+service.AllowNoPipelines"},
		},
		Storage: supervisorconfig.Storage{
			Directory: supervisorStorageDir,
		},
	}

	logger, err := zap.NewDevelopment()
	if err != nil {
		t.Fatalf("failed to create logger: %v", err)
	}

	sup, err := supervisor.NewSupervisor(logger, supervisorConfig)
	if err != nil {
		t.Fatalf("failed to create supervisor: %v", err)
	}

	t.Cleanup(sup.Shutdown)
	return sup, supervisorStorageDir
}

Expected Result

No data race. Any data potentially accessed concurrently by different go-routines is protected by mutexes or through the use of atomics.

Actual Result

A data race happens :(

Collector version

da60438

Environment information

Environment

OS: macOS
Compiler(if manually compiled): go version go1.24.3 darwin/arm64

OpenTelemetry Collector configuration

Irrelevant.

Log output

Additional context

No response

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions