Skip to content

OpenTelemetry Collector does not gracefully shutdown, losing metrics on spot instance termination #33441

Closed as not planned
@Rommmmm

Description

@Rommmmm

Component(s)

datadogexporter

What happened?

Description

We are currently experiencing an issue with the OpenTelemetry Collector running in our Kubernetes cluster, which is managed by Karpenter. Our setup involves spot instances, and we've noticed that when Karpenter terminates these instances, the OpenTelemetry Collector does not seem to shut down gracefully. Consequently, we are losing metrics and traces that are presumably still in the process of being processed or exported.

Steps to Reproduce

  1. Deploy the OpenTelemetry Collector on a Kubernetes cluster with Karpenter managing spot instances.
  2. Simulate a spot instance termination (or just teminate a node in the cluster).
  3. Observe that the metrics and traces during the termination period are lost.

Expected Result

The OpenTelemetry Collector should flush all pending metrics and traces before shutting down to ensure no data is lost during spot instance termination.

Actual Result

During a spot termination event triggered by Karpenter, the OpenTelemetry Collector shuts down without flushing all the data, causing loss of metrics and traces.

Collector version

0.95.0

Environment information

Environment

Kubernetes Version: 1.27
Karpenter Version: 0.35.2
Cloud Provider: AWS

OpenTelemetry Collector configuration

connectors:
  datadog/connector: null
exporters:
  datadog:
    api:
      fail_on_invalid_key: true
      key: <KEY>
      site: <SITE>
    host_metadata:
      enabled: false
    metrics:
      histograms:
        mode: distributions
        send_count_sum_metrics: true
      instrumentation_scope_metadata_as_tags: true
      resource_attributes_as_tags: true
      sums:
        cumulative_monotonic_mode: raw_value
    retry_on_failure:
      enabled: true
      initial_interval: 1s
      max_elapsed_time: 600s
      max_interval: 20s
    sending_queue:
      enabled: true
      num_consumers: 100
      queue_size: 3000
    traces:
      trace_buffer: 30
  debug: {}
  logging: {}
extensions:
  health_check:
    endpoint: <HEALTHCHECK>
processors:
  batch:
    send_batch_max_size: 3000
    send_batch_size: 2000
    timeout: 3s
  memory_limiter:
    check_interval: 5s
    limit_mib: 1800
    spike_limit_mib: 750
receivers:
  carbon:
    endpoint: <CARBON>
  otlp:
    protocols:
      grpc:
        endpoint: <ENDPOINT>
      http:
        endpoint: <ENDPOINT>
  prometheus:
    config:
      scrape_configs:
      - job_name: <JOB_NAME>
        scrape_interval: 30s
        static_configs:
        - targets:
          - <ENDPOINT>
  statsd:
    aggregation_interval: 60s
    endpoint: <ENDPOINT>
service:
  extensions:
  - health_check
  pipelines:
    logs:
      exporters:
      - datadog
      processors:
      - memory_limiter
      - batch
      - resource
      receivers:
      - otlp
    metrics:
      exporters:
      - datadog
      processors:
      - memory_limiter
      - batch
      - resource
      receivers:
      - otlp
      - carbon
      - statsd
      - prometheus
      - datadog/connector
    traces:
      exporters:
      - datadog
      - datadog/connector
      processors:
      - memory_limiter
      - batch
      - resource
      receivers:
      - otlp
  telemetry:
    metrics:
      address: <ENDPOINT>

Log output

No response

Additional context

I noticed that there is a terminationGracePeriodSeconds configuration in Kubernetes deployment that can give workloads more time to shutdown. However, this option does not seem to be exposed in the OpenTelemetry Collector Helm chart.

I would like to suggest the following enhancements:

  1. Expose the terminationGracePeriodSeconds parameter in the Helm chart to allow users to specify a custom grace period.
  2. Review the shutdown procedure of the OpenTelemetry Collector to ensure that it attempts to flush all buffered data before exiting.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions