OpenTelemetry Collector does not gracefully shutdown, losing metrics on spot instance termination

### Component(s)

datadogexporter

### What happened?

## Description
We are currently experiencing an issue with the OpenTelemetry Collector running in our Kubernetes cluster, which is managed by Karpenter. Our setup involves spot instances, and we've noticed that when Karpenter terminates these instances, the OpenTelemetry Collector does not seem to shut down gracefully. Consequently, we are losing metrics and traces that are presumably still in the process of being processed or exported.

## Steps to Reproduce
1. Deploy the OpenTelemetry Collector on a Kubernetes cluster with Karpenter managing spot instances.
2. Simulate a spot instance termination (or  just teminate a node in the cluster).
3. Observe that the metrics and traces during the termination period are lost.

## Expected Result
The OpenTelemetry Collector should flush all pending metrics and traces before shutting down to ensure no data is lost during spot instance termination.


## Actual Result
During a spot termination event triggered by Karpenter, the OpenTelemetry Collector shuts down without flushing all the data, causing loss of metrics and traces.



### Collector version

0.95.0

### Environment information

## Environment
Kubernetes Version: 1.27
Karpenter Version: 0.35.2
Cloud Provider: AWS

### OpenTelemetry Collector configuration

```yaml
connectors:
  datadog/connector: null
exporters:
  datadog:
    api:
      fail_on_invalid_key: true
      key: <KEY>
      site: <SITE>
    host_metadata:
      enabled: false
    metrics:
      histograms:
        mode: distributions
        send_count_sum_metrics: true
      instrumentation_scope_metadata_as_tags: true
      resource_attributes_as_tags: true
      sums:
        cumulative_monotonic_mode: raw_value
    retry_on_failure:
      enabled: true
      initial_interval: 1s
      max_elapsed_time: 600s
      max_interval: 20s
    sending_queue:
      enabled: true
      num_consumers: 100
      queue_size: 3000
    traces:
      trace_buffer: 30
  debug: {}
  logging: {}
extensions:
  health_check:
    endpoint: <HEALTHCHECK>
processors:
  batch:
    send_batch_max_size: 3000
    send_batch_size: 2000
    timeout: 3s
  memory_limiter:
    check_interval: 5s
    limit_mib: 1800
    spike_limit_mib: 750
receivers:
  carbon:
    endpoint: <CARBON>
  otlp:
    protocols:
      grpc:
        endpoint: <ENDPOINT>
      http:
        endpoint: <ENDPOINT>
  prometheus:
    config:
      scrape_configs:
      - job_name: <JOB_NAME>
        scrape_interval: 30s
        static_configs:
        - targets:
          - <ENDPOINT>
  statsd:
    aggregation_interval: 60s
    endpoint: <ENDPOINT>
service:
  extensions:
  - health_check
  pipelines:
    logs:
      exporters:
      - datadog
      processors:
      - memory_limiter
      - batch
      - resource
      receivers:
      - otlp
    metrics:
      exporters:
      - datadog
      processors:
      - memory_limiter
      - batch
      - resource
      receivers:
      - otlp
      - carbon
      - statsd
      - prometheus
      - datadog/connector
    traces:
      exporters:
      - datadog
      - datadog/connector
      processors:
      - memory_limiter
      - batch
      - resource
      receivers:
      - otlp
  telemetry:
    metrics:
      address: <ENDPOINT>
```


### Log output

_No response_

### Additional context

I noticed that there is a terminationGracePeriodSeconds configuration in Kubernetes deployment that can give workloads more time to shutdown. However, this option does not seem to be exposed in the OpenTelemetry Collector Helm chart.

I would like to suggest the following enhancements:

1. Expose the terminationGracePeriodSeconds parameter in the Helm chart to allow users to specify a custom grace period.
2. Review the shutdown procedure of the OpenTelemetry Collector to ensure that it attempts to flush all buffered data before exiting.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

OpenTelemetry Collector does not gracefully shutdown, losing metrics on spot instance termination #33441

Component(s)

What happened?

Description

Steps to Reproduce

Expected Result

Actual Result

Collector version

Environment information

Environment

OpenTelemetry Collector configuration

Log output

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

OpenTelemetry Collector does not gracefully shutdown, losing metrics on spot instance termination #33441

Description

Component(s)

What happened?

Description

Steps to Reproduce

Expected Result

Actual Result

Collector version

Environment information

Environment

OpenTelemetry Collector configuration

Log output

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions