Description
Component(s)
datadogexporter
What happened?
Description
We are currently experiencing an issue with the OpenTelemetry Collector running in our Kubernetes cluster, which is managed by Karpenter. Our setup involves spot instances, and we've noticed that when Karpenter terminates these instances, the OpenTelemetry Collector does not seem to shut down gracefully. Consequently, we are losing metrics and traces that are presumably still in the process of being processed or exported.
Steps to Reproduce
- Deploy the OpenTelemetry Collector on a Kubernetes cluster with Karpenter managing spot instances.
- Simulate a spot instance termination (or just teminate a node in the cluster).
- Observe that the metrics and traces during the termination period are lost.
Expected Result
The OpenTelemetry Collector should flush all pending metrics and traces before shutting down to ensure no data is lost during spot instance termination.
Actual Result
During a spot termination event triggered by Karpenter, the OpenTelemetry Collector shuts down without flushing all the data, causing loss of metrics and traces.
Collector version
0.95.0
Environment information
Environment
Kubernetes Version: 1.27
Karpenter Version: 0.35.2
Cloud Provider: AWS
OpenTelemetry Collector configuration
connectors:
datadog/connector: null
exporters:
datadog:
api:
fail_on_invalid_key: true
key: <KEY>
site: <SITE>
host_metadata:
enabled: false
metrics:
histograms:
mode: distributions
send_count_sum_metrics: true
instrumentation_scope_metadata_as_tags: true
resource_attributes_as_tags: true
sums:
cumulative_monotonic_mode: raw_value
retry_on_failure:
enabled: true
initial_interval: 1s
max_elapsed_time: 600s
max_interval: 20s
sending_queue:
enabled: true
num_consumers: 100
queue_size: 3000
traces:
trace_buffer: 30
debug: {}
logging: {}
extensions:
health_check:
endpoint: <HEALTHCHECK>
processors:
batch:
send_batch_max_size: 3000
send_batch_size: 2000
timeout: 3s
memory_limiter:
check_interval: 5s
limit_mib: 1800
spike_limit_mib: 750
receivers:
carbon:
endpoint: <CARBON>
otlp:
protocols:
grpc:
endpoint: <ENDPOINT>
http:
endpoint: <ENDPOINT>
prometheus:
config:
scrape_configs:
- job_name: <JOB_NAME>
scrape_interval: 30s
static_configs:
- targets:
- <ENDPOINT>
statsd:
aggregation_interval: 60s
endpoint: <ENDPOINT>
service:
extensions:
- health_check
pipelines:
logs:
exporters:
- datadog
processors:
- memory_limiter
- batch
- resource
receivers:
- otlp
metrics:
exporters:
- datadog
processors:
- memory_limiter
- batch
- resource
receivers:
- otlp
- carbon
- statsd
- prometheus
- datadog/connector
traces:
exporters:
- datadog
- datadog/connector
processors:
- memory_limiter
- batch
- resource
receivers:
- otlp
telemetry:
metrics:
address: <ENDPOINT>
Log output
No response
Additional context
I noticed that there is a terminationGracePeriodSeconds configuration in Kubernetes deployment that can give workloads more time to shutdown. However, this option does not seem to be exposed in the OpenTelemetry Collector Helm chart.
I would like to suggest the following enhancements:
- Expose the terminationGracePeriodSeconds parameter in the Helm chart to allow users to specify a custom grace period.
- Review the shutdown procedure of the OpenTelemetry Collector to ensure that it attempts to flush all buffered data before exiting.