Description
Component(s)
processor/k8sattributes
What happened?
Description
We have otel deployed as a daemon set in our cluster and have noticed that after a restart of the otel agent pods, prometheus metrics are missing the k8s_deployment_name attribute even though it is being extracted as part of the k8sattributes processor. After about 5 minutes, the problem resolves and the label is present on metrics. Notably, we have not observed this issue with the k8s_daemonset_name attribute. By looking at the logs, we found the k8s.deployment.name attribute is missing from the resource, while the other extracted metadata is present (see attached logs).
Our config sets the wait_for_metadata
flag to true. The k8sattribute processor config we are currently using:
k8sattributes:
wait_for_metadata: true
auth_type: "serviceAccount"
passthrough: false
filter:
node_from_env_var: KUBE_NODE_NAME
extract:
metadata:
- k8s.pod.name
- k8s.pod.uid
- k8s.deployment.name
- k8s.deployment.uid
- k8s.namespace.name
labels:
# Extracts the value of a pod label and inserts it as a resource attribute
- tag_name: service_name
key: tags.datadoghq.com/service
from: pod
- tag_name: version
key: tags.datadoghq.com/version
from: pod
- tag_name: label_k8s_bluecore_com_team
key: k8s.bluecore.com/team
from: pod
pod_association:
# below association takes a look at the datapoint's k8s.pod.uid resource attribute and tries to match it with
# the pod having the same attribute.
- sources:
- from: resource_attribute
name: k8s.pod.uid
Steps to Reproduce
Deploy otel with a prometheus receiver and a pipeline that uses the k8sattributes processor with the config above. Use the debug exporter to output to stdout. Note that for ~5 min after a restart of the otel pod, k8s_deployment_name is missing from the metrics. After about 5 min, the resource attribute appears.
Expected Result
After the otel pods restart, the k8s_deployment_name label should be on all exported metrics.
Actual Result
For about 5 minutes after otel pods restart, exported prometheus metrics are missing the k8s_deployment_name label despite it being extracted as part of the k8sattribute processor.
Collector version
v0.116.1
Environment information
Environment
OS: (e.g., "Ubuntu 20.04")
Compiler(if manually compiled): (e.g., "go 14.2")
OpenTelemetry Collector configuration
receivers:
prometheus/bc:
config:
scrape_configs:
- job_name: bc-prom
scrape_interval: 60s
kubernetes_sd_configs:
- role: pod
selectors:
- role: pod
# only scrape data from pods running on the same node as collector
field: "spec.nodeName=${env:KUBE_NODE_NAME}"
relabel_configs:
# scrape pods annotated with "prometheus.io/scrape: true"
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
regex: "true"
action: keep
# read the port from "prometheus.io/port: <port>" annotation and update scraping address accordingly
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
target_label: __address__
regex: ([^:]+)(?::\d+)?;(\d+)
# escaped $1:$2
replacement: $$1:$$2
# do not scrape init containers; the above does not catch this case because most init containers do not have ports defined
- source_labels: [ __meta_kubernetes_pod_container_init ]
regex: "false"
action: keep
metric_relabel_configs:
- source_labels: [__name__]
regex: "(bc_.*|controller_runtime_.*|rest_client_requests_total|velero_.*|node_.*|certmanager_.*|kubelet_.*)"
action: keep
processors:
resourcedetection:
detectors: [env, gcp]
timeout: 5s
gcp:
resource_attributes:
cloud.provider:
enabled: true
cloud.platform:
enabled: true
cloud.account.id:
enabled: true
cloud.region:
enabled: true
cloud.availability_zone:
enabled: true
k8s.cluster.name:
enabled: true
host.id:
enabled: false
host.name:
enabled: false
k8sattributes:
wait_for_metadata: true
auth_type: "serviceAccount"
passthrough: false
filter:
node_from_env_var: KUBE_NODE_NAME
extract:
metadata:
- k8s.pod.name
- k8s.pod.uid
- k8s.deployment.name
- k8s.deployment.uid
- k8s.namespace.name
labels:
# Extracts the value of a pod label and inserts it as a resource attribute
- tag_name: service_name
key: tags.datadoghq.com/service
from: pod
- tag_name: version
key: tags.datadoghq.com/version
from: pod
- tag_name: label_k8s_bluecore_com_team
key: k8s.bluecore.com/team
from: pod
pod_association:
# below association takes a look at the datapoint's k8s.pod.uid resource attribute and tries to match it with
# the pod having the same attribute.
- sources:
- from: resource_attribute
name: k8s.pod.uid
batch:
send_batch_max_size: 1000
send_batch_size: 100
timeout: 10s
exporters:
debug:
verbosity: detailed
sampling_initial: 5
sampling_thereafter: 200
service:
pipelines:
metrics/bc:
receivers: [prometheus/bc]
processors:
- batch
- resourcedetection
- k8sattributes
exporters:
- debug
Log output
Immediately after otel pods restart, note that k8s.deployment.name is missing as a resource attribute but the other extracted metadata is present:
Resource attributes:
-> service.name: Str(bc-prom)
-> net.host.name: Str(<ip>)
-> server.address: Str(<ip>)
-> service.instance.id: Str(<ip>:9102)
-> net.host.port: Str(9102)
-> http.scheme: Str(http)
-> server.port: Str(9102)
-> url.scheme: Str(http)
-> cluster_name: Str(<cluster_name>)
-> k8s.pod.name: Str(txnl-api-5fc5f496c5-r4grd)
-> k8s.pod.uid: Str(03a84821-8629-4345-b265-c5018856468d)
-> k8s.container.name: Str(txnl-api)
-> k8s.namespace.name: Str(txnl-api)
-> pod: Str(txnl-api-5fc5f496c5-r4grd)
-> cloud.provider: Str(gcp)
-> cloud.account.id: Str(<cluster_name>)
-> cloud.platform: Str(gcp_kubernetes_engine)
-> cloud.region: Str(<region>)
-> label_k8s_bluecore_com_team: Str(<team>)
-> service_name: Str(txnl-api)
-> version: Str(7c79a41)
Also notice that the k8s.daemonset.name attribute IS present immediately after otel pods restart:
Resource attributes:
-> service.name: Str(bc-prom)
-> net.host.name: Str(<ip>)
-> server.address: Str(<ip>)
-> service.instance.id: Str(<ip>:9100)
-> net.host.port: Str(9100)
-> http.scheme: Str(http)
-> server.port: Str(9100)
-> url.scheme: Str(http)
-> k8s.namespace.name: Str(otel-system)
-> pod: Str(prometheus-node-exporter-c9xsw)
-> k8s.pod.name: Str(prometheus-node-exporter-c9xsw)
-> k8s.pod.uid: Str(37b69e3e-6589-4aae-806e-a0009822bf25)
-> k8s.container.name: Str(node-exporter)
-> k8s.daemonset.name: Str(prometheus-node-exporter)
-> cloud.provider: Str(gcp)
-> cloud.account.id: Str(<cluster_name>)
-> cloud.platform: Str(gcp_kubernetes_engine)
-> cloud.region: Str(<region>)
-> cluster_name: Str(<cluster_name>)
After about 5 minutes, with no further changes, the resource DOES contain the k8s.deployment.name attribute:
Resource attributes:
-> service.name: Str(bc-prom)
-> net.host.name: Str(<ip>)
-> server.address: Str(<ip>)
-> service.instance.id: Str(<ip>:9102)
-> net.host.port: Str(9102)
-> http.scheme: Str(http)
-> server.port: Str(9102)
-> url.scheme: Str(http)
-> k8s.pod.name: Str(txnl-api-5fc5f496c5-8khhd)
-> k8s.pod.uid: Str(90eda72b-4dc7-40e9-bf4a-f9827b298315)
-> k8s.container.name: Str(txnl-api)
-> k8s.namespace.name: Str(txnl-api)
-> cluster_name: Str(<cluster_name>)
-> pod: Str(txnl-api-5fc5f496c5-8khhd)
-> cloud.provider: Str(gcp)
-> cloud.account.id: Str(<cluster_name>)
-> cloud.platform: Str(gcp_kubernetes_engine)
-> cloud.region: Str(<region>)
-> k8s.deployment.name: Str(txnl-api)
-> k8s.deployment.uid: Str(f5d96967-5e4a-4728-8f12-ae7ad969b42e)
-> service_name: Str(txnl-api)
-> version: Str(7c79a41)
-> label_k8s_bluecore_com_team: Str(<team>)
### Additional context
_No response_