-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Description
Describe the bug
When running the prometheusremotewrite
exporter in wal enabled mode under (very) high load (250k active series), it quickly builds up memory until the kernel oom kills otelcol
Steps to reproduce
docker-compose.yml
version: '2'
services:
# generates 250k series to be scaped by otelcol
avalanche:
image: quay.io/prometheuscommunity/avalanche:main
command:
- --metric-count=1000
- --series-count=250
- --label-count=5
- --series-interval=3600
- --metric-interval=3600
otel:
image: otel/opentelemetry-collector
volumes: [otel-cfg:/etc/otelcol]
user: 0:0
tmpfs:
- /wal
depends_on:
otel-cfg:
condition: service_completed_successfully
mem_limit: 8G
restart: always
otel-cfg:
image: alpine
volumes: [otel-cfg:/etc/otelcol]
command:
- sh
- -c
- |
cat - > /etc/otelcol/config.yaml << EOF
receivers:
prometheus:
config:
scrape_configs:
- job_name: stress
scrape_interval: 15s
static_configs:
- targets:
- avalanche:9001
processors:
batch:
exporters:
prometheusremotewrite:
endpoint: http://receiver:9090/api/v1/write
wal:
directory: /wal
service:
pipelines:
metrics:
receivers: [prometheus]
processors: [batch]
exporters: [prometheusremotewrite]
EOF
# dummy http server to "receive" remote_write samples by always replying with http 200
receiver:
image: caddy
command: sh -c 'echo ":9090" > /tmp/Caddyfile && exec caddy run --config /tmp/Caddyfile'
# prometheus observing resource usage of otelcol
prometheus:
image: prom/prometheus
ports:
- 9090:9090
entrypoint: /bin/sh
command:
- -c
- |
cat - > prometheus.yml << EOF && /bin/prometheus
global:
scrape_interval: 5s
scrape_configs:
- job_name: otel
static_configs:
- targets:
- otel:8888
EOF
volumes:
otel-cfg: {}
What did you expect to see?
Otelcol having a (high) but (periodically) stable memory usage
What did you see instead?
Otelcol repeatedly builds up memory until it is oom killed by the operating system, only to repeat this exact behavior
What version did you use?
Version: Docker otel/opentelemetry-collector-contrib:0.72.0
What config did you use?
See above docker-compose.yml
Environment
docker info
Client:
Context: default
Debug Mode: false
Plugins:
compose: Docker Compose (Docker Inc., 2.13.0)
Server:
Containers: 31
Running: 0
Paused: 0
Stopped: 31
Images: 65
Server Version: 20.10.21
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Native Overlay Diff: false
userxattr: false
Logging Driver: json-file
Cgroup Driver: systemd
Cgroup Version: 2
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 770bd0108c32f3fb5c73ae1264f7e503fe7b2661.m
runc version:
init version: de40ad0
Security Options:
seccomp
Profile: default
cgroupns
Kernel Version: 5.12.10-arch1-1
Operating System: Arch Linux
OSType: linux
Architecture: x86_64
CPUs: 12
Total Memory: 15.39GiB
Name: <omit>
ID: <omit>
Docker Root Dir: /var/lib/docker
Debug Mode: false
Username: <omit>
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
Additional context
This only occurs when enabling wal mode. Other prometheus agents (Grafana Agent, Prometheus Agent Mode) do not show this behavior on the exact same input data