Skip to content

exporter/prometheusremotewrite: wal leads to oom under high load #19363

@tombrk

Description

@tombrk

Describe the bug
When running the prometheusremotewrite exporter in wal enabled mode under (very) high load (250k active series), it quickly builds up memory until the kernel oom kills otelcol

Steps to reproduce

docker-compose.yml
version: '2'
services:
  # generates 250k series to be scaped by otelcol
  avalanche:
    image: quay.io/prometheuscommunity/avalanche:main
    command:
      - --metric-count=1000
      - --series-count=250
      - --label-count=5
      - --series-interval=3600
      - --metric-interval=3600

  otel:
    image: otel/opentelemetry-collector
    volumes: [otel-cfg:/etc/otelcol]
    user: 0:0
    tmpfs:
      - /wal
    depends_on:
      otel-cfg:
        condition: service_completed_successfully
    mem_limit: 8G
    restart: always

  otel-cfg:
    image: alpine
    volumes: [otel-cfg:/etc/otelcol]
    command:
      - sh
      - -c
      - |
        cat - > /etc/otelcol/config.yaml << EOF
        receivers:
          prometheus:
            config:
              scrape_configs:
              - job_name: stress
                scrape_interval: 15s
                static_configs:
                  - targets:
                    - avalanche:9001
        processors:
          batch:
        exporters:
          prometheusremotewrite:
            endpoint: http://receiver:9090/api/v1/write
            wal:
              directory: /wal
        service:
          pipelines:
            metrics:
              receivers: [prometheus]
              processors: [batch]
              exporters: [prometheusremotewrite]
        EOF
  # dummy http server to "receive" remote_write samples by always replying with http 200
  receiver:
    image: caddy
    command: sh -c 'echo ":9090" > /tmp/Caddyfile && exec caddy run --config /tmp/Caddyfile'

  # prometheus observing resource usage of otelcol
  prometheus:
    image: prom/prometheus
    ports:
      - 9090:9090
    entrypoint: /bin/sh
    command:
      - -c
      - |
        cat - > prometheus.yml << EOF && /bin/prometheus
        global:
          scrape_interval: 5s
        scrape_configs:
          - job_name: otel
            static_configs:
              - targets:
                - otel:8888
        EOF
volumes:
  otel-cfg: {}

What did you expect to see?
Otelcol having a (high) but (periodically) stable memory usage

What did you see instead?

Otelcol repeatedly builds up memory until it is oom killed by the operating system, only to repeat this exact behavior

Screenshot from 2023-03-06 23-33-13

What version did you use?
Version: Docker otel/opentelemetry-collector-contrib:0.72.0

What config did you use?
See above docker-compose.yml

Environment

docker info
Client:
 Context:    default
 Debug Mode: false
 Plugins:
  compose: Docker Compose (Docker Inc., 2.13.0)

Server:
 Containers: 31
  Running: 0
  Paused: 0
  Stopped: 31
 Images: 65
 Server Version: 20.10.21
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Native Overlay Diff: false
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: systemd
 Cgroup Version: 2
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 770bd0108c32f3fb5c73ae1264f7e503fe7b2661.m
 runc version: 
 init version: de40ad0
 Security Options:
  seccomp
   Profile: default
  cgroupns
 Kernel Version: 5.12.10-arch1-1
 Operating System: Arch Linux
 OSType: linux
 Architecture: x86_64
 CPUs: 12
 Total Memory: 15.39GiB
 Name: <omit>
 ID: <omit>
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Username: <omit>
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

Additional context
This only occurs when enabling wal mode. Other prometheus agents (Grafana Agent, Prometheus Agent Mode) do not show this behavior on the exact same input data

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions