Description
Component(s)
receiver/hostmetrics
What happened?
Description
Otel Collector running on Windows Server 2019 was observed to have high CPU spikes (3-7%) each time the hostmetrics receiver collection process ran which was set to an interval of 1 minute.
After testing the issue was narrowed down to the process
scraper. The following shows the Otel collector CPU usage when only the process
scraper is enabled.
After reenabling all other hostmetrics scrapers except for the process
scraper, we can see the magnitude of the CPU spikes come down significantly (<0.5%).

Steps to Reproduce
On a machine running Windows Server 2019, download the v0.94
version of Otel collector from https://github.com/open-telemetry/opentelemetry-collector-releases/releases/tag/v0.94.0.
Modify the config.yaml
to enable the hostmetrics process
scraper and set the collection interval (see config attached to the issue for an example).
Run the otel collector exe
Monitor the CPU usage of the otel collector on Task Manager or graph the usage using perfmon
Expected Result
CPU usage comparable to observed levels on Linux collectors (<0.5%)
Actual Result
CPU spikes to 3-7%
Collector version
v0.93.0
Environment information
Environment
OS: Windows Server 2019
OpenTelemetry Collector configuration
receivers:
hostmetrics:
collection_interval: 1m
scrapers:
cpu:
metrics:
system.cpu.utilization:
enabled: true
disk:
load:
filesystem:
metrics:
system.filesystem.utilization:
enabled: true
memory:
metrics:
system.memory.utilization:
enabled: true
network:
paging:
metrics:
system.paging.utilization:
enabled: true
processes:
process:
mute_process_exe_error: true
metrics:
process.cpu.utilization:
enabled: true
process.memory.utilization:
enabled: true
docker_stats:
collection_interval: 1m
metrics:
container.cpu.throttling_data.periods:
enabled: true
container.cpu.throttling_data.throttled_periods:
enabled: true
container.cpu.throttling_data.throttled_time:
enabled: true
prometheus:
config:
scrape_configs:
- job_name: $InstanceId/otel-self-metrics-collector-$Region
scrape_interval: 1m
static_configs:
- targets: ['0.0.0.0:9999']
otlp:
protocols:
grpc:
http:
exporters:
debug:
verbosity: normal
otlp:
endpoint: <endpoint>
processors:
memory_limiter:
check_interval: 1s
limit_mib: 500
spike_limit_mib: 100
batch:
send_batch_size: 8192
send_batch_max_size: 8192
timeout: 2000ms
filter:
metrics:
exclude:
match_type: strict
metric_names:
# comment a metric to remove from exclusion rule
- otelcol_exporter_queue_capacity
- otelcol_exporter_enqueue_failed_spans
- otelcol_exporter_enqueue_failed_log_records
- otelcol_exporter_enqueue_failed_metric_points
- otelcol_exporter_send_failed_metric_points
- otelcol_process_runtime_heap_alloc_bytes
- otelcol_process_runtime_total_alloc_bytes
- otelcol_processor_batch_timeout_trigger_send
- otelcol_process_runtime_total_sys_memory_bytes
- otelcol_process_uptime
- otelcol_scraper_errored_metric_points
- otelcol_scraper_scraped_metric_points
- scrape_samples_scraped
- scrape_samples_post_metric_relabeling
- scrape_series_added
- scrape_duration_seconds
# - up
resourcedetection:
detectors: ec2, env, system
ec2:
tags:
- ^Environment$
system:
hostname_sources: ["os"]
resource_attributes:
host.id:
enabled: true
extensions:
health_check:
pprof:
zpages:
service:
telemetry:
metrics:
level: detailed
address: 0.0.0.0:9999
extensions: [pprof, zpages, health_check]
pipelines:
traces:
receivers: [otlp]
exporters: [debug, otlp]
processors: [memory_limiter, batch, resourcedetection]
metrics:
receivers: [otlp, hostmetrics, prometheus]
exporters: [debug, otlp]
processors: [memory_limiter, batch, resourcedetection, filter]
logs:
receivers: [otlp]
exporters: [debug, otlp]
processors: [memory_limiter, batch, resourcedetection]
Log output
No response
Additional context
Additional details: Windows 2019 was running on an m5x.large EC2