GPU Metrics Agent

A high-performance agent for collecting NVIDIA GPU metrics and exporting them via OpenTelemetry Arrow protocol. Developed by Polar Signals for comprehensive GPU observability.

Why GPU Metrics Agent?

Modern GPU workloads require deep visibility into resource utilization, performance characteristics, and system behavior. This agent provides the metrics you need to optimize GPU usage, reduce costs, and troubleshoot performance issues.

Use Cases

🎯 GPU Resource Optimization

Track GPU and memory utilization to identify underutilized resources. Optimize batch sizes and workload scheduling based on actual GPU usage patterns.

Key metrics: gpu_utilization_percent, gpu_utilization_memory_percent

👥 Multi-tenant GPU Sharing

Monitor per-process GPU utilization to ensure fair resource allocation in shared environments. Track which processes are consuming GPU resources and enforce usage policies.

Key metrics: Per-process gpu_utilization_percent with pid and comm attributes

🔍 Performance Troubleshooting

Identify performance bottlenecks by correlating GPU metrics with application behavior. Detect thermal throttling, power limitations, and PCIe bandwidth constraints.

Key metrics: gpu_temperature_celsius, gpu_clock_hertz, gpu_pcie_throughput_*_bytes

💰 Cost Management

Monitor power consumption to calculate operational costs and optimize for efficiency. Track power usage trends and identify opportunities for cost reduction.

Key metrics: gpu_power_watt, gpu_power_limit_watt

📊 Capacity Planning

Analyze historical utilization patterns to plan for future GPU infrastructure needs. Understand peak usage times and resource requirements for different workloads.

Key metrics: All utilization metrics with time-series analysis

🤖 ML/AI Workload Monitoring

Track GPU performance during model training and inference. Ensure optimal resource allocation for different phases of machine learning pipelines.

Key metrics: All metrics combined with workload-specific context

Collected Metrics

Metric	Description	Collection Interval
`gpu_utilization_percent`	GPU compute utilization (0-100%)	5s
`gpu_utilization_memory_percent`	GPU memory utilization (0-100%)	5s
`gpu_power_watt`	Current power consumption	1s
`gpu_power_limit_watt`	Maximum power limit	1s
`gpu_clock_hertz`	Clock speeds (graphics, SM, memory, video)	1s
`gpu_temperature_celsius`	GPU temperature	1s
`gpu_pcie_throughput_transmit_bytes`	PCIe transmit throughput	100ms
`gpu_pcie_throughput_receive_bytes`	PCIe receive throughput	100ms

All metrics include uuid (GPU identifier) and index (GPU index) attributes. Process-level metrics also include pid and comm attributes.

Quick Start

Installation

For Kubernetes deployments, see our comprehensive setup guide.

Basic Usage

./gpu-metrics-agent \
  --remote-store-address=grpc.polarsignals.com:443 \
  --metrics-producer-nvidia-gpu=true \
  --node=$(hostname) \
  --log-level=info

Docker

docker run -it --rm \
  --gpus all \
  -v /etc/machine-id:/etc/machine-id:ro \
  -v /var/run/secrets/polarsignals.com:/var/run/secrets/polarsignals.com:ro \
  ghcr.io/polarsignals/gpu-metrics-agent:latest \
  --remote-store-address=grpc.polarsignals.com:443 \
  --metrics-producer-nvidia-gpu=true

Configuration Options

Flag	Description	Default
`--remote-store-address`	gRPC endpoint for metric storage	Required
`--metrics-producer-nvidia-gpu`	Enable NVIDIA GPU metrics	false
`--collection-interval`	Metric export interval	10s
`--node`	Node name for metric labeling	Machine ID
`--bearer-token`	Authentication token	-

Architecture

The agent consists of three main components:

NVIDIA Producer: Interfaces with NVIDIA Management Library (NVML) to collect GPU metrics
Metric Exporter: Batches and exports metrics using OpenTelemetry Arrow protocol
gRPC Client: Manages secure connections to remote storage endpoints

┌─────────────────┐     ┌──────────────────┐     ┌─────────────────┐
│  NVIDIA GPUs    │────▶│  GPU Metrics     │────▶│ Remote Storage  │
│  (via NVML)     │     │  Agent           │     │ (gRPC/OTel)    │
└─────────────────┘     └──────────────────┘     └─────────────────┘

Requirements

NVIDIA GPU with driver version 390.x or newer
Linux operating system
NVIDIA Management Library (NVML) available

Building from Source

go build ./cmd/gpu-metrics-agent

License

Apache License 2.0

Support

Documentation: polarsignals.com/docs
Issues: GitHub Issues

Name		Name	Last commit message	Last commit date
Latest commit History 51 Commits
.github/workflows		.github/workflows
.gitignore		.gitignore
.golangci.yml		.golangci.yml
.goreleaser.yml		.goreleaser.yml
Dockerfile		Dockerfile
README.md		README.md
exporter.go		exporter.go
flags.go		flags.go
go.mod		go.mod
go.sum		go.sum
grpc.go		grpc.go
main.go		main.go
nvidia.go		nvidia.go
nvidia_mock.go		nvidia_mock.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

GPU Metrics Agent

Why GPU Metrics Agent?

Use Cases

🎯 GPU Resource Optimization

👥 Multi-tenant GPU Sharing

🔍 Performance Troubleshooting

💰 Cost Management

📊 Capacity Planning

🤖 ML/AI Workload Monitoring

Collected Metrics

Quick Start

Installation

Basic Usage

Docker

Configuration Options

Architecture

Requirements

Building from Source

License

Support

About

Uh oh!

Releases 1

Packages

Uh oh!

Uh oh!

Contributors 5

Languages

polarsignals/gpu-metrics-agent

Folders and files

Latest commit

History

Repository files navigation

GPU Metrics Agent

Why GPU Metrics Agent?

Use Cases

🎯 GPU Resource Optimization

👥 Multi-tenant GPU Sharing

🔍 Performance Troubleshooting

💰 Cost Management

📊 Capacity Planning

🤖 ML/AI Workload Monitoring

Collected Metrics

Quick Start

Installation

Basic Usage

Docker

Configuration Options

Architecture

Requirements

Building from Source

License

Support

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Uh oh!

Contributors 5

Languages

Packages