Monitoring Pipeline
-
We have GPU farm, 1000's of GPUs where ML workload need to be run
On GPU following processes would be running:
1. ML Job
2. GPU Metrics Collector(seperate process): collects metrics, exposes to prometheus on 8000. Metrics:
GPU utilization (%)
Memory(Total available/Used Memory/cached)
Temperature
Active processes
3. ML Log Monitor(seperate process): collects application errors
OOM errors
CUDA errors
Numerical instability warnings
Training divergence
Hardware errors
Should parse both stdout/stderr and log files.
4. Health Checker(seperate process): Takes preventive measures, if values goes above threshold
| Detects | Condition | Response |
|---|---|---|
| High temperature | |
Throttle(less) training speed |
| High Memory | |
Memory Cleanup. Clear caches |
| Hung Process | Deadlock (no progress) | Kill hanging threads |
| ML Process → Hits OOM Error | Detects in logs | Triggers → Reduce batch size |
Docker Container (per GPU):
├── Main ML Process (PID 1)
├── GPU Monitor Daemon (port 8000/metrics)
├── Log Monitor Thread
├── Health Checker Thread
└── Metrics Exporter (Prometheus client)