Monitoring Pipeline
-
We have GPU farm, 1000's of GPUs where ML workload need to be run
On GPU following processes would be running:
1. ML Job
2. GPU Metrics Collector(seperate process): collects metrics, exposes to prometheus on 8000. Metrics:
GPU utilization (%)
Memory(Total available/Used Memory/cached)
Temperature
Active processes
3. ML Log Monitor(seperate process): collects application errors
OOM errors
CUDA errors
Numerical instability warnings
Training divergence
Hardware errors
Should parse both stdout/stderr and log files.
4. Health Checker(seperate process): Takes preventive measures, if values goes above threshold
Detects | Condition | Response |
---|---|---|
High temperature |
|
Throttle(less) training speed |
High Memory |
|
Memory Cleanup. Clear caches |
Hung Process | Deadlock (no progress) | Kill hanging threads |
ML Process → Hits OOM Error | Detects in logs | Triggers → Reduce batch size |
Docker Container (per GPU):
├── Main ML Process (PID 1)
├── GPU Monitor Daemon (port 8000/metrics)
├── Log Monitor Thread
├── Health Checker Thread
└── Metrics Exporter (Prometheus client)