Monitoring Pipeline

We have GPU farm, 1000's of GPUs where ML workload need to be run

On GPU following processes would be running:
1. ML Job

2. GPU Metrics Collector(seperate process): collects metrics, exposes to prometheus on 8000. Metrics:
  GPU utilization (%)
  Memory(Total available/Used Memory/cached)
  Temperature
  Active processes


3. ML Log Monitor(seperate process): collects application errors
  OOM errors
  CUDA errors
  Numerical instability warnings
  Training divergence
  Hardware errors
Should parse both stdout/stderr and log files.

4. Health Checker(seperate process): Takes preventive measures, if values goes above threshold
Detects Condition Response
High temperature

if gpu_temp > 85°C
                
Throttle(less) training speed
High Memory

if gpu_mem > 90%
                
Memory Cleanup. Clear caches
Hung Process Deadlock (no progress) Kill hanging threads
ML Process → Hits OOM Error Detects in logs Triggers → Reduce batch size

Docker Container (per GPU):
├── Main ML Process (PID 1)
├── GPU Monitor Daemon (port 8000/metrics)
├── Log Monitor Thread
├── Health Checker Thread
└── Metrics Exporter (Prometheus client)