Monitoring Pipeline

We have GPU farm, 1000's of GPUs where ML workload need to be run

On GPU following processes would be running:
1. ML Job

2. GPU Metrics Collector(seperate process): collects metrics, exposes to prometheus on 8000. Metrics:
GPU utilization (%)
Memory(Total available/Used Memory/cached)
Temperature
Active processes

3. ML Log Monitor(seperate process): collects application errors
OOM errors
CUDA errors
Numerical instability warnings
Training divergence
Hardware errors
Should parse both stdout/stderr and log files.

4. Health Checker(seperate process): Takes preventive measures, if values goes above threshold

Detects	Condition	Response
High temperature	`if gpu_temp > 85°C`	Throttle(less) training speed
High Memory	`if gpu_mem > 90%`	Memory Cleanup. Clear caches
Hung Process	Deadlock (no progress)	Kill hanging threads
ML Process → Hits OOM Error	Detects in logs	Triggers → Reduce batch size


Docker Container (per GPU):
├── Main ML Process (PID 1)
├── GPU Monitor Daemon (port 8000/metrics)
├── Log Monitor Thread
├── Health Checker Thread
└── Metrics Exporter (Prometheus client)