1. Install Pacakges GPU Node Setup
# Install drivers
sudo apt install nvidia-driver-470
# Install container toolkit
sudo apt install nvidia-container-toolkit
sudo systemctl restart docker
# Join cluster
kubeadm join ...
2. Kubernetes Cluster Configuration
-
- Use kubeadm to create a cluster or a managed solution like EKS, GKE.
- Ensure taints and labels are applied to GPU nodes:
kubectl label node hardware-type=gpu
kubectl taint nodes hardware-type=gpu:NoSchedule
3. Install NVIDIA Kubernetes Device Plugin
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.13.0/nvidia-device-plugin.yml
4. Create Pod using GPU
apiVersion: v1
kind: Pod
metadata:
name: gpu-job
spec:
containers:
- name: ml-container
image: pytorch/pytorch:latest
resources:
limits:
nvidia.com/gpu: 1
5. Helm Charts for ML and GPU Workloads
-
Helm charts for:
- ML frameworks (JupyterHub, TensorFlow Serving, PyTorch)
- Storage (e.g., NFS, Ceph)
- Monitoring (Prometheus + Grafana)
- Ingress (NGINX or Istio)
- Auth (Keycloak)
- GPU jobs (custom charts per customer/job)
gpu-job-chart/
Chart.yaml
values.yaml
templates/
deployment.yaml
service.yaml
job.yaml