1. Install Pacakges GPU Node Setup


# Install drivers
sudo apt install nvidia-driver-470

# Install container toolkit
sudo apt install nvidia-container-toolkit
sudo systemctl restart docker

# Join cluster
kubeadm join ...
        

2. Kubernetes Cluster Configuration

- Use kubeadm to create a cluster or a managed solution like EKS, GKE.
- Ensure taints and labels are applied to GPU nodes:

kubectl label node  hardware-type=gpu
kubectl taint nodes  hardware-type=gpu:NoSchedule
        

3. Install NVIDIA Kubernetes Device Plugin


kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.13.0/nvidia-device-plugin.yml
        

4. Create Pod using GPU


apiVersion: v1
kind: Pod
metadata:
    name: gpu-job
spec:
    containers:
    - name: ml-container
    image: pytorch/pytorch:latest
    resources:
        limits:
        nvidia.com/gpu: 1
        

5. Helm Charts for ML and GPU Workloads

Helm charts for:
- ML frameworks (JupyterHub, TensorFlow Serving, PyTorch)
- Storage (e.g., NFS, Ceph)
- Monitoring (Prometheus + Grafana)
- Ingress (NGINX or Istio)
- Auth (Keycloak)
- GPU jobs (custom charts per customer/job)

gpu-job-chart/
Chart.yaml
values.yaml
templates/
    deployment.yaml
    service.yaml
    job.yaml