The Challenge
Running AI inference workloads requires significant compute resources, particularly GPU access. Cloud GPU pricing can be prohibitive for sustained workloads, and latency requirements often favor on-premises deployment. The challenge was building a production-grade Kubernetes cluster that could support GPU workloads while maintaining enterprise-level reliability and observability.
Specific requirements included:
- Support for multiple GPU-accelerated workloads with fair scheduling
- High-availability control plane with automatic failover
- GitOps-based deployment for reproducibility and audit trails
- NVMe-backed storage with NFS for shared persistent volumes
- Comprehensive monitoring, alerting, and log aggregation
- Zero-downtime upgrades and maintenance windows
The Solution
I designed a hybrid cluster with dedicated control plane nodes, standard workers for general workloads, and GPU-equipped workers for AI inference. The entire deployment is managed through GitOps with FluxCD, ensuring all changes are tracked in Git and automatically reconciled.
Cluster Architecture
| Node Type | Count | Specs | Purpose |
|---|---|---|---|
| Control Plane | 3 | 16GB RAM, 4 vCPU | etcd, API Server, Scheduler |
| Workers (General) | 4 | 32GB RAM, 8 vCPU | Services, APIs, Databases |
| Workers (GPU) | 5 | 32GB RAM, RTX 3060 12GB | AI Inference (vLLM) |
┌─────────────────────────────────────────────────────────────────┐
│ Kubernetes Cluster │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Control Plane (HA) Workers (General) │
│ ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐ │
│ │ CP1 │ │ CP2 │ │ CP3 │ │ W1 │ │ W2 │ │ W3 │ │ W4 │ │
│ └──┬──┘ └──┬──┘ └──┬──┘ └──┬──┘ └──┬──┘ └──┬──┘ └──┬──┘ │
│ │ │ │ │ │ │ │ │
│ └───────┴───────┴─────────┴───────┴───────┴───────┘ │
│ │ │
│ GPU Workers │ │
│ ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐ │
│ │GPU1 │ │GPU2 │ │GPU3 │ │GPU4 │ │GPU5 │ │
│ │3060 │ │3060 │ │3060 │ │3060 │ │3060 │ │
│ └──┬──┘ └──┬──┘ └──┬──┘ └──┬──┘ └──┬──┘ │
│ │ │ │ │ │ │
│ └───────┴───────┴───────┴───────┘ │
│ │ │
├──────────────────────────┼───────────────────────────────────────┤
│ ▼ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ TrueNAS (NFS Storage) │ │
│ │ NVMe-backed, 20TB usable │ │
│ └─────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
Key Implementation Details
- FluxCD GitOps: All manifests stored in Git, automatically reconciled every 5 minutes
- Kustomize Overlays: Base manifests with environment-specific overlays for dev/staging/prod
- NVIDIA GPU Operator: Automatic driver and toolkit installation on GPU nodes
- MetalLB: Bare-metal load balancer for service exposure
- Cert-Manager: Automatic TLS certificate provisioning with Let's Encrypt
- democratic-csi: NFS-based persistent volume provisioner for TrueNAS
Technology Stack
Results
Monitoring & Observability
A comprehensive monitoring stack ensures visibility into cluster health and application performance:
- Prometheus: Metrics collection with custom ServiceMonitors for all applications
- Grafana: Dashboards for cluster health, GPU utilization, and application metrics
- Loki: Log aggregation with Promtail agents on each node
- Alertmanager: Alert routing to Slack and email for critical issues
- Node Exporter: Host-level metrics for all cluster nodes
- DCGM Exporter: NVIDIA GPU metrics including utilization, memory, and temperature
Key Learnings
- Start with GitOps: Implementing FluxCD from day one saved countless hours in debugging and rollbacks
- GPU scheduling is complex: Proper node affinity and resource limits are crucial for fair GPU sharing
- Storage performance matters: NVMe-backed NFS made a significant difference in model loading times
- Monitoring is not optional: The investment in observability pays off every time there's an incident
"This infrastructure now serves as the foundation for all SolidRusT Networks services, from AI inference to game servers, with consistent deployment patterns and centralized observability."