Case Study: Production Kubernetes Infrastructure

The Challenge

Running AI inference workloads requires significant compute resources, particularly GPU access. Cloud GPU pricing can be prohibitive for sustained workloads, and latency requirements often favor on-premises deployment. The challenge was building a production-grade Kubernetes cluster that could support GPU workloads while maintaining enterprise-level reliability and observability.

Specific requirements included:

Support for multiple GPU-accelerated workloads with fair scheduling
High-availability control plane with automatic failover
GitOps-based deployment for reproducibility and audit trails
NVMe-backed storage with NFS for shared persistent volumes
Comprehensive monitoring, alerting, and log aggregation
Zero-downtime upgrades and maintenance windows

The Solution

I designed a hybrid cluster with dedicated control plane nodes, standard workers for general workloads, and GPU-equipped workers for AI inference. The entire deployment is managed through GitOps with FluxCD, ensuring all changes are tracked in Git and automatically reconciled.

Cluster Architecture

Node Type	Count	Specs	Purpose
Control Plane	3	16GB RAM, 4 vCPU	etcd, API Server, Scheduler
Workers (General)	4	32GB RAM, 8 vCPU	Services, APIs, Databases
Workers (GPU)	5	32GB RAM, RTX 3060 12GB	AI Inference (vLLM)

┌─────────────────────────────────────────────────────────────────┐
│                    Kubernetes Cluster                            │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│   Control Plane (HA)        Workers (General)                   │
│   ┌─────┐ ┌─────┐ ┌─────┐   ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐   │
│   │ CP1 │ │ CP2 │ │ CP3 │   │ W1  │ │ W2  │ │ W3  │ │ W4  │   │
│   └──┬──┘ └──┬──┘ └──┬──┘   └──┬──┘ └──┬──┘ └──┬──┘ └──┬──┘   │
│      │       │       │         │       │       │       │        │
│      └───────┴───────┴─────────┴───────┴───────┴───────┘        │
│                          │                                       │
│   GPU Workers            │                                       │
│   ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐                       │
│   │GPU1 │ │GPU2 │ │GPU3 │ │GPU4 │ │GPU5 │                       │
│   │3060 │ │3060 │ │3060 │ │3060 │ │3060 │                       │
│   └──┬──┘ └──┬──┘ └──┬──┘ └──┬──┘ └──┬──┘                       │
│      │       │       │       │       │                           │
│      └───────┴───────┴───────┴───────┘                           │
│                          │                                       │
├──────────────────────────┼───────────────────────────────────────┤
│                          ▼                                       │
│   ┌─────────────────────────────────────────────────────────┐   │
│   │               TrueNAS (NFS Storage)                      │   │
│   │               NVMe-backed, 20TB usable                   │   │
│   └─────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────┘

Key Implementation Details

FluxCD GitOps: All manifests stored in Git, automatically reconciled every 5 minutes
Kustomize Overlays: Base manifests with environment-specific overlays for dev/staging/prod
NVIDIA GPU Operator: Automatic driver and toolkit installation on GPU nodes
MetalLB: Bare-metal load balancer for service exposure
Cert-Manager: Automatic TLS certificate provisioning with Let's Encrypt
democratic-csi: NFS-based persistent volume provisioner for TrueNAS

Technology Stack

Kubernetes 1.29 FluxCD v2 Kustomize Helm Prometheus Grafana Loki MetalLB NVIDIA Operator Cert-Manager Traefik TrueNAS

Results

Nodes in Production

Deployments Running

Namespaces

60GB

GPU VRAM Available

Monitoring & Observability

A comprehensive monitoring stack ensures visibility into cluster health and application performance:

Prometheus: Metrics collection with custom ServiceMonitors for all applications
Grafana: Dashboards for cluster health, GPU utilization, and application metrics
Loki: Log aggregation with Promtail agents on each node
Alertmanager: Alert routing to Slack and email for critical issues
Node Exporter: Host-level metrics for all cluster nodes
DCGM Exporter: NVIDIA GPU metrics including utilization, memory, and temperature

Key Learnings

Start with GitOps: Implementing FluxCD from day one saved countless hours in debugging and rollbacks
GPU scheduling is complex: Proper node affinity and resource limits are crucial for fair GPU sharing
Storage performance matters: NVMe-backed NFS made a significant difference in model loading times
Monitoring is not optional: The investment in observability pays off every time there's an incident

"This infrastructure now serves as the foundation for all SolidRusT Networks services, from AI inference to game servers, with consistent deployment patterns and centralized observability."

Back to Case Studies