Case Study

Production Kubernetes Infrastructure

Designing and deploying a 12-node Kubernetes cluster with GPU support, FluxCD GitOps, and comprehensive monitoring for AI workloads.

Duration
12 Months
Role
Infrastructure Architect
Cluster Size
12 Nodes

The Challenge

Running AI inference workloads requires significant compute resources, particularly GPU access. Cloud GPU pricing can be prohibitive for sustained workloads, and latency requirements often favor on-premises deployment. The challenge was building a production-grade Kubernetes cluster that could support GPU workloads while maintaining enterprise-level reliability and observability.

Specific requirements included:

The Solution

I designed a hybrid cluster with dedicated control plane nodes, standard workers for general workloads, and GPU-equipped workers for AI inference. The entire deployment is managed through GitOps with FluxCD, ensuring all changes are tracked in Git and automatically reconciled.

Cluster Architecture

Node Type Count Specs Purpose
Control Plane 3 16GB RAM, 4 vCPU etcd, API Server, Scheduler
Workers (General) 4 32GB RAM, 8 vCPU Services, APIs, Databases
Workers (GPU) 5 32GB RAM, RTX 3060 12GB AI Inference (vLLM)
┌─────────────────────────────────────────────────────────────────┐
│                    Kubernetes Cluster                            │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│   Control Plane (HA)        Workers (General)                   │
│   ┌─────┐ ┌─────┐ ┌─────┐   ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐   │
│   │ CP1 │ │ CP2 │ │ CP3 │   │ W1  │ │ W2  │ │ W3  │ │ W4  │   │
│   └──┬──┘ └──┬──┘ └──┬──┘   └──┬──┘ └──┬──┘ └──┬──┘ └──┬──┘   │
│      │       │       │         │       │       │       │        │
│      └───────┴───────┴─────────┴───────┴───────┴───────┘        │
│                          │                                       │
│   GPU Workers            │                                       │
│   ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐                       │
│   │GPU1 │ │GPU2 │ │GPU3 │ │GPU4 │ │GPU5 │                       │
│   │3060 │ │3060 │ │3060 │ │3060 │ │3060 │                       │
│   └──┬──┘ └──┬──┘ └──┬──┘ └──┬──┘ └──┬──┘                       │
│      │       │       │       │       │                           │
│      └───────┴───────┴───────┴───────┘                           │
│                          │                                       │
├──────────────────────────┼───────────────────────────────────────┤
│                          ▼                                       │
│   ┌─────────────────────────────────────────────────────────┐   │
│   │               TrueNAS (NFS Storage)                      │   │
│   │               NVMe-backed, 20TB usable                   │   │
│   └─────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────┘
                

Key Implementation Details

Technology Stack

Kubernetes 1.29 FluxCD v2 Kustomize Helm Prometheus Grafana Loki MetalLB NVIDIA Operator Cert-Manager Traefik TrueNAS

Results

12
Nodes in Production
63
Deployments Running
46
Namespaces
60GB
GPU VRAM Available

Monitoring & Observability

A comprehensive monitoring stack ensures visibility into cluster health and application performance:

Key Learnings

"This infrastructure now serves as the foundation for all SolidRusT Networks services, from AI inference to game servers, with consistent deployment patterns and centralized observability."

Back to Case Studies