The Challenge
As AI capabilities rapidly evolved, I recognized the need for a robust, self-hosted inference platform that could serve multiple consumer-facing applications. The challenge was building infrastructure that could handle variable workloads, provide low-latency responses, and maintain high availability - all while keeping costs manageable compared to cloud AI services.
Key challenges included:
- Managing GPU resources efficiently across multiple models
- Providing automatic failover when local inference was unavailable
- Implementing proper API key management and billing integration
- Maintaining infrastructure as code with zero-downtime deployments
- Serving multiple frontend applications through a unified API gateway
The Solution
I designed and implemented a comprehensive AI platform built on Kubernetes, featuring a 12-node cluster with 5 GPU-equipped worker nodes. The architecture follows GitOps principles with FluxCD for continuous deployment and Kustomize for environment management.
┌─────────────────────────────────────────────────────────────────┐
│ Public Internet │
└─────────────────────────┬───────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ Artemis Gateway (AWS EC2) │
│ - TLS Termination │
│ - Rate Limiting │
│ - API Key Validation │
└─────────────────────────┬───────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ Kubernetes Cluster (12 nodes) │
├─────────────────────────────────────────────────────────────────┤
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────────┐ │
│ │ LiteLLM │ │ vLLM │ │ Data Layer API │ │
│ │ Proxy │──│ Inference │ │ (RAG, Embeddings) │ │
│ │ (Failover) │ │ (GPU Pods) │ │ │ │
│ └─────────────┘ └─────────────┘ └─────────────────────────┘ │
├─────────────────────────────────────────────────────────────────┤
│ ┌─────────────────────────────────────────────────────────────┐│
│ │ PAM Platform ││
│ │ (API Keys, Billing, User Auth) ││
│ └─────────────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────────────┘
Key Components
- vLLM Inference: High-performance serving with PagedAttention, running Qwen3-4B on RTX 3060 GPUs
- LiteLLM Proxy: Unified API interface with automatic failover to Claude Haiku during GPU maintenance
- Artemis Gateway: Public-facing nginx proxy handling TLS, rate limiting, and request routing
- PAM Platform: Custom-built API key management with Stripe billing integration
- Data Layer: RAG capabilities with vector search (bge-m3 embeddings) and knowledge graph
Technology Stack
Results
Key Learnings
Building this platform provided invaluable experience in production ML infrastructure. The most important lessons learned:
- Failover is essential: GPU workloads can fail unpredictably. Having Claude Haiku as a fallback ensures continuous service.
- GitOps simplifies everything: FluxCD's reconciliation loop catches drift and ensures consistency across environments.
- Observability first: Prometheus metrics and Grafana dashboards made debugging and optimization possible.
- API gateways add complexity but are worth it: Centralized rate limiting, auth, and routing simplifies client integration.
"The platform now handles inference for five consumer applications including AI chat interfaces, game assistants, and multi-agent deliberation systems - all from the same underlying infrastructure."