Infrastructure

How We Handle 500+ Concurrent Users: A Deep Dive Into Our Base Infrastructure

Rahul Mehta·April 2, 2026·9 min read

When we say we've tested at 500+ concurrent users, we don't mean we ran a quick Locust script and called it done. We mean we systematically broke the system, identified every failure point, and fixed them before shipping.

Here's what that actually looked like.

The Architecture

Our Base Infrastructure runs on a microservices pattern. An API gateway handles authentication and routing. Behind it sit three core services: a transcription service (Whisper/Deepgram), an AI orchestration service (model-agnostic, routes to Claude or OpenAI based on task type), and a streaming service that manages WebSocket connections and SSE streams.

State lives in PostgreSQL. Session data lives in Redis with TTLs. Inter-service communication goes through a Redis Streams message queue for async operations and direct HTTP for synchronous ones.

The Load Test

We used k6 for load testing. The test scenario: a user connects via WebSocket, sends a voice message (simulated as a binary payload), receives a streaming text response, and disconnects. We ramped from 10 to 500 concurrent virtual users over 5 minutes, held at 500 for 10 minutes, then ramped down.

**Failure mode 1: WebSocket connection pool exhaustion.** At around 300 concurrent users, the streaming service started refusing connections. Root cause: the default Node.js HTTP server backlog was too small. Fix: explicit connection pool configuration and a load balancer in front of the streaming service.

**Failure mode 2: Redis connection timeouts.** At peak load, session reads were timing out. Root cause: a single Redis connection being shared across async handlers. Fix: a proper connection pool with min/max bounds and timeout configuration.

**Failure mode 3: AI API rate limits.** At 500 concurrent users, we hit OpenAI's rate limits. Fix: a request queue with backpressure, exponential backoff, and automatic failover to Claude when OpenAI is throttled.

The Results

After these fixes: 500 concurrent users, p99 latency under 2.4 seconds for a full round trip (voice-in to text-out), zero connection drops, zero dropped messages. Horizontal scaling via Kubernetes adds capacity linearly when needed.

We publish these numbers because they're the only credible evidence that a system works. Architecture diagrams don't prove anything. Load test results do.