System Design Study

GenAI Forward Deployed Engineer prep · foundations before scenario practice · last built 2026-05-25

Study sequence (the plan we agreed on):

Get comfortable with general SD fundamentals (cheatsheet below) + review 3–4 illuminating examples.
Review GenAI-specific architectures (cheatsheet below) + 1–2 illuminating examples.
Then we practice scenarios together (Days 12–13). The interview framework itself we drill in the practice/mock rounds near June 1 — not now.

1 · Resources to review

Watch/read these offline. The goal is recognition and vocabulary, not memorization.

General SD — channels & references

Gaurav Sen (YouTube) — best for fundamentals: sharding, consistent hashing, CAP, load balancing, caching. Start here.
ByteByteGo (YouTube) — Alex Xu; best visual system breakdowns. 5–15 min animated videos.
system-design-primer (GitHub) — the canonical free written reference; skim the "study guide" + the worked examples.
Hello Interview — free written design walkthroughs structured the way interviews actually go.
Book (if you want depth): Alex Xu, System Design Interview Vol 1 & 2.

3–4 illuminating examples (general)

Each teaches a reusable lesson. Watch/read one walkthrough of each (search the title on ByteByteGo or Gaurav Sen, or read the system-design-primer / Hello Interview version).

Design a URL shortener (TinyURL) — teaches: back-of-envelope estimation, hashing/base62 key generation, read-heavy caching, SQL-vs-NoSQL choice. The cleanest first example.

Design a rate limiter — teaches: token bucket / leaky bucket / sliding window, where to place it (gateway vs service), distributed state in Redis. (You already built the token-bucket logic conceptually — this is the systems framing.)

Design a news feed / Twitter timeline — teaches: fan-out-on-write vs fan-out-on-read, caching hot timelines, the "celebrity" hotspot problem. The classic push/pull tradeoff.

Design a chat system (WhatsApp) or YouTube/Netflix — teaches: (chat) websockets, delivery/ordering, presence; (video) CDN, blob storage, the read-path at massive scale. Pick whichever interests you.

GenAI — written walkthroughs

RAG: orq.ai — RAG Architecture Explained · n8n — RAG System Architecture (production) · Pinecone Learn (vector search + RAG fundamentals).
LLM serving: Ubicloud — Life of an inference request (vLLM) (excellent end-to-end) · Nebius — Serving LLMs with vLLM.

1–2 illuminating examples (GenAI)

RAG document Q&A for an enterprise — teaches: the full ingestion→retrieval→generation pipeline, vector DB choice, chunking, hybrid search + re-ranking, grounding/eval. Maps to your RRK scenario #2.

LLM-powered chatbot / serving platform at scale — teaches: latency vs throughput, continuous batching, KV cache, prompt caching, autoscaling, cost. Maps to your RRK scenario #1.

2 · General SD fundamentals — cheatsheet

The building blocks. For each, know what it is, when to reach for it, and the main tradeoff.

Requirements & estimation start every design here

Functional ("what it does") vs non-functional ("how well"): scale/QPS, latency, availability, consistency, durability, cost. Pin these numbers before designing.
Back-of-envelope: estimate QPS (DAU × actions ÷ 86,400), storage (records × size × retention), bandwidth, memory. Round aggressively; show the reasoning.
Availability: "nines" — 99.9% ≈ 8.7h/yr down, 99.99% ≈ 52 min/yr. Latency: reason in p50/p95/p99, not averages.

Scaling & load balancing

Vertical (bigger box, simple, capped) vs horizontal (more boxes, needs statelessness + LB). Default to horizontal for scale.
Load balancer: L4 (transport, fast) vs L7 (application, content-aware routing). Algorithms: round-robin, least-connections, consistent hashing (minimizes reshuffling on node add/remove — key for caches & sharding).
Stateless services scale trivially; push state to data stores / caches so any replica can serve any request.

Caching

Where: client → CDN → API gateway → application (Redis/Memcached) → database. Cache closest to the reader that's still correct.
Strategies: cache-aside (lazy, most common), write-through (consistent, slower writes), write-back (fast writes, durability risk).
Eviction: LRU / LFU / TTL. Invalidation is the hard part — stale data vs thundering herd on expiry.
"There are only two hard things: cache invalidation and naming things."

Data stores

Type	Use when
SQL (relational)	Strong consistency, transactions, complex joins, well-defined schema.
NoSQL key-value (Redis, DynamoDB)	Simple lookups, massive scale, low latency, flexible schema.
Document (MongoDB)	Semi-structured/nested data, schema flexibility.
Wide-column (Cassandra)	Write-heavy, time-series, huge scale, tunable consistency.
Graph (Neo4j)	Relationship-centric queries (social, fraud).
Vector (pgvector, Pinecone)	Semantic similarity over embeddings — the GenAI store.

Indexing trades write speed + storage for read speed. Sharding/partitioning: split by a key (user_id, geo) — watch for hotspots from a bad key. Replication: leader-follower (read scaling), multi-leader / quorum (availability) — read replicas can serve stale data.

Consistency CAP

CAP: under a network partition you choose Consistency or Availability. Most large web systems pick AP + eventual consistency; finance picks CP.
Strong (every read sees the latest write) vs eventual (converges over time). Variants: read-after-write, monotonic reads.
ACID (transactional DBs) vs BASE (Basically Available, Soft state, Eventual consistency — NoSQL).

Async & messaging

Message queue (SQS, RabbitMQ) / log (Kafka): decouple producers from consumers, absorb spikes, enable retries. Smooths bursty load.
Pub/sub for fan-out to many consumers. Idempotency + dedup are mandatory (at-least-once delivery).
Use async for anything slow or spiky: notifications, media processing, analytics, LLM batch jobs. Backpressure when consumers fall behind.

Reliability, API, observability

Reliability

Redundancy + failover, no single point of failure.
Retries with exponential backoff + jitter.
Circuit breakers, timeouts, graceful degradation.
Rate limiting: token/leaky bucket, sliding window.

API & observability

REST (simple) · gRPC (fast, internal) · GraphQL (flexible reads).
Pagination, idempotency keys, versioning.
Logging · metrics · tracing (the 3 pillars).
SLI / SLO / SLA; alert on user-facing symptoms.

3 · GenAI architecture — cheatsheet

The patterns that show up in GenAI FDE system design. This is the differentiating material for the role.

RAG (retrieval-augmented generation) core FDE pattern

Two pipelines: an offline ingestion pipeline and an online query pipeline. Query flow: query → retrieve → assemble context → generate.

Ingestion

Chunking (biggest lever on quality): size + overlap; semantic vs fixed-size; respect document structure. Bad chunking is the #1 cause of weak RAG.
Embedding model choice; embed chunks → store vectors + metadata in a vector DB (HNSW / IVF index).

Retrieval

Dense (semantic, embeddings) + sparse (keyword, BM25) → hybrid search for robustness.
Re-ranking (cross-encoder) on the top-k candidates sharply improves precision.
Context management: fit the window, order matters ("lost in the middle"), dedup, cite sources.

Eval & failure modes

Measure retrieval (recall@k, precision) separately from generation (faithfulness/groundedness, answer relevance).
Common failures: bad chunk boundaries, weak embeddings, retrieval misses, hallucination despite good context.
RAG vs fine-tuning vs long-context: RAG for fresh/proprietary/citable knowledge; fine-tune for behavior/format/domain style; long-context when the corpus is small.

LLM serving & inference latency/cost core

Two metrics: TTFT (time-to-first-token, dominated by prefill) and TPOT/ITL (per-token, dominated by decode). Streaming hides TTFT from users.
Latency vs throughput is the central tradeoff; batching trades one for the other.
Continuous (in-flight) batching: add/remove requests every step instead of waiting for a fixed batch — big GPU-utilization win (≈5× throughput).
KV cache + PagedAttention: cache attention keys/values to avoid recompute; page it like OS virtual memory to kill fragmentation (60–80% memory reclaimed).
Prompt/prefix caching: reuse the KV for shared prompt prefixes (system prompts, few-shot) → big latency cut. (This is exactly the prompt-prefix caching your LRU/trie work modeled.)
Quantization (INT8/INT4, GPTQ/AWQ): smaller, faster, cheaper, slight quality cost. Speculative decoding: a small draft model proposes tokens a big model verifies.
Serving stacks: vLLM, TGI, TensorRT-LLM. Autoscaling on GPU is slow + expensive — scale-to-zero rarely fits low-latency.

Agentic systems

ReAct loop: reason → act (call a tool) → observe → repeat until done. Tool/function calling is the action interface.
Planning (decompose a goal into steps), memory (short-term scratchpad + long-term store, often a vector DB).
Control: step/iteration caps to avoid infinite loops, termination conditions, fallbacks. Guardrails on tool use.
Multi-agent orchestration (planner + workers) when one agent's context/skills aren't enough. Eval is hard — measure task success + trajectory quality.

LLMOps & cross-cutting concerns

LLMOps

Prompt versioning & management.
Eval pipelines: offline + online, LLM-as-judge, A/B.
Monitoring: drift, hallucination rate, token/cost, latency.
Feedback loop back to prompts/data/model.

Cost · latency · safety

Cost levers: smaller models, caching, routing, quantization, token budgets.
Latency levers: streaming, prefix cache, smaller/distilled models, parallel retrieval.
Safety/privacy: input/output filtering, PII handling, jailbreak defense — critical for enterprise FDE (data residency, governance).

4 · The answer framework drill in practice rounds, not now

Listed for awareness; we'll rehearse this live in the mock rounds near June 1.

Clarify requirements (functional + non-functional, scale numbers) → estimate (QPS/storage) → API sketch → high-level architecture → data model → deep-dive one or two components → bottlenecks & tradeoffs → scale story (10K → millions) with cost/latency.