System Design Study
GenAI Forward Deployed Engineer prep · foundations before scenario practice · last built 2026-05-25
Study sequence (the plan we agreed on):
- Get comfortable with general SD fundamentals (cheatsheet below) + review 3–4 illuminating examples.
- Review GenAI-specific architectures (cheatsheet below) + 1–2 illuminating examples.
- Then we practice scenarios together (Days 12–13). The interview framework itself we drill in the practice/mock rounds near June 1 — not now.
1 · Resources to review
Watch/read these offline. The goal is recognition and vocabulary, not memorization.
General SD — channels & references
- Gaurav Sen (YouTube) — best for fundamentals: sharding, consistent hashing, CAP, load balancing, caching. Start here.
- ByteByteGo (YouTube) — Alex Xu; best visual system breakdowns. 5–15 min animated videos.
- system-design-primer (GitHub) — the canonical free written reference; skim the "study guide" + the worked examples.
- Hello Interview — free written design walkthroughs structured the way interviews actually go.
- Book (if you want depth): Alex Xu, System Design Interview Vol 1 & 2.
3–4 illuminating examples (general)
Each teaches a reusable lesson. Watch/read one walkthrough of each (search the title on ByteByteGo or Gaurav Sen, or read the system-design-primer / Hello Interview version).
Design a URL shortener (TinyURL) — teaches: back-of-envelope estimation, hashing/base62 key generation, read-heavy caching, SQL-vs-NoSQL choice. The cleanest first example.
Design a rate limiter — teaches: token bucket / leaky bucket / sliding window, where to place it (gateway vs service), distributed state in Redis. (You already built the token-bucket logic conceptually — this is the systems framing.)
Design a news feed / Twitter timeline — teaches: fan-out-on-write vs fan-out-on-read, caching hot timelines, the "celebrity" hotspot problem. The classic push/pull tradeoff.
Design a chat system (WhatsApp) or YouTube/Netflix — teaches: (chat) websockets, delivery/ordering, presence; (video) CDN, blob storage, the read-path at massive scale. Pick whichever interests you.
GenAI — written walkthroughs
1–2 illuminating examples (GenAI)
RAG document Q&A for an enterprise — teaches: the full ingestion→retrieval→generation pipeline, vector DB choice, chunking, hybrid search + re-ranking, grounding/eval. Maps to your RRK scenario #2.
LLM-powered chatbot / serving platform at scale — teaches: latency vs throughput, continuous batching, KV cache, prompt caching, autoscaling, cost. Maps to your RRK scenario #1.
2 · General SD fundamentals — cheatsheet
The building blocks. For each, know what it is, when to reach for it, and the main tradeoff.
Requirements & estimation start every design here
- Functional ("what it does") vs non-functional ("how well"): scale/QPS, latency, availability, consistency, durability, cost. Pin these numbers before designing.
- Back-of-envelope: estimate QPS (DAU × actions ÷ 86,400), storage (records × size × retention), bandwidth, memory. Round aggressively; show the reasoning.
- Availability: "nines" — 99.9% ≈ 8.7h/yr down, 99.99% ≈ 52 min/yr. Latency: reason in p50/p95/p99, not averages.
Scaling & load balancing
- Vertical (bigger box, simple, capped) vs horizontal (more boxes, needs statelessness + LB). Default to horizontal for scale.
- Load balancer: L4 (transport, fast) vs L7 (application, content-aware routing). Algorithms: round-robin, least-connections, consistent hashing (minimizes reshuffling on node add/remove — key for caches & sharding).
- Stateless services scale trivially; push state to data stores / caches so any replica can serve any request.
Caching
- Where: client → CDN → API gateway → application (Redis/Memcached) → database. Cache closest to the reader that's still correct.
- Strategies: cache-aside (lazy, most common), write-through (consistent, slower writes), write-back (fast writes, durability risk).
- Eviction: LRU / LFU / TTL. Invalidation is the hard part — stale data vs thundering herd on expiry.
- "There are only two hard things: cache invalidation and naming things."
Data stores
| Type | Use when |
| SQL (relational) | Strong consistency, transactions, complex joins, well-defined schema. |
| NoSQL key-value (Redis, DynamoDB) | Simple lookups, massive scale, low latency, flexible schema. |
| Document (MongoDB) | Semi-structured/nested data, schema flexibility. |
| Wide-column (Cassandra) | Write-heavy, time-series, huge scale, tunable consistency. |
| Graph (Neo4j) | Relationship-centric queries (social, fraud). |
| Vector (pgvector, Pinecone) | Semantic similarity over embeddings — the GenAI store. |
Indexing trades write speed + storage for read speed. Sharding/partitioning: split by a key (user_id, geo) — watch for hotspots from a bad key. Replication: leader-follower (read scaling), multi-leader / quorum (availability) — read replicas can serve stale data.
Consistency CAP
- CAP: under a network partition you choose Consistency or Availability. Most large web systems pick AP + eventual consistency; finance picks CP.
- Strong (every read sees the latest write) vs eventual (converges over time). Variants: read-after-write, monotonic reads.
- ACID (transactional DBs) vs BASE (Basically Available, Soft state, Eventual consistency — NoSQL).
Async & messaging
- Message queue (SQS, RabbitMQ) / log (Kafka): decouple producers from consumers, absorb spikes, enable retries. Smooths bursty load.
- Pub/sub for fan-out to many consumers. Idempotency + dedup are mandatory (at-least-once delivery).
- Use async for anything slow or spiky: notifications, media processing, analytics, LLM batch jobs. Backpressure when consumers fall behind.
Reliability, API, observability
Reliability
- Redundancy + failover, no single point of failure.
- Retries with exponential backoff + jitter.
- Circuit breakers, timeouts, graceful degradation.
- Rate limiting: token/leaky bucket, sliding window.
API & observability
- REST (simple) · gRPC (fast, internal) · GraphQL (flexible reads).
- Pagination, idempotency keys, versioning.
- Logging · metrics · tracing (the 3 pillars).
- SLI / SLO / SLA; alert on user-facing symptoms.
3 · GenAI architecture — cheatsheet
The patterns that show up in GenAI FDE system design. This is the differentiating material for the role.
RAG (retrieval-augmented generation) core FDE pattern
Two pipelines: an offline ingestion pipeline and an online query pipeline. Query flow: query → retrieve → assemble context → generate.
Ingestion
- Chunking (biggest lever on quality): size + overlap; semantic vs fixed-size; respect document structure. Bad chunking is the #1 cause of weak RAG.
- Embedding model choice; embed chunks → store vectors + metadata in a vector DB (HNSW / IVF index).
Retrieval
- Dense (semantic, embeddings) + sparse (keyword, BM25) → hybrid search for robustness.
- Re-ranking (cross-encoder) on the top-k candidates sharply improves precision.
- Context management: fit the window, order matters ("lost in the middle"), dedup, cite sources.
Eval & failure modes
- Measure retrieval (recall@k, precision) separately from generation (faithfulness/groundedness, answer relevance).
- Common failures: bad chunk boundaries, weak embeddings, retrieval misses, hallucination despite good context.
- RAG vs fine-tuning vs long-context: RAG for fresh/proprietary/citable knowledge; fine-tune for behavior/format/domain style; long-context when the corpus is small.
LLM serving & inference latency/cost core
- Two metrics: TTFT (time-to-first-token, dominated by prefill) and TPOT/ITL (per-token, dominated by decode). Streaming hides TTFT from users.
- Latency vs throughput is the central tradeoff; batching trades one for the other.
- Continuous (in-flight) batching: add/remove requests every step instead of waiting for a fixed batch — big GPU-utilization win (≈5× throughput).
- KV cache + PagedAttention: cache attention keys/values to avoid recompute; page it like OS virtual memory to kill fragmentation (60–80% memory reclaimed).
- Prompt/prefix caching: reuse the KV for shared prompt prefixes (system prompts, few-shot) → big latency cut. (This is exactly the prompt-prefix caching your LRU/trie work modeled.)
- Quantization (INT8/INT4, GPTQ/AWQ): smaller, faster, cheaper, slight quality cost. Speculative decoding: a small draft model proposes tokens a big model verifies.
- Serving stacks: vLLM, TGI, TensorRT-LLM. Autoscaling on GPU is slow + expensive — scale-to-zero rarely fits low-latency.
Agentic systems
- ReAct loop: reason → act (call a tool) → observe → repeat until done. Tool/function calling is the action interface.
- Planning (decompose a goal into steps), memory (short-term scratchpad + long-term store, often a vector DB).
- Control: step/iteration caps to avoid infinite loops, termination conditions, fallbacks. Guardrails on tool use.
- Multi-agent orchestration (planner + workers) when one agent's context/skills aren't enough. Eval is hard — measure task success + trajectory quality.
LLMOps & cross-cutting concerns
LLMOps
- Prompt versioning & management.
- Eval pipelines: offline + online, LLM-as-judge, A/B.
- Monitoring: drift, hallucination rate, token/cost, latency.
- Feedback loop back to prompts/data/model.
Cost · latency · safety
- Cost levers: smaller models, caching, routing, quantization, token budgets.
- Latency levers: streaming, prefix cache, smaller/distilled models, parallel retrieval.
- Safety/privacy: input/output filtering, PII handling, jailbreak defense — critical for enterprise FDE (data residency, governance).
4 · The answer framework drill in practice rounds, not now
Listed for awareness; we'll rehearse this live in the mock rounds near June 1.
Clarify requirements (functional + non-functional, scale numbers) → estimate (QPS/storage) → API sketch → high-level architecture → data model → deep-dive one or two components → bottlenecks & tradeoffs → scale story (10K → millions) with cost/latency.