System Design Study

GenAI Forward Deployed Engineer prep · foundations before scenario practice · last built 2026-05-25

Study sequence (the plan we agreed on):
  1. Get comfortable with general SD fundamentals (cheatsheet below) + review 3–4 illuminating examples.
  2. Review GenAI-specific architectures (cheatsheet below) + 1–2 illuminating examples.
  3. Then we practice scenarios together (Days 12–13). The interview framework itself we drill in the practice/mock rounds near June 1 — not now.

1 · Resources to review

Watch/read these offline. The goal is recognition and vocabulary, not memorization.

General SD — channels & references

3–4 illuminating examples (general)

Each teaches a reusable lesson. Watch/read one walkthrough of each (search the title on ByteByteGo or Gaurav Sen, or read the system-design-primer / Hello Interview version).

Design a URL shortener (TinyURL)teaches: back-of-envelope estimation, hashing/base62 key generation, read-heavy caching, SQL-vs-NoSQL choice. The cleanest first example.
Design a rate limiterteaches: token bucket / leaky bucket / sliding window, where to place it (gateway vs service), distributed state in Redis. (You already built the token-bucket logic conceptually — this is the systems framing.)
Design a news feed / Twitter timelineteaches: fan-out-on-write vs fan-out-on-read, caching hot timelines, the "celebrity" hotspot problem. The classic push/pull tradeoff.
Design a chat system (WhatsApp) or YouTube/Netflixteaches: (chat) websockets, delivery/ordering, presence; (video) CDN, blob storage, the read-path at massive scale. Pick whichever interests you.

GenAI — written walkthroughs

1–2 illuminating examples (GenAI)

RAG document Q&A for an enterpriseteaches: the full ingestion→retrieval→generation pipeline, vector DB choice, chunking, hybrid search + re-ranking, grounding/eval. Maps to your RRK scenario #2.
LLM-powered chatbot / serving platform at scaleteaches: latency vs throughput, continuous batching, KV cache, prompt caching, autoscaling, cost. Maps to your RRK scenario #1.

2 · General SD fundamentals — cheatsheet

The building blocks. For each, know what it is, when to reach for it, and the main tradeoff.

Requirements & estimation start every design here

Scaling & load balancing

Caching

Data stores

TypeUse when
SQL (relational)Strong consistency, transactions, complex joins, well-defined schema.
NoSQL key-value (Redis, DynamoDB)Simple lookups, massive scale, low latency, flexible schema.
Document (MongoDB)Semi-structured/nested data, schema flexibility.
Wide-column (Cassandra)Write-heavy, time-series, huge scale, tunable consistency.
Graph (Neo4j)Relationship-centric queries (social, fraud).
Vector (pgvector, Pinecone)Semantic similarity over embeddings — the GenAI store.

Indexing trades write speed + storage for read speed. Sharding/partitioning: split by a key (user_id, geo) — watch for hotspots from a bad key. Replication: leader-follower (read scaling), multi-leader / quorum (availability) — read replicas can serve stale data.

Consistency CAP

Async & messaging

Reliability, API, observability

Reliability

  • Redundancy + failover, no single point of failure.
  • Retries with exponential backoff + jitter.
  • Circuit breakers, timeouts, graceful degradation.
  • Rate limiting: token/leaky bucket, sliding window.

API & observability

  • REST (simple) · gRPC (fast, internal) · GraphQL (flexible reads).
  • Pagination, idempotency keys, versioning.
  • Logging · metrics · tracing (the 3 pillars).
  • SLI / SLO / SLA; alert on user-facing symptoms.

3 · GenAI architecture — cheatsheet

The patterns that show up in GenAI FDE system design. This is the differentiating material for the role.

RAG (retrieval-augmented generation) core FDE pattern

Two pipelines: an offline ingestion pipeline and an online query pipeline. Query flow: query → retrieve → assemble context → generate.

Ingestion

Retrieval

Eval & failure modes

LLM serving & inference latency/cost core

Agentic systems

LLMOps & cross-cutting concerns

LLMOps

  • Prompt versioning & management.
  • Eval pipelines: offline + online, LLM-as-judge, A/B.
  • Monitoring: drift, hallucination rate, token/cost, latency.
  • Feedback loop back to prompts/data/model.

Cost · latency · safety

  • Cost levers: smaller models, caching, routing, quantization, token budgets.
  • Latency levers: streaming, prefix cache, smaller/distilled models, parallel retrieval.
  • Safety/privacy: input/output filtering, PII handling, jailbreak defense — critical for enterprise FDE (data residency, governance).

4 · The answer framework drill in practice rounds, not now

Listed for awareness; we'll rehearse this live in the mock rounds near June 1.

Clarify requirements (functional + non-functional, scale numbers) → estimate (QPS/storage) → API sketchhigh-level architecturedata modeldeep-dive one or two components → bottlenecks & tradeoffsscale story (10K → millions) with cost/latency.