A practical overview of challenging real-world system designs. Each design idea includes its purpose, blockers, solutions, intuition, and a popular interview Q&A to help you prepare for high-level interviews or system architecture discussions.
Use this as a cheat sheet or learning reference to guide your system design thinking.
# | System Design Problem | Intuition & Design Idea | Blockers & Challenges | Solution/Best Practices | Famous Interview Question & Answer |
---|---|---|---|---|---|
1 | URL Shortening (bit.ly) | Map long URLs to short hashes. Store metadata and handle redirection. | High scale, link abuse | Use Base62/UUID, Redis cache, rate-limiting | Q: How to avoid collisions in shortened URLs? A: Use hash + check DB for duplicates. |
2 | Distributed KV Store (Redis) | Store data as key-value pairs across nodes. | Network partitions, consistency | Gossip/Raft protocol, sharding, replication | Q: How to handle Redis master failure? A: Sentinel auto-failover. |
3 | Scalable Social Network (Facebook) | Users interact via posts, likes, comments. Need timeline/feed generation. | Feed generation latency, DB bottlenecks | Precompute feed (fanout), cache timeline | Q: How is news feed generated? A: Fan-out to followers or pull on-demand. |
4 | Recommendation System (Netflix) | Suggest content based on user taste + trends | Cold start, real-time scoring | Use hybrid filtering, vector embeddings | Q: How to solve cold start? A: Use content-based filtering. |
5 | Distributed File System (HDFS) | Break files into blocks, replicate across nodes. | Metadata scaling, file recovery | NameNode for metadata, block replication | Q: How does HDFS ensure fault tolerance? A: 3x replication and heartbeat checks. |
6 | Real-time Messaging (WhatsApp) | Deliver messages instantly, maintain order. | Ordering, delivery failures | Kafka queues, delivery receipts, retries | Q: How to ensure delivery? A: ACK, retry, message status flags. |
7 | Web Crawler (Googlebot) | Crawl web, avoid duplicate/irrelevant content. | URL duplication, crawl efficiency | BFS + filters, politeness policy | Q: How to avoid crawling same URL? A: Normalize + deduplicate with hash. |
8 | Distributed Cache (Memcached) | Store frequently accessed data closer to users. | Cache invalidation, stampede | TTL staggering, background refresh | Q: How to handle cache stampede? A: Use mutex/locks for rebuilds. |
9 | CDN (Cloudflare) | Serve static assets from edge for low latency. | Cache expiry, geolocation | Use geo-DNS, cache invalidation APIs | Q: How does CDN reduce latency? A: Edge nodes cache closer to user. |
10 | Search Engine (Google) | Index content and rank pages on queries. | Real-time indexing, ranking | MapReduce, inverted index, TF-IDF | Q: How does Google rank pages? A: Relevance + PageRank + freshness. |
11 | Ride-sharing (Uber) | Match drivers to riders using location data. | Geo-search, dynamic pricing | Use GeoHashing, Kafka, ETA predictions | Q: How does Uber find nearby drivers? A: Geo index or R-tree based lookup. |
12 | Video Streaming (YouTube) | Store and stream videos with low buffer. | Encoding, adaptive playback | ABR (adaptive bitrate), chunking, CDN | Q: How to support multiple devices? A: Transcode to multiple formats. |
13 | Food Delivery (Zomato) | Show restaurants, manage orders, track delivery. | ETA accuracy, busy hours | ML models for ETA, real-time maps | Q: How is ETA calculated? A: Based on past data + live traffic. |
14 | Collaborative Docs (Google Docs) | Enable multiple users to edit in real time. | Conflict resolution | Use CRDTs/OT, state sync | Q: How does real-time collaboration work? A: Merge edits using CRDT. |
15 | E-Commerce (Amazon) | Sell products, track inventory, handle payments. | Concurrency, pricing errors | Use event sourcing, locking, audit trail | Q: How to handle flash sale? A: Queue requests + inventory locking. |
16 | Marketplace Recommendation | Personalize based on shopping history. | New users, noisy data | Use embeddings, clustering, trending items | Q: How to personalize for new user? A: Use trending/best-selling items. |
17 | Fault-tolerant DB | Ensure consistency + uptime in failures. | Partitioning, network split | Raft/Paxos, quorum reads/writes | Q: CAP theorem real example? A: CP (MongoDB), AP (Cassandra). |
18 | Event System (Twitter) | Send tweets/events to followers in real time. | Fan-out, latency | Kafka, event store, async processing | Q: Push or pull tweets? A: Push for active, pull for passive. |
19 | Photo Sharing (Instagram) | Users upload, view, and like photos. | Storage, metadata | Store media on CDN/S3, DB for metadata | Q: Where are images stored? A: CDN edge, S3 origin. |
20 | Task Scheduler | Schedule and trigger jobs reliably. | Time zone issues, duplication | Use cron w/ distributed locks | Q: How to ensure task runs once? A: Use leader election or DB locks. |
🧠 Tips for Developers:
-
Always consider scalability (horizontal vs vertical).
-
Trade-offs are key: CAP, latency vs availability.
-
Use queues to decouple services.
-
Think about observability: logging, metrics, alerts.
📚 Want to go deeper? Check out:
-
"Designing Data-Intensive Applications" by Martin Kleppmann
-
SystemDesignPrimer (GitHub)
-
Grokking the System Design Interview (Educative.io)
Let me know if you'd like deep dives, diagrams, or downloadable PDF/Markdown version!