When I first heard of Alex Xu’s System Design Interview – An insider’s guide, my reaction was “finally someone wrote a book with deep discussions around scalable systems design”, so I purchased it.

tl;dr: If you are new to systems design and uninitiated in distributed systems space, this book should be an instant buy for you. If you are seeking answers to how Facebook scales, how APIs like AWS S3 work or want to learn trade-offs around distributed databases and queues, this book might leave you unsatisfied as it seems to be focused on getting the beginners to succeed in the interviews.

My expectations before reading

I was expecting to read a deep discussion around:

  • Challenges around distributed systems

  • Designing for failure, availability modes

  • Established framework for approaching to systems design, like:

    • determine requirements and goals
    • back of the napkin estimation
    • determine components and high-level design
    • no SPoF (single point of failure)
    • determine bottlenecks, assess reliability
  • Common problem solving patterns e.g. pull vs push, streaming, async processing, discovery, gossip protocols

  • Scaling and availability tradeoffs of distributed storage systems

  • Distributed data structures that can be applied to problems (CRDTs, Merkle trees)

  • Real world implementations from the field (Maglev, Borg/Kubernetes, DAG engines, Spanner/CockroachDB etc)

Admittedly, I set up my expectations perhaps a little too high. I was expecting to find answers to how AWS builds EBS, or how Apple scales APNS.

The book does provide a good problem solving framework (similar to what I listed above) but I found it to be heavily optimized for interviews vs actually learning the act. Other topics are either covered very hand-wavy or not at all.

Table of Contents

  • Chapters 1-3: Scaling for millions, Back-of-the-envelope estimations and Framework for design interviews:

    These few chapters are basically like Backend Engineering 101 almost to the extent it doesn’t assume any prior experience in the space. I got the feeling here that the author really is targeting audiences like frontend/iOS devs but then it gets more serious at a sharp pace.

  • Chapter 4: Designing a rate limiter

    Not sure why the book moved on from basics directly to this chapter. I personally learned a lot in this chapter in terms of state-of-the-art rate limiting algorithms.

  • Chapter 5: Consistent hashing

    Appropritaly placed in the ToC. This chapter is key to solving many distributed storage problems. I always wondered how ring hashing solves hot-replica problem, this book taught me about virtual nodes technique which helps with better balancing with limited number of actual nodes.

  • Chapter 6: Design a key-value store

    Fasten your seat belts ‘cause you’ve been fooling around. This chapter squeezes the entire distributes systems theory and concepts into a mere 30 pages.

    It introduces CAP theorem, vector clocks, consensus, read/write replicas, leader election and brings it all together by designing a Redis-like KV store.

    Having taken a Distributed Systems class in undergrad from one of the best for a semester, I was really pumped for this chapter. But it left me unsatisfied –and I assume many of the beginners the author targeted in early chapters are probably left very confused.

    I recommend splitting the discussion in this chapter over many chapters and going into greater detail while relating the theory to the real-world implementations and trade-offs.

  • Chapter 7: Designing distributed unique ID generator

    While reading the title, I expected to find Twitter Snowflake and I was not mistaken. This question is regarded to be single dimensional and (hopefully) not widely-asked out there. As I said earlier, a chapter on these sort of “auxilary concepts” for solving distributed systems problems is what the book is lacking as they individually do not warrant a chapter.

  • Chapter 8: Designing a URL shortener

    a.k.a the FizzBuzz of systems interview. There is good discussion here on how to create shortest URLs and resolve conflicts. Surprisingly very little discussion of Bloom filters as the book could go into a little bit of detail on what guarantees Bloom filters provide and example use cases of it.

  • Chapter 9: Designing a web crawler

    This chapter focuses on the high-level design, but instead it went into scheduling, prioritization, fault tolerance and QoS in a distributed environment. Instead, it focuses on long-tail aspects of a crawler, such as detecting webspam, eliminating redundant content which are not interesting from system design perspective.

    Overall the book lacks a solid discussion around message/job queues such as delivery modes, acks/retries, dead-letter queues, trade-offs around ordering and exactly-once delivery.

  • Chapter 10: Design a notification system

    Came to hear how Apple APNS or Google GCM/FCM is built to deliver billions of messages a day. Instead, found a question around how a startup might send emails, SMS etc using separate message queues. (Ditto the lack of queue trade-off discussions here.)

  • Chapter 11: Design a newsfeed/timeline

    a.k.a cornerstore of any system design interview loop. There’s solid discussion in this chapter. It could be even better if it touched on database normalization and FB-style aggressive caching.

    Something I still don’t know about timelines is that how does FB/IG show my friends’ names first in a list of tens of millions of likes.

  • Chapter 12: Design a chat system

    Fairly covers the problem by focusing on different aspects (e.g. Telegram/WhatsApp style as well as IRC-style/websockets). It talks why some room limitations exist in some chat apps, but doesn’t go into eliminating those limitations through techniques like fanout servers (e.g. how Twitch uses IRC under the covers but develops layered WebSockets listeners on top of that to stream to a million viewers simultaneously. This would be a great addition to the book IMO.)

  • Chapter 13: Designing search autocomplete service

    Overall good chapter but the two problems (collecting data vs serving) should probably have been different chapters. Talks about building distributed read-only tries, but hand-wavy about how to change the underlying data structure while still serving traffic.

  • Chapter 14: Design YouTube

    Chapter focuses on introducing the “workflow engine” model and using DAGs to create processing pipelines, in this case, for videos. I haven’t learned from this part.

    Then, it focuses on the networking part, such as sending the uploader to a closer server (but doesn’t explain how is this achieved such as geo DNS or anycast IPs/ECMP). Then it talks about more hand-wavy stuff like adaptive streaming and DRM (which are not relevant).

    On a positive note, instead of slapping CDN on top, it talks about long-tail optimizations like co-lo with ISPs like Netflix Open Connect.

  • Chapter 15: Design Google Drive

    Overall a quite underwhelming chapter, but it providex excellent flow diagrams around file syncing between clients, which is the meat of the question.

    I expected more advanced discussion around how block storage systems work (vs the book just recommends using off-the-shelf) in terms of block alignment, and relating this back to files such as calculating diffs (e.g. how would you detect and handle a single byte change in a 5 GB file on Google Drive) and de-duplication in practice (it’s mentioned in a hand-wavy way).

Conclusion

Overall, great book for those uninitiated in the space. I had very high expectations as I was expecting to get the academic fundamentals and then relate them to real-world implementations from the industry.

I also didn’t read with the purpose of getting better at interviewing, rather, I wanted to get visibility into:

  • storage system trade-offs, specially around CAP theorem and serializability
  • building blocks like distributed queues and reliability trade-offs around them (as I mentioned above, this is really lacking in the book)
  • state-of-the-art data structures (like CRDTs)
  • discovery, load balancing and health checks in a dynamic environment
  • how to look for failure modes and address them
  • replication techniques (e.g. log-based storage systems)

but the book fell short in these areas. I really recommend Amazon Builders' Library as a learning source for these areas. On top of that, keep a close eye on Adrian Colyer’s The Morning Paper which explains complex papers (not always relevant) in understandable terms.

Hope you buy the book, I enjoyed following through it despite the shortcomings and added a few more tricks under my belt. Cheers.