Is Designing Data-Intensive Applications Worth Reading?

January 22, 2025

I recently finished reading Designing Data-Intensive Applications (DDIA) by Martin Kleppmann. It has been a long read and I am relieved to have finally read it cover to cover. There has been a lot of viral buzz about this book online. I want to share my insights and whether the book is worthwhile.

A picture of a physical copy of DDIA
My physical copy of the book.

Summary

Part I: Foundations of Data Systems

The opening chapters serve as the conceptional foundation of the book, which will be referenced repeatedly in the later sections. The first chapter covers the three main themes of the book: reliability, scalability, and maintainability. Each of these elements is an important factor to consider when designing any serious application.

The later chapters cover:

  1. Data models and their query languages: SQL, NoSQL, and graph data models.
  2. Different ways databases store data: SSTables, BTrees, OLTP vs OLAP, row vs column-orientated databases.
  3. Ways of encoding data and dataflow: JSON, Thrift, Avro. REST vs RPC, message-passing dataflow.

Part II: Distributed Data

The second part is the crux of DDIA. We start with various methods for replicating and partitioning data. Later, transactions and distributed transactions (consensus algorithms) are discussed in detail.

Kleppmann makes it clear that in distributed, concurrent settings, everything can go wrong; replication lag, network faults, and node outages plague distributed data systems, and guarantees like linearization are expensive.

Key concepts I learned:

  1. Leader vs multi-leader vs leaderless replication. Consistency, eventual consistency, read your own writes, monotonic reads
  2. Partitioning is to evenly distribute a database table into many nodes. Rebalancing algorithms. Request routing
  3. Transactions, ACID. Isolation levels: read committed, read skew, write skew, snapshot isolation, serialization, SSI
  4. Network faults, clock skew, and system models
  5. Linearizable vs serializable, ordering guarantees, total order guarantee, consensus algorithms. Leader replication vs Lamport timestamp vs 2PC

Part III: Derived Data

The last section is about data integration, i.e. how to integrate data from different systems together. In complex systems, there is a source of truth like the main database, and other derived data systems like recommendation systems, search indices, or caches depend on those records.

The book explored batch processing, stream processing, and alternative data flows (as opposed to RESTful applications). The common theme is that having idempotent, immutable, self-contained functions is useful for fault tolerance and explainability, which is the core insight of the Unix philosophy and MapReduce. Important topics included

  1. Hadoop vs MPP, fault tolerance, workflows
  2. Stream windows, change data capture, event sourcing
  3. Unbundling databases, end-to-end argument

Takeaways

Overall, I give this book four stars. The book is well written and Kleppmann is an excellent expounder. I loved that the author went over naive approaches that a layman may try and their pitfalls to motivate the definitions and concepts. The thinking process of arriving at the grounded algorithms we use today is something that is often overlooked.

Even as an undergraduate, I found many topics relatable in the book. For instance, in my internship at Ramp, I had to deal schemas, different data models, denormalization, etc. Admittedly, I probably will never need to directly apply knowledge like Lamport timestamps or serializable snapshot isolation unless I become a database engineer.

Ultimately, I think this book is invaluable for inexperienced programmers because it teaches us how to think critically about large systems, how to weigh the pros and cons of different approaches, and be extremely critical and careful in distributed settings. Data races are extremely common and it is important for software developers to identify them because we can’t always rely on databases to do it for us.