CAP Theorem Explained

- December 08, 2019

When building large-scale software systems today, you have to make tradeoffs. You can't have an ACID compliant data store with infinite storage/throughput/connections that's always available in any part of the world with super low latency where clients can read/write concurrently without any risk of inconsistencies that's free. If you could, the problem would be solved and our industry could go build spaceships at SpaceX or retire and make sourdough every Sunday.

Instead, we need to make tradeoffs. Does our product/system need ACID semantics? Is latency more important? Can we allow certain types of data inconsistencies for a short time in favor of availability? How much are we able to spend so that we don't have to sacrifice as much?

These are some questions that everyone building a large-scale software system has to grapple with in the design phase. A great way to begin your thinking is using CAP Theorem - or at least what it's slowly been crystallized into over the years.

CAP Theorem was first proposed in 2000 and proven a couple years later. CAP stands for Consistency, Availability, Partition tolerance. Initially, it was thought that you had to choose 2 out of the 3 in a distributed systems - and that was your tradeoff. The initial definition of Availability was that if a node in the system is not considered to be down, then it must perform and respond to any read/write requests it receives. Partition tolerance meant that if a network fault occurred causing some node(s) in the network to no longer be able to reach each other, though clients could still connect to at least 1 node, then the system would still work as expected. And Consistency meant that every read request receives the latest write and that writes are never lost.

Practically speaking (and NOT theoretically), Partition tolerance is a must-have. Any real-world network will have faults that will cause partitions and it will happen often - if the network is large enough, there is a network partition at any given moment. If the entire system stopped working because of a partition, a distributed system on a large enough scale would be down at all times.

So it's not really a choice of 2 out of 3 - Partition tolerance is a must, which leaves only a choice between Consistency and Availability. CP or AP - which one fits your use case?

Of course, like all the juiciest problems, it's not a black-and-white tradeoff. It's more of a spectrum. Your system might be able to handle some forms of inconsistency for short periods of time - if it's a news site and updates to an article take a few minutes to slowly propagate to each user-facing node in the system, it's probably tolerable as long as all the nodes get updated eventually.

You might also want to accept some unavailability because you can NEVER accept inconsistencies - if you're a replicated storage system for user data and you're using quorum writes and a majority of the replica servers are down, then you might choose to let writes fail until one of the servers comes back so that user data doesn't get corrupted - but in the meantime, reads will still work from the remaining live servers.

It's a spectrum - and this is the best way to start thinking about building a large-scale software system. To understand all the points on the spectrum, there's a lot of knowledge out there too - I'll write about the different types of consistency on a future post.

The CP and AP tradeoff is well documented in 2 important texts on the subject: Designing Data Intensive Applications by Martin Kleppmann and NoSQL Distilled by Pramod J. Sadalage and Martin Fowler.

Search This Blog

masudio - tech

Uber's Michelangelo vs. Netflix's Metaflow

CAP Theorem Explained

Comments

Post a Comment

Popular posts from this blog

ChatGPT - How Long Till They Realize I’m a Robot?

Architectural Characteristics - Transcending Requirements

Laws of Software Architecture