SW engineering, engineering management and the business of software

subscribe for more
stuff like this:

CRDTs in a Nutshell

Note: This post is a updated, expanded rewrite of an older post from 2012. I have removed references to older technology, but the concepts still stand.

Understanding CRDTs is fairly straightforward if you have some concepts clear ahead of time.

First, you have to be clear on two different types of possible contention with distributed data stores:

  1. Simultaneous Read and Write
  2. Simultaneous Write and Write

In case 1, If you are writing a value to a key, that value needs to be replicated to a few other nodes. This is where the read quorum (for fetches) and write quorum (for stores) come into play: While reading or writing, I must have X number of nodes agree on what the correct value is. You can set X to be all the nodes to get a strong certainty of getting the most recent value. If X is 1 then you might give up on getting the latest value in exchange for some improvement to latency. The typical, balanced – and often default solution – is to set X to be a simple majority of nodes.

In case 2, the write quorum has no practical use. If you have a 10-node cluster, and some client writes “flub” to nodeA and some other client writes “biggle” to nodeF, then we have something called write contention or siblings. How do you decide who wins? This is where sibling resolution comes into play. There are many, many strategies for this. The simplest involve last-write-wins (which is a good way to lose data if you have a counter). If you’ve ever seen a file in your Dropbox called SOME_TITLE (Matt Nunogawa's conflicted copy 2020-07-12) this is an example of siblig resolutions I like to call last-write-wins-but-save-a-copy-of-the-loser.

What happens in practice is that you cannot think about a distributed data store in terms of sets and gets. A better mental model is more like an operation log.

A counter works well in a distributed system if you are only adding. Each client simply tags his increment with some arbitrary but unique client ID. The operation is not get x, then set x+1, but rather counterID:ABC:clientID:gamma:count:+1. Since it’s add-only, if you have multiple counters incrementing at once, they will only modify their own clientID entry. If client alpha adds 1 to nodeA while client gamma adds 1, 1 and 1 to nodeF, you would see something like this:

[
    counterID:ABC:clientID:alpha:count:1
    counterID:ABC:clientID:gamma:count:3
]

or posisbly this:

[
    counterID:ABC:clientID:alpha:count:1
    counterID:ABC:clientID:gamma:count:1
    counterID:ABC:clientID:gamma:count:2
    counterID:ABC:clientID:gamma:count:3
]

To get the total count of the ABC counter, simply add up all the counts, taking the highest value per clientID. Both of the above examples resolve to 4. You could eventually whip up some garbage collection to clean up older entries if space is something you care about.

If client gamma partitions off the network? It’s just going to keep accumulating counts from is local clients. The short term totals are off but hopefully the system knows it is partitioned, but when it connects back, doesn’t matter what happend locally or on client alpha, the total counts should again be accurate assuming eventual consistency.

Add-only-counters are great but if you need counters that go up and down, you have to get a little clever. If you keep two counters, posABC and negABC and subtract, you effectively get a couter that can go in both directions.

And now we have Conflict-free Replicated Data Types:

CRDTs can be thought of as primitive, resolvable operation logs that can be composed into useful data types.

Now the implications are interesting, because if your distributed datastore can do key-value, and is eventually consistent, you can hypothetically develop CRDTs on top of it.

The above example is an over-simplification. In a real world operation, you would do lots and lots of finicky housekeeping around compression of the log as high-volume writes start to get expensive in terms of storage. But hopefully the concept is clearer.

CRDTS are exciting (in the way that only useful mathematical properties are). The add-only-counter and the add-only-set versions of these are relatively straight-forward. Some very smart people have created CRDTs for more complex data types like lists, maps and even directed graphs. Searching for CRDTs should bring you to the work of Marc Shapiro.

Very recently, Martin Kleppmann published a CRDT talk on recent advancements in this area as well.

But if you really want to learn about all the edge cases, finicky bits, etc. there’s no substitute for implementing a counter in a three node distributed datastore and hammering all three nodes with writes. Odd cookie that I am, I found it to be a great learning experience.



in lieu of comments, you should follow me on bluesky at @amattn.com and on twitch.tv at twitch.tv/amattn. I'm happy to chat about content here anytime.


the fine print:
aboutarchivemastodonblueskytwitchconsulting or speaking inquiries
© matt nunogawa 2010 - 2023 / all rights reserved