SW engineering, engineering management and the business of software
Every time I talk to someone about Riak, I mention about how difficult it was to get distributed counters working. Then I mention about how I ended up implementing a impoverished man’s version of CRDTs (Conflict-free Replicated Data Types).
The usual response is along the lines but doesn’t Riak solve that with their read/write quorum functionality?
The answer is no.
This is a surprising answer to a common misperception of Riak. So common that at my very first Riak meetup, I asked a variation of the same question.
In a distributed data store such as Riak, you have two basic kinds of inconsistencies that require resolution:
In case 1, If you are writing a value to a key, that value needs to be replicated to a few other nodes. This is where the read quorum (for fetches) and write quorum (for stores) come into play: While reading or writing, I must have X number of nodes agree on what the correct value is. You can set X to be all the nodes to get a strong certainty of getting the most recent value. If X is 1 then you might give up on getting the latest value in exchange for some improvement to latency. The typical, balanced and default solution is to set X to be a simple majority of nodes.
In case 2, the write quorum has no practical use. If you have a 10-node cluster, and some client writes “flub” to nodeA and some other client writes “biggle” to nodeF, then we have something called siblings. How do you decide who wins? This is where sibling resolution comes into play. There are many, many strategies for this. The simplest involve last-write-wins (which is a good way to lose data if you have a counter).
What happens in practice is that you cannot think about a distributed data store in terms of sets and gets. You need to approach it more like an operation log. A counter works well in a distributed system if you are only adding. Each client simply tags his increment with some arbitrary but unique client ID. The operation is not get x, then set x+1, but rather
counterID:ABC:clientID:XYZ:count:+1. Since it’s add-only, if you have multiple counters incrementing at once, they will only modify their own clientID entry. If client
PRQ adds 1 to nodeA while client
XYZ adds 1, 1 and 1 to nodeF, you would see something like this:
[ counterID:ABC:clientID:PRQ:count:1 counterID:ABC:clientID:XYZ:count:3 ]
or posisbly this:
[ counterID:ABC:clientID:PRQ:count:1 counterID:ABC:clientID:XYZ:count:1 counterID:ABC:clientID:XYZ:count:2 counterID:ABC:clientID:XYZ:count:3 ]
To get the total count of the
ABC counter, simply add up all the counts, taking the highest value per clientID. Both of the above examples resolve to 4. You could eventually whip up some garbage collection to clean up older entries if space is something you care about.
Add-only-counters are great but if you need counters that go up and down, you have to get a little clever. If you keep two counters,
negABC and subtract, you effectively get a couter that can go in both directions.
And now we have Conflict-free Replicated Data Types:
CRDTs can be thought of as primitive, resolvable operation logs that can be composed into useful data types.
In a real world operation, you would do lots and lots of finicky housekeeping around compression of the log as high-volume writes start to get expensive in terms of storage. But hopefully the concept is clearer.
CRDTS are exciting (in the way that only useful mathematical properties are). The add-only-counter and the add-only-set versions of these are relatively straight-forward. Some very smart people have created CRDTs for more complex data types like lists, maps and even directed graphs. Searching for CRDTs should bring you to the work of Marc Shapiro.
If you are patient, the very smart people at Basho are working to integrate CRDTs into Riak itself.
But if you really want to learn this stuff go implement a multi-node, write-heavy counter in a three node Riak cluster (make sure allow_mult=true). I found it to be a great learning experience.