Start now
Feb 23, 2010 nosql.mypopescu.com
There have been confirmed rumors[1] about Twitter planning to use Cassandra for a long time. But except the mentioned post, I couldn’t find any other references.
Twitter is fun by itself and we all know that NoSQL projects love Twitter. So, imagine how excited I was when after posting about Cassandra 0.5.0 release, I received a short email from Ryan King, the lead of Cassandra efforts at Twitter simply saying that he would be glad to talk about these efforts.
So without further ado, here is the conversation I had with Ryan King (@rk) about Cassandra usage at Twitter:
MyNoSQL: Can you please start by stating the problem that lead you to look into NoSQL?
Ryan King: We have a lot of data, the growth factor in that data is huge and the rate of growth is accelerating.
We have a system in place based on shared mysql + memcache but its quickly becoming prohibitively costly (in terms of manpower) to operate. We need a system that can grow in a more automated fashion and be highly available.
MyNoSQL: I imagine you’ve investigated many possible approaches, so what are the major solutions that you have considered?
Ryan King:
MyNoSQL: What kind of tests have you run to evaluate these systems?
Ryan King: We first evaluated them on their architectures by asking many questions along the lines of:
Asking these questions narrowed down our choices dramatically. Everything but Cassandra was ruled out by those questions. Given that it seemed to be our best choice, we went about testing its functionality (“can we reasonably model our data in this system?”) and load testing.
The load testing mostly focused on the write-path. In the medium/long term we’d like to be able to run without a cache in front of Cassandra, but for now we have plenty of memcache capacity and experience with scaling traffic that way.
MyNoSQL: If you draw a line, what were the top reasons for going with Cassandra?
Ryan King:
MyNoSQL: Will Cassandra completely replace the current solution?
Ryan King: Over time, yes. We’re currently moving our largest (and most painful to maintain) table — the statuses table, which contains all tweets and retweets. After this we’ll start putting some new projects on Cassandra and migrating other tables.
MyNoSQL: How do you plan to migrate existing data?
Ryan King: We have a nice system for dynamically controlling features on our site. We commonly use this to roll out new features incrementally across our user base. We use the same system for rolling out new infrastructure.
So to roll out the new data store we do this:
Eventually we get to a point where we’re doing 100% doubling of our writes and comfortable that we’re going to stay there. Then we:
Run an importer that imports the data to cassandra
Some side notes here about importing. We were originally trying to use the BinaryMemtable[2] interface, but we actually found it to be too fast — it would saturate the backplane of our network. We’ve switched back to using the Thrift interface for bulk loading (and we still have to throttle it). The whole process takes about a week now. With infinite network bandwidth we could do it in about 7 hours on our current cluster.
Once the data is imported we start turning on real read traffic to Cassandra (in parallel to the mysql traffic), again by user groups and percentages.
A philosophical note here — our process for rolling out new major infrastructure can be summed up as “integrate first, then iterate”. We try to get new systems integrated into the application code base as early in their development as possible (but likely only activated for a small number of people). This allows us to iterate on many fronts in parallel: design, engineering, operations, etc.
MyNoSQL: Please include anything I’ve missed.
Ryan King: I can’t really think of anything else.
MyNoSQL: Thank you very much!
BinaryMemtable to load preserialized data. (↩)