Migrating from RavenDB to Cassandra

Today on the MarkedUp Analytics Blog I authored a post entitled “Cassandra, Hive, and Hadoop: How We Picked Our Analytics Stack.”

In it I explain MarkedUp’s evaluation process for choosing a new database, how we selected Cassandra, and some benchmarks from our test. If you want to learn more, go read it.

I’ve written about RavenDB in the past on and I’ve spoken about it at code camp before. I think it’s a great technology for prototyping and it has some really interesting concepts not found in any other database that make Raven really simple and elegant to operate.

So I wanted to copy over a section from our blog post about building MarkedUp on Cassandra which explains why we ultimately needed to move away from RavenDB:

Looking back to what went wrong with RavenDB, we determined that it was fundamentally flawed in the following ways:

  • Raven’s indexing system is very expensive on disk, which makes it difficult to scale vertically – even on SSDs Raven’s indexing system would keep indexes stale by as much as three or four days;
  • Raven’s map/reduce system requires re-aggregation once it’s written by our data collection API, which works great at low volumes but scales at an inverted ratio to data growth – the more people using us, the worse the performance gets for everyone;
  • Raven’s sharding system is really more of a hack at the client level which marries your network topology to your data, which is a really bad design choice – it literally appends the ID of your server to all document identifiers;
  • Raven’s sharding system actually makes read performance on indices orders of magnitude worse (has to hit every server in the cluster on every request to an index) and doesn’t alleviate any issues with writing to indexes – no benefit there;
  • Raven’s map/reduce pipeline was too simplistic, which stopped us from being able to do some more in-depth queries that we wanted; and
  • We had to figure out everything related to RavenDB on our own – we even had to write our own backup software and our own indexing-building tool for RavenDB; there’s very little in the way of a RavenDB ecosystem.

So based on all of this, we decided that our next database system needed to be capable of:

  1. Integrating with Hadoop and the Hadoop ecosystem, so we could get more powerful map/reduce capabilities;
  2. “Linear” hardware scale – make it easy for us to increase our service’s capacity with better / more hardware;
  3. Aggregate-on-write – eliminate the need to constantly iterate over our data set;
  4. Utilizing higher I/O – it’s difficult to get RavenDB to move any of its I/O to memory, hence why it’s so hard on disk;
  5. Fast setup time – need to be able to move quickly;
  6. Great ecosystem support – we don’t want to be the biggest company using whatever database we pick next.

So there you have it. If you want to learn more about our migration to Cassandra, check out our original blog post.

Discussion, links, and tweets

I'm the CTO and founder of Petabridge, where I'm making distributed programming for .NET developers easy by working on Akka.NET, Phobos, and more..