Cassandra looks really impressive. I have heard from many organizations, though, that it is an operational burden. Clearly Netflix has either figured them out or decided to live with them. Does anyone have any experiences and opinions on this?
I joke that Cassandra is a great “write-only” database.
This is because though it has a very interesting distributed design and a flexible data model (especially these days w/ CQL3), it does not actually know how to do the kinds of queries that you typically associate with databases.
You essentially can only read data back exactly as you write the data. This even applies to basic operations like filtering and sorting. The only form of efficient filtering is the “range slice”, which can fit some time series scenarios; and the only form of efficient sorting is “clustering order”, which supports sorting only in one direction per table. It is even less query-flexible in this respect than Redis, which at least has some interesting data structures like sorted sets that can support multiple query operations.
This means that people who use Cassandra as their primary data store do one of two things. They either write data in a lot of different ways representing all the query scenarios to separate tables or rows or columns. This can be a maintenance nightmare of waste of resources in many cases. Or they treat Cassandra as a “staging area” for data and index it elsewhere to get query flexibility (e.g. SQL database, MongoDB, ElasticSearch, Solr, etc.). This seems like the best use of it currently. Basically you use Cassandra to handle your “durable, distributed, real-time write path” and then use some other database to handle your “flexible, possibly delayed, read path”.
I’m a heavy use of Riak and have similar experiences, as one would expect. In that sense I use Riak just for the high availability and have another database as the actual system of record. Mostly this is because it’s hard to explore ones data inside Riak.
But my question is less about the actual operations one can apply to the database, but actually how to operate the database. Riak, for example, is pretty operator friendly. One has to do very little to set it up and maintain it. I have heard the opposite of Cassandra and am looking for some real-world-experience reports.
Cassandra looks really impressive. I have heard from many organizations, though, that it is an operational burden. Clearly Netflix has either figured them out or decided to live with them. Does anyone have any experiences and opinions on this?
I joke that Cassandra is a great “write-only” database.
This is because though it has a very interesting distributed design and a flexible data model (especially these days w/ CQL3), it does not actually know how to do the kinds of queries that you typically associate with databases.
You essentially can only read data back exactly as you write the data. This even applies to basic operations like filtering and sorting. The only form of efficient filtering is the “range slice”, which can fit some time series scenarios; and the only form of efficient sorting is “clustering order”, which supports sorting only in one direction per table. It is even less query-flexible in this respect than Redis, which at least has some interesting data structures like sorted sets that can support multiple query operations.
This means that people who use Cassandra as their primary data store do one of two things. They either write data in a lot of different ways representing all the query scenarios to separate tables or rows or columns. This can be a maintenance nightmare of waste of resources in many cases. Or they treat Cassandra as a “staging area” for data and index it elsewhere to get query flexibility (e.g. SQL database, MongoDB, ElasticSearch, Solr, etc.). This seems like the best use of it currently. Basically you use Cassandra to handle your “durable, distributed, real-time write path” and then use some other database to handle your “flexible, possibly delayed, read path”.
I’m a heavy use of Riak and have similar experiences, as one would expect. In that sense I use Riak just for the high availability and have another database as the actual system of record. Mostly this is because it’s hard to explore ones data inside Riak.
But my question is less about the actual operations one can apply to the database, but actually how to operate the database. Riak, for example, is pretty operator friendly. One has to do very little to set it up and maintain it. I have heard the opposite of Cassandra and am looking for some real-world-experience reports.