Cassandra data modelling is a bit of a mine field. You have more directions that could go wrong than ones could not. I found very few people who can model the in it the right way. I am not one of those people either. After re-reading everything that is available about data modelling in Cassandra I still not sure how to cover all use cases.
I once had the displeasure of building a smallish Cassandra cluster that did a 1million insert/s. It was not a fun experience. 0/10 would not recommend again.
My biggest takeaway is that it was incredibly easy to get Cassandra into a bad state (GCs that never finish, oom errors) by just sending it traffic. This was a stark contrast to other systems I’ve worked with where overload results in high latency and backpressure and is generally resolved when the load is withdrawn. At that time my recommendation would have been “no really, don’t use Cassandra unless you’re absolutely sure you need it and nothing else will work”. For the love of all that’s holy don’t use it for your account database because “we want to be able to handle the scale some day”. For some companies that needed to do certain types of horizontally scalable writes and didn’t have strong consistency requirements and are willing to spend alot of people-time on the problem, it can be an option.
What would I use instead these days? Depends on the use case. Would use clickhouse if I could shoehorn it in. GCP Big Query is great for some types of high latency analytics. Vitess or sharded mysql if I needed something horizontally scalable but relational and transactional. Maybe cloud spanner. I’d use vanilla mysql/postgres if I could.
I got burned by Cassandra Counters, too. Wrote about it in this long blog post about Cassandra data modeling:
(Search “counters” for the relevant section.)
Interesting to me because in a past life (about 5ish years ago) I was working at a web advertising company. The company was in the process of switching from having statistics in MySQL and counting things with queries like SELECT hour, count(*) FROM hits WHERE customerId=? GROUP BY hour, to storing everything in Cassandra instead with counters precomputed for all of the roll-ups (e.g. one counter for each (customerId, hour) pair).
SELECT hour, count(*) FROM hits WHERE customerId=? GROUP BY hour
I say it was in that process because the company died ignomiously, being bought out by a competitor AFAIK for a rather modest sum. I don’t think this was Cassandra’s fault though. ;)
The thing I’d be super interested to know, which this article didn’t get into at all (*), is why the use of counter columns was so much more expensive for them than regular upserts. Never benchmarked it myself but at the time I was told that Cassandra counter increments were cheaper than regular upserts - precisely because of the crappy (non-idempotent, prone to under- or over-counting) semantics.
(* No blame for not digging into the root cause, it’d probably take a bunch of effort.)