I normally don’t link to HN, but in this case, the conversation about this article over there involved a pretty length reply by the original author, that I think is worth reading almost as a follow-up to this post:
I don’t really like how the author talks about persistence. There are no general solutions, and certain approaches must be taken to solve certain problems. TSDB’s are an area where usage patterns will sharply dictate whether one is usable for you or not. There are severe trade-offs that need to be taken to solve nontrivial problems in this space. LSM’s are well suited for the use case of high-write and better latency when reading recently written data. Of course there’s complexity. If you ignore it, you are going to have problems. This is a space that punishes avoidance of the inherent complexity of a problem heavily.
An issue that really irks me about many TSDB’s is their approach to operational complexity as they scale. So many of them completely ignore the work that has been done in distributed databases to solve the same problems, and because they choose not to use existing ones they are forcing operators to work a lot harder to learn and build tooling around a far less generic system. InfluxDB is a distributed database that you have to learn the characteristics of, and then can’t use for anything except InfluxDB. Prometheus forces operators to manually manage master-slave architectures, and because it’s pull-based you’re forcing the higher level aggregators to know about lots of little configuration changes instead of if it were the other way around and you only needed to swap a single dns entry or similar. When you talk to people in this space about pull vs push, they don’t think about coordination avoidance, but some crazy notion that one makes capacity planning easier than the other. This is not true.