I am confused by a statement: This Inverted Index comes at a cost however - it has to be updated on every insert, and takes up disk space and RAM to keep in memory (don’t @ me about mmap).
Lucene stores data in immutable segments that it slowly and continuously merges into larger segments. Each of those segments carries and index. You can switch of indexing for many fields (which is done in ELK scenarios) and a lot of queries exploit the immutability of the segments for really aggressive and cheap caches.
How does Lucene segment merging relate to the memory and disk requirements of the inverted index?
Also, segment merging can be quite heavyweight, and with ES (this is about to change!) you can end up with 20% of your index being deleted documents that merging will never reclaim
I was referring to the update. Yes, deleted documents are a problem, but not in the logging case, where you usually don’t delete.
I just don’t get a good feeling what exact problem they are pointing out in context. I agree with the general sentiment that Lucene is an okay storage for the log usecase, but will be beaten by something more specific..
That makes sense. I wasn’t thinking about how you don’t usually delete for logging (it’s not what I use ES for - though in our case most of our deleted documents from from updates, not “real” deletes).
I have played with Loki and think it’s a pretty good model. For small use-cases you can totally just use the filesystem storage option. You only have to upgrade to an object store when your scale requires it. It does require some configuration on the ingestion side to specify the fields you are going to index on but that doesn’t seem that big of a problem to me.
It may have been me configuring it poorly (probably), but my experience w/ Loki in a small scale setting has been that it will do terrible, terrible things to your inodes when running Loki using filesystem storage.
Just something to look out for, but worth keeping an eye on it. Besides the “Oops! All inodes!” issue, Loki+Grafana is a pretty nice setup.
I have not run into that issue in my setup. It may be a result of the amount of logs I’m pulling in which is actually quite small or something else to do with the my setup.
It also has to do with the file system you are using, so it might partly be about using the right tool for the job. But it would certainly make sense to structure them in a better way, regardless.