In my anecdotal experience, SQLite full-text search performs worse than PostrgeSQL and Xapian, but better than Sphinx. Since this is basically a REST shim on top of SQLite, I’d expect a similar profile here.
I have never been in a situation where the logical choices were ES and SQLite, so I can’t compare those two for speed, but I’d point out that you’re likely in the same situation, too: ES is a tool to reach for when Postgres or Xapian or similar is insufficient (not least because the minimum cluster size to avoid data loss is three, versus one plus WAL shipping for PostgreSQL). By that point, SQLite is not your answer.
In my anecdotal experience, SQLite full-text search performs worse than PostgreSQL and Xapian, but better than Sphinx.
Does it mean that, in your experience, PostgreSQL and Xapian perform better than Sphinx? I didn’t expect that (especially regarding PostgreSQL).
I tried to use PostgreSQL for full-text search, but kept using Xapian instead, because document ranking was too slow. It’s slow because PostgreSQL has to check the heap for each ranked row. The improvements recently submitted by Oleg Bartunov et al. should have fixed this by storing in the index all the information necessary to ranking.
I’d point out that you’re likely in the same situation, too: ES is a tool to reach for when Postgres or Xapian or similar is insufficient
Good guess :-) Yes, my data set fits in a single server, which is a why I keep using Xapian. I’m having a look a PostgreSQL, but it requires some work to implement faceted search, and I don’t know what to expect in terms of performance, compared to a specialized engine like Xapian or Sphinx.
Does it mean that, in your experience, PostgreSQL and Xapian perform better than Sphinx?
Yes, for the use case I was looking at, and given how I was testing. I’ll provide more background here, but I really want to make very clear I’m providing an anecdote, not science.
For Kiln On Demand, when we decided we wanted instant everything-wide search, and before we settled on Elasticsearch or looked at Solr, I looked at several smaller options, including Sphinx, Xapian, SQLite FTS, and PostgreSQL (the last of which was interesting to me because we also wanted to migrate off of SQL Server, and were actively using SQL Server’s full-text search engine at the time). We were interested in providing instant search across both source code (nice, rich, full-text search fodder), and across more structured data (project names, committer email addresses, etc.; think Gmail search). Given what we had to do, I was interested in total search time. This matters because I had to store a lot more data in Sphinx to achieve what in PostgreSQL was a SQL query joined on an FTS query. In that context, Sphinx lost to everything else I tried.
If I’d gone with having the same data in PostgreSQL and in Sphinx, it’s entirely possible that Sphinx would’ve done a lot better, but I’d note that Xapian (and, much later, ES) seemed totally fine with it, so Sphinx may just be a little bit slower. I’d also note that, given when I did this, I was probably using SQLite FTS3, whereas I think that FTS5 may be stable now, so SQLite’s performance profile may have changed.
It looks like our use cases are quite close. I’d like to ask a few questions:
How do you synchronize your source of truth datastore (SQL Server in your case) with your full-text search engine (ES in your case)?
In my use case, the storage used by our FTS engine (currently Xapian) is 10x bigger than the storage used by PostgreSQL. If I get rid of Xapian, and use PostgreSQL’s FTS instead, I’m afraid of the FTS feature eating all the resources, at the cost of “normal” usage. For example, backup/restore could be longer because of that. What is your experience on this matter?
Did you try to implement some kind of faceted search with PostgreSQL?
I haven’t worked at Fog Creek for a couple of years, so this may have changed, but this is how things worked as of winter 2013:
Elasticsearch was never the source of truth. ES at the time had lousy support for backups (specifically: there wasn’t any), so we had to treat it as de juris ephemeral storage. Thus, the source of truth was always either the repositories themselves or SQL Server. In the former case, both SQL Server and ES could be rebuilt with a button click. This made certain classes of data migration and repair really pleasant, and had very few drawbacks (ES usually lagged less than a second behind SQL Server), so I’d absolutely do it that way again. (I am almost positive I remember ES coming out with a backup solution in the last couple of years, though, so this may no longer be a sane way to approach things. I have no idea.)
Your point on resource usage matched ours, and is part of why I ended up throwing out all of those solutions in favor of ES: if you ditched the full-text search indices, then a single large SQL Server box could handle our workload amazingly well, whereas you’d otherwise need a fleet. Throw in ES' near-linear scaling and it was a no-brainer; it was literally the only game in town. (At the time, ES v. Solr was extremely clear-cut: only ES could handle our write-load. This isn’t work I did, but its conclusions and data roughly matched mine; I remember being happy at the time to see my results weren’t unique. I am positive that Solr and ES have both changed, so I’m providing this only for historical context.)
I never tried to use PostgreSQL in a faceted way; staying on SQL Server, but adding ES, made so much economic sense that we never got too serious with the PostgreSQL work. Since then, I’ve never worked with a PostgreSQL BD big enough I needed to shard.
Any idea of its performance compared to ElasticSearch, Sphinx, Xapian or PostgreSQL?
In my anecdotal experience, SQLite full-text search performs worse than PostrgeSQL and Xapian, but better than Sphinx. Since this is basically a REST shim on top of SQLite, I’d expect a similar profile here.
I have never been in a situation where the logical choices were ES and SQLite, so I can’t compare those two for speed, but I’d point out that you’re likely in the same situation, too: ES is a tool to reach for when Postgres or Xapian or similar is insufficient (not least because the minimum cluster size to avoid data loss is three, versus one plus WAL shipping for PostgreSQL). By that point, SQLite is not your answer.
Thanks for sharing your experience.
Does it mean that, in your experience, PostgreSQL and Xapian perform better than Sphinx? I didn’t expect that (especially regarding PostgreSQL).
I tried to use PostgreSQL for full-text search, but kept using Xapian instead, because document ranking was too slow. It’s slow because PostgreSQL has to check the heap for each ranked row. The improvements recently submitted by Oleg Bartunov et al. should have fixed this by storing in the index all the information necessary to ranking.
Good guess :-) Yes, my data set fits in a single server, which is a why I keep using Xapian. I’m having a look a PostgreSQL, but it requires some work to implement faceted search, and I don’t know what to expect in terms of performance, compared to a specialized engine like Xapian or Sphinx.
Another interesting project is blevesearch.com.
Yes, for the use case I was looking at, and given how I was testing. I’ll provide more background here, but I really want to make very clear I’m providing an anecdote, not science.
For Kiln On Demand, when we decided we wanted instant everything-wide search, and before we settled on Elasticsearch or looked at Solr, I looked at several smaller options, including Sphinx, Xapian, SQLite FTS, and PostgreSQL (the last of which was interesting to me because we also wanted to migrate off of SQL Server, and were actively using SQL Server’s full-text search engine at the time). We were interested in providing instant search across both source code (nice, rich, full-text search fodder), and across more structured data (project names, committer email addresses, etc.; think Gmail search). Given what we had to do, I was interested in total search time. This matters because I had to store a lot more data in Sphinx to achieve what in PostgreSQL was a SQL query joined on an FTS query. In that context, Sphinx lost to everything else I tried.
If I’d gone with having the same data in PostgreSQL and in Sphinx, it’s entirely possible that Sphinx would’ve done a lot better, but I’d note that Xapian (and, much later, ES) seemed totally fine with it, so Sphinx may just be a little bit slower. I’d also note that, given when I did this, I was probably using SQLite FTS3, whereas I think that FTS5 may be stable now, so SQLite’s performance profile may have changed.
It looks like our use cases are quite close. I’d like to ask a few questions:
I haven’t worked at Fog Creek for a couple of years, so this may have changed, but this is how things worked as of winter 2013:
Thanks a lot for sharing your experience here. I appreciate.