This is a bit of a simplistic model, but it’s absolutely a good intro to the topic. There are many more filters you might want (compound words for example), and enrichment (synonyms, phoenetics, spellcheck) you could do to the tokens before adding or searching for it. Boosting and reducing score based on exact or partial match.
Also consider the target language. This will work well for English, but many other languages have different grammar rules that will make the strategies employed here fail.
There are tonnes of other options here (no surprise considering the competition in the market), and I would encourage people to explore the problem space as you can improve everything from the matching algorithm, tokenizers and other processors in quite various and interesting ways.
It is using Snowball stemmer which works for English, Spanish, Portuguese, French, German, Italian.
Not with the current implementation as he specifically imports the English version.
Adding to that, stemmers won’t break up compound words properly. The classic joke of “this is a flammenwerfer. It werfs flammen” is a joke about the word “flammenwerfer” which literally writes as “flamethrower” in German (note the lack of space), but a stemmer won’t break those two words up into different tokens. Many languages work like this. Many more do even more crazy stuff by compounding words and adding suffixes, prefixes, and infixes. Often many at the same time. Stemming is really just the absolute start of reducing the number of tokens down to the constituent parts. There are many more steps to this.
I’m not trying to shit on his post, because it’s an excellent introduction to the topic, but I wanted to point out that there are many areas where this can be vastly improved if you wanted to tinker with this yourself.
This is a nice little writeup, thanks for sharing! I’ve been consider building a text search engine that compiles to WASM, so I would be interested in any further advanced material on this topic if anyone knows any. Currently I’m using https://github.com/weixsong/elasticlunr.js which is nice, but this seems like a fun project to build from scratch!
This is very relevant, thanks! I’ll have to take a look to see if this might just be an easy addition, but something like this where the index can be created using a library is almost exactly what I need (and technically currently have, but would be fun to do with WASM as the runner)
Persistence, if you want to query without loading the entire dataset, is where things get tricky, Some FTS engines build their own storage layer, while others use existing database engines. Bleve is a good example of the former, and SQLite’s FTS extension(s) of the latter. Both are open source and well documented.
I’ll check out Bleve, thanks. I’m definitely more interested in a FTS which can use a pre-generated index, rather than something like SQLite.
In terms of advanced topics, things like boosting, stemming, other fitlers and fuzzy-search all seem like interesting problems to tackle. Very cool problem space for sure.
Well, SQLite’s FTS uses its b-tree module directly; it isn’t using SQL. So it may be more relevant than you think.
Stemming does seem a good area to look at. Snowball is far from perfect. But given how irregular human languages are, esp. English, and how many of them there are, it also seems a difficult problem. If you find any better steamers, do post them to Lobsters!
(FYI, I implemented FTS in Couchbase Lite. The current implementation uses SQLite FTS4, but an earlier version was hand-rolled atop a generic KV store, so I got to learn more of the innards of how it’s done. I also used to work with Marty Schoch who wrote Bleve (on the server side) but I don’t know anything about its code, as we don’t use Go on the client side.)
Very cool, thanks for the context! SQLite is certainly a paragon of software quality, so you’re certainly right I should take a look at it.
It does seem like the human language part of this problem is actually the more complicated part. The actual look-ups and storage side of things seem fairly benign in comparison! I’ll certainly post anything cool I find to Lobsters.
One of my projects a while back was implementing an autocomplete for a proprietary programming language offering at a company. However programming is much more well-behaved than human language, so I didn’t have to deal with any stemming or other complications thankfully. I wonder if that would have tainted my memory of that project – it was one of the few times where I got to really flex the pure CS muscles!
Loved the write up! I found it easy to follow along despite not knowing golang.