How hard is calculating edit distance?
Assuming a computer can execute a billion steps in a second and an edit distance calculation takes about 100 steps(?) or so, that’s an ability to scan 10,000,000 words per second. Most full dictionaries seem to have 1,000,000 words, so that’s 10 full scans per second. Could probably cheaply maintain a topK, but full sorting would take ~20,000,000 steps which can happen at a rate of 50 full sorts per second and we’re still bounded by the edit distance calculation
Hyyro 2003 shows how to compute edit distances pretty quickly with the 1999 Myers bit vector algorithm. References are HERE.
This is also coded up in the Nim implementation already mentioned. One of the graphs there shows sub-ms time to scan 80,000 words hot cache. Really, that is 80,000*6 because that graph shows “6 batches” - the concept being modeled is that one first does vanilla hash look ups to identify possible misspells in some document, gets (say) 6 words, and then wants the closest K matches for all 6 which you can get with just a single IO sweep over the dictionary - amortizing the IO cost. Anyway, roughly 1/2 million distance calculations per millisecond is probably a simple ballpark for words with English-like statistics.
This batch method is a pretty natural way to go since the vast majority of words in the vast majority of documents will not be mispelled. It also scales well to very high edit distances like 5 because it does not really depend upon “how far apart” the words are from dictionary words. Wolfe’s SymSpell does not scale well to high edit distance - either in space usage where it basically starts blowing up or time usage where it gets worse more gradually.
The Peter Norvig idea that is “a million times slower” was just some did-it-on-an-airplane-in-an-hour implementation. That might almost define “strawman” except for Norvig’s fame/status/captive audiences.
If you have further interest then reading the whole README may be enlightening and/or interesting.
A noteworthy Java-based implementation of SymSpell:
It’s used in KeenWrite to check all the words in a plain text paragraph:
The spell check algorithm is insanely fast.
Note that the lexicon for the reference implementation is missing about 18,000 words, at time of writing. The author is aware. For the impatient, grab en.txt from the repo:
Fast implementation that saves to disk (in an an online/updatable way) to avoid slow recomputes: https://github.com/c-blake/suggest. (Also some performance analysis suggesting the algo is not necessarily a good idea if you care about consistent time, though this conclusion depends a lot on what kind of device backs the database and/or RAM/swap - NVMe vs Winchester drives.)
Could you just store a bloom filter of all the deletes for each word?
edit: Hm, not sure if there’s a way to scan a tree of bloom filters. Seems like it should be possible though.
You need a lookup table to find out what the candidate words are, not just a membership test. You could use a bloom filter as a pre-test, but I don’t think there’s a lot of sense in that when the recommended data structure here is a hash table, and a bloom filter requires several hash computations :)
No I mean, literally a bloom filter per word. Though given the limited size of a word, it’s probably not worth anything spacewise.