1. 33
  1.  

  2. 20

    Those same tricks are still being used today, they just aren’t quite needed any more inside a spell checker simply because the scale of data is so small (relatively speaking).

    One of the most popular tricks is to store a word set in finite state machine, and then minimize the FSM. A trie is of course a kind of FSM, but a trie typically only compresses prefixes. Minimizing the FSM will also compress suffixes. My favorite author on the topic is Jan Daciuk, and it’s still an active area of research. In terms of practical implementations, the most popular one I know of is Lucene, which uses a finite state transducer. (FSTs permit storing values—that follow a certain algebra—on the transitions of the FSM, which essentially lets you store not just sets of words but a map from word to some value.) Lucene uses FSTs primarily as an index into their term dictionary, but ElasticSearch also uses FSTs for fast autocomplete.

    The central challenge of using FSMs/FSTs this way is how to actually build them in the first place using reasonable resources. The naive approach is to use Hopcroft’s algorithm, but this has problems because you need to store the entire finite state machine in memory first. This is a problem because you tend to pull out these data structures only when working with very large quantities of data, so you really want to avoid “throw everything into memory” if you can. It turns out you can do a lot better if you can make some assumptions like:

    • The keys are inserted in lexicographic order.
    • The FSM/FST is acyclic.
    • Being OK with a small automaton but not one that is truly minimal.

    The combination of these lets you build the FSM very quickly and incrementally in a stream fashion using constant memory.

    Other areas of research relax some requirements (like the lexicographic ordering), but I’m out of my depth at that point.

    The other part of this task is optimizing how you store your FSM states in memory (or on disk). If you do things right, the vast majority of states should occupy a single byte! (Jan Daciuk has published a few papers on that specific topic, which are just so much fun to read.)

    1. 3

      Moreover, spell checkers are context-sensitive nowadays, so you probably want some form of language model. Language models are typically much larger than the dictionary. There are many interesting approaches to language model compression (e.g. Stolcke pruning, quantization of probabilities and Golomb coding of n-grams).

      Since you mention Jan Daciuk ;), he also did work in that direction with Gertjan van Noord, which they called ‘tuple automata’. Last time I looked tuple automata were still used in e.g. the Alpino parser for storing the tag trigram model for Alpino’s HMM tagger, as well as storing the language model for fluency ranking in generation.

      Some years ago, I developed a spell checker for a company, which also used many of these techniques (including perfect hash automata & tuple automata). Especially the language model for the spell checker was too large when using ‘naive’ data structures.