1. 5
  1. 1

    perhaps could have used a compression algorithm instead.

    1. 3

      You can almost always do better with a compression algorithm that’s tailored to the data than a general-purpose one and it looks as if this one reduces the size quite considerably relative to gzip (30338 bytes gzip’d vs 17871 bytes with this approach).

      The alternative approach, which is coming back into fashion now, is to use some kind of ML model and store that plus the exceptions. This is how TeX stores its hyphenation databases: they’re a map of two-letter sequences to weights and a short list of exceptions. From memory, for English, something like 70 words don’t get correctly hyphenated if you take the weights and use them to determine hyphenation points, so there’s a separate table of 70 words along side a 676 weight entry table, which lets TeX correctly hyphenate 20+K English words. It’s especially fun because the same code for generating the compressed version works with English and American, in spite of their different hyphenation rules (English rules are related to root words, American to syllable breaks), as long as you have a corpus of valid words.

      It would be interesting to see what happens if you train even something simple, like a Markov chain that generates 5-letter sequences, with the set of words in the list and then set a threshold probability where all of the words in the list are generated and include a list of all of the ones that it can generate, yet aren’t in the original list (or one that generates almost all of the required words with a list of ones that it doesn’t generate, or some combination) and see how that compares in size.

      1. 2

        Keep reading; they refer to that later on.