1. 11
  1. 7

    Long ago I contributed some Levenshtein distance code to apache commons. There are two interesting optimizations that aren’t touched on in the post:

    1. The iterative grid solution doesn’t need to maintain the full grid, only two rows (which can be swapped back and forth). This can significantly reduce memory usage on large strings.
    2. Using an upper bound alg and iteratively increasing the bound until completion lets you bring the general case down from O(nm) (where n and m are the length of the two strings) to O(km) (where k is the edit distance between n and m). This performance can be significant, especially for similar or long strings.

    Most of this I have in a comment starting on line 175. I also recommend this computation biology book for anyone with an interest in string algorithms.