1. 10
  1. 1

    I am not sure Lucene has only a vector-model-based approximate matching algorithm. Some one can point me to the relevant code? I was under the impression they also use BM5 / TF-IDF.

    Given it is sparse matrix, it seems unlikely to be computable tractable to compute the top-k-nearest neighbor over very large 10k x 10k matrix, where 10k is the count of words, lemme or stem in the corpus. And that without an online algorithm (that can be updated in-place while new documents possibly new words comes in)