I am not sure Lucene has only a vector-model-based approximate matching algorithm. Some one can point me to the relevant code? I was under the impression they also use BM5 / TF-IDF.

Given it is sparse matrix, it seems unlikely to be computable tractable to compute the top-k-nearest neighbor over very large 10k x 10k matrix, where 10k is the count of words, lemme or stem in the corpus. And that without an online algorithm (that can be updated in-place while new documents possibly new words comes in)

I am not sure Lucene has

onlya vector-model-based approximate matching algorithm. Some one can point me to the relevant code? I was under the impression they also use BM5 / TF-IDF.Given it is sparse matrix, it seems unlikely to be computable tractable to compute the top-k-nearest neighbor over very large 10k x 10k matrix, where 10k is the count of words, lemme or stem in the corpus. And that without an online algorithm (that can be updated in-place while new documents possibly new words comes in)