I thought this was gonna be like one of those “write a neural network in 50 lines of Python” where they import a dozen libraries first, but I was happily surprised. A very clear and concise article.
def search(query):
# assuming query is a list of token / terms
terms = [(term, term_frequency.get(term, 0)) for term in query]
terms.sort(key=lambda x: x[1])
# The seed is least frequent term in the corpus part of the query.
# It is associated to the smallest set of document that match the
# full query.
seed = terms[1]
if seed[0] == 0:
# if the seed appears zero time, it means the query yields
# zero hits.
return []
# assuming index is dictionary mapping term to document
# aka. inverted index or posting.
candidates = index.get(seed)
def rank(document):
return (sum(document.term_frequency(token) for token in query), document)
hits = []
def sort(hit):
global hits
hits.append(hit)
# sort with rank
hits = sorted(hits, key=lambda x: x[0], reverse=True)
for_each_par_map(sort, rank, candidates)
return hits
That doesn’t mean that we can’t think about fun expansions on this basic functionality though;
Fun and useful. What database do you plan to use?
NB: There is Abstract data class, where is the concrete one?
FWIW, I also tried the intersection of ordered sets that are lazily loaded, it does not scale either.
My current algorithm: At index time: store transliterated words from documents to support many languages in the inverted index. I also store association between words and their stem or lemme or synonyms. At query time: I expand the query according to stem, lemme and synonyms mappings, then I pick the seed in a similar way as described above, I lazily fetch candidates and rank them in a map-reduce (for-each-par-map). I keep only top N results.
Really like this breakdown. usually when I encounter this need, I just use sqlite3. FTS5 generally just works. could probably make a full-text search engine in less Python punting everything to that.
I thought this was gonna be like one of those “write a neural network in 50 lines of Python” where they import a dozen libraries first, but I was happily surprised. A very clear and concise article.
The author of a Rust text search library called Tantivy has some posts that go more into more advanced details: https://fulmicoton.com/posts/behold-tantivy-part2/
Very well written post and explains the concepts really well
It reminded me of this article: https://artem.krylysov.com/blog/2020/07/28/lets-build-a-full-text-search-engine/
Good article.
I prefer my code:
Fun and useful. What database do you plan to use?
NB: There is Abstract data class, where is the concrete one?
Nice! The class is called
Abstract
because that’s what I’m indexing (ie Wikipedia abstracts), it’s not an actual abstract class :-)FWIW, I also tried the intersection of ordered sets that are lazily loaded, it does not scale either.
My current algorithm: At index time: store transliterated words from documents to support many languages in the inverted index. I also store association between words and their stem or lemme or synonyms. At query time: I expand the query according to stem, lemme and synonyms mappings, then I pick the seed in a similar way as described above, I lazily fetch candidates and rank them in a map-reduce (for-each-par-map). I keep only top N results.
Anway, check out http://peacesear.ch/
Really like this breakdown. usually when I encounter this need, I just use sqlite3. FTS5 generally just works. could probably make a full-text search engine in less Python punting everything to that.