1. 31
  1. 4

    I thought this was gonna be like one of those “write a neural network in 50 lines of Python” where they import a dozen libraries first, but I was happily surprised. A very clear and concise article.

    The author of a Rust text search library called Tantivy has some posts that go more into more advanced details: https://fulmicoton.com/posts/behold-tantivy-part2/

    1. 3

      Very well written post and explains the concepts really well

        1. 1

          Good article.

          I prefer my code:

          def search(query):
              # assuming query is a list of token / terms
              terms = [(term, term_frequency.get(term, 0)) for term in query]
              terms.sort(key=lambda x: x[1])
              # The seed is least frequent term in the corpus part of the query.
              # It is associated to the smallest set of document that match the
              # full query.
              seed = terms[1]
              if seed[0] == 0:
                  # if the seed appears zero time, it means the query yields
                  # zero hits.
                  return []
              # assuming index is dictionary mapping term to document
              # aka. inverted index or posting.
              candidates = index.get(seed)
              def rank(document):
                  return (sum(document.term_frequency(token) for token in query), document)
              hits = []
              def sort(hit):
                  global hits
                  # sort with rank
                  hits = sorted(hits, key=lambda x: x[0], reverse=True)
              for_each_par_map(sort, rank, candidates)
              return hits

          That doesn’t mean that we can’t think about fun expansions on this basic functionality though;

          Fun and useful. What database do you plan to use?

          NB: There is Abstract data class, where is the concrete one?

          1. 1

            Nice! The class is called Abstract because that’s what I’m indexing (ie Wikipedia abstracts), it’s not an actual abstract class :-)

            1. 1

              FWIW, I also tried the intersection of ordered sets that are lazily loaded, it does not scale either.

              My current algorithm: At index time: store transliterated words from documents to support many languages in the inverted index. I also store association between words and their stem or lemme or synonyms. At query time: I expand the query according to stem, lemme and synonyms mappings, then I pick the seed in a similar way as described above, I lazily fetch candidates and rank them in a map-reduce (for-each-par-map). I keep only top N results.

              Anway, check out http://peacesear.ch/

          2. 1

            Really like this breakdown. usually when I encounter this need, I just use sqlite3. FTS5 generally just works. could probably make a full-text search engine in less Python punting everything to that.