1. 12
  1.  

  2. 15

    This post uses GIN indexes, one of their authors (Oleg Bartunov) strongly suggested to use RUM indexes (https://github.com/postgrespro/rum) when I spoke with him at PGCon in Ottawa (2019).

    Other things that would be nice:

    • show us your PostgreSQL config
    • show us your query plans
    • use ts_vector column type instead of functional indexes
    • avoid using OR in the query by using ANY - this would likely improve query plans a lot (if we were able to see them)

    For a moment I thought there is enough information to actually reproduce the test but I think there is a fundamental flaw in this whole experiment. The datasets are not the same for both systems!

          JSON.stringify({
            key: `${faker.lorem.word()} ${faker.lorem.word()}`,
            val: faker.lorem.words(),
            valInt: Math.floor(faker.random.float()),
            valDate: faker.date.past()
          })
    

    Please correct me if I am wrong as I don’t know the ‘faker’ used here but I would assume that it’s building a randomised lorem ipsum document. This depending on the actual distribution of words would greatly impact how the index is build and would affect the execution times of queries used against it. This means that one system could have gotten a much worse execution path with a document biased differently than the other system.

    Regardless of word distribution - is elasticsearch really working on the same initial data and are it’s documents split into equivalent logical chunks? PostgreSQL is searching over 1.5M entries, is the data loaded into elasticsearch also split into 1.5M documents?

    Am I really not understanding something fundamental? I only see one iteration (Iterations = 1) of the faker for Elasticsearch:

    const faker = require("faker")
    const { writeFileSync } = require("fs")
    const Iterations = 1
    writeFileSync(
      "./dataset.ndjson",
      Array.from(Array(Iterations))
        .map(() =>
          JSON.stringify({
            key: `${faker.lorem.word()} ${faker.lorem.word()}`,
            val: faker.lorem.words(),
            valInt: Math.floor(faker.random.float()),
            valDate: faker.date.past()
          })
        )
        .join("\n")
    )
    
    1. 4

      Lorem ipsum text also makes it impossible to evaluate search result quality (unlike, say, the Wikipedia dataset).

      1. 1

        what is the wikipedia dataset? url?

        1. 1
          1. 1

            Wikipedia offers full database backups : https://en.m.wikipedia.org/wiki/Wikipedia:Database_download

      2. 12

        TL;DR: guy says that over 1.5 million records, PostgreSQL will return results in up to 120ms, while Elasticsearch gets it in 20. So it’s worth to maintain an Elasticsearch cluster next to your Postgres.

        I think that for some critical search, it might make sense. But for most of us, first we need to get to 1.5 million records for this to be relevant. And even then - meh, it’s likely that our search is just fine with 120ms.

        1. 3

          I’d also like to see this in Linux, I’ve had weird issues with postgresql in Mac that I’ve never seen in Linux. Mind you, if this is getting deployed in a Mac server, this quick comparison would be enough for me.

          1. 1

            Yep, cluster sizes, everything. I mean, if you had 1.5mil records that you need a FTS on, you probably don’t run this on a desktop machine. That’s what I’m thinking - this probably doesn’’t matter so much for majority of us - overhead of another technology for a few milliseconds.

          2. 2

            Not only that - full text search over 15 million documents, but with no other filtering? I’m sure something needs that, but I have never worked on a system matching that need.

          3. 8

            An unfair comparison because he is building his ts_vector at query time.

            For speed, you have to use ts_vector columns pre-filled (by a script or a db trigger).

            1. [Comment removed by author]

              1. 2

                calculated columns

                I didn’t know about them. Thank you.

                1. 2

                  You’re welcome - sorry I ninja-deleted my comment - I thought it didn’t really add enough to the discussion - but am glad you learned about pg12 generated (“calculated”) columns :)

                  https://www.postgresql.org/docs/12/ddl-generated-columns.html

                  I also came across this example of caching ts_vector via triggers:

                  https://thoughtbot.com/blog/optimizing-full-text-search-with-postgres-tsvector-columns-and-triggers

                  A similar example using generated columns (both are for ruby on rails, but both feature examples of the sql queries that are involved) :

                  https://pganalyze.com/blog/full-text-search-ruby-rails-postgres

                  An interesting tidbit from the last article, regarding speed:

                  We have a problem! The query that is produced by Job.search_job(“Ruby on Rails”) takes an astounding 130ms. That may not seem like such a large number, but it is astounding because there are only 145 records in my database. Imagine if there were thousands! The majority of time is spent in the to_tsvector function. (…)

                  With this optimization done, we have gone from around 130ms to 7ms per query.

            2. 3

              It would be interesting to see a similar test but with pg_trgm included in the postgres test.

              1. 1

                What does that do?

                1. 2

                  Creates trigram index, which helps with search for fixed strings and some regular expressions.

              2. 3

                This article appears to be a simple performance comparison and nothing more. Elasticsearch does a fundamentally different thing than PostgreSQL. Elasticsearch is an IR system, PostgreSQL is a relational database with some barebones fulltext features. AIUI, PostgreSQL does not use any kind of tfidf scheme because it doesn’t track corpus frequencies of terms. This will fundamentally change the quality of search results.

                Maybe PostgreSQL’s fulltext features are enough for you. But the only way to know is to do some kind of evaluation.