1. 29

tldr; you can do jobs, queues, real time change feeds, time series, object store, document store, full text search with PostgreSQL. How to, pros/cons, rough performance and complexity levels are all discussed. Many sources and relevant documentation is linked to.

  1.  

  2. 17

    As a postgresql DBA, I agree. :)

    That said, if you do find yourself considering using pg as (for example) a queue, object store, log storage, and graph database at the same time, I have some advice: make them separate databases. Doing so forces development to consider them separately, and keep a cleaner separation of concerns in the codebase. If you end up in a situation where, say, log storage ends up being the biggest disk consumer, but you need high-performance disk for your graph database, you’ll be glad you can easily pop the log database out and put it on separate hardware without any code changes.

    It also makes things easier should you end up wanting to move away from queues in pg to RabbitMQ, or graphs in pg to Neo4j (or whatever’s hot in a few years in those spaces).

    On the flip side, if they’re all part of the same database, inevitably someone will join tables which should be separated, and then your migration path becomes much, much harder.

    I’m speaking from experience, here, having dealt with a monolithic database that held log data, reporting data, session data, and object data, all in one big mess of a schema. It took years to move sessions out, and it looks like the solution for separating the rest is going to be a rewrite.

    In short: use postgresql for all of the things, just use different, separate, databases for each of the things.

    1. 3

      I’m not a PostgreSQL DBA, but I pretend to be one at work. I can vouch for this answer 100%.

      1. 2

        This advice doesn’t just apply to Postgres: it can apply to any time you want a backend store. Scalabilty problems are often down to over-reliance on some single part of your system, which ends up being used in more places that it should.

      2. 13

        A small niggle: there is a pretty big difference between the fulltext search found in PostgreSQL and the fulltext search found in a real IR engine like Lucene (which powers Elasticsearch and SOLR). In particular, Lucene will rank results using corpus frequencies or something called tf-idf. PostgreSQL won’t do that for you (last time I checked). Taking corpus frequencies into account is important because it lets you build your ranking function based on how important something is in the entire corpus.

        To be fair, I don’t think I’ve ever seen anyone directly compare PostgreSQL’s ranking with Lucene’s (or any other search engine), so I don’t know where the “it works well enough for X% of use cases” line falls.

        This is otherwise a nice resource. Thanks!

        1. 3

          Appreciate you calling this out. I maintain an Elasticsearch client and while I’m an avid PostgreSQL user I’ve run into some people very stubborn about using PG’s FULLTEXT instead of a real search engine.

          It’s too bad ES plays hot potato with your data.

          1. 1

            I suspect this can be treated like a scaling constraint? Like pg’s full text search will often suffice while your corpus isn’t large, so most searches return few results, so you aren’t overly bothered by the fact that the ordering isn’t as good as you like.

            It is definitely not as good as ES/Lucene, but pgsql full text search is IME a huge step up if you move to it from completely naïve byte by byte substring testing - which I’ve also seen in production plenty of times!

            1. 1

              I suspect this can be treated like a scaling constraint? Like pg’s full text search will often suffice while your corpus isn’t large, so most searches return few results, so you aren’t overly bothered by the fact that the ordering isn’t as good as you like.

              People do not expect boolean search anymore. So, if the query contains some non-discriminative term, you will retrieve a large number of non-relevant documents. Aggressive stopword filtering will help you somewhat, but not much.

              Note that TF-IDF is fairly trivial to compute, assuming that you have a term -> (document, freq) mapping (a typical postings list). Though, depending on how your data is structured, computing the denominator in query-document cosine similarity can be more difficult (more specifically, you need to compute the L2 norm of the complete document vector).

              If documents have virtually the same length and there are no odd repetitions, you might get away with the dot product of the query and document vector as a similarity measure. If each word occurs once in the query, then you simply sum the document’s TF-IDFs for the query terms.

              1. 2

                People are used to shit search, though. They’ll try variations on their queries and take words out if the entire corpus comes back. Again, this is replacing substring matching, which has exactly the same problem, not Lucene!

          2. 2

            The RUM index extension plans to address TF/IDF in the near future.