I liked this article, but it’s flawed.
They’re hiding away from most of the MySQL problems by not actually using MySQL in their code, they’re using an abstraction layer which can enforce acceptable behaviour when using the DB. Their complaints about connection pooling, etc, would be handled if they had the same abstraction for PostgreSQL. This also applies to their complaints about people with long-lived queries on replicas:
While it’s always bad form to let your code hold open database transactions while performing unrelated blocking I/O, the reality is that most engineers are not database experts and may not always understand this problem, especially when using an ORM that obscures low-level details like open transactions
They’re not letting those same engineers at MySQL itself, they’re forcing them through an abstraction layer which cleans things up. So while it’s certainly true that there are trade-offs, the arguments used to dismiss PostgreSQL as unfit could be similarly used for MySQL. Mostly, they’re at “we’re moving to a new abstraction layer, it makes things better and we don’t let most non-SQL-expert engineers directly at the RDBMS” and getting all the Win from that.
There’s a few good points about problems for Pg to address, though, so it’s still a good read, just … caveat lector.
I wonder if their connection pool problem isn’t a consequence of micro services. The last trip service needs a connection, the favorite trip service needs a connection, etc. Is the solution to have each service connect to a different database? In theory, they’re all isolated, right?
Given the suggested practices, I would guess that is not the case. The number of database connections should be a function of the number of service nodes that wrap up that database rather than the number of services that need to use that data.
Is the solution to have each service connect to a different database?
A while back I listened to a talk given by Monsanto at AWS re:Iinvent 2015 and that’s pretty much the pattern they went with - each microservice is completely standalone, with its own (non-shared) persistence. They have hundreds (if not thousands) of microservices and this approach apparently works well for them. I guess it really enforces separation of concerns…
That’s the way to go, really; from a scaling standpoint you don’t really want contention on your database resource that is shared by other services. Potentially that would lead to all services suffering because one of them is under load. I’ve rarely seen this in practice though.
It is in practice an issue with most web stacks. Running ~1000 Postgres connections is a lot; and it doesn’t really gain you much as far parallelism or concurrency control since having that many active queries scanning the disk or that many locks active is fairly IO or CPU intensive.
However, the way most web stacks work, is they create and manage a small pool of connections on each node. Say 10-20. Imagine you have 80 web nodes – there’s 800 to 1600 connections right there. From the reasoning given earlier, we can not expect much of a speed up relative to a few dozen connections – it ultimately comes down to, how much activity can we get out of 8 CPUs or 2 disk heads – but it turns out that it is advantageous for the web stack implementer to use a pool instead of trying to use one connection in a very optimal way. For example, maybe each and every request gets to checkout and hold a DB connection for its entire duration, just in case it needs it. This is a safe thing to do. Implementing correct and optimal use of a shared resource like a database connection is difficult.
I haven’t used MySQL in a long time, but it was so bad when I did that I have a hard time believing it’s better than Postgres at anything.
In this case I have to wonder how many MySQL problems they’re hiding in their abstraction layer? Could they have hidden the Postgres problems just as easily, and still got the benefits of Postgres?
And “Schemaless” is a terrible project name.
The pg-hackers mailing list has a pretty honest self-assessment of this article: https://www.postgresql.org/message-id/579795DF.10502%40commandprompt.com
The responses in the mailing list (if you’re willing to click through a lot of them) are pretty good.
Not only are they honest about what the shortcomings of PostgreSQL they even start digging into how to fix it. Open Source for the win.
I enjoyed this article and learned some stuff I didn’t know from it. I think the connection handling dig is a bit unfair, pgbouncer is a really nice piece of software and I’ve run thousands of clients against it with no issues. The physical vs. logical replication thing is definitely a pain point though. The article also doesn’t really address some of the downsides of MySQL’s choices, e.g. I’ve heard some wacky buffer pool locking horror stories.
This was an interesting trip through some of the database internals. I’m particularly surprised by Postgres spawning a process per connection, which I would not have expected.
One thing I’d like to note is that I consider large numbers of indexes on a single table to be a smell. An indicator of a potential God Object, or an architectural problem with a better solution (e.g. elasticsearch).
Has anybody ever tuned a MySQL InnoDB LRU cache strategy? Seems like a neat capability (I’m not sure if I would ever need)
[Comment removed by author]
I love postgres, but I do not love it’s huge resource consumption per-connection. Almost all the pg outages I’ve seen have been caused/worsened by the number of open connections spiking.
There are two practical reasons that Postgres uses one process per connection:
It makes the server quite resilient to user code – even crashy extensions (shared objects dynamically loaded into the Postgres process) will affect only the connection process, not the master process.
It allows the OS to schedule different connections on different cores, making Postgres scale on multi-CPU machines “for free”.
Relevant documentation: https://www.postgresql.org/docs/9.5/static/connect-estab.html
The criticism with regards to Postgres’s MVCC model is legitimate, in terms of the costs of write amplification. However, it does seem a little disingenuous not to go into the disadvantage of intermediating all other indexes through the primary key:
Slower queries (at least potentially).
Must have a primary key.
Compound primary keys…what then?
Effect of updates to primary keys.