I hear a lot the phrase “Database is always the bottleneck” and “don’t put logic in db because it does not scale”. Of course it does not scale if instead of sending one complex query which the db can inspect and be smart about it, optimise and rewrite, people send 100+ stupid queries per request then do the joining on the other side. I am not sure why the think their code can do better then a query planner.
I used to do that in the past where the ratio of db servers to cheap client machines was something like 1:10 or 1:15. The dbs were expensive (first Informix, then Oracle) so it made sense to do more processing on the client side. This wasn’t a web-app or something, it was data processing and we basically had to touch every byte in the db at least once. In those cases it makes perfect sense to offload the processing elsewhere.
Sure there are different cases, it makes sense what you are saying, i was just talking about your basic CRUD app / wordpress kind of thing which are the vast majority.
But even in your case, from what you say, the reason was that the DB was expensive and you could not afford to have read replicas, the reason was not that it is faster to do it in the client or that the code is simpler then a long query
I haven’t seen app-joining in a while, but working with Django, people sure could use more of .values() and listing the ones they need when doing read-only.
And naturally naming the columns whenever possible.
It’s just too easy to get things up and running fast with ORMs and then suffer from retrofitting all that into the code.
GraphQL seems like an API that encourages the right kind of thinking from developers – and translating GraphQL->SQL is straightforward.
While the GraphQL spec does encourage the elimination of SELECT *, the reference implementation makes things worse then ORMs. It’s not the implementation’s fault, it’s a general purpose one, but people took it as the way to do everything. They bolt on a few simple resolvers, which of course generate a million + 1 queries, then when a newcomer asks how to fix this, everyone just says “use dataloader”. It never crosses their mind that they might need a custom execution module. So the best case scenario that you end up with is a few sequential queries which could have been a single query with joins.
GraphQL->SQL, you are probably thinking PostgreSQL when you say this (i know your work), but can you do it in MySQL where the json support is not that good?
I see what you are saying about the N+1 queries. A good implementation would have to compile to JOINs.
As regards JSON and MySQL – I don’t think it makes too much of a difference if there is JSON support. The project I did a while back, GraphpostgresQL, did all the steps of GraphQL processing in the database, using stored procedures, so JSON processing definitely mattered for formatting results, but a production GraphQL->SQL implementation should definitively involve a proxy server and much of the work can be done there.
The reason i say json matters is because without it, if have to execute say a 3 level query (join 3 tables), then you will be transferring over the wire a lot of redundant data. This is the type of queries this tool creates https://github.com/stems/join-monster but the more levels, the worse the results since there is a lot of duplication. To be efficient the database has to format the result as json (the only way a database can return a tree like response). This is the way it works in PostgREST and it works well, and it’s interface is powerful enough to build graphql on top of it (which i did :) )
Without JSON aggregations it’s hard to pull everything up into a tree, even if there is basic JSON support, that’s true. I am not sure what MySQL’s level of support is, beyond the basics.
At my work, we have one data store that contains arbitrary key/values and we query against it a lot. People were complaining that the queries were taking on the order of 15-30 minutes too long so I tried to make something that was faster. I spoke with the DBA and we concluded that the serialization and transport was a huge bottleneck. People were querying for hundreds of thousands of keys, each representing between 5 and 500 KB of data, serializing it, transmitting it, then querying for that same data again with a slightly increased time stamp. Of course you’re getting terrible perf if this is the way you’re doing things. No caching on client side and no purning of unnecessary data. I got 100x perf gains just from those two alone.
Not only is SELECT * terrible, but it gets worse when you forget it’s there and start adding columns to the table.