Hm interesting … I knew sqlite had bytecode, but I didn’t know most databases including MySQL evaluate a tree.
Off hand I’d guess the reasons for this are
most databases only execute user-defined expressions, not so much loops and functions (although I know there are lots of weird imperative languages embedded in various SQL dialects)
those expressions are generally pretty small – 50-100 lines is big, not 1,000 or 10,000 or 100,000 lines like Python or JS
the query planning is where the bigger wins in terms of speed come in?
since it affects what you actually read off disk, and what you join, and those things are orders of magnitude slower
I’m also fairly surprised. While your points 1 and 2 seem like they make sense because each expression would only be evaluated once the corollary is that most of these are executed per-row, so in many cases you can achieve a performance win quite quickly. Basically there is an implicit loop.
Although I know that MySQL is well known for how quickly it can execute small queries such as single row reads and writes, so maybe that is a common use case.
Which basically leaves the 3rd point I guess :) I wish open source databases would publish benchmarks with every release so we could find out the details, e.g. what’s actually hot in most queries
query planning is where the bigger wins in terms of speed come in
Correct. Although you’ll notice many of the analytic DBs support JIT execution, as they typically do evaluate lots of expressions when you consider rows processed, even if the actual expression code itself is short. Redshift, not listed here, also compiles queries.
Also worth mentioning is that databases use query planners and optimizations, much like compiler optimization passes to run queries. I wonder if any of that has an effect on the actual expression evaluation; does it end up being evaluated verbatim or potentially altered. I do not have the mental fortitude to go through the source code like @eatonphil, though 😃
Yeah it would be a whole separate discussion to talk about the optimizations that these various databases do and don’t do. Common subexpression elimination, constant folding, dead code elimination, etc. Each database is all over the place relative to each other. @tekknolagi is also thinking about this stuff so maybe we’ll see something from him too.
I’m surprised tree-walking is so common, as it’s a lot slower than a bytecode or threaded interpreter. I can only guess that execution is dominated by I/O so the interpretation isn’t a big factor?
Fantastic post, I had no idea most of these just walked the AST directly, very interesting choice as andyc pointed out. I’m wondering if there are any databases out there that operate more like MongoDB in the sense that your query is a fully declarative statement rather than a weird little parsed procedural language. Do you know of any?
I’ve wondered about building a radical new relational database for a while now that wouldn’t use a traditional “query language” as an interface and this summary is a great confirmation of my bias that such a thing is yet to exist!
I think there is a pretty large subset of SQL that is fully declarative ?
But there are a bunch of databases based around Datalog – I think Datomic is one of the more prominent (or at least talked about, since it’s made by the creator of Clojure)
I actually wonder why I haven’t heard more about WASM plugins for databases. I’m not a big WASM booster (I wrote that it’s “less general than claimed”), but it seems to make perfect sense for this problem.
It’s remote evaluation – you send your code near the data on a database server, just like a web server sends JS to your browser to be evaluated.
The main limitation of WASM is bindings to the platform, but query evaluation doesn’t really have that problem. It wants to be sandboxed.
I googled and found this, but it seems a bit immature, since it supports integers only, and not strings:
(also it has come to light recently that wasmer is not very reputable)
I’d be interested if anyone knows reasons WASM is a poor fit for databases. Seems like an obvious combination, as I know many databases have R and Python plugins, etc. I think Oracle in particular supports a bazillion languages.
Maybe a per-row granularity is slow? But I think that relates to an implementation of WASM, not the language. It seems like you should be able to warm up a JIT with a particular query, and reuse that across all rows.
Yes, SQL can be described as declarative from certain perspectives - you’re not literally writing table joins and query plans by hand so in this case it could be called “declarative” (though, the amount of messing about and weird rules you have to figure out when it comes to optimising queries, I find it hard to argue for it being fully declarative in the real world!)
Datalog looks quite interesting, though I find the syntax is quite awkward, yet another DSL! I’d like to write queries in my application language that get baked into some simple JSON - no custom AST walker required, just data structures (which I do find Datalog has leaned into much more than SQL - less stringy things and more properly structured data!)
Against SQL by Jamie is fantastic, and came about the time I was writing a similar series called “Stockholm syndrome of SQL” 😂 which essentially covered the same ideas. I’d like to explore this area more tbh. feels like lots of untrodden ground!
WASM for plugins is an interesting area of work, I’ve attempted it and it’s still very rough around the edges (I did a talk at Go x Rust London last December on my attempts and failures) mostly because the system interface, IPC and calling conventions is still in development but it’s getting there! I think once the dust settles around things like WASI and calling in/out of WASM binaries becomes easier, we’ll see more use cases for stuff like this. It’s very compelling compared to the status quo of OS-bound DLL/shared object linking which usually requires uploading binaries to a server and doing lots of non-cattle type configuration (which hurts portability and disaster recovery)
I’ve been playing with the idea of the client figuring out a query plan locally then submitting that to a very thin rudimentary database server which simply executes and responds. WASM could fit into this architecture quite nicely as a query plan optimiser module or a post-query processing pipeline. I should probably just write this stuff down and build it rather than spam lobste.rs comments :)
Thanks for the links though, this is all really interesting stuff!
Hm interesting … I knew sqlite had bytecode, but I didn’t know most databases including MySQL evaluate a tree.
Off hand I’d guess the reasons for this are
I’m also fairly surprised. While your points 1 and 2 seem like they make sense because each expression would only be evaluated once the corollary is that most of these are executed per-row, so in many cases you can achieve a performance win quite quickly. Basically there is an implicit loop.
Although I know that MySQL is well known for how quickly it can execute small queries such as single row reads and writes, so maybe that is a common use case.
Yeah it’s true, there is an implicit loop
Which basically leaves the 3rd point I guess :) I wish open source databases would publish benchmarks with every release so we could find out the details, e.g. what’s actually hot in most queries
https://www.oilshell.org/release/0.18.0/quality.html#benchmarks
Correct. Although you’ll notice many of the analytic DBs support JIT execution, as they typically do evaluate lots of expressions when you consider rows processed, even if the actual expression code itself is short. Redshift, not listed here, also compiles queries.
Also worth mentioning is that databases use query planners and optimizations, much like compiler optimization passes to run queries. I wonder if any of that has an effect on the actual expression evaluation; does it end up being evaluated verbatim or potentially altered. I do not have the mental fortitude to go through the source code like @eatonphil, though 😃
Yeah it would be a whole separate discussion to talk about the optimizations that these various databases do and don’t do. Common subexpression elimination, constant folding, dead code elimination, etc. Each database is all over the place relative to each other. @tekknolagi is also thinking about this stuff so maybe we’ll see something from him too.
ooooh ~ foreshadowing ~
I’m surprised tree-walking is so common, as it’s a lot slower than a bytecode or threaded interpreter. I can only guess that execution is dominated by I/O so the interpretation isn’t a big factor?
Fantastic post, I had no idea most of these just walked the AST directly, very interesting choice as andyc pointed out. I’m wondering if there are any databases out there that operate more like MongoDB in the sense that your query is a fully declarative statement rather than a weird little parsed procedural language. Do you know of any?
I’ve wondered about building a radical new relational database for a while now that wouldn’t use a traditional “query language” as an interface and this summary is a great confirmation of my bias that such a thing is yet to exist!
I think there is a pretty large subset of SQL that is fully declarative ?
But there are a bunch of databases based around Datalog – I think Datomic is one of the more prominent (or at least talked about, since it’s made by the creator of Clojure)
https://docs.datomic.com/pro/query/query-executing.html
Jamie Brandon has written a ton about alternative database designs - https://www.scattered-thoughts.net/
Example: https://www.scattered-thoughts.net/writing/imp-sets-and-funs/
I actually wonder why I haven’t heard more about WASM plugins for databases. I’m not a big WASM booster (I wrote that it’s “less general than claimed”), but it seems to make perfect sense for this problem.
It’s remote evaluation – you send your code near the data on a database server, just like a web server sends JS to your browser to be evaluated.
The main limitation of WASM is bindings to the platform, but query evaluation doesn’t really have that problem. It wants to be sandboxed.
I googled and found this, but it seems a bit immature, since it supports integers only, and not strings:
https://medium.com/wasmer/announcing-the-first-postgres-extension-to-run-webassembly-561af2cfcb1
(also it has come to light recently that wasmer is not very reputable)
I’d be interested if anyone knows reasons WASM is a poor fit for databases. Seems like an obvious combination, as I know many databases have R and Python plugins, etc. I think Oracle in particular supports a bazillion languages.
Maybe a per-row granularity is slow? But I think that relates to an implementation of WASM, not the language. It seems like you should be able to warm up a JIT with a particular query, and reuse that across all rows.
Yes, SQL can be described as declarative from certain perspectives - you’re not literally writing table joins and query plans by hand so in this case it could be called “declarative” (though, the amount of messing about and weird rules you have to figure out when it comes to optimising queries, I find it hard to argue for it being fully declarative in the real world!)
Datalog looks quite interesting, though I find the syntax is quite awkward, yet another DSL! I’d like to write queries in my application language that get baked into some simple JSON - no custom AST walker required, just data structures (which I do find Datalog has leaned into much more than SQL - less stringy things and more properly structured data!)
Against SQL by Jamie is fantastic, and came about the time I was writing a similar series called “Stockholm syndrome of SQL” 😂 which essentially covered the same ideas. I’d like to explore this area more tbh. feels like lots of untrodden ground!
WASM for plugins is an interesting area of work, I’ve attempted it and it’s still very rough around the edges (I did a talk at Go x Rust London last December on my attempts and failures) mostly because the system interface, IPC and calling conventions is still in development but it’s getting there! I think once the dust settles around things like WASI and calling in/out of WASM binaries becomes easier, we’ll see more use cases for stuff like this. It’s very compelling compared to the status quo of OS-bound DLL/shared object linking which usually requires uploading binaries to a server and doing lots of non-cattle type configuration (which hurts portability and disaster recovery)
I’ve been playing with the idea of the client figuring out a query plan locally then submitting that to a very thin rudimentary database server which simply executes and responds. WASM could fit into this architecture quite nicely as a query plan optimiser module or a post-query processing pipeline. I should probably just write this stuff down and build it rather than spam lobste.rs comments :)
Thanks for the links though, this is all really interesting stuff!