Very interesting but I remain unconvinced. It seems as though once you start adding the required features for productionization the models converge.
The obvious way to improve throughout in a pull based model is to the have producer prefetch a buffer. This mimics the buffer that would exist on the DAG consumers in the push-based model. (In fact it may be better as it is naturally shared amount consumers.) In many real-world systems you will need to add back pressure to the producer in a push-based model, leading to it being basically equivalent to the pull-based model with buffering. (again, except the buffers are in the consumers).
The author does raise an interesting point though, PostgreSQL and CockroachDB materialize the entire table when using with classes. In the past I have heard is that this is treated as some sort of optimization hint in PostgreSQL so I wonder if there is something fundamental about the model that is holding it back, or if it is a “feature” that it works this way.
PostgreSQL used to materialize WITH clauses unconditionally, and it was often used as a mechanism for hand-optimizing queries. But as of PostgreSQL 12, by default it only materializes them if they are used more than once, are recursive, or have side effects. Otherwise it folds them into the main query for purposes of generating a query plan. If you still want to use them for manual optimization, you can explicitly say whether to materialize them.
+1 Insightful. This is approximately my take too. I’ve written query engines and I opted for something that’s kind of a mix: Next() and Check() being the respective primitive operations to pull and push. The buffer you mention is a materialize operator that’s a logical noop to the plan but something the optimizer can add where it sees fit.
That said, there may be something to leaning more push than I have been, specifically for DAGs, which require hacks in a pull model. Furthermore, in a distributed setting, push works a better because there’s much less back-and-forth. Tradeoff there is throwing a ton of data over the network. Still needs a hybrid.
If you find it interesting, let me know what specifically, maybe I’ll put it on my blogging stack.
This article’s code was exceptionally easy to follow.
Very interesting but I remain unconvinced. It seems as though once you start adding the required features for productionization the models converge.
The obvious way to improve throughout in a pull based model is to the have producer prefetch a buffer. This mimics the buffer that would exist on the DAG consumers in the push-based model. (In fact it may be better as it is naturally shared amount consumers.) In many real-world systems you will need to add back pressure to the producer in a push-based model, leading to it being basically equivalent to the pull-based model with buffering. (again, except the buffers are in the consumers).
The author does raise an interesting point though, PostgreSQL and CockroachDB materialize the entire table when using with classes. In the past I have heard is that this is treated as some sort of optimization hint in PostgreSQL so I wonder if there is something fundamental about the model that is holding it back, or if it is a “feature” that it works this way.
PostgreSQL used to materialize
WITH
clauses unconditionally, and it was often used as a mechanism for hand-optimizing queries. But as of PostgreSQL 12, by default it only materializes them if they are used more than once, are recursive, or have side effects. Otherwise it folds them into the main query for purposes of generating a query plan. If you still want to use them for manual optimization, you can explicitly say whether to materialize them.+1 Insightful. This is approximately my take too. I’ve written query engines and I opted for something that’s kind of a mix: Next() and Check() being the respective primitive operations to pull and push. The buffer you mention is a materialize operator that’s a logical noop to the plan but something the optimizer can add where it sees fit.
That said, there may be something to leaning more push than I have been, specifically for DAGs, which require hacks in a pull model. Furthermore, in a distributed setting, push works a better because there’s much less back-and-forth. Tradeoff there is throwing a ton of data over the network. Still needs a hybrid.
If you find it interesting, let me know what specifically, maybe I’ll put it on my blogging stack.
I guess I’m late to the discussion, but what about coroutines, which appear to be the solution to all producer/consumer problems?
Fantastically comprehensible and informative.