A common problem with no straightforward solution in PostgreSQL, solved with some array trickery… Hope find this useful!
The straightforward and portable solution I would reach for would involve the RANK() or ROW_NUMBER() window functions.
Looks like these are fully supported in Postgres: https://www.postgresql.org/docs/9.1/tutorial-window.html
Even Hive! https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics
The downside to using RANK/DENSE_RANK/ROW_NUMBER is that they don’t work with group by. As your example below demonstrate, you need to use more than one query and join the results. Another downside, which is probably more significant, is that rank etc. requires a sort. Using max(array) keeps a “running max” in memory dosen’t need any type of sort.
I don’t understand what you mean by saying they don’t work with group by. Window functions have PARTITION BY syntax for evaluating over groups.
You are definitely right that performance could be improved on huge datasets by using your array trick. I would still prefer the window functions unless this query is a prominent bottleneck.
Window functions cannot be used with group by. What you did was to combine the window function with DISTINCT ON, so you essentially calculate the aggregates for the entire set, and then take only the first row.
This is different than how a group by works in the sense that group by applying “reducing” rows to group while computing the aggregates. Window functions operate on the entire set (w/o distinct on you would get duplicate rows).
I’m sorry, now I’m really confused. I did not write DISTINCT ON.
It is a unique way of phrasing it, but if I were to guess, he’s saying: “What you [must have] did [at some point] was to combine the window function with DISTINCT ON [which while similar, has important differences]”
I appreciate the post, especially putting all the DDL in a <details>. I didn’t like how both credit events had the same date, so I put the 50 earlier than the 100.
I write T-SQL mostly; I didn’t know about the distinct on trick, that’s cool. I am trying to get better at PostgreSQL.
This is how I think about & approach the problem, with separate tables, row_number() & CTEs: db-fiddle.
Thanks, I just realized I accidentally put the same date on two records. Good catch!
The downside (and upside) to using row number is that it makes it harder to add other aggregates. I just added a bit about comments I got from readers with my thoughts on them.