1. 3

    As per barakmich’s comment I agree the “joins dont scale” argument isn’t appropriately answered with a query fitting on a single machine. I also believe you could rewrite the query to remove the join.

    select date_trunc('month',ORDER_DATE) order_month, USER_ID, sum(amount), count(*)
    from ORDERS
    where USER_ID=42
    group by order_month, USER_ID
    ;
    
    1. 1

      Essentially, yes this is correct. Or at least that’s what the query optimizer deduced it could do.

      There’s one slight modification to what you wrote vs. what the join as written in the article returns – if USER_ID=42 does not exist in the other table, it returns nothing.

      So it is an according-to-Hoyle relational algebra join, but one that’s really easy to optimize. You can see it in the explains in the article:

      Index Only Scan using users_pkey on public.users  (cost=0.29..4.30 rows=1 width=8) (actual time=0.006..0.006 rows=1 loops=1)
                           Output: users.user_id
                           Index Cond: (users.user_id = 42)
      

      Which is the aside-check that verifies existence of the key. You’d get pretty much the same performance even if you asked for some details from that table (say, selecting last_name), since it’s pulling that row anyway.

      A somewhat more indicative join would be WHERE last_name=Smith (with appropriately loaded data, Smith being statistically prevalent in North America, for example) – so we wouldn’t know a priori how many USER_IDs match, and there’s potential fanout.

      Still a one-to-many join – many-to-many is still undiscussed – but at least it’d be something that made the join do some real lifting.

    1. 3

      This talk https://www.youtube.com/watch?v=oF_lIZHJPig from CMU on Deepgreen DB (a Postgres/Greenplum commercial offering with JIT) goes into some good details around the benefits / drawbacks of JIT in databases.

      1. 1

        I have watched that talk twice, it is damn excellent!

      1. 1

        I had a somewhat similar road: Knoppix -> Ubuntu -> Arch -> Fedora -> Arch (OS X used throughout mainly for work).

        My dislike with Arch Linux is the amount of configuration that must be done upfront. I actually enjoy using Gnome (themed appropriately with Arc) and so have used Fedora as well but have found that the Fedora upgrade is much more painful than just running Arch Linux. I’m an Arch Linux tester and am happy creating my own PKGBUILD’s.

        I’m interested in NixOS, additionally I’d like to try out Alpine Linux as a desktop OS.

        1. 2

          It was interesting to hear Greg Kroah-Hartman’s suprising to me comments on the stability of ZFS on Linux in a recent AMA.

          “You are relying on a kernel module that no one in the kernel community can ever touch, help out with, or debug. The very existence of the kernel module is at the whim of the kernel itself not doing something that might end up breaking it either with api changes, or functional changes, as the kernel community does not know what is in that code, nor does it care one bit about it.”

          https://www.reddit.com/r/linux/comments/fx5e4v/im_greg_kroahhartman_linux_kernel_developer_ama/fn5t6t4

          Slightly scary considering how important correctness is when talking about filesystems. I’ve been using btrfs recently but prefer ZFS and would like more explicit kernel support.

          1. 8

            It was interesting to hear Greg Kroah-Hartman’s suprising to me comments on the stability of ZFS on Linux in a recent AMA: “You are relying on a kernel module that no one in the kernel community can ever touch, help out with, or debug. […]”

            Slightly scary considering how important correctness is when talking about filesystems.

            That ZFS on Linux is an unsupported external module is scary. But relying on filesystems such as ext4 or XFS that do not have any measures in place against data silent data corruption is, to me, even scarier.

            1. 1

              What about btrfs? I use it at work and it seems to work pretty well.

            2. 6

              They have a point, but you’ve got to pick the least risky option that provides the features you want.

              I’ve ran btrfs. I’ve ran zfs. I’ve used snapshots/subvolumes on both. Btrfs was flaky and I lost data. ZFS was a joy to use and I have not lost data.

              1. 5

                the kernel community does not know what is in that code, nor does it care one bit about it

                An even stronger argument could be made for 99% of flagship Android phones shipping proprietary, out of tree modules. Yet, ZFS is open source and gets flak anyway because it’s not GPL.

                I think we’d all like to see ZFS ship with the Linux kernel, but adding boot.supportedFilesystems = ["zfs"] to my nix config isn’t really ever going to be a huge deal. So, really, this entire line of argument is just more party-line GPL bickering, nothing new to see here.

                1. 4

                  The position Greg, and seemingly others in the Linux project, take on this is persistently tedious – for ZFS users and developers alike. All I can suggest is that there are other UNIX platforms that make different trade-offs, and which don’t have a frustrating relationship with ZFS; e.g., illumos or FreeBSD.

                1. 3

                  The demoscene http://www.pouet.net/ has some brilliant example of this.

                  1. 1

                    Whats this looks like a news aggregator

                    1. 2

                      Pouet is an aggregator for the demoscene https://en.m.wikipedia.org/wiki/Demoscene. Demos are like videos but made via programming often with extreme limitations (i.e. under 4kb in size).

                  1. 6

                    Although PostgreSQL is my go to for relational databases I’d echo this especially #4 MVCC Garbage Collection. Additionally I find the permission system labyrinthine and dislike how a basic “select count(*) from table” takes non-trivial time compared to other databases.

                    1. 5

                      A data warehousing approach is to store data in layers and preserve the raw sources as the initial source layer which is often files but you can use a database if you are lightweight about it (i.e. by storing just an id, some metadata, and JSON in a blob). Ideally with this you can rebuild your entire data warehouse from the source layer.

                      Additional access layers are then created as needed so you can have views over multiple sources (i.e. a single view of all messages you have written with a common schema).

                      1. 1

                        As a counterpoint, I’ve also had a Surface Go for the last year.

                        My purpose for buying it was the great Linux support along with portability and the touchscreen. I used Windows along with WSL for a few weeks (I mainly use Mac OS and Linux) and couldn’t get past the slowness of the filesystem on git checkouts and my general dissatisfaction with Windows.

                        Linux worked like a charm (though with additional steps for WiFi). However for long-term usage I found the screen too small and the keyboard cramped. Additionally one big blocker is you can’t really use this on a couch with the keyboard as there is no hinge.

                        I’m now of the view that any mobile device I own needs a decent sized usable keyboard and screen >= 13”. Although I do love the form factor of small devices, I find them unusable and that they lead to passive consumption due to limited input abilities.

                        1. 1

                          I totally understand where you’re coming from. I think I mention in the post that it is not a machine for everyone and that your experience may be quite different than mine. Since you’re into Linux and might want a lightweight portable device, have you checked the Pinebook Pro it appears to fit what you enjoy and it is not expensive (its aarch64 thought, it might be a dealbreaker for you.)

                          1. 1

                            Thanks I have been following the Pinebook Pro and might look at getting one.

                        1. 2

                          This is not just an EU thing, exactly the same has happened in Australia.

                          I don’t believe this will be fixed until we are forced to by some geopolitical upheaval. It is important developers actively assume the communications network is hostile to prevent harm to users.

                          1. 1

                            It is important developers actively assume the communications network is hostile to prevent harm to users.

                            Harm from whom? The service provider itself, for whom the user traffic is a valuable recurring revenue stream?

                            The domestic government, who may be interested in preventing crime, tracking dissent, or other uses of “lawful intercept”?

                            Or a foreign government, who may use backdoors to spy on the domestic government or on industry?

                            1. 5

                              The network is always hostile. There are many public 0-days currently affecting hundreds of mobile operators and millions of customers around the world, and most of them are not patchable. Some operators don’t verify source addresses for custom SMS senders, leading to easy phishing attacks as the SMS will be listed along with legit ones. Some operators are state owned and can tap on any form of 2FA using SMS. etc.. There’s a lot of cases like this, so it’s better to assume the network is like a clear-text channel unless it is wrapped in another layer of security.

                          1. 1

                            Try running Elasticsearch + Filebeat rather than the full ELK stack for a more lightweight approach. Ingest pipelines negate most of the need for Logstash and Kibana can be spun up when needed. You can also change the processors config option to make Elasticsearch reduce the processors it uses. NB I work at Elastic.

                            1. 2

                              Great work. A lot of these optimisations seem to be suprisingly low-hanging fruit. I would have assumed that Array.unshift would already be optimised to not be O(n). It probably implies there are many such opportunities in the lower level stacks in browser and UI frameworks.