1. 95
    1. 56

      I think DuckDB has some of the best developer experience I’ve seen in a database, it really feels like a Swiss Army knife in the same vein as jq, so I share that sentiment.

      But I want to post a word of warning here: I drank this coolaid and replaced a data process at work I was working on with it, and.. I’ve never seen so many different segfault bugs in a database in my life; the project became a huge source of stress and severe outages. Once the segfaults ended the random crashes when calling S3 with invalid HTTP requests started.. It would do a few hundred million calls fine, and then out of the blue send something that created thousands of corrupted nonsense-files instead of a single parquet file as intended.

      The number of severe bugs that cause hard crashes has me worried about what hides below the surface, I don’t trust this to keep my data safe if it can’t reliably make calls to S3.

      1. 8

        I love it but wouldn’t use it in production pipelines. It’s just too new.

        1. 3

          Can you please share which version these issues happened with?

          I was thinking recently about using DuckDB at work. If there were couple of new versions since the one you experienced problems with, maybe it stabilised a bit.

          1. 10

            I had similar problems yesterday when I was kicking the tires. Sometimes I’d get a segfault when loading into an in-memory DB. Sometimes, with the same data, I wouldn’t.

            duckdb --version
            v0.10.1 4a89d97db8
            
            1. 4

              We had segfaulting on multiple (3 or 4?) separate releases, eventually when they released 0.7.1 and we stopped using a lot of the S3/Parquet features we managed to get to a state where our specific use case no longer segfaulted, so I’m not touching it now, just leaving it alone while writing the replacement system.

              It was actually running ok for ~4months on this version, after I had removed most of the places where we ask it to call S3, replacing it with manually doing the S3 work and asking duckdb to only do local file system ops (eg. instead of push-to-s3 we ask it to write a local parquet file, then push to s3 with the AWS SDK, and vice versa for fetching).

              I had missed one spot, and that suddenly started crashing last week, with HTTP 400s from S3. The failure path there was something like:

              • Populate a duckdb table with ~1M rows of UUIDs
              • Write that to local disk as parquet and upload to S3
              • In a separate process, load the file (direcly in DuckDB) from S3 [this works ok]
              • Delete the table
              • Load it again using the same function <– this causes HTTP 400s

              What was annoying here was: This exact process had been running for several months without anyone touching it, and then one day it just started exploding like this, and would consistently explode, always getting a 400 on that second call.. presumably something about how the data in the table happened to end up being shaped or something?

              Resolved this same as mentioned above: Stop using their S3 integration, instead ask them to write to local disk and manually push the file with AWS SDK. Made issue go away, back to working on getting DuckDB out of our stack.

              1. 1

                Thank you for sharing that! Very useful indeed.

            2. 3

              I was about to replace my data pipeline (part of it) with this tool since I started to read so many (and good!) articles about it, but since it was a pretty early trend, I decided to wait to do more research before implementing it; wow, what a relief to read this now.

              It would be nice to have some kind of “https://www.arewewebyet.org/” for this type of project, given the fragility of trusting your data to it

              1. 3

                We’re trying it in a bakeoff against a few other databases like SQLite and also getting a fair amount of segfaults just reading and writing. It’s cool but needs a lot more baking, it’s definitely out of the running.

                1. 2

                  That’s a shame. Mark Litwintschik tried DuckDB and there were two issues, or maybe one issue exposed two ways, reading 1.1 billion rows of CSV that other tools parse OK. Also, Parquet import to DuckDB’s internal format performed poorly when loading all rows in one go rather than a chunk at a time.

                  I like the ability to query files as they are, schema auto-inference, query planning (ClickHouse, for example, had some fiddly bits in my tests like needing to put the smaller table first in a JOIN for optimal behavior), and the SQL dialect. Maybe DuckDB still needs to be exercised more at scale to shake this kind of stuff out.

                  1. 4

                    I will say - the performance has been excellent, the use case is an ETL-style job processing about 6TiB of IoT data per run, and even on very modest hardware DuckDB is nowhere near the bottleneck, the main speed limit is the pace of the EBS disks we pull the source data from.

                    But I dunno.. I don’t want to be negative about someone else’s project, and clearly this thing is really incredible from many (most!) angles, but I think it is a shame that they didn’t pick a memory safe language to implement it in.

                    1. 1

                      It is somewhat encouraging that some bugs went away with upgrades, and from your longer reply, another problem sounds specific to the S3 integration. I’d say that out of self-hostable software like this (DuckDB, ClickHouse, Trino) this nudged me from ‘DuckDB clearly wins for its use case’ (i.e. when you only need one node) to ‘all of the options have some downsides’.

                  2. 1

                    oh good, it’s not just me

                  3. 14

                    I agree that doing stuff like this in jq can be pretty difficult if you don’t do it often, but I think my instinct would have been to do the field extraction with jq, and the rest with sort and uniq:

                    $ < repos.json jq -r '.[] | .license.key' | sort | uniq -c | sort -rn
                      23 bsd-3-clause
                      5 apache-2.0
                      2 null
                    

                    Or for the hell of it, with gron:

                    $ gron repos.json | fgrep license.key | gron -v | sort | uniq -c | sort -rn
                      23 bsd-3-clause
                      5 apache-2.0
                    

                    Admittedly you do lose the information that two of the repos did not have a license (the null values) that way.

                    Disclaimer: I’m the author of gron.

                    1. 8

                      That would me my instinct too as it is something I learned long before any of the tools mentioned in this thread were first released.

                      Thanks for your work in gron. seriously useful tool. I use it daily. My coworkers look confused when I use it to find some value inside a JSON blob. I usually get it while they are finishing the sentence “we can write a python script that…” 😂

                      1. 7

                        Yes! I too like thinking in pipes. Also, for the hell of it, in nushell:

                        > cat repos.json
                        | from json
                        | get license
                        | compact
                        | get key
                        | uniq --count
                        | sort-by value --reverse
                        
                        ╭───┬──────────────┬───────╮
                        │ # │    value     │ count │
                        ├───┼──────────────┼───────┤
                        │ 0 │ bsd-3-clause │    23 │
                        │ 1 │ apache-2.0   │     5 │
                        ╰───┴──────────────┴───────╯
                        

                        I’m still a bit new to nushell and am curious if anyone has a more concise and/or elegant way to extract the nested license key.

                        1. 8

                          You can use the builtin open, which will do the from json automatically. Or use http get which also does that if the Content-Type matches.

                          Here is the optimal way to do that:

                          ~
                           : open repos.json | get license.key --ignore-errors | uniq --count
                          
                          ╭───┬──────────────┬───────╮
                          │ # │    value     │ count │
                          ├───┼──────────────┼───────┤
                          │ 0 │ bsd-3-clause │    23 │
                          │ 1 │ apache-2.0   │     5 │
                          │ 2 │              │     2 │
                          ╰───┴──────────────┴───────╯
                          
                          1. 5

                            Awesome. Thanks for the tips! I wonder why --ignore-errors is necessary to get nested data, though, or how that flag makes it work. Intuitively, I would not expect get license.key to produce any errors.

                            1. 6

                              Some don’t have anything in the license field (it is an empty attrset). So you need -i for Nushell’s get to return null instead of erroring out.

                              1. 3

                                Thank you and ~adamshaylor, I need to push myself to learn nu, and these kinds of comments are really helpful.

                        2. 6

                          Tangential aside: apparently, ... | sort | uniq -c | sort -rn is known to old-timers as the Honeyman idiom after Peter Honeyman

                          1. 7

                            Great to put a name to a pattern I use probably dozens of times per month, thank you :)

                            Perhaps it’s time to add this to my .bashrc:

                            alias honey="sort | uniq -c | sort -rn"
                            
                            1. 2

                              Depending upon redundancy & scale of your data compared to DIMM memory, it may be much faster to use awk '{n[$0]++}END{for(k in n)print n[k],k}' than sort|uniq -c (granted >3X more keydowns, but if you are saving it in a file anyway that matters little; I might also suggest "linefreq" or maybe even "histo" as more descriptive names – more about what is done rather than how).

                              Similarly, if there are many unique keys, but you are only interested in the most frequent ones then a heap-based topn (or topn|tac) might also be better. That way only needs space proportional to the target number kept which is probably CPU L1-cachable (for a few screenfuls of short lines, anyway). mawk is even pretty fast - within 2-4X of optimized C or Nim, though when I tried it gawk was 2X slower than mawk.

                          2. 2

                            Is there an advantage to using the pipe in the jq expression instead of this:

                            $ jq '.[].license.key' repos.json | sort | uniq -c | sort -rh
                                 23 "bsd-3-clause"
                                  5 "apache-2.0"
                                  2 null
                            

                            ?

                            1. 4

                              by “pipe” is it the usage of the < repos.json redirect you’re talking about? If so, I do it the way that tomnomnom does. There’s not a perf advantage, but there is a ux advantage. I can easily put another tool there if I want to preprocess the file. Say that the file was huge and it had 1M json lines in it (one json entry per line). If I want to test the pipeline out, I can modify that above statement to say:

                              $ < repos.json head -10 | jq -r '.[] | .license.key' | sort | uniq -c | sort -rn
                                23 bsd-3-clause
                                5 apache-2.0
                                2 null
                              

                              to just process the first 10 lines. Then, when things are working like I want, and I want to process the whole file, I can just delete the head -10 | and run it. vs having to slice out the file name and transpose it somewhere else.

                              1. 2

                                No, sorry, I should have been more clear. I was calling the | symbol a “pipe”. In other words, I’m asking if there is a benefit of doing

                                '.[] | .license.key'
                                

                                over

                                '.[].license.key'
                                

                                ?

                                1. 4

                                  I think it partly boils down to a style preference, and where one prefers the balance between concise and clarity to be. One advantage of using the | in this case might be to draw attention to the fact that the array iterator .[] actually will generate multiple streams of values, and I find that having the subsequent part of the filter more distinctly visible (i.e. after a pipe), especially in longer sequences of filters, helps me read it and remember that, plus explain it to others too.

                              2. 2

                                Oh! Probably not; just my relative inexperience with jq! :D

                            2. 6

                              Since it seems we’re all trading our own methods I tried to do it in ryelang:

                              get https://api.github.com/orgs/golang/repos
                              |parse-json
                              |map { -> "license" |fix\either { "no" } { -> "key" } }
                              |spreadsheet* { "license" }
                              |group-by 'license { 'license count } |display
                              
                              | license      | license_count |
                              +------------------------------+
                              | bsd-3-clause | 23            |
                              | apache-2.0   | 5             |
                              | no           | 2             |
                              
                              1. 4

                                Love to see your ryelang examples, thanks!

                                Skipping imports and not formatting as a table, I practiced geting the same data with Factor:

                                "https://api.github.com/orgs/golang/repos" http-get-json nip
                                [
                                  "license" of
                                  dup json-null =
                                  [ drop "None" ] [ "key" of ] if
                                ] map histogram .
                                

                                I’m also trying to learn PRQL, but my favorite examples in these comments are actually Nu.

                                1. 3

                                  you can use json-null? and maybe "key" ?of "None" or but that is probably as tacky as the if

                                  1. 1

                                    Nice, thanks!

                                  2. 3

                                    Nice Factor code. I’m just not sure what nip does? Rye example would benefit from a function like unique\count (it has unique), I will probably add it. Nushell looks really cool, yes, I didn’t know for it before.

                                    1. 3

                                      Ha, I had guessed you knew Factor much better than I do, based on rye. nip just removes the second-from-the-top from the data stack – http-get-json pushes (1) a response object and (2) the json, so nip tosses away the response object.

                                      1. 2

                                        It’s been more than 10 years since I’ve worked with Factor, but as I said it marked me forever :). I had a little clue what nip does do but I didn’t know that http-get-json pushes two values to the stack.

                                2. 5

                                  Reposting from another comment, but this is how it is done in Nushell:

                                   http get https://api.github.com/orgs/golang/repos
                                  | get license.key -i
                                  | uniq --count
                                  | rename license
                                  

                                  This outputs the following:

                                  ╭───┬──────────────┬───────╮
                                  │ # │    license   │ count │
                                  ├───┼──────────────┼───────┤
                                  │ 0 │ bsd-3-clause │    23 │
                                  │ 1 │ apache-2.0   │     5 │
                                  │ 2 │              │     2 │
                                  ╰───┴──────────────┴───────╯
                                  

                                  And is identical to the expected result in the blog post, if you run to json on it (Yes, that’s a null key).

                                  1. 2

                                    shocked they still don’t let you read from stdin

                                    1. 2

                                      The syntax is super powerful, but I have to study the docs anytime I want to do anything beyond just selecting fields.

                                      Yes. Exactly. This is why I created fx. I know JS very well and JS is best for JSON (it is in the name on the format).

                                      1. 2

                                        rumor is that the j in jq is for json ;) joke aside i think duckdb, js and jq all have their uses, but i personally think the composable nature of jq makes it extremely well suited for adhoc and exploratory programming

                                      2. 2

                                        I enjoyed this article, thanks. The jq example which had this group_by | map({(first...):...}) pattern caught my eye, and I was inspired to write this short post: Curating a collection of jq functions.

                                        1. 1

                                          This is so cool.

                                          1. 1

                                            I was playing with DuckDB this weekend and I also really liked it!

                                            1. [Comment removed by author]