1. 12
    1. 10

      When you’re trial-and-erroriteratively building jq commands as I do, you’ll quickly grow tired of having to wait about a minute for your command to succeed, only to find out that it didn’t in fact return what you were looking for.

      I feel like the article misses step 0: build and test your tools on a minimal representative sample. There’s no point running the first few jq attempts on the full dataset rather than “head 100” of it. Then potentially some better selected sample where you know the right answers. Then step 1: think how you’re going to apply your ready and tested process to gigabytes of json.

      Also, choosing a better language is an automatic massive gain sometimes. I wrote https://github.com/viraptor/heapleak in Crystal because it’s ~10x faster than heapy in Ruby.

      1. 1

        If I can do it in 50 seconds on the full dataset in the way I’m used to, I’ll do it. If I couldn’t have done that, I probably would have used JQ to sample, but thanks to parallel, that wasn’t necessary. Hooray!

    2. 7
      1. 1

        Cool, thanks! I wasn’t familiar with DuckDB! I almost wrote a coda mentioning Postgres, which would probably have worked as well.

        1. 3

          duckdb is all the rage in certain circles lately. Totally worth checking out!

    3. 2

      There’s also ijson and json-stream (and many other) libraries that can turn JSON files to python generators which are really fast and memory efficient.

      1. 3

        It does feel like as memory and connection speeds have grown, streaming parsing strategies have been forgotten somewhat.

      2. 2

        I’ve tried working with the chunksize parameter for pandas.read_csv, but I ran out of memory very quickly again when trying to reduce my JSON data. Maybe pandas doesn’t use one of these libraries, but that discouraged me from trying any other non-parallelized approach.

        1. 1

          Chunksize should work in this case unless you store the result somewhere instead of just printing on screen and letting the garbage collector do it’s thing. Though afaik pandas has nothing for chunking/streaming JSON.

          In general if you can take advantage of garbage collector and generators you can process any size files with very little memory (just need enough to load 1 chunk/item). Though working with generators is much more hands-on than just loading everything to memory so unsurprisingly it’s less approachable.

          1. 2

            The code was something like

            result = pd.DataFrame()
            for chunk in pd.read_csv(...):
                df = pd.DataFrame(chunk)[["column1", "column2", "column3"]]
                result = pd.concat([result, df]}
            

            Used all the memory. Probably creating full dataframes and throwing them away causes a lot of memory pressure. But Dask took care of all that for me, so it gets a mention.

          2. 1

            If you were to use the ijson basic_parse interface you’d have to write a state machine to keep track of where you are in the JSON file and of the opening and closing braces/brackets with the start/end_map/array events, which might be a bit tedious but is easily abstracted away with some functions. When you use it like that you’d only basically have to pay the memory cost of one JSON token at a time. I’m guessing the reason why the 10GB file takes so long to process in jq is that it tries to parse the file into memory in its entirety before doing any work, so any solution that doesn’t involve keeping the entire parsed file in memory is probably gonna be a lot faster.

            By the way @jeeger, I looked at jq’s documentation and saw that it has a --stream flag that does exactly that and claims to be faster on very large files but requires you to use a slightly different querying syntax, have you tried it?

    4. 2

      I really liked that parallel+jq solution. Really nice! I also had to do it on my own and it meant rust. Here’s a quick brain dump of what I did: https://tilde.cat/posts/analyzing-multi-gigabyte-json-files-locally-with-serde/

      Also had to write minimal site generator to avoid writing html by hand for the second time in one week, it was a busy weekend ;)

    5. 2

      I’d be curious if you tried polars, my sense is that it’s very good / fast at these sorts of things. (But I don’t have a lot of experience with it.)

      1. 1

        Sounds very interesting, I’ll try it out sometime!