1. 2

    What size would the sqlite file be if it was also gzipped?

    1. 7

      Parquet can read directly from the compressed file and decompress on the fly, though.

      1. 3

        471MB zipped.

        1. 1

          Or anyone seriously investigated this? https://sqlite.org/zipvfs/doc/trunk/www/readme.wiki

          1. 1

            It’s not very popular due to the license and fees. Here is an interesting project should work really well with the zstd dict compression. Since there is tons of repeated values in some columns, or json text, zstd’s dict based compress works really well.

            [2020-12-23T21:14:06Z INFO sqlite_zstd::transparent] Compressed 5228 rows with dictid=111. Total size of entries before: 91.97MB, afterwards: 1.41MB, (average: before=17.59kB, after=268B)

            1. 1

              I would want to investigate using ZSTD (with a dictionary) and Parquet as well. Columnar formats are better for OLAP.

            2. 1

              Why would anybody investigate this when you have Parquet and ORC specifically designed for OLAP use cases. What does Sqlite add to this problem that a data scientist wants to process the Github export data?

            3. 1

              Apples to oranges.

            1. 6

              Can’t wait for Generics and I don’t think I am ever going back to a JVM based language!

              1. 2

                Maybe go will be worth another look after generics. I guess it should make implementing a sequence/stream API more feasible. Although I suspect the performance would suck, go probably can’t optimize the virtual function calls as much as a JIT can.
                Coming from a JVM background, and having recently written a CLI app with go, I found the experience extremely painful, and I don’t quite understand why one would give up a higher level language to work with go for non-trivial applications.
                Being able to easily build and cross compile native binaries is a great feature, especially for CLI’s, but if running a JVM isn’t a major constraint, I’d take any major JVM language over go.

                1. 2

                  This kind of reflects my views about Go as well. I think once you are out of “simplicity” dogma, you quickly realize how messy the code gets with interface{} casts everywhere. I use generics on daily base! Even a basic cache requires generic support. I don’t want to litter my code with castings and ifs when there exists a decent solution to do all of the manual undertaking for you. That is what compilers were invented for rather than just generating plain code. You can obviously ignore them if you don’t need them; but I not having them is a big pain in the a**.

                  1. 7

                    you quickly realize how messy the code gets with interface{} casts everywhere.

                    It should essentially never happen that you use interface{} in day-to-day code. If you’re having that experience, I can understand why you’d be frustrated! You’re essentially fighting the language.

                    interface{} and package reflect are tools of last resort that really only need to be used in library code.

                    1. 1

                      Please feel free to attach a reusable LRU cache solution. A pretty common thing to do in any production grade app is to use a basic cache ranging all the way from caching addresses to deserialized objects. I am pretty sure you will have the same hacky if casting check and use solution.

                      1. 1

                        Please feel free to attach a reusable LRU cache solution. A pretty common thing to do in any production grade app is to use a basic cache ranging all the way from caching addresses to deserialized objects. I am pretty sure you will have the same hacky if casting check and use solution.

                        It’s not necessary for the LRU cache you write for your application to be reusable. It only needs to solve the needs of your application, which, by definition, deals with a bounded set of types, typically just one.

                        If your reaction to this is “But that’s stupid and of course I’m going to write a generic data structure…” then this is kind of exactly what I mean when I say that you’re fighting the language.

                      2. 1

                        Please feel free to attach a reusable LRU cache solution. A pretty common thing to do in any production grade app is to use a basic cache ranging all the way from caching addresses to deserialized objects. I am pretty sure you will have the same hacky if casting check and use solution.

                  2. 1

                    Two more releases, likely :)

                    I’m curious to see how the generics will work out in practice. But I do look forward to having a sane assert.Equal().