1. 14
    1. 4

      Looks interesting but it’s notable that they don’t put compression speeds on the table. Zstd can get good results for compression and decompression at speeds that (even with a modest CPU) make I/O the bottleneck. Decompression performance at the expense of compression is a win for some cases (e.g. immutable system images), not for others (e.g. network traffic). Also, an AGPLv3 reference implementation absolutely guarantees that no experts are going to look at this, since it would taint them for working on any of the competitors (which are all permissively licensed) and no one will incorporate it into any products. That seems to be their goal since this whole post was an add for their compression service.

      1. 2

        Re AGPLv3, they relicensed to Apache v2.

        1. 4

          TBH that sucks. This kind of technology is enormously valuable, disproportionally so to those who are already the richest people in the world.

          I did similar work some years ago with results in the same ballpark as this post. Some of what I found has since become publicly known, some has not. It was disheartening to realise that my options were 1) try and likely fail to monetise my findings, 2) donate billions of dollars to Google etc, or 3) keep it to myself, which aside from some close friends is what I ultimately did.

          1. 1

            Indeed. All of my personal stuff is GPL’d or at least LGPL’d for precisely this reason. Not that I’ve written anything super valuable, mind you.

    2. 4

      Whenever crazy optimized SIMD code is presented, I wonder whether it’s portable to other architectures. (All but one of the general-purpose computing devices I own are ARM-based.) Last I checked, ARM doesn’t have 512-bit SIMD instructions yet…?

      1. 3

        Getting the best possible performance out of modern processors requires careful tuning, both for architectural and microarchitectural facilities. Although multiple people (myself included) have expressed interest in and may be working on sufficiently smart compilers (fsvo ‘sufficient’), nothing exists yet (ispc is closest, but is not particularly good, so I hear), so manual porting is necessary.

        Furthermore: compared with avx512, neon is decently capable, but a bit anemic—firestorm is quad-issue, which helps somewhat, but not enough to be an unalloyed win over scalar code for most non-embarrassingly parallel cases, particularly how wide the scalar end is—; sve has seen some interest from hpc and server, but comparatively less from mobile/client. Probably partly because bigger vectors suck more power (and arm is easier to decode than x86, so ~microcoding is not that worthwhile), and partly because variable-length vectors are a bit of a clusterfuck. That said, I expect we’ll see some standardisation on maybe 256-bit sve on moblie/client parts in the future, and that might open up some new doors.

      2. 2

        Last I checked, ARM doesn’t have 512-bit SIMD instructions yet…?

        Well, kind of. Arm has SVE, which (from memory) allows vector registers to be 128-2048 bits, with some logic for determining which size you have at run time. This is designed mostly as a target for autovectorisation: if you determine that loop iterations are independent (or mostly independent) then you can rewrite it to perform n iterations in parallel, where n is determined by the vector unit width. It’s less useful for code that is specifically designed around a single vector width (though you can potentially move the branch to an ifunc resolver and ship versions that use 1, 2, or 4, vector registers for a 512-bit value. There’s masking so you can always use less than a full register).

        I don’t believe Apple’s Arm cores implement SVE yet (which surprised me a bit, I assumed at least the M2 would, but possibly they expect that anything that would get a good speedup from SVE will get a bigger one from offload to the GPU or the ML core, and with them both on the same cache-coherent interconnect that’s much cheaper to do than on a lot of other platforms).

    3. 1

      Literals are mixed with variable-length uints, making it impossible to infer the semantics of an arbitrary bitstream portion without interpreting all the preceding tokens. Not an issue for a scalar implementation of the decoder

      Of course it is. From what I understand, the rad decompressors work on multiple bitstreams at once in order to compensate for this.