1. 3
  1.  

    1. 2

      It is very tricky to program with such a data representation. It is so error prone that, despite the performance edge, our empirical observation is that even very dedicated C programmers don’t do it.

      Isn’t this what the data-oriented design people do?

      What I like about this API is that it turns a highly error-prone endeavour, programming directly with serialized representation of data types, into a rather comfortable situation. The key ingredient is linear types.

      Yeah, that’s cool, but at what cost? Compare add1 in the post vs the idiomatic way of writing it: fmap (+1) :: Tree Int -> Tree Int. What would be really neat is if we could elaborate the latter into former inside the compiler (out of sight from the developer).

      1. 2

        I think a closer point of comparison is flatbuffers

        1. 1

          Looking at the page you linked to, I can’t see any example where a function is defined over the serialised data without deserialising it first, i.e. what add1 does?

          1. 1

            The tutorial says,

            Deserialization is a bit of a misnomer, since FlatBuffers doesn’t deserialize the whole buffer when accessed. It just “decodes” the data that is requested, leaving all the other data untouched. It is up to the application to decide if the data is copied out or even read in the first place.

            So the flatbuffers accessors are analogous to caseTree in the article, and the flatbuffers builders are analogous to the leaf and tree needy constructors.

            1. 1

              Can flatbuffers represent recursive datatypes such as Tree? Can they modify the data in-place, rather than copying it when rebuilding? (To be fair, it’s not clear if the article does it in-place, but it seems plausible if the underlying array is mutable.)

              Compare with DoD, https://lobste.rs/s/8bhptf/data_oriented_design_revisited_type , where they do have recursive types and do it in-place.

        2. 1

          I agree, this is what I thought as well. In fact, this is one of the main tricks used in simdjson, see page 4, section 3, where they parse the (tree-like) nested JSON structure into a flat array called a tape, which is fast to navigate when retrieving values.

        3. 1

          OMG! The article says

          When I receive the tree, it is already in a pointer representation (remember, unCompact is free)

          And I wondered how deserializing a pointerful data structure like part of a Haskell heap can be free.

          The compact normal forms paper says,

          Pointer adjustment

          If the data associated with a compact region is not loaded into the same address as its original address, it is necessary to offset all of the internal pointers so that they point to the new address of the data in question. This procedure can be skipped if the sender is trusted and the compact region is loaded to its original address.

          To ensure that we will be able to load compact regions into the correct address space, we observe the address space in a 64-bit architecture (or even a 48 bit one like x86_64) is fairly large, more than the application will need. Therefore, our approach is to divide it into n chunks (in our case, 256 chunks of 128 GiB each) and assign each chunk to a specific machine/process combination.

          Which is a magnificent performance hack. But they say in the introductory sections that they are aiming to improve performance in data centres and supercomputers, for which a limit of 256 threads is rather too small!

          Even so, it’s pretty cool. I should read the rest of the paper :-)