1. 10
  1. 2

    In a world where many/most people used this, there would be a lot of redundant computation occurring.

    I wonder if such a system could/should be built in a way that this could be exploited - effectively memoising it across all users. (In the same way that cloud storage exploits redundancy between users by using content addressable storage on the backend so they only need to store one copy of $POPULAR_FILE)

    e.g. for compilation you’d probably want:

    1. content-based-addressing of the data input (I am compiling something equivalent to foo.c)
    2. content-based-addressing of the executable input (I am running gcc 2.6.1 on x86)
    3. some kind of hash of the execution environment (cpuid flags, env vars, container OS?)

    and probably some other bits and pieces (does executable access system calls like local “current time”). Could probably be made to work if the executables agreed to play nice and/or run inside a decent sandbox.

    It would be challenging, but also very cool, to get this right.

    1. 1

      Gradle has a distributed cache that can probably do that.

      1. 1

        That’s interesting, thank you.

        I was using the example of compilation, but I think the general question is interesting. “We are performing this computation, with this executable code, on this input data, in this runtime environment (which is a special case of input data)”

        If we can determine that (all essential) characteristics of these are the same as a previous run, then we can lookup the result.

        I think there are interesting questions as to what constitutes inputs here (e.g. no-pure things like ‘time’ and ‘network’) and - moreso - what makes the executable code “the same” for this purpose. (What level do you work at - source code, binary etc).

        1. 1

          Gradle has a very flexible task system - you can completely define the relevant inputs and dependencies of tasks by yourself. Often that will just be files, and some dependencies on the outputs of other tasks. The tricky part is usually defining all of them correctly. But once you do that, magic happens - tasks can be cached, and if you set up a distributed cache, it may even be shared amongst multiple machines (devs, CI ,etc) A task doesn’t have to be compilation, can be anything that takes inputs and produces outputs, maybe you want to do some code generation, or whatever. I’m sure there are other build systems that are backed by a similarly flexible high-level task system, Gradle is just the one that I happen to know.

      2. 1

        Llama (https://github.com/nelhage/llama) does a few of these (like content addressing). I have played around with it a bit and its been a joy.