1. 12
  1.  

  2. 3

    It seems like a starting point for this sort of would would be dependent types, with the obvious warning/caveat that dependent types can become undecidable if you demand too much of them.

    I tried to work out a sane way to do data science in a strongly-typed, ideological way and found that it usually requires some compromise. If you’re happy to call a data frame a [[Double]] or even a Map String [Double] (column-based) then it can be reasonably sane, but if you want to push your domain logic automatically into the type system, with neat auto-derived record/row types (e.g. {name: String, age: Int, income: Double, ... }) then it can get messy.

    For example, if you use record types (i.e. each record shape is its own type) and (a) have 100 field names, (b) are doing a data-dependent removal of columns such as dropping all columns correlated at > 0.99 with previous columns– a common data-cleaning operation in real-world data– and (c ) want the type signature to represent what has and has not been removed, you now have 2^100 different record types to worry about.

    1. 6

      Dependent types are needed to statically guarantee more complicated properties, but I think a lot of the low-hanging fruit in statistical computing is more around trying to find a not-super-tedious way of making some basic kind of tagging/annotation system on data work (more the 2nd part of your comment). Less verification that mathematical identities hold, and more just basic provenance: this function requires a distribution, and you passed me array X. Is array X something that even claims to be a distribution? I.e. either it comes from a function that produces distributions as output, or you read it from a file and consciously claimed it is one. But not, e.g., that you accidentally passed the wrong variable, or (very common in something like R) you were chaining together a series of operations and did a().b().c() when you should’ve done a().c().b(), ending up with numbers that don’t mean anything. For various reasons, statistical computing has settled on essentially a unityped system, where absolutely everything is a matrix of numbers, so even these kinds of basic errors don’t get you a static error, and whether you get a runtime error depends on the luck of whether some function does sanity checking and it triggers an assertion.

      There’s an obvious OO solution to this problem, where instead of arrays as the universal data type, you have Distribution and Sample and whatever, and force people to use them explicitly, but the trick is making this not so tedious that people just prefer the approach with unityped matrices.

      1. 3

        I think the solution there is better support for existentials (and perhaps a certain amount of embracing casting). There are 2^100 possible record types, but at compile time there is just an existential type that can be represented as an erasable generic, and at runtime there is just the concrete type that exists.

      2. 3

        I think there is a lot of low-hanging fruit around the intersection of types and probability / statistics.

        You can nicely encode a general probabilistic program as

        type Program a = Free (Sum (ProbF a) (Ap (ProbF a)))
        

        where Free is the free monad, Ap is the free applicative, ProbF is a base functor that defines a probabilistic instruction set, and Sum is a coproduct of functors.

        What you get is a structure-preserving probabilistic model that exactly encodes dependence (via the left branch of Sum) and conditional independence (via the right branch). It’s really nice, and has some implications on inference. If your model is shallow/wide (i.e. you use the right branch a lot) then inference tends to be much easier than if it’s deep and narrow. The dual (cofree) representation can also be used to do inference.

        Also there are just sort of neat things that might just be interesting as structural connections more than anything. Various probability distributions are induced by recursive processes, and these can typically be encoded as recursion schemes. Autoregressive models are anamorphisms, the geometric distribution can be encoded as an apomorphism, etc. Whether this has any practical implication I can’t say, but it’s interesting to think about classifying models by their recursion scheme encoding and, say, studying the properties of inference algos over some class of models that can be encoded by a particular scheme.

        1. 1

          I like type safety but this seems restrictive. If I had, for instance, gotten an array of numbers from an external library I would need to turn it into a RandomSample just to compute inferential statistics on it. If I already knew it was, then at that point I would coerce or call a constructor before I used the library; but since I’m still the one making that decision, it is the same as if I had just abandoned specific statistical types entirely.

          1. 8

            Not quite. For whoever comes along after you the explicit typing of the entire computation would at a minimum enhance the readability of what has been done. Having some record of the assumptions made in an analysis would actually be pretty fantastic and make reading statistical code less of a nightmare.