1. 21
    1. 5

      As some one in the data processing and analytics field, a fan of array languages (including R), and dataframe geek, I’m very much in agreement that this topic is worth more investment. Also, if you’re like me and want to understand the presenter’s background on the topic, this is from Graydon Hoare, the creator of Rust.

    2. 2

      There’s a lot to say about using interpreted languages as sugar and glue around a high-performance kernel. I think the approach we see in Python/ML, for instance in Jax, would benefit to being generalised. These patterns apply beyond the specific use cases, and this talk does a great job at calling them out.

    3. 2

      There’s some good info here, and the conclusion that interpreters should take implementation influence from databases is probably worth repeating [1]. You could flip it around and say that interpreters are slow because their batch sizes aren’t 1 KB to 8 KB, as stated in one of the slides. The batch size for a Lua or Python ADD op code is more like 24 to 64 bytes :-P


      BUT, I disagree with the framing of vectorized interpreters as a third option to compiler vs. interpreters.

      It’s a language design issue, not a language implementation issue (acknowledged in 1 or 2 slides, but he didn’t provide much light into language design)

      I mean he pointed out that the most popular interpreter in the world – CPython – already makes use of the technique – SOMETIMES. Basically for NumPy and Pandas.

      So I don’t really get what’s actionable here – you can already do this. But it means convincing your users to think in terms of 1 KB to 8 KB batches to get optimal performance. And maybe coaxing them through language design to do so.

      I don’t think that will happen more than it’s already happened.

      I would propose a challenge: rewrite the Django template language in a vectorized style (or equivalent Ruby ERB, etc.) You can maybe do a little bit better, but those languages are working on trees, not flat arrays. They are also working on variable sized strings, not ints and floats that line up nicely. This is inherent in the language design, and also extremely useful for users.

      Templates are sort of representative of what front end server processes spend a lot of time doing, when they’re CPU and memory bound (which they often are in interpreted languages).

      Another way to state the hypothesis: can languages can be divided into control plane (the pipelines) and data plane (the vectorized operations, which may potentially be reodered, etc.) ?


      There was also a language invented ~10 years ago for scientific computing that rejected vectorized interpreters, and specifically aimed to get rid of the two language problem that plagued Python:

      https://juliadatascience.io/julia_accomplish

      The “Two-Language Problem” is a very typical situation in scientific computing where a researcher devises an algorithm or a solution to tackle a desired problem or analysis at hand. Then, the solution is prototyped in an easy to code language (like Python or R). If the prototype works, the researcher would code in a fast language that would not be easy to prototype (C++ or FORTRAN). Thus, we have two languages involved in the process of developing a new solution.

      To be clear, to develop a new algorithm, you often cannot do it quickly in pure Python or R – you will have to drop down into C or Fortran to create new vector operations.

      So Julia’s solution is basically their unusual compilation model involving “type stability”. You can write nested for loops – that’s explicitly encouraged – the language compiler will understand all of that globally, not force you to split it into two levels for speed.

      I’m surprised that the slides didn’t address optimization across that boundary, or loop fusion. I think Julia has loop fusion because even if you write in vectorized style, you can be producing a lot of memory traffic for trivial batched operations.

      Graydon was very complementary of Julia when it came out:

      https://graydon2.dreamwidth.org/189377.html

      Julia’s a descendant of Dylan. A really, really good one.

      Julia has all the tidy signs of Lisps: it’s easily syntactically extensible, highly interactive, reflective, open, but also built around a solid, minimal, very clearly defined semantics that admits efficient compilation. This latter part is important: many people are drawn to Julia for its speed, and that speed is primarily a result of a very well thought out implementation of multimethods. These are the core design abstraction in the language, and Julia’s done great work here. I encourage anyone who’s bothered to read this far to read Karpinski’s recent presentation or, if you have the time, the full original Julia paper from 2012. It’s an absolute delight.

      So overall I’m kind of confused about what’s really actionable here. I think there is a problem, but you need different techniques to solve it, not vectorized interpreters. I don’t see the evidence that implementers are overlooking it.


      And I’m glad he acknowledged shell – I explicitly think of it as vectorized :

      Shell can be really fast for this reason, though as mentioned the string-ish nature holds it back. (This can be solved with JSON-based data languages)


      [1] Seems like Jamie Brandon has been thinking along the same lines, with references to Umbra, etc. And also trying to do something different than the interpreter/compiler tradeoff - https://www.scattered-thoughts.net/writing/implementing-interactive-languages/ - https://news.ycombinator.com/item?id=37256479

      1. 2

        I would propose a challenge: rewrite the Django template language in a vectorized style (or equivalent Ruby ERB, etc.) You can maybe do a little bit better, but those languages are working on trees, not flat arrays. They are also working on variable sized strings, not ints and floats that line up nicely. This is inherent in the language design, and also extremely useful for users.

        Templates are sort of representative of what front end server processes spend a lot of time doing, when they’re CPU and memory bound (which they often are in interpreted languages).

        The APL community has found ways to process text and trees using an APL array representation. Because of the inherent vectorization, this has resulted in compilers that run on a GPU.

        Suppose you implement the front end of a compiler using purely object-oriented techniques, the coding style that most people are familiar and comfortable with. Lexical analysis produces a stream of dynamically allocated token objects. The parser produces an AST of dynamically allocated parse tree nodes. The semantic analyser transforms this tree into an IR tree, of dynamically allocated IR nodes.

        This will produce a slow compiler, even if you implement it in a systems language like C, Zig, Rust. It’s slow because of all that dynamic memory allocation, and because the memory representation is destructive of cache coherency.

        In order to make all this text processing run as fast as possible, you need to use data oriented techniques. Examples are the new Zig compiler, or the new Carbon compiler https://www.youtube.com/watch?v=ZI198eFghJk.

        I see a lot of similarity between the vectorized data representations that I cited, used for writing compilers in array languages, and the data oriented representations that I cited, used to write extremely fast compilers in low level systems languages. In both cases, you are representing a collection of like objects as one or more arrays, instead of dynamically allocating each object. These techniques are generally applicable to text processing, a point made by the dyalog.tv video, since you generally parse text into a tree representation for further processing.

        1. 2

          Oh yeah definitely, but I would conjecture that people don’t want to use APL semantics for emitting HTML from structured data. They want to use Python, JavaScript, Ruby semantics.

          Or at least something more familiar. The APL stuff seems cool, and can probably help with some solutions, but I’d say there’s a big gap.

          My point was that vectorization is a language design issue, not really an implementation issue, and the APL examples seem to prove exactly that!


          I have a longstanding interest in making trees faster – i.e. I’ve been maintaining this wiki page for a number of years, and I added the Carbon link a few days ago, it has the Zig link, the APL stuff, etc.

          https://github.com/oilshell/oil/wiki/Compact-AST-Representation

          I like that everyone is trying this, but I’m not really on board with the Zig style for my own code, because of type safety. The T* is type safe, but an index into an array of T[] is not. Also IIRC it makes the code look weirder – they did it very “manually” from what I remember.

          Likewise Rust arenas compromise memory safety, at least to some degree. It’s basically circumventing the ownership system and manually managing indexes yourself. It is probably fine for some use cases, but it’s not clear where it falls down.


          I actually have an idea I want to pursue in the mycpp runtime (the runtime for the language https://oilshell.org is written in).

          I called it “squeeze and freeze”. A the cost of a traversal / copy, you get a compressed immutable block, that’s more type safe than Zig, and more memory safe than Rust idioms. It’s integrated with the garbage collector.

          The closest thing I found was this work in GHC which I posted a few weeks ago:

          https://lobste.rs/s/hgct1d/efficient_communication_collection

          We probably won’t have bandwidth to do this for at least a year, but I think it’s a good idea. It is a follow-on from the “oheap” experiments I did several years ago (also on the wiki page)


          I guess to summarize I would say APL is “manually being clever”, and Zig and Rust are also “manually being clever” to some degree… I think it would nice to get away from that.

          I would like my tree traversals to look like “normal” code, but also be fast!

          1. 5

            Thanks for the AST Representation link, that’s a goldmine.

            I don’t like the coding style either.

            vectorization is a language design issue

            I like this framing of the issue.

            Although I don’t have a ready-made solution, my initial impulse is to consider high level programming at the tree level, analogous to APL programming at the array level. Instead of vectorized code, we have arborized code. Use functional operations that transform tree values, instead of imperative code that traffics in node pointers or array indexes. And then represent the trees using a compact, data-oriented representation.

            1. 1

              Instead of vectorized code, we have arborized code. Use functional operations that transform tree values

              I think the Haskell community has lots of libraries/types/functions for transforming trees in a functional way. Perhaps that would be a good place to look to get inspiration for a “basis set” of tree transformation operations that are needed/most common. I see the “squeeze and freeze”-related link above touches on Haskell too.

              I hear words like “catamorphism” thrown around when Haskell people are talking about generalizations of tree transformation operations, although I don’t know precisely what that means.

          2. 1

            I called it “squeeze and freeze”. A the cost of a traversal / copy, you get a compressed immutable block, that’s more type safe than Zig, and more memory safe than Rust idioms. It’s integrated with the garbage collector. The closest thing I found was this work in GHC which I posted a few weeks ago: https://lobste.rs/s/hgct1d/efficient_communication_collection

            So it sounds to me like the idea of squeeze-and-freeze is for the programmer to identify data structures (for example, consider a tree) that should be “serialized” as follows:

            the tree is originally represented by various independently allocated node objects connected by pointers

            and then after squeeze-and-freeze transformation, it is represented as a struct-of-arrays (one array for each field within the node object type), where the nodes refer to each other with array indices instead of pointers, and nothing in the frozen tree is allowed to have any pointers to the outside (also, in the Haskell example, the tree must be fully evaluated, eg no lazy thunks in the middle of it). Also, the frozen tree is treated as a single unit by the garbage collector (so the GC doesn’t trace the references between nodes inside the tree, and it won’t reorder them either).

            Am I understanding the idea right?

            1. 2

              Yes that’s right, except if you copy a subgraph

              • You can pack the entries as tightly as you want, with no holes. So you don’t necessarily need a struct of arrays – the purpose of that was mainly to get rid of pointers and create integers.
              • You turn your pointers into uint8_t if there are less than 256 “cells”, uint16_t if there are less than 64 Ki “cells”, etc. Though I guess this requires knowing the subgraph size ahead of time, which is not always easy.

              The oheap blog post has the tiny Obj::Ref() implementation I was thinking of:

              http://www.oilshell.org/blog/2017/01/09.html

              If you want to discuss this further feel free to join https://oilshell.zulipchat.com/ . There’s a thread on our #performance channel about it!

              The other issue is that debuggers understand pointers, and this kind of breaks that, but so do the manual arenas. Those also break or complicate use of essential tools like ASAN .