1. 28
    1. 16

      Oh my, this is exciting and embarrassing - this is my first paper written as a part of my PhD studies, and while I’m reasonably proud of the work, I am acutely aware that it is not a very well written paper.

      The existing discussion I think summaries things pretty well - the benchmarks are microbenchmarks, the speedup is real and brings f-exprs into the realm of practicality, but it’s unclear if they would work in existing very-optimized lisps, a language with f-exprs is an entirely different language than one based on macros. I also do not have any formal proofs that the partial evaluation procedure is correct - I believe it is, and I eventually wanted to prove it in Coq or similar.

      Since doing this work, I still believe in f-exprs, but I was frustrated with the limits of the more heuristic decisions I had to make in the partial evaluator, and my next attempt to make f-exprs practical was to be via tracing-JIT compilation, which I was also going to use to see how fast of a Lisp I could make without “cheating” (I believe many optimizing lisps don’t allow rebinding of primitives so that they can be optimized, but I think JIT compilation with deoptimization guards can do it better).

      In any case, my current research has taken a detour through CRDTs and distributed systems, but I’ll return to my possibly quixotic goal of making a fast practical language based on f-exprs before too long.

      I’m happy to answer any outstanding questions!

      1. 5

        If you have something new, post it here on Lobsters!

        1. 4

          I also do not have any formal proofs that the partial evaluation procedure is correct - I believe it is, and I eventually wanted to prove it in Coq or similar.

          For the record, I think that you are being too hard on yourself here: providing a mechanized proof is above what could/should be expected of most works in programming languages. It is super-nice when people do it, and impressive because it is a lot more work, but not currently a requirement in this scientific community. (Some problems are particularly tricky and have a past history of mistakes in pen-and-paper proofs, and there mechanization is welcome.)

          I think that when we describe an algorithm in a programming-languages paper, we should at least formulate its specification – the correctness statements that should hold if it was correct. Even this can be instructive in some cases. Then it is a reasonable expectation, but not a requirement for all kind of work, that the authors would work out a pen-and-paper (mathematical) proof that the algorithm meets the specification, possibly on a simplified/idealized system. But then there is a lot of nice implementation-oriented work that does not come with correctness proof, and evaluates itself through other means (for example: good benchmarks.. which are also a lot of work to do properly; or an extensive approach to testing).

          TL;DR: you don’t need a Coq proof. Working out a pen-and-paper proof is very nice if you can do it (it can be easy or very hard depending on the problem), but there are other ways to evaluate an implementation technique carefully, for example with a good testing strategy and good benchmarks.

          1. 2

            This is great work! I haven’t read the full paper yet, but will when I have some time.

          2. 2

            The performance evaluation of this work is not very good, and the way they describe it is fairly bad – it looks like they are giving inflated numbers by comparing their best system to a terrible system that is monstrously slow by design.

            1. 4

              The numbers aren’t inflated. The article is about fexprs, which are an historical feature of Lisp. Lisp had fexprs before it had macros, then abandoned them in favour of macros because fexprs are monstrously slow if implemented straightforwardly. Their baseline implementation of fexprs is comparable to the original Lisp implementation, and it is therefore reasonable to compare their new implementation against this implementation. They benchmark against other implementations as well.

              1. 2

                The numbers aren’t inflated.

                Thought experiment:

                1. I design a language with primitive machine numbers, but no primitive operations on them besides succ, pred, and an if-zero test. Addition can be implemented in O(n) (repeated increment: decrement one number and increment the other one, until one of them reaches zero), Multiplication can be implemented in O(n^2) (repeated addition), exponential is O(n^3) (repeated multiplication).

                2. I implement a compiler optimization that detects this implementation of addition, and turns it into primitive addition, and detects the implementation of multiplication and turns it into primitive multiplication. Now addition is O(1) and multiplication is O(1), exponential is O(n).

                3. I write a microbenchmark that compares the performance of exponential before and after my optimization, by measuring the time needed to compute (-1)^n. With the optimization the code is much faster. In fact I can pick any input n I want, and I will get good numbers, and the larger the input the larger the speedup factor. Maybe I pick n=100 and I get a 100,000x speedup over the non-optimized version (I made up this number, the point is that it’s big). If I don’t find this impressive enough (and I am very patient with my benchmarks), I pick n=400 and I get a 10^6x speedup. Any input I pick, the result will be “wow!”, and I just get to choose how impressive I want the result to be when choosing the input.

                This is what I call an “inflated” number. I see no point in saying that a technique is 70,000x faster than another in a paper abstract, other than to impress the reader. But this number, while certainly true/honest/faithful to experiments, is arbitrary and was chosen to impress rather than to convey a scientific fact. Saying “The implementation is qualitatively much faster” would be okay, or maybe “For a subset of programs, that are representative of macro-like programming patterns, we eliminate the overhead of f-exprs completely, resulting in arbitrarily large speedups compared to un-optimized implementations.” Saying “It is 70,000x faster” in the abstract is bad science. (It’s okay to have a table for your benchmarks somewhere in the paper, with the actual numbers in them.) Same with “improving on NewLisp’s performance by 233x on one benchmark”.

                They benchmark against other implementations as well.

                They benchmark fibonacci. I don’t know of actual programs that people care about that have a similar performance profile to fibonacci.

                1. 3

                  The original Lisp interpreter did not implement numbers this way, and I’m not familiar with a language that does. That’s just a strawman. But it did implement FEXPRs in a similar way to what’s described in the paper. Therefore, the OP’s FEXPR implementation is not “arbitrary” and “chosen to impress”. The original Lisp implementation was a parse tree interpreter, in which FEXPRs were trivial to implement. It was completed very quickly, years ahead of the compiler. People started using this interpreter immediately for projects. FEXPRs (with their slow implementation) were used in real research projects. FEXPRs were finally abandoned because they couldn’t compile them into efficient code: macros were invented as a replacement. So the paper is saying, yes you can compile FEXPRs, and it’s a lot faster, at least for the common cases for which FEXPRs were used back then. The OP doesn’t do a good job of describing the historical context in which their results are relevant. But that’s the historic context.

                  1. 3

                    The f-expr implementation is not arbitrary, but the 70,000x speedup number is. Take a smaller program, you get a smaller speedup, take a larger program and you get a larger speedup – because the optimization provides super-linear gains. 70,000x is just a number they got for one input, and they chose to emphasize it because it looks large.

                    That’s just a strawman.

                    I did some effort to present a parallel with a similar but different situation (a “thought experiment”) to clarify what I mean, and explain in more details. I find your assessment unfair.

                    (Also, it is common to claim that “the lambda-calculus” is Turing-complete, and it precisely has linear-time addition in its simplest (non-machine) representation of natural numbers, and some people try to speed it up by recognizing important operations and replacing them by faster primitives. The example I gave is a pure thought experiment but it does resemble some language designs that have been proposed seriously and are worth considering: minimalist language with slow-by-design operations, with a clever optimizer that replaces them by fast primitives.)

                    I am not contesting that the OP’s baseline is reasonably faithful to the original Lisp implementation. My point is that it is slow by design, and when you compare yourself to a slow-by-design system, you observe a quantitative performance difference rather than meaningful constant factors or numbers (in general). Emphasizing impressive numbers to describe them is not a good idea, because these numbers are arbitrary – they depend on the specific measurement you make (which in this case are very few).
                    The most problematic aspect is that evaluating mostly against known-bad implementations does not allow you to assess that the new/faster system is in fact “good enough”. For this you have to perform an adequate comparison to known-good systems, or perform another form of evaluation. The evaluation in the paper does not support its own scientific claims about performance.

                    I am not saying that the work itself, or the overall presentation, is bad – it’s an interesting approach and aspects of it are done with care – presenting operational semantics is nice, for example. I am saying that the evaluation of the work is not very good. The evaluation of a scientific paper should provide evidence for the scientific claims made in the paper, and for this it needs to be done carefully, thinking about what the claims are and what’s the appropriate way to evaluate them – and following the best practices in the scientific community, etc. You can do very nice work and have a weak evaluation, and that’s a valid (in fact fairly common) reason to criticize a scientific article, for example suggesting its rejection when it is submitted to a scientific venue.

                    1. 1

                      The f-expr implementation is not arbitrary, but the 70,000x speedup number is.

                      Yes, I think that’s right. I doubt that the first Lisp compiler would have got a 70,000x speedup over the initial Lisp interpreter for the kinds of programs being written in the early 1960’s. You have to wield FEXPRs in a certain way to get the big slowdowns described in the paper, and early Lisp programmers would have avoided writing that kind of code. On the other side of that equation, the first Lisp compiler produced inefficient code, and didn’t meet the standards of a modern optimizing compiler, so the gulf between interpreted and compiled speed would have been smaller.

                      A more in depth analysis of why exactly the speedup is so great would be interesting to read. I’m just throwing out guesses. The difference between the IBM 704 that the first Lisp implementations used, vs modern CPUs, might also be a factor. Early CPUs didn’t have a memory cache, they didn’t have pipelined, out of order, superscalar execution, which I can imagine might benefit compiled code more than parse-tree interpreted code.

                      So yes I agree that the paper could do a better job of evaluation.

              2. 3

                But that’s the point: they’ve been able to get the language benefits of fexprs without the terrible slowdown. As I understand it, the goal was to make fexprs a practical replacement for macros, not to make Lisp faster.

                1. 3

                  But that’s the point: they’ve been able to get the language benefits of fexprs without the terrible slowdown.

                  The evaluation shows that on some examples their technique brings a large speed-up. This does not demonstrate that the terrible slowdowns are gone. To make a parallel, if the best-known algorithm for some problem is O(n), and you start from a O(n^4) algorithm and optimize it into a O(n^2) algorithm, you can show very impressive speedups and yet remain much slower than the competition, maybe your optimized system is still unusable in practice for its intended use-case – or maybe it isn’t, but the point is, the impressive speedups do not actually provide evidence in one direction or the other.

                  (The authors also make a more general and interesting claim that they can optimize well a whole family of programs that corresponds to the encoding of macro-using programs. This is more convincing and it could be a strong point of the evaluation, with maybe a bit more exposition work. But they chose to emphasize the 70,000x performance number instead, and spend most of their evaluation time/space on the comparison to known-bad systems.)

                  1. 1

                    But good Lisp implementations are pretty dang fast, and @gasche’s claim is that they compared their fastest implementation to a quite slow system. Thus there is no indication that their work allows fexprs to be fast enough to be used in actually fast Lisp systems that exist today.

                  2. 1

                    I’m curious as to whether the authors proved any equivalences between the plain and partially-evaluated languages. The paper shows no sign of it.

                    It’s easy to show fexpr languages running - doing things - and, if partially evaluated, doing things quickly. But do they do the right things? Proving things about fexpr languages seems much harder.

                    1. 4

                      This is why I have mixed feelings about fexprs.

                      I think it’s neat that them being first-class means a lot of other things in a Lisp can also be made first-class. Notably, modules. The same core language can encompass much more.

                      But… what does (+ 2 2) mean now? It might mean 4, and it might mean the list (+ 2 2), depending on whether the context it’s in allows me to pass in a fexpr that inspects the source. Partial evaluation can recover some of the pragmatic problems with this, but the deeper semantic problem remains - it’s harder to figure out what a program means.

                      (I’ve thought a bit about having two different kinds of bindings, one that can bind fexprs and one that can’t. Have the latter be the default, so any construct that allows passing in an arbitrary fexpr becomes syntactically obvious. I never got around to trying to implement it, though.)

                  3. 1

                    This looks interesting but without reading their paper yet, their performance numbers look like they used synthetic micro-benchmarks:

                    We show our partial evaluation and compilation framework produces code that is more than 70,000 times faster than naive interpretation due to the elimination of repeated work and exposure of static information enabling additional optimization. In addition, our Kraken compiler performs better compared to existing interpreted languages that support fexprs, including improving on NewLisp’s fexpr performance by 233x on one benchmark.

                    Almost nothing real is ever 70,000 times faster. I’d be curious what exactly is going on.

                    Aah: it’s 70,000 faster than their alternate implementation:

                    The incredible slowdown implied by these tables comes to full fruition in our RB-Tree test in Fig. 13. We kept this run shorter because Kraken’s non-partial-evaluating interpreter takes an incredibly long time even for 100 insertions (40 minutes). The compounding layers of repeated macro-like operative calls in the non-partially-evaluated Kraken version cause a 70,000x slowdown relative to the partial evaluated, optimized, and compiled version.

                    They do compare to newlisp, which is much better. However, Fib(30) and Inserting 10 nodes into an RB-Tree and then summing are a bit small. I’d love to see, say a parser for a simplified subset of JSON, or something.

                    1. 1

                      For practical reasons, I wish they’d have implemented this as an extension to one of the open-source Common Lisp or Scheme implementations.

                      Not only do Scheme and CL have a lot more libraries available, it sounds like the Kraken compiler is pretty slow. Do the performance benefits carry over to Lisps with better compilers? I don’t think this is solving a big enough problem that I’d switch to another language to use it.

                      1. 3

                        You’d have to switch to another language to use it. If they removed the entire macro system from Scheme or CL and introduced a fexpr construct with completely different semantics, the resulting language wouldn’t meaningfully be Scheme or CL anymore.

                        1. 2

                          I don’t see why it can’t be orthogonal to normal macros, or maybe a layer below them.