1. 5

    I question the assertion that the interpreting slowdown is exponential; whether it takes a few dozen cycles per statement (as in a bytecode interpreter) or a few thousand (as in a pure AST interpreter), the same number of statements would need to be executed, and so the slowdown is roughly linear no matter how deep the call stack gets. The exception to that would be if the interpreter were itself getting interpreted for a subroutine call, and I can’t see any reason that an implementation would do such a thing.

    That said, one interesting place you might look for purely interpreted applications is in the Forth family; it’s fairly common to implement forth using some form of threaded code, which is effectively an AST modulo the fact that neither the S nor the T really apply to forth. You don’t see many programs written in forth directly (though apparently Chuck Moore has written a VLSI layout package), but postscript is fairly common and every implementation I’ve seen has been a pure interpreter.

    Another possibility, also outside the realm of traditional languages, is TeX, which seems to be completely uncompileable (even to bytecode); I’d definitely consider LaTeX to be a large application written entirely in TeX.

    1. 2

      and so the slowdown is roughly linear no matter how deep the call stack gets.

      You’re the first one to say anything about this! I need to think about this again. Right now I think you’re right. Hm, so the slowdown isn’t from where I think it is in that case. I hope I didn’t just waste more than a month on a false premise.

      Also, then I’m surprised there aren’t more examples of direct AST interpretation if its “just” a constant factor.

      Edit: Ok, so I think the issue is that the slowdown scales exponentially with the height of the call stack but that’s unaffected by whether bytecode or direct AST is interpreted. I’m trying to push primitives (like loops) out of the host language and have them written in L (a bit like Forth but its more interpreted than compiled). Now instead of calling a primitive D directly, I call A which calls B which calls C which calls D and each of those have ~10 instructions in L then I get a 1000x slowdown. Is this reasoning right?

      The exception to that would be if the interpreter were itself getting interpreted for a subroutine call

      There are many nested layers but the interpreter itself isn’t interpreted (except possibly as a (very deliberate) test).

      Forth

      I guess I’m thinking of Forth as basically already bytecode since its a stack machine like most bytecode’s VM. The same goes for Postscript although I don’t know the language.

      TeX and LaTeX

      That’s an interesting example. I don’t know how it works internally, just heard it described as a macro expansion language. I’m not sure I’d say that’s interpreted though but its definitely not compiled to either assembly or bytecode. So it would be an example I’m looking for, although I don’t think I can actually make use of this example.

      1. 4

        I think the way to look at is is that, as you push the veil back and write more and more of your logic in L, the slowdown applies to more and more of your execution time. It’s still not exponential; if a mixed program takes time t to run, and you have an interpretation overhead of o, the slowest you can possibly get by moving everything to interpretation is t*o and the fastest it can get is t/o. Those bounds could be tightened somewhat by considering the time that it spends in interpreted code vs. compiled code separately; take i to be the ratio of time spent in interpreted code. Then the worst case runtime would be t*i+(1-i)*t*o and the best case runtime would be t*(1-i)+i*t/o.

        These bounds are difficult to think about intuitively though, because they look at if you moved everything to compiled code or interpreted code. So, instead of looking at the possible performance improvement or loss, let’s look at the actual performance relative to the best case of everything being compiled. If the proportion of interpreted code in your program is i, the slowdown relative to the best possible case will simply be i*o+(1-i)

        It’s worth noting, though, that for the first few levels of pushing logic into your language, i grows (roughly) exponentially larger. The base has nothing to do with the interpretation overhead though; it’s simply the branching factor of your routines.

        1. 1

          Thanks! Your comments here should be all the way at the top!

          If the proportion of interpreted code in your program is i, the slowdown relative to the best possible case will simply be i*o+(1-i)

          Yes, I agree and the all-compiled program is a better baseline. This means that I should get a factor of o at worst, even if its all AST interpreted.

          It’s worth noting, though, that for the first few levels of pushing logic into your language, i grows (roughly) exponentially larger.

          Hm, I’m wondering if this is what I’ve been seeing and let me to the incorrect exponential conclusion.

          Still, that means both direct AST interpretation and layering will result in “only” a constant factor slowdown? And its also “only” a constant factor to go from bytecode to native assembly.

          In general, is there somewhere I can run my reasoning (like this one) by some people for flaws? Other than to posting on lobste.rs that is. I’ve told some others and no-one had seen the problem. At the same time I’m re-doing the time bound estimates, I’m also estimating how much time I lost because of this :(.

          Ok, one more thing. If functions in level j only calls functions in level j-1 (and level 0 is primitives of the host language) then the time for a level j call is exponential in j, right? (And this happens regardless of whether the levels are (all) interpreted bytecode or direct AST.)

          And that means I should make calls in as low level as possible (instead of j-1) to make things faster. Except this goes in the wrong direction in terms of the abstraction I’m trying to provide (and potential polymorphism that would allow). For example, not using the object system would be faster than using it.

        2. 1

          Is this reasoning right?

          I don’t think so.

          Perhaps think of it like this. A function can have more than one call site. Those call sites may have different stack depths. When executing that function, the interpreter will do the same work irrespective of what is on the stack.

          You may be thinking of nested loops? But that isn’t anything to do with how the program is run, but rather the algorithm which is implemented.

          1. 1

            Also, then I’m surprised there aren’t more examples of direct AST interpretation if its “just” a constant factor.

            You need to take into consideration the number of toy languages that make it to languages in production use, where that constant factor results in a significant savings in time and money. (Less servers, less developer downtime while tests run, etc)

            1. 1

              I guess I’m thinking of Forth as basically already bytecode since its a stack machine like most bytecode’s VM. The same goes for Postscript although I don’t know the language.

              Hmm. Forths are typically pretty low level, and often run directly on hardware (though they often will use their own calling conventions, instead of the C one). The dynamism comes from the fact that words are “compiled” into a list of addresses to the other words, which can basically then be “jumped” to in succession. So, it’s actually slightly strange, and not really a VM with “a list of known operators that you switch on” in the traditional sense.

              1. 2

                Good point. Although it means Forths are even more compiled (and a less good example of what I’m looking for) than languages that only stop at the bytecode.

          1. 3

            i am somewhere between appalled that he’d go so long with a labelless keyboard and ruefully aware that i would probably gripe but do nothing about it too :)

            1. 2

              It’s more amazing to me that you could go that long without developing muscle memory.

              1. 5

                I can touch-type, but I still look at the keyboard when I type complex passwords. I think my muscle memory is linked to sequences of key presses more than to specific characters. So I can type a sentence, but give me a random sequence of letters, numbers, and symbols and I’ll mess it up if I don’t glance at the keyboard.

                1. 2

                  I have my keyboard arranged in alphabetical order, mostly to troll anybody else who sits down at my keyboard. Oddly enough, I still find it useful to look at the keyboard when typing a complex password, even though the letters on the keys have nothing to do with the character that they result in.

            1. 4

              My automata theory is incredibly rusty, but isn’t it the case that you typically convert an NFA to a DFA, so that, say, you don’t have to simultaneously take two paths explicitly? I guess I could dig out my copy of the dragon book…

              1. 5

                You could, but NFA to DFA conversion is susceptible to exponential blowup in the number of DFA states. This is why production regex engines use techniques to avoid this, by e.g., compiling the DFA lazily or compiling many small DFA.s

                1. 1

                  exponential blowup in the number of DFA states.

                  Yeah. That makes sense!

                2. 7

                  Yes, you could. However, if you compute the DFA strictly, you’ll usually end up with a HUGE (exponential in the size of the NFA) DFA, which then needs to be minimized to be workable. On the other hand, Thompson’s algorithm lazily generates the DFA as it’s needed, allowing you to still run in constant time.

                  1. 3

                    you’ll usually end up with a HUGE (exponential in the size of the NFA) DFA

                    That’s exaggerating a bit. There are pathological, worst case, regular expressions that are exponential in the size of the NFA, but most of them aren’t, and I wouldn’t say it’s “usual.”

                    1. 1

                      Thanks for that! I’ll have to look at Thompson’s algorithm. I guess I’m even more rusty than a thought…

                  1. 3

                    The example there about edges on an array is something I’ve had to directly deal with myself when I implemented a multidimensional image-processing function for GNU Octave. The problem is to find connected components in a binary image, where voxels are either 0 or 1. You want to find all the islands of ones, and you want to be flexible if diagonals count as being connected or not.

                    The problem, of course, is that along the edges you don’t want to check for neighbours outside of the image boundary. I found no good way to do this for an n-dimensional image, after a few attempts of writing very complicated code. In the end, I ended up padding the whole image with a boundary of zeros, iterating over the padded image with an if(current_voxel) conditional that skipped checking for lit neighbours around the boundary, and when checking for lit neighbours at the original image’s boundaries would give no neighbours at the padded zero boundaries.

                    The code was cleaner, but I incurred a big realloc, because N-dimensional images in Octave are stored as contiguous Fortran-order (column-major) arrays. I’m still looking for a better solution to this problem.

                    So, how do you do this cleanly?

                    1. 2

                      I recently ran across a blog post that claims Julia’s ‘cartesian iterators’ make this less painful/ugly than normal (scroll down to “A multidimensional boxcar filter”). I haven’t put them to any serious use myself yet though, so I can’t really vouch for this either way.

                      1. 2

                        I’d use a union-find algorithm along one dimension at a time. Diagonals would simply be another set of axes.

                        1. 1

                          Can you explain this more? Note that iterating one dimension at a time is a bit tricky given that the thing is n-dimensional to begin with.

                      1. 3

                        I’d actually go a step further than the author: sometimes time spent not thinking about work can be the most productive possible use of your time. Programming is not an end unto itself, unless you’re writing development tools (and even then, your domain is usually not actually programming; instead, you’re writing tools to make writing some kind of software easier), and the projects that will be most useful to other people are likely to be the projects that you come up with while you aren’t actively programming.