1. 44
    1. 10

      The treatment of performance in this blog post is a bit dubious. In the design requirements, it claims:

      Over the years, processors and C compilers have gotten much better using a couple of techniques. These include pipelining, inlining functions, and branch prediction. Unfortunately, the parsers generated by most parser generators make it difficult for any of these techniques to apply. Most generated parsers operate with a combination of jump tables and gotos, rendering some of the more advanced optimization techniques impotent. Because of this, generated parsers have a maximum performance cliff that is extremely difficult to overcome without significant effort.

      (and then it goes to explain that one has to switch from Bison, used by the previous parser, to a hand-written recursive descent parser)

      But then, when evaluating the new parser, it says:

      Once the parser was able to produce semantically equivalent syntax trees, we began looking at performance. We don’t have great comparison numbers yet because as discussed our tree is different and does more things in general (for example we provide unescaped versions of strings on our string nodes to make life easier on the consumers of YARP).

      What we can share so far is that YARP is able to parse around 50,000 of Shopify’s Ruby files in about 4.49 seconds, with a peak memory footprint of 10.94 Mb. Needless to say, we’re thrilled with these results so far.

      It is trivial to run the old parser on the same inputs to get a meaningful comparison, and it would also be fairly easy to avoid the difference in produced syntax trees (by just removing all the AST-construction code from both parsers, getting simple validators to compare). So why didn’t they do it? This gives the impression (which may or may not be true) that the new parser is in fact slower than the previous parser, but that they don’t want to admit it, but that they still decided to leave this “Performance” goal that suggests that they will be faster. Dubious.

      1. 9

        Note: the first paragraph does not actually claim that the new parser will be faster than the previous parser, and the design documents they link to very reasonably suggest that “about the same performance” should be the target goal. Still, mentioning performance as a design goal and then explicitly hiding the comparing evaluation of the performance, this is weird.

        1. 6

          Yeah it looks like they haven’t gotten that deeply into performance, which they mention

          The malloc() for each node raised my eyebrows, and it does seem reflected in the code

          https://github.com/ruby/yarp/blob/main/src/yarp.c#L339

          In my experience with https://www.oilshell.org – which is quite similar in that it’s a recursive descent parser of a language that used yacc (bash) – malloc() e.g. glibc can easily be over 50% of the parsing time.

          But it looks like it should be pretty easy to provide a hook for consumers, and get a huge speedup in most cases by replacing malloc. The problem is that when you have a parser with many clients (interpreters, tools, etc.), you don’t know what memory management scheme they’ll use.


          Otherwise it’s a very good article, looks like a big achievement in cleaning up 20 year old code

          Whenever people say “parsing is solved”, we can point them at the clean slate 12,000 lines of C that is the best effort to understand Ruby’s syntax :) As they said, this was partly motivated by bugs

          My takeaway, as always, is that humans like more subtle syntax than is practical with current parsing algorithms. Parser engineering is tough. I’m more of a Python person, but a lot of real software has been written in Ruby and the language seems to fit many people’s brains very well.


          A small correction I posted, in one place WRT Python, they wrote “more ambiguous”, it really should be “more lenient” - https://news.ycombinator.com/item?id=36311573

          1. 2

            Whenever people say “parsing is solved”, we can point them at the clean slate 12,000 lines of C that is the best effort to understand Ruby’s syntax :) As they said, this was partly motivated by bugs

            This is a minor point, since I don’t want to argue that “parsing is solved”, but I’d interpret that claim as “there aren’t significant open problems in generating high quality parsers for industrial languages.” Having to rewrite an almost 30 year old parser using widely used techniques doesn’t really seem like evidence one way or another.

            1. 2

              I’d say that the open problem is being able to “declaratively” express the syntax that humans actually like, rather than what CFGs can express

              e.g. in Ruby or Rust

              Graydon mentioned language design and parser engineering as a real problem in this recent article https://lobste.rs/s/47amaq/rust_i_wanted_had_no_future

              If humans preferred Lisp syntax or even Pascal syntax over Ruby or Rust, then it could be argued parsing is “solved” :)

              It has basically lost its status as “theory”. It’s now engineering elbow grease and art. It uses theory, but as a whole it’s not explained – it’s a bag of 100 tricks that are particular to certain languages, in certain programs (ask me about my current issues with OSH and YSH, or about how to solve the “$PS2 problem”)

              matklad’s recent Resilient LL parsing is another good example of that – it’s not in books.


              Another significant open problem is automatically inverting parsers, rather than writing 2 or 3 parsers (again mentioned in the Rust community as a problem)

              That’s a very nice engineering benefit of declarative syntax – it doesn’t specify the recognition algorithm, similar to how PEGs have at least 2 different algorithms, and CFGs have a dozen or more, of varying power

              (Parsing is also still being worked on in academia, so presumably they don’t agree there aren’t significant open problems)

          2. 5

            Most generated parsers operate with a combination of jump tables and gotos, rendering some of the more advanced optimization techniques impotent

            Don’t have solid numbers at hand, but I’ve heard this claim several times, and it makes intuitive sense to me: recursive descent with direct function calls seems more mechanically-sympathetic than jump-table based LR. I think this is equivalent to interpreters: giant switch statement is slower than threaded code (https://jilp.org/vol5/v5paper12.pdf might be the classical paper on the topic?).

            I think this criticism is a bit misplaced, and applies only to specific style of generated parsers. You could avoid giant state transition table and instead emit a bunch of mutually recursive functions, using technique called “recursive ascent”. LALRPOP has (had?) a mode for this.

            Take with a grain of salt, as I’ve seen exactly zero benchmarks comparing performance of various parsing techniques.

            1. 2

              The lookup tables generated by that era of parsers was, in part, designed for processors with minuscule caches, only basic instruction prefetch, limit branch prediction, and very narrow memory bandwidths. So it makes sense that modern machines would not benefit from the same optimization techniques.

            2. 3

              These blog posts are often coordinated across teams and not all teams have the same focus. So they could very well have good reason to believe their performance numbers will be better but they have been focused on the correctness and portability aspects without time to widdle down to an apples-to-apples comparison benchmark.

            3. 8

              I found this amusing:

              The last reason to switch to hand-written recursive descent actually comes from Matz himself. In version 0.95 of Ruby — released in 1995 — a small ToDo file was included in the repository. One of very few items in that file was:

              Hand written parser(recursive decent)

              1. 3

                Me too, that’s the longest TODO resolution I’ve ever heard of. 28 years later: marked as completed

              2. 6

                In the 25 years that the parser has existed, only 65 people have contributed to it, and only 13 of those have contributed more than 10 commits. In the last year, only 9 people have contributed to the parser, of which only 2 have contributed more than 10 commits.

                Wow. I knew it was scary code, but that’s a scary stat.

                I’m most excited about a stable syntax tree with error tolerance. It will let me rewrite a lot of syntax suggest (hopefully) and operate on tokens instead of lines. I should probably sync with Kevin some time on it.

                1. 4

                  Building a new parser for Ruby seems pretty close to Herculean to me. Kudos to Kevin!

                  1. 3

                    This is heartening, though:

                    Fortunately since open-sourcing the repository at the beginning of this year, we’ve had 31 contributors add code to the parser. We’ve been working to improve our contributing guidelines and guidance to make it even easier to contribute going forward.

                  2. [Comment removed by author]