For the hobby languages that I’ve designed and implemented, I used an LR(1) parser generator (Bison) to debug the grammar I am designing. Bison error messages indicate bugs (my grammar is ambiguous). I don’t use %left, %right, %precedence to make errors go away, I redesign the grammar instead. This keeps the grammar simple and predictable. There’s a clear hierarchical structure with a fixed number of precedence levels, and each non-terminal belongs to a specific and clearly identified precedence level.
The result of this process is a simple unambiguous grammar that is trivial to translate into a hand-written recursive descent parser. Once I’ve debugged the grammar using Bison, I find the parser to be the easiest part of the compiler to write (and maintain). The parts I find challenging are everything after that: semantic analysis, optimization, partial evaluation, code generation. Tree walking interpreters are pretty easy.
TIL that Guy Steele recommends the same approach for writing parsers.
Do you keep the grammar around for documentation and testing? I kind of wish more languages would do that… it’s so easy to add ambiguities into recursive descent parsers down the line!
So you don’t use left/right/precedence annotations because you’re then forced to handle precedence the old-school way by adding additional rules, which matches how precedence is implemented in recursive descent. Interesting!
I’ve done the same, but how do you handle the mismatch in grammar requirements for generator (LR) vs recursive descent (LL)? That mismatch is why PEG seems like a better basis for generators, to me.
My recursive descent parser, as written, requires one token of lookahead. That probably means the grammar isn’t LL. My lexical analyser supports lookahead (I can read a token then push it back onto the input stream). Certain parse functions will look ahead to see if the current parse is followed by an end token such as ‘)’, and will return immediately if this is seen. The unconsumed ‘)’ token is consumed by another parse function higher in the call stack.
The parser I’m talking about is 749 lines of C++ with comments (the bison grammar is embedded as commentary). My next language will have a simpler grammar requiring less code. I don’t mind writing the parser this way, in part because I feel that the size and complexity of the code is directly related to the difficulty of parsing code in your head, and I want to want to reduce the cognitive load that the language places on human readers. If I had a terse DSL that let me create, in a few lines, a grammar too complex for me to understand as a user, I feel that wouldn’t help me reach my goal.
The tokenizer is also hand-coded in C++, and unlike the parser, I feel the code is too hard to understand and maintain. In retrospect, I would have preferred to use a DSL that describes tokens using regular expessions. That’s for next time.
Very good in general, but it doesn’t mention operator precedence/precedence climbing or Pratt parsers, where you write a little sub-parser that lets you write your operator precedence in a lookup table. It’s a pretty common approach, and one that makes a lot of sense to me.
…Actually now that I reread it I disagree with like 80% of his conclusions/preferences, but he does a very good job talking about them.
I generally just write a recursive descent parser. Next project, I may try a parser combinator, but mainly because I’m curious – a friend wrote one, and explained the concept to me, and I like the concept.
BTW - parsing is the easiest 1% of a compiler. Not sure why people get so focused on this one aspect of a language; it may be interesting, but it’s really a tiny portion of the effort.
I wrote my parser using a parser combinator, lexy.
It was hard, but I managed to do it.
I should really write about it and publish it, it’s a small pythonic language, that I preparse with python, generate indent tokens and then generate an json AST.
For the hobby languages that I’ve designed and implemented, I used an LR(1) parser generator (Bison) to debug the grammar I am designing. Bison error messages indicate bugs (my grammar is ambiguous). I don’t use %left, %right, %precedence to make errors go away, I redesign the grammar instead. This keeps the grammar simple and predictable. There’s a clear hierarchical structure with a fixed number of precedence levels, and each non-terminal belongs to a specific and clearly identified precedence level.
The result of this process is a simple unambiguous grammar that is trivial to translate into a hand-written recursive descent parser. Once I’ve debugged the grammar using Bison, I find the parser to be the easiest part of the compiler to write (and maintain). The parts I find challenging are everything after that: semantic analysis, optimization, partial evaluation, code generation. Tree walking interpreters are pretty easy.
TIL that Guy Steele recommends the same approach for writing parsers.
Do you keep the grammar around for documentation and testing? I kind of wish more languages would do that… it’s so easy to add ambiguities into recursive descent parsers down the line!
So you don’t use left/right/precedence annotations because you’re then forced to handle precedence the old-school way by adding additional rules, which matches how precedence is implemented in recursive descent. Interesting!
I’ve done the same, but how do you handle the mismatch in grammar requirements for generator (LR) vs recursive descent (LL)? That mismatch is why PEG seems like a better basis for generators, to me.
My recursive descent parser, as written, requires one token of lookahead. That probably means the grammar isn’t LL. My lexical analyser supports lookahead (I can read a token then push it back onto the input stream). Certain parse functions will look ahead to see if the current parse is followed by an end token such as ‘)’, and will return immediately if this is seen. The unconsumed ‘)’ token is consumed by another parse function higher in the call stack.
The parser I’m talking about is 749 lines of C++ with comments (the bison grammar is embedded as commentary). My next language will have a simpler grammar requiring less code. I don’t mind writing the parser this way, in part because I feel that the size and complexity of the code is directly related to the difficulty of parsing code in your head, and I want to want to reduce the cognitive load that the language places on human readers. If I had a terse DSL that let me create, in a few lines, a grammar too complex for me to understand as a user, I feel that wouldn’t help me reach my goal.
The tokenizer is also hand-coded in C++, and unlike the parser, I feel the code is too hard to understand and maintain. In retrospect, I would have preferred to use a DSL that describes tokens using regular expessions. That’s for next time.
Discussion when it was published - https://lobste.rs/s/9pcqys/which_parsing_approach (2020)
Very good in general, but it doesn’t mention operator precedence/precedence climbing or Pratt parsers, where you write a little sub-parser that lets you write your operator precedence in a lookup table. It’s a pretty common approach, and one that makes a lot of sense to me.
…Actually now that I reread it I disagree with like 80% of his conclusions/preferences, but he does a very good job talking about them.
As soon as I read about & grokked Pratt parsing, it became my favorite technique.
I generally just write a recursive descent parser. Next project, I may try a parser combinator, but mainly because I’m curious – a friend wrote one, and explained the concept to me, and I like the concept.
BTW - parsing is the easiest 1% of a compiler. Not sure why people get so focused on this one aspect of a language; it may be interesting, but it’s really a tiny portion of the effort.
Because it is nicely contained, nothing work well and it kills most of the projects that try to write a language.
A lot of the rest are cherry on top, improvements, problems to solve to have a better language.
Getting a parser that work though? Project killer.
I wrote my parser using a parser combinator, lexy.
It was hard, but I managed to do it.
I should really write about it and publish it, it’s a small pythonic language, that I preparse with python, generate indent tokens and then generate an json AST.
Now I need to translate it to C.