1. 32
  1.  

  2. 2

    This post makes me feel a bit like an NPC in a game. I wish you luck in your adventure, and I’ll have some experience points and praise in the form of github stars and editor integration patches for you if you come back from the dungeon with nicer IDE integration for rust, but I myself want absolutely no part in this “parsing without formal underpinning” shenanigans, there be dragons.

    :)

    1. 1

      Good points. Another one: Does rust-analyzer provide (or aim to provide) completion of variable names within printf format strings?

      https://doc.rust-lang.org/std/fmt/

      If so, then you have the sublanguage composition problem that essentially all parser generators are poor at.

      I elaborated on that in the original thread: https://lobste.rs/s/9pcqys/which_parsing_approach#c_lu3gu7

      Analogously, all “real” C/C++ compilers parse printf format strings. The problem is not just to parse the C language; it’s to parse this composition of languages.


      I think the argument boils down to definition: how many cases are “basic parsing”, and how many are “production quality”. Some people may be prototyping languages and use “basic parsing” and an LR parser generator, but it seems to me that the far more common problems require a hand-written parser, as you say and as many people in /r/ProgrammingLanguages have found (links in the original thread)

      But I still agree that writing by parsers by hand is annoying and has problems… We do need better parser generators and it’s great to see people working on that.

      1. 1

        Does rust-analyzer provide (or aim to provide) completion of variable names within printf format strings?

        Not yet, but we do something cooler. For rust-analyzer’s own tests, we analyze certain string literals as rust code, to provide syntax highlighting (which, eg, runs type inference). I feel “language composition” is very similar to “minimal length repair syntax” – an academically and mathematically interesting problem, which, in my experience, is irrelevant for the task of building an IDE. There’s much simpler way to do composition – in the outer language, parse a single token (string literal, contents of <script> tag, etc). In the semantic layer compute the string value of the token (this handles escape sequences). Feed the resulting string value to a completely separate second parser.

        This is easier and more powerful than grammar composition. For example, you can use semantic info to decide the target language of string literals.

        1. 1

          Hm what is the image showing exactly? It’s completing from history?

          OK I can see that for Rust format strings you don’t really need language/parser composition. The distinction is if the inner language is recursive or not. In Rust it’s not, but in shell it is. I elaborated on that here:

          http://www.oilshell.org/blog/2017/12/17.html

          But with the “feed a token into another parser solution”, you still have the location info problem. I’m sure it’s not too hard either way, but I find it convenient to always attach location info to tokens, and not have to do adjustments based on “nested” parsing.


          Also, when I say “language composition”, I don’t mean anything mathematical. I just mean that there are two languages in the same source file, which is true when you have format strings, although in Rust’s case they’re not recursive, so you could argue it’s not a language (as opposed to Python, JS, Swift, shell, Perl, HTML, etc.)

          1. 2

            Hm what is the image showing exactly? It’s completing from history?

            This is a screenshot of a test from the rust-analyzer test suite. Everything inside r#""# is just a multilne raw string literal. However, because it is passed to the check function, rust-analyzer knows that the literal represents Rust code, and highlights it as such.

            But with the “feed a token into another parser solution”, you still have the location info problem

            For string literals, the problem is in some sense inherent, due to escape sequences. \n in the string literal in the outer language would be a single character in the inner language. Keywords in the inner language can be spelled using unicode escapes

            The distinction is if the inner language is recursive or not.

            For recursive language, an example I always look at is Kotlin. Relevant bits from Lexer and Parser. The key takeaway that no coordination between parser and lexer is required, lexer just counts curly braces.

      2. 1

        Completely off topic but, I find it interesting how many people incorrectly write payed rather than paid, including myself.

        I wonder if there’s some kind of linguistic psychology involved that makes the problem common even amongst educated people. It’s hardly the most bizarre idiosyncrasy in English.