1. 4
  1.  

  2. 1

    Here is a proposal for a context-free grammar interchange format, which most (all?) context-free grammars can be reduced to:

    1. A grammar is defined as a JSON object with nonterminal symbols as keys.
    2. Each nonterminal symbol starts with < and ends with >.
    3. Any key that does not start with < and ends with > is considered meta information.
    4. The definition associated with each nonterminal is given by a list of (expansion) rules. Empty definitions are not allowed.
    5. An expansion rule is given by a list of tokens. An empty list represents epsilon rule.
    6. A token can either be a nonterminal or a terminal symbol in the grammar.
    7. terminal symbols should be nonempty, and should match exactly the characters present in the symbol, and should not start with < and end with > (but can start with or end with these characters).
    8. All nonterminal symbols used should be defined.

    Here is an expression grammar:

    grammar = {
        "START" : "<expr>",
        "<expr>": [
            ["<term>", "+", "<expr>"],
            ["<term>", "-", "<expr>"],
            ["<term>"]],
        "<term>": [
            ["<fact>", "*", "<term>"],
            ["<fact>", "/", "<term>"],
            ["<fact>"]],
        "<fact>": [
            ["<digit+>"],
            ["(","<expr>",')"]],
        "<digit+>": [
            ["<digit>","<digit+>"],
            ["<digit>"]],
        "<digit>": [["0"], ["1"], ["2"], ["3"], ["4"],
                    ["5"], ["6"], ["7"], ["8"], ["9"]],
    }
    

    Conventions; The default start symbol is specified as the value of START (optional). Use <key*> for the nonterminal that specifies zero or more <key>s and <key+> for when the nonterminal specifies one or more keys, but provide the definitions within the grammar (see <digit+> abive). Use the grammar name as the value of the root key. In the above <expr>. All caps nonterminals are reserved for ANTLR like lexical tokens. E.g., "<END>" : [["end"]]. Do not define or use <>. It is reserved.

    The reason I propose this format is that this format contains just the basic necessities of representing a context-free grammar, and since it is in JSON (which can easily be transcribed into a binary format if necessary), makes loading such grammars into programs such as parsers and fuzzers, and processing them simpler, and given there is no pre-processing necessary to obtain the basic context-free grammar underneath, there is no chance for errors or minor differences between implementations.

    1. 1

      I’m not sure how this would recognise the C++ spaceship operator (<=>) as a terminal. Aside from that, it looks like it is just a BNF serialisation and so has the same ambiguity problems that mean that you cannot automatically transform a BNF grammar into a (correct) parser.

      1. 1

        The space ship operator would be encoded as a nonterminal

        "<opspaceship>" : [[ "<langle>", "=", "<rangle>"]],
        "<langle>" : [["<"]],
        "<rangle>" : [[">"]],
        

        The format is not meant to be tied to a specific subset of context-free grammar. Rather, it is meant to represent any context-free grammar. It is indeed just a serialization of pretty much any syntactic notation capable of expressing context-free grammars, and hence, intended to be used as an interchange format between different parsers and fuzzers avoiding the current mess of syntactic variants.

        1. 2

          Having to split a token into multiple terminals to be able to represent it is unfortunate. It’s also wrong, because <=> and < => parse differently (the first is a single token, the second is two tokens). You can avoid this only by having whitespace explicit in the grammar, which is sometimes a good choice but adds a lot of noise for common cases.

          1. 1

            Are there places where < and => comes together?

            For cases such as this, post processing can be applied to distinguish the cases (most parsers allow you to capture the match begin and end for tokens, which can tell you if there is a space between two consecutive tokens.) If not, going with a single level grammar without a tokenizing step is the best option.

            1. 1

              < => is a syntax error in C++. Being able to not recognise an ill-formed input is as important as recognising a correct one for a grammar. A formal grammar defines a predicate on an input. It should not give false positives or false negatives.

              If you require additional processing then your form has not captured the grammar of the language.

              1. 1

                There are two possibilities here: If we need to capture the complete language, then we need to incorporate spaces in the grammar. This will make the grammar noisier but standalone and accurate. The other option is to provide a second token grammar which deals only with tokens, and is a regular grammar. If this is the case, operators such as spaceship should be handled at the token grammar level, where the spaces are included.