1. 13

At some point in a programming language’s life the community will want to pursue standardization. In my mind this idea is bound with multiple implementations of the language’s toolset - the lexer, parser, semantic analyzer, type checker, and so forth. A large implementation-independent test corpus is an important component of standardization, to ensure conformance of an implementation to an abstract standard. The corpus contains basic tests for all language constructs along with all the weird syntactic or semantic edge cases that nobody would ever think of when reading the language standard. Since many programming languages exist and all face this issue, I would like to ask - what file formats have been developed for these test corpora?

I currently only know of one example, which is the file format used for syntax-level tests by tree-sitter. This is a text file with flexibly-defined separators delineating a series of text inputs along with the expected parse tree as an S-expression. We have adapted these tests to run against multiple implementations of the TLA+ syntax parser, but are now considering what such a format might look like for semantic tests (where you want to assert one identifier refers to another) or type tests (where you want to assert a given subexpression has some type).

    1. 12

      Not exactly what you’re looking for, but the Test Anything Protocol might be relevant.

      The approach I’ve used for several of my projects is to simply define test fixtures as a directory within which each test consists of a series of files with a shared name, using extensions to associate an input file (probably with the extension of a source file for the language) with either an expected stdout result (.out) or an expected error message (.err). Using separate files and a naming convention entirely sidesteps the problems of choosing language-agnostic delimiters.

      I don’t believe in writing tests that directly examine the data structures used within a language implementation, as such tests tend to be very brittle; tests should be designed to surface those issues at the integration level. Building in hard, instrumented separations between components like lexers, parsers, and type checkers heavily ossifies the design of each component and introduces a great deal of incidental complexity that is orthogonal to the problem of correctly interpreting and executing programs.

      1. 4

        I still use TAP: I use pgtap for PostgreSQL testing. Works great.

        1. 1

          Do you hardcode the exact error message or do something like search for standardized error codes in the output like the rust compiler emits?

          1. 2

            For my purposes, hardcoding the exact error messages works well. If error messages were extremely verbose and frequently changed phrasing, or perhaps if error messages were somehow localized, I suppose I could see the benefits of the rust approach.

          2. 1

            Not exactly what you’re looking for, but the Test Anything Protocol might be relevant.

            Is that the thing that was used everywhere in the perl universe in the early 90s?

            1. 4

              Yes. It’s still used today; see https://testanything.org/.

              1. 2

                My eye went right past the mention of perl to the sample output. I should read more carefully. That brings back memories. Test Anything was way ahead of its time.

          3. 6

            I’d say the specific format doesn’t matter much. It’s fine if it is a bespoke text format, the job to emit it from a different tool is O(1) anyway.

            Some things I’ve used

            • for syntax trees, I am not a fan of S-expressions and prefer serializing them as indented trees like
            
            File
              Fn
                'fn'
                'fib_rec'
                ParamList
                  '('
                  Param
                    'f1'
                    ':'
                    TypeExpr
                      'u32'
                    ','
            

            All terminals are single-quote quotes, all non-terminals are CamelCase.

            Also, for parsers specifically, what I found to be ridiculously efficient is specifying example test interspersed with the grammar, right in the middle of parsing code next to a specific if or for:

            https://github.com/rust-lang/rust-analyzer/blob/b65911d5eecfa562532549ef7e70e460eb042e2c/crates/parser/src/grammar/generic_params.rs#L74

            • for symbols tables/scope, print the name of the module, then, for each symbol, what it binds to: https://github.com/rust-lang/rust-analyzer/blob/b65911d5eecfa562532549ef7e70e460eb042e2c/crates/hir-def/src/nameres/tests.rs#L58

            • but scopes are a lie, they don’t exist. What actually exists is “this name over here refers to that name over there”. So a good test marks two identifiers in the source code, and check that they map to each other. Here you need this general pattern:

              A textual file format, that specifies a tree of text files with paths and meta-information for some ranges in the text files, where meta is specified as comments in the target language. E.g,

            
            #[test]
            fn infer_adt_self() {
                check_types(
                    r#"
            enum Nat { Succ(Self), Demo(Nat), Zero }
            
            fn test() {
                let foo: Nat = Nat::Zero;
                if let Nat::Succ(x) = foo {
                    x;
                } //^ Nat
            }
            "#,
                );
            }
            

            Here, // ^^^ marks a range, and what follows is the meta for the range. There’s generic infrastructure that extracts the ranges, and then it is specialized for type inference test here.

            In general, two patterns here:.

            • either you feed the test with a single self-marked-up source file (the above example)
            • or you feed it with two files, one source code, one the textual serialization of the result.

            The former leads to more succinct tests, but is harder to update. The latter yields itself nicely to snapshot testing, handles more verbose output easier, but has a large fixed-cost verbosity.

            Do lean on snapshot testing heavily. You don’t need to write parsers for gold files, you only need serializers and text diffing.

            I do strongly recommend meta-in-comments as opposed to inventing special syntax (eg, using xml tags for meta, and stripping them out or get the raw source). I tried both, if you input is also just a valid $languafe file, it is so much better.

            Optimize for the format itself being changeable. If, at some point, you decide to change concrete syntax of the test format a bit, you shouldn’t need to update every test manually, the existing infra.

            1. 5

              Another tip here is to avoid dependencies on stdlib in tests — this makes them slow! Instead, have a mini-stdlib which models the real one, and which can be partially included. E.g, rust-analyzer has a single-file mini core, which is additionally internally guarded by a bunch of op-in if-defs:

              https://github.com/rust-lang/rust-analyzer/blob/master/crates/test-utils/src/minicore.rs

              One extra benefit is that you see what each test actually depends on!

              Generalizing this, pay attention not only how you specify the output, but also how you specific the input!

              1. 3

                I’d say the specific format doesn’t matter much.

                In my experience, the value of test capturing being concise is underestimated. Specifically, having related tests in the same file and, when possible, having the entire test on a single line. For example:

                : replace
                :
                {
                  : basics
                  :
                  {
                    $* <'print $string.replace(  abcb,      b, BB)' >'aBBcBB' : expand
                    $* <'print $string.replace(  aabbccbb, bb,  B)' >'aaBccB' : shrink
                    $* <'print $replace([string] abc,       b,  B)' >'aBc'    : typed
                    $* <'print $replace([string]  "",       b,  B)' >''       : empty
                    $* <'print $replace([string] bbb,       b, "")' >''       : to-empty
                    $* <'print $replace([string] bb,        b, Bb)' >'BbBb'   : no-recursion
                  }
                
                  : first
                  :
                  {
                    $* <'print $string.replace(babc, b, B, first_only)' >'Babc' : first
                    $* <'print $string.replace(abcb, b, B, first_only)' >'aBcb' : middle
                    $* <'print $string.replace(b,    b, B, first_only)' >'B'    : only
                  }
                
                  : last
                  :
                  {
                    $* <'print $string.replace(babc, b, B, last_only)' >'baBc' : middle
                    $* <'print $string.replace(abcb, b, B, last_only)' >'abcB' : last
                    $* <'print $string.replace(b,    b, B, last_only)' >'B'    : only
                  }
                }
                

                Now imagine the same spread out over 12 different files.

                1. 2

                  Yeah, I think we are trying to say the same: you could use JSON or YAML to encode this sort of thing, and use file system to encode trees of file. The benefit here would be that you can read it “for free” from any language.

                  But this is a very small benefit, which comes with a gigantic drawback — the tests become much less readable.

                  That is, it doesn’t matter whether it’s JSON, YAML, TAP, or (any) home grown thing. What does matter is that it exists, and that it actually is good.

                2. 1

                  Great tip for just using in-language comments for semantic/type assertions, thanks!

                3. 4

                  LLVM has this tool called FileCheck where you can add comments like // CHECK: <something> to a source file and it will test those checks on some output file.

                  I think they use it for checking the generated assembly, byte code, and maybe even compiler error messages.

                  The Rust project uses it too (I think, based on the syntax of these tests).

                  Maybe those test files are reusable between compiler implementations? I’m not sure. The tool definitely is.

                  1. 3

                    For Oils, I made a format that is intended to be similar to a shell script, but you also have

                    #### test case name
                    

                    and you can specify expected output like, in addition to code blocks:

                    ## status: 2
                    ## STDOUT:
                    foo 
                    bar
                    ## END
                    

                    https://github.com/oils-for-unix/oils/tree/master/spec

                    We have our own test runner for CI:

                    Wiki page: https://github.com/oils-for-unix/oils/wiki/Spec-Tests

                    This is a very simple format, but it grew and changed over the years.

                    It runs against many shells, including 2 different implementations of Oils!

                    For testing the AST, I just use Python unit tests. Or very occasionally I will “textualize” some internal data structures – but I’d argue that end-to-end is the most robust, and avoids coupling to implementation details

                    i.e. the exact AST may change frequently, because it’s used for many different purposes, but the user-observable behavior doesn’t


                    I also remember that WebAssembly developed some specification tools that sounded cool, but I haven’t looked into

                    Wasm SpecTec: Engineering a Formal Language Standard

                    https://lobste.rs/s/fdykb2/wasm_spectec_engineering_formal

                    1. 3

                      Not directly what you asked for, but my favorite example of a cross-implementation test suite that focuses on the semantics of a language standard is Paul F. Dietz’s test suite for Common Lisp at https://github.com/pfdietz/ansi-test

                      1. 2

                        I wonder if K could be useful here: https://kframework.org/

                        1. 2

                          When writing the test cases for Idol I went through three different answers to this question:

                          • First, since the original implementation was in Rust, I had the syntax tree types implement Debug and used plain text diff to compare them with the expected test output. This worked OK but the Debug impls sometimes needed to be customized to avoid printing irrelevant internal state of the ST values, so the time and code savings wasn’t as much as I originally hoped for.

                          • When rewriting the reference implementation into Go, matching the Rust Debug output proved impractical, so I switched to S-expressions. There were two big downsides to S-expressions: there’s no standard and they’re difficult to auto-format.

                            • There has never been a single widely-accepted document describing what S-expressions are. Every Lisp dialect has its own slightly different flavor, and every non-Lisp library for encoding/decoding S-expressions adds its own opinions into the mix. Even concepts as important as booleans (false vs #f) or string escape sequences differ between S-expression dialects. If I wanted to share test cases across languages I’d have to write my own sexpr library for each language, which is a lot of code that would itself need to be tested.
                            • S-expressions are difficult to auto-format consistently (is (a "b") a dict entry that should be one line, or a len=2 list that should be four lines?). The most generic method treats every (a "b") list as an actual list and indents the contents, which is hugely verbose. I could have hand-written a serializer that uses more compact output, but then I would have to write a parser and auto-formatter so that the test files’ whitespace doesn’t affect test behavior, which was yet more code.
                          • I switched to JSON, which has a stable specification and good-enough libraries available for ~every language. The test files can be generated by tooling (with compact indentation) or just hand-written, then in the tests they get auto-formatted prior to diff. So far this approach is working well, and I’ve used it for testing the tokenizer and parser.

                          Token tests (example: https://github.com/jmillikin/idol/blob/trunk/testdata/tokens/int_literals.json) contain the source code to test (a short snippet), and either the expected sequence of tokens or an expected error.

                          Parser tests (example: https://github.com/jmillikin/idol/tree/trunk/testdata/syntax/int_literal) are a directory containing the .idol file to parse and either a expect_ok.json file with the serialized ST, or expect_err.json with error details.

                          Compiler tests (example: https://github.com/jmillikin/idol/tree/trunk/testdata/schema/const) use the Idol text format, since the output of the compiler is a Schema message.

                          Language-specific support code lives next to the implementation(s), for Go it’s in https://github.com/jmillikin/go-idol/tree/trunk/idol/internal/testutil.

                          1. 2

                            Just create a DSL on top of YAML. And then reuse it across any language. Not easy but it is the best compromise between over-architecting and not locking oneself in a language.

                            1. 2

                              And then we can generate the YAML itself with Nickel, Dhall or Cue!

                              1. [Comment removed by author]

                            2. 2

                              Does cucumber count? I guess it’s more “implement the same thing in different languages” but the BDD tests should be kinda compatible?

                              1. 2

                                I designed a file format for testing shell commands. It checks exit status (set by ?, or zero) and you can provide input and mandate output of individual file descriptors (as either single lines or multi-line here-documents).

                                Some examples using a shell and my command-line chess logic tester:

                                $ sh /dev/fd/3
                                
                                ? 1
                                
                                3 << 'END'
                                set -e
                                echo pretty >&2
                                false
                                echo bad
                                END
                                
                                1 >> ""
                                
                                2 >= pretty
                                
                                $ chess-test e4 e5 Bc4 Bc5 Nf3 Nf6 Rg1 Rg8 Rh1 Rh8 0-0 v
                                1 >= "castling forbidden"
                                
                                $ chess-test e4 e5 Bc4 Bc5 Nf3 Nf6 a4 a5 Ra3 Ra6 0-0 v
                                1 >> '[END]'
                                  n b q k     r
                                  v v v   v v v
                                r         n    
                                v   b   v      
                                ^   B   ^      
                                R         N    
                                  ^ ^ ^   ^ ^ ^
                                  N B Q   R K  
                                [END]
                                

                                Thanks to writing a test suite, I did in fact find bugs in my chess logic engine. :-)

                                1. 1

                                  Hello @ahelwer!

                                  I have developed just such a language. Its purpose is no more or less than what you ask, and It is called CSTML. https://github.com/bablr-lang/#cstml

                                  It is a format which allows in interleaving of arbitrary parsing metadata into arbitrary streams of embedded text. The original text is always trivially recoverable.

                                  I am working on being able to express any and every languages’ parse trees in terms of CSTML.

                                  1. 1

                                    Here is the format describing the input "\n"

                                    <!0:cstml bablr-language='https://bablr.org/languages/core/en/cstml'>
                                    <>
                                      .:
                                      <String>
                                        openToken: <*Punctuator '"' balanced='"' balancedSpan='String:Double' />
                                        content:
                                        <*StringContent>
                                          @:
                                          <EscapeSequence>
                                            escapeToken: <*Punctuator '\\' openSpan='Escape' />
                                            code: <*Keyword 'n' closeSpan='Escape' />
                                          </>
                                        </>
                                        closeToken: <*Punctuator '"' balancer />
                                      </>
                                    </>
                                    
                                  2. 1

                                    This is making me think of contextual equivalences. So there’s a little bit of language design going on for the language of contexts as seen in these tests. Huh.