1. 15
  1. 5

    fathom is another project to watch in this space.

    1. 2

      Thanks for the mention, I’m working on Fathom right now!

      Responding to some of the criticisms of Katai:

      parsing a kaitai schema is significantly more involved than parsing an openapi schema. one contributing factor is that kaitai features an embedded expression language, allowing almost any attribute to be bound to expressions involving other fields, rather than simple constants. this makes is hard for an ecosystem to thrive because of a high technical barrier for entry.

      The tricky thing about real-world binary formats is just how interwoven they are with programming logic: often they start off implemented as some random bit of C code and then we have to go back and pick up the pieces in order to attempt to describe them declaratively and safely (and ideally in a way can be checked using a proof assistant in the future). It’s a massive pain.

      the question of serialization remains; however, the technical challenges seem largely connected to the use of a feature in kaitai called “instances” rather than “sequences”. this feature allows defining a field at a fixed offset, rather than defining it in context to the field before or after. this feature is obviously very useful to reverse engineers, who may understand only a certain chunk of a binary blob and need to annotate its fields before moving on, but it wouldn’t be a desireable feature in a schema that serves as the source definition for a binary format.

      This is really hard to avoid if you want to deal with real world formats. Formats like OpenType are sprawling, with offsets and pointers to other parts of the binary data, sometimes with mutual dependencies between offset data.

      Ultimately it sounds like the author really wants a restricted language for restricted a subset of new binary formats. If you have the luxury of writing your own binary format from scratch, then metaformats like Protocol Buffers or Cap’n Proto could offer a more restricted approach that might make this kind of thing easier. I’m not sure how they fare for binary files though, as I’m pretty sure they are more designed to describe streaming data.

      While I’m here, I’d also like to call out some additional tools that I think don’t get enough airtime:

      • binary-data: DSL for parsing binary data for Dylan
      • Everparse: parser tools that extract an ‘RFC’ format to verified F* code and ultimately C, used as part of Project Everest (formally verified TLS stack by Inria and Microsoft) (paper)
      • Narcissus: a DSL embedded in Coq for formally verified descriptions of binary data formats (paper)

      More interesting tools and languages can be found at dloss/binary-parsing.

    2. 4

      I see that ASN.1 was mentioned, but what about the ASN.1 encoding control notation? That covers the “full control over encoding” gap. As far as I know, it’s expressive enough to handle most wire formats, targeted at serialization, and documentable.

      1. 2

        I’mma let you finish, but SWIG is the OG “swagger for binary”.

        In all seriousness, though, projects like SWIG did a pretty good job letting us dinosaurs link together multiple languages via C headers and annotated ABI long before JSON was a thing, much less widespread.

        Likewise, Windows and UNIX had their respective “component” ecosystems using IDLs to describe efficient binary encodings: CORBA, DCOM, etc. “way back” in the 90s. More recently we have Thrift, Protobufs/gRPC, Flatbuffers, etc.

        The problem isn’t a lack of interest or tools, it’s the endless treadmill of slightly different use-cases leading to entirely divergent toolchains and ABIs for each generation – not to mention vendor/OSS sponsoring org/etc. – getting picked up, used for a generation or two of systems, and then discarded for something new.

        Note: I’m not trying to suggest there’s nothing new worth trying, but if your list of prior art for “binary API specs” leaves out everything above except for Protobufs it’s a very incomplete (even ahistorical) sample.

        1. 1


          It’s higher level , but lower than that is a non problem

          1. 1

            Wouldn’t the current solution be C headers? Most language support parsing these too then generate native bindings with FFI. It’s not very formal and has many drawbacks, but I feel like a successful standard could define a C subset that is easy to parse and define all the ambiguous part. This way, most of the native binary tooling would just work out of the box. To define yet another schema definition sounds like a « we have N competing standard, now we have N+1 ».

            1. 1

              The problem with attempting to parse a C header is that you need to pull in an entire C compiler (preprocessor, pull #defines from CFLAGS and command line arguments, etc).

              1. 1

                You could handle this complexity by supporting a subset of C and error out on anything outside of this subset. You could even make a tool that transpile some random header using a C compiler and systems header dependency to generate a header in that C subset language, without preprocessor directives, making the header file “portable”. This way you can add this step as part of the build pipeline and define the “swagger” file based on what is really being compiled. The more I write about it, the more I want to build this thing!

              2. 1

                Fwiw, IDA (One of the most used reverse engineering tool) use C struct to define all the structure which in my experience works really well and allow copy/pasting C code to import data structures from open source project. I never hit any limitation. It also use so C-like language to type functions with some extra keyword to define calling convention. If I were to define some kind of Swagger for binary, that’s probably where I would start.

              3. 1

                not sure i understand this post really… basically you generate a binary (like you generate an API) from a definition (your program src) using a generator (your compiler) with a defined target (compiler options + ld scripts) and you obtain a defined/documented binary for the platform you’re targeting…(your service), maybe an ld script is what she/he is describing? I have no idea why web fantabulous technologies who already ruined the web are supposed to be inspiring here, define another new abstraction spaghetti standard? I’m probably stupid and did not understand, but it reminds me of the well known xkcd: https://xkcd.com/927