1. 14
  1. 3

    This must be an application of what @dcreager and his team has been working on with stackgraph.

    I love this approach with tree-sitter being a lighter weight alternative compare to the like of Kythe.io.

    Looking forward for more languages and cross-languages support.

    1. 2

      That’s right! This article describes the pipeline that powers “search-based” or “CTags-like” Code Navigation. Stack graphs powers “precise” Code Navigation. Both have the same non-functional requirements described in the paper — zero config, incremental, language agnostic. Both build on tree-sitter and its DSLs to (hopefully!) achieve those requirements.

    2. 2

      Just the tip of the iceberg too, as I understand it! GitHub also owns CodeQL.

      1. 2

        Which they got from buying out Semmle IIRC.

      2. 1

        I really enjoyed reading this behind the scenes look at a feature I quite enjoy about GitHub. However, I was surprised about the numbers.

        GitHub serves more than 65 million developers contributing to over 200 million repositories across 370 programming languages… The system supports nine popular programming languages across six million repositories… It supports nine languages: C#, CodeQL, Go, Java, JavaScript, PHP, Python, Ruby, and TypeScript, with grammars for many others under development.

        I’m really surprised those nine languages add up to only 3% of the repositories (6 of 200).

        1. 2

          There are other filters that limit the number of repos that we index. We don’t process repos that are marked as spammy, for instance, and at the time the article was written, we weren’t processing forks. (Currently we only process fork branches that are involved in a PR.)

          1. 1

            Gotcha, thanks for the clarification! That makes a lot more sense.

            1. 1

              I note that C/C++ are excluded from that list and it feels as if supporting them is basically impossible with the Zero Configuration goal. You really need something like compile_commands.json to make them work, and that can often depend on both host-system headers (e.g. for macros that are used for conditional compilation) or on generated files. For example, LLVM’s build system first builds a tool called TableGen and then uses TableGen to generate a bunch of C++ files that are included elsewhere. Similarly, FreeBSD’s build system generates a load of things like the system-call stubs and other bits of libc, and so you can’t parse a load of the sources in either as valid C/C++ unless you also have a build tree. In LLVM, this shows up (or, rather, doesn’t show up) in the doxygen output where a load of the target-specific classes inherit from a class that’s generated from TableGen, which inherits from a generic class in the target-agnostic code generator. Inspecting the Doxygen output (doxygen runs without the generated files), you can’t tell that the class in the target is an indirect subclass of the target-agnostic class.

              C++, in particular, is very hard to usefully parse with anything that isn’t a complete C++ parser. Whether a particular token is a field or a type name can be context dependent and that context may be in a different file. It’s hard to imagine something built on TreeSitter giving anything better than superficial information for C++.