Threads for dcreager

    1. 5

      For those who haven’t seen it, CCAN is a great collection of reusable C code, much of which specifically exists to work around the kinds of issues mentioned in OP. It turns “oh crap I should probably roll my own” into “wait I bet someone has already re-rolled this”.

      Edit: See below re the ccodearchive.net link belonging to a squatter now! I’ve updated the link to CCAN’s GitHub repo, which is still active and not spammy. h/t @taal for pointing this out!

      1. 8

        Please note that it seems this is NOT the right link anymore. Giveaway: the link to the online casino at the bottom. See: https://lists.ozlabs.org/pipermail/ccan/2022-September/001411.html

        However, also good warning to not include other peoples code found on the internet without looking at the actual code in detail - a lot of nasty things can be done by a squatter… especially with this kind of code that is outsourcing magic and goes into.low level parts people might not monitoring.

    2. 1

      errno: Without libc you don’t have to use this global, hopefully thread-local, pseudo-variable.

      Isn’t it used by POSIX system calls? Can’t escape it entirely in that case. :-/

      Oh, /u/spc476 says this is mainly a reaction to Windows programming. Don’t need POSIX there, then.

      1. 1

        The article also suggests rolling your own I/O instead of using <stdio.h> (even on POSIX systems), which is one of the main reasons you’d need to be reading errno.

    3. 1

      Are there still plans to publish stack graphs for general availability and local use?

      1. 1

        The stack-graphs crate is open source, is that what you mean? That repo implements the core algorithms, and is also where we’re putting the language-specific stack graph construction rules at the moment. (We had considered putting them into each language’s grammar repo, like we do for syntax highlighting and fuzzy tagging queries, but we decided that it would be asking too much of drive-by contributors to learn how to update stack graph rules in sync with any grammar changes that they submit.)

        1. 1

          Is the project intended to become a documented self-serve thing where people can write their own rules for their own languages that will presumably hook into clients in various editors?

          1. 1

            That’s still the intent, yes! I can’t say that there’s good narrative-style documentation yet, but we’re starting to put representative examples in the repo, showing what kinds of stack graph gadgets to use to model common language features.

    4. 2

      For folks that prefer to learn by watching rather than reading, the Strange Loop talk on this was one of the finest examples of technical communication I’ve ever had the pleasure of seeing with my own eyes: https://www.thestrangeloop.com/2021/incremental-zero-config-code-nav-using-stack-graphs.html

      My hat’s off to you, @dcreager :)

      1. 2

        Wow, thank you for the kind words!

    5. 1

      Recently I learned that in Zig, there could be dynamic evaluations happening at build time which could change the outcome of the compilation result entirely.

      How would stack graphs handle those cases where there could be fuzzy hole in the graph? I.e. they are not exactly precise, but a narrowed set of possibilities.

      1. 4

        In general arbitrary compute happening at build time will be a problem with this kind of source-only analysis. We have some thoughts about how you could possibly simulate some of that computation in the stack graphs framework, but that at that point, it’s not clear that you’re gaining anything over just running the build and recording the results.

        One thing we also call out in the paper is that we can handle ambiguity an exploratory feature like Code Nav better than in an actual compiler or interpreter for the language. It’s okay for our rules to resolve a reference to a small number of possible definitions — it’s still helpful to the user even if we can’t nail it down to precisely one answer. That said, build-time compute is open-ended enough that we probably wouldn’t even be able to produce ambiguous answers!

    6. 1

      I really enjoyed reading this behind the scenes look at a feature I quite enjoy about GitHub. However, I was surprised about the numbers.

      GitHub serves more than 65 million developers contributing to over 200 million repositories across 370 programming languages… The system supports nine popular programming languages across six million repositories… It supports nine languages: C#, CodeQL, Go, Java, JavaScript, PHP, Python, Ruby, and TypeScript, with grammars for many others under development.

      I’m really surprised those nine languages add up to only 3% of the repositories (6 of 200).

      1. 2

        There are other filters that limit the number of repos that we index. We don’t process repos that are marked as spammy, for instance, and at the time the article was written, we weren’t processing forks. (Currently we only process fork branches that are involved in a PR.)

        1. 1

          Gotcha, thanks for the clarification! That makes a lot more sense.

        2. 1

          I note that C/C++ are excluded from that list and it feels as if supporting them is basically impossible with the Zero Configuration goal. You really need something like compile_commands.json to make them work, and that can often depend on both host-system headers (e.g. for macros that are used for conditional compilation) or on generated files. For example, LLVM’s build system first builds a tool called TableGen and then uses TableGen to generate a bunch of C++ files that are included elsewhere. Similarly, FreeBSD’s build system generates a load of things like the system-call stubs and other bits of libc, and so you can’t parse a load of the sources in either as valid C/C++ unless you also have a build tree. In LLVM, this shows up (or, rather, doesn’t show up) in the doxygen output where a load of the target-specific classes inherit from a class that’s generated from TableGen, which inherits from a generic class in the target-agnostic code generator. Inspecting the Doxygen output (doxygen runs without the generated files), you can’t tell that the class in the target is an indirect subclass of the target-agnostic class.

          C++, in particular, is very hard to usefully parse with anything that isn’t a complete C++ parser. Whether a particular token is a field or a type name can be context dependent and that context may be in a different file. It’s hard to imagine something built on TreeSitter giving anything better than superficial information for C++.

    7. 3

      This must be an application of what @dcreager and his team has been working on with stackgraph.

      I love this approach with tree-sitter being a lighter weight alternative compare to the like of Kythe.io.

      Looking forward for more languages and cross-languages support.

      1. 2

        That’s right! This article describes the pipeline that powers “search-based” or “CTags-like” Code Navigation. Stack graphs powers “precise” Code Navigation. Both have the same non-functional requirements described in the paper — zero config, incremental, language agnostic. Both build on tree-sitter and its DSLs to (hopefully!) achieve those requirements.

    8. 3

      It’s not mentioned directly in the article, but another benefit of using tree-sitter like this is that it makes it easier to adapt the linter to work with other languages, since it’s parameterized by the language grammar that you use and the queries that you write.

    9. 8

      Gemini protocol has many constraints. Each of those constraints has been put into place to make a feature more powerful. Lack of cookies helps those who are interested in privacy. No scripting helps those who want documents instead of apps. Simple markup helps those who want to focus on content. You get the gist.

      These two make the features I need much less powerful. I mostly write documents, but try to use the features of the web to present content in better and more informative ways. For example, hiding technical details in a dropdown, adding a button to toggle the visibility of advance topics, even just tweaking CSS to make certain things stand out more. If I knew how to make interactive JavaScript simulations a la Bartosz Ciechanowski, I’d absolutely add those too.

      I also “want documents” and “to focus on the content”, but I need far more power than Gemini provides to do either.

      1. 2

        On that post, I also mention that depending on your case you’re better served by using the Web. You are using styling and page interaction to make better content, that is what the Web was made for. Gemini is not what you need.

      2. 1

        I can definitely relate to this — that when presenting deep technical topics, you often need very fine control over presentation so that you can be absolutely sure you’re conveying your ideas in the way you intend. For that content, Gemini gets in your way.

        I also definitely relate with what you quoted, though! One of the (many) reasons that I write less than I’d like is that I get side-tracked by playing with the layout instead of writing the text. Or editing while writing instead of after writing. One endless yak shave after another. And I have found Gemini genuinely useful for providing guard rails that make it harder to do that. Or at least to delay the yak shaving until I create the HTML equivalent of what I just wrote in Gemini! 😀

    10. 1

      OP author here, happy to answer any question folks might have. I also spoke at Strange Loop back in October, and the talk has a bit more detail than the blog post.

      1. 1

        I actually was very impressed by your strange loop talk. Watched it several times since it came out and my mind was blown away on how things tied together. Prior to this, i was looking deeply into bazel + kythe integration for something similar. Have you tried Kythe?

        I really wana start exploring a way to index a bazel code base using stack graphs. I would appreciate if you can give me some pointers on where to start / examples

        1. 2

          (I haven’t watched the full talk yet, so I might be talking nonsense)

          If you already have a Bazel codebase, than, in principle, you want Kythe rather than stack graphs. One of the original constraints for the thing is “we want it to just work for heterogeneous corpus of code on GitHub without config”. With Bazel, you don’t need that constraint, as you have homogeneous codebase which you can just build. This makes Kythe approach of producing exact data during build better (as it is guaranteed to be precise).

          But of course this fully ignores quality of implementation issues, I don’t know which would work better in practice.

          1. 1

            There is a problem with Kythe: its a google product and therefore it sucks at being a ‘platform’ (to quote stevey).

            Its not developer friendly, does not get a lot of supports and does not work with any other software. I have a feeling where both Kythe and Stack Graph will end up help solving a same set problems (code intel, large scale refactoring, etc…). But i trust that Github/MSFT have a better platform where things integrate with each others(ide, codesearch, etc…).

            So i do look forward to try and explore these new product direction. So far i can see treesitter/StackGraph being a great tool for large scale refactoring / code transformation (example: https://github.com/ThePrimeagen/refactoring.nvim)

          2. 1

            Bazel takes care of knowing how to invoke the language-specific tools to build a heterogeneous project, but you still need something to generate the necessary code navigation data as part of that build process. That could be Kythe, if the set of languages you’re using overlaps nicely with existing Kythe-compatible extractors. It could also be something built on LSP/LSIF, since it’s looking like there’s more critical mass in that space for “build-time symbol extractors”. But you’re right that if you’ve decided to extract this data as part of the build process, you probably won’t turn to stack graphs to do it.

            1. 1

              Adding some more nuance here:

              as it is guaranteed to be precise

              This is a really important point — it’s possible to have a build-time extraction be guaranteed precise, but this depends on the quality of the symbol extractor that you’re using. The Microsoft folks, for instance, are updating their compilers to be able to spit out LSIF-compatible symbol data as part of the compilation process. And matklad, I know you and others have been putting in amazing work to make rustc and rust-analyzer reuse more and more underlying code, for similar reasons.

              If you can do that, then you absolutely have “guaranteed precise” symbol data, since you are tapped into the process that defines what “absolute precision” is. But, it’s important to call out that this isn’t something you get for free just because your language is statically typed and/or compiled. It requires work on the part of the compiler/language server authors. Work that they’ve been almost universally willing to do, to be clear! But not free.

              For a language like Python, where there isn’t really a compile step, “build time” is more accurately called “CI time”. And then you’re more at the mercy of the quality of the tool that you’re using. This is one of the reasons that we targeted Python first — we feel that stack graphs can be useful for Python even if you’re extracting data as part of a CI process.

        2. 1

          We had looked at Kythe, especially for its data model. They’ve done a really thorough job of describing all of the entities and relationships that can exist in code. It’s not incremental, though — its define relationship maps between two source entities directly. Stack graphs’ partial paths are concatenated together at query time to produce the same results, but the building blocks that we store are incremental, which is essential at our scale.

          Unfortunately, we don’t have the example stack graph construction code in any public repos yet. We proved everything out in some internal repos, which included an older version of the graph DSL that I describe in the blog post and Strange Loop talk. The public tree-sitter-graph language is informed by it, and cleaned up a bit relative to the prototype internal DSL. Python stack graph rules that we deployed to production use that prototype DSL. We’re actively moving them over to the new DSL, and will be developing the stack graph rules for other languages directly in the public open-source tree-sitter grammar repos for those languages. I realize it’s not a satisfying answer, but we do hope to have the Python stack graph rules over in the tree-sitter-python repo as soon as we can.

          1. 1

            Very exciting! Look forward to it.

    11. 2

      This looks great! I’m all for database-centric architectures. am a big fan of Materialize (right in the process of introducing it in a project I’m working in) and followed Eve closely when it was still alive. (I created the little bouncing ball example that became a benchmark of sorts towards the end.) Will definitely try to participate in this!

      1. 2

        What are Materialize and Eve? I’m puzzled by “little bouncing ball” and relationship with databases — that sounds pretty interesting!

        1. 4

          Materialize is a streaming database based on differential dataflow, from Frank McSherry and team. McSherry is well-known for his “Cost that outperforms single-threaded” (COST) paper [PDF].

          1. 3

            This is one of my favorite papers from Frank McSherry. The other being “A COOL AND PRACTICAL ALTERNATIVE TO TRADITIONAL HASH TABLES” which is a great intro to Cuckoo hashing, and a great analysis: https://www.ru.is/faculty/ulfar/CuckooHash.pdf