1. 12
  1.  

    1. 3

      Semgrep is super cool as a tool. It kinda sucks that it’s trying really hard to put in stuff to make it “pay us money” value add, because it feels so good at its core as a standalone thing that should sit right next to grep itself IMO.

      1. 2

        OK cool, I’d be interested to hear what kind of problems you’ve used it for. I haven’t tried it, but I found it interesting because it is polyglot tool.

        Although now that I look, the implementation it appears pretty heavy. It looks like most languages have multiple back ends, like treesitter and menhir and “generic” :

        https://github.com/returntocorp/semgrep/tree/develop/languages/bash/ast

        So it feels like the result quality can vary depending on language and back end, though again I haven’t actually used it.


        I was also interested in

        which uses “micro-grammars”, but it doesn’t appear to be open source. This is a much lighter implementation and has some semantic understanding of code as well.

        https://cseweb.ucsd.edu/~dstefan/cse291-fall16/notes/uchex/


        FWIW one motivation for looking at this is that I find Github’s new source viewer to be extremely slow, since they added the more semantic features. Anyone else find that?

        A lot of this is the front end JS, not the back end analysis. But I’m also finding the back end analysis doesn’t work that well either? I thought Github was supposed to have some semantic jump to definition

        Looks like they have made multiple attempts:

        https://github.blog/changelog/2019-06-11-jump-to-definition-in-public-repositories/ (based on a Haskell library that seems to be no longer used)

        Actually here, it says Python is the ONLY language with precise code navigation?

        https://docs.github.com/en/repositories/working-with-files/using-files/navigating-code-on-github

        But I happen to have a ton of Python code, and it seems like it doesn’t work. I thought they were doing all this stuff with stack graphs

        https://github.blog/2021-12-09-introducing-stack-graphs/

        1. 2

          I have a large monorepo containing many projects. When I deprecate an API in an internal library, I use semgrep to automatically rewrite the whole repo to use the new API. https://semgrep.dev/docs/writing-rules/autofix/

        2. 2

          Check out https://github.com/bablr-lang. Insomuchas every other tool I’ve ever seen gets some important part of the requirements wrong, I think I may be the very first to hit the nail on the head. I’m shooting for supporting 10,000 languages instead of 10 (or in the case of stack graphs, one). And indeed, my code is wildly, radically simpler than that of any other tool I’ve ever seen.

          1. 2

            Ehh but you didn’t test your grammars/metalanguage on a real language? You only have one language that’s your own CSTML markup?

            It will get a lot more complicated if you try to do say Python or JavaScript – and then bash is the real test for any parsing metalanguage. C++ is another good test (including preprocessor).


            I think you have to start from the problem and go down. Not start from the grammars and go up.

            Or at least balance the two. That was basically the whole point of the Oil blog in the beginning - http://www.oilshell.org/blog/tags.html?tag=parsing-shell#parsing-shell

            When you start from the grammars and try to go up, you’ll find that it falls apart – your metalanguage is too weak

            e.g. Treesitter solves a bunch of problems nicely, but they still have 1150 lines of C++ for lexing bash, which seems like a step backward - https://github.com/tree-sitter/tree-sitter-bash/blob/master/src/scanner.c

            Also it appears there were like 50 bugs fixed recently by one person, which means they have 50 or more bugs for ~5 years … It’s a sign that the abstraction/metalanguage doesn’t quite fit the problem

            In your case, I think you’ll also find that your metalanguage is too slow (e.g. to understand C++ or bash, or even JS itself), which I think was the comment you mailed me about

            1. 1

              You’re right about the basic process being making the grammar more and more powerful. I’ve gone through that process about 5 times already, each time adding fundamental degrees of freedom that are necessary for parsing real languages. That is to say, in order to move quickly you will not see me maintaining a working Ruby syntax while I design the VM, but you will see me carefully adapt the design to ensure that it is powerful enough to parse Ruby’s heredocs. For example even though I’ve never written a parser for Heredoc syntax, I know that Heredoc will be the first time I have Trivia tokens inside string quotes, so I’m preserving that freedom carefully as I work on the core.

            2. 1

              As to speed, I don’t have benchmark numbers up yet, but they won’t be nearly as bad you think! That’s because I have taken some important steps for perf:

              • I’m using monomorphic ASTs, which NONE of the rest of the tooling in the JS ecosystem is! JS normally has a nasty problem with ASTs: if you have { type, id } and { type, expr } you basically know type will be at memory offset 0. All ASTs in the current JS ecosystem seem to assume that JS can optimize the node.type access pattern, but it can’t and probably never will be able to. Instead in my system nodes are more like { type, value }, where value might be { id } or it might be { expr }. Nearly every JS engine can and will optimize node.type accesses into direct memory offset access when the objects are exactly monomorphic like this.
              • All those template tag strings you see in the grammar are indeed written in an incredibly slow way. At every single step of evaluation the process must stop and do a string parse to get from a template string to an instruction object! But because those are just static objects, all I need to do to make that fast is pre-parse the expressions and then hoist the resulting objects and I’ll go from sllooowww to blazing speed (or, well, fast enough for my purposes)
        3. 2

          I’ve used it for two things recently:

          • Finding examples of certain AST usage in a Python codebase (in my case it was “finding if x is y usages where y is not None and not not None)

          • trying to do an API refactor (x.foo + y.bar -> x.baz(y) stuff), that would work thanks to the very specific nature of my problem (where other concerns were not a big deal and where a failed change would be instantly caught by the test suite anyways)

          semgrep is one way of looking at it, but with the suggested change feature it’s more like semawk (or a primordial version of it, anyways)

        4. 2

          OK cool, I’d be interested to hear what kind of problems you’ve used it for.

          https://github.com/fedimint/fedimint/blob/97b3b701aed2dffc556edd12c2159afcf9aa526d/.semgrep.all.yaml#L1 from our Rust project . Not much but very useful.

        5. 1

          Oh sorry, you’re OilShell Andy! You already know about my project, though it didn’t interest you last time around. cst-tokens has rebranded to CSTML, Spamex, and BABLR VM since then.