Threads for underyx

  1. 5

    I thought this community would be particularly interested in the parser work we’re doing with Semgrep. Since the tool is multilingual with minimal runtime dependencies, we don’t want to rely on native compiler parsers. While the tool was being developed at FB it used homegrown parsers written in OCaml, but we’ve now replaced most of those bindings against the tree-sitter project, a big undertaking that has paid off by giving us 4-5 nines of parse rate success on some of our languages (https://dashboard.semgrep.dev has stats).

    1. 3

      I’ve been looking at this to replace splint which is effectively unmaintained…

      I note C is still in the “Experimental” bucket….

      My own experience with parsing C for analysis is the preprocessor is a huge obstacle, especially as any statement about correctness or otherwise has to be made relative to the unprocessed source, and preferable correctness of all branches of #if should be checked.

      What approach are you guys taking? Analysing post preprocessor source? Pre preprocessor source?

      1. 2

        I replied above (sorry, first time on lobsters)

    1. 2

      This is great! It looks a lot like Semmle’s ql, except usable from the command line. This is the kind of tool I’ve been dreaming of for a long time.

      I’m wondering, how are you parsing the languages? Are you using tree-sitter or re-implementing your own parsers? Why did you choose to base your work on pfff rather than github’s semantic?

      1. 1

        As you’ve seen, semgrep is a frontend to a larger program analysis library named pfff. Pfff began and was open-sourced at Facebook, but is now archived. Its primary maintainer now works in our team at r2c.

        The syntax for queries largely originated in INRIA’s coccinelle project, which created automatic semantic patches for the Linux kernel. The original creator of the tool, Yoann, did a PhD there.

        Because of that, the parsers are quite custom at the moment. Semantic lacks some linter-specific functionality we have in pfff; that being said, we are quite interested in using https://tree-sitter.github.io/ as a base.

      1. 4

        Cool! On the other hand, this is a lot longer of a command than grep foo

        docker run --rm -v "${PWD}:/home/repo" returntocorp/semgrep --lang python --pattern '$X == $X' test.py
        
        1. 1

          For programs that run in Docker, it’s handy to wrap the boilerplate in a shell function. For example, you might stick something like this in your Bash or ZSH startup script.

          semgrep() {
            docker run --rm -v "${PWD}:/home/repo" returntocorp/semgrep "$@"
          }
          

          Then running your example looks a little better:

          semgrep --lang python --pattern '$X == $X' test.py
          
          1. 1

            Wouldn’t an alias be more appropriate? But yes, a function works fine.

            1. 1

              I think the alias wouldn’t be able to interpolate the working directory into the volume mount with -v "${PWD}:/home/repo".

              1. 1

                It can; why wouldn’t it?

        1. 11

          This is a program that reads text files and writes to stdout; why do you use docker to distribute it? Is it difficult to compile? If that is the case, why don’t give us a static binary?

          OK, I would love to try it, but I don’t use docker. Typically, in these cases, I can read the dockerfile and reproduce the build steps. But here, it seems to be really complex. Why does it mess with certificates?

          I would appreciate very much a clear explanation of the compilation instructions, if a binary cannot be made available. As in: I have just installed debian and cloned the semgrep repo. Which packages do I need to apt-get before compiling semgrep? (and a similar thing for openbsd).

          1. 2

            Same maintainer speaking as above. Thanks for the feedback; we just shipped binaries for the first time for macOS and Debian in the most recent release, click through to find them.

            As for the certificates: I’m not sure off the top of my head! Perhaps it’s because Nuitka[0] doesn’t embed the certificates that our Python dependency chain brings in via certifi? That’s sort of a wild guess, but I’m quite curious now. I’d imagine some tools might expect to get a path to a certificate as opposed to just the certificate content itself.

            And if you want to compile from source, development.md will point you in the right direction. None of us tried compiling on OpenBSD so far, so I’d suggest a Debian base for self-compiling.

            [0]: https://github.com/Nuitka/Nuitka compiles our Python package into a binary

            1. 2

              Thank you very much, it looks great! Will try to compile it on debian.

          1. 17

            Perhaps it’s just me, but running CLI apps with docker run seems rather weird. I tried using the semgrep-v0.6.1-ubuntu-16.04.tgz from the releases page, but can’t get that to work. Could probably figure that out, but having to figure out how to even run it for a quick test to see if I like it is a bit of a turn-off IMO.

            1. 4

              Hey, I’m on the semgrep team, sorry it’s been giving you trouble! If you just wanna give it a quick go, semgrep.live might be to your liking. As for the installation woes, we provide an install script[0] (warning, will download) for the time being.

              [0]: The install script’s needed cause the fastest way we could get things working was to do the parsing and heavy lifting in OCaml, and write the more feature-packed and user-friendly CLI in Python, so we have two binaries to ship together. I assume your preferred way here would be to just install a .deb package?

              1. 17

                heavy lifting in OCaml, and write the more feature-packed and user-friendly CLI in Python,

                Is writing a featureful cli in ocaml really that hard? Shipping a single binary would be much more palatable than docker imo,

                1. 2

                  Daniel Buenzli made a reasonable arg parsing library for OCaml but I can’t remember the name. It’s pretty good, though.

                  EDIT: It’s cmdliner

                2. 12

                  My preferred way would be to install a single binary with a simple build into $PATH. Packaging would be nice, but I’m using OpenBSD, so a deb package (or a docker container) doesn’t really help.

                  (Docker as a part of someone’s build has become a red flag for “hellish to get building” – it’s led me to steer away from a whole bunch of packages when I was doing stuff that had to run on Android, as well as for my personal computing environment.)

                  1. 11

                    I think you’re insanely undervaluing this by saying “like grep, but for code.” This is a linter language language for writing linters that can target multiple languages, and has “one liner” support for ad hoc searching. Don’t sell yourself short.

                    1. 1

                      Did you write your own parser for all 5 languages currently supported?

                      1. 1

                        Yes, Debian package would be great.