I thought this community would be particularly interested in the parser work we’re doing with Semgrep. Since the tool is multilingual with minimal runtime dependencies, we don’t want to rely on native compiler parsers. While the tool was being developed at FB it used homegrown parsers written in OCaml, but we’ve now replaced most of those bindings against the tree-sitter project, a big undertaking that has paid off by giving us 4-5 nines of parse rate success on some of our languages (https://dashboard.semgrep.dev has stats).
I’ve been looking at this to replace splint which is effectively unmaintained…
I note C is still in the “Experimental” bucket….
My own experience with parsing C for analysis is the preprocessor is a huge obstacle, especially as any statement about correctness or otherwise has to be made relative to the unprocessed source, and preferable correctness of all branches of #if should be checked.
What approach are you guys taking? Analysing post preprocessor source? Pre preprocessor source?
I replied above (sorry, first time on lobsters)
We parse the code as-is, that is we don’t call the pre-preprocessor, because this would require to know how to call the pre-processor (which -I and -D?), would slow down a lot the analysis (pre-processed code is really big), and would not allow you to match code as it is written.
We do the same thing that in the Coccinelle tool if you’re familiar with this refactoring tool. The parsing tricks we use are explained in this paper: https://www.semanticscholar.org/paper/Parsing-C%2FC%2B%2B-Code-without-Pre-processing-Padioleau/e55537b006545ece0a1143b0f4f41307139ddfa2 at CC’09