This paper is somewhat jumbled. The short version is that a a bunch of tools were run through the benchmark used for a 2017 software verification competition, where lots of systems were compared. The benchmarks were mostly short, i.e., under 20 lines, but there are lots of them.
The programs used to represent testing were: AFL-fuzz, CPATiger, Crest-ppc, FShell, Klee, and PRtest. A somewhat odd selection to label ‘testing’, but if you are looking for automated tools, what else is there?
The model checkers were: Cbmc, CPA-Seq, Esbmc-incr, and Esbmc-kInd.
Figure 6 shows the ‘testing’ tools to be a lot slower than the model based tools. Is this because the testing tools are designed to work on programs that are much larger than 20 lines and so have a higher startup cost? Model based tools would certainly have performance issues with larger programs.
Perhaps when the authors have had time to think about their results, they will write a more interesting paper (assuming the data contains something interesting).
I also think they have focused entirely on the wrong benchmark.
I really don’t care about two orders of magnitude difference in CPU time for this class of task.
I really really do care about how much of my time and brain power is required to set up and run these tools.
They can then run all week for all I care.
I suspect under that benchmark, and given their results, AFL looks very very good. (And from what I have read, and from a couple of personal experiments I have run, AFL is indeed very very Good)
Also very interesting, and I didn’t spot (on admittedly on a fast read) any deeper analysis of the factoid that fuzzers found some bugs model checkers didn’t and vice versa.
They needed a benchmark that the model based tools had a chance of being able to handle; the one thy used is it.
Me thinks the authors saw the opportunity to get a paper involving formal methods published, with their names on, that did not involve too much effort.
Appreciate you taking time to enumerate weaknesses. Being able to check small modules is fine for my purposes since I advocate using methods for safe interfacing of components. However, for use with legacy software or not designed with verification, that would be a big problem for the model-checkers.
I think this highlights an interesting point. Fuzzers test the program. A program IS A model. Model checkers also test models. Formal models are models of models. And software is arguably a model of a users mental model. Therefore formal models are models of models, which model a model. It might just be tutrles all the way down.