This never gets old. On the one hand, the test strategy and execution is very impressive. With the tools that we currently have, this test suite is definitely one of the “best” on the planet, in quotes because it’s still such a multidimensional problem.
On the other, and with all due respect, the amount of effort they’ve put in here is almost scary. This is said about the overall size of the various suites they have:
By comparison, the project has 608 times as much test code and test script
I always say that 2-3x is a pretty fair scale factor between an application and a thorough test suite. 608x obviously blows that out of the water. When I read that, I thought that this included generated code. But they clearly distinguish between the code that they’ve written and code that gets generated via various test case macros. Still, 90 million lines of code seems like a lot to write by hand, even for an open source project with a lot of developers. I’m curious to clarify that. Either way, this is well in the range of formal verification making sense. I wonder if they’d consider that.
I don’t believe it’s measurable though, at least not automatically measurable, so I’m not sure where they’re getting their 100% number from here. In their justification of it, they show their techniques for achieving it, but those look like they require a proper application of a macro to certain cases in the code. That doesn’t sound like an automated process.
When I was working on the Giles Production System Compiler, it generated some pretty hairy SQL, and I managed to find this. Because the test code that I was running was for a government contract we had to do all sorts of rigamarole to report it because it was so obscure the minimal reproduction included stuff that wasn’t able to be disclosed. That was fun.
This never gets old. On the one hand, the test strategy and execution is very impressive. With the tools that we currently have, this test suite is definitely one of the “best” on the planet, in quotes because it’s still such a multidimensional problem.
On the other, and with all due respect, the amount of effort they’ve put in here is almost scary. This is said about the overall size of the various suites they have:
I always say that 2-3x is a pretty fair scale factor between an application and a thorough test suite. 608x obviously blows that out of the water. When I read that, I thought that this included generated code. But they clearly distinguish between the code that they’ve written and code that gets generated via various test case macros. Still, 90 million lines of code seems like a lot to write by hand, even for an open source project with a lot of developers. I’m curious to clarify that. Either way, this is well in the range of formal verification making sense. I wonder if they’d consider that.
I love their usage of MC/DC (modified condition / decision coverage. I’ve talked and written about the pitfalls of measuring statement coverage, and even regular branch coverage, extensively. MC/DC is likely the most robust and thorough testing approach, and it’s even required for the highest level of criticality in airplane software.
I don’t believe it’s measurable though, at least not automatically measurable, so I’m not sure where they’re getting their 100% number from here. In their justification of it, they show their techniques for achieving it, but those look like they require a proper application of a macro to certain cases in the code. That doesn’t sound like an automated process.
Worth noting that even with all this effort, bugs do get reported and fixed.
Overall, this is super inspirational, and an amazing case study. Verification remains the hardest problem in computer science.
I have the rare honor of having found and reported a bug in SQLite: https://www.sqlite.org/src/info/6bfb98dfc0c
When I was working on the Giles Production System Compiler, it generated some pretty hairy SQL, and I managed to find this. Because the test code that I was running was for a government contract we had to do all sorts of rigamarole to report it because it was so obscure the minimal reproduction included stuff that wasn’t able to be disclosed. That was fun.
I often reference sqlite’s tests for inspiration and real world TCL examples when having to write my own Expect scripts. Its quite impressive.