I still don’t understand how people can routinely put back code into a master branch without ensuring that it even builds correctly, let alone functions as designed. When an accident happens and you break the build, or the software in general, it seems prudent to back the change out of the repository until you can understand it. Whatever the mistake was should be understood, so that you (and your colleagues) can learn from it and avoid it in future.
Mistakes can always happen, but negligence is simply not cricket.
“I still don’t understand how”: a blameless postmortem is an interesting tool to try to find out. The idea is that things that in hindsight look negligent might have seemed perfectly sensible at the time. Finding out why they seemed sensible might show gaps in tooling or training. (Eg. tests were run but not on exactly the changeset that got merged; tests have been failing for engineer X for weeks but she’s ignoring them because they pass in CI; etc.)
GitHub has a CI-integration “protect this branch”
feature: you can configure master so that a PR can only be merged if a particular CI check has passed on the branch being merged.
I agree wholeheartedly that it is important not to lay blame (or worse) for mistakes. Doing a post mortem analysis of mistakes is a crucial part of avoiding repeating the same mistake over and over; to ignore the problem is to become negligent.
If you run the tests on a patch that is not the same as the one you eventually merge, you didn’t really run the test. Discovering that this is true, and understanding that even small changes can result in unanticipated defects is an opportunity to take ownership of, and learn from, a mistake. To continue routinely putting back changes where you did not test the actual patch is, subsequently, negligence.
If the tests routinely fail on your machine (but not on others) and you ignore those failures without understanding the cause: that’s negligence as well. Every engineer in a team is responsible for the quality of the software. This is a core part of avoiding blame games – if everybody owns, analyses, learns from and shares their mistakes, nobody need point upset fingers at others.
Certainly the blameless postmortem idea is only going to work if you do something with the findings. If the kinds of mistakes you’re talking about carry on happening regularly, then yes you have a problem.
People will still make mistakes, though. That’s the nice thing about a tooling solution like that GitHub configuration: a whole class of mistakes simply can’t happen any more.
For this reason, I think more version control systems need to make an easy, visible revert part of their core functionality.
Where I work, it usually happens due to portability. Someone checks in code that builds fine on their preferred dev platform and assumes it will work on the others. We have an abstraction layer that helps with differences in the system libraries, but things like mistaken capitalization in an #include will work on Windows but not Linux. Conversely the GCC linker is more forgiving about mixing up struct vs. class forward declarations than VS.
Me too, the most common portability isssue we have is wchar_t = UTF-8 vs UTF-16.
Yeah, developing for more than one platform can make it much more tedious to make sure your code is tested before it goes back. If this kind of failure happened more than once or twice, though, I would probably consider adding some kind of #include case check tool to be run out of make check. We do this already for things like code style checks.
You could conceivably make it relatively easy to install the checks as a pre-commit hook in the local clone via some target like make hooks. Pushing code to a feature branch to trigger some kind of CI build/test process before cherry-picking into master could help as well.
I had close to a dozen build failures in the space of an hour because someone built live-environment integration tests into the CI test process, and they depended on a dev service that was down. “Fixing” the build entailed rerunning it unchanged once the depended-upon service had been restarted. It has always been my experience that broken CI builds are due to unforeseeable problems or circumstances outside the developer’s control, not a lack of due diligence on the developer’s part; so these “build-breaker shaming” devices seem incredibly counterproductive to me.
I had it happen when I changed to a job where I had to use a different IDE from the one I was used to. I was used to making the kind of change that would show up immediately as a failure in my IDE if it was incorrect; if not, I would habitually commit to master, confident that it would work. Running a command-line build or unit tests was simply not justified in terms of the cost given the level of confidence I tended to have in such a change. With the new IDE my confidence was entirely misplaced and I broke a lot of master builds until I adjusted.