very interesting … would love to know bcantrill opinion here about the “over” engineering around the right thing to do in filesystems :-D
One source of frustration when talking about these sorts of issues is when people are full of bluster about how the simple thing basically works and we don’t have any evidence that there was much corruption for most people. I think it’s pretty easy not to have that evidence when your system, by construction, can’t even tell if certain classes of corruption occur. How do you ever know that they really haven’t, and not just that you didn’t notice?
Sometimes I have to debug file system software, and silent corruption is my omnipresent fear.
Silent corruption is something that is hard to deal with, precisely because it’s silent.
A big difference between BSD and Unix vs Linux probably is also that the interests diverged until ZFS came up. If you take a look at history and also see the scale of systems, then you’ll see that Linux was the thing that people also ran on lower end platforms, while Solaris and BSDs ran on “enterprise setups”. Again, historically. I know that’s not true today.
Another thing is that large scale systems really are rather new, when it comes to this scale. So while it was okay to switch out your disks in a raid setup and having a bit of downtime on that node a thing that Sun tried to solve with ZFS is exactly that silent corruption problem that won’t bug you until you have really big infrastructure. And even on a single system ZFS, simply needs resources which weren’t there yet.
Meanwhile, Linux which was run on “cheap” hardware, even at scales of Google probably had a larger developer/user base where this very specific failure case appeared. Another thing, and that’s something else that changed. Linux used to have more kernel panics. Again, that’s history and I think that today they are equal or even have less, but that means that abrupt shutdowns are simply a thing that on average was more likely for Linux to happen.
Bit rot, etc. or complete (noticed) disk failures aren’t things that Linux has been handling better. Partly, because until not too long ago hardware raids would be preferred over software raid. Another thing that changed not too long ago.
All that isn’t at all meant to say anything about what is better, or worse, but more like the history.
A thing that really frequently happens in IT on both software and company level is that the thing that you do worst will become the thing you do best, simply because it nags you/the developers enough. That holds true for so many developments.
I agree. And now there is ZFS (and to a degree HAMMER), which completely turned it around. Unix/BSD were the first ones to adopt it. Meanwhile Btrfs at least isn’t quite there yet, slowly being community wise overturned by ZFS on Linux.
BSDs weren’t worse for having soft updates as an option, you could turn it off. netbsd even removed all the soft updates code.