Wow, when I saw Ceph I just assumed it was some deep distributed systems problem. But this issue would have broken even non-Ceph local volumes – and the initial error was more than a year ago. Storage is hard!
Even if you use ZFS to eliminate layers, there are still questions about how to optimally configure the two main ZFS layers (pool and datasets) and the layers below ZFS (e.g. NVMe and partitions). For example, should I use nvme-format to initialize the NVMe device with a particular block size? Do I partition the device so I can boot from it, or give ZFS the whole thing and boot from PXE? What should I specify, if anything, for the ashift pool option? And then, of course, there are many choices when it comes to splitting the pool into datasets.
(I’m actually dealing with these questions right now. I don’t expect free advice from an Illumos core developer though.)
I feel that in a court room, this would be prejudicial!
But seriously, I agree there are always choices you’re going to have to make on some level. My free advice extends to:
If you are picking a block size on the NVMe device, I’d be careful with going too high. There are some aspects of the base ZFS design that mean large disk sectors can conspire with particular vdev topologies (e.g., raidz) and particular dataset options (e.g., small recordsize) to result in a surprising amount of space inefficiency/overhead. 512 byte sectors give ZFS the best space efficiency, as I suspect is true with many file systems on some level; 4K sectors are common now as well, but I’d be careful going larger without some testing with your particular workload and settings. If your datasets are going to have recordsizes close to the disk sector size, I believe mirrors do much better than raidz for space efficiency.
I’m not sure if you’re planning to deploy on an illumos system, but I don’t think you will have to mess with ashift there if you’ve already formatted the device with a particular sector size. We should detect that size and pick an appropriate ashift at pool creation (9 for 512, 12 for 4K, etc). Probably good to confirm that we picked the right one though!
I would be inclined to give ZFS the entire disk and boot from something else if possible – whether PXE or some other device like a USB or SD card. It seems simpler on a number of levels, and is what we did with SmartOS at Joyent for many years when I was there.
We’re sadly missing a tag for incident stories, as these carry quite a lot of information and chances to learn from.
Wow, when I saw Ceph I just assumed it was some deep distributed systems problem. But this issue would have broken even non-Ceph local volumes – and the initial error was more than a year ago. Storage is hard!
Indeed, and with many commonly deployed deeply layered storage stacks it feels harder than it really needs to be.
Even if you use ZFS to eliminate layers, there are still questions about how to optimally configure the two main ZFS layers (pool and datasets) and the layers below ZFS (e.g. NVMe and partitions). For example, should I use nvme-format to initialize the NVMe device with a particular block size? Do I partition the device so I can boot from it, or give ZFS the whole thing and boot from PXE? What should I specify, if anything, for the ashift pool option? And then, of course, there are many choices when it comes to splitting the pool into datasets.
(I’m actually dealing with these questions right now. I don’t expect free advice from an Illumos core developer though.)
I feel that in a court room, this would be prejudicial!
But seriously, I agree there are always choices you’re going to have to make on some level. My free advice extends to:
If you are picking a block size on the NVMe device, I’d be careful with going too high. There are some aspects of the base ZFS design that mean large disk sectors can conspire with particular vdev topologies (e.g., raidz) and particular dataset options (e.g., small recordsize) to result in a surprising amount of space inefficiency/overhead. 512 byte sectors give ZFS the best space efficiency, as I suspect is true with many file systems on some level; 4K sectors are common now as well, but I’d be careful going larger without some testing with your particular workload and settings. If your datasets are going to have recordsizes close to the disk sector size, I believe mirrors do much better than raidz for space efficiency.
I’m not sure if you’re planning to deploy on an illumos system, but I don’t think you will have to mess with ashift there if you’ve already formatted the device with a particular sector size. We should detect that size and pick an appropriate ashift at pool creation (9 for 512, 12 for 4K, etc). Probably good to confirm that we picked the right one though!
I would be inclined to give ZFS the entire disk and boot from something else if possible – whether PXE or some other device like a USB or SD card. It seems simpler on a number of levels, and is what we did with SmartOS at Joyent for many years when I was there.
There’s a reason most companies buy storage in a pre-made box rather than try to piece it together themselves.