1. 31
  1. 17

    Not an expert, but I do run a lot of ZFS RAIDz2 on NVMe on Linux and have done a fair bit of tuning for it. I don’t know which specific thing is given you such “impossible” numbers, but I’m happy to suggest a few things that might be in play, and maybe how to even squeeze more out of it! (btw, don’t mean this to come across as patronising, I’m just writing a few things out for readers that haven’t seen it, or for actual experts to tell me I’m doing it wrong!).

    Most of the performance is going to be from the ARC, that is, the memory cache. ZFS will aggressively use RAM for caching (on Linux, by default, the ARC will grow to as much as half the physical RAM). You’ve already seen this in Note #3; reducing the RAM reduces throughput. Incidentally, you can tune how much RAM is used for the ARC with zfs_arc_min and zfs_arc_max (see zfs-module-parameters(5)); you don’t have to reduce the system “physical” RAM (though maybe that was more convenient for you to do).

    Compression gets ZFS a huge amount of throughput, because its faster to do a smaller read and decompress it than wait for the IO (turning compression off can actually make things slower, not faster, because it has to hit the disk more). Compression is block level, and as a special case, all-zero blocks are not even written - the block header has a special case that says “this is all-zeroes, length XXX” that ZFS just inflates. Finally, turning off compression doesn’t change the compression state of already-written blocks, so if you’re benchmarking on data that already exists, you’ll need to rewrite it to really “uncompress” it.

    In a RAIDzX, data is striped across multiple devices, and reads can be issued in parallel to get bits of the file back and then reassemble them in memory. You have 32 lanes, so you’re probably right in saying you’re not saturating the PCI bandwidth. You’re almost certainly getting stuff as fast as the drives can give it to you.

    You’re using 512B blocks. Most current NVMe is running 4K blocks internally. The drive firmware will likely be loading the full 4K block and returning a 512B chunk of it to the system, and keeping the rest of the block cached in its own memory. In sequential reads that’s going to mean almost always 7 out of 8 blocks are going to served from the drive’s own cache memory, before touching the actual flash cells. (This is well worth tuning, by the way - use flashbench to determine the internal block size of your drive, and then find out how to do a low-level format for your device to switch it to its native block size. Along with an appropriate ashift for your pool, it will let ZFS and the Linux block layer deal in the drives native block size all the way through the stack, without ever having to split or join blocks).

    ZFS will use a variable blocksize, by default growing blocks as large as 128K. When reading, it will request the entire logical block, composed of multiple physical blocks, from the block layer. If they’re stored sequentially, that can translate to single “range request” on the PCI bus, which may get coalesced into an even larger range, which the drive may be able to service entirely with parallel fetches against multiple flash cells internally.

    Not sure which version of ZFS you’re using, but versions before 2.0.6 or 2.1.0 have a serious performance bottleneck on wide low-latency vdevs (GH#12121 GH#12212).

    In my experience though, and yours too, ZFS performance is pretty good out of the box. Enough that even though my workload does some things that are outright hostile to a CoW filesystem, the gains have been so good that it hasn’t yet been worth changing the software.

    1. 2

      Great list, that’s almost surely what’s at play here. I don’t think the drive/file system speeds are actually being measured.

      Some other performance tuning things to think about with zfs: if you have a fast slog vdev you can set sync=always but if your zil is slow you can set sync=disabled to gain a lot of speed at the expense of safety. For some use cases that’s okay.

      When I ran mechanical disks I used an Optane drive for my slog vdev and it was so fast I couldn’t measure a performance difference when using sync=always.

      1. 2

        I am trying out a few things with fio and will post the results here. There was a suggestion on HN that mirrors what you’re suggesting. I’ll update the article if I find that 157 GB/s is a bogus result.

        Edit: OK folks, party is over. 157 GB/s is a misleading number. The FIO library needs separate files for each thread otherwise, it will report incorrect bandwidth numbers. See this post, I am in process of updating the article: https://news.ycombinator.com/item?id=29547346

        Updated, thanks everyone! - https://neil.computer/notes/zfs-raidz2/#note-5

      2. 1

        and then find out how to do a low-level format for your device to switch it to its native block size

        What does this mean? I’ve set up aligned partitions and filesystem block sizes (or ashift for ZFS), but I don’t know what a low-level format even means.

        1. 3

          All drives (flash and spinners) have a “native” block size. This is the block size that the drive electronics will read from or write to the actual storage media (magnetic platters or flash cells) in a single unit. Sizes vary, but in pretty much every current NVMe SSD the block size is 4KB.

          Traditionally though, most drives arrive from the factory set to present a 512B block size to the OS. This is mostly for legacy reasons; back in the mists of time, physical disks blocks were actually 512B, and then the joy of PC backward compatibility means that almost everything ever since starts by pretending to be from 1981, even if that makes no sense anymore.

          So, the OS asks the drive what its block size is, and it comes back with 512B. Any upper layers (usually a filesystem, maybe also intermediate layers like cryptoloops) that operate in larger block sizes will eventually submit work to the block layer, and it will then have to split the block into 512B chunks before submitting them to the device.

          But, if the device isn’t actually 512B natively, then it has to do more work to get things back into its native block size. Say you write a single 512B block. A drive doing 4K internally will have to fetch the entire 4K block from storage into its memory, update it with the changed 512B, then write it back down. So its a bit slower, and for SSDs, doing more writes, so increasing wear.

          So what you can do on many drives is a “low-level format”, which is also an old and now meaningless term for setting up the basic drive structure. Among other things, you can change the block size that is exposed to the OS. If you can make it match the native block size, then the drive never has to do deal in partial blocks. And if you can set the same block size through the entire stack, then you get eliminate partial block overheads from the entire stack.

          I should note here that all this talk of extra reads and writes and wear and whatnot makes it sound like every SSD must be a total piece of crap out of the box, running at glacial pace and wearing itself out while its still young. Not so! Drive electronics and firmware are extremely good at minimising the effects of all these, so for most workloads (especially large sequential reads) the difference is barely even measurable.

          But if you’re building storage systems that are busy all the time, then there is performance being left on the table, so it can be worth looking at this. My particular workload includes constant I/O of mostly small random reads and writes, so anything extra I can get can help.

          I mentioned flashbench before, which is a tool to measure the native block size of a flash drive, since the manufacturer won’t always tell you or might lie about it. It works by reading or writing blocks of different sizes, within and across theoretical block boundaries, and looks at the latency for each operation. For example, you might try to read 4K blocks at 0, 2K, 4K, 6K, etc offsets. If its 4K internally, then the drive only has to load a single block at 0, but will have to load two blocks at 2K to cross the block boundary, and this will be visible because it takes just a little longer to do its work. It’s tough to outsmart the drive electronics (for example, current Intel 3DNAND SSDs will do two 4K fetches in parallel, so a naive read of the latency figures can make it look like it actually has an 8K block size internally), but with some thought and care, you can figure it out. Most of the time it is 4K, so you can use that as a starting point.

          On Linux, the nvme list tool can tell you the current block size reported by the drive. Here’s some output for a machine I’m currently right in the middle of reformatting as described above (it was inadvertently introduced to production without having been reformatted, so I’m having to to reformat individual drives then resilver, repeatedly, until its all reformatted. Just another sysadmin adventure!)

          [fastmail root(robn)@imap52 ~]# nvme list
          Node             SN                   Model                 Namespace Usage                      Format           FW Rev  
          ---------------- -------------------- --------------------- --------- -------------------------- ---------------- --------
          /dev/nvme0n1     PHLJ133000RS8P0HGN   INTEL SSDPE2KX080T8   1           8.00  TB /   8.00  TB      4 KiB +  0 B   VDV10170
          /dev/nvme10n1    PHLJ132601GU8P0HGN   INTEL SSDPE2KX080T8   1           8.00  TB /   8.00  TB    512   B +  0 B   VDV10170
          /dev/nvme11n1    PHLJ133000RH8P0HGN   INTEL SSDPE2KX080T8   1           8.00  TB /   8.00  TB      4 KiB +  0 B   VDV10170
          /dev/nvme12n1    PHLJ131000MS8P0HGN   INTEL SSDPE2KX080T8   1           8.00  TB /   8.00  TB      4 KiB +  0 B   VDV10170
          
          

          So you can see that nvme10n1 is still on 512B.

          And then once you’ve done that, you have to issue a low-level format. I think it might be possible with nvme format, but I use the Intel-specific isdct and intelmas tools. Dunno about other brands, but I expect the info is easily findable especially for high-quality devices.

          Do remember though: low-level format destroys all data on the drive. Don’t attempt in-place! And I honestly wouldn’t bother if you’re not sure if you need it, though I guess plenty of people try it “just for fun”. You do you!

      3. 3

        Any experts here who can shed light on how this is possible? See Note 2 and Note 3 at the end of the article.

        1. 3

          I found 1 source that shows these drives have a 2GB LPDDR4 cache each[1]

          So that’s 16GB of faster-than-nand cache total, plus the 64GB of system memory, is 80GB. So maybe that’s skewing the fio numbers?

          1. https://techgage.com/news/samsung-reveals-the-970-pro-970-evo-generation-m-2-ssds/
        2. 1

          I haven’t spent a lot of time thinking about this but here is my two cents:

          The benchmark produces synthetic files which have low entropy thus, are highly compressible by lz4. This results in abnormal I/O bandwidth, i.e., small binary files on disk become big files on RAM.

          Can you measure the compressibility on the synthetic files?

          Small example: imagine the benchmark tool is creating binary files containing a long chain of 0’s. Lz4 can compress this file into a very small file. Real data will almost always have a decent amount of entropy, unless it is already in a compressed file format like most pictures or videos. I think zfs is intelligent that it doesn’t compress high entropy files

          1. 2

            I found the problem, its to do with the benchmarking library fio: https://news.ycombinator.com/item?id=29547346