1. 20
  1.  

  2. 9

    [On Rob Pike arguing against the sendfile system call] I find his argument in that post pretty difficult to follow (if the kernel provides a way to do something, and that way gives you better performance in practice, why not use it?). But I thought it was interesting.

    Rob Pike comes from the school of minimalist (post-7 AT&T Unix, Plan 9) kernel & systems design, so I imagine he has a problem with such special purpose syscalls. (Not to mention microkernel people would likely have a problem with it too.)

    1. 2

      It doesn’t strike me (at least in its present form) as being all that special-purpose – I suppose it was in the original “out_fd must be a socket” form, but that restriction went away in Linux 2.6.33 (Feb. 2010). I guess the man page linked at the start of that thread may not have been updated to reflect that at the time of the post in question (Sept. 2010), so Pike may still have been looking at an out-of-date description. I’m not sure (and mildly curious) exactly how much that factored in to his negative opinion of it.

      (The “in_fd must be mmap-able” requirement still stands though, which is slightly unfortunate.)

      1. 2

        On a busy server, the in_fd file will already be in memory; servers already benefit from so much from userspace caching (varnish, nginx, etc) that I don’t see the point of sendfile() – it’s only faster in situations where write() can be improved, so I’d rather see effort there. Furthermore, I think NACL and TLS that are fundamentally incompatible with sendfile(), and I think we should be recommending strong fast crypto on the network.

        Maybe if you have a system with not enough RAM, a small virtual address space (like a 386), no encryption/privacy, and too many very big files, then maybe sendfile() makes sense for performance reasons.

        1. 5

          The advantage of sendfile is that the data can never hit userspace, it’s handled entirely in the kernel. Having to bounce back and forth between user and kernel space to make read()s and write()s is not something you can improve only by optimizing write().

          servers already benefit from so much from userspace caching

          If you’re using sendfile() you’ll likely rely more on pagecache in the kernel than caching in userspace.

          1. 1

            Having to bounce back and forth

            You do not have to"bounce back and forth" since you only read() once.

            The advantage of sendfile is that the data can never hit userspace

            If I read() a file, it is in memory. That those pages are mapped to user pages isn’t significant unless it takes up a lot of pages that I’m frequently swapping out for another large set of pages. Pinning to CPU is the normal way people do this.

            The advantage of write() is that the code is much simpler which means better cache utilisation.

            If you’re using sendfile() you’ll likely rely more on pagecache in the kernel than caching in userspace.

            I don’t know exactly what you’re trying to say here. You already need to maintain a cache in userspace.

            I don’t believe that mapping pages in the kernel is somehow more efficient than mapping them to a user program that never gets unmapped; if it’s true, then it’s a bug that should be fixed rather than adding hacky syscalls.

            1. 5

              I think you’re underestimating the cost of switching between user and kernel space and of copying data between them, it can be a bottleneck when you’re doing high-throughput serving. This is why sendfile was created, to avoid that overhead.

              You do not have to"bounce back and forth" since you only read() once.

              The equivalent non-sendfile code would read a chunk, write it, read a chunk, write it etc. This is to avoid slurping a multi-GB file into RAM, you want to keep your buffer sizes constrained.

              If I read() a file, it is in memory.

              It’s now twice in memory, once in pagecache from when the kernel read it and another time in userspace for the output of the read(). This is wasteful.

              That those pages are mapped to user pages isn’t significant unless it takes up a lot of pages that I’m frequently swapping out for another large set of pages.

              The problem is that you need to take all this data that has already been copied from kernel to user space by read(), and now copy it back from user to kernel space with write().

              I don’t know exactly what you’re trying to say here. You already need to maintain a cache in userspace.

              You don’t need to maintain a cache in userspace. You can let the kernel take care of it for you via pagecache, and many applications take this approach as you can get caching for mostly free.

              I don’t believe that mapping pages in the kernel is somehow more efficient than mapping them to a user program that never gets unmapped;

              The question isn’t the cost of mapping pages. The issue is the cost of copying data and transferring control between userspace and kernelspace.

              1. 1

                The equivalent non-sendfile code would read a chunk, write it, read a chunk, write it etc. This is to avoid slurping a multi-GB file into RAM, you want to keep your buffer sizes constrained. … The problem is that you need to take all this data that has already been copied from kernel to user space by read(), and now copy it back from user to kernel space with write().

                You are arguing a straw man.

                I am comparing write() with sendfile() and not sendfile() with some userspace code that does what sendfile does.

                It should be obvious that write() is going to be faster than sendfile() because it does less.

                There’s no reason to copy data between user and kernel space: User pages are accessible to kernel space, and they only need to be remapped if another process is going to run on that CPU. If write() copies unnecessarily, then it is a bug that should be fixed.

                If the file is demand-paged (e.g. mmap()) and you call write() it, then it should be at least as fast as sendfile, and sendfile is just a wasted syscall. If it isn’t that fast, write() should be fixed.

                If the address space is not large (e.g. a 386), but the files are very large, and we don’t need encryption, then that’s certainly something else, but I’m not going to argue that either.

                1. 3

                  You are arguing a straw man.

                  As you haven’t fully sketched out what you’re proposing, I’m having to make a best effort guess at what you mean. I’m presuming you’re trying to create a system that reads files off disk and serves them over the network, which is what sendfile() was created for.

                  I am comparing write() with sendfile() and not sendfile() with some userspace code that does what sendfile does.

                  That’s an odd comparison, and you had mentioned using read() so I thought you were comparing read()+write() to sendfile().

                  If you’re not using read(), where is the data coming from then?

                  In any case, sendfile() will still be faster than write() (presuming the data is cached in kernel and userspace respectively), as the data doesn’t have to be copied from user space with sendfile().

                  User pages are accessible to kernel space, and they only need to be remapped if another process is going to run on that CPU.

                  I’m not sure what you mean by remapping user pages. If there’s a context switch on the CPU, page tables are per-process so there’s no remapping of pages required.

                  Are you getting confused by NUMA or conflating with memory management for something other than Linux?

                  I’m not sure what this point has to do with the question at hand.

                  If the file is demand-paged (e.g. mmap()) and you call write() it, then it should be at least as fast as sendfile, and sendfile is just a wasted syscall. If it isn’t that fast, write() should be fixed.

                  None of this is done by write() on Linux. You’d need splice() or vmsplice() to explicitly ask for this behaviour, as it’s not generally safe to do this without the user asking for it. Consider what’d happen if the data changes in memory after you make the syscall but before it’s sent over the network, or if the data isn’t aligned.

                  1. 1

                    That’s an odd comparison, you had mentioned using read() so I thought you were comparing read()+write() to sendfile().

                    You are finding it odd because you believe user-space caching is unnecessary, however if you note that everyone building high-performance systems is already doing user-space caching, it makes sense to consider that case from the start.

                    User-space caching means we’re don’t have to consider the cost of read() at all. We only need to consider the cost difference between write() and sendfile().

                    A busy server that wants to maximise speed will already have all of its files in memory. How they got there (because sendfile() cached them, or read() cached them) is irrelevant to building high performance systems: We just need to make sure the system has enough ram.

                    If we’ve done that, I can’t see any reason write() wouldn’t be faster than sendfile() since it will obviously be doing less.

                    In any case, sendfile() will still be faster than write() (presuming the data is cached in kernel and userspace respectively), as the data doesn’t have to be copied from user space with sendfile().

                    Nonsense. If the data is page-aligned the computer doesn’t have to do any extra work.

                    Consider what’d happen if the data changes in memory after you make the syscall but before it’s sent over the network

                    The block cache is unified on all modern unixish systems (including Linux) which means blocks read from the disk using mmap(), read(), or sendfile() can all be shared. The kernel is already dealing with this situation.

                    1. 3

                      You are finding it odd because you believe user-space caching is unnecessary,

                      I don’t believe it’s always unnecessary, however in many cases doing it yourself is busywork and/or counter-productive as the pagecache gives you a lot out of the box.

                      however if you note that everyone building high-performance systems is already doing user-space caching, it makes sense to consider that case from the start.

                      I know and have worked on many high-performance systems that depend on pagecache. Kafka is one example, where they’ve made this design choice explicit.

                      If we’ve done that, I can’t see any reason write() wouldn’t be faster than sendfile() since it will obviously be doing less.

                      I disagree with your premise. Even then write() as I’ve explained has to do more, as it has to copy the data from userspace to kernel space.

                      How they got there (because sendfile() cached them, or read() cached them) is irrelevant to building high performance systems: We just need to make sure the system has enough ram.

                      Not quite, if you’re doing user-space caching then you’re duplicating the work of the pagecache and could need in the worst case double the ram. With all the memory copies you’re also eating up CPU cycles and memory bandwidth, and throughput could be halved due to this.

                      If the data is page-aligned the computer doesn’t have to do any extra work.

                      That’s a big if, malloc() doesn’t guarantee page-aligned data so that’s something you’d need to implement (and worry about Hugepages and fragmentation).

                      Even given alignment, there’d be extra work in making the data safely available to the kernel to protect against concurrent access, and likely some memory resource management issues too.

                      In any case Linux at least doesn’t implement this. A quick search of the web doesn’t reveal any other kernels that do either (even FreeBSD, which is big on zero copy for network stuff).

                      The block cache is unified on all modern unixish systems (including Linux) which means blocks read from the disk using mmap(), read(), or sendfile() can all be shared. The kernel is already dealing with this situation.

                      The pagecache is used directly by mmap() and sendfile(). Read() will use pagecache, however the data now in userspace has no link back to the file it came from. Thus a subsequent write() of that data in userspace can’t use the data in pagecache, even presuming alignment is handled somehow.

                      You seem to have several misconceptions about Linux memory management. I highly recommend Understanding the Linux Kernel which covers this and many other interesting topics.

    2. 4

      FYI: Netflix has also implemented a wrapper for sendfile that enables SSL/TLS. Using encryption with sendfile is possible.

      1. 2

        The async_sendfile that the netflix paper talks about is now (more about it) in FreeBSD head. I believe it will be part of FreeBSD 11.

        1. 2

          Encryption will prevent the use of this in most scenarios