Memory-mapping the file (e.g. with mmap) can give you the best of both worlds. You can code as though the file were all in memory, which is simpler, and you can even backtrack or peek forward if necessary. But the computer doesn’t have to copy the whole file into RAM.
(Depending on the OS, you may lose some performance because the file is read in smaller chunks, possibly as small as 4KB. But in practice the kernel will usually see your access pattern and read ahead in bigger chunks, which btw gives you async I/O for free.)
It’s not just the I/O being in smaller chunks, it’s also the cost of the page faults. If you do a read system call, then the calling thread blocks and the data is copied into your buffer. If you do an mmap call, there’s some page-table setup and any pages already in the buffer cache will typically be made available immediately, but every time you touch a page that isn’t there then you end up with a page fault (more expensive than a system call). This gives you a lot more jitter in loading.
The aio_* family of system calls can let you run the I/O in the background (for some reason, Linux’s implementation is very slow) quite easily and you can also use it to prefault the things in mmap. If you do a one-byte aio_read per page then it will force the page into the buffer cache. When you then access the pages with mmap, they’ll already be in the buffer cache by the time you get to them.
All of that said, avoiding a big up-front load entirely is far better than trying to optimise it.
Yes, this is related to shell programming: It’s often simpler and more efficient to send a program to another process to load data (e.g. SQL query for tables, CSS selector for documents, jq expression for records) , rather than load data yourself. It’s language-oriented programming rather than library-oriented.
In the first case you get a small piece of data back. In the later you get a big graph of pointers in your own address space, and often you don’t need most of it.
It’s counterintuitive to many, but a big string / byte stream is often a more efficient data structure than a graph of pointers in memory. (Obviously it depends on the operation.)
This is what I’m trying to get at with the fallacy: Programming With Strings (Byte Streams) Isn’t Slow
Memory-mapping the file (e.g. with mmap) can give you the best of both worlds. You can code as though the file were all in memory, which is simpler, and you can even backtrack or peek forward if necessary. But the computer doesn’t have to copy the whole file into RAM.
(Depending on the OS, you may lose some performance because the file is read in smaller chunks, possibly as small as 4KB. But in practice the kernel will usually see your access pattern and read ahead in bigger chunks, which btw gives you async I/O for free.)
[I also posted this as a comment in the blog]
It’s not just the I/O being in smaller chunks, it’s also the cost of the page faults. If you do a
read
system call, then the calling thread blocks and the data is copied into your buffer. If you do anmmap
call, there’s some page-table setup and any pages already in the buffer cache will typically be made available immediately, but every time you touch a page that isn’t there then you end up with a page fault (more expensive than a system call). This gives you a lot more jitter in loading.The
aio_*
family of system calls can let you run the I/O in the background (for some reason, Linux’s implementation is very slow) quite easily and you can also use it to prefault the things inmmap
. If you do a one-byteaio_read
per page then it will force the page into the buffer cache. When you then access the pages with mmap, they’ll already be in the buffer cache by the time you get to them.All of that said, avoiding a big up-front load entirely is far better than trying to optimise it.
Yes, this is related to shell programming: It’s often simpler and more efficient to send a program to another process to load data (e.g. SQL query for tables, CSS selector for documents, jq expression for records) , rather than load data yourself. It’s language-oriented programming rather than library-oriented.
In the first case you get a small piece of data back. In the later you get a big graph of pointers in your own address space, and often you don’t need most of it.
It’s counterintuitive to many, but a big string / byte stream is often a more efficient data structure than a graph of pointers in memory. (Obviously it depends on the operation.)
This is what I’m trying to get at with the fallacy: Programming With Strings (Byte Streams) Isn’t Slow
https://www.oilshell.org/blog/2021/07/blog-backlog-1.html#fallacies
And relates to the “narrow waist” of byte streams, and Perlis-Thompson principle, which I’ve been discussing with other shell authors here.