The late adoption of getline() into the POSIX standard is probably responsible for its lack of widespread familiarity. Which is a terrible shame, because it’s a wheel that so many programs have had to reinvent.
Usually people are surprised to find out that isn’t a GNU extension (any more), because they are used to POSIX being limiting and cumbersome to work with.
But it has also somehow failed to make it into the C standard. Which is kind of sad.
One of the easiest way to do this is probably by reading the file with mmap: backtracking is as simple as pointer arithmetic and memory is magically managed by the kernel. Forget reallocs in loops and off-by-ones. It won’t work with streams though.
It also won’t work with special virtual files. Take /proc/cpuinfo on Linux for example. Depending on your tool, your users will expect it work on such things.
This gives me a bad idea for faking making it “work” for pipes, along the lines of http://t-a-w.blogspot.com/2007/03/segfaulting-own-programs-for-fun-and.html
Carve out a huge area of address space with one big mmap() call, mprotect it all to PROT_NONE. When the main program tries to access it, catch the sigsegv, mprotect() it to read/write, read(2) one full page from the pipe to fill out the input block and then return from the sigsegv handler. Assuming read(2) is safe to call from a synchronous signal handler, which I’m not sure whether it is or not.
This is silly and really complicated, completely defeating the concept of making it “easy”, but you could hide it all behind one function call. ;) Plus it’ll crash when the address space runs out (after, like, several terabytes?) and I suppose there’s no sensible way to communicate the length of the file but… eh whatever. I never claimed this was a good idea. :)
That’s super interesting! I’ll try to play with this silly idea this weekend.
If you do, please share your results with lobste.rs! I know I would at least fine it interesting!
So I tried to implement this and unfortunately it doesn’t work for most streams. That’s because there’s no way to use mprotect on something smaller than a page, and this is an issue just after calling read(), when you want to split the memory map in two part: An allowed part for already read characters, and a forbidden part for characters that will need to be read. These two parts must have different protections but unfortunately this is not possible if the boundary between the parts is not page-aligned. So it could work if you can read the stream in chunks of 4KB but not if it’s something written by a human on stdin.
Bummer. Thanks for following up!
Imo there was no need to strike out the word silly ❤
I imagine it would also fail with files that are larger than available memory (assuming the program actually touches every page).
Of course, it’s rather unlikely for a text file to exceed available memory nowadays, but I’ve seen some nightmare logfiles that were many GB in size.
When you mmap() an ordinary file read only, you’re limited by address space but not by memory. The kernel is free to discard pages from the mapping that you haven’t touched in a while, cuz it can always reload them from the file later if it needs to.
You’re right that the 32 bit address space limit is inconvenient but you can still do this on finite amounts of RAM. ❤
Yes it’s something to consider on 32 bit systems.