Threads for p4bl0

  1. 26

    Forgive the wall of text. Text editor data structures are something I’m particularly passionate about.

    The piece table is superior to the gap buffer only if the following are all true:

    1. Files are stored in the encoding they will be manipulated in, and/or it is trivial to synchronize the encoding in memory. Fixed-length encodings and UTF-8 fit these criteria, most other variable-length multibyte encodings do not.
    2. Your system does not have efficient paged virtual memory with a secondary store.
    3. Filesystem operations are not more than some fairly small constant multiplier more expensive than memory operations.

    As a big fan of Oberon, which used piece chains as its mechanism for storing text (derived from Gutkenecht’s earlier work on Lara), I’m pretty familiar with the piece chain data structure. The piece chain made perfect sense in Oberon, because the machines in question didn’t have virtual memory and the programming language itself didn’t support dynamically-sized allocations so it couldn’t reallocate a gap buffer even if it wanted to.

    (Some later versions of Oberon did support dynamically-sized allocations…)

    Piece chains have the advantage that all or almost all of the text stays on disk. This is very useful in that file loads are essentially instantaneous and memory usage is proportional to the number of changes rather than the size of the file. It makes absolutely perfect sense if you don’t have virtual memory that can be paged to and from disk.

    (Pike, in his The Text Editor sam extends this argument: it doesn’t work if you have crappy virtual memory either.)

    Virtual memory, though, lets the operating system handle the disk operations too. And the rapid loading benefit goes away if you have to decode the text (e.g. from some multibyte encoding to wide characters in memory).

    In terms of expense of editing operations, gap buffers and piece chains are more-or-less identical in the asymptotic sense: the piece chain maintains a cached (offset,piece) pair that corresponds directly to the location of the gap in a gap buffer. The performance of a piece chain degrades as you add more edits as you have to traverse more and more pieces to get to a certain spot. Gap buffers lack that issue.

    Piece chains have two important advantages: it’s easier to store metainformation about text if the metainformation applies to “runs” of text, since the formatting information can be stored in the piece headers; and undo/redo is easy to implement (but not incredibly so).

    In other words, piece chains are awesome and a particularly beautiful data structure, but their actual advantage over gap buffers on modern hardware under modern operating systems is kinda a mirage in most use cases.

    1. 4

      First, thanks for your insightful comment. You seem knowledgeable on the subject so I have a question for you.

      I discovered the piece table data structure while reading the linked article (so, just now), and my first thought was that it would be a good fit to implement a collaborative editor. I have not thought this through yet but I guess that instead of a single “new” file it would be possible to have one per participant. What do you think about this idea?

      1. 2

        Thinking about it, yeah, that would be an excellent use case. Each user could have their own append-only key file from which pieces are carved and linked into the chain. Each user would also need their own point offset cache, so you’d need to be careful about managing those but I think that would be considerably easier than maintaining multiple gaps.

        1. 2

          Cool. So this will be another project for motivated master students ^^. Thanks!

      2. 1

        Can you explain how to implement gap buffers in a way to exploit virtual memory? This seems to require quite complicated and probably nonportable overlapping mappings, but perhaps I’m missing something obvious.

        Also, piece chains trivially and efficiently support “multiple cursors” which are en vogue today.

        1. 1

          Re: the multiple cursor thing, yeah, that’s true. There was another comment above talking about that (well, collaborative editors, which are essentially the same thing). Multiple cursors are neat but I never got into them. I’m too old and set in my ways.

          Gap buffers taking advantage of virtual memory is automatic, assuming the VM implementation is reasonable. Allocate a buffer of whatever size and the individual pages of that allocation will be paged in and out as needed.

          One major advantage of piece chains is that most or all of the text remains or can remain on disk, which is important when you have limited memory. When you have an essentially infinite virtual memory that the OS automatically pages to and from disk as needed, there’s no longer any reason to not simply load the whole file into memory.

          1. 1

            I don’t see how this will work. You can mmap the whole file into a contiguous area, but then you cannot add a gap… or you can map it twice but then you need to remap everytime you move the gap… just making use of VM is either “load whole file into memory” or, really, “copy whole file into swap partition”.

            For piece chains, you just map once and look for the unchanged text there.

            1. 2

              That’s it. Read the whole file into memory. You’re done. When memory runs low/something else needs physical memory, the OS will page out part of the buffer to disk. When you access that part again, a page fault happens and the page is brought back in.

              No mmaping or anything. Just allocate as normal and let the virtual memory subsystem do its job.

              You pay the price of the initial load, but if you have to decode the text in any way (e.g. UTF-8 to wchar), you were gonna have to do it anyway.