1. 4
    1. 1

      So most of the text editors I’ve written, the internal representation is whatever wchar_t is on the system, almost certainly a 32 bit integer these days.

      But in my most recent editor, I assume files, input, and display to all be UTF-8; this is a safe assumption these days.

      I store a file as an array of arrays, more or less, since that’s how the user deals with text (as a ragged two dimensional array). The array is just treated as bytes.

      I abstract away the encoding shenanigans by having riders (borrowing the term from Wirth and Gutknecht). A rider is attached to a buffer and then can move forward a character, up a line, back, whatever, but it’s responsible for moving the correct number of bytes to end up aligned at a character position.

      A rider reports its position as a Position struct, which is a row/column in the array.

      A Position has several validity flags: valid for insert (which might be, for example, one after the end of the line), valid for delete (points to an actual place within a line), etc. Obviously changing a buffer invalidates positions.

      Anyway, it’s always neat to see how it’s done. I love text editors.