1. 24
  1.  

  2. 9

    Dare I say and ask: How many spaces is a tab? GCC seems to say “8”, for some reason. [emphasis mine]

    It never fails to astound me how a once-universally understood truth was basically erased within a few scant years. GCC has tabs as being eight spaces because for decades that’s what they were. While it was possible for most terminals to redefine the tab width, the default was always eight, and people generally never changed the setting, because if they did other people’s text files wouldn’t display correctly.

    You’ll note that many early C programmers indented their code to eight spaces. Using tabs for the indent level saved precious disk space (and even parsing time, on larger projects). But when programmers started favoring 4-space indents instead, they didn’t change the tab settings on their terminals! They just used spaces to indent on the first level. (But they would often still use a tab to indent at the second level. Why waste disk space, after all.)

    But somewhere in the 1980s-1990s (in my personal experience, anyway), people started conflating indent level with tab width. I started meeting more and more programmers who did not understand that these were two completely unrelated concepts. Often they were using GUI editors, and I thought they were just confusing tab characters with the Tab key, but then they would complain about my code looking wrong in their editors, and I realized that their editors were actually linterpreting tab characters as having a width of four.

    I want to lay the blame for this at the feet of MSVC, but that’s probably just my bias. Likely there were a number of factors that led to this generational forgetting. I have grudgingly accepted that tabs are just cursed characters now, and stopped using them in my code. But I continue to be surprised by how few people even seem to know that this happened.

    1. 7

      The tab character means ‘indent to the next tabulator’. It is entirely environment dependent where you place the tabulators. Tabulators predate computers, they were present on typewriters. A typewriter carriage is pulled to the left by a spring. When you reach the end of the line, you typically had to manually push the carriage back to the right after pressing the newline character. Teletypes added an extra control code for this (carriage return) so that the sequence of carriage return and line feed would move the carriage back to be ready to write to the leftmost column and advance to the next line. These are separate ASCII characters because they were required to be for controlling teletypes.

      Tabs, in a typewriter, were usually implemented as sliders (tab stops) attached to the carriage. When you pressed the tab key, the carriage would be raised slightly and would slide (pulled by the spring) until it hit a tab stop. The same implementation was used by most teletypes.

      Not all typewriters were fixed-width. It requires a more complex design, but the amount that the carriage is moved to the left when you released a key could be independent for each key, allowing proportional fonts to be written. Irrespective of whether a typewriter implemented this, the location of the tab stops were not normally required to be an integer multiple of the character width, they were typically analogue things (sliders that could be clamped to the carriage at any distance).

      The idea that tabs are 8 number of spaces has been true for only a very small amount of time and space:

      • Typewriters? Not true.
      • Teletypes? Not always true.
      • Early virtual teletypes (terminals)? Probably true (though I’ve not actually found any terminals that didn’t let you configure the tab width. 8 was the default but just because you never changed the default doesn’t mean that it couldn’t be changed).
      • Later terminals? Configurable, POSIX provides tabs utility to control it.
      • Typesetters (manual or computerised)? Not true.
      • DTP / word processing software? Not true.
      • Most code editors (vi or newer)? Not true.

      Using tabs for indentation (i.e. the thing that the tab character was invented for, back when typewriters were new and exciting) makes it possible for the reader to control the indent width. Using a mixture of tabs and spaces makes this hard. As of the most recent versions, clang-format now supports an indent mode where tabs are used for indentation, spaces for alignment, so the code is displayed correctly irrespective of the consumer’s tab width.

    2. 7

      If the compiler use-case is targeting a terminal, an option might be to use ANSI escapes to temporarily switch the output to transparent foreground then draw the starting text of the previous line up to the error column, then revert back to the default foreground to print the error. Super cheating, however!

      1. 6

        For machines, I have a relatively strong preference for using UTF8 offsets as coordinates. That way, only implementations using legacy encodings internally need to do translation. With code points, everyone needs to translate to a coordinate space which is not otherwise useful.

        1. 3

          I really dislike this design point for a couple of reasons:

          • If you’re targeting CJK languages, UTF-16 is a more space-efficient encoding, so assuming everything is UTF-8 is an ethnocentric assumption.
          • UTF-8 offsets can be in the middle of a code point, which can in turn be in the middle of a grapheme cluster. The first of these is definitely an invalid value, the second is probably an unhelpful one. Well-designed APIs should make it impossible, by construction, to express invalid APIs. In CJK text, over half of the possible representable values in this API are invalid.
          1. 1

            If you’re targeting CJK languages, UTF-16 is a more space-efficient encoding, so assuming everything is UTF-8 is an ethnocentric assumption.

            How much plain text do you intend to transfer?

            Say an (english) novel is 100k words, each word is 7 characters (including spaces and punctuation; still an overshot); then the whole thing is 0.7mb.

            Say a CJK character requires 50% more space (2 byte vs 3), and there are the same number of characters in a word (which is not true). Then you go from 0.7mb to just over 1mb.

            That’s a relatively small change, all things considered. And, most likely, you’re not transferring or storing novels, but much smaller quantities of text.

            How much of the space on your hard drive is dedicated to text, vs binaries, images, videos, etc.? Would you notice if it doubled in size?

            1. 1

              If space isn’t a concern at all, then use UTF-32 and sidestep now your indexing by codepoint is trivial: every code point is exactly 4 bytes.

              My disk size is sufficient that, unless I’m doing something like processing the whole of Wikipedia, the overhead of UTF-32 is not that high. My L1 cache, in contrast, is smaller than a typical preprocessed C/C++ source file even in ASCII and the additional cache misses from a less-dense encoding is quite measurable on anything involving random access.

            2. 1

              UTF-8 offsets can be in the middle of a code point […] make it impossible, by construction, to express invalid APIs

              I sympathize with this position. Really, I do. I like to think that I am fairly good at designing C apis which are difficult to misuse (a difficult task, considering the language does not assist you, and it is definitely harder than in other languages); certainly, I think about the problem a lot.

              However, I find the argument uncompelling in this case.

              First: the range problem is definitely still there, in that even if you measure grapheme clusters, you can have an index which is too large. If you lack unsigned numbers, you can also have an index which is too small.

              Second: performance should also be considered, alongside other factors; seeking to a particular byte offset is likely to be faster than seeking to a given grapheme.

              Third: what sorts of bugs or misuses should we expect to see? In context of a text editor which attempts to be performant, it will likely try to cache a lot of things. So I expect a common bug category will be seeking to a stale offset in a fresh buffer. If offsets are measured in graphemes, the offset will simply be wrong, and the user may not notice. If offsets are measured in bytes, then the application may have a check to see if the offset ended up in the middle of a codepoint or grapheme; if this check fails, the application will be able to self-diagnose the problem (and e.g. tell the user to submit a bug report, or …)

          2. 1

            My first thought for portable, human-readable console underlines (what GCC uses non-portable virtual columns for) was to suggest grouping characters by width. That is, counting that you have one tabstop, two emoji-width characters, five regular monospace characters, three double-width characters, and six East Asian script characters. That way, you could output a number of spaces for each character width, and have the spacing work across fonts.

            However, as I found out, there is basically no solution for emoji-width spaces in Unicode right now. So basically, as soon as emoji are in the picture, accurately rendering terminal underlines becomes a fool’s endeavor.

            Probably the only good solution would be to punt the problem to the terminal viewer. For instance, one could use a zero-width elastic tab after every character, or just render everything in a CSS grid or table.

            Either way, you have to give up on the idea of everything being the same width, or even a multiple of a given width. So I can understand why someone might restrict source code to ISO characters—this would help ensure that tools like vim and gcc can do their work.

            It looks like David Chisnall had a similar suggestion using wcwidth, which apparently can be somewhat fickle.

            1. 1

              This is generally solved with a ‘wcwidth’ like solution. The function itself is in POSIX but iirc its results are crap. There are libraries that do it better with updated unicode tables but you can even implement it yourself by downloading the unicode width tables.

              1. 5

                wcwidth() returns the width (in character cells) of a “wide character” (a character in libc’s internal encoding, which is almost always UTF-16 or UTF-32). Usually this is 1 (for regular characters), 2 (for wide characters like emoji), 0 (for combining characters) or -1 for “not a character”. This data mostly comes from the “East Asian Width” property in the Unicode database.

                However, it’s not that simple:

                • not every character in the Unicode database has an East Asian Width property; for the ones that don’t, you need your own table of widths
                • Some characters have an East Asian Width of “ambiguous”, which means they could be width-1 or width-2 depending on display context. If your terminal emulator has a setting that makes “ambiguous width” characters narrow or wide, this is it — and there’s no standard way for a terminal app to query the state of this option
                • Even in Latin-1, there’s quirks like U+00AD SOFT HYPHEN, which is a hyphen that is only displayed when followed by a line-break, making its width either 0 or 1 depending on context. You can pick either one, but is your implementation going to match every other home-grown implementation in every other terminal app/emulator?
                • If you decide on a width for every existing Unicode code-point, some emoji are defined as ligatures. For example, the 🐻‍❄️ emoji is represented by the sequence 🐻 + ZWJ + ❄ + V16, where “ZWJ” is the special Zero Width Joiner code-point, and V16 makes the preceding character render as an emoji. 🐻 has a width of 2, ZWJ has a width of 0, ❄ has a width of 1, V16 has a width of 0 or 1 depending on how hacky you want to be, so wcwidth() will report that 🐻‍❄️ has a total width of 3 or 4… but its actual width is 2.
                • wcwidth() depends on libc’s Unicode tables being up-to-date, which is fine on Linux but a losing battle on macOS. Apple insists on keeping their libc pinned to, like, 2005 for backwards compatibility even as their GUI APIs are kept up-to-date, guaranteeing that terminal-based apps using wcwidth() will be out of sync with GUI-based terminal emulators.

                In short, wcwidth() is terrible and the only alternative is (apparently) for every app and terminal emulator to embed their own idiosyncratic alternative implementation. sigh

                1. 2

                  Curious. The font I’m currently using displays 🐻‍❄️ as a bear and a snowflake, and not (I assume) a polar bear. What I hate most about Unicode is the constant changing of the rules year by year.

                  1. 3

                    The constant growth of Unicode can be frustrating, but the Consortium does try to minimise the disruption - for example, encoding “polar bear” as a combination of two existing glyphs that older systems can render instead of a completely new glyph that would just appear as broken. Or the snowflake emoji, which is the traditional snowflake glyph with the V16 modifier, rather than a completely new glyph.

                  2. 1

                    Can’t you use the GUI APIs from a non-GUI context?

                    1. 2

                      Terminal-based apps want to know how many character cells a string will take in general, GUI APIs can report how many pixels a string will take in a specific font. That’s fine if you’re a terminal emulator since you have a specific font in mind and you’re working in pixels anyway, but for terminal-based apps it doesn’t help.

                  3. 4

                    I believe that should work for everything except for tabs. How tabs are rendered is much more fun because the expected meaning of the character is ‘align to the next tabulator’. Typically, terminals place tabulators at 8-character intervals and do this properly (from the article, vim places them at 4-character intervals and does it properly), but a few things just expand a tabs to sequences of spaces. Doing it properly means that the width of “ \t” and “ \t” will be the same if they occur at the start of a line but may be different if they occur somewhere else.

                    For ASCII input, you can replace every non-tab character in a string with a space and end up with something that, if you print it under the old string, will have the same width. For unicode you probably need to replace every non-tab grapheme cluster with the equivalent of wcwidth spaces, but you have to do it with grapheme clusters not code units or you’ll be completely broken by combining diacritics. You might be able to just strip combining diacritics and then apply wcwidth to the remaining characters.

                    I think the real take-home for this is that ‘column number’ is not the right API to expose for allowing people to do the underlining thing. Any consumer of the API needs to parse the string, so the right API should probably return the line number and the substring up to the error as a string view.

                    1. 1

                      vim places them at 4-character intervals and does it properly

                      fyi, ‘:help tabstop’ to see the setting that does this.

                      Tabs are of course troublesome, but as a general approach this works well.

                      1. 2

                        Here’s another summary that explains the other 3 settings you might want to set too. Vim’s help touches upon them but this helped it click better for me.

                  4. 1

                    If the output is a TTY, it’d be better to insert ANSI escapes into the line before and after the error range to highlight it. Why bother with fake underlining when you can show a real underline?

                    This whole “how do align text to something on the previous line” issue reminds me of an older typesetting system I used in the 80s where you could put markup in a line that would set a temporary tab-stop at that x position, and then the next line would indent to that tab-stop. (Or maybe it set a hanging indent there? Anyway, it would have made this trivial.)