1. 20
    1. 8

      As I recall, POSIX didn’t deliberately choose, it independently added two constraints in different parts of the spec. One implicitly required char to be at least 8 bits, one required char to be at most 8 bits. It only later realised this and made it an explicit requirement.

      I fully support this. LLVM put a lot of effort into supporting things with other byte sizes and, in spite of that, it doesn’t really work. Most recently someone came along with a word-addressable architecture with 16-bit words and the advice was to use pointers that point to bytes and introduce masking on unaligned loads and stores in the back end (LLVM is pretty good now at preserving alignment, so if your front end knows something is 2-byte aligned and the load or store width is a multiple of two bytes, this is all fine).

      There are still some IBM architectures with 36-bit words, I believe, but I think this is mostly this is a backwards compatibility mode on newer 64-bit systems.

      1. 3

        AFAIK IBM has two and a half architectures, z/390, POWER, and AS/400. AS/400 was originally 48 bit but has run on POWER since the early 1990s.

        The remaining 36 bit systems are emulated on x86 systems

        The DEC PDP 10 series didn’t survive beyond the 1990s.

        There’s also still

        1. 1

          I recall years (decades now?) ago in the early years of “you can compile C to GPUs” that many (all?) the integral types were still just floats - eg fractional values were technically possible (I never looked into whether they actually did truncation or anything on operations), overflow was not 2s complement, etc

        2. 3

          If all systems that the toolchain developers use have 8-bit bytes then we can be confident that any attempts to implement support for non-8-bit bytes will be wrong. Any code paths that haven’t been tested are surely going to have bugs.

          1. 1

            This is such an excellent argument that it should have been in the PEP P3477R0 list of justifications, above “we look silly”.

          2. 2

            All of this makes me wonder what programming on a non-8-bit byte platform is like. For example, if your byte is 16 or 36-bits: How do you represent character strings? Does it mean you cannot access characters individually, but only in groups? Or are you forced to use something like UTF-16 (or UTF-36)?

            And other questions: is our choice of 8-bit bytes purely cultural? We choose it because most hardware uses it, because is a fair tradeoff between accessing finely-granular pieces of data vs complexity and address lines? Was the 8-bit byte just a hack because we had 7-bit bytes before, but we realised that there is value in adding another bit to represent language-dependent code pages (which turned out to not be sufficient, just like 16-bit Unicode)?

            I would say that it makes sense to have programming languages that align with the real world. But at the same time, it is interesting to think about a world where we made other choices.

            1. 1

              is our choice of 8-bit bytes purely cultural?

              No, there’s a sound engineering reason.

              Sometimes it is necessary to address individual bits. This is trivial to do if the number of bits in an addressable unit is a power of two: the bit index is the bottom N bits and the address is the remaining upper bits. If it isn’t a power of two then bit addressing becomes horribly inefficient.

              So, the question is how large to make an addressable unit. 4 bits? 8 bits? 16 bits? The choice was made in the 1960s so it was determined by the needs at that time, but it’s hard to say the alternatives would have been better.

              16 bits or more is too large: business data processing wanted to work on text, so one character per address is ideal. We have bigger character sets now, and it seems the best way to represent large character sets is with a variable-length encoding, rather than a bigger code unit.

              4 bits is good for BCD but not much else. 1 bit just makes everything more complicated for little benefit.

              8 bits is big enough to have an adequate character repertoire for protocol stuff like headers or markup, with enough slack to make variable-length encoding tolerably efficient.

              1. 1

                36-bit architectures usually used 6-bit character types, either manually shifted or with instructions supporting individual access.

                Another common domain using different sizes were DSP chips, where character strings are not really a thing you expect to deal with (but 12/20/24 bits were a good size to do math on)

                1. 2

                  It varied! In the late 1950s / early 1960s, the US DOD FIELDATA 6 bit character set was a common choice on 36 bit machines. Late 1960s, 7 bit ASCII fits OK with only 1 wasted bit. By the 1970s, 9 bit character sets were common on PDP-10 machines running LISP (think Space Cadet keyboard).

              2. 1
                1. 4

                  The point of the title is to assert that C’s byte shall have exactly 8 bits instead of at least 8 bits.

                  POSIX defines byte as

                  An individually addressable unit of data storage that is exactly an octet, used to store a character or a portion of a character; see also Character. A byte is composed of a contiguous sequence of 8 bits.