1. 38
  1. 27

    There is one important reason for using big endian that has been left out in the article: When specifying a binary protocol/file format, using big endian makes it much more likely that it’s implemented correctly by others: Most systems are little endian nowadays, so it’s likely that people won’t bother with proper endian conversion during implementation.

    I can also confirm this with personal experience: I explicitly chose big endian integer fields for my image file format farbfeld and got numerous complaints from people about why I chose big over little endian. They argued that given most systems are little endian anyway such a conversion is wasteful (I’d argue this factor is negligible). After further discussion, I found out that they were mapping files directly to memory, and with even a small oversight, if I had specified little endian values, the code would’ve worked as is on most systems, only to drastically fail on big endian machines.

    I think this is an interesting aspect to keep in mind.

    1. 13

      That’s perverse as heck and I love it. XD “Why did you do it this weird way, you could make it less work” “The work would always be there, you just don’t get to ignore it now” is an approach I have deep sympathy with.

      1. 1

        Exactly, it’s all about paying off debt now rather than down the road.

        1. 2

          Yeah. I wonder if there’s a way to do this by having some kind of conformance test suite rather than booby-trapping the specification? Though for a file format the conformance suite usually consists of a bunch of somewhat-degenerate example files with expected output they should produce, and if you want people to actually use it to vet their file reader/writer implementations you’ll have a hard time saying “run this on a big-endian VM of some kind”…

      2. 8

        It’s a shame that this is necessary because it’s a file format design decision to work around problems in specific languages. If I were parsing your file in Erlang, for example, then I’d do binary pattern matching which takes an explicit endian in the pattern description. My code would be trivially portable. Doing the same in C would require a load of manual byte swapping. Doing the same in C++ would probably make me just lift the endian wrappers from LLD. Even if I mmap the file, as long as I declared the field as the type of the correct endian, it would be fine. Similarly with Rust, I could use a fixed-layout structure containing a type that does byte swapping on access to define the header fields and still have the right thing happen.

        Things like this make me realise that C has probably been making the world a worse place for long enough.

        1. 4

          POSIX offers the ntoh*() and hton*() functions for that purpose. It’s not that difficult.

          1. 2

            It’s still tedious, and if something’s tedious versus an “easy” way that will blow your foot off in a slightly different situation, C programmers will pick the latter.

            1. 1

              What could be simpler than a single function call?

              1. 2

                Not having to do anything. In C++, for example, you’d declare the field as something like a BigEndian<uint32_t>. The BigEndian<T> template would have an implicit cast-to-T operator containing an if constexpr that either did byte swapping or didn’t, based on the host platform. LittleEndian<T> would be similarly implemented (or, more commonly, both would just be using directives that expanded to T or ByteSwapped<T> depending on the current architecture). You declare a struct representing your data with these types and then never think about it again. In contrast, the C approach requires a function call for every field that you have to remember everwhere, or requires you to do a single byte-swap-everything pass to get it right.

                Similarly, in Erlang, you just write the /big or /little modifier in your pattern matches and everything falls out. C is the most verbose and error prone language that I’ve used for this kind of thing. Last time I had to write an implementation of a protocol that used a big-endian wire format, the byte-swapping bits of the C implementation were larger than the total Erlang implementation.

                1. 1

                  Referencing a field or array member directly.

          2. 8

            In JSON APIs, I sometimes will deliberately use “kebab-case-keys” to force the clients to write a proper validator on the other end instead of doing response.camelCaseKey and just hoping its valid. :-)

            1. 3

              Thanks for the explanation! I was like /o\ when i saw that farbfeld is using big endian, and i still think it’s the wrong decision. But i can see reason in your decision and it’s less of a “why?!” to me now and more a “okay, seems like we value different things” 👍

              farbfeld is a cool project and i should add support for it to zigimg for more widespread support

              1. 2

                Completely agree. I’m working a lot with parsing big endian data right now, and had it been little endian I wouldn’t have given nearly as much thought to my implementation. Now my code will work properly regardless of the endianness.

                1. 2

                  I experienced the opposite at work. There’s a project at work that started out life on a big endian system (SPARC for the curious), and it was my work that made it possible to use a little endian system. It made the project more portable, and when we finally move off of SPARC to Linux, it should just work [1].

                  [1] It will. We actually develop and test now on Linux (cheaper than getting everyone a SPARCstation).

                2. 3

                  This is a very good article. Well written and lists pros and cons without bias. Imho people should stick to little endian for newer projects as it’s likely to have less overall overhead, but as FRIGN pointed out, people need to care about endian in the first place and don’t just mmap stuff into your process (it’s mostly never a good idea to do that).

                  It also makes clear why a lot of network hardware still uses MIPS processors, as these can handle big endian natively and all network protocols are using big endian.

                  1. 2

                    It would be interesting to see an adversarial collaboration https://www.lesswrong.com/tag/adversarial-collaboration about themes like this, spaces vs tabs (Elastic tabstops https://nickgravgaard.com/elastic-tabstops/ will have their day), IPv4 vs IPv6, etc.

                    1. 2

                      With little endian order, you need to check the last byte. With big endian order, you check the first byte.

                      Is there a technical reason why the sign bit needs to be where it is?

                      1. 7

                        There is no specific sign bit, integers are stored in two’s complement format.

                      2. 1

                        The perceived antagonism between ‘host’ and ‘network’ byte order does not allow PDP-11 users to sleep soundly at night.


                        1. 1

                          unless I mistake the first example show bytes where half-byte change of endianess despite the fact that byte has no endianess usually