Neat, learned about some new techniques I didn’t know about.
I’ve spent some time reversing custom/proprietary compression algorithms found in various video games, and usually they’re just LZSS variants with some difference in how it’s encoded exactly (how big the literal-or-backref bitsets are and how backrefs are encoded).
Once, however, I came across something quite different: byte pair encoding. The general idea is to replace common pairs of bytes with unused bytes repeatedly to compress, and then store the final compressed data plus a list of substitutions done, and perform those substitutions in reverse to decompress. I thought I’d mention it since it wasn’t included in the linked-to book, but I don’t know how useful it is in practice (one huge drawback is that you can’t stream BPE). LZ-based approaches certainly seem to have won the lossless compression war.
I found last summer’s Why Google Stores Billions of Lines of Code in a Single Repository (L) to be a much more interesting explanation of how this works out (it’s referenced by trunkbaseddevelopment.com, but the link there appears to be broken). Not only does it show that this is how Google prefers to work in this fashion, and explain some of their perceived benefits, but it also shows that it’s scalable to really big projects (or sets of projects, even, since Google apparently keeps all of theirs in one repository), something that I found quite surprising.
Personally, well, I don’t really have any personal experience with large projects at all, but I can’t help but think this would be problematic for the same reason that centralised systems like Subversion are (in my mind), since it appears to mimic the way Subversion etc was used… but then again, I never really used Subversion, so I’m a bit wary of my view being biased in favour of how DVCSes are typically used here.
A TTY-related fact that wasn’t mentioned in the article:
To control text-formatting on an ANSI-compatible terminal (or emulator), you can send it terminal control sequences like \e[1m to enable bold or whatever. Paper-based terminals didn’t have control-sequences like that, but people still figured out ways to do formatting. For example, if you printed a letter, then sent backspace (Ctrl-H, octet 0x08) and printed the same letter again, it would be printed with twice as much ink, making it look “bold”. If you printed a letter, then sent backspace and an underscore, it would look underlined.
The original Unix typesetting software took full advantage of this trick. If you told it to output a document (say, a manpage) to your terminal (as opposed to the expensive typesetting machine in the corner), it would use the BS trick to approximate the intended formatting.
This worked great, up until the invention of video display terminals, where the backspace trick just replaced the original text, instead of adding to it. So people wrote software to translate the backspace-trick into ANSI control codes… software like less(1).
…in a modern terminal emulator, you’ll probably get output like:
Hello _____
…because that’s how glass TTYs work. However, if you pipe it through less(1):
printf 'H\x08He\x08el\x08ll\x08lo\x08o w\x08_o\x08_r\x08_l\x08_d\x08_\n' | less
… it will convert the backspace trick into formatting your terminal can understand. Unfortunately, I can’t figure out how to represent it in Markdown, so you’ll have to try it for yourself!
Indeed, this is how man(1) formats manual pages. One could try PAGER='cat -v' man ls with GNU cat(1) (nonstandard -v option) to see all the ^H’s in their full glory.
Something else that I felt was missing from the article that would be nice to add is about DEL: it’s separate from the other control sequences, located in the far end as value 0x7F. But why? Turns out, back when paper tapes were still relevant, it is a useful property to have the DEL sequence be “all bits set”, because you can always overstrike existing data with more holes—which means you can take an existing tape and DEL out data.
Also the
**
power operator! Which means it’s now available in both stable Firefox and Chromium, very handy.Neat, learned about some new techniques I didn’t know about.
I’ve spent some time reversing custom/proprietary compression algorithms found in various video games, and usually they’re just LZSS variants with some difference in how it’s encoded exactly (how big the literal-or-backref bitsets are and how backrefs are encoded).
Once, however, I came across something quite different: byte pair encoding. The general idea is to replace common pairs of bytes with unused bytes repeatedly to compress, and then store the final compressed data plus a list of substitutions done, and perform those substitutions in reverse to decompress. I thought I’d mention it since it wasn’t included in the linked-to book, but I don’t know how useful it is in practice (one huge drawback is that you can’t stream BPE). LZ-based approaches certainly seem to have won the lossless compression war.
Eek!
This is a good lesson in the risks of wrapping a command-line interface, blindly trusting that its interface isn’t going to change.
Exactly. This is what we should show people when discussing not to just blindly use calls to exec().
I found last summer’s Why Google Stores Billions of Lines of Code in a Single Repository (L) to be a much more interesting explanation of how this works out (it’s referenced by trunkbaseddevelopment.com, but the link there appears to be broken). Not only does it show that this is how Google prefers to work in this fashion, and explain some of their perceived benefits, but it also shows that it’s scalable to really big projects (or sets of projects, even, since Google apparently keeps all of theirs in one repository), something that I found quite surprising.
Personally, well, I don’t really have any personal experience with large projects at all, but I can’t help but think this would be problematic for the same reason that centralised systems like Subversion are (in my mind), since it appears to mimic the way Subversion etc was used… but then again, I never really used Subversion, so I’m a bit wary of my view being biased in favour of how DVCSes are typically used here.
A TTY-related fact that wasn’t mentioned in the article:
To control text-formatting on an ANSI-compatible terminal (or emulator), you can send it terminal control sequences like
\e[1m
to enable bold or whatever. Paper-based terminals didn’t have control-sequences like that, but people still figured out ways to do formatting. For example, if you printed a letter, then sent backspace (Ctrl-H, octet 0x08) and printed the same letter again, it would be printed with twice as much ink, making it look “bold”. If you printed a letter, then sent backspace and an underscore, it would look underlined.The original Unix typesetting software took full advantage of this trick. If you told it to output a document (say, a manpage) to your terminal (as opposed to the expensive typesetting machine in the corner), it would use the BS trick to approximate the intended formatting.
This worked great, up until the invention of video display terminals, where the backspace trick just replaced the original text, instead of adding to it. So people wrote software to translate the backspace-trick into ANSI control codes… software like
less(1)
.If you run:
…in a modern terminal emulator, you’ll probably get output like:
…because that’s how glass TTYs work. However, if you pipe it through
less(1)
:… it will convert the backspace trick into formatting your terminal can understand. Unfortunately, I can’t figure out how to represent it in Markdown, so you’ll have to try it for yourself!
Indeed, this is how man(1) formats manual pages. One could try
PAGER='cat -v' man ls
with GNU cat(1) (nonstandard -v option) to see all the^H
’s in their full glory.Something else that I felt was missing from the article that would be nice to add is about DEL: it’s separate from the other control sequences, located in the far end as value 0x7F. But why? Turns out, back when paper tapes were still relevant, it is a useful property to have the DEL sequence be “all bits set”, because you can always overstrike existing data with more holes—which means you can take an existing tape and DEL out data.
See also
ul(1)
… so common it got one of the two-character command-names.