1. 173
  1.  

  2. 51

    This is the kind of stuff that makes (some) people hate programming =P

    It is also the sort of thing I’ll likely remember for the rest of my life, unlike other less important information, like bank accounts, government ids, phone numbers, people’s names, etc.

    1. 22

      I don’t understand. Nowadays people don’t remember their ASCII by heart any more?

      1. 24

        I genuinely hope this is a joke, because I’m 35 and been programming as a career for 12-ish years (including a 1-year paid internship) and I haven’t needed an ASCII chart more than a handful of times.

        1. 12

          Hahaha, I feel you.

          I guess it’s always other people who deal with stuff like vtys and software flow control and fast parsers and keyboard drivers for us.

          OTOH, man ascii. Typed in cool-retro-term for the best effect when teaching.

          I believe it’s nice to start with unescaped in-band formats, then move on to the escaped ones, then fixed layout and finally TLV.

          1. 4

            man ascii

            Did not know that was there. Nice!

            1. 4

              Ha, I just had to check to make sure \044 was what I thought it was when I saw it in a script yesterday.

            2. 8

              Being able to read hex and distinguish ASCII can come in handy in surprising places (especially during debugging).

              1. 1

                Recognizing SHIFT-JIS in hex has saved me days before.

              2. 2

                I’m sure it’s mostly a joke, but surely most developers know at least 1-3 ASCII codes by heart, especially full stack devs. When debugging HTTP-relates stuff you see stuff like %20 a lot

                1. 1

                  Depends on what you work on, after many years of googling (no, I don’t have it ALL memorized) I started self-hosting one at an url I remember)

                2. 2

                  Most people can’t read hexdumps anymore, and don’t have to thanks to improvements in tooling in recent years. We should be glad people don’t have to remember ASCII anymore :)

                  1. 2

                    Mine has rotted quite thoroughly out of my head. In the blue moon that I need to check, I just go pull up the table.

                    Most of the character problems I deal with these days involve Unicode instead, or rather, people not understanding Unicode and screwing it up.

                    1. 2

                      The last time I saw an ASCII chart printed in a book, it was published in the 80s (in fact, I think every computer book published in the 80s, at least in the US, had an ASCII chart somewhere in it). By the 90s, I think it was assumed ASCII was a known standard.

                      1. 1

                        I slacked off back in the days and didn’t get around to actually learning it. Then we switched to Unicode and I just gave up.

                        1. 6

                          FWIW, all the ASCII knowledge is still applicable; the first 128 Unicode characters matches 7-bit ASCII, and UTF-8 encodes the values below 128 as just plain bytes with those values. So when looking at most text using the Latin alphabet, you can’t tell if you’re looking at ASCII encoded as bytes or Unicode encoded as UTF-8 even when looking at a raw hex dump.

                      2. 27

                        This is just the type of coffee time reading that I enjoy. A beautiful discovery, must have been very satisfying.

                        1. 10

                          I read this as ‘Any character between ‘,’ and ‘.’ in ASCII, and wondered if the surprise was that ‘.’ comes before ‘,’ and somehow matching now happens between ‘.’ and the last character, and also ASCII 0 and ‘,’

                          To me then, it does what I expected, but the surprise was that the three characters are at those particular positions. The author’s surprise was different. I think this just adds more evidence that regular expressions are surprising.

                          That said, they are still the best tool for some jobs and I will use them, laid out over multiple lines and commented, and not shared with others, because the worst problem with regular expressions is that you can’t trust that everyone will read them the same, including your future self.

                          1. 3

                            wondered if the surprise was that ‘.’ comes before ‘,’

                            That was it for me; I saw that ‘,’ comes before ‘.’ on the QWERTY and Dvorak keyboard layouts, so I assumed it was that way in ASCII too.

                          2. 8

                            Reminds me of some IRC daemons doing case-insensitive comparisons for special characters and therefore treating nicknames like abcde|{} to be the same as ABCDE\[]. It’s indeed a property of the ASCII table and an XOR with 0x20 will flip those characters from one case to the other.

                            1. 18

                              Back before we had standardised 8-bit character sets with ASCII in the bottom half, we had standardised 7-bit character sets based on ASCII with “lesser used” characters replaced as needed. IRC was invented in Finland, and it turns out the ASCII variant used by Sweden and Finland replaces [\] with ÄÖÅ and {|} with äöå. So the IRC protocol defines those bytes as case-insensitively equal, and conforming implementations must do the same even you’re using an encoding that treats them as punctuation instead of letters.

                              1. 2

                                Thanks. I didn’t know that!

                                1. 1

                                  I think that applies to most ISO-646 variants.

                                  1. 1

                                    The encoding used by Microsoft in Japan replaced \ with ¥, so Japanese users came to expect paths to look like C:¥Windows¥system32¥ etc.

                                2. 7

                                  Yup, ascii(7) manual page is very useful for this type of info. Also if one needs to %-encode a character, i.e. in a URI, etc.

                                  1. 9

                                    With Egg expresssions, you must quote characters that aren’t alphanumeric, so the difference between literals and operators is clear.

                                    oil$ const pat = / [',' - '.'] /
                                    oil$ = pat
                                    (Regex)   [,-.]
                                    

                                    (The = operator shows the value of an expression, which in the regex case is the ERE it compiles to.)

                                    oil$ const pat = / [ - '.'] /
                                      const pat = / [ - '.'] /
                                                      ^
                                    [ interactive ]:8: Syntax error in expression (near Id.Arith_Minus)
                                    

                                    So if you quoted that, then it would work. The same applies outside character classes – CODE and DATA are distinct! This is a big problem with both shell and regex syntax.

                                    Egg expressions are also statically parsed like code, not dynamically parsed like data.

                                    1. 4

                                      python’s regex module has a re.VERBOSE flag which ignores (most) whitespace and allows comments. and python strings (and regexes) allow for ‘named characters’ using the \N{...} syntax.

                                      combining the two can massively help regex writing/maintenance. the \d{2}[,-.]\d{2} example from the article can be written like this, for example:

                                      import re
                                      pattern = re.compile(
                                          r"""
                                          \d{2}
                                          [\N{comma}\N{hyphen-minus}\N{full stop}]  # separator
                                          \d{2}
                                          """,
                                          re.VERBOSE
                                      )
                                      print(pattern.fullmatch('12.34'))
                                      # <re.Match object; span=(0, 5), match='12.34'>
                                      

                                      whitespace is not ignored inside character classes [...] so these three cannot be split over multiple lines. a (non-capturing) group with parentheses can be used instead though:

                                      pattern = re.compile(
                                          r"""
                                          \d{2}
                                          (:?  # separator
                                              \N{comma}
                                              | \N{hyphen-minus}
                                              | \N{full stop}
                                          )
                                          \d{2}
                                          """,
                                          re.VERBOSE
                                      )
                                      

                                      while this may not be the best example since it gets rather verbose for something that started out as something relatively simple, for more complex regular expressions this can be the difference between ‘write once’ and ‘actually maintainable’.

                                      1. 6

                                        In the sport of fencing, there’s an aesthetic concept carried over from its dueling roots known as la belle mort (the beautiful death) wherein two opponents match each other touch for touch with increasingly complex attacks until, tied and exhausted, they are one touch away from the end of the bout. One of them brings the tactical wheel back around to a simple attack and the other, like being knocked over with a feather, fails to anticipate it and loses. In a real duel, it would be considered even more beautiful if both opponents died from their wounds.

                                        This is all very weirdly romanticized and morbid, much in the way that death by Perl regex seems to be from some of the applause here. Somehow the author has tricked themselves into spotting a mistake only to find that their limited knowledge of character order in the Unicode table (and I do believe it’s Unicode, not ASCII) made the mistake one of their own misinterpretation.

                                        At the risk of moralizing, the lesson should not be (in effect): Someone’s regular expression confused me, which is their fault, so I will rewrite it in a way that makes more sense to me and preemptively tell you all to write regular expressions the same way.

                                        Instead, it should be something more like: Humans are too fallible to understand regular expression reliably without aids. Anyone who implies that there’s anything obvious about regular expression syntax is either lying or deluded. I was deluded. I will therefore use diagram tools when I have to read or write regular expressions (there have been several over the years) and will generally prefer parsers or codecs, which are generally more deliberate and less ambiguous.

                                        1. 3

                                          Unicode table (and I do believe it’s Unicode, not ASCII)

                                          It is both. Unicode is a superset of ascii.

                                          1. 2

                                            It is both. Unicode is a superset of ascii.

                                            It depends. UTF-16 is a unicode encoding and it is not a super-set of ascii. UTF-8 OTOH is.

                                            1. 3

                                              Ascii is both an encoding and a character set. The unicode character set is a superset of the ascii character set; this has no bearing on the relationship between the ascii encoding and the various ways in which unicode can be encoded.

                                          2. 2

                                            Well Unicode was deliberately designed so that it’s first 127 code points are identical to those of ASCII, but yes.

                                            And I would agree that the traditional regex syntax is not very intuitive. I’d like to see something like a small DSL for constructing regexes. Instead of /\d{2}[,-.]\d{2}/' from the article, why not the much-easier-to-debug: Regex::new().digit().repeat(2).unicode_range(',', '.').digit().repeat(2).compile();

                                            1. 5

                                              Does that .repeat(2) apply only to the method call immediately prior, or to the entire built regex thus far? The compact notation lacks this ambiguity.

                                            2. 2

                                              I’m not saying the mistake is in failing to memorize character order in Unicode (or ASCII, the difference being somewhat beside the point). The mistake is in implying that there is some objective level of obviousness in regex character classes or in regex as a whole.

                                              Now that we’ve passed 150 upvotes, I think it’s time to revisit the old saw:

                                              Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems.

                                              —Jamie Zawinski, paraphrasing a similar saying about sed

                                            3. 3

                                              cursed regex

                                              1. 3

                                                Task failed successfully.

                                                or

                                                You arrived at the right answer by the wrong method.

                                                1. 4

                                                  Reminds me of my fav sentence from a help/man page:

                                                  false - does nothing, unsuccessfully

                                                2. 2

                                                  I never imagined a regex can make developers smile.

                                                  1. 2

                                                    Wouldn’t this be slightly faster in a naive implementation? Two compares instead of three…

                                                    1. 2

                                                      In my experience, character sets are implemented as an array of 32-bytes, with each bit representing a character (0 not in the set; 1 - in the set). So a set like [A-Z] takes the same time to test as the set [,-.].