1. 30
    1. 22

      Remember that for compatibility reasons Erlang will always use sv_SE.iso88591 encoding when running the re module.

      As a Swede this fills me with patriotic pride.

      1. 5

        And as an “Erikson” I suspect ;)

        But yes, we can at least have this one quirk considering how many of the American quirks we already have to deal with when handling “Swedish” data.

    2. 6

      Spoiler: implementation used <ctype.h>, and was cursed by the C locale.

      1. 17

        It’s more subtle than that. ctype.h include an isascii, which returns true for things < 128. The FreeBSD libc version actually exposes this as a macro: (((c) & ~0x7F) == 0) (true if all of the bits above the low 7 are true). The regex library was using its own isascii equivalent implemented as isprint() || iscontrol(). These are locale-dependent and isprint will return true for a lot of non-ascii characters (most of unicode, in a unicode locale). This is not C locales’ fault, it is incorrect API usage.

        The fact that the C locales APIs are an abomination is a tangentially related fact.

        1. 2

          true if all of the bits above the low 7 are true

          Nit: it’s true if any bits above the low 7 are true.

        2. 2

          (true if all of the bits above the low 7 are true)

          false if any of the bits above the low 7 are true

      2. 3

        To be fair, this is a kind of bug you can run into without needing to hit C locales.

        For example, a few times I’ve had to re-teach people regexes in Python, because the behavior today isn’t what it was once upon a time.

        To take the most common example I personally see (because of the stuff I work on/with), Django used to only support regexes as the way to specify its URL routing. You write a regex that matches the URL you expect, tell Django to map it to a particular view, and any captured groups in the regex become arguments (keyword arguments for named captures, positional otherwise) used to call the view. Now there’s a simpler alternative syntax that covers a lot of common cases, but regexes are still supported when you need the kind of fine-grained/complex matching rules they provide.

        Anyway, suppose you want to build a blog, and you want to have the year, month, and day in the URL. Like /weblog/2021/02/26/post-title. Easy enough to do with regex. Except… most of the examples and tutorials floating around from days of yore are from a Python 2 world, where you could match things like the four-digit year with \d{4}. That only worked because Python 2 was an ASCII world, and \d was equivalent to [0-9]. In Python 3, the world is Unicode, and \d matches anything that Unicode considers to be a digit, which is a larger set than just the nine numerals of ASCII. So every once in a while someone pops up with “why is my regex URL pattern matching this weird stuff” and gets to learn that in a Python 3/Unicode world, if all you really want is [0-9], then [0-9] is what you have to write.

        This seems to have been an instance of the same problem, where the decision was deferred to some other system that might have its own ideas about things like “printable”, instead of just writing it correctly from the start to only match what it intended to match.

        1. 2

          In Python 3, the world is Unicode, and \d matches anything that Unicode considers to be a digit, which is a larger set than just the nine numerals of ASCII

          TIL!

        2. 1

          Have the Python core devs ever articulated what kind of use cases motivate “\d matches anything that Unicode considers to be a digit”?

          1. 3

            UTS#18 gives a set of recommendations for how regex metacharacters should behave, and its “Standard”-level recommendation (“applications should use this definition wherever possible”) is that \d match anything with Unicode general category Nd. This is what Python 3 does by default.

            So I would presume there’s no need to articulate “use cases” for simply following the recommendation of the Unicode standards.

            If you dislike this and want or absolutely need to use \d as a synonym for [0-9], you can explicitly switch the behavior to UTS#18’s “Posix Compatible” fallback by passing the re.ASCII flag to your Python regex (just as in Python 2 you could opt in to Unicode-recommended behavior by passing re.UNICODE). You also can avoid it altogether by not using str instances; the regex behavior on bytes instances is the ASCII behavior.