1. 42

  2. 8

    I think it would be extremely incorrect to exclude Chinese numerals from is_alphabetic(), they are alphabetic, within the already strained metaphor of applying “alphabeticyness” to non-alphabets.

    Wait, what??? Did they just say ideograms are considered alphabetic? I know I’d file a bug if I encountered that in an API I was using.

    Other than that I found this an interesting and informative post, but dang–that one is a real head-scratcher.

    1. 7

      I assume that the API assumes that the users assume that calling is_alphabetic() on some text/characters that are used spoken, not the quiet punctation, special chars, etc. is going to return True. In other words, there’s a conjecture that too many western programmers wouldn’t know the difference.

      1. 12

        You’re probably right in terms of what the intent was, but that’s like … not what an alphabet is. Not even close. Making an API give incorrect answers because people expect incorrect answers is … definitely a take, I guess.

        1. 2

          If it leads to more programs that work and less bugs, or even just more utility for users, that seems better than being correct according to some strict definition.

          1. 6

            Use a different name for your function then. “alphabet” means something already, and it’s not some esoteric or obscure term.

            1. 13

              Plenty of non-linguists use the term “alphabet” loosely for any writing system, not just writing systems where single symbols generally represent phonetic segments rather than syllables or morphemes. It’s kind of silly if the Rust standard library defines char::is_alphabetic('ი') to return true and char::is_alphabetic('い') to return false, because Georgian mkhedruli is an alphabet and Japanese hiragana is a syllabary according to technical definitions of both of those terms as widely used by linguists. This isn’t a distinction most programmers will fully understand or care about. I don’t object to picking a different term than is_alphabetic for the API function (although I can’t think of a better one offhand), but given that this API is in the language already I wouldn’t deprecate it because it doesn’t adhere to the technical vocabulary of a non-programming field.

        2. 8

          I would expect it to just implement Unicode, and not make up something of its own.

          So in this case, the same result as a \p{Alphabetic} regex test.

        3. 3

          That’s probably coming straight from Unicode.

        4. 8

          I remember a major outage at a past job where a Java frontend validated something like 10.💯.1.1 as a valid IP and commited it to the source of truth. Then the C++ control plane tried to reconfigure the network with that change and crashed preventing reconfiguration until the source of truth was rolled back.

          Learned two important lessons that day.

          1. Parse and reserialize instead of validating.
          2. Always use a function like is-ascii-digit instead of is-number when parsing something you expect to be an ascii number. You almost never want Unicode character classes in these cases. If your programming language doesn’t provide an API with a clear enough name create an alias with a very unambiguous name so that it is very clear what the predicate is accepting.
          1. 2

            Parse and reserialize instead of validating.


            I have a post explaining why this is a good idea: https://www.brainonfire.net/blog/2022/04/29/preventing-parser-mismatch/

          2. 3

            This essay is such an enjoyable read.

            1. 3

              This is one of those things that people generally don’t appreciate when they move to a language with real Unicode support and bring ASCII (or “UTF-8, it’s effectively ASCII because I don’t use the rest of it”) assumptions with them.

              When the Python 2->3 transition happened, for example, Python’s re module became Unicode-aware by default. Which means that now \d matches anything Unicode says is a digit, so if you wanted to match only [0-9] you’d better be explicit about that.

              In general I’m sympathetic to the idea that the code points in question ought to be considered numeric since Unicode certainly designates them as such via the Numeric_Type property. I’m less sympathetic to the idea that Unicode APIs need to divide the world up into “this or that, or maybe sometimes neither, but never both”. Sometimes Unicode absolutely says it will be both, or that the answer depends on how you ask. For example: I wrote a thing a couple years back about case in Unicode and code points which, depending on exactly how you ask, may answer “yes” to both of “are you uppercase?” and “are you lowercase?” Or you might get “no” to both questions if you ask a different way (i.e., a code point might answer “yes” to both when asked in terms of whether it would be changed by applying a case mapping to the case in question, but answer “no” to both when asked in terms of its general category).

              1. 4

                In general I’m sympathetic to the idea that the code points in question ought to be considered numeric since Unicode certainly designates them as such via the Numeric_Type property.

                I don’t know. It makes sense to consider the context and “what are you trying to do” aspect here. For my typical use cases, I would care about the code point being numeric only when I am trying to actually convert it to a number. OK, you could make a tool to highlight numbers in a text for some reason and the Unicode classification might be handy, but in the usual (for me) use case of parsing, I would definitely maintain that numbers are only those code points that can be parsed as such by the program.

                Which means that now \d matches anything Unicode says is a digit, so if you wanted to match only [0-9] you’d better be explicit about that.

                I did not know of this. That’s actually pretty unfortunate, since e.g. int('一') fails despite the documentation not saying anything about code points, only about bases.

              2. 3

                I feel like these APIs really should say “is ASCII numeric”. Hell, “is ASCII alphabetic” while we’re at it. These APIs tend to get used when dealing with certain kinds of machine readable files, and when they get used in normal text there’s always something semantically off.

                I have been bit by numeric detection considering Japanese characters to be numeric.

                Meanwhile I have never seen Japanese software say “actually you can use Japanese numerals for this import file number”. It’s ASCII (for numbers of course).

                1. 1

                  You can always rely on the well-known property of the Western Arabic numerals to never have a code point above U+0039.

                2. 1

                  More practically, allowing Chinese numerals in is_numeric could lead to some interesting bugs. Imagine setting your session id to a series of Chinese characters, and it getting stored in a numeric database field… It’s probably best to be conservative with this one.

                  1. 3

                    There’s a ton of Unicode characters that have the “Decimal Number” property already, so adding Chinese numerals will probably not add to any mess: https://www.compart.com/en/unicode/category/Nd