1. 20
  1.  

  2. 21

    What it comes down to and the post sadly never mentions is: “length” is not well-defined on Strings (not even ASCII strings!) and there’s an intuitive understand. Just, in ASCII, it works intuitively for all printable characters: the number of visible glyphs is the length in bytes. But, for example “foobar\t” has a length of 7 in bytes and an undefined visual/intuitive length (without context).

    Unicode with combining characters makes that even harder, but in general, the problem is the same: whenever you say “string length”, the next question should be “which length?”.

    1. 1

      “length” is perfectly well-defined for bytestrings, and if you have a font and a size, “length” (or width; visual length) can also be defined.

      The real problem is that “length” is a stupid name for something nobody wants: People in JavaScript aren’t ever interested in the internal buffer size (well, almost never) – usually people want the number of bytes in an external format (like UTF8) or the number of pixels wide, and yet the language designers gave the thing nobody wants a really discoverable and valuable name, while giving the things people look for a big bury.

      At this point, we can’t fix this aspect of JavaScript anymore, but we can still get a lesson from it: Stop building “unicode” into new languages.

      1. 1

        “length” is perfectly well-defined for bytestrings, and if you have a font and a size, “length” (or width; visual length) can also be defined.

        What’s a “bytestring”? I’m talking about intuitive understanding of Strings.

        You reiterate my point: without context (font and size), “length” can have very different meanings and outcomes. Yes, it’s ambiguous and it would be better if we had “bytesize”, “visuallength”, “grapheme_length”.

        At this point, we can’t fix this aspect of JavaScript anymore, but we can still get a lesson from it: Stop building “unicode” into new languages.

        I don’t think that’s a feasible solution, it would only make matters worse.

    2. 24

      I long for the day when these ‘I just discovered unicode’ posts will stop.

      1. 18

        Unicode (or rather, natural language) is a complex topic and we should encourage folks to learn more about it. One effective way to learn something is to try to explain it someone else.

        1. 11

          I agree, but nearly none of these unicode blogposts goes beyond ‘.length gives a weird result.’ It’s like posting a new article how to do joins in SQL or pointer arithmetic every week. These things are important to know, but you could and should just read proper documentation to learn them.

          1. 18

            The documentation for learning these things is voluminous and likely impenetrable for a lot of folks. Not only do you need to be familiar with the inner workings of Unicode itself, but you also need to be familiar with the specific quirky implementation of Unicode with whatever toolset you’re using.

            I think it’s pointless to complain about how others choose to learn. You lament the learning style of others. Well, I lament people snubbing their nose at these folks while lecturing them on the “proper” way to learn.

            1. 6

              Maybe we are misunderstanding each other. People can of course read and write about everything they want, and if someone reads this and learns about the basics of unicode for the first time, that’s nice. I still don’t see why these articles have to be posted and upvoted here, because they seldom add anything new beyond the basic information that a grapheme isn’t one codepoint and that a codepoint isn’t a fixed set of bytes.

              1. 13

                because they seldom add anything new

                Maybe to you or me. :-) If I come across a programmer that has never heard of “grapheme” before, I’m not at all surprised. It might be basic to you or I, but as these articles show, Unicode and its various implementations present oodles of surprises to unsuspecting folks. If they perceive that a lot of people don’t know about it—which matches my perception as well—then writing about it to share with others seems like a totally reasonable conclusion to come to!

                I think it would be interesting to start a link aggregator where some notion of novelty was a minimum standard for article submission. I’m not sure what the right level would be though. If it’s too high, then you’ve limited yourself to academia. I suppose the problem with novelty is that if topics are never duplicated, then new members or folks who missed the first link or folks that just weren’t ready to absorb the material will miss out.

                Here’s my opinion: when someone posts an article that clearly documents a learning process, the least we can do is not discourage the author.

                1. 8

                  This gave me an idea.

                  There could be a tip-of-the-day site that pretends to be a link aggregator.

                  1. Get list of various common problems in a field. I.e. length of string has multiple definitions, addresses and names are surprising, you probably don’t want to use strncpy, you may want to use CLOCK_MONOTONIC, something from lists of falsehoods programmers believe in, etc.

                  2. Get multiple high quality links for each category with a few levels of depth.

                  3. Get a list of random links - one link per category per level of depth. Then shuffle it.

                  4. Get a link from the list every X hours (depending on the number of links you have). Then submit it to the site lobste.rs/HN style.

                  5. When a list is empty go back to the point 3, but don’t use links that are on first Y pages of submitted links.

            2. 5

              These things are important to know, but you could and should just read proper documentation to learn them.

              and many people do.

              However some people are just playing around, ‘discover’ some interesting result and want to share.

              There’s no need to be a grumpy old man.

            3. 1

              One potentially effective way.

              However if you explain it poorly and yet are persuasive, then you have done worse: Burrito monads may have made me a little sensitive to this point, but the important thing isn’t that .length is weird, but that every language you think “supports” unicode doesn’t.