1. 29
  1.  

    1. 10

      Pretty much every language out there that I can think of uses bytewise equality when comparing strings, and for good reasons: it’s what you want 99% of the time.

      Arguing that differently encoded strings might impose problems is something you can (and should) approach by making (and requiring) strings to be UTF-8 and UTF-8 only. Those wishing to use a different encoding likely have very different requirements to begin with, such that those users are better catered by a third-party library.

      Expecting certain optimizations is similarly a weird argument, because the same is typically true when matching a value against a list of non-monotonic integers, i.e something like this:

      match the-number
        10  -> foo
        123 -> bar
        583 -> baz
        _   -> fallback
      

      Zig does support this and in the worst case you’d compile it the same way as with strings: a nested if-else/linear search. And yet for strings this isn’t OK?

      In fact, supporting this explicitly actually enables you to generate better code. If somebody manually writes a nested if-else then there’s not much a compiler can do, but in the case of a switch/match the compiler has more options. For example, this:

      match the-string
        'foo'   -> A
        'bar'   -> B
        'hello' -> C 
        'x'     -> D
        _       -> E
      

      Could be compiled into something like this:

      match size(the-string)
        1 -> match the-string
          'x' -> D
          _   -> E
        3 -> match the-string
          'foo' -> A
          'bar' -> B
          _     -> E
        5 -> match the-string
          'hello' -> C
          _       ->
      

      You can further optimize that by also branching on the first byte or the first N bytes (if the strings all have a common prefix), though in most cases the branch on the size is enough such that the linear scan ends up being fast enough.

      I also suspect that people using Zig (or a similarly low-level language) are very much aware that matching against strings won’t compile to a jump table, such that supporting it in a more basic form (i.e. without the above optimization) would be just fine.

      1. 3

        Arguing that differently encoded strings might impose problems is something you can (and should) approach by making (and requiring) strings to be UTF-8 and UTF-8 only.

        Which normal form? NFC? NFD? NFKC? NFKD?

        1. 1

          I’m not sure what you’re trying to say here, but “Use UTF-8 for strings” really isn’t rocket science: it’s widely used, there’s plenty of information on it out there, and it’s much better than whatever clever alternative one might come up with (including “strings are bytes lol”).

          1. 3

            Give https://en.wikipedia.org/wiki/Unicode_equivalence and Dark corners of Unicode a read.

            The TL;DR for what I’m referring to specifically is that, to ensure the ability to round-trip legacy encodings through Unicode, there are often multiple ways to encode visually identical characters. (eg. Do you write é as the precomposed character that was defined to round-trip latin1 or do you write it as an e plus a combining diacritic?)

            …and this occurs at the level of Unicode codepoints so UTF-8 is subject to this complexity too.

            NFD, NFC, NFKD, and NFKC are the abbreviations for the four different “Normal Forms” that Unicode codepoint streams can be normalized to. (Normalization Form Canonical Decomposition, Normalization Form Canonical Composition, Normalization Form Compatibility Decomposition, and Normalization Form Compatibility Composition)

            I remember reading about how you can break macOS Finder in weird ways because, while the filesystem is spec’d to use UTF-8, there are still ways to get filenames onto a filesystem in one of the normal forms Finder doesn’t expect to be using. (Sort of like how you can get weird results out of Windows Explorer or CMD.EXE by using Linux to set up case-sensitive filename collisions in an NTFS filesystem.)

            Example:

            As a long-time Korean user of macOS, I hate the fact that APFS is normalization-insensitive yet normalization-preserving.

            As a result, this means that… if you create a Korean-named file yourself in Finder: then the file name is normalized in Unicode NFD (decomposed) form, but if you create a Korean-named file in a zsh shell, the file name is normalized in Unicode NFC (composed) form.

            You don’t realize the difference until you try an ls | xxd, but the bytes are different, even though they look the same.

            It’s an invisible mess that people don’t realize until they do. And a lot of programs gets subtly crazy when fopen(A/B) succeeds but listdir(A) doesn’t list B.

            – goranmoomin @https://news.ycombinator.com/item?id=35336510

        2. 1

          I would be disappointed by a compiler that turns an integer switch into a linear search: it should be a jump table for dense cases or a binary search for sparse cases. (Linear might make sense for tiny cases.)

          Similarly, after switching on the string length, it’s naïve to restrict the compiler to test a prefix of the string. Much more effective to test the first byte that’s different in all cases — which is just a simple special case of perfect hashing.

        3. 14

          My brother in christ, please just let me switch on strings.

          1. 6

            The word “just” is sitting on top of a lot of complexity in this statement.

            Pardon my lack of zig experience but I wouldnt expect any reference type to work in switch in a systems language (similar to C)

            1. 10

              I would count rust as a systems language and it works in magch there (not just for &str but any reference).

              1. 2

                Any constant reference.

            2. 5

              Do you want equivalent but differently encoded unicode strings to be considered equal?

              1. 17

                No, just raw byte-for-byte comparison.

                1. 3

                  None of the workarounds in the article seem to address that either.

                  Do you think that not allowing users to switch on strings will avoid canonicalisation issues somehow?

                  Can users define their own string type and make switch work on it?

                  1. 3

                    would it make sense to allow an optional comparator to the switch statement? e.g. switch (mystr, &std.mem.equal) { ... }

                    1. 2

                      I had been imagining something like the ability to extend switch to operate on any aggregate type that has a hash-function definition available, but having a comparator generalizes that, and even supports some interesting use-cases like canonicalization-in-comparison (as @andrewrk asked about) as well as even fancier things like edit distance matching!

                      I find this idea quite compelling…

                      All the best,

                      -HG

                      1. 2

                        Sounds like pattern matching

                2. 5

                  if (std.mem.eql(u8, color, "red") == true)

                  Is boolean-expr == true idiomatic in zig? A younger me lost a number of grade points in CSE101 to expressions like this.

                  1. 5

                    Very much no

                  2. 2

                    Alternative approaches: https://zigbin.io/274557

                    1. 2

                      I was hoping for a perfect hash on the statically known set of strings, so a switch looks like hash -> branch table -> memcmp. I suppose using the length as a hash works but I expect it’s possible to do better.

                      Oh, d’oh, that’s what the “improve in the future” link is about. Ah well, the idea deserves more space than the article gave it.

                      Fun fact! Early versions of php used string length as the hash function for looking up builtin functions, and the names of some functions were chosen to have lengths that balanced the chain lengths in the lookup logic.

                        1. 1

                          Yeah that link appears near the top of the “improve in the future” discussion.

                      1. 1

                        But then, without built-in understanding of utf8, is this supposed to only cover us-ascii? The same string might be encoded in multiple ways in uncode + utf8.

                        Example: in unicode, å can be either the direct U+00E5 or a composition of U+0061 + U+030A, both render the same. That means different lengths.

                        1. 6

                          I didn’t know of any languages canonicalising unicode strings before comparing them, but it turns out that Swift does this.

                          Two String values (or two Character values) are considered equal if their extended grapheme clusters are canonically equivalent. Extended grapheme clusters are canonically equivalent if they have the same linguistic meaning and appearance, even if they’re composed from different Unicode scalars behind the scenes.

                          Do you know of any other languages that do?

                          1. 4

                            It works for any encoding if you canonicalize to the right flavour of bytestring first.

                            1. 3

                              I think this gets into the discussion at the top of the article, which is that switch (color) should obviously be semantically equivalent to if color == ... else if color == ..., but the definition of == is ambiguous. Pointer and string equality are two answers, what you’re talking about is whether two byte strings of valid utf encoding represent the same sequence of glyphs. (*)

                              That’s useful in some contexts but it’s more specialized than just “match on this set of byte strings”.

                              (*) to be super pedantic, { U+00E5 } and { U+0061, U+030A } are different strings, but encode the same glyph. Text is hard.

                              1. 1

                                If you need unicode semantics you can use a library like zg https://codeberg.org/atman/zg