1. 38

  2. 15

    As a Haskell programmer, I’m definitely missing more types in this post, and the C programmer in my also has lots of questions.

    What is the type of s[1..3] == "ell"? is it copied from s? is it null terminated? What’s stopping me from accessing s[1..3][3]? Will I get back 'o'?

    It looks like .len doesn’t include the null terminator in the length?

    What’s the type of .next() and chunk?

    .* looks pretty confusing, is it a method/function which copies the contents of a const string? what happens if I do

    var s = "good morning";
    var t = s.*;
    t[0] = 'm';
    std.debug.print("{s}\n", .{s});
    1. 10

      hey, thank you so much for the feedback, yes, the article definitely lacks of these points, and they’re really important information. I will try to update it to address those points!

      1. 3
        1. 9

          Ideally I wouldn’t need to look at reference docs to understand a X in N minutes post, I think there’s plenty of space left in those five minutes to fill in the blanks.

          1. 3

            Yeah, I fully agree.

      2. 9

        For example, to find the index of a character in a string: var found = std.mem.indexOf(u8, c, “w”);

        This seems very suspicious. Since you’re specifying u8 there, doesn’t this find the byte offset rather than the index by characters?

        1. 4

          Yes, you’re right, I’ve updated the post to point out this. Thank you so much!

        2. 9

          For better Unicode support when processing strings in Zig, see the libraries Zigstr and Ziglyph.

          In the comments of the Zig issue “Improved handling of strings and unicode”, language designer andrewrk wrote that he would be open to incorporating those libraries’ concepts into Zig’s standard library API before releasing Zig 1.0. He noted a caveat that the standard library might not necessarily gain all the functionality of those two libraries, as that would require the standard library to depend on a Unicode data file.

          1. 6

            Having really weak string support like this really kills a language’s ergonomics for me. Rust also blew it in this regard. I understand why for both Zig and Rust but it’s still disappointing.

            1. 8

              I can understand problems with Zig, but Rust? What is an issue there?

              1. 7

                Yeah, Rust string methods are as complete as any other language I know. Presumably kris meant that Rust strings are not ergonomic? Because borrow checker and String vs. &str vs. &mut str vs. whatever str is. (Which is all a necessary consequence of borrow checking and zero-cost abstractions, but that doesn’t make it less confusing to a beginner.)

                1. 4

                  I agree that it is confusing, but strings in general are hard, Rust just exposes that from the day one instead of pretending that everything is ASCII until it is not.

              2. 4

                What would you like to see in the way of string support in a Zig/Rust-like?

                1. 4

                  An immutable string type is just so convenient that’s it’s really hard to do justify doing string stuff in a language without it.

                  1. 2

                    In Rust, immutable and mutable strings would have exactly the same API and would have the same ergonomics. The only difference between the two would be in performance

                    operation | “immutable” | “mutable” 
                    clone     | O(1)        | O(N)
                    push char | O(log(N))   | O(1)


              3. 5

                The post claims this is null-terminated: var s = [_]u8{'h', 'e', 'l', 'l', 'o'};

                Is there a missing '\0' at the end, or is Zig doing something clever?

                1. 4

                  That was a mistake, string literals are null-terminated but string slices are not.

                2. 3

                  Caveat: I don’t have a lot of experience with C (or C++).

                  I had the impression from reading various internet discussions that null-terminated strings were considered a mistake. After some searching, I found multiple impassioned defenses of them on Stack Overflow. This gives me more context and understanding for why null terminated strings were chosen for C, but doesn’t provide any reason why almost no languages since uses them.

                  What is Zig’s rationale for using null terminated strings?

                  1. 4

                    Zig has support for both null and non-null terminated strings. The []const u8 type, which is the convention for strings is non-null terminated. The default type for a string literal is *const [N:0]u8. This can then coerce into a []const u8 which is a slice. Null terminated strings are useful for c interop, but slices are very useful also.

                    1. 3

                      As someone who only knows a little of Zig, my guess is that the decision is a consequence of Zig’s origin. Zig is meant to be a better C. C uses null-terminated strings and (nearly) every C library does. Therefore, supporting them in an essential way seems hard to get away from.

                      1. 3

                        EDIT: looks like g-w1 actually knows: Zig has both kinds of strings, and the null-terminated ones are for C interop.

                      2. 1

                        Relying on the null terminator causes problems because calculating lengths (and doing bounds checks on random access) are O(n). C used null terminators because space was very constrained. A length field the same size as a null byte (as Pascal used) limited strings to 256 characters, which caused a lot of problems. If you have a 32-bit or 64-bit size field, you’re typically not losing much (especially if you do short-string optimisation and reserve a bit to indicate whether the short string is embedded in the space used by the size and pointer).

                        In contrast, having the null terminator can make C interop easier because you don’t need to copy strings to convert them to C strings. How much this matters depends a lot on your use case. Having the null terminator can cause a lot of problems if you have one inconsistently. For example:

                        $ cat str.cc
                        #include <string>
                        #include <cstring>
                        #include <iostream>
                        int main()
                                std::string hello = "hello";
                                auto hello_null = hello;
                                hello_null += '\0';
                                std::cout << hello << " == " << hello_null << " = " << (hello == hello_null) << std::endl;
                                std::cout << "strlen(" << hello << ".c_str()) == " << strlen(hello.c_str()) << std::endl;
                                std::cout << "strlen(" << hello_null << ".c_str()) == " << strlen(hello_null.c_str()) << std::endl;
                        $ c++ str.cc && ./a.out
                        hello == hello = 0
                        strlen(hello.c_str()) == 5
                        strlen(hello.c_str()) == 5

                        Converting a C++ standard string to a C string implicitly strips the null terminator (it’s there, you just can’t see it), which means that strlen(x.c_str()) and x.size() will be inconsistent.

                        The biggest mistake that a string library can make is coupling the string interface to a string representation. A contiguous array of bytes containing a UTF-8 encoding is fine for a lot of uses of immutable strings, but what happens if you want to iterate over grapheme clusters (or even unicode code points)? If you do this multiple times for the same string then you can do it much more efficiently if you cache the boundaries with the string. For mutable strings, there are a lot more problems. Consider adding a character to the middle of a string with the contiguous-array representation. It’s a O(n) operation in the length of the string, because you have to reallocate and copy everything. With a model that over-allocates the buffer then it’s O(n) in the length of the tail of the string, with periodic O(n) copies when the buffer is exhausted (amortised to something better depending on the policy). With a twine-like representation, insertion can be cheap but indexing may be more expensive. The optimal string representation depends hugely on the set of operations that you want to perform. If your string operations aren’t abstracted over the representation then there’s pressure to use a non-optimal representation.

                        Objective-C did this reasonably well. Strings implement a small set of primitive methods and can implement more efficient specialised versions. The UText interface in ICU is very similar to the Objective-C model, with one important performance improvement. When iterating over characters (actually, UTF-16 code units), implementations of UText have a choice of providing direct access to an internal buffer or to a temporary one. With a twine-like implementation, you can just update the pointer and length in the UText to point to the current segment, whereas with NSString you need to copy the characters to a caller-provided buffer.

                      3. 3

                        This is great. I’d love to see a part two dealing more with dynamic / run-time strings i/o - eg: classic beginner program prompting “what is your name.”, responding with “Hello, {name}”, and/or reading from arguments etc.

                        I feel it’s a bit underseved by current zig documentation.

                        Come to think of it, maybe some of the examples from k&r “The C programming language” would be useful in idiomatic Zig? (Also is there an equivalent for modern idiomatic C?).

                        1. 2

                          Things that bother me about living in the future: it is now impossible for any new language to implement strings.

                          1. 1

                            Swift implements them quite well.

                          2. 1

                            This perhaps me just not being familiar enough with Zig; does the ++ operator do an alloc/memcpy or is it compiler magic when the arrays are known at compile time?

                            1. 3

                              I think it’s comptime only. Zig has no language level things which engage allocators.

                              1. 1

                                At compile time, it just creates another compile time known string. At runtime it should do alloc + memcpy only if the sizes are compile time known. I started implementing this in the self-hosted compiler https://github.com/ziglang/zig/pull/9876.