1. 18
  1.  

  2. 4

    In both the Python and Ruby example there were explanations about “what makes sense.” Another dimension for this could be “how are we likely to use the output?”

    The Python version (returning [””] on an empty string) has the advantage that you can always count on a string being in the output. It almost seems like a total function at two levels. It’s elegant but atypical.

    1. 2

      Another dimension for this could be “how are we likely to use the output?”

      The follow up to that, though, is “how likely are we to use the output correctly?

      Functions that are really core in a system like string splitting, path manipulation, etc. are in a tricky position. They are so frequently used that you really want to make sure they do what the user intends. But, because they are so frequently used, it’s also really painful if they do something like throw an exception when the input makes the user’s intention ambiguous. So instead they get pressured into accepting more inputs than are really meaningful and then have to make somewhat arbitrary choices like this for how to interpret the weird edge cases.

      1. 1

        I’m interested in the question of how malleable this is. For instance, people in Python and Ruby are used to negative indexes working from the back of an array. It seems like we can make these choices and they work as long as they become commonly understood behavior.

    2. 4

      This article did a great thing–it made me think about my preference here. And what I’ve concluded is that both behaviors are useful, and the name sucks.

      The AWK method, should really be called fields"a,b,c".fields(",") == ["a", "b", "c"] with "".fields(",") == []. The Python version should continue to be called split. If you split a log with an axe, 1 of two things will happen: You’ll miss, which gives you back the whole log unscathed. Or, you’ll hit the wood and it’ll be split into 1 or more pieces. In either case, you have at least one piece of wood.

      fields is also sort of strange to me. "a,b,c".fields(",") doesn’t feel right, but ",".fields("a,b,c") does. It’s like the purpose of fields is to query the argument. Hmm.

      1. 2

        Perl is a superset of AWK, and of course if you have basic stuff like string manipulation you’re not gonna change how your function works compared to the model.

        The documentation for Perl split() is pretty clear:

        Note that splitting an EXPR that evaluates to the empty string always produces zero fields, regardless of the LIMIT specified.

        The documentation for Ruby continues the use of “fields”.

        I think the Python split method makes sense as the return type of splitting a string is always a array of strings, not, weirdly, an empty array. Perl doesn’t have this problem because the function split is defined to return a list (or its length, in scalar context). Where Ruby arguably fails is awkwardly marrying the Perl outcome with the Python syntax.

        If you split a log with an axe, 1 of two things will happen: You’ll miss, which gives you back the whole log unscathed. Or, you’ll hit the wood and it’ll be split into 1 or more pieces.

        What kind of log is an empty string? And how can a split function/method fail?

        1. 2

          What kind of log is an empty string? And how can a split function/method fail?

          It’s a log of zero length, which is too small for an axe to split, and therefore, it’s still a log of zero length. Obviously, you can’t extract fields from it, either, so the result is an empty array in that case.

          1. 1

            Let’s not split hairs, for all love ;)

            Seriously though, I agree with you that both approaches make sense in their own context.

      2. 1

        Another interesting thing to think about is that even with the different split semantics for both Python and Ruby, I think we still have this (using Ruby syntax):

        x == x.split(y).join(y)

        1. 1

          The author lost me near the start: The output is “cat\ndog\n” not “cat\ndog”, and trailing separators are sort of relevant for a technical discussion of splitting strings on separators.

          1. 2

            Not to mention that if I were writing this I’d not use the output of ls (1) to get the contents of the directory, I’d use the native options in my language of choice. But I’ve always been wary of “shelling out” for stuff like this.

            1. 2

              And even if shelling out, find -print 0 -maxdepth 1 -mindepth 1 is a more paranoia-compatible option than ls.

              1. 2

                Assuming the find on the system has max/mindepth extensions.

          2. 1

            I think it is perhaps worse that it seems that in Ruby:

            • "-".split("-") is []
            • "-x".split("-") is ["", "x"]
            • "x-".split("-") is ["x"]

            Looking at the documentation, trailing fields are suppressed unless you supply a negative limit parameter. Python and AWK return two elements for each of these.

            1. 1

              Perl acts like Ruby in this case, unless you supply a negative LENGTH parameter.

              Edit edit AWK does return 2 elements, it’s only Perl and Ruby in the weird corner :D

              My limited knowledge of AWK indicates it acts in the same way as Perl and Ruby, but this could be because I’m writing as

              $ awk -F: ‘{split($0,a,”-”); print a[1],”:”,a[2]}’ <<< “x-”

              so I’m implicitly assuming the array a will have 2 elements.

              Edit the manual for my version of AWK (GNU Awk 4.1.3) states:

              If string is null, the array has no elements. (So this is a portable way to delete an entire array with one statement. See Delete.)

            2. 1

              Whilst I like the Python justification as a “pattern”, we could go a little further to make it a real ‘specification’. As indicated by @mfeathers it’s nice to know how split and join interact.

              It would be nice if, for all strings x and y, join is the inverse of split:

              y.join(x.split(y)) == x
              

              Yet this doesn’t work in Python due to an ‘empty separator’ error when y = "". It seems to hold for non-empty separators though. Some languages allow empty separators, returning a list of the individual characters. Would that be enough to satisfy this equation in general?

              It would also be nice if, for all strings x and lists of strings y, split is the inverse of join:

              x.join(y).split(x) == y
              

              This doesn’t work because the elements of y may contain x, e.g. if x = '-' and y = ['foo', 'bar-baz'] we get ['foo', 'bar', 'baz'].

              Instead, we might split the inner values and append the results:

              x.join(y).split(x) == sum(map(lambda s: s.split(x), y), [])
              

              This also doesn’t work in Python, since y = [] gets [""] on the left hand side and [] on the right hand side. This may work for Ruby’s behaviour (although I don’t have Ruby to test with).