1. 11
  1.  

  2. 3

    I understand why it was included, but people should not mistake ByteString for being a string in the same way Text and String are - it’s a byte array. Accordingly, you can’t convert from ByteString to a text representation without knowing and choosing what encoding you’re expecting. This isn’t different in any other language.

    1. 1

      I wish the Haskell community would fix things like this.

      It’s absolutely ridiculous that there are a dozen string types in the standard library. And to make it 100x times worse, the default string representation is the most inefficient possible one. It gives a terrible first impression of the language when one of the fundamental built in data types is completely brain dead, and all for what? So it’s compatible with List functions? Who cares!

      And it’s not just a bad first impression because any “real life” code that uses a few libraries has to deal with converting between all these types, and it’s a pain in the ass. And it makes adds extra work when writing new code, because you have to stop and think about what kind of string you need.

      1. 5

        Here are the rules we should be trying to follow:

        1. Use Text for actual “string” things.
        2. Use ByteString when there’s actually a sequence of bytes.
        3. Almost never use String.

        I wish the Haskell community would fix things like this.

        People pretty much agree to the general rules I put above but I’m not sure how we can fix this.

        The Prelude really sucks and makes it convenient to use String for the wrong things. We can’t remove the type String = [Char] alias (I’d absolutely love to) and we can’t stop returning things like IO [Char] for the functions which do, because all libraries would break. Kinda stuck if we’re trying to improve the Prelude.

        I think we need to start writing better Prelude-like modules and start using -XNoImplicitPrelude.

        Also, I think the conversion table at the bottom of this article could be much more succinctly expressed using lens, which would be cool to see.

        1. 3

          I wouldn’t mind seeing a Python 3-like backwards incompatible change at a major version. Strings are probably the worst offender, but there are several other places in the stdlib and language that are just stupidly designed or implemented and could use cleanup. Maybe just have the compiler special case [Char] and string literals. Or do something with type classes. Just because strings act like [Char] doesn’t mean they have to be implemented that way.

          At the very least maybe the community could stop teaching things like [Char]. Obviously, asking in #haskell and on mailing lists and what not, [Char] is kinda shunned in favor of the better alternatives. But a lot of tutorials and language intros use [Char] exclusively.

          1. 5

            Python 3 is probably an example of why Haskell won’t do this anytime soon.

            The most sensible way forward is probably going to be ignoring the existing Prelude and making it easy to do so.

            1. 1

              The issue with Python 3 (disclaimer: I’m not a Python user) is that it doesn’t let programmers specify encodings, at all. I agree that Haskell should very much avoid going down that path.

              This blog post about how Python 3 (doesn’t) handle encoding issues probably focuses a bit too heavily on an audience who already understands encodings, and is probably a bit too angry to be effective, but I found it a pretty good snapshot of the state of the discourse, both the arguments it’s making and the replies listed in the “But you’re wrong” heading. As long as people are talking past each other like that, I intend to stay well away.

              I do think that a backward-incompatible change would be fine, as long as the new status quo is an improvement and not a regression! But I understand the hesitation.

              1. 2

                That wasn’t really the problem with Python 3 at all. The blog post isn’t even correct… There’s an entire module for dealing with encodings. Further, many people actually prefer the Python3 way of handling Unicode because the Python 2 way of doing it often only worked by accident. The Python 3 way of doing it requires that if you’re going to treat I/O values as strings then you better be sure they really are strings. If they’re not strings, or you don’t know, treat them as bytes and encode/decode the right way when you need to.

                There was no single big problem with Python 3, except that it annoyed a lot of people and orphaned a lot of existing code by making a lot of breaking changes. To make it worse the original Python 3 release didn’t have many cool features to entice people. The Python team basically said, “This will allow us to make cool changes in the future, but we haven’t got around to it yet, so right now we’re just breaking everything.”

                The biggest single problem was probably that it broke just about every third party module that used the CFFI because in addition to the string/bytes changes, it made major changes to the way objects and other internal “stuff” was handled behind the scenes in the interpreter. Suddenly things like numpy, matplotlib, scipy, etc. stopped working, along with most of the major web frameworks which were heavily impacted by the bytes/string issue, and often used CFFI libraries.

            2. 3

              I was discussing this the other day with the Chicago Haskell user group and one person recollected a discussion that the Prelude includes things like totally-common partial functions (eg. head for lists) for pedagogical reasons. Students get something naive and easy, then have to work up through improvements. A quick web search didn’t immediately turn up anyone official talking about this seriously, and I’m hungry for a reference if anyone can think of or find one. Until a citation appears, I can’t bring myself to believe it.

              1. 1

                I’ve heard that too. I’ve also heard that map, fold, etc. were specialised to list because teaching [a] -> [b] is easy but Functor f => f a -> f b is not.

                I agree that the list version is easier to teach, but sacrificing the whole Prelude for that was a huge cost. It’s great that GHC committed to the Burning Bridges Proposal.

                1. 1

                  The Typeclassopedia’s explanation is that it makes for nicer error messages:

                  You might ask why we need a separate map function. Why not just do away with the current list-only map function, and rename fmap to map instead? Well, that’s a good question. The usual argument is that someone just learning Haskell, when using map incorrectly, would much rather see an error about lists than about Functors.

                  It seems like a pretty difficult problem in general: How to make very generic code produce good error messages that make sense in a specific concrete context.

        2. 0

          The final note on bytestrings: the library isn’t designed for Unicode. For Unicode strings you should use Text from text package.

          This is a garbage in my eyes, then.

          1. 2

            You never ever work with raw bytes? Only encoded text?

            Have you ever worked on anything that wasn’t a web app?

            1. 1

              I see quite a few uses for non-Unicode strings such as for compactness/speed.

              1. 2

                Ok, I get it if you want to have efficient raw bytes use ByteString. But please do not put actual text in such things, since I get sad every time someone’s surname or just an actual sentence gets garbled. The world speak not only English.

                1. 2

                  I really don’t understand why this is getting downvoted in 2015. “Non-Unicode strings” usually mean 7-bit ASCII; I don’t think anyone is under the illusion that 8-bit codepages are a good idea, anymore. That’s excluding people, deliberately. It’s often excluding the authors of the language themselves. It’s unreasonable to expect anyone to use ASCII-only strings.

                  And “string” doesn’t mean “array of bytes” - that should indeed be a different type. That’s the case where performance concerns are the dominant concern. By all means, do not use a string type for your array of 16-bit dual-channel audio samples.

                  This discussion has veered far away from Haskell, but for the record, yes, ByteString is meant to be blob data, not text. Unfortunately for historical reasons it gets thought of as a text type, hence its inclusion in this article, but that’s not what it’s meant for.

                  It’s really saddening to still see this level of talking past each other; I kind of thought programmers more-or-less understood this problem domain, these days.

                  1. 3

                    It increases the risk of errors (and brittleness even when the original solution is correct), but I do find myself resorting to treating UTF-8 strings as arrays of bytes in some cases, because doing things properly via Unicode support seems to do everything an order of magnitude more slowly. This is possible to do without garbling non-English text, if you’re careful.

                    A common use-case for me is when I want to parse a structured data file (XML, say), where all the formatting characters are ASCII, but the payload may be arbitrary UTF-8. The Wikipedia XML data dumps happen to have this property: the tags are all named ASCII things like <title>, but the contents of the tags might be in Japanese or Arabic. One way to do this is to parse the file as UTF-8, but that’s really slow. Another way is to treat it as a raw byte string and scan for islands of these ASCII tags (with some additional logic for handling CDATA sections, also marked in ASCII). This wouldn’t work in all encodings, but UTF-8 happens to have the property that 7-bit ASCII values never appear as a byte within a multibyte codepoint’s encoding, so if you want to search a UTF-8 string for an ASCII needle, you can treat it as an array of bytes. Then once you’ve found your structure, you pull out the UTF-8 and treat it as UTF-8.

                    I’d rather not mess with doing that kind of thing, but with speedups around 10x, it’s often practical.

                    1. 2

                      I suppose. I don’t necessarily think of that as breaking the abstraction; you’re using each type for the things it’s good at. But, yes, I think you’re describing a legitimate scenario for treating text as a blob. It’s not that different in some sense from the scenario I would have picked for casting back and forth, where you have a binary network protocol that contains null-delimited text in some of its messages.

                      The important point I’d make is that you’re aware, when treating it as a blob, that it isn’t arbitrary data. You’re being careful to not interfere with its text-like properties. There are all sorts of naive things one can do that cause trouble, which is why text types make sense at all, but you’ve taken time to understand the situation and not do those.

                      As a footnote, which you clearly know, but for anyone who doesn’t - this wouldn’t work for most multi-byte encodings; UTF-8 was specifically designed for this use.

                  2. 0

                    Being a speaker of multiple languages myself, I agree.

                  3. 1

                    e.g. O(1) indexing