1. 48
  1. 27

    The indifferent approach to correctness is the reason that I would never use MySQL again. I know that today it’s less haphazard, but I lost so much data to bad defaults, incomprehensible limitations, and pure documentation fails that I am unalterably biased against it.

    Of course, I work at Shopify, where we have a huge MySQL installation. Ah, but I don’t have to touch it.

    1. 3

      I showed your post to a friend knowing that he would single out the sentence: “The indifferent approach to correctness is the reason that I would never use MySQL again;” knowing that he and I would both agree that this was the singular reason we ditched MySQL years ago.

    2. 25

      Running into this actually gave us enough push to move to Postgres while we were making such a large change. MySQL is shooting itself in the foot.

      1. 13

        Oh man, this is only the tip of the iceberg with MySQL weirdness. On the topic of character sets, another fun one is that it doesn’t verify its input. So it will happily accept and store latin-1 characters in a UTF-8 connection. This used to cause so many issues when clients dealt with encodings in a haphazard way - if some store latin-1 and some store UTF-8, you’ll end up with a database that can’t really be corrected. This blog post helped me out a great deal back in the day when I was dealing with CMSes with shitty character handling written in PHP (Drupal and Wordpress, mostly, but others too). Ironically, instead of recommending utf8mb4, it also tells you to use the utf8 encoding….

        1. 5

          Back when I worked at Mozilla on MDN, we had tags on articles, and since it was a multilingual site each language-specific version of each article had its own language-specific tags. Except there was a recurring and hard-to-reproduce weird bug where sometimes you’d get, say, French tags showing up on an English article or vice-versa.

          It turned out, after some digging, that the issue was MySQL: its character collations didn’t treat accented and un-accented (so think of “e” versus “é”) as distinct. So if, say, the French and English tags of an article differed only in an accent or two (example: English tag “Reference”, French tag “Référence”), MySQL couldn’t tell them apart and which one you got back from the queries we were doing was non-deterministic. Which was why sometimes the English article retrieved the French tag, or the French article retrieved the English one.

          A co-worker solved this by building and installing a new MySQL collation which could distinguish them. I have no idea whether MySQL itself has ever fixed this, or if now there’s a particular character set or collation you can use to avoid the issue.

          1. 1

            The correct behaviour for text search in utf8 is super annoying.

            There’s the obvious issue of different encodings for the same character e.g é as combined vs single unit (I totally forget Unicode terms nowadays). Similarly some characters have multiple encodings.

            There’s other “obvious” things like the fi ligature, and other similar ligatures which having coding.

            Things start getting weird when you ask should a search for e match é? Most people seem to say yes, and so no to the reverse of should é match e? It seems more clear that è != é, but that implies non transitive reaction ships :)

            The real pain point though is things like “ss” - should this only match “ss” or should it also match ß? In practice they need to match. Vice versus should ß match “ss” some people seem strongly opposed but in practice yes. I vaguely recall some MS software makes this locale specific, but that was >10 years ago, so the exact behavior might be OS locale or user locale or something. Also I could simply be misremembering :)

            I do kind of feel for the authors of a sql DB as I’m sure any choice will have people complaining.

        2. 8

          I remember the JIRA ticket at my previous job that nobody wanted to touch b/c it was basically fixing this on the main production database.

          1. 3

            Oof.

          2. 8

            Yet another mysql quirk to be aware of.

            I keep having to say this: If you’re considering mysql, look at postgres.

            It’s been literal decades. Why do I still have to say this? Why does MySQL still exist, even?

            1. 8

              The sad thing is originally utf8 worked. Here’s the actual commit that broke in in Sept 2002.

              1. 5

                That should be common knowledge since ~10 years

                1. 13

                  This should have been fixed in ~10 years. I mean, at least print a warning if you try to create a database in the broken encoding.

                  It’s not unreasonable for someone to think “UTF8” would be a data type that stored UTF8 data.

                  1. 5

                    Yes, ~20 years of pain with no upside. Yikes! I use Postgres when I have a choice, personally.

                    1. 2

                      I haven’t used MySQL in anger since the early 2000s but the factoid “there’s something hinky about UTF8 on MySQL” has been in my mind for ages.

                      1. 1

                        it’s for backwards compatibility..

                      2. 5

                        I’ve never seen this until today. Probably because I stopped using MySQL and MariaDB over ten years ago. Still, if you say your encoding is utf8, I do believe you you have created an implicit contract with the consumers of your work. If you’re definition of utf8 is only a subset what a programmer would want when he chooses UTF-8 I do think that it’s on your to call that out in your documentation. After reading the MariaDB documentation on Unicode, I don’t think that’s been done.

                        1. 3

                          Your comment made me look up what the current versions are doing.

                          The MySQL docs mark utf8 as deprecated. The language around utf8mb3 is not as clear as I’d like (I would say “never use this it is broken”) but if you read carefully the info you need is there.

                          MariaDB seems to have redefined “utf8” to be a working UTF-8 encoding. See OLD MODE for details.

                          1. 2

                            I read the MariaDB docs before I wrote my comment. This sentence stood out: “From MariaDB 10.6, utf8 is an alias for utf8mb3, but this can changed to ut8mb4 by changing the default value of the old_mode system variable.” Note well that as stated, the problem in caused when using three byte unicode rather than four byte unicode. The gist of it is that even in current MariaDB to get four byte unicode you have either explicitly ask for “utf8mb4” encoding in your database or change “old_mode” from it’s default setting to make “utf8” mean four byte utf-8 encoding. These changes only started with MariaDB 10.5, April 2021. That’s only about eight or nine months ago.

                            1. 1

                              I think you misread that. You have to set UTF8_IS_UTF8MB3, otherwise UTF8 is UTF8MB4

                              1. 1

                                I took this sentence from the documentation: “If set, the default, utf8 is an alias for utf8mb3. If not set, utf8 would be an alias for utf8mb4.” to mean that UTF8_IS_UTF8MB3 is set by default in OLD_MODE meaning that in MariaDB 10.6.1 “utf8” means “utf8mb3”. Had I written that passage, I would have more clearly stated the effect of the tunable, and it’s default setting by splitting those two pieces of information into two sentences. Regardless, If I go back to my high school english classes and take the meaning of that sentence to mean By default, utf8 is an alias for utf8mb3.

                                1. 2

                                  Wow, what a mess, you’d think folks would care more about being able to store text correctly in a database. The docs sure are confusing.

                                  Thanks for correcting my mistake up above; I’d naively assumed the default was sensible. It seems to be broken though, from the latest changes: “Default value changed in MariaDB Community Server 10.6.1 from “” (empty) to UTF8_IS_UTF8MB3”.

                      3. 4

                        Back in the day, we used Latin-1 because it would store UTF8, it just couldn’t interpret it.

                        1. 1

                          Wow, I thought utf-8 had some 5 byte things in it but apparently not. It’s flexible but I guess there aren’t enough characters to need it yet. I wonder if the migration steps that are linked in the original article are live-safe or if you’d want to do this with Percona (or the prod shuffle dance).

                          1. 5

                            So, the design of UTF-8 could accommodate much wider code points – the limit is a code point requiring 31 bits to represent as an integer. But UTF-8 never uses more than four, and never should, because Unicode itself will never assign code points that wide. The limit for Unicode is the current setup of 17 “planes” of 2^16 code points each.

                            That in turn is due to a technical limitation of UTF-16. If you know how UTF-16 works and what a surrogate pair is, you already know this. But for those who don’t: originally, Unicode assigned only code points representable by a 16-bit integer (what we now call “Plane 0” or the “Basic Multilingual Plane”), so the scheme of just representing them as two-byte units worked. But when Unicode expanded past that, the solution for UTF-16 was to use two ranges in Plane 0 to encode higher code points. In UTF-16, a code point outside of Plane 0 is represented by two code units (so four bytes total), with the first being from the range U+D800-U+DBFF and the second from the range U+DC00-U+DFFF. This “surrogate pair” of code points serves to encode the actual code point. This allows UTF-16 to handle up to 17 planes’ worth of code points – Plane 0 encoded as-is with two-byte units, and Planes 1-16 encoded with surrogates. But since that’s a hard limit due to the size of the surrogate ranges in Plane 0, Unicode itself now promises not to assign code points past Plane 16 (max U+10FFFF ).

                            Since UTF-8 can handle code points up to 21 bits wide in four bytes, UTF-8 never has to use more than four bytes.