1. 57
  1.  

    1. 2

      Slightly off topic, but really struggled to find the right tag for an article about encoding. Glad it’s about encoding a URL, do “web” it is… :)

      1. 2

        If encoding of typical URLs doesn’t work really well, shouldn’t they make a new encoding version that is specialized for alphanumeric, / : ? & etc?

        1. 5

          chicken-egg problem now. Who would want to use a QR code encoding that won’t work with the majority of QR code readers for only a very small gain, and how many reader implementations are actively maintained and will add support for something nobody uses yet?

          1. 2

            The encoding could be invented for internal use by a very large company, kinda like UPS’s MaxiCode: https://en.m.wikipedia.org/wiki/MaxiCode

            1. 1

              What you’re describing is the problem for all new standards. How do they ever work? ;-)

              1. 1

                Better in environments that are not as fragmented and/or can provide backwards compatibility? ;)

            2. 2

              The “byte” encoding works well for that. Don’t forget, URls can contain the full range of Unicode, so a restricted set is never going to please everyone. Given that QR codes can contain ~4k symbols, I’m not sure there’s much need for an entirely new encoding.

              1. 4

                Given that QR codes can contain ~4k symbols

                Yes, although… there’s some benefit to making QR codes smaller even when nowhere near the limits. Smaller ones scan faster and more reliably. I find it difficult to get a 1kB QR code to scan at all with a phone.

                1. 3

                  QR codes also let you configure the amount of error correction. For small amounts of data, you often turn up the error correction which makes them possible to scan with a very poor image, so they can often scan while the camera is still trying to focus.

                  1. 1

                    IME small QR codes scan very fast even with the FEC at minimum.

                2. 4

                  URIs can contain the full Unicode range, but the average URL does not. Given that’s become a major use case for qrcodes it’s definitely a shame it does not have a better mode for them: binary mode needs a byte per byte while alnum only needs 5.5 bits per byte.

                  1. 3

                    All non-ascii characters in URLs can be %-encoded so unicode isn’t a problem. A 6-bit code has room for lower case, digits, and all the URL punctuation, plus room to spare for space and an upper-case shift and a few more.

                    1. 2

                      So ideally a QR encoder should ask the user whether the text is a URL, and should check which encoding is the smallest (alnum with %-encoding, or binary mode). Of course this assumes that any paths in the URL are also case-insensitive (which depends on the server).

                      Btw. can all HTTP servers correctly handle requests with uppercase domain names? I’m thinking about SNI, maybe certificate entries…? Or do browsers already “normalize” the domain name part of a URL into lowercase?

                      1. 4

                        The spec says host names are case-insensitive. In practice I believe all (?) browsers normalize to lowercase so I’m not sure if all servers would handle it correctly but a lot certainly do. I just checked and curl does not normalize, so it would be easy to test a particular server that way.

                        1. 3

                          Host name yes, but not path. So if you are making a URL that includes a path the path should be upper case (or case insensitive but encoded as upper case for the QR code).

                    2. 2

                      No, a URI consists of ASCII characters only. A particular URI scheme may define how non-ASCII characters are encoded as ASCII, e.g. via percent-encoding their UTF-8 bytes.

                      1. 1

                        Ok? You see how that makes the binary/byte encoding even worse right?

                        1. 1

                          Furthermore, the thing that’s like a URI but not limited to ASCII is a IRI (Internationalized Resource Identifier).

                      2. 2

                        You’re right (oh and I should know that URLs can contain anything…. Let’s blame it on Sunday :-))

                        1. 2

                          Byte encoding is fun in practice, as I recently discovered, because Android’s default QR code API returns the result as a Java String. But aren’t Java Strings UTF-16? Why yes, and so the byte data is interpreted as UTF-8, then converted to UTF-16, and then provided as a string.

                          The work around, apparently, if you want raw byte data, is to use an undocumented setting to tell the API that the data is in an 8-bit code page that can be safely round tripped through Unicode, and then extract the data from the string by exporting it in that encoding.

                          1. 3

                            I read somewhere that the EU’s COVID passes used text mode with base45 encoding because of this tendency for byte mode QR codes to be interpreted as UTF-8.

                            1. 1

                              Do you really mean base45 or was that a typo for base64? I’m confused because base64 fits fine into utf8’s one-byte-per-character subset. :)

                              1. 4

                                There’s an RFC: https://datatracker.ietf.org/doc/rfc9285/ and yes, base45, which is numbers and uppercase and symbols that matches the QR code “alphanumeric” set.

                      3. 2

                        Fun fact: Data Matrix codes can switch encoding after every codeword