1. 38
    1. 17

      in the real world A B <@C :"D E"@F -G-H!> is a valid email.

      The HTML5 spec is doing the Lord’s work, by being much more restrictive than the RFCs on what a valid “email address” is for the purpose of HTML form inputs. So a lot of “but technically the RFC allows it!” addresses just don’t work in a modern input type="email" control. And crucially, there is no amount of filing bugs against an individual web app that will succeed at changing this, because the developers of web applications have no control over this and can simply deflect responsibility to the HTML5 spec, which is very clear about its intentions:

      This requirement is a willful violation of RFC 5322, which defines a syntax for email addresses that is simultaneously too strict (before the “@” character), too vague (after the “@” character), and too lax (allowing comments, whitespace characters, and quoted strings in manners unfamiliar to most users) to be of practical use here.

      1. 2

        there is no amount of filing bugs against an individual web app that will succeed at changing this, because the developers of web applications have no control over this

        The developers may be unable to change HTML, but they can change the HTML that they write.

        I doubt bizarre email addresses are common enough for enough users to care, but developers of a Web app could use a type other than type="email" if users complain enough.

        1. 17

          We can analyze the situation by postulating the following:

          • there is a group of people who have functioning email addresses conforming to RFC 5322 but not the HTML5 spec, and who lack any other email addresses or any other means of communication, and who are willing to engage in e-commerce with a specific site or sites
          • there is a group who enjoy arguing about email address formatting online.

          My instinct is that the two groups are entirely disjoint, and that the former group is the empty set.

          1. 5

            Oh, I’ll bet there’s one really angry guy who’s in both.

          2. 1

            if users complain enough

            I’d be surprised

        2. 10

          Perhaps we could say strings are a “narrow waist”, an incredibly fragile one. The one all Unix tools sort of understand. I just recently had the tiny misfortune of adding a command line argument that required an embedded ampersand and a variable substitution. How many layers of escaping / substitution / command line argument splitting rules am I faced with it in shell? In Dockerfiles wrapping shell? In Groovy wrapping shell? Ugh.

          1. 5

            Yup there are a lot of messes involving shell.

            Weirdly, people keep complaining about it, but the same pattern of generating/templating shell is perpetuated by “modern” tools like Docker and Kubernetes.

            The “original sin” was Make – it reads a line at a time, and passes it to /bin/sh. It then has a substitution language that’s not shell, but collides badly with shell.

            I always like to point out that monstrosities like this are in the official GNU make manual:

            https://www.gnu.org/software/make/manual/html_node/Automatic-Prerequisites.html

            %.d: %.c
                    @set -e; rm -f $@; \
                     $(CC) -M $(CPPFLAGS) $< > $@.$$$$; \
                     sed 's,\($*\)\.o[ :]*,\1.o $@ : ,g' < $@.$$$$ > $@; \
                     rm -f $@.$$$$
            

            The $$$$ being because $$ means the PID in shell, and $ is the escaping char, which means that $$$$ is the way to write the PID in Make.

            The $@ is also a horrible collision with shell – it doesn’t mean the arguments array, it means the output file. I really can’t be sure what they were thinking when creating a language that offloads its semantics to shell, yet collides so badly with it.


            I think there is a pretty straightforward way out of this mess though.

            What I discovered with https://www.oilshell.org/ is that the power of shell has nothing to do with all the sloppy string usage.

            You can just make a better shell that doesn’t have all the string bugs, and you lose nothing!


            We can do a few things:

            1. Teach people about string safety – it should be like memory safety. If you’re going to shell out, and you need variables, do it properly.

            No:

            os.system('ls %s > out.txt' % dir)   # what if dir contains shell characters?
            

            Better:

            os.system('ls %s > out.txt' % shlex.quote(dir))   
            

            This brings up a point in that I think many people are confused about shell escaping because they want something “portable” to Windows.

            The sloppy method is actually “portable”, but the right method is specific to POSIX shell!

            https://docs.python.org/3/library/shlex.html#module-shlex

            Even better:

            subprocess.call(['sh', '-c', 'ls "$1" > out.txt', 'dummy0', dir])
            

            Many people don’t know about this last method, but it’s the best and most flexible.

            Also, many tools don’t support it. Awk does support it:

            No:

            awk "{ \$1 = $myval }" file
            

            Yes:

            # doesn't confuse awk language and shell language
            awk -v myval="$myval" '{ $1 = myval }' file
            

            But sed does not support anything like this, which is a bug in sed! It’s a language meant to be used from shell, but there’s no good way of “paramterizing” it by a variable.


            1. The shell language itself should be a basis for DSLs, like what you see in Docker Files and Groovy.

            Instead of making languages that are awkwardly bolted on to shell, shell can be a flexible enough language to make languages in.

            This is exactly what Sketches of YSH Features is about

            We’ve made very good progress on this – Aidan Olsen recently implemented the ctx builtin, which is a low-level primitive for such DSLs.

            We also need to add the notion of “bindings” to our structured eval primitive.


            1. It should be easy to write efficient HTML escaping, URL escaping, shell escaping, etc. in shell.

            If you make a language centered around strings, you must make it easy to write escaping functions, and it must be easy to use them.

            People keep writing this basic bug over and over again:

            https://lobste.rs/s/jcm2am/reminiscing_cgi_scripts#c_ukq5pe

            The plan is for Oils to do this with

            echo "<a>hello ${name|html}</a>"
            

            where html is a function, possibly in a namespace. This is backward compatible with shell.

            Let me know if you want to help!

            If we can do those 3 things, we can get out of the mess with shell.

            https://github.com/oilshell/oil/wiki/Contributing

            https://github.com/oilshell/oil/wiki/Where-Contributors-Have-Problems

          2. 8

            Article summarises one of my biggest issues with how SQL is used in the wild. languages within languages within languages often leads to footguns aimed at footguns aimed at footguns.

            1. 3

              Don’t you ever miss the days of PHP inside of JavaScript inside of HTML?

              1. 1

                thankfully this was before my time 😁

                1. 2

                  Oh, it’s still out there now. 😉

                  1. 1

                    yeah isn’t it like 40% of websites using wordpress? power to them tbh, if it works it works! I started working frontend around the time React got popular so that’s pretty much all I know!

            2. 7

              John Osterhaut: You think that’s bad? That’s nothing; hold my beer and watch this. *invents TCL*

              1. 6

                Yes, absolutely.

                I think Cocoa did a good job by distinguishing “bag of bytes”, which is NSData, and “human readable string”, which is NSString. So “file contents” aka “the stuff that everything else serializes into” is NSData, not NSString.

                Alas, they really dropped the ball on identifiers, pressing NSString into that role. Objective-S really went all the way on identifiers with polymorphic identifiers [pdf], making a form of URIs part of the language. (The match isn’t perfect, but the benefits far outweigh the niggles, IMNSHO).

                The Wyvern people found that 81% of strings in Java constructors are identifiers, 4% file system paths and 2% URIs. With some of the 10% “other” category being things like IP addresses, it seems safe to say that polymorphic identifiers would capture the vast majority of this (mis-)use of strings.

                So we have these three actual data types that we currently map onto string:

                • Data, bag of bytes, external representation
                • Human readable text (that one is “string”)
                • Polymorphic identifiers
                1. 2

                  Are there languages/ecosystems where it is particularly easy or common to make special-typed strings to control where and how they are used? Right now I can only think of MarkupSafe as an example in the Python world, which creates a string-type that HTML-escapes its content if used with other strings, unless you explicitly use a special interface to get the contents unescaped, but that could be taken further…

                  1. 4

                    The Shibboleth you want is “tagged string”. They now exist in one form or another in multiple languages. They sometimes use a different name.

                    1. 4

                      It’s TypeScript. It’s always TypeScript.

                      type EmailUser = string extends `/^[\\x00-\\x7F&&[^ @]]+$/` ? string : never;
                      
                      type Domain = string extends `/^[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$/` ? string : never;
                      
                      type Email = `${EmailUser}@${Domain}`;
                      
                      1. 2

                        That’s very cool. I think your regex is wrong unless you’re punycoding domains, though.

                        The EmailUser part might also be wrong. I’ve never seen nested character groups like that and idk how exactly it all works out.

                        At some point I wrote this regex and tests that I believe would have fewer false negatives than your regex, tho some more false positives:

                        export const isValidEmail = (email: string) => {
                          // This overly complicated RegExp was written by Colin Caine as a bit of a joke.
                          // It validates that a given email address is not obviously wrong while permitting
                          // non-ascii mailbox addresses and domains (most regex for this task don't do this,
                          // including the one in the HTML Spec).
                          //
                          // Valid emails using very rarely used features of email, like quoted mailbox
                          // addresses or comments will not be validated and that is intentional.
                          const atext_ascii = "[A-Za-z0-9.!#$%&'*+/=?^_`{|}~-]"
                          const nonascii = "[^\u0000-\u009f]" // Excludes ascii and C1 control codes
                          const atext = `(?:${atext_ascii}|${nonascii})`
                        
                          // IDNA domains would be impossible to validate properly with regex.
                          // This is designed to exclude some invalid ASCII domains but does not attempt
                          // to validate non-ascii characters in domains
                          const let_dig = `(?:[a-zA-Z0-9]|${nonascii})`
                          const ldh_str = `(?:[a-zA-Z0-9-]|${nonascii})`
                          const label = `${let_dig}(?:${ldh_str}{0,61}${let_dig})?`
                          const domain = `${label}(?:\\.${label})*`
                        
                          const email_re = RegExp(`^${atext}+@${domain}$`, 'u')
                        
                          return email_re.test(email)
                        }
                        
                        describe('isValidEmail', () => {
                          test.each([
                            "hello@example.com",
                            "first.last@example.com",
                            "用户@例子.广告",
                            "अजय@डाटा.भारत",
                            "квіточка@пошта.укр",
                            "χρήστης@παράδειγμα.ελ",
                            "Dörte@Sörensen.example.com",
                            "коля@пример.рф",
                          ])('valid email:   %s', email => {
                            expect(isValidEmail(email)).toBe(true)
                          })
                        
                          test.each([
                            "hello@-example.com",
                            "hi@there@example.com",
                            "用 户@例子.广告",
                            "अ@जय@डाटा.भारत",
                            "квіточка@?",
                          ])('invalid email: %s', email => {
                            expect(isValidEmail(email)).toBe(false)
                          })
                        })
                        
                      2. 3

                        in Haskell you write a newtype around a string type (e.g. String, Text, ByteString), and then add typeclass instances (Show, IsString) as needed to make it behave more or less like a regular String.

                        A nice lightweight approach to precondition checking to your (string-like) values is the “ghosts of departed proofs” technique, described here: https://kataskeue.com/gdp.pdf

                        1. [Comment removed by author]