1. 6

¯\_(ツ)_/¯

  1.  

  2. 29

    I wish people would stop spreading this rubbish. Neither RFC 822 nor 2822 nor 5322 specify the syntax of an email address. Nor for that matter do any of the SMTP or even IMAP specs. What all of these do give is quotation mechanisms that allow arbitrary email addresses to be encoded in them, because both of them have to be capable of processing such email addresses.

    An email address is a series of arbitrary UTF-8 characters* (the local or box part), followed by an @ character, followed by either a valid domain name or an IP address in square brackets (the domain part). Since “arbitrary UTF-8 characters” includes things like spaces, angle brackets, colons, and even other @ signs (the domain part is what comes after the last @ character in an email address, not the first) which have various special meanings in Internet mail format and in the SMTP protocol, the specifications for these provide quotation mechanisms to allow those characters to be included in the text content of an email address and be correctly parsed.

    Thinking that the quotation format of RFC 822 et al specifies what a valid email address is, and asking users to input their email addresses pre-quoted into it, is like thinking that the syntax of HTML entities specifies a what a valid Unicode string is and asking users to provide all string input in forms with e.g. & replacing &.

    In practice, many tools get email addresses wrong. Most of the special characters that are theoretically allowed in email addresses will be corrupted by MTAs, be incorrectly parsed by IMAP servers or mail clients, or both. The best practice for email validation (that’s future-proof against the move to Unicode email addresses but otherwise unlikely to be corrupted en route) is probably to split at the last @, then check that the local part matches something like (Unicode) ^[^\p{Z}\p{C}]+$, ensuring there are no spaces or control characters in it, and the domain part is a valid domain name (including the possibility that it’s an IDN).

    * UTF-8 since the introduction of SMTPUTF8 and the message/global MIME type; before that US-ASCII.

    1. 2

      This website is garbage, but if you want to validate or recognize email addresses, it is possible to do so without throwing your head in the sand and claiming it’s impossible. Basically, if you run software that interacts with a variety of email services and other databases that have email addresses, you’ll grow a regex that works. For example, you might need to accept "John Smith"@asdf.org or john.smith(comment)@asdf.org, and recognize that the comment is a comment, or possibly even that they are “the same” email address. (These are real world examples that I’ve seen, names changed.). Meanwhile, you also might need to reject other strings.

      1. 2

        The email addresses you gave are in RFC 822/2822/5322 format with quotation. The actual email addresses are

        John Smith@asdf.org
        john.smith@asdf.org
        

        You could also have written John (Jonathan) Smith@asdf(ghjkl).org or \J\o\h\n\ \S\m\i\t\h@"asdf".org for the first. It’s still the same email address and both are different from the second.

        1. 1

          Obviously they’re different, otherwise there’d be no reason to point out that they needed to be treated “the same.”

          I don’t know what an “actual” email address is, the point is that email addresses with spaces do exist in the real world, in email servers and other systems. And if you need to make a system that accepts valid and rejects invalid email addresses, or even, gasp, RFC quoted forms of them, because that’s what everything stores and uses, it’s possible to do a really good job of it. And ^[^\p{Z}\p{C}]+$ won’t get the job done.

          1. 1

            You shouldn’t be storing RFC quoted forms of email addresses. You should quote at the last step — actually inserting them into the relevant address list field, as with anything (that’s how we protect against XSS attacks in HTML, for instance).

            What would constitute doing a ‘really good job of it’ in your books?

            1. 1

              Look, email systems store and transfer email addresses in all sorts of RFC and non-RFC formats, sometimes in-band with other junk. (Thanks, Lotus.) You are the one that has to deal with it and decide whether a piece of data is trying to be a real email address or if it’s garbage.

              A ‘really good job’ means, say, even if every mistake results in a ticket being filed and a costly unit of customer service, you can process a few million unique fields from multiple systems purporting to be or contain an email address, with one “bug” revealed. (And plenty of “wrong but valid” fields, if they’re human edited.)

              If you revise your address extractor/validator based on real data, the time period between each bug keeps increasing, and in my experience it increases very non-linearly. On the other hand, if you build a validator from first principles without feedback from reality, it’ll have a higher error rate.

              1. 2

                Okay, I’m increasingly convinced that we’re talking at cross purposes here. I’m not talking about extracting email addresses from any ‘in-band’ data in documents etc.

                What I mean is that if you have a form field like

                <input type=text name=email-address>
                

                and a database column it goes directly into, you should not be asking users to input a quoted form of their email address in if it contains spaces or other special character. You should be collecting it in a raw, unquoted format. Then you can do the validation I suggested at the start of this thread.

                That applies equally if you’re writing a mail client, etc. Unless you’ve deliberately designed it so that users have to enter raw header data themselves (a bad idea imho) the email address as input should be in unquoted format, then your email writing code should apply the quoting itself.

                This is not least because SMTP has a different special character system from RFC 822. You have to be able to speak both to be able to send messages.

    2. 12

      I need a wrong downvote option. I don’t even have to come up with common valid email addresses that fail, just note that the different examples implement wildly different matching. At least I have the satisfaction of knowing the author wasted several dollars registering this domain.

      1. 7

        Just send your users an email when they sign up and make them click a link in that email to confirm their account.

        1. 6

          The thing that shocks me is that people use regular expressions for this despite the spec itself giving a hint at a better tool: BNF with parser generation. The articles all assume that the whole field of parsing and pattern matching stopped with regex’s. The amount of complex jobs people do with them is crazy. Especially given many other methods have open-source tooling available to use them. I think there was one person I saw in search results that implemented this problem with BNF in Haskell’s parsing system. The article is gone and not in Wayback but the comments were more positive than I see on the regex threads.

          I found a nice illustration of using right tool for the job. Well, of the complexity you can avoid by choosing the higher-level tools for parsing jobs. Here’s a BNF parser in Prolog vs same one in C:

          https://muaddibspace.blogspot.com/2008/03/executable-bnf-parser-in-prolog.html

          http://cvs.savannah.gnu.org/viewvc/bnf/bnf/src/grio.c?view=markup

          1. 5

            God damn it. Are there no [parody] or [humor] or [bad idea] tags?

            1. 3

              There is a satire tag.

            2. 4

              Using this in production seems like bad advice.

              Just look at the ruby regex, imagine blindly pasting that into your production code. How could you possibly even begin to read that or debug that?

              What problem does this really address? Are invalid email addresses getting sent off to outbound email clients a necessary thing to defend against? Because obviously a simple regex isn’t sufficient. Why bother with such complexity?

              1. 3

                Worse still, there’s vector for a denial of service attack there. Combining user-supplied content with a backtracking-based regular expression library is dangerous.

              2. 3

                I mean it’s sort of useful so you won’t find yourself wasting a stamp sending email to possibly invalid addresses.

                1. 2

                  Just what the world needs - another bit of software that only works most of the time.

                  1. 2

                    Opening graphic example fails+tags@gmail.com