1. 18
    1. 9

      The main problem with these libraries is that they bend over backwards attempting to parse garbage. They are not user agents, but generic libraries intended for a variety of use cases. At least in user agents that deal with user input directly, being forgiving might make some sense, but in server-side software there are more risks involved. It would be much better if these libraries would just reject these URIs as invalid. Or if they must error-correct, have this be an option which defaults to false.

      The other problem is that (as I have argued in the past) the newer WHATWG spec does not even have a formal grammar, making it very hard for different pieces of software to agree on how to parse a given URL. One reason for this seems to be that the WHATWG spec seems to be focused purely on use in browsers.

      1. 9

        I agree. This article presents a list of parse choices for each strange URL, but neglects to mention the actually correct option: rejecting the URL as invalid. The RFC exists for a reason and some of these URLs simply aren’t compliant.

        The other problem is that (as I have argued in the past) the newer WHATWG spec does not even have a formal grammar, making it very hard for different pieces of software to agree on how to parse a given URL. One reason for this seems to be that the WHATWG spec seems to be focused purely on use in browsers.

        URIs are formally specified by RFC3986. I’m no fan of what WHATWG represents as I’ve written about before. The most obvious point of distinction between W3C and WHATWG when they performed their standards coup d’etat against W3C seemed to be seeing the web as a platform for applications rather than semantic documents. However, another significant difference seems to have been that WHATWG took the view that the spec should document what browsers actually do, as opposed to W3C’s spec which documented what browsers ought to do (but which they often didn’t).

        This pragmatic view isn’t necessarily a bad thing. However, the result of this view appears to be that WHATWG’s “standards” are basically just a web browser written in natural language. Essentially a standards-washing translation of “what Chrome does” to English. There doesn’t seem to be any effort at all to describe the “idealistic” vision of a standard as distinct from implementation, even if just in addition to this code.

        1. 2

          This pragmatic view isn’t necessarily a bad thing.

          I agree - what matters is what works. But as this article (and others like it) indicates, clearly the status quo isn’t “working” for anyone but the browser vendors.

          However, the result of this view appears to be that WHATWG’s “standards” are basically just a web browser written in natural language. Essentially a standards-washing translation of “what Chrome does” to English.

          You know, at this point I don’t even care about the fact that they’re standards-washing what Chrome is doing. I’d be content if they just updated the ABNF to “whatever Chrome accepts”, with an added algorithm to “normalize” human input into what is represented in the JS URL objects, or whatever they use to determine what is sent to the server. But it seems like even that is too much of an ask.

      2. 1

        This looks like a perfectly reasonable state machine in the specification? https://url.spec.whatwg.org/#concept-basic-url-parser

        It’s not fun to read but you can make a FSM to lex URLs from that.

        1. 1

          It’s a state machine alright, but not what I would call a proper syntax. In the sixties we already had formal language specifications better than clumsy descriptions like this. For instance, with a BNF you can even generate valid URLs, something you cannot reasonably do with an algorithm like they specify. Also, parser generators based on (regular) grammars are compact and fast.

          NOTE: The original RFC had such a BNF, instead of updating it with their changes, they chose to drop it in favor of a description like above because according to them, a BNF can’t capture all the error-correction and other weird shit they’re doing.

    2. 5

      Nice summary.

      What seems to be better at resolving SSRF than an allow-list in the software, is to implement that allow list in a web proxy and require the software to always go through there.

      P.S.: the whatwg URL standard strips newlines in the middle of URLs because <a href="… may break at anytime according to html (and how people used to write this in 80 column-wide terminals).

      1. 4

        P.S.: the whatwg URL standard strips newlines in the middle of URLs because <a href=”… may break at anytime

        Similarly, IETF RFC 3986 appendix C says:

        In some cases, extra whitespace (spaces, line-breaks, tabs, etc.) may have to be added to break a long URI across lines. The whitespace should be ignored when the URI is extracted.

        For robustness, software that accepts user-typed URI should attempt to recognize and strip both delimiters and embedded whitespace.

        1. 2

          I had no idea. I thought this was a HTML specialty. Huh. Thanks!

      2. 3

        What seems to be better at resolving SSRF than an allow-list in the software, is to implement that allow list in a web proxy and require the software to always go through there.

        At my last company the security team built this out and we transitioned to it. It’s a clever way of preventing SSRF. Does this idea have a name?

        1. 1

          What seems to be better at resolving SSRF than an allow-list in the software, is to implement that allow list in a web proxy and require the software to always go through there.

          It’s just outbound filtering/ egress firewalling. It’s awesome and it’s crazy how often it doesn’t get deployed - it absolutely fucks attackers up.

    3. 2

      We ran into an inconsistency wrt Outlook’s parsing of mailto urls.

      We had a long standing bug in the link generation code that appended an extra ampersand (eg ?&subject=Foo) and when Outlook was used as the Mail handler, it would launch okay but come up blank. No issues with Web mail clients (we didn’t try any other email clients) of course so it went undiscovered for years.

    4. 1

      Why not just parse the URLs to a data structure first, then pass those to the libraries? For instance, you could have a protocol field, a username field, a host field, a port field, etc.

      1. 2

        I have a parser that does just that. The original version even broke down the query portion into name/value pairs, but as I realized later, not all queries have name/value pairs. Another issue—in one program, that was too much work, and I had to undo the parsing of the query back into it’s encoded form to pass into another program. Also, some URLs are not good fits for generic parsing, like the gopher:, sip: and tel:, which require further processing anyway.

        1. 1

          This name/value query parsing thing is only for URLs which use application/x-www-form-urlencoded encoding. i.e. “Web” URLs (which includes at least the http, https, ws, wss and possibly the file scheme). This is also quite messy indeed. It would’ve been better if they hadn’t tried to shoe-horn completely unrelated identifiers into a single format. Now it’s up to the consumer of the URL (i.e. the server-side handler) to decide how to decode the URL’s component parts.

          1. 1

            It gets worse. The gopher: URL doesn’t even use the query portion for gopher searches! It’s insane.

        2. 1

          Hm. Seems like a modular/pluggable parsing system would be in order. I think that might let this approach generalize. Although, maybe I’m just defending my idea to the grave now ;)

          1. 1

            I’ve done that to some degree using LPEG. This bit will construct a URL parser that deals with five particular URL types, before falling back to the generic URL parser:

            local url = require "org.conman.parsers.url.data"
                      + require "org.conman.parsers.url.gopher"
                      + require "org.conman.parsers.url.siptel" -- handle sip: and tel:
                      + require "org.conman.parsers.url.tag"
                      + require "org.conman.parsers.url"
            
            local result = url:match(SOMEURL)
            

            This is one of the reasons I like using LPEG.

      2. 2

        This is usually a problem when the two cases are separate processes; for instance nginx and python, etc. They frequently aren’t even on the same machine.

      3. 1

        While any problem can be solved by adding a layer of abstraction, it seems this would just kick the can down the road — something has to parse the original URL and produce the data structure. (Plenty of frameworks do this, btw, for instance Cocoa’s NSURLComponents.)

        1. 1

          Yes, the point being that this would cut down on inconsistencies in parsing.

    5. 1

      I’ve written a URL parser that I thought was fairly complete, but on rereading the RFC, I realized I’d never known that the address portion can be directly followed by a query or fragment without a “/“ separator.

      (However, the library my parser’s used in requires at least one path component, because reasons, so this type of URL wouldn’t work anyway.)