1. 9
  1.  

  2. 6

    Complicated regexes are opaque, unmaintainable, and often wrong. The correct approach to validating a URL is as follows:

    from urllib.parse import urlparse

    First non-docstring line in urlparse.py:

    import re
    

    And OK, it doesn’t go as bananas on regexes as the example. But it does contain the following comment on the transparency, maintainability, and correctness of the implementation:

    # XXX The stuff below is bogus in various ways...
    

    It’s probably the problem that’s hard, rather than the choice of solution.

    1. 3

      Eh. The OP googled for “URL regex,” not “how to validate a URL.” The former is a solution to a problem, and the latter is a problem that isn’t well solved by said solution. The OP even goes on to say (emphasis mine):

      A regex is useful for validating simple patterns and for finding patterns in text.

      So… For example, a URL regex like the one the OP presented seems like a fine first approximation to the problem of “find URLs in text.” It won’t be perfect, and whether that’s OK or not (unsurprisingly) depends on the problem you’re trying to solve.

      Of course, the stackoverflow post that I think the OP is referring to is indeed about validating URLs, so…

      1. 2

        One place that uses URL regexes reasonably successfully is even this here website (for finding and auto-linkifying bare URLs). :-)

      2. 2

        I guess someone has to link to the (legendary) Stack Overflow response about parsing HTML with regex: https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags

        1. 2

          Also

          Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems.

          • Jamie Zawinski
          1. 5

            I’ve always hated this quote (and most of its variations). As far as I can tell it boils down to “never use regular expressions”, which is bad advice.

            1. 3

              Indeed, I’d rather see someone use a regex where appropriate than try to re-invent a capture-group from scratch.

              EDIT: I’ve seen some absolutely miserable text-extraction code written by juniors who hadn’t ever picked up regex because it was “icky”. A quick demonstration later the whole module could be deleted and replaced with a handful of neat little regexes, making good use of capture groups.

              1. 1

                I ‘grew up’ with Perl, so in that day and age, it was something that did need to be said…

            2. 1
            3. 2

              I use Lua patterns now when I can and for the most part pretty happy. Not nearly the performance in my little implementation but I rarely go to regexp anyways,