1. 17
  1.  

  2. 7

    I find them about equally easy to read, but I think I’m unusual that way.

    1. 5

      Personally, I’m a big fan of raku’s grammars for tasks like these. Here’s an (incomplete) example:

      grammar LogEntry {
      	rule TOP {
      		<ipaddr> <ident> <user> <date> <request> <responseno> <responsesize>
      	}
      	token ipaddr {
      		\d+ '.' \d+ '.' \d+ '.' \d+
      	}
      	token ident { <-[\s]>+ }
      	token user { <-[\s]>+ }
      
      	token date {
      		<day> '/' <month> '/' <year> ':' <hour> ':' <minute> ':' <second> \s+ <timezone>
      	}
      
      	token day { \d+ }
      	token month { Jan | Feb | Mar | Apr | May | Jun | Jul | Aug | Sep | Oct | Nov | Dec }
      
      	#...
      }
      

      Not every token or rule needs to be named explicitly—for instance, you can elide day, and replace the definition of date with \d+ / <month> / <year...—but I think having a richer data structure increases clarity both in definition and use; the OP also uses named capture groups to a similar end.

      1. 1

        Last \s+ should idiomatically be replaced with <.ws> (though both will work fine).

      2. 5

        Regular Expressions are a great example of where Lisp/Scheme’s Code = Data approach shines. Take the Scheme Regular Expression SRFI, or Elips’ rx. The same could be done in other languages, but writing something like

        RE urlMatcher = new SequenceRe(            // start matching an url
          new RE.Optional("http://"),
           new RE.OneOrMore(
              new RE.Group("domain", 
                 new RE.Sequence(new RE.OneOrMore(RE.Word), ".")
              )
           ),
           // etc. ...
        )
        

        is far more cumbersome, even if it would allow you to insert comment, structure the expression, statically ensure that the syntax is valid.

        1. 4

          There are libraries that do this: https://github.com/VerbalExpressions

          tester = (verbal_expression.
                      start_of_line().
                      find('http').
                      maybe('s').
                      find('://').
                      maybe('www.').
                      anything_but(' ').
                      end_of_line()
          )
          
          1. 3

            I still find

            (rx bol "http" (? "s") "://" (? "www.") (* (not (any " "))) eol)
            

            nicer, plus it evaluates at compile-time. But good to know that other languages are thinking about these ideas too.

            1. 6

              So the regexp this represents is this one I think?

              ^https?://www\.[^ ]+$
              

              I don’t know … I find the “bare” regular expression easier. Especially the Scheme/Lisp variant essentially uses the same characters, but with more syntax (e.g. (? "s") instead of s?). Maybe the advantages are clearer with larger examples, although I find the commenting solution as mentioned in this post much better as it allows you to clearly describe what it’s matching and/or why.

              1. 3

                This example is rather simple, but in Elisp, I’d still use it, because it’s easier to maintain. I can break the line where ever I want, insert real comments. Usually it’s more verbose, but in one case, I even managed to write a (slightly) shorter expression using rx, than a string literal because of escape-symbols:

                (rx (* ?\\ ?\\) (or ?\\ (group "%")))
                

                vs

                "\\(?:\\\\\\\\\\)*\\(?:\\\\\\|\\(%\\)\\)"
                

                but with more syntax (e.g. (? “s”) instead of s?).

                If that’s the issue, these macros usually allow the flexibility to choose more verbose keywords. The example from above would then become (combined with the previous points)

                (rx line-start
                    "http" (zero-or-one "s") "://"	;http or https
                    (zero-or-one "www.")		;don't require "www."
                    (zero-or-more (not (any " ")))	;just no spaces
                    line-end)
                

                Edit: With Emacs 27 you can even extend the macro yourself, to add your own operators and variables.

                1. 2

                  Right; that example shows the advantages much clearer. It still looks kinda unnatural to me, but that’s probably just lack of familiarity (both with this method and Scheme in general; it’s been years since I did any Scheme programming, and never did much in the first place: my entire experience is going through The Little Schemer and writing two small programs). But I’m kinda warming to the idea of it.

                  One way you can do this in other languages is by adding a sort of sexpr_regex.compile() which transforms it to a normal regexp object for the language:

                  regex = sexpr_regex.compile('''
                  	(rx line-start
                  		"http" (zero-or-one "s") "://"	;http or https
                  		(zero-or-one "www.")		    ;don't require "www."
                  		(zero-or-more (not (any " ")))	;just no spaces
                  		line-end)
                  ''')
                  

                  And then regex.match(), regex.find(), what-have-you.

                  Dunno if that’s worth it …

                  1. 1

                    It would be possible, but you’d lose the fact that rx and similar macros can be expanded and checked at compile-time.

                    1. 2

                      You already don’t have that in those languages anyway, so you’re not really losing much. And you can declare it as a package-level global (which isn’t too bad if you’re consistent about it) and throw an exception if there’s an error, so you’ll get an error on startup, which is the next best thing after a compile-time check. You can also integrate it in your test suite.

                      1. 2

                        Well Elisp does (Byte compilation), and some Schemes do too (but SRFI 27 couldn’t make use of it in most cases anyway).

              2. 2

                I agree – the problem with the chained-builder approach is that it’s shoe-horning this abstract idea of a regex through the lens of the syntax of a programming language. Designed from first-principles, a regex syntax is much more likely to look like the s-expression syntax.

                1. 2

                  That is really pretty. Is this elisp?

                  1. 1

                    Yes.

                  2. 1

                    An aside about compile-time: There exists a C++ compile-time regex parser, surely one of the most terrifying examples of template metaprogramming ever. It’s also quite feasible to compile regexes at compile time in Rust or Nim thanks to their macro facilities.

                    That LISP syntax is nice, though.

                    1. 1

                      That sounds very appealing to me!

                      1. 1

                        Do you have a link to some site that describes the C++ parser?

                        1. 1

                          No, sorry, or I’d have given it. I just remember a video of a conference presentation, by a woman with a Russian name…

                          Edit: a quick search turned up this, which looks familiar: https://youtu.be/3WGsN_Hp9QY

                  3. 1

                    I program in Lua, and there, I use LPEG. There’s a submodule of LPEG that allows one to use a BNF-like syntax. Here’s the one I constructed from RFC-3986.

                  4. 4

                    The example given is quite extreme. Overall, I think verbosity isn’t necessarily a bad thing (look at vim for example).

                    Also, if you’re into comments - just give an example of the input and output of such a Regular Expression. It would do most of the job, and if the reader knows just a bit of regex, they could figure it out, either by themselves or by using tools like regexr and regex101.

                    1. 1

                      regex101

                      +1 for regex101, their test functions and also storage for your tests, so you can embed that as comment

                      1. 1

                        That might be a bad idea if you want your code to last longer than Regex101. Probably okay if it’s purely supplemental to your actual comment, though.

                        1. 1

                          true, but it’s just additional, with an easy way to verify & create it

                    2. 3

                      Suggestion: use \d instead of [0-9] and it can be used inside character class as well, [\d.] instead of [0-9.]

                      1. 3

                        it’s unreadable, barely usable, and unmaintainable.

                        The author should try harder to read it. I bet if they put their mind to it, they could make progress.

                        Showing an example above the regex is the ultimate documentation.

                        I prefer verbose syntax, but this doesn’t seem to be a great example of why it’s effective. (I like the explicit nature of verbose syntax, not the ability to add comments more easily.)

                        1. 1

                          Confession: I can read it, and that line was for effect. However I would not impose that expectation on others in code that I wrote.

                        2. 3

                          I think in cases where regular expressions get as big as this you’d be better served by using parser combinators instead, which are much more composable and can return structured data instead of just match groups.

                          1. 2

                            I think the end result was much worse. But I’m not sure if i just disagree with trying to make regexes more legible with those techniques, or if the specific example is just too simple to illustrate the advantages (the original example regex was perfectly clear to me)

                            1. 2

                              I think once you have to start naming your variables, and define capture groups, you are much better off simply using parser combinators. One can define a full fledged combinator parser library with recursion and naming in about 20 lines of code if one is not already available in your language. The nice thing is that, you can incorporate regular expressions for tokens to strike a balance between readability and terseness.

                              1. 2

                                Can anyone explain the title? What’s “now you have one and a bit problems” about?

                                1. 5

                                  It’s a reference to a quote by Jamie Zawinski (source), though there were other flavours of the “now they have two problems” meme both before and after.

                                  Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems.

                                  (Edited to clarify that it’s a meme.)

                                  1. 2

                                    It might be nice to enhance this article with discussion of more verbose wrappers of regex, such as verbal expressions or even conceptually different replacements, which I know exist but can’t seem to find with a quick search.

                                    I don’t have experience with any of these, but I’d love to know more about them. Especially in the context of rule-based refactoring, if that is what we can call refactoring with as little manual actions as possible to prevent accidents.

                                    1. 1

                                      I found it most readable at step 2. The example in the comment before could have been more generic, and that would have helped separate syntax elements used to guide the regex, and those captured by it:

                                      # ip separator user [date] "method path version" code size
                                      #     e.g. '127.0.0.1 - james [09/May/2018:16:00:39 +0000] "GET /report HTTP/1.0" 200 123'