1. 34
  1. 26

    Personally, having alternative syntaxes is more disruptive than the weirdness of regex to begin with. Sure, maybe your new syntax is technically superior in every regard, but it’s just more non-standard info I have to jam into my head.

    1. 10

      Let’s go ahead and add emphasis to that maybe… from one of the linked example projects:

      const SuperExpressive = require('super-expressive');
      
      const myRegex = SuperExpressive()
        .startOfInput
        .optional.string('0x')
        .capture
          .exactly(4).anyOf
            .range('A', 'F')
            .range('a', 'f')
            .range('0', '9')
          .end()
        .end()
        .endOfInput
        .toRegex();
      
      // Produces the following regular expression:
      /^(?:0x)?([A-Fa-f0-9]{4})$/
      

      It’s fun to pretend that somehow “SuperExpressive” was the verbose established king… the Java of pattern-matching… and this snippet is from the README of an upstart competitor… a radical idea called “regex” – all you have to do is change the comment “Produces the following regular expression” to “With our approach, the above becomes the single line”, and that is the whole sales pitch.

      1. 1

        Fun to imagine sure but that’s not the reality. The reality is that hundreds of thousands of people have all learned Regex, and while it may not be pretty it works and isn’t actually that difficult to learn if you take the time, so GP’s point still stands.

        1. 2

          I think you misread my comment. I was agreeing with GP, and making the same point you are.

        2. 1

          I think part of it is that anything complicated enough where you’d really appreciate the readability and composability of a ‘builder’ type system… you probably shouldn’t be using regexes for in the first place.

          1. 1

            I think in such cases the sheer length of the builder version would still make it worse. Especially in a language like ruby that supports splitting regexes over multiple lines with inline comments:

            regexp = %r{
              start         # some text
              \s            # white space char
              (group)       # first group
              (?:alt1|alt2) # some alternation
              end
            }x
            

            Fundamentally, the assumption underlying the builder approach – that verbosity is always an aid to understanding – is simply not true (see, eg, Objective C variable naming as another classic counterexample where verbosity harms readability).

            1. 1

              But also, this syntax doesn’t handle interpolating strings from elsewhere nicely:

              foo = "?*!"
              re = %r{#{foo}?bar}
              # throws RegexpError (target of repeat operator is not specified: /?*!?bar/)
              

              You have to remember to Regexp.quote it.

              And while in Ruby it looks like you can interpolate a regex into another regex because it has built-in regex literals, in Python this happens:

              >>> foo = re.compile("foo?")
              >>> re.compile(f"{foo}?bar")
              re.compile("re.compile('foo?')?bar") # definitely not right!
              >>> re.compile(f"{foo.pattern}?bar")
              re.compile('foo??bar') # also not right
              >>> re.compile(f"({foo.pattern})?bar")
              re.compile('(foo?)?bar') # there we go
              

              And it has the same ‘you have to remember escape a string when you interpolate it’ problem.

      2. 4

        I just started this wiki page, feel free to add projects you know about! :)

        1. 2

          There’s https://edicl.github.io/cl-ppcre/

          As the name indicate, it’s a Common Lisp library for Perl Compatible Regular Expression (the extra “P” is for “Portable”, because it is written in “portable common lisp”, i.e. no FFI).

          In addition to specifying regular expressions as strings like in Perl you can also use S-expressions.

          The library provides functions to convert between the string and s-expr formats. Which means you can manipulate the regexe(n|s) pretty much how you want.

          1. 2

            Thanks for doing this.

            Would you consider Lua’s pattern matching syntax the kind of thing you’d like to add? It’s regex adjacent, though not a flavor of regular expression.

            1. 2

              Definitely, that counts … I believe it’s more of a subset for simplicity than an alternative syntax for POSIX- or Perl-style regexes, so that is worth mentioning on the page.

              1. 2

                Sounds good. I added a link and comment.

            2. 1

              You might be interested in Gema.

              It can be used to do the sorts of things that are done by Unix utilities such as cpp, grep, sed, awk, or strings. It can be used as a macro processor, but it is much more general than cpp or m4 because it does not impose any particular syntax for what a macro call looks like. Unlike utilities like sed or awk, gema can deal with patterns that span multiple lines and with nested constructs. It is also distinguished by being able to use multiple sets of rules to be used in different contexts.

              I’ve only dabbled in it myself, but have seen enough impressive Gema golfs to be convinced of its power.

              1. 1

                Isn’t this just PEG library for Racket?

                1. 2

                  yes i wrote it

                  1. 2

                    Oops, my apologies. Seems I misunderstood both the posted link and your reply.

              2. 4

                If you are willing to step outside the canonical regular expression syntax, what stops you from going full context-free (or PEG)? Such parsers will give you the ability to name nonterminal symbols, allow recursion, as well as provide you with a parse tree.

                Context-free grammars are no harder to define than regular expressions. For example, here is a simple grammar for mathematical expressions.

                a_grammar = {
                    '<start>': [['<expr>']],
                    '<expr>': [
                        ['<expr>', '+', '<expr>'],
                        ['<expr>', '-', '<expr>'],
                        ['<expr>', '*', '<expr>'],
                        ['<expr>', '/', '<expr>'],
                        ['(', '<expr>', ')'],
                        ['<integer>']],
                    '<integer>': [
                        ['<digits>']],
                    '<digits>': [
                        ['<digit>','<digits>'],
                        ['<digit>']],
                    '<digit>': [[str(i)] for i in range(10)],
                }
                

                and you can use it with off the shelf parsers such as earley parser.

                1. 1

                  Right the metalanguage for regular languages and context free languages is exactly the same syntactically – productions with alternation, repetition, etc.

                  The only difference is if productions are allowed to be recursive, e.g.

                  x -> '(' x ')'
                  

                  is context free but not regular.

                  So yeah I thought about making Eggex into a syntax for CFGs. However a big problem is that CFG is hard to recognize in general – there are many different subsets of CFG with corresponding to different parsing algorithms. Practically speaking I have no use for most of them :)

                  1. 1

                    I see. However, general CFGs are not hard to recognize! Several well-known general algorithms exist with very good runtime for most kinds of grammars we see (e.g. linear time for deterministic grammars) , and you can do it in sub-cubic time even in the worst case. So you do not have to deal with different subsets of CFGs separately.

                    Furthermore, the Earley algorithm I linked is almost the same as the NFA matching Thompson, but extended for context-free grammars (same parallel threads of parsing approach). I think you should be able to get that algorithm to be as fast as Thompson matching on regular grammars (if it is not already).

                2. 3

                  On mobile right now so I can’t contribute, but Emacs’ rx falls into the Lisp-like category.

                  https://www.emacswiki.org/emacs/rx

                  1. 3

                    OCaml’s re library lets you write regular expressions by composing functions (kinda like a parser combinator) and then provides PCRE/Emacs/Posix string regex parsing functions.

                    1. 3

                      Situation: there are 14 competing standards….

                      1. 1

                        There is also the cl-ppcre parse tree syntax: https://edicl.github.io/cl-ppcre/#create-scanner2

                        1. 1

                          a further nudge to checkout eggex again, which I’ll gladly take

                          1. 2

                            Let me know how it turns out and if you have questions! (via Zulip, Github, etc.)

                            It has been pretty stable. The last thing I remember is some user feedback caused me to change negation from ~[a-z] to ![a-z].

                            The main reason is that the syntax “stutters” when you have the awk-like (string ~ pattern). That is you don’t want 2 different meanings of ~.

                            So it’s definitely still open to feedback!

                          2. 1

                            You might like my friend’s project: https://github.com/ethanpailes/remake

                            1. 1

                              (added, after getting back to a computer)

                            2. 1

                              mikmatch is old and unmaintained, but it was pretty nice to use