1. 10
  1.  

  2. [Comment removed by author]

    1. 9

      It’s strange the article points out not to use regex, but then doesn’t give any alternative.

      I think the traditional “don’t use regex” advice refers to code like this:

      for (next-token in line.split(" "))
      {
          if (next-token =~ keyword-regex)
          {
              // handle a keyword
          }
          else if (next-token =~ floating-point-regex)
          {
              // Handle floating point number
          }
          else if (next-token =~ integer-regex)
          {
              // Handle integer number
          }
          // and on and on and on for every token type...
      }
      

      A proper lexical analyzer (especially a generated one) lets you use regular expressions to define the language tokens but can be a lot faster and use less memory at runtime because it can build one big finite automaton that can share states between all of the tokens. It has the added benefit of making error messages easier, in a way, because an input string like “123omg” will tell you immediately at ‘o’ that the expected integer was malformed, where as a regex if/then mess will only know it didn’t match any of the regex.

      1. 6

        It’s strange the article points out not to use regex, but then doesn’t give any alternative.

        That’s the whole article, for most part. Don’t do this, do that. This is good, that is bad. A bunch of shallow opinions with hardly any justification. One could turn each sentence around and give them an equally shallow justification. Ironically, many of the points made are about appealing to other people’s wants and needs, while the article starts off by discouraging people who need validation from other people. Perhaps a more appropriate title would be “My design principles.” There are no naked truths, just a personal form.

        That said, the linked explanation by Rob Pike does propose the use of procedural code as the alternative for lexing.

        I’m not sure I understand the beef with regexes – is it about the syntax. or about the underlying mechanics (fsm) involved in executing a regex?

        That is to say, if the author had instead written “don’t use matrix multiplication”, is that supposed to mean I should just not bother with the notation, or that I should entirely avoid the scalar multiplication and addition that we use to actually perform a matrix multiply?

        When I think regex, I think FSM. To me regexes are nothing more than a convenient syntax for specifying (a part of) a state machine. And state machines are great; it is so easy to verify that every possible input, invalid or valid, is appropriately considered in every possible state. It is a simple way to produce very robust lexers with good performance.

        1. 1

          You can combine the single regexes into one. It have seen such a regex-lexer being at the same speed as a naively handwritten C implementation for a Java subset thanks to the Python regex implementation. You can make the handwritten one faster though.

          Error messages are not much of a problem, because you parse one token at a time and know the position anyways.

          In general, for quick development regex is fine. Sometimes you need more features like significant whitespace. Sometimes it is worth it to optimize further and write it manually.

          Edit: I have never tried a lexer generator like Ragel.

        2. 5

          Same, as long as tokenizing is a discrete step, I’m very pro regex.

          1. 1

            Yeah I got so angry about the linked Rob Pike article, I made a tweetstorm: https://twitter.com/robey/status/866402127987351552

            Glad to hear I’m not alone.