1.  

    I agree that the naming of the non-open-source licenses is confusing, e.g. “commons clause” is not a great name.

    However, I wonder if there is in fact a fundamental difference between some kinds of reuse and others (thinking aloud here).

    If I want to use Debian to power my corporate desktops, that’s one thing. Or if I want run Debian on cloud machines that provide a Redis service.

    But if I actually expose the Redis API and protocol through a service to customers, and charge them for it, can that be considered another form of reuse that requires licensing?

    This is in contrast to running Redis as a backend for say a for-profit Twitter clone.

    Using Debian, the analogy would be providing a remote-desktop-as-a-service using Debian, as opposed to merely using Debian in a cloud service.


    Although, I may have missed it – why not make Redis AGPL? That would prevent proprietary forks, as I understand it. But they want something even stronger than that? They don’t want just a level playing field, but an advantage?

    I suppose I can understand that when you are faced with big cloud providers with tons of locked-in customers.

    I’m surprised that Cantrill isn’t more sympathetic to the business model problems and the lock-in effects of big cloud providers. He seems to take a pretty hard line that the “community” comes first. But what if there’s no real community? Is there a community around CockroachDB? It seems to be mainly for for-profit companies to run cloud services.


    In other words I think there could be another name for software where you have rights to view, modify, and distribute the source, but not to directly sell it as a service.

    He is saying that nobody is ever going to license this software – they will just reimplement it or use something else. But I think that is besides the point, which is that there are customers who WOULD pay for a hosted version of Redis or CockroachDB.

    I guess where this falls down is that it only works for projects in which there have been essentially no external contributors. Although a few people may have contributed the lion’s share of the code, it’s not clear that you should reserve rights for author A’s company but not author B’s company, even author B only contributed 10 lines of code. It is hard to draw that line.

    (copy of my meandering HN comment)

    1. 6

      Although, I may have missed it – why not make Redis AGPL? That would prevent proprietary forks, as I understand it. But they want something even stronger than that? They don’t want just a level playing field, but an advantage?

      The AGPL prevents forks from differentiating themselves with “value add” features. If company X takes an AGPL’d webmail system and adds a cool filtering feature, X’s users can obtain the source to that feature and provide it to X’s competitor Y, so they can have the same feature.

      If an enterprise wants to build a system on top of Redis, the AGPL is often enough to scare the enterprise into paying for a licence. Building things on top might be considered an extra feature, and the enterprise doesn’t want to have to provide their core business software source to all their users.

      The problem that Redis et. al. are having is that Amazon is not building a system on top of Redis, and they’re not trying to add value by extending Redis with custom features, they’re adding value by providing deployment and management services. Lots of people pay Amazon to run Redis for them, but Amazon isn’t paying Redis anything, and there’s nothing Redis can do about it (well, except changing the licence).

    1. 9

      I guess this makes sense, but the same could be said for JavaScript? When you load a script, doesn’t it affect the whole page? (Technically, the global is actually window?) And a script loaded via src is usually for more than one page, right? So should JavaScript be loaded via link ref tags too?

      Actually, the more I think about it the weirder link is for css. The other link types, next page, rss, lang=fr, etc. connect this page to others, but don’t change it. Browsers don’t generally load link targets. But a css link does get loaded and it affects this page. It’s not really a link at all.

      1. 7

        The semantics of <script src> are such that if you have

        <div id=a></div>
        <script src=...></script>
        <div id=b></div>
        

        Then the script can see ‘a’ but it cannot see ‘b’. So it’s evaluated exactly at the point it’s pulled in.

        Further, the script can actually document.write and insert more DOM/script before ‘b’.

        1. 1

          I’m not that clear on the timeline, but was this formalized before the event handlers for page ready were in place? This rationale seems like it could have made sense at the time but is now a bit of a dead metaphor.

          1. 3

            Formalized is a strong word to associate with Javascript’s history. I don’t know whether document events were present in the original implementation, but when JS was originally added to Netscape they definitely had no idea what paradigms would become dominant years later. At the time, it made perfect sense to run a script inline that would output raw HTML into the document, because that’s how everything else worked.

        2. 6

          Yeah honestly I didn’t find the response very enlightening.

          To me, it still seems like a mistake. Every time I write an HTML document (and I write by hand fairl often), I notice this inconsistency and have to copy from previous documents to maintain it.

          What else is the link tag used for? I’ve only ever used it for CSS.

          1. 7

            Also used to get used for “here’s the RSS feed for the blog that you’re currently reading”. There’s a bunch of early Web 2.0 stuff that used various <link> tags in page <head>s for stuff like pingbacks.

            1. 6

              <link>: The External Resource Link element

              tl;dr - icons and fonts, commonly

            2. 1

              The only difference I can recall is document.write, which writes right after its <script> tag, and I remember that it was quite popular in “Web 1.0”, even for “client-side templating” for static sites.

              With async attribute, designed especially to overcome document.write problems, <script> finally loses its location-dependency.

            1. 1

              On hiring illustrators, this blogger has done so with some nice results:

              https://mtlynch.io/how-to-hire-a-cartoonist/

              1. 3

                @andyc That could be potentially very interesting to me for the Ultimate Plumber tool! That said, not knowing LSP well yet, I don’t really have any idea what’s the difference between the two protocols. Wouldn’t you mind if I asked you what are the main differences, or especially limitations in LSP which made you want to create a new protocol? Skimming the articles, I couldn’t find any elaboration on that; I’d be grateful for a pointer if I missed such a text. I think it could be interesting to others too! Thanks.

                (And btw, the “shellac” name is just awesome :D)

                1. 3

                  Ha, glad you got the name :) I think the tagline should be Shellac finishes your commands (groan).

                  So I actually haven’t worked much with the language server protocol. I know it uses JSON-RPC, and pretty much everyone was in unanimous agreement that we don’t want to impose that dependency on Shellac clients (shells and editors) and Shellac servers (command line tools).

                  There is also the issue that the argv array you want to complete is different than the actual line of text in the shell or the editor. I don’t know that the LSP can NOT accomodate this … but I suppose the problem just seems somewhat different on the surface.

                  As far as I know, the LSP is basically for when you type . in a Java-like language. I think it has notions of methods and arguments, which shell doesn’t have.

                  But I could be wrong and I would like it if someone tells me more about the language server protocol :)

                  It’s still very early, and the point of this post was to get people thinking about the problem!

                  As I mentioned in another comment, we’re discussing this on #shell-completion at https://oilshell.zulipchat.com , which is easy to log into with Github.

                  1. 2

                    In fountain pens world, shellac is used mainly as a general purpose glue, which also fits well here! :)

                    As to the protocol, is it expected that I could eventually call for completion on:

                    SHELLAC_ARGV=["oil", "-c", "echo $"]
                    SHELLAC_ARGC=2
                    SHELLAC_OFFSET=6
                    

                    and get a list of env variable names? or recursively for ["oil", "-c", "ls --"] have ls completion queried via oil?

                    (sorry for not going to zulipchat, but I have enough spam already in my inbox, and don’t expect to get involved unfortunately, having more than enough hobby projects on my head…)

                    1. 2

                      For the first question, Oil will be a shellac server as ewll as a client, so the answer is yes. It will know about its own syntax.

                      For the second question, I can imagine that will work, but that’s not the “right” way to do it. There would instead be a request you make to a shellac server regarding ls specifically. You wouldn’t have to go through Oil necessarily.

                      1. 1

                        Thanks! As for the second question, please try to look at it as a “degenerate case” of the first question. I.e. this would be more like ["oil", "-c", "if ls --"] or something similar in reality — that is, I’d need oil to parse its own syntax and then delegate further to ls. (Heheh, imagine an Inception-like scenario of oil -c "oil -c ..."; cue “We can go deeeeeper…” ;P)

                1. 4

                  I’ve given this some thought in the context of my own projects but I consider the ‘completion script’ angle as a workaround for a more fundamental issue: there is no option to query the client itself for completion or validation, and you would really want both tied to the actual binary that will receive the constructed arg/env to avoid any impedance mismatch, and to provide better feedback.

                  The legacy- compatible way for that would be another entrypoint to main(), one where the contract is to provide validation and compltion for argc/argv without execution and, for binaries which miss this fictive “eval” entry point, provide a wrapper that exports the completion. Alas this is not something that ELF was really designed for, neither is the case of dlopen()ing non- .so ELF binary (do I hear the chimes of rundll32.exe?). A much less appealing option would be yet-another-section and a static format.

                  1. 4

                    git, npm, and Clang actually already handle completion in the binary. It’s a common architecture – it’s just that there is a bunch of bash glued on, and they all do it in a slightly different way.

                    So the Shellac protocol should be designed so that they can migrate to something that’s not bash-specific, without rewriting all their logic.

                    I could have mentioned this directly in the post, but it’s on the wiki page:

                    https://github.com/oilshell/oil/wiki/Shell-Autocompletion

                    http://blog.llvm.org/2017/09/clang-bash-better-auto-completion-is.html

                    I think the easiest way is to simply use a flag or environment variable for the completion mode. A flag is better in that it’s not inherited, but environment variables feel more decoupled from the main program.

                    1. 2

                      I think if you got npm, git and clang to adopt something like this, your adoption would jump up pretty fast and each is representative of some major groups; package managers, common/important non-shell tools, and language based things. I’d love to see adoption by by languages like Go (which as an example, already has a flags package that this should tie into) that way it’s easy to generate binaries that adhere to the protocol. I don’t use Rust (yet), but this seems like the sort of thing that would appeal to that community as well. At a different level it seems like supporting this protocol would appeal to linux distros/*BSDs for their package management as well.

                      Autocomplete comes to mind every time I use Acme (which has only filename autocompletion built-in). I’ve always wanted something that was equally general to help out with command line autocompletion. This could potentially make other tools move some of their logic to a more reasonable place (IDE plugins would either evaporate or become much smaller and simpler).

                      @andyc I’m actually a little overwhelmed at the possibilities here and how it can make so many tools so much better. Is there anything I can jump in and help with?

                      1. 2

                        I think you could get significant (albeit implicit) adoption of a new completion protocol just by adding support for it to libraries like Python’s argparse or docopt or Haskell’s optparse-applicative: these already know about all of a program’s valid command-line options. (The latter package can already do bash completion via the hidden --bash-completion-* options that get injected into programs that use it.)

                        1. 1

                          That’s similar to, but more informatitve than, the point I was making about Go’s flags. Especially since it looks like docopt (at least) has versions in multiple languages.

                          Autocomplete all the things!

                          1. 1

                            Thanks for the Haskell link, I added it here:

                            https://github.com/oilshell/oil/wiki/Shell-Autocompletion

                            There’s an argcomplete package for argparse which someone else mentioned. That’s definitely the idea – instead of generating bash completion scripts, those libraries could implement a shell-independent protocol.

                            However, for backward compatibility, we could first turn OSH itself into a Shellac server! It could simply interpret the scripts generated by clap-rs and serve completions to another shell.

                          2. 1

                            Glad you like the idea! It’s still quite early and there’s a lot to do.

                            If you’re interested in completion for Acme, then that’s absolutely what I’m going for… I want to be able to complete my shell scripts in Vim!!! So Vim, Acme, Oil, Elvish, zsh, etc. are perfect clients for this. However it’s not obvious that they can all share the same logic… that’s what we have to work out.

                            The main issue I see is how to separate completion of the shell language itself, e.g. echo $V<TAB> to complete var names, vs. completion of the argv array. Those are two separate things that are intertwined in the way most shells including bash do things.

                            Autocompletion is sort of a “polyglot” problem, so if you have expertise in Go or the JS ecosystem that would help.

                            Does Go have any existing packages for completion of its flags package? I know that Python does for argparse and Rust does as well. There are some links off the threads in my wiki page.

                            We’re discussing this on the #shell-completion channel on https://oilshell.zulipchat.com/ , which is easy to log into with Github. The conversation is dormant now but I’d like to revive it for awhile and get everyone on the same page. There was a debate about grammar-based approaches vs. this server-like approach. The server approach is closer to what npm, git, and Clang already do.

                      1. 17

                        Great talk! I’ve watched a lot of Hickey’s talks over the years, and thought he might have “run out of steam” by now. But there are some new insights here that I enjoyed.

                        One of his main points is that Maybe is context-dependent, and other aspects of schemas are as well. This is a great and underappreciated point.

                        It’s related to a subtle but important point that Google learned the hard way over many years with respect to protocol buffers. There’s an internal wiki page that says it better, but this argument has spilled out on to the Internet:

                        https://stackoverflow.com/questions/31801257/why-required-and-optional-is-removed-in-protocol-buffers-3

                        In proto3, they removed the ability to specify whether a field is optional. People say that the reason is “backward compatibility”, which is true, but I think Hickey’s analysis actually gets to the core of the issue.

                        The issue is that the shape/schema of data is an idea that can be reused across multiple contexts, while optional/required is context-specific. They are two separate things conflated by type systems when you use constructs like Maybe.

                        When you are threading protocol buffers through long chains of servers written and deployed at different times (and there are many of these core distributed types at Google), then you’ll start to appreciate why this is an important issue.

                        I think a lot of the disagreement in this thread comes down to a difference between thinking about code vs. thinking about systems.

                        When you’ve spent a lot of time with long-lived, evolving systems, as Hickey evidently has, then you’ll be hesistant to repeat naivities about strict types being an unalloyed good. There are downsides to strict types, and he points out a good one here.

                        1. 5

                          The issue is that the shape/schema of data is an idea that can be reused across multiple contexts, while optional/required is context-specific. They are two separate things conflated by type systems when you use constructs like Maybe.

                          Can you elaborate, perhaps with an example, about this conflation? Maybe it’s hard to do in isolation if this is heavily tied to large systems, but I don’t see why types like Maybe shouldn’t be thought of as something that contributes to the shape (and schema) of data. I’ve certainly always seen it that way…

                          Also, are there perhaps different concerns in play when thinking about the problems that protocol buffers solve and the problems that any arbitrary type definition might solve?

                          1. 11

                            You can model it as part of the type system if you want, but the argument that systems become brittle and harder to maintain if you do. There is less reuse of schemas. That’s what Hickey was saying, and that matches my experience with protobufs.

                            Hickey gives an example in his talk – the user/address one. It could sound trivial if you haven’t personally experienced this problem, but IME minor changes in the definition of core types have a big architectural effect on the system over time.

                            Protobufs are similarly central to nearly all Google code, certainly in search and ads, and to a lesser extent Chrome/Android. In a very real sense Google code doesn’t use the type system of C++ or Java; it uses the protobuf type system. I think of it as “extending your system over the network” (which has both good and bad properties).

                            I left a few years ago, but there are core types for log entries and search requests that have lasted 10-15 years. Maintaining these schemas well is important because there are terabytes/petabytes of data encoded in that format.

                            There are really two separate cases – when you’re persisting protobuf data, and when you’re sending it to another server as a message. Required fields are bad in both cases. Whether something is required shouldn’t be part of the schema. It’s an orthogonal concern to the shape of the data, because every binary has different requirements.

                            The schema is universal, but optional/required is relative to a particular piece of code, which you may write and deploy at different times. You can’t upgrade every binary atomically. It has to be done with backward compatibility for years and years. The point of the schemas is to share some information between different binaries operating on the same data, of course.

                            The issue with protobufs is perhaps a worse version of what happens in Clojure, but it sounds like the use case of “long-lived information systems that model the real world” is very close. What I’d say is that the design decision after ~15 years of using them ended up being the same as what Hickey proposed: in proto1 and proto2 you were forced to specify whether a field was required, optional, or repeated. In proto3 the concept of optional/required went away after some debates.

                            You can think of it as “every field is optional”, which how Clojure works. Every field of a map is optional, although this doesn’t necessarily characterize the way you write code. There are higher-level conventions than that.


                            And as I noted in my other comment, the applicability of types is very domain-specific. So I’m not saying Maybe is bad in all cases – just that you have to think about it. People seem to get a lot out of the type system in Elm, in its very different domain of single-page apps.

                            It makes a very “closed world” assumption and the benefits outweigh the costs in many those cases. (Although honestly I’m hearing a lot about the downsides of this strictness too, probably from people with more open problems.)

                            If you have an open world, which lasts over many years, then you should be careful about using every bell and whistle of the type system to model the world in a fine-grained / strict fashion. I don’t think it is even limited to Maybe or required/optional, but that is a great example, and he explained it well.

                            1. 4

                              Interesting. Thanks for the thoughtful reply. I will definitely ruminate on it. I certainly don’t have any experience with Google scale systems, so I’m out of my depth there. I think if I were going to push back at all, I think I might try to explore whether your concerns are more applicable to a system’s serialization boundaries rather than to types in general. That is, if you have core data types shared across many processes over network boundaries, then I can definitely believe there might be different concerns there with respect to how much you use the type system. But do they apply to types that, say, don’t get directly serialized? Perhaps that’s what you were hitting on with your open/closed world distinction.

                              Also, on another note, I got a chance to read the re2c paper on tagged DFAs you pointed me to. I don’t grok everything yet, but it’s got my creative juices flowing. :-) Thanks for that!

                              1. 3

                                I think a lot of companies and codebases have problems that Hickey’s ideas address, because the “scale” that matters isn’t necessarily the number of requests (which Google gets a lot of), but really the age of the codebase, and diversity of users, data it deals with, and systems it talks to, etc.

                                I think banks and insurance companies are perfect examples. They are super old and accrete functionality over time. Amazon is apparently the same way – according to a talk by Werner Vogels I just watched, they have 3 main types that have lasted decades: users, orders, and products. Hickey’s example seems like a toy but you can imagine it on Amazon’s scale easily.

                                I agree that serialization boundaries make things worse (since I said that protobufs have a “worse version” of the problem). But I think the systems that Hickey is thinking about are largely the same, from watching his talks. They are long-lived systems that model the world with data. That describes most “web” apps, and many other apps.

                                I don’t think there are many closed world problems left. You might start with a closed-world problem that you can model in strict types, but as soon your application becomes successful, you have to talk to more things. There are few desktop apps that don’t talk to some kind of network service these days. Embedded systems not only talk to networks now, but their code is frequently updated!


                                If I had a bunch of free time I would try writing some code in Elm, since it’s very strongly typed and people seem to like it a lot. But to me it’s a bad sign it has problems with JSON [1]. Just a little data from the outside world causes problems. (That’s what I mean by “closed world”).

                                In fact strong types and external data in some way remind me of this statement I just read about blockchain.

                                I always say “the moment you need to verify that a human did something outside the system, your blockchain is broken.”

                                I extend it further: the moment you need to verify that something happened outside the system, your blockchain is broken.

                                https://news.ycombinator.com/item?id=18573966

                                Types have this problem with data !! And that is crazy. This is not theoretical – I’ve seen binaries crash in production because people didn’t realize that changing their schema and recompiling the code does not change data stored on disk. :) Yes I’m serious.

                                I didn’t understand “the map is not the territory” until I had been programming for awhile. Type systems are a map of runtime behavior. They are useful up to that point. Runtime behavior is the territory; it’s how you make something happen in the world, and it’s what you ultimately care about. A lot of the arguments I see below in this thread seemingly forget that.

                                (I don’t want to pile onto this, but the Pony divide by zero thing was a similar mistake… they were making the runtime behavior wrong in order to satisfy the type system (the map), which is backwards.)

                                [1] https://medium.com/@eeue56/json-decoding-in-elm-is-still-difficult-cad2d1fb39ae


                                Glad you found the regex paper useful! Feel free to ping me if it leads to anything. I haven’t gotten back to regexes yet but I will probably mention it on a blog post at some point, e.g. the research on derivatives and some of my experiments.

                                1. 2

                                  Thanks again for the reply. I don’t think I quite agree with everything you’re saying, but I don’t necessarily strongly disagree either. I can definitely see your perspective. I don’t think I’ve quite come to appreciate it yet; perhaps I may not until I deal with systems at that scale. (And yeah, I totally get your point on the meaning of “scale”; we’re on the same page there.)


                                  the research on derivatives

                                  Ah right right! I re-read Repy’s paper on derivatives, and I spent some time thinking and re-crystallizing the questions I had when I read the paper the first time around (many moons ago):

                                  • The paper does address Unicode, but it addresses it by talking about the increased alphabet size. This is generally not how non-toy DFA implementations handle Unicode. Namely, they don’t expand the alphabet to all Unicode codepoints, they keep the alphabet restricted to 256 bytes (or even smaller, using some tricks with a match-time cost). They achieve this by building knowledge of the encoding into the automaton. This is why, for example, re2c lets you specify the encoding when building the automata. Try using -8 for example. The regex derivative paper goes out of its way to talk about large Unicode alphabets and how those can be handled well, but doesn’t talk about how the large alternations are handled that ultimately result from building, say, UTF-8 into the DFA using a normal 256-byte alphabet.
                                  • The paper doesn’t really talk much about other practical issues, such as finding the start of a match or encoding match priority. e.g., a|ab matches a in ab when using leftmost first semantics (e.g., Perl) but will match ab when using leftmost longest semantics (e.g., POSIX). Happily, Trofimovich dives into this topic in quite a bit of depth in his tagged DFA paper.

                                  Anyway, I’m not necessarily looking to restart our regex discussion. :-) I just wanted to throw these at you as two concerns that popped out at me from an implementor’s perspective. These concerns may not matter for your use case, which might make regex derivatives a good fit without needing to figure anything else out!

                                  1. 2

                                    Yeah I think it’s very domain-specific. I was going to add that I’m writing a Python VM now and C++ types work great there. I’m sure the same is true of Rust types and a regex engine.

                                    But VMs and regex engines are very particular kinds of software. I think the “open world” problems are very common, and existing type systems don’t handle them very well.

                                    Thanks for the notes on regexes. For the Oil lexer, all the significant characters are in the ASCII range, so I think I can just pass through UTF-8 and not do anything special.

                                    For the runtime, Unicode is one of the main reasons I am avoiding implementing my own globs and regexes for as long as possible :) I want to lean on libc a bit. Most shells actually implement their own, and it wasn’t too hard to find cases where they disagree, because character classes are a pain to implement.

                            2. 5

                              I realized I could have said this in a more concisely… Basically the idea is the the shape is a global property of the system. All binaries operating on the data share knowledge of the shape.

                              But whether an individual field is used or not isn’t a global property. Binary X might use field A and not B; the opposite is true for binary Y. So it’s better to handle optional/required at runtime in the code for that binary, rather than in the global schema.

                              This is not a pedantic point – it comes up a lot in practice. Some people are just writing a simple client-server architecture with protobufs, and it doesn’t come up then. But e.g. web search isn’t exactly client/server – there are dozens or even hundreds of binaries operating on the same protobufs. Often they look at a small subset of data and pass it on to other services.

                              This was several years ago, but as far as I remember, the person who argued the most strongly against putting “required” in schemas was Kenton Varda, author of proto2 and capnproto, who was working on such servers in search. I think he wanted to get rid of the idea of optional/required in proto2, but it was eventually removed in proto3 after he stopped working on it. capnproto doesn’t have this concept either.

                              The reason it’s a long argument is that there are workarounds for making things required, and probably slight (local) benefits… but the claim is that your code becomes messy over time, with duplicated schemas, bloat from required fields that aren’t actually required, generated schemas to express slight differences, etc.

                              So I would say there is a tradeoff between locally reasoning about code, and evolving systems in a global fashion. The latter is more important, but you may not realize it yet … :)

                            3. -1

                              I agree that he’s a productive philosopher, but I think it’s funny that it’s been 13 years since he introduced Clojure and he’s still lauding RDF in his talks. We get it already!

                            1. 2

                              I really enjoyed this! It also makes me wonder if APLs would be good languages to do complicated dataframe analysis. Freebasing your “urls as percentage of total day hits” in J (I don’t have an interpreter rn to check, so gonna be super sloppy):

                              'dates counts' =: ({."1 ; {:"1) data
                              sum =: +/ counts 
                              table =: dates ; (dates) (%&(sum)@:+//. counts
                              

                              Probably a lot less readable and harder to write, but I think there’s a gem of an idea there

                              1. 1

                                I’m interested in more ports, including to APL and kdb (preferably runnable on an Ubuntu machine?).

                                I put a note about Julia here, which I may or may not get around to: https://github.com/oilshell/blog-code/tree/master/data-frames


                                As far as I understand, APL and J are written in a point-free style on vectors. The tidyverse is pretty much a point-free style on data frames, which are more or less dictionaries of {column name -> vector }. There are some longer examples here:

                                https://github.com/oilshell/oil/blob/master/benchmarks/report.R#L150

                                Also, I think the lazily evaluated named arguments do help, e.g. being able to say percentage = sum(num_hits) / total_hits * 100.0.

                              1. 3

                                You’re doing important work. Love these blog posts.

                                1. 1

                                  Thank you, glad you appreciate them! I spent a long time writing this post in particular. Usually I write at least 2x the text I publish, and this one went through more editing and feedback cycles than most.

                                  As you might have noticed, I’m doing an experiment with monetizing the blog, to support the project in general. I think that honest book reviews are a good way to do that.

                                  One thing I’ve learned by blogging is that not everybody reads lobste.rs / reddit / HN. So if you have any friends or coworkers who would benefit from this post or any other, consider sending them a pointer with some context. Or consider sharing it on some online channel that I haven’t covered… for some reason my posts get very little traffic from Twitter.

                                  I should probably tag the educational ones since I have a fair mix now … http://www.oilshell.org/blog/

                                1. 2

                                  Hey Andy.

                                  For your SQL query, have you looked into the with statement?

                                  Then you could rewrite the query to something along these lines

                                  WITH total (hits)
                                  AS
                                  (
                                      SELECT SUM(num_hits)
                                      FROM traffic
                                  )
                                  
                                  SELECT url,
                                         total.hits * 100.0 / total.hits
                                         AS percentage
                                  FROM traffic
                                  GROUP BY url
                                  ORDER BY percentage DESC;
                                  

                                  Keep up the interesting blogs posts :)

                                  1. 2

                                    Thanks, I didn’t realize that was SQL 99, and sqlite supports it. I will edit the post!

                                    1. 2

                                      Hm I was able to do this in sqlite, but it doesn’t look much better. It seems like you need still need a SELECT and you can’t use a syntax like total.hits.

                                      WITH total (hits) AS (SELECT SUM(num_hits) FROM traffic)       
                                      SELECT url, SUM(num_hits) * 100.0 / (SELECT hits FROM total) AS percentage 
                                      ...
                                      

                                      Thanks for the suggestion though. I will update the blog post but I think it still illustrates the awkwardness of SQL.

                                      1. 4

                                        Try this first because the “awkwardness” of SQL (which is really just Dataframes + simplified constraint solver) is the same awkwardness as any other dataframe. The only difference is that we don’t have to reinvent aggregation or deal with the fragility of implementation across M different scientists (and all the various edge cases)

                                        -- Create a single row, single column table with the sum of the hits:
                                        WITH total (hits)
                                        AS
                                        (
                                            SELECT SUM(hits)
                                            FROM traffic
                                        )
                                        -- And then use it as a cross join 
                                        -- (1 row from total * N rows from traffic) = N total rows
                                        SELECT url,
                                               traffic.hits * 100.0 / total.hits
                                               AS percentage
                                        FROM
                                            traffic,
                                            total
                                        GROUP BY url
                                        ORDER BY percentage DESC
                                        ;
                                        

                                        No inner query necessary. And looks just as nice as the Dataframe examples, if not better.

                                        Can we have a redemption arc for SQL because this is what JOINs are good at?

                                        Edit: Proof of work:

                                        $ sqlite3
                                        SQLite version 3.19.3 2017-06-27 16:48:08
                                        Enter ".help" for usage hints.
                                        Connected to a transient in-memory database.
                                        Use ".open FILENAME" to reopen on a persistent database.
                                        sqlite> CREATE TABLE traffic (
                                           ...>      date DATETIME NOT NULL,
                                           ...>      URL TEXT NOT NULL,
                                           ...>      hits INTEGER NOT NULL DEFAULT 0,
                                           ...>      PRIMARY KEY(date, URL)
                                           ...>  );
                                        sqlite>
                                        sqlite> INSERT INTO traffic
                                           ...>  VALUES
                                           ...>      ("2018-11-30",  "/site.html",  300),
                                           ...>      ("2018-11-30",  '/blog/'  , 1000),
                                           ...>      ("2018-12-01",  '/site.html',  320),
                                           ...>      ("2018-12-01",  '/blog/',  1300),
                                           ...>      ("2018-12-02",  '/site.html',  310),
                                           ...>      ("2018-12-02",  '/blog/',  1800),
                                           ...>      ("2018-12-02",  '/data-frames.html',   7500)
                                           ...> ;
                                        sqlite> -- Create a single row, single column table with the sum of the hits:
                                        sqlite> WITH total (hits)
                                           ...> AS
                                           ...> (
                                           ...>     SELECT SUM(hits)
                                           ...>     FROM traffic
                                           ...> )
                                           ...> -- And then use it as a cross join (1 row * N rows) = N rows
                                           ...> SELECT url,
                                           ...>        traffic.hits * 100.0 / total.hits
                                           ...>        AS percentage
                                           ...> FROM
                                           ...>     traffic, total
                                           ...> GROUP BY url
                                           ...> ORDER BY percentage DESC
                                           ...> ;
                                        /data-frames.html|59.8563447725459
                                        /blog/|14.365522745411
                                        /site.html|2.47406225059856
                                        sqlite>
                                        
                                        1. 2

                                          Thanks, that is better. I updated the code here and gave you credit:

                                          https://github.com/oilshell/blog-code/blob/master/data-frames/run.sh#L70

                                          For this case, SQL works OK. I think the data frame syntax is nicer, but you could argue it the other way.

                                          I still stand by my claim that SQL gets unwieldy for complex analysis. I don’t think this is controversial – most people including myself have dealt with 200 line SQL queries with no abstraction, that take forever to run.

                                          Here are some lengthier dplyr pipelines – I shudder to think how they would be expressed in SQL :)

                                          https://github.com/oilshell/oil/blob/master/benchmarks/report.R


                                          Also, data manipulation is only part of what R does. As mentioned, it is really meant for statistics, visualization, and mathematical modelling, and SQL isn’t appropriate in those domains. There’s a fuzzy boundary in the middle of a data science workflow where you could use either.

                                          My general guideline is to cut down the data with SQL – from terabytes to gigabytes – and then do analysis in R on a single machine. SQL is for extraction and R is for analysis. This is a very nice workflow IMO.


                                          Also note that sqlite supports direct CSV import: https://github.com/oilshell/blog-code/blob/master/data-frames/run.sh#L65)

                                          Thanks for the suggsestion (and testing it! )

                                          1. 2

                                            Usually when your SQL statement gets too unwieldy its due to lack of CTEs or trying to do too much in one query.

                                            Perhaps the issue with SQL is that people compare a (usually) singly scoped statement against a script file of successive statements. Most people in that situation will opt for few giant, ugly SQL statements instead of using simpler successive imperative SQL statements that pass state via session variables.

                                    1. 13

                                      Rich has been railing on types for the last few keynotes, but it looks to me like he’s only tried Haskell and Kotlin and that he hasn’t used them a whole lot, because some of his complaints look like complete strawmen if you have a good understanding and experience with a type system as sophisticated as Haskell’s, and others are better addressed in languages with different type systems than Haskell, such as TypeScript.

                                      I think he makes lots of good points, I’m just puzzled as to why he’s seemingly ignoring a lot of research in type theory while designing his own type system (clojure.spec), and if he’s not, why he thinks other models don’t work either.

                                      1. 14

                                        One nit: spec is a contract system, not a type system. The former is often used to patch up a lack of the latter, but it’s a distinct concept you can do very different things with.

                                        EDIT: to see how they can diverge, you’re probably better off looking at what Racket does than what Clojure does. Racket is the main “research language” for contracts and does some pretty fun stuff with them.

                                        1. 4

                                          It’s all fuzzy to me. They’re both formal specifications. They get overlapped in a lot of ways. Many types people are describing could be pre/post conditions and invariants in contract form for specific data or functions on them. Then, a contract system extended to handle all kinds of things past Boolean will use enough logic to be able to do what advanced type systems do.

                                          Past Pierce or someone formally defining it, I don’t know as a formal, methods non-expert that contract and type systems in general form are fundamentally that different since they’re used the same in a lot of ways. Interchangeably, it would appear, if each uses equally powerful and/or automated logics.

                                          1. 13

                                            It’s fuzzy but there are differences in practice. I’m going to assume we’re using non-FM-level type systems, so no refinement types or dependent types for full proofs, because once you get there all of our intuition about types and contracts breaks down. Also, I’m coming from a contract background, not a type background. So take everything I say about type systems with a grain of salt.

                                            In general, static types verify a program’s structure, while contracts verify its properties. Like, super roughly, static types are whether a program is sense or nonsense, while contracts are whether its correct or incorrect. Consider how we normally think of tail in Haskell vs, like, Dafny:

                                            tail :: [a] -> [a]
                                            
                                            method tail(s: seq<T>) returns (o: seq<T>)
                                            requires s.Len > 0
                                            ensures s[0] + o = s
                                            

                                            The tradeoff is that verifying structure automatically is a lot easier than verifying semantics. That’s why historically static typing has been compile-time while contracts have been runtime. Often advances in typechecking subsumed use cases for contracts. See, for example, how Eiffel used contracts to ensure “void-free programming” (no nulls), which is subsumed by optionals. However, there are still a lot of places where they don’t overlap, such as in loop invariants, separation logic, (possibly existential contracts?), arguably smart-fuzzing, etc.

                                            Another overlap is refinement types, but I’d argue that refinement types are “types that act like contracts” versus contracts being “runtime refinement types”, as most successful uses of refinement types came out of research in contracts (like SPARK) and/or are more ‘contracty’ in their formulations.

                                            1. 3

                                              Is there anything contacts do that dependent types cannot?

                                              1. 2

                                                Fundamentally? Not really, nor vice versa. Both let you say arbitrary things about a function.

                                                In practice contracts are more popular for industrial work because they so far seem to map better to imperative languages than dependent types do.

                                                1. 1

                                                  That makes sense, thanks! I’ve never heard of them. I mean I’ve probably seen people throw the concept around but I never took it for an actual thing

                                          2. 1

                                            I see the distinction when we talk about pure values, sum and product types. I wonder if the IO monad for example isn’t kind of more on the contract side of things. Sure it works as a type, type inference algorithms work with it, but the sife-effect thing makes it seem more like a pattern.

                                          3. 17

                                            I’m just puzzled as to why he’s seemingly ignoring a lot of research in type theory

                                            Isn’t that his thing? He’s made proud statements about his disinterest in theory. And it shows. His jubilation about transducers overlooked that they are just a less generic form of ad-hoc polymorphism, invented to abstract over operations on collections.

                                            1. 1

                                              wow, thanks for that, never really saw it that way but it totally makes sense. not a regular clojure user, but love lisp, and love the ML family of languages.

                                              1. 1

                                                So? Theory is useless without usable and easy implementation

                                              2. 6

                                                seemingly ignoring a lot of research in type theory

                                                I’ve come to translate this utterance as “it’s not Haskell”. Are there languages that have been hurt by “ignoring type theory research”? Some (Go, for instance) have clearly benefited from it.

                                                1. 12

                                                  I don’t think rich is nearly as ignorant of Haskell’s type system as everyone seems to think. You can understand this stuff and not find it valuable and it seems pretty clear to me that this is the case. He’s obviously a skilled programmer who’s perspective warrants real consideration, people who are enamored with type systems shouldnt be quick to write him off even if they disagree.

                                                  I don’t like dynamic languages fwiw.

                                                  1. 3

                                                    I dont think we can assume anything about what he knows. Even Haskellers here are always learning about its type system or new uses. He spends most of his time in a LISP. It’s safe to assume he knows more LISP benefits than Haskell benefits until we see otherwise in examples he gives.

                                                    Best thing tl do is probably come up with lot of examples to run by him at various times/places. See what says for/against them.

                                                    1. 9

                                                      I guess I would want hear what people think he’s ignorant of because he clearly knows the basics of the type system, sum types, typeclasses, etc. The clojure reducers docs mention requiring associative monoids. I would be extremely surprised if he didn’t know what monads were. I don’t know how far he has to go for people to believe he really doesn’t think it’s worthwhile. I heard edward kmett say he didn’t think dependent types were worth the overhead, saying that the power to weight ratio simply wasn’t there. I believe the same about haskell as a whole. I don’t think it’s insane to believe that about most type systems and I don’t think hickey’s position stems from ignorance.

                                                      1. 2

                                                        Good examples supporting he might know the stuff. Now, we just need more detail to further test the claims on each aspect of languge design.

                                                        1. 2

                                                          From the discussions I see, it’s pretty clear to me that Rich has a better understanding of static typing and its trade offs than most Haskell fans.

                                                  2. 10

                                                    I’d love to hear in a detailed fashion how Go has clearly benefited from “ignoring type theory research”.

                                                    1. 5

                                                      Rust dropped GC by following that research. Several languages had race freedom with theirs. A few had contracts or type systems with similar benefits. Go’s developers ignored that to do a simpler, Oberon-2- and C-like language.

                                                      There were two reasons. dmpk2k already said first, which Rob Pike said, that it was designed for anyone from any background to pick up easily right after Google hired them. Also, simplicity and consistency making it easy for them to immediately go to work on codebases they’ve never seen. This fits both Google’s needs and companies that want developers to be replaceable cogs.

                                                      The other is that the three developers had to agree on every feature. One came from C. One liked stuff like Oberon-2. I dont recall the other. Their consensus is unlikely to be an Ocaml, Haskell, Rust, Pony, and so on. It was something closer to what they liked and understood well.

                                                      If anything, I thought at the time they shouldve done something like Julia with a mix of productivity features, high C/Python integration, a usable subset people stick to, and macros for just when needed. Much better. I think a Noogler could probably handle a slighty-more-advanced language than Go. That team wanted otherwise…

                                                      1. 2

                                                        I have a hard time with a number of these statements:

                                                        “Rust dropped GC by following that research”? So did C++ also follow research to “drop GC”? What about “C”? I’ve been plenty of type system conversation related to Rust but nothing that I would attribute directly to “dropping GC”. That seems like a bit of a simplification.

                                                        Is there documentation that Go developers ignored type research? Has the Go team stated that? Or that they never cared? I’ve seen Rob Pike talk about wanting to appeal to C and C++ programmers but nothing about ignorning type research. I’d be interested in hearing about that being done and what they thought the benefits were.

                                                        It sounds like you are saying that the benefit is something familiar and approachable. Is that a benefit to the users of a language or to the language itself? Actually I guess that is more like, is the benefit that it made Go approachable and familiar to a broad swath of programmers and that allowed it to gain broad adoption?

                                                        If yes, is there anything other than anecdotes (which I would tend to believe) to support that assertion?

                                                        1. 9

                                                          “That seems like a bit of a simplification.”

                                                          It was. Topic is enormously complex. Gets worse when you consider I barely knew C++ before I lost my memory. I did learn about memory pools and reference counting from game developers who used C++. I know it keeps getting updated in ways that improve its safety. The folks that understand C++ and Rust keep arguing about how safe C++ is with hardly any argument over Rust since its safety model is baked thoroughly into the language rather than an option in a sea of options. You could say I’m talking about Rust’s ability to be as safe as a GC in most of an apps code without runtime checks on memory accesses.

                                                          “Is there documentation that Go developers ignored type research? Has the Go team stated that? Or that they never cared?”

                                                          Like with the Rich Hickey replies, this burden of proof is backwards asking us to prove a negative. If assessing what people knew or did, we should assume nothing until we see evidence in their actions and/or informed opinions that they did these things. Only then do we believe they did. I start by comparing what I’ve read of Go to Common LISP, ML’s, Haskell, Ada/SPARK, Racket/Ometa/Rascal on metaprogramming side, Rust, Julia, Nim, and so on. Go has almost nothing in it compared to these. Looks like a mix of C, Wirth’s stuff, CSP like old stuff in 1970’s-1980’s, and maybe some other things. Not much past the 1980’s. I wasn’t the first to notice either. Article gets point across despite its problems the author apologized for.

                                                          Now, that’s the hypothesis from observation of Go’s features vs other languages. Lets test it on intent first. What was the goal? Rob Pike tells us here with Moray Taylor having a nicer interpretation. The quote:

                                                          The key point here is our programmers are Googlers, they’re not researchers. They’re typically, fairly young, fresh out of school, probably learned Java, maybe learned C or C++, probably learned Python. They’re not capable of understanding a brilliant language but we want to use them to build good software. So, the language that we give them has to be easy for them to understand and easy to adopt.

                                                          It must be familiar, roughly C-like. Programmers working at Google are early in their careers and are most familiar with procedural languages, particularly from the C family. The need to get programmers productive quickly in a new language means that the language cannot be too radical.

                                                          So, they’re intentionally dumbing the language down as much as they can while making it practically useful. They’re doing this so smart people from many backgrounds can pick it up easily and go right to being productive for their new employer. It’s also gotta be C-like for the same reason.

                                                          Now, let’s look at its prior inspirations. In the FAQ, they tell you the ancestors: “Go is mostly in the C family (basic syntax), with significant input from the Pascal/Modula/Oberon family (declarations, packages), plus some ideas from languages inspired by Tony Hoare’s CSP, such as Newsqueak and Limbo (concurrency).” They then make an unsubstantiated claim, in that section at least, that it’s a new language across the board to make programming better and more fun. In reality, it seems really close to a C-like version of the Oberon-2 experience one developer (can’t recall) wanted to recreate with concurrency and tooling for aiding large projects. I covered the concurrency angle in other comment. You don’t see a lot of advanced or far out stuff here: decades old tech that’s behind current capabilities. LISP’ers, metaprogrammers and REBOL’s might say behind old tech, too. ;)

                                                          Now, let’s look at execution of these C, Wirth-like, and specific concurrency ideas into practice. I actually can’t find this part. I did stumble upon its in-depth history of design decisions. The thing I’m missing, if it was correct, is a reference to the claim that the three developers had to agree on each feature. If that’s true, it automatically would hold the language back from advanced stuff.

                                                          In summary, we have a language designed by people who mostly didn’t use cutting-edge work in type systems, employed nothing of the sort, looked like languages from the 1970’s-1980’s, considered them ancestors, is admittedly dumbed-down as much as possible so anyone from any background can use it, and maybe involved consensus from people who didn’t use cutting-edge stuff (or even much cutting-edge at 90’s onward). They actually appear to be detractors to a lot of that stuff if we consider the languages they pushed as reflecting their views on what people should use. Meanwhile, the languages I mentioned above used stuff from 1990’s-2000’s giving them capabilities Go doesn’t have. I think the evidence weighs strongly in favor of that being because designers didn’t look at it, were opposed to it for technical and/or industrial reasons, couldn’t reach a consensus, or some combo.

                                                          That’s what I think of Go’s history for now. People more knowledgeable feel free to throw any resources I might be missing. It just looks to be a highly-practical, learn/use-quickly, C/Oberon-like language made to improve onboarding and productivity of random developers coming into big companies like Google. Rob Pike even says that was the goal. Seems open and shut to me. I thank the developers of languages like Julia and Nim believing we were smart enough to learn a more modern language, even if we have to subset them for inexperienced people.

                                                      2. 4

                                                        It’s easy for non-LtU programmers to pick up, which happens to be the vast majority.

                                                        1. 3

                                                          Sorry, that isn’t detailed. Is there evidence that its easy for these programmers to pick up? What does “easy to pick up” mean? To get something to compile? To create error-free programs? “Clearly benefited” is a really loaded term that can mean pretty much anything to anyone. I’m looking for what the stated benefits are for Go. Is the benefit to go that it is “approachable” and “familiar”?

                                                          There seems to be an idea in your statement then that using any sort of type theory research will inherintly make something hard to pick up. I have a hard time accepting that. I would, without evidence, be willing to accept that many type system ideas (like a number of them in Pony) are hard to pick up, but the idea that you have to ignore type theory research to be easy to pick up is hard for me to accept.

                                                          Could I create a language that ignores type system theory but using a non-familiar syntax and not be easy to pick up?

                                                          1. 5

                                                            I already gave you the quote from Pike saying it was specifically designed for this. Far as the how, I think one of its designers explains it well in those slides. The Guiding Principles section puts simplicity above everything else. Next, a slide says Pascal was a minimalist language designed for teaching non-programmers to code. Oberon was similarly simple. Oberon-2 added methods on records (think simpler OOP). The designer shows Oberon-2 and Go code saying it’s C’s syntax with Oberon-2’s structure. I’ll add benefits like automatic, memory management.

                                                            Then, the design link said they chose CSP because (a) they understood it enough to implement and (b) it was the easiest thing to implement throughout the language. Like Go itself, it was the simplest option rather than the best along many attributes. There were lots of people who picked up SCOOP (super-easy but with overhead) with probably even more picking up Rust’s method grounded in affine types. Pony is itself doing clever stuff using advances in language. Go language would ignore those since (a) Go designers didn’t know them well from way back when and (b) would’ve been more work than their intent/budget could take.

                                                            They’re at least consistent about simplicity for easy implementation and learning. I’ll give them that.

                                                        2. 3

                                                          It seems to me that Go was clearly designed to have a well-known, well-understood set of primitives, and that design angle translated into not incorporating anything fundamentally new or adventurous (unlike Pony and it’s impressive use of object capabilities). It looked already old at birth, but it feels impressively smooth, in the beginning at least.

                                                          1. 3

                                                            I find it hard to believe that CSP and Goroutines were “well-understood set of primitives”. Given the lack of usage of CSP as a mainstream concurrency mechanism, I think that saying that Go incorporates nothing fundamentally new or adventurous is selling it short.

                                                            1. 5

                                                              CSP is one of oldest ways people modeled concurrency. I think it was built on Hoare’s monitor concept from years before which Per Brinch Hansen turned into Concurrent Pascal. Built Solo OS with mix of it and regular Pascal. It was also typical in high-assurance to use something like Z or VDM for specifying main system with concurrency done in CSP and/or some temporal logic. Then, SPIN became dominant way to analyze CSP-like stuff automatically with a lot of industrial use for a formal method. Lots of other tools and formalisms existed, though, under banner of process algebras.

                                                              Outside of verification, the introductory text that taught me about high-performance, parallel computing mentioned CSP as one of basic models of parallel programming. I was experimenting with it in maybe 2000-2001 based on what those HPC/supercomputing texts taught me. It also tied into Agent-Oriented Programming I was looking into then given they were also concurrent, sequential processes distributed across machines and networks. A quick DuckDuckGo shows a summary article on Wikipedia mentions it, too.

                                                              There were so many courses teaching and folks using it that experts in language design and/or concurrency should’ve noticed it a long time ago trying to improve on it for their languages. Many did, some doing better. Eiffel SCOOP, ML variants like Concurrent ML, Chapel, Clay with Wittie’s extensions, Rust, and Pony are examples. Then you have Go doing something CSP-like (circa 1970’s) in the 2000’s still getting race conditions and stuff. What did they learn? (shrugs) I don’t know…

                                                              1. 10

                                                                Nick,

                                                                I’m going to take the 3 different threads of conversation we have going and try to pull them all together in this one reply. I want to thank you for the time you put into each answer. So much of what appears on Reddit, HN, and elsewhere is throw away short things that often feel lazy or like communication wasn’t really the goal. For a long time, I have appreciated your contributions to lobste.rs because there is a thoughtfulness to them and an attempt to convey information and thinking that is often absent in this medium. Your replies earlier today are no exception.


                                                                Language is funny.

                                                                You have a very different interpretation of the words “well-understood primitives” than I do. Perhaps it has something to do with anchoring when I was writing my response. I would rephrase my statement this way (and I would still be imprecise):

                                                                While CSP has been around for a long time, I don’t that prior to Go, that is was a well known or familiar concurrency model for most programmers. From that, I would say it isn’t “well-understood”. But I’m reading quite a bit, based on context into what “well-understood” means here. I’m taking it to me, “widely understood by a large body of programmers”.

                                                                And I think that your response Nick, I think it actually makes me believe that more. The languages you mention aren’t ones that I would consider familiar or mainstream to most programmers.

                                                                Language is fun like that. I could be anchoring myself again. I rarely ask questions on lobste.rs or comment. I decided to on this occasion because I was really curious about a number of things from an earlier statement:

                                                                “Go has clearly benefited from “ignoring type theory research”.

                                                                Some things that came to mind when I read that and I wondered “what does this mean?”

                                                                “clearly benefited”

                                                                Hmmm, what does benefit mean? Especially in reference to a language. My reading of benefit is that “doing X helped the language designers achieve one or more goals in a way that had acceptable tradeoffs”. However, it was far from clear to me, that is what people meant.

                                                                “ignoring type theory research”

                                                                ignoring is an interesting term. This could mean many things and I think it has profound implications for the statement. Does ignoring mean ignorance? Does it mean willfully not caring? Or does it mean considered but decided not to use?

                                                                I’m familiar with some of the Rob Pike and Go early history comments that you referenced in the other threads. In particular related to the goal of Go being designed for:

                                                                The key point here is our programmers are Googlers, they’re not researchers. They’re typically, fairly young, fresh out of school, probably learned Java, maybe learned C or C++, probably learned Python. They’re not capable of understanding a brilliant language but we want to use them to build good software. So, the language that we give them has to be easy for them to understand and easy to adopt.

                                                                It must be familiar, roughly C-like. Programmers working at Google are early in their careers and are most familiar with procedural languages, particularly from the C family. The need to get programmers productive quickly in a new language means that the language cannot be too radical.

                                                                I haven’t found anything though that shows there was a willful disregard of type theory. I wasn’t attempting to get you to prove a negative, more I’m curious. Has the Go team ever said something that would fall under the heading of “type system theory, bah we don’t need it”. Perhaps they have. And if they have, is there anything that shows a benefit from that.

                                                                There’s so much that is loaded into those questions though. So, I’m going to make some statements that are possibly open to being misconstrued about what from your responses, I’m hearing.

                                                                “Benefit” here means “helped make popular” because Go on its surface, presents a number of familiar concepts for the programmer to work with. There’s no individual primitive that feels novel or new to most programmers except perhaps the concurrency model. However, upon the first approach that concurrency model is fairly straightforward in what it asks the programmer to grasp when first encountering it. Given Go’s stated goals from the quote above. It allows the programmers to feel productive and “build good software”.

                                                                Even as I’m writing that though, I start to take issue with a number of the assumptions that are built into the Pike quote. But that is fine. I think most of it comes down to for me what “good software” is and what “simple” is. And those are further loaded words that can radically change the meaning of a comment based on the reader.

                                                                So let me try again:

                                                                When people say “Go has clearly benefited from “ignoring type theory research” what they are saying is:

                                                                Go’s level of popularity is based, in part, on it providing a set of ideas that should be mostly familiar to programmers who have some experience with the Algol family of languages such as C, C++, Python, Ruby etc. We can further refine that to say that from the Algol family of languages that we are really talking about ones that have type systems that make few if any guarantees (like C). That Go put this familiarity as its primary goal and because of that, is popular.

                                                                Would you say that is a reasonable summation?


                                                                When I asked:

                                                                “Is there documentation that Go developers ignored type research? Has the Go team stated that? Or that they never cared?”

                                                                I wasn’t asking you for to prove a negative. I was very curious if any such statements existed. I’ve never seen any. I’ve drawn a number of conclusions about Go based mostly on the Rob Pike quote you provided earlier. I was really looking for “has everyone else as well” or do they know things that I don’t know.

                                                                It sounds like we are both mostly operating on the same set of information. That’s fine. We can draw conclusions from that. But I feel at least good in now saying that both you and I are inferring things based on what appears to be mostly shared set of knowledge here and not that I am ignorant of statements made by Go team members.

                                                                I wasn’t looking for proof. I was looking for information that might help clear up my ignorance in the area. Related to my ignorance.

                                                                1. 2

                                                                  I appreciate that you saw I was trying to put effort into it being productive and civil. Those posts took a while. I appreciate your introspective and kind reply, too. Now, let’s see where we’re at with this.

                                                                  Yeah, it looks like we were using words with a different meaning. I was focused on well-understood by PLT types that design languages and folks studying parallelism. Rob Pike at the least should be in both categories following that research. Most programmers don’t know about it. You’re right that Go could’ve been first time it went mainstream.

                                                                  You also made a good point that it’s probably overstating it to say they never considered. I have good evidence they avoided almost all of it. Other designers didn’t. Yet, they may have considered it (how much we don’t know), assessed it against their objectives, and decided against all of it. The simplest approach would be to just ask them in a non-confrontational way. The other possibility is to look at each’s work to see if it showed any indication they were considering or using such techniques in other work. If they were absent, saying they didn’t consider it in their next work would be reasonable. Another angle would be to look at, like with C’s developers, whether they had a personal preference for simpler or barely any typing consistently avoiding developments in type systems. Since that’s lots of work, I’ll leave it at “Unknown” for now.

                                                                  Regarding its popularity, I’ll start by saying I agree its simple design reusing existing concepts was a huge element of that. It was Wirth’s philosophy to do same thing for educating programmers. Go adopted that philosophy to modern situation. Smart move. I think you shouldn’t underestimate the fact that Google backed it, though.

                                                                  There were a lot of interesting languages over the decades with all kinds of good tradeoffs. The ones with major, corporate backing and/or on top of advantageous foundations/ecosystems (eg hardware or OS’s) usually became big in a lasting way. That included COBOL on mainframes, C on cheap hardware spreading with UNIX, Java getting there almost entirely through marketing given its technical failures, .NET/C# forced by Microsoft on its huge ecosystem, Apple pushing Swift, and some smaller ones. Notice the language design is all across the board here in complexity, often more complex than existing languages. The ecosystem drivers, esp marketing or dominant companies, are the consistent thread driving at least these languages’ mass adoption.

                                                                  Now, mighty Google claims they’re backing for their massive ecosystem a new language. It’s also designed by celebrity researchers/programmers, including one many in C community respect. It might also be a factor in whether developers get a six digit job. These are two, major pulls plus a minor one that each in isolation can draw in developers. Two, esp employment, will automatically make a large number of users if they think Google is serious. Both also have ripple effects where other companies will copy what big company is doing to not get left behind. Makes the pull larger.

                                                                  So, as I think of your question, I have that in the back of my mind. I mean, those effects pull so hard that Google’s language could be a total piece of garbage and still have 50,000-100,000 developers just going for a gold rush. I think that they simplified the design to make it super-easy to learn and maintain existing code just turbocharges that effect. Yes, I think the design and its designers could lead to significant community without Google. I’m just leaning toward it being a major employer with celebrity designers and fanfare causing most of it.

                                                                  And then those other languages start getting uptake despite advanced features or learning troubles (esp Rust). Shows they Go team could’ve done better on typing using such techniques if they wanted to and/or knew about those techniques. I said that’s unknown. Go might be the best they could do in their background, constraints, goals, or whatever. Good that at least four, different groups made languages to push programming further into the 90’s and 2000’s instead of just 70’s to early 80’s. There’s at least three creating languages closer to C generating a lot of excitement. C++ is also getting updates making it more like Ada. Non-mainstream languages like Ada/SPARK and Pony are still getting uptake even though smaller.

                                                                  If anything, the choices of systems-type languages is exploding right now with something for everyone. The decisions of Go’s language authors aren’t even worth worrying about since that time can be put into more appropriate tools. I’m still going to point out that Rob Pike quote to people to show they had very, very specific goals which made a language design that may or may not be ideal for a given task. It’s good for perspective. I don’t know designers’ studies, their tradeoffs, and (given alternatives) they barely matter past personal curiosity and PLT history. That also means I’ll remain too willfully ignorant about it to clear up anyone’s ignorance. At least till I see some submissions with them talking about it. :)

                                                                  1. 2

                                                                    Thanks for the time you put into this @nickpsecurity.

                                                                    1. 1

                                                                      Sure thing. I appreciate you patiently putting time into helping me be more accurate and fair describing Go designers’ work.

                                                                      1. 2

                                                                        And thank you, I have a different perspective on Go now than I did before. Or rather, I have a better understanding of other perspectives.

                                                      3. 6

                                                        I don’t see anything of substance in this comment other than “Haskell has a great type system”.

                                                        I just watched the talk. Rich took a lot of time to explain his thoughts carefully, and I’m convinced by many of his points. I’m not convinced by anything in this comment because there’s barely anything there. What are you referring to specifically?

                                                        edit: See my perspective here: https://lobste.rs/s/zdvg9y/maybe_not_rich_hickey#c_povjwe

                                                        1. 3

                                                          That wasn’t my point at all. I agree with what Rich says about Maybes in this talk, but it’s obvious from his bad Haskell examples that he hasn’t spent enough time with the language to justify criticizing its type system so harshly.

                                                          Also, what he said about representing the idea of a car with information that might or might not be there in different parts of a program might be correct in Haskell’s type system, but in languages with structural subtyping (like TypeScript) or row polymorphism (like Ur/Web) you can easily have a function that takes a car record which may be missing some fields, fills some of them out and returns an object which has a bit more fields than the other one, like Rich described at some point in the talk.

                                                          I’m interested to see where he’s gonna end up with this, but I don’t think he’s doing himself any favors by ignoring existing research in the same fields he’s thinking about.

                                                          1. 5

                                                            But if you say that you need to go to TypeScript to express something, that doesn’t help me as a Haskell user. I don’t start writing a program in a language with one type system and then switch into a language with a different one.

                                                            Anyway, my point is not to have a debate on types. My point is that I would rather read or watch an opinion backed up by real-world experience.

                                                            I don’t like the phrase “ignoring existing research”. It sounds too much like “somebody told me this type system was good and I’m repeating it”. Just because someone published a paper on it, doesn’t mean it’s good. Plenty of researchers disagree on types, and admit that there are open problems.

                                                            There was just one here the other day!

                                                            https://lobste.rs/s/dldtqq/ast_typing_problem

                                                            I’ve found that the applicability of types is quite domain-specific. Rich Hickey is very clear about what domains he’s talking about. If someone makes general statements about type systems without qualifying what they’re talking about, then I won’t take them very seriously.

                                                        2. 4

                                                          I don’t have a good understanding of type systems. What is it that Rich misses about Haskells Maybe? Does changing the return type of a function from Maybe T to T not mean that you have to change code which uses the return value of that function?

                                                          1. 23

                                                            Does changing the return type of a function from Maybe T to T not mean that you have to change code which uses the return value of that function?

                                                            It does in a way, but I think people sometimes over-estimate the amount of changes that are required. It depends on whether or not really really care about the returned value. Let’s look at a couple of examples:

                                                            First, let’s look at an example. Let’s say that we had a function that was going to get the first element out of a list, so we start out with something like:

                                                            getFirstElem :: [a] -> a
                                                            getFirstElem = head
                                                            

                                                            Now, we’ll write a couple of functions that make use of this function. Afterwards, I’ll change my getFirstElem function to return a Maybe a so you can see when, why, and how these specific functions need to change.

                                                            First, let’s imagine that I have some list of lists, and I’d like to just return a single list that has the first element; for example I might have something like ["foo","bar","baz"] and I want to get back "fbb". I can do this by calling map over my list of lists with my getFirstElem function:

                                                            getFirsts :: [[a]] -> [a]
                                                            getFirsts = map getFirstElem
                                                            

                                                            Next, say we wanted to get an idea of how many elements we were removing from our list of lists. For example, in our case of ["foo","bar","baz"] -> "fbb", we’re going from a total of 9 elements down to 3, so we’ve eliminated 6 elements. We can write a function to help us figure out how many elements we’ve dropped pretty easily by looking at the sum of the lengths of the lists in the input lists, and the overall length of the output list.

                                                            countDropped :: [[a]] -> [b] -> Int
                                                            countDropped a b =
                                                              let a' = sum $ map length a
                                                                  b' = length b
                                                              in a' - b'
                                                            

                                                            Finally, we probably want to print out our string, so we’ll use print:

                                                            printFirsts =
                                                              let l = ["foo","bar","baz"]
                                                                  r = getFirsts l
                                                                  d = countDropped l r
                                                              in print l >> print r >> print d
                                                            

                                                            Later, if we decide that we want to change our program to look at ["foo","","bar","","baz"]. We’ll see our program crashes! Oh no! the problem is that head doesn’t work with an empty list, so we better go and update it. We’ll have it return a Maybe a so that we can capture the case where we actually got an empty list.

                                                            getFirstElem :: [a] -> Maybe a
                                                            getFirstElem = listToMaybe
                                                            

                                                            Now we’ve changed our program so that the type system will explicitly tell us whether we tried to take the head of an empty list or not- and it won’t crash if we pass one in. So what refactoring do we have to do to our program?

                                                            Let’s walk back through our functions one-by-one. Our getFirsts function had the type [[a]] -> [a] and we’ll need to change that to [[a]] -> [Maybe a] now. What about the code?

                                                            If we look at the type of map we’ll see that it has the type: map :: (c -> d) -> [c] -> [d]. Since both [[a]] -> [a] and [[a]] -> [Maybe a] satisfy the constraint [a] -> [b], (in both cases, c ~ [a], in the first case, d ~ a and in the second d ~ Maybe a). In short, we had to fix our type signature, but nothing in our code has to change at all.

                                                            What about countDropped? Even though our types changed, we don’t have to change anything in countDropped at all! Why? Because countDropped is never looking at any values inside of the list- it only cares about the structure of the lists (in this case, how many elements they have).

                                                            Finally, we’ll need to update printFirsts. The type signature here doesn’t need to change, but we might want to change the way that we’re printing out our values. Technically we can print a Maybe value, but we’d end up with something like: [Maybe 'f',Nothing,Maybe 'b',Nothing,Maybe 'b'], which isn’t particularly readable. Let’s update it to replace Nothing values with spaces:

                                                            printFirsts :: IO ()
                                                            printFirsts =
                                                              let l = ["foo","","bar","","baz"]
                                                                  r = map (fromMaybe ' ') $ getFirsts' l
                                                                  d = countDropped l r
                                                              in print l >> print r >> print d
                                                            

                                                            In short, from this example, you can see that we can refactor our code to change the type, and in most cases the only code that needs to change is code that cares about the value that we’ve changed. In an untyped language you’d expect to still have to change the code that cares about the values you’re passing around, so the only additional changes that we’ve had to do here was a very small update to the type signature (but not the implementation) of one function. In fact, if I’d let the type be inferred (or written a much more general function) I wouldn’t have had to even do that.

                                                            There’s an impression that the types in Haskell require you to do a lot of extra work when refactoring, but in practice the changes you are making aren’t materially more or different than the ones you’d make in an untyped language- it’s just that the compiler will tell you about the changes you need to make, so you don’t need to find them through unit tests or program crashes.

                                                            1. 3

                                                              countDropped should be changed. To what will depend on your specification but as a simple inspection, countDropped ["", "", "", ""] [None, None, None, None] will return -4, which isn’t likely to be what you want.

                                                              1. 2

                                                                That’s correct in a manner of speaking, since we’re essentially computing the difference between the number of characters in all of the substrings minutes the length of the printed items. Since [""] = [[]], but is printed " ", we print one extra character (the space) compared to the total length of the string, so a negative “dropped” value is sensible.

                                                                Of course the entire thing was a completely contrived example I came up with while I was sitting at work trying to get through my morning coffee, and really only served to show “sometimes we don’t need to change the types at all”, so I’m not terribly worried about the semantics of the specification. You’re welcome to propose any other more sensible alternative you’d like.

                                                                1. -3

                                                                  That’s correct in a manner of speaking, since …

                                                                  This is an impressive contortion, on par with corporate legalese, but your post-hoc justification is undermined by the fact that you didn’t know this was the behavior of your function until I pointed it out.

                                                                  Of course the entire thing was a completely contrived example …

                                                                  On this, we can agree. You created a function whose definition would still typecheck after the change, without addressing the changed behavior, nor refuting that in the general case, Maybe T is not a supertype of T.

                                                                  You’re welcome to propose any other more sensible alternative you’d like.

                                                                  Alternative to what, Maybe? The hour long talk linked here is pretty good. Nullable types are more advantageous, too, like C#’s int?. The point is that if you have a function and call it as f(0) when the function requires its first argument, but later, the requirement is “relaxed”, all the places where you wrote f(0) will still work and behave in exactly the same way.

                                                                  Getting back to the original question, which was (1) “what is it that Rich Hickey doesn’t understand about types?” and, (2) “does changing the return type from Maybe T to T cause calling code to break?”. The answer to (2) is yes. The answer to (1), given (2), is nothing.

                                                                  1. 9

                                                                    I was actually perfectly aware of the behavior, and I didn’t care because it was just a small toy example. I was just trying to show some examples of when and how you need to change code and/or type signatures, not write some amazing production quality code to drop some things from a list. No idea why you’re trying to be such an ass about it.

                                                                    1. 3

                                                                      She did not address question (1) at all. You are reading her response to question (2) as implying something about (1) that makes your response needlessly adverse.

                                                                2. 1

                                                                  This is a great example. To further reinforce your point, I feel like the one place Haskell really shows it’s strength in these refactors. It’s often a pain to figure out what the correct types should be parts of your programs, but when you know this and make a change, the Haskell compiler becomes this real guiding light when working through a re-factor.

                                                                3. 10

                                                                  He explicitly makes the point that “strengthening a promise”, that is from “I might give you a T” to “I’ll definitely give you a T” shouldn’t necessarily be a breaking change, but is in the absence of union types.

                                                                  1. 2

                                                                    Half baked thought here that I’m just airing to ask for an opinion on:

                                                                    Say as an alternative, the producer produces Either (forall a. a) T instead of Maybe T, and the consumer consumes Either x T. Then the producer’s author changes it to make a stronger promise by changing it to produce Either Void T instead.

                                                                    I think this does what I would want? This change hasn’t broken the consumer because x would match either alternative. The producer has strengthened the promise it makes because now it promises not to produce a Left constructor.

                                                                    1. 4

                                                                      When the problem is “I can’t change my mind after I had insufficient forethought”, requiring additional forethought is not a solution.

                                                                      1. 2

                                                                        So we’d need a way to automatically rewrite Maybe t to Either (forall a. a) t everywhere - after the fact. ;)

                                                                4. 2

                                                                  Likewise, I wonder what he thinks about Rust’s type system to ensure temporal safety without a GC. Is safe, no-GC operation in general or for performance-critical modules desirable for Clojure practitioners? Would they like a compile to native option that integrates that safe, optimized code with the rest of their app? And if not affine types, what’s his solution that doesn’t involve runtime checks that degrade performance?

                                                                  1. 7

                                                                    I’d argue that GC is a perfectly fine solution in vast majority of cases. The overhead from advanced GC systems like the one on the JVM is becoming incredibly small. So, the scenarios where you can’t afford GC are niche in my opinion. If you are in such a situation, then types do seem like a reasonable way to approach the problem.

                                                                    1. 3

                                                                      I have worked professionally in Clojure but I have never had to make a performance critical application with it. The high performance code I have written has been in C and CUDA. I have been learning Rust in my spare time.

                                                                      I argue that both Clojure and Rust both have thread safe memory abstractions, but Clojure’s solution has more (theoretical) overhead. This is because while Rust uses ownership and affine types, Clojure uses immutable data structures.

                                                                      In particular, get/insert/remove for a Rust HashMap is O(1) amortized while Clojure’s corresponding hash-map’s complexity is O(log_32(n)) for those operations.

                                                                      I haven’t made careful benchmarks to see how this scaling difference plays out in the real world, however.

                                                                      1. 4

                                                                        Having used clojure’s various “thread safe memory abstractions” I would say that the overhead is actual not theoretical.

                                                                  2. 2

                                                                    Disclaimer: I <3 types a lot, Purescript is lovely and whatnot

                                                                    I dunno, I kinda disagree about this. Even in the research languages, people are opting for nominal ADTs. Typescript is the exception, not the rule.

                                                                    His wants in this space almost require “everything is a dictionary/hashmap”, and I don’t think the research in type theory is tackling his complaints (the whole “place-oriented programming” stuff and positional argument difficulties ring extremely true). M…aybe row types, but row types are not easy to use compared to the transparent and simple Typescript model in my opinion.

                                                                    Row types help o solve issues generated in the ADT universe, but you still have the nominal typing problem which is his other thing.

                                                                    His last keynote was very agressive and I think people wrote it off because it felt almost ignorant, but I think this keynote is extremely on point once he gets beyond the maybe railing in the intro

                                                                  1. 4

                                                                    You can think of R as a wrapper around Fortran, in much the same way that Python is a wrapper around C.

                                                                    Not only does building R itself require a Fortran compiler, common packages do as well. If you look inside your R package tarballs, particularly of solvers, you will see Fortran.

                                                                    (Scientific computing libraries in Python also wrap Fortran, but CPython doesn’t depend on it. Base R depends on Fortran.)

                                                                    (repost from HN)

                                                                    1. 3

                                                                      This is where taking time to learn Gentoo is helpful. There is nothing like seeing fortan run across the screen to remind you of all the code that exists.

                                                                    1. 9

                                                                      I’m also having problems to keep up with my side project. I had a lot of energy when I was cowboy coding and had no job. When I started to care more about quality and tests my productivity dropped a lot and seeing a lot of bad code at my job may drain my energy to keep working.

                                                                      My plan was to gather money to get a sabbatical year to work on it. I’ve seen suggestions saying that it’s nota good idea because of the pressure and that you should be able to do it even having a job.

                                                                      1. 15

                                                                        Normally I don’t try to give anyone advice, but I have a lot of friends in this situation, and I would say it’s better to start something while you’re employed. This will help you make the most of your sabbatical year.

                                                                        It will be surprisingly easy to let 3 or 6 months go by “figuring out” what you want to do.

                                                                        Or it will be very easy to work on something for 3 months and then realize it wasn’t the right thing, and you don’t really want to work on it.

                                                                        1. 3

                                                                          I second this advice. No matter how difficult or taxing it may be, get a start on your project while you’re employed. Only after putting considerable time and effort into it should you consider taking a sabbatical.

                                                                          I learned this the hard way after deciding to take a year off to work on a side project myself. Unfortunately, I somehow ended up spending eleven months (!!!) figuring things out, working on it, realizing I messed up, and figuring things out again before finally starting on something truly workable. One year has already turned into one-and-a-half years, and it’s beginning to look like two years due to unforeseen complications. While I’m finally on track, I’m running short on both money and motivation.

                                                                          Hindsight is 20/20, but had I not convinced myself that I required a sabbatical to put real work into this project, I’d be in a much better place right now. While I’m happy with the progress I made, I’m unhappy with how I made it. Dedicating even an hour every day to my side project while keeping my day job would have been very stressful, but my current situation is certainly far worse.

                                                                          Be brave and careful, not one or the other.

                                                                      1. 6

                                                                        Can we please stop using the “Make X Y Again” schema for advertising things? I know there is no ill intent behind this but some of us are directly affected by the policies and rhetoric that comes out of the very much sincere desire to roll back progressivism by decades.

                                                                        1. 12

                                                                          Your comment is off-topic.

                                                                          You reasonably observe that the name of the project is derivative, acknowledge that the author bears you no ill intent, but nevertheless suggest the project name is harming you and yours.

                                                                          You don’t address the author, you don’t talk about Medium as a platform, you don’t talk about blogging or anything apparently connected to the article. Your comment is a generic complaint. (Applies to any submission matching your pattern.)

                                                                          We’re a community of practitioners. We show (create, invest, fix), rather than tell (scold, beg, demand).

                                                                          1. 26

                                                                            The title of this project is a riff on a political slogan that itself is a riff on various fascist slogans throughout history. Making a joke of it by using it as the name of a browser extension is, at the very least, in poor taste. The commenter you responded to made a polite request to the community to stop doing this thing that is in poor taste. There was no need for them to address the substance of the project because the comment was only concerned with the choice of title. In terms of scolding/begging/demanding, I see more of that in your comment than in the one you responded to.

                                                                            1. 1

                                                                              Apologies for the off-topicness, but are Mel Brooks’ Hitler jokes/comedy in bad taste? Can something horrible be alleviated by ridiculing it?

                                                                              This is a philosophical question that doesn’t wven account for the author’s intent with the naming.

                                                                              And on the other side, would “Medium we can believe in” or “Medium we can” be more acceptable or less, and to whom?

                                                                              A rose by any other name… It seems to be a somewhat useful browser addition regardless.

                                                                              1. 2

                                                                                Can something horrible be alleviated by ridiculing it?

                                                                                Yes, somewhat, and only if actually done well. (And even then, sometimes the supposed object of ridicule can miss the point entirely and embrace whatever the “joke” was about.)

                                                                                I guess the point is, naming entirely unrelated things with the same pattern (“Make X Y again” here) is not comedy! It’s literally just spreading the slogan.

                                                                            2. 19

                                                                              You can’t ignore politics when they are no longer ignoring you. However much you may think that Lobsters is a domain of pure, unadulterated reason and everything unreasonable is offtopic, the linked software decided to make a political slogan ontopic.

                                                                              You’re grandstanding here about how neutral Lobsters is, but there’s no neutrality on this moving train, and telling people to shut up about the politics that affects them isn’t nice.

                                                                              1. 9

                                                                                We’re a community of practitioners. We show (create, invest, fix), rather than tell (scold, beg, demand).

                                                                                I like this a lot! The internet would be a better place if there were more places that followed this philosophy.

                                                                                1. 0

                                                                                  Yeah, wouldn’t that be something…

                                                                                  :-/

                                                                                2. 8

                                                                                  I also happen to feel playful takes on MAGA is putting googly eyes on swastika, and was about to post similar comment. Didn’t post as the earlier exchanges OT exchanges like this on Lobsters suggest ethics is a taboo subjects to many here.

                                                                                  But seriously, screw this.

                                                                                  1. -6

                                                                                    Fine, let’s discuss ethics.

                                                                                    Calling a playful riff on the MAGA slogan”putting googly eyes on a swastika” is bullshit. It’s the same authoritarian communist rhetorical technique that the East German government used when they called the Berlin Wall the “anti fascist defense wall”. I’m not a huge fan of Trump myself, but I’m even less of a fan of the anti-Trumpist faction in American politics characterizing Trump’s policies as literally Nazi-like so they can feel justified in weaponizing the social norm that “Nazis=bad” in western society against their poltiical enemies.

                                                                                    Nothing the Trump administration is doing is in any meaningful way close to the bad things that the Nazis did - frankly most of what he’s been doing are the same things that every post-WWII American presidential administration has done, just with less high class verbiage to describe it. The people who claim otherwise are doing so in order to make themselves feel like they’re morally-righteous crusaders instead of people having ordinary political disagreements in the American political system.

                                                                                    Lobsters isn’t a political discussion forum, but if people are going to say that nonpolitical articles that happen to reference the current US President’s campaign slogan should be considered forbidden, you’re already bringing politics into the space, and you shouldn’t expect that your particular poltics must go unchallenged. There’s nothing wrong with the title of the article, and people claiming otherwise are making a backhanded political argument that Trump is Bad on a technical forum.

                                                                                    1. 3

                                                                                      Now this is an off-topic comment.

                                                                                      1. 10

                                                                                        And yet despite being in good company, it is the only one flagged to death, because it comes from the perspective of the wrong tribe.

                                                                                        You see why I object to politics and “ethics” discussions? This is sort of the reason why–people don’t get a fair shake.

                                                                                        1. 0

                                                                                          This is a tough problem to solve, for sure.

                                                                                          I am among those who have flagged it as off-topic, as per @alynpost ’s comment here

                                                                                          https://lobste.rs/s/f4t0y2/make_medium_readable_again#c_ty2pp6

                                                                                          (based on my understanding, posted here: https://lobste.rs/s/f4t0y2/make_medium_readable_again#c_szkkme)

                                                                                          As both this downvote and the one I made on the other post were made in affect, I have removed them both.

                                                                                          1. -1

                                                                                            This whole discussion is a response to unnecessarily politicised title. Ironically, it’s the objection to the title was attacked by no ethics pls crowd.

                                                                                        2. 2

                                                                                          I’m not taking the bait. Would just remark that my reply, and your rant could be precisely avoided if the author stuck to fucking technicals for technical write up.

                                                                                      2. 6

                                                                                        How does one show, create, invest or fix in response to a negative pattern like the “Make X Y Again” headline?

                                                                                        1. 4

                                                                                          Indeed. I suppose one could suggest an alternate name for the project, in which case I will propose “Readable Medium” as a straightforward name for a browser extension that would entirely avoid any political connotations that only serve to distract from the substance of the project.

                                                                                          1. 1

                                                                                            I like that also because I find it humorous – a medium is a person who may do a “reading”, so “readable medium” sounds backward to me.

                                                                                          2. 0

                                                                                            If the title of the project bothers you, open an issue and try to convince the author of your point. If not possible, fork it.

                                                                                          3. 3

                                                                                            I downvoted this comment as “incorrect” but I have since reconsidered and removed my downvote.

                                                                                            I initially read the comment to mean “never discuss anything political, (as defined by us the community*) on this site”.

                                                                                            I know hope it reads “please feel free to discuss things political, but the focus should be on the technical contents of the submitted post”.

                                                                                            In this spirit, I will submit a comment that both reflects my opinion on the linked content, and will serve as a template for an acceptable comment that also addresses the political/ethical implications.

                                                                                            [start comment]

                                                                                            This project strikes me as useful for now, but ultimately reactive. It’s easy for Medium to redesign their site to defeat the circumvention, and the developer and users will engage in a game of whack-a-mole to keep up.

                                                                                            It’s a similar situation with ad blockers, with the significant difference that the market for ad-free browsing is much larger than the market for reading Medium without a bunch of banners.

                                                                                            This segues nicely into the problems with Medium’s business plan. Ultimately, it’s just Wordpress.com with a nicer editor and draconian rules about CSS. There’s really no reason to pay for Medium apart from the content, and the content, for me personally, seems mostly to be cryptocurrency boosters nowadays. Essentially it’s content as a commodity… there has to be a critical mass of writers who are only available on Medium for it to be worth paying for.

                                                                                            If Medium promised a cleaner reading experience as part of a paid tier, that would maybe help?

                                                                                            As to the name of the linked project - it’s unfortunately hard to detect irony on the web, and considering the “alt-right” has had some success in shifting the conversation by “pretending” to be racist, saying it’s for the “lulz”, I am prepared to automatically assume that someone who seems to do the same is either on the same side as this political faction, or insensitive to how they appear by choosing this name.

                                                                                            Personally I would add the name choice as a negative in evaluating this project.

                                                                                            [end comment]

                                                                                            If anyone upvotes or downvotes this comment, please let me know if it was because of the content, or the presentation, or the meta-narrative on how to handle political/ethical/sensitive submissions to the site.


                                                                                            * who represents this community is another question that deserves discussion but that’s for another time.

                                                                                            1. 0

                                                                                              Good comment, upvoted. You address the content of the article first, make good points and analysis, and close with minor but reasonable speculation and an opinion–and you don’t go on a screed.

                                                                                            2. 2

                                                                                              @gerikson, @jamesmacaulay, @JordiGH, @varjag I’ll reply to all of you at once in the interest of my time.

                                                                                              I have had folk observe that I’m prone to understatement. I may have done that here describing the project name as derivative, when I could have said political slogan (h/t jamesmacaulay) or dog whistle (h/t gerikson). Both would have been more accurate.

                                                                                              The de minimis statement I made supporting my off-topic claim was “Your comment is a generic complaint.” I then provided a test so the claim can be falsified: “[Your comment] applies to any submission matching your pattern.” This same test holds without regard to the sentiment of the comment. A similarly context-free comment supporting, rather than detracting, this political slogan, dog whistle, or derivative name would also be off-topic.

                                                                                              We know that naming things is hard. The problem is featured in a widely known joke. (“There are two hard problems in computer science…”) We also know that names can be chosen because they’re provocative. (“There’s no such thing as bad publicity.”) Discussing names gets the benefit of the doubt regarding topicality. The comment in question is off topic qua a kind of behavior.

                                                                                              Thank you all for your replies.

                                                                                            3. 1

                                                                                              I suppose to a progressive, the title would sound like “make medium awful again” – the exact opposite of what the author is trying to convey!

                                                                                              (I didn’t even pick up on the political reference until you pointed it out.)

                                                                                              1. 1

                                                                                                Can’t speak for others, but to me the original intent was clear given the context. But it’s hard to divorce the connotations of opression and hate from it. As @JordiGH said so eloquently, at this point it’s impossible to ignore politics as they won’t ignore you. Using this language will hurt people. I assume this wasn’t anyone’s intention by choosing this name, so I’m just trying to point this out hoping that when the next time comes around people can make a more informed decision.

                                                                                            1. 3

                                                                                              I’ve flown on Christmas day or Chistmas Eve before – same idea.

                                                                                              Travel is very seasonal and you will get significantly better rates and experience if you travel when others don’t.

                                                                                              You should be able to just use the price as a gauge of popularity, and without even measuring Wi-Fi. Tickets on Thanksgiving day and Christmas Eve/Day are much cheaper, because there’s less demand. Airline pricing is very finely tuned so this shows up pretty clearly.

                                                                                              Also Tuesdays seem to be cheaper. A very common business flight is to fly out Sunday night or Monday morning, and then return to your home city on Friday. If you fly on Tuesdays and Saturdays, you can get a cheaper rate than by following the traditional business traveler schedule.

                                                                                              This is pretty apparent if you just go on Orbitz or any of the other aggregators, and sort by price!

                                                                                              1. 1

                                                                                                So if I understand this correctly, OSH will end up being implemented in a custom subset of Python 2, assuming this “OVM2” will remain compatible with it. I can appreciate wanting to make a better version of Python 2 more geared towards the goals of your project, but I find the idea nonetheless devious, unless you’re planning to turn this OVM2 into the Oil language compiler at some point (therefore turning OSH into an Oil program)?

                                                                                                1. 3

                                                                                                  Yes! The OVM2 work is not just infrastructure to support OSH. The Python-like semantics will be exposed to users as part of the Oil language.

                                                                                                  Oil is basically a cross between bash and Python (with some features of Ruby (blocks) and JS). I said some of that further down in the thread I linked in this post:

                                                                                                  https://lobste.rs/s/jrzwgy/what_is_systems_programming_really#c_xnvie9

                                                                                                  This is of course a lot of work (writing a small Python interpreter), but the point of this post is to say that I’ve “audited” the situation and believe it’s possible.

                                                                                                  1. 1

                                                                                                    Nice, I’m looking forward to seeing how it develops.

                                                                                                    By the way, I’ve noticed you mentioned removing all instances of closures from the OSH code; do you have any special plans for them in Oil?

                                                                                                    1. 1

                                                                                                      I guess I would want to sort out the difference between Python and JS closures. Are they the same? Python has the global and nonlocal keywords, but I believe that is mostly a syntactic difference not a semantic one.

                                                                                                      I don’t see closures used very often in Python. The two usages I had in Oil weren’t “real” closures, as the captured variable didn’t outlive the parent scope.

                                                                                                      Closures are a lot more common and idiomatic in JavaScript code. The combination of function literals and closures leads to a different and often elegant style in JS. Although this style doesn’t scale that well, and the JavaScript world seems to have moved onto other idioms (promises, virtual DOM to get your normal control flow back, etc.)

                                                                                                      1. 1

                                                                                                        I think they’re the same, I don’t know Python all that well but I don’t think there’s any difference.

                                                                                                        Although this style doesn’t scale that well, and the JavaScript world seems to have moved onto other idioms

                                                                                                        Trust me, closures are still used a lot in JS; I cannot see writing many parts of the React code I’m writing at work without closures. I think the reason why they’re not used in Python is mostly because Guido doesn’t like them, so he made them unpleasant to use.

                                                                                                        Also, if you’re concerned about the captured variables in closures, have you considered making them explicit like in C++‘s lambda syntax? I’ve always thought it was a really neat feature which I would definitely steal if I ever wrote a language.

                                                                                                        1. 1

                                                                                                          Yeah I’d like to make sure JavaScript idioms with respect to closures work in Oil (although this is a LONG way off, probably at least a year.)

                                                                                                          But yes a problem I have is that variable capture isn’t explicit! I usually prefer the explicit state of classes.

                                                                                                          There are probably idioms I don’t use just because Python doesn’t really support them.

                                                                                                          Another issue I had is that the difference between rebinding like captured = true and mutating like captured.field = true. Python supports the latter, but the former requires nonlocal. But I actually don’t see that many use cases for the former – it’s mostly the latter as far as I can tell!

                                                                                                          In other words, I think Python is mostly missing function literals, which is a parsing thing, and it doesn’t actually have to do with nonlocal.

                                                                                                          If you want to discuss further feel free to e-mail me at andy@oilshell.org, or https://oilshell.zulipchat.com/ . Although as mentioned keep in mind this is very far off :)

                                                                                                          I would like to understand the implementation of closures better and how that affects their semantics. It’s not super straightforward as far as I can tell – the Lua 5.0 paper devotes a pretty large portion to it.

                                                                                                          (I think the issue there is that the Lua compiler that generates code on the fly rather than generating an AST which you can do name analysis on. Shell has something of a problem like that because statements are parsed and executed one at a time. I was trying to get away from that for Oil though.)

                                                                                                1. 1

                                                                                                  Why not use rpython instead?

                                                                                                  1. 2

                                                                                                    In a similar vein, have you looked at python transpilers? For example, here is one that turns python into go then compiles it: https://github.com/google/grumpy

                                                                                                    I suppose the output binary size would be fairly large though (Go binaries are pretty large in general)…

                                                                                                    1. 1

                                                                                                      The FAQ here addresses PyPy: http://www.oilshell.org/blog/2018/03/04.html#faq

                                                                                                      I could probably add one on RPython specifically, but in short:

                                                                                                      RPython is overkill because it handles the full generality of Python, which takes hundreds of thousands of lines of code.

                                                                                                      That is, RPython and PyPy are bigger than CPython!!! The goal is to make something simpler than bash, and that would be more complex than bash.

                                                                                                      And PyPy takes forever to compile with the RPython toolchain – more than 1 hour IIRC.

                                                                                                      1. 1

                                                                                                        Right but the question isn’t why don’t you use pypy - it’s why don’t you use rpython, which is something you don’t have to maintain or ship with the shell. It’s not clear to me why you want to write this on top of a stripped down interpreter rather than compiling.

                                                                                                        1. 3

                                                                                                          OK sure:

                                                                                                          1. The build dependencies matter for a shell. Shells are often cross-compiled to embedded systems. For example every Android phone has a shell (mksh), which was ported to the somewhat quirky Android build process.

                                                                                                          The RPython toolchain doesn’t fit there at all, and it supports many fewer CPU architectures than C or C++.

                                                                                                          1. Interpreters written in RPython take forever to compile, so it will slow down the dev process. Compiling 10K lines of C++ code with no dependencies is super fast.

                                                                                                          There is the “double interpretation” problem, and I will have an answer for that at some point, but it’s too early to get into details. There is a bunch of bootstrapping to do.

                                                                                                          1. I eventually want to bootstrap Oil in Oil, so writing Oil in RPython is a distraction from that. This is pretty much like how the Go compiler was written in C, but it’s now largely written in Go.

                                                                                                          RPython is very different from Python, and isn’t a pleasant language to program in (by all accounts). It would probably make the Oil code significantly larger and less readable.

                                                                                                          1. JITing shell scripts isn’t desirable IMO. It’s a lot of extra complexity and probably won’t benefit most workloads (just like PyPy’s JIT made the OSH parser slower and use more memory, not faster.)
                                                                                                    1. 3

                                                                                                      I really don’t understand the logic in implementing bash. It seems like 21 months of effort to re-implement bash and the real work of designing a better shell language isn’t started?

                                                                                                      Devil’s advocate, why is migrating old scripts to a hypothetical new language important? Why not have a way to compile a new shell into bash instead for deployment?

                                                                                                      1. 6

                                                                                                        Short answer: fish and zsh already exist, but they are generally used “on top of” bash. Linux distros and modern cloud systems all use bash:

                                                                                                        http://www.oilshell.org/blog/2018/01/28.html

                                                                                                        The goal is to replace bash, not just make a new shell.

                                                                                                        Another way to put it: a big motivation for Oil was digging around the guts of Debian and being horrified… :)


                                                                                                        Also, there’s a big list of other shells here: https://github.com/oilshell/oil/wiki/ExternalResources

                                                                                                        1. 2

                                                                                                          I fully agree with the mission, I still don’t understand why replacing bash involves reimplementing it? I suppose it is like a C++ situation? bash++ ?

                                                                                                          1. 5

                                                                                                            How else would you replace bash? Why didn’t any of the other shells I linked replace bash?

                                                                                                            Compatibility is really important – nobody has time to rewrite thousands of lines of bash scripts (and not just time, but expertise). If Oil isn’t compatible, they’ll keep using bash out of inertia.

                                                                                                            I myself have written thousands of lines of bash scripts, and I wouldn’t even port them into a new language wholesale! I would port them incrementally if there is enough benefit.

                                                                                                            The Python 2 -> 3 transition should give an idea how much inertia a programming language has, even to minor changes. That is, it was probably 10x more inertia than the Python dev team thought.

                                                                                                            1. 3

                                                                                                              To me the point is that it enables people with pre-existing bash scripts to move to a better shell system. This completely removes any barriers for people to migrate from bash to Oil.

                                                                                                          2. 3

                                                                                                            Although it wouldn’t lead to Oil, I considered new implementations for verifying correctness of configs and secure operation:

                                                                                                            Formally-Verified Interpreter for a Shell-like Language with interesting comments by AndyC

                                                                                                            SHILL: A Secure, Shell-Scripting Language

                                                                                                            1. 1

                                                                                                              Also, good question about compiling Oil to bash, answered here:

                                                                                                              https://www.reddit.com/r/ProgrammingLanguages/comments/9i3mnd/does_anybody_know_of_a_language_that_compiles_to/e6l6z80/

                                                                                                              It’s not out of the question, but I don’t think it’s as straightforward as it seems, or a clear win.

                                                                                                            1. 1

                                                                                                              does anyone know how it is related to thompsons idea (https://swtch.com/~rsc/regexp/regexp1.html , the original paper seems to be behind a paywall) of how to match regular expressions?

                                                                                                              1. 11

                                                                                                                I’m not quite sure how to answer the question since it’s kind of broad. In the most simplistic sense, they are both implement matching using finite state machines that executes in time proportional to the input being searched. Aho-Corasick is one of many different ways to search for multiple strings, and IIRC, Navaro and Raffinot classify it was a prefix oriented approach that generalizes from the idea of KMP for single pattern search. (The other categories of string search being “suffix” and “factor based” approaches.)

                                                                                                                In the usual framing, an Aho-Corasick matcher is typically implemented as a simulation of a non-deterministic finite automaton, where the “non determinism” comes from following failure transitions. That is, for each byte of input, it is possible to follow N failure transitions before advancing to the next byte of input (where I believe N is proportional to the number of strings you’re trying to match). However, such a matcher can be converted to a deterministic finite automaton by pre-computing all of the failure transitions. Once you have that, each byte of input can be processed in a constant number of instructions, regardless of how many strings you’re searching. Of course, the downside of this approach is that it requires more memory. (In the literature, you usually see the NFA variant just called “Aho-Corasick” and the DFA variant called “Advanced Aho-Corasick.”)

                                                                                                                Thompson’s construction targets a strictly more general use case beyond just searching for literal strings, and is usually implemented by compiling a regex down into a small set of bytecodes for implementing alternations and repetitions. An alternation of literal strings compiled using Thompson’s construction is generally going to incur quite a bit more overhead than a standard Aho-Corasick implementation because Thompson’s construction supports not just an alternation of literals, but an alternation of arbitrary regular expressions. The actual time complexity here remains the same I think, but the processing of the various “split” bytecodes in Thompson’s construction in a VM of sorts is generally going to be quite slow.

                                                                                                                It would be interesting to try and draw a more concrete correspondence between Thompson’s “split” bytecode and Aho-Corasick’s failure transitions.

                                                                                                                I do also know that Paul Wankadia (current maintainer of RE2) has done some really cool work on removing the various split bytecodes from the normal Thompson construction, at the cost of doing a bit more work translating the regex to bytecodes. But I still don’t think it will approach to speed of Aho-Corasick. To a first approximation, I would somewhat expect Aho-Corasick to be about an order of magnitude faster than the typical Thompson construction with a VM. If you JIT the Thompson construction (which has been done), then all bets are off. :-)

                                                                                                                For completeness, note that there is also Glushkov’s construction algorithm, which also constructs an NFA for matching regexes, but with different trade offs than Thompson’s construction. That is, Glushkov’s algorithm does epsilon removal, where as Thompson’s doesn’t. But this means you get a bigger machine (more transitions) and spend more time building it. I’ve wanted to experiment with Glushkov’s construction for a while now, but I always get stuck when I try to figure out how to handle giant Unicode character classes (which also thwarts a lot of interesting bit-parallel implementations of NFAs).

                                                                                                                1. 2

                                                                                                                  An alternation of literal strings compiled using Thompson’s construction is generally going to incur quite a bit more overhead than a standard Aho-Corasick implementation because Thompson’s construction supports not just an alternation of literals, but an alternation of arbitrary regular expressions.

                                                                                                                  I don’t think this follows because both algorithms essentially compile their problems down to DFAs. The compile time might be different (and mostly negligible), but if the DFA is optimized, then I don’t see any reason that the execution time should be different.

                                                                                                                  The regex -> NFA -> DFA -> optimization transformation does feel like “magic” in that sense, in that you can solve a harder problem in the same amount of time, at least from the POV of computational complexity.

                                                                                                                  In Oil, I use re2c to compile a huge number of regexes down to a series of switch/goto statements, with no backtracking, and it seems to be very fast. There are some links to code at the beginning of this post: http://www.oilshell.org/blog/2017/12/17.html

                                                                                                                  As far as I understand, re2c is basically doing the “Thompson” or Dragon Book transformation. Perhaps with some DFA optimizations that I don’t understand. But the expressiveness of the problem is the same – it accepts alternations of arbitrary regexes.

                                                                                                                  It doesn’t do any backtracking or “split” – it’s all just switch and goto.

                                                                                                                  The actual time complexity here remains the same I think,

                                                                                                                  As long as both algorithms produce a DFA, the search should use O(n) time and O(1) space.

                                                                                                                  I think you are saying that the “split” bytecodes from [2] necessary to simulate the NFA make things slower.

                                                                                                                  But I am not sure this is true, based on [1]:

                                                                                                                  In a sense, Thompson’s NFA simulation is executing the equivalent DFA: each List corresponds to some DFA state, and the step function is computing, given a list and a next character, the next DFA state to enter. Thompson’s algorithm simulates the DFA by reconstructing each DFA state as it is needed.


                                                                                                                  All this is complicated by another fact I just read here:

                                                                                                                  https://research.swtch.com/yaccalive

                                                                                                                  The now-standard framing of Thompson’s paper as an NFA construction and the addition of caching states to produce a DFA were both introduced later, by Aho and Ullman in their various textbooks. Like many primary sources, the original is different from the textbook presentations of it. (I myself am also guilty of this particular misrepresentation.)

                                                                                                                  (my emphasis)

                                                                                                                  I think he’s saying that Thompson’s original method was actually based on regular expression derivatives (Brzozowski derivatives) and not the McNaughton-Yamada algorithm.

                                                                                                                  The derivatives method produces an optimized DFA directly – it doesn’t require the NFA -> unoptimized DFA intermediate stages.


                                                                                                                  I just watched a couple talks about derivatives recently, including one about redgrep (also by Paul Wankadia), which uses derivatives – not the McNaughton-Yamada algorithm used by RE2:

                                                                                                                  https://github.com/google/redgrep

                                                                                                                  redgrep: from regular expression derivatives to LLVM

                                                                                                                  This one is a lot more friendly and shows the code on one slide basically:

                                                                                                                  Regular Expression Derivatives in Python

                                                                                                                  https://github.com/MichaelPaddon/epsilon

                                                                                                                  I would like to replace Oil’s usage of re2c with this technique eventually. I’m not sure if it will make things faster, but it might make the code size smaller. The lexer is a single function that’s around 23K of native code right now.

                                                                                                                  Anyway I still have a lot more to learn about this, and to actually try out the “new” algorithms in code :) But I thought it was funny to see that little footnote that Thompson’s algorithm as explained in those long articles isn’t really Thompson’s algorithm. This is not a minor point – derivatives are a REALLY different algorithm.

                                                                                                                  (Note that parsing with derivatives is different than regex with derivatives. Cox is criticizing parsing with derivatives in that article, but he says that regex with derivatives is a “win-win” – the algorithm is smaller, and the output is more optimized (smaller DFA). I guess that was the point of the redgrep exploration.)


                                                                                                                  [1] https://swtch.com/~rsc/regexp/regexp1.html:

                                                                                                                  [2] https://swtch.com/~rsc/regexp/regexp2.html

                                                                                                                  1. 3

                                                                                                                    Minor update: The original re2c [1] paper does cite the Dragon Book, so at least in 1994 it used the McNaughton-Yamada algorithm to generate a DFA (not an NFA).

                                                                                                                    From there it says that generating assembly code from the DFA is “straightforward” (section 3.2).

                                                                                                                    The code is quite large now but I’m guessing it still follows that basic algorithm, along with a bunch of optimizations for DFA size, and then the big complication of submatch extraction (which Russ Cox’s articles also deal heavily with).

                                                                                                                    [1] http://re2c.org/1994_bumbulis_cowan_re2c_a_more_versatile_scanner_generator.pdf


                                                                                                                    Also if anyone is still reading who wants to manipulate regexes for a practical purpose (i.e. portably implementing extended glob as in bash/ksh), check out my issues :) https://github.com/oilshell/oil/issues?q=is%3Aissue+is%3Aopen+label%3Acomputer-science

                                                                                                                    1. 1

                                                                                                                      Hiya! I’m a huge fan of the work you’re doing with Oil. I am really excited to see how it turns out. :-)

                                                                                                                      I don’t think this follows because both algorithms essentially compile their problems down to DFAs. The compile time might be different (and mostly negligible), but if the DFA is optimized, then I don’t see any reason that the execution time should be different.

                                                                                                                      The regex -> NFA -> DFA -> optimization transformation does feel like “magic” in that sense, in that you can solve a harder problem in the same amount of time, at least from the POV of computational complexity.

                                                                                                                      In Oil, I use re2c to compile a huge number of regexes down to a series of switch/goto statements, with no backtracking, and it seems to be very fast. There are some links to code at the beginning of this post: http://www.oilshell.org/blog/2017/12/17.html

                                                                                                                      As far as I understand, re2c is basically doing the “Thompson” or Dragon Book transformation. Perhaps with some DFA optimizations that I don’t understand. But the expressiveness of the problem is the same – it accepts alternations of arbitrary regexes.

                                                                                                                      It doesn’t do any backtracking or “split” – it’s all just switch and goto.

                                                                                                                      I’m actually not terribly familiar with re2c, but I think there might be some confusion here. Thompson’s construction is really about building an NFA and then simulating that. At least, that’s what his original paper describes. (scihub this: https://dl.acm.org/citation.cfm?id=363387) The original paper actually describes something closer to a JIT, since the output of his compiler is object code. But the object code is still executing a simulation of an NFA. This is very clear via the discussion around CLIST and NLIST handling at match time.

                                                                                                                      So to be clear, I wouldn’t say the output of Thompson’s construction is itself a DFA, but rather, an NFA that can either 1) be interpreted as-is or 2) further transformed into an actual DFA. (2) can itself be split into a couple different strategies. One of them is the “lazy DFA,” which I believe was pioneered by Thompson, but in his implementation of grep, and definitely not in his original IBM 7094 paper. A lazy DFA is something that essentially caches the result of interpreting the NFA during match time. This tends to work well in practice and also permits setting a bound of the amount of memory you use, at the cost of essentially falling back to the performance characteristics of NFA interpretation if you reach that bound. The other strategy for (2) is your normal DFA building: transform the NFA into a DFA via powerset construction, and then possibly, DFA minimization. I’m guessing that’s what re2c is doing. They are still likely using Thompson’s construction, but it’s only one piece of the puzzle.

                                                                                                                      But yeah, the expressiveness of everything here is all equivalent. It’s a weird framing, but at this point, we tend to couple theoretical terms (“NFA”, “DFA”, and so on) with specific performance characteristics. e.g., an NFA uses a small amount of memory to execute, but has very high overhead while a DFA can use lots of memory but has very small constant factors.

                                                                                                                      As long as both algorithms produce a DFA, the search should use O(n) time and O(1) space.

                                                                                                                      I think you are saying that the “split” bytecodes from [2] necessary to simulate the NFA make things slower.

                                                                                                                      But I am not sure this is true, based on [1]:

                                                                                                                      In a sense, Thompson’s NFA simulation is executing the equivalent DFA: each List corresponds to some DFA state, and the step function is computing, given a list and a next character, the next DFA state to enter. Thompson’s algorithm simulates the DFA by reconstructing each DFA state as it is needed.

                                                                                                                      I think it’s pretty uncontroversial to state that the interpretation of an NFA is going to be, in practice, slower than an implementation that reflects the key characteristic of a DFA: a small constant number of instructions per processed byte, typically in the form of a transition lookup, advancing to the next state and checking for a match condition. An interpretation of the NFA is usually going to result in a number of instructions proportional to the size of your regex.

                                                                                                                      There are some interesting exceptions, but I don’t think it really changes the above much. One exception is bit parallel NFAs, which can work quite well—perhaps even better than DFAs—but only for smaller machines. Another exception is JIT. If you JIT Thompson’s construction, then the performance characteristics change from the normal “interpretation” of an NFA, which usually resembles a VM of sorts.

                                                                                                                      The now-standard framing of Thompson’s paper as an NFA construction and the addition of caching states to produce a DFA were both introduced later, by Aho and Ullman in their various textbooks. Like many primary sources, the original is different from the textbook presentations of it. (I myself am also guilty of this particular misrepresentation.)

                                                                                                                      (my emphasis)

                                                                                                                      I think he’s saying that Thompson’s original method was actually based on regular expression derivatives (Brzozowski derivatives) and not the McNaughton-Yamada algorithm.

                                                                                                                      The derivatives method produces an optimized DFA directly – it doesn’t require the NFA -> unoptimized DFA intermediate stages.

                                                                                                                      Thompson’s original paper isn’t long, so I think I’d actually recommend that you just read it. :-) It should be clear that Thompson isn’t talking about a DFA, but an NFA.

                                                                                                                      I’m not sure what Cox means by his statement. Certainly, Thompson’s original paper does not talk about DFA state caching, but the construction and execution is undoubtedly an NFA simulation. The only thing I can think of here is that Thompson never actually mentions “NFA” or “DFA” in his paper, and therefore, framing Thompson’s contribution in those terms is something that others have added. But it kind of seems like a distinction without much of a difference, so I might be missing Cox’s point here.

                                                                                                                      I would like to replace Oil’s usage of re2c with this technique eventually. I’m not sure if it will make things faster, but it might make the code size smaller. The lexer is a single function that’s around 23K of native code right now.

                                                                                                                      Anyway I still have a lot more to learn about this, and to actually try out the “new” algorithms in code :) But I thought it was funny to see that little footnote that Thompson’s algorithm as explained in those long articles isn’t really Thompson’s algorithm. This is not a minor point – derivatives are a REALLY different algorithm.

                                                                                                                      (Note that parsing with derivatives is different than regex with derivatives. Cox is criticizing parsing with derivatives in that article, but he says that regex with derivatives is a “win-win” – the algorithm is smaller, and the output is more optimized (smaller DFA). I guess that was the point of the redgrep exploration.)

                                                                                                                      Yeah, Paul’s talk on redgrep is pretty freaking sweet.

                                                                                                                      I wish I could remember why I haven’t done much with regex derivatives. I’ve researched them before, but it’s been a while now at this point. If I had to guess, I’d say that I probably get stuck with figuring out how to handle large Unicode character classes (which is also where I get stuck in Glushkov’s construction and in trying bit parallel NFAs).

                                                                                                                      1. 2

                                                                                                                        BTW Darius Bacon had the same question about Thompson’s paper and derivatives and addressed it here:

                                                                                                                        https://news.ycombinator.com/item?id=18434225

                                                                                                                        I guess there is a related idea called Antimirov Derivatives / Partial Derivatives, and that leads to the NFA? I have to understand it better. It looks like Antimirov derivatives didn’t have a name until 1995, but Thompson must have used that idea (?).

                                                                                                                        The algorithm to generate a DFA in redgrep / epsilon based on derivatives seem seems to be completely different. It does NOT generate an NFA; it generates a DFA directly (and it has fewer states than some implementations of Thompson construction -> DFA minimization).

                                                                                                                        The hard part of that algorithm seems to be canonicalizing the DFA states so that you can compare them with equality. Everybody seems to mention that issue. But it seems like it’s a short algorithm.

                                                                                                                        Paul W. mentioned partial derivatives in his redgrep talk with respect to capturing groups. I guess that was an open problem in redgrep.

                                                                                                                        1. 2

                                                                                                                          Thanks for the kind words about Oil. I’m happy to have help from people familiar with regex implementation :)

                                                                                                                          Two somewhat separate points:

                                                                                                                          1. With regard to the original question: How does Aho-Corasick relate to Thompson’s construction? I agree in the question isn’t super clear, but I don’t think “Thompson’s method uses an NFA and Aho-Corasick uses a DFA, and NFA has more overhead than DFA” (i.e. because there are split bytecodes) quite captures it.

                                                                                                                          If I understand correctly, the difference between NFA and DFA really comes down to submatch extraction. You can extract submatches with the slower NFA method, but doing it with a DFA was impossible or very difficult.

                                                                                                                          The authors of re2c published a new algorithm in 2017 for doing submatch extraction with compiled DFAs, but I haven’t gone over it:

                                                                                                                          http://re2c.org/2017_trofimovich_tagged_deterministic_finite_automata_with_lookahead.pdf

                                                                                                                          Anyway, the point is that submatch extraction with DFAs is hard. That is borne out in both RE2 and re2c. IIRC, the whole point of the “multi-engine” backend in RE2 by Cox is to handle this. It uses the slower method when it needs to extract submatches, and the faster DFA method when it doesn’t.

                                                                                                                          If you are NOT extracting submatches, which is the case for the Aho-Corasick problem, then you might as well just convert your NFA to a DFA and match more quickly.

                                                                                                                          I basically think of NFAs and DFAs as equivalent, since they are mathematically. They recognize the exact same class of languages.

                                                                                                                          This all boils down to terminology, but the way I think of it, there are two worlds. (And this is pretty much exactly the point that the Cox articles make.)

                                                                                                                          a) Perl-style regexes, PCRE, Python – this is backtracking world. They’re not using NFAs or DFAs.

                                                                                                                          b) awk, egrep, libc, RE2, re2c – This is NFA/DFA world or “Thompson’s construction” / Dragon Book / McNaughton-Yamada.

                                                                                                                          In my mind I don’t think of “Thompson’s construction” as NFA. But yes I see that Cox’s articles use “Thompson NFA” a lot. However I think the issue is not NFA vs. DFA. It’s NFA vs. backtracking, or really NFA/DFA vs. backtracking. That’s how I think about it and I think it captures the salient difference.

                                                                                                                          Anyway, this has probably hopelessly confused the original poster … long story short, I have never actually worked with Aho-Corasick, but my point is that if you create a regex like const_string_1|const_string_2|...|const_string_N, and you do the Thompson NFA -> DFA, you will end up with something that runs as fast as Aho-Corasick (since all DFAs can be interpreted in linear time and constant space).

                                                                                                                          And if you optimize the DFA, maybe you will end up with the exact same one (just guessing, I don’t know that.)


                                                                                                                          1. I went back and skimmed the Thompson paper, and I see the parts about CLIST and NLIST. It is not clear to me that this is about NFAs. As you say, it doesn’t mention those words. And it talks about Brzozowski derivatives in the first paragraph.

                                                                                                                          http://www.oilshell.org/archive/Thompson-1968.pdf

                                                                                                                          I don’t know 7094 assembly or ALGOL-60, so it’s hard for me to interpret. I tend to trust what Russ Cox says.

                                                                                                                          I think it’s plausible that it is using NFAs, because the Dragon Book explains it that way, and Russ Cox explained it that way in the first paragraphs of his first article.

                                                                                                                          But I think he went back and read it later and realized it was about derivatives! It’s also plausible to me that the technique is based on derivatives. In fact it is the most plausible to me because the paper mentions derivatives and does not mention NFAs or DFAs!!!

                                                                                                                          I guess to really answer this question I’d have to go implement both algorithms and see. (Or maybe just e-mail Russ Cox.)

                                                                                                                          (I actually implemented what were known as 2 separate parsing algorithms here and realized they are the same: http://www.oilshell.org/blog/2016/11/01.html . The author of precedence climbing acknowledged this to be true – links at end. Derivatives and McNaughton-Yamada seem very different to me, but maybe there is a deep connection.)

                                                                                                                          1. 1

                                                                                                                            With regard to the original question: How does Aho-Corasick relate to Thompson’s construction? I agree in the question isn’t super clear, but I don’t think “Thompson’s method uses an NFA and Aho-Corasick uses a DFA, and NFA has more overhead than DFA” (i.e. because there are split bytecodes) quite captures it.

                                                                                                                            Hmm I don’t think I said that. :-/ Sorry if that’s how it came across. The typical Aho-Corasick formulation uses failure transitions at match time, so it isn’t a DFA. It’s an NFA. “Advanced” Aho-Corasick is the classical DFA variant.

                                                                                                                            It doesn’t really make sense to compare and contrast the theoretical properties of Aho-Corasick and Thompson’s construction, because they are both just alternative ways of framing the construction of an NFA, which can eventually be turned into equivalent DFAs through powerset construction and minimization. The interesting comparison is at the implementation level, and specifically, the construction algorithms themselves. Aho-Corasick is a fundamentally more constrained algorithm, and doesn’t need to handle the full generality of regular expressions.

                                                                                                                            If I understand correctly, the difference between NFA and DFA really comes down to submatch extraction. You can extract submatches with the slower NFA method, but doing it with a DFA was impossible or very difficult.

                                                                                                                            Your second sentence is true, but the first sentence doesn’t really make sense to me honestly. If we’re talking about an implementation that reflects the structure of an NFA, then you’re almost always going to be looking at some interpretation of the NFA at match time (or, perhaps, a bit-parallel implementation if the machine is small enough or a JIT). This looks very very different from what a DFA implementation looks like.

                                                                                                                            Thank you for the link about re2c. I hadn’t seen that. Currently, Rust’s regex library (like RE2) needs to use an NFA to do submatch capture, which stinks, because it’s much slower than the DFA.

                                                                                                                            If you are NOT extracting submatches, which is the case for the Aho-Corasick problem, then you might as well just convert your NFA to a DFA and match more quickly.

                                                                                                                            This is too much of a simplification. The DFA variant of Aho-Corasick requires a lot more memory than the NFA variant. In my Aho-Corasick library, this is exposed as an option for users, so they can decide the time vs space trade off themselves.

                                                                                                                            I wonder if your Oil use case is making this a bit more difficult to grasp. In that context, where you have a tokenizer that rarely changes and a tool to convert your DFA into code, then an NFA seems rather silly. But in a more general context, where you’re building regexes at runtime and matching them, the specific trade offs between space and time become much more relevant. In that context, eagerly building a DFA is a pretty rare thing to do (as supported by numerous implementations such as RE2, Go’s regex library, GNU grep, Thompson’s original grep and of course my own regex library too).

                                                                                                                            I basically think of NFAs and DFAs as equivalent, since they are mathematically. They recognize the exact same class of languages.

                                                                                                                            Yes… But like I said in my comment above, we are well beyond theory here. The vernacular in this domain couples theoretical terms with implementation details. There are very real differentiating properties between “I implemented an NFA matcher” and “I implemented a DFA matcher.”

                                                                                                                            Take a look at section 6 of the original paper that introduced Aho-Corasick. It explicitly introduces a DFA variant, and explicitly calls out its downside as increased memory usage but faster matching. This is in contrast to the initial and classical formulation with failure transitions that uses less memory but is slower at matching.

                                                                                                                            In my mind I don’t think of “Thompson’s construction” as NFA. But yes I see that Cox’s articles use “Thompson NFA” a lot. However I think the issue is not NFA vs. DFA. It’s NFA vs. backtracking, or really NFA/DFA vs. backtracking. That’s how I think about it and I think it captures the salient difference.

                                                                                                                            No, I don’t think so. I mean, yes, of course it’s a critical difference, but that’s really not what we’re talking about here. We’re talking about Aho-Corasick vs the Thompson construction (or, RE2 if you like). Bringing up backtracking engines it well outside the scope, and moreover, it is not the only distinction that matters. You yourself noted the variety of implementation strategies used by RE2. These are precisely a reflection of the various properties of typical NFA vs DFA implementations. Waving all of this away honestly makes this entire discussion rather moot.

                                                                                                                            Anyway, this has probably hopelessly confused the original poster … long story short, I have never actually worked with Aho-Corasick, but my point is that if you create a regex like const_string_1|const_string_2|…|const_string_N, and you do the Thompson NFA -> DFA, you will end up with something that runs as fast as Aho-Corasick (since all DFAs can be interpreted in linear time and constant space).

                                                                                                                            That sounds right, yes, but it’s a somewhat trivial theoretical result. e.g., Given two regular expressions, you can determine whether the languages they recognize are equivalent by converting both to a minimal DFA. If the resulting DFAs are equivalent up to state renaming, then you know they recognize the same languages. Framing it this way misses the forest for the trees; there are very real implementation distinctions that arise in practice between regex implementations and Aho-Corasick implementations. A lot of it is really about tweaking time vs space trade offs. For example, RE2 (and also Rust’s regex library) will never actually build the full DFA of a regex eagerly. Instead, it’s constructed lazily. That lazy construction is really useful because it gets you very close to the full speed of a DFA but allows more precise control over the space used. Aho-Corasick, on the other hand, is going to use far less memory. In practice, for small regexes of literal alternations, you might choose to use an Aho-Corasick DFA (“Advanced Aho-Corasick”) instead of using the full blown regex engine, because the Aho-Corasick DFA is likely faster than the fully general lazy DFA.

                                                                                                                            I went back and skimmed the Thompson paper, and I see the parts about CLIST and NLIST. It is not clear to me that this is about NFAs.

                                                                                                                            It is. No DFA implementation is going to be explicitly manipulating states at match time. Derivatives are certainly used to motivate construction; but it’s still an NFA.

                                                                                                                            I’m not quite sure how to fix your confusion on this topic. :-/ I think if I could say one thing, it would be that NFA vs DFA have very real distinctions in their typical implementations, even if theoretically they give the same power. This is why we can speak separately about “Thompson’s construction” and “DFA”, because they are not intricately coupled.

                                                                                                                            I feel like a read of Navaro and Raffinot’s “Flexible Pattern Matching in Strings” would help clear up a lot of this for you, because they make the NFA vs DFA distinction very clear at the implementation level with lots of examples. Unfortunately, the book is a little pricey. If we were friends in real life, I’d lend you my copy. :-)

                                                                                                                            1. 1

                                                                                                                              I did some work on this that clarifies things somewhat – at least I think so, let me know what you think.

                                                                                                                              https://github.com/oilshell/blog-code/tree/master/fgrep-problem-benchmarks

                                                                                                                              I don’t think we are disagreeing that much about technical points. I was somewhat nitpicking the original answer because I didn’t think it illuminated the core issue. And because I want to understand the difference myself, for Oil! I think this will matter eventually and I’m not just arguing for its own sake :)

                                                                                                                              It doesn’t really make sense to compare and contrast the theoretical properties of Aho-Corasick and Thompson’s construction, because they are both just alternative ways of framing the construction of an NFA, which can eventually be turned into equivalent DFAs through powerset construction and minimization.

                                                                                                                              I think that totally makes sense! That is very important to understand!!! Does everyone reading this thread know that?

                                                                                                                              I don’t believe it’s common knowledge and it’s one of the first things I would say to someone who says “compare Thompson’s construction and Aho-Corasick”. If it’s true that they boil down to the same DFA, then that’s important to understand. (I think Aho-Corasick might behave differently with respect to overlapping matches. If that’s the main difference then that would clarify things.)

                                                                                                                              People generally care about execution time, not compilation time. Compilation time/memory depends on the size of the expressions, which is generally small (like bytes or kilobytes), while execution depends on the size of the input data, which could be gigabytes, terabytes, etc.

                                                                                                                              For one project I routinely used regexes over terabytes of data. I never thought about compilation time, or NFA vs. DFA. It wasn’t a choice you’re really able to make as a user of regular expressions. But you can make the choice of backtracking vs. automata-based (and not everyone is aware of that choice).

                                                                                                                              This is too much of a simplification. The DFA variant of Aho-Corasick requires a lot more memory than the NFA variant. In my Aho-Corasick library, this is exposed as an option for users, so they can decide the time vs space trade off themselves.

                                                                                                                              I don’t understand this… Look at the generated code I linked for re2c:

                                                                                                                              https://github.com/oilshell/blog-code/blob/master/fgrep-problem-benchmarks/_gen/fixed-strings.cc

                                                                                                                              https://raw.githubusercontent.com/oilshell/blog-code/master/fgrep-problem-benchmarks/_gen/code-size.txt

                                                                                                                              This code uses almost no memory, and the code size is tiny as well. It’s matching using a DFA. Why can’t Aho-Corasick do the same? Doesn’t it boil down to the same Trie-like DFA?

                                                                                                                              Framing it this way misses the forest for the trees; there are very real implementation distinctions that arise in practice between regex implementations and Aho-Corasick implementations. A lot of it is really about tweaking time vs space trade offs. For example, RE2 (and also Rust’s regex library) will never actually build the full DFA of a regex eagerly. Instead, it’s constructed lazily.

                                                                                                                              For the case that the input is an alternation of constant strings, why not do it? That seems to be exactly what re2c does. I get a trie-like DFA out of re2c and that appears to be what Aho-Corasick does.

                                                                                                                              Of course re2c has more time to compile things, since it’s offline, so that could be a difference. It does have a --dfa-minimization flag for the strategy. It could be that I am underestimating how hard that is.

                                                                                                                              But I’m only talking about the constant string case. I know that DFA minimization in general is hard, but what about minimizing an NFA with no repetition at all? (which is the Aho-Corasick problem)


                                                                                                                              I didn’t realize you had implemented the Rust aho-corasick library, but I linked it in my repo. (You should have mentioned that in the first reply! :) )

                                                                                                                              If I’m understanding correctly, I think it’s going to construct the same trie-like DFA as re2c did, and therefore it will run no faster than re2c. Or possibly slower since it’s interpreted and not compiled. I would love a PR for a benchmark :)

                                                                                                                              (Note that grep runs faster than re2c! Pretty amazing, and I note that I think it’s due to skipping input, but I didn’t verify that in any way. grep also runs faster than fgrep on this problem!)


                                                                                                                              About Thompson’s paper: I now agree it’s about NFAs. After looking at Figure 5, it’s obvious it is an NFA for a(b|c)d.

                                                                                                                              I’m still confused by the derivatives comment in the first paragraph. The 2009 paper that “reintroduced” regex derivatives actually makes a comment on this exact line at the end in the “Related Work” section:

                                                                                                                              Regular-expression derivatives re-examined

                                                                                                                              I’ll definitely be looking at derivatives more! Let me know if you do any work with them. Apparently they do support Unicode well and this paper specifically talks about large character classes. The “episilon” video by Michael Paddon I linked earlier says that Unicode was one of his primary motivations for derivatives.

                                                                                                                              There is a big Unicode data file in the repo: https://github.com/MichaelPaddon/epsilon/blob/master/epsilon/ucd.py

                                                                                                                              Also take a look at the table in section 5.2 on DFA size. It compares the derivatives method vs. a Thompson construction and says it’s always better.

                                                                                                                              I guess Aho-Corasick is probably always better than Thompson too for the subset of the problem it solves, as far as DFA size, but I would love to see if that translates to execution time. I think it will be hard to construct a case where re2c’s algorithms isn’t good enough and you have to use Aho-Corasick.


                                                                                                                              Also I see the clist and nlist thing here. It wasn’t clear from your comment what those were supposed to be doing in the Thompson paper, but it’s a lot clearer in this explanation:

                                                                                                                              https://swtch.com/~rsc/regexp/regexp1.html

                                                                                                                              The simulation uses two lists: clist is the current set of states that the NFA is in, and nlist is the next set of states that the NFA will be in, after processing the current character.


                                                                                                                              Anyway, thanks for this discussion! I’m definitely learning a lot. I want to eventually unify the mess of regexes in a typical shell script for Oil, so that’s why I’m so interested in this.

                                                                                                                              1. 1

                                                                                                                                I think that totally makes sense! That is very important to understand!!! Does everyone reading this thread know that?

                                                                                                                                Not sure. I jumped right over the theory part because in the theory, everything folds into equivalence with the same expressive power. The only real practical distinction there that I can think to make is that I remember NFAs being easier to construct than DFAs in my computability theory class (many moons ago), which in turn made it easier to use NFAs instead of DFAs to prove a particular language to be regular.

                                                                                                                                If it’s true that they boil down to the same DFA, then that’s important to understand. (I think Aho-Corasick might behave differently with respect to overlapping matches. If that’s the main difference then that would clarify things.)

                                                                                                                                Right… Good point. You probably can’t say for certain that they’ll boil down to the same DFA without being a bit more careful. Overlapping matches are probably part of it, but even in the non-overlapping case, the typical Aho-Corasick implementation will still probably have subtly different match semantics than the typical regex DFA implementation. That is, regexes typically implement “leftmost first” match semantics (the natural consequence of a backtracking implementation) or “leftmost longest” match semantics (found primarily in POSIX compatible regex implementations). That is, given the strings Sam and Samwise matching the string Samwise:

                                                                                                                                • Sam|Samwise matches Samwise in leftmost longest.
                                                                                                                                • Samwise|Sam matches Samwise in leftmost longest.
                                                                                                                                • Sam|Samwise matches Sam in leftmost first.
                                                                                                                                • Samwise|Sam matches Samwise in leftmost first.

                                                                                                                                So with leftmost longest semantics, the order of your alternation doesn’t matter. But with leftmost first semantics, the order does matter. RE2 (along with Go’s and Rust’s regex engines) both provide leftmost first semantics even though they are finite automata implementations. The reason for doing so is to match the semantics of popular regex implementations such as PCRE. Notably, RE2 (and Go) both permit the caller to switch to leftmost longest semantics if they want. Generally speaking, I’ve found leftmost first semantics to be more useful.

                                                                                                                                Back to Aho-Corasick… In most typical implementations that I know of (including my own), it will report all matches. So for Sam|Samwise, it would match at both Sam and Samwise. However, it’s also easy to tweak the matching routine (without changing the automaton) to not report overlapping matches by simply moving back to the start state after a match is found. In this case, Aho-Corasick will report Sam as a match but not Samwise regardless of the order of the patterns. In this sense, Aho-Corasick has neither leftmost first nor leftmost longest semantics, but leftmost shortest. (Generally, I see this as a problem and it’s one of the things on my list to fix. In particular, I’d like for Aho-Corasick to have the same match semantics as the regex engine so that Aho-Corasick can be used to completely replace the use of the regex engine internally when you have an alternation of literals.)

                                                                                                                                People generally care about execution time, not compilation time. Compilation time/memory depends on the size of the expressions, which is generally small (like bytes or kilobytes), while execution depends on the size of the input data, which could be gigabytes, terabytes, etc.

                                                                                                                                For one project I routinely used regexes over terabytes of data. I never thought about compilation time, or NFA vs. DFA. It wasn’t a choice you’re really able to make as a user of regular expressions. But you can make the choice of backtracking vs. automata-based (and not everyone is aware of that choice).

                                                                                                                                Sure, but this is mostly a domain specific thing. I’d agree that if you’re working on a problem in which re2c is worth using in the first place, then you probably always want to spend the extra time in compilation to make matching faster. In that sense, choosing between NFA and DFA is an easy choice: pick the DFA. But this is because you’re paying for compilation when you compile your program itself. In a typical regex engine, the cost of compilation is paid during the execution of your program. While users have generally grown to expect that regex compilation is not exactly super cheap, you generally don’t have all day either. For re2c, it’s conceivable that you might be willing to spend minutes building the DFA. But for a normal regex engine, you probably don’t want to spend more than a millisecond building your machine for reasonably sized regexes.

                                                                                                                                The other thing to look at here is that, in terms of implementation choices, an implementation modeled after an NFA tends to have more power in that it can resolve capture groups. Now, you did point out a paper that purports to do this in a DFA, but this generally isn’t how it has been done in practice (RE2, Go, Rust). I haven’t read the paper yet, so I’m also not sure what the trade offs are. My point here is that, depending on implementation, choosing between NFA and DFA might depend on what question you’re asking.

                                                                                                                                I don’t understand this… Look at the generated code I linked for re2c:

                                                                                                                                https://github.com/oilshell/blog-code/blob/master/fgrep-problem-benchmarks/_gen/fixed-strings.cc

                                                                                                                                https://raw.githubusercontent.com/oilshell/blog-code/master/fgrep-problem-benchmarks/_gen/code-size.txt

                                                                                                                                This code uses almost no memory, and the code size is tiny as well. It’s matching using a DFA. Why can’t Aho-Corasick do the same? Doesn’t it boil down to the same Trie-like DFA?

                                                                                                                                I don’t think looking at generated code out of context can really do much. e.g., I don’t know what set of strings went into the production of that machine. I’d guess it was small though. So in that sense, it’s no surprise that the corresponding DFA is small. When it comes to Aho-Corasick, I suspect that the size of the DFA is indeed proportional to the total size of all the patterns you’ve compiled into it. (I’m not 100% on that.) But in the case of regexes, with its repetition operators, the size of the DFA can be exponential in the size of the original regex.

                                                                                                                                Try building an automaton that matches any word in /usr/share/dict/words.

                                                                                                                                For the case that the input is an alternation of constant strings, why not do it? That seems to be exactly what re2c does. I get a trie-like DFA out of re2c and that appears to be what Aho-Corasick does.

                                                                                                                                It really just depend on your tolerance for memory-vs-speed-vs-compile-time. If you’re using re2c, then you almost certainly fall into the “I don’t care about compile time, and I probably have a very high tolerance of memory usage since that’s merely reflected in code size.” There are some secondary effects here of course, e.g., a large code size might result in thrashing your instruction cache.

                                                                                                                                Of course re2c has more time to compile things, since it’s offline, so that could be a difference. It does have a –dfa-minimization flag for the strategy. It could be that I am underestimating how hard that is.

                                                                                                                                But I’m only talking about the constant string case. I know that DFA minimization in general is hard, but what about minimizing an NFA with no repetition at all? (which is the Aho-Corasick problem)

                                                                                                                                Yes, while the general case is exponential, I believe that at least building an Aho-Corasick DFA is linear with respect to the number of literals you’re trying to match. I don’t know whether minimization is also minimal. But remember, memory usage is another key component here. The DFA is going to use a lot more of it. Usually, if you’re building a DFA, you’re either doing it at source compile time (so it’s embedded in the code as a state machine) or you’re doing it at runtime, which means it’s probably a table based DFA. Naively, for a byte based automaton, that’s 256 transitions for each state, where each transition needs to be big enough to point to any arbitrary state. If you choose 32 bits, then you’re already at 1KB per state. Now, obviously, there are different ways to represent DFAs. You could pick a sparse representation of the transitions, but this almost always sacrifices execution time because a sparse lookup will probably be slower. Beyond that, there are other tricks you can do like condensing the number of transitions per state to the minimal number of transitions needed by separating the space of all 256 bytes into equivalence classes. This incurs an extra lookup per byte at match time, but it’s generally worth it for the space savings.

                                                                                                                                I didn’t realize you had implemented the Rust aho-corasick library, but I linked it in my repo. (You should have mentioned that in the first reply! :) )

                                                                                                                                If I’m understanding correctly, I think it’s going to construct the same trie-like DFA as re2c did, and therefore it will run no faster than re2c. Or possibly slower since it’s interpreted and not compiled. I would love a PR for a benchmark :)

                                                                                                                                (Note that grep runs faster than re2c! Pretty amazing, and I note that I think it’s due to skipping input, but I didn’t verify that in any way. grep also runs faster than fgrep on this problem!)

                                                                                                                                Right, yeah, hah. I’ve been working on string search for a lot of years now. In terms of credentials, I also wrote Rust’s regex library, ripgrep (which bests grep in many cases) and one of the fastest CSV parsers in existence (which is particularly relevant to our conversation here, because I did it by compiling the CSV parser into a DFA on the stack, where as most CSV parsers use the standard NFA simulation embedded in the code).

                                                                                                                                Most or all of my work is focused on building finite state machines at runtime. This biases toward a very different set of trade offs than the offline compilation found in the domain of re2c. So I don’t think I would necessarily claim anything about performance. In general, I’d probably almost always give the edge to re2c over any equivalent machine compiled at runtime.

                                                                                                                                That grep runs faster is probably what I’d expect. A finite state machine has its limits. Its key advantage is the ability to express complex match semantics in a general way. But there are all sorts of special cases that can be optimized to go quite a bit faster than a normal finite state machine. You hinted at it with the “skipping input” aspect of grep (you’ve probably read this). But in modern times, it’s less to do with skipping input and more to do that the skip loop itself is written to use optimized vector instructions via memchr (glibc’s implementation of that is in x86 Assembly, my own implementation uses normal Rust code with vendor intrinsics). To that end, it turns out that you don’t even need to skip input to go very fast, but it’s instead better to pick the byte you skip on more wisely. That’s what led to my own FreqyPacked algorithm that uses a static notion of how common certain bytes are, and chooses the skip loop based on that. This is distinct from Boyer-Moore, which is usually written with a skip loop on the very last byte. That’s what GNU grep does. This specific algorithmic difference is a key reason why ripgrep edges out GNU grep in a lot of cases.

                                                                                                                                I’ll definitely be looking at derivatives more! Let me know if you do any work with them. Apparently they do support Unicode well and this paper specifically talks about large character classes. The “episilon” video by Michael Paddon I linked earlier says that Unicode was one of his primary motivations for derivatives.

                                                                                                                                Interesting! Thanks for the links. One of the tasks that’s nextish on my list (after building out a bit more infrastructure) is to rethink the optimization and compilation passes in Rust’s regex crate. No small part of that will be figuring out how to address big Unicode classes, because right now, they are a primary thing that causes Thompson’s NFA construction to go into the weeds.

                                                                                                                                Anyway, thanks for this discussion! I’m definitely learning a lot. I want to eventually unify the mess of regexes in a typical shell script for Oil, so that’s why I’m so interested in this.

                                                                                                                                Good luck!

                                                                                                                                1. 1

                                                                                                                                  I will respond more later, but I just tested ripgrep and it beats grep (and fgrep)! Impressive :)

                                                                                                                                  https://github.com/oilshell/blog-code/tree/master/fgrep-problem-benchmarks

                                                                                                                                  Is ripgrep basically a wrapper around the Rust regex crate, or is there more to it?

                                                                                                                                  I’m curious about RE2 now. Also curious about your aho-corasick library. I’m going to be surprised if can actually take advantage of anything to perform better than ripgrep :)

                                                                                                                                  The fact that fgrep is slower than grep makes me doubt that the constant string alternation problem is really “easier” than the regex problem.

                                                                                                                                  1. 1

                                                                                                                                    Is ripgrep basically a wrapper around the Rust regex crate, or is there more to it?

                                                                                                                                    Hah… You’ll want to read the blog post I linked you to. :-) There is quite a bit more to it, but most of the interesting algorithmy stuff is in the regex engine. ripgrep itself is mostly engineering. Note that ripgrep’s regex engine is actually pluggable. By default it uses Rust’s regex engine, but the -P flag enables PCRE2, assuming you build ripgrep with PCRE2 support. (See README.)

                                                                                                                                    Your RE2 and re2c benchmarks aren’t quite measuring the same thing as your grep/fgrep/ripgrep benchmarks. The latter are counting lines, but the former are counting all matches. Consider this comparison, where the former corresponds to counting all matching lines while the latter corresponds to counting every individual match:

                                                                                                                                    $ time rg 'for|while|continue|break|if|fi|then|else|elif|case|esac|do|done' /tmp/oil.txt -c
                                                                                                                                    2969020
                                                                                                                                    real    0.673
                                                                                                                                    user    0.655
                                                                                                                                    sys     0.017
                                                                                                                                    maxmem  343 MB
                                                                                                                                    faults  0
                                                                                                                                    
                                                                                                                                    $ time rg 'for|while|continue|break|if|fi|then|else|elif|case|esac|do|done' /tmp/oil.txt -c -o
                                                                                                                                    3830430
                                                                                                                                    real    1.497
                                                                                                                                    user    1.458
                                                                                                                                    sys     0.037
                                                                                                                                    maxmem  343 MB
                                                                                                                                    faults  0
                                                                                                                                    

                                                                                                                                    Since your query has a lot of matches, this actually winds up being quite significant. I don’t think GNU grep actually has a way of counting every individual match, short of using -o and piping the output to wc -l. But then you wind up measuring the time it takes to not just count every match, but print every match too. You can see the difference with ripgrep, which admittedly isn’t too much, but it’s not nothing:

                                                                                                                                    $ time rg 'for|while|continue|break|if|fi|then|else|elif|case|esac|do|done' /tmp/oil.txt -o -c
                                                                                                                                    3830430
                                                                                                                                    real    1.511
                                                                                                                                    user    1.492
                                                                                                                                    sys     0.017
                                                                                                                                    maxmem  343 MB
                                                                                                                                    faults  0
                                                                                                                                    
                                                                                                                                    $ time rg 'for|while|continue|break|if|fi|then|else|elif|case|esac|do|done' /tmp/oil.txt -o | wc -l
                                                                                                                                    3830430
                                                                                                                                    real    1.684
                                                                                                                                    user    1.648
                                                                                                                                    sys     0.033
                                                                                                                                    maxmem  343 MB
                                                                                                                                    faults  0
                                                                                                                                    

                                                                                                                                    I’m going to be surprised if can actually take advantage of anything to perform better than ripgrep :)

                                                                                                                                    I welcome all competitors! Beating ripgrep in all or most cases will be tricky. If you limit yourself to a specific subset of regexes and allow yourself to use offline compilation, then I’m sure there are wins to be had. There are lots of interesting optimizations at play here though, and generally finite state machines are your last line of defense because they are slower than optimized vector routines. But, vector optimizations don’t really scale to large inputs. They only really work for small regexes.

                                                                                                                      2. 2

                                                                                                                        wow, thanks for this long answer. seems like i have quite a bit of reading to do! :) so only an limited answer for some points:

                                                                                                                        An alternation of literal strings compiled using Thompson’s construction is generally going to incur quite a bit more overhead than a standard Aho-Corasick implementation because Thompson’s construction supports not just an alternation of literals, but an alternation of arbitrary regular expressions.

                                                                                                                        I haven’t really thought about it this way (regular expressions as matched unit instead of words) until now. Maybe because most regexp examples / uses tend to match against a limited set of words, not infinite ones.

                                                                                                                        For completeness, note that there is also Glushkov’s construction algorithm

                                                                                                                        I didn’t know about Glushkov’s algorithm, interesting!

                                                                                                                      3. 3

                                                                                                                        BTW I put the original paper here, based on the long discussion below:

                                                                                                                        http://www.oilshell.org/archive/Thompson-1968.pdf

                                                                                                                        This got me interested and I pubilshed some benchmarks. Supposedly fgrep originally used Aho-Corasick. I’m not sure if GNU fgrep actually does now though! But regular grep is faster than fgrep on this problem, perhaps because it has better Boyer-Moore like “input skipping”. That’s my conjecture anyway.

                                                                                                                        https://github.com/oilshell/blog-code/tree/master/fgrep-problem-benchmarks


                                                                                                                        EDIT: Yeah I see this in grep.c in the GNU source. It seems like it doesn’t try to treat the fgrep problem specially? For example, by using Aho-Corasick. Not sure why fgrep is slower than grep then. It should be the same speed if the same engine is used.

                                                                                                                          /* In a unibyte locale, switch from fgrep to grep if                                                                                                                                                                                        
                                                                                                                             the pattern matches words (where grep is typically faster).                                                                                                                                                                              
                                                                                                                             In a multibyte locale, switch from fgrep to grep if either                                                                                                                                                                               
                                                                                                                             (1) case is ignored (where grep is typically faster), or                                                                                                                                                                                 
                                                                                                                             (2) the pattern has an encoding error (where fgrep might not work).  */                                                                                                                                                                  
                                                                                                                          if (compile == Fcompile                                                                                                                                                                                                                     
                                                                                                                              && (MB_CUR_MAX <= 1                                                                                                                                                                                                                     
                                                                                                                                  ? match_words                                                                                                                                                                                                                       
                                                                                                                                  : match_icase || contains_encoding_error (keys, keycc)))                                                                                                                                                                            
                                                                                                                            {                                                                                                                                                                                                                                         
                                                                                                                              size_t new_keycc;                                                                                                                                                                                                                       
                                                                                                                              char *new_keys;                                                                                                                                                                                                                         
                                                                                                                              fgrep_to_grep_pattern (keycc, keys, &new_keycc, &new_keys); 
                                                                                                                        
                                                                                                                        1. 1

                                                                                                                          Yeah, GNU fgrep and GNU grep should be the same speed on the same input, assuming both inputs are just normal literals. That’s because grep should be able to detect the case of a simple literal and run as if it were fgrep. For your benchmark, what input are you searching? I can try and re-run the benchmark for you and explain the discrepancy if one indeed exists.

                                                                                                                          Also, if you’re interested in more details on how grep is optimized, you might want to read my ripgrep article (or perhaps skim it, since it’s long, but don’t miss the parts about Teddy and memchr :-)).

                                                                                                                          1. 1

                                                                                                                            Thanks, that would be useful. You should be able to clone the repo and run it with the instructions at the top of fixed-strings.sh. Instead of all-benchmarks you can run grep-fixed-benchmark.

                                                                                                                            Basically this will just download a file, make it 10x bigger, and run grep on it.

                                                                                                                            The input is an alternation of literal strings, which I deem the “fgrep problem”. For me, grep is reliably faster, but I don’t understand why since I think they should be using the same engine either way. There is some attempt to convert fgrep to grep, but I guess I don’t understand under what conditions that happens.

                                                                                                                            But even if the conversion were failing, I would expect fgrep to do better than grep on the problem it’s designed for!!!

                                                                                                                            Actually this is my beef with Aho-Corasick too. I cannot think of an example where it’s going to be measurably faster or use less memory in practice than Thompson construction (at runtime, I don’t think you can measure compile time in a benchmark like this). Generally I think the difference between fgrep and grep comes down to engineering and not algorithms, which is why the results are inverted (conjecture).

                                                                                                                            There are 13 keywords, described in the README:

                                                                                                                            Matching this file:
                                                                                                                            -rw-rw-r-- 1 andy andy 338M Nov 10 11:09 _tmp/all-10.txt
                                                                                                                            
                                                                                                                            Against 13 keywords:
                                                                                                                               for while continue break if fi then else elif case esac do done
                                                                                                                            
                                                                                                                            1. 1

                                                                                                                              All right… So I played with your benchmark script. I hacked at it until it ran without error, and here’s my output:https://gist.github.com/BurntSushi/1203d5b56112a8c1bacff5b19273d956

                                                                                                                              There are definitely some issues with how you’re benchmarking this here. Namely, you can’t give regexes to fgrep, and the BRE syntax used by standard grep doesn’t support alternation. So in order to test with fgrep, you need to pass the alternation via a file using the -f flag (or, as you’ll see, use the -e flag). For testing standard grep, you need to use egrep. For GNU grep specifically, grep, fgrep and egrep are all just different front ends. They share the same backend essentially. What this means is that your benchmarks aren’t actually measuring the speed of alternations, but rather, a very long single literal search.

                                                                                                                              I’m going to diverge from your benchmark script and just type the actual commands we want to run. It will be easier this way to verify results. So I started with just a regex alternation. The first two are ripgrep, where the first searches with a memory map and the latter uses standard read calls (the latter is what GNU grep does):

                                                                                                                              $ time rg 'for|while|continue|break|if|fi|then|else|elif|case|esac|do|done' /tmp/oil.txt| wc -l
                                                                                                                              2969020
                                                                                                                              real    0.838
                                                                                                                              user    0.803
                                                                                                                              sys     0.033
                                                                                                                              maxmem  343 MB
                                                                                                                              faults  0
                                                                                                                              
                                                                                                                              $ time rg --no-mmap 'for|while|continue|break|if|fi|then|else|elif|case|esac|do|done' /tmp/oil.txt| wc -l
                                                                                                                              2969020
                                                                                                                              
                                                                                                                              real    0.891
                                                                                                                              user    0.823
                                                                                                                              sys     0.067
                                                                                                                              maxmem  6 MB
                                                                                                                              faults  0
                                                                                                                              
                                                                                                                              $ time egrep 'for|while|continue|break|if|fi|then|else|elif|case|esac|do|done' /tmp/oil.txt| wc -l
                                                                                                                              2969001
                                                                                                                              real    1.223
                                                                                                                              user    1.111
                                                                                                                              sys     0.110
                                                                                                                              maxmem  5 MB
                                                                                                                              faults  0
                                                                                                                              

                                                                                                                              For fgrep, we can’t use an alternation. Instead, we have to pass each pattern individually since fgrep doesn’t take regexes of any kind as input:

                                                                                                                              $ time fgrep -e for -e while -e continue -e break -e if -e fi -e then -e else -e elif -e case -e esac -e do -e done /tmp/oil.txt| wc -l
                                                                                                                              2969001
                                                                                                                              real    1.515
                                                                                                                              user    1.427
                                                                                                                              sys     0.087
                                                                                                                              maxmem  5 MB
                                                                                                                              faults  0
                                                                                                                              

                                                                                                                              Now this is interesting. There is a repeatable performance difference here between egrep and fgrep. However, this doesn’t appear to have anything to do with fgrep per se, but rather, the method of input. Specifically, if we give each pattern individually to egrep, then it has the same slow down (and similarly for grep):

                                                                                                                              $ time egrep -e for -e while -e continue -e break -e if -e fi -e then -e else -e elif -e case -e esac -e do -e done /tmp/oil.txt| wc -l
                                                                                                                              2969001
                                                                                                                              real    1.537
                                                                                                                              user    1.411
                                                                                                                              sys     0.123
                                                                                                                              maxmem  5 MB
                                                                                                                              faults  0
                                                                                                                              

                                                                                                                              One of my hypotheses for the difference is match overhead. In particular, you’ve chosen a query that has a lot of matches. This tends to be a less common case for tools like grep which can used interactively. Large results sets aren’t as useful because we humans can’t look through everything, so we tend to run searches that produce small result sets. As a result, cases with a large number of results might not receive as much optimization love (see, for example, the silver searcher). I tested my hypothesis by tweaking your query such that it doesn’t match anything:

                                                                                                                              $ time egrep 'faor|wbhile|ccontinue|bdreak|iXf|f#i|tghen|ehlse|eilif|cjase|eksac|d!o|dmone' /tmp/oil.txt
                                                                                                                              real    1.099
                                                                                                                              user    1.035
                                                                                                                              sys     0.063
                                                                                                                              maxmem  5 MB
                                                                                                                              faults  0
                                                                                                                              
                                                                                                                              $ time egrep -e faor -e wbhile -e ccontinue -e bdreak -e iXf -e 'f#i' -e tghen -e ehlse -e eilif -e cjase -e eksac -e 'd!o' -e dmone /tmp/oil.txt
                                                                                                                              real    1.482
                                                                                                                              user    1.433
                                                                                                                              sys     0.047
                                                                                                                              maxmem  5 MB
                                                                                                                              faults  0
                                                                                                                              

                                                                                                                              So my hypothesis is wrong. The -e variant is still slower, even though I believe they are semantically the same search. As a fun side note, ripgrep does really well here, and doesn’t suffer from the same (seeming) performance bug:

                                                                                                                              $ time rg 'faor|wbhile|ccontinue|bdreak|iXf|f#i|tghen|ehlse|eilif|cjase|eksac|d!o|dmone' /tmp/oil.txt
                                                                                                                              real    0.098
                                                                                                                              user    0.068
                                                                                                                              sys     0.029
                                                                                                                              maxmem  343 MB
                                                                                                                              faults  0
                                                                                                                              
                                                                                                                              $ time rg -e faor -e wbhile -e ccontinue -e bdreak -e iXf -e 'f#i' -e tghen -e ehlse -e eilif -e cjase -e eksac -e 'd!o' -e dmone /tmp/oil.txt
                                                                                                                              real    0.102
                                                                                                                              user    0.078
                                                                                                                              sys     0.024
                                                                                                                              maxmem  342 MB
                                                                                                                              faults  0
                                                                                                                              

                                                                                                                              The reason why ripgrep is fast, is that for both searches, it isn’t actually using Aho-Corasick at all! Instead, it’s using a SIMD algorithm called Teddy. Teddy is part of the regex engine and is described in detail here: https://github.com/rust-lang/regex/blob/master/src/literal/teddy_ssse3/imp.rs — Teddy doesn’t do too well when there are a lot of matches, but when there are very few matches, like in the latter search above, it absolutely screams.

                                                                                                                              Looking at grep’s source code (current master), this comment in GEAcompile sheds some light:

                                                                                                                                /* For GNU regex, pass the patterns separately to detect errors like
                                                                                                                                   "[\nallo\n]\n", where the patterns are "[", "allo" and "]", and
                                                                                                                                   this should be a syntax error.  The same for backref, where the
                                                                                                                                   backref should be local to each pattern.  */
                                                                                                                              

                                                                                                                              ripgrep’s default regex engine doesn’t support backreferences, so it doesn’t have to deal with this. In particular, ripgrep will take -e pat1 -e pat2 ... -e patN and just translate that to (?:pat1)|(?:pat2)|...|(?:patN) such that it’s one giant regex. Since it doesn’t have to worry about backreferences, it can do this. At this point, I’m out of gas w.r.t. to investigation, and it’s not a stretch at this point to say that these two invocations go through different code paths. Since fgrep cannot invoke the pat1|pat2|...|patN code path, you can’t quite do a fair comparison unless you make egrep use the same code path as fgrep for multiple literals. And when you do that, they come out to be even.

                                                                                                                              But even if the conversion were failing, I would expect fgrep to do better than grep on the problem it’s designed for!!!

                                                                                                                              I don’t know the origins of fgrep, but in today’s world, it’s primarily a UX thing and not a performance thing. Namely, it’s nice to be able to search for literal strings like Foo(. But if that’s interpreted as a regex, it’s probably a syntax error (unless you’re using BREs). So fgrep is really nice for that.

                                                                                                                              In a suboptimal implementation, it’s possible that fgrep will be faster than the equivalent egrep command. If that is indeed the case, then that just means that the egrep implementation isn’t falling back to normal substring search when the pattern it has been given is just a literal. If egrep is just “parse and feed to the regex engine,” then there will be cases in which fgrep is faster. You need an explicit analysis phase that ducks out to normal substring search that bypasses the regex engine.

                                                                                                                              Actually this is my beef with Aho-Corasick too. I cannot think of an example where it’s going to be measurably faster or use less memory in practice than Thompson construction (at runtime, I don’t think you can measure compile time in a benchmark like this).

                                                                                                                              The problem with this statement is that “Thompson construction” isn’t really enough to go on. Are you referring to the NFA simulation of the Thompson construction? A lazy DFA of the Thompson construction? A eagerly compiled DFA of the Thompson construction? All of these things actually exist in practice.

                                                                                                                              Generally I think the difference between fgrep and grep comes down to engineering and not algorithms, which is why the results are inverted (conjecture).

                                                                                                                              I think this is too much of a simplification. As someone who has spent a lot of time building these types of tools, the real answer is very solidly “both.” :-)

                                                                                                                              1. 1

                                                                                                                                What this means is that your benchmarks aren’t actually measuring the speed of alternations, but rather, a very long single literal search.

                                                                                                                                That’s not true, as I just wrote, grep, fgrep, and ripgrep all return 2969020 results. BRE supports alternation with \|. You don’t need a flag to get alternation with fgrep either.

                                                                                                                                $ { echo foo; echo bar; }  | grep 'foo\|bar' 
                                                                                                                                foo
                                                                                                                                bar
                                                                                                                                
                                                                                                                                $ { echo foo; echo bar; }  | fgrep $'foo\nbar' 
                                                                                                                                foo
                                                                                                                                bar
                                                                                                                                
                                                                                                                                $ grep --version
                                                                                                                                grep (GNU grep) 2.25
                                                                                                                                Copyright (C) 2016 Free Software Foundation, Inc.
                                                                                                                                
                                                                                                                                1. 1

                                                                                                                                  Cute. I was misled by this page then I guess. This page also indicates that BREs don’t have alternations: http://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap09.html

                                                                                                                                  I forgot that \n could be used to delimit patterns in fgrep. But it’s still taking the same code path AFAIK inside GNU grep when compared with -e foo -e bar. The code path is different from foo|bar.

                                                                                                                                2. 1

                                                                                                                                  After reading this, I’m still confused by why fgrep is slower than egrep.

                                                                                                                                  It should be either

                                                                                                                                  • The exact same speed, because it’s just a front end for the same engine.
                                                                                                                                  • Faster, because it’s solving a “simpler” problem and can use a faster algorithm. The whole thing that started this discussion was this:

                                                                                                                                  An alternation of literal strings compiled using Thompson’s construction is generally going to incur quite a bit more overhead than a standard Aho-Corasick implementation because Thompson’s construction supports not just an alternation of literals, but an alternation of arbitrary regular expressions.

                                                                                                                                  I still don’t buy this! The fixed string alternation is not really easier. (I agree that there might be a small difference in theory, but I don’t think it matters in practice.)

                                                                                                                                  If you run your aho-corasick algorithm on this data set and it’s faster than the RE2 benchmark, that might convince me! (because RE2 is also counting all matches and not all lines, and it’s interpreted not compiled).

                                                                                                                                  I think this is too much of a simplification. As someone who has spent a lot of time building these types of tools, the real answer is very solidly “both.” :-)

                                                                                                                                  Maybe? But your post didn’t show that?

                                                                                                                                  The point that the re2c and RE2 benchmarks aren’t measuring the same thing is well taken, but that’s separate from the grep/fgrep issue.

                                                                                                                                  Are you referring to the NFA simulation of the Thompson construction? A lazy DFA of the Thompson construction? A eagerly compiled DFA of the Thompson construction? All of these things actually exist in practice.

                                                                                                                                  The eagerly compiled DFA, because I don’t think that’s hard to create in the alternation of constant strings case.

                                                                                                                                  I think you always get a DFA that’s going to run as fast as Aho-Corasick in that case. The NFA is very simple, and the DFA is very simple. I guess the test I should do is to run the simple powerset/subset construction on that case. I would be surprised if it “blows up” in this particular case (i.e. there are no repetitions, so the graph is very simple).

                                                                                                                                  Every alternation should look the same from the engine’s point of view, e.g. foo|bar or spam|eggs|ham are the same problem in terms of creating an NFA, converting to DFA, and optimizing (if it’s necessary in that case).

                                                                                                                                  So do you think Aho-Corasick is going to do anything on this benchmark? To a first approximation, it seems like a different way to build a DFA. As you point out, you don’t even use it for this case! Teddy is apparently better.

                                                                                                                                  That’s essentially my point. I can’t think of any problems where Aho-Corasick will be better than an automata-based regex engine (“Thompson’s construction” in my terminology. They are all based on that.)

                                                                                                                                  Can you think of any examples of such problems?

                                                                                                                                  1. 1

                                                                                                                                    After reading this, I’m still confused by why fgrep is slower than egrep.

                                                                                                                                    Because that’s not the most precise statement we can make given the data. I carved the problem down to egrep 'pat1|pat2|...|patN' vs egrep -e pat1 -e pat2 ... -e patN in my previous comment, where the former is faster than the latter, and, crucially, the latter matches the speed of fgrep. So it’s not about “fgrep vs egrep,” but rather, a distinction in the code path. In other words, it looks like an unnecessary implementation quirk. But this is not a complete explanation. It looks like you’ll need to either ask someone who’s familiar with GNU grep automata construction, or dig into the code yourself. :-)

                                                                                                                                    An alternation of literal strings compiled using Thompson’s construction is generally going to incur quite a bit more overhead than a standard Aho-Corasick implementation because Thompson’s construction supports not just an alternation of literals, but an alternation of arbitrary regular expressions.

                                                                                                                                    I still don’t buy this! The fixed string alternation is not really easier. (I agree that there might be a small difference in theory, but I don’t think it matters in practice.)

                                                                                                                                    If you run your aho-corasick algorithm on this data set and it’s faster than the re2c benchmark, that might convince me! (because re2c is also counting all matches and not all lines).

                                                                                                                                    I think we are unfortunately approach diminishing returns in this discussion. We’re getting hung up on terminology. In my comment above, I was using “Thompson’s construction” as short hand for “the NFA simulation of Thompson’s construction,” which roughly matches the algorithm in Thompson’s original paper. That is going to be very obviously and very trivially slower than even the NFA fomulation of Aho-Corasick, and certainly much slower than the DFA formulation of Aho-Corasick.

                                                                                                                                    The above paragraph is 100% about implementation details. It has nothing to do with the theory.

                                                                                                                                    I think this is too much of a simplification. As someone who has spent a lot of time building these types of tools, the real answer is very solidly “both.” :-)

                                                                                                                                    Maybe? But your post didn’t show that?

                                                                                                                                    I’m not in a debate with you here, and I’m not trying to convince you of anything. I’m just trying to offer up my knowledge and experience. Do with it as you will.

                                                                                                                                    I think if you read my blog post about ripgrep, it should pretty solidly convince you that both algorithms and engineering are required. ripgrep, to a first approximation, uses the same algorithm as GNU grep. That is, both of them fundamentally use a lazy DFA as their work horse. But to a second approximation, the algorithms surrounding the lazy DFA, and in particular, the analysis performed on the regex, are very different. e.g., GNU grep does not use frequency based substring search and it does not have a vectorized multi-pattern matcher. Those are algorithmic differences. They aren’t huge, and someone could add them to GNU grep if they wanted to. But they are still differences. One other algorithmic difference has to do with how large Unicode character classes are handled. ripgrep is much faster than GNU grep in that space, and I suspect it largely has to do with how GNU grep implements POSIX locale support. But I’m not certain.

                                                                                                                                    The engineering aspect of these tools is also quite large, but not for any particularly interesting reasons. You just need to make sure you’re doing as little work as possible per line (or per byte) of input. Naive implementations (e.g., “read a file line by line and apply a regex to each line”) will invariably do much more work than is necessary. Rising above naive implementations requires engineering and smarter choices about how you find your matches.

                                                                                                                                    Are you referring to the NFA simulation of the Thompson construction? A lazy DFA of the Thompson construction? A eagerly compiled DFA of the Thompson construction? All of these things actually exist in practice.

                                                                                                                                    The eagerly compiled DFA, because I don’t think that’s hard to create in the alternation of constant strings case.

                                                                                                                                    OK, but you’ll wind up needing to pick a cutoff point where you don’t eagerly compile the DFA because it will use up too much memory. You don’t need to pick this cutoff point, probably, in the context of re2c.

                                                                                                                                    I think you always get a DFA that’s going to run as fast as Aho-Corasick in that case.

                                                                                                                                    I think I agree. At least, I can’t see any specific reason for a performance difference.

                                                                                                                                    What algorithm does rust/regex use for DFA minimization?

                                                                                                                                    It doesn’t do DFA minimization. Neither does RE2. It’s too expensive. Neither RE2 nor rust/regex ever eagerly computes a DFA for general regexes. It’s always done lazily. rust/regex has an optimization pass that RE2 doesn’t have, where, if it detects an alternation of literals, it will use Aho-Corasick specifically because Aho-Corasick is eagerly built and is therefore faster than the lazy DFA. The heuristics used for this currently are not optimal, but the code is there and works in a lot of cases.

                                                                                                                                    The NFA is very simple, and the DFA is very simple. I have never implemented that algorithm but I would be surprised if it “blows up” in this particular case.

                                                                                                                                    I would be too, in the sense that “blow up” means “exponentially blow up.” Other than that, I think I already characterized the memory usage of an Aho-Corasick DFA. In a typical table based DFA, every state uses 1KB of memory. That’s a lot.

                                                                                                                                    Every alternation should look the same from the engine’s point of view, e.g. foo|bar or spam|eggs|ham are the same problem in terms of creating an NFA, converting to DFA, and optimizing (if it’s necessary in that case).

                                                                                                                                    So do you think Aho-Corasick is going to do anything on this benchmark? To a first approximation, it seems like a different way to build a DFA. As you point out, you don’t even use it for this case! Teddy is apparently better.

                                                                                                                                    That’s essentially my point. I can’t think of any problems where Aho-Corasick will be better than an automata-based regex engine (“Thompson’s construction” in my terminology. They are all based on that.)

                                                                                                                                    I wouldn’t suspect so either. But if my regex engine never bothers with the eager DFA construction or minimization after applying the Thompson construction, then it makes sense to just use Aho-Corasick when possible. In order for me to be convinced to do otherwise, I think I’d need to be convinced that building and minimizing a DFA by starting with Thompson’s construction won’t be noticeably slower than using the normal Aho-Corasick construction algorithm, and then converting that into a table based DFA directly. Intuitively speaking, I would be surprised if both construction algorithms ran at similar speeds. In your use cases, this distinction is probably wildly uninteresting, since this is all paid for when you compile your program, and therefore you have a very high tolerance. In that case, you probably don’t care about Aho-Corasick, and for implementation simplicity, you’d just stick with Thompson’s construction -> DFA -> minimization. That’s not true in my case.

                                                                                                                                    1. 1

                                                                                                                                      FWIW I constructed a benchmark to match 1000 - 6000 strings “in parallel” with re2c, RE2, and ripgrep:

                                                                                                                                      https://github.com/oilshell/blog-code/blob/master/fgrep-problem-benchmarks/fixed-strings.sh#L328

                                                                                                                                      ripgrep (and grep) are faster than re2c for n = 100 just like they are for n = 13 (which was surprising to me since they’re not compilers).

                                                                                                                                      But in the range of 1000 - 6000 strings, re2c match time is essentially constant (< 3 seconds), while ripgrep increases linearly to 13+ seconds. RE2 also increases linearly but not as much.

                                                                                                                                      The re2c compile time is linear as well, and the generated code size is linear, with a small constant factor. For 6000 strings, it only uses 213 KiB of code (representing the DFA).

                                                                                                                                      I’m not pointing this out because I think it’s important for ripgrep to handle. GNU grep blows up somewhere around 400 strings and I gave up running it.

                                                                                                                                      But it is interesting in the sense that this is precisely the Aho-Corasick problem – an alternation of constant strings. (modulo the overlapping matches issue)

                                                                                                                                      My point was that “Thompson’s construction” can handle the Aho-Corasick problem, and do very well at it. It’s possible that I’m underestimating what re2c is doing after that, but considering that the compile time is low, I don’t think anything very advanced is going on.

                                                                                                                                      If I have time in the future I’ll explore that further, although actually I’m more interested in derivatives, so I may never get to it. If I do experiment with derivatives, this would be a great test case.


                                                                                                                                      In theory, I would expect DFAs to be able to match in constant time with respect to the number of strings, not linear time. But re2c is the only implementation that gets close to that!

                                                                                                                                      I would have thought RE2 would too, so I suppose it’s harder than I thought. RE2 also doesn’t want to use more than 8 MB of memory for the DFA by default, although it’s easy to bump up.

                                                                                                                                      I could say more about this, but I’ll probably blog about this in the future instead… I’ve been meaning to write a post/demo about re2c for a long time. If you have any comments on this feel free to mail me at andy@oilshell.org instead of this deep lobste.rs thread!

                                                                                                                                      1. 1

                                                                                                                                        Aye. Just some small quick notes:

                                                                                                                                        • In RE2’s case, it will never use Aho-Corasick and it will never use an eagerly compiled DFA.
                                                                                                                                        • In ripgrep’s case, it will use Aho-Corasick for very small alternations, but falls back to the same lazy DFA approach that RE2 uses beyond that. This is a failing of the current implementation. You can tweak the DFA size limit with the --dfa-size-limit flag.
                                                                                                                                        • In both of the aforementioned cases, once the DFA exceeds the internal limit, it starts thrashing. If it thrashes enough, it falls back to the NFA simulation of Thompson’s construction.
                                                                                                                                        • “Thompson’s construction” can only handle the Aho-Corasick problem well if you eagerly compile the DFA. 6,000 strings might not be enough to really test the differences between the traditional Aho-Corasick compilation algorithm and Thompson’s construction -> DFA -> minimization. /usr/share/dict/words for instance is over 100,000 entries on my system.

                                                                                                                                        Just as a quick example, here is the time difference in compilation between NFA Aho-Corasick (traditional) and DFA Aho-Corasick (“advanced”) using an example program from my Aho-Corasick library:

                                                                                                                                        $ cargo build --release --example dict-search
                                                                                                                                        
                                                                                                                                        $ time ./target/release/examples/dict-search -d /usr/share/dict/words /dev/null
                                                                                                                                        pattern,start,end
                                                                                                                                        
                                                                                                                                        real    0.109
                                                                                                                                        user    0.089
                                                                                                                                        sys     0.020
                                                                                                                                        maxmem  38 MB
                                                                                                                                        faults  0
                                                                                                                                        
                                                                                                                                        $ time ./target/release/examples/dict-search --full -d /usr/share/dict/words /dev/null
                                                                                                                                        pattern,start,end
                                                                                                                                        
                                                                                                                                        real    1.018
                                                                                                                                        user    0.903
                                                                                                                                        sys     0.113
                                                                                                                                        maxmem  321 MB
                                                                                                                                        faults  0
                                                                                                                                        

                                                                                                                                        Note also the difference in peak memory usage. :-)

                                                                                                                                  2. 1

                                                                                                                                    I think we might be getting hung up on “Thompson construction” terminology [1], so let me put it another way:

                                                                                                                                    I don’t think Aho-Corasick is currently used to speed up string search problems.

                                                                                                                                    GNU grep doesn’t appear to use it, and ripgrep doesn’t use it either.

                                                                                                                                    The automata-based regex techniques seemed to have “subsumed it” (at least for any problems I can think of).

                                                                                                                                    In your article I see this, which seems to make sense, since you have to go beyond automata to get further speedup!

                                                                                                                                    Handling the case of multiple literals (e.g., foo|bar) is just as important. GNU grep uses a little known algorithm similar to Commentz-Walter for searching multiple patterns. In short, Commentz-Walter is what you get when you merge Aho-Corasick with Boyer-Moore: a skip table with a reverse automaton.

                                                                                                                                    (However I’m not seeing the results of this in practice either! But that is probably because I haven’t looked hard enough.)

                                                                                                                                    [1] Which I think of from a user’s perspective as “automata-based regex engine”, but you’re thinking of from an implementer’s perspective.

                                                                                                                                    1. 1

                                                                                                                                      GNU grep doesn’t appear to use [Aho-Corasick], and ripgrep doesn’t use it either.

                                                                                                                                      Both do, absolutely. For GNU grep, see: http://git.savannah.gnu.org/cgit/grep.git/tree/src/kwset.c

                                                                                                                                      For ripgrep, given that I wrote every line of string searching code in it, I can tell you that it will use Aho-Corasick. Teddy is an optimization that, when applicable, will best Aho-Corasick. But it is not always applicable. For example, it doesn’t work well when you have a large number of literals. In those cases, ripgrep falls back to Aho-Corasick.

                                                                                                                                      With respect to Commentz-Walter, I do not believe GNU grep uses it any more. I think it did at one time.

                                                                                                                                      If GNU grep and/or ripgrep compiled regexes eagerly to a DFA and then minimized them, then there might not be any difference between using that machine to search and Aho-Corasick to search, other than, conjecturally, the time it takes to build the machine. But since both use lazy DFAs built directly from Thompson’s NFA construction at match time, there is no infrastructure for eagerly building out and minimizing the DFA. So we both just use Aho-Corasick instead.

                                                                                                                                  3. 1

                                                                                                                                    In my previous comment, one other interesting thing stood out to me is that GNU grep’s match count is slightly lower than ripgrep. It looks like this is because GNU grep is, for some reason, thinking that some of the matched lines contain invalid UTF-8. This is easier to see on a smaller example:

                                                                                                                                    $ time egrep 's a very, very long f' oil.txt | wc -l
                                                                                                                                    51
                                                                                                                                    
                                                                                                                                    $ time rg 's a very, very long f' oil.txt | wc -l
                                                                                                                                    60
                                                                                                                                    

                                                                                                                                    Specifically, if you look at the output of egrep, it actually stops printing results and says Binary file oil.txt matches. In my case, it looks like GNU grep thinks its a binary file because it contains invalid UTF-8 in places and my locale is set to en_US.UTF-8. I can work around this by either setting my locale to C (via LC_ALL=C), or by enabling --text.

                                                                                                                                    This doesn’t seem to impact performance though.

                                                                                                                                    1. 2

                                                                                                                                      Yes, LC_ALL=C is essential, otherwise on my machine GNU grep stops after 10% of the file!!!

                                                                                                                                      I spent awhile figuring that out! Honestly I think it is a bug I should file, but I haven’t looked into it further. The test file is a concatenation of the same file 10 times, and a specific character sequence in the last part of the file makes grep stop!

                                                                                                                                      See ESSENTIAL BUG FIX comment here.

                                                                                                                                      https://github.com/oilshell/blog-code/blob/master/fgrep-problem-benchmarks/fixed-strings.sh

                                                                                                                                      I just re-ran the verification and grep, fgrep, and ripgrep are all returning 2969020 results, with the same timings as reported in the README.

                                                                                                                                      1. 1

                                                                                                                                        I don’t think it’s a bug. It’s intended behavior.

                                                                                                                                        My output from the benchmark script is very different: https://gist.github.com/BurntSushi/1203d5b56112a8c1bacff5b19273d956

                                                                                                                                        1. 1

                                                                                                                                          I dunno, why would you ever want grep to stop in the middle of a file, no matter what other text is there? Suppose I have:

                                                                                                                                          grep foo

                                                                                                                                          on a file that looks like this:

                                                                                                                                          foo
                                                                                                                                          <unintelligible poorly encoded garbage>
                                                                                                                                          foo
                                                                                                                                          

                                                                                                                                          Shouldn’t it find both foo’s ? Or I guess it’s saying it doesn’t know that the second foo is even a foo after the encoding errors. That is plausible (for non-UTF8 encodings, since UTF-8 is designed to be able to resynchronize). But at least it should print an error rather than stopping silently!

                                                                                                                                          1. 1

                                                                                                                                            It does print an error. Or at least, it should. It says “Binary file foo matches”:

                                                                                                                                              if (!out_quiet && (encoding_error_output
                                                                                                                                                                 || (0 <= nlines_first_null && nlines_first_null < nlines)))
                                                                                                                                                {
                                                                                                                                                  printf_errno (_("Binary file %s matches\n"), input_filename ());
                                                                                                                                                  if (line_buffered)
                                                                                                                                                    fflush_errno ();
                                                                                                                                                }
                                                                                                                                            

                                                                                                                                            I did some digging and found this: http://lists.gnu.org/archive/html/bug-grep/2016-02/msg00037.html

                                                                                                                              2. 3

                                                                                                                                Complementary. Here’s the “original paper” you’re looking for: String Matching: An Aid to Bibliographic Search. It learns from Thompson’s method and improves its performance and utility in exchange for some features (no backtracking).

                                                                                                                              1. 3

                                                                                                                                Maybe it’s not applicable at all but have you thought about choosing a leaner python implementation? Micropython comes to my mind. It’s targeted at embedded but maybe it’s a fit (?)

                                                                                                                                1. 5

                                                                                                                                  MicroPython is an impressive project. I’ve looked at it a couple times and mentioned it on the blog.

                                                                                                                                  But it’s more than Oil needs and less than it needs at the same time… more because it implements a lot of the dynamism in Python that Oil doesn’t use (and which makes things slower), and less because it doesn’t have the same bindings (POSIX, readline). I looked at porting Oil to MicroPython, but it was too much work.

                                                                                                                                  It’s also more than 50K lines of code if I remember correctly, so to meet my goals of “one-person maintainability” I’d have to strip it down just like I did CPython. In particular MicroPython has an entire front end that I don’t need, because I have “OPy”.

                                                                                                                                1. 3

                                                                                                                                  Wow I love this! Just watched it, here is a little summary:

                                                                                                                                  • Application: designing / synthesizing molecules from “molecular legos”.
                                                                                                                                  • It sounds like mostly do brute force search in the style of “superoptimization”. That is, generate a bunch of candidates in a combinatorial fashion, and score them. So they need both interactive computing like Python, but also raw speed.
                                                                                                                                  • Common Lisp fits the application domain of molecules because it’s all trees and graphs
                                                                                                                                  • They want to use GPUs to do this search, hence LLVM.

                                                                                                                                  Back story:

                                                                                                                                  • Crazy computational chemistry professor has been programming for 20 years. Claims he programs for 6-8 hours a day!
                                                                                                                                  • Has problems with interfacing Python and C++ (I can relate), and stumbles into Common Lisp 5 years ago.
                                                                                                                                  • Combines an existing Common Lisp implementation with LLVM. There is a complex bootstrapping process.
                                                                                                                                  • Wraps the unstable LLVM C++ interface in Common Lisp, so you can do metaprogramming on C++ with Common Lisp!
                                                                                                                                    • Generates his garbage collector! The garbage collector has to know about C++ types that are pointers.
                                                                                                                                    • Wraps the ASTMatcher API to do source-level refactoring of C++ ?