1. 4

    would be amazing if ms open sourced the engine now. could be interesting to see an engine written from scratch for the current technologies.

    but i guess this will never happen, just like opera didn’t release the presto engine :/

    1. 4

      EdgeHTML wasn’t rewritten form scratch, AFAIU that’s a myth they want people to believe. It’s still pretty close to Trident (IE rendering engine), from which they removed lots of Microsoft-legacy/proprietary and then added some features.

      1. 2

        AFAIU that’s a myth they want people to believe

        I don’t think MS has ever promulgated this, actually; the Edge page definitely doesn’t claim it’s brand-new, and they initially launched Edge as literally just a document mode for IE—definitely not as some sort of total rewrite. I think the idea that EdgeHTML is an outright new browser got started by fans who either conflated the fact that the Edge chrome is brand-new (which it was) for the rendering engine being new (which it emphatically isn’t); who conflated EdgeHTML in general with the Chakra JavaScript engine (which was indeed new and has now been open-sourced); or who just straight-up exaggerated what Microsoft had done to Trident to make the Edge rendering engine.

        Microsoft itself though has been quite clear in everything I’ve read that Edge is a direct descendent of IE’s rendering engine, because it’s honestly their marketing for why it’s okay to use Edge in corporations currently stuck on IE: it’s the same stuff you know and love, but now it can also scale cleanly into Chrome/Firefox territory. That messaging makes no sense if it’s a brand-new engine.

        1. 1

          ah, thanks (also gecko for the background information) for the clarification, is must have got that mixed up :)

      1. 5

        As Linux gets more and more corporate and less targeted for the desktop, having a light-weight and responsive OS is enough to make it unique.

        I do patch my Linux with the MuQSS scheduler, the best thing for Linux responsiveness, but I was recently told the Haiku one is essentially the same. This is awesome to me.

        There is a lot and more in apps and hardware support that Heroku would need for me to switch over, but it seems like a cool project.

        Does it do virtual desktops, btw?

        1. 4

          MuQSS schedule

          I heard there was a better scheduler for desktop use. Didn’t know the name. Thanks for the tip.

          1. 3

            Does it do virtual desktops, btw?

            It does.

            The things that it’s missing that I would need to make it my daily driver:

            Minimum:

            • Support for multiple monitors (was in the works at one point, may be there now)
            • Support for videoconferencing and screen sharing in Google Meet (long shot because Google barely even supports Firefox there)
            • Full disk encryption (there’s an encrypted block device driver in the tree but last I checked it was moribund)

            Optimal:

            • The ability to run virtual machines at full speed (there’s qemu but without OS support it’s doing true emulation and is unusably slow for my purposes)
            • The ability to use Firefox Sync

            I’d say BeOS is my favorite operating system of all time, but I can’t quite bring myself to say it since AmigaOS existed.

            1. 3

              I do patch my Linux with the MuQSS scheduler, the best thing for Linux responsiveness, but I was recently told the Haiku one is essentially the same. This is awesome to me.

              I don’t know a lot about the MuQSS scheduler, but from reading over the introductory document, it indeed looks pretty similar to Haiku’s. (I wonder where you read this previously, though?)

              There is a lot and more in apps and hardware support that Haiku would need for me to switch over, but it seems like a cool project.

              What would those be? Most minor tools are easily ported at this point.

              1. 3

                IRC, oftc.net, can’t remember why I joined Con Kolivas’ channel #ck, but there. I consider him a friend after all this time and tested some of his prototypes way back.

                The Godot engine would be one big thing.

              2. 3

                Virtual desktops: yes.

                Linux gets more and more corporate and less targeted for the desktop

                Let’s hope the competitors get better in quality. I doubt I will want change to Haiku unless something really bad happens in the nix world, but hopefully its presence will make everyone else better nonetheless.

                1. 3

                  Disk encryption, does it have that? Password-protected screensaver?

                  1. 4

                    BeOS had a password-protected screensaver.

                2. 2

                  How does mainstream GNU/Linux get worse?

                  1. 10

                    NB: this turned out to be a poettering rant.

                    adding ever more complicated layers onto complicated layers to reinvent the wheel. most things should be done a few layers down, not by adding a few layers on top. this while having the same functionality 10 years ago, which most of the time was working as good as today, only less complicated and prone to break. the sound stack is just horrible, the most sane thing would be to throw out alsa and pulseaudio and use oss4, which implements most of the features. session and login management is also insane, a mess of daemons connected via dbus of all things. systemd people constantly reinventing square wheels (resolved, really?). while i’m at it, ps found a now one i didn’t know about: “rtkit-daemon”, fixing problems i don’t have, running by default.

                    i know, it’s open source, i can write a patch.

                    1. 3

                      I’ve been geeking out on schedulers for a long, long time and every encounter with vanilla Linux on a heavily-loaded box has been awful. It might behave better now, but that would be by very complicated code and bizarre special-case settings.

                      As a simple user, I just use the -ck patch set and ignore the horrors of the sound stack, systemd, Linux Foundation’s corporate politics, cgroups and what have you.

                      I mean, it kinda still works, but sometimes it feels the best desktop-experience parity with Windows was reached 20 years ago, if you exclude hardware support and games, and or with gnome3-type shit and everything got worse.

                      I’m not positive the desktop experience is as good as it gets but I am positive it’s no one’s priority.

                      1. 4

                        I actually like Gnome 3 UI-wise, but the Linux scheduler seems to be more horrific than it used to be, and I remember it being bad a decade ago. I’ve had systems where X11 chugged hard and took 30 minutes to get to a vt when Firefox was stressing the system, when Windows on even more decrepit hardware was slow, but at least felt usable due to seemingly better scheduling - and it didn’t matter what WM you were using.

                    2. 1

                      I’m not seeing Linux move away from the desktop at all. In fact I’m seeing more investment in the LInux desktop than ever.

                      It’s just that they’re investing in the wrong (from my selfish stance :) desktop environment :)

                      1. 2

                        they’re moving away from the desktop and towards tablets, even though linux doesn’t run on any

                    1. 2

                      offtopic: the throbbing always in view “join dev to” button is a joke, is it?

                      1. 21

                        To start, the ZFS filesystem combines the typical filesystem with a volume manager. It includes protection against corruption, snapshots and copy-on-write clones, as well as volume manager.

                        It continues to baffle me how “mainstream” filesystems like ext4 forgo checksumming of the data they contain. You’d think that combatting bitrot would be a priority for a filesystem.

                        Ever wondered where did vi come from? The TCP/IP stack? Your beloved macOS from Apple? All this is coming from the FreeBSD project.

                        Technically, vi and the BSD implementations of the TCP/IP stack can be attributed to 4.xBSD at UCB; FreeBSD is not the origin of either.

                        1. 10

                          It continues to baffle me how “mainstream” filesystems like ext4 forgo checksumming of the data they contain. You’d think that combatting bitrot would be a priority for a filesystem.

                          At least ext4 supports metadata checksums:

                          https://wiki.archlinux.org/index.php/ext4#Enabling_metadata_checksums

                          At any rate Ted T’so (the ext[34] maintainer) has said as far back as 2009 that ext4 was meant to be transitional technology:

                          Despite the fact that Ext4 adds a number of compelling features to the filesystem, T’so doesn’t see it as a major step forward. He dismisses it as a rehash of outdated “1970s technology” and describes it as a conservative short-term solution. He believes that the way forward is Oracle’s open source Btrfs filesystem, which is designed to deliver significant improvements in scalability, reliability, and ease of management.

                          https://arstechnica.com/information-technology/2009/04/linux-collaboration-summit-the-kernel-panel/

                          Of course, the real failing here is not ext4, but that btrfs hasn’t been able to move to production use in more than ten years (at least according to some people).

                          That said, ZFS works fine on Linux as well and some distributions (e.g. NixOS) support ZFS on root out-of-the-box.

                          1. 3

                            Of course, the real failing here is not ext4, but that btrfs hasn’t been able to move to production use in more than ten years (at least according to some people).

                            I think it’s good to contrast “some people’s” opinion with the one from Facebook:

                            it’s safe to say every request you make to Facebook.com is processed by 1 or more machines with a btrfs filesystem.

                            Facebook’s open-source site:

                            Btrfs has played a role in increasing efficiency and resource utilization in Facebook’s data centers in a number of different applications. Recently, Btrfs helped eliminate priority inversions caused by the journaling behavior of the previous filesystem, when used for I/O control with cgroup2 (described below). Btrfs is the only filesystem implementation that currently works with resource isolation, and it’s now deployed on millions of servers, driving significant efficiency gains.

                            But Facebook employs btrfs project lead.

                            There is also the fact that Google is now using BTRFS on Chromebooks with Crostini.

                            As for opinions I’ve seen one that claims that “ZFS is more mature than btrfs ON SOLARIS. It is mostly ok on FreeBSD (with various caveats) and I wouldn’t recommend it on Linux.”.

                            1. 2

                              I wouldn’t recommend it on Linux.

                              I’d still say that ZFS is more usable than lvm & linux-softraid. If only due to the more sane administration tooling :)

                          2. 9

                            Ext4, like most evolutions of existing filesystems, is strongly constrained by what the structure of on-disk data and the existing code allows it to do. Generally there is no space for on-disk checksums, especially for data; sometimes you can smuggle some metadata checksums into unused fields in things like inodes. Filesystems designed from the ground up for checksums build space for checksums into their on-disk data structures and also design their code’s data processing pipelines so there are natural central places to calculate and check checksums. The existing structure of the code matters too because when you’re evolving a filesystem, the last thing you want to do is to totally rewrite and restructure that existing battle-tested code with decade(s) of experience embedded into it; if you’re going to do that, you might as well start from scratch with an entirely new filesystem.

                            In short: that ext4 doesn’t have checksums isn’t surprising; it’s a natural result of ext4 being a backwards compatible evolution of ext3, which was an evolution of ext2, and so on.

                            1. 4

                              It continues to baffle me how “mainstream” filesystems like ext4 forgo checksumming of the data they contain. You’d think that combatting bitrot would be a priority for a filesystem.

                              Ext4 doesn’t aim to be that type of filesystem, for desktop use on the average user, this is fairly okay since actual bitrot in data the user cares about is rare (most bitrot occurs either in system files or empty space or in media files where the single corrupt frame barely matters).

                              If you want to check out a more modern alternative, there is bcachefs. I’ve been using it on my laptop for a while (until I stopped but now I’m back on it) and it’s been basically rock solid. The developer is also working on erasure coding and replication in a more solid way than btrfs currently has.

                            1. 2

                              Deleted my original rant response because, as @sgreben rightfully pointed out, my rant was exactly the kind of thing i ranted about.

                              But, yes, I’d like to see a freeze in new users for a while, to avoid lobste.rs from collapsing under its own weight.

                              I’d also suggest:

                              1. Culling users who have been new members for X months and never posted (but once you post/comment once, you’re in for forever?)
                              2. Requiring more than one invitation for joining?
                              3. Moving to a subscription model. I get enough enjoyment out of this site that I’d be happy to pay for it.

                              But, that’s just my USD$0.02.

                              1. 15

                                Culling users who have been new members for X months and never posted (but once you post/comment once, you’re in for forever?)

                                I will note that there are people who primarily lurk in any online community. This site has a private message system and possibly other features of value to members who never post publicly.

                                Original research I was privy to in my first moderating position suggested that about 20 percent of users were active participants who posted regularly or semi regularly, another 10 percent posted only once or very rarely and the rest lurked. Anecdotal observation suggests that these figures probably are fairly representative of other communities I have engaged in.

                                1. 8

                                  Lurkers are harmful to communities like this, because they have influence in shaping the site but also don’t bother to engage beyond being a silent majority that can be pandered to (purposefully or not) they amplify any democratic issues the site might have.

                                  Better to purge them and leave control of the site (what little there is) in the hands of the people who bother participating.

                                  Edit: lurkers here being those who have accounts but don’t post.

                                  1. 7

                                    Hi friendlysock, I’m malxau and I’m a lurker.

                                    The reason I ended up like this is because the technology landscape is very broad today (and getting broader), and I have firsthand knowledge or experience with a tiny fraction of topics that get discussed. So the best way I can see to keep the signal-to-noise ratio high is to read about things that I don’t know, including comments from people more familiar with them, and avoid contributing to or moderating those posts.

                                    Occasionally there will be something I know, but something I deeply know and have firsthand knowledge of is still rather rare. (In my case, I’ve spent the last 14 years working in Windows kernel mode; I’m an active practitioner, but looking at submissions you’ll see why I don’t feel like I know the breadth of topics being discussed, including things like the Palantir thread.)

                                    Do you still think I’m a problem? Do you think the site would work better if I commented or moderated more?

                                    1. 4

                                      I can’t see your upvotes or flags, so I can’t comment on that front. That said, I think the site would definitely be improved by your participation and submissions of things relating to your background with Windows arcane programming!

                                      Thank you for giving your perspective here.

                                      1. 1

                                        Your site was refreshingly different since it covered stuff I don’t usually see on Lobsters. Doing low-level kernel stuff, I bet you ran into both content and lessons learned that Lobsters might have found interesting regardless of you writing on Windows. There’s also Lobsters on Windows. There’s also a lot of Lobsters that hate Windows.

                                        I have no idea how well your stuff would’ve been received. There’s a chance people might have found it interesting, though. If it’s Windows-like as someone said, an easy example is Minoca OS getting lots of appreciation. Another thread on its documentation had 10 votes. So, there’s potential.

                                      2. 6

                                        Hey there. That seems like a fairly strong opinion. Any research or data you can point me to? I’m not aware of evidence that lurkers are somehow harmful in most cases.

                                        1. 5

                                          Have you seen HN or Reddit? I’m serious. It’s called hivemind for a reason.

                                          People that care enough about a site to post content, or even comment, are, by definition, more involved in the site than users who maintain accounts but don’t do anything but vote up and down.

                                          Lurkers who just vote and flag look an awful lot like slacktivists. They’re freeloaders, contributing no content of their own and no discussion, but they can still screw up conversations by voting with a knee-jerk reaction.

                                          One of the things that sets Lobsters apart is that is made up quite largely of people that actually write code frequently (instead of, say, being growth hackers, or bloggers, or marketers, or bankers, or whatever else) and that those people are given transparency and tools for interacting with the running of the community. Lurkers run counter to at least the latter of those key characteristics.

                                          1. 11

                                            Yes, I’ve seen both HN and Reddit.

                                            I don’t think I’ve ever seen a forum that didn’t have a lot of lurkers. Do you know of any forums where “post or leave” is actual policy? Do you know of any research on this angle?

                                            I’m not making any recommendations here. I’m just seeing people saying “I think we should do X!” and the things I’m seeing don’t fit with my understanding of best practices. But I certainly don’t know everything, so I’m trying to share what I know concerning actual (pertinent) data and asking if anyone knows of any supporting research for their positions.

                                            To be clear, I’m absolutely not trying to tell anyone how lobsters should be run. I was given an invitation by a coder who wants to start a discussion board and he asked if I would consider taking on the role of lead moderator. I tentatively agreed.

                                            So I’m not actually a programmer, though I have some technical training and so on. I’m genuinely interested in learning if there is good data and research supporting the various proposals in this discussion because I’m looking for, among other things, stuff pertinent to the project I’m trying to collaborate on.

                                            I’m genuinely curious and open to seeing good information on such things. I’m aware these questions may be unwelcome here, both because I’m new and because people will tend to interpret my comments as intent to shape policy on lobsters the very day I joined.

                                            A best case outcome is that my comments and questions serve to be helpful and thought provoking for people here who are trying to shape lobsters while I get useful resources to support my project. But a less nice and more likely outcome is that people decide my questions are somehow bad behavior and I get told to gtfo of the discussion or something.

                                            1. 7

                                              I’ve never thought about how lurkers skew voting until this thread, but it seems commonsensical now. You end up with the posters performing for a silent audience, instead of interacting with each other.

                                              Maybe a half-measure we could try is giving people a pool of votes that’s replenished when you post, and you spend from that pool when you up or down a story or comment; one post (submission or comment) could earn you 10 votes or something. That way votes come from the people who are actually engaging with the site, but we’re not kicking anyone off for not being chatty.

                                              1. 10

                                                Maybe a half-measure we could try is giving people a pool of votes that’s replenished when you post

                                                No no no no no no no. That would result in users creating a large number of low-effort comments in order to refuel. It’s bad enough that internet users will do almost anything to make a number go up. It’s even worse when you attach actual incentives to that number.

                                                1. 3

                                                  We could do something like requiring a comment/post have at least +3 or something before it counts towards your vote pool; that might be enough to frustrate a lot of the system-gaming, no?

                                                  1. 1

                                                    The low-effort posts on popular topics get lots of votes. Probably won’t work.

                                                2. 2

                                                  I’ve never thought about how lurkers skew voting until this thread, but it seems commonsensical now. You end up with the posters performing for a silent audience, instead of interacting with each other.

                                                  This is an empirical question worth empirically validating before believing. There is also a plausible just-so story that older users feel more confident voting strategically to enforce their political opinions, etc. Form your hypothesis, write a query, decide how to interpret possible results, and then send it to me to run.

                                                  1. 1

                                                    That’s a neat idea and I’d be in favor of trying it. I don’t know to what extent that would affect the upvote/downvote dynamics of the site, but I’m interested in finding out, and I don’t think it’s an onerous requirement on people.

                                                    1. 1

                                                      a pool of votes that’s replenished when you post, and you spend from that pool when you up or down a story or comment; one post (submission or comment) could earn you 10 votes or something.

                                                      I think that this is great idea. Personally I would go with 1-2 votes per submission but whatever the number I think we should try it.

                                                      1. 1

                                                        Yeah; I originally said 10 because voting serves a real purpose, and I’d worry that only getting one vote per comment could reduce the quality of the front page, because people would hoard their precious votes. I’m no expert on this stuff, though.

                                                      2. 1

                                                        This idea sounds great. I’m not sure what the dynamics would look like, but it’s be interested in trying it out.

                                                      3. 6

                                                        but they can still screw up conversations by voting with a knee-jerk reaction

                                                        Yes, voting does screw up conversations. If I had my way, lobsters wouldn’t have votes on comments, exactly because I don’t think that meaningful conversations should be democratized like that. Lobsters isn’t a very good system for conversations in my very humble opinion (I keep linking to Discourse.org as the model to live up to for a reason). But I don’t think lurkers are necessarily any worse at knee-jerk voting than active commenters.

                                                        Lobsters is, however, pretty much the gold standard for link aggregation, for surfacing content from elsewhere. Voting, flagging, and submitting articles without ever commenting is something I think we should be encouraging, because that’s what the Lobsters software is actually good at. Less conversations, more stories.

                                                    2. 5

                                                      voting satisfies the “me too” impulse. absent that, I suspect you’d see a lot more actual me too comments.

                                                      1. 4

                                                        If you change the rules to ‘post or get out!’, I suspect you will see:

                                                        1. People who are slow to integrate into the community but will eventually post good stuff lose their connection to lobsters and go elsewhere instead of slowly ramping up from just looking to joining to voting to commenting/submitting.
                                                        2. Lots of comments along the lines of “I have nothing to say right now, I’m just trying to say something so I don’t get purged”
                                                        1. 4

                                                          Voting lurkers could indeed be problematic. Perhaps adding a min-karma-threshold for upvoting (similar to flagging), could be a useful experiment.

                                                          1. 1

                                                            Is the problem with vote lurking the up or down votes?

                                                          2. 3

                                                            Downvotes are inaccessible until a user reaches a certain karma threshold. Would it make sense to do the same thing for upvotes too, reducing the pool of users that can vote?

                                                            I don’t think outright purging users is very helpful, since reading for a while before posting is a common practice (and probably not something that should be discoraged). I agree having a silent voting majority is potentially quite harmful to a forum.

                                                            1. 1

                                                              reading for a while before posting is a common practice (and probably not something that should be discoraged)

                                                              You don’t need an account to read.

                                                              1. 2

                                                                You don’t, but there are features that are useful for people who are only reading (tag filtering, hiding stories).

                                                              2. 1

                                                                That inversion is worth thinking on more. The political folks currently do more upvoting of political stuff than submissions or comments. It isn’t limited to them. We see the same thing in the technical threads for some people or types of comments.

                                                                1. 1

                                                                  I was under the impression votes were anonymous, is this not correct?

                                                                  1. 1

                                                                    The site won’t tell other users what your votes are, but it needs to know, both to prevent multiple votes and to show you what you’ve voted on. Obviously the site administrators, who have direct database access, can query that information.

                                                                    1. 1

                                                                      This is accurate, and I’ve written elsewhere in this thread about that access and the practices around it.

                                                                    2. 1

                                                                      They usually vote and comment together. So, you know who some of the likely voters are.

                                                                2. 2

                                                                  how about limiting the votes one has? dota2 does that for reports to keep the reporting system valuable. one of:

                                                                  • fixed number of votes per time-unit (easiest, but limited impact i think)
                                                                  • votes per time-unit limited by karma, eg. votes * karma / maxKarma (could become a lobsters ingame currency)
                                                                  • votes per time-unit limited by submission count (facilitates spamming)
                                                                  • votes per time-unit limited by combined submission count and karma (i don’t have an idea for a good function to do that ;)

                                                                  this should at least limit the lurker influence. i for one wouldn’t care if i’d have to manage my votes a bit.

                                                                  edit: haldean had posted this idea before me, i should have read this thread more thoroughly :)

                                                                  1. 3

                                                                    If the intent is to limit the effect of upvotes, and avoid knee-jerk voting, one could also make it mirror the current downvote choices and simply make a user think about why they are up-voting a comment. So an upvote arrow should offer choices such as [technical|meta|..].

                                                                    1. 1

                                                                      Or “MAS” for “mutual appreciation society” ;)

                                                                  2. 2

                                                                    Wouldn’t that just cause stupid posts like “not lurker” or “first” to trigger account “lock in” – possibly even on very old threads.

                                                                    1. 1

                                                                      My concern with a negative eye towards people like myself who don’t post much is that it suggests posting is mandatory regardless of quality or relevance. I am a lurker, but only because I don’t want to clutter up threads with poorly informed or nontechnical content. I wish I had the depth of experience that some more frequent posters have; should I be excluded for being more of a generalist?

                                                                1. 1

                                                                  does anyone know how it is related to thompsons idea (https://swtch.com/~rsc/regexp/regexp1.html , the original paper seems to be behind a paywall) of how to match regular expressions?

                                                                  1. 11

                                                                    I’m not quite sure how to answer the question since it’s kind of broad. In the most simplistic sense, they are both implement matching using finite state machines that executes in time proportional to the input being searched. Aho-Corasick is one of many different ways to search for multiple strings, and IIRC, Navaro and Raffinot classify it was a prefix oriented approach that generalizes from the idea of KMP for single pattern search. (The other categories of string search being “suffix” and “factor based” approaches.)

                                                                    In the usual framing, an Aho-Corasick matcher is typically implemented as a simulation of a non-deterministic finite automaton, where the “non determinism” comes from following failure transitions. That is, for each byte of input, it is possible to follow N failure transitions before advancing to the next byte of input (where I believe N is proportional to the number of strings you’re trying to match). However, such a matcher can be converted to a deterministic finite automaton by pre-computing all of the failure transitions. Once you have that, each byte of input can be processed in a constant number of instructions, regardless of how many strings you’re searching. Of course, the downside of this approach is that it requires more memory. (In the literature, you usually see the NFA variant just called “Aho-Corasick” and the DFA variant called “Advanced Aho-Corasick.”)

                                                                    Thompson’s construction targets a strictly more general use case beyond just searching for literal strings, and is usually implemented by compiling a regex down into a small set of bytecodes for implementing alternations and repetitions. An alternation of literal strings compiled using Thompson’s construction is generally going to incur quite a bit more overhead than a standard Aho-Corasick implementation because Thompson’s construction supports not just an alternation of literals, but an alternation of arbitrary regular expressions. The actual time complexity here remains the same I think, but the processing of the various “split” bytecodes in Thompson’s construction in a VM of sorts is generally going to be quite slow.

                                                                    It would be interesting to try and draw a more concrete correspondence between Thompson’s “split” bytecode and Aho-Corasick’s failure transitions.

                                                                    I do also know that Paul Wankadia (current maintainer of RE2) has done some really cool work on removing the various split bytecodes from the normal Thompson construction, at the cost of doing a bit more work translating the regex to bytecodes. But I still don’t think it will approach to speed of Aho-Corasick. To a first approximation, I would somewhat expect Aho-Corasick to be about an order of magnitude faster than the typical Thompson construction with a VM. If you JIT the Thompson construction (which has been done), then all bets are off. :-)

                                                                    For completeness, note that there is also Glushkov’s construction algorithm, which also constructs an NFA for matching regexes, but with different trade offs than Thompson’s construction. That is, Glushkov’s algorithm does epsilon removal, where as Thompson’s doesn’t. But this means you get a bigger machine (more transitions) and spend more time building it. I’ve wanted to experiment with Glushkov’s construction for a while now, but I always get stuck when I try to figure out how to handle giant Unicode character classes (which also thwarts a lot of interesting bit-parallel implementations of NFAs).

                                                                    1. 2

                                                                      An alternation of literal strings compiled using Thompson’s construction is generally going to incur quite a bit more overhead than a standard Aho-Corasick implementation because Thompson’s construction supports not just an alternation of literals, but an alternation of arbitrary regular expressions.

                                                                      I don’t think this follows because both algorithms essentially compile their problems down to DFAs. The compile time might be different (and mostly negligible), but if the DFA is optimized, then I don’t see any reason that the execution time should be different.

                                                                      The regex -> NFA -> DFA -> optimization transformation does feel like “magic” in that sense, in that you can solve a harder problem in the same amount of time, at least from the POV of computational complexity.

                                                                      In Oil, I use re2c to compile a huge number of regexes down to a series of switch/goto statements, with no backtracking, and it seems to be very fast. There are some links to code at the beginning of this post: http://www.oilshell.org/blog/2017/12/17.html

                                                                      As far as I understand, re2c is basically doing the “Thompson” or Dragon Book transformation. Perhaps with some DFA optimizations that I don’t understand. But the expressiveness of the problem is the same – it accepts alternations of arbitrary regexes.

                                                                      It doesn’t do any backtracking or “split” – it’s all just switch and goto.

                                                                      The actual time complexity here remains the same I think,

                                                                      As long as both algorithms produce a DFA, the search should use O(n) time and O(1) space.

                                                                      I think you are saying that the “split” bytecodes from [2] necessary to simulate the NFA make things slower.

                                                                      But I am not sure this is true, based on [1]:

                                                                      In a sense, Thompson’s NFA simulation is executing the equivalent DFA: each List corresponds to some DFA state, and the step function is computing, given a list and a next character, the next DFA state to enter. Thompson’s algorithm simulates the DFA by reconstructing each DFA state as it is needed.


                                                                      All this is complicated by another fact I just read here:

                                                                      https://research.swtch.com/yaccalive

                                                                      The now-standard framing of Thompson’s paper as an NFA construction and the addition of caching states to produce a DFA were both introduced later, by Aho and Ullman in their various textbooks. Like many primary sources, the original is different from the textbook presentations of it. (I myself am also guilty of this particular misrepresentation.)

                                                                      (my emphasis)

                                                                      I think he’s saying that Thompson’s original method was actually based on regular expression derivatives (Brzozowski derivatives) and not the McNaughton-Yamada algorithm.

                                                                      The derivatives method produces an optimized DFA directly – it doesn’t require the NFA -> unoptimized DFA intermediate stages.


                                                                      I just watched a couple talks about derivatives recently, including one about redgrep (also by Paul Wankadia), which uses derivatives – not the McNaughton-Yamada algorithm used by RE2:

                                                                      https://github.com/google/redgrep

                                                                      redgrep: from regular expression derivatives to LLVM

                                                                      This one is a lot more friendly and shows the code on one slide basically:

                                                                      Regular Expression Derivatives in Python

                                                                      https://github.com/MichaelPaddon/epsilon

                                                                      I would like to replace Oil’s usage of re2c with this technique eventually. I’m not sure if it will make things faster, but it might make the code size smaller. The lexer is a single function that’s around 23K of native code right now.

                                                                      Anyway I still have a lot more to learn about this, and to actually try out the “new” algorithms in code :) But I thought it was funny to see that little footnote that Thompson’s algorithm as explained in those long articles isn’t really Thompson’s algorithm. This is not a minor point – derivatives are a REALLY different algorithm.

                                                                      (Note that parsing with derivatives is different than regex with derivatives. Cox is criticizing parsing with derivatives in that article, but he says that regex with derivatives is a “win-win” – the algorithm is smaller, and the output is more optimized (smaller DFA). I guess that was the point of the redgrep exploration.)


                                                                      [1] https://swtch.com/~rsc/regexp/regexp1.html:

                                                                      [2] https://swtch.com/~rsc/regexp/regexp2.html

                                                                      1. 3

                                                                        Minor update: The original re2c [1] paper does cite the Dragon Book, so at least in 1994 it used the McNaughton-Yamada algorithm to generate a DFA (not an NFA).

                                                                        From there it says that generating assembly code from the DFA is “straightforward” (section 3.2).

                                                                        The code is quite large now but I’m guessing it still follows that basic algorithm, along with a bunch of optimizations for DFA size, and then the big complication of submatch extraction (which Russ Cox’s articles also deal heavily with).

                                                                        [1] http://re2c.org/1994_bumbulis_cowan_re2c_a_more_versatile_scanner_generator.pdf


                                                                        Also if anyone is still reading who wants to manipulate regexes for a practical purpose (i.e. portably implementing extended glob as in bash/ksh), check out my issues :) https://github.com/oilshell/oil/issues?q=is%3Aissue+is%3Aopen+label%3Acomputer-science

                                                                        1. 1

                                                                          Hiya! I’m a huge fan of the work you’re doing with Oil. I am really excited to see how it turns out. :-)

                                                                          I don’t think this follows because both algorithms essentially compile their problems down to DFAs. The compile time might be different (and mostly negligible), but if the DFA is optimized, then I don’t see any reason that the execution time should be different.

                                                                          The regex -> NFA -> DFA -> optimization transformation does feel like “magic” in that sense, in that you can solve a harder problem in the same amount of time, at least from the POV of computational complexity.

                                                                          In Oil, I use re2c to compile a huge number of regexes down to a series of switch/goto statements, with no backtracking, and it seems to be very fast. There are some links to code at the beginning of this post: http://www.oilshell.org/blog/2017/12/17.html

                                                                          As far as I understand, re2c is basically doing the “Thompson” or Dragon Book transformation. Perhaps with some DFA optimizations that I don’t understand. But the expressiveness of the problem is the same – it accepts alternations of arbitrary regexes.

                                                                          It doesn’t do any backtracking or “split” – it’s all just switch and goto.

                                                                          I’m actually not terribly familiar with re2c, but I think there might be some confusion here. Thompson’s construction is really about building an NFA and then simulating that. At least, that’s what his original paper describes. (scihub this: https://dl.acm.org/citation.cfm?id=363387) The original paper actually describes something closer to a JIT, since the output of his compiler is object code. But the object code is still executing a simulation of an NFA. This is very clear via the discussion around CLIST and NLIST handling at match time.

                                                                          So to be clear, I wouldn’t say the output of Thompson’s construction is itself a DFA, but rather, an NFA that can either 1) be interpreted as-is or 2) further transformed into an actual DFA. (2) can itself be split into a couple different strategies. One of them is the “lazy DFA,” which I believe was pioneered by Thompson, but in his implementation of grep, and definitely not in his original IBM 7094 paper. A lazy DFA is something that essentially caches the result of interpreting the NFA during match time. This tends to work well in practice and also permits setting a bound of the amount of memory you use, at the cost of essentially falling back to the performance characteristics of NFA interpretation if you reach that bound. The other strategy for (2) is your normal DFA building: transform the NFA into a DFA via powerset construction, and then possibly, DFA minimization. I’m guessing that’s what re2c is doing. They are still likely using Thompson’s construction, but it’s only one piece of the puzzle.

                                                                          But yeah, the expressiveness of everything here is all equivalent. It’s a weird framing, but at this point, we tend to couple theoretical terms (“NFA”, “DFA”, and so on) with specific performance characteristics. e.g., an NFA uses a small amount of memory to execute, but has very high overhead while a DFA can use lots of memory but has very small constant factors.

                                                                          As long as both algorithms produce a DFA, the search should use O(n) time and O(1) space.

                                                                          I think you are saying that the “split” bytecodes from [2] necessary to simulate the NFA make things slower.

                                                                          But I am not sure this is true, based on [1]:

                                                                          In a sense, Thompson’s NFA simulation is executing the equivalent DFA: each List corresponds to some DFA state, and the step function is computing, given a list and a next character, the next DFA state to enter. Thompson’s algorithm simulates the DFA by reconstructing each DFA state as it is needed.

                                                                          I think it’s pretty uncontroversial to state that the interpretation of an NFA is going to be, in practice, slower than an implementation that reflects the key characteristic of a DFA: a small constant number of instructions per processed byte, typically in the form of a transition lookup, advancing to the next state and checking for a match condition. An interpretation of the NFA is usually going to result in a number of instructions proportional to the size of your regex.

                                                                          There are some interesting exceptions, but I don’t think it really changes the above much. One exception is bit parallel NFAs, which can work quite well—perhaps even better than DFAs—but only for smaller machines. Another exception is JIT. If you JIT Thompson’s construction, then the performance characteristics change from the normal “interpretation” of an NFA, which usually resembles a VM of sorts.

                                                                          The now-standard framing of Thompson’s paper as an NFA construction and the addition of caching states to produce a DFA were both introduced later, by Aho and Ullman in their various textbooks. Like many primary sources, the original is different from the textbook presentations of it. (I myself am also guilty of this particular misrepresentation.)

                                                                          (my emphasis)

                                                                          I think he’s saying that Thompson’s original method was actually based on regular expression derivatives (Brzozowski derivatives) and not the McNaughton-Yamada algorithm.

                                                                          The derivatives method produces an optimized DFA directly – it doesn’t require the NFA -> unoptimized DFA intermediate stages.

                                                                          Thompson’s original paper isn’t long, so I think I’d actually recommend that you just read it. :-) It should be clear that Thompson isn’t talking about a DFA, but an NFA.

                                                                          I’m not sure what Cox means by his statement. Certainly, Thompson’s original paper does not talk about DFA state caching, but the construction and execution is undoubtedly an NFA simulation. The only thing I can think of here is that Thompson never actually mentions “NFA” or “DFA” in his paper, and therefore, framing Thompson’s contribution in those terms is something that others have added. But it kind of seems like a distinction without much of a difference, so I might be missing Cox’s point here.

                                                                          I would like to replace Oil’s usage of re2c with this technique eventually. I’m not sure if it will make things faster, but it might make the code size smaller. The lexer is a single function that’s around 23K of native code right now.

                                                                          Anyway I still have a lot more to learn about this, and to actually try out the “new” algorithms in code :) But I thought it was funny to see that little footnote that Thompson’s algorithm as explained in those long articles isn’t really Thompson’s algorithm. This is not a minor point – derivatives are a REALLY different algorithm.

                                                                          (Note that parsing with derivatives is different than regex with derivatives. Cox is criticizing parsing with derivatives in that article, but he says that regex with derivatives is a “win-win” – the algorithm is smaller, and the output is more optimized (smaller DFA). I guess that was the point of the redgrep exploration.)

                                                                          Yeah, Paul’s talk on redgrep is pretty freaking sweet.

                                                                          I wish I could remember why I haven’t done much with regex derivatives. I’ve researched them before, but it’s been a while now at this point. If I had to guess, I’d say that I probably get stuck with figuring out how to handle large Unicode character classes (which is also where I get stuck in Glushkov’s construction and in trying bit parallel NFAs).

                                                                          1. 2

                                                                            BTW Darius Bacon had the same question about Thompson’s paper and derivatives and addressed it here:

                                                                            https://news.ycombinator.com/item?id=18434225

                                                                            I guess there is a related idea called Antimirov Derivatives / Partial Derivatives, and that leads to the NFA? I have to understand it better. It looks like Antimirov derivatives didn’t have a name until 1995, but Thompson must have used that idea (?).

                                                                            The algorithm to generate a DFA in redgrep / epsilon based on derivatives seem seems to be completely different. It does NOT generate an NFA; it generates a DFA directly (and it has fewer states than some implementations of Thompson construction -> DFA minimization).

                                                                            The hard part of that algorithm seems to be canonicalizing the DFA states so that you can compare them with equality. Everybody seems to mention that issue. But it seems like it’s a short algorithm.

                                                                            Paul W. mentioned partial derivatives in his redgrep talk with respect to capturing groups. I guess that was an open problem in redgrep.

                                                                            1. 2

                                                                              Thanks for the kind words about Oil. I’m happy to have help from people familiar with regex implementation :)

                                                                              Two somewhat separate points:

                                                                              1. With regard to the original question: How does Aho-Corasick relate to Thompson’s construction? I agree in the question isn’t super clear, but I don’t think “Thompson’s method uses an NFA and Aho-Corasick uses a DFA, and NFA has more overhead than DFA” (i.e. because there are split bytecodes) quite captures it.

                                                                              If I understand correctly, the difference between NFA and DFA really comes down to submatch extraction. You can extract submatches with the slower NFA method, but doing it with a DFA was impossible or very difficult.

                                                                              The authors of re2c published a new algorithm in 2017 for doing submatch extraction with compiled DFAs, but I haven’t gone over it:

                                                                              http://re2c.org/2017_trofimovich_tagged_deterministic_finite_automata_with_lookahead.pdf

                                                                              Anyway, the point is that submatch extraction with DFAs is hard. That is borne out in both RE2 and re2c. IIRC, the whole point of the “multi-engine” backend in RE2 by Cox is to handle this. It uses the slower method when it needs to extract submatches, and the faster DFA method when it doesn’t.

                                                                              If you are NOT extracting submatches, which is the case for the Aho-Corasick problem, then you might as well just convert your NFA to a DFA and match more quickly.

                                                                              I basically think of NFAs and DFAs as equivalent, since they are mathematically. They recognize the exact same class of languages.

                                                                              This all boils down to terminology, but the way I think of it, there are two worlds. (And this is pretty much exactly the point that the Cox articles make.)

                                                                              a) Perl-style regexes, PCRE, Python – this is backtracking world. They’re not using NFAs or DFAs.

                                                                              b) awk, egrep, libc, RE2, re2c – This is NFA/DFA world or “Thompson’s construction” / Dragon Book / McNaughton-Yamada.

                                                                              In my mind I don’t think of “Thompson’s construction” as NFA. But yes I see that Cox’s articles use “Thompson NFA” a lot. However I think the issue is not NFA vs. DFA. It’s NFA vs. backtracking, or really NFA/DFA vs. backtracking. That’s how I think about it and I think it captures the salient difference.

                                                                              Anyway, this has probably hopelessly confused the original poster … long story short, I have never actually worked with Aho-Corasick, but my point is that if you create a regex like const_string_1|const_string_2|...|const_string_N, and you do the Thompson NFA -> DFA, you will end up with something that runs as fast as Aho-Corasick (since all DFAs can be interpreted in linear time and constant space).

                                                                              And if you optimize the DFA, maybe you will end up with the exact same one (just guessing, I don’t know that.)


                                                                              1. I went back and skimmed the Thompson paper, and I see the parts about CLIST and NLIST. It is not clear to me that this is about NFAs. As you say, it doesn’t mention those words. And it talks about Brzozowski derivatives in the first paragraph.

                                                                              http://www.oilshell.org/archive/Thompson-1968.pdf

                                                                              I don’t know 7094 assembly or ALGOL-60, so it’s hard for me to interpret. I tend to trust what Russ Cox says.

                                                                              I think it’s plausible that it is using NFAs, because the Dragon Book explains it that way, and Russ Cox explained it that way in the first paragraphs of his first article.

                                                                              But I think he went back and read it later and realized it was about derivatives! It’s also plausible to me that the technique is based on derivatives. In fact it is the most plausible to me because the paper mentions derivatives and does not mention NFAs or DFAs!!!

                                                                              I guess to really answer this question I’d have to go implement both algorithms and see. (Or maybe just e-mail Russ Cox.)

                                                                              (I actually implemented what were known as 2 separate parsing algorithms here and realized they are the same: http://www.oilshell.org/blog/2016/11/01.html . The author of precedence climbing acknowledged this to be true – links at end. Derivatives and McNaughton-Yamada seem very different to me, but maybe there is a deep connection.)

                                                                              1. 1

                                                                                With regard to the original question: How does Aho-Corasick relate to Thompson’s construction? I agree in the question isn’t super clear, but I don’t think “Thompson’s method uses an NFA and Aho-Corasick uses a DFA, and NFA has more overhead than DFA” (i.e. because there are split bytecodes) quite captures it.

                                                                                Hmm I don’t think I said that. :-/ Sorry if that’s how it came across. The typical Aho-Corasick formulation uses failure transitions at match time, so it isn’t a DFA. It’s an NFA. “Advanced” Aho-Corasick is the classical DFA variant.

                                                                                It doesn’t really make sense to compare and contrast the theoretical properties of Aho-Corasick and Thompson’s construction, because they are both just alternative ways of framing the construction of an NFA, which can eventually be turned into equivalent DFAs through powerset construction and minimization. The interesting comparison is at the implementation level, and specifically, the construction algorithms themselves. Aho-Corasick is a fundamentally more constrained algorithm, and doesn’t need to handle the full generality of regular expressions.

                                                                                If I understand correctly, the difference between NFA and DFA really comes down to submatch extraction. You can extract submatches with the slower NFA method, but doing it with a DFA was impossible or very difficult.

                                                                                Your second sentence is true, but the first sentence doesn’t really make sense to me honestly. If we’re talking about an implementation that reflects the structure of an NFA, then you’re almost always going to be looking at some interpretation of the NFA at match time (or, perhaps, a bit-parallel implementation if the machine is small enough or a JIT). This looks very very different from what a DFA implementation looks like.

                                                                                Thank you for the link about re2c. I hadn’t seen that. Currently, Rust’s regex library (like RE2) needs to use an NFA to do submatch capture, which stinks, because it’s much slower than the DFA.

                                                                                If you are NOT extracting submatches, which is the case for the Aho-Corasick problem, then you might as well just convert your NFA to a DFA and match more quickly.

                                                                                This is too much of a simplification. The DFA variant of Aho-Corasick requires a lot more memory than the NFA variant. In my Aho-Corasick library, this is exposed as an option for users, so they can decide the time vs space trade off themselves.

                                                                                I wonder if your Oil use case is making this a bit more difficult to grasp. In that context, where you have a tokenizer that rarely changes and a tool to convert your DFA into code, then an NFA seems rather silly. But in a more general context, where you’re building regexes at runtime and matching them, the specific trade offs between space and time become much more relevant. In that context, eagerly building a DFA is a pretty rare thing to do (as supported by numerous implementations such as RE2, Go’s regex library, GNU grep, Thompson’s original grep and of course my own regex library too).

                                                                                I basically think of NFAs and DFAs as equivalent, since they are mathematically. They recognize the exact same class of languages.

                                                                                Yes… But like I said in my comment above, we are well beyond theory here. The vernacular in this domain couples theoretical terms with implementation details. There are very real differentiating properties between “I implemented an NFA matcher” and “I implemented a DFA matcher.”

                                                                                Take a look at section 6 of the original paper that introduced Aho-Corasick. It explicitly introduces a DFA variant, and explicitly calls out its downside as increased memory usage but faster matching. This is in contrast to the initial and classical formulation with failure transitions that uses less memory but is slower at matching.

                                                                                In my mind I don’t think of “Thompson’s construction” as NFA. But yes I see that Cox’s articles use “Thompson NFA” a lot. However I think the issue is not NFA vs. DFA. It’s NFA vs. backtracking, or really NFA/DFA vs. backtracking. That’s how I think about it and I think it captures the salient difference.

                                                                                No, I don’t think so. I mean, yes, of course it’s a critical difference, but that’s really not what we’re talking about here. We’re talking about Aho-Corasick vs the Thompson construction (or, RE2 if you like). Bringing up backtracking engines it well outside the scope, and moreover, it is not the only distinction that matters. You yourself noted the variety of implementation strategies used by RE2. These are precisely a reflection of the various properties of typical NFA vs DFA implementations. Waving all of this away honestly makes this entire discussion rather moot.

                                                                                Anyway, this has probably hopelessly confused the original poster … long story short, I have never actually worked with Aho-Corasick, but my point is that if you create a regex like const_string_1|const_string_2|…|const_string_N, and you do the Thompson NFA -> DFA, you will end up with something that runs as fast as Aho-Corasick (since all DFAs can be interpreted in linear time and constant space).

                                                                                That sounds right, yes, but it’s a somewhat trivial theoretical result. e.g., Given two regular expressions, you can determine whether the languages they recognize are equivalent by converting both to a minimal DFA. If the resulting DFAs are equivalent up to state renaming, then you know they recognize the same languages. Framing it this way misses the forest for the trees; there are very real implementation distinctions that arise in practice between regex implementations and Aho-Corasick implementations. A lot of it is really about tweaking time vs space trade offs. For example, RE2 (and also Rust’s regex library) will never actually build the full DFA of a regex eagerly. Instead, it’s constructed lazily. That lazy construction is really useful because it gets you very close to the full speed of a DFA but allows more precise control over the space used. Aho-Corasick, on the other hand, is going to use far less memory. In practice, for small regexes of literal alternations, you might choose to use an Aho-Corasick DFA (“Advanced Aho-Corasick”) instead of using the full blown regex engine, because the Aho-Corasick DFA is likely faster than the fully general lazy DFA.

                                                                                I went back and skimmed the Thompson paper, and I see the parts about CLIST and NLIST. It is not clear to me that this is about NFAs.

                                                                                It is. No DFA implementation is going to be explicitly manipulating states at match time. Derivatives are certainly used to motivate construction; but it’s still an NFA.

                                                                                I’m not quite sure how to fix your confusion on this topic. :-/ I think if I could say one thing, it would be that NFA vs DFA have very real distinctions in their typical implementations, even if theoretically they give the same power. This is why we can speak separately about “Thompson’s construction” and “DFA”, because they are not intricately coupled.

                                                                                I feel like a read of Navaro and Raffinot’s “Flexible Pattern Matching in Strings” would help clear up a lot of this for you, because they make the NFA vs DFA distinction very clear at the implementation level with lots of examples. Unfortunately, the book is a little pricey. If we were friends in real life, I’d lend you my copy. :-)

                                                                                1. 1

                                                                                  I did some work on this that clarifies things somewhat – at least I think so, let me know what you think.

                                                                                  https://github.com/oilshell/blog-code/tree/master/fgrep-problem-benchmarks

                                                                                  I don’t think we are disagreeing that much about technical points. I was somewhat nitpicking the original answer because I didn’t think it illuminated the core issue. And because I want to understand the difference myself, for Oil! I think this will matter eventually and I’m not just arguing for its own sake :)

                                                                                  It doesn’t really make sense to compare and contrast the theoretical properties of Aho-Corasick and Thompson’s construction, because they are both just alternative ways of framing the construction of an NFA, which can eventually be turned into equivalent DFAs through powerset construction and minimization.

                                                                                  I think that totally makes sense! That is very important to understand!!! Does everyone reading this thread know that?

                                                                                  I don’t believe it’s common knowledge and it’s one of the first things I would say to someone who says “compare Thompson’s construction and Aho-Corasick”. If it’s true that they boil down to the same DFA, then that’s important to understand. (I think Aho-Corasick might behave differently with respect to overlapping matches. If that’s the main difference then that would clarify things.)

                                                                                  People generally care about execution time, not compilation time. Compilation time/memory depends on the size of the expressions, which is generally small (like bytes or kilobytes), while execution depends on the size of the input data, which could be gigabytes, terabytes, etc.

                                                                                  For one project I routinely used regexes over terabytes of data. I never thought about compilation time, or NFA vs. DFA. It wasn’t a choice you’re really able to make as a user of regular expressions. But you can make the choice of backtracking vs. automata-based (and not everyone is aware of that choice).

                                                                                  This is too much of a simplification. The DFA variant of Aho-Corasick requires a lot more memory than the NFA variant. In my Aho-Corasick library, this is exposed as an option for users, so they can decide the time vs space trade off themselves.

                                                                                  I don’t understand this… Look at the generated code I linked for re2c:

                                                                                  https://github.com/oilshell/blog-code/blob/master/fgrep-problem-benchmarks/_gen/fixed-strings.cc

                                                                                  https://raw.githubusercontent.com/oilshell/blog-code/master/fgrep-problem-benchmarks/_gen/code-size.txt

                                                                                  This code uses almost no memory, and the code size is tiny as well. It’s matching using a DFA. Why can’t Aho-Corasick do the same? Doesn’t it boil down to the same Trie-like DFA?

                                                                                  Framing it this way misses the forest for the trees; there are very real implementation distinctions that arise in practice between regex implementations and Aho-Corasick implementations. A lot of it is really about tweaking time vs space trade offs. For example, RE2 (and also Rust’s regex library) will never actually build the full DFA of a regex eagerly. Instead, it’s constructed lazily.

                                                                                  For the case that the input is an alternation of constant strings, why not do it? That seems to be exactly what re2c does. I get a trie-like DFA out of re2c and that appears to be what Aho-Corasick does.

                                                                                  Of course re2c has more time to compile things, since it’s offline, so that could be a difference. It does have a --dfa-minimization flag for the strategy. It could be that I am underestimating how hard that is.

                                                                                  But I’m only talking about the constant string case. I know that DFA minimization in general is hard, but what about minimizing an NFA with no repetition at all? (which is the Aho-Corasick problem)


                                                                                  I didn’t realize you had implemented the Rust aho-corasick library, but I linked it in my repo. (You should have mentioned that in the first reply! :) )

                                                                                  If I’m understanding correctly, I think it’s going to construct the same trie-like DFA as re2c did, and therefore it will run no faster than re2c. Or possibly slower since it’s interpreted and not compiled. I would love a PR for a benchmark :)

                                                                                  (Note that grep runs faster than re2c! Pretty amazing, and I note that I think it’s due to skipping input, but I didn’t verify that in any way. grep also runs faster than fgrep on this problem!)


                                                                                  About Thompson’s paper: I now agree it’s about NFAs. After looking at Figure 5, it’s obvious it is an NFA for a(b|c)d.

                                                                                  I’m still confused by the derivatives comment in the first paragraph. The 2009 paper that “reintroduced” regex derivatives actually makes a comment on this exact line at the end in the “Related Work” section:

                                                                                  Regular-expression derivatives re-examined

                                                                                  I’ll definitely be looking at derivatives more! Let me know if you do any work with them. Apparently they do support Unicode well and this paper specifically talks about large character classes. The “episilon” video by Michael Paddon I linked earlier says that Unicode was one of his primary motivations for derivatives.

                                                                                  There is a big Unicode data file in the repo: https://github.com/MichaelPaddon/epsilon/blob/master/epsilon/ucd.py

                                                                                  Also take a look at the table in section 5.2 on DFA size. It compares the derivatives method vs. a Thompson construction and says it’s always better.

                                                                                  I guess Aho-Corasick is probably always better than Thompson too for the subset of the problem it solves, as far as DFA size, but I would love to see if that translates to execution time. I think it will be hard to construct a case where re2c’s algorithms isn’t good enough and you have to use Aho-Corasick.


                                                                                  Also I see the clist and nlist thing here. It wasn’t clear from your comment what those were supposed to be doing in the Thompson paper, but it’s a lot clearer in this explanation:

                                                                                  https://swtch.com/~rsc/regexp/regexp1.html

                                                                                  The simulation uses two lists: clist is the current set of states that the NFA is in, and nlist is the next set of states that the NFA will be in, after processing the current character.


                                                                                  Anyway, thanks for this discussion! I’m definitely learning a lot. I want to eventually unify the mess of regexes in a typical shell script for Oil, so that’s why I’m so interested in this.

                                                                                  1. 1

                                                                                    I think that totally makes sense! That is very important to understand!!! Does everyone reading this thread know that?

                                                                                    Not sure. I jumped right over the theory part because in the theory, everything folds into equivalence with the same expressive power. The only real practical distinction there that I can think to make is that I remember NFAs being easier to construct than DFAs in my computability theory class (many moons ago), which in turn made it easier to use NFAs instead of DFAs to prove a particular language to be regular.

                                                                                    If it’s true that they boil down to the same DFA, then that’s important to understand. (I think Aho-Corasick might behave differently with respect to overlapping matches. If that’s the main difference then that would clarify things.)

                                                                                    Right… Good point. You probably can’t say for certain that they’ll boil down to the same DFA without being a bit more careful. Overlapping matches are probably part of it, but even in the non-overlapping case, the typical Aho-Corasick implementation will still probably have subtly different match semantics than the typical regex DFA implementation. That is, regexes typically implement “leftmost first” match semantics (the natural consequence of a backtracking implementation) or “leftmost longest” match semantics (found primarily in POSIX compatible regex implementations). That is, given the strings Sam and Samwise matching the string Samwise:

                                                                                    • Sam|Samwise matches Samwise in leftmost longest.
                                                                                    • Samwise|Sam matches Samwise in leftmost longest.
                                                                                    • Sam|Samwise matches Sam in leftmost first.
                                                                                    • Samwise|Sam matches Samwise in leftmost first.

                                                                                    So with leftmost longest semantics, the order of your alternation doesn’t matter. But with leftmost first semantics, the order does matter. RE2 (along with Go’s and Rust’s regex engines) both provide leftmost first semantics even though they are finite automata implementations. The reason for doing so is to match the semantics of popular regex implementations such as PCRE. Notably, RE2 (and Go) both permit the caller to switch to leftmost longest semantics if they want. Generally speaking, I’ve found leftmost first semantics to be more useful.

                                                                                    Back to Aho-Corasick… In most typical implementations that I know of (including my own), it will report all matches. So for Sam|Samwise, it would match at both Sam and Samwise. However, it’s also easy to tweak the matching routine (without changing the automaton) to not report overlapping matches by simply moving back to the start state after a match is found. In this case, Aho-Corasick will report Sam as a match but not Samwise regardless of the order of the patterns. In this sense, Aho-Corasick has neither leftmost first nor leftmost longest semantics, but leftmost shortest. (Generally, I see this as a problem and it’s one of the things on my list to fix. In particular, I’d like for Aho-Corasick to have the same match semantics as the regex engine so that Aho-Corasick can be used to completely replace the use of the regex engine internally when you have an alternation of literals.)

                                                                                    People generally care about execution time, not compilation time. Compilation time/memory depends on the size of the expressions, which is generally small (like bytes or kilobytes), while execution depends on the size of the input data, which could be gigabytes, terabytes, etc.

                                                                                    For one project I routinely used regexes over terabytes of data. I never thought about compilation time, or NFA vs. DFA. It wasn’t a choice you’re really able to make as a user of regular expressions. But you can make the choice of backtracking vs. automata-based (and not everyone is aware of that choice).

                                                                                    Sure, but this is mostly a domain specific thing. I’d agree that if you’re working on a problem in which re2c is worth using in the first place, then you probably always want to spend the extra time in compilation to make matching faster. In that sense, choosing between NFA and DFA is an easy choice: pick the DFA. But this is because you’re paying for compilation when you compile your program itself. In a typical regex engine, the cost of compilation is paid during the execution of your program. While users have generally grown to expect that regex compilation is not exactly super cheap, you generally don’t have all day either. For re2c, it’s conceivable that you might be willing to spend minutes building the DFA. But for a normal regex engine, you probably don’t want to spend more than a millisecond building your machine for reasonably sized regexes.

                                                                                    The other thing to look at here is that, in terms of implementation choices, an implementation modeled after an NFA tends to have more power in that it can resolve capture groups. Now, you did point out a paper that purports to do this in a DFA, but this generally isn’t how it has been done in practice (RE2, Go, Rust). I haven’t read the paper yet, so I’m also not sure what the trade offs are. My point here is that, depending on implementation, choosing between NFA and DFA might depend on what question you’re asking.

                                                                                    I don’t understand this… Look at the generated code I linked for re2c:

                                                                                    https://github.com/oilshell/blog-code/blob/master/fgrep-problem-benchmarks/_gen/fixed-strings.cc

                                                                                    https://raw.githubusercontent.com/oilshell/blog-code/master/fgrep-problem-benchmarks/_gen/code-size.txt

                                                                                    This code uses almost no memory, and the code size is tiny as well. It’s matching using a DFA. Why can’t Aho-Corasick do the same? Doesn’t it boil down to the same Trie-like DFA?

                                                                                    I don’t think looking at generated code out of context can really do much. e.g., I don’t know what set of strings went into the production of that machine. I’d guess it was small though. So in that sense, it’s no surprise that the corresponding DFA is small. When it comes to Aho-Corasick, I suspect that the size of the DFA is indeed proportional to the total size of all the patterns you’ve compiled into it. (I’m not 100% on that.) But in the case of regexes, with its repetition operators, the size of the DFA can be exponential in the size of the original regex.

                                                                                    Try building an automaton that matches any word in /usr/share/dict/words.

                                                                                    For the case that the input is an alternation of constant strings, why not do it? That seems to be exactly what re2c does. I get a trie-like DFA out of re2c and that appears to be what Aho-Corasick does.

                                                                                    It really just depend on your tolerance for memory-vs-speed-vs-compile-time. If you’re using re2c, then you almost certainly fall into the “I don’t care about compile time, and I probably have a very high tolerance of memory usage since that’s merely reflected in code size.” There are some secondary effects here of course, e.g., a large code size might result in thrashing your instruction cache.

                                                                                    Of course re2c has more time to compile things, since it’s offline, so that could be a difference. It does have a –dfa-minimization flag for the strategy. It could be that I am underestimating how hard that is.

                                                                                    But I’m only talking about the constant string case. I know that DFA minimization in general is hard, but what about minimizing an NFA with no repetition at all? (which is the Aho-Corasick problem)

                                                                                    Yes, while the general case is exponential, I believe that at least building an Aho-Corasick DFA is linear with respect to the number of literals you’re trying to match. I don’t know whether minimization is also minimal. But remember, memory usage is another key component here. The DFA is going to use a lot more of it. Usually, if you’re building a DFA, you’re either doing it at source compile time (so it’s embedded in the code as a state machine) or you’re doing it at runtime, which means it’s probably a table based DFA. Naively, for a byte based automaton, that’s 256 transitions for each state, where each transition needs to be big enough to point to any arbitrary state. If you choose 32 bits, then you’re already at 1KB per state. Now, obviously, there are different ways to represent DFAs. You could pick a sparse representation of the transitions, but this almost always sacrifices execution time because a sparse lookup will probably be slower. Beyond that, there are other tricks you can do like condensing the number of transitions per state to the minimal number of transitions needed by separating the space of all 256 bytes into equivalence classes. This incurs an extra lookup per byte at match time, but it’s generally worth it for the space savings.

                                                                                    I didn’t realize you had implemented the Rust aho-corasick library, but I linked it in my repo. (You should have mentioned that in the first reply! :) )

                                                                                    If I’m understanding correctly, I think it’s going to construct the same trie-like DFA as re2c did, and therefore it will run no faster than re2c. Or possibly slower since it’s interpreted and not compiled. I would love a PR for a benchmark :)

                                                                                    (Note that grep runs faster than re2c! Pretty amazing, and I note that I think it’s due to skipping input, but I didn’t verify that in any way. grep also runs faster than fgrep on this problem!)

                                                                                    Right, yeah, hah. I’ve been working on string search for a lot of years now. In terms of credentials, I also wrote Rust’s regex library, ripgrep (which bests grep in many cases) and one of the fastest CSV parsers in existence (which is particularly relevant to our conversation here, because I did it by compiling the CSV parser into a DFA on the stack, where as most CSV parsers use the standard NFA simulation embedded in the code).

                                                                                    Most or all of my work is focused on building finite state machines at runtime. This biases toward a very different set of trade offs than the offline compilation found in the domain of re2c. So I don’t think I would necessarily claim anything about performance. In general, I’d probably almost always give the edge to re2c over any equivalent machine compiled at runtime.

                                                                                    That grep runs faster is probably what I’d expect. A finite state machine has its limits. Its key advantage is the ability to express complex match semantics in a general way. But there are all sorts of special cases that can be optimized to go quite a bit faster than a normal finite state machine. You hinted at it with the “skipping input” aspect of grep (you’ve probably read this). But in modern times, it’s less to do with skipping input and more to do that the skip loop itself is written to use optimized vector instructions via memchr (glibc’s implementation of that is in x86 Assembly, my own implementation uses normal Rust code with vendor intrinsics). To that end, it turns out that you don’t even need to skip input to go very fast, but it’s instead better to pick the byte you skip on more wisely. That’s what led to my own FreqyPacked algorithm that uses a static notion of how common certain bytes are, and chooses the skip loop based on that. This is distinct from Boyer-Moore, which is usually written with a skip loop on the very last byte. That’s what GNU grep does. This specific algorithmic difference is a key reason why ripgrep edges out GNU grep in a lot of cases.

                                                                                    I’ll definitely be looking at derivatives more! Let me know if you do any work with them. Apparently they do support Unicode well and this paper specifically talks about large character classes. The “episilon” video by Michael Paddon I linked earlier says that Unicode was one of his primary motivations for derivatives.

                                                                                    Interesting! Thanks for the links. One of the tasks that’s nextish on my list (after building out a bit more infrastructure) is to rethink the optimization and compilation passes in Rust’s regex crate. No small part of that will be figuring out how to address big Unicode classes, because right now, they are a primary thing that causes Thompson’s NFA construction to go into the weeds.

                                                                                    Anyway, thanks for this discussion! I’m definitely learning a lot. I want to eventually unify the mess of regexes in a typical shell script for Oil, so that’s why I’m so interested in this.

                                                                                    Good luck!

                                                                                    1. 1

                                                                                      I will respond more later, but I just tested ripgrep and it beats grep (and fgrep)! Impressive :)

                                                                                      https://github.com/oilshell/blog-code/tree/master/fgrep-problem-benchmarks

                                                                                      Is ripgrep basically a wrapper around the Rust regex crate, or is there more to it?

                                                                                      I’m curious about RE2 now. Also curious about your aho-corasick library. I’m going to be surprised if can actually take advantage of anything to perform better than ripgrep :)

                                                                                      The fact that fgrep is slower than grep makes me doubt that the constant string alternation problem is really “easier” than the regex problem.

                                                                                      1. 1

                                                                                        Is ripgrep basically a wrapper around the Rust regex crate, or is there more to it?

                                                                                        Hah… You’ll want to read the blog post I linked you to. :-) There is quite a bit more to it, but most of the interesting algorithmy stuff is in the regex engine. ripgrep itself is mostly engineering. Note that ripgrep’s regex engine is actually pluggable. By default it uses Rust’s regex engine, but the -P flag enables PCRE2, assuming you build ripgrep with PCRE2 support. (See README.)

                                                                                        Your RE2 and re2c benchmarks aren’t quite measuring the same thing as your grep/fgrep/ripgrep benchmarks. The latter are counting lines, but the former are counting all matches. Consider this comparison, where the former corresponds to counting all matching lines while the latter corresponds to counting every individual match:

                                                                                        $ time rg 'for|while|continue|break|if|fi|then|else|elif|case|esac|do|done' /tmp/oil.txt -c
                                                                                        2969020
                                                                                        real    0.673
                                                                                        user    0.655
                                                                                        sys     0.017
                                                                                        maxmem  343 MB
                                                                                        faults  0
                                                                                        
                                                                                        $ time rg 'for|while|continue|break|if|fi|then|else|elif|case|esac|do|done' /tmp/oil.txt -c -o
                                                                                        3830430
                                                                                        real    1.497
                                                                                        user    1.458
                                                                                        sys     0.037
                                                                                        maxmem  343 MB
                                                                                        faults  0
                                                                                        

                                                                                        Since your query has a lot of matches, this actually winds up being quite significant. I don’t think GNU grep actually has a way of counting every individual match, short of using -o and piping the output to wc -l. But then you wind up measuring the time it takes to not just count every match, but print every match too. You can see the difference with ripgrep, which admittedly isn’t too much, but it’s not nothing:

                                                                                        $ time rg 'for|while|continue|break|if|fi|then|else|elif|case|esac|do|done' /tmp/oil.txt -o -c
                                                                                        3830430
                                                                                        real    1.511
                                                                                        user    1.492
                                                                                        sys     0.017
                                                                                        maxmem  343 MB
                                                                                        faults  0
                                                                                        
                                                                                        $ time rg 'for|while|continue|break|if|fi|then|else|elif|case|esac|do|done' /tmp/oil.txt -o | wc -l
                                                                                        3830430
                                                                                        real    1.684
                                                                                        user    1.648
                                                                                        sys     0.033
                                                                                        maxmem  343 MB
                                                                                        faults  0
                                                                                        

                                                                                        I’m going to be surprised if can actually take advantage of anything to perform better than ripgrep :)

                                                                                        I welcome all competitors! Beating ripgrep in all or most cases will be tricky. If you limit yourself to a specific subset of regexes and allow yourself to use offline compilation, then I’m sure there are wins to be had. There are lots of interesting optimizations at play here though, and generally finite state machines are your last line of defense because they are slower than optimized vector routines. But, vector optimizations don’t really scale to large inputs. They only really work for small regexes.

                                                                          2. 2

                                                                            wow, thanks for this long answer. seems like i have quite a bit of reading to do! :) so only an limited answer for some points:

                                                                            An alternation of literal strings compiled using Thompson’s construction is generally going to incur quite a bit more overhead than a standard Aho-Corasick implementation because Thompson’s construction supports not just an alternation of literals, but an alternation of arbitrary regular expressions.

                                                                            I haven’t really thought about it this way (regular expressions as matched unit instead of words) until now. Maybe because most regexp examples / uses tend to match against a limited set of words, not infinite ones.

                                                                            For completeness, note that there is also Glushkov’s construction algorithm

                                                                            I didn’t know about Glushkov’s algorithm, interesting!

                                                                          3. 3

                                                                            BTW I put the original paper here, based on the long discussion below:

                                                                            http://www.oilshell.org/archive/Thompson-1968.pdf

                                                                            This got me interested and I pubilshed some benchmarks. Supposedly fgrep originally used Aho-Corasick. I’m not sure if GNU fgrep actually does now though! But regular grep is faster than fgrep on this problem, perhaps because it has better Boyer-Moore like “input skipping”. That’s my conjecture anyway.

                                                                            https://github.com/oilshell/blog-code/tree/master/fgrep-problem-benchmarks


                                                                            EDIT: Yeah I see this in grep.c in the GNU source. It seems like it doesn’t try to treat the fgrep problem specially? For example, by using Aho-Corasick. Not sure why fgrep is slower than grep then. It should be the same speed if the same engine is used.

                                                                              /* In a unibyte locale, switch from fgrep to grep if                                                                                                                                                                                        
                                                                                 the pattern matches words (where grep is typically faster).                                                                                                                                                                              
                                                                                 In a multibyte locale, switch from fgrep to grep if either                                                                                                                                                                               
                                                                                 (1) case is ignored (where grep is typically faster), or                                                                                                                                                                                 
                                                                                 (2) the pattern has an encoding error (where fgrep might not work).  */                                                                                                                                                                  
                                                                              if (compile == Fcompile                                                                                                                                                                                                                     
                                                                                  && (MB_CUR_MAX <= 1                                                                                                                                                                                                                     
                                                                                      ? match_words                                                                                                                                                                                                                       
                                                                                      : match_icase || contains_encoding_error (keys, keycc)))                                                                                                                                                                            
                                                                                {                                                                                                                                                                                                                                         
                                                                                  size_t new_keycc;                                                                                                                                                                                                                       
                                                                                  char *new_keys;                                                                                                                                                                                                                         
                                                                                  fgrep_to_grep_pattern (keycc, keys, &new_keycc, &new_keys); 
                                                                            
                                                                            1. 1

                                                                              Yeah, GNU fgrep and GNU grep should be the same speed on the same input, assuming both inputs are just normal literals. That’s because grep should be able to detect the case of a simple literal and run as if it were fgrep. For your benchmark, what input are you searching? I can try and re-run the benchmark for you and explain the discrepancy if one indeed exists.

                                                                              Also, if you’re interested in more details on how grep is optimized, you might want to read my ripgrep article (or perhaps skim it, since it’s long, but don’t miss the parts about Teddy and memchr :-)).

                                                                              1. 1

                                                                                Thanks, that would be useful. You should be able to clone the repo and run it with the instructions at the top of fixed-strings.sh. Instead of all-benchmarks you can run grep-fixed-benchmark.

                                                                                Basically this will just download a file, make it 10x bigger, and run grep on it.

                                                                                The input is an alternation of literal strings, which I deem the “fgrep problem”. For me, grep is reliably faster, but I don’t understand why since I think they should be using the same engine either way. There is some attempt to convert fgrep to grep, but I guess I don’t understand under what conditions that happens.

                                                                                But even if the conversion were failing, I would expect fgrep to do better than grep on the problem it’s designed for!!!

                                                                                Actually this is my beef with Aho-Corasick too. I cannot think of an example where it’s going to be measurably faster or use less memory in practice than Thompson construction (at runtime, I don’t think you can measure compile time in a benchmark like this). Generally I think the difference between fgrep and grep comes down to engineering and not algorithms, which is why the results are inverted (conjecture).

                                                                                There are 13 keywords, described in the README:

                                                                                Matching this file:
                                                                                -rw-rw-r-- 1 andy andy 338M Nov 10 11:09 _tmp/all-10.txt
                                                                                
                                                                                Against 13 keywords:
                                                                                   for while continue break if fi then else elif case esac do done
                                                                                
                                                                                1. 1

                                                                                  All right… So I played with your benchmark script. I hacked at it until it ran without error, and here’s my output:https://gist.github.com/BurntSushi/1203d5b56112a8c1bacff5b19273d956

                                                                                  There are definitely some issues with how you’re benchmarking this here. Namely, you can’t give regexes to fgrep, and the BRE syntax used by standard grep doesn’t support alternation. So in order to test with fgrep, you need to pass the alternation via a file using the -f flag (or, as you’ll see, use the -e flag). For testing standard grep, you need to use egrep. For GNU grep specifically, grep, fgrep and egrep are all just different front ends. They share the same backend essentially. What this means is that your benchmarks aren’t actually measuring the speed of alternations, but rather, a very long single literal search.

                                                                                  I’m going to diverge from your benchmark script and just type the actual commands we want to run. It will be easier this way to verify results. So I started with just a regex alternation. The first two are ripgrep, where the first searches with a memory map and the latter uses standard read calls (the latter is what GNU grep does):

                                                                                  $ time rg 'for|while|continue|break|if|fi|then|else|elif|case|esac|do|done' /tmp/oil.txt| wc -l
                                                                                  2969020
                                                                                  real    0.838
                                                                                  user    0.803
                                                                                  sys     0.033
                                                                                  maxmem  343 MB
                                                                                  faults  0
                                                                                  
                                                                                  $ time rg --no-mmap 'for|while|continue|break|if|fi|then|else|elif|case|esac|do|done' /tmp/oil.txt| wc -l
                                                                                  2969020
                                                                                  
                                                                                  real    0.891
                                                                                  user    0.823
                                                                                  sys     0.067
                                                                                  maxmem  6 MB
                                                                                  faults  0
                                                                                  
                                                                                  $ time egrep 'for|while|continue|break|if|fi|then|else|elif|case|esac|do|done' /tmp/oil.txt| wc -l
                                                                                  2969001
                                                                                  real    1.223
                                                                                  user    1.111
                                                                                  sys     0.110
                                                                                  maxmem  5 MB
                                                                                  faults  0
                                                                                  

                                                                                  For fgrep, we can’t use an alternation. Instead, we have to pass each pattern individually since fgrep doesn’t take regexes of any kind as input:

                                                                                  $ time fgrep -e for -e while -e continue -e break -e if -e fi -e then -e else -e elif -e case -e esac -e do -e done /tmp/oil.txt| wc -l
                                                                                  2969001
                                                                                  real    1.515
                                                                                  user    1.427
                                                                                  sys     0.087
                                                                                  maxmem  5 MB
                                                                                  faults  0
                                                                                  

                                                                                  Now this is interesting. There is a repeatable performance difference here between egrep and fgrep. However, this doesn’t appear to have anything to do with fgrep per se, but rather, the method of input. Specifically, if we give each pattern individually to egrep, then it has the same slow down (and similarly for grep):

                                                                                  $ time egrep -e for -e while -e continue -e break -e if -e fi -e then -e else -e elif -e case -e esac -e do -e done /tmp/oil.txt| wc -l
                                                                                  2969001
                                                                                  real    1.537
                                                                                  user    1.411
                                                                                  sys     0.123
                                                                                  maxmem  5 MB
                                                                                  faults  0
                                                                                  

                                                                                  One of my hypotheses for the difference is match overhead. In particular, you’ve chosen a query that has a lot of matches. This tends to be a less common case for tools like grep which can used interactively. Large results sets aren’t as useful because we humans can’t look through everything, so we tend to run searches that produce small result sets. As a result, cases with a large number of results might not receive as much optimization love (see, for example, the silver searcher). I tested my hypothesis by tweaking your query such that it doesn’t match anything:

                                                                                  $ time egrep 'faor|wbhile|ccontinue|bdreak|iXf|f#i|tghen|ehlse|eilif|cjase|eksac|d!o|dmone' /tmp/oil.txt
                                                                                  real    1.099
                                                                                  user    1.035
                                                                                  sys     0.063
                                                                                  maxmem  5 MB
                                                                                  faults  0
                                                                                  
                                                                                  $ time egrep -e faor -e wbhile -e ccontinue -e bdreak -e iXf -e 'f#i' -e tghen -e ehlse -e eilif -e cjase -e eksac -e 'd!o' -e dmone /tmp/oil.txt
                                                                                  real    1.482
                                                                                  user    1.433
                                                                                  sys     0.047
                                                                                  maxmem  5 MB
                                                                                  faults  0
                                                                                  

                                                                                  So my hypothesis is wrong. The -e variant is still slower, even though I believe they are semantically the same search. As a fun side note, ripgrep does really well here, and doesn’t suffer from the same (seeming) performance bug:

                                                                                  $ time rg 'faor|wbhile|ccontinue|bdreak|iXf|f#i|tghen|ehlse|eilif|cjase|eksac|d!o|dmone' /tmp/oil.txt
                                                                                  real    0.098
                                                                                  user    0.068
                                                                                  sys     0.029
                                                                                  maxmem  343 MB
                                                                                  faults  0
                                                                                  
                                                                                  $ time rg -e faor -e wbhile -e ccontinue -e bdreak -e iXf -e 'f#i' -e tghen -e ehlse -e eilif -e cjase -e eksac -e 'd!o' -e dmone /tmp/oil.txt
                                                                                  real    0.102
                                                                                  user    0.078
                                                                                  sys     0.024
                                                                                  maxmem  342 MB
                                                                                  faults  0
                                                                                  

                                                                                  The reason why ripgrep is fast, is that for both searches, it isn’t actually using Aho-Corasick at all! Instead, it’s using a SIMD algorithm called Teddy. Teddy is part of the regex engine and is described in detail here: https://github.com/rust-lang/regex/blob/master/src/literal/teddy_ssse3/imp.rs — Teddy doesn’t do too well when there are a lot of matches, but when there are very few matches, like in the latter search above, it absolutely screams.

                                                                                  Looking at grep’s source code (current master), this comment in GEAcompile sheds some light:

                                                                                    /* For GNU regex, pass the patterns separately to detect errors like
                                                                                       "[\nallo\n]\n", where the patterns are "[", "allo" and "]", and
                                                                                       this should be a syntax error.  The same for backref, where the
                                                                                       backref should be local to each pattern.  */
                                                                                  

                                                                                  ripgrep’s default regex engine doesn’t support backreferences, so it doesn’t have to deal with this. In particular, ripgrep will take -e pat1 -e pat2 ... -e patN and just translate that to (?:pat1)|(?:pat2)|...|(?:patN) such that it’s one giant regex. Since it doesn’t have to worry about backreferences, it can do this. At this point, I’m out of gas w.r.t. to investigation, and it’s not a stretch at this point to say that these two invocations go through different code paths. Since fgrep cannot invoke the pat1|pat2|...|patN code path, you can’t quite do a fair comparison unless you make egrep use the same code path as fgrep for multiple literals. And when you do that, they come out to be even.

                                                                                  But even if the conversion were failing, I would expect fgrep to do better than grep on the problem it’s designed for!!!

                                                                                  I don’t know the origins of fgrep, but in today’s world, it’s primarily a UX thing and not a performance thing. Namely, it’s nice to be able to search for literal strings like Foo(. But if that’s interpreted as a regex, it’s probably a syntax error (unless you’re using BREs). So fgrep is really nice for that.

                                                                                  In a suboptimal implementation, it’s possible that fgrep will be faster than the equivalent egrep command. If that is indeed the case, then that just means that the egrep implementation isn’t falling back to normal substring search when the pattern it has been given is just a literal. If egrep is just “parse and feed to the regex engine,” then there will be cases in which fgrep is faster. You need an explicit analysis phase that ducks out to normal substring search that bypasses the regex engine.

                                                                                  Actually this is my beef with Aho-Corasick too. I cannot think of an example where it’s going to be measurably faster or use less memory in practice than Thompson construction (at runtime, I don’t think you can measure compile time in a benchmark like this).

                                                                                  The problem with this statement is that “Thompson construction” isn’t really enough to go on. Are you referring to the NFA simulation of the Thompson construction? A lazy DFA of the Thompson construction? A eagerly compiled DFA of the Thompson construction? All of these things actually exist in practice.

                                                                                  Generally I think the difference between fgrep and grep comes down to engineering and not algorithms, which is why the results are inverted (conjecture).

                                                                                  I think this is too much of a simplification. As someone who has spent a lot of time building these types of tools, the real answer is very solidly “both.” :-)

                                                                                  1. 1

                                                                                    What this means is that your benchmarks aren’t actually measuring the speed of alternations, but rather, a very long single literal search.

                                                                                    That’s not true, as I just wrote, grep, fgrep, and ripgrep all return 2969020 results. BRE supports alternation with \|. You don’t need a flag to get alternation with fgrep either.

                                                                                    $ { echo foo; echo bar; }  | grep 'foo\|bar' 
                                                                                    foo
                                                                                    bar
                                                                                    
                                                                                    $ { echo foo; echo bar; }  | fgrep $'foo\nbar' 
                                                                                    foo
                                                                                    bar
                                                                                    
                                                                                    $ grep --version
                                                                                    grep (GNU grep) 2.25
                                                                                    Copyright (C) 2016 Free Software Foundation, Inc.
                                                                                    
                                                                                    1. 1

                                                                                      Cute. I was misled by this page then I guess. This page also indicates that BREs don’t have alternations: http://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap09.html

                                                                                      I forgot that \n could be used to delimit patterns in fgrep. But it’s still taking the same code path AFAIK inside GNU grep when compared with -e foo -e bar. The code path is different from foo|bar.

                                                                                    2. 1

                                                                                      After reading this, I’m still confused by why fgrep is slower than egrep.

                                                                                      It should be either

                                                                                      • The exact same speed, because it’s just a front end for the same engine.
                                                                                      • Faster, because it’s solving a “simpler” problem and can use a faster algorithm. The whole thing that started this discussion was this:

                                                                                      An alternation of literal strings compiled using Thompson’s construction is generally going to incur quite a bit more overhead than a standard Aho-Corasick implementation because Thompson’s construction supports not just an alternation of literals, but an alternation of arbitrary regular expressions.

                                                                                      I still don’t buy this! The fixed string alternation is not really easier. (I agree that there might be a small difference in theory, but I don’t think it matters in practice.)

                                                                                      If you run your aho-corasick algorithm on this data set and it’s faster than the RE2 benchmark, that might convince me! (because RE2 is also counting all matches and not all lines, and it’s interpreted not compiled).

                                                                                      I think this is too much of a simplification. As someone who has spent a lot of time building these types of tools, the real answer is very solidly “both.” :-)

                                                                                      Maybe? But your post didn’t show that?

                                                                                      The point that the re2c and RE2 benchmarks aren’t measuring the same thing is well taken, but that’s separate from the grep/fgrep issue.

                                                                                      Are you referring to the NFA simulation of the Thompson construction? A lazy DFA of the Thompson construction? A eagerly compiled DFA of the Thompson construction? All of these things actually exist in practice.

                                                                                      The eagerly compiled DFA, because I don’t think that’s hard to create in the alternation of constant strings case.

                                                                                      I think you always get a DFA that’s going to run as fast as Aho-Corasick in that case. The NFA is very simple, and the DFA is very simple. I guess the test I should do is to run the simple powerset/subset construction on that case. I would be surprised if it “blows up” in this particular case (i.e. there are no repetitions, so the graph is very simple).

                                                                                      Every alternation should look the same from the engine’s point of view, e.g. foo|bar or spam|eggs|ham are the same problem in terms of creating an NFA, converting to DFA, and optimizing (if it’s necessary in that case).

                                                                                      So do you think Aho-Corasick is going to do anything on this benchmark? To a first approximation, it seems like a different way to build a DFA. As you point out, you don’t even use it for this case! Teddy is apparently better.

                                                                                      That’s essentially my point. I can’t think of any problems where Aho-Corasick will be better than an automata-based regex engine (“Thompson’s construction” in my terminology. They are all based on that.)

                                                                                      Can you think of any examples of such problems?

                                                                                      1. 1

                                                                                        After reading this, I’m still confused by why fgrep is slower than egrep.

                                                                                        Because that’s not the most precise statement we can make given the data. I carved the problem down to egrep 'pat1|pat2|...|patN' vs egrep -e pat1 -e pat2 ... -e patN in my previous comment, where the former is faster than the latter, and, crucially, the latter matches the speed of fgrep. So it’s not about “fgrep vs egrep,” but rather, a distinction in the code path. In other words, it looks like an unnecessary implementation quirk. But this is not a complete explanation. It looks like you’ll need to either ask someone who’s familiar with GNU grep automata construction, or dig into the code yourself. :-)

                                                                                        An alternation of literal strings compiled using Thompson’s construction is generally going to incur quite a bit more overhead than a standard Aho-Corasick implementation because Thompson’s construction supports not just an alternation of literals, but an alternation of arbitrary regular expressions.

                                                                                        I still don’t buy this! The fixed string alternation is not really easier. (I agree that there might be a small difference in theory, but I don’t think it matters in practice.)

                                                                                        If you run your aho-corasick algorithm on this data set and it’s faster than the re2c benchmark, that might convince me! (because re2c is also counting all matches and not all lines).

                                                                                        I think we are unfortunately approach diminishing returns in this discussion. We’re getting hung up on terminology. In my comment above, I was using “Thompson’s construction” as short hand for “the NFA simulation of Thompson’s construction,” which roughly matches the algorithm in Thompson’s original paper. That is going to be very obviously and very trivially slower than even the NFA fomulation of Aho-Corasick, and certainly much slower than the DFA formulation of Aho-Corasick.

                                                                                        The above paragraph is 100% about implementation details. It has nothing to do with the theory.

                                                                                        I think this is too much of a simplification. As someone who has spent a lot of time building these types of tools, the real answer is very solidly “both.” :-)

                                                                                        Maybe? But your post didn’t show that?

                                                                                        I’m not in a debate with you here, and I’m not trying to convince you of anything. I’m just trying to offer up my knowledge and experience. Do with it as you will.

                                                                                        I think if you read my blog post about ripgrep, it should pretty solidly convince you that both algorithms and engineering are required. ripgrep, to a first approximation, uses the same algorithm as GNU grep. That is, both of them fundamentally use a lazy DFA as their work horse. But to a second approximation, the algorithms surrounding the lazy DFA, and in particular, the analysis performed on the regex, are very different. e.g., GNU grep does not use frequency based substring search and it does not have a vectorized multi-pattern matcher. Those are algorithmic differences. They aren’t huge, and someone could add them to GNU grep if they wanted to. But they are still differences. One other algorithmic difference has to do with how large Unicode character classes are handled. ripgrep is much faster than GNU grep in that space, and I suspect it largely has to do with how GNU grep implements POSIX locale support. But I’m not certain.

                                                                                        The engineering aspect of these tools is also quite large, but not for any particularly interesting reasons. You just need to make sure you’re doing as little work as possible per line (or per byte) of input. Naive implementations (e.g., “read a file line by line and apply a regex to each line”) will invariably do much more work than is necessary. Rising above naive implementations requires engineering and smarter choices about how you find your matches.

                                                                                        Are you referring to the NFA simulation of the Thompson construction? A lazy DFA of the Thompson construction? A eagerly compiled DFA of the Thompson construction? All of these things actually exist in practice.

                                                                                        The eagerly compiled DFA, because I don’t think that’s hard to create in the alternation of constant strings case.

                                                                                        OK, but you’ll wind up needing to pick a cutoff point where you don’t eagerly compile the DFA because it will use up too much memory. You don’t need to pick this cutoff point, probably, in the context of re2c.

                                                                                        I think you always get a DFA that’s going to run as fast as Aho-Corasick in that case.

                                                                                        I think I agree. At least, I can’t see any specific reason for a performance difference.

                                                                                        What algorithm does rust/regex use for DFA minimization?

                                                                                        It doesn’t do DFA minimization. Neither does RE2. It’s too expensive. Neither RE2 nor rust/regex ever eagerly computes a DFA for general regexes. It’s always done lazily. rust/regex has an optimization pass that RE2 doesn’t have, where, if it detects an alternation of literals, it will use Aho-Corasick specifically because Aho-Corasick is eagerly built and is therefore faster than the lazy DFA. The heuristics used for this currently are not optimal, but the code is there and works in a lot of cases.

                                                                                        The NFA is very simple, and the DFA is very simple. I have never implemented that algorithm but I would be surprised if it “blows up” in this particular case.

                                                                                        I would be too, in the sense that “blow up” means “exponentially blow up.” Other than that, I think I already characterized the memory usage of an Aho-Corasick DFA. In a typical table based DFA, every state uses 1KB of memory. That’s a lot.

                                                                                        Every alternation should look the same from the engine’s point of view, e.g. foo|bar or spam|eggs|ham are the same problem in terms of creating an NFA, converting to DFA, and optimizing (if it’s necessary in that case).

                                                                                        So do you think Aho-Corasick is going to do anything on this benchmark? To a first approximation, it seems like a different way to build a DFA. As you point out, you don’t even use it for this case! Teddy is apparently better.

                                                                                        That’s essentially my point. I can’t think of any problems where Aho-Corasick will be better than an automata-based regex engine (“Thompson’s construction” in my terminology. They are all based on that.)

                                                                                        I wouldn’t suspect so either. But if my regex engine never bothers with the eager DFA construction or minimization after applying the Thompson construction, then it makes sense to just use Aho-Corasick when possible. In order for me to be convinced to do otherwise, I think I’d need to be convinced that building and minimizing a DFA by starting with Thompson’s construction won’t be noticeably slower than using the normal Aho-Corasick construction algorithm, and then converting that into a table based DFA directly. Intuitively speaking, I would be surprised if both construction algorithms ran at similar speeds. In your use cases, this distinction is probably wildly uninteresting, since this is all paid for when you compile your program, and therefore you have a very high tolerance. In that case, you probably don’t care about Aho-Corasick, and for implementation simplicity, you’d just stick with Thompson’s construction -> DFA -> minimization. That’s not true in my case.

                                                                                        1. 1

                                                                                          FWIW I constructed a benchmark to match 1000 - 6000 strings “in parallel” with re2c, RE2, and ripgrep:

                                                                                          https://github.com/oilshell/blog-code/blob/master/fgrep-problem-benchmarks/fixed-strings.sh#L328

                                                                                          ripgrep (and grep) are faster than re2c for n = 100 just like they are for n = 13 (which was surprising to me since they’re not compilers).

                                                                                          But in the range of 1000 - 6000 strings, re2c match time is essentially constant (< 3 seconds), while ripgrep increases linearly to 13+ seconds. RE2 also increases linearly but not as much.

                                                                                          The re2c compile time is linear as well, and the generated code size is linear, with a small constant factor. For 6000 strings, it only uses 213 KiB of code (representing the DFA).

                                                                                          I’m not pointing this out because I think it’s important for ripgrep to handle. GNU grep blows up somewhere around 400 strings and I gave up running it.

                                                                                          But it is interesting in the sense that this is precisely the Aho-Corasick problem – an alternation of constant strings. (modulo the overlapping matches issue)

                                                                                          My point was that “Thompson’s construction” can handle the Aho-Corasick problem, and do very well at it. It’s possible that I’m underestimating what re2c is doing after that, but considering that the compile time is low, I don’t think anything very advanced is going on.

                                                                                          If I have time in the future I’ll explore that further, although actually I’m more interested in derivatives, so I may never get to it. If I do experiment with derivatives, this would be a great test case.


                                                                                          In theory, I would expect DFAs to be able to match in constant time with respect to the number of strings, not linear time. But re2c is the only implementation that gets close to that!

                                                                                          I would have thought RE2 would too, so I suppose it’s harder than I thought. RE2 also doesn’t want to use more than 8 MB of memory for the DFA by default, although it’s easy to bump up.

                                                                                          I could say more about this, but I’ll probably blog about this in the future instead… I’ve been meaning to write a post/demo about re2c for a long time. If you have any comments on this feel free to mail me at andy@oilshell.org instead of this deep lobste.rs thread!

                                                                                          1. 1

                                                                                            Aye. Just some small quick notes:

                                                                                            • In RE2’s case, it will never use Aho-Corasick and it will never use an eagerly compiled DFA.
                                                                                            • In ripgrep’s case, it will use Aho-Corasick for very small alternations, but falls back to the same lazy DFA approach that RE2 uses beyond that. This is a failing of the current implementation. You can tweak the DFA size limit with the --dfa-size-limit flag.
                                                                                            • In both of the aforementioned cases, once the DFA exceeds the internal limit, it starts thrashing. If it thrashes enough, it falls back to the NFA simulation of Thompson’s construction.
                                                                                            • “Thompson’s construction” can only handle the Aho-Corasick problem well if you eagerly compile the DFA. 6,000 strings might not be enough to really test the differences between the traditional Aho-Corasick compilation algorithm and Thompson’s construction -> DFA -> minimization. /usr/share/dict/words for instance is over 100,000 entries on my system.

                                                                                            Just as a quick example, here is the time difference in compilation between NFA Aho-Corasick (traditional) and DFA Aho-Corasick (“advanced”) using an example program from my Aho-Corasick library:

                                                                                            $ cargo build --release --example dict-search
                                                                                            
                                                                                            $ time ./target/release/examples/dict-search -d /usr/share/dict/words /dev/null
                                                                                            pattern,start,end
                                                                                            
                                                                                            real    0.109
                                                                                            user    0.089
                                                                                            sys     0.020
                                                                                            maxmem  38 MB
                                                                                            faults  0
                                                                                            
                                                                                            $ time ./target/release/examples/dict-search --full -d /usr/share/dict/words /dev/null
                                                                                            pattern,start,end
                                                                                            
                                                                                            real    1.018
                                                                                            user    0.903
                                                                                            sys     0.113
                                                                                            maxmem  321 MB
                                                                                            faults  0
                                                                                            

                                                                                            Note also the difference in peak memory usage. :-)

                                                                                      2. 1

                                                                                        I think we might be getting hung up on “Thompson construction” terminology [1], so let me put it another way:

                                                                                        I don’t think Aho-Corasick is currently used to speed up string search problems.

                                                                                        GNU grep doesn’t appear to use it, and ripgrep doesn’t use it either.

                                                                                        The automata-based regex techniques seemed to have “subsumed it” (at least for any problems I can think of).

                                                                                        In your article I see this, which seems to make sense, since you have to go beyond automata to get further speedup!

                                                                                        Handling the case of multiple literals (e.g., foo|bar) is just as important. GNU grep uses a little known algorithm similar to Commentz-Walter for searching multiple patterns. In short, Commentz-Walter is what you get when you merge Aho-Corasick with Boyer-Moore: a skip table with a reverse automaton.

                                                                                        (However I’m not seeing the results of this in practice either! But that is probably because I haven’t looked hard enough.)

                                                                                        [1] Which I think of from a user’s perspective as “automata-based regex engine”, but you’re thinking of from an implementer’s perspective.

                                                                                        1. 1

                                                                                          GNU grep doesn’t appear to use [Aho-Corasick], and ripgrep doesn’t use it either.

                                                                                          Both do, absolutely. For GNU grep, see: http://git.savannah.gnu.org/cgit/grep.git/tree/src/kwset.c

                                                                                          For ripgrep, given that I wrote every line of string searching code in it, I can tell you that it will use Aho-Corasick. Teddy is an optimization that, when applicable, will best Aho-Corasick. But it is not always applicable. For example, it doesn’t work well when you have a large number of literals. In those cases, ripgrep falls back to Aho-Corasick.

                                                                                          With respect to Commentz-Walter, I do not believe GNU grep uses it any more. I think it did at one time.

                                                                                          If GNU grep and/or ripgrep compiled regexes eagerly to a DFA and then minimized them, then there might not be any difference between using that machine to search and Aho-Corasick to search, other than, conjecturally, the time it takes to build the machine. But since both use lazy DFAs built directly from Thompson’s NFA construction at match time, there is no infrastructure for eagerly building out and minimizing the DFA. So we both just use Aho-Corasick instead.

                                                                                      3. 1

                                                                                        In my previous comment, one other interesting thing stood out to me is that GNU grep’s match count is slightly lower than ripgrep. It looks like this is because GNU grep is, for some reason, thinking that some of the matched lines contain invalid UTF-8. This is easier to see on a smaller example:

                                                                                        $ time egrep 's a very, very long f' oil.txt | wc -l
                                                                                        51
                                                                                        
                                                                                        $ time rg 's a very, very long f' oil.txt | wc -l
                                                                                        60
                                                                                        

                                                                                        Specifically, if you look at the output of egrep, it actually stops printing results and says Binary file oil.txt matches. In my case, it looks like GNU grep thinks its a binary file because it contains invalid UTF-8 in places and my locale is set to en_US.UTF-8. I can work around this by either setting my locale to C (via LC_ALL=C), or by enabling --text.

                                                                                        This doesn’t seem to impact performance though.

                                                                                        1. 2

                                                                                          Yes, LC_ALL=C is essential, otherwise on my machine GNU grep stops after 10% of the file!!!

                                                                                          I spent awhile figuring that out! Honestly I think it is a bug I should file, but I haven’t looked into it further. The test file is a concatenation of the same file 10 times, and a specific character sequence in the last part of the file makes grep stop!

                                                                                          See ESSENTIAL BUG FIX comment here.

                                                                                          https://github.com/oilshell/blog-code/blob/master/fgrep-problem-benchmarks/fixed-strings.sh

                                                                                          I just re-ran the verification and grep, fgrep, and ripgrep are all returning 2969020 results, with the same timings as reported in the README.

                                                                                          1. 1

                                                                                            I don’t think it’s a bug. It’s intended behavior.

                                                                                            My output from the benchmark script is very different: https://gist.github.com/BurntSushi/1203d5b56112a8c1bacff5b19273d956

                                                                                            1. 1

                                                                                              I dunno, why would you ever want grep to stop in the middle of a file, no matter what other text is there? Suppose I have:

                                                                                              grep foo

                                                                                              on a file that looks like this:

                                                                                              foo
                                                                                              <unintelligible poorly encoded garbage>
                                                                                              foo
                                                                                              

                                                                                              Shouldn’t it find both foo’s ? Or I guess it’s saying it doesn’t know that the second foo is even a foo after the encoding errors. That is plausible (for non-UTF8 encodings, since UTF-8 is designed to be able to resynchronize). But at least it should print an error rather than stopping silently!

                                                                                              1. 1

                                                                                                It does print an error. Or at least, it should. It says “Binary file foo matches”:

                                                                                                  if (!out_quiet && (encoding_error_output
                                                                                                                     || (0 <= nlines_first_null && nlines_first_null < nlines)))
                                                                                                    {
                                                                                                      printf_errno (_("Binary file %s matches\n"), input_filename ());
                                                                                                      if (line_buffered)
                                                                                                        fflush_errno ();
                                                                                                    }
                                                                                                

                                                                                                I did some digging and found this: http://lists.gnu.org/archive/html/bug-grep/2016-02/msg00037.html

                                                                                  2. 3

                                                                                    Complementary. Here’s the “original paper” you’re looking for: String Matching: An Aid to Bibliographic Search. It learns from Thompson’s method and improves its performance and utility in exchange for some features (no backtracking).

                                                                                  1. 9

                                                                                    So, uh, what’s better?

                                                                                    1. 15
                                                                                      • composable GUIs (like the alto & modern smalltalk environments)
                                                                                      • notebook interfaces (like jupyter & mathematica)
                                                                                      • literate programming interfaces (like swyft / the canon cat)
                                                                                      • plan9
                                                                                      • menuet
                                                                                      • NeWS
                                                                                      • language-based systems like interim
                                                                                      • modern unix shells like zsh
                                                                                      • borderline-obscure stuff like zigzag, sometimes

                                                                                      And, of course, I’ve been working on a prototype of the system I pitched last year.

                                                                                      The thing about interfaces is, if you put what you’re already accustomed to out of your mind, you start to think of alternatives pretty quickly – and many of them are better for certain tasks than what you already use.

                                                                                      For my next book I’m planning to write a survey of alternatives to the WIMP paradigm that can be easily run or emulated by readers. Unfortunately, I’m going to need to do plenty of research if I’m to seriously discuss the internals of these systems, since a lot of them are language-based systems built around languages I don’t know very well (like lisp or forth) or at all (like holy c or oberon).

                                                                                      1. 5

                                                                                        I’m interested in your research. is there any place where I can keep on track with it?

                                                                                        1. 4

                                                                                          I’ve barely started researching for the book in question, so I’m not sure to what extent chapters & other content will be made available before it’s finished.

                                                                                          The last book was mostly compiled from stuff I had already published on Medium. If you follow me there, you’ll probably get at least some of the material intended for the next one – maybe even rough drafts for chapters. Also, a lot of the chapters from the last book were inspired by or adapted from discussions I’ve had on mastodon or SSB, & this will probably be true of the next one: if you follow me on mastodon, no doubt you’ll get a preview of some of the ideas I’m playing with.

                                                                                          If there’s enough interest, I might make a point of posting about the systems I’m researching on a more regular basis. Those posts will probably end up on Medium too.

                                                                                          1. 2

                                                                                            Also, I’m going to be posting resources here as I find them during my research.

                                                                                            1. 2

                                                                                              Thank you for this. I’m going to follow your work on it.

                                                                                        2. 7

                                                                                          I’d assume the biggest problem is overlapping windows. I’ve been using tiling window managers since 2012 and I would not go back. If you look at all the newer mobile operating systems (iOS, Android, that failed Windows 8/10 UI thing), they’re all either single app at a time or, at most, split screen.

                                                                                          I guess a second thing is steering people away from mouse dependence. Hotkeys should be easily discoverable and easily encouraged. A higher learning curve at first can mean faster operation later on. Great example: Autozone. Watch staff look up a part today. They do a lot of clicking and switching back and fourth. The old setup was all terminal based and had the same searching/information. I think the new GUI still has a lot of keyboard shortcuts, but very few people I’ve watched use them.

                                                                                          1. 5

                                                                                            Overlapping windows are pretty pointless when they can’t be made to work together in interesting ways. Drag & drop between unrelated applications is the minimum interoperability to justify even supporting overlapping windows in my eyes (and support for that is pretty rare), but I’d be all about overlapping windows if freeform composition with gestures was a standard part of the toolkit. Even then, tiling is preferable on large displays once we’ve settled on an arrangement & interconnections.

                                                                                            Support for tiling and pseudo-tiling (and quick-switching mechanisms) is something I don’t have a problem with in modern desktops, though. Even Microsoft and Apple have been pretty quick to jump on that bandwagon.

                                                                                            1. 5

                                                                                              Tiling windows seems like a funny point since we’re complaining about UIs treating users as stupid. The first version of Windows was tiling only because users were too stupid to handle overlap and might lose track of a hidden window, but eventually it was decided that users could be trusted with the responsibility of arranging things in their own. Round the circle we go.

                                                                                              1. 1

                                                                                                The first version of Windows was tiling not because of contempt of users, but to avoid a lawsuit from Apple (who did get overlapping windows working because they thought the Alto had it when it didn’t really). Also, a tiling window system is easier to write than overlapping.

                                                                                                1. 1

                                                                                                  Alas, I’m not sure of the reference, but they apparently had the feature, tested it, users were confused, and it was pulled.

                                                                                              2. 4

                                                                                                I think we’re slowly moving away from mouse-oriented approach, for better or worse. I’d personally wish for keyboard-oriented desktop UI, but judging by how much Microsoft and Apple are striving to unite all their systems software-wise (eg Windows Phone and Xbox One both running Windows variants, or audioOS etc being iOS-based), we might expect a move towards touchscreen-oriented UI on desktops instead. (Although I guess that goes as far back as GNOME 3 instead.) On the other hand, there exist a minority of mouse-oriented UI advocates, such as Rob Pike and his followers. He argues that mouse is more intuitive and faster, and the problem lies in bad UI design instead.

                                                                                                1. 6

                                                                                                  On the other hand, there exist a minority of mouse-oriented UI advocates, such as Rob Pike and his followers. He argues that mouse is more intuitive and faster, and the problem lies in bad UI design instead.

                                                                                                  i still think that acmes mouse interface for common editing operations is better than keyboard shortcuts (see http://acme.cat-v.org/mouse). the way the mouse is used in most other systems is bad though. the windows way is nearly unusable with the popup-menu thing, and X11 is only saved by the primary selection insert with the middle mouse button ;)

                                                                                              3. 4

                                                                                                Presumably some form of programming-is-first-class system like the Alto, where everything you can click is also explorable (“view source” is built in) and extendable via SmallTalk. On the one hand I’m a bit sceptical and think not many users will really make use of this, on the other hand if you see how far some regular (i.e. non-programmer) users take, say, Excel and VBA scripting, having this programmability available pervasively by default in every application would definitely empower users much more than “closed” systems like the original Mac do.

                                                                                                I have no idea how many people use AppleScript, which ostensibly brings pervasive programmability to the Mac. It wasn’t part of the original Mac OS and is about programming scripts “on the outside” onto or “against” existing applications rather than full-fledged inspection and modification of internals “inside” those same applications.

                                                                                                1. 3

                                                                                                  The only “modern” OS I know that makes use of hypertext self-documentation is… TempleOS. It also blurs the line between “using” and “programming”, like Alto and LispMs It’s not entirely user-friendly, but I guess it fits the bill.

                                                                                                2. 8

                                                                                                  Ask not the polemic for what is better; it is merely the shallow well in which the author steeps their discontent.

                                                                                                1. 2

                                                                                                  On one hand: I agree that DNS-over-HTTPS is a silly and convoluted solution.

                                                                                                  On the other hand: DNS-over-TLS is a bad solution for the reason pointed out: it lives on its own port.

                                                                                                  Question: Why do we need ports any more at all? It seems like if we didn’t have dedicated port numbers, but instead referred to resources by subdomain or subdirectory beneath the main hostname, then all traffic would be indiscriminate when secured by TLS.

                                                                                                  1. 4

                                                                                                    Could it have been possible for DNS-over-TLS to use 443 and make the server able to route DNS and HTTP request appropriately? I’m not very knowledgable of TLS. From what I understand its just a transport layer so a server could simply read the beginning of an incoming message and easily detect if it is an HTTP or DNS header?

                                                                                                    1. 9

                                                                                                      Yes, much like http2 works. It complicates the TLS connection because now it passes a hint about the service it wants, but that bridge is already crossed.

                                                                                                    2. 4

                                                                                                      IP addresses allow two arbitrary computers to exchange information [1], whereas ports allow to arbitrary programs (or processes) to exchange information. Also, it’s TCP and UDP that have ports. There are other protocols that ride on top of IP (not that anyone cares anymore).

                                                                                                      [1] Well, in theory anyway, NAT breaks that to some degree.

                                                                                                      1. 3

                                                                                                        Ports are kinda central to packet routing, if my understanding is correct, as it has been deployed.

                                                                                                        1. 5

                                                                                                          You need the concept of ports to route packets to the appropriate process, certainly. However, with DNS SRV records, you don’t need globally-agreed-upon port assignments (a la “HTTP goes to port 80”). You could assign arbitrary ports to services and direct clients accordingly with SRV.

                                                                                                          Support for this is very incomplete (e.g. browsers go to port 80/443 on the A/AAAA record for a domain rather than querying for SRVs), but the infrastructure is in place.

                                                                                                          1. 5

                                                                                                            On what port do I send the DNS query for the SRV record of my DNS server?

                                                                                                            1. 1

                                                                                                              Obviously, you look up an SRV record to determine which port DNS is served over. ;)

                                                                                                              I don’t know if anyone has thought about the bootstrapping problem. In theory, you could deal with it the same way you already bootstrap your DNS (DHCP or including the port with the IP address in static configurations), but I don’t know if this is actually possible.

                                                                                                            2. 2

                                                                                                              You need the concept of ports to route packets to the appropriate process

                                                                                                              Unless we assign an IP address to every web facing process.

                                                                                                          2. 1

                                                                                                            Problem: both solutions to private DNS queries have downsides related to the DNS protocol fundamentally having failed to envision a need for privacy

                                                                                                            Solution: radically overhaul the transport layer by replacing both TCP and UDP with something portless?

                                                                                                            The suggested cure is worse than the disease, in this case, in terms of sheer amount of work, and completely replaced hardware and software, it would require .

                                                                                                            1. 2

                                                                                                              I don’t think DNS is the right place to do privacy. If I’m on someone’s network, he can see what IP addresses I’m talking to. I can hide my DNS traffic, but he still gets to see the IP addresses I ultimately end up contacting.

                                                                                                              Trying to add privacy at the DNS stage is doing it at the wrong layer. If I want privacy, I need it at the IP layer.

                                                                                                              1. 4

                                                                                                                Assuming that looking up an A record and making a connection to that IP is the only thing DNS is used for.

                                                                                                                1. 3

                                                                                                                  Think of CDN or “big websites” traffic. If you hit Google, Amazon, Cloudflare datacenters, nobody will be able to tell if you were reaching google.com, amazon.com, cloudflare.com or any of their costumers.

                                                                                                                  Currently, this is leaking through SNI and DNS. DoH and Ecrypted SNI (ESNI) will improve on the status quo.

                                                                                                                  1. 2

                                                                                                                    And totally screws small sites. Or is the end game centralization of all web sites to a few hosts to “protect” the privacy of users?

                                                                                                                    1. 2

                                                                                                                      You can also self-host more than one domain on your site. In fact, I do too. It’s just a smaller set :-)

                                                                                                                      1. 1

                                                                                                                        End game would be VPNs or Tor.

                                                                                                                      2. 2

                                                                                                                        Is that really true? I though request/response metadata and timing analysis coud tell them who we were connecting to.

                                                                                                                        1. 2

                                                                                                                          Depends who they are. I’m not going to do a full traffic dump, then try to correlate packet timings to discover whether you were loading gmail or facebook. But tcpdump port 53 is something I’ve actually done to discover what’s talking to where.

                                                                                                                          1. 1

                                                                                                                            True. maybe ESNI and DoH are only increasing the required work. Needs more research?

                                                                                                                            1. 1

                                                                                                                              Probably to be on safe side. Id run it by experts in correlation analyses on network traffic. They might already have something for it.

                                                                                                                          2. 2

                                                                                                                            nobody will be able to tell if you were reaching google.com, amazon.com, cloudflare.com or any of their costumers.

                                                                                                                            except for GOOGL, AMZN, et al. which will happily give away your data, without even flinching a bit.

                                                                                                                            1. 1

                                                                                                                              Yeah, depends on who you want to exclude from snooping on your traffic. The ISP, I assumed. The Googles and Amazons of the world have your data regardless of DNS/DoH.

                                                                                                                              I acknowledge that the circumstances are different in every country, but in the US, the major ISPs actually own ad networks and thus have a strong incentive not to ever encrypt DNS traffic.

                                                                                                                              1. 1

                                                                                                                                Yeah, depends on who you want to exclude from snooping on your traffic. The ISP, I assumed. The Googles and Amazons of the world have your data regardless of DNS/DoH.

                                                                                                                                so i’m supposed to just give them full access over the remaining part which isn’t served by them?

                                                                                                                                I acknowledge that the circumstances are different in every country, but in the US, the major ISPs actually own ad networks and thus have a strong incentive not to ever encrypt DNS traffic.

                                                                                                                                ISPs in the rest of the world aren’t better, but this still isn’t a reason to shoehorn DNS into HTTP.

                                                                                                                                1. 1

                                                                                                                                  No, you’re misreading the first bit. You’re already giving iit to them, most likely, because of all those cloud customers. This makes their main web property indistinguishable from their clients, once SNI and DNS is encrypted.

                                                                                                                                  No need to give more than before.

                                                                                                                                  1. 1

                                                                                                                                    You’re already giving iit to them, most likely, because of all those cloud customers.

                                                                                                                                    this is a faux reason. i try to not use these things when possible. just because many things are there, it doesn’t mean that i have to use even more stuff of them, quite the opposite. this may be an inconvenience for me, but it is one i’m willing to take.

                                                                                                                                    This makes their main web property indistinguishable from their clients, once SNI and DNS is encrypted.

                                                                                                                                    indistinguishable for everybody on the way, but not for the big ad companies on whose systems things are. those are what i’m worried about.

                                                                                                                                    1. 1

                                                                                                                                      Hm I feel were going in circles here.

                                                                                                                                      For those people who do use those services, there is an immediate gain in terms of hostname privacy (towards their ISP), once DoH and ESNI are shipped.

                                                                                                                                      That’s all I’m saying. I’m not implying you do or you should.

                                                                                                                                      1. 1

                                                                                                                                        I’m not implying you do or you should.

                                                                                                                                        no, but the implications of DoH are that i’ll end up using it, even if i don’t want to. it’ll be baked into the browsers, from there it’s only a small step to mandatory usage in systemd. regarding DoH in general: if you only have http, everything looks like a nail.

                                                                                                                        2. 1

                                                                                                                          Alternative solution: don’t use DNS anymore.

                                                                                                                          Still lots of work since we need to ditch HTTP, HTTPS, FTP, and a host of other host-oriented protocols. But, for many of these, we’ve got well-supported alternatives already. The question of how to slightly improve a horribly-flawed system stuck in a set of political deadlocks becomes totally obviated.

                                                                                                                          1. 3

                                                                                                                            That’s the biggest change of all of them. The whole reason for using DoH is to have a small change, that improves things, and that doesn’t require literally replacing the entire web.

                                                                                                                            1. 1

                                                                                                                              Sure, but it’s sort of a waste of time to try to preserve the web. The biggest problem with DNS is that most of the time the actual hostname is totally irrelevant to our purposes & we only care about it because the application-layer protocol we’re using was poorly designed.

                                                                                                                              We’re going to need to fix that eventually so why not do it now, ipv6-style (i.e., make a parallel set of protocols that actually do the right thing & hang out there for a couple decades while the people using the old ones slowly run out of incremental fixes and start to notice the dead end they’re heading toward).

                                                                                                                              Myopic folks aren’t going to adopt large-scale improvments until they have no other choice, but as soon as they have no other choice they’re quick to adopt an existing solution. We’re better off already having made one they can adopt, because if we let them design their own it’s not going to last any longer than the last one.

                                                                                                                              DNS is baked into everything, despite being a clearly bad idea, because it was well-established. Well, IPFS is well-established now, so we can start using it for new projects and treating DNS as legacy for everything that’s not basically ssh.

                                                                                                                              1. 8

                                                                                                                                Well, IPFS is well-established now

                                                                                                                                No it’s not. Even by computer standards, IPFS is still a baby.

                                                                                                                                Skype was probably the most well-established P2P application in the world before they switched to being a reskinned MSN Messenger, and the Skype P2P network had disasters just like centralized services have, caused by netsplits, client bugs, and introduction point issues. BitTorrent probably holds the crown for most well-established P2P network now, and since it’s shared-nothing (the DHT isn’t, but BitTorrent can operate without it), has never had network-wide disasters. IPFS relies on the DHT, so it’s more like Skype than BitTorrent for reliability.

                                                                                                                                1. 0

                                                                                                                                  It’s only ten years old, sure. I haven’t seen any reliability problems with it. Have you?

                                                                                                                                  DHT tech, on top of being an actually appropriate solution to the problem of addressing static chunks of data (one that eliminates whole classes of attacks by its very nature), is more reliable now than DNS is. And, we have plenty of implementations and protocols to choose from.

                                                                                                                                  Dropping IPFS or some other DHT into an existing system (like a browser) is straightforward. Opera did it years ago. Beaker does it now. There are pure-javascript implementations of DAT and IPFS for folks who can’t integrate it into their browser.

                                                                                                                                  Skype isn’t a good comparison to a DHT, because Skype connects a pair of dynamic streams together. In other words, it can’t take advantage of redundant caching, so being P2P doesn’t really do it any favors aside from eliminating a single point of failure from the initial negotiation steps.

                                                                                                                                  For transferring documents (or scripts, or blobs, or whatever), dynamicism is a bug – and one we eliminate with named data. Static data is the norm for most of what we use the web for, and should be the norm for substantially more of it. We can trivially eliminate hostnames from all asset fetches, replace database blobs with similar asset fetches, use one-time pads for keeping secret resources secret while allowing anyone to fetch them, & start to look at ways of making services portable between machines. (I hear DAT has a solution to this last one.) All of this is stuff any random front-end developer can figure out without much nudging, because the hard work has been done & open sourced already.

                                                                                                                                  1. 4

                                                                                                                                    IPFS is not ten years old. Its initial commit is five years ago, and that was the start of the paper, not the implementation.

                                                                                                                                    1. 1

                                                                                                                                      Huh. I could have sworn it was presented back in 2010. I must be getting it confused with another DHT system.

                                                                                                                                2. 7

                                                                                                                                  Sure, but it’s sort of a waste of time to try to preserve the web.

                                                                                                                                  This is letting Perfect be the enemy of Good thinking. We can incrementally improve (imperfectly, true) privacy now. Throwing out everything and starting over with a completely new set of protocols is a multi-decade effort before we start seeing the benefits. We should improve the situation we’re in, not ignore it while fantasizing about being in some other situation that won’t arrive for many years.

                                                                                                                                  The biggest problem with DNS is that most of the time the actual hostname is totally irrelevant to our purposes & we only care about it because the application-layer protocol we’re using was poorly designed.

                                                                                                                                  This hasn’t been true since Virtual Hosting and SNI became a thing. DNS contains (and leaks) information about exactly who we’re talking to that an IP address doesn’t.

                                                                                                                                  1. 2

                                                                                                                                    This is letting Perfect be the enemy of Good thinking. We can incrementally improve (imperfectly, true) privacy now.

                                                                                                                                    We can also take advantage of low-hanging fruit that circumvent the tarpit that is incremental improvements to DNS now.

                                                                                                                                    The perfect isn’t the enemy of the good here. This is merely a matter of what looks like a good idea on a six month timeline versus what looks like a good idea on a two year timeline. And, we can guarantee that folks will work on incremental improvements to DNS endlessly, even if we are not those folks.

                                                                                                                                    Throwing out everything and starting over with a completely new set of protocols is a multi-decade effort before we start seeing the benefits.

                                                                                                                                    Luckily, it’s an effort that started almost two decades ago, & we’re ready to reap the benefits of it.

                                                                                                                                    DNS contains (and leaks) information about exactly who we’re talking to that an IP address doesn’t.

                                                                                                                                    That’s not a reason to keep it.

                                                                                                                                    Permanently associating any kind of host information (be it hostname or DNS name or IP) with a chunk of data & exposing that association to the user is a mistake. It’s an entanglement of distinct concerns based on false assumptions about DNS permanence, and it makes the whole domain name & data center rent-seeking complex inevitable. The fact that DNS is insecure is among its lesser problems; it should not have been relied upon in the first place.

                                                                                                                                    The faster we make it irrelevant the better, & this can be done incrementally and from the application layer.

                                                                                                                                  2. 2

                                                                                                                                    But why would IPFS solve it?

                                                                                                                                    Replacing every hostname with a hash doesn’t seem very user-friendly to me and last I checked, you can trivially sniff out what content someone is loading by inspecting the requested hashes from the network.

                                                                                                                                    IPFS isn’t mature either, it’s not even a decade old and most middleboxes will start blocking it once people start using it for illegitimate purposes. There is no plan to circumvent blocking by middleboxes, not even after that stunt with putting wikipedia on IPFS.

                                                                                                                                    1. 1

                                                                                                                                      IPFS doesn’t replace hostnames with hashes.It uses hashes as host-agnostic document addresses.

                                                                                                                                      Identifying hosts is not directly relevant to grabbing documents, and so baking hostnames into document addresses mixes two levels of abstractions, with undesirable side effects (like dependence upon DNS and server farms to provide absurd uptime guarantees).

                                                                                                                                      IPFS is one example of distributed permanent addressing. There are a lot of implementations – most relying upon hashes, since hashes provide a convenient mechanism for producing practically-unique addresses without collusion, but some using other mechanisms.

                                                                                                                                      The point is that once you have permanent addresses for static documents, all clients can become servers & you start getting situations where accidentally slashdotting a site is impossible because the more people try to access it the more redundancy there is in its hosting. You remove some of the hairiest problems with caching, because while you can flush things out of a cache, the copy in cache is never invalidated by changes, because the object at a particular permanent address is immutable.

                                                                                                                                      Problems (particularly with web-tech) that smart folks have been trying to solve with elaborate hacks for decades become trivial when we make addresses permanent, because complications like DNS become irrelevant.

                                                                                                                                      1. 1

                                                                                                                                        And other problems become hard like “how do I have my content still online in 20 years?”.

                                                                                                                                        IPFS doesn’t address the issues it should be addressing, using hashes everywhere being one of them making it particularly user-unfriendly (possibly even user-hostile).

                                                                                                                                        IPFS doesn’t act like a proper cache either (unless their eviction strategy has significantly evolved to be more cooperative) in addition to leaking data everywhere.

                                                                                                                                        Torrent and dat:// solve the problem much better and don’t over-advertise their capabilities.

                                                                                                                                        Nobody really needs permanent addressing, what they really need is either a Tor onion address or actually cashing out for a proper webserver (where IPFS also won’t help if your content is dynamic, it’ll make things only more javascript heavy than they already are).

                                                                                                                                        1. 1

                                                                                                                                          how do I have my content still online in 20 years?

                                                                                                                                          If you want to guarantee persistence of content over long periods, you will need to continue to host it (or have it hosted on your behalf), just as you would with host-based addressing. The difference is that your host machine can be puny because it’s no longer a single point of failure under traffic load: as requests increase linearly, the likelihood of any request being directed to your host decreases geometrically (with a slow decay via cache eviction).

                                                                                                                                          IPFS doesn’t address the issues it should be addressing, using hashes everywhere being one of them making it particularly user-unfriendly (possibly even user-hostile).

                                                                                                                                          I would absolutely support a pet-name system on top of IPFS. Hashes are convenient for a number of reasons, but IPFS is only one example of a relatively-mature named-data-oriented solution to permanent addressing. It’s minimal & has good support for putting new policies on top of it, so integrating it into applications that have their own caching and name policies is convenient.

                                                                                                                                          IPFS doesn’t act like a proper cache either (unless their eviction strategy has significantly evolved to be more cooperative) in addition to leaking data everywhere.

                                                                                                                                          Most caches have forced eviction based on mutability. Mutability is not a feature of systems that use permanent addressing. That said, I would like to see IPFS clients outfitted with a replication system that forces peers to cache copies of a hash when it is being automatically flushed if an insufficient number of peers already have it (in order to address problem #1) as well as a store-and-forward mode (likewise).

                                                                                                                                          Torrent and dat:// solve the problem much better and don’t over-advertise their capabilities.

                                                                                                                                          Torrent has unfortunately already become a popular target for blocking. I would personally welcome sharing caches over DHT by default over heavy adoption of IPFS since it requires less additional work to solve certain technical problems (or, better yet, DHT sharing of IPFS pinned items – we get permanent addresses and seed/leech metrics), but for political reasons that ship has probably sailed. DAT seems not to solve the permanent address problem at all, although it at least decentralizes services; I haven’t looked too deeply into it, but it could be viable.

                                                                                                                                          Nobody really needs permanent addressing,

                                                                                                                                          Early web standards assume but do not enforce that addresses are permanent. Every 404 is a fundamental violation of the promise of hypertext. The fact that we can’t depend upon addresses to be truly permanent has made the absurd superstructure of web tech inevitable – and it’s unnecessary.

                                                                                                                                          what they really need is either a Tor onion address

                                                                                                                                          An onion address just hides traffic. It doesn’t address the single point of failure in terms of a single set of hosts.

                                                                                                                                          or actually cashing out for a proper webserver

                                                                                                                                          A proper web server, though relatively cheap, is more expensive and requires more technical skill to run than is necessary or desirable. It also represents a chain of single points of failure: a domain can be siezed (by a state or by anybody who can social-engineer GoDaddy or perform DNS poisoning attacks), while a host will go down under high load (or have its contents changed if somebody gets write access to the disk). Permanent addresses solve the availability problem in the case of load or active threat, while hash-based permanent addresses solve the correctness problem.

                                                                                                                                          where IPFS also won’t help if your content is dynamic,

                                                                                                                                          Truly dynamic content is relatively rare (hence the popularity of cloudflare and akamai), and even less dynamic content actually needs to be dynamic. We ought to minimize it for the same reasons we minimize mutability in functional-style code. Mutability creates all manner of complications that make certain kinds of desirable guarantees difficult or impossible.

                                                                                                                                          Signature chains provide a convenient way of adding simulated mutability to immutable objects (sort of like how monads do) in a distributed way. A more radical way of handling mutability – one that would require more infrastructure on top of IPFS but would probably be amenable to use with other protocols – is to support append-only streams & construct objects from slices of that append-only stream (what was called a ‘permascroll’ in Xanadu from 2006-2014). This stuff would need to be implemented, but it would not need to be invented – and inventing is the hard part.

                                                                                                                                          it’ll make things only more javascript heavy than they already are

                                                                                                                                          Only if we stick to web tech, and then only if we don’t think carefully and clearly about how best to design these systems. (Unfortunately, endemic lack of forethought is really the underlying problem here, rather than any particular technology. It’s possible to use even complete trash in a sensible and productive way.)

                                                                                                                                          1. 1

                                                                                                                                            The difference is that your host machine can be puny because it’s no longer a single point of failure under traffic load: as requests increase linearly, the likelihood of any request being directed to your host decreases geometrically (with a slow decay via cache eviction).

                                                                                                                                            I don’t think this is a problem that needs addressing. Static content like the type that IPFS serves can be cheaply served to a lot of customers without needing a fancy CDN. An RPi on a home connection should be able to handle 4 million visitors a month easily with purely static content.

                                                                                                                                            Dynamic content, ie the content that needs bigger nodes, isn’t compatible with IPFS to begin with.

                                                                                                                                            Most caches have forced eviction based on mutability

                                                                                                                                            Caches also evict based on a number of different strategies that have nothing to do with mutability though, IPFS’ strategy for loading content (FIFO last I checked) behaves poorly with most internet browsing behaviour.

                                                                                                                                            DAT seems not to solve the permanent address problem at all, although it at least decentralizes services; I haven’t looked too deeply into it, but it could be viable.

                                                                                                                                            The public key of a DAT share is essentially like a IPFS target with the added bonus of having at tracked and replicated history and mutability, offering everything an IPNS or IPFS hash does. Additionally it’s more private and doesn’t try to sell itself as censorship resistant (just look at the stunt with putting Wikipedia on IPFS)

                                                                                                                                            Every 404 is a fundamental violation of the promise of hypertext.

                                                                                                                                            I would disagree with that. It’s more important that we archive valuable content (ie, archive.org or via the ArchiveTeam, etc.) than having a permanent addressing method.

                                                                                                                                            Additionally the permanent addressing still does not solve content being offline. Once it’s lost, it’s lost and no amount of throwing blockchain, hashes and P2P at it will ever solve this.

                                                                                                                                            You cannot stop a 404 from happening.

                                                                                                                                            The hash might be the same but for 99.999% of content on the internet, it’ll be lost within the decade regardless.

                                                                                                                                            Truly dynamic content is relatively rare

                                                                                                                                            I would also disagree with that, in the modern internet, mutable and dynamic content are becoming more common as people become more connected.

                                                                                                                                            CF and Ak allow hosters to cache pages that are mostly static like the reddit frontpage as well as reducing the need for georeplicated servers and reducing the load on the existing servers.

                                                                                                                                            is to support append-only streams & construct objects from slices of that append-only stream

                                                                                                                                            See DAT, that’s what it does. It’s an append-only log of changes. You can go back and look at previous versions of the DAT URL provided that all the chunks are available in the P2P network.

                                                                                                                                            Only if we stick to web tech, and then only if we don’t think carefully and clearly about how best to design these systems.

                                                                                                                                            IPFS in it’s current form is largely provided as a Node.js library, with bindings to some other languages. It’s being heavily marketed for browsers. The amount of JS in websites would only increase with IPFS and likely slow everything down even further until it scales up to global, or as it promises, interplanetary scale (though interplanetary is a pipedream, the protocol can’t even handle satellite internet properly)

                                                                                                                                            Instead of looking at pipedreams of cryptography for the solution, we ought to improve the infrastructure and reduce the amount of CPU needed for dynamic content, this is a more easy and viable option than switching the entire internet to a protocol that forgets data if it doesn’t remember it often enough.

                                                                                                                          1. 4

                                                                                                                            here be rants:

                                                                                                                            everything get’s shoehorned atop of webby stuff. it’s getting harder (socially and technically) to use things in ways your several big brothers don’t want you to. got the wrong ip? solve 50 capchas to help cars not run over too many people, affecting stock value. interoperability? well we’ve got a “rest” api which is slower than parsing the website with beautiful soup. tables are represented as JSON arrays, thats the only thing we know! hell, let’s put markup in javascript in html because we use a platform for rendering static documents as remote-GUI toolkit. want to use a port different from port 80/443? well, too bad, we don’t allow those filthy things here. want a real connection? well we have web sockets in store! soon there will be a requirement for a tunnel to your next friendly cdn to do things. which effectively is like dialing into a remote mainframe, only that everything is http now. then we’ll finally have the love child of all the best parts: the blazing speed of browsers paired with the privacy of centralized data storage.

                                                                                                                            1. 1

                                                                                                                              I understand your point. But as hacky as DoH is: Unlike all the other crypto protocols, HTTPS actually works well, large scale and is somewhat battle-tested regarding protocol attacks (HTTP, TLS), cryptography, fuzzing.

                                                                                                                              I’d much rather use Firefox’s HTTPS stack for DNS than, say, any other application using openssl.

                                                                                                                              As an aside, this happening on port 443 is a feature, not a bug: Your DNS traffic can blend in. It’s known to be allowed in most, if not all setups.

                                                                                                                              1. 2

                                                                                                                                I understand your point. But as hacky as DoH is: Unlike all the other crypto protocols, HTTPS actually works well, large scale and is somewhat battle-tested regarding protocol attacks (HTTP, TLS), cryptography, fuzzing.

                                                                                                                                why is it better then dns over tls?

                                                                                                                                edit: the crypto used is the same for tls, fuzzing is a problem for individual implementations. there may be undefined behaviours in the protocol definition, but DNS should be well aged by now so that these are known.

                                                                                                                                I’d much rather use Firefox’s HTTPS stack for DNS than, say, any other application using openssl.

                                                                                                                                this is a problem of using openssl and not a feature inherent to DoH.

                                                                                                                                As an aside, this happening on port 443 is a feature, not a bug: Your DNS traffic can blend in. It’s known to be allowed in most, if not all setups.

                                                                                                                                encryption should happen some layers further down, not on the application layer.

                                                                                                                                even if unpopular: people should trust those who’ve created much of the infrastructure we use today if they say something is a bad idea.

                                                                                                                            1. 1

                                                                                                                              mark my words, this is the year of linux on the desktop!

                                                                                                                              1. 1

                                                                                                                                often if i read about something invented in linuxland, i have to think about plan9, and how good some concepts would fit todays distributed computing.

                                                                                                                                1. 3

                                                                                                                                  How does that apply here? How did Plan9 do it differently?

                                                                                                                                  (FWIW, X is not a Linux invention.)

                                                                                                                                  1. 1

                                                                                                                                    well, you can compose your environment: take the filesystem from one system, the processing power of another and something else to use as terminal handling input/output, all by using a rather simple network protocol.

                                                                                                                                1. 6

                                                                                                                                  Electricity usage is a huge concern even within the cryptocurrency community. There is a lot of work going towards more energy efficient solutions. However, proof-of-work is still the defacto method. At Merit we still use PoW but I chose a memory-bound algorithm called Cuckoo Cycle which is more energy efficient since it’s memory bandwidth bound. I hope to move away from proof-of-work completely in the future, but it’s not easy to get the same properties. Since in some ways, Merit is half PoW and PoS (Proof-of-Stake) via our Proof-of-Growth (PoG) algorithm, we are already halfway there.

                                                                                                                                  Proof-of-Work is fascinating because it’s philosophically the opposite of Fiat money. Fiat money is one of the few things in the world where you can expend less effort and produce more of it. Cryptoccurrencies with PoW are the opposite, where you produce fewer of it in the proportion of effort expended.

                                                                                                                                  1. 2

                                                                                                                                    How much more memory efficient is Merit (on the scale of the top 100 countries electricity consumption)?

                                                                                                                                    The article points out that ASIC miners have found ways of solving algorithms that have previously been thought to be resistant to a bespoke hardware solution.

                                                                                                                                    Consuming the same amount of electricity as a large’ish country is certainly fascinating.

                                                                                                                                    1. 4

                                                                                                                                      Warning! this will be a bummer reply, nothing I will say here will be uplifting…..

                                                                                                                                      Notice of course, the difference between the #1 country, and the #2 country is large. It likely follows zipf’s law. The issue with ASICs is that they are not accessible to acquire and therefore insiders get access to them first and have a huge advantage. It’s anathema to the goal of having anyone download the software and mine.

                                                                                                                                      In the scheme of things, the amount of electricity used to mine cryptocurrencies pales in comparison to the amount of electricity wasted on countless other things. We should just acknowledge that there is something fundamentally wrong with the global economic system that allows for gross externalities that aren’t accounted for. And that there is such a gross disparity of wealth where some countries have such excess capacity for electricity while others struggle with brownouts and blackouts every day.

                                                                                                                                      Global warming itself is an incredibly complex problem. Using a slow scripting language for your software? How much hardware are you wasting running at scale? Buying a Tesla? Too bad your electricity is likely dirty, and the production caused 5 years worth of CO2 a normal car puts out. Switching to solar and wind? Too bad the air will be cleaner causing more sunlight to hit the earth heating it up faster because even stopping now, we have decades of warming built in, and that a cleaner atmosphere accelerates that warming.

                                                                                                                                      Global warming is such an insanely difficult, complex, and urgent problem that we are missing the forest for the trees.

                                                                                                                                      Cryptocurrencies are not tackling the problem of Global Warming, but so aren’t most technologies we are creating every day. I would love to hear how many people on Lobsters are tackling global warming head on? I suspect almost zero. And isn’t that just the most depressing thing? It is for me, I think about this every day when I look at my children.

                                                                                                                                      EDIT: Holy poop I was right, totally zipf’s law https://en.wikipedia.org/wiki/List_of_countries_by_electricity_consumption .

                                                                                                                                      1. 9

                                                                                                                                        NB: this may be ranty ;)

                                                                                                                                        In the scheme of things, the amount of electricity used to mine cryptocurrencies pales in comparison to the amount of electricity wasted on countless other things.

                                                                                                                                        how about not doing things which have currently no value for society, except from being an item for financial speculation, and burning resources. that would be a start. i still have to see a valid application of cryptocurrencies which really works. hard cash is still a good thing which works. it’s like voting machines: they may kinda work, but crosses made with a pen on paper are still the best solution.

                                                                                                                                        the electricity wasted on other things is due to shitty standby mechanisms and lazyness. these things can be fixed. the “currency” part of “cryptocurrency” is to waste ressources, which can’t be fixed.

                                                                                                                                        Global warming itself is an incredibly complex problem.

                                                                                                                                        so-so.

                                                                                                                                        Using a slow scripting language for your software? How much hardware are you wasting running at scale?

                                                                                                                                        see the fixing part above. fortunately most technology tends to get more efficient the longer it exists.

                                                                                                                                        Buying a Tesla? Too bad your electricity is likely dirty, and the production caused 5 years worth of CO2 a normal car puts out.

                                                                                                                                        yeah, well, don’t buy cars from someone who shoots cars into orbit.

                                                                                                                                        Switching to solar and wind? Too bad the air will be cleaner causing more sunlight to hit the earth heating it up faster because even stopping now, we have decades of warming built in, and that a cleaner atmosphere accelerates that warming.

                                                                                                                                        the dimming and warming are two seperate effects, though both are caused by burning things. cooling is caused by particles, while warming is caused by gases (CO2, CH4, …). there are some special cases like soot in the (ant)arctic ice, speeding up the melting. (cf. https://en.wikipedia.org/wiki/Global_cooling#Physical_mechanisms , https://en.wikipedia.org/wiki/Global_warming#Initial_causes_of_temperature_changes_(external_forcings) )

                                                                                                                                        Cryptocurrencies are not tackling the problem of Global Warming, but so aren’t most technologies we are creating every day. I would love to hear how many people on Lobsters are tackling global warming head on? I suspect almost zero. And isn’t that just the most depressing thing? It is for me, I think about this every day when I look at my children.

                                                                                                                                        as global warming doesn’t has a single cause, there isn’t much to do head on. with everything theres a spectrum here. some ideas which will help:

                                                                                                                                        • don’t fly (less CO2).
                                                                                                                                        • buy local food when possible, not fruit from around the globe in midwinter. don’t eat much meat (less CO2, CH4, N2O).
                                                                                                                                        • use electricity from renewable sources (less CO2).

                                                                                                                                        those things would really help if done on a larger scale, and aren’t too hard.

                                                                                                                                        1. 2

                                                                                                                                          how about not doing things which have currently no value for society, except from being an item for financial speculation, and burning resources. that would be a start. i still have to see a valid application of cryptocurrencies which really works.

                                                                                                                                          Buying illegal goods through the internet without the risk of getting caught by the financial transaction (Monero and probably Bitcoin with coin tumblers).

                                                                                                                                          1. 4

                                                                                                                                            mind that i’ve written society: a valid reason are drugs, which shouldn’t be illegal but be sold by reliable, quality controlled suppliers. i think other illegal things are illegal for a reason. additionally, i’d argue it’s risky to mail-order illegal things to your doorstep.

                                                                                                                                            1. 2

                                                                                                                                              cryptocurrencies solve a much harder problem than hard cash, which is they have lowered the cost of producing non-state money. Non-state money has existed for thousands of years, but this is the first time in history you can trade globally with it. While the US dollar may be accepted almost everywhere, this is not true for other forms of cash.

                                                                                                                                              1. 4

                                                                                                                                                but what is the real use case?

                                                                                                                                                • if globalized trade continues to exist, so will the classic ways of payment. cryptocurrencies are only useful in this case if you want to do illegal things. there may be a use case in oppressed countries, but the people tend to have other problems there than to buy things somewhere in the world.

                                                                                                                                                • if it ceases to exist, one doesn’t need a cryptocurrency to trade anywhere in the world, as there is no trade.

                                                                                                                                                i’m not a huge fan of the current state of the banking system, but it is a rather deep local optimum. it bugs me that i have to pay transaction fees, but thats the case with cryptocurrencies, too. i just think that while theoretically elegant, cryptocurrencies do more harm than good.

                                                                                                                                                anecdote: years ago, i payed for a shell account by putting money in an envelope and sending it via mail ;)

                                                                                                                                                1. 2

                                                                                                                                                  Cryptocurrencies are a transvestment from centralized tech to decentralized. It’s not what they do, but how they do it that’s different. It’s a technology that allows the private sector to invest in decentralized tech, where in the past they had no incentive to do so. Since the governments of the world have failed so miserably to invest in decentralized technology in the last 20 years, this is the first time that I can remember where the private sector can contribute to building decentralized technology. Note cryptocurrencies are behind investments of decentralized storage, processing, and other solutions, where before the blockchain, they would have been charity cases.

                                                                                                                                                  The question you can ask is, why not just stick with centralized solutions? I think the argument is a moral one and about power to the people, vs to some unaccountable 3rd party.

                                                                                                                                                  1. 1

                                                                                                                                                    It’s a technology that allows the private sector to invest in decentralized tech, where in the past they had no incentive to do so.

                                                                                                                                                    i still don’t see exactly where the cryptocurrencies are required for investment in decentralized technology. we have many classic systems which are decentralized: internet (phone before that), electricity grid, water supply, roads, etc. why are cryptocurrencies required for “modern” decentralized systems? it just takes multiple parties who decide that it is a good solution to run a distributed service (like e-mail). how it is paid for is a different problem. one interesting aspect is that the functionality can be tightly coupled with payments in blockchainy systems. i’m not convinced if that is reason enough to use it. furthermore some things can’t be well done due to the CAP theorem. so centralization is the only solution in these cases.

                                                                                                                                                    Note cryptocurrencies are behind investments of decentralized storage, processing, and other solutions, where before the blockchain, they would have been charity cases.

                                                                                                                                                    I’d say that the internet needs more of the “i run it because i can, not because i can make money with it” spirit again.

                                                                                                                                                    1. 1

                                                                                                                                                      i still don’t see exactly where the cryptocurrencies are required for investment in decentralized technology.

                                                                                                                                                      You are absolutely right! It isn’t a requirement. I love this subject by the way, so let me explain why you are right.

                                                                                                                                                      we have many classic systems which are decentralized: internet (phone before that), electricity grid, water supply, roads, etc. why are cryptocurrencies required for “modern” decentralized systems

                                                                                                                                                      You are absolutely right here. In the past, our decentralized systems were developed and paid for by the public sector. The private sector, until now, failed to create decentralized systems. The reason we need cryptocurrencies for modern decentralized systems is that we don’t have the political capital to create and fund them in the public sector anymore.

                                                                                                                                                      If we had a functioning global democracy, we could probably create may systems that “i run it because i can, not because i can make money with it”.

                                                                                                                                                      That spirit died during the great privatization of computing in the mid 80s, and the privatization of the internet in the mid 90s.

                                                                                                                                          2. 2

                                                                                                                                            I love rants :-) Let’s go!

                                                                                                                                            “currency” part of “cryptocurrency” is to waste ressources, which can’t be fixed.

                                                                                                                                            Some people value non-state globally tradeable currencies. Google alone claims to have generated $238 billion in economic activity from their ads and search. https://economicimpact.google.com/ . The question is, how much CO2 did that economic activity create? Likely far greater than all cryptocurrencies combined. But that’s just my guess. It’s not an excuse, I’m just pointing out we are missing the forest for the trees. People follow the money, just as google engineers work for google because the money is there from ads, many people are working on cryptocurrencies because the money is there.

                                                                                                                                            see the fixing part above. fortunately most technology tends to get more efficient the longer it exists.

                                                                                                                                            While true, since our profession loves pop-culture, most technologies are replaced with more fashionable and inefficient ones the longer they exist. Remember when C people were claiming C++ was slow? I do.

                                                                                                                                            the dimming and warming are two separate effects, though both are caused by burning things.

                                                                                                                                            They are separate effects that have a complex relationship with our models of the earth warming. Unfortunately, even most well-meaning climate advocates don’t acknowledge dimming and that it’s not as simple as changing to renewable resources since renewables do not cause dimming, and god knows we need the dimming.

                                                                                                                                            those things would really help if done on a larger scale and aren’t too hard.

                                                                                                                                            Here is my honest opinion, we should have done this 30 years ago when it wasn’t too late. I was a child 30 years ago. The previous generation gave me this predicament on a silver plate. I do my part, I don’t eat meat because of global warming, I rarely use cars, use public transport as much as possible. Work from home as much as possible. etc, etc,

                                                                                                                                            But I do these things knowing it’s too late. Even if we stopped dumping CO2 in the atmosphere today, we have decades of warming built in that will likely irreparably change our habitat. Even the IPCC assumes we will geoengineer our way with some magical unicorn technology that hasn’t been created yet.

                                                                                                                                            I do my part not because I think they will help, but because I want to be able to look at my children and at least say I tried.

                                                                                                                                            I think one of my next software projects will be helping migrants safely travel, because of one of the biggest tragedies and sources of human suffering as a result of climate change has been the refugee crisis, which is going to increase more.

                                                                                                                                            1. 2

                                                                                                                                              Some people value non-state globally tradeable currencies. Google alone claims to have generated $238 billion in economic activity from their ads and search. https://economicimpact.google.com/ . The question is, how much CO2 did that economic activity create? Likely far greater than all cryptocurrencies combined. But that’s just my guess. It’s not an excuse, I’m just pointing out we are missing the forest for the trees. People follow the money, just as google engineers work for google because the money is there from ads, many people are working on cryptocurrencies because the money is there.

                                                                                                                                              i won’t refute that ads are a waste of resources, i just don’t see why more resources need to be wasted on things which have no use except for speculation. i hope we can do better.

                                                                                                                                              While true, since our profession loves pop-culture, most technologies are replaced with more fashionable and inefficient ones the longer they exist. Remember when C people were claiming C++ was slow? I do.

                                                                                                                                              Javascript has gotten more efficient in the order of magnitudes. Hardware is still getting more efficient. There is always room for improvement. As you’ve written, people go where the money is (or can be saved).

                                                                                                                                              They are separate effects that have a complex relationship with our models of the earth warming. Unfortunately, even most well-meaning climate advocates don’t acknowledge dimming and that it’s not as simple as changing to renewable resources since renewables do not cause dimming, and god knows we need the dimming.

                                                                                                                                              But I do these things knowing it’s too late. Even if we stopped dumping CO2 in the atmosphere today, we have decades of warming built in that will likely irreparably change our habitat.

                                                                                                                                              Dimming has an effect. As reason not to switch to renewable energy it isn’t a good argument. Stopping to pump more greenhouse gasses would be a good start, they tend to be consumed by plants.

                                                                                                                                              […] we will geoengineer our way with some magical unicorn technology that hasn’t been created yet.

                                                                                                                                              lets not do this, humans have a tendency to make things worse that way ;)

                                                                                                                                              1. 1

                                                                                                                                                i hope we can do better.

                                                                                                                                                I don’t think our economic system is setup for that.

                                                                                                                                                Javascript has gotten more efficient in the order of magnitudes. Hardware is still getting more efficient. There is always room for improvement. As you’ve written, people go where the money is (or can be saved).

                                                                                                                                                I think because moore’s law is now dead, things are starting to swing back towards efficiency. I hope this trend continues.

                                                                                                                                                Dimming has an effect. As reason not to switch to renewable energy it isn’t a good argument. Stopping to pump more greenhouse gasses would be a good start, they tend to be consumed by plants.

                                                                                                                                                I didn’t provide dimming as a reason not to switch to renewables, I provided it because JUST switching to renewables will doom us. As I’ve said, there are decades of warming backed in, there is a lag with the CO2 we already put in. Yes, we need to stop putting more in, but it’s not enough to just stop. And in fact, stopping and not doing anything else will doom us faster.

                                                                                                                                                lets not do this, humans have a tendency to make things worse that way ;)

                                                                                                                                                I totally agree. I don’t want countries to start launching nuclear weapons, for example. The only realistic thing that could possibly work is to do massive planting of trees, like I mean billions of trees need to be planted. And time is running out, because photosynthesis stops working at a certain temperature, so many places are already impossible to fix (iraq for example, which used to be covered in thick forests thousands of years ago).

                                                                                                                                                1. 1

                                                                                                                                                  I don’t think our economic system is setup for that.

                                                                                                                                                  aren’t we the system? changes can begin small, just many attempts fail early i suppose.

                                                                                                                                                  And in fact, stopping and not doing anything else will doom us faster.

                                                                                                                                                  do you have any sources for that?

                                                                                                                                                  The only realistic thing that could possibly work is to do massive planting of trees, like I mean billions of trees need to be planted. And time is running out, because photosynthesis stops working at a certain temperature, so many places are already impossible to fix (iraq for example, which used to be covered in thick forests thousands of years ago).

                                                                                                                                                  well, if the trends continues, greenland will have some ice-free space for trees ;) just stopping deforestation would be a good start though.

                                                                                                                                                  1. 1

                                                                                                                                                    aren’t we the system?

                                                                                                                                                    We did not create the system, we were born into it. To most, they see it as reality vs a system that’s designed.

                                                                                                                                                    do you have any sources for that?

                                                                                                                                                    https://www.sciencedaily.com/releases/2017/07/170731114534.htm

                                                                                                                                                    well, if the trends continues, greenland will have some ice-free space for trees ;) just stopping deforestation would be a good start though.

                                                                                                                                                    Sorry if I’m wrong, but do I sense a bit of skepticism about the dangers we face ahead?

                                                                                                                                          3. 5

                                                                                                                                            That was such a non-answer full of red herrings. He wanted to know what your cryptocurrency’s electrical consumption is. It’s positioned as an alternative to centralized methods like Bitcoin is. The centralized methods running on strongly-consistent DB’s currently do an insane volume of transactions on cheap machines that can be clustered globally if necessary. My approach is centralized setup with multiple parties involved checking each other. Kind of similar to how multinational finance already works but with more specific, open protocols to improve on it. That just adds a few more computers for each party… individual, company, or country… that is involved in the process. I saw a diesel generator at Costco for $999 that could cover the energy requirements of a multi-national setup of my system that outperforms all crypto-currency setups.

                                                                                                                                            So, what’s the energy usage of your system, can I participate without exploding my electric bill at home (or generator), and, if not, what’s the justification of using that cryptosystem instead of improving on the centralized-with-checking methods multinationals are using right now that work despite malicious parties?

                                                                                                                                            1. 3

                                                                                                                                              How much more memory efficient is Merit (on the scale of the top 100 countries electricity consumption)?

                                                                                                                                              Sorry, That’s his question. I can answer that easily, it’s not on that scale. My interpretation of that question was that he was making a joke, which is why I didn’t answer it. If derek-jones was serious about that question, I apologize.

                                                                                                                                              As I mentioned, the algorithm is memory bandwidth bound, I’m seeing half the energy cost on my rig, but I need to do more stringent measurements.

                                                                                                                                              1. 1

                                                                                                                                                More of a pointed remark than a joke. But your reply was full of red herrings to quote nickpsecurity.

                                                                                                                                                If I am sufficiently well financed that I can consume 10M watt of power, then I will always consume 10M watt. If somebody produces more efficient hashing hardware/software, I will use it to generate more profit, not reduce electricity consumption. Any system that contains a PoW component creates a pushes people to consume as much electricity as they can afford.

                                                                                                                                                1. 1

                                                                                                                                                  If somebody produces more efficient hashing hardware/software, I will use it to generate more profit, not reduce electricity consumption.

                                                                                                                                                  This is true for any resource and any technology in our global economic system.

                                                                                                                                                  I wasn’t trying to reply with red herrings, but to expand the conversation. It’s really interesting that people attack cryptocurrencies for wasting electricity when there is a bigger elephant in the room nobody seems to want to talk about. Everyone knows who butters their bread. Keep in mind I’m not defending wasting electricity, but focusing on electricity is like, to use a computer analogy, focussing only on memory and creating garbage collection to deal with it, while ignoring other resources like sockets, pipes, etc. That’s why I like C++, because it solves the problem for ALL resources, not just one. We need a C++ for the real world ;-)

                                                                                                                                          4. 2

                                                                                                                                            I answered your question more directly, see response to nickpsecurity.

                                                                                                                                        1. 6

                                                                                                                                          i’ve had recent experience: ubuntu 1804 feels sluggish in some parts vs. a slackware install, with more or less the same functionality:

                                                                                                                                          • xscreensaver unlocking: for some reason it takes half a second after i hit the return key until i see the desktop again on the ubuntu install.

                                                                                                                                          • there are weird lags everywhere (and no chance of debugging it with all the magick moving parts aka. “*kit”).

                                                                                                                                          • booting is slower, the initrd takes seconds until i can type in the passphrase to unlock the full disk encryption.

                                                                                                                                          maybe all the shiny new tools aren’t so great after all. if one looks at the default install size of slackware, it isn’t even lightweight with around ~8G. it just doesn’t have all the stuff handed down to us by redhat, running in the background doing things.

                                                                                                                                          1. 3

                                                                                                                                            I would kill for Michael Larabel @ Phoronix to figure out some reliable user responsiveness tests for Linux distros.

                                                                                                                                            1. 1

                                                                                                                                              Because it is very hard and takes a lot of time I’d wager. Few have the time, money or drive to do such a thing.

                                                                                                                                              Update: the responsiveness issues I was having with Ubuntu 18.04 were resolved by not using a distro install based on Gnome. I was able to replicate responsiveness improvements w/ a Debian + LXDE install and a Lubuntu (Ubuntu + LXDE) install.

                                                                                                                                              I use XMonad as my primary WM anyway so I really just want a leaner bootstrap environment that “just works” anyway.

                                                                                                                                          1. 13

                                                                                                                                            I think I understand where the author’s coming from, but I think some of his concerns are probably a bit misplaced. For example, unless you’ve stripped all the Google off your Android phone (which some people can do), Google can muck with whatever on your phone regardless of how you install Signal. In all other cases, I completely get why Moxie would rather insist you install Signal via a mechanism that ensures updates are efficiently and quickly delivered. While he’s got a point on centralized trust (though a note on that in a second), swapping out Google Play for F-Droid doesn’t help there; you’ve simply switched who you trust. And in all cases of installation, you’re trusting Signal at some point. (Or whatever other encryption software you opt to use, for that matter—even if its something built pretty directly on top of libsodium at the end of the day.)

                                                                                                                                            That all gets back to centralized trust. Unless the author is reading through all the code they’re compiling, they’re trusting some centralized sources—likely whoever built their Android variant and the people who run the F-Droid repositories, at a bare minimum. In that context, I think that trusting Google not to want to muck with Signal is probably honestly a safe bet for most users. Yes, Google could replace your copy of Signal with a nefarious version for their own purposes, but that’d be amazingly dumb: it’d be quickly detected and cause irreparable harm to trust in Google from both users and developers. Chances are honestly higher that you’ll be hacked by some random other app you put on your phone than that Google will opt to go after Signal on their end. Moxie’s point is that you’re better off trusting Signal and Google than some random APK you find on the Internet. And for the overwhelming majority of users, I think he’s entirely correct.

                                                                                                                                            When I think about something like Signal, I usually focus on, who am I attempting to protect myself from? Maybe a skilled user with GPG is more secure than Signal (although that’s arguable; we’ve had quite a few CVEs this year, such as this one), but normal users struggle to get such a setup meaningfully secure. And if you’re just trying to defend against casual snooping and overexcited law enforcement, you’re honestly really well protected out-of-the-box by what Signal does today—and, as Mickens has noted, you’re not going to successfully protect yourself from a motivated nation-state otherwise.

                                                                                                                                            1. 20

                                                                                                                                              and cause irreparable harm to trust in Google from both users and developers

                                                                                                                                              You have good points except this common refrain we should all stop saying. These big companies were caught pulling all kinds of stuff on their users. They usually keep their market share and riches. Google was no different. If this was detected, they’d issue an apologetic press release saying either it was a mistake in their complex, distribution system or that the feature was for police with a warrant with it used accordingly or mistakenly. The situation shifts from everyone ditch evil Google to more complicated one most users won’t take decisive action on. Many wouldn’t even want to think to hard into it or otherwise assume mass spying at government or Google level is going on. It’s something they tolerate.

                                                                                                                                              1. 11

                                                                                                                                                I think that trusting Google not to want to muck with Signal is probably honestly a safe bet for most users.

                                                                                                                                                The problem is that moxie could put things in the app if enough rubberhose (or money, or whatever) is applied. I don’t know why this point is frequently overlooked. These things are so complex that nobody could verify that the app in the store isn’t doing anything fishy. There are enough side-channels. Please stop trusting moxie, not because he has done something wrong, but because it is the right thing to do in this case.

                                                                                                                                                Another problem: signals servers could be compromised, leaking the communication metadata of everone. That could be fixed with federation, but many people seem to be against federation here, for spurious reasons. That federation & encryption work together is shown by matrix for example. I give that it is rough on the edges, but at least they try, and for now it looks promising.

                                                                                                                                                Finally (imho): good crypto is hard, as the math behind it has hard constraints. Sure, the user interfaces could be better in most cases, but some things can’t be changed without weakening the crypto.

                                                                                                                                                1. 2

                                                                                                                                                  many people seem to be against federation here, for spurious reasons

                                                                                                                                                  Federation seems like a fast path to ossification. It is much harder to change things without disrupting people if there are tons of random servers and clients out there.

                                                                                                                                                  Also, remember how great federation worked out for xmpp/jabber when google embraced and then extinguished it? I sure do.

                                                                                                                                                  1. 2

                                                                                                                                                    Federation seems like a fast path to ossification.

                                                                                                                                                    I have been thinking about this. There are certainly many protocols that are unchangeable at this point but I don’t think it has to be this way.

                                                                                                                                                    Web standards like HTML/CSS/JS and HTTP are still constantly improving despite having thousands of implementations and different programs using them.

                                                                                                                                                    From what I can see, the key to stopping ossification of a protocol is to have a single authority and source of truth for the protocol. They have to be dedicated to making changes to the protocol and they have to change often.

                                                                                                                                                    1. 2

                                                                                                                                                      I think your HTTP example is a good one. I would also add SSL/TLS to that, as another potential useful example to analyze. Both (at some point) had concepts of versioning built into them, which has allowed the implementation to change over time, and cut off the “long tail” non-adopters. You may be on to something with your “single authority” concept too, as both also had (for the most part) relatively centralized committees responsible for their specification.

                                                                                                                                                      I think html/css/js are /perhaps/ a bit of a different case, because they are more documentation formats, and less “living” communication protocols. The fat clients for these have tended to grow in complexity over time, accreting support for nearly all versions. There are also lots of “frozen” documents that people still may want to view, but which are not going to be updated (archival pages, etc). These have also had a bit more of a “de facto” specification, as companies with dominant browser positions have added their own features (iframe, XMLHttpRequest, etc) which were later taken up by others.

                                                                                                                                                    2. 1

                                                                                                                                                      Federation seems like a fast path to ossification. It is much harder to change things without disrupting people if there are tons of random servers and clients out there. Also, remember how great federation worked out for xmpp/jabber when google embraced and then extinguished it? I sure do.

                                                                                                                                                      It may seem so, but that doesn’t mean it will happen. It has happened with xmpp, but xmpp had other problems, too:

                                                                                                                                                      • Not good for mobile use (some years back when messenger apps went big, but mobile connections were bad)
                                                                                                                                                      • A “kind-of-XML”, which was hard to parse (I may be wrong here)
                                                                                                                                                      • Reinventing of the wheel, I’m not sure how many crypto standards there are for xmpp

                                                                                                                                                      Matrix does some things better:

                                                                                                                                                      • Reference server and clients for multiple platforms (electron/web, but at least there is a client for many platforms)
                                                                                                                                                      • Reference crypto library in C (so bindings are easier and no one tries to re-implement it)
                                                                                                                                                      • Relatively simple client protocol (less prone to implementation errors than the streams of xmpp, imho)

                                                                                                                                                      The google problem you described isn’t inherent to federation. It’s more of a people problem: Too many people being too lazy to setup their own instances, just using googles, forming essentially an centralized network again.

                                                                                                                                                  2. 10

                                                                                                                                                    Maybe a skilled user with GPG is more secure than Signal

                                                                                                                                                    Only if that skilled user contacts solely with other skilled users. It’s common for people to plaintext reply quoting the whole encrypted message…

                                                                                                                                                    1. 3

                                                                                                                                                      And in all cases of installation, you’re trusting Signal at some point.

                                                                                                                                                      Read: F-Droid is for open-source software. No trust necessary. Though to be fair, even then the point on centralization still stands.

                                                                                                                                                      Yes, Google could replace your copy of Signal with a nefarious version for their own purposes, but that’d be amazingly dumb: it’d be quickly detected and cause irreparable harm to trust in Google from both users and developers.

                                                                                                                                                      What makes you certain it would be detected so quickly?

                                                                                                                                                      1. 5

                                                                                                                                                        “Read: F-Droid is for open-source software. No trust necessary”

                                                                                                                                                        That’s non-sense. FOSS can conceal backdoors if nobody is reviewing it. Often the case. Bug hunters also find piles of vulnerabilities in FOSS just like proprietary. People who vet stuff they use have limits on skill, tools, and time that might make them miss vulnerabilities. Therefore, you absolutely have to trust the people and/or their software even if it’s FOSS.

                                                                                                                                                        The field of high-assurance security was created partly to address being able to certify (trust) systems written by your worst enemy. They achieved many pieces of that goal but new problems still show up. Almost no FOSS is built that way. So, it sure as hell cant be trusted if you dont trust those making it. Same with proprietary.

                                                                                                                                                        1. 3

                                                                                                                                                          It’s not nonsense, it’s just not an assurance. Nothing is. Open source, decentralization, and federation are the best we can get. However, I sense you think we can do better, and I’m curious as to what ideas you might have.

                                                                                                                                                          1. 4

                                                                                                                                                            There’s definitely a better method. I wrote it up with roryokane being nice enough to make a better-formatted copy here. Spoiler: none of that shit matters unless the stuff is thoroughly reviewed and proof sent to you by skilled people you can trust. Even if you do that stuff, the core of its security and trustworthiness will still fall on who reviewed it, how, how much, and if they can prove it to you. It comes down to trusting a review process by people you have to trust.

                                                                                                                                                            In a separate document, I described some specifics that were in high-assurance security certifications. They’d be in a future review process since all of them caught or prevented errors, often different ones. Far as assurance techniques, I summarized decades worth of them here. They were empirically proven to work addressing all kinds of problems.

                                                                                                                                                        2. 2

                                                                                                                                                          even then the point on centralization still stands.

                                                                                                                                                          fdroid actually lets you add custom repo sources.

                                                                                                                                                          1. 1

                                                                                                                                                            The argument in favour of F-Droid was twofold, and covered the point about “centralisation.” The author suggested Signal run an F-Droid repo themselves.

                                                                                                                                                        1. 1

                                                                                                                                                          Is there any well known PGP alternative other than this? Based from history, I cannot blindly trust code written by one human being and that is not battle tested.

                                                                                                                                                          In any case, props to them for trying to start something. PGP does need to die.

                                                                                                                                                          1. 7

                                                                                                                                                            a while ago i found http://minilock.io/ which sounds interesting as pgp alternative. i don’t have used it myself though.

                                                                                                                                                            1. 2

                                                                                                                                                              Its primitives and an executable model were also formally verified by Galois using their SAW tool. Quite interesting.

                                                                                                                                                            2. 6

                                                                                                                                                              This is mostly a remix, in that the primitives are copied from other software packages. It’s also designed to be run under very boring conditions: running locally on your laptop, encrypting files that you control, in a manual fashion (an attacker can’t submit 2^## plaintexts and observe the results), etc.

                                                                                                                                                              Not saying you shouldn’t be ever skeptical about new crypto code, but there is a big difference between this and hobbyist TLS server implementations.

                                                                                                                                                              1. 5

                                                                                                                                                                I’m Enchive’s author. You’ve very accurately captured the situation. I didn’t write any of the crypto primitives. Those parts are mature, popular implementations taken from elsewhere. Enchive is mostly about gluing those libraries together with a user interface.

                                                                                                                                                                I was (and, to some extent, still am) nervous about Enchive’s message construction. Unlike the primitives, it doesn’t come from an external source, and it was the first time I’ve ever designed something like that. It’s easy to screw up. Having learned a lot since then, if I was designing it today, I’d do it differently.

                                                                                                                                                                As you pointed out, Enchive only runs in the most boring circumstances. This allows for a large margin of error. I’ve intentionally oriented Enchive around this boring, offline archive encryption.

                                                                                                                                                                I’d love if someone smarter and more knowledgeable than me had written a similar tool — e.g. a cleanly implemented, asymmetric archive encryption tool with passphrase-generated keys. I’d just use that instead. But, since that doesn’t exist (as far as I know), I had to do it myself. Plus I’ve become very dissatisfied with the direction GnuPG has taken, and my confidence in it has dropped.

                                                                                                                                                                1. 2

                                                                                                                                                                  I didn’t write any of the crypto primitives

                                                                                                                                                                  that’s not 100% true, I think you invented the KDF.

                                                                                                                                                                  1. 1

                                                                                                                                                                    I did invent the KDF, but it’s nothing more than SHA256 applied over and over on random positions of a large buffer, not really a new primitive.

                                                                                                                                                              2. 6

                                                                                                                                                                Keybase? Kinda?…

                                                                                                                                                                1. 4

                                                                                                                                                                  It always bothers me when I see the update say it needs over 80 megabytes for something doing crypto. Maybe no problems will show up that leak keys or cause a compromise. That’s a lot of binary, though. I wasn’t giving it my main keypair either. So, I still use GPG to encrypt/decrypt text or zip files I send over untrusted mediums. I use Keybase mostly for extra verification of other people and/or its chat feature.

                                                                                                                                                                2. 2

                                                                                                                                                                  Something based on nacl/libsodium, in a similar vein to signify, would be pretty nice. asignify does apparently use asymmetric encryption via cryptobox, but I believe it is also written/maintained by one person currently.

                                                                                                                                                                  1. 1

                                                                                                                                                                    https://github.com/stealth/opmsg is a possible alternative.

                                                                                                                                                                    Then there was Tedu’s reop experiment: https://www.tedunangst.com/flak/post/reop

                                                                                                                                                                  1. 11

                                                                                                                                                                    why to people have the need to use a framework for everything, like the BDD testing frameworks in this article. i really don’t see the value of it. it’s just another dependency to carry around, and i can’t just read and understand what is happening.

                                                                                                                                                                    what is gained by writing:

                                                                                                                                                                    Expect(resp.StatusCode).To(Equal(http.StatusOK))
                                                                                                                                                                    

                                                                                                                                                                    instead of

                                                                                                                                                                    if resp.StatusCode != http.StatusOK { 
                                                                                                                                                                        t.Fail() 
                                                                                                                                                                    }
                                                                                                                                                                    
                                                                                                                                                                    1. 11

                                                                                                                                                                      I don’t use that particular testing framework, but the thing I’d expect to gain by using it is better test failure messages. I use testify at work for almost precisely that reason. require.Equal(t, valueA, valueB) provides a lot of value, for example. I tried not to use any additional test helpers in the beginning, probably because we have similar sensibilities. But writing good tests that also have good messages when they fail got pretty old pretty fast.

                                                                                                                                                                      1. 3

                                                                                                                                                                        ok, i can see that good messages may help, though i’d still rather use t.Fatal/Fatalf/Error/Errorf, maybe paired with a custom type implementing error (admitting that it’s a bit more to type) if a custom DSL is the alternative :)

                                                                                                                                                                        testify looks interesting though!

                                                                                                                                                                        1. 4

                                                                                                                                                                          testify is nice because it isn’t really a framework, unless maybe you start using its “suite” functionality, which is admittedly pretty light weight. But the rest of the library drops right into the normal Go unit testing harness, which I like.

                                                                                                                                                                          I did try your methods for a while, but it was just untenable. I eventually just stopped writing good failure messages, which I just regretted later when trying to debug test failures. :-)

                                                                                                                                                                          testify is a nice middle ground that doesn’t force you to play by their rules, but adds a lot of nice conveniences.

                                                                                                                                                                      2. 6

                                                                                                                                                                        The former probably gives a much better failure message (e.g. something like “expected value ‘200’ but got value ‘500’”, rather than “assertion failed”).

                                                                                                                                                                        That’s obviously not inherent to the complicated testing DSL, though. In general, I’m a fan of more expressive assert statements that can give better indications of what went wrong; I’m not a big fan of heavyweight testing frameworks or assertion DSLs because, like you, I generally find they badly obfuscate what’s actually going on in the test code.

                                                                                                                                                                        1. 4

                                                                                                                                                                          yeah, with the caveats listed by others, I sort of thing this is a particularly egregious example of strange library usage/design. in theory, anyone (read: not just engineers) is supposed to be able to write a BDD spec. However, for that to be possible, it should be written in natural language. Behat specs are a good example of this: http://behat.org/en/latest/. But this one is just a DSL, which misses the point I think…

                                                                                                                                                                          1. 3

                                                                                                                                                                            However, for that to be possible, it should be written in natural language. Behat specs are a good example of this: http://behat.org/en/latest/. But this one is just a DSL, which misses the point I think…

                                                                                                                                                                            I’d say that the thing behat does is a real DSL (like, with a parser and stuff). The library from the article just has fancy named functions which are a bit a black box to me.

                                                                                                                                                                            Just a thought: One could maybe write a compiler for a behat-like language which generates stdlib Go-tests, using type information found in the tested package, instead of using interface{} and reflect. That’d be a bit of work though ;)

                                                                                                                                                                        1. 5

                                                                                                                                                                          this full-throttle tinfoily panic mode of some people right now. “move to hosted gitlab!!1 that will show ‘em!!11”. i’m not anti-gitlab, but hosted gitlab has the same set of problems like github. like, for example, being bought by $EVILCOMPANY

                                                                                                                                                                          if microsoft now decides there will be no more free repos, it’s ok! they can do with their property however they please (just like before that acquisition github could’ve done). don’t bitch about the free lunch not tasting right. that is the deal if you use resources of others for free.

                                                                                                                                                                          1. 3

                                                                                                                                                                            I think for most people, if gitlab took a similar turn, a self-hosted (or pay someone else to host it) OSS version of GitLab would be fine.

                                                                                                                                                                            People use gitlab.com because it’s hands-off, not because it’s the commercial version for free.

                                                                                                                                                                            1. 3

                                                                                                                                                                              It’s not “that will show em” at all. No idea where that is being quoted from.
                                                                                                                                                                              I can say my statement was, IF the MS acquisition bothered you, and there is enough historical precedent that it may reasonably do so for reasonable people, then note that Gitlab does currently have 1-click repository migration from GitHub. In addition that is is also a possibility that Github may unilaterally sever that capability IF the migration becomes a flood. Ergo if you are going to do it, then do so now and don’t wait.

                                                                                                                                                                              1. 1

                                                                                                                                                                                it was a purposely overstated made-up-quote (easily spotted by the liberal use of “!!111”).

                                                                                                                                                                                microsoft is an actor on the market and as a result does things to maximize profits. one only has to take that in account when choosing to use their services. i’m not overly happy with it either, but gitlab is an actor too and plays by the same rules, including the possibility of being acquired. just self host, it’s not even hard, scaleway has prepared images for that for example.

                                                                                                                                                                                regarding the importing functionality: if they break the mechanisms to do that, i guess many other things won’t work as well, like bots acting on issues, etc. i don’t think they will break the whole ecosystem, as effectively that’s what they’ve paid for. maybe they’ll do that in the extended future, like twitter breaking their api for clients.

                                                                                                                                                                              2. 2

                                                                                                                                                                                Imagine what would happen when MSFT after buying GH also gets travisCi , which i believe they will do :)

                                                                                                                                                                                1. 2

                                                                                                                                                                                  It should also be quite a bit cheaper, afaik they never took VC money.

                                                                                                                                                                              1. 12

                                                                                                                                                                                Output should be simple to parse and compose

                                                                                                                                                                                No JSON, please.

                                                                                                                                                                                Yes, every tool should have a custom format that needs a badly cobbled together parser (in awk or whatever) that will break once the format is changed slighly or the output accidentally contains a space. No, jq doesn’t exist, can’t be fitted into Unix pipelines and we will be stuck with sed and awk until the end of times, occasionally trying to solve the worst failures with find -print0 and xargs -0.

                                                                                                                                                                                1. 11

                                                                                                                                                                                  JSON replaces these problems with different ones. Different tools will use different constructs inside JSON (named lists, unnamed ones, different layouts and nesting strategies).

                                                                                                                                                                                  In a JSON shell tool world you will have to spend time parsing and re-arranging JSON data between tools; as well as constructing it manually as inputs. I think that would end up being just as hacky as the horrid stuff we do today (let’s not mention IFS and quoting abuse :D).


                                                                                                                                                                                  Sidestory: several months back I had a co-worker who wanted me to make some code that parsed his data stream and did something with it (I think it was plotting related IIRC).

                                                                                                                                                                                  Me: “Could I have these numbers in one-record-per-row plaintext format please?”

                                                                                                                                                                                  Co: “Can I send them to you in JSON instead?”

                                                                                                                                                                                  Me: “Sure. What will be the format inside the JSON?”

                                                                                                                                                                                  Co: “…. it’ll just be JSON.”

                                                                                                                                                                                  Me: “But it what form? Will there be a list? Name of the elements inside it?”

                                                                                                                                                                                  Co: “…”

                                                                                                                                                                                  Me: “Can you write me an example JSON message and send it to me, that might be easier.”

                                                                                                                                                                                  Co: “Why do you need that, it’ll be in JSON?”

                                                                                                                                                                                  Grrr :P


                                                                                                                                                                                  Anyway, JSON is a format, but you still need a format inside this format. Element names, overall structures. Using JSON does not make every tool use the same format, that’s strictly impossible. One tool’s stage1.input-file is different to another tool’s output-file.[5].filename; especially if those tools are for different tasks.

                                                                                                                                                                                  1. 3

                                                                                                                                                                                    I think that would end up being just as hacky as the horrid stuff we do today (let’s not mention IFS and quoting abuse :D).

                                                                                                                                                                                    Except that standardized, popular formats like JSON get the side effect of tool ecosystems to solve most problems they can bring. Autogenerators, transformers, and so on come with this if it’s a data format. We usually don’t get this if it’s random people creating formats for their own use. We have to fully customize the part handling the format rather than adapt an existing one.

                                                                                                                                                                                    1. 2

                                                                                                                                                                                      Still, even XML that had the best tooling I have used so far for a general purpose format (XSLT and XSD in primis), was unable to handle partial results.

                                                                                                                                                                                      The issue is probably due to their history, as a representation of a complete document / data structure.

                                                                                                                                                                                      Even s-expressions (the simplest format of the family) have the same issue.

                                                                                                                                                                                      Now we should also note that pipelines can be created on the fly, even from binary data manipulations. So a single dictated format would probably pose too restrictions, if you want the system to actually enforce and validate it.

                                                                                                                                                                                      1. 2

                                                                                                                                                                                        “Still, even XML”

                                                                                                                                                                                        XML and its ecosystem were extremely complex. I used s-expressions with partial results in the past. You just have to structure the data to make it easy to get a piece at a time. I can’t recall the details right now. Another I used trying to balance efficiency, flexibility, and complexity was XDR. Too bad it didn’t get more attention.

                                                                                                                                                                                        “So a single dictated format would probably pose too restrictions, if you want the system to actually enforce and validate it.”

                                                                                                                                                                                        The L4 family usually handles that by standardizing on an interface, description language with all of it auto-generated. Works well enough for them. Camkes is an example.

                                                                                                                                                                                        1. 3

                                                                                                                                                                                          XML and its ecosystem were extremely complex.

                                                                                                                                                                                          It is coherent, powerful and flexible.

                                                                                                                                                                                          One might argue that it’s too flexible or too powerful, so that you can solve any of the problems it solves with simpler custom languages. And I would agree to a large extent.

                                                                                                                                                                                          But, for example, XHTML was a perfect use case. Indeed to do what I did back then with XLST now people use Javascript, which is less coherent and way more powerful, and in no way simpler.

                                                                                                                                                                                          The L4 family usually handles that by standardizing on an interface, description language with all of it auto-generated.

                                                                                                                                                                                          Yes but they generate OS modules that are composed at build time.

                                                                                                                                                                                          Pipelines are integrated on the fly.

                                                                                                                                                                                          I really like strongly typed and standard formats but the tradeoff here is about composability.

                                                                                                                                                                                          UNIX turned every communication into byte streams.

                                                                                                                                                                                          Bytes byte at times, but they are standard, after all! Their interpretation is not, but that’s what provides the flexibility.

                                                                                                                                                                                          1. 4

                                                                                                                                                                                            Indeed to do what I did back then with XLST now people use Javascript, which is less coherent and way more powerful, and in no way simpler.

                                                                                                                                                                                            While I am definitely not a proponent of JavaScript, computations in XSLT are incredibly verbose and convoluted, mainly because XSLT for some reason needs to be XML and XML is just a poor syntax for actual programming.

                                                                                                                                                                                            That and the fact that while my transformations worked fine with xsltproc but did just nothing in browsers without any decent way to debug the problem made me put away XSLT as an esolang — lot of fun for an afternoon, not what I would use to actually get things done.

                                                                                                                                                                                            That said, I’d take XML output from Unix tools and some kind of jq-like processor any day over manually parsing text out of byte streams.

                                                                                                                                                                                            1. 2

                                                                                                                                                                                              I loved it when I did HTML wanting something more flexible that machines could handle. XHTML was my use case as well. Once I was a better programmer, I realized it was probably an overkill standard that could’ve been something simpler with a series of tools each doing their little job. Maybe even different formats for different kinds of things. W3C ended up creating a bunch of those anyway.

                                                                                                                                                                                              “Pipelines are integrated on the fly.”

                                                                                                                                                                                              Maybe put it in the OS like a JIT. Far as bytestreams, that mostly what XDR did. They were just minimally-structured, byte streams. Just tie the data types, layouts, and so on to whatever language the OS or platform uses the most.

                                                                                                                                                                                      2. 3

                                                                                                                                                                                        JSON replaces these problems with different ones. Different tools will use different constructs inside JSON (named lists, unnamed ones, different layouts and nesting strategies).

                                                                                                                                                                                        This is true, but but it does not mean heaving some kind of common interchange format does not improve things. So yes, it does not tell you what the data will contain (but “custom text format, possibly tab separated” is, again, not better). I know the problem, since I often work with JSON that contains or misses things. But the problem is not to not use JSON but rather have specifications. JSON has a number of possible schema formats which puts it at a big advantage of most custom formats.

                                                                                                                                                                                        The other alternative is of course something like ProtoBuf, because it forces the use of proto files, which is at least some kind of specification. That throws away the human readability, which I didn’t want to suggest to a Unix crowd.

                                                                                                                                                                                        Thinking about it, an established binary interchange format with schemas and a transport is in some ways reminiscent of COM & CORBA in the nineties.

                                                                                                                                                                                      3. 7

                                                                                                                                                                                        will break once the format is changed slighly

                                                                                                                                                                                        Doesn’t this happens with json too?
                                                                                                                                                                                        A slight change in the key names or turning a string to a listof strings and the recipient won’t be able to handle the input anyway.

                                                                                                                                                                                        the output accidentally contains a space.

                                                                                                                                                                                        Or the output accidentally contact a comma: depending on the parser, the behaviour will change.

                                                                                                                                                                                        No, jq doesn’t exis…

                                                                                                                                                                                        Jq is great, but I would not say JSON should be the default output when you want composable programs.

                                                                                                                                                                                        For example JSON root is always a whole object and this won’t work for streams that get produced slowly.

                                                                                                                                                                                        1. 5

                                                                                                                                                                                          will break once the format is changed slighly

                                                                                                                                                                                          Doesn’t this happens with json too?

                                                                                                                                                                                          Using a whitespace separated table such as suggested in the article is somewhat vulnerable to continuing to appear to work after the format has changed while actually misinterpreting the data (e.g. if you inserted a new column at the beginning, your pipeline could happily continue, since all it needs is at least two columns with numbers in). Json is more likely to either continue working correctly and ignore the new column or fail with an error. Arguably it is the key-value aspect that’s helpful here, not specifically json. As you point out, there are other issues with using json in a pipeline.

                                                                                                                                                                                        2. 3

                                                                                                                                                                                          On the other hand, most Unix tools use tabular format or key value format. I do agree though that the lack of guidelines makes it annoying to compose.

                                                                                                                                                                                          1. 2

                                                                                                                                                                                            Hands up everybody that has to write parsers for zpool status and its load-bearing whitespaces to do ZFS health monitoring.

                                                                                                                                                                                            1. 2

                                                                                                                                                                                              In my day-to-day work, there are times when I wish some tools would produce JSON and other times when I wish a JSON output was just textual (as recommended in the article). Ideally, tools should be able to produce different kinds of outputs, and I find libxo (mentioned by @apy) very interesting.

                                                                                                                                                                                              1. 2

                                                                                                                                                                                                I spent very little time thinking about this after reading your comment and wonder how, for example, the core utils would look like if they accepted/returned JSON as well as plain text.

                                                                                                                                                                                                A priori we have this awful problem of making everyone understand every one else’s input and output schemas, but that might not be necessary. For any tool that expects a file as input, we make it accept any JSON object that contains the key-value pair "file": "something". For tools that expect multiple files, have them take an array of such objects. Tools that return files, like ls for example, can then return whatever they want in their JSON objects, as long as those objects contain "file": "something". Then we should get to keep chaining pipes of stuff together without having to write ungodly amounts jq between them.

                                                                                                                                                                                                I have no idea how much people have tried doing this or anything similar. Is there prior art?

                                                                                                                                                                                                1. 9

                                                                                                                                                                                                  In FreeBSD we have libxo which a lot of the CLI programs are getting support for. This lets the program print its output and it can be translated to JSON, HTML, or other output forms automatically. So that would allow people to experiment with various formats (although it doesn’t handle reading in the output).

                                                                                                                                                                                                  But as @Shamar points out, one problem with JSON is that you need to parse the whole thing before you can do much with it. One can hack around it but then they are kind of abusing JSON.

                                                                                                                                                                                                  1. 2

                                                                                                                                                                                                    That looks like a fantastic tool, thanks for writing about it. Is there a concerted effort in FreeBSD (or other communities) to use libxo more?

                                                                                                                                                                                                    1. 1

                                                                                                                                                                                                      FreeBSD definitely has a concerted effort to use it, I’m not sure about elsewhere. For a simple example, you can check out wc:

                                                                                                                                                                                                      apy@bsdell ~> wc -l --libxo=dtrt dmesg.log
                                                                                                                                                                                                           238 dmesg.log
                                                                                                                                                                                                      apy@bsdell ~> wc -l --libxo=json dmesg.log
                                                                                                                                                                                                      {"wc": {"file": [{"lines":238,"filename":"dmesg.log"}]}
                                                                                                                                                                                                      }
                                                                                                                                                                                                      
                                                                                                                                                                                                2. 1

                                                                                                                                                                                                  powershell uses objects for its pipelines, i think it even runs on linux nowaday.

                                                                                                                                                                                                  i like json, but for shell pipelining it’s not ideal:

                                                                                                                                                                                                  • the unstructured nature of the classic output is a core feature. you can easily mangle it in ways the programs author never assumed, and that makes it powerful.

                                                                                                                                                                                                  • with line based records you can parse incomplete (as in the process is not finished) data more easily. you just have to split after a newline. with json, technically you can’t begin using the data until a (sub)object is completely parsed. using half-parsed objects seems not so wise.

                                                                                                                                                                                                  • if you output json, you probably have to keep the structure of the object tree which you generated in memory, like “currently i’m in a list in an object in a list”. thats not ideal sometimes (one doesn’t have to use real serialization all the time, but it’s nicer than to just print the correct tokens at the right places).

                                                                                                                                                                                                  • json is “java script object notation”. not everything is ideally represented as an object. thats why relational databases are still in use.

                                                                                                                                                                                                  edit: be nicer ;)

                                                                                                                                                                                                1. 15

                                                                                                                                                                                                  world-open QA-less package ecosystems (NPM, go get)

                                                                                                                                                                                                  This is one I’m increasingly grumpy about. I wish more ecosystems would establish a gold set of packages that have complete test coverage, complete API documentation, and proper semantic versioning.

                                                                                                                                                                                                  1. 4

                                                                                                                                                                                                    world-open QA-less package ecosystems (NPM, go get)

                                                                                                                                                                                                    i’d argue that go get is no package ecosystem. it’s just a (historic) convenience tool, which was good enough for the initial use (inside a organization). furthermore, i like the approach better than the centralized language package systems. nobody checks all the packages in pypi or rubygems. using a known good git repo isn’t worse, maybe it’s even better as there is not another link in the chain which could break, as the original repository is used instead of a somehow packaged copy.

                                                                                                                                                                                                    I wish more ecosystems would establish a gold set of packages that have complete test coverage, complete API documentation, and proper semantic versioning.

                                                                                                                                                                                                    python has the batteries included since ages, gos standard library isn’t bad either. both are well-tested and have good documentation. in my opinion the problem is that often another 3rd pary depencendy gets quickly pulled in, instead of giving a second thought to if it is really required or can be done by oneself which may spare one trouble in the future (e.g. left-pad).

                                                                                                                                                                                                    in some cases there is even a bit of quality control for non standard packages: some database drivers for go are externally tested: https://github.com/golang/go/wiki/SQLDrivers

                                                                                                                                                                                                    1. 2

                                                                                                                                                                                                      Then you get the curation (and censorship) of Google Play or Apple’s Store.

                                                                                                                                                                                                      Maybe you want more of the Linux package repo model where you have the official repo (Debian, RedHat, Gentoo Portage), some optional non-oss or slightly less official repos (Fedora EPEL) and then always having the option to add 3rd party vendor repos with their own signing keys (PPA, opensuse build service, Gentoo Portage overlays).

                                                                                                                                                                                                      I really wish Google Play had the option of adding other package trees. I feel like Apple and Google took a great concept and totally fucked it up. Ubuntu Snap is going in the same (wrong) direction.

                                                                                                                                                                                                      1. 2

                                                                                                                                                                                                        On Android it’s certainly possible to install F-Droid, and get access to an alternate package management ecosystem. I think I had to sideload the F-Droid APK to get it to work though, which not every user would know how to do easily (I just checked, it doesn’t seem to be available in the play store).

                                                                                                                                                                                                    1. 1

                                                                                                                                                                                                      There is a typo in the title (decontructing -> deconstructing).

                                                                                                                                                                                                      1. 2

                                                                                                                                                                                                        it’s in the article too, so i’d keep it that way here