1. 3

    Nearly all the data comes from theGNU/Linux Distribution Timeline, which sadly has not be kept up to date.

    1. 3

      As an ex-compiler writer have have plenty of sympathy for those trying to make a living selling compilers, but times change.

      The problem used to be competition from many small compiler companies, then the amount of memory available on developer machines exploded, which is the critical factor holding back open source compilers.

      1. 1

        I’ve updated the title to include the date, since it matters in the context of the paper.

      1. 7

        An extreme example of unportable C is the book Mastering C Pointers: Tools for Programming Power, which was castigated recently. To be fair, that book has other flaws rather than being in a different camp, but I think that fuels some of the intensity of passion against it.

        This… rather grossly undersells how much is wrong with that book. The author didn’t understand scope, for crying out loud, and never had a grasp of how C organized memory, even in the high-level handwavy “C Abstract Machine” sense the standard is written to.

        There are better examples of unportable C, such as pretty much any non-trivial C program written for MS-DOS, especially the ones which did things like manually writing to video memory to get the best graphics performance. Of course, pretty much all embedded C would fit here as well, but you’ll actually be able to get and read the source of some of those MS-DOS programs.

        In so doing, the committee had to converge on a computational model that would somehow encompass all targets. This turned out to be quite difficult, because there were a lot of targets out there that would be considered strange and exotic; arithmetic is not even guaranteed to be twos complement (the alternative is ones complement), word sizes might not be a power of 2, and other things.

        Another example would be saturation semantics for overflow, as opposed to wraparound. DSPs use saturation semantics, so going off the top end of the scale plateaus, instead of causing a weird jagged waveform.

        As for the rest, it’s a hard problem. Selectively turning off optimization for specific functions would be useful for some codebases, but aggressive optimization isn’t the only problem here: Optimization doesn’t cause your long type to suddenly be the wrong size to hold a pointer on some machines but not others. Annotating the code with machine-checked assumptions about type size, overflow behavior, and maybe other things would allow intelligent warnings about stupid code, but… well… try to get anyone to do it.

        1. 6

          Re “Mastering C Pointers,” that’s fair. I included it because it’s one of the things that got me thinking about the unportable camp, but I can see how its (agreed, very serious) flaws might detract from the overall argument I’m making and that there might be a better example.

          Re saturating arithmetic, well, Rust has it :)

          1. 2

            My interpretation is that the point of C is that simple C code should lead to simple assembly code. Needing to write SaturatedArithmetic::addWithSaturation(a, b) instead of just a + b in all arithmetic DSP code would be quite annoying, and would simply lead to people using another language.

            You could say ‘oh they should add operator overloading’, but then that contravenes the first point, that simple C code (like a + b) should not hide complex behaviour. The only construct in C that can hide complexity is the function call, which everyone recognises. But if you see some arithmetic, you know it’s just arithmetic.

            1. 1

              You could say ‘oh they should add operator overloading’, but then that contravenes the first point, that simple C code (like a + b) should not hide complex behavior. The only construct in C that can hide complexity is the function call, which everyone recognizes. But if you see some arithmetic, you know it’s just arithmetic.

              Not to mention that not everything can be overloaded, causing inconsistencies, and some operations in mathematics have operators other than just “+-/*”. Vector dot product “·”, for example. Even if CPP (or any other language) extends to support more operators, these operators can’t be reached without key composition (“shortcuts”), making it almost unwanted. vec_dot() might require typing more, but it’s reachable to everyone, and operators don’t need to have hidden meanings.

              1. 2

                Eh, perl6 seems to do just fine with 60 bajillion operators.

                1. 2

                  Perl does have more operators than C, but all of them are operators that can be typed using simple key composition, such as [SHIFT+something]. String concatenation for example.

                  My point, added with what @milesrout said, is that some operators (math operators) aren’t easy to type with just [SHIFT+something]. As result, operator overloading in languages that offer operator overloading will always stay in a unfinished state, because it will only compromise those operators that are easily composed.

          2. 1

            Mastering C Pointers: Tools for Programming Power has several four-star reviews on Amazon uk.

            Herbert Schildt’s C: The Complete Reference is often touted as the worst C book ever and here.

            Perhaps Mastering C Pointers is the worst in its niche (i.e., pointers) and Schildt’s is a more general worst?

            1. 2

              Mastering C Pointers: Tools for Programming Power has several four-star reviews on Amazon uk.

              So? One of the dangers of picking the wrong textbook is thinking it’s great, and using it to evaluate subsequent works in the field, without knowing it’s shit. Per hypothesis, if it’s your first book, you don’t know enough to question it, and if you think it’s teaching you things, those are the things you’ll go on to know, even if they’re the wrong things. It’s a very pernicious bootstrap problem.

              In this case, the book is objectively terrible. Other books being bad don’t make it better.

              I do agree that Schildt’s book is also terrible.

          1. 4

            It’s a terrible source. As the wiki says: “The C rules and recommendations in this wiki are a work in progress and reflect the current thinking of the secure coding community. Because this is a development website, many pages are incomplete or contain errors. “

            My review of the 2008 edition of the published guidelines pointed out that they were full of errors and omissions. So now they have started wiki for people to fix their problems.

            1. 8

              Amusing that this type of person goes unnoticed when many of us tell people what we do. I fall into this category as I was describing in another thread. What she misses is many of us aren’t rich or the work sustainable: many people in lower to middle classes sacrificing money or status to do deep research and development in a field that interests them and/or they find necessary. They might also not like the priorities of commercial or government funding groups.

              For instance, I thought figuring out how to make computers that don’t fail or get hacked was a thing we desperately needed. I believed both livelihoods and lives were at stake. That we had access to them was a social good that neither the markets nor FOSS were really serving. It was also an interesting, deep, rabbit hole of a problem crossing many sub-fields of IT, economics, and psychology. That she misses people without money doing it altruistically surprises me more given she wrote the report on FOSS developers working with little to no money or contributions on critical stuff that mattered to them. Same kind of thing I think with different work output.

              Still a good write-up that will draw attention to the concept. We might get more people doing it or publishing what they’re doing. I think most of us don’t publish enough. We should accept some of the troubles of that since the ideas get out there more. I also like this quote focusing on the obsessive nature of deep, independent research:

              “I understand, then, why researchers flock to the safety of institutions. Imagine studying something that nobody else is studying, for reasons you can’t really articulate, without knowing what the outcome of your work will be. For the truly obsessed person, the need for validation isn’t about ego; it’s about sanity. You want to know there’s some meaning behind the dizzying mental labyrinth that you simultaneously can’t escape and also never want to leave.”

              1. 3

                I’ve been kicking around the idea of creating a community for independent researchers. At first I thought it’d be mostly PL oriented but I’m starting to think that broadening the reach is better for both emotional support and cross-pollination of ideas. After all, it’s not like the world is teeming with independent researchers, right?

                Would you (and anyone else!) be interested in this?

                1. 3

                  Such a community could be great.

                  There are people doing research for their own private interest and are not setting out to discover anything that is not already known (but might do so accidentally). I would put people who invent new programming language sin this category.

                  There are people doing research to discover stuff that is not already known, or at least nothing seems to have been published anywhere).

                  My interest is in discovering stuff that is not yet known.

                  1. 1

                    It’s an interesting idea. I probably wouldn’t join one right now given I’m too overloaded. Maybe later on.

                    However, it reminds me of another idea I had for CompSci where there would be a similar site having researchers at lots of universities (or independent) in forums where they could talk about stuff. Also, the non-paywalled papers would be available on it. Any new people at conferences that seemed bright would be invited. My idea was to break the silos that are hiding good ideas to facilitate cross-pollination among institutions and sub-fields.

                  2. 3

                    What she misses is many of us aren’t rich or the work sustainable: many people in lower to middle classes sacrificing money or status to do deep research and development in a field that interests them and/or they find necessary. They might also not like the priorities of commercial or government funding groups.

                    thank you for this. as someone who gave up the salaried lifestyle to pursue open source contribution, research, and my local community, it is refreshing to hear. even among very close peers and friends, there is a huge misconception that anyone at the upper end of their technical field is comfortably making ends meet, but in reality we often live a lifestyle that more closely resembles a starving, sleep-deprived graduate research student.

                  1. 4

                    Research in software engineering does not need lots of hardware, the funding needed to be a gentleman scientist in this field is money to live and basic computing equipment.

                    The startup cost is high, after all it is necessary to be reasonably expert in the field. But again, this is the personal time needed to spend time reading a lot and gaining practical experience.

                    1. 2

                      Even lower barrier than that: many bright folks that were hackers or makers that I met in rural areas were on food stamps or living with someone unemployed without a car. They could usually get a Wifi-enabled phone or old laptop that they could use at nearby McDonalds or something. Many use data plans, too, just cuz cell service is a necessity to them. At one point, I went lower having no PC, phone, or car. Designed on paper or dirt depending on where we were at.

                      The reading and practicing like you said is what gave us skill. I think peer review and support is just as important, though. I had lots of it once I got online. There’s quite a few people out there probably stuck on some subjects, reinventing wheels, or chasing dead ends just because they can’t talk to experienced people.

                    1. 4

                      It feels like we are through the valley of releases that needed to restore Python 2 compatibility. Good to see releases with a few innovations that will make my life easier.

                      1. 2

                        Some interesting data on Python 2/3 usage and transition (or not):

                        1. 4

                          I admit I didn’t read through it but only skimmed over the pages, I think they measure open source libraries mostly, and I kind of expect those eto maintain compatibility for a while. Some still do for Python 2.6. So while their data is valid, it doesn’t say much about actual Python applications / serices.

                          My subjective take on Python 3 migrations

                          • Migration didn’t really happen before Python 3.4. With 3.0 and 3.2 there wasn’t really a “win”, and linux distributions still had Python 2.7 as their default.
                          • In the beginning of 2017 I spent a few days getting the CI infrastructure at my then employer (~100 Python devs) into shape. When the docker builds were busy, I spent some time applying python-modernize on a few of the companies shared libraries, then fixing a few remaining issues by hand. It wasn’t much trouble.
                          • In 2018 I see more and more Python projects starting off of Python 3.x. When in 2015, developers would have chosen 2.7 if in doubt.

                          Oh, and I am still furious about them renaming .iteritems() to .items() instead of at least leaving it as an alias.

                      1. 8

                        I have always thought that Herbert Schildt’s C books contained more mistakes than any other C books out there.

                        Clive Feather’s ‘famous’ review and a review of a later book by seebs (of obfuscated fame): C: The complete nonsense

                        1. 3

                          Indeed, Schildt’s C books are quite widely maligned. Unfortunately I read them when I was at school, keen to learn C - I didn’t know any better at the time (this was in the days when the web had only just been invented and hadn’t left CERN yet). Herb Schildt, Ray Duncan, Charles Petzold (Windows logo tattoo and all) and Michael Abrash were the programming heroes of my youth.

                        1. -5

                          It is only a disaster if your business relies on making use of other people work, in which they own the copyright.

                          Not everybody can afford to create stuff and give it away for free, and there are plenty of people who want to earn money from there creative work.

                          Those who have made a living from steeling other peoples’ material are up in arms that their free lunch not going to be free anymore.

                          1. 17

                            Or you run any kind of site where users can input anything that another visitor can see. Not just video and file sharing sites; Lobsters users could paste copyrighted content into a comment/PM and I’d be liable for not having a system implementing some kind of copyright controls.

                            (To say nothing of Article 11 wanting us to start paying the news sites we link to for privilege of sending them traffic.)

                            1. -2

                              If somebody posted something here that I owned the copyright to, and I asked Lobsters admin to remove the material, then I imagine they would. If somebody kept posting this material they could be banned.

                              Or are you saying that the Lobsters’ site should be a place where anybody can post copyright material, without any recourse by the copyright holder?

                              1. 13

                                The new law changes this standard safe harbor behavior. Lobsters (me) is presumptively at fault for copyright infringement for not proactively checking for possibly-copyrighted material before posting. So yes, your scenario is the current, reasonable law and accurately describes why everyone is concerned about this change.

                                1. -2

                                  Lots of FUD being generated by those who will lose out. Copyright holders not making much noise about the fact they will probably make some money (or rather lose less).

                                  Some good points about what is going on.

                                2. 4

                                  The law isn’t about that, though. The new law doesn’t say admins must take-down on request (that’s already the case under existing law) but rather that they must have an AI system that prevents any infringing uploads from happening in the first place.

                                  The link tax is a much bigger problem, especially lobsters, but both articles are very bad.

                                  1. 1

                                    AI system that prevents any infringing uploads from happening in the first place.

                                    How is that any different from what @pushcx said? As the owner/operator of lobste.rs he would have to abide by this law and produce, or buy access to some sort of copyrighted work database in order to test for it for all content that is created on lobsters.

                                    That’s not going to make it easy for startups. That’s not going to make it easy for privately owned, independent side projects. That’s just going to hurt.

                                    1. 2

                                      ALSO, you’d better not quote any part of my message if you reply, because I could, apparently, legitimately sue lobsters for not enforcing my copyright. e.g. there’s no such thing as fair use anymore.

                                      (yes, that’s a stretch, but that seems to be the basic threat)

                                      1. 1

                                        I replied before @pushcx and yes, it seems we agree on how bad it is :)

                                        1. 2

                                          Blargh! I am sorry. I misread the thread and thought you were replying to pushcx.

                                3. 6

                                  Or lobster gets a fine when you submit a link to any European news sites.

                                  1. 1

                                    What’s worse is that people will devise a way to signal what content is linkable and what only with license. This will limit quality news dissemination and strengthen fake news position. This will help to kill EU. Sad, right?

                                  2. 1

                                    most probably that lobster will be not able to post most of the links

                                  1. 2

                                    Ok, the summer project has just started and he has not read the standard yet.

                                    undefined behavior: behavior, upon use of a nonportable or erroneous program construct or of erroneous data, for which this International Standard imposes no requirements”

                                    Let’s hope he reads the C standard before implementing any checks. A freely available pdf that is not the actual C standard you buy in the shops (which has a fancy ISO cove r on it)

                                    1. 2

                                      “A minimalist knowledge approach to software engineering is cost effective because most code does not exist long enough to make it worthwhile investing in reducing future maintenance costs. Yes, it is more expensive for those that survive to become commonly used, but think of all the savings from not investing in those that did not survive.”

                                      This is something that I’m probably going to have to think more on. @derek-jones might even have data to support it in his collection. My data, though, indicated that most real-world projects from the 1960’s on up to present times run into problems late in the lifecycle they have to fix. Those fixes usually cost more in money or reputation. Some groups spent small, upfront investment preventing most problems like that. They claim it usually paid off in various ways. This was especially true if software was long-lasting. There were times when the quality cost more overall on top of a thrown-together project.

                                      Another issue is is that pervasively-buggy software conditioned users to expect that it’s normal. This reduces demand or competitive advantage of high-quality, mass-market software. Many firms, esp small or startups, can profitably supply buggy software so long as it meets a need and they fix the bugs. In enterprise market, you can even sell software that barely works or doesn’t at all so long as it appeared to meet a need making someone in the company look good. So, this needs to be factored into the decision of whether to engineer software vs throw it together.

                                      I still say lean toward well-documented, easy-to-change software just in case you get stuck with it. You can also charge more in many markets with better rep. Use the amount of QA practices that the market will pay for. If they pay nothing, use stuff that costs about nothing like interface checks, usage-based testing, and fuzzing. If they’ll pay significant amount, add more design/code review, analysis/testing, slices of specialist talent (eg UI or security), improvements to dependencies, and so on.

                                      1. 4

                                        Cost/benefit for applications, there is also a less rigorous analysis.

                                        Code containing a fault is more likely to be modified (removing the fault as a side-effect) than have the fault reported (of course it may be experienced and not reported); see Figure 10.75.

                                        Other kinds of related data currently being analysed.

                                        Microsoft/Intel is responsible for conditioning users to treat buggy software as normal. When companies paid lots of money for their hardware, they expected the software to work. Success with mass market software meant getting getting good enough software out the door, or be wiped out by the competition.

                                        1. 2

                                          I think IBM’s mainframe data might not fit your argument. IBM kept coming up with things like Fagan Inspections, Cleanroom, formal specs, and safe languages. They often experimented on mainframe software. A good deal of it was written in high-level languages like PL/I and PL/S that prevent many problems a shop using C might have. They have lifecycles that include design and review steps. In other words, IBM was regularly doing upfront investments to reduce maintenance costs down the line. The investments varied depending on which component we’re talking about. The fact they were trying stuff like that should disqualify them, though, as a baseline. A much better example would be Microsoft doing fast-moving, high-feature development in C or C++ before and after introducing SDL and other reliability tools. It made a huge difference.

                                          Other issues are backward compatibility and lock-in. The old behavior had to be preserved as new developments happened. The two companies also made the data formats and protocols closed-source, complicated ones to make moves difficult. The result is that both IBM and Microsoft eventually developed a customer base that couldn’t move. Their development practices on maintenance side probably factor this in. So, we might need multiple baselines with some allowing lock-in and some being companies that can loose customers at any time. I expect upfront vs fix or change later decisions to be more interesting in the latter.

                                          1. 2

                                            The data is on customer application usage that ran on IBM mainframes (or at least plug compatibles).

                                            1. 1

                                              Oh Ok. So, mainframe apps rather than mainframe systems themselves. That would be fine.

                                      1. 20

                                        Having got one business prediction right, I will stick my neck out and make another one (plus the obvious one that the world will not end because of this purchase).

                                        1. -3

                                          Since we are doing wild predictions … Here is one, I’ll stick a neck out on … You’re probably young, early in your career, say <5 experience in the industry. Certainly not in the industry since the 90’s early 2000s. It’s fine, there is nothing I can say on this topic that will change anything whatsoever. But save this thread. Queue it up for the day after your 30 B-day or say in 7 years. You’ll be amused between what the you of today believe and what the you of tomorrow will have learned in this industry and specifically about this purchase.

                                          1. 22

                                            Checking that one is too easy after he linked to a blog with posts dating back 10 years. And checking out posts from 2008… There it says having not taught programming for 25 years.

                                            1. 19

                                              IIRC @derek-jones was on the C99 standardization committee.

                                              1. 3

                                                The beauty of predictions is there capability of being wrong. I was wrong, surprised to be so, but wrong.
                                                However, another prediction is still undecided, the impact of MS buying Github and how they will manipulate their influence over it compared to the counterfactual. I’m seriously not a tin foil hat kinda guy, but MS is just never a good thing when then step into any area whether the Internet, Browers, Software Development, OS’s, you name it. It is always a net-net-negative (not from a business standpoint of course) but from an overall “good” in that respective area. Far more harm than good will result.

                                                1. 5

                                                  I still don’t get your reply to Derek. He never claimed that MS purchases are good for the community. In fact, he is predicting an EDG buyout solely because he thinks it will allow for vendor lock-in.

                                                  1. 4

                                                    I believe Derek triggered him with the line

                                                    (plus the obvious one that the world will not end because of this purchase)

                                                    Where he propably refers to his experience from earlier (pre-git, pre-internet) times and how there will be other ways for open source and development (back to Mailing lists, Gitlab, …).

                                                    But Grey, when hearing about GitHub not being changed too much (as Derek also stated in his posting, “sluggish Integration”, but also “more data friendly”), remembered history on Microsoft (they were anti-open-source and are working a lot on changing their image). GitHub being an “open-source community” therefore is in danger getting swallowed by this “anti-open-source business”.

                                                    And I can understand getting emotional about such things. And emotion kills rationality. Which propably led to this misunderstanding.

                                            1. -2

                                              That domain name is the worst thing ever, so many hyphens

                                              1. 9

                                                No, that’s this one.

                                            1. 2

                                              I’d offer to buy the companies or groups controlling most of the mining power. It’s usually an oligopoly like in the for-profit, non-crypto markets. Even possible to pay off individual executives to cut deals to lower the price. The resulting buy would probably be way, way, way less than $100 billion for Bitcoin. Hell, it might be less than a billion. That’s owning it, too, rather than a bribe to sabotage it in some way that looks like an accident. That could be mere millions.

                                              1. 2

                                                Mining power is actually quite decentralized, because for various reasons cheap electricity is decentralized. Mining pool is oligopoly, but mining pools don’t own mining power. Miners can and will leave pools if something happens.

                                                1. 2

                                                  No, mining power is very centralized.

                                                  https://lobste.rs/s/yawv5u/state_cryptocurrency_mining

                                                  1. 1

                                                    Your link agrees with me. To quote:

                                                    Mining farms are perhaps the one area where manufacturers and economies of scale are not dominant. Good electricity deals tend to come in smaller packages, tend to be distributed around the world, and tend to be difficult to find and each involve unique circumstances.

                                                    1. 1

                                                      The link agrees with one point, good electricity deals. But the economics of scale applies to the building of devices that use less electricity.

                                                  2. 1

                                                    I appreciate clarification.

                                                1. 1

                                                  Cutting free from the ball and chain makes it easier for committee members to add more complexity.

                                                  1. 2

                                                    There are PoW-less cryptocurrencies being developed. If these gain traction and turn out to be secure, then we can leave behind the first generation of cryptos based on PoW.

                                                    1. 2

                                                      Would you provide some more detail here/a URL perhaps?

                                                      1. 2

                                                        These are the two I know of, which also seem to have serious teams backing them up.

                                                        Nano: https://nano.org/en Iota: https://www.iota.org/

                                                        They’re based on new architectures that enable to dispense with the concept of miner, providing the services that these gave in different manners. For example, Nano uses proof of stake. When there are two conflicting transactions i.e. a double spend, the network votes to resolve the conflict and each node has a voting stake proportional to the the amount of currency it holds, or the amount of accounts that delegate their vote to that particular node. Thus conflicts are resolved through vote. Iota uses a DAG architecture where the cost of making a transaction is doing PoW in the form of “confirmations”. The transactions that are more robust are those with the larger number of confirmations. Both currencies have a set supply so no new coins will be produced ever, this means that all the coins that will exist were generated in the first block.

                                                        1. 1

                                                          The problem with proof of stake is that once an entity has 51% they own the currency forever. With proof of work, it is a continual effort to own 51% (this is covered in the linked to article).

                                                          A quick look at IOTA (not knowing anything about it), and it does not involve a blockchain and it’s not on Wikipedia.

                                                          1. 1

                                                            I see both PoW and PoS as protocols depending on the rationality assumption. Those that hold the power will act rationally, thus will want to preserve the value of their investment and as a consequence protect the network. Without the rationality assumption, we could have the top N miners combining their hashing power to destroy the network. What stops them from doing this?

                                                            Whether IOTA or Nano are blockchain or not isn’t important I think, what matters is that they satisfy (theoretically, and Nano somewhat practically) certain properties that allow them to function as decentralized cryptocurrencies.

                                                    1. 4

                                                      Ok, it’s time to stop developing software, get some Verilog books, and go after them coins. That’s what we need to do people. :)

                                                      On a serious note, one thing the author didn’t mention that might reduce barrier to entry are structured ASIC’s. They use a bit more energy, they’re a bit slower, and have much lower cost to develop/deploy. The main company doing it is eASIC with their Nextreme’s. I remember their 90nm option had prototyping of fifty to sixty grand for a pile of chips at one point. Get it working on an FPGA designed for a conversion. Then, see if they can put it on a Nextreme.

                                                      1. 2

                                                        It’s an interesting situation. Essentially China is extracting wealth from the world with this lock on asic production (Bitmain) and cheap hydro.

                                                        1. 3

                                                          It’s not China, but people of influence in China who are extracting money from China (which has currency controls).

                                                          The electricity is cheap because the ‘right’ people are being paid off (ever wondered how electronics ordered on Amazon, that comes from China, has zero postage?)

                                                      1. 2

                                                        pdf graphs to csv is the most interesting project :-)

                                                        1. 13

                                                          The author has a pretty explicit bias toward Earley parsing, and using LL(k) over LR(k) methods (which I mostly share). There’s a wealth of information here, but he’s missing a couple other significant strands of generalized parsing research outside of Marpa’s particular Earley variant (perhaps because they don’t fit within his narrative).

                                                          GLR (Generalized LR):

                                                          • Tomita, LR parsers for natural languages, 1984.
                                                          • Tomita, Efficient parsing for natural language, 1986.
                                                          • Scott and Johnstone, Generalised bottom up parsers with reduced stack activity, 2005. This segues into their later work with Earley and GLL.

                                                          GLL (Generalized LL):

                                                          • Scott and Johnstone, GLL Parsing, 2010.
                                                          • Afroozeh and Izmaylova, Faster, Practical GLL Parsing, 2015.

                                                          Outside of Earley, GLL seems like a very practical generalized parsing approach. Instaparse is one implementation.

                                                          Earley / Shared Packed Parse Forests:

                                                          • Scott, SPPF-style parsing from Earley recognisers, 2008 (and several related papers by Scott and Johnstone). Note: I’ve never been able to get the approach described in the paper implemented correctly, and not for lack of trying.

                                                          Earley Intersection Parsing (not sure if there’s a canonical name for this):

                                                          • Bar-Hillel, Perles, and Shamir, On formal properties of simple phrase structure grammars in Zeitschrift für Phonetik, Sprachwissenschaft und Kommunikationsforschung, 1961. Proves some important results about intersecting automata and context free grammars.
                                                          • Chapter 13 (“Parsing as Intersection”) of Grune and Jacobs (also cited elsewhere in his timeline), particularly 13.4 (pgs. 437-439): This describes Bar-Hillel’s Intersection Parsing approach using contemporary CS & parsing terminology, and then suggests combining it with Earley parsing. While the basic intersection parsing method produces an astonishing amount of dead-end nodes to garbage-collect later, Earley parsers limit searching to productions reachable from the start symbol. If the completer is modified to produce non-terminals in the intersection parsing format (which is easy), intersection parsing nicely handles producing the parse forest (with structural sharing, when ambiguity produces multiple derivations).

                                                          I’ve been working on a C library for Earley Intersection parsing, and an awk-like, pattern-match-directed language based on it, but working on testing them thoroughly tends to lead me to working on theft instead.

                                                          1. 3

                                                            Thanks for details about alternatives.

                                                            I have played around with Bison’s GLR option and have almost managed to create a C parser that does not require special handling of typedefs in the lexer (ambiguities still appearing at runtime for a few constructs).

                                                            Grune and Jacobs’ Parsing Techniques - A Practical Guide is sitting in a pile waiting to be read (the first edition was manageable, the second looks daunting)

                                                            1. 1

                                                              I have a print copy of the second, and it’s never struct me as daunting – reading it cover to cover, perhaps, but it’s well suited to dipping into for just specific techniques, with a well-curated bibliography for further details.

                                                            2. 2

                                                              packrattle is another example of a GLL parser.

                                                              The times I’ve read about Earley parsers, they seem like the same solution as GLL, but attacking the problem from a different angle. Hopefully someone in academia will eventually prove that.

                                                              1. 2

                                                                I’m particularly interested in Earley parsers because some use cases of mine assume ambiguity (reverse-engineering binaries, scraping damaged data), and the chart that Earley parsers build up supports several operations beyond just recognizing & building parse trees:

                                                                • what terminals/nonterminals would advance the current parse(s) in progress? (i.e., autocomplete)
                                                                • if we don’t assume where the starting position is, what overlapping instruction encodings are there?
                                                                • if we inserted a fake nonterminal here, would enough of the parse have completed to match any instances of (some structure) between here and there?
                                                                • incremental re-parsing of changing input, since earlier columns in the chart are constant once the parser has moved on.
                                                              2. 2

                                                                Thank you for adding this. I understand why everybody is excited about Earley parsing, though I personally don’t feel sufficiently grounded in it to share that excitement, but at the very least other lines of research deserve to be mentioned.

                                                              1. 6

                                                                Amazing. Well written, well researched and concise. Parsing is such an interesting problem - it arises naturally almost anywhere you find text.
                                                                I’ve read the phrase “parsing is a solved problem” so often in discussions about it but this article puts that in its place. Thoroughly.

                                                                1. 5

                                                                  Parsing is a solved problem in the sense that if you spend enough time working on a grammar it is often possible to produce something that Bison can handle. I once spent several months trying to get the grammar from the 1991(?) SQL standard into a form that Bison could handle without too many complaints.

                                                                  Bison’s GLR option helps a lot, but at the cost of sometimes getting runtime ambiguity errors, something that non-GLR guarantees will not happen.

                                                                  So the ‘newer’ parsing techniques don’t require so much manual intervention to get them into a form that the algorithm can handle.

                                                                  1. 3

                                                                    I think that usually means for most languages and problems people are applying it to. If so, then it probably is a solved problem given we have parsers for all of them plus generators for quite a lot of those kinds of parsers. Just look at what Semantic Designs does with GLR. There’s FOSS tools that try to similarly be thorough on parsing, term-rewriting, metaprogramming, etc. At some point, it seemed like we should be done with parsing algorithm research for most scenarios to instead polish and optimize implementations of stuff like that. Or put time into stuff that’s not parsing at all much of which is nowhere near as solved.

                                                                    I still say it’s a solved problem for most practical uses if it’s something machine-readable like a data format or programming language.

                                                                    Note: It is as awesome an article as you all say, though. I loved it.

                                                                    1. 5

                                                                      A solved problem, but for a subset of constrained grammars and languages. Still people say it without any qualification as if there’s nothing more to be said or understood.

                                                                      I take your points though and there are definitely other areas that need work.

                                                                      I think my point is just that I’m glad people are still putting effort and research into solved problems like this. No telling where breakthroughs in one area can lead and we’re still restricted constructing and representing language grammar.

                                                                      1. 3

                                                                        A solved problem, but for a subset of constrained grammars and languages. Still people say it without any qualification as if there’s nothing more to be said or understood.

                                                                        Yeah, they should definitely be clear on what’s solved and isn’t. It’s inaccurate if they don’t.