1. 23

    This same article is posted on HN and there is a flurry of comments saying that this kind of usage of Lua is bad because the configuration file can take over the program and other security concerns. From reading a bunch of such comments, all spoken with the tone of “I’m an expert!”, I came to realize that these people don’t know Lua at all.

    It is a very common pattern for programs embedding Lua like this to create clamped-down environments that expose only a selected part of the language and it’s libraries to the running script. You can craft an environment with the exact functions and access you desire, and if the script tries to access something that isn’t there, well, it is an error because that function doesn’t exist.

    This is a built-in feature of the language, even though it changed API between Lua 5.1 and the more recent versions, it is still there.

    Lua is not too powerful for this, Lua has the exact power you need.

    1. 10

      Not only is it a built-in feature of the language, it’s an important part of Lua’s origins – See the paper The Evolution of Lua, particularly the section 3 about SOL and DEL.

      1. 7

        If anyone’s curious as to whether this is an exaggeration; here’s the actual code needed to sandbox the loading of a file like this so it only has access to a limited number of safe functions in LuaJIT:

        local chunk = loadfile("config.lua")
        setfenv(chunk, {add_bodies = sandbox.add_bodies, other_fn = sandbox.other_fn})
        chunk()
        

        In this case the worst you could do with malicious code is loop forever and chew up CPU. (There are ways to solve that too, but they’re not quite as trivial as the sandboxing.)

        1. 4

          That works for Lua 5.1. Lua 5.2 (or higher) changed how environments work, so there you do:

          local chunk = loadfile("config.lua","t",{ add_bodies = sandbox.add_bodies , other_fn = sandbox.other_fn })
          chunk()
          

          The “t” parameter disalllows pre-compiled Lua code (malicious bytecode is a problem in Lua).

          1. 2

            Also, if you are bullish about in process sandboxing, WASM LUA exists: https://github.com/vvanders/wasm_lua

            WASM has good capabilities to restrict memory use and you can restrict the maximum runtime.

          2. 3

            posted on HN and there is a flurry of comments […] all spoken with the tone of “I’m an expert!”

            So on-brand for the orange site.

            This is a built-in feature of the language, even though it changed API between Lua 5.1 and the more recent versions, it is still there.

            Do you know what’s the status of LuaJIT on that?

            1. 2

              Do you know what’s the status of LuaJIT on that?

              The old API can be implemented in terms of the new API: https://leafo.net/guides/setfenv-in-lua52-and-above.html

              But I’m not sure about vice versa. LuaJIT does not implement the new _ENV system of 5.2, tho it offers many other 5.2 features backported.

              1. 1

                Thanks!

              2. 1

                I’m not a LuaJIT user but I believe it works just like Lua 5.1 for that use case.

            1. 4

              Interesting read, but I don’t understand one detail of the argument: what makes Perl more secure than the other scripting languages mentioned?

              1. 13

                Taint checking comes to mind, and Perl has it. I think OpenBSD folks prefer tech where it’s easier to do the right thing; doing the right thing in shell or php can require more effort, with painstaking, error-prone effort to avoid pitfalls.

                1. 2

                  ruby might be an interesting alternative, but I would assume it doesn’t support nearly as many platforms or architectures as perl.

                  EDIT: huh. Apparently ruby removed taint checking in 2.7.

                  1. 10

                    Ruby code ages poorly compared to perl, though. I’ve been on too many projects where code written in ruby a year or two earlier at most had already been broken by language or dependency changes.

                    1. 2

                      To be fair, OpenBSD controls base, so they could keep a check on the dependency changes. Language changes are rarely breaking with Ruby, when was the last one?

                      1. 5

                        Now, you’ve got to start auditing upstream for bug and security fixes, and backporting them, rather than just updating when needed.

                        Forking is a bunch of work – why do it when there’s a suitable alternative?

                        1. 1

                          We may be talking past each other here. I said that they could keep a check on the dependency changes, by which I meant that they would author code in such a way that it does not require external dependencies (or at least not few enough that they couldn’t vendor them), which wouldn’t be any different from what they’re doing with Perl already. This means that this downside of the Ruby ecosystem could be mitigated. And language changes they’d just have to accept and roll with, but I hold that Ruby rarely introduces breaking changes.

                          OpenBSD will have to vendor $language_of_choice in any case because that’s how the BSDs’ whole-OS approach works.

                          1. 2

                            Yes. I thought you meant essentially forking the shifting dependencies instead of fully avoiding them.

                            In any case, perl is there and in use, so switching would be a bunch of work to solve a non-problem.

                      2. 1

                        Yeah, you’re not wrong. Excellent point.

                  2. 4

                    Maybe the “use warnings” and “use strict”?

                    1. 3

                      That doesn’t bring any security though: it may give you a bit of safety, catching bugs earlier than in some other script languages.

                      1. 6

                        What would bring any security then, as opposed to just helping catch bugs? Barring “legitimate” cases of correct algorithms outliving their usefulness (e.g. the math behind crypto algorithms getting “cracked” to the point where it’s feasible to mount attacks on reasonably-priced hardware) virtually all security issues are bugs. Things like shell injections are pretty easy to miss when writing shell code, no matter how careful you are.

                        1. 1

                          Probably the taint mode that the other commenter mentioned

                          1. 3

                            But that’s exactly what taint checking does: it helps you catch bugs that occur due to executing stuff that’s under the user’s control. Some of these can be exploited for security reasons, but security isn’t the only reason why you want this – it’s just as good at preventing a user enumeration attack as it is at preventing accidental “rm -rf /”

                        2. 2

                          I thought the same. I figure the OpenBSD people know what they are talking about but I am still not really clear on what Perl has over Tcl, for example. Hopefully a Perl monk will show up and clarify.

                    1. 3

                      Is there a way to add new languages? There’s a couple idioms in k I’d like to add.

                      1. 2

                        There doesn’t seem to be :|

                        Supported languages are Ada, C, Caml, Clojure, Cobol, C++, C#, D, Dart, Elixir, Erlang, Fortran, Go, Haskell, JS, Java, Kotlin, Lisp, Lua, Obj-C, PHP, Pascal, Perl, Prolog, Python, Ruby, Rust, Scala, Scheme, VB

                        There are pull requests (https://github.com/Deleplace/programming-idioms/issues/93) but not sure how quick they’re being answered to.

                      1. 2

                        While I don’t own them, I have access to servers with 40+ cores. They’re fantastic for running fuzzers. Since it’s ultimately a search problem, fuzzing benefits a lot from parallelism.

                        1. 7

                          I can really recommend UNIX Network Programming, Volume 1: The Sockets Networking API (3rd Edition)

                          It’s from 2003 but still relevant today. It’s very concise and complete in my opinion and more to the point than Advanced Programming in the Unix Environment when it comes to socket programming.

                          1. 1

                            It’s older than that, isn’t it? I have a copy on my shelf and I’m almost positive I bought it in 1998. I still go to it on an irregular basis… it’s no stretch to call it my longest-lived useful tech book right next to Cormen/Leiserson/Rivest’s algorithms text.

                            1. 1

                              It’s very concise

                              Amazon lists it at over 1000 pages.

                              1. 2

                                It is indeed a huge book, but has concise coverage of a zillion different Unix APIs.

                                1. 1

                                  Those statements are not mutually exclusive.

                                  I did this stuff in my first professional programming job (working on the DPOP mail server, mostly, in C) and it’s a deep and broad field.

                              1. 11

                                At the same time you learn about socket programming, learn to use wireshark – wireshark is a network protocol analyzer, it will capture traffic and let you see what messages are going back and forth (including protocol headers). Use netcat and/or your preferred language’s socket library to do various things on the network, and see what the traffic looks like. That should complement reading about the protocols, it’ll help to verify whether your socket library is doing what you want, and wireshark is a great tool for investigating all kinds of network issues.

                                I learned from beej’s network guide (C-based), followed by Steven’s Advanced Programming in the Unix Environment and The Linux Programming Interface. There’s a lot of overlap between those two (the latter has more coverage of Linux-specific extensions), but you may find one or the other clicks with you better.

                                1. 1

                                  I remember using Wireshark during my studies… Thanks for the idea, I will use it !

                                1. 4

                                  What does the expression at the end mean?

                                  Looking at the manual it looks like | is Max/Or. I also saw that |/ is defined as “Max-Over”?

                                  1. 7

                                    It’s the K implementation of the algorithm above. Find the max of a list of numbers.

                                    1. 1

                                      Ah, for some reason I thought it was a pun to the effect of “hah, you noobs”.

                                    2. 6

                                      | is max. It’s also boolean or. If you wanted the minimum, it’d be &/, because & is min/boolean and.

                                      The APLs have teased apart lots of common operations into atomic parts that combine cleanly, sometimes unpacking them further than other languages go. The single-argument form of & (“where”) is a good example:

                                        &1 2 3
                                      0 1 1 2 2 2
                                      

                                      It counts up, repeating each successive number based on the next number in the argument.

                                        &5 5 5
                                      0 0 0 0 0 1 1 1 1 1 2 2 2 2 2
                                      

                                      Okay, so that makes the pattern clearer. By why is that useful?

                                        & 0 0 0 1 0 1 1 0 1 0
                                      3 5 6 8
                                      

                                      Ah ha – “what are the offsets of the 1s?”

                                        x:10?!1000    / draw 10 random numbers 0 to 999
                                        x             / print them
                                      379 998 594 106 191 686 123 845 495 700
                                        x < 500       / what values are less than 500
                                      1 0 0 1 1 0 1 0 1 0
                                        x[&x<500]     / slice x by indices where x is less than 500
                                      379 106 191 123 495
                                      

                                      So it combines with a conditional to become a sort of SELECT, but it also combines with other operators in a predictable way, and the implementation is straightforward.

                                      1. 1

                                        Thank you! I was stumped as to what the where usage of & is for. This is a great explanation.

                                    1. 3

                                      If you’re doing popcount over large, constant bit arrays, then one practical approach is to precompute the total in fairly large blocks (say, every 64 or 256 KB), and also individual counts in smaller blocks (perhaps 4 or 16 KB). Use the total from the last large block before the offset, add the smaller blocks, and then only do popcount for the few remaining words – this can give you a count (rank) in constant time, and it’s especially useful when all of the words are compressed and would need conversion to popcount.

                                      Many succinct data structures are built on top of these operations, so making them fast is critical.

                                      1. 8

                                        Comments that have typos or are otherwise sloppy seem more likely to have bugs near them, either nearby in the code or in the same commits.

                                        Code pushed by people just before going on vacation should get extra scrutiny.

                                        If you’re converting/compressing and transferring something, multiple layers of checksums can help figure out which layer is screwing it up. It’s easier to add them and then disable them once everything is working properly.

                                        Sometimes starting from scratch and rewriting a prototype of something from memory is a good way to rethink the core design.

                                        1. 2

                                          Comments that have typos or are otherwise sloppy seem more likely to have bugs near them, either nearby in the code or in the same commits.

                                          I agree with this in principle, except that there’s too much noise in this signal from people who are just bad at spelling and/or grammar. The real signal is when someone is being uncharacteriscally sloppy, which requires you to be familiar with that person’s language habits.

                                          Code pushed by people just before going on vacation should get extra scrutiny.

                                          Oh good grief yes, a thousand times. I’d go even further, and say that people should not be permitted to push code right before a vacation, at all.

                                          1. 1

                                            Yes, when somebody whose general writing voice you know is being uncharacteristically sloppy. That’s a great way of putting it.

                                            I’m thinking of someone at a previous job who usually wrote well, but typos in comments often correlated with bugs – using the wrong constant for some calibration value, for example – and I realized it could have been a sign he was mentally exhausted but hadn’t realized it yet.

                                          2. 1

                                            Comments that have typos or are otherwise sloppy seem more likely to have bugs near them, either nearby in the code or in the same commits.

                                            This is generally true, but not always. I have a colleague that everybody recognizes as a “superproductive programmer” who is unable to write a sentence without a few orthography or grammar errors. Even worse, he uses inconsistent formatting and spacing in his code, and he does not care. Yet, I have never seen anybody else able to write a continuous stream of useful and working code without bugs and very well-organized overall. Some people seem to just think at a higher level (and not at a lower level at the same time). This is not a matter of typing speed, he also types quite slowly.

                                          1. 1
                                            1. 3

                                              Keep in mind that maybe half the people using Lua are using it inside of some other system that may already have its own package manager (or strong opinions about packaging that don’t apply elsewhere). It’s designed for embedding as well as standalone use.

                                              1. 1

                                                so whats a good way to deal with that problem? I need HTTP client and I dont think its built in - so i need to install lua-http or lua-curl or similar

                                                1. 2

                                                  Lua libraries fall into two categories: single-file, zero-dependency (in which case I just plop it in the repo for the application I’m using) or libraries with really complex builds that interop with C code, in which case I use apt-get for them because they tend to be very mature and stable, and I never need the latest version. In your case luasocket falls into the latter camp, and it ships with an HTTP client.

                                                  1. 0

                                                    luasocket doesnt appear to be useful for larger files

                                                    https://github.com/diegonehab/luasocket/issues/291

                                                    1. 0

                                                      I wonder if this is a platform-specific issue or something; I can’t reproduce any problem with the snipped you pasted using the 100MB file.

                                                      1. 1

                                                        It doesn’t have progress…

                                                  2. 1

                                                    You can run apt install lua-curl.

                                                    1. 1

                                                      Cygwin doesnt offer a lua-curl package.

                                                2. 1

                                                  It looks like you are trying to use LuaRocks on Windows with some “non-classical” setting.

                                                  I haven’t used Windows for about a year, but last time I did https://github.com/Tieske/luawinmulti was the best option. LuaRocks 3 supposedly makes it better but there may still be issues because afaik none of the core devs really uses Windows…

                                                1. 3

                                                  There was also a Strange Loop talk about this in 2018.

                                                  It is (was?) built on top of lpeg, but trying to address some usability issues with PEGs.

                                                  1. 2

                                                    Thanks for sharing, this is a handy reference.

                                                    Aside: IMO logical expressions with == should be bool and not int.

                                                    1. 2

                                                      This depends on the C standard – bool isn’t in C89, so it can become a portability issue.

                                                      1. 1

                                                        Y’know – I was tempted to add a snarky “unless you need to be compatible with C89 <scoff>”. I will acknowledge that there are some (hopefully very very few) legit use cases where bool support is missing.

                                                        1. 2

                                                          Indeed, and I’m inclined to agree otherwise.

                                                          I’ve worked on a couple projects that needed C89 compatibility (mostly in embedded), and I begrudgingly keep greatest to -std=c89 just so that it’s always an option. There are some features that are only available when it’s built with >= C99 though.

                                                    1. 1

                                                      Anyone have recommendations for other types of testing for C? Like property testing and mutation testing, etc.

                                                      1. 3

                                                        For property-based testing, look at theft.

                                                      1. 8

                                                        The comment field there doesn’t permit editting and correcting typos…..

                                                        So let me try again here…

                                                        In a galaxy far far away….

                                                        Larry Wall wondered why he needed to learn 3 pretty bad languages, sh, awk, sed…., and devised perl as the Grand Unifying Language.

                                                        Perl sadly borrowed too much from it’s inspirations, and wasn’t much more readable.

                                                        Then Matz came along and resolved to borrow the best from perl and scheme and ….. and make something more powerful than them all, yet more readable.

                                                        It’s called Ruby.

                                                        And yes, you can do everything in Ruby, in one line if you must, that you can do in bash, awk, sed, jq, perl…. but in a more powerful and maintainable form.

                                                        All this has been available for decades, why are we (still) bashing (pun intended) our heads against the Lowest Common Denominator?

                                                        1. 8

                                                          serious question: what does “doing some awk in Ruby” look like? This might be a pretty big motivator for me to finally figure out Ruby for scripting (I’m more of a Python guy myself but awk works nicely for small scripts on line-oriented stuff when I want a one-liner)

                                                          1. 8

                                                            Compare:

                                                            # Official way of naming Go-related things:
                                                            $ grep -i ^go /usr/share/dict/* | cut -d: -f2 | sort -R | head -n1
                                                            goldfish
                                                            

                                                            Versus Ruby:

                                                            puts(Dir['/usr/share/dict/*-english'].map do |f|
                                                              File.open(f)
                                                                .readlines
                                                                .select { |l| l[0..1].downcase == 'go' }
                                                            end.flatten.sample.chomp)
                                                            

                                                            Simple example, but I think it demonstrates that doing various basic and common tasks are quite a bit more complex to do in Ruby than in the shell.

                                                            That doesn’t mean I’m always in favour of shell scripts – I got that example from an article I wrote saying you shouldn’t use shell scripts – but there are definitely reasons shell scripting persists, even though we have things like Perl and Ruby.

                                                            In that article I wrote “I regret writing most shell scripts [..] and my 2018 new year’s pledge will be to not write any more”. I’ve mostly failed at that new years’ pledge, and have happily continued shelling about. I have started rewritting shell script prototypes to other languages at the first sign of getting hairy though, and that seems like a middle ground that is working well for me (I should update/ammend that article).

                                                            1. 5

                                                              To be fair, it looks like most of the additional complexity in the Ruby code comes from reading files: the first command in the pipeline, grep -i ^re glob, is what becomes

                                                              Dir[glob].map do |f|
                                                                File.open(f)
                                                                  .readlines
                                                                  .select { |l| l[0..1].downcase == re }
                                                              end.flatten
                                                              

                                                              The rest of the script contributes very little to the Ruby code.

                                                              I suspect this is a recurring theme when trying to replace shell pipelines with programs. Only Perl avoids some of this additional complexity for reading files, I think.

                                                              1. 5
                                                                puts Dir['/usr/share/dict/*-english'].
                                                                  flat_map { |f| File.readlines(f).grep(/^go/i) }.
                                                                  sample
                                                                
                                                                1. 6

                                                                  At least with Ruby I don’t have to constantly cross-reference the man page and my cargo-culted knowledge of Unix’s multitude text manipulation DSLs, all unlike. It’s pretty obvious what it’s doing.

                                                                  1. 1

                                                                    Actually you used very little shell there in your first example.

                                                                    You also used grep, cut, sort and head.

                                                                    Why do you assume the backtick operator and the | operator for io doesn’t exist in ruby? In fact why do people assume shell and jq do not exist if you use ruby?

                                                                    Personally I tend to reduce the number of tools involved to reduce the cognitive load of needing to understand each tool to understand the one liner.

                                                                    I balance that against considerations like going IO.read(”|sort -u fileName”) can be a huge performance boost

                                                                    Anyhoo… some examples of ruby onliners

                                                                    http://reference.jumpingmonkey.org/programming_languages/ruby/ruby-one-liners.html

                                                                  2. 7

                                                                    Because code in sed or awk that worked a decade ago (or, hell, two years) still works. Ruby code seems to bit rot faster than any other language I’ve use for nontrivial work.

                                                                    Also, with awk, I could put it down for a year, then use it again, and everything I’d need to be productive fits in a small man page. (The same seems to be true of sed, though I don’t use it much.) The Ruby ecosystem moves a lot faster, and if you haven’t been following it closely, catching up will add extra work. (Whether it’s actually going anywhere is neither here nor there.)

                                                                    Yes, awk is a more limited language, but that’s a trade-off – there are upsides, and I know which I’d prefer.

                                                                    1. 1

                                                                      Not true.

                                                                      The awk scripts I wrote decades ago with in Solaris awk which is not quite the same thing as gnu awk.

                                                                      Well thought out growth in a language is good.

                                                                      I find the maintenance burden in ruby rolling forward with language versions is very low.

                                                                      Doubly so since rubocop will often autocorrect stuff.

                                                                    2. 6

                                                                      I don’t know Ruby. But for me these are the reasons why I am writing more and more bash programs:

                                                                      • Bash is my command line. So I am doing a lot of small steps, file modifications, comparing, searching analysing. At some point I can see that some of the steps can be composed and I pull them out of the history, try them out on the console and at some point put them into a script. If Ruby would have a REPL in which I can do all the operations that I am doing on the command line with less typing and more comfort, I would maybe give it a try.

                                                                      • Bash is on every Linux box. Ruby is not.

                                                                      1. 4

                                                                        Ruby does have a REPL. It’s called IRB and it comes with every Ruby installation. I use it exactly as you describe, for composing small programs iteratively.

                                                                        1. 1

                                                                          Are you using the Ruby REPL as your daily all-time console or just when you have in mind to create a program? I am asking honestly because I do not know anything about Ruby or their REPL and I am quite interested how good this REPL is as a replacement for the daily life?

                                                                          My point is that shell scripts are a by-product of using the shell for doing manual tasks. And I get better and better at my shell usage, and even after 20 years of shell usage I am still discovering new features or ways to do something in a more efficient way. While the shell language is really ugly, but being very succinct plus the composition of unix commands, the history, the prompt customization, the possibility to have vi mode for editing (and I probably forgot a lot of features), all this makes using shell such an efficient tool.

                                                                          1. 2

                                                                            Well, no, not as my daily shell. I dislike shell scripting enough that I switch to Ruby pretty quickly if I’m having to spend any amount of time or effort on a task, but it’s not meant to be a replacement for bash/zsh/fish.

                                                                        2. 3

                                                                          Bash is on every Linux box. Ruby is not.

                                                                          Let’s not limit ourselves here. For those not using Bash and/or Linux, how about this:

                                                                          • Bourne-compatible $SHELL is on every Unix box. Ruby is not.
                                                                          1. 2

                                                                            Bash is on every Linux box. Ruby is not.

                                                                            So is ed.

                                                                            However sudo apt install ruby solves that problem.

                                                                            And yes, ruby does have a REPL.

                                                                            1. 2

                                                                              apt: command not found.

                                                                              sudo: permission denied

                                                                              $

                                                                              1. 2

                                                                                Have fun with ed then, it’s the Standard!

                                                                                https://www.gnu.org/fun/jokes/ed-msg.html

                                                                                1. 1

                                                                                  I have written scripts in ed before to do some sufficiently tricky text manipulation. It’s a good tool.

                                                                          2. 5

                                                                            Mostly, because picking up enough jq, awk and sed to be useful is faster than learning the ins and outs of Ruby?

                                                                            I suppose you could make a similar argument about learning Ruby one-liners, but by the time I’m writing a very long bash script, I’m probably writing a larger program anyway, either in Go or Python. Ruby as a language doesn’t have much appeal to me, at least at the moment.

                                                                            Awk, at least, fits very nicely into a small space right next to regex. jq is a bit fiddilier to pick up, but very nice for basic stuff. Sed, I still don’t have down very well, but also is nicely regex adjacent.

                                                                            1. 3

                                                                              I regularly write sed one liners to do refactorings on my Ruby code. Usually the sed call is fed by the result of grep or find. I could write a Ruby one liner to do the same, but it would be a much longer line and escaping would be much more difficult. Ruby is simply not a replacement for the convenience of sed.

                                                                              And maintainability is a red herring here: the whole point of something like sed is that you use it for one-off commands.

                                                                              1. 2

                                                                                I’m not that experienced with jq, but when it comes to awk (and sed), one of their benefits is that you can easily write a program in the shell, since they act as glue between pipe operations.

                                                                                For example, to filter out all lines that have less than 4 characters, all you have to write is

                                                                                ... | awk 'length >= 5' | ...
                                                                                

                                                                                no imports or types required. It was made for stuff like this, which makes it easy to use. I’ve only read a book about Ruby a few years ago, but to process stdin/out this was should require a bit more overhead, shouldn’t it?

                                                                                1. 1

                                                                                  One part of your history lesson is missing: Paul McCarthy and Steve Russell saw what was going to happen and pre-emptively invented Lisp. And yes, you can do everything in Lisp, in one line if you must, that you can do in bash, awk, sed, jq, perl… but in a more powerful and maintainable form.

                                                                                  ;)

                                                                                  1. 2

                                                                                    s/Paul/John/

                                                                                    This gotta be one of my most common brainarts…

                                                                                    1. 2

                                                                                      It was Yoko’s fault.

                                                                                    2. 1

                                                                                      Ruby equivalents of the basic awk and sed examples from the article, as examples of Ruby one-liner structure:

                                                                                      • AWK: awk '{print $1}' logs.txt
                                                                                        • Ruby: cat logs.txt | ruby -ne 'puts $_.split[0]'
                                                                                        • Ruby: cat logs.txt | ruby -ane 'puts $F[0]'
                                                                                      • sed: sed 's/^[^ ]*//' logs.txt |sed 's/"[^"]*"$//'
                                                                                        • Ruby: cat logs.txt | ruby -ne 'puts $_.gsub(/^[^ ]*/, "").gsub(/"[^"]*"$/, "")'
                                                                                    1. 6

                                                                                      This was a nice review. I admit I wasn’t able to finish the book, as I constantly got stuck on the confusing PlusCal syntax.

                                                                                      At the time of reading it, I was getting some weird edge cases in a distributed protocol at work, and trying to come up with a minimal example that could be reproduced was proving to be quite difficult. I got the book after reading a couple of Hillel’s posts and seeing the announcement here, with the idea of using TLA+ to solve our issue.

                                                                                      I already had a written specification (although in pseudocode form, and quite loose), and some intuition of where the problem was coming from, so I thought it would be a simple task. Nevertheless, I was never able to do it. Trying to reduce the real implementation to a model that TLA could understand, and figuring out the right level of detail, proved to be too time consuming, plus I had other, more urgent things to do, and in any case the problem was rare enough to be able to postpone finding a solution for a bit.

                                                                                      Fast forward a couple of months, and I finally decided to come back to the problem. This time, though, I implemented a simplified version of the system, in real code, and ran it through a property-based test (poor man’s model checker) that would execute actions against the system, and specified that the edge case would never happen. I didn’t get the prettiest execution on the first try, but I managed to refine it iteratively and armed with existing knowledge of the domain. The final execution was maybe 10 lines long as easily explainable in a whiteboard, which I consider a great success.

                                                                                      The whole thing took maybe a week of work (counting reading about model-checking, property tests and learning to use the testing library) and proved much simpler than trying to tackle a real TLA spec. I’m not saying my approach was better, TLA can express so much in terms of temporal properties that trying to model that some other way will be painful. But sometimes, the simplest approach works best.

                                                                                      1. 5

                                                                                        I’ve also had a lot of success using property-based testing as a sort of model checker, but there are trade-offs to keep in mind. I’ve been learning TLA+, off and on, and don’t feel it’s at all redundant with property-based testing.

                                                                                        TLA+ will check the model exhaustively, whereas every property testing library I’m aware of is probabilistic – because they’re semi-randomly exploring, it’s never certain when they’ve checked all meaningful cases, and some are better than others about avoiding redundant checks. Sometimes the partiality is fine, and just random stress testing is a good way to feel out an implementation, but total state-space coverage is compelling.

                                                                                        Where TLA+ checks a model of the code, a property testing library will check the implementation (or perhaps a subset). It’s often worth doing both! I mostly work in C these days (and use my theft library for property-based testing), so testing for mistakes in the implementation details is pretty important, but a non-code model checker could still be really valuable during system design.

                                                                                        1. 2

                                                                                          TLA+ will check the model exhaustively, whereas every property testing library I’m aware of is probabilistic.

                                                                                          For sure! A naïve property might not find the example you’re looking for. I know I didn’t get anything on my first try. I had to play a lot with the model generator and the shrinking procedure, to the point that I ended up defining explicitly certain parameters of the generator (number of concurrent processes, number of operations a process is able to execute, etc).

                                                                                          It’s often worth doing both!

                                                                                          Won’t disagree here, if I had already known TLA or any other specification language, I would’ve use that for sure (and if I ever get the time I plan to come back to the book and work through it, plus Lamport’s book), but when you’re facing constraints in the available time you have, going for a simpler solution can get you far enough.

                                                                                      1. 4

                                                                                        Working on wrapping up and releasing some personal projects I’ve had mostly ready for a while.

                                                                                        I just pushed skiparray, an unrolled skip list library for C. It has roughly the same relation to a regular skip list as a B-Tree has to a basic balanced binary tree.

                                                                                        Aside from that, getting back to learning TLA+, working through hwayne’s book.

                                                                                        1. 6

                                                                                          I usually turn to this blog post when I need a reference for the Fisher-Yates shuffle.

                                                                                          It’s a good algorithm to have on hand, whether you’re shuffling a data structure or just indices into it. I’ve found a couple really elusive bugs by running tests shuffled multiple ways.

                                                                                          1. 15

                                                                                            The author has a pretty explicit bias toward Earley parsing, and using LL(k) over LR(k) methods (which I mostly share). There’s a wealth of information here, but he’s missing a couple other significant strands of generalized parsing research outside of Marpa’s particular Earley variant (perhaps because they don’t fit within his narrative).

                                                                                            GLR (Generalized LR):

                                                                                            • Tomita, LR parsers for natural languages, 1984.
                                                                                            • Tomita, Efficient parsing for natural language, 1986.
                                                                                            • Scott and Johnstone, Generalised bottom up parsers with reduced stack activity, 2005. This segues into their later work with Earley and GLL.

                                                                                            GLL (Generalized LL):

                                                                                            • Scott and Johnstone, GLL Parsing, 2010.
                                                                                            • Afroozeh and Izmaylova, Faster, Practical GLL Parsing, 2015.

                                                                                            Outside of Earley, GLL seems like a very practical generalized parsing approach. Instaparse is one implementation.

                                                                                            Earley / Shared Packed Parse Forests:

                                                                                            • Scott, SPPF-style parsing from Earley recognisers, 2008 (and several related papers by Scott and Johnstone). Note: I’ve never been able to get the approach described in the paper implemented correctly, and not for lack of trying.

                                                                                            Earley Intersection Parsing (not sure if there’s a canonical name for this):

                                                                                            • Bar-Hillel, Perles, and Shamir, On formal properties of simple phrase structure grammars in Zeitschrift für Phonetik, Sprachwissenschaft und Kommunikationsforschung, 1961. Proves some important results about intersecting automata and context free grammars.
                                                                                            • Chapter 13 (“Parsing as Intersection”) of Grune and Jacobs (also cited elsewhere in his timeline), particularly 13.4 (pgs. 437-439): This describes Bar-Hillel’s Intersection Parsing approach using contemporary CS & parsing terminology, and then suggests combining it with Earley parsing. While the basic intersection parsing method produces an astonishing amount of dead-end nodes to garbage-collect later, Earley parsers limit searching to productions reachable from the start symbol. If the completer is modified to produce non-terminals in the intersection parsing format (which is easy), intersection parsing nicely handles producing the parse forest (with structural sharing, when ambiguity produces multiple derivations).

                                                                                            I’ve been working on a C library for Earley Intersection parsing, and an awk-like, pattern-match-directed language based on it, but working on testing them thoroughly tends to lead me to working on theft instead.

                                                                                            1. 3

                                                                                              Thanks for details about alternatives.

                                                                                              I have played around with Bison’s GLR option and have almost managed to create a C parser that does not require special handling of typedefs in the lexer (ambiguities still appearing at runtime for a few constructs).

                                                                                              Grune and Jacobs’ Parsing Techniques - A Practical Guide is sitting in a pile waiting to be read (the first edition was manageable, the second looks daunting)

                                                                                              1. 1

                                                                                                I have a print copy of the second, and it’s never struct me as daunting – reading it cover to cover, perhaps, but it’s well suited to dipping into for just specific techniques, with a well-curated bibliography for further details.

                                                                                              2. 3

                                                                                                Thank you for adding this. I understand why everybody is excited about Earley parsing, though I personally don’t feel sufficiently grounded in it to share that excitement, but at the very least other lines of research deserve to be mentioned.

                                                                                                1. 2

                                                                                                  I apologize for reviving this thread, but do you know if Joop Leo’s improvements for Right Recursion allows for optimizing indirect right recursion also?

                                                                                                  1. 1

                                                                                                    Not sure, sorry.

                                                                                                  2. 2

                                                                                                    packrattle is another example of a GLL parser.

                                                                                                    The times I’ve read about Earley parsers, they seem like the same solution as GLL, but attacking the problem from a different angle. Hopefully someone in academia will eventually prove that.

                                                                                                    1. 2

                                                                                                      I’m particularly interested in Earley parsers because some use cases of mine assume ambiguity (reverse-engineering binaries, scraping damaged data), and the chart that Earley parsers build up supports several operations beyond just recognizing & building parse trees:

                                                                                                      • what terminals/nonterminals would advance the current parse(s) in progress? (i.e., autocomplete)
                                                                                                      • if we don’t assume where the starting position is, what overlapping instruction encodings are there?
                                                                                                      • if we inserted a fake nonterminal here, would enough of the parse have completed to match any instances of (some structure) between here and there?
                                                                                                      • incremental re-parsing of changing input, since earlier columns in the chart are constant once the parser has moved on.
                                                                                                  1. 5

                                                                                                    I’ve done this on a couple projects. It was particularly helpful when I was in an embedded project and had time values from a couple clock domains – mixing up a 1 msec resolution time value with a 1 sec time value (or, worse, our hardware’s timer that had 1024 tick per second) could have led to some subtle bugs.

                                                                                                    Unfortunately, the lack of generics means that tracking multiple things (such as, what resolution is the timer AND is it an absolute timestamp or relative) can get verbose. Still, the struct value boxing should have zero runtime cost, so it can be a handy way to use C’s type system for extra dataflow constraints.

                                                                                                    1. 9

                                                                                                      Something that seems curiously absent is recommending that any code that is accepting untrusted input and attempting to parse it should get lots of fuzzing. That’s one of the ideal uses cases for fuzzing, and coverage-guided fuzzers like afl-fuzz and libfuzzer are really good at searching for input that provokes unusual behavior in parsers.