I would like to read @andyc’s thoughts on this.
Sure … This paper was published over 4 years ago, and I read it shortly after, when @nickpsecurity brought related work to my attention right here on lobste.rs.
Formally-Verified Interpreter for a Shell-like, Programming Language (2017) - 4 years ago, a bunch of my comments
Executable formal semantics for the POSIX shell - 2 years ago, more comments
It was definitely exciting to see someone else tread a lot of the same ground! The very first posts on the Oil blog (over 5 years ago!!!) were precisely about static parsing
I mention the Morbig and Smoosh work in the appendix of this post:
Oil actually started as an experiment in static parsing, which I frame here: http://www.oilshell.org/blog/2021/12/backlog-project.html#oil-is-a-bunch-of-experiments-that-succeeded
It’s even more rigid about this than POSIX, which specifies dynamic parsing for arithmetic. Dynamic parsing is technically OK in POSIX, but becomes extremely unsafe when combined with bash and ksh features – it leads to Log4Shell and ShellShock-like “hidden evals”.
In other words the shell can start parsing your data as code and executing it without you knowing! See the section below on that.
Here’s a more pedestrian example. Because arithmetic is dynamic in POSIX, all shells share this behavior:
$ dash -c 'op=+; echo $(( 3 $op 4 ))'
$ osh -c 'op=+; echo $(( 3 $op 4 ))'
op=+; echo $(( 3 $op 4 ))
[ -c flag ]:1: Unexpected token after arithmetic expression
If you want you can use an explicit eval in OSH:
$ osh -c 'op=+; eval "echo \$(( 3 $op 4 ))"'
I talk about POSIX in the FAQ and claim that OSH is an up-to-date POSIX: http://www.oilshell.org/blog/2021/01/why-a-new-shell.html
Note that bash is influencing a new de facto standard – busybox ash, OSH, and dash have been naturally converging on its behavior, while POSIX seems far pretty behind.
OSH is basically a “cleaned up bash” – it runs real bash scripts, but we don’t copy all the bugs. What was not clear at the outset is if the static subset could run most shell scripts – that turned out to be true!!!
Appendix: if you really understand static vs. dynamic parsing (and it’s become clear to me that this is a difficult concept for some, even shell authors), then these posts are related:
Again, dynamic parsing is bad because it leads to bugs like ShellShock and Log4Shell. Those bugs are both what you would call an “unintentional hidden eval of shell code”. That’s the same thing as dynamic parsing (of user supplied data) and then execution.
I thought the same thing. :)
Why a paywalled link? Both the source and the paper are available online.
And yes, Menhir is the best LR parser generator in the world. ;)
It seems this is an extended version. The scihub pdf link.
Full PDF here