It sounds like what you really need for this to be usable on Linux is an implementation in userspace of pledge() (still implemented using seccomp() for the actual filtering itself) which is provided by your libc and therefore has at least a snowball’s chance of correctly knowing what syscalls correspond to what operations.
That wouldn’t suffice for software that invokes syscalls directly, without libc. I imagine a more complicated mechanism might work though: each piece of code that will invoke syscalls directly (e.g. libc, Go stdlib, etc) needs to also have in its API somewhere a function that spits out BPF fragments that can be chained together to take their union and then passed into seccomp().
The problem with a pledge implementation in the linux userspace is that seccomp(2) rules are inherited by child processes executed with execve(2).
Isn’t that exactly the behaviour which I want? It’s not a very good sandbox if I can break out of it by execve()ing something. In all of the practical use cases I can think of for sandboxing, if you’re going to fork() or execve() then you do so at the same time as reading your config files and then you impose the sandbox on yourself afterward.
Edit: if inheritance isn’t what pledge() itself does then the sandbox would have to be given a different name for clarity’s sake, of course.
Yes new processes executed with exec are not pledged, see the exec promise in pledge(2).
This makes it easier to pledge everything, with seccomp you would need very broad rules for programs like sudo, doas or sshd. pledge limits the impact of code execution vulnerabilities in the pledged executable, seccomp is more like a sandbox.
This is just a simple restatement, but hopefully it can be insightful. The reason here is “mechanism, not policy”. seccomp is almost purely a mechanism, but sandboxing needs a policy. pledge combines two, but on Linux there is yet to be a widely accepted sandboxing policy (as opposed to sandboxing mechanism).