I’ve just started to explore and use seccomp(2). It’s nice, but having worked with pledge(2) before, I would prefer syscall groups instead of individual platform specific syscalls. In my opinion the right place for a pledge like call would be the C library (like cosmopolitan by @jart). Also the syscall groups as defined by systemd don’t allow one to separate between read/write/create in the filesystem.
Seccomp has a big limitation that composes very badly with another choice in Linux. New system calls in Linux take pointers to variable-sized structures so that they can be extended in backwards-compatible ways. The eBPF filter in seccomp-bpf cannot read arguments (even if it could, it would be subject to TOCTOU attacks because these things are copied inside the syscall implementation and not at the boundary) and so you cannot express a load of things. For example, you cannot require openat2 calls to use the resolve-beneath argument because this is passed indirectly. This makes implementing the Capsicum model (which is typically what you want if you care about least privilege and intentionality) impossible without an extra process. Newer versions can be used with a parent process and some ptrace things to copy arguments out, but then it’s even slower than the overhead that you already have from an eBPF filter running on every system call. I’m contrast, capability mode in Capsicum just toggles the syscall table and incurs basically no overhead.
This is definitely a well known and well understood problem. Where seccomp thrives is in processes that are inherently built to be constrained. The reason you don’t worry about this sort of breakage in the Chromium renderer is because that thing is designed to never need any system calls anyways.
Where seccomp is much more likely to break is when your program is actually performing a ~small to medium sized set of system calls - at any point, a library that implemented its functionality with {a, b, c} syscalls may suddenly choose to use {c, d, e}. This is doubly bad for the syscall argument / conditional filtering, which is quite fragile.
At this point, I suggest either:
a) Properly architect your system so that the attack surface part requires no system calls, or system calls that are so simplistic that you are in complete control of them. This is what Chromium does.
b) Use a higher level library that attempts to group system calls together. This isn’t a problem unique to seccomp at all - apparmor, for example has policies that can be imported and bundled together. I’ve been using extrasafe and I quite like how they do it. But really, do (a) if it is at all possible. You will end up with a stronger, more maintainable sandbox.
In the future it would also be nice to see static analysis/ symbolic execution drive seccomp policy. I started to build this in Rust but I am lazy.
I’ve just started to explore and use seccomp(2). It’s nice, but having worked with pledge(2) before, I would prefer syscall groups instead of individual platform specific syscalls. In my opinion the right place for a pledge like call would be the C library (like cosmopolitan by @jart). Also the syscall groups as defined by systemd don’t allow one to separate between read/write/create in the filesystem.
Seccomp has a big limitation that composes very badly with another choice in Linux. New system calls in Linux take pointers to variable-sized structures so that they can be extended in backwards-compatible ways. The eBPF filter in seccomp-bpf cannot read arguments (even if it could, it would be subject to TOCTOU attacks because these things are copied inside the syscall implementation and not at the boundary) and so you cannot express a load of things. For example, you cannot require openat2 calls to use the resolve-beneath argument because this is passed indirectly. This makes implementing the Capsicum model (which is typically what you want if you care about least privilege and intentionality) impossible without an extra process. Newer versions can be used with a parent process and some ptrace things to copy arguments out, but then it’s even slower than the overhead that you already have from an eBPF filter running on every system call. I’m contrast, capability mode in Capsicum just toggles the syscall table and incurs basically no overhead.
This is definitely a well known and well understood problem. Where seccomp thrives is in processes that are inherently built to be constrained. The reason you don’t worry about this sort of breakage in the Chromium renderer is because that thing is designed to never need any system calls anyways.
Where seccomp is much more likely to break is when your program is actually performing a ~small to medium sized set of system calls - at any point, a library that implemented its functionality with {a, b, c} syscalls may suddenly choose to use {c, d, e}. This is doubly bad for the syscall argument / conditional filtering, which is quite fragile.
At this point, I suggest either:
a) Properly architect your system so that the attack surface part requires no system calls, or system calls that are so simplistic that you are in complete control of them. This is what Chromium does.
b) Use a higher level library that attempts to group system calls together. This isn’t a problem unique to seccomp at all - apparmor, for example has policies that can be imported and bundled together. I’ve been using
extrasafe
and I quite like how they do it. But really, do (a) if it is at all possible. You will end up with a stronger, more maintainable sandbox.In the future it would also be nice to see static analysis/ symbolic execution drive seccomp policy. I started to build this in Rust but I am lazy.