I find it kinda weird that BPF gets involved in this …
Yeah. Linux developers love flexibility, I’d say way too much. Rather than just having a predefined very strict security model that makes sense, they allow you to load your own bytecode that would run as the syscall filter. I guess it’s cool in a way, but it makes auditing the sandbox very difficult — I’d have to somehow prove that my filter program doesn’t allow some subtle way to do something that shouldn’t be allowed.
(As for eBPF in particular, it’s their main “bytecode in the kernel” engine now. It’s also used for their DTrace equivalent, for example.)
I hope they are not using BPF to sandbox BPF?
The kernel itself verifies BPF programs at load time. The VM also does several checks: no loops, no pointer arithmetic, etc. Here’s a good overview from 2017: https://lwn.net/Articles/740157/
Somebody came up in the issues with mini-jail, seems to be similar.