1. 10
  1.  

  2. 2

    This looks neat, and I have a lot of questions that I don’t know the answers to, so I’m hoping someone else can clear stuff up:

    1. My guess is that the Ptrace sentry, with PTRACE_SYSEMU, mode would incur a ton of overhead for apps, and it’d be more appropriate to use the KVM sentry in production. Can anyone confirm this suspicion?
    2. Networking. The README suggests that it creates a TAP device (meaning, virtual Ethernet), and then there’s netstack involved. My understanding of how that’d work is that the application makes calls to the BSD sockets API, and the syscalls are translated to netstack calls, from which the gVisor process sends along raw ethernet frames to the host kernel to send on it’s way. This would mean that gVisor’s netstack would have to implement full TCP/IP with congestion control, etc, etc, etc, and seemingly make it impossible to monitor connections from the container host (unless of course gVisor has an API that provides this data, e.g. through 9p or something?) But, surely that’s not the case? Surely this doesn’t leave you in a blind behind ethernet frames state?
    3. Networking part 2: There’s the ability to use the host‘s network stack, as well, at the cost of reduced isolation. I guess in that case, the syscall trapping basically proxy’s the BSD sockets API calls to the host…, which would make it possible to use standard networking tools, but open up more potential for isolation violations?
    1. 1

      The issues section make it look like kvm mode isn’t that well tested. It would be nice if they actually explained how much this is used, and how they test it internally. I have no idea how production ready this tool is.

      1. 3

        After reading both the code, and the various Linux patches Google published on LKML, I came to the conclusion Google most likely uses a patched Linux kernel that uses some special eBPF technology to implement syscall interception, not ptrace or KVM.

        Speaking of ptrace, I don’t believe it is safe to use. Just like systrace(1), it is plagued by TOCTOU. Unlike systrace, it can protect one program from another, or it can protect the host kernel from a malicious program, however, it can’t protect a program from itself. ISTM a vulnerability in a program can be made much worse by exploiting TOCTOU. E.g. it can be used to exfiltrate data from the process. Yes, the vulnerability has to already be there, but once it’s there it’s easier to exploit it that in another environment.

        1. 2

          Very useful connection w/ the LKML patches, thanks. After reading your comment, some of the Google folks’ comments elsewhere definitely sound like they’re using some other interface. E.g. this one responds to a criticism of ptrace by saying: “The default platform is ptrace so that it works out of the box everywhere”. Which sure sounds a lot like “that’s just the default, we’re not actually using that configuration”. Which doesn’t necessarily make it bad, but my confidence in a sandbox is a lot less if I’m trying to use it in a different configuration from the one that the main developer/tester uses themselves.