This document is meant to be a high-level introduction to gVisor for security researchers. It explains how gVisor differs from the way other security products provide isolation. It assumes a solid understanding of how kernels and operating systems work.
A look at the Security Model page is also recommended.
gVisor is an open-source workload isolation solution to safely run untrusted code, containers and applications. It fundamentally differs from other isolation solutions in that it is an application kernel, not a virtual machine hypervisor or a system call filter.
The two most common approaches to sandboxing workloads are to use virtualization
(virtual machines or VMs), and/or Linux kernel security primitives such as
seccomp-bpf
, Linux namespaces, AppArmor, Landlock, etc. gVisor uses these
technologies, but not in the standalone way they are typically used. They are
used for defense-in-depth rather than as a primary layer of defense.
To explain the difference, it is useful to contrast it against these other approaches first.
When using Linux kernel security primitives like seccomp-bpf
, AppArmor,
Landlock, namespaces, and such, the attack surface of the sandboxed applications
is reduced, but that enforcement is still done by the single monolithic Linux
kernel that the sandboxed application can still talk to.
This means the workload is only one system call away from host compromise. The
Linux kernel security primitives help in reducing the surface, but any attack
within that surface (or that undoes the Linux kernel security mechanism itself)
can still be executed. Additionally, these security primitives need to be
carefully configured for the particular workload being sandboxed in order to be
meaningful. For example, system call filters need to be whittled down to the
very set of system calls the workload needs and no others. This also means that
if an application depends on an “unsafe” or broad system call (like ioctl(2)
or io_uring(2)
), it can be very difficult or even impossible to create a
secure set of filters for that application. Additionally, creating a generic
configuration that works for all or most workloads will result in needing to
allow all or most of the kernel surface to be exposed.
While gVisor leverages seccomp-bpf
and namespaces to minimize its own surface
to the host kernel, it does so only as a second layer of defense, and not in a
way that requires workload-specific tailoring in order to be meaningful.
When using virtual machines, a hypervisor (which can run in userspace or kernelspace or both; in the diagram below, it is shown in kernelspace but the principle applies without loss of generality) manages the coordination between two kernels: one running on the host as normal, and another running inside a hardware-enforced virtual machine where the sandboxed workload’s activity is contained.
The virtual machine acts as a strong security boundary that restricts the application from accessing any host resource. The only way out of the virtual machine is through a “VM exit”, an event triggered only in certain circumstances and handled by the hypervisor. While virtual machines are the gold standard of workload isolation, they come at a steep cost in terms of resource overhead and efficiency, due to the need to pre-allocate machine resources to each virtual machine on each host and to boot a full, separate Linux kernel.
While gVisor can use virtualization (specifically: KVM), it can also work without virtualization while maintaining a high level of security.
Now that we’ve seen how Linux kernel security primitives and virtual machines work, let’s turn to gVisor.
gVisor acts as an application kernel, but runs in userspace. This means it takes the role that a kernel would from the perspective of a sandboxed workload, while gVisor itself otherwise acts as a regular user application from the host kernel’s perspective. Like a kernel, gVisor intercepts and handles system calls and page faults from the sandboxed workload. This handling logic happens entirely within gVisor code, written in memory-safe Go. This kernel-like component is called the “gVisor Sentry”.
Like a user application, the gVisor Sentry may make limited system calls to the host Linux kernel. It does so when it determines that servicing the sandboxed workload’s request requires information from the host machine and that the sandboxed workload was initially configured to allow such access.
This means the gVisor Sentry needs to re-implement Linux in Go. The gVisor Sentry contains a Go-based, from-scratch reimplementation of the Linux system call interface, memory management, filesystems, a network stack, process management, signal handling, namespaces, etc. gVisor never passes through any system call to the host. Therefore, if a kernel feature isn’t reimplemented in gVisor, then the sandboxed workload cannot use it.
Let’s walk through an example. Say a sandboxed process calls getpid(2)
. gVisor
intercepts this system call. gVisor keeps track of its own PID table
representing the processes in the sandbox. These are not real host processes!
Running top(1)
on the host will not show them. gVisor uses its own PID table
to find the PID of the sandboxed process, and returns that. From its
perspective, the sandboxed process just ran getpid(2)
, yet no host system call
was made.
Some system calls made by a sandboxed process may result in one or more host
system calls being made. As a second example, if a sandboxed process wishes to
read(2)
from a unix pipe(2)
that another process in the sandbox is
write(2)
‘ing to, the gVisor Sentry (and more specifically, the Go runtime it
relies on) may call the host futex(2)
system call to perform blocking and
synchronization between these operations. Therefore, the Sentry does need to be
able to perform real system calls, but they do not map 1-to-1 to the system
calls made by the sandboxed processes.
The gVisor Sentry runs in a very restricted environment, leveraging all of the
Linux kernel security primitives available (system call filtering, namespacing,
cgroups, pivot_root(2)
, etc.). Its system call filter prohibits system calls
like exec(2)
, connect(2)
, and their respective variants (with caveats
depending on sandbox configurations). It has an isolated view of the host
filesystem using mount namespaces, and runs as in an isolated user namespace
with minimal capabilities. This does not mean that the sandboxed workload
can’t use these system calls; it actually can! But their logic and
implementation is entirely handled within the gVisor Sentry’s kernel logic,
rather than delegating any of it to the host kernel.
For requests that cannot be serviced from within this restricted environment, there is a sidecar process called the Gofer which is a slightly-more-trusted companion process running in a slightly-more-privileged context.
This security architecture is similar to virtual machines in that there are two separate kernels, with the innermost one being exclusive to the sandboxed workload, and with very restricted access to the host kernel. However, unlike virtual machines, gVisor sandboxes have the flexibility to allocate and release host resources (CPU, memory) at runtime, providing better efficiency and utilization without compromising on the security benefits of the VM-like dual-kernel security architecture.
Additionally, the gVisor components are all written in memory-safe Go, eliminating the largest class of security vulnerabilities that would otherwise be present in a typical VM setup (Linux as guest kernel). In order to break out of a gVisor sandbox, an attacker would need to simultaneously exploit the gVisor Sentry kernel and the host Linux kernel, which do not share any code.
gVisor contains multiple mechanisms by which it can intercept system calls and page faults from the sandboxed workload. These are called “gVisor platforms”. There are currently two supported platforms:
seccomp-bpf
subsystem for system call interception (as opposed to
the typical use-case of seccomp-bpf
being for system call
filtering). It does not require virtualization support from the host
and is therefore well-suited to run inside a virtual machine. Read our
announcement post for more details on Systrap.Platforms are meant to be transparently interchangeable from the system administrator’s perspective. However, they are still different from a security perspective, as the Linux kernel functionality they rely on to provide system call and page fault interception differs.
For more information on gVisor security, please see the Security Model page.
Generally speaking, gVisor protects against Linux kernel exploits by separating the sandboxed workload from accessing the host kernel directly.
Where gVisor does not help:
gVisor is available as an OCI-compliant container runtime named runsc. It can be used with container ecosystem tools like Docker (gVisor guide) or Kubernetes (gVisor guide). It can also be used directly for one-off testing, like this:
$ sudo runsc do echo Hello world
Hello world
Note the use of sudo
, which may give you pause. It’s a sandboxing tool, after
all, shouldn’t it run as an unprivileged user? gVisor-sandboxed workloads do
run with minimal capabilities in an isolated user namespace from the perspective
of the host kernel. However, the sandbox setup process requires privileges,
specifically for setting up the userspace network stack. Once the sandbox setup
is complete, gVisor re-executes itself and drops all privileges in the process.
This takes place before any untrusted code runs. For sandboxes that don’t
require networking, it is possible to run in rootless mode without sudo:
$ runsc --rootless --network=none do echo Hello world
Hello world
How can you tell that gVisor is working? Well, try to do something that involves
the host kernel. For example, you can call dmesg(1)
, which reads the kernel
logs:
# Without gVisor (unsandboxed):
$ dmesg
dmesg: read kernel buffer failed: Operation not permitted
# With gVisor (sandboxed):
$ runsc --rootless --network=none do dmesg
[ 0.000000] Starting gVisor...
[ 0.498943] Waiting for children...
[ 0.972223] Committing treasure map to memory...
[ 1.192981] Segmenting fault lines...
[ 1.591823] Verifying that no non-zero bytes made their way into /dev/zero...
[ 1.787191] Consulting tar man page...
[ 2.083245] Searching for needles in stacks...
[ 2.534575] Forking spaghetti code...
[ 2.742140] Digging up root...
[ 2.921313] Gathering forks...
[ 3.342436] Creating cloned children...
[ 3.511124] Setting up VFS...
[ 3.812459] Setting up FUSE...
[ 4.233037] Ready!
This demonstrates that the dmesg(1)
binary is talking to the gVisor kernel
instead of the host Linux kernel. The humorous messages displayed are part of
gVisor’s kernel code when a sandboxed workload asks for kernel logs. These logs
are fictitious and are generated by gVisor’s system call handler on demand,
which is why re-running this command will yield different messages. Try to catch
them all!
Note: runsc do
gives the sandbox read-only access to the host’s entire
filesystem by default, as runsc do
is just a convenience feature to test out
gVisor quickly. In real-world usage, when runsc is used as an OCI container
runtime, host filesystem mappings are strictly defined by the OCI runtime
configuration and gVisor will only expose the paths that the OCI configuration
dictates should be exposed (and will pivot_root(2)
away from being able to
access any other host directory, for defense in depth). For this reason, when
testing gVisor from a security standpoint, it’s better to
install it as a Docker runtime,
and then use it as follows:
$ sudo docker run --rm --runtime=runsc -it -v /tmp/vol:/vol ubuntu /bin/bash
This will spawn a Bash shell inside a gVisor sandbox, with access to the host
directory /tmp/vol
mapped to /vol
inside the sandbox. You can then poke
around within the sandbox and see if you can escape out to the host or glean
information from it (other than the contents of /tmp/vol
).