<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="4.0.0">Jekyll</generator><link href="/blog/index.xml" rel="self" type="application/atom+xml" /><link href="/" rel="alternate" type="text/html" /><updated>2026-03-04T22:36:44-06:00</updated><id>/blog/index.xml</id><entry><title type="html">Safe Ride into the Dangerzone: Reducing attack surface with gVisor</title><link href="/blog/2024/09/23/safe-ride-into-the-dangerzone/" rel="alternate" type="text/html" title=" Safe Ride into the Dangerzone: Reducing attack surface with gVisor" /><published>2024-09-23T00:00:00-05:00</published><updated>2024-09-23T00:00:00-05:00</updated><id>/blog/2024/09/23/dangerzone</id><content type="html" xml:base="/blog/2024/09/23/safe-ride-into-the-dangerzone/">&lt;p&gt;&lt;em&gt;This article was written in collaboration with the
&lt;a href=&quot;https://freedom.press&quot;&gt;Freedom of the Press Foundation&lt;/a&gt; and
&lt;a href=&quot;https://dangerzone.rocks/news/2024-09-23-gvisor&quot;&gt;cross-posted on the Dangerzone blog&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;One of the oft-repeated sound bites of computer security advice is: “Don’t open
random attachments from strangers.” If you are a journalist, however, opening
attachments and documents is part of your job description. Since journalists
already have a lot of security threats to worry about in dealing with sources,
the safe opening of documents should not be one of them.
&lt;a href=&quot;https://dangerzone.rocks&quot;&gt;Dangerzone&lt;/a&gt; was developed to solve this problem. It
lets you open suspicious documents with confidence and gets out of your way.&lt;/p&gt;

&lt;p&gt;For the past few months, members of the Dangerzone team and the
&lt;a href=&quot;https://gvisor.dev&quot;&gt;gVisor project&lt;/a&gt; collaborated on significantly improving the
security properties of Dangerzone. We’re excited to announce that &lt;strong&gt;as of
version 0.7.0, Dangerzone uses gVisor to secure its document conversion
process&lt;/strong&gt;. It is already trusted by Google
&lt;a href=&quot;https://gvisor.dev/users&quot;&gt;and others&lt;/a&gt; to secure cloud products, scan Gmail
attachments for viruses, etc.&lt;/p&gt;

&lt;!--/excerpt--&gt;

&lt;p&gt;If you’re an existing Dangerzone user on 0.7.0 scratching your head and thinking
“Well, I haven’t noticed anything different,” then first of all, “yay!” That was
the plan. And second, because the plan worked so deviously well, this change has
probably flown under the radar, so here are more than 3,000 words to amend this.&lt;/p&gt;

&lt;p&gt;The rest of the article dives deep into Dangerzone’s security, describes how
gVisor works as a technology, and explains how Dangerzone’s security profile has
changed after this integration. Expect some technical terms and nerdery.&lt;/p&gt;

&lt;h2 id=&quot;how-dangerzone-works&quot;&gt;How Dangerzone works&lt;/h2&gt;

&lt;p&gt;Dangerzone’s purpose is to sanitize documents of any elements that can
compromise your computer or the source’s identity (think malware and document
metadata). To do this, it first renders the document into visual data (pixels)
and then turns this visual representation back into a readable document file.
The first part of this process (rendering the document into pixel data) is the
most security-critical part and, for the purpose of this article, we will zoom
in on just this.&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;💡 For a broader understanding of how Dangerzone works, we encourage you to
read the &lt;a href=&quot;https://dangerzone.rocks/about/&quot;&gt;“About Dangerzone”&lt;/a&gt; section on the
Dangerzone website. Props to the &lt;a href=&quot;https://www.qubes-os.org/&quot;&gt;Qubes OS&lt;/a&gt; team,
who first popularized the concept that is now their
&lt;a href=&quot;https://blog.invisiblethings.org/2013/02/21/converting-untrusted-pdfs-into-trusted.html&quot;&gt;TrustedPDF feature&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;In order to support a wide variety of document formats (PDF, office documents,
image formats, etc.), Dangerzone needs to open them with software that
potentially has security bugs. That may result in compromise of the user’s
device, personal files, and communication. This is the same risk you face when
you use your computer to open attachments from unknown sources. Dangerzone needs
to somehow isolate this process from the rest of your computer, so that anything
it does cannot “get out of the box”.&lt;/p&gt;

&lt;p&gt;Dangerzone’s isolation relies on &lt;strong&gt;Linux containers&lt;/strong&gt;. Containers are very handy
for two things: ensuring that they work the same way across operating systems
and separating the container from the rest of the machine.&lt;/p&gt;

&lt;figure&gt;
&lt;img src=&quot;/assets/images/2024-09-23-dangerzone-outline.svg&quot; alt=&quot;Diagram showing the Dangerzone UI sending a document to a document renderer, which converts it to pixels, and then receives the pixels back.&quot; /&gt;
&lt;figcaption&gt;Outline of how Dangerzone uses containers to render a document into pixels.&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;p&gt;Dangerzone benefits from both of these aspects: Development and testing are made
easy by using containers’ cross-platform compatibility; and containers’
security, especially how Dangerzone configured them, offers strong isolation
guarantees. The
&lt;a href=&quot;https://freedom.press/news/dangerzone-receives-favorable-audit/&quot;&gt;security audit Dangerzone passed recently&lt;/a&gt;
is a testament to this.&lt;/p&gt;

&lt;p&gt;In computer security, the gold standard of isolation is &lt;strong&gt;virtual machines&lt;/strong&gt;.
VMs are what they sound like: a computer running within a computer. When running
a virtual machine, the “host” (outer) machine is protected from the action of
the “guest” (inner) virtual machine. This is why the TrustedPDF feature of
QubesOS uses disposable VMs as its isolation mechanism. Dangerzone also tried to
use VMs in the past, but implementing them in a multiplatform way proved
high-maintenance. Thus, Dangerzone switched back to containers, but the team
always wanted to improve Dangerzone’s security properties.&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;💡 How does Dangerzone use Linux containers on Windows and Mac OS? It requires
&lt;a href=&quot;https://www.docker.com/products/docker-desktop/&quot;&gt;Docker Desktop&lt;/a&gt;, which runs
Linux inside a virtual machine and then runs Linux containers in it.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2 id=&quot;dangerzones-attack-surface&quot;&gt;Dangerzone’s attack surface&lt;/h2&gt;

&lt;p&gt;To understand how to protect Dangerzone users from exploits, it’s useful to
think like an attacker. When Dangerzone processes a malicious document within a
container, the first point of the attack is the application that opens the
document. Dangerzone is designed with the assumption that determined attackers
will find a vulnerability in such applications and take control of them (check
out this &lt;a href=&quot;https://github.com/freedomofpress/dangerzone/blob/main/docs/advisories/2023-12-07.md&quot;&gt;security advisory from the Dangerzone team about a recent, critical
LibreOffice
vulnerability&lt;/a&gt;).
From there on, the next point of attack is to circumvent the Linux kernel
protections for the container or directly compromise the Linux kernel.&lt;/p&gt;

&lt;p&gt;The Linux kernel, even in Docker Desktop VMs, is a very privileged component. It
has access to sensitive data, such as other files on the user’s machine or the
user’s browser history, and to your computer’s network.&lt;/p&gt;

&lt;p&gt;Processes in containers interface with the Linux kernel through
&lt;a href=&quot;https://en.wikipedia.org/wiki/System_call&quot;&gt;&lt;strong&gt;system calls&lt;/strong&gt;&lt;/a&gt; and
&lt;a href=&quot;https://opensource.com/article/19/3/virtual-filesystems-linux&quot;&gt;&lt;strong&gt;virtual filesystems&lt;/strong&gt;&lt;/a&gt;.
Attackers can try to take advantage of security bugs in the above interfaces. So
it is critical to limit the container’s access to the Linux kernel. We call this
the container’s
&lt;a href=&quot;https://en.wikipedia.org/wiki/Attack_surface&quot;&gt;&lt;strong&gt;attack surface&lt;/strong&gt;&lt;/a&gt;. The smaller
it is, the more secure a system is.&lt;/p&gt;

&lt;p&gt;Dangerzone tries to reduce its attack surface by multiple mechanisms available
to Linux containers:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Removal of
&lt;a href=&quot;https://en.wikipedia.org/wiki/Capability-based_security&quot;&gt;process capabilities&lt;/a&gt;.
This reduces the set of permissions the container has in the kernel.&lt;/li&gt;
  &lt;li&gt;Removal of network access. This prevents the container from accessing the
internet to exfiltrate document data.&lt;/li&gt;
  &lt;li&gt;Filtering of allowed system calls through
&lt;a href=&quot;https://en.wikipedia.org/wiki/Seccomp&quot;&gt;seccomp&lt;/a&gt;. This reduces the set of
system calls (i.e., types of actions) that the container is allowed to make
to the kernel.&lt;/li&gt;
  &lt;li&gt;Minimal &lt;a href=&quot;https://en.wikipedia.org/wiki/User_identifier&quot;&gt;user ID&lt;/a&gt; mapping.
This reduces the risk that the container may access files belonging to users
other than the Dangerzone user on the same computer.&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
  &lt;p&gt;💡 Check out the above protection measures in
&lt;a href=&quot;https://github.com/freedomofpress/dangerzone/blob/88a2d151ab4a3cb2f769998f27f251518d93bb45/dangerzone/isolation_provider/container.py#L188-L213&quot;&gt;Dangerzone’s codebase&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;figure&gt;
&lt;img src=&quot;/assets/images/2024-09-23-dangerzone-protections.svg&quot; alt=&quot;Diagram showing that the renderer and LibreOffice make system calls to the Linux kernel, to which several filters are applied.&quot; /&gt;
&lt;figcaption&gt;Container protections employed by Dangerzone prior to 0.7.0.&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;p&gt;This provides the container with a fair degree of isolation from the Linux
kernel. However, some attack surface remains, since:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;The computer’s user is still mapped in the container. This means that a
container escape would allow the attacker to access the user’s personal
files (browser data, documents, etc.); it would be more isolated if that
were not the case.&lt;/li&gt;
  &lt;li&gt;The system call filter is still relatively permissive. The specific system
calls that are blocked are dependent on the container manager and version in
use (see
&lt;a href=&quot;https://github.com/microsoft/docker/blob/master/docs/security/seccomp.md&quot;&gt;Docker’s filters, for example&lt;/a&gt;),
but in general, the system call filter only blocks obscure or
system-admin-only system calls (e.g., rebooting, modifying systemwide
settings). It does not block containers from opening arbitrary files or
interacting with the network stack, which can still be vectors for security
bugs.&lt;/li&gt;
  &lt;li&gt;The container’s root filesystem, while ephemeral, is still writable. This
allows attackers to exploit potential vulnerabilities in Linux’s filesystem
stack.&lt;/li&gt;
  &lt;li&gt;The Linux kernel is still exposed to the container. While it is possible to
reduce the attack surface available to the container to a minimum, this
architecture still requires that the container have direct access to Linux
via system calls. So if a Linux security bug can be triggered within the set
of filtered system calls, an attack may still be successful.&lt;/li&gt;
&lt;/ul&gt;

&lt;figure&gt;
&lt;img src=&quot;/assets/images/2024-09-23-dangerzone-protections-annotated.svg&quot; alt=&quot;Diagram highlighting how access to the Linux kernel and the relatively permissive system filter may create exposure to bugs or vulnerabilities.&quot; /&gt;
&lt;figcaption&gt;Dangerzone's attack surface prior to 0.7.0, illustrated.&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;p&gt;We’ve wanted to mitigate these risks for a while now, but we had to do so in a
cross-platform way and without burdening the user with administrative tasks.&lt;/p&gt;

&lt;p&gt;Enter gVisor.&lt;/p&gt;

&lt;h2 id=&quot;what-is-gvisor&quot;&gt;What is gVisor?&lt;/h2&gt;

&lt;p&gt;&lt;a href=&quot;https://gvisor.dev&quot;&gt;&lt;strong&gt;gVisor&lt;/strong&gt;&lt;/a&gt; is a container security solution. In short, it
makes it much harder for malicious code to break out of the container boundary.
This was a great fit for Dangerzone’s security needs.&lt;/p&gt;

&lt;p&gt;An open source project written in Go, gVisor was released in May 2018 by Google
under the Apache 2.0 license. It runs on Linux and integrates with all popular
container management software, such as Docker, Podman, or Kubernetes. At its
core, gVisor is an &lt;strong&gt;application kernel&lt;/strong&gt; that implements a substantial portion
of the Linux system call interface. This means gVisor sits between a container
and the Linux kernel and plays both roles: from the container’s perspective,
gVisor acts as a &lt;strong&gt;kernel&lt;/strong&gt;, but from Linux’s perspective, gVisor is just a
regular &lt;strong&gt;application&lt;/strong&gt;. That means the container can no longer directly
interface with the Linux kernel. This is a massive reduction in attack surface.&lt;/p&gt;

&lt;p&gt;If you’re new to gVisor, the concept of not interfacing with the Linux kernel at
all may seem either quite vague or overly restrictive. That’s normal, so let’s
toy with this concept a bit for fun and illustrative purposes. Here’s a
perfectly normal sentence:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;“A process opens a document on the filesystem”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;And here’s how gVisor warps every single word in that sentence:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;“on the filesystem”: Nope, no such thing. The gVisor container runs in an
empty filesystem.&lt;/li&gt;
  &lt;li&gt;“opens a document”: Nuh-uh, the gVisor container does not even have the
permission to perform the &lt;code class=&quot;highlighter-rouge&quot;&gt;open&lt;/code&gt; system call. Also, there are no files to
open in the first place.&lt;/li&gt;
  &lt;li&gt;“A process”: Amusingly, the gVisor container does not even have the ability
to perform the &lt;code class=&quot;highlighter-rouge&quot;&gt;exec&lt;/code&gt; system calls. From the Linux kernel’s perspective, the
gVisor “process” looks like a typical multithreaded program, even while many
independent processes are running within the gVisor sandbox.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And yet, gVisor can containerize most applications without issue. For example,
the Dangerzone container image was not altered at all for the gVisor
integration.&lt;/p&gt;

&lt;p&gt;So what’s going on here?&lt;/p&gt;

&lt;p&gt;gVisor manages to pull the above trick with the help of two components:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Sentry&lt;/strong&gt; is the component that runs the containerized application. It
intercepts every system call that the application makes and reimplements it
in Go. As part of this, it may decide to do one or more system calls to the
host Linux kernel. However, it’s heavily restricted with a strict seccomp
filter (that’s why system calls like &lt;code class=&quot;highlighter-rouge&quot;&gt;open&lt;/code&gt;, &lt;code class=&quot;highlighter-rouge&quot;&gt;socket&lt;/code&gt;, or &lt;code class=&quot;highlighter-rouge&quot;&gt;exec&lt;/code&gt; are not
allowed).&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Gofer&lt;/strong&gt; is a component that runs outside the container and is responsible
for filesystem operations. The sentry may make I/O requests to the gofer.
The gofer will independently validate them, then perform these I/O
operations on the container’s behalf (that’s how the container can read
files from the host filesystem, even though &lt;code class=&quot;highlighter-rouge&quot;&gt;open&lt;/code&gt; is not allowed from the
sentry).&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The above components are managed by a container runtime called &lt;code class=&quot;highlighter-rouge&quot;&gt;runsc&lt;/code&gt;, which
exposes the same interface as other container runtimes. This means it can be
integrated in other container management software like Podman, Docker, or
Kubernetes.&lt;/p&gt;

&lt;figure&gt;
&lt;img src=&quot;/assets/images/2024-09-23-gvisor-outline.svg&quot; alt=&quot;Diagram showing a potentially vulnerable application running in the gVisor sandbox. gVisor Sentry implements the sandbox and intercepts all system calls. It services them either by making limited system calls of its own, or by asking gVisor Gofer to perform I/O system calls on its behalf. Both components are further restricted by a tailored kernel filter, along with other kernel protections.&quot; /&gt;
&lt;figcaption&gt;gVisor intercepting system calls from a sandboxed application&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;p&gt;With the above architecture, gVisor blue-pills the application into thinking
that it interacts with a regular Linux kernel. In practice, gVisor reimplements
most basic features that Linux provides (memory management, scheduling, system
call interface, I/O, networking), and only issues system calls to the Linux
kernel when truly necessary, such as when it needs information from it (e.g.,
reading the document to be converted by Dangerzone).&lt;/p&gt;

&lt;p&gt;The gVisor kernel is designed to be difficult to break out of. gVisor is written
in Go. Many of Linux’s security woes stem from its use of C, which is a
memory-unsafe language. By contrast, gVisor is a regular Go application and
inherits Go’s memory safety features. This eliminates a large class of security
vulnerabilities.&lt;/p&gt;

&lt;p&gt;The gVisor kernel also has a much smaller code footprint, because unlike a
traditional kernel like Linux, it does not have to deal with things like
hardware devices, and only implements a subset of the Linux kernel interface
that is sufficient for most applications to work in practice. Because of its
smaller implementation, there are fewer moving parts to juggle between, and thus
fewer opportunities for bugs to exist.&lt;/p&gt;

&lt;p&gt;Beyond its kernel indirection, gVisor also hardens itself through a bunch of
security measures on startup, some of which are similar to regular containers:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Isolation&lt;/strong&gt;: Running in its own set of namespaces (user namespace, process
namespace, network namespace, etc.) to further isolate it from the host.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;File access prevention&lt;/strong&gt;: Running in its own root with exactly zero host
files initially visible to it.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Privilege revocation&lt;/strong&gt;: Dropping all capabilities it has to ensure it runs
with the least privileges.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;System call filtering&lt;/strong&gt;: Setting a strict system call filter tuned for the
gVisor Sentry specifically.
    &lt;ul&gt;
      &lt;li&gt;As mentioned, unlike Docker or Podman’s default system call filter, this
is a &lt;em&gt;very restricted set&lt;/em&gt; of system calls. This filter blocks basic
operations like opening files, creating network connections, or
executing other processes. The presence of this filter does &lt;em&gt;not&lt;/em&gt;
prevent use of these system calls from within the gVisor sandbox;
instead, the gVisor kernel &lt;em&gt;intercepts and reimplements&lt;/em&gt; system calls
internally without needing to make a “real” system call out to the Linux
kernel.&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;The gofer also uses all of the above techniques to isolate itself as much as
possible.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The gVisor kernel has been battle-tested by Google and other large companies
like Ant and Cloudflare. For example, searching for the text “GKE Sandbox”
(which uses gVisor) on the
&lt;a href=&quot;https://cloud.google.com/kubernetes-engine/security-bulletins&quot;&gt;GKE security bulletin&lt;/a&gt;
shows how often Linux kernel vulnerabilities occur but that gVisor prevents.
gVisor is also continuously &lt;a href=&quot;https://en.wikipedia.org/wiki/Fuzzing&quot;&gt;fuzz-tested&lt;/a&gt;
for bugs using &lt;a href=&quot;https://github.com/google/syzkaller/&quot;&gt;Syzkaller&lt;/a&gt;, an automated
kernel security testing tool.&lt;/p&gt;

&lt;p&gt;What’s the catch here? Applications that perform lots of system calls and heavy
I/O will have some degraded performance. Also, applications that rely on exotic
features by the Linux kernel may not work. In practice,
&lt;a href=&quot;https://gvisor.dev/docs/user_guide/compatibility&quot;&gt;the majority of applications do not suffer from this issue&lt;/a&gt;.&lt;/p&gt;

&lt;h2 id=&quot;integrating-gvisor-with-dangerzone&quot;&gt;Integrating gVisor with Dangerzone&lt;/h2&gt;

&lt;p&gt;So, gVisor looks like a strong candidate for Dangerzone, which is a relatively
simple application that does not perform a heavy amount of system calls. Also,
gVisor conveniently offers a container runtime that is a drop-in replacement for
use with Docker/Podman. Therefore, integrating these two projects should be
really simple, right?&lt;/p&gt;

&lt;p&gt;Well, not so fast.&lt;/p&gt;

&lt;p&gt;Dangerzone is a &lt;em&gt;multiplatform&lt;/em&gt; application, and most of its users are on
Windows and macOS. Integrating gVisor just for Linux would not cut it. At the
same time, gVisor works strictly on Linux systems, so we are at an impasse.&lt;/p&gt;

&lt;p&gt;In what is, in retrospect, a classic case of
&lt;a href=&quot;https://en.wikipedia.org/wiki/Law_of_the_instrument&quot;&gt;Maslow’s hammer&lt;/a&gt;, we
decided to solve our container problems with yet another container. The idea is
simple; why not containerize gVisor and make it run on Docker Desktop? After
all, as we already pointed out, Docker Desktop runs Linux inside a virtual
machine.&lt;/p&gt;

&lt;p&gt;By doing so, Dangerzone now has two containers with different responsibilities:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;The &lt;strong&gt;outer&lt;/strong&gt; Docker/Podman container acts as the &lt;strong&gt;portability&lt;/strong&gt; layer for
Dangerzone. Its main responsibility is to bundle the necessary config files,
scripts, and programs to run gVisor. It’s also responsible for bundling the
container image that gVisor will spawn a container from.&lt;/li&gt;
  &lt;li&gt;The &lt;strong&gt;inner&lt;/strong&gt; gVisor container acts as the &lt;strong&gt;isolation&lt;/strong&gt; layer for
Dangerzone. Its sole responsibility is to run the actual Dangerzone logic
for rendering documents to pixels.&lt;/li&gt;
&lt;/ul&gt;

&lt;figure&gt;
&lt;img src=&quot;/assets/images/2024-09-23-dangerzone-with-gvisor.svg&quot; alt=&quot;Diagram showing the Dangerzone UI sending a document to a document renderer within an inner container, which is protected by gVisor's Sentry. The Sentry intercepts system calls, allowing only limited system calls to pass to the Linux kernel with strict security settings. I/O system calls are handled by gVisor Gofer in an outer container, with less strict but controlled permissions&quot; /&gt;
&lt;figcaption&gt;Outline of how gVisor integrates with Dangerzone. There are now two nested containers, and each one brings its own protections. Usage of LibreOffice is implied.&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;p&gt;Running gVisor inside a container came with its own set of challenges:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;The Docker/Podman’s seccomp filter must allow the &lt;code class=&quot;highlighter-rouge&quot;&gt;ptrace&lt;/code&gt; system call. We
found that recent Docker Desktop versions and Podman version &amp;gt;= 4.0 have a
seccomp filter that allows this system call. For older versions, we
specified a custom seccomp filter that allowed it.&lt;/li&gt;
  &lt;li&gt;gVisor cannot run under SELinux in enforcing mode under default settings, so
we labeled the container with &lt;code class=&quot;highlighter-rouge&quot;&gt;container_engine_t&lt;/code&gt; (see GitHub issue
&lt;a href=&quot;https://github.com/freedomofpress/dangerzone/issues/880&quot;&gt;#880&lt;/a&gt;).&lt;/li&gt;
  &lt;li&gt;The Docker/Podman container must run with the &lt;code class=&quot;highlighter-rouge&quot;&gt;SYS_CHROOT&lt;/code&gt; capability. This
is needed by gVisor to restrict its own access to the filesystem before it
starts document processing. Other than that, the &lt;strong&gt;outer&lt;/strong&gt; container drops
all other capabilities and privileges.&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
  &lt;p&gt;💡 You can find more details about this integration in the Dangerzone’s
&lt;a href=&quot;https://github.com/freedomofpress/dangerzone/blob/main/docs/developer/gvisor.md&quot;&gt;gVisor design doc&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2 id=&quot;dangerzone-protections&quot;&gt;Dangerzone protections&lt;/h2&gt;

&lt;p&gt;We talked about Dangerzone’s original attack surface, and how we integrated
gVisor to reduce it. In practice though, in what ways is Dangerzone better off
than before? Well, if the Matryoshka containers are giving you a headache, or
you just skimmed to this section (no shade), here’s how the new Dangerzone
protections fare against the previous version, and the default protections of
Linux containers:&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;🛡️ &lt;strong&gt;Protections&lt;/strong&gt;&lt;/th&gt;
      &lt;th&gt;&lt;strong&gt;Default&lt;/strong&gt;&lt;/th&gt;
      &lt;th&gt;&lt;strong&gt;Dangerzone (0.6.1)&lt;/strong&gt;&lt;/th&gt;
      &lt;th&gt;&lt;strong&gt;Dangerzone + gVisor (0.7.0)&lt;/strong&gt;&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td&gt;🐧 &lt;strong&gt;Linux kernel&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;&lt;span style=&quot;color: #505050;&quot;&gt;Exposed&lt;/span&gt;&lt;/td&gt;
      &lt;td&gt;&lt;span style=&quot;color: #990000;&quot;&gt;👎 Exposed&lt;/span&gt;&lt;/td&gt;
      &lt;td&gt;&lt;span style=&quot;color: #38761d&quot;&gt;🎉 Not exposed&lt;/span&gt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;🛠️ &lt;strong&gt;System call filter&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;&lt;span style=&quot;color: #505050;&quot;&gt;Moderate&lt;/span&gt;&lt;/td&gt;
      &lt;td&gt;&lt;span style=&quot;color: #990000;&quot;&gt;👎 Moderate&lt;/span&gt;&lt;/td&gt;
      &lt;td&gt;&lt;span style=&quot;color: #38761d&quot;&gt;👍 Strict&lt;/span&gt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;🛠️ &lt;strong&gt;Capabilities&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;&lt;span style=&quot;color: #505050;&quot;&gt;Default&lt;/span&gt;&lt;/td&gt;
      &lt;td&gt;&lt;span style=&quot;color: #38761d&quot;&gt;👍 None&lt;/span&gt;&lt;/td&gt;
      &lt;td&gt;&lt;span style=&quot;color: #38761d&quot;&gt;👍 None&lt;/span&gt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;👤 &lt;strong&gt;Host user&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;&lt;span style=&quot;color: #505050;&quot;&gt;Mapped&lt;/span&gt;&lt;/td&gt;
      &lt;td&gt;&lt;span style=&quot;color: #990000;&quot;&gt;👎 Mapped&lt;/span&gt;&lt;/td&gt;
      &lt;td&gt;&lt;span style=&quot;color: #38761d&quot;&gt;👍 Unmapped&lt;/span&gt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;📁 &lt;strong&gt;Filesystem&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;&lt;span style=&quot;color: #505050;&quot;&gt;Exposed&lt;/span&gt;&lt;/td&gt;
      &lt;td&gt;&lt;span style=&quot;color: #990000;&quot;&gt;👎 Writable&lt;/span&gt;&lt;/td&gt;
      &lt;td&gt;&lt;span style=&quot;color: #38761d&quot;&gt;👍 Read-only&lt;/span&gt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;🌐 &lt;strong&gt;Network&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;&lt;span style=&quot;color: #505050;&quot;&gt;Exposed&lt;/span&gt;&lt;/td&gt;
      &lt;td&gt;&lt;span style=&quot;color: #38761d&quot;&gt;👍 Disabled&lt;/span&gt;&lt;/td&gt;
      &lt;td&gt;&lt;span style=&quot;color: #38761d&quot;&gt;✌️ Disabled at two levels&lt;/span&gt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;🔒 &lt;strong&gt;SELinux&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;&lt;span style=&quot;color: #505050;&quot;&gt;Yes (&lt;code class=&quot;highlighter-rouge&quot;&gt;container_t&lt;/code&gt;)&lt;/span&gt;&lt;/td&gt;
      &lt;td&gt;&lt;span style=&quot;color: #38761d&quot;&gt;👍 Yes (&lt;code class=&quot;highlighter-rouge&quot;&gt;container_t&lt;/code&gt;)&lt;/span&gt;&lt;/td&gt;
      &lt;td&gt;&lt;span style=&quot;color: #38761d&quot;&gt;👍 Yes (&lt;code class=&quot;highlighter-rouge&quot;&gt;container_engine_t&lt;/code&gt;)&lt;/span&gt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;🖥️ &lt;strong&gt;Hardware Virtualization&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;&lt;span style=&quot;color: #505050;&quot;&gt;None&lt;/span&gt;&lt;/td&gt;
      &lt;td&gt;&lt;span style=&quot;color: #990000;&quot;&gt;👎 None&lt;/span&gt;&lt;/td&gt;
      &lt;td&gt;&lt;span style=&quot;color: #990000;&quot;&gt;👎 None&lt;/span&gt;&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;As you can see, the most important protection is that &lt;strong&gt;the document conversion
process no longer has access to the Linux kernel&lt;/strong&gt;. Instead, it only has access
to the gVisor kernel (in the Sentry), and must break out of it before it can
access the Linux kernel that it (prior to gVisor integration) had access to.&lt;/p&gt;

&lt;p&gt;Additionally, Dangerzone itself configures the two containers to be more secure
with:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Privilege revocation: Removing all privileges and capabilities of the
document conversion process in the &lt;strong&gt;inner container&lt;/strong&gt;, and minimizing the
set of capabilities granted to the &lt;strong&gt;outer container&lt;/strong&gt; to just &lt;code class=&quot;highlighter-rouge&quot;&gt;SYS_CHROOT&lt;/code&gt;
and no other.&lt;/li&gt;
  &lt;li&gt;File modification prevention: Making the &lt;strong&gt;inner container&lt;/strong&gt;’s root
filesystem read-only.&lt;/li&gt;
  &lt;li&gt;User isolation: Running the &lt;strong&gt;outer container&lt;/strong&gt; in a user namespace that
does not include the Dangerzone UI user (available in Linux distributions
with Podman version 4.1 or greater).&lt;/li&gt;
  &lt;li&gt;Kernel security settings: Setting the &lt;strong&gt;outer container&lt;/strong&gt;’s system call
filter and SELinux label settings.&lt;/li&gt;
  &lt;li&gt;Host access prevention: Not using any mounts in either container.&lt;/li&gt;
  &lt;li&gt;Network access prevention: Disabling both containers’ ability to use
networking.&lt;/li&gt;
&lt;/ul&gt;

&lt;figure&gt;
&lt;img src=&quot;/assets/images/2024-09-23-dangerzone-with-gvisor-annotated.svg&quot; alt=&quot;Diagram highlighting how gVisor mitigates against bugs and vulnerabilities in the inner container, including exploits which escalate privileges to the outer container.&quot; /&gt;
&lt;figcaption&gt;Explanation of how Dangerzone's latest protections limit its attack surface.&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;h2 id=&quot;conclusion&quot;&gt;Conclusion&lt;/h2&gt;

&lt;p&gt;Integrating the gVisor project with Dangerzone was very exciting: It’s a good
example of how gVisor can add another line of defense to a project without
requiring application-level changes.&lt;/p&gt;

&lt;p&gt;At the same time, the design complexity of the Dangerzone project increased a
bit, mostly to cater to its cross-platform nature, but honestly not that much.
Dangerzone is strongly security-focused, so we believe it’s worth the cost.&lt;/p&gt;

&lt;p&gt;We hope that this article demystifies some security aspects of containers, so
that you can use Dangerzone and gVisor with even more confidence. Feel free to
reach out to us with any questions or comments:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;https://notmyidea.org&quot;&gt;Alexis Métaireau&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://freedom.press/people/alex-p&quot;&gt;Alex Pyrgiotis&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://perot.me&quot;&gt;Etienne Perot&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://freedom.press/contact/&quot;&gt;Freedom of the Press Foundation (FPF)&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://gvisor.dev/community&quot;&gt;gVisor community&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;</content><author><name>almet</name></author><summary type="html">This article was written in collaboration with the Freedom of the Press Foundation and cross-posted on the Dangerzone blog. One of the oft-repeated sound bites of computer security advice is: “Don’t open random attachments from strangers.” If you are a journalist, however, opening attachments and documents is part of your job description. Since journalists already have a lot of security threats to worry about in dealing with sources, the safe opening of documents should not be one of them. Dangerzone was developed to solve this problem. It lets you open suspicious documents with confidence and gets out of your way. For the past few months, members of the Dangerzone team and the gVisor project collaborated on significantly improving the security properties of Dangerzone. We’re excited to announce that as of version 0.7.0, Dangerzone uses gVisor to secure its document conversion process. It is already trusted by Google and others to secure cloud products, scan Gmail attachments for viruses, etc.</summary></entry><entry><title type="html">Optimizing seccomp usage in gVisor</title><link href="/blog/2024/02/01/seccomp/" rel="alternate" type="text/html" title=" Optimizing seccomp usage in gVisor" /><published>2024-02-01T00:00:00-06:00</published><updated>2024-02-01T00:00:00-06:00</updated><id>/blog/2024/02/01/seccomp</id><content type="html" xml:base="/blog/2024/02/01/seccomp/">&lt;p&gt;gVisor is a multi-layered security sandbox. &lt;a href=&quot;https://www.kernel.org/doc/html/v4.19/userspace-api/seccomp_filter.html&quot;&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt;&lt;/a&gt; is
gVisor’s second layer of defense against container escape attacks. gVisor uses
&lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; to filter its own syscalls by the host kernel. This significantly
reduces the attack surface to the host that a compromised gVisor process can
access. However, this layer comes at a cost: every legitimate system call that
gVisor makes must be evaluated against this filter by the host kernel before it
is actually executed. &lt;strong&gt;This blog post contains more than you ever wanted to
know about &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt;, and explores the past few months of work to optimize
gVisor’s use of it.&lt;/strong&gt;&lt;/p&gt;

&lt;!--/excerpt--&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/2024-02-01-gvisor-seccomp.png&quot; alt=&quot;gVisor and seccomp&quot; title=&quot;gVisor and seccomp&quot; style=&quot;max-width:100%&quot; /&gt;
&lt;span class=&quot;attribution&quot;&gt;A diagram showing gVisor’s two main layers of
security: gVisor itself, and &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt;. This blog post touches on the
&lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; part.
&lt;a href=&quot;https://commons.wikimedia.org/wiki/File:Tux.svg&quot;&gt;Tux logo by Larry Ewing and The GIMP&lt;/a&gt;.&lt;/span&gt;&lt;/p&gt;

&lt;h2 id=&quot;performance-considerations&quot;&gt;Understanding &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; performance in gVisor&lt;/h2&gt;

&lt;p&gt;One challenge with gVisor performance improvement ideas is that it is often very
difficult to estimate how much they will impact performance without first doing
most of the work necessary to actually implement them. Profiling tools help with
knowing where to look, but going from there to numbers is difficult.&lt;/p&gt;

&lt;p&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; is one area which is actually much more straightforward to
estimate. Because it is a secondary layer of defense that lives outside of
gVisor, and it is merely a filter, we can simply yank it out of gVisor and
benchmark the performance we get. While running gVisor in this way is strictly
&lt;strong&gt;less secure&lt;/strong&gt; and not a mode that gVisor should support, the numbers we get in
this manner do provide an upper bound on the maximum &lt;em&gt;potential&lt;/em&gt; performance
gains we could see from optimizations within gVisor’s use of &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;To visualize this, we can run a benchmark with the following variants:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Unsandboxed&lt;/strong&gt;: Unsandboxed performance without gVisor.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;gVisor&lt;/strong&gt;: gVisor from before any of the performance improvements described
later in this post.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;gVisor with empty filter&lt;/strong&gt;: Same as &lt;strong&gt;gVisor&lt;/strong&gt;, but with the &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt;
filter replaced with one that unconditionally approves every system call.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;From these three variants, we can break down the gVisor overhead that comes from
gVisor itself vs the one that comes from &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; filtering. The difference
between &lt;strong&gt;gVisor&lt;/strong&gt; and &lt;strong&gt;unsandboxed&lt;/strong&gt; represents the total gVisor performance
overhead, and the difference between &lt;strong&gt;gVisor&lt;/strong&gt; and &lt;strong&gt;gVisor with empty filter&lt;/strong&gt;
represents the performance overhead of gVisor’s &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; filtering rules.&lt;/p&gt;

&lt;p&gt;Let’s run these numbers for the ABSL build benchmark:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/2024-02-01-gvisor-seccomp-absl-empty-filter.png&quot; alt=&quot;ABSL seccomp-bpf performance&quot; title=&quot;ABSL seccomp-bpf performance&quot; style=&quot;max-width:100%&quot; /&gt;&lt;/p&gt;

&lt;p&gt;We can now use these numbers to give a rough breakdown of where the overhead is
coming from:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/2024-02-01-gvisor-seccomp-absl-breakdown.png&quot; alt=&quot;ABSL seccomp-bpf performance breakdown&quot; title=&quot;ABSL seccomp-bpf performance breakdown&quot; style=&quot;max-width:100%&quot; /&gt;&lt;/p&gt;

&lt;p&gt;The &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; overhead is small in absolute terms. The numbers suggest that
the best that can be shaved off by optimizing &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; filters is &lt;strong&gt;up to&lt;/strong&gt;
3.4 seconds off from the total ABSL build time, which represents a reduction of
total runtime by ~3.6%. However, when looking at this amount relative to
gVisor’s overhead over unsandboxed time, this means that optimizing the
&lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; filters may remove &lt;strong&gt;up to&lt;/strong&gt; ~15% of gVisor overhead, which is
significant. &lt;em&gt;(Not all benchmarks have this behavior; some benchmarks show
smaller &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt;-related overhead. The overhead is also highly
platform-dependent.)&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Of course, this level of performance is what was reached with &lt;strong&gt;empty
&lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; filtering rules&lt;/strong&gt;, so we cannot hope to reach this level of
performance gains. However, it is still useful as an upper bound. Let’s see how
much of it we can recoup without compromising security.&lt;/p&gt;

&lt;h2 id=&quot;a-primer-on-bpf-and-seccomp-bpf&quot;&gt;A primer on BPF and &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt;&lt;/h2&gt;

&lt;h3 id=&quot;bpf-cbpf-ebpf-oh-my&quot;&gt;BPF, cBPF, eBPF, oh my!&lt;/h3&gt;

&lt;p&gt;&lt;a href=&quot;https://en.wikipedia.org/wiki/Berkeley_Packet_Filter&quot;&gt;BPF (Berkeley Packet Filter)&lt;/a&gt; is a virtual machine and eponymous machine
language. Its name comes from its original purpose: filtering packets in a
kernel network stack. However, its use has expanded to other domains of the
kernel where programmability is desirable. Syscall filtering in the context of
&lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp&lt;/code&gt; is one such area.&lt;/p&gt;

&lt;p&gt;BPF itself comes in two dialects: “Classic BPF” (sometimes stylized as cBPF),
and the now-more-well-known &lt;a href=&quot;https://en.wikipedia.org/wiki/EBPF&quot;&gt;“Extended BPF” (commonly known as eBPF)&lt;/a&gt;.
eBPF is a superset of cBPF and is usable extensively throughout the kernel.
However, &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp&lt;/code&gt; is not one such area. While
&lt;a href=&quot;https://lwn.net/Articles/857228/&quot;&gt;the topic has been heavily debated&lt;/a&gt;, the
status quo remains that &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp&lt;/code&gt; filters may only use cBPF, so this post will
focus on cBPF alone.&lt;/p&gt;

&lt;h3 id=&quot;so-what-is-seccomp-bpf-exactly&quot;&gt;So what is &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; exactly?&lt;/h3&gt;

&lt;p&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; is a part of the Linux kernel which allows a program to impose
syscall filters on itself. A &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; filter is a cBPF program that is
given syscall data as input, and outputs an “action” (a 32-bit integer) to do as
a result of this system call: allow it, reject it, crash the program, trap
execution, etc. The kernel evaluates the cBPF program on every system call the
application makes. The “input” of this cBPF program is the byte layout of the
&lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp_data&lt;/code&gt; struct, which can be loaded into the registers of the cBPF
virtual machine for analysis.&lt;/p&gt;

&lt;p&gt;Here’s what the &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp_data&lt;/code&gt; struct looks like in
&lt;a href=&quot;https://github.com/torvalds/linux/blob/master/include/uapi/linux/seccomp.h&quot;&gt;Linux’s &lt;code class=&quot;highlighter-rouge&quot;&gt;include/uapi/linux/seccomp.h&lt;/code&gt;&lt;/a&gt;:&lt;/p&gt;

&lt;div class=&quot;language-c highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;struct&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;seccomp_data&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;kt&quot;&gt;int&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;nr&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;                     &lt;span class=&quot;c1&quot;&gt;// 32 bits&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;__u32&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;arch&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;                 &lt;span class=&quot;c1&quot;&gt;// 32 bits&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;__u64&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;instruction_pointer&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;  &lt;span class=&quot;c1&quot;&gt;// 64 bits&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;__u64&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;args&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;6&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;];&lt;/span&gt;              &lt;span class=&quot;c1&quot;&gt;// 64 bits × 6&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;};&lt;/span&gt;                              &lt;span class=&quot;c1&quot;&gt;// Total 512 bits&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;sample-filter&quot;&gt;Sample &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; filter&lt;/h3&gt;

&lt;p&gt;Here is an example &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; filter, adapted from the
&lt;a href=&quot;https://www.kernel.org/doc/Documentation/networking/filter.txt&quot;&gt;Linux kernel documentation&lt;/a&gt;&lt;sup id=&quot;fnref:1&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:1&quot; class=&quot;footnote&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;:&lt;/p&gt;

&lt;!-- Markdown note: This uses &quot;javascript&quot; syntax highlighting because that
     happens to work pretty well with this pseudo-assembly-like language.
     It is not actually JavaScript. --&gt;

&lt;div class=&quot;language-javascript highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;mi&quot;&gt;00&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;load32&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;4&lt;/span&gt;                &lt;span class=&quot;c1&quot;&gt;// Load 32 bits at offsetof(struct seccomp_data, arch) (= 4)&lt;/span&gt;
                            &lt;span class=&quot;c1&quot;&gt;//   of the seccomp_data input struct into register A.&lt;/span&gt;
&lt;span class=&quot;mi&quot;&gt;01&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;jeq&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;0xc000003e&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;11&lt;/span&gt;   &lt;span class=&quot;c1&quot;&gt;// If A == AUDIT_ARCH_X86_64, jump by 0 instructions [to 02]&lt;/span&gt;
                            &lt;span class=&quot;c1&quot;&gt;//   else jump by 11 instructions [to 13].&lt;/span&gt;
&lt;span class=&quot;mi&quot;&gt;02&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;load32&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;                &lt;span class=&quot;c1&quot;&gt;// Load 32 bits at offsetof(struct seccomp_data, nr) (= 0)&lt;/span&gt;
                            &lt;span class=&quot;c1&quot;&gt;//   of the seccomp_data input struct into register A.&lt;/span&gt;
&lt;span class=&quot;mi&quot;&gt;03&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;jeq&lt;/span&gt;  &lt;span class=&quot;mi&quot;&gt;15&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;  &lt;span class=&quot;mi&quot;&gt;10&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;   &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;       &lt;span class=&quot;c1&quot;&gt;// If A == __NR_rt_sigreturn, jump by 10 instructions [to 14]&lt;/span&gt;
                            &lt;span class=&quot;c1&quot;&gt;//   else jump by 0 instructions [to 04].&lt;/span&gt;
&lt;span class=&quot;mi&quot;&gt;04&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;jeq&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;231&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;   &lt;span class=&quot;mi&quot;&gt;9&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;   &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;       &lt;span class=&quot;c1&quot;&gt;// If A == __NR_exit_group, jump by 9 instructions [to 14]&lt;/span&gt;
                            &lt;span class=&quot;c1&quot;&gt;//   else jump by 0 instructions [to 05].&lt;/span&gt;
&lt;span class=&quot;mi&quot;&gt;05&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;jeq&lt;/span&gt;  &lt;span class=&quot;mi&quot;&gt;60&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;   &lt;span class=&quot;mi&quot;&gt;8&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;   &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;       &lt;span class=&quot;c1&quot;&gt;// If A == __NR_exit, jump by 8 instructions [to 14]&lt;/span&gt;
                            &lt;span class=&quot;c1&quot;&gt;//   else jump by 0 instructions [to 06].&lt;/span&gt;
&lt;span class=&quot;mi&quot;&gt;06&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;jeq&lt;/span&gt;   &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;   &lt;span class=&quot;mi&quot;&gt;7&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;   &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;       &lt;span class=&quot;c1&quot;&gt;// Same thing for __NR_read.&lt;/span&gt;
&lt;span class=&quot;mi&quot;&gt;07&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;jeq&lt;/span&gt;   &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;   &lt;span class=&quot;mi&quot;&gt;6&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;   &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;       &lt;span class=&quot;c1&quot;&gt;// Same thing for __NR_write.&lt;/span&gt;
&lt;span class=&quot;mi&quot;&gt;08&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;jeq&lt;/span&gt;   &lt;span class=&quot;mi&quot;&gt;5&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;   &lt;span class=&quot;mi&quot;&gt;5&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;   &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;       &lt;span class=&quot;c1&quot;&gt;// Same thing for __NR_fstat.&lt;/span&gt;
&lt;span class=&quot;mi&quot;&gt;09&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;jeq&lt;/span&gt;   &lt;span class=&quot;mi&quot;&gt;9&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;   &lt;span class=&quot;mi&quot;&gt;4&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;   &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;       &lt;span class=&quot;c1&quot;&gt;// Same thing for __NR_mmap.&lt;/span&gt;
&lt;span class=&quot;mi&quot;&gt;10&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;jeq&lt;/span&gt;  &lt;span class=&quot;mi&quot;&gt;14&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;   &lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;   &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;       &lt;span class=&quot;c1&quot;&gt;// Same thing for __NR_rt_sigprocmask.&lt;/span&gt;
&lt;span class=&quot;mi&quot;&gt;11&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;jeq&lt;/span&gt;  &lt;span class=&quot;mi&quot;&gt;13&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;   &lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;   &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;       &lt;span class=&quot;c1&quot;&gt;// Same thing for __NR_rt_sigaction.&lt;/span&gt;
&lt;span class=&quot;mi&quot;&gt;12&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;jeq&lt;/span&gt;  &lt;span class=&quot;mi&quot;&gt;35&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;   &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;   &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;       &lt;span class=&quot;c1&quot;&gt;// If A == __NR_nanosleep, jump by 1 instruction [to 14]&lt;/span&gt;
                            &lt;span class=&quot;c1&quot;&gt;//   else jump by 0 instructions [to 13].&lt;/span&gt;
&lt;span class=&quot;mi&quot;&gt;13&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;                &lt;span class=&quot;c1&quot;&gt;// Return SECCOMP_RET_KILL_THREAD&lt;/span&gt;
&lt;span class=&quot;mi&quot;&gt;14&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;0x7fff0000&lt;/span&gt;       &lt;span class=&quot;c1&quot;&gt;// Return SECCOMP_RET_ALLOW&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This filter effectively allows only the following syscalls: &lt;code class=&quot;highlighter-rouge&quot;&gt;rt_sigreturn&lt;/code&gt;,
&lt;code class=&quot;highlighter-rouge&quot;&gt;exit_group&lt;/code&gt;, &lt;code class=&quot;highlighter-rouge&quot;&gt;exit&lt;/code&gt;, &lt;code class=&quot;highlighter-rouge&quot;&gt;read&lt;/code&gt;, &lt;code class=&quot;highlighter-rouge&quot;&gt;write&lt;/code&gt;, &lt;code class=&quot;highlighter-rouge&quot;&gt;fstat&lt;/code&gt;, &lt;code class=&quot;highlighter-rouge&quot;&gt;mmap&lt;/code&gt;, &lt;code class=&quot;highlighter-rouge&quot;&gt;rt_sigprocmask&lt;/code&gt;,
&lt;code class=&quot;highlighter-rouge&quot;&gt;rt_sigaction&lt;/code&gt;, and &lt;code class=&quot;highlighter-rouge&quot;&gt;nanosleep&lt;/code&gt;. All other syscalls result in the calling thread
being killed.&lt;/p&gt;

&lt;h3 id=&quot;cbpf-limitations&quot;&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; and cBPF limitations&lt;/h3&gt;

&lt;p&gt;cBPF is quite limited as a language. The following limitations all factor into
the optimizations described in this blog post:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;The cBPF virtual machine only has 2 32-bit registers, and a tertiary
pseudo-register for a 32-bit immediate value. (Note that syscall arguments
evaluated in the context of &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp&lt;/code&gt; are 64-bit values, so you can already
foresee that this leads to complications.)&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; programs are limited to 4,096 instructions.&lt;/li&gt;
  &lt;li&gt;Jump instructions can only go forward (this ensures that programs must
halt).&lt;/li&gt;
  &lt;li&gt;Jump instructions may only jump by a fixed (“immediate”) number of
instructions. (You cannot say: “jump by whatever this register says”.)&lt;/li&gt;
  &lt;li&gt;Jump instructions come in two flavors:
    &lt;ul&gt;
      &lt;li&gt;“Unconditional” jump instructions, which jump by a fixed number of
instructions. This number must fit in 16 bits.&lt;/li&gt;
      &lt;li&gt;“Conditional” jump instructions, which include a condition expression
and two jump targets:
        &lt;ul&gt;
          &lt;li&gt;The number of instructions to jump by if the condition is true. This
number must fit in 8 bits, so this cannot jump by more than 255
instructions.&lt;/li&gt;
          &lt;li&gt;The number of instructions to jump by if the condition is false.
This number must fit in 8 bits, so this cannot jump by more than 255
instructions.&lt;/li&gt;
        &lt;/ul&gt;
      &lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;seccomp-bpf-caching-in-linux&quot;&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; caching in Linux&lt;/h3&gt;

&lt;p&gt;Since
&lt;a href=&quot;https://www.phoronix.com/news/Linux-5.11-SECCOMP-Performance&quot;&gt;Linux kernel version 5.11&lt;/a&gt;,
when a program uploads a &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; filter into the kernel,
&lt;a href=&quot;https://github.com/torvalds/linux/commit/8e01b51a31a1e08e2c3e8fcc0ef6790441be2f61&quot;&gt;Linux runs a BPF emulator&lt;/a&gt;
that looks for system call numbers where the BPF program doesn’t do any fancy
operations nor load any bits from the &lt;code class=&quot;highlighter-rouge&quot;&gt;instruction_pointer&lt;/code&gt; or &lt;code class=&quot;highlighter-rouge&quot;&gt;args&lt;/code&gt; fields of
the &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp_data&lt;/code&gt; input struct, and still returns “allow”. When this is the
case, &lt;strong&gt;Linux will cache this information&lt;/strong&gt; in a per-syscall-number bitfield.&lt;/p&gt;

&lt;p&gt;Later, when a cacheable syscall number is executed, the BPF program is not
evaluated at all; since the kernel knows that the program is deterministic and
doesn’t depend on the syscall arguments, it can safely allow the syscall without
actually running the BPF program.&lt;/p&gt;

&lt;p&gt;This post uses the term “cacheable” to refer to syscalls that match this
criteria.&lt;/p&gt;

&lt;h2 id=&quot;how-gvisor-builds-its-seccomp-bpf-filter&quot;&gt;How gVisor builds its &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; filter&lt;/h2&gt;

&lt;p&gt;gVisor imposes a &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; filter on itself as part of Sentry start-up. This
process works as follows:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;gVisor gathers bits of configuration that are relevant to the construction
of its &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; filter. This includes which platform is in use, whether
certain features that require looser filtering are enabled (e.g. host
networking, profiling, GPU proxying, etc.), and certain file descriptors
(FDs) which may be checked against syscall arguments that pass in FDs.&lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;gVisor generates a sequence of rulesets from this configuration. A ruleset
is a mapping from syscall number to a predicate that must be true for this
system call, along with an “action” (return code) that is taken should this
predicate be satisfied. For ease of human understanding, the predicate is
often written as a
&lt;a href=&quot;https://en.wikipedia.org/wiki/Logical_disjunction&quot;&gt;disjunctive rule&lt;/a&gt;, for
which each sub-rule is a
&lt;a href=&quot;https://en.wikipedia.org/wiki/Logical_conjunction&quot;&gt;conjunctive rule&lt;/a&gt; that
verifies each syscall argument. In other words, &lt;code class=&quot;highlighter-rouge&quot;&gt;(fA(args[0]) &amp;amp;&amp;amp; fB(args[1])
&amp;amp;&amp;amp; ...) || (fC(args[0]) &amp;amp;&amp;amp; fD(args[1]) &amp;amp;&amp;amp; ...) || ...&lt;/code&gt;. This is represented
&lt;a href=&quot;https://github.com/google/gvisor/blob/master/runsc/boot/filter/config/config_main.go&quot;&gt;in gVisor code&lt;/a&gt;
as follows:&lt;/p&gt;

    &lt;div class=&quot;language-go highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;Or&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;          &lt;span class=&quot;c&quot;&gt;// Disjunction rule&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;PerArg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;  &lt;span class=&quot;c&quot;&gt;// Conjunction rule over each syscall argument&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;fA&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;  &lt;span class=&quot;c&quot;&gt;// Predicate for `seccomp_data.args[0]`&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;fB&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;  &lt;span class=&quot;c&quot;&gt;// Predicate for `seccomp_data.args[1]`&lt;/span&gt;
        &lt;span class=&quot;c&quot;&gt;// ... More predicates can go here (up to 6 arguments per syscall)&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;PerArg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;  &lt;span class=&quot;c&quot;&gt;// Conjunction rule over each syscall argument&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;fC&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;  &lt;span class=&quot;c&quot;&gt;// Predicate for `seccomp_data.args[0]`&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;fD&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;  &lt;span class=&quot;c&quot;&gt;// Predicate for `seccomp_data.args[1]`&lt;/span&gt;
        &lt;span class=&quot;c&quot;&gt;// ... More predicates can go here (up to 6 arguments per syscall)&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;    &lt;/div&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;gVisor performs several optimizations on this data structure.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;gVisor then renders this list of rulesets into a linear program that looks
close to the final machine language, other than jump offsets which are
initially represented as symbolic named labels during the rendering process.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;gVisor then resolves all the labels to their actual instruction index, and
computes the actual jump targets of all jump instructions to obtain valid
cBPF machine code.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;gVisor runs further optimizations on this cBPF bytecode.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;Finally, the cBPF bytecode is uploaded into the host kernel and the
&lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; filter becomes effective.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Optimizing the &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; filter to be more efficient allows the program to
be more compact (i.e. it’s possible to pack more complex filters in the 4,096
instruction limit), and to run faster. While &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; evaluation is
measured in nanoseconds, the impact of any optimization is magnified here,
because host syscalls are an important part of the synchronous “syscall hot
path” that must execute as part of handling certain performance-sensitive
syscall from the sandboxed application. The relationship is not 1-to-1: a single
application syscall may result in several host syscalls, especially due to
&lt;code class=&quot;highlighter-rouge&quot;&gt;futex(2)&lt;/code&gt; which the Sentry calls many times to synchronize its own operations.
Therefore, shaving a nanosecond here and there results in several shaved
nanoseconds in the syscall hot path.&lt;/p&gt;

&lt;h2 id=&quot;structure&quot;&gt;Structural optimizations&lt;/h2&gt;

&lt;p&gt;The first optimization done for gVisor’s &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; was to turn its linear
search over syscall numbers into a
&lt;a href=&quot;https://en.wikipedia.org/wiki/Binary_search_tree&quot;&gt;binary search tree&lt;/a&gt;. This
turns the search for syscall numbers from &lt;code class=&quot;highlighter-rouge&quot;&gt;O(n)&lt;/code&gt; to &lt;code class=&quot;highlighter-rouge&quot;&gt;O(log n)&lt;/code&gt; instructions.
This is a very common &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; optimization technique which is replicated
in other projects such as
&lt;a href=&quot;https://github.com/seccomp/libseccomp/issues/116&quot;&gt;libseccomp&lt;/a&gt; and Chromium.&lt;/p&gt;

&lt;p&gt;To do this, a cBPF program basically loads the 32-bit &lt;code class=&quot;highlighter-rouge&quot;&gt;nr&lt;/code&gt; (syscall number)
field of the &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp_data&lt;/code&gt; struct, and does a binary tree traversal of the
&lt;a href=&quot;https://chromium.googlesource.com/chromiumos/docs/+/HEAD/constants/syscalls.md#tables&quot;&gt;syscall number space&lt;/a&gt;.
When it finds a match, it jumps to a set of instructions that check that
syscall’s arguments for validity, and then returns allow/reject.&lt;/p&gt;

&lt;p&gt;But why stop here? Let’s go further.&lt;/p&gt;

&lt;p&gt;The problem with the binary search tree approach is that it treats all syscall
numbers equally. This is a problem for three reasons:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;It does not matter to have good performance for disallowed syscalls, because
such syscalls should never happen during normal program execution.&lt;/li&gt;
  &lt;li&gt;It does not matter to have good performance for syscalls which can be cached
by the kernel, because the BPF program will only have to run once for these
system calls.&lt;/li&gt;
  &lt;li&gt;For the system calls which are allowed but are not cacheable by the kernel,
there is a
&lt;a href=&quot;https://en.wikipedia.org/wiki/Pareto_distribution&quot;&gt;Pareto distribution&lt;/a&gt; of
their relative frequency. To exploit this we should evaluate the most-often
used syscalls faster than the least-often used ones. The binary tree
structure does not exploit this distribution, and instead treats all
syscalls equally.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;So gVisor splits syscall numbers into four sets:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;🅰: Non-cacheable 🅰llowed, called very frequently.&lt;/li&gt;
  &lt;li&gt;🅱: Non-cacheable allowed, called once in a 🅱lue moon.&lt;/li&gt;
  &lt;li&gt;🅲: 🅲acheable allowed (whether called frequently or not).&lt;/li&gt;
  &lt;li&gt;🅳: 🅳isallowed (which, by definition, is neither cacheable nor expected to
ever be called).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Then, the cBPF program is structured in the following layout:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Linear search over allowed frequently-called non-cacheable syscalls (🅰).
These syscalls are ordered in most-frequently-called first (e.g. &lt;code class=&quot;highlighter-rouge&quot;&gt;futex(2)&lt;/code&gt;
is the first one as it is by far the most-frequently-called system call).&lt;/li&gt;
  &lt;li&gt;Binary search over allowed infrequently-called non-cacheable syscalls (🅱).&lt;/li&gt;
  &lt;li&gt;Binary search over allowed cacheable syscalls (🅲).&lt;/li&gt;
  &lt;li&gt;Reject anything else (🅳).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This structure takes full advantage of the kernel caching functionality, and of
the Pareto distribution of syscalls.&lt;/p&gt;

&lt;details&gt;

  &lt;summary&gt;

    &lt;h3 id=&quot;binary-search-tree-optimizations&quot;&gt;Binary search tree optimizations&lt;/h3&gt;

    &lt;p&gt;Beyond classifying syscalls to see which binary search tree they should be a
part of, gVisor also optimizes the binary search process itself.&lt;/p&gt;

  &lt;/summary&gt;

  &lt;p&gt;Each syscall number is a node in the tree. When traversing the tree, there are
three options at each point:&lt;/p&gt;

  &lt;ul&gt;
    &lt;li&gt;The syscall number is an exact match&lt;/li&gt;
    &lt;li&gt;The syscall number is lower than the node’s value&lt;/li&gt;
    &lt;li&gt;The syscall number is higher than the node’s value&lt;/li&gt;
  &lt;/ul&gt;

  &lt;p&gt;In order to render the BST as cBPF bytecode, gVisor used to render the following
(in pseudocode):&lt;/p&gt;

  &lt;div class=&quot;language-javascript highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;syscall&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;number&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;current&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;node&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;value&lt;/span&gt;
    &lt;span class=&quot;nx&quot;&gt;jump&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;rules_for_this_syscall&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;syscall&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;number&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;current&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;node&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;value&lt;/span&gt;
    &lt;span class=&quot;nx&quot;&gt;jump&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;left_node&lt;/span&gt;
&lt;span class=&quot;nx&quot;&gt;jump&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;right_node&lt;/span&gt;

&lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;rules_for_this_syscall&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;c1&quot;&gt;// Render bytecode for this syscall's filters here...&lt;/span&gt;

&lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;left_node&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;c1&quot;&gt;// Recursively render the bytecode for the left node value here...&lt;/span&gt;

&lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;right_node&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;c1&quot;&gt;// Recursively render the bytecode for the right node value here...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;  &lt;/div&gt;

  &lt;p&gt;Keep in mind the &lt;a href=&quot;#cbpf-limitations&quot;&gt;cBPF limitations&lt;/a&gt; here. Because conditional
jumps are limited to 255 instructions, the jump to &lt;code class=&quot;highlighter-rouge&quot;&gt;@left_node&lt;/code&gt; can be further
than 255 instructions away (especially for syscalls with complex filtering rules
like &lt;a href=&quot;https://man7.org/linux/man-pages/man2/ioctl.2.html&quot;&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;ioctl(2)&lt;/code&gt;&lt;/a&gt;). The jump
to &lt;code class=&quot;highlighter-rouge&quot;&gt;@right_node&lt;/code&gt; is almost certainly more than 255 instructions away. This means
in actual cBPF bytecode, we would often need to use conditional jumps followed
by unconditional jumps in order to jump so far forward. Meanwhile, the jump to
&lt;code class=&quot;highlighter-rouge&quot;&gt;@rules_for_this_syscall&lt;/code&gt; would be a very short hop away, but this locality
would only be taken advantage of for a single node of the entire tree for each
traversal.&lt;/p&gt;

  &lt;p&gt;Consider this structure instead:&lt;/p&gt;

  &lt;div class=&quot;language-javascript highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c1&quot;&gt;// Traversal code:&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;syscall&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;number&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;current&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;node&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;value&lt;/span&gt;
      &lt;span class=&quot;nx&quot;&gt;jump&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;left_node&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;syscall_number&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;current&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;node&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;value&lt;/span&gt;
      &lt;span class=&quot;nx&quot;&gt;jump&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;right_node&lt;/span&gt;
  &lt;span class=&quot;nx&quot;&gt;jump&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;rules_for_this_syscall&lt;/span&gt;
  &lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;left_node&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
    &lt;span class=&quot;c1&quot;&gt;// Recursively render only the traversal code for the left node here&lt;/span&gt;
  &lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;right_node&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
    &lt;span class=&quot;c1&quot;&gt;// Recursively render only the traversal code for the right node here&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;// Filtering code:&lt;/span&gt;
  &lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;rules_for_this_syscall&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
    &lt;span class=&quot;c1&quot;&gt;// Render bytecode for this syscall's filters here&lt;/span&gt;
  &lt;span class=&quot;c1&quot;&gt;// Recursively render only the filtering code for the left node here&lt;/span&gt;
  &lt;span class=&quot;c1&quot;&gt;// Recursively render only the filtering code for the right node here&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;  &lt;/div&gt;

  &lt;p&gt;This effectively separates the per-syscall rules from the traversal of the BST.
This ensures that the traversal can be done entirely using conditional jumps,
and that for any given execution of the cBPF program, there will be at most one
unconditional jump to the syscall-specific rules.&lt;/p&gt;

  &lt;p&gt;This structure is further improvable by taking advantage of the fact that
syscall numbers are a dense space, and so are syscall filter rules. This means
we can often avoid needless comparisons. For example, given the following tree:&lt;/p&gt;

  &lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;      22
     /  \
    9    24
   /    /  \
  8   23    50
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;  &lt;/div&gt;

  &lt;p&gt;Notice that the tree contains &lt;code class=&quot;highlighter-rouge&quot;&gt;22&lt;/code&gt;, &lt;code class=&quot;highlighter-rouge&quot;&gt;23&lt;/code&gt;, and &lt;code class=&quot;highlighter-rouge&quot;&gt;24&lt;/code&gt;. This means that if we get to
node &lt;code class=&quot;highlighter-rouge&quot;&gt;23&lt;/code&gt;, we do not need to check for syscall number equality, because we’ve
already established from the traversal that the syscall number must be &lt;code class=&quot;highlighter-rouge&quot;&gt;23&lt;/code&gt;.&lt;/p&gt;

&lt;/details&gt;

&lt;h2 id=&quot;cbpf-bytecode-optimizations&quot;&gt;cBPF bytecode optimizations&lt;/h2&gt;

&lt;p&gt;gVisor now implements a
&lt;a href=&quot;https://github.com/google/gvisor/blob/master/pkg/bpf/optimizer.go&quot;&gt;bytecode-level cBPF optimizer&lt;/a&gt;
running a few lossless optimizations. These optimizations are run repeatedly
until the bytecode no longer changes. This is because each type of optimization
tends to feed on the fruits of the others, as we’ll see below.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/2024-02-01-gvisor-seccomp-sentry-filter-size.png&quot; alt=&quot;gVisor sentry seccomp-bpf filter program size&quot; title=&quot;gVisor sentry seccomp-bpf filter program size&quot; style=&quot;max-width:100%&quot; /&gt;&lt;/p&gt;

&lt;p&gt;gVisor’s &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; program size is reduced by over a factor of 4 using the
optimizations below.&lt;/p&gt;

&lt;details&gt;

  &lt;summary&gt;

    &lt;h3 id=&quot;optimizing-cbpf-jumps&quot;&gt;Optimizing cBPF jumps&lt;/h3&gt;

    &lt;p&gt;The &lt;a href=&quot;#cbpf-limitations&quot;&gt;limitations of cBPF jump instructions described earlier&lt;/a&gt;
means that typical BPF bytecode rendering code will usually favor unconditional
jumps even when they are not necessary. However, they can be optimized after the
fact.&lt;/p&gt;

  &lt;/summary&gt;

  &lt;p&gt;Typical BPF bytecode rendering code for a simple condition is usually rendered
as follows:&lt;/p&gt;

  &lt;div class=&quot;language-javascript highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nx&quot;&gt;jif&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;condition&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;     &lt;span class=&quot;c1&quot;&gt;// If &amp;lt;condition&amp;gt; is true, continue,&lt;/span&gt;
                          &lt;span class=&quot;c1&quot;&gt;//   otherwise skip over 1 instruction.&lt;/span&gt;
&lt;span class=&quot;nx&quot;&gt;jmp&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;condition_was_true&lt;/span&gt;   &lt;span class=&quot;c1&quot;&gt;// Unconditional jump to label @condition_was_true.&lt;/span&gt;
&lt;span class=&quot;nx&quot;&gt;jmp&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;condition_was_false&lt;/span&gt;  &lt;span class=&quot;c1&quot;&gt;// Unconditional jump to label @condition_was_false.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;  &lt;/div&gt;

  &lt;p&gt;… or as follows:&lt;/p&gt;

  &lt;div class=&quot;language-javascript highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nx&quot;&gt;jif&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;condition&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;     &lt;span class=&quot;c1&quot;&gt;// If &amp;lt;condition&amp;gt; is true, jump by 1 instruction,&lt;/span&gt;
                          &lt;span class=&quot;c1&quot;&gt;//   otherwise continue.&lt;/span&gt;
&lt;span class=&quot;nx&quot;&gt;jmp&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;condition_was_false&lt;/span&gt;  &lt;span class=&quot;c1&quot;&gt;// Unconditional jump to label @condition_was_false.&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;// Flow through here if the condition was true.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;  &lt;/div&gt;

  &lt;p&gt;… In other words, the generated code always uses unconditional jumps, and
conditional jump offsets are always either 0 or 1 instructions forward. This is
because conditional jumps are limited to 8 bits (255 instructions), and it is
not always possible at BPF bytecode rendering time to know ahead of time that
the jump targets (&lt;code class=&quot;highlighter-rouge&quot;&gt;@condition_was_true&lt;/code&gt;, &lt;code class=&quot;highlighter-rouge&quot;&gt;@condition_was_false&lt;/code&gt;) will resolve to
an instruction that is close enough ahead that the offset would fit in 8 bits.
The safe thing to do is to always use an unconditional jump. Since unconditional
jump targets have 16 bits to play with, and &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; programs are limited
to 4,096 instructions, it is always possible to encode a jump using an
unconditional jump instruction.&lt;/p&gt;

  &lt;p&gt;But of course, the jump target often &lt;em&gt;does&lt;/em&gt; fit in 8 bits. So gVisor looks over
the bytecode for optimization opportunities:&lt;/p&gt;

  &lt;ul&gt;
    &lt;li&gt;&lt;strong&gt;Conditional jumps that jump to unconditional jumps&lt;/strong&gt; are rewritten to
their final destination, so long as this fits within the 255-instruction
conditional jump limit.&lt;/li&gt;
    &lt;li&gt;&lt;strong&gt;Unconditional jumps that jump to other unconditional jumps&lt;/strong&gt; are rewritten
to their final destination.&lt;/li&gt;
    &lt;li&gt;&lt;strong&gt;Conditional jumps where both branches jump to the same instruction&lt;/strong&gt; are
replaced by an unconditional jump to that instruction.&lt;/li&gt;
    &lt;li&gt;&lt;strong&gt;Unconditional jumps with a zero-instruction jump target&lt;/strong&gt; are removed.&lt;/li&gt;
  &lt;/ul&gt;

  &lt;p&gt;The aim of these optimizations is to clean up after needless indirection that is
a byproduct of cBPF bytecode rendering code. Once they all have run, all jumps
are as tight as they can be.&lt;/p&gt;

&lt;/details&gt;

&lt;details&gt;

  &lt;summary&gt;

    &lt;h3 id=&quot;removing-dead-code&quot;&gt;Removing dead code&lt;/h3&gt;

    &lt;p&gt;Because cBPF is a very restricted language, it is possible to determine with
certainty that some instructions can never be reached.&lt;/p&gt;

  &lt;/summary&gt;

  &lt;p&gt;In cBPF, each instruction either:&lt;/p&gt;

  &lt;ul&gt;
    &lt;li&gt;&lt;strong&gt;Flows&lt;/strong&gt; forward (e.g. &lt;code class=&quot;highlighter-rouge&quot;&gt;load&lt;/code&gt; operations, math operations).&lt;/li&gt;
    &lt;li&gt;&lt;strong&gt;Jumps&lt;/strong&gt; by a fixed (immediate) number of instructions.&lt;/li&gt;
    &lt;li&gt;&lt;strong&gt;Stops&lt;/strong&gt; the execution immediately (&lt;code class=&quot;highlighter-rouge&quot;&gt;return&lt;/code&gt; instructions).&lt;/li&gt;
  &lt;/ul&gt;

  &lt;p&gt;Therefore, gVisor runs a simple program traversal algorithm. It creates a
bitfield with one bit per instruction, then traverses the program and all its
possible branches. Then, all instructions that were never traversed are removed
from the program, and all jump targets are updated to account for these
removals.&lt;/p&gt;

  &lt;p&gt;In turn, this makes the program shorter, which makes more jump optimizations
possible.&lt;/p&gt;

&lt;/details&gt;

&lt;details&gt;

  &lt;summary&gt;

    &lt;h3 id=&quot;redundant-loads&quot;&gt;Removing redundant &lt;code class=&quot;highlighter-rouge&quot;&gt;load&lt;/code&gt; instructions&lt;/h3&gt;

    &lt;p&gt;cBPF programs filter system calls by inspecting their arguments. To do these
comparisons, this data must first be loaded into the cBPF VM registers. These
load operations can be optimized.&lt;/p&gt;

  &lt;/summary&gt;

  &lt;p&gt;cBPF’s conditional operations (e.g. “is equal to”, “is greater than”, etc.)
operate on a single 32-bit register called “A”. As such, a &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; program
typically consists of many load operations (&lt;code class=&quot;highlighter-rouge&quot;&gt;load32&lt;/code&gt;) that loads a 32-bit value
from a given offset of the &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp_data&lt;/code&gt; struct into register A, then performs
a comparative operation on it to see if it matches the filter.&lt;/p&gt;

  &lt;div class=&quot;language-javascript highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;mi&quot;&gt;00&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;load32&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;offset&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt;
&lt;span class=&quot;mi&quot;&gt;01&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;jif&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;condition1&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;condition1_was_true&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;condition1_was_false&lt;/span&gt;
&lt;span class=&quot;mi&quot;&gt;02&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;load32&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;offset&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt;
&lt;span class=&quot;mi&quot;&gt;03&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;jif&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;condition2&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;condition2_was_true&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;condition2_was_false&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;// ...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;  &lt;/div&gt;

  &lt;p&gt;But when a syscall rule is of the form “this syscall argument must be one of the
following values”, we don’t need to reload the same value (from the same offset)
multiple times. So gVisor looks for redundant loads like this, and removes them.&lt;/p&gt;

  &lt;div class=&quot;language-javascript highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;mi&quot;&gt;00&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;load32&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;offset&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt;
&lt;span class=&quot;mi&quot;&gt;01&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;jif&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;condition1&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;condition1_was_true&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;condition1_was_false&lt;/span&gt;
&lt;span class=&quot;mi&quot;&gt;02&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;jif&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;condition2&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;condition2_was_true&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;condition2_was_false&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;// ...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;  &lt;/div&gt;

  &lt;p&gt;Note that syscall arguments are &lt;strong&gt;64-bit&lt;/strong&gt; values, whereas the A register is
only 32-bits wide. Therefore, asserting that a syscall argument matches a
predicate usually involves at least 2 &lt;code class=&quot;highlighter-rouge&quot;&gt;load32&lt;/code&gt; operations on different offsets,
thereby making this optimization useless for the “this syscall argument must be
one of the following values” case. We’ll get back to that.&lt;/p&gt;

&lt;/details&gt;

&lt;details&gt;

  &lt;summary&gt;

    &lt;h3 id=&quot;minimizing-the-number-of-return-instructions&quot;&gt;Minimizing the number of &lt;code class=&quot;highlighter-rouge&quot;&gt;return&lt;/code&gt; instructions&lt;/h3&gt;

    &lt;p&gt;A typical syscall filter program consists of many predicates which return either
“allowed” or “rejected”. These are encoded in the bytecode as either &lt;code class=&quot;highlighter-rouge&quot;&gt;return&lt;/code&gt;
instructions, or jumps to &lt;code class=&quot;highlighter-rouge&quot;&gt;return&lt;/code&gt; instructions. These instructions can show up
dozens or hundreds of times in the cBPF bytecode in quick succession, presenting
an optimization opportunity.&lt;/p&gt;

  &lt;/summary&gt;

  &lt;p&gt;Since two &lt;code class=&quot;highlighter-rouge&quot;&gt;return&lt;/code&gt; instructions with the same immediate return code are exactly
equivalent to one another, it is possible to rewrite jumps to all &lt;code class=&quot;highlighter-rouge&quot;&gt;return&lt;/code&gt;
instructions that return “allowed” to go to a single &lt;code class=&quot;highlighter-rouge&quot;&gt;return&lt;/code&gt; instruction that
returns this code, and similar for “rejected”, so long as the jump offsets fit
within the limits of conditional jumps (255 instructions). In turn, this makes
the program shorter, and therefore makes more jump optimizations possible.&lt;/p&gt;

  &lt;p&gt;To implement this optimization, gVisor first replaces all unconditional jump
instructions that go to &lt;code class=&quot;highlighter-rouge&quot;&gt;return&lt;/code&gt; statements with a copy of that &lt;code class=&quot;highlighter-rouge&quot;&gt;return&lt;/code&gt;
statement. This removes needless indirection.&lt;/p&gt;

  &lt;div class=&quot;language-javascript highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;    &lt;span class=&quot;nx&quot;&gt;Original&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;bytecode&lt;/span&gt;                      &lt;span class=&quot;nx&quot;&gt;New&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;bytecode&lt;/span&gt;
&lt;span class=&quot;mi&quot;&gt;00&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;jeq&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;                        &lt;span class=&quot;mi&quot;&gt;00&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;jeq&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;
&lt;span class=&quot;mi&quot;&gt;01&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;jmp&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;good&lt;/span&gt;                    &lt;span class=&quot;o&quot;&gt;--&amp;gt;&lt;/span&gt;   &lt;span class=&quot;mi&quot;&gt;01&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;allowed&lt;/span&gt;
&lt;span class=&quot;mi&quot;&gt;02&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;jmp&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;bad&lt;/span&gt;                     &lt;span class=&quot;o&quot;&gt;--&amp;gt;&lt;/span&gt;   &lt;span class=&quot;mi&quot;&gt;02&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;rejected&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;...&lt;/span&gt;                                    &lt;span class=&quot;p&quot;&gt;...&lt;/span&gt;
&lt;span class=&quot;mi&quot;&gt;10&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;jge&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;                        &lt;span class=&quot;mi&quot;&gt;10&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;jge&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;
&lt;span class=&quot;mi&quot;&gt;11&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;jmp&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;good&lt;/span&gt;                    &lt;span class=&quot;o&quot;&gt;--&amp;gt;&lt;/span&gt;   &lt;span class=&quot;mi&quot;&gt;11&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;allowed&lt;/span&gt;
&lt;span class=&quot;mi&quot;&gt;12&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;jmp&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;bad&lt;/span&gt;                     &lt;span class=&quot;o&quot;&gt;--&amp;gt;&lt;/span&gt;   &lt;span class=&quot;mi&quot;&gt;12&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;rejected&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;...&lt;/span&gt;                                    &lt;span class=&quot;p&quot;&gt;...&lt;/span&gt;
&lt;span class=&quot;mi&quot;&gt;100&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;good&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]:&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;allowed&lt;/span&gt;            &lt;span class=&quot;mi&quot;&gt;100&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;good&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]:&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;allowed&lt;/span&gt;
&lt;span class=&quot;mi&quot;&gt;101&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;bad&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]:&lt;/span&gt;  &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;rejected&lt;/span&gt;           &lt;span class=&quot;mi&quot;&gt;101&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;bad&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]:&lt;/span&gt;  &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;rejected&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;  &lt;/div&gt;

  &lt;p&gt;gVisor then searches for &lt;code class=&quot;highlighter-rouge&quot;&gt;return&lt;/code&gt; statements which can be entirely removed by
seeing if it is possible to rewrite the rest of the program to jump or flow
through to an equivalent &lt;code class=&quot;highlighter-rouge&quot;&gt;return&lt;/code&gt; statement (without making the program longer
in the process). In the above example:&lt;/p&gt;

  &lt;div class=&quot;language-javascript highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;    &lt;span class=&quot;nx&quot;&gt;Original&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;bytecode&lt;/span&gt;                      &lt;span class=&quot;nx&quot;&gt;New&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;bytecode&lt;/span&gt;
&lt;span class=&quot;mi&quot;&gt;00&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;jeq&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;                  &lt;span class=&quot;o&quot;&gt;--&amp;gt;&lt;/span&gt;   &lt;span class=&quot;mi&quot;&gt;00&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;jeq&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;99&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;100&lt;/span&gt;   &lt;span class=&quot;c1&quot;&gt;// Targets updated&lt;/span&gt;
&lt;span class=&quot;mi&quot;&gt;01&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;allowed&lt;/span&gt;                     &lt;span class=&quot;mi&quot;&gt;01&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;allowed&lt;/span&gt;   &lt;span class=&quot;c1&quot;&gt;// Now dead code&lt;/span&gt;
&lt;span class=&quot;mi&quot;&gt;02&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;reject&lt;/span&gt;                      &lt;span class=&quot;mi&quot;&gt;02&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;rejected&lt;/span&gt;  &lt;span class=&quot;c1&quot;&gt;// Now dead code&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;...&lt;/span&gt;                                    &lt;span class=&quot;p&quot;&gt;...&lt;/span&gt;
&lt;span class=&quot;mi&quot;&gt;10&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;jge&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;                  &lt;span class=&quot;o&quot;&gt;--&amp;gt;&lt;/span&gt;   &lt;span class=&quot;mi&quot;&gt;10&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;jge&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;89&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;90&lt;/span&gt;    &lt;span class=&quot;c1&quot;&gt;// Targets updated&lt;/span&gt;
&lt;span class=&quot;mi&quot;&gt;11&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;jmp&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;good&lt;/span&gt;                          &lt;span class=&quot;mi&quot;&gt;11&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;allowed&lt;/span&gt;   &lt;span class=&quot;c1&quot;&gt;// Now dead code&lt;/span&gt;
&lt;span class=&quot;mi&quot;&gt;12&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;jmp&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;bad&lt;/span&gt;                           &lt;span class=&quot;mi&quot;&gt;12&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;rejected&lt;/span&gt;  &lt;span class=&quot;c1&quot;&gt;// Now dead code&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;...&lt;/span&gt;                                    &lt;span class=&quot;p&quot;&gt;...&lt;/span&gt;
&lt;span class=&quot;mi&quot;&gt;100&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;good&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]:&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;allowed&lt;/span&gt;            &lt;span class=&quot;mi&quot;&gt;100&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;good&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]:&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;allowed&lt;/span&gt;
&lt;span class=&quot;mi&quot;&gt;101&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;bad&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]:&lt;/span&gt;  &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;rejected&lt;/span&gt;           &lt;span class=&quot;mi&quot;&gt;101&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;bad&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]:&lt;/span&gt;  &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;rejected&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;  &lt;/div&gt;

  &lt;p&gt;Finally, the dead code removal pass cleans up the dead &lt;code class=&quot;highlighter-rouge&quot;&gt;return&lt;/code&gt; statements and
the program becomes shorter.&lt;/p&gt;

  &lt;div class=&quot;language-javascript highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;    &lt;span class=&quot;nx&quot;&gt;Original&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;bytecode&lt;/span&gt;                      &lt;span class=&quot;nx&quot;&gt;New&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;bytecode&lt;/span&gt;
&lt;span class=&quot;mi&quot;&gt;00&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;jeq&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;99&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;100&lt;/span&gt;               &lt;span class=&quot;o&quot;&gt;--&amp;gt;&lt;/span&gt;   &lt;span class=&quot;mi&quot;&gt;00&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;jeq&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;95&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;96&lt;/span&gt;  &lt;span class=&quot;c1&quot;&gt;// Targets updated&lt;/span&gt;
&lt;span class=&quot;mi&quot;&gt;01&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;allowed&lt;/span&gt;               &lt;span class=&quot;o&quot;&gt;--&amp;gt;&lt;/span&gt;   &lt;span class=&quot;cm&quot;&gt;/* Removed */&lt;/span&gt;
&lt;span class=&quot;mi&quot;&gt;02&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;reject&lt;/span&gt;                &lt;span class=&quot;o&quot;&gt;--&amp;gt;&lt;/span&gt;   &lt;span class=&quot;cm&quot;&gt;/* Removed */&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;...&lt;/span&gt;                                    &lt;span class=&quot;p&quot;&gt;...&lt;/span&gt;
&lt;span class=&quot;mi&quot;&gt;10&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;jge&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;89&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;90&lt;/span&gt;                &lt;span class=&quot;o&quot;&gt;--&amp;gt;&lt;/span&gt;   &lt;span class=&quot;mi&quot;&gt;08&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;jge&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;87&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;88&lt;/span&gt;  &lt;span class=&quot;c1&quot;&gt;// Targets updated&lt;/span&gt;
&lt;span class=&quot;mi&quot;&gt;11&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;allowed&lt;/span&gt;               &lt;span class=&quot;o&quot;&gt;--&amp;gt;&lt;/span&gt;   &lt;span class=&quot;cm&quot;&gt;/* Removed */&lt;/span&gt;
&lt;span class=&quot;mi&quot;&gt;12&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;rejected&lt;/span&gt;              &lt;span class=&quot;o&quot;&gt;--&amp;gt;&lt;/span&gt;   &lt;span class=&quot;cm&quot;&gt;/* Removed */&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;...&lt;/span&gt;                                    &lt;span class=&quot;p&quot;&gt;...&lt;/span&gt;
&lt;span class=&quot;mi&quot;&gt;100&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;good&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]:&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;allowed&lt;/span&gt;            &lt;span class=&quot;mi&quot;&gt;96&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;good&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]:&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;allowed&lt;/span&gt;
&lt;span class=&quot;mi&quot;&gt;101&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;bad&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]:&lt;/span&gt;  &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;rejected&lt;/span&gt;           &lt;span class=&quot;mi&quot;&gt;97&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;bad&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]:&lt;/span&gt;  &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;rejected&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;  &lt;/div&gt;

  &lt;p&gt;While this search is expensive to perform, in a program full of predicates —
which is exactly what &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; programs are — this approach massively
reduces program size.&lt;/p&gt;

&lt;/details&gt;

&lt;h2 id=&quot;optimize-rulesets&quot;&gt;Ruleset optimizations&lt;/h2&gt;

&lt;p&gt;Bytecode-level optimizations are cool, but why stop here? gVisor now also
performs
&lt;a href=&quot;https://github.com/google/gvisor/blob/master/pkg/seccomp/seccomp_optimizer.go&quot;&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp&lt;/code&gt; ruleset optimizations&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;In gVisor, a &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp&lt;/code&gt; &lt;code class=&quot;highlighter-rouge&quot;&gt;RuleSet&lt;/code&gt; is a mapping from syscall number to a logical
expression named &lt;code class=&quot;highlighter-rouge&quot;&gt;SyscallRule&lt;/code&gt;, along with a &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; action (e.g. “allow”)
if a syscall with a given number matches its &lt;code class=&quot;highlighter-rouge&quot;&gt;SyscallRule&lt;/code&gt;.&lt;/p&gt;

&lt;details&gt;

  &lt;summary&gt;

    &lt;h3 id=&quot;basic-ruleset-simplifications&quot;&gt;Basic ruleset simplifications&lt;/h3&gt;

    &lt;p&gt;A &lt;code class=&quot;highlighter-rouge&quot;&gt;SyscallRule&lt;/code&gt; is a predicate over the data contained in the &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp_data&lt;/code&gt;
struct (beyond its &lt;code class=&quot;highlighter-rouge&quot;&gt;nr&lt;/code&gt;). A trivial implementation is &lt;code class=&quot;highlighter-rouge&quot;&gt;MatchAll&lt;/code&gt;, which simply
matches any &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp_data&lt;/code&gt;. Other implementations include &lt;code class=&quot;highlighter-rouge&quot;&gt;Or&lt;/code&gt; and &lt;code class=&quot;highlighter-rouge&quot;&gt;And&lt;/code&gt; (which
do what they sound like), and &lt;code class=&quot;highlighter-rouge&quot;&gt;PerArg&lt;/code&gt; which applies predicates to each specific
argument of a &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp_data&lt;/code&gt;, and forms the meat of actual syscall filtering
rules. Some basic simplifications are already possible with these building
blocks.&lt;/p&gt;

  &lt;/summary&gt;

  &lt;p&gt;gVisor implements the following basic optimizers, which look like they may be
useless on their own but end up simplifying the logic of the more complex
optimizer described in other sections quite a bit:&lt;/p&gt;

  &lt;ul&gt;
    &lt;li&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;Or&lt;/code&gt; and &lt;code class=&quot;highlighter-rouge&quot;&gt;And&lt;/code&gt; rules with a single predicate within them are replaced with
just that predicate.&lt;/li&gt;
    &lt;li&gt;Duplicate predicates within &lt;code class=&quot;highlighter-rouge&quot;&gt;Or&lt;/code&gt; and &lt;code class=&quot;highlighter-rouge&quot;&gt;And&lt;/code&gt; rules are removed.&lt;/li&gt;
    &lt;li&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;Or&lt;/code&gt; rules within &lt;code class=&quot;highlighter-rouge&quot;&gt;Or&lt;/code&gt; rules are flattened.&lt;/li&gt;
    &lt;li&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;And&lt;/code&gt; rules within &lt;code class=&quot;highlighter-rouge&quot;&gt;And&lt;/code&gt; rules are flattened.&lt;/li&gt;
    &lt;li&gt;An &lt;code class=&quot;highlighter-rouge&quot;&gt;Or&lt;/code&gt; rule which contains a &lt;code class=&quot;highlighter-rouge&quot;&gt;MatchAll&lt;/code&gt; predicate is replaced with
&lt;code class=&quot;highlighter-rouge&quot;&gt;MatchAll&lt;/code&gt;.&lt;/li&gt;
    &lt;li&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;MatchAll&lt;/code&gt; predicates within &lt;code class=&quot;highlighter-rouge&quot;&gt;And&lt;/code&gt; rules are removed.&lt;/li&gt;
    &lt;li&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;PerArg&lt;/code&gt; rules with &lt;code class=&quot;highlighter-rouge&quot;&gt;MatchAll&lt;/code&gt; predicates for each argument are replaced
with a rule that matches anything.&lt;/li&gt;
  &lt;/ul&gt;

  &lt;p&gt;As with the bytecode-level optimizations, gVisor runs these in a loop until the
structure of the rules no longer change. With the basic optimizations above,
this silly-looking rule:&lt;/p&gt;

  &lt;div class=&quot;language-go highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;Or&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;Or&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;And&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;MatchAll&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;PerArg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;EqualTo&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
        &lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;MatchAll&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;PerArg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;EqualTo&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;PerArg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;EqualTo&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;  &lt;/div&gt;

  &lt;p&gt;… is simplified down to just &lt;code class=&quot;highlighter-rouge&quot;&gt;PerArg{AnyValue, EqualTo(2), AnyValue}&lt;/code&gt;.&lt;/p&gt;

&lt;/details&gt;

&lt;details&gt;

  &lt;summary&gt;

    &lt;h3 id=&quot;extracting-repeated-argument-matchers&quot;&gt;Extracting repeated argument matchers&lt;/h3&gt;

    &lt;p&gt;This is the main optimization that gVisor performs on rulesets. gVisor looks for
common argument matchers that are repeated across all combinations of &lt;em&gt;other&lt;/em&gt;
argument matchers in branches of an &lt;code class=&quot;highlighter-rouge&quot;&gt;Or&lt;/code&gt; rule. It removes them from these
&lt;code class=&quot;highlighter-rouge&quot;&gt;PerArg&lt;/code&gt; rules, and &lt;code class=&quot;highlighter-rouge&quot;&gt;And&lt;/code&gt; the overall syscall rule with a single instance of
that argument matcher. Sound complicated? Let’s look at an example.&lt;/p&gt;

  &lt;/summary&gt;

  &lt;p&gt;In the
&lt;a href=&quot;https://github.com/google/gvisor/blob/master/runsc/boot/filter/config/&quot;&gt;gVisor Sentry &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; configuration&lt;/a&gt;,
these are the rules for the
&lt;a href=&quot;https://man7.org/linux/man-pages/man2/fcntl.2.html&quot;&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;fcntl(2)&lt;/code&gt; system call&lt;/a&gt;:&lt;/p&gt;

  &lt;div class=&quot;language-go highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;rules&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;...&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;map&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;uintptr&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;SyscallRule&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;SYS_FCNTL&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Or&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;PerArg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;NonNegativeFD&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;EqualTo&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;F_GETFL&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;
        &lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;PerArg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;NonNegativeFD&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;EqualTo&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;F_SETFL&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;
        &lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;PerArg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;NonNegativeFD&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;EqualTo&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;F_GETFD&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;
        &lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;  &lt;/div&gt;

  &lt;p&gt;… This means that for the &lt;code class=&quot;highlighter-rouge&quot;&gt;fcntl(2)&lt;/code&gt; system call, &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp_data.args[0]&lt;/code&gt; may
be any non-negative number, &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp_data.args[1]&lt;/code&gt; may be either &lt;code class=&quot;highlighter-rouge&quot;&gt;F_GETFL&lt;/code&gt;,
&lt;code class=&quot;highlighter-rouge&quot;&gt;F_SETFL&lt;/code&gt;, or &lt;code class=&quot;highlighter-rouge&quot;&gt;F_GETFD&lt;/code&gt;, and all other &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp_data&lt;/code&gt; fields may be any value.&lt;/p&gt;

  &lt;p&gt;If rendered naively in BPF, this would iterate over each branch of the &lt;code class=&quot;highlighter-rouge&quot;&gt;Or&lt;/code&gt;
expression, and re-check the &lt;code class=&quot;highlighter-rouge&quot;&gt;NonNegativeFD&lt;/code&gt; each time. Clearly wasteful.
Conceptually, the ideal expression is something like this:&lt;/p&gt;

  &lt;div class=&quot;language-go highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;rules&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;...&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;map&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;uintptr&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;SyscallRule&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;SYS_FCNTL&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;PerArg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;NonNegativeFD&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;AnyOf&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;F_GETFL&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;F_SETFL&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;F_GETFD&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;  &lt;/div&gt;

  &lt;p&gt;… But going through all the syscall rules to look for this pattern would be
quite tedious, and some of them are actually &lt;code class=&quot;highlighter-rouge&quot;&gt;Or&lt;/code&gt;‘d from multiple
&lt;code class=&quot;highlighter-rouge&quot;&gt;map[uintptr]SyscallRule&lt;/code&gt; in different files (e.g. platform-dependent syscalls),
so they cannot be all specified in a single location with a single predicate on
&lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp_data.args[1]&lt;/code&gt;. So gVisor needs to detect this programmatically at
optimization time.&lt;/p&gt;

  &lt;p&gt;Conceptually, gVisor goes from:&lt;/p&gt;

  &lt;div class=&quot;language-go highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;Or&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;PerArg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;A1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;B1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;C1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;D&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;PerArg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;A2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;B1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;C1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;D&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;PerArg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;A1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;B2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;C2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;D&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;PerArg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;A2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;B2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;C2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;D&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;PerArg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;A1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;B3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;C3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;D&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;PerArg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;A2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;B3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;C3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;D&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;  &lt;/div&gt;

  &lt;p&gt;… to (after one pass):&lt;/p&gt;

  &lt;div class=&quot;language-go highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;And&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;Or&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;PerArg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;A1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;PerArg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;A2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;PerArg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;A1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;PerArg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;A2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;PerArg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;A1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;PerArg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;A2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;Or&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;PerArg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;B1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;C1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;D&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;PerArg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;B1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;C1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;D&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;PerArg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;B2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;C2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;D&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;PerArg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;B2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;C2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;D&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;PerArg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;B3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;C3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;D&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;PerArg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;B3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;C3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;D&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;  &lt;/div&gt;

  &lt;p&gt;Then the &lt;a href=&quot;#basic-ruleset-simplifications&quot;&gt;basic optimizers&lt;/a&gt; will kick in and
detect duplicate &lt;code class=&quot;highlighter-rouge&quot;&gt;PerArg&lt;/code&gt; rules in &lt;code class=&quot;highlighter-rouge&quot;&gt;Or&lt;/code&gt; expressions, and delete them:&lt;/p&gt;

  &lt;div class=&quot;language-go highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;And&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;Or&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;PerArg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;A1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;PerArg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;A2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;Or&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;PerArg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;B1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;C1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;D&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;PerArg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;B2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;C2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;D&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;PerArg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;B3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;C3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;D&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;  &lt;/div&gt;

  &lt;p&gt;… Then, on the next pass, the second inner &lt;code class=&quot;highlighter-rouge&quot;&gt;Or&lt;/code&gt; rule gets recursively
optimized:&lt;/p&gt;

  &lt;div class=&quot;language-go highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;And&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;Or&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;PerArg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;A1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;PerArg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;A2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;And&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;Or&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;PerArg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;D&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;PerArg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;D&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;PerArg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;D&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
        &lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;Or&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;PerArg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;B1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;C1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;PerArg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;B2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;C2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;PerArg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;B3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;C3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
        &lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;  &lt;/div&gt;

  &lt;p&gt;… which, after other basic optimizers clean this all up, finally becomes:&lt;/p&gt;

  &lt;div class=&quot;language-go highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;And&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;Or&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;PerArg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;A1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;PerArg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;A2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;PerArg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;D&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;Or&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;PerArg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;B1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;C1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;PerArg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;B2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;C2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;PerArg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;B3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;C3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;  &lt;/div&gt;

  &lt;p&gt;This has turned what would be 24 comparisons into just 9:&lt;/p&gt;

  &lt;ul&gt;
    &lt;li&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp_data[0]&lt;/code&gt; must either match predicate &lt;code class=&quot;highlighter-rouge&quot;&gt;A1&lt;/code&gt; or &lt;code class=&quot;highlighter-rouge&quot;&gt;A2&lt;/code&gt;.&lt;/li&gt;
    &lt;li&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp_data[3]&lt;/code&gt; must match predicate &lt;code class=&quot;highlighter-rouge&quot;&gt;D&lt;/code&gt;.&lt;/li&gt;
    &lt;li&gt;At least one of the following must be true:
      &lt;ul&gt;
        &lt;li&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp_data[1]&lt;/code&gt; must match predicate &lt;code class=&quot;highlighter-rouge&quot;&gt;B1&lt;/code&gt; and &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp_data[2]&lt;/code&gt; must
match predicate &lt;code class=&quot;highlighter-rouge&quot;&gt;C1&lt;/code&gt;.&lt;/li&gt;
        &lt;li&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp_data[1]&lt;/code&gt; must match predicate &lt;code class=&quot;highlighter-rouge&quot;&gt;B2&lt;/code&gt; and &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp_data[2]&lt;/code&gt; must
match predicate &lt;code class=&quot;highlighter-rouge&quot;&gt;C2&lt;/code&gt;.&lt;/li&gt;
        &lt;li&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp_data[1]&lt;/code&gt; must match predicate &lt;code class=&quot;highlighter-rouge&quot;&gt;B3&lt;/code&gt; and &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp_data[2]&lt;/code&gt; must
match predicate &lt;code class=&quot;highlighter-rouge&quot;&gt;C3&lt;/code&gt;.&lt;/li&gt;
      &lt;/ul&gt;
    &lt;/li&gt;
  &lt;/ul&gt;

  &lt;p&gt;To go back to our &lt;code class=&quot;highlighter-rouge&quot;&gt;fcntl(2)&lt;/code&gt; example, the rules would therefore be rewritten to:&lt;/p&gt;

  &lt;div class=&quot;language-go highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;rules&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;...&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;map&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;uintptr&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;SyscallRule&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;SYS_FCNTL&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;And&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
        &lt;span class=&quot;c&quot;&gt;// Check for args[0] exclusively:&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;PerArg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;NonNegativeFD&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
        &lt;span class=&quot;c&quot;&gt;// Check for args[1] exclusively:&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;Or&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;PerArg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;EqualTo&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;F_GETFL&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)},&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;PerArg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;EqualTo&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;F_SETFL&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)},&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;PerArg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;EqualTo&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;F_GETFD&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)},&lt;/span&gt;
        &lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;  &lt;/div&gt;

  &lt;p&gt;… thus we’ve turned 6 comparisons into 4. But we can do better still!&lt;/p&gt;

&lt;/details&gt;

&lt;details&gt;

  &lt;summary&gt;

    &lt;h3 id=&quot;extracting-repeated-32-bit-match-logic-from-64-bit-argument-matchers&quot;&gt;Extracting repeated 32-bit match logic from 64-bit argument matchers&lt;/h3&gt;

    &lt;p&gt;We can apply the same optimization, but down to the 32-bit matching logic that
underlies the 64-bit syscall argument matching predicates.&lt;/p&gt;

  &lt;/summary&gt;

  &lt;p&gt;As you may recall,
&lt;a href=&quot;#cbpf-limitations&quot;&gt;cBPF instructions are limited to 32-bit math&lt;/a&gt;. This means
that when rendered, each of these argument comparisons are actually 2 operations
each: one for the first 32-bit half of the argument, and one for the second
32-bit half of the argument.&lt;/p&gt;

  &lt;p&gt;Let’s look at the &lt;code class=&quot;highlighter-rouge&quot;&gt;F_GETFL&lt;/code&gt;, &lt;code class=&quot;highlighter-rouge&quot;&gt;F_SETFL&lt;/code&gt;, and &lt;code class=&quot;highlighter-rouge&quot;&gt;F_GETFD&lt;/code&gt; constants:&lt;/p&gt;

  &lt;div class=&quot;language-go highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;F_GETFL&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;0x3&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;F_SETFL&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;0x4&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;F_GETFD&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;0x1&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;  &lt;/div&gt;

  &lt;p&gt;The cBPF bytecode for checking the arguments of this syscall may therefore look
something like this:&lt;/p&gt;

  &lt;div class=&quot;language-javascript highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c1&quot;&gt;// Check for `seccomp_data.args[0]`:&lt;/span&gt;
  &lt;span class=&quot;mi&quot;&gt;00&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;load32&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;16&lt;/span&gt;                &lt;span class=&quot;c1&quot;&gt;// Load the first 32 bits of&lt;/span&gt;
                               &lt;span class=&quot;c1&quot;&gt;//   `seccomp_data.args[0]` into register A.&lt;/span&gt;
  &lt;span class=&quot;mi&quot;&gt;01&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;jeq&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;bad&lt;/span&gt;           &lt;span class=&quot;c1&quot;&gt;// If A == 0, continue, otherwise jump to @bad.&lt;/span&gt;
  &lt;span class=&quot;mi&quot;&gt;02&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;load32&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;20&lt;/span&gt;                &lt;span class=&quot;c1&quot;&gt;// Load the second 32 bits of&lt;/span&gt;
                               &lt;span class=&quot;c1&quot;&gt;//   `seccomp_data.args[0]` into register A.&lt;/span&gt;
  &lt;span class=&quot;mi&quot;&gt;03&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;jset&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;0x80000000&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;bad&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;// If A &amp;amp; 0x80000000 != 0, jump to @bad,&lt;/span&gt;
                               &lt;span class=&quot;c1&quot;&gt;//   otherwise continue.&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;// Check for `seccomp_data.args[1]`:&lt;/span&gt;
  &lt;span class=&quot;mi&quot;&gt;04&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;load32&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;24&lt;/span&gt;                &lt;span class=&quot;c1&quot;&gt;// Load the first 32 bits of&lt;/span&gt;
                               &lt;span class=&quot;c1&quot;&gt;//   `seccomp_data.args[1]` into register A.&lt;/span&gt;
  &lt;span class=&quot;mi&quot;&gt;05&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;jeq&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;next1&lt;/span&gt;         &lt;span class=&quot;c1&quot;&gt;// If A == 0, continue, otherwise jump to @next1.&lt;/span&gt;
  &lt;span class=&quot;mi&quot;&gt;06&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;load32&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;28&lt;/span&gt;                &lt;span class=&quot;c1&quot;&gt;// Load the second 32 bits of&lt;/span&gt;
                               &lt;span class=&quot;c1&quot;&gt;//   `seccomp_data.args[1]` into register A.&lt;/span&gt;
  &lt;span class=&quot;mi&quot;&gt;07&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;jeq&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;0x3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;good&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;next1&lt;/span&gt;   &lt;span class=&quot;c1&quot;&gt;// If A == 0x3, jump to @good,&lt;/span&gt;
                               &lt;span class=&quot;c1&quot;&gt;//   otherwise jump to @next1.&lt;/span&gt;

&lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;next1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;mi&quot;&gt;08&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;load32&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;24&lt;/span&gt;                &lt;span class=&quot;c1&quot;&gt;// Load the first 32 bits of&lt;/span&gt;
                               &lt;span class=&quot;c1&quot;&gt;//   `seccomp_data.args[1]` into register A.&lt;/span&gt;
  &lt;span class=&quot;mi&quot;&gt;09&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;jeq&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;next2&lt;/span&gt;         &lt;span class=&quot;c1&quot;&gt;// If A == 0, continue, otherwise jump to @next2.&lt;/span&gt;
  &lt;span class=&quot;mi&quot;&gt;10&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;load32&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;28&lt;/span&gt;                &lt;span class=&quot;c1&quot;&gt;// Load the second 32 bits of&lt;/span&gt;
                               &lt;span class=&quot;c1&quot;&gt;//   `seccomp_data.args[1]` into register A.&lt;/span&gt;
  &lt;span class=&quot;mi&quot;&gt;11&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;jeq&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;0x4&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;good&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;next2&lt;/span&gt;   &lt;span class=&quot;c1&quot;&gt;// If A == 0x3, jump to @good,&lt;/span&gt;
                               &lt;span class=&quot;c1&quot;&gt;//   otherwise jump to @next2.&lt;/span&gt;

&lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;next2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;mi&quot;&gt;12&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;load32&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;24&lt;/span&gt;                &lt;span class=&quot;c1&quot;&gt;// Load the first 32 bits of&lt;/span&gt;
                               &lt;span class=&quot;c1&quot;&gt;//   `seccomp_data.args[1]` into register A.&lt;/span&gt;
  &lt;span class=&quot;mi&quot;&gt;13&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;jeq&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;bad&lt;/span&gt;           &lt;span class=&quot;c1&quot;&gt;// If A == 0, continue, otherwise jump to @bad.&lt;/span&gt;
  &lt;span class=&quot;mi&quot;&gt;14&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;load32&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;28&lt;/span&gt;                &lt;span class=&quot;c1&quot;&gt;// Load the second 32 bits of&lt;/span&gt;
                               &lt;span class=&quot;c1&quot;&gt;//   `seccomp_data.args[1]` into register A.&lt;/span&gt;
  &lt;span class=&quot;mi&quot;&gt;15&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;jeq&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;0x1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;good&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;bad&lt;/span&gt;     &lt;span class=&quot;c1&quot;&gt;// If A == 0x1, jump to @good,&lt;/span&gt;
                               &lt;span class=&quot;c1&quot;&gt;//   otherwise jump to @bad.&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;// Good/bad jump targets for the checks above to jump to:&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;good&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;mi&quot;&gt;16&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;ALLOW&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;bad&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;mi&quot;&gt;17&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;REJECT&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;  &lt;/div&gt;

  &lt;p&gt;Clearly this could be better. The first 32 bits must be zero in all possible
cases. So the syscall argument value-matching primitives (e.g. &lt;code class=&quot;highlighter-rouge&quot;&gt;EqualTo&lt;/code&gt;) may be
split into 2 32-bit value matchers:&lt;/p&gt;

  &lt;div class=&quot;language-go highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;rules&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;...&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;map&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;uintptr&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;SyscallRule&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;SYS_FCNTL&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;And&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;PerArg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;NonNegativeFD&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;Or&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;PerArg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
                &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
                &lt;span class=&quot;n&quot;&gt;splitMatcher&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
                    &lt;span class=&quot;n&quot;&gt;high32bits&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;EqualTo32Bits&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
                      &lt;span class=&quot;n&quot;&gt;F_GETFL&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;0xffffffff00000000&lt;/span&gt; &lt;span class=&quot;c&quot;&gt;/* = 0 */&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;
                    &lt;span class=&quot;n&quot;&gt;low32bits&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;  &lt;span class=&quot;n&quot;&gt;EqualTo32Bits&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
                      &lt;span class=&quot;n&quot;&gt;F_GETFL&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;0x00000000ffffffff&lt;/span&gt; &lt;span class=&quot;c&quot;&gt;/* = 0x3 */&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;
                &lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
            &lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;PerArg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
                &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
                &lt;span class=&quot;n&quot;&gt;splitMatcher&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
                    &lt;span class=&quot;n&quot;&gt;high32bits&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;EqualTo32Bits&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
                      &lt;span class=&quot;n&quot;&gt;F_SETFL&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;0xffffffff00000000&lt;/span&gt; &lt;span class=&quot;c&quot;&gt;/* = 0 */&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;
                    &lt;span class=&quot;n&quot;&gt;low32bits&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;  &lt;span class=&quot;n&quot;&gt;EqualTo32Bits&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
                      &lt;span class=&quot;n&quot;&gt;F_SETFL&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;0x00000000ffffffff&lt;/span&gt; &lt;span class=&quot;c&quot;&gt;/* = 0x4 */&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;
                &lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
            &lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;PerArg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
                &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
                &lt;span class=&quot;n&quot;&gt;splitMatcher&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
                    &lt;span class=&quot;n&quot;&gt;high32bits&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;EqualTo32Bits&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
                      &lt;span class=&quot;n&quot;&gt;F_GETFD&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;0xffffffff00000000&lt;/span&gt; &lt;span class=&quot;c&quot;&gt;/* = 0 */&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;
                    &lt;span class=&quot;n&quot;&gt;low32bits&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;  &lt;span class=&quot;n&quot;&gt;EqualTo32Bits&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
                      &lt;span class=&quot;n&quot;&gt;F_GETFD&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;0x00000000ffffffff&lt;/span&gt; &lt;span class=&quot;c&quot;&gt;/* = 0x1 */&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;
                &lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
            &lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
        &lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;  &lt;/div&gt;

  &lt;p&gt;gVisor then applies the same optimization as earlier, but this time going into
each 32-bit half of each argument. This means it can extract the
&lt;code class=&quot;highlighter-rouge&quot;&gt;EqualTo32Bits(0)&lt;/code&gt; matcher from the &lt;code class=&quot;highlighter-rouge&quot;&gt;high32bits&lt;/code&gt; part of each &lt;code class=&quot;highlighter-rouge&quot;&gt;splitMatcher&lt;/code&gt; and
move it up to the &lt;code class=&quot;highlighter-rouge&quot;&gt;And&lt;/code&gt; expression like so:&lt;/p&gt;

  &lt;div class=&quot;language-go highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;rules&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;...&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;map&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;uintptr&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;SyscallRule&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;SYS_FCNTL&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;And&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;PerArg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;NonNegativeFD&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;PerArg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;splitMatcher&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
                &lt;span class=&quot;n&quot;&gt;high32bits&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;EqualTo32Bits&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;
                &lt;span class=&quot;n&quot;&gt;low32bits&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;  &lt;span class=&quot;n&quot;&gt;Any32BitsValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
            &lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
        &lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;Or&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;PerArg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
                &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
                &lt;span class=&quot;n&quot;&gt;splitMatcher&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
                    &lt;span class=&quot;n&quot;&gt;high32bits&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Any32BitsValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
                    &lt;span class=&quot;n&quot;&gt;low32bits&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;  &lt;span class=&quot;n&quot;&gt;EqualTo32Bits&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
                      &lt;span class=&quot;n&quot;&gt;F_GETFL&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;0x00000000ffffffff&lt;/span&gt; &lt;span class=&quot;c&quot;&gt;/* = 0x3 */&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;
                &lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
            &lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;PerArg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
                &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
                &lt;span class=&quot;n&quot;&gt;splitMatcher&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
                    &lt;span class=&quot;n&quot;&gt;high32bits&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Any32BitsValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
                    &lt;span class=&quot;n&quot;&gt;low32bits&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;  &lt;span class=&quot;n&quot;&gt;EqualTo32Bits&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
                      &lt;span class=&quot;n&quot;&gt;F_SETFL&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;0x00000000ffffffff&lt;/span&gt; &lt;span class=&quot;c&quot;&gt;/* = 0x4 */&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;
                &lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
            &lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;PerArg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
                &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
                &lt;span class=&quot;n&quot;&gt;splitMatcher&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
                    &lt;span class=&quot;n&quot;&gt;high32bits&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Any32BitsValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
                    &lt;span class=&quot;n&quot;&gt;low32bits&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;  &lt;span class=&quot;n&quot;&gt;EqualTo32Bits&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
                      &lt;span class=&quot;n&quot;&gt;F_GETFD&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;0x00000000ffffffff&lt;/span&gt; &lt;span class=&quot;c&quot;&gt;/* = 0x1 */&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;
                &lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
            &lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
        &lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;  &lt;/div&gt;

  &lt;p&gt;This looks bigger as a tree, but keep in mind that the &lt;code class=&quot;highlighter-rouge&quot;&gt;AnyValue&lt;/code&gt; and
&lt;code class=&quot;highlighter-rouge&quot;&gt;Any32BitsValue&lt;/code&gt; matchers do not produce any bytecode. So now let’s render that
tree to bytecode:&lt;/p&gt;

  &lt;div class=&quot;language-javascript highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c1&quot;&gt;// Check for `seccomp_data.args[0]`:&lt;/span&gt;
  &lt;span class=&quot;mi&quot;&gt;00&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;load32&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;16&lt;/span&gt;                &lt;span class=&quot;c1&quot;&gt;// Load the first 32 bits of&lt;/span&gt;
                               &lt;span class=&quot;c1&quot;&gt;//   `seccomp_data.args[0]` into register A.&lt;/span&gt;
  &lt;span class=&quot;mi&quot;&gt;01&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;jeq&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;bad&lt;/span&gt;           &lt;span class=&quot;c1&quot;&gt;// If A == 0, continue, otherwise jump to @bad.&lt;/span&gt;
  &lt;span class=&quot;mi&quot;&gt;02&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;load32&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;20&lt;/span&gt;                &lt;span class=&quot;c1&quot;&gt;// Load the second 32 bits of&lt;/span&gt;
                               &lt;span class=&quot;c1&quot;&gt;//   `seccomp_data.args[0]` into register A.&lt;/span&gt;
  &lt;span class=&quot;mi&quot;&gt;03&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;jset&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;0x80000000&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;bad&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;// If A &amp;amp; 0x80000000 != 0, jump to @bad,&lt;/span&gt;
                               &lt;span class=&quot;c1&quot;&gt;//   otherwise continue.&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;// Check for `seccomp_data.args[1]`:&lt;/span&gt;
  &lt;span class=&quot;mi&quot;&gt;04&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;load32&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;24&lt;/span&gt;                &lt;span class=&quot;c1&quot;&gt;// Load the first 32 bits of&lt;/span&gt;
                               &lt;span class=&quot;c1&quot;&gt;//   `seccomp_data.args[1]` into register A.&lt;/span&gt;
  &lt;span class=&quot;mi&quot;&gt;05&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;jeq&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;bad&lt;/span&gt;           &lt;span class=&quot;c1&quot;&gt;// If A == 0, continue, otherwise jump to @bad.&lt;/span&gt;
  &lt;span class=&quot;mi&quot;&gt;06&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;load32&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;28&lt;/span&gt;                &lt;span class=&quot;c1&quot;&gt;// Load the second 32 bits of&lt;/span&gt;
                               &lt;span class=&quot;c1&quot;&gt;//   `seccomp_data.args[1]` into register A.&lt;/span&gt;
  &lt;span class=&quot;mi&quot;&gt;07&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;jeq&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;0x3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;good&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;next1&lt;/span&gt;   &lt;span class=&quot;c1&quot;&gt;// If A == 0x3, jump to @good,&lt;/span&gt;
                               &lt;span class=&quot;c1&quot;&gt;//   otherwise jump to @next1.&lt;/span&gt;

&lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;next1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;mi&quot;&gt;08&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;load32&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;28&lt;/span&gt;                &lt;span class=&quot;c1&quot;&gt;// Load the second 32 bits of&lt;/span&gt;
                               &lt;span class=&quot;c1&quot;&gt;//   `seccomp_data.args[1]` into register A.&lt;/span&gt;
  &lt;span class=&quot;mi&quot;&gt;09&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;jeq&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;0x4&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;good&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;next2&lt;/span&gt;   &lt;span class=&quot;c1&quot;&gt;// If A == 0x3, jump to @good,&lt;/span&gt;
                               &lt;span class=&quot;c1&quot;&gt;//   otherwise jump to @next2.&lt;/span&gt;

&lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;next2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;mi&quot;&gt;10&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;load32&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;28&lt;/span&gt;                &lt;span class=&quot;c1&quot;&gt;// Load the second 32 bits of&lt;/span&gt;
                               &lt;span class=&quot;c1&quot;&gt;//   `seccomp_data.args[1]` into register A.&lt;/span&gt;
  &lt;span class=&quot;mi&quot;&gt;11&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;jeq&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;0x1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;good&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;bad&lt;/span&gt;     &lt;span class=&quot;c1&quot;&gt;// If A == 0x1, jump to @good,&lt;/span&gt;
                               &lt;span class=&quot;c1&quot;&gt;//   otherwise jump to @bad.&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;// Good/bad jump targets for the checks above to jump to:&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;good&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;mi&quot;&gt;12&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;ALLOW&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;bad&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;mi&quot;&gt;13&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;REJECT&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;  &lt;/div&gt;

  &lt;p&gt;This is where the bytecode-level optimization to remove redundant loads
&lt;a href=&quot;#redundant-loads&quot;&gt;described earlier&lt;/a&gt; finally becomes relevant. We don’t need to
load the second 32 bits of &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp_data.args[1]&lt;/code&gt; multiple times in a row:&lt;/p&gt;

  &lt;div class=&quot;language-javascript highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c1&quot;&gt;// Check for `seccomp_data.args[0]`:&lt;/span&gt;
  &lt;span class=&quot;mi&quot;&gt;00&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;load32&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;16&lt;/span&gt;                &lt;span class=&quot;c1&quot;&gt;// Load the first 32 bits of&lt;/span&gt;
                               &lt;span class=&quot;c1&quot;&gt;//   `seccomp_data.args[0]` into register A.&lt;/span&gt;
  &lt;span class=&quot;mi&quot;&gt;01&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;jeq&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;bad&lt;/span&gt;           &lt;span class=&quot;c1&quot;&gt;// If A == 0, continue, otherwise jump to @bad.&lt;/span&gt;
  &lt;span class=&quot;mi&quot;&gt;02&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;load32&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;20&lt;/span&gt;                &lt;span class=&quot;c1&quot;&gt;// Load the second 32 bits of&lt;/span&gt;
                               &lt;span class=&quot;c1&quot;&gt;//   `seccomp_data.args[0]` into register A.&lt;/span&gt;
  &lt;span class=&quot;mi&quot;&gt;03&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;jset&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;0x80000000&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;bad&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;// If A &amp;amp; 0x80000000 != 0, jump to @bad,&lt;/span&gt;
                               &lt;span class=&quot;c1&quot;&gt;//   otherwise continue.&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;// Check for `seccomp_data.args[1]`:&lt;/span&gt;
  &lt;span class=&quot;mi&quot;&gt;04&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;load32&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;24&lt;/span&gt;                &lt;span class=&quot;c1&quot;&gt;// Load the first 32 bits of&lt;/span&gt;
                               &lt;span class=&quot;c1&quot;&gt;//   `seccomp_data.args[1]` into register A.&lt;/span&gt;
  &lt;span class=&quot;mi&quot;&gt;05&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;jeq&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;bad&lt;/span&gt;           &lt;span class=&quot;c1&quot;&gt;// If A == 0, continue, otherwise jump to @bad.&lt;/span&gt;
  &lt;span class=&quot;mi&quot;&gt;06&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;load32&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;28&lt;/span&gt;                &lt;span class=&quot;c1&quot;&gt;// Load the second 32 bits of&lt;/span&gt;
                               &lt;span class=&quot;c1&quot;&gt;//   `seccomp_data.args[1]` into register A.&lt;/span&gt;
  &lt;span class=&quot;mi&quot;&gt;07&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;jeq&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;0x3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;good&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;next1&lt;/span&gt;   &lt;span class=&quot;c1&quot;&gt;// If A == 0x3, jump to @good,&lt;/span&gt;
                               &lt;span class=&quot;c1&quot;&gt;//   otherwise jump to @next1.&lt;/span&gt;

&lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;next1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;mi&quot;&gt;08&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;jeq&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;0x4&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;good&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;next2&lt;/span&gt;   &lt;span class=&quot;c1&quot;&gt;// If A == 0x3, jump to @good,&lt;/span&gt;
                               &lt;span class=&quot;c1&quot;&gt;//   otherwise jump to @next2.&lt;/span&gt;

&lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;next2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;mi&quot;&gt;09&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;jeq&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;0x1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;good&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;bad&lt;/span&gt;     &lt;span class=&quot;c1&quot;&gt;// If A == 0x1, jump to @good,&lt;/span&gt;
                               &lt;span class=&quot;c1&quot;&gt;//   otherwise jump to @bad.&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;// Good/bad jump targets for the checks above to jump to:&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;good&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;mi&quot;&gt;10&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;ALLOW&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;bad&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;mi&quot;&gt;11&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;REJECT&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;  &lt;/div&gt;

  &lt;p&gt;Of course, in practice the &lt;code class=&quot;highlighter-rouge&quot;&gt;@good&lt;/code&gt;/&lt;code class=&quot;highlighter-rouge&quot;&gt;@bad&lt;/code&gt; jump targets would also be unified
with rules from other system call filters in order to cut down on those too. And
by having reduced the number of instructions in each individual filtering rule,
the jumps to these targets can be deduplicated against that many more rules.&lt;/p&gt;

  &lt;p&gt;This example demonstrates how &lt;strong&gt;optimizations build on top of each other&lt;/strong&gt;,
making each optimization more likely to make &lt;em&gt;other&lt;/em&gt; optimizations useful in
turn.&lt;/p&gt;

&lt;/details&gt;

&lt;h2 id=&quot;other-optimizations&quot;&gt;Other optimizations&lt;/h2&gt;

&lt;p&gt;Beyond these, gVisor also has the following minor optimizations.&lt;/p&gt;

&lt;details&gt;

  &lt;summary&gt;

    &lt;h3 id=&quot;making-futex2-rules-faster&quot;&gt;Making &lt;code class=&quot;highlighter-rouge&quot;&gt;futex(2)&lt;/code&gt; rules faster&lt;/h3&gt;

    &lt;p&gt;&lt;a href=&quot;https://man7.org/linux/man-pages/man2/futex.2.html&quot;&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;futex(2)&lt;/code&gt;&lt;/a&gt; is by far the
most-often-called system call that gVisor calls as part of its operation. It is
used for synchronization, so it needs to be very efficient.&lt;/p&gt;

  &lt;/summary&gt;

  &lt;p&gt;Its rules used to look like this:&lt;/p&gt;

  &lt;div class=&quot;language-go highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;SYS_FUTEX&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Or&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;PerArg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;EqualTo&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;FUTEX_WAIT&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;|&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;FUTEX_PRIVATE_FLAG&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;PerArg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;EqualTo&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;FUTEX_WAKE&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;|&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;FUTEX_PRIVATE_FLAG&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;PerArg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;EqualTo&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;FUTEX_WAIT&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;PerArg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;EqualTo&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;FUTEX_WAKE&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;  &lt;/div&gt;

  &lt;p&gt;Essentially a 4-way &lt;code class=&quot;highlighter-rouge&quot;&gt;Or&lt;/code&gt; between 4 different values allowed for
&lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp_data.args[1]&lt;/code&gt;. This is all well and good, and the above optimizations
already optimize this down to the minimum amount of &lt;code class=&quot;highlighter-rouge&quot;&gt;jeq&lt;/code&gt; comparison operations.&lt;/p&gt;

  &lt;p&gt;But looking at the actual bit values of the &lt;code class=&quot;highlighter-rouge&quot;&gt;FUTEX_*&lt;/code&gt; constants above:&lt;/p&gt;

  &lt;div class=&quot;language-go highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;FUTEX_WAIT&lt;/span&gt;         &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;0x00&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;FUTEX_WAKE&lt;/span&gt;         &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;0x01&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;FUTEX_PRIVATE_FLAG&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;0x80&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;  &lt;/div&gt;

  &lt;p&gt;… We can see that this is equivalent to checking that no bits other than
&lt;code class=&quot;highlighter-rouge&quot;&gt;0x01&lt;/code&gt; and &lt;code class=&quot;highlighter-rouge&quot;&gt;0x80&lt;/code&gt; may be set. Turns out that cBPF has an instruction for that.
This is now optimized down to two comparison operations:&lt;/p&gt;

  &lt;div class=&quot;language-javascript highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;mi&quot;&gt;01&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;load32&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;24&lt;/span&gt;                     &lt;span class=&quot;c1&quot;&gt;// Load the first 32 bits of&lt;/span&gt;
                                  &lt;span class=&quot;c1&quot;&gt;//   `seccomp_data.args[1]` into register A.&lt;/span&gt;
&lt;span class=&quot;mi&quot;&gt;02&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;jeq&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;bad&lt;/span&gt;                &lt;span class=&quot;c1&quot;&gt;// If A == 0, continue,&lt;/span&gt;
                                  &lt;span class=&quot;c1&quot;&gt;//   otherwise jump to @bad.&lt;/span&gt;
&lt;span class=&quot;mi&quot;&gt;03&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;load32&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;28&lt;/span&gt;                     &lt;span class=&quot;c1&quot;&gt;// Load the second 32 bits of&lt;/span&gt;
                                  &lt;span class=&quot;c1&quot;&gt;//   `seccomp_data.args[1]` into register A.&lt;/span&gt;
&lt;span class=&quot;mi&quot;&gt;04&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;jset&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;0xffffff7e&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;bad&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;good&lt;/span&gt;  &lt;span class=&quot;c1&quot;&gt;// If A &amp;amp; ^(0x01 | 0x80) != 0, jump to @bad,&lt;/span&gt;
                                  &lt;span class=&quot;c1&quot;&gt;//   otherwise jump to @good.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;  &lt;/div&gt;

&lt;/details&gt;

&lt;details&gt;

  &lt;summary&gt;

    &lt;h3 id=&quot;optimizing-non-negative-fd-checks&quot;&gt;Optimizing non-negative FD checks&lt;/h3&gt;

    &lt;p&gt;A lot of syscall arguments are file descriptors (FD numbers), which we need to
filter efficiently.&lt;/p&gt;

  &lt;/summary&gt;

  &lt;p&gt;An FD is a 32-bit positive integer, but is passed as a 64-bit value as all
syscall arguments are. Instead of doing a “less than” operation, we can simply
turn it into a bitwise check. We simply check that the first half of the 64-bit
value is zero, and that the 31st bit of the second half of the 64-bit value is
not set.&lt;/p&gt;

&lt;/details&gt;

&lt;details&gt;

  &lt;summary&gt;

    &lt;h3 id=&quot;enforcing-consistency-of-argument-wise-matchers&quot;&gt;Enforcing consistency of argument-wise matchers&lt;/h3&gt;

    &lt;p&gt;When one syscall argument is checked consistently across all branches of an
&lt;code class=&quot;highlighter-rouge&quot;&gt;Or&lt;/code&gt;, enforcing that this is the case ensures that the
&lt;a href=&quot;#optimize-rulesets&quot;&gt;optimization for such matchers&lt;/a&gt; remains effective.&lt;/p&gt;

  &lt;/summary&gt;

  &lt;p&gt;The &lt;code class=&quot;highlighter-rouge&quot;&gt;ioctl(2)&lt;/code&gt; system call takes an FD as one of its arguments. Since it is a
“grab bag” of a system call, gVisor’s rules for &lt;code class=&quot;highlighter-rouge&quot;&gt;ioctl(2)&lt;/code&gt; were similarly spread
across many files and rules, and not all of them checked that the FD argument
was non-negative; some of them simply accepted any value for the FD argument.&lt;/p&gt;

  &lt;p&gt;Before this optimization work, this meant that the BPF program did less work for
the rules which didn’t check the value of the FD argument. However, now that
gVisor &lt;a href=&quot;#optimize-rulesets&quot;&gt;optimizes repeated argument-wise matchers&lt;/a&gt;, it is
now actually &lt;em&gt;cheaper&lt;/em&gt; if &lt;em&gt;all&lt;/em&gt; &lt;code class=&quot;highlighter-rouge&quot;&gt;ioctl(2)&lt;/code&gt; rules verify the value of the FD
argument consistently, as that argument check can be performed exactly once for
all possible branches of the &lt;code class=&quot;highlighter-rouge&quot;&gt;ioctl(2)&lt;/code&gt; rules. So now gVisor has a test that
verifies that this is the case. This is a good example that shows that
&lt;strong&gt;optimization work can lead to improved security&lt;/strong&gt; due to the efficiency gains
that comes from applying security checks consistently.&lt;/p&gt;

&lt;/details&gt;

&lt;h2 id=&quot;secbench&quot;&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;secbench&lt;/code&gt;: Benchmarking &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; programs&lt;/h2&gt;

&lt;p&gt;To measure the effectiveness of the above improvements, measuring gVisor
performance itself would be very difficult, because each improvement is a rather
tiny part of the syscall hot path. At the scale of each of these optimizations,
we need to zoom in a bit more.&lt;/p&gt;

&lt;p&gt;So now gVisor has
&lt;a href=&quot;https://github.com/google/gvisor/blob/master/test/secbench/&quot;&gt;tooling for benchmarking &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; programs&lt;/a&gt;.
It works by taking a
&lt;a href=&quot;https://github.com/google/gvisor/blob/master/runsc/boot/filter/filter_bench_test.go&quot;&gt;cBPF program along with several possible syscalls&lt;/a&gt;
to try with it. It runs a subprocess that installs this program as &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt;
filter for itself, replacing all actions (other than “approve syscall”) with
“return error” in order to avoid crashing. Then it measures the latency of each
syscall. This is then measured against the latency of the very same syscalls in
a subprocess that has an empty &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; (i.e. the only instruction within
it is &lt;code class=&quot;highlighter-rouge&quot;&gt;return ALLOW&lt;/code&gt;).&lt;/p&gt;

&lt;p&gt;Let’s measure the effect of the above improvements on a gVisor-like workload.&lt;/p&gt;

&lt;details&gt;

  &lt;summary&gt;

    &lt;h3 id=&quot;modeling-gvisor-seccomp-bpf-behavior-for-benchmarking&quot;&gt;Modeling gVisor &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; behavior for benchmarking&lt;/h3&gt;

    &lt;p&gt;This can be done by running gVisor under &lt;code class=&quot;highlighter-rouge&quot;&gt;ptrace&lt;/code&gt; to see what system calls the
gVisor process is doing.&lt;/p&gt;

  &lt;/summary&gt;

  &lt;p&gt;Note that &lt;code class=&quot;highlighter-rouge&quot;&gt;ptrace&lt;/code&gt; here refers to the mechanism by which we can inspect the
system call that the gVisor Sentry is making. This is distinct from the system
calls the &lt;em&gt;sandboxed&lt;/em&gt; application is doing. It has also nothing to do with
gVisor’s former “ptrace” platform.&lt;/p&gt;

  &lt;p&gt;For example, after running a Postgres benchmark inside gVisor with Systrap, the
&lt;code class=&quot;highlighter-rouge&quot;&gt;ptrace&lt;/code&gt; tool generated the following summary table:&lt;/p&gt;

  &lt;div class=&quot;language-markdown highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;% time     seconds  usecs/call     calls    errors syscall
&lt;span class=&quot;p&quot;&gt;------ ----------- ----------- --------- --------- ----------------&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt; 62.&lt;/span&gt;10  431.799048         496    870063     46227 futex
&lt;span class=&quot;p&quot;&gt;  4.&lt;/span&gt;23   29.399526         106    275649        38 nanosleep
&lt;span class=&quot;p&quot;&gt;  0.&lt;/span&gt;87    6.032292          37    160201           sendmmsg
&lt;span class=&quot;p&quot;&gt;  0.&lt;/span&gt;28    1.939492          16    115769           fstat
&lt;span class=&quot;p&quot;&gt; 27.&lt;/span&gt;96  194.415343        2787     69749       137 ppoll
&lt;span class=&quot;p&quot;&gt;  1.&lt;/span&gt;05    7.298717         315     23131           fsync
&lt;span class=&quot;p&quot;&gt;  0.&lt;/span&gt;06    0.446930          31     14096           pwrite64
&lt;span class=&quot;p&quot;&gt;  3.&lt;/span&gt;37   23.398106        1907     12266         9 epoll_pwait
&lt;span class=&quot;p&quot;&gt;  0.&lt;/span&gt;00    0.019711           9      1991         6 close
&lt;span class=&quot;p&quot;&gt;  0.&lt;/span&gt;02    0.116739          82      1414           tgkill
&lt;span class=&quot;p&quot;&gt;  0.&lt;/span&gt;01    0.068481          48      1414       201 rt_sigreturn
&lt;span class=&quot;p&quot;&gt;  0.&lt;/span&gt;02    0.147048         104      1413           getpid
&lt;span class=&quot;p&quot;&gt;  0.&lt;/span&gt;01    0.045338          41      1080           write
&lt;span class=&quot;p&quot;&gt;  0.&lt;/span&gt;01    0.039876          37      1056           read
&lt;span class=&quot;p&quot;&gt;  0.&lt;/span&gt;00    0.015637          18       836        24 openat
&lt;span class=&quot;p&quot;&gt;  0.&lt;/span&gt;01    0.066699          81       814           madvise
&lt;span class=&quot;p&quot;&gt;  0.&lt;/span&gt;00    0.029757         111       267           fallocate
&lt;span class=&quot;p&quot;&gt;  0.&lt;/span&gt;00    0.006619          15       420           pread64
&lt;span class=&quot;p&quot;&gt;  0.&lt;/span&gt;00    0.013334          35       375           sched_yield
&lt;span class=&quot;p&quot;&gt;  0.&lt;/span&gt;00    0.008112         114        71           pwritev2
&lt;span class=&quot;p&quot;&gt;  0.&lt;/span&gt;00    0.003005          57        52           munmap
&lt;span class=&quot;p&quot;&gt;  0.&lt;/span&gt;00    0.000343          18        19         6 unlinkat
&lt;span class=&quot;p&quot;&gt;  0.&lt;/span&gt;00    0.000249          15        16           shutdown
&lt;span class=&quot;p&quot;&gt;  0.&lt;/span&gt;00    0.000100           8        12           getdents64
&lt;span class=&quot;p&quot;&gt;  0.&lt;/span&gt;00    0.000045           4        10           newfstatat
...
&lt;span class=&quot;p&quot;&gt;------ ----------- ----------- --------- --------- ----------------&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;100.&lt;/span&gt;00  695.311111         447   1552214     46651 total
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;  &lt;/div&gt;

  &lt;p&gt;To mimic the syscall profile of this gVisor sandbox from the perspective of
&lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; overhead, we need to have it call these system calls with the same
relative frequency. Therefore, the dimension that matters here isn’t &lt;code class=&quot;highlighter-rouge&quot;&gt;time&lt;/code&gt; or
&lt;code class=&quot;highlighter-rouge&quot;&gt;seconds&lt;/code&gt; or even &lt;code class=&quot;highlighter-rouge&quot;&gt;usecs/call&lt;/code&gt;; it is actually just the number of system calls
(&lt;code class=&quot;highlighter-rouge&quot;&gt;calls&lt;/code&gt;). In graph form:&lt;/p&gt;

  &lt;p&gt;&lt;img src=&quot;/assets/images/2024-02-01-gvisor-seccomp-sentry-syscall-profile.png&quot; alt=&quot;Sentry syscall profile&quot; title=&quot;Sentry syscall profile&quot; style=&quot;max-width:100%&quot; /&gt;&lt;/p&gt;

  &lt;p&gt;The Pareto distribution of system calls becomes immediately clear.&lt;/p&gt;

&lt;/details&gt;

&lt;h3 id=&quot;seccomp-bpf-filtering-overhead-reduction&quot;&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; filtering overhead reduction&lt;/h3&gt;

&lt;p&gt;The &lt;code class=&quot;highlighter-rouge&quot;&gt;secbench&lt;/code&gt; library lets us take the top 10 system calls and measure their
&lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; filtering overhead individually, as well as building a weighted
aggregate of their overall overhead. Here are the numbers from before and after
the filtering optimizations described in this post:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/2024-02-01-gvisor-seccomp-systrap.png&quot; alt=&quot;Systrap seccomp-bpf performance&quot; title=&quot;Systrap seccomp-bpf performance&quot; style=&quot;max-width:100%&quot; /&gt;&lt;/p&gt;

&lt;p&gt;The &lt;code class=&quot;highlighter-rouge&quot;&gt;nanosleep(2)&lt;/code&gt; system call is a bit of an oddball here. Unlike the others,
this system call causes the current thread to be descheduled. To make the
results more legible, here is the same data with the duration normalized to the
&lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; filtering overhead from before optimizations:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/2024-02-01-gvisor-seccomp-systrap-normalized.png&quot; alt=&quot;Systrap seccomp-bpf performance&quot; title=&quot;Systrap seccomp-bpf performance&quot; style=&quot;max-width:100%&quot; /&gt;&lt;/p&gt;

&lt;p&gt;This shows that most system calls have had their filtering overhead reduced, but
others haven’t significantly changed (10% or less change in either direction).
This is to be expected: those that have not changed are the ones that are
cacheable: &lt;code class=&quot;highlighter-rouge&quot;&gt;nanosleep(2)&lt;/code&gt;, &lt;code class=&quot;highlighter-rouge&quot;&gt;fstat(2)&lt;/code&gt;, &lt;code class=&quot;highlighter-rouge&quot;&gt;ppoll(2)&lt;/code&gt;, &lt;code class=&quot;highlighter-rouge&quot;&gt;fsync(2)&lt;/code&gt;, &lt;code class=&quot;highlighter-rouge&quot;&gt;pwrite64(2)&lt;/code&gt;,
&lt;code class=&quot;highlighter-rouge&quot;&gt;close(2)&lt;/code&gt;, &lt;code class=&quot;highlighter-rouge&quot;&gt;getpid(2)&lt;/code&gt;. The non-cacheable syscalls
&lt;a href=&quot;#structure&quot;&gt;which have dedicated checks&lt;/a&gt; before the main BST, &lt;code class=&quot;highlighter-rouge&quot;&gt;futex(2)&lt;/code&gt; and
&lt;code class=&quot;highlighter-rouge&quot;&gt;sendmmsg(2)&lt;/code&gt;, experienced the biggest boost. Lastly, &lt;code class=&quot;highlighter-rouge&quot;&gt;epoll_pwait(2)&lt;/code&gt; is
non-cacheable but doesn’t have a dedicated check before the main BST, so while
it still sees a small performance gain, that gain is lower than its
counterparts.&lt;/p&gt;

&lt;p&gt;The “Aggregate” number comes from the &lt;code class=&quot;highlighter-rouge&quot;&gt;secbench&lt;/code&gt; library and represents the
total time difference spent in system calls after calling them using weighted
randomness. It represents the average system call overhead that a Sentry using
Systrap would incur. Therefore, per these numbers, these optimizations removed
~29% from gVisor’s overall &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; filtering overhead.&lt;/p&gt;

&lt;p&gt;Here is the same data for KVM, which has a slightly different syscall profile
with &lt;code class=&quot;highlighter-rouge&quot;&gt;ioctl(2)&lt;/code&gt; and &lt;code class=&quot;highlighter-rouge&quot;&gt;rt_sigreturn(2)&lt;/code&gt; being critical for performance:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/2024-02-01-gvisor-seccomp-kvm-normalized.png&quot; alt=&quot;KVM seccomp-bpf performance&quot; title=&quot;KVM seccomp-bpf performance&quot; style=&quot;max-width:100%&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Lastly, let’s look at GPU workload performance. This benchmark enables gVisor’s
&lt;a href=&quot;/blog/2023/06/20/gpu-pytorch-stable-diffusion/&quot;&gt;experimental &lt;code class=&quot;highlighter-rouge&quot;&gt;nvproxy&lt;/code&gt; feature for GPU support&lt;/a&gt;.
What matters for this workload is &lt;code class=&quot;highlighter-rouge&quot;&gt;ioctl(2)&lt;/code&gt; performance, as this is the system
call used to issue commands to the GPU. Here is the &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; filtering
overhead of various CUDA control commands issued via &lt;code class=&quot;highlighter-rouge&quot;&gt;ioctl(2)&lt;/code&gt;:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/2024-02-01-gvisor-seccomp-nvproxy-ioctl.png&quot; alt=&quot;nvproxy ioctl seccomp-bpf performance&quot; title=&quot;nvproxy ioctl seccomp-bpf performance&quot; style=&quot;max-width:100%&quot; /&gt;&lt;/p&gt;

&lt;p&gt;As &lt;code class=&quot;highlighter-rouge&quot;&gt;nvproxy&lt;/code&gt; adds a lot of complexity to the &lt;code class=&quot;highlighter-rouge&quot;&gt;ioctl(2)&lt;/code&gt; filtering rules, this is
where we see the most improvement from these optimizations.&lt;/p&gt;

&lt;h2 id=&quot;secfuzz&quot;&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;secfuzz&lt;/code&gt;: Fuzzing &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; programs&lt;/h2&gt;

&lt;p&gt;To ensure that the optimizations above don’t accidentally end up producing a
cBPF program that has different behavior from the unoptimized one used to do,
gVisor also has
&lt;a href=&quot;https://github.com/google/gvisor/blob/master/test/secfuzz/&quot;&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; fuzz tests&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Because gVisor knows which high-level filters went into constructing the
&lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; program, it also
&lt;a href=&quot;https://github.com/google/gvisor/blob/master/runsc/boot/filter/filter_fuzz_test.go&quot;&gt;automatically generates test cases&lt;/a&gt;
from these filters, and the fuzzer verifies that each line and every branch of
the optimized cBPF bytecode is executed, and that the result is the same as
giving the same input to the unoptimized program.&lt;/p&gt;

&lt;p&gt;(Line or branch coverage of the unoptimized program is not enforceable, because
without optimizations, the bytecode contains many redundant checks for which
later branches can never be reached.)&lt;/p&gt;

&lt;h2 id=&quot;optimizing-in-gvisor-seccomp-bpf-filtering&quot;&gt;Optimizing in-gVisor &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; filtering&lt;/h2&gt;

&lt;p&gt;gVisor supports sandboxed applications adding &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; filters onto
themselves, and
&lt;a href=&quot;https://github.com/google/gvisor/blob/master/pkg/bpf/interpreter.go&quot;&gt;implements its own cBPF interpreter&lt;/a&gt;
for this purpose.&lt;/p&gt;

&lt;p&gt;Because the cBPF bytecode-level optimizations are lossless and are generally
applicable to any cBPF program, they are applied onto programs uploaded by
sandboxed applications to make filter evaluation faster in gVisor itself.&lt;/p&gt;

&lt;p&gt;Additionally, gVisor removed the use of Go interfaces previously used for
loading data from the BPF “input” (i.e. the &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp_data&lt;/code&gt; struct for
&lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt;). This used to require an endianness-specific interface due to how
the BPF interpreter was used in two places in gVisor: network processing (which
uses network byte ordering), and &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; (which uses native byte
ordering). This interface has now been replaced with
&lt;a href=&quot;https://go.dev/doc/tutorial/generics&quot;&gt;Go templates&lt;/a&gt;, yielding to a 2x speedup
on &lt;a href=&quot;#sample-filter&quot;&gt;the reference simplistic &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; filter&lt;/a&gt;. The more
&lt;code class=&quot;highlighter-rouge&quot;&gt;load&lt;/code&gt; instructions are in the filter, the better the effect. &lt;em&gt;(Naturally, this
also benefits network filtering performance!)&lt;/em&gt;&lt;/p&gt;

&lt;h3 id=&quot;gvisor-cbpf-interpreter-performance&quot;&gt;gVisor cBPF interpreter performance&lt;/h3&gt;

&lt;p&gt;The graph below shows the gVisor cBPF interpreter’s performance against three
sample filters: &lt;a href=&quot;#sample-filter&quot;&gt;the reference simplistic &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; filter&lt;/a&gt;,
and optimized vs unoptimized versions of gVisor’s own syscall filter (to
represent a more complex filter).&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/2024-02-01-gvisor-seccomp-interpreter.png&quot; alt=&quot;gVisor cBPF interpreter performance&quot; title=&quot;gVisor cBPF interpreter performance&quot; style=&quot;max-width:100%&quot; /&gt;&lt;/p&gt;

&lt;h3 id=&quot;seccomp-bpf-filter-result-caching-for-sandboxed-applications&quot;&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; filter result caching for sandboxed applications&lt;/h3&gt;

&lt;p&gt;Lastly, gVisor now also implements an in-sandbox caching mechanism for syscalls
which do not depend on the &lt;code class=&quot;highlighter-rouge&quot;&gt;instruction_pointer&lt;/code&gt; or syscall arguments. Unlike
Linux’s &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; cache, gVisor’s implementation also handles actions other
than “allow”, and supports the entire set of cBPF instructions rather than the
restricted emulator Linux uses for caching evaluation purposes. This removes the
interpreter from the syscall hot path entirely for cacheable syscalls, further
speeding up system calls from applications that use &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; within gVisor.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/2024-02-01-gvisor-seccomp-cache.png&quot; alt=&quot;gVisor seccomp-bpf cache&quot; title=&quot;gVisor seccomp-bpf cache&quot; style=&quot;max-width:100%&quot; /&gt;&lt;/p&gt;

&lt;h2 id=&quot;faster-gvisor-startup-via-filter-precompilation&quot;&gt;Faster gVisor startup via filter precompilation&lt;/h2&gt;

&lt;p&gt;Due to these optimizations, the overall process of building the syscall
filtering rules, rendering them to cBPF bytecode, and running all the
optimizations, can take quite a while (~10ms). As one of gVisor’s strengths is
its startup latency being much faster than VMs, this is an unacceptable delay.&lt;/p&gt;

&lt;p&gt;Therefore, gVisor now
&lt;a href=&quot;https://github.com/google/gvisor/blob/master/pkg/seccomp/precompiledseccomp/&quot;&gt;precompiles the rules&lt;/a&gt;
to optimized cBPF bytecode for most possible gVisor configurations. This means
the &lt;code class=&quot;highlighter-rouge&quot;&gt;runsc&lt;/code&gt; binary contains cBPF bytecode embedded in it for some subset of
popular configurations, and it will use this bytecode rather than compiling the
cBPF program from scratch during startup. If &lt;code class=&quot;highlighter-rouge&quot;&gt;runsc&lt;/code&gt; is invoked with a
configuration for which the cBPF bytecode isn’t embedded in the &lt;code class=&quot;highlighter-rouge&quot;&gt;runsc&lt;/code&gt; binary,
it will fall back to compiling the program from scratch.&lt;/p&gt;

&lt;details&gt;

  &lt;summary&gt;

    &lt;h3 id=&quot;dealing-with-dynamic-values-in-precompiled-rules&quot;&gt;Dealing with dynamic values in precompiled rules&lt;/h3&gt;

  &lt;/summary&gt;

  &lt;p&gt;One challenge with this approach is to support parts of the configuration that
are only known at &lt;code class=&quot;highlighter-rouge&quot;&gt;runsc&lt;/code&gt; startup time. For example, many filters act on a
specific file descriptor used for interacting with the &lt;code class=&quot;highlighter-rouge&quot;&gt;runsc&lt;/code&gt; process after
startup over a Unix Domain Socket (called the “controller FD”). This is an
integer that is only known at runtime, so its value cannot be embedded inside
the optimized cBPF bytecode prepared at &lt;code class=&quot;highlighter-rouge&quot;&gt;runsc&lt;/code&gt; compilation time.&lt;/p&gt;

  &lt;p&gt;To address this, the &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; precompilation tooling actually supports the
notions of 32-bit “variables”, and takes as input a function to render cBPF
bytecode for a given key-value mapping of variables to placeholder 32-bit
values. The precompiler calls this function &lt;em&gt;twice&lt;/em&gt; with different arbitrary
value mappings for each variable, and observes where these arbitrary values show
up in the generated cBPF bytecode. This takes advantage of the fact that
gVisor’s &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; program generation is deterministic.&lt;/p&gt;

  &lt;p&gt;If the two cBPF programs are of the same byte length, and the placeholder values
show up at exactly the same byte offsets within the cBPF bytecode both times,
and the rest of the cBPF bytecode is byte-for-byte equivalent, the precompiler
has very high confidence that these offsets are where the 32-bit variables are
represented in the cBPF bytecode. It then stores these offsets as part of the
embedded data inside the &lt;code class=&quot;highlighter-rouge&quot;&gt;runsc&lt;/code&gt; binary. Finally, at &lt;code class=&quot;highlighter-rouge&quot;&gt;runsc&lt;/code&gt; execution time, the
bytes at these offsets are replaced with the now-known values of the variables.&lt;/p&gt;

&lt;/details&gt;

&lt;h2 id=&quot;performance&quot;&gt;OK that’s great and all, but is gVisor actually faster?&lt;/h2&gt;

&lt;p&gt;The short answer is: &lt;strong&gt;yes, but only slightly&lt;/strong&gt;. As we
&lt;a href=&quot;#performance-considerations&quot;&gt;established earlier&lt;/a&gt;, &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; is only a
small portion of gVisor’s total overhead, and the &lt;code class=&quot;highlighter-rouge&quot;&gt;secbench&lt;/code&gt; benchmark shows
that this work only removes a portion of that overhead, so we should not expect
large differences here.&lt;/p&gt;

&lt;p&gt;Let’s come back to the trusty ABSL build benchmark, with a new build of gVisor
with all of these optimizations turned on:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/2024-02-01-gvisor-seccomp-absl-vs-unsandboxed.png&quot; alt=&quot;ABSL build performance&quot; title=&quot;ABSL build performance&quot; style=&quot;max-width:100%&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Let’s zoom the vertical axis in on the gVisor variants to see the difference
better:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/2024-02-01-gvisor-seccomp-absl.png&quot; alt=&quot;ABSL build performance&quot; title=&quot;ABSL build performance&quot; style=&quot;max-width:100%&quot; /&gt;&lt;/p&gt;

&lt;p&gt;This is about in line with what the earlier benchmarks showed. The initial
benchmarks showed that &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; filtering overhead for this benchmark was
on the order of ~3.6% of total runtime, and the &lt;code class=&quot;highlighter-rouge&quot;&gt;secbench&lt;/code&gt; benchmarks showed
that the optimizations reduced &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; filter evaluation time by ~29% in
aggregate. The final absolute reduction in total runtime should then be around
~1%, which is just about what this result shows.&lt;/p&gt;

&lt;p&gt;Other benchmarks show a similar pattern. Here’s gRPC build, similar to ABSL:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/2024-02-01-gvisor-seccomp-grpc-vs-unsandboxed.png&quot; alt=&quot;gRPC build performance&quot; title=&quot;gRPC build performance&quot; style=&quot;max-width:100%&quot; /&gt;&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/2024-02-01-gvisor-seccomp-grpc.png&quot; alt=&quot;gRPC build performance&quot; title=&quot;gRPC build performance&quot; style=&quot;max-width:100%&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Here’s a benchmark running the
&lt;a href=&quot;https://github.com/fastlane/fastlane&quot;&gt;Ruby Fastlane&lt;/a&gt; test suite:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/2024-02-01-gvisor-seccomp-rubydev-vs-unsandboxed.png&quot; alt=&quot;Ruby Fastlane performance&quot; title=&quot;Ruby Fastlane performance&quot; style=&quot;max-width:100%&quot; /&gt;&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/2024-02-01-gvisor-seccomp-rubydev.png&quot; alt=&quot;Ruby Fastlane performance&quot; title=&quot;Ruby Fastlane performance&quot; style=&quot;max-width:100%&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Here’s the 50th percentile of nginx serving latency for an empty webpage.
&lt;a href=&quot;https://www.prnewswire.com/news-releases/akamai-online-retail-performance-report-milliseconds-are-critical-300441498.html&quot;&gt;Every microsecond counts when it comes to web serving&lt;/a&gt;,
and here we’ve shaven off 20 of them.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/2024-02-01-gvisor-seccomp-nginx-vs-unsandboxed.png&quot; alt=&quot;nginx performance&quot; title=&quot;nginx performance&quot; style=&quot;max-width:100%&quot; /&gt;&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/2024-02-01-gvisor-seccomp-nginx.png&quot; alt=&quot;nginx performance&quot; title=&quot;nginx performance&quot; style=&quot;max-width:100%&quot; /&gt;&lt;/p&gt;

&lt;p&gt;CUDA workloads also get a boost from this work. Since their gVisor-related
overhead is already relatively small, &lt;strong&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; filtering makes up a
higher proportion of their overhead&lt;/strong&gt;. Additionally, as the performance
improvements described in this post disproportionately help the &lt;code class=&quot;highlighter-rouge&quot;&gt;ioctl(2)&lt;/code&gt;
system call, this cuts a larger portion of the &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; filtering overhead
of these workload, since CUDA uses the &lt;code class=&quot;highlighter-rouge&quot;&gt;ioctl(2)&lt;/code&gt; system call to communicate
with the GPU.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/2024-02-01-gvisor-seccomp-pytorch-vs-unsandboxed.png&quot; alt=&quot;PyTorch performance&quot; title=&quot;PyTorch performance&quot; style=&quot;max-width:100%&quot; /&gt;&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/2024-02-01-gvisor-seccomp-pytorch.png&quot; alt=&quot;PyTorch performance&quot; title=&quot;PyTorch performance&quot; style=&quot;max-width:100%&quot; /&gt;&lt;/p&gt;

&lt;p&gt;While some of these results may not seem like much in absolute terms, it’s
important to remember:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;These improvements have resulted in gVisor being able to enforce &lt;strong&gt;more&lt;/strong&gt;
&lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; filters than it previously could; gVisor’s &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt;
filter was nearly half the maximum &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; program size, so it could
at most double in complexity. After optimizations, it is reduced to less
than a fourth of this size.&lt;/li&gt;
  &lt;li&gt;These improvements allow the gVisor filters to &lt;strong&gt;scale better&lt;/strong&gt;. This is
visible from the effects on &lt;code class=&quot;highlighter-rouge&quot;&gt;ioctl(2)&lt;/code&gt; performance with &lt;code class=&quot;highlighter-rouge&quot;&gt;nvproxy&lt;/code&gt; enabled.&lt;/li&gt;
  &lt;li&gt;The resulting work has produced useful libraries for &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; tooling
which may be helpful for other projects: testing, fuzzing, and benchmarking
&lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; filters.&lt;/li&gt;
  &lt;li&gt;This overhead could not have been addressed in another way. Unlike other
areas of gVisor, such as network overhead or file I/O, overhead from the
host kernel evaluating &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; filter lives outside of gVisor itself
and therefore it can only be improved upon by this type of work.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;further-work&quot;&gt;Further work&lt;/h2&gt;

&lt;p&gt;One potential source of work is to look into the performance gap between no
&lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; filter at all versus performance with an empty &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt;
filter (equivalent to an all-cacheable filter). This points to a potential
inefficiency in the Linux kernel implementation of the &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; cache.&lt;/p&gt;

&lt;p&gt;Another potential point of improvement is to port over the optimizations that
went into searching for a syscall number into the
&lt;a href=&quot;https://man7.org/linux/man-pages/man2/ioctl.2.html&quot;&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;ioctl(2)&lt;/code&gt; system call&lt;/a&gt;. &lt;code class=&quot;highlighter-rouge&quot;&gt;ioctl(2)&lt;/code&gt; is a “grab-bag” kind of system call,
used by many drivers and other subsets of the Linux kernel to extend the syscall
interface without using up valuable syscall numbers. For example, the
&lt;a href=&quot;https://en.wikipedia.org/wiki/Kernel-based_Virtual_Machine&quot;&gt;KVM&lt;/a&gt; subsystem is
almost entirely controlled through &lt;code class=&quot;highlighter-rouge&quot;&gt;ioctl(2)&lt;/code&gt; system calls issued against
&lt;code class=&quot;highlighter-rouge&quot;&gt;/dev/kvm&lt;/code&gt; or against per-VM file descriptors.&lt;/p&gt;

&lt;p&gt;For this reason, the first non-file-descriptor argument of &lt;a href=&quot;https://man7.org/linux/man-pages/man2/ioctl.2.html&quot;&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;ioctl(2)&lt;/code&gt;&lt;/a&gt;
(“request”) usually encodes something analogous to what the syscall number
usually represents: the type of request made to the kernel. Currently, gVisor
performs a linear scan through all possible enumerations of this argument. This
is usually fine, but with features like &lt;code class=&quot;highlighter-rouge&quot;&gt;nvproxy&lt;/code&gt; which massively expand this
list of possible values, this can take a long time. &lt;code class=&quot;highlighter-rouge&quot;&gt;ioctl&lt;/code&gt; performance is also
critical for gVisor’s KVM platform. A binary search tree would make sense here.&lt;/p&gt;

&lt;p&gt;gVisor welcomes further contributions to its &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; machinery. Thanks for
reading!&lt;/p&gt;

&lt;div class=&quot;footnotes&quot; role=&quot;doc-endnotes&quot;&gt;
  &lt;ol&gt;
    &lt;li id=&quot;fn:1&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;cBPF does not have a canonical assembly-style representation. The
assembly-like code in this blog post is close to
&lt;a href=&quot;https://man7.org/linux/man-pages/man8/bpfc.8.html&quot;&gt;the one used in &lt;code class=&quot;highlighter-rouge&quot;&gt;bpfc&lt;/code&gt;&lt;/a&gt;
but diverges in ways to make it hopefully clearer as to what’s happening,
and all code is annotated with &lt;code class=&quot;highlighter-rouge&quot;&gt;// comments&lt;/code&gt;. &lt;a href=&quot;#fnref:1&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
  &lt;/ol&gt;
&lt;/div&gt;</content><author><name>eperot</name></author><summary type="html">gVisor is a multi-layered security sandbox. seccomp-bpf is gVisor’s second layer of defense against container escape attacks. gVisor uses seccomp-bpf to filter its own syscalls by the host kernel. This significantly reduces the attack surface to the host that a compromised gVisor process can access. However, this layer comes at a cost: every legitimate system call that gVisor makes must be evaluated against this filter by the host kernel before it is actually executed. This blog post contains more than you ever wanted to know about seccomp-bpf, and explores the past few months of work to optimize gVisor’s use of it.</summary></entry><entry><title type="html">Faster filesystem access with Directfs</title><link href="/blog/2023/06/27/directfs/" rel="alternate" type="text/html" title=" Faster filesystem access with Directfs" /><published>2023-06-27T00:00:00-05:00</published><updated>2023-06-27T00:00:00-05:00</updated><id>/blog/2023/06/27/directfs</id><content type="html" xml:base="/blog/2023/06/27/directfs/">&lt;p&gt;Directfs is now the default in runsc. This feature gives gVisor’s application
kernel (the Sentry) secure direct access to the container filesystem, avoiding
expensive round trips to the filesystem gofer. Learn more about this feature in
the following blog that was
&lt;a href=&quot;https://opensource.googleblog.com/2023/06/optimizing-gvisor-filesystems-with-directfs.html&quot;&gt;originally posted&lt;/a&gt;
on &lt;a href=&quot;https://opensource.googleblog.com/&quot;&gt;Google Open Source Blog&lt;/a&gt;.&lt;/p&gt;

&lt;!--/excerpt--&gt;

&lt;h2 id=&quot;origins-of-the-gofer&quot;&gt;Origins of the Gofer&lt;/h2&gt;

&lt;p&gt;gVisor is used internally at Google to run a variety of services and workloads.
One of the challenges we faced while building gVisor was providing remote
filesystem access securely to the sandbox. gVisor’s strict
&lt;a href=&quot;https://gvisor.dev/docs/architecture_guide/security/&quot;&gt;security model&lt;/a&gt; and
defense in depth approach assumes that the sandbox may get compromised because
it shares the same execution context as the untrusted application. Hence the
sandbox cannot be given sensitive keys and credentials to access Google-internal
remote filesystems.&lt;/p&gt;

&lt;p&gt;To address this challenge, we added a trusted filesystem proxy called a “gofer”.
The gofer runs outside the sandbox, and provides a secure interface for
untrusted containers to access such remote filesystems. For architectural
simplicity, gofers were also used to serve local filesystems as well as remote.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/2023-06-27-gofer-proxy.svg&quot; alt=&quot;Figure 1&quot; title=&quot;Filesystem gofer proxy&quot; width=&quot;100%&quot; /&gt;&lt;/p&gt;

&lt;h2 id=&quot;isolating-the-container-filesystem-in-runsc&quot;&gt;Isolating the Container Filesystem in runsc&lt;/h2&gt;

&lt;p&gt;When gVisor was &lt;a href=&quot;https://github.com/google/gvisor&quot;&gt;open sourced&lt;/a&gt; as
&lt;a href=&quot;https://gvisor.dev/docs/&quot;&gt;runsc&lt;/a&gt;, the same gofer model was copied over to
maintain the same security guarantees. runsc was configured to start one gofer
process per container which serves the container filesystem to the sandbox over
a predetermined protocol (now
&lt;a href=&quot;https://github.com/google/gvisor/blob/master/pkg/lisafs&quot;&gt;LISAFS&lt;/a&gt;). However, a gofer
adds a layer of indirection with significant overhead.&lt;/p&gt;

&lt;p&gt;This gofer model (built for remote filesystems) brings very few advantages for
the runsc use-case, where all the filesystems served by the gofer (like rootfs
and &lt;a href=&quot;https://docs.docker.com/storage/bind-mounts/&quot;&gt;bind mounts&lt;/a&gt;) are mounted
locally on the host. The gofer directly accesses them using filesystem syscalls.&lt;/p&gt;

&lt;p&gt;Linux provides some security primitives to effectively isolate local
filesystems. These include,
&lt;a href=&quot;https://man7.org/linux/man-pages/man7/mount_namespaces.7.html&quot;&gt;mount namespaces&lt;/a&gt;,
&lt;a href=&quot;https://man7.org/linux/man-pages/man2/pivot_root.2.html&quot;&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;pivot_root&lt;/code&gt;&lt;/a&gt; and
detached bind mounts&lt;sup id=&quot;fnref:1&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:1&quot; class=&quot;footnote&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;. &lt;strong&gt;Directfs&lt;/strong&gt; is a new filesystem access mode that uses
these primitives to expose the container filesystem to the sandbox in a secure
manner. The sandbox’s view of the filesystem tree is limited to just the
container filesystem. The sandbox process is not given access to anything
mounted on the broader host filesystem. Even if the sandbox gets compromised,
these mechanisms provide additional barriers to prevent broader system
compromise.&lt;/p&gt;

&lt;h2 id=&quot;directfs&quot;&gt;Directfs&lt;/h2&gt;

&lt;p&gt;In directfs mode, the gofer still exists as a cooperative process outside the
sandbox. As usual, the gofer enters a new mount namespace, sets up appropriate
bind mounts to create the container filesystem in a new directory and then
&lt;a href=&quot;https://man7.org/linux/man-pages/man2/pivot_root.2.html&quot;&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;pivot_root(2)&lt;/code&gt;&lt;/a&gt;s into
that directory. Similarly, the sandbox process enters new user and mount
namespaces and then
&lt;a href=&quot;https://man7.org/linux/man-pages/man2/pivot_root.2.html&quot;&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;pivot_root(2)&lt;/code&gt;&lt;/a&gt;s into
an empty directory to ensure it cannot access anything via path traversal. But
instead of making RPCs to the gofer to access the container filesystem, the
sandbox requests the gofer to provide file descriptors to all the mount points
via &lt;a href=&quot;https://man7.org/linux/man-pages/man7/unix.7.html&quot;&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;SCM_RIGHTS&lt;/code&gt; messages&lt;/a&gt;.
The sandbox then directly makes file-descriptor-relative syscalls (e.g.
&lt;a href=&quot;https://linux.die.net/man/2/fstatat&quot;&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;fstatat(2)&lt;/code&gt;&lt;/a&gt;,
&lt;a href=&quot;https://linux.die.net/man/2/openat&quot;&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;openat(2)&lt;/code&gt;&lt;/a&gt;,
&lt;a href=&quot;https://linux.die.net/man/2/mkdirat&quot;&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;mkdirat(2)&lt;/code&gt;&lt;/a&gt;, etc) to perform filesystem
operations.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/2023-06-27-directfs.svg&quot; alt=&quot;Figure 2&quot; title=&quot;Directfs configuration&quot; width=&quot;100%&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Earlier when the gofer performed all filesystem operations, we could deny all
these syscalls in the sandbox process using seccomp. But with directfs enabled,
the sandbox process’s seccomp filters need to allow the usage of these syscalls.
Most notably, the sandbox can now make
&lt;a href=&quot;https://linux.die.net/man/2/openat&quot;&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;openat(2)&lt;/code&gt;&lt;/a&gt; syscalls (which allow path
traversal), but with certain restrictions:
&lt;a href=&quot;https://github.com/google/gvisor/commit/114a033bd038519fa6e867c230dc4ad4e057e675&quot;&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;O_NOFOLLOW&lt;/code&gt; is required&lt;/a&gt;,
&lt;a href=&quot;https://github.com/google/gvisor/commit/fcbc289a7ac14b8d84d0c0b23c4b2a14fc626e79&quot;&gt;no access to procfs&lt;/a&gt;
and
&lt;a href=&quot;https://github.com/google/gvisor/commit/aa8abdfa9256cf057202ec8f4a81ba9f5d6a203f&quot;&gt;no directory FDs from the host&lt;/a&gt;.
We also had to give the sandbox the same privileges as the gofer (for example
&lt;code class=&quot;highlighter-rouge&quot;&gt;CAP_DAC_OVERRIDE&lt;/code&gt; and &lt;code class=&quot;highlighter-rouge&quot;&gt;CAP_DAC_READ_SEARCH&lt;/code&gt;), so it can perform the same
filesystem operations.&lt;/p&gt;

&lt;p&gt;It is noteworthy that only the trusted gofer provides FDs (of the container
filesystem) to the sandbox. The sandbox cannot walk backwards (using ‘..’) or
follow a malicious symlink to escape out of the container filesystem. In effect,
we’ve decreased our dependence on the syscall filters to catch bad behavior, but
correspondingly increased our dependence on Linux’s filesystem isolation
protections.&lt;/p&gt;

&lt;h2 id=&quot;performance&quot;&gt;Performance&lt;/h2&gt;

&lt;p&gt;Making RPCs to the gofer for every filesystem operation adds a lot of overhead
to runsc. Hence, avoiding gofer round trips significantly improves performance.
Let’s find out what this means for some of our benchmarks. We will run the
benchmarks using our newly released
&lt;a href=&quot;https://gvisor.dev/blog/2023/04/28/systrap-release/&quot;&gt;systrap platform&lt;/a&gt; on bind
mounts (as opposed to rootfs). This would simulate more realistic use cases
because bind mounts are extensively used while configuring filesystems in
containers. Bind mounts also do not have an overlay
(&lt;a href=&quot;https://opensource.googleblog.com/2023/04/gvisor-improves-performance-with-root-filesystem-overlay.html&quot;&gt;like the rootfs mount&lt;/a&gt;),
so all operations go through goferfs / directfs mount.&lt;/p&gt;

&lt;p&gt;Let’s first look at our
&lt;a href=&quot;https://github.com/google/gvisor/blob/master/test/perf/linux/stat_benchmark.cc&quot;&gt;stat micro-benchmark&lt;/a&gt;,
which repeatedly calls
&lt;a href=&quot;https://man7.org/linux/man-pages/man2/lstat.2.html&quot;&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;stat(2)&lt;/code&gt;&lt;/a&gt; on a file.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/2023-06-27-stat-benchmark.svg&quot; alt=&quot;Figure 3&quot; title=&quot;Stat micro benchmark&quot; width=&quot;100%&quot; /&gt;&lt;/p&gt;

&lt;p&gt;The &lt;code class=&quot;highlighter-rouge&quot;&gt;stat(2)&lt;/code&gt; syscall is more than 2x faster! However, since this is not
representative of real-world applications, we should not extrapolate these
results. So let’s look at some
&lt;a href=&quot;https://github.com/google/gvisor/blob/master/test/benchmarks/fs&quot;&gt;real-world benchmarks&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/2023-06-27-real-world-benchmarks.svg&quot; alt=&quot;Figure 4&quot; title=&quot;Real world benchmarks&quot; width=&quot;100%&quot; /&gt;&lt;/p&gt;

&lt;p&gt;We see a 12% reduction in the absolute time to run these workloads and 17%
reduction in Ruby load time!&lt;/p&gt;

&lt;h2 id=&quot;conclusion&quot;&gt;Conclusion&lt;/h2&gt;

&lt;p&gt;The gofer model in runsc was overly restrictive for accessing host files. We
were able to leverage existing filesystem isolation mechanisms in Linux to
bypass the gofer without compromising security. Directfs significantly improves
performance for certain workloads. This is part of our ongoing efforts to
improve gVisor performance. You can learn more about gVisor at
&lt;a href=&quot;http://www.gvisor.dev/&quot;&gt;gvisor.dev&lt;/a&gt;. You can also use gVisor in
&lt;a href=&quot;https://cloud.google.com/kubernetes-engine&quot;&gt;GKE&lt;/a&gt; with
&lt;a href=&quot;https://cloud.google.com/kubernetes-engine/docs/concepts/sandbox-pods&quot;&gt;GKE Sandbox&lt;/a&gt;.
Happy sandboxing!&lt;/p&gt;

&lt;hr /&gt;

&lt;div class=&quot;footnotes&quot; role=&quot;doc-endnotes&quot;&gt;
  &lt;ol&gt;
    &lt;li id=&quot;fn:1&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Detached bind mounts can be created by first creating a bind mount using
mount(MS_BIND) and then detaching it from the filesystem tree using
umount(MNT_DETACH). &lt;a href=&quot;#fnref:1&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
  &lt;/ol&gt;
&lt;/div&gt;</content><author><name>ayushranjan</name></author><summary type="html">Directfs is now the default in runsc. This feature gives gVisor’s application kernel (the Sentry) secure direct access to the container filesystem, avoiding expensive round trips to the filesystem gofer. Learn more about this feature in the following blog that was originally posted on Google Open Source Blog.</summary></entry><entry><title type="html">Running Stable Diffusion on GPU with gVisor</title><link href="/blog/2023/06/20/gpu-pytorch-stable-diffusion/" rel="alternate" type="text/html" title=" Running Stable Diffusion on GPU with gVisor" /><published>2023-06-20T00:00:00-05:00</published><updated>2023-06-20T00:00:00-05:00</updated><id>/blog/2023/06/20/gpu-pytorch-stable-diffusion</id><content type="html" xml:base="/blog/2023/06/20/gpu-pytorch-stable-diffusion/">&lt;p&gt;gVisor is &lt;a href=&quot;https://github.com/google/gvisor/blob/master/g3doc/proposals/nvidia_driver_proxy.md&quot;&gt;starting to support GPU&lt;/a&gt; workloads. This post
showcases running the &lt;a href=&quot;https://stability.ai/blog/stable-diffusion-public-release&quot;&gt;Stable Diffusion&lt;/a&gt; generative model from &lt;a href=&quot;https://stability.ai/&quot;&gt;Stability AI&lt;/a&gt; to
generate images using a GPU from within gVisor. Both the
&lt;a href=&quot;https://github.com/AUTOMATIC1111/stable-diffusion-webui&quot;&gt;Automatic1111 Stable Diffusion web UI&lt;/a&gt;
and the &lt;a href=&quot;https://pytorch.org/&quot;&gt;PyTorch&lt;/a&gt; code used by Stable Diffusion were run entirely within gVisor
while being able to leverage the NVIDIA GPU.&lt;/p&gt;

&lt;!--/excerpt--&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/2023-06-20-sandboxed-gpu.png&quot; alt=&quot;A sandboxed GPU&quot; title=&quot;A sandboxed GPU.&quot; /&gt;
&lt;span class=&quot;attribution&quot;&gt;&lt;strong&gt;Sand&lt;/strong&gt;boxing a GPU. Generated with Stable Diffusion
v1.5.&lt;br /&gt;This picture gets a lot deeper once you realize that GPUs are made out
of sand.&lt;/span&gt;&lt;/p&gt;

&lt;h2 id=&quot;disclaimer&quot;&gt;Disclaimer&lt;/h2&gt;

&lt;p&gt;As of this writing (2023-06), &lt;a href=&quot;https://github.com/google/gvisor/blob/master/g3doc/proposals/nvidia_driver_proxy.md&quot;&gt;gVisor’s GPU support&lt;/a&gt; is not
generalized. Only some PyTorch workloads have been tested on NVIDIA T4, L4,
A100, and H100 GPUs, using the specific driver versions that your runsc version
supports using the command below. Contributions are welcome to expand this set
to support other GPUs and driver versions!&lt;/p&gt;

&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;$ make run TARGETS=runsc ARGS=&quot;nvproxy list-supported-drivers&quot;

$ runsc nvproxy list-supported-drivers
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Additionally, while gVisor does its best to sandbox the workload, interacting
with the GPU inherently requires running code on GPU hardware, where isolation
is enforced by the GPU driver and hardware itself rather than gVisor. More to
come soon on the value of the protection gVisor provides for GPU workloads.&lt;/p&gt;

&lt;p&gt;In a few months, gVisor’s GPU support will have broadened and become
easier-to-use, such that it will not be constrained to the specific sets of
versions used here. In the meantime, this blog stands as an example of what’s
possible today with gVisor’s GPU support.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/2023-06-20-spacesuit-helmets.png&quot; alt=&quot;Various space suit helmets&quot; title=&quot;Various space suit helmets.&quot; width=&quot;100%&quot; /&gt;
&lt;span class=&quot;attribution&quot;&gt;&lt;strong&gt;A collection of astronaut helmets in various styles&lt;/strong&gt;.&lt;br /&gt;Other than the helmet in the center, each helmet was generated using Stable Diffusion v1.5.&lt;/span&gt;&lt;/p&gt;

&lt;h2 id=&quot;why-even-do-this&quot;&gt;Why even do this?&lt;/h2&gt;

&lt;p&gt;The recent explosion of machine learning models has led to a large number of new
open-source projects. Much like it is good practice to be careful about running
new software downloaded from the Internet, it is good practice to run new
open-source projects in a sandbox. For projects like the
&lt;a href=&quot;https://github.com/AUTOMATIC1111/stable-diffusion-webui&quot;&gt;Automatic1111 Stable Diffusion web UI&lt;/a&gt;,
which automatically download various models, components, and
&lt;a href=&quot;https://github.com/AUTOMATIC1111/stable-diffusion-webui-extensions/blob/master/index.json&quot;&gt;extensions&lt;/a&gt; from external repositories as
the user enables them in the web UI, this principle applies all the more.&lt;/p&gt;

&lt;p&gt;Additionally, within the machine learning space, tooling for packaging and
distributing models are still nascent. While some models (including Stable
Diffusion) are packaged using the more secure &lt;a href=&quot;https://github.com/huggingface/safetensors&quot;&gt;safetensors&lt;/a&gt; format, &lt;strong&gt;the
majority of models available online today are distributed using the
&lt;a href=&quot;https://www.splunk.com/en_us/blog/security/paws-in-the-pickle-jar-risk-vulnerability-in-the-model-sharing-ecosystem.html&quot;&gt;Pickle format&lt;/a&gt;, which can execute arbitrary Python code&lt;/strong&gt; upon deserialization.
As such, even when using trustworthy software, using Pickle-formatted models may
still be risky (&lt;strong&gt;Edited 2024-04-04:
&lt;a href=&quot;https://www.wiz.io/blog/wiz-and-hugging-face-address-risks-to-ai-infrastructure&quot;&gt;this exact vulnerability vector was found in Hugging Face’s Inference API&lt;/a&gt;&lt;/strong&gt;).
gVisor provides a layer of protection around this process which helps protect
the host machine.&lt;/p&gt;

&lt;p&gt;Third, &lt;strong&gt;machine learning applications are typically not I/O heavy&lt;/strong&gt;, which
means they tend not to experience a significant performance overhead. The
process of uploading code to the GPU is not a significant number of system
calls, and most communication to/from the GPU happens over shared memory, where
gVisor imposes no overhead. Therefore, the question is not so much “why should I
run this GPU workload in gVisor?” but rather “why not?”.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/2023-06-20-turbo.png&quot; alt=&quot;Cool astronauts don't look at explosions&quot; title=&quot;Cool astronauts don't look at explosions.&quot; /&gt;
&lt;span class=&quot;attribution&quot;&gt;&lt;strong&gt;Cool astronauts don’t look at explosions&lt;/strong&gt;.
Generated using Stable Diffusion v1.5.&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;Lastly, running GPU workloads in gVisor is pretty cool.&lt;/p&gt;

&lt;h2 id=&quot;setup&quot;&gt;Setup&lt;/h2&gt;

&lt;p&gt;We use a Debian virtual machine on GCE. The machine needs to have a GPU and to
have sufficient RAM and disk space to handle Stable Diffusion and its large
model files. The following command creates a VM with 4 vCPUs, 15GiB of RAM, 64GB
of disk space, and an NVIDIA T4 GPU, running Debian 11 (bullseye). Since this is
just an experiment, the VM is set to self-destruct after 6 hours.&lt;/p&gt;

&lt;div class=&quot;language-shell highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;gcloud compute instances create stable-diffusion-testing &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--zone&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;us-central1-a &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--machine-type&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;n1-standard-4 &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--max-run-duration&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;6h &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--instance-termination-action&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;DELETE &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--maintenance-policy&lt;/span&gt; TERMINATE &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--accelerator&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;count&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;1,type&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;nvidia-tesla-t4 &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--create-disk&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;auto-delete&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;yes&lt;/span&gt;,boot&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;yes&lt;/span&gt;,device-name&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;stable-diffusion-testing,image&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;projects/debian-cloud/global/images/debian-11-bullseye-v20230509,mode&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;rw,size&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;64
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;gcloud compute ssh &lt;span class=&quot;nt&quot;&gt;--zone&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;us-central1-a stable-diffusion-testing
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;All further commands in this post are performed while SSH’d into the VM. We
first need to install the specific NVIDIA driver version that gVisor is
currently compatible with.&lt;/p&gt;

&lt;div class=&quot;language-shell highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;sudo &lt;/span&gt;apt-get update &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;sudo &lt;/span&gt;apt-get &lt;span class=&quot;nt&quot;&gt;-y&lt;/span&gt; upgrade
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;sudo &lt;/span&gt;apt-get &lt;span class=&quot;nb&quot;&gt;install&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-y&lt;/span&gt; build-essential linux-headers-&lt;span class=&quot;si&quot;&gt;$(&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;uname&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-r&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;runsc nvproxy list-supported-drivers
&lt;span class=&quot;nv&quot;&gt;$ DRIVER_VERSION&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;some-driver-version &lt;span class=&quot;c&quot;&gt;# Get from your runsc binary.&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;curl &lt;span class=&quot;nt&quot;&gt;-fSsl&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-O&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;https://us.download.nvidia.com/tesla/&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$DRIVER_VERSION&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;/NVIDIA-Linux-x86_64-&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$DRIVER_VERSION&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;.run&quot;&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;sudo &lt;/span&gt;sh NVIDIA-Linux-x86_64-&lt;span class=&quot;nv&quot;&gt;$DRIVER_VERSION&lt;/span&gt;.run
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;!--
The above in a single line, for convenience:
DRIVER_VERSION=some-driver-version; sudo apt-get update &amp;&amp; sudo apt-get -y upgrade &amp;&amp; sudo apt-get install -y build-essential linux-headers-$(uname -r) &amp;&amp; curl -fSsl -O &quot;https://us.download.nvidia.com/tesla/$DRIVER_VERSION/NVIDIA-Linux-x86_64-$DRIVER_VERSION.run&quot; &amp;&amp; sudo sh NVIDIA-Linux-x86_64-$DRIVER_VERSION.run
--&gt;

&lt;p&gt;Next, we install Docker, per &lt;a href=&quot;https://docs.docker.com/engine/install/debian/&quot;&gt;its instructions&lt;/a&gt;.&lt;/p&gt;

&lt;div class=&quot;language-shell highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;sudo &lt;/span&gt;apt-get &lt;span class=&quot;nb&quot;&gt;install&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-y&lt;/span&gt; ca-certificates curl gnupg
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;sudo install&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-m&lt;/span&gt; 0755 &lt;span class=&quot;nt&quot;&gt;-d&lt;/span&gt; /etc/apt/keyrings
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;curl &lt;span class=&quot;nt&quot;&gt;-fsSL&lt;/span&gt; https://download.docker.com/linux/debian/gpg | &lt;span class=&quot;nb&quot;&gt;sudo &lt;/span&gt;gpg &lt;span class=&quot;nt&quot;&gt;--dearmor&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;--batch&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;--yes&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-o&lt;/span&gt; /etc/apt/keyrings/docker.gpg
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;sudo chmod &lt;/span&gt;a+r /etc/apt/keyrings/docker.gpg
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;echo&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;deb [arch=&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;$(&lt;/span&gt;dpkg &lt;span class=&quot;nt&quot;&gt;--print-architecture&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt; signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/debian &lt;/span&gt;&lt;span class=&quot;si&quot;&gt;$(&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;.&lt;/span&gt; /etc/os-release &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;echo&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$VERSION_CODENAME&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt; stable&quot;&lt;/span&gt; | &lt;span class=&quot;nb&quot;&gt;sudo tee&lt;/span&gt; /etc/apt/sources.list.d/docker.list &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; /dev/null
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;sudo &lt;/span&gt;apt-get update &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;sudo &lt;/span&gt;apt-get &lt;span class=&quot;nb&quot;&gt;install&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-y&lt;/span&gt; docker-ce docker-ce-cli
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;!--
The above in a single live, for convenience:
sudo apt-get install -y ca-certificates curl gnupg &amp;&amp; sudo install -m 0755 -d /etc/apt/keyrings &amp;&amp; curl -fsSL https://download.docker.com/linux/debian/gpg | sudo gpg --dearmor --batch --yes -o /etc/apt/keyrings/docker.gpg &amp;&amp; sudo chmod a+r /etc/apt/keyrings/docker.gpg &amp;&amp; echo &quot;deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/debian $(. /etc/os-release &amp;&amp; echo &quot;$VERSION_CODENAME&quot;) stable&quot; | sudo tee /etc/apt/sources.list.d/docker.list &gt; /dev/null &amp;&amp; sudo apt-get update &amp;&amp; sudo apt-get install -y docker-ce docker-ce-cli
--&gt;

&lt;p&gt;We will also need the &lt;a href=&quot;https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/user-guide.html&quot;&gt;NVIDIA container toolkit&lt;/a&gt;, which enables use of GPUs with
Docker. Per its
&lt;a href=&quot;https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html&quot;&gt;installation instructions&lt;/a&gt;:&lt;/p&gt;

&lt;div class=&quot;language-shell highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nv&quot;&gt;$ distribution&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;$(&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;.&lt;/span&gt; /etc/os-release&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;echo&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;$ID$VERSION_ID&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; curl &lt;span class=&quot;nt&quot;&gt;-fsSL&lt;/span&gt; https://nvidia.github.io/libnvidia-container/gpgkey | &lt;span class=&quot;nb&quot;&gt;sudo &lt;/span&gt;gpg &lt;span class=&quot;nt&quot;&gt;--dearmor&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-o&lt;/span&gt; /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; curl &lt;span class=&quot;nt&quot;&gt;-s&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-L&lt;/span&gt; https://nvidia.github.io/libnvidia-container/&lt;span class=&quot;nv&quot;&gt;$distribution&lt;/span&gt;/libnvidia-container.list | &lt;span class=&quot;nb&quot;&gt;sed&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;'s#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g'&lt;/span&gt; | &lt;span class=&quot;nb&quot;&gt;sudo tee&lt;/span&gt; /etc/apt/sources.list.d/nvidia-container-toolkit.list
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;sudo &lt;/span&gt;apt-get update &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;sudo &lt;/span&gt;apt-get &lt;span class=&quot;nb&quot;&gt;install&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-y&lt;/span&gt; nvidia-container-toolkit
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Of course, we also need to &lt;a href=&quot;https://gvisor.dev/docs/user_guide/install/&quot;&gt;install gVisor&lt;/a&gt; itself.&lt;/p&gt;

&lt;div class=&quot;language-shell highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;sudo &lt;/span&gt;apt-get &lt;span class=&quot;nb&quot;&gt;install&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-y&lt;/span&gt; apt-transport-https ca-certificates curl gnupg
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;curl &lt;span class=&quot;nt&quot;&gt;-fsSL&lt;/span&gt; https://gvisor.dev/archive.key | &lt;span class=&quot;nb&quot;&gt;sudo &lt;/span&gt;gpg &lt;span class=&quot;nt&quot;&gt;--dearmor&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-o&lt;/span&gt; /usr/share/keyrings/gvisor-archive-keyring.gpg
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;echo&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;deb [arch=&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;$(&lt;/span&gt;dpkg &lt;span class=&quot;nt&quot;&gt;--print-architecture&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt; signed-by=/usr/share/keyrings/gvisor-archive-keyring.gpg] https://storage.googleapis.com/gvisor/releases release main&quot;&lt;/span&gt; | &lt;span class=&quot;nb&quot;&gt;sudo tee&lt;/span&gt; /etc/apt/sources.list.d/gvisor.list &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; /dev/null
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;sudo &lt;/span&gt;apt-get update &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;sudo &lt;/span&gt;apt-get &lt;span class=&quot;nb&quot;&gt;install&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-y&lt;/span&gt; runsc

＃ As gVisor does not yet &lt;span class=&quot;nb&quot;&gt;enable &lt;/span&gt;GPU support by default, we need to &lt;span class=&quot;nb&quot;&gt;set &lt;/span&gt;the flags
＃ that will &lt;span class=&quot;nb&quot;&gt;enable &lt;/span&gt;it:
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;sudo &lt;/span&gt;runsc &lt;span class=&quot;nb&quot;&gt;install&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;--&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;--nvproxy&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;true&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;--nvproxy-docker&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;true&lt;/span&gt;

&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;sudo &lt;/span&gt;systemctl restart docker
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Now, let’s make sure everything works by running commands that involve more and
more of what we just set up.&lt;/p&gt;

&lt;div class=&quot;language-shell highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;＃ Check that the NVIDIA drivers are installed, with the right version, and with
＃ a supported GPU attached
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;sudo &lt;/span&gt;nvidia-smi &lt;span class=&quot;nt&quot;&gt;-L&lt;/span&gt;
GPU 0: Tesla T4 &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;UUID: GPU-6a96a2af-2271-5627-34c5-91dcb4f408aa&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;sudo cat&lt;/span&gt; /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module  DRIVER_VERSION  Wed Nov 30 06:39:21 UTC 2022

＃ Check that Docker works.
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;sudo &lt;/span&gt;docker version
＃ &lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;...]
Server: Docker Engine - Community
 Engine:
  Version:          24.0.2
＃ &lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;...]

＃ Check that gVisor works.
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;sudo &lt;/span&gt;docker run &lt;span class=&quot;nt&quot;&gt;--rm&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;--runtime&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;runsc debian:latest dmesg | &lt;span class=&quot;nb&quot;&gt;head&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-1&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;    0.000000] Starting gVisor...

＃ Check that Docker GPU support &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;without gVisor&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; works.
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;sudo &lt;/span&gt;docker run &lt;span class=&quot;nt&quot;&gt;--rm&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;--gpus&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;all nvidia/cuda:11.6.2-base-ubuntu20.04 nvidia-smi &lt;span class=&quot;nt&quot;&gt;-L&lt;/span&gt;
GPU 0: Tesla T4 &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;UUID: GPU-6a96a2af-2271-5627-34c5-91dcb4f408aa&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;

＃ Check that gVisor works with the GPU.
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;sudo &lt;/span&gt;docker run &lt;span class=&quot;nt&quot;&gt;--rm&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;--runtime&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;runsc &lt;span class=&quot;nt&quot;&gt;--gpus&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;all nvidia/cuda:11.6.2-base-ubuntu20.04 nvidia-smi &lt;span class=&quot;nt&quot;&gt;-L&lt;/span&gt;
GPU 0: Tesla T4 &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;UUID: GPU-6a96a2af-2271-5627-34c5-91dcb4f408aa&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;We’re all set! Now we can actually get Stable Diffusion running.&lt;/p&gt;

&lt;p&gt;We used the following &lt;code class=&quot;highlighter-rouge&quot;&gt;Dockerfile&lt;/code&gt; to run Stable Diffusion and its web UI within
a GPU-enabled Docker container.&lt;/p&gt;

&lt;div class=&quot;language-dockerfile highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;FROM&lt;/span&gt;&lt;span class=&quot;s&quot;&gt; python:3.10&lt;/span&gt;

＃ Set of dependencies that are needed to make this work.
&lt;span class=&quot;k&quot;&gt;RUN &lt;/span&gt;apt-get update &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; apt-get &lt;span class=&quot;nb&quot;&gt;install&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-y&lt;/span&gt; git wget build-essential &lt;span class=&quot;se&quot;&gt;\
&lt;/span&gt;        nghttp2 libnghttp2-dev libssl-dev ffmpeg libsm6 libxext6
＃ Clone the project at the revision used for this test.
&lt;span class=&quot;k&quot;&gt;RUN &lt;/span&gt;git clone https://github.com/AUTOMATIC1111/stable-diffusion-webui.git &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\
&lt;/span&gt;    &lt;span class=&quot;nb&quot;&gt;cd&lt;/span&gt; /stable-diffusion-webui &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\
&lt;/span&gt;    git checkout baf6946e06249c5af9851c60171692c44ef633e0
＃ We don't want the build step to start the server.
&lt;span class=&quot;k&quot;&gt;RUN &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;sed&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-i&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;'/start()/d'&lt;/span&gt; /stable-diffusion-webui/launch.py
＃ Install some pip packages.
＃ Note that this command will run as part of the Docker build process,
＃ which is *not* sandboxed by gVisor.
&lt;span class=&quot;k&quot;&gt;RUN &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;cd&lt;/span&gt; /stable-diffusion-webui &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;COMMANDLINE_ARGS&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;--skip-torch-cuda-test&lt;/span&gt; python launch.py
&lt;span class=&quot;k&quot;&gt;WORKDIR&lt;/span&gt;&lt;span class=&quot;s&quot;&gt; /stable-diffusion-webui&lt;/span&gt;
＃ This causes the web UI to use the Gradio service to create a public URL.
＃ Do not use this if you plan on leaving the container running long-term.
&lt;span class=&quot;k&quot;&gt;ENV&lt;/span&gt;&lt;span class=&quot;s&quot;&gt; COMMANDLINE_ARGS=--share&lt;/span&gt;
＃ Start the webui app.
&lt;span class=&quot;k&quot;&gt;CMD&lt;/span&gt;&lt;span class=&quot;s&quot;&gt; [&quot;python&quot;, &quot;webui.py&quot;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;We build the image and create a container with it using the &lt;code class=&quot;highlighter-rouge&quot;&gt;docker&lt;/code&gt;
command-line.&lt;/p&gt;

&lt;div class=&quot;language-shell highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;cat&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; Dockerfile
&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;... Paste the above contents...&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
^D
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;sudo &lt;/span&gt;docker build &lt;span class=&quot;nt&quot;&gt;--tag&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;sdui &lt;span class=&quot;nb&quot;&gt;.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Finally, we can start the Stable Diffusion web UI. Note that it will take a long
time to start, as it has to download all the models from the Internet. To keep
this post simple, we didn’t set up any kind of volume that would enable data
persistence, so it will do this every time the container starts.&lt;/p&gt;

&lt;div class=&quot;language-shell highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;sudo &lt;/span&gt;docker run &lt;span class=&quot;nt&quot;&gt;--runtime&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;runsc &lt;span class=&quot;nt&quot;&gt;--gpus&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;all &lt;span class=&quot;nt&quot;&gt;--name&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;sdui &lt;span class=&quot;nt&quot;&gt;--detach&lt;/span&gt; sdui

＃ Follow the logs:
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;sudo &lt;/span&gt;docker logs &lt;span class=&quot;nt&quot;&gt;-f&lt;/span&gt; sdui
＃ &lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;...]
Calculating sha256 &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; /stable-diffusion-webui/models/Stable-diffusion/v1-5-pruned-emaonly.safetensors: Running on &lt;span class=&quot;nb&quot;&gt;local &lt;/span&gt;URL:  http://127.0.0.1:7860
Running on public URL: https://4446d982b4129a66d7.gradio.live

This share &lt;span class=&quot;nb&quot;&gt;link &lt;/span&gt;expires &lt;span class=&quot;k&quot;&gt;in &lt;/span&gt;72 hours.
＃ &lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;...]
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;We’re all set! Now we can browse to the Gradio URL shown in the logs and start
generating pictures, all within the secure confines of gVisor.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/2023-06-20-stable-diffusion-web-ui.png&quot; alt=&quot;Stable Diffusion Web UI&quot; title=&quot;Stable Diffusion UI.&quot; width=&quot;100%&quot; /&gt;
&lt;span class=&quot;attribution&quot;&gt;&lt;strong&gt;Stable Diffusion Web UI screenshot.&lt;/strong&gt; Inner image
generated with Stable Diffusion v1.5.&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;Happy sandboxing!&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/2023-06-20-astronaut-thumbs-up.png&quot; alt=&quot;Astronaut showing thumbs up&quot; title=&quot;Astronaut showing thumbs up.&quot; /&gt;
&lt;span class=&quot;attribution&quot;&gt;&lt;strong&gt;Happy sandboxing!&lt;/strong&gt; Generated with Stable Diffusion
v1.5.&lt;/span&gt;&lt;/p&gt;</content><author><name>eperot</name></author><summary type="html">gVisor is starting to support GPU workloads. This post showcases running the Stable Diffusion generative model from Stability AI to generate images using a GPU from within gVisor. Both the Automatic1111 Stable Diffusion web UI and the PyTorch code used by Stable Diffusion were run entirely within gVisor while being able to leverage the NVIDIA GPU.</summary></entry><entry><title type="html">Rootfs Overlay</title><link href="/blog/2023/05/08/rootfs-overlay/" rel="alternate" type="text/html" title=" Rootfs Overlay" /><published>2023-05-08T00:00:00-05:00</published><updated>2023-05-08T00:00:00-05:00</updated><id>/blog/2023/05/08/rootfs-overlay</id><content type="html" xml:base="/blog/2023/05/08/rootfs-overlay/">&lt;p&gt;Root filesystem overlay is now the default in runsc. This improves performance
for filesystem-heavy workloads by overlaying the container root filesystem with
a tmpfs filesystem. Learn more about this feature in the following blog that was
&lt;a href=&quot;https://opensource.googleblog.com/2023/04/gvisor-improves-performance-with-root-filesystem-overlay.html&quot;&gt;originally posted&lt;/a&gt;
on &lt;a href=&quot;https://opensource.googleblog.com/&quot;&gt;Google Open Source Blog&lt;/a&gt;.&lt;/p&gt;

&lt;!--/excerpt--&gt;

&lt;h2 id=&quot;costly-filesystem-access&quot;&gt;Costly Filesystem Access&lt;/h2&gt;

&lt;p&gt;gVisor uses a trusted filesystem proxy process (“gofer”) to access the
filesystem on behalf of the sandbox. The sandbox process is considered untrusted
in gVisor’s
&lt;a href=&quot;https://gvisor.dev/docs/architecture_guide/security/&quot;&gt;security model&lt;/a&gt;. As a
result, it is not given direct access to the container filesystem and
&lt;a href=&quot;https://github.com/google/gvisor/tree/master/runsc/boot/filter&quot;&gt;its seccomp filters&lt;/a&gt;
do not allow filesystem syscalls.&lt;/p&gt;

&lt;p&gt;In gVisor, the container rootfs and
&lt;a href=&quot;https://docs.docker.com/storage/bind-mounts/#&quot;&gt;bind mounts&lt;/a&gt; are configured to
be served by a gofer.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/2023-05-08-rootfs-overlay-gofer-diagram.svg&quot; alt=&quot;Figure 1&quot; title=&quot;Gofer process diagram.&quot; width=&quot;100%&quot; /&gt;&lt;/p&gt;

&lt;p&gt;When the container needs to perform a filesystem operation, it makes an RPC to
the gofer which makes host system calls and services the RPC. This is quite
expensive due to:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;RPC cost: This is the cost of communicating with the gofer process,
including process scheduling, message serialization and
&lt;a href=&quot;https://en.wikipedia.org/wiki/Inter-process_communication&quot;&gt;IPC&lt;/a&gt; system
calls.
    &lt;ul&gt;
      &lt;li&gt;To ameliorate this, gVisor recently developed a purpose-built protocol
called &lt;a href=&quot;https://github.com/google/gvisor/tree/master/pkg/lisafs&quot;&gt;LISAFS&lt;/a&gt;
which is much more efficient than its predecessor.&lt;/li&gt;
      &lt;li&gt;gVisor is also
&lt;a href=&quot;https://groups.google.com/g/gvisor-users/c/v-ODHzCrIjE&quot;&gt;experimenting&lt;/a&gt;
with giving the sandbox direct access to the container filesystem in a
secure manner. This would essentially nullify RPC costs as it avoids the
gofer being in the critical path of filesystem operations.&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;Syscall cost: This is the cost of making the host syscall which actually
accesses/modifies the container filesystem. Syscalls are expensive, because
they perform context switches into the kernel and back into userspace.
    &lt;ul&gt;
      &lt;li&gt;To help with this, gVisor heavily caches the filesystem tree in memory.
So operations like
&lt;a href=&quot;https://man7.org/linux/man-pages/man2/lstat.2.html&quot;&gt;stat(2)&lt;/a&gt; on cached
files are serviced quickly. But other operations like
&lt;a href=&quot;https://man7.org/linux/man-pages/man2/mkdir.2.html&quot;&gt;mkdir(2)&lt;/a&gt; or
&lt;a href=&quot;https://man7.org/linux/man-pages/man2/rename.2.html&quot;&gt;rename(2)&lt;/a&gt; still
need to make host syscalls.&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;h2 id=&quot;container-root-filesystem&quot;&gt;Container Root Filesystem&lt;/h2&gt;

&lt;p&gt;In Docker and Kubernetes, the container’s root filesystem (rootfs) is based on
the filesystem packaged with the image. The image’s filesystem is immutable. Any
change a container makes to the rootfs is stored separately and is destroyed
with the container. This way, the image’s filesystem can be shared efficiently
with all containers running the same image. This is different from bind mounts,
which allow containers to access the bound host filesystem tree. Changes to bind
mounts are always propagated to the host and persist after the container exits.&lt;/p&gt;

&lt;p&gt;Docker and Kubernetes both use the
&lt;a href=&quot;https://docs.kernel.org/filesystems/overlayfs.html&quot;&gt;overlay filesystem&lt;/a&gt; by
default to configure container rootfs. Overlayfs mounts are composed of one
upper layer and multiple lower layers. The overlay filesystem presents a merged
view of all these filesystem layers at its mount location and ensures that lower
layers are read-only while all changes are held in the upper layer. The lower
layer(s) constitute the “image layer” and the upper layer is the “container
layer”. When the container is destroyed, the upper layer mount is destroyed as
well, discarding the root filesystem changes the container may have made.
Docker’s
&lt;a href=&quot;https://docs.docker.com/storage/storagedriver/overlayfs-driver/#how-the-overlay2-driver-works&quot;&gt;overlayfs driver documentation&lt;/a&gt;
has a good explanation.&lt;/p&gt;

&lt;h2 id=&quot;rootfs-configuration-before&quot;&gt;Rootfs Configuration Before&lt;/h2&gt;

&lt;p&gt;Let’s consider an example where the image has files &lt;code class=&quot;highlighter-rouge&quot;&gt;foo&lt;/code&gt; and &lt;code class=&quot;highlighter-rouge&quot;&gt;baz&lt;/code&gt;. The
container overwrites &lt;code class=&quot;highlighter-rouge&quot;&gt;foo&lt;/code&gt; and creates a new file &lt;code class=&quot;highlighter-rouge&quot;&gt;bar&lt;/code&gt;. The diagram below shows
how the root filesystem used to be configured in gVisor earlier. We used to go
through the gofer and access/mutate the overlaid directory on the host. It also
shows the state of the host overlay filesystem.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/2023-05-08-rootfs-overlay-before.svg&quot; alt=&quot;Figure 2&quot; title=&quot;Rootfs state before.&quot; width=&quot;100%&quot; /&gt;&lt;/p&gt;

&lt;h2 id=&quot;opportunity-sandbox-internal-overlay&quot;&gt;Opportunity! Sandbox Internal Overlay&lt;/h2&gt;

&lt;p&gt;Given that the upper layer is destroyed with the container and that it is
expensive to access/mutate a host filesystem from the sandbox, why keep the
upper layer on the host at all? Instead we can move the upper layer &lt;strong&gt;into the
sandbox&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The idea is to overlay the rootfs using a sandbox-internal overlay mount. We can
use a tmpfs upper (container) layer and a read-only lower layer served by the
gofer client. Any changes to rootfs would be held in tmpfs (in-memory).
Accessing/mutating the upper layer would not require any gofer RPCs or syscalls
to the host. This really speeds up filesystem operations on the upper layer,
which contains newly created or copied-up files and directories.&lt;/p&gt;

&lt;p&gt;Using the same example as above, the following diagram shows what the rootfs
configuration would look like using a sandbox-internal overlay.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/2023-05-08-rootfs-overlay-memory.svg&quot; alt=&quot;Figure 3&quot; title=&quot;Memory-backed rootfs overlay.&quot; width=&quot;100%&quot; /&gt;&lt;/p&gt;

&lt;h2 id=&quot;host-backed-overlay&quot;&gt;Host-Backed Overlay&lt;/h2&gt;

&lt;p&gt;The tmpfs mount by default will use the sandbox process’s memory to back all the
file data in the mount. This can cause sandbox memory usage to blow up and
exhaust the container’s memory limits, so it’s important to store all file data
from tmpfs upper layer on disk. We need to have a tmpfs-backing “filestore” on
the host filesystem. Using the example from above, this filestore on the host
will store file data for &lt;code class=&quot;highlighter-rouge&quot;&gt;foo&lt;/code&gt; and &lt;code class=&quot;highlighter-rouge&quot;&gt;bar&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;This would essentially flatten all regular files in tmpfs into one host file.
The sandbox can &lt;a href=&quot;https://man7.org/linux/man-pages/man2/mmap.2.html&quot;&gt;mmap(2)&lt;/a&gt; the
filestore into its address space. This allows it to access and mutate the
filestore very efficiently, without incurring gofer RPCs or syscalls overheads.&lt;/p&gt;

&lt;h2 id=&quot;self-backed-overlay&quot;&gt;Self-Backed Overlay&lt;/h2&gt;

&lt;p&gt;In Kubernetes, you can set
&lt;a href=&quot;https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#setting-requests-and-limits-for-local-ephemeral-storage&quot;&gt;local ephemeral storage limits&lt;/a&gt;.
The upper layer of the rootfs overlay (writeable container layer) on the host
&lt;a href=&quot;https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#resource-emphemeralstorage-consumption&quot;&gt;contributes towards this limit&lt;/a&gt;.
The kubelet enforces this limit by
&lt;a href=&quot;https://github.com/containerd/containerd/blob/bbcfbf2189f15c9e9e2ce0775c3caf2e8642274c/vendor/github.com/containerd/continuity/fs/du_unix.go#L57-L58&quot;&gt;traversing&lt;/a&gt;
the entire
&lt;a href=&quot;https://github.com/containerd/containerd/blob/bbcfbf2189f15c9e9e2ce0775c3caf2e8642274c/snapshots/overlay/overlay.go#L189-L190&quot;&gt;upper layer&lt;/a&gt;,
&lt;code class=&quot;highlighter-rouge&quot;&gt;stat(2)&lt;/code&gt;-ing all files and
&lt;a href=&quot;https://github.com/containerd/containerd/blob/bbcfbf2189f15c9e9e2ce0775c3caf2e8642274c/vendor/github.com/containerd/continuity/fs/du_unix.go#L69-L74&quot;&gt;summing up&lt;/a&gt;
their &lt;code class=&quot;highlighter-rouge&quot;&gt;stat.st_blocks*block_size&lt;/code&gt;. If we move the upper layer into the sandbox,
then the host upper layer is empty and the kubelet will not be able to enforce
these limits.&lt;/p&gt;

&lt;p&gt;To address this issue, we
&lt;a href=&quot;https://github.com/google/gvisor/commit/a53b22ad5283b00b766178eff847c3193c1293b7&quot;&gt;introduced “self-backed” overlays&lt;/a&gt;,
which create the filestore in the host upper layer. This way, when the kubelet
scans the host upper layer, the filestore will be detected and its
&lt;code class=&quot;highlighter-rouge&quot;&gt;stat.st_blocks&lt;/code&gt; should be representative of the total file usage in the
sandbox-internal upper layer. It is also important to hide this filestore from
the containerized application to avoid confusing it. We do so by
&lt;a href=&quot;https://github.com/google/gvisor/commit/09459b203a532c24fbb76cc88484d533356b8b91&quot;&gt;creating a whiteout&lt;/a&gt;
in the sandbox-internal upper layer, which blocks this file from appearing in
the merged directory.&lt;/p&gt;

&lt;p&gt;The following diagram shows what rootfs configuration would finally look like
today in gVisor.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/2023-05-08-rootfs-overlay-self.svg&quot; alt=&quot;Figure 4&quot; title=&quot;Self-backed rootfs overlay.&quot; width=&quot;100%&quot; /&gt;&lt;/p&gt;

&lt;h2 id=&quot;performance-gains&quot;&gt;Performance Gains&lt;/h2&gt;

&lt;p&gt;Let’s look at some filesystem-intensive workloads to see how rootfs overlay
impacts performance. These benchmarks were run on a gLinux desktop with
&lt;a href=&quot;https://gvisor.dev/docs/architecture_guide/platforms/#kvm&quot;&gt;KVM platform&lt;/a&gt;.&lt;/p&gt;

&lt;h3 id=&quot;micro-benchmark&quot;&gt;Micro Benchmark&lt;/h3&gt;

&lt;p&gt;&lt;a href=&quot;https://linux-test-project.github.io/&quot;&gt;Linux Test Project&lt;/a&gt; provides a
&lt;a href=&quot;https://github.com/linux-test-project/ltp/tree/master/testcases/kernel/fs/fsstress&quot;&gt;fsstress binary&lt;/a&gt;.
This program performs a large number of filesystem operations concurrently,
creating and modifying a large filesystem tree of all sorts of files. We ran
this program on the container’s root filesystem. The exact usage was:&lt;/p&gt;

&lt;p&gt;    &lt;code class=&quot;highlighter-rouge&quot;&gt;sh -c &quot;mkdir /test &amp;amp;&amp;amp; time fsstress -d /test -n 500 -p
20 -s 1680153482 -X -l 10&quot;&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;You can use the -v flag (verbose mode) to see what filesystem operations are
being performed.&lt;/p&gt;

&lt;p&gt;The results were astounding! Rootfs overlay reduced the time to run this
fsstress program &lt;strong&gt;from 262.79 seconds to 3.18 seconds&lt;/strong&gt;! However, note that
such microbenchmarks are not representative of real-world applications and we
should not extrapolate these results to real-world performance.&lt;/p&gt;

&lt;h3 id=&quot;real-world-benchmark&quot;&gt;Real-world Benchmark&lt;/h3&gt;

&lt;p&gt;Build jobs are very filesystem intensive workloads. They read a lot of source
files, compile and write out binaries and object files. Let’s consider building
the &lt;a href=&quot;https://github.com/abseil/abseil-cpp&quot;&gt;abseil-cpp project&lt;/a&gt; with
&lt;a href=&quot;https://bazel.build/&quot;&gt;bazel&lt;/a&gt;. Bazel performs a lot of filesystem operations in
rootfs; in bazel’s cache located at &lt;code class=&quot;highlighter-rouge&quot;&gt;~/.cache/bazel/&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;This is representative of the real-world because many other applications also
use the container root filesystem as scratch space due to the handy property
that it disappears on container exit. To make this more realistic, the
abseil-cpp repo was attached to the container using a bind mount, which does not
have an overlay.&lt;/p&gt;

&lt;p&gt;When measuring performance, we care about reducing the sandboxing overhead and
bringing gVisor performance as close as possible to unsandboxed performance.
Sandboxing overhead can be calculated using the formula &lt;em&gt;overhead = (s-n)/n&lt;/em&gt;
where &lt;code class=&quot;highlighter-rouge&quot;&gt;s&lt;/code&gt; is the amount of time taken to run a workload inside gVisor sandbox
and &lt;code class=&quot;highlighter-rouge&quot;&gt;n&lt;/code&gt; is the time taken to run the same workload natively (unsandboxed). The
following graph shows that rootfs overlay &lt;strong&gt;halved the sandboxing overhead&lt;/strong&gt; for
abseil build!&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/2023-05-08-rootfs-overlay-benchmark-result.svg&quot; alt=&quot;Figure 5&quot; title=&quot;Sandbox Overhead: rootfs overlay vs no overlay.&quot; width=&quot;100%&quot; /&gt;&lt;/p&gt;

&lt;h2 id=&quot;conclusion&quot;&gt;Conclusion&lt;/h2&gt;

&lt;p&gt;Rootfs overlay in gVisor substantially improves performance for many
filesystem-intensive workloads, so that developers no longer have to make large
tradeoffs between performance and security. We recently made this optimization
&lt;a href=&quot;https://github.com/google/gvisor/commit/38750cdedcce19a3039da10e515f5852565d2c7e&quot;&gt;the default&lt;/a&gt;
in runsc. This is part of our ongoing efforts to improve gVisor performance. You
can use gVisor in GKE with GKE Sandbox. Happy sandboxing!&lt;/p&gt;</content><author><name>ayushranjan</name></author><summary type="html">Root filesystem overlay is now the default in runsc. This improves performance for filesystem-heavy workloads by overlaying the container root filesystem with a tmpfs filesystem. Learn more about this feature in the following blog that was originally posted on Google Open Source Blog.</summary></entry><entry><title type="html">Releasing Systrap - A high-performance gVisor platform</title><link href="/blog/2023/04/28/systrap-release/" rel="alternate" type="text/html" title=" Releasing Systrap - A high-performance gVisor platform" /><published>2023-04-28T00:00:00-05:00</published><updated>2023-04-28T00:00:00-05:00</updated><id>/blog/2023/04/28/systrap-release</id><content type="html" xml:base="/blog/2023/04/28/systrap-release/">&lt;p&gt;We are releasing a new gVisor platform: Systrap. Like the existing ptrace
platform, Systrap runs on most Linux machines out of the box without
virtualization. Unlike the ptrace platform, it’s fast 🚀. Go try it by adding
&lt;code class=&quot;highlighter-rouge&quot;&gt;--platform=systrap&lt;/code&gt; to the runsc flags. If you want to know more about it, read
on.&lt;/p&gt;

&lt;!--/excerpt--&gt;

&lt;p&gt;gVisor is a security boundary for arbitrary Linux processes. Boundaries do not
come for free, and gVisor imposes some performance overhead on sandboxed
applications. One of the most fundamental performance challenges with the
security model implemented by gVisor is system call interception, which is the
focus of this post.&lt;/p&gt;

&lt;p&gt;To recap on the
&lt;a href=&quot;https://gvisor.dev/docs/architecture_guide/security/#what-can-a-sandbox-do&quot;&gt;security model&lt;/a&gt;:
gVisor is an application kernel that implements the Linux ABI. This includes
system calls, signals, memory management, and more. For example, when a
sandboxed application calls
&lt;a href=&quot;https://man7.org/linux/man-pages/man2/read.2.html&quot;&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;read(2)&lt;/code&gt;&lt;/a&gt;, it actually
transparently calls into
&lt;a href=&quot;https://github.com/google/gvisor/blob/44e2d0fcfeb641f3b8013c3f93cacdae447cc0f1/pkg/sentry/syscalls/linux/sys_read_write.go#L36&quot;&gt;gVisor’s implementation of this system call&lt;/a&gt;
This minimizes the attack surface of the host kernel, because sandboxed programs
simply can’t make system calls directly to the host in the first place&lt;sup id=&quot;fnref:1&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:1&quot; class=&quot;footnote&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;. This
interception happens through an internal layer called the Platform interface,
which we have written about in a previous
&lt;a href=&quot;https://gvisor.dev/blog/2020/10/22/platform-portability/&quot;&gt;blog post&lt;/a&gt;. To handle
these interceptions, this interface must also create new address spaces,
allocate memory, and create execution contexts to run the workload.&lt;/p&gt;

&lt;p&gt;gVisor had two platform implementations: KVM and ptrace. The KVM platform uses
the kernel’s KVM functionality to allow the Sentry to act as both guest OS and
VMM (Virtual machine monitor). It does system call interception just like a
normal virtual machine would. This gives good performance when using bare-metal
virtualization, but has a noticeable impact with nested virtualization. The
other obvious downside is that it requires support for nested virtualization in
the first place, which is not supported by all hardware (such as ARM CPUs) or
within some Cloud environments.&lt;/p&gt;

&lt;p&gt;The ptrace platform was the alternative wherever KVM was not available. It works
through the
&lt;a href=&quot;http://man7.org/linux/man-pages/man2/ptrace.2.html&quot;&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;PTRACE_SYSEMU&lt;/code&gt;&lt;/a&gt; action,
which makes the user process hand back execution to the sentry whenever it
encounters a system call. This is a clean method to achieve system call
interception in any environment, virtualized or not, except that it’s quite
slow. To see how slow, an unrealistic but highly illustrative benchmark to use
is the
&lt;a href=&quot;https://github.com/google/gvisor/blob/108410638aa8480e82933870ba8279133f543d2b/test/perf/linux/getpid_benchmark.cc&quot;&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;getpid&lt;/code&gt; benchmark&lt;/a&gt;&lt;sup id=&quot;fnref:2&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:2&quot; class=&quot;footnote&quot;&gt;2&lt;/a&gt;&lt;/sup&gt;.
This benchmark runs the
&lt;a href=&quot;https://man7.org/linux/man-pages/man2/getpid.2.html&quot;&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;getpid(2)&lt;/code&gt;&lt;/a&gt; system call
in a tight &lt;code class=&quot;highlighter-rouge&quot;&gt;while&lt;/code&gt; loop. No useful application has this behavior, so it is not a
realistic benchmark, but it is well-suited to measure system call latency.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/2023-04-28-getpid-ptrace-vs-native.svg&quot; alt=&quot;Figure 1&quot; title=&quot;Getpid benchmark: ptrace vs. native Linux.&quot; width=&quot;100%&quot; /&gt;&lt;/p&gt;

&lt;p&gt;All &lt;code class=&quot;highlighter-rouge&quot;&gt;getpid&lt;/code&gt; runs have been performed on a GCE n2-standard-4 VM, with the
&lt;code class=&quot;highlighter-rouge&quot;&gt;debian-11-bullseye-v20230306&lt;/code&gt; image.&lt;/p&gt;

&lt;p&gt;While this benchmark is not applicable to most real-world workloads, just about
any workload will generally suffer from high overhead in system call
performance. Since running in a virtualized environment is the default state for
most cloud users these days, it’s important that gVisor performs well in this
context. Systrap is the new platform targeting this important use case.&lt;/p&gt;

&lt;p&gt;Systrap relies on multiple techniques to implement the Platform interface. Like
the ptrace platform, Systrap uses Linux’s ptrace subsystem to initialize
workload executor threads, which are started as child processes of the main
gVisor sentry process. Systrap additionally sets a very restrictive seccomp
filter, installs a custom signal handler, and allocates chunks of memory shared
between user threads and runsc sentry. This shared memory is what serves as the
main form of communication between the sentry and sandboxed programs: whenever
the sandboxed process attempts to execute a system call, it triggers a &lt;code class=&quot;highlighter-rouge&quot;&gt;SIGSYS&lt;/code&gt;
signal which is handled by our signal handler. The signal handler in turn
populates shared memory regions, and requests the sentry to handle the requested
system call. This alone proved to be faster than using &lt;code class=&quot;highlighter-rouge&quot;&gt;PTRACE_SYSEMU&lt;/code&gt;, as
demonstrated by the &lt;code class=&quot;highlighter-rouge&quot;&gt;getpid&lt;/code&gt; benchmark:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/2023-04-28-getpid-ptrace-vs-systrap-unoptimized.svg&quot; alt=&quot;Figure 2&quot; title=&quot;Getpid benchmark: ptrace vs. Systrap.&quot; width=&quot;100%&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Can we make it even faster? Recall what the main purpose of our signal handler
is: to send a request to the sentry via shared memory. To do that, the sandboxed
process must first incur the overhead of executing the seccomp filter&lt;sup id=&quot;fnref:3&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:3&quot; class=&quot;footnote&quot;&gt;3&lt;/a&gt;&lt;/sup&gt;, and
then generating a full signal stack before being able to run the signal handler.
What if there was a way to simply have the sandboxed process jump to another
user-space function when it wanted to perform a system call? Well, turns out,
there is&lt;sup id=&quot;fnref:4&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:4&quot; class=&quot;footnote&quot;&gt;4&lt;/a&gt;&lt;/sup&gt;! There is a popular x86 instruction pattern that’s used to perform
system calls, and it goes a little something like this: &lt;strong&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;mov sysno, %eax;
syscall&lt;/code&gt;&lt;/strong&gt;. The size of the mov instruction is 5 bytes and the size of the
syscall instruction is 2 bytes. Luckily this is just enough space to fit in a
&lt;strong&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;jmp *%gs:offset&lt;/code&gt;&lt;/strong&gt; instruction. When the signal handler sees this instruction
pattern, it signals to the sentry that the original instructions can be replaced
with a &lt;strong&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;jmp&lt;/code&gt;&lt;/strong&gt; to trampoline code that performs the same function as the
regular &lt;code class=&quot;highlighter-rouge&quot;&gt;SIGSYS&lt;/code&gt; signal handler. The system call number is not lost, but rather
encoded in the offset. The results are even more impressive:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/2023-04-28-getpid-ptrace-vs-systrap-opt.svg&quot; alt=&quot;Figure 3&quot; title=&quot;Getpid benchmark: ptrace vs. Optimized Systrap.&quot; width=&quot;100%&quot; /&gt;&lt;/p&gt;

&lt;p&gt;As mentioned, the &lt;code class=&quot;highlighter-rouge&quot;&gt;getpid&lt;/code&gt; benchmark is not representative of real-world
performance. To get a better picture of the magnitude of improvement, here are
some real-world workloads:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;The
&lt;a href=&quot;https://github.com/google/gvisor/blob/master/blob/master/test/benchmarks/fs/bazel_test.go&quot;&gt;Build ABSL benchmark&lt;/a&gt;
measures compilation performance by compiling
&lt;a href=&quot;https://abseil.io/&quot;&gt;abseil.io&lt;/a&gt;; this is a highly system call dependent
workload due to needing to do a lot of I/O filesystem operations (gVisor’s
file system overhead is also dependent upon file system isolation it
implements, which is something you can learn about
&lt;a href=&quot;https://gvisor.dev/docs/user_guide/filesystem/&quot;&gt;here&lt;/a&gt;).&lt;/li&gt;
  &lt;li&gt;The
&lt;a href=&quot;https://github.com/google/gvisor/blob/master/blob/master/test/benchmarks/media/ffmpeg_test.go&quot;&gt;ffmpeg benchmark&lt;/a&gt;
runs a multimedia processing tool, to perform video stream encoding/decoding
for example; this workload does not require a significant amount of system
calls and there are very few userspace to kernel mode switches.&lt;/li&gt;
  &lt;li&gt;The
&lt;a href=&quot;https://github.com/google/gvisor/blob/master/blob/master/test/benchmarks/ml/tensorflow_test.go&quot;&gt;Tensorflow benchmark&lt;/a&gt;
trains a variety of machine learning models on CPU; the system-call usage of
this workload is in between compilation and ffmpeg, due to needing to
retrieve training and validation data, but the majority of time is still
spent just running userspace computations.&lt;/li&gt;
  &lt;li&gt;Finally, the Redis benchmark performs SET RPC calls with 5 concurrent
clients, measures the latency that each call takes to execute, and reports
the median (scaled by 250,000 to fit the graph’s axis); this workload is
heavily bounded by system call performance due to high network stack usage.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/2023-04-28-systrap-sample-workloads.svg&quot; alt=&quot;Figure 4&quot; title=&quot;Comparison of sample workloads running on ptrace, Systrap, and native Linux.&quot; width=&quot;100%&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Systrap will replace the ptrace platform by September 2023 and become the
default. Until then, we are working really hard to make it production-ready,
which includes working on additional performance and stability improvements, and
making sure we maintain a high bar for security through targeted fuzz-testing
for Systrap specifically.&lt;/p&gt;

&lt;p&gt;In the meantime, we would like gVisor users to try it out, and give us feedback!
If you run gVisor using ptrace today (either by specifying &lt;code class=&quot;highlighter-rouge&quot;&gt;--platform ptrace&lt;/code&gt;
or not specifying the &lt;code class=&quot;highlighter-rouge&quot;&gt;--platform&lt;/code&gt; flag at all), or you use the KVM platform with
nested virtualization, switching to Systrap should be a drop-in performance
upgrade. All you have to do is specify &lt;code class=&quot;highlighter-rouge&quot;&gt;--platform systrap&lt;/code&gt; to runsc. If you
encounter any issues, please let us know at
&lt;a href=&quot;https://github.com/google/gvisor/issues&quot;&gt;gvisor.dev/issues&lt;/a&gt;.
&lt;br /&gt;
&lt;br /&gt;&lt;/p&gt;

&lt;hr /&gt;

&lt;!-- mdformat off(Footnotes need to be separated by linebreaks to be rendered) --&gt;

&lt;!-- mdformat on --&gt;
&lt;div class=&quot;footnotes&quot; role=&quot;doc-endnotes&quot;&gt;
  &lt;ol&gt;
    &lt;li id=&quot;fn:1&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Even if the sandbox itself is compromised, it will still be bound by
several defense-in-depth layers, including a restricted set of seccomp
filters. You can find more details here:
&lt;a href=&quot;https://gvisor.dev/blog/2020/09/18/containing-a-real-vulnerability/&quot;&gt;https://gvisor.dev/blog/2020/09/18/containing-a-real-vulnerability/&lt;/a&gt;. &lt;a href=&quot;#fnref:1&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:2&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Once the system call has been intercepted by gVisor (or in the case of
Linux, once the process has entered kernel-mode), actually executing the
getpid system call itself is very fast, so this benchmark effectively
measures single-thread syscall-interception overhead. &lt;a href=&quot;#fnref:2&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:3&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Seccomp filters are known to have a “not insubstantial” overhead:
&lt;a href=&quot;https://lwn.net/Articles/656307/&quot;&gt;https://lwn.net/Articles/656307/&lt;/a&gt;. &lt;a href=&quot;#fnref:3&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:4&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;On the x86_64 architecture. ARM does not have this optimization as of the
time of writing. &lt;a href=&quot;#fnref:4&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
  &lt;/ol&gt;
&lt;/div&gt;</content><author><name>bogomolov</name></author><summary type="html">We are releasing a new gVisor platform: Systrap. Like the existing ptrace platform, Systrap runs on most Linux machines out of the box without virtualization. Unlike the ptrace platform, it’s fast 🚀. Go try it by adding --platform=systrap to the runsc flags. If you want to know more about it, read on.</summary></entry><entry><title type="html">How we Eliminated 99% of gVisor Networking Memory Allocations with Enhanced Buffer Pooling</title><link href="/blog/2022/10/24/buffer-pooling/" rel="alternate" type="text/html" title=" How we Eliminated 99% of gVisor Networking Memory Allocations with Enhanced Buffer Pooling" /><published>2022-10-24T00:00:00-05:00</published><updated>2022-10-24T00:00:00-05:00</updated><id>/blog/2022/10/24/buffer-pooling</id><content type="html" xml:base="/blog/2022/10/24/buffer-pooling/">&lt;p&gt;In an
&lt;a href=&quot;https://gvisor.dev/blog/2020/04/02/gvisor-networking-security/&quot;&gt;earlier blog post&lt;/a&gt;
about networking security, we described how and why gVisor implements its own
userspace network stack in the Sentry (gVisor kernel). In summary, we’ve
implemented our networking stack – aka Netstack – in Go to minimize exposure to
unsafe code and avoid using an unsafe Foreign Function Interface. With Netstack,
gVisor can do all packet processing internally and only has to enable a few host
I/O syscalls for near-complete networking capabilities. This keeps gVisor’s
exposure to host vulnerabilities as narrow as possible.&lt;/p&gt;

&lt;!--/excerpt--&gt;

&lt;p&gt;Although writing Netstack in Go was important for runtime safety, up until now
it had an undeniable performance cost. iperf benchmarks showed Netstack was
spending between 20-30% of its processing time allocating memory and pausing for
garbage collection, a slowdown that limited gVisor’s ability to efficiently
sandbox networking workloads. In this blog we will show how we crafted a cure
for Netstack’s allocation addiction, reducing them by 99%, while also increasing
gVisor networking throughput by 30+%.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/2022-10-24-buffer-pooling-figure1.png&quot; alt=&quot;Figure 1&quot; title=&quot;Buffer pooling results.&quot; width=&quot;100%&quot; /&gt;&lt;/p&gt;

&lt;h2 id=&quot;a-waste-management-problem&quot;&gt;A Waste Management Problem&lt;/h2&gt;

&lt;p&gt;Go guarantees a basic level of memory safety through the use of a garbage
collector (GC), which is described in great detail by the Go team
&lt;a href=&quot;https://tip.golang.org/doc/gc-guide&quot;&gt;here&lt;/a&gt;. The Go runtime automatically tracks
and frees objects allocated from the heap, relieving the programmer of the often
painful and error-prone process of manual memory management. Unfortunately,
tracking and freeing memory during runtime comes at a performance cost. Running
the GC adds scheduling overhead, consumes valuable CPU time, and occasionally
pauses the entire program’s progress to track down garbage.&lt;/p&gt;

&lt;p&gt;Go’s GC is highly optimized, tunable, and sufficient for a majority of
workloads. Most of the other parts of gVisor happily use Go’s GC with no
complaints. However, under high network stress, Netstack needed to aggressively
allocate buffers used for processing TCP/IP data and metadata. These buffers
often had short lifespans, and once the processing was done they were left to be
cleaned up by the GC. This meant Netstack was producing tons of garbage that
needed to be tracked and freed by GC workers.&lt;/p&gt;

&lt;h2 id=&quot;recycling-to-the-rescue&quot;&gt;Recycling to the Rescue&lt;/h2&gt;

&lt;p&gt;Luckily, we weren’t the only ones with this problem. This pattern of small,
frequently allocated and discarded objects was common enough that the Go team
introduced &lt;a href=&quot;https://pkg.go.dev/sync#Pool&quot;&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;sync.Pool&lt;/code&gt;&lt;/a&gt; in Go1.3. &lt;code class=&quot;highlighter-rouge&quot;&gt;sync.Pool&lt;/code&gt; is
designed to relieve pressure off the Go GC by maintaining a thread-safe cache of
previously allocated objects. &lt;code class=&quot;highlighter-rouge&quot;&gt;sync.Pool&lt;/code&gt; can retrieve an object from the cache
if it exists or allocate a new one according to a user specified allocation
function. Once the user is finished with an object they can safely return it to
the cache to be reused again.&lt;/p&gt;

&lt;p&gt;While &lt;code class=&quot;highlighter-rouge&quot;&gt;sync.Pool&lt;/code&gt; was exactly what we needed to reduce allocations,
incorporating it into Netstack wasn’t going to be as easy as just replacing all
our &lt;code class=&quot;highlighter-rouge&quot;&gt;make()&lt;/code&gt;s with &lt;code class=&quot;highlighter-rouge&quot;&gt;pool.Get()&lt;/code&gt;s.&lt;/p&gt;

&lt;h2 id=&quot;netstack-challenges&quot;&gt;Netstack Challenges&lt;/h2&gt;

&lt;p&gt;Netstack uses a few different types of buffers under the hood. Some of these are
specific to protocols, like
&lt;a href=&quot;https://github.com/google/gvisor/blob/master/pkg/tcpip/transport/tcp/segment.go&quot;&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;segment&lt;/code&gt;&lt;/a&gt;
for TCP, and others are more widely shared, like
&lt;a href=&quot;https://github.com/google/gvisor/blob/master/pkg/tcpip/stack/packet_buffer.go&quot;&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;PacketBuffer&lt;/code&gt;&lt;/a&gt;,
which is used for IP, ICMP, UDP, etc. Although each of these buffer types are
slightly different, they generally share a few common traits that made it
difficult to use &lt;code class=&quot;highlighter-rouge&quot;&gt;sync.Pool&lt;/code&gt; out of the box:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;The buffers were originally built with the assumption that a garbage
collector would clean them up automatically – there was little (if any)
effort put into tracking object lifetimes. This meant that we had no way to
know when it was safe to return buffers to a pool.&lt;/li&gt;
  &lt;li&gt;Buffers have dynamic sizes that are determined during creation, usually
depending on the size of the packet holding them. A &lt;code class=&quot;highlighter-rouge&quot;&gt;sync.Pool&lt;/code&gt; out of the
box can only accommodate buffers of a single size. One common solution to
this is to fill a pool with
&lt;a href=&quot;https://pkg.go.dev/bytes#Buffer&quot;&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;bytes.Buffer&lt;/code&gt;&lt;/a&gt;, but even a pooled
&lt;code class=&quot;highlighter-rouge&quot;&gt;bytes.Buffer&lt;/code&gt; could incur allocations if it were too small and had to be
grown to the requested size.&lt;/li&gt;
  &lt;li&gt;Netstack splits, merges, and clones buffers at various points during
processing (for example, breaking a large segment into smaller MTU-sized
packets). Modifying a buffer’s size during runtime could mean lots of
reallocating from the pool in a one-size-fits-all setup. This would limit
the theoretical effectiveness of a pooled solution.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We needed an efficient, low-level buffer abstraction that had answers for the
Netstack specific challenges and could be shared by the various intermediate
buffer types. By sharing a common buffer abstraction, we could maximize the
benefits of pooling and avoid introducing additional allocations while minimally
changing any intermediate buffer processing logic.&lt;/p&gt;

&lt;h2 id=&quot;introducing-bufferv2&quot;&gt;Introducing bufferv2&lt;/h2&gt;

&lt;p&gt;Our solution was
&lt;a href=&quot;https://github.com/google/gvisor/tree/1ceb81454444981448ad57612139adfc0def1b85/pkg/bufferv2&quot;&gt;bufferv2&lt;/a&gt;.
Bufferv2 is a non-contiguous, reference counted, pooled, copy-on-write,
buffer-like data structure.&lt;/p&gt;

&lt;p&gt;Internally, a bufferv2 &lt;code class=&quot;highlighter-rouge&quot;&gt;Buffer&lt;/code&gt; is a linked list of &lt;code class=&quot;highlighter-rouge&quot;&gt;View&lt;/code&gt;s. Each &lt;code class=&quot;highlighter-rouge&quot;&gt;View&lt;/code&gt; has
start/end indices and holds a pointer to a &lt;code class=&quot;highlighter-rouge&quot;&gt;Chunk&lt;/code&gt;. A &lt;code class=&quot;highlighter-rouge&quot;&gt;Chunk&lt;/code&gt; is a
reference-counted structure that’s allocated from a pool and holds data in a
byte slice. There are several &lt;code class=&quot;highlighter-rouge&quot;&gt;Chunk&lt;/code&gt; pools, each of which allocates chunks with
different sized byte slices. These sizes start at 64 and double until 64k.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/2022-10-24-buffer-pooling-figure2.png&quot; alt=&quot;Figure 2&quot; title=&quot;bufferv2 implementation diagram.&quot; width=&quot;100%&quot; /&gt;&lt;/p&gt;

&lt;p&gt;The design of bufferv2 has a few key advantages over simpler object pooling:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Zero-cost copies and copy-on-write&lt;/strong&gt;: Cloning a Buffer only increments the
reference count of the underlying chunks instead of reallocating from the
pool. Since buffers are much more frequently read than modified, this saves
allocations. In the cases where a buffer is modified, only the chunk that’s
changed has to be cloned, not the whole buffer.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Fast buffer transformations&lt;/strong&gt;: Truncating and merging buffers or appending
and prepending Views to Buffers are fast operations. Thanks to the
non-contiguous memory structure these operations are usually as quick as
adding a node to a linked list or changing the indices in a View.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Tiered pools&lt;/strong&gt;: When growing a Buffer or appending data, the new chunks
come from different pools of previously allocated chunks. Using multiple
pools means we are flexible enough to efficiently accommodate packets of all
sizes with minimal overhead. Unlike a one-size-fits-all solution, we don’t
have to waste lots of space with a chunk size that is too big or loop
forever allocating small chunks.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;trade-offs&quot;&gt;Trade-offs&lt;/h2&gt;

&lt;p&gt;Shifting Netstack to bufferv2 came with some costs. To start, rewriting all
buffers to use bufferv2 was a sizable effort that took many months to fully roll
out. Any place in Netstack that allocated or used a byte slice needed to be
rewritten. Reference counting had to be introduced so all the aforementioned
intermediate buffer types (&lt;code class=&quot;highlighter-rouge&quot;&gt;PacketBuffer&lt;/code&gt;, &lt;code class=&quot;highlighter-rouge&quot;&gt;segment&lt;/code&gt;, etc) could accurately
track buffer lifetimes, and tests had to be modified to ensure reference
counting correctness.&lt;/p&gt;

&lt;p&gt;In addition to the upfront cost, the shift to bufferv2 also increased the
engineering complexity of future Netstack changes. Netstack contributors must
adhere to new rules to maintain memory safety and maximize the benefits of
pooling. These rules are strict – there needs to be strong justification to
break them. They are as follows:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Never allocate a byte slice; always use &lt;code class=&quot;highlighter-rouge&quot;&gt;NewView()&lt;/code&gt; instead.&lt;/li&gt;
  &lt;li&gt;Use a &lt;code class=&quot;highlighter-rouge&quot;&gt;View&lt;/code&gt; for simple data operations (e.g writing some data of a fixed
size) and a &lt;code class=&quot;highlighter-rouge&quot;&gt;Buffer&lt;/code&gt; for more complex I/O operations (e.g appending data of
variable size, merging data, writing from an &lt;code class=&quot;highlighter-rouge&quot;&gt;io.Reader&lt;/code&gt;).&lt;/li&gt;
  &lt;li&gt;If you need access to the contents of a &lt;code class=&quot;highlighter-rouge&quot;&gt;View&lt;/code&gt; as a byte slice, use
&lt;code class=&quot;highlighter-rouge&quot;&gt;View.AsSlice()&lt;/code&gt;. If you need access to the contents of a &lt;code class=&quot;highlighter-rouge&quot;&gt;Buffer&lt;/code&gt; as a byte
slice, consider refactoring, as this will cause an allocation.&lt;/li&gt;
  &lt;li&gt;Never write or modify the slices returned by &lt;code class=&quot;highlighter-rouge&quot;&gt;View.AsSlice()&lt;/code&gt;; they are
still owned by the view.&lt;/li&gt;
  &lt;li&gt;Release bufferv2 objects as close to where they’re created as possible. This
is usually most easily done with defer.&lt;/li&gt;
  &lt;li&gt;Document function ownership of bufferv2 object parameters. If there is no
documentation, it is assumed that the function does not take ownership of
its parameters.&lt;/li&gt;
  &lt;li&gt;If a function takes ownership of its bufferv2 parameters, the bufferv2
objects must be cloned before passing them as arguments.&lt;/li&gt;
  &lt;li&gt;All new Netstack tests must enable the leak checker and run a final leak
check after the test is complete.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;give-it-a-try&quot;&gt;Give it a Try&lt;/h2&gt;

&lt;p&gt;Bufferv2 is enabled by default as of
&lt;a href=&quot;https://github.com/google/gvisor/releases/tag/release-20221017.0&quot;&gt;gVisor 20221017&lt;/a&gt;,
and will be rolling out to
&lt;a href=&quot;https://cloud.google.com/kubernetes-engine/docs/concepts/sandbox-pods&quot;&gt;GKE Sandbox&lt;/a&gt;
soon, so no action is required to see a performance boost. Network-bound
workloads, such as web servers or databases like Redis, are the most likely to
see benefits. All the code implementing bufferv2 is public
&lt;a href=&quot;https://github.com/google/gvisor/tree/master/pkg/bufferv2&quot;&gt;here&lt;/a&gt;, and
contributions are welcome! If you’d like to run the iperf benchmark for
yourself, you can run:&lt;/p&gt;

&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;make run-benchmark BENCHMARKS_TARGETS=//test/benchmarks/network:iperf_test \
  RUNTIME=your-runtime-here BENCHMARKS_OPTIONS=-test.benchtime=60s
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;in the base gVisor directory. If you experience any issues, please feel free to
let us know at &lt;a href=&quot;https://github.com/google/gvisor/issues&quot;&gt;gvisor.dev/issues&lt;/a&gt;.&lt;/p&gt;</content><author><name>lucasmanning</name></author><summary type="html">In an earlier blog post about networking security, we described how and why gVisor implements its own userspace network stack in the Sentry (gVisor kernel). In summary, we’ve implemented our networking stack – aka Netstack – in Go to minimize exposure to unsafe code and avoid using an unsafe Foreign Function Interface. With Netstack, gVisor can do all packet processing internally and only has to enable a few host I/O syscalls for near-complete networking capabilities. This keeps gVisor’s exposure to host vulnerabilities as narrow as possible.</summary></entry><entry><title type="html">Threat Detection in gVisor</title><link href="/blog/2022/08/01/threat-detection/" rel="alternate" type="text/html" title=" Threat Detection in gVisor" /><published>2022-08-31T00:00:00-05:00</published><updated>2022-08-31T00:00:00-05:00</updated><id>/blog/2022/08/01/threat-detection</id><content type="html" xml:base="/blog/2022/08/01/threat-detection/">&lt;p&gt;gVisor helps users secure their infrastructure by running containers in a
dedicated kernel that is isolated from the host. But wouldn’t it be nice if you
could tell when someone attempts to break out? Or get an early warning that your
web server might have been compromised? Now you can do it with gVisor! We are
pleased to announce support for &lt;strong&gt;runtime monitoring&lt;/strong&gt;. Runtime monitoring
provides the ability for an external process to observe application behavior and
detect threats at runtime. Using this mechanism, gVisor users can watch actions
performed by the container and generate alerts when something unexpected occurs.&lt;/p&gt;

&lt;!--/excerpt--&gt;

&lt;p&gt;A monitoring process can connect to the gVisor sandbox and receive a stream of
actions that the application is performing. The monitoring process decides what
actions are allowed and what steps to take based on policies for the given
application. gVisor communicates with the monitoring process via a simple
protocol based on
&lt;a href=&quot;https://developers.google.com/protocol-buffers&quot;&gt;Protocol Buffers&lt;/a&gt;, which is the
basis for &lt;a href=&quot;https://grpc.io/&quot;&gt;gRPC&lt;/a&gt; and is well supported in several languages.
The monitoring process runs isolated from the application inside the sandbox for
security reasons, and can be shared among all sandboxes running on the same
machine to save resources. Trace points can be individually configured when
creating a tracing session to capture only what’s needed.&lt;/p&gt;

&lt;p&gt;Let’s go over a simple example of a web server that gets compromised while being
monitored. The web server can execute files from &lt;code class=&quot;highlighter-rouge&quot;&gt;/bin&lt;/code&gt;, read files from &lt;code class=&quot;highlighter-rouge&quot;&gt;/etc&lt;/code&gt;
and &lt;code class=&quot;highlighter-rouge&quot;&gt;/html&lt;/code&gt; directories, create files under &lt;code class=&quot;highlighter-rouge&quot;&gt;/tmp&lt;/code&gt;, etc. All these actions are
reported to a monitoring process which analyzes them and deems them normal
application behavior. Now suppose that an attacker takes control over the web
server and starts executing code inside the container. The attacker writes a
script under &lt;code class=&quot;highlighter-rouge&quot;&gt;/tmp&lt;/code&gt; and, in an attempt to make it executable, runs &lt;code class=&quot;highlighter-rouge&quot;&gt;chmod u+x
/tmp/exploit.sh&lt;/code&gt;. The monitoring process determines that making a file
executable is not expected in the normal web server execution and raises an
alert to the security team for investigation. Additionally, it can also decide
to kill the container and stop the attacker from making more progress.&lt;/p&gt;

&lt;h2 id=&quot;falco&quot;&gt;Falco&lt;/h2&gt;

&lt;p&gt;&lt;a href=&quot;https://falco.org/&quot;&gt;Falco&lt;/a&gt; is an Open Source Cloud Native Security monitor that
detects threats at runtime by observing the behavior of your applications and
containers. Falco
&lt;a href=&quot;https://falco.org/blog/falco-0-32-1/&quot;&gt;supports monitoring applications running inside gVisor&lt;/a&gt;.
All the Falco rules and tooling work seamlessly with gVisor. You can use
&lt;a href=&quot;https://gvisor.dev/docs/tutorials/falco/&quot;&gt;this tutorial&lt;/a&gt; to learn how to
configure Falco and gVisor together. More information can be found on the
&lt;a href=&quot;https://falco.org/blog/intro-gvisor-falco/&quot;&gt;Falco blog&lt;/a&gt;.&lt;/p&gt;

&lt;h2 id=&quot;whats-next&quot;&gt;What’s next?&lt;/h2&gt;

&lt;p&gt;We’re looking for more projects to take advantage of the runtime monitoring
system and the visibility that it provides into the sandbox. There are a few
unique capabilities provided by the system that makes it easy to monitor
applications inside gVisor, like resolving file descriptors to full paths,
providing container ID with traces, separating processes that were exec’ed into
the container, internal procfs state access, and many more.&lt;/p&gt;

&lt;p&gt;If you would like to explore it further, there is a
&lt;a href=&quot;https://docs.google.com/document/d/1RQQKzeFpO-zOoBHZLA-tr5Ed_bvAOLDqgGgKhqUff2A&quot;&gt;design document&lt;/a&gt;
and
&lt;a href=&quot;https://github.com/google/gvisor/tree/master/pkg/sentry/seccheck/README.md&quot;&gt;documentation&lt;/a&gt;
with more details about the configuration and communication protocol. In
addition, the &lt;a href=&quot;https://gvisor.dev/docs/tutorials/falco/&quot;&gt;tutorial using Falco&lt;/a&gt;
is a great way to see it in action.&lt;/p&gt;

&lt;p&gt;We would like to thank &lt;a href=&quot;https://github.com/LucaGuerra&quot;&gt;Luca Guerra&lt;/a&gt;,
&lt;a href=&quot;https://github.com/loresuso&quot;&gt;Lorenzo Susini&lt;/a&gt;, and the Falco team for their
support while building this feature.&lt;/p&gt;</content><author><name>fvoznika</name></author><summary type="html">gVisor helps users secure their infrastructure by running containers in a dedicated kernel that is isolated from the host. But wouldn’t it be nice if you could tell when someone attempts to break out? Or get an early warning that your web server might have been compromised? Now you can do it with gVisor! We are pleased to announce support for runtime monitoring. Runtime monitoring provides the ability for an external process to observe application behavior and detect threats at runtime. Using this mechanism, gVisor users can watch actions performed by the container and generate alerts when something unexpected occurs.</summary></entry><entry><title type="html">Running gVisor in Production at Scale in Ant</title><link href="/blog/2021/12/02/running-gvisor-in-production-at-scale-in-ant/" rel="alternate" type="text/html" title=" Running gVisor in Production at Scale in Ant" /><published>2021-12-02T00:00:00-06:00</published><updated>2021-12-02T00:00:00-06:00</updated><id>/blog/2021/12/02/running-gvisor-in-production-at-scale-in-ant</id><content type="html" xml:base="/blog/2021/12/02/running-gvisor-in-production-at-scale-in-ant/">&lt;blockquote&gt;
  &lt;p&gt;This post was contributed by &lt;a href=&quot;https://www.antgroup.com/&quot;&gt;Ant Group&lt;/a&gt;, a
large-scale digital payment platform. Jianfeng and Yong are engineers at Ant
Group working on infrastructure systems, and contributors to gVisor.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;At Ant Group, we are committed to keep online transactions safe and efficient.
Continuously improving security for potential system-level attacks is one of
many measures. As a container runtime, gVisor provides container-native security
without sacrificing resource efficiency. Therefore, it has been on our radar
since it was released.&lt;/p&gt;

&lt;!--/excerpt--&gt;

&lt;p&gt;However, there have been performance concerns raised by members of
&lt;a href=&quot;https://www.usenix.org/system/files/hotcloud19-paper-young.pdf&quot;&gt;academia&lt;/a&gt; and
&lt;a href=&quot;https://news.ycombinator.com/item?id=19924036&quot;&gt;industry&lt;/a&gt;. Users of gVisor tend
to bear the extra overhead as the tax of security. But we tend to agree that
&lt;a href=&quot;https://sel4.systems/About/seL4-whitepaper.pdf&quot;&gt;security is no excuse for poor performance (See Chapter 6!)&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;In this article, we will present how we identified bottlenecks in gVisor and
unblocked large-scale production adoption. Our main focus are the CPU
utilization and latency overhead it brings. Small memory footprint is also a
valued goal, but not discussed in this blog. As a result of these efforts and
community improvements, 70% of our applications running on runsc have &amp;lt;1%
overhead; another 25% have &amp;lt;3% overhead. Some of our most valued application are
the focus of our optimization, and get even better performance compared with
runc.&lt;/p&gt;

&lt;p&gt;The rest of this blog is organized as follows: * First, we analyze the cost of
different syscall paths in gVisor. * Then, a way to profile a whole picture of a
instance is proposed to find out if some slow syscall paths are encountered.
Some invisible overhead in Go runtime is discussed. * At last, a short summary
on performance optimization with some other factors on production adoption.&lt;/p&gt;

&lt;p&gt;For convenience of discussion, we are targeting KVM-based, or hypervisor-based
platforms, unless explicitly stated.&lt;/p&gt;

&lt;h2 id=&quot;cost-of-different-syscall-paths&quot;&gt;Cost of different syscall paths&lt;/h2&gt;

&lt;p&gt;&lt;a href=&quot;../../../../2019/11/18/gvisor-security-basics-part-1/#defense-in-depth&quot;&gt;Defense-in-depth&lt;/a&gt;
is the key design principle of gVisor. In gVisor, different syscalls have
different paths, further leading to different cost (orders of magnitude) on
latency and CPU consumption. Here are the syscall paths in gVisor.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/2021-12-02-syscall-figure1.png&quot; alt=&quot;Figure 1&quot; title=&quot;Sentry syscall paths.&quot; /&gt;&lt;/p&gt;

&lt;h3 id=&quot;path-1-user-space-vdso&quot;&gt;Path 1: User-space vDSO&lt;/h3&gt;

&lt;p&gt;Sentry provides a
&lt;a href=&quot;https://github.com/google/gvisor/tree/master/vdso&quot;&gt;vDSO library&lt;/a&gt; for its
sandboxed processes. Several syscalls are short circuited and implemented in
user space. These syscalls cost almost as much as native Linux. But note that
the vDSO library is partially implemented. We once noticed some
&lt;a href=&quot;https://github.com/google/gvisor/issues/3101&quot;&gt;syscalls&lt;/a&gt; in our environment are
not properly terminated in user space. We create some additional implementations
to the vDSO, and aim to push these improvements upstream when possible.&lt;/p&gt;

&lt;h3 id=&quot;path-2-sentry-contained&quot;&gt;Path 2: Sentry contained&lt;/h3&gt;

&lt;p&gt;Most syscalls, e.g., &lt;code&gt;clone(2)&lt;/code&gt;, are implemented in Sentry. They are
some basic abstractions of a operating system, such as process/thread lifecycle,
scheduling, IPC, memory management, etc. These syscalls and all below suffer
from a structural cost of syscall interception. The overhead is about 800ns
while that of the native syscalls is about 70ns. We’ll dig it further below.
Syscalls of this kind spend takes about several microseconds, which is
competitive to the corresponding native Linux syscalls.&lt;/p&gt;

&lt;h3 id=&quot;path-3-host-kernel-involved&quot;&gt;Path 3: Host-kernel involved&lt;/h3&gt;

&lt;p&gt;Some syscalls, resource related, e.g., read/write, are redirected into the host
kernel. Note that gVisor never passes through application syscalls directly into
host kernel for functional and security reasons. So comparing to native Linux,
time spent in Sentry seems an extra overhead. Another overhead is the way to
call a host kernel syscall. Let’s use kvm platform of x86_64 as an example.
After Sentry issues the syscall instruction, if it is in GR0, it first goes to
the syscall entrypoint defined in LSTAR, and then halts to HR3 (a vmexit happens
here), and exits from a signal handler, and executes syscall instruction again.
We can save the “Halt to HR3” by introducing vmcall here, but there’s still a
syscall trampoline there and the vmexit/vmentry overhead is not trivial.
Nevertheless, these overhead is not that significant.&lt;/p&gt;

&lt;p&gt;For some sentry-contained syscalls in Path 2, although the syscall semantic is
terminated in Sentry, it may further introduces one or many unexpected exits to
host kernel. It could be a page fault when Sentry runs, and more likely, a
schedule event in Go runtime, e.g., M idle/wakeup. An example in hand is that
&lt;code&gt;futex(FUETX_WAIT)&lt;/code&gt; and &lt;code&gt;epoll_wait(2)&lt;/code&gt; could lead to M
idle and a further futex call into host kernel if it does not find any runnable
Gs. (See the comments in https://go.dev/src/runtime/proc.go for further
explanation about the Go scheduler).&lt;/p&gt;

&lt;h3 id=&quot;path-4-gofer-involved&quot;&gt;Path 4: Gofer involved&lt;/h3&gt;

&lt;p&gt;Other IO-related syscalls, especially security sensitive, go through another
layer of protection - Gofer. For such a syscall, it usually involves one or more
Sentry/Gofer inter-process communications. Even with the recent optimization
that using lisafs to supersede P9, it’s still the slowest path which we shall
try best to avoid.&lt;/p&gt;

&lt;p&gt;As shown above, some syscall paths are by-design slow, and should be identified
and reduced as much as possible. Let’s hold it to the next section, and dig into
the details of the structural and implementation-specific cost of syscalls
firstly, because the performance of some Sentry-contained syscalls are not good
enough.&lt;/p&gt;

&lt;h3 id=&quot;the-structural-cost&quot;&gt;The structural cost&lt;/h3&gt;

&lt;p&gt;The first kind of cost is the comparatively stable, introduced by syscall
interception. It is platform-specific depending on the way to intercept
syscalls. And whether this cost matters also depends on the syscall rate of
sandboxed applications.&lt;/p&gt;

&lt;p&gt;Here’s the benchmark result on the structural cost of syscall. We got the data
on a Intel(R) Xeon(R) CPU E5-2650 v2 platform, using
&lt;a href=&quot;https://github.com/google/gvisor/blob/master/test/perf/linux/getpid_benchmark.cc&quot;&gt;getpid benchmark&lt;/a&gt;.
As we can see, for KVM platform, the syscall interception costs more than 10x
than a native Linux syscall.&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;getpid&lt;/th&gt;
      &lt;th&gt;benchmark (ns)&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td&gt;Native&lt;/td&gt;
      &lt;td&gt;62&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Native-KPTI&lt;/td&gt;
      &lt;td&gt;236&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;runsc-KVM&lt;/td&gt;
      &lt;td&gt;830&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;runsc-ptrace&lt;/td&gt;
      &lt;td&gt;6249&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;* “Native” stands for using vanilla linux kernel.&lt;/p&gt;

&lt;p&gt;To understand the structural cost of syscall interception, we did a
&lt;a href=&quot;https://github.com/google/gvisor/issues/2354&quot;&gt;quantitative analysis&lt;/a&gt; on kvm
platform. According to the analysis, the overhead mainly comes from:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;
    &lt;p&gt;KPTI-like CR3 switches: to maintain the address equation of Sentry running
in HR3 and GR0, it has to switch CR3 register twice, on each user/kernel
switch;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Platform’s Switch(): Linux is very efficient by just switching to a
per-thread kernel stack and calling the corresponding syscall entry
function. But in Sentry, each task is represented by a goroutine; before
calling into syscall entry functions, it needs to pop the stack to recover
the big while loop, i.e., kernel.(*Task).run.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Can we save the structural cost of syscall interception? This cost is actually
by-design. We can optimize it, for example, avoid allocation and map operations
in switch process, but it can not be eliminated.&lt;/p&gt;

&lt;p&gt;Does the structural cost of syscall interception really matter? It depends on
the syscall rate. Most applications in our case have a syscall rate &amp;lt; 200K/sec,
and according to flame graphs (which will be described later in this blog), we
see 2~3% of samples are in the switch Secondly, most syscalls, except those as
simple as &lt;code&gt;getpid(2)&lt;/code&gt;, take several microseconds. In proportion, it’s
not a significant overhead. However, if you have an elephant RPC (which involves
many times of DB access), or a service served by a long-snake RPC chain, this
brings nontrivial overhead on latency.&lt;/p&gt;

&lt;h3 id=&quot;the-implementation-specific-cost&quot;&gt;The implementation-specific cost&lt;/h3&gt;

&lt;p&gt;The other kind of cost is implementation-specific. For example, it involves some
heavy malloc operations; or defer is used in some frequent syscall paths (defer
is optimized in Go 1.14); what’s worse, the application process may trigger a
long-path syscall with host kernel or Gofer involved.&lt;/p&gt;

&lt;p&gt;When we try to do optimization on the gVisor runtime, we need information on the
sandboxed applications, POD configurations, and runsc internals. But most people
only play either as platform engineer or application engineer. So we need an
easier way to understand the whole picture.&lt;/p&gt;

&lt;h2 id=&quot;performance-profile-of-a-running-instance&quot;&gt;Performance profile of a running instance&lt;/h2&gt;

&lt;p&gt;To quickly understand the whole picture of performance, we need some ways to
profile a running gVisor instance. As gVisor sandbox process is essentially a Go
process, Go pprof is an existing way:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;https://golang.org/pkg/runtime/pprof/&quot;&gt;Go pprof&lt;/a&gt; - provides CPU and heap
profile through
&lt;a href=&quot;https://gvisor.dev/docs/user_guide/debugging/#profiling&quot;&gt;runsc debug subcommands&lt;/a&gt;.&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://golang.org/pkg/runtime/trace/&quot;&gt;Go trace&lt;/a&gt; - provides more internal
profile types like synchronization blocking and scheduler latency.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Unfortunately, above tools only provide hot-spots in Sentry, instead of the
whole picture (how much time spent in GR3 and HR0). And CPU profile relies on
the &lt;a href=&quot;https://golang.org/pkg/runtime/pprof/&quot;&gt;SIGPROF signal&lt;/a&gt;, which may not
accurate enough.&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://www.linux-kvm.org/page/Perf_events&quot;&gt;perf-kvm&lt;/a&gt; cannot provide what we
need either. It may help to top/record/stat some information in guest with the
help of option [–guestkallsyms], but it cannot analyze the call chain (which is
not supported in the host kernel, see Linux’s perf_callchain_kernel).&lt;/p&gt;

&lt;h3 id=&quot;perf-sandbox-process-like-a-normal-process&quot;&gt;Perf sandbox process like a normal process&lt;/h3&gt;

&lt;p&gt;Then we turn to a nice virtual address equation in Sentry: [(GR0 VA) = (HR3
VA)]. This is to make sure any pointers in HR3 can be directly used in GR0.&lt;/p&gt;

&lt;p&gt;The equation is helpful to solve this problem in the way that we can profile
Sentry just as a normal HR3 process with a little hack on kvm.&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;First, as said above, Linux does not support to analyze the call chain of
guest. So Change [is_in_guest] to pretend that it runs in host mode even
it’s in guest mode. This can be done in
&lt;a href=&quot;https://github.com/torvalds/linux/blob/v4.19/arch/x86/kvm/x86.c#L6560&quot;&gt;kvm_is_in_guest&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;int kvm_is_in_guest(void)
 {
-       return __this_cpu_read(current_vcpu) != NULL;
+       return 0;
 }
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;ul&gt;
  &lt;li&gt;Secondly, change the process of guest profile. Previously, after PMU counter
overflows and triggers a NMI interrupt, vCPU is forced to exit to host, and
calls [int $2] immediately for later recording. Now instead of calling [int
$2], we shall call &lt;strong&gt;do_nmi&lt;/strong&gt; directly with correct registers (i.e.,
pt_regs):&lt;/li&gt;
&lt;/ul&gt;

&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;+void (*fn_do_nmi)(struct pt_regs *, long);
+
+#define HIGHER_HALF_CANONICAL_ADDR 0xFFFF800000000000
+
+void make_pt_regs(struct kvm_vcpu *vcpu, struct pt_regs *regs)
+{
+       /* In Sentry GR0, we will use address among
+        *   [HIGHER_HALF_CANONICAL_ADDR, 2^64-1)
+        * when syscall just happens. To avoid conflicting with HR0,
+        * we correct these addresses into HR3 addresses.
+        */
+       regs-&amp;gt;bp = vcpu-&amp;gt;arch.regs[VCPU_REGS_RBP] &amp;amp; ~HIGHER_HALF_CANONICAL_ADDR;
+       regs-&amp;gt;ip = vmcs_readl(GUEST_RIP) &amp;amp; ~HIGHER_HALF_CANONICAL_ADDR;
+       regs-&amp;gt;sp = vmcs_readl(GUEST_RSP) &amp;amp; ~HIGHER_HALF_CANONICAL_ADDR;
+
+       regs-&amp;gt;flags = (vmcs_readl(GUEST_RFLAGS) &amp;amp; 0xFF) |
+                     X86_EFLAGS_IF | 0x2;
+       regs-&amp;gt;cs = __USER_CS;
+       regs-&amp;gt;ss = __USER_DS;
+}
+
 static void vmx_complete_atomic_exit(struct vcpu_vmx *vmx)
 {
        u32 exit_intr_info;
@@ -8943,7 +8965,14 @@ static void vmx_complete_atomic_exit(struct vcpu_vmx *vmx)
        /* We need to handle NMIs before interrupts are enabled */
        if (is_nmi(exit_intr_info)) {
                kvm_before_handle_nmi(&amp;amp;vmx-&amp;gt;vcpu);
-               asm(&quot;int $2&quot;);
+               if (vmcs_readl(GUEST_RFLAGS) &amp;amp; X86_EFLAGS_IF)
+                       asm(&quot;int $2&quot;);
+               else {
+                       struct pt_regs regs;
+                       memset((void *)&amp;amp;regs, 0, sizeof(regs));
+                       make_pt_regs(&amp;amp;vmx-&amp;gt;vcpu, &amp;amp;regs);
+                       fn_do_nmi(&amp;amp;regs, 0);
+               }
                kvm_after_handle_nmi(&amp;amp;vmx-&amp;gt;vcpu);
        }
 }
@@ -11881,6 +11927,10 @@ static int __init vmx_init(void)
                }
        }

+       fn_do_nmi = (void *) kallsyms_lookup_name(&quot;do_nmi&quot;);
+       if (!fn_do_nmi)
+               printk(KERN_ERR &quot;kvm: lookup do_nmi fail\n&quot;);
+
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;As shown above, we properly handle samples in GR3 and GR0 trampoline.&lt;/p&gt;

&lt;h3 id=&quot;an-example-of-profile&quot;&gt;An example of profile&lt;/h3&gt;

&lt;p&gt;Firstly, make sure we compile the runsc with symbols not stripped: &lt;code class=&quot;highlighter-rouge&quot;&gt;bazel build
runsc --strip=never&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;As an example, run below script inside the gVisor container to make it busy:
&lt;code class=&quot;highlighter-rouge&quot;&gt;stress -i 1 -c 1 -m 1&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Perf the instance with command: &lt;code class=&quot;highlighter-rouge&quot;&gt;perf kvm --host --guest record -a -g -e cycles
-G &amp;lt;path/to/cgroup&amp;gt; -- sleep 10 &amp;gt;/dev/null&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Note we still need to perf the instance with ‘perf kvm’ and ‘–guest’, because
kvm-intel requires this to keep the PMU hardware event enabled in guest mode.&lt;/p&gt;

&lt;p&gt;Then generate a flame graph using
&lt;a href=&quot;https://github.com/brendangregg/FlameGraph&quot;&gt;Brendan’s tool&lt;/a&gt;, and we got this
&lt;a href=&quot;https://raw.githubusercontent.com/zhuangel/gvisor/zhuangel_blog/website/blog/blog-kvm-stress.svg&quot;&gt;flame graph&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Let’s roughly divide it to differentiate GR3 and GR0 like this:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/2021-12-02-flamegraph-figure2.png&quot; alt=&quot;Figure 2&quot; title=&quot;Flamegraph of stress.&quot; /&gt;&lt;/p&gt;

&lt;h3 id=&quot;optimize-based-on-flame-graphs&quot;&gt;Optimize based on flame graphs&lt;/h3&gt;

&lt;p&gt;Now we can get clear information like:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;
    &lt;p&gt;The bottleneck syscall(s): the above flame graph shows &lt;code&gt;sync(2)&lt;/code&gt;
is a relatively large block of samples. If we cannot avoid them in user
space, they are worth time for optimization. Some real cases we found and
optimized are: supersede CopyIn/CopyOut with CopyInBytes/CopyOutBytes to
avoid reflection; avoid use defer in some frequent syscalls in which case
you can say &lt;code&gt;deferreturn()&lt;/code&gt; in the flame graph (not needed if you
already upgrade to newer Go version). Another optimization is: after we find
that append write of shared volume spends a lot of time querying gofer for
current file length in the flame graph, we propose to add
&lt;a href=&quot;https://github.com/google/gvisor/issues/1792&quot;&gt;an handle only for append write&lt;/a&gt;.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;If GC is a real problem: we can barely see sample related to GC in this
case. But if we do, we can further search &lt;code&gt;mallocgc()&lt;/code&gt; to see
where the heap allocation is frequent. We can perform a heap profile to see
allocated objects. And we can consider adjust
&lt;a href=&quot;https://golang.org/pkg/runtime/debug/#SetGCPercent&quot;&gt;GC percent&lt;/a&gt;, 100% by
default, to sacrifice memory for less CPU utilization. We once found that
allocating a object &amp;gt; 32 KB also triggers GC, referring to
&lt;a href=&quot;https://github.com/google/gvisor/commit/f697d1a33e4e7cefb4164ec977c38ccc2a228099&quot;&gt;this&lt;/a&gt;.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Percentage of time spent in GR3 app and Sentry: We can determine if it
worths to continue the optimization. If most of the samples are in GR3, then
we better turn to optimizing the application code instead.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Rather large chunk of samples lie in ept violation and
&lt;code&gt;fallocate(2)&lt;/code&gt; (into HR0). This is caused by frequent memory
allocation and free. We can either optimize the application to avoid this,
or add a memory buffer layer in memfile management to relieve it.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;As a short summary, now we have a tool to get a visible graph of what’s going on
in a running gVisor instance. Unfortunately, we cannot get the details of the
application processes in the above flame graph because of the semantic gap. To
get a flame graph of the application processes, we have prototyped a way in
Sentry. Hopefully, we’ll discuss it in later blogs.&lt;/p&gt;

&lt;p&gt;A visible way is very helpful when we try to optimize a new application on
gVisor. However, there’s another kind of overhead, invisible like “Dark matter”.&lt;/p&gt;

&lt;h2 id=&quot;invisible-overhead-in-go-runtime&quot;&gt;Invisible overhead in Go runtime&lt;/h2&gt;

&lt;p&gt;Sentry inherits timer, scheduler, channel, and heap allocator in Go runtime.
While it saves a lot of code to build a kernel, it also introduces some
unpleasant overhead. The Go runtime, after all, is designed and massively used
for general purpose Go applications. While it’s used as a part or the basis of a
kernel, we shall be very careful with the implementation and overhead of these
syntactic sugar.&lt;/p&gt;

&lt;p&gt;Unfortunately, we did not find an universal method to identify this kind of
overhead. The only way seems to get your hands dirty with Go runtime. We’ll show
some examples in our use case.&lt;/p&gt;

&lt;h3 id=&quot;timer&quot;&gt;Timer&lt;/h3&gt;

&lt;p&gt;It’s known that Go (before 1.14) timer suffers from
&lt;a href=&quot;https://github.com/golang/go/issues/27707&quot;&gt;lock contention and context switches&lt;/a&gt;.
What’s worse, statistics of Sentry syscalls shows that a lot of
&lt;code&gt;futex()&lt;/code&gt; is introduced by timers (64 timer buckets), and that Sentry
syscalls walks a much longer path (redpill), makes it worse.&lt;/p&gt;

&lt;p&gt;We have two optimizations here: 1. decrease the number of timer buckets, from 64
to 4; 2. decrease the timer precision from ns to ms. You may worry about the
decrease of timer precision, but as we see, most of the applications are
event-based, and not affected by a coarse grained timer.&lt;/p&gt;

&lt;p&gt;However, Go changes the implementation of timer in v1.14; how to port this
optimization remains an open question.&lt;/p&gt;

&lt;h3 id=&quot;scheduler&quot;&gt;Scheduler&lt;/h3&gt;

&lt;p&gt;gVisor introduces an extra level of schedule along with the host linux scheduler
(usually CFS). A L2 scheduler sometimes brings positive impact as it saves the
heavy context switch in the L1 scheduler. We can find many two-level scheduler
cases, for example, coroutines, virtual machines, etc.&lt;/p&gt;

&lt;p&gt;gVisor reuses Go’s work-stealing scheduler, which is originally designed for
coroutines, as the L2 scheduler. They share the same goal:&lt;/p&gt;

&lt;p&gt;“We need to balance between keeping enough running worker threads to utilize
available hardware parallelism and parking excessive running worker threads to
conserve CPU resources and power.” – From
&lt;a href=&quot;https://golang.org/src/runtime/proc.go&quot;&gt;Go scheduler code&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;If not properly tuned, the L2 scheduler may leak the schedule pressure to the L1
scheduler. According to G-P-M model of Go, the parallelism is close related to
the GOMAXPROCS limit. The upstream gVisor by default uses # of host cores, which
leads to a lot of wasted M wake/stop(s). By properly configuring the GOMAXPROCS
of a POD of 4/8/16 cores, we find it can save some CPU cycles without worsening
the workload latency.&lt;/p&gt;

&lt;p&gt;To further restrict extra M wake/stop(s), before wakep(), we calculate the # of
running Gs and # of running Ps to decide if necessary to wake a M. And we find
it’s better to firstly steal from the longest local run queue, comparing to
previously random-sequential way. Another related optimization is that we find
most applications will get back to Sentry very soon, and it’s not necessary to
handle off its P when it leaves into user space and find an idle P when it gets
back.&lt;/p&gt;

&lt;p&gt;Some optimizations in Go are put
&lt;a href=&quot;https://github.com/zhuangel/go/tree/go1.13.4.blog&quot;&gt;here&lt;/a&gt;. What we learned from
the optimization process of gVisor is that digging into Go runtime to understand
what’s going on there. And it’s normal that some ideas work, but some fail.&lt;/p&gt;

&lt;h2 id=&quot;summary&quot;&gt;Summary&lt;/h2&gt;

&lt;p&gt;We introduced how we profiled gVisor for production-ready performance. Using
this methodology, along with some other aggressive measures, we finally got to
run gVisor with an acceptable overhead, and even better than runc in some
workloads. We also absorbed a lot of optimization progress in the community,
e.g., VFS2.&lt;/p&gt;

&lt;p&gt;So far, we have deployed more than 100K gVisor instances in the production
environment. And it very well supported transactions of
&lt;a href=&quot;https://en.wikipedia.org/wiki/Singles%27_Day&quot;&gt;Singles Day Global Shopping Festivals&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Along with performance, there are also some other important aspects for
production adoption. For example, generating a core after a sentry panic is
helpful for debugging; a coverage tool is necessary to make sure new changes are
properly covered by test cases. We’ll leave these topics to later discussions.&lt;/p&gt;</content><author><name>jianfengt</name></author><summary type="html">This post was contributed by Ant Group, a large-scale digital payment platform. Jianfeng and Yong are engineers at Ant Group working on infrastructure systems, and contributors to gVisor. At Ant Group, we are committed to keep online transactions safe and efficient. Continuously improving security for potential system-level attacks is one of many measures. As a container runtime, gVisor provides container-native security without sacrificing resource efficiency. Therefore, it has been on our radar since it was released.</summary></entry><entry><title type="html">gVisor RACK</title><link href="/blog/2021/08/31/gvisor-rack/" rel="alternate" type="text/html" title=" gVisor RACK" /><published>2021-08-31T00:00:00-05:00</published><updated>2021-08-31T00:00:00-05:00</updated><id>/blog/2021/08/31/gvisor-rack</id><content type="html" xml:base="/blog/2021/08/31/gvisor-rack/">&lt;p&gt;gVisor has implemented the &lt;a href=&quot;https://datatracker.ietf.org/doc/html/rfc8985&quot;&gt;RACK&lt;/a&gt;
(Recent ACKnowledgement) TCP loss-detection algorithm in our network stack,
which improves throughput in the presence of packet loss and reordering.&lt;/p&gt;

&lt;!--/excerpt--&gt;

&lt;p&gt;TCP is a connection-oriented protocol that detects and recovers from loss by
retransmitting packets. &lt;a href=&quot;https://datatracker.ietf.org/doc/html/rfc8985&quot;&gt;RACK&lt;/a&gt; is
one of the recent loss-detection methods implemented in Linux and BSD, which
helps in identifying packet loss quickly and accurately in the presence of
packet reordering and tail losses.&lt;/p&gt;

&lt;h2 id=&quot;background&quot;&gt;Background&lt;/h2&gt;

&lt;p&gt;The TCP congestion window indicates the number of unacknowledged packets that
can be sent at any time. When packet loss is identified, the congestion window
is reduced depending on the type of loss. The sender will recover from the loss
after all the packets sent before reducing the congestion window are
acknowledged. If the loss is identified falsely by the connection, then the
connection enters loss recovery unnecessarily, resulting in sending fewer
packets.&lt;/p&gt;

&lt;p&gt;Packet loss is identified mainly in two ways:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Three duplicate acknowledgments, which will result in either
&lt;a href=&quot;https://datatracker.ietf.org/doc/html/rfc2001#section-4&quot;&gt;Fast&lt;/a&gt; or
&lt;a href=&quot;https://datatracker.ietf.org/doc/html/rfc6675&quot;&gt;SACK&lt;/a&gt; recovery. The
congestion window is reduced depending on the type of congestion control
algorithm. For example, in the
&lt;a href=&quot;https://en.wikipedia.org/wiki/TCP_congestion_control#TCP_Tahoe_and_Reno&quot;&gt;Reno&lt;/a&gt;
algorithm it is reduced to half.&lt;/li&gt;
  &lt;li&gt;RTO (Retransmission Timeout) which will result in Timeout recovery. The
congestion window is reduced to one
&lt;a href=&quot;https://en.wikipedia.org/wiki/Maximum_segment_size&quot;&gt;MSS&lt;/a&gt;.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Both of these cases result in reducing the congestion window, with RTO being
more expensive. Most of the existing algorithms do not detect packet reordering,
which get incorrectly identified as packet loss, resulting in an RTO.
Furthermore, the loss of an ACK at the end of a sequence (known as “tail loss”)
will also trigger RTO and slow down future transmissions unnecessarily. RACK
helps us to identify loss accurately in all these scenarios, and will avoid
entering RTO.&lt;/p&gt;

&lt;h2 id=&quot;implementation-of-rack&quot;&gt;Implementation of RACK&lt;/h2&gt;

&lt;p&gt;Implementation of RACK requires support for:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Per-packet transmission timestamps: RACK detects loss depending on the
transmission times of the packet and the timestamp at which ACK was
received.&lt;/li&gt;
  &lt;li&gt;SACK and ability to detect DSACK: Selective Acknowledgement and Duplicate
SACK are used to adjust the timer window after which a packet can be marked
as lost.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3 id=&quot;packet-reordering&quot;&gt;Packet Reordering&lt;/h3&gt;

&lt;p&gt;Packet reordering commonly occurs when different packets take different paths
through a network. The diagram below shows the transmission of four packets
which get reordered in transmission, and the resulting TCP behavior with and
without RACK.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/2021-08-31-rack-figure1.png&quot; alt=&quot;Figure 1&quot; title=&quot;Packet reordering.&quot; /&gt;&lt;/p&gt;

&lt;p&gt;In the above example, the sender sees three duplicate acknowledgments. Without
RACK, this is identified falsely as packet loss, and the congestion window will
be reduced after entering Fast/SACK recovery.&lt;/p&gt;

&lt;p&gt;To detect packet reordering, RACK uses a reorder window, bounded between
[&lt;a href=&quot;https://en.wikipedia.org/wiki/Round-trip_delay&quot;&gt;RTT&lt;/a&gt;/4, RTT]. The reorder
timer is set to expire after &lt;em&gt;RTT+reorder_window&lt;/em&gt;. A packet is marked as lost
when the packets following it were acknowledged using SACK and the reorder timer
expires. The reorder window is increased when a DSACK is received (which
indicates that there is a higher degree of reordering).&lt;/p&gt;

&lt;h3 id=&quot;tail-loss&quot;&gt;Tail Loss&lt;/h3&gt;

&lt;p&gt;Tail loss occurs when the packets are lost at the end of data transmission. The
diagram below shows an example of tail loss when the last three packets are
lost, and how it is handled with and without RACK.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/2021-08-31-rack-figure2.png&quot; alt=&quot;Figure 2&quot; title=&quot;Tail loss figure 2.&quot; /&gt;&lt;/p&gt;

&lt;p&gt;For tail losses, RACK uses a Tail Loss Probe (TLP), which relies on a timer for
the last packet sent. The TLP timer is set to &lt;em&gt;2 * RTT,&lt;/em&gt; after which a probe is
sent. The probe packet will allow the connection one more chance to detect a
loss by triggering ACK feedback to avoid entering RTO. In the above example, the
loss is recovered without entering the RTO.&lt;/p&gt;

&lt;p&gt;TLP will also help in cases where the ACK was lost but all the packets were
received by the receiver. The below diagram shows that the ACK received for the
probe packet avoided the RTO.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/2021-08-31-rack-figure3.png&quot; alt=&quot;Figure 3&quot; title=&quot;Tail loss figure 3.&quot; /&gt;&lt;/p&gt;

&lt;p&gt;If there was some loss, then the ACK for the probe packet will have the SACK
blocks, which will be used to detect and retransmit the lost packets.&lt;/p&gt;

&lt;p&gt;In gVisor, we have support for
&lt;a href=&quot;https://datatracker.ietf.org/doc/html/rfc6582&quot;&gt;NewReno&lt;/a&gt; and SACK loss recovery
methods. We
&lt;a href=&quot;https://github.com/google/gvisor/issues/5243&quot;&gt;added support for RACK&lt;/a&gt; recently,
and it is the default when SACK is enabled. After enabling RACK, our internal
benchmarks in the presence of reordering and tail losses and the data we took
from internal users inside Google have shown ~50% reduction in the number of
RTOs.&lt;/p&gt;

&lt;p&gt;While RACK has improved one aspect of TCP performance by reducing the timeouts
in the presence of reordering and tail losses, in gVisor we plan to implement
the undoing of congestion windows and
&lt;a href=&quot;https://datatracker.ietf.org/doc/html/draft-cardwell-iccrg-bbr-congestion-control&quot;&gt;BBRv2&lt;/a&gt;
(once there is an RFC available) to further improve TCP performance in less
ideal network conditions.&lt;/p&gt;

&lt;p&gt;If you haven’t already, try gVisor. The instructions to get started are in our
&lt;a href=&quot;https://gvisor.dev/docs/user_guide/quick_start/docker/&quot;&gt;Quick Start&lt;/a&gt;. You can
also get involved with the gVisor community via our
&lt;a href=&quot;https://gitter.im/gvisor/community&quot;&gt;Gitter channel&lt;/a&gt;,
&lt;a href=&quot;https://groups.google.com/forum/#!forum/gvisor-users&quot;&gt;email list&lt;/a&gt;,
&lt;a href=&quot;https://gvisor.dev/issue/new&quot;&gt;issue tracker&lt;/a&gt;, and
&lt;a href=&quot;https://github.com/google/gvisor&quot;&gt;Github repository&lt;/a&gt;.&lt;/p&gt;</content><author><name>nybidari</name></author><summary type="html">gVisor has implemented the RACK (Recent ACKnowledgement) TCP loss-detection algorithm in our network stack, which improves throughput in the presence of packet loss and reordering.</summary></entry></feed>