<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="4.0.0">Jekyll</generator><link href="/blog/index.xml" rel="self" type="application/atom+xml" /><link href="/" rel="alternate" type="text/html" /><updated>2026-05-15T19:33:08-05:00</updated><id>/blog/index.xml</id><entry><title type="html">Scaling Agentic-RL Sandboxes to the Millions with gVisor at Tencent</title><link href="/blog/2026/04/23/scaling-agentic-rl-sandboxes-to-the-millions-with-gvisor-at-tencent/" rel="alternate" type="text/html" title=" Scaling Agentic-RL Sandboxes to the Millions with gVisor at Tencent" /><published>2026-04-23T00:00:00-05:00</published><updated>2026-04-23T00:00:00-05:00</updated><id>/blog/2026/04/23/scaling-agentic-rl-sandboxes-to-the-millions-with-gvisor-at-tencent</id><content type="html" xml:base="/blog/2026/04/23/scaling-agentic-rl-sandboxes-to-the-millions-with-gvisor-at-tencent/">&lt;blockquote&gt;
  &lt;p&gt;&lt;em&gt;This article was contributed by &lt;a href=&quot;https://www.tencent.com/&quot;&gt;Tencent&lt;/a&gt;. Yifeng
Tan, Hua Liu, and Hui Chen are engineers at Tencent, responsible for the
internal container infrastructure.&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;As LLMs evolve from chat interfaces to autonomous agents, building a robust and
secure isolation environment becomes a necessity. We chose
&lt;a href=&quot;https://gvisor.dev&quot;&gt;gVisor&lt;/a&gt; as the default sandbox for our Agentic-RL
scenarios. Today, we run millions of gVisor sandboxes daily for Agentic-RL
training in production, and that scale continues to grow. After more than
&lt;strong&gt;74,000&lt;/strong&gt; side-by-side comparisons between &lt;code class=&quot;highlighter-rouge&quot;&gt;runsc&lt;/code&gt; (gVisor) and &lt;code class=&quot;highlighter-rouge&quot;&gt;runc&lt;/code&gt;
(unsandboxed/Linux), combined with targeted fixes driven by real-world
workloads, we have essentially closed the execution correctness gap with &lt;code class=&quot;highlighter-rouge&quot;&gt;runc&lt;/code&gt;,
fully meeting our production-grade business requirements. During this process,
we successfully investigated and resolved gVisor compatibility issues that
accounted for approximately &lt;strong&gt;1.7%&lt;/strong&gt; of all test cases.&lt;/p&gt;

&lt;p&gt;This post focuses on CPU-centric code execution and testing workloads. We will
discuss gVisor compatibility verification and highlight representative issues,
skipping implementation details like GPU support, image distribution, or cluster
scheduling. We aim to answer three questions:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Why choose gVisor?&lt;/li&gt;
  &lt;li&gt;Why doesn’t manual compatibility verification scale?&lt;/li&gt;
  &lt;li&gt;How can AI agents analyze compatibility issues, what do typical failures
look like, and what best practices have we established?&lt;/li&gt;
&lt;/ol&gt;

&lt;!--/excerpt--&gt;

&lt;h2 id=&quot;background-why-agentic-rl-needs-gvisor&quot;&gt;Background: Why Agentic-RL Needs gVisor&lt;/h2&gt;

&lt;p&gt;Over the past two years, benchmarks like SWE-bench have turned “Agents fixing
bugs in real code repositories” from a research concept into an engineering
reality. The agent behavioral model has evolved from &lt;strong&gt;static code generation&lt;/strong&gt;
to &lt;strong&gt;dynamic environmental interaction&lt;/strong&gt;, spanning the entire lifecycle of
dependency resolution, execution, test feedback, and iterative debugging. We
don’t just need “an environment that runs Docker,” but rather a sandbox that
strictly constrains the kernel attack surface while remaining lightweight and
easy to deploy at scale. &lt;a href=&quot;https://gvisor.dev&quot;&gt;gVisor&lt;/a&gt; is a great fit for this
scenario:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;It implements an application-level kernel in user space, intercepting and
re-implementing system calls, significantly reducing the attack surface
where containers directly interact with the host kernel. Its isolation has
been well-recognized by the industry.&lt;/li&gt;
  &lt;li&gt;It integrates naturally with existing Docker/Kubernetes infrastructure,
avoiding the need for an entirely new guest kernel operation and maintenance
system.&lt;/li&gt;
  &lt;li&gt;Compared to microVM solutions—which must run on bare-metal hosts—gVisor can
run inside regular VMs, making it significantly cheaper while remaining more
flexible with lower startup and resource costs. This makes it far better
suited for large-scale deployments of sandbox containers.&lt;/li&gt;
  &lt;li&gt;It is also more friendly to GPU scenarios, facilitating integration with
existing heterogeneous computing environments.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;However, &lt;strong&gt;re-implementing the Linux ABI means its compatibility must be
rigorously validated.&lt;/strong&gt; In an Agentic-RL scenario where “any project can run and
any environment can appear,” compatibility can’t rely on intuition. It requires
large-scale verification against real workloads.&lt;/p&gt;

&lt;h2 id=&quot;challenge-verifying-tens-of-thousands-of-cases-cannot-rely-entirely-on-manual-effort&quot;&gt;Challenge: Verifying Tens of Thousands of Cases Cannot Rely Entirely on Manual Effort&lt;/h2&gt;

&lt;p&gt;Compatibility issues are rarely simple. Analyzing a typical SWE-related failure
usually requires answering several questions at once:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Is this failure unique to &lt;code class=&quot;highlighter-rouge&quot;&gt;runsc&lt;/code&gt; (gVisor), or does it also fail under
&lt;code class=&quot;highlighter-rouge&quot;&gt;runc&lt;/code&gt;?&lt;/li&gt;
  &lt;li&gt;If it only fails under gVisor, is it a semantic inconsistency in the Linux
ABI, missing procfs / sysfs, file system behavioral differences, or a TOCTOU
(Time-of-Check to Time-of-Use) race condition amplified by system call
overhead?&lt;/li&gt;
  &lt;li&gt;What is the actual behavior of the Linux kernel? At which layer did gVisor
deviate?&lt;/li&gt;
  &lt;li&gt;Should this issue be addressed by patching gVisor, modifying the test case,
adjusting configurations, or simply avoiding a certain way of running?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Engineers can handle a handful of cases manually. But across these datasets, we
are dealing with hundreds of thousands of real-world project instances, over a
dozen programming languages, and numerous build systems (Gradle, Maven, CMake,
Cargo, pip, npm, sbt, SwiftPM). Manual triage simply doesn’t scale.&lt;/p&gt;

&lt;p&gt;To solve this, we brought AI coding agents into the verification pipeline to act
as &lt;strong&gt;compatibility analysts&lt;/strong&gt;. The process breaks down into four layers:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Baseline Comparison Layer&lt;/strong&gt;: Run the same set of test cases in parallel
under &lt;code class=&quot;highlighter-rouge&quot;&gt;runc&lt;/code&gt; and &lt;code class=&quot;highlighter-rouge&quot;&gt;runsc&lt;/code&gt;, collecting complete execution logs and exit
statuses.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Difference Filtering Layer&lt;/strong&gt;: Filter out environmental noise and
non-deterministic outputs unrelated to the runtime, preserving samples that
only fail under gVisor.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;AI Diagnostic Layer&lt;/strong&gt;: LLMs output structured root cause analysis reports
by combining logs and relevant source code.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Decision Routing Layer&lt;/strong&gt;: Route the reports into gVisor bugs, user-space
race conditions, environmental differences, or test case issues, providing
suggestions for fixes or workarounds.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This creates a neat closed loop: &lt;strong&gt;AI analyzing its own runtime environment.&lt;/strong&gt;&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;language-mermaid&quot;&gt;graph TD
    A[Baseline Comparison Layer] --&amp;gt;|Run under runc/runsc in parallel&amp;lt;br&amp;gt;Collect logs &amp;amp; exit status| B(Difference Filtering Layer)
    B --&amp;gt;|Filter environmental noise&amp;lt;br&amp;gt;Keep gVisor-specific failures| C{AI Diagnostic Layer}
    C --&amp;gt;|Combine logs &amp;amp; source code&amp;lt;br&amp;gt;Output structured root cause report| D[Decision Routing Layer]

    D --&amp;gt;|gVisor bug| E[Submit community fix]
    D --&amp;gt;|User-space race condition| F[Workaround strategy]
    D --&amp;gt;|Environmental difference| G[Adjust environment]
    D --&amp;gt;|Test case issue| H[Fix test case]

    subgraph AI-Driven Compatibility Verification Framework
    A
    B
    C
    D
    end
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;In our workflow, every deeply analyzed case produces a structured document,
typically containing:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Failure symptoms and minimal reproduction method&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;runc&lt;/code&gt;/&lt;code class=&quot;highlighter-rouge&quot;&gt;runsc&lt;/code&gt; comparison results&lt;/li&gt;
  &lt;li&gt;Root cause classification: gVisor bug, missing feature, environmental
difference, test case issue, or race condition amplification&lt;/li&gt;
  &lt;li&gt;Linux kernel behavior comparison and source code evidence&lt;/li&gt;
  &lt;li&gt;Fixes or workaround suggestions&lt;/li&gt;
  &lt;li&gt;Regression verification results&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;To date, we have used AI to automatically analyze &lt;strong&gt;thousands of test cases
exhibiting behavioral differences&lt;/strong&gt;. From these, we extracted and deeply
reviewed &lt;strong&gt;100+ highly representative cases&lt;/strong&gt; across &lt;strong&gt;10+ programming
languages&lt;/strong&gt; and multiple build systems. These cases help us determine not only
“whether gVisor is usable,” but also “who is actually to blame for a given
failure.”&lt;/p&gt;

&lt;h2 id=&quot;compatibility-landscape-boundaries-defined-by-batch-comparisons&quot;&gt;Compatibility Landscape: Boundaries Defined by Batch Comparisons&lt;/h2&gt;

&lt;p&gt;Looking at a small sample of failures makes it easy to misjudge gVisor’s
compatibility. Reliable conclusions require large-scale A/B testing.&lt;/p&gt;

&lt;p&gt;Across 10 mainstream code execution datasets in our Agentic-RL infrastructure,
we’ve run &lt;strong&gt;74,379&lt;/strong&gt; side-by-side comparisons between &lt;code class=&quot;highlighter-rouge&quot;&gt;runc&lt;/code&gt; and &lt;code class=&quot;highlighter-rouge&quot;&gt;runsc&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Please see the detailed data in the table below:&lt;/p&gt;

&lt;!-- mdformat off(no multiline table support in Kramdown) --&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;Dataset&lt;/th&gt;
      &lt;th style=&quot;text-align: right&quot;&gt;Total cases&lt;/th&gt;
      &lt;th style=&quot;text-align: right&quot;&gt;Native &lt;code class=&quot;highlighter-rouge&quot;&gt;runc&lt;/code&gt; accuracy&lt;/th&gt;
      &lt;th style=&quot;text-align: right&quot;&gt;gVisor pre-fix &lt;code class=&quot;highlighter-rouge&quot;&gt;runsc&lt;/code&gt; accuracy&lt;/th&gt;
      &lt;th style=&quot;text-align: right&quot;&gt;gVisor post-fix &lt;code class=&quot;highlighter-rouge&quot;&gt;runsc&lt;/code&gt; accuracy&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;terminal-bench2&lt;/code&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;89&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;100.00%&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;94.38%&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;97.75%&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;swe-public/Multi-SWE-bench&lt;/code&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;1,632&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;70.16%&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;72.49%&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;73.16%&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;swe-public/Multi-SWE-RL&lt;/code&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;7,046&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;27.73%&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;20.49%&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;26.81%&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;swe-public/SWE-bench_Multilingual&lt;/code&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;300&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;93.00%&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;92.67%&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;93.00%&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;swe-public/SWE-bench_Not_Verified&lt;/code&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;1,794&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;97.94%&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;97.94%&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;97.94%&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;swe-public/SWE-bench_Pro&lt;/code&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;731&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;90.15%&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;90.97%&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;90.97%&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;swe-public/SWE-bench_Verified&lt;/code&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;500&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;100.00%&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;99.60%&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;100.00%&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;swe-public/SWE-Gym&lt;/code&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;2,438&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;86.75%&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;88.27%&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;88.27%&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;swe-public/SWE-rebench&lt;/code&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;21,336&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;83.33%&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;83.33%&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;83.77%&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;swe-public/SWE-smith&lt;/code&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;38,513&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;99.37%&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;97.42%&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;99.31%&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;strong&gt;Total&lt;/strong&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;&lt;strong&gt;74,379&lt;/strong&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;&lt;strong&gt;86.78%&lt;/strong&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;&lt;strong&gt;85.18%&lt;/strong&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;&lt;strong&gt;86.91%&lt;/strong&gt;&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;!-- mdformat on --&gt;

&lt;p&gt;Three key takeaways emerge from this data:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;runsc&lt;/code&gt; (gVisor) and &lt;code class=&quot;highlighter-rouge&quot;&gt;runc&lt;/code&gt; (Linux native) are now effectively on par.&lt;/strong&gt;
Across 74,379 runs, the correctness gap between &lt;code class=&quot;highlighter-rouge&quot;&gt;runsc&lt;/code&gt; and &lt;code class=&quot;highlighter-rouge&quot;&gt;runc&lt;/code&gt; is only
about &lt;strong&gt;0.13 percentage points&lt;/strong&gt; (86.91% vs 86.78%). We also performed
retries and cross-validation on core datasets to rule out one-off flakiness.
We have improved &lt;code class=&quot;highlighter-rouge&quot;&gt;runsc&lt;/code&gt;’s overall pass rate by approximately &lt;strong&gt;1.7
percentage points&lt;/strong&gt;. This correctness gain largely stemmed from highly
concentrated failures in a small number of repositories—such as trio,
cloud-custodian, asciidoctor, and syncthing. Once a root cause was
identified, a single fix could often resolve hundreds of failing cases at
once.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Most “compatibility issues” should not be attributed to gVisor.&lt;/strong&gt; The
table clearly demonstrates that even under the native &lt;code class=&quot;highlighter-rouge&quot;&gt;runc&lt;/code&gt; environment,
there is an inherent failure rate of about 13% (with an average correctness
of 86.91%). These failures largely stem from flaky test code, build
environment deficiencies, or limitations within the underlying datasets.
Evaluating gVisor without a &lt;code class=&quot;highlighter-rouge&quot;&gt;runc&lt;/code&gt; baseline could easily lead to
misattributing this 13% background failure rate as sandbox
incompatibilities.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;The overall pass rate for Multi-SWE-RL is relatively low (around ~27% for
both runtimes).&lt;/strong&gt; This is because our internal evaluation framework and some
case-execution methods are still being adapted, so it is not a standalone
compatibility problem in gVisor itself. The same bias affects both &lt;code class=&quot;highlighter-rouge&quot;&gt;runc&lt;/code&gt;
and &lt;code class=&quot;highlighter-rouge&quot;&gt;runsc&lt;/code&gt;, and therefore does not change the comparative conclusion.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;At the production scale we described earlier—&lt;strong&gt;millions of gVisor sandboxes
running every day&lt;/strong&gt;—this data answers the real question: how much correctness do
we lose by replacing &lt;code class=&quot;highlighter-rouge&quot;&gt;runc&lt;/code&gt; with &lt;code class=&quot;highlighter-rouge&quot;&gt;runsc&lt;/code&gt;? The answer is: &lt;strong&gt;almost none.&lt;/strong&gt;&lt;/p&gt;

&lt;h2 id=&quot;representative-cases-six-types-of-issues-and-corresponding-fix-paths&quot;&gt;Representative Cases: Six Types of Issues and Corresponding Fix Paths&lt;/h2&gt;

&lt;p&gt;After filtering out cases where both &lt;code class=&quot;highlighter-rouge&quot;&gt;runc&lt;/code&gt; and &lt;code class=&quot;highlighter-rouge&quot;&gt;runsc&lt;/code&gt; failed simultaneously,
we conducted in-depth reviews of the remaining cases that exhibited behavioral
differences. Using these 100+ representative cases as a sample, their final
root-cause attribution can roughly be divided into the following categories:&lt;/p&gt;

&lt;!-- mdformat off(no multiline table support in Kramdown) --&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;Root Cause Category&lt;/th&gt;
      &lt;th&gt;Requires gVisor Modification?&lt;/th&gt;
      &lt;th&gt;Typical Examples&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;strong&gt;Genuine gVisor bugs&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;Yes&lt;/td&gt;
      &lt;td&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;poll&lt;/code&gt; incorrectly modifying &lt;code class=&quot;highlighter-rouge&quot;&gt;events&lt;/code&gt;, inconsistent &lt;code class=&quot;highlighter-rouge&quot;&gt;execve&lt;/code&gt; &lt;code class=&quot;highlighter-rouge&quot;&gt;errno&lt;/code&gt; returns, &lt;code class=&quot;highlighter-rouge&quot;&gt;O_TRUNC&lt;/code&gt; missing &lt;code class=&quot;highlighter-rouge&quot;&gt;IN_MODIFY&lt;/code&gt; inotify events&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;strong&gt;Missing syscalls and virtual FS entries&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;Yes&lt;/td&gt;
      &lt;td&gt;Unimplemented &lt;code class=&quot;highlighter-rouge&quot;&gt;copy_file_range&lt;/code&gt; syscall, missing &lt;code class=&quot;highlighter-rouge&quot;&gt;/proc/sys/fs/pipe-max-size&lt;/code&gt; configuration file, and absence of &lt;code class=&quot;highlighter-rouge&quot;&gt;/sys/dev/block&lt;/code&gt; directory&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;strong&gt;Clock and timer precision differences&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;Partially&lt;/td&gt;
      &lt;td&gt;CPU clock measurement precision, monotonic clock start value differences, sleep duration jitter&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;strong&gt;Amplified race conditions&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;No&lt;/td&gt;
      &lt;td&gt;Gradle &lt;code class=&quot;highlighter-rouge&quot;&gt;clean test&lt;/code&gt; parallel execution concurrency race, CMake &lt;code class=&quot;highlighter-rouge&quot;&gt;copy_if_different&lt;/code&gt; TOCTOU race&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;strong&gt;Environmental or config differences&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;No&lt;/td&gt;
      &lt;td&gt;External network access restrictions, JDK version mismatches, missing dynamic library paths&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;strong&gt;Test case issues&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;No&lt;/td&gt;
      &lt;td&gt;Test execution order dependencies, underlying dataset defects, inherently flaky tests&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;!-- mdformat on --&gt;

&lt;p&gt;This shows that aside from genuine bugs or missing Linux ABI implementations in
gVisor, a significant portion of behavioral differences stems from
timing-sensitive tests, amplified user-space race conditions, or environmental
setup differences. This is especially crucial for Agentic-RL scenarios. Without
&lt;code class=&quot;highlighter-rouge&quot;&gt;runc&lt;/code&gt; baselines and root cause analysis, these failures could easily be
misattributed as sandbox incompatibilities, leading to systematically
pessimistic conclusions.&lt;/p&gt;

&lt;p&gt;These cases highlight the different types of compatibility issues we see in
Agentic-RL: system call semantic deviations, Linux ABI gaps, VFS implementation
gaps, and user-space race conditions.&lt;/p&gt;

&lt;h3 id=&quot;case-1-poll-behavior-inconsistency-causes-tmux-busy-loop&quot;&gt;Case 1: &lt;code class=&quot;highlighter-rouge&quot;&gt;poll&lt;/code&gt; Behavior Inconsistency Causes &lt;code class=&quot;highlighter-rouge&quot;&gt;tmux&lt;/code&gt; Busy-Loop&lt;/h3&gt;

&lt;p&gt;The evaluation cluster’s CPU utilization was unusually high. Investigation
revealed that the &lt;code class=&quot;highlighter-rouge&quot;&gt;tmux&lt;/code&gt; server in each Agent container was pegging a CPU core:
under gVisor, CPU usage hovered at &lt;strong&gt;96.6%&lt;/strong&gt;, while under &lt;code class=&quot;highlighter-rouge&quot;&gt;runc&lt;/code&gt; it was
practically &lt;strong&gt;0%&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The root cause was &lt;code class=&quot;highlighter-rouge&quot;&gt;poll&lt;/code&gt; write-back semantics. gVisor internally appended
&lt;code class=&quot;highlighter-rouge&quot;&gt;POLLHUP|POLLERR&lt;/code&gt; to &lt;code class=&quot;highlighter-rouge&quot;&gt;pollfd.events&lt;/code&gt; and wrote the entire &lt;code class=&quot;highlighter-rouge&quot;&gt;pollfd&lt;/code&gt; struct back
to user space. Linux, however, only writes to &lt;code class=&quot;highlighter-rouge&quot;&gt;revents&lt;/code&gt; and &lt;strong&gt;never modifies the
user’s original &lt;code class=&quot;highlighter-rouge&quot;&gt;events&lt;/code&gt;&lt;/strong&gt;. This discrepancy prevented libevent from properly
removing closed file descriptors. Subsequent &lt;code class=&quot;highlighter-rouge&quot;&gt;poll&lt;/code&gt; calls immediately returned
&lt;code class=&quot;highlighter-rouge&quot;&gt;POLLNVAL&lt;/code&gt;, triggering a busy-loop.&lt;/p&gt;

&lt;p&gt;After fixing this, the &lt;code class=&quot;highlighter-rouge&quot;&gt;tmux&lt;/code&gt; CPU dropped from 96.6% to 0%. The impact goes far
beyond &lt;code class=&quot;highlighter-rouge&quot;&gt;tmux&lt;/code&gt; — any program relying on the &lt;code class=&quot;highlighter-rouge&quot;&gt;libevent&lt;/code&gt; &lt;code class=&quot;highlighter-rouge&quot;&gt;poll&lt;/code&gt; backend benefits
from this.&lt;/p&gt;

&lt;h3 id=&quot;case-2-syncthing-test-case-exposes-two-independent-linux-abi-gaps-unimplemented-syscalls-or-virtual-files&quot;&gt;Case 2: syncthing Test Case Exposes Two Independent Linux ABI Gaps (Unimplemented Syscalls or Virtual Files)&lt;/h3&gt;

&lt;p&gt;In real-world workloads, it’s not uncommon for a single test case to hit two
independent gVisor compatibility issues at once. The &lt;code class=&quot;highlighter-rouge&quot;&gt;syncthing__syncthing-7828&lt;/code&gt;
test case in the Multi-SWE-RL dataset passes normally under &lt;code class=&quot;highlighter-rouge&quot;&gt;runc&lt;/code&gt;, but
consistently fails under &lt;code class=&quot;highlighter-rouge&quot;&gt;runsc&lt;/code&gt;: 16 &lt;code class=&quot;highlighter-rouge&quot;&gt;TestCopyRange/*&lt;/code&gt; subtests report &lt;code class=&quot;highlighter-rouge&quot;&gt;function
not implemented&lt;/code&gt;, and another &lt;code class=&quot;highlighter-rouge&quot;&gt;TestTruncateFileOnly&lt;/code&gt; times out waiting for an
inotify event.&lt;/p&gt;

&lt;p&gt;This was caused by two independent Linux ABI gaps:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;copy_file_range&lt;/code&gt; (syscall 326) was unimplemented.&lt;/strong&gt; gVisor registered it
as &lt;code class=&quot;highlighter-rouge&quot;&gt;ErrorWithEvent(ENOSYS)&lt;/code&gt;, so any program using this syscall received
&lt;code class=&quot;highlighter-rouge&quot;&gt;function not implemented&lt;/code&gt;.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;open(O_TRUNC)&lt;/code&gt; was missing the &lt;code class=&quot;highlighter-rouge&quot;&gt;IN_MODIFY&lt;/code&gt; inotify event.&lt;/strong&gt; The Linux
kernel generates &lt;code class=&quot;highlighter-rouge&quot;&gt;IN_MODIFY&lt;/code&gt; along the &lt;code class=&quot;highlighter-rouge&quot;&gt;do_open()&lt;/code&gt; → &lt;code class=&quot;highlighter-rouge&quot;&gt;handle_truncate()&lt;/code&gt; →
&lt;code class=&quot;highlighter-rouge&quot;&gt;notify_change()&lt;/code&gt; path. However, gVisor VFS’s &lt;code class=&quot;highlighter-rouge&quot;&gt;OpenAt&lt;/code&gt; only generated
&lt;code class=&quot;highlighter-rouge&quot;&gt;IN_OPEN&lt;/code&gt;, causing programs listening for file modification events to be
“deaf” to the truncation action.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The fix proceeded along two lines: implementing &lt;code class=&quot;highlighter-rouge&quot;&gt;copy_file_range&lt;/code&gt; for both amd64
(326) and arm64 (285), and issuing &lt;code class=&quot;highlighter-rouge&quot;&gt;IN_MODIFY&lt;/code&gt; at the VFS layer for &lt;code class=&quot;highlighter-rouge&quot;&gt;O_TRUNC&lt;/code&gt; on
non-newly created files (skipping it for newly created files via the
&lt;code class=&quot;highlighter-rouge&quot;&gt;FMODE_CREATED&lt;/code&gt; flag, consistent with Linux). After the fix, this test case
passed consistently under &lt;code class=&quot;highlighter-rouge&quot;&gt;runsc&lt;/code&gt; just like under &lt;code class=&quot;highlighter-rouge&quot;&gt;runc&lt;/code&gt;.&lt;/p&gt;

&lt;h3 id=&quot;case-3-gradle-clean-test-concurrency-raceroot-cause-in-user-space-not-gvisor&quot;&gt;Case 3: Gradle clean test Concurrency Race—Root Cause in User Space, Not gVisor&lt;/h3&gt;

&lt;p&gt;Not all issues that “only reproduce under gVisor” are actually gVisor bugs.&lt;/p&gt;

&lt;p&gt;A Thunderbird Android test running &lt;code class=&quot;highlighter-rouge&quot;&gt;./gradlew clean test --max-workers 8
--continue&lt;/code&gt; under &lt;code class=&quot;highlighter-rouge&quot;&gt;runsc&lt;/code&gt; frequently failed with &lt;code class=&quot;highlighter-rouge&quot;&gt;Unable to delete directory&lt;/code&gt;.
However, running it 7 times under &lt;code class=&quot;highlighter-rouge&quot;&gt;runc&lt;/code&gt; yielded &lt;strong&gt;5 failures&lt;/strong&gt; (71%). This
pointed to a user-space TOCTOU race condition in Gradle’s parallel build: one
subproject was still writing to &lt;code class=&quot;highlighter-rouge&quot;&gt;build/&lt;/code&gt;, while another subproject’s clean task
was already trying to delete it.&lt;/p&gt;

&lt;p&gt;gVisor’s higher system call overhead amplified the probability of triggering
this race, but it did not introduce new semantic errors. Splitting the command
into &lt;code class=&quot;highlighter-rouge&quot;&gt;./gradlew clean&lt;/code&gt; and &lt;code class=&quot;highlighter-rouge&quot;&gt;./gradlew test ...&lt;/code&gt; fixed it completely. &lt;strong&gt;This is
also a fundamental principle we follow in compatibility analysis: always use
&lt;code class=&quot;highlighter-rouge&quot;&gt;runc&lt;/code&gt; as a baseline first, then determine whether the issue should be
attributed to the sandbox itself.&lt;/strong&gt;&lt;/p&gt;

&lt;h3 id=&quot;case-4-missing-procfs--sysfs-causes-real-applications-to-take-abnormal-paths&quot;&gt;Case 4: Missing procfs / sysfs Causes Real Applications to Take Abnormal Paths&lt;/h3&gt;

&lt;p&gt;Agentic-RL workloads are full of paths that are not usually tested in isolation
but are relied upon by real projects, such as &lt;code class=&quot;highlighter-rouge&quot;&gt;/proc/sys/fs/pipe-max-size&lt;/code&gt;,
&lt;code class=&quot;highlighter-rouge&quot;&gt;/proc/sys/kernel/randomize_va_space&lt;/code&gt;, &lt;code class=&quot;highlighter-rouge&quot;&gt;/sys/dev/block&lt;/code&gt;, &lt;code class=&quot;highlighter-rouge&quot;&gt;/proc/[pid]/fdinfo&lt;/code&gt;,
etc. Once missing, these typically manifest as &lt;code class=&quot;highlighter-rouge&quot;&gt;ENOENT&lt;/code&gt; or cause upper-layer
libraries to take abnormal code paths.&lt;/p&gt;

&lt;p&gt;These are usually cheap to fix by wiring up static files or directory
structures. They perfectly illustrate the value of real-world workloads: &lt;strong&gt;we
aren’t adding these paths to satisfy a benchmark, we’re adding them because real
applications actually read them.&lt;/strong&gt;&lt;/p&gt;

&lt;h3 id=&quot;case-5-inconsistent-pty-implementation-causes-interactive-agents-to-error&quot;&gt;Case 5: Inconsistent PTY Implementation Causes Interactive Agents to Error&lt;/h3&gt;

&lt;p&gt;Interactive terminals are easily overlooked but heavily used in Agent systems
(tmux, screen, expect, REPLs, etc.). All rely on PTYs. We fixed several
inconsistencies here:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;The &lt;code class=&quot;highlighter-rouge&quot;&gt;ISIG&lt;/code&gt; flag was not checked correctly, causing signals to still be
generated after &lt;code class=&quot;highlighter-rouge&quot;&gt;stty -isig&lt;/code&gt;.&lt;/li&gt;
  &lt;li&gt;When the master closed, it did not send &lt;code class=&quot;highlighter-rouge&quot;&gt;SIGHUP&lt;/code&gt; to the foreground process
group as Linux does.&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;TCSBRK&lt;/code&gt; / &lt;code class=&quot;highlighter-rouge&quot;&gt;TCFLSH&lt;/code&gt; and other ioctls were missing or had incorrect
directional semantics, affecting programs like pyserial.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Notably, &lt;code class=&quot;highlighter-rouge&quot;&gt;TCFLSH&lt;/code&gt; semantics must be evaluated from the &lt;strong&gt;caller’s perspective&lt;/strong&gt;
rather than hardcoding internal queue names. Otherwise, the flush directions
seen by the master and replica are reversed compared to Linux.&lt;/p&gt;

&lt;h3 id=&quot;case-6-jekyll-test-order-dependency-causes-flaky-failuresa-pure-test-case-issue&quot;&gt;Case 6: Jekyll Test Order Dependency Causes Flaky Failures—A Pure Test Case Issue&lt;/h3&gt;

&lt;p&gt;Sometimes, a test failing under gVisor has nothing to do with the runtime
environment at all.&lt;/p&gt;

&lt;p&gt;During evaluation, a Jekyll test case (&lt;code class=&quot;highlighter-rouge&quot;&gt;jekyll-7637&lt;/code&gt;) failed under &lt;code class=&quot;highlighter-rouge&quot;&gt;runsc&lt;/code&gt; but
coincidentally passed under &lt;code class=&quot;highlighter-rouge&quot;&gt;runc&lt;/code&gt;. After a deep dive, we found that this test
actually had a roughly 33% chance of failing in &lt;em&gt;any&lt;/em&gt; environment.&lt;/p&gt;

&lt;p&gt;The root cause was rather dramatic: the test code itself had a bug where it
passed a configuration value as a Ruby &lt;code class=&quot;highlighter-rouge&quot;&gt;Symbol&lt;/code&gt; type, while the underlying
source code incorrectly compared it as a &lt;code class=&quot;highlighter-rouge&quot;&gt;String&lt;/code&gt;. As a result, this test could
&lt;strong&gt;never&lt;/strong&gt; load its required syntax highlighting plugin as intended. So why did
it sometimes pass? Because the testing framework (&lt;code class=&quot;highlighter-rouge&quot;&gt;minitest&lt;/code&gt;) executes tests in
a randomized order. If this buggy test happened to run &lt;strong&gt;after&lt;/strong&gt; another test
that correctly loaded the plugin into memory, it would “freeload” off that
global state and pass. But if the randomized order happened to put this test
first, it would genuinely fail. It just so happened that gVisor hit that 1-in-3
failure chance during our evaluation.&lt;/p&gt;

&lt;p&gt;This perfectly illustrates why we need large-scale A/B testing and deep
analysis: without them, sporadic test flakiness like this can easily be
misdiagnosed as “sandbox instability.”&lt;/p&gt;

&lt;h2 id=&quot;best-practices-suggestions-for-using-gvisor-in-agentic-rl-scenarios&quot;&gt;Best Practices: Suggestions for Using gVisor in Agentic-RL Scenarios&lt;/h2&gt;

&lt;p&gt;If you’re building an Agent execution environment with gVisor, here are some
practical tips.&lt;/p&gt;

&lt;h3 id=&quot;suggestions-for-different-build-systems&quot;&gt;Suggestions for Different Build Systems&lt;/h3&gt;

&lt;!-- mdformat off(no multiline table support in Kramdown) --&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;Build System&lt;/th&gt;
      &lt;th&gt;Common Risks&lt;/th&gt;
      &lt;th&gt;Suggestions&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;strong&gt;Gradle&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;clean test concurrency race&lt;/td&gt;
      &lt;td&gt;Split into clean and test steps&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;strong&gt;Maven&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;Remote dependency download timeout or 403&lt;/td&gt;
      &lt;td&gt;Pre-populate local repo cache, minimize online downloads&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;strong&gt;CMake&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;copy_if_different&lt;/code&gt; race conditions&lt;/td&gt;
      &lt;td&gt;Lower parallelism, avoid over-reliance on extremely short time windows&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;strong&gt;sbt / Scala&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;Deep stack, slow startup, test flakiness&lt;/td&gt;
      &lt;td&gt;Increase &lt;code class=&quot;highlighter-rouge&quot;&gt;-Xss&lt;/code&gt;, give the first compilation a more generous timeout&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;strong&gt;pip / pytest&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;Differences in CPU count vs cgroup quota perception&lt;/td&gt;
      &lt;td&gt;Be aware of the relationship between &lt;code class=&quot;highlighter-rouge&quot;&gt;os.cpu_count()&lt;/code&gt; and actual quotas&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;strong&gt;Cargo / npm / yarn&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;Generally good compatibility&lt;/td&gt;
      &lt;td&gt;Usually do not require special handling&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;!-- mdformat on --&gt;

&lt;h3 id=&quot;debugging-procedure-when-encountering-failures&quot;&gt;Debugging Procedure When Encountering Failures&lt;/h3&gt;

&lt;p&gt;When a test fails, we recommend this debugging flow:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;First reproduce the same command under &lt;code class=&quot;highlighter-rouge&quot;&gt;runc&lt;/code&gt; to confirm if the failure is
specific to gVisor.&lt;/li&gt;
  &lt;li&gt;If &lt;code class=&quot;highlighter-rouge&quot;&gt;runc&lt;/code&gt; also fails, prioritize investigating test case issues,
environmental differences, or race conditions.&lt;/li&gt;
  &lt;li&gt;If it only fails under gVisor, check for obvious missing syscalls, procfs,
or sysfs.&lt;/li&gt;
  &lt;li&gt;For issues with no obvious missing features, compare logs, strace, and
runtime behavior to distinguish between semantic inconsistencies, amplified
race conditions, or environmental configuration differences.&lt;/li&gt;
  &lt;li&gt;Only after confirming it is a gVisor semantic issue, proceed to locate the
code path, create a minimal reproduction, and add regression tests.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Note: Many perceived “gVisor compatibility issues” are ultimately reclassified
as test case issues during this step.&lt;/p&gt;

&lt;h2 id=&quot;ai-driven-compatibility-analysis-why-this-path-is-feasible&quot;&gt;AI-Driven Compatibility Analysis: Why This Path Is Feasible&lt;/h2&gt;

&lt;p&gt;Large-scale compatibility analysis is well suited to AI assistance because it
involves a large amount of repetitive, context-heavy work:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Reading project source code and build scripts&lt;/li&gt;
  &lt;li&gt;Comparing behavioral differences between two runtimes&lt;/li&gt;
  &lt;li&gt;Comparing syscall, procfs, sysfs, PTY, network, and VFS semantics&lt;/li&gt;
  &lt;li&gt;Turning conclusions into executable patches, PRs, or workaround suggestions&lt;/li&gt;
  &lt;li&gt;Running regression validation and re-investigating the issue when validation
fails&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Manual analysis does not scale, while hardcoded rules often break down on
complex cases. AI agents fit naturally in the middle: they can take on most of
the “read logs → categorize → locate → report” work, while human engineers still
review the proposed approach and code.&lt;/p&gt;

&lt;p&gt;The real value here is not just saving time; it is making our conclusions
&lt;strong&gt;scalable, traceable, and continuously improvable&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Every case has standardized analysis artifacts rather than scattered chat
logs.&lt;/li&gt;
  &lt;li&gt;Every fix can be validated again against the original real-world test case.&lt;/li&gt;
  &lt;li&gt;Every case that is “not a gVisor issue” can still be turned into a concrete
workaround playbook.&lt;/li&gt;
  &lt;li&gt;As new datasets, images, or build systems arrive, the same analysis
framework can be reused.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Through this method, we already have more than ten fixes merged into the gVisor
mainline, covering multiple areas such as file systems, networking, proc/sysfs,
PTY, and system call semantics. Some representative PRs are listed below:&lt;/p&gt;

&lt;!-- mdformat off(no multiline table support in Kramdown) --&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;PR&lt;/th&gt;
      &lt;th&gt;Fix Content&lt;/th&gt;
      &lt;th&gt;Typical Agentic-RL Scenario&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;a href=&quot;https://github.com/google/gvisor/pull/12851&quot;&gt;#12851&lt;/a&gt;&lt;/td&gt;
      &lt;td&gt;poll: Only write back &lt;code class=&quot;highlighter-rouge&quot;&gt;revents&lt;/code&gt;&lt;/td&gt;
      &lt;td&gt;tmux, libevent poll backend&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;a href=&quot;https://github.com/google/gvisor/pull/12911&quot;&gt;#12911&lt;/a&gt;&lt;/td&gt;
      &lt;td&gt;proc: Add &lt;code class=&quot;highlighter-rouge&quot;&gt;/proc/sys/fs/pipe-max-size&lt;/code&gt;&lt;/td&gt;
      &lt;td&gt;Python libraries like wurlitzer&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;a href=&quot;https://github.com/google/gvisor/pull/12915&quot;&gt;#12915&lt;/a&gt;&lt;/td&gt;
      &lt;td&gt;pty: Implement &lt;code class=&quot;highlighter-rouge&quot;&gt;TCSBRK&lt;/code&gt; / &lt;code class=&quot;highlighter-rouge&quot;&gt;TCFLSH&lt;/code&gt;&lt;/td&gt;
      &lt;td&gt;pyserial, interactive PTY programs&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;a href=&quot;https://github.com/google/gvisor/pull/12814&quot;&gt;#12814&lt;/a&gt;&lt;/td&gt;
      &lt;td&gt;proc: Add &lt;code class=&quot;highlighter-rouge&quot;&gt;randomize_va_space&lt;/code&gt;&lt;/td&gt;
      &lt;td&gt;Performance and security inspection tools&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;a href=&quot;https://github.com/google/gvisor/pull/12813&quot;&gt;#12813&lt;/a&gt;&lt;/td&gt;
      &lt;td&gt;sysfs: Add &lt;code class=&quot;highlighter-rouge&quot;&gt;/sys/dev/block&lt;/code&gt; and &lt;code class=&quot;highlighter-rouge&quot;&gt;/sys/dev/char&lt;/code&gt;&lt;/td&gt;
      &lt;td&gt;lsblk, device-related tools&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;a href=&quot;https://github.com/google/gvisor/pull/12819&quot;&gt;#12819&lt;/a&gt;&lt;/td&gt;
      &lt;td&gt;proc: Fill in &lt;code class=&quot;highlighter-rouge&quot;&gt;fdinfo&lt;/code&gt; fields&lt;/td&gt;
      &lt;td&gt;lsof, fuser, diagnostic tools&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;a href=&quot;https://github.com/google/gvisor/pull/12786&quot;&gt;#12786&lt;/a&gt;&lt;/td&gt;
      &lt;td&gt;devpts: Fix &lt;code class=&quot;highlighter-rouge&quot;&gt;ISIG&lt;/code&gt; check&lt;/td&gt;
      &lt;td&gt;Interactive shells / terminal-based agents&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;a href=&quot;https://github.com/google/gvisor/pull/12853&quot;&gt;#12853&lt;/a&gt;&lt;/td&gt;
      &lt;td&gt;vfs: &lt;code class=&quot;highlighter-rouge&quot;&gt;FICLONE*&lt;/code&gt; returns &lt;code class=&quot;highlighter-rouge&quot;&gt;EOPNOTSUPP&lt;/code&gt;&lt;/td&gt;
      &lt;td&gt;file copying tools&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;!-- mdformat on --&gt;

&lt;p&gt;In this sense, Agentic-RL is not just a new use case for gVisor; it has also
pushed our compatibility engineering toward a more AI-driven workflow.&lt;/p&gt;

&lt;h2 id=&quot;conclusion&quot;&gt;Conclusion&lt;/h2&gt;

&lt;p&gt;Agentic-RL is both a proving ground for gVisor and, in practice, a &lt;strong&gt;large-scale
regression suite&lt;/strong&gt;: it continuously drives real-world projects through the
sandbox and exposes compatibility boundaries that standard unit tests struggle
to cover. By bringing AI agents into this verification loop, we can evaluate
gVisor’s production readiness with data rather than intuition.&lt;/p&gt;

&lt;p&gt;Our conclusions are simple:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;strong&gt;gVisor’s compatibility has proven to be production-ready.&lt;/strong&gt;&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Most “compatibility issues” should not actually be attributed to gVisor.&lt;/strong&gt;&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Real-world workloads are better than handpicked tests at revealing
critical problems.&lt;/strong&gt;&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;AI-driven compatibility analysis is practical.&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;As AI agents take on heavier tasks, the code-execution sandbox will become an
indispensable security foundation. We will continue refining this AI-driven
verification system, applying it to new datasets and language stacks, and
upstreaming our findings to the gVisor community. For Agentic-RL, a good sandbox
is not just secure—it also needs to be &lt;strong&gt;highly compatible, debuggable, and able
to evolve alongside real-world workloads.&lt;/strong&gt;&lt;/p&gt;</content><author><name>yifengtan</name></author><summary type="html">This article was contributed by Tencent. Yifeng Tan, Hua Liu, and Hui Chen are engineers at Tencent, responsible for the internal container infrastructure. As LLMs evolve from chat interfaces to autonomous agents, building a robust and secure isolation environment becomes a necessity. We chose gVisor as the default sandbox for our Agentic-RL scenarios. Today, we run millions of gVisor sandboxes daily for Agentic-RL training in production, and that scale continues to grow. After more than 74,000 side-by-side comparisons between runsc (gVisor) and runc (unsandboxed/Linux), combined with targeted fixes driven by real-world workloads, we have essentially closed the execution correctness gap with runc, fully meeting our production-grade business requirements. During this process, we successfully investigated and resolved gVisor compatibility issues that accounted for approximately 1.7% of all test cases. This post focuses on CPU-centric code execution and testing workloads. We will discuss gVisor compatibility verification and highlight representative issues, skipping implementation details like GPU support, image distribution, or cluster scheduling. We aim to answer three questions: Why choose gVisor? Why doesn’t manual compatibility verification scale? How can AI agents analyze compatibility issues, what do typical failures look like, and what best practices have we established?</summary></entry><entry><title type="html">Multi-Agent gVisor Isolation (MAGI)</title><link href="/blog/2026/04/15/magi-multi-agent-gvisor-isolation/" rel="alternate" type="text/html" title=" Multi-Agent gVisor Isolation (MAGI)" /><published>2026-04-15T00:00:00-05:00</published><updated>2026-04-15T00:00:00-05:00</updated><id>/blog/2026/04/15/magi-multi-agent-gvisor-isolation</id><content type="html" xml:base="/blog/2026/04/15/magi-multi-agent-gvisor-isolation/">&lt;figure class=&quot;img-100pct&quot;&gt;
&lt;img src=&quot;/assets/images/2026-04-15-magi/magi.png&quot; alt=&quot;Diagram showing the MAGI system: three agents running in gVisor, along with a lot of side-services in gVisor-sandboxed containers. Evangelion style.&quot; /&gt;
&lt;figcaption&gt;Get in the sandbox, Agents.&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;p&gt;&lt;strong&gt;Does gVisor work with OpenClaw?&lt;/strong&gt; This question has been asked a lot, so let’s
answer it here and now: &lt;strong&gt;Yes&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;In this post, we will set up a triple-agent system combining
&lt;strong&gt;&lt;a href=&quot;https://openclaw.ai/&quot;&gt;OpenClaw&lt;/a&gt;&lt;/strong&gt;,
&lt;strong&gt;&lt;a href=&quot;https://github.com/sipeed/picoclaw&quot;&gt;PicoClaw&lt;/a&gt;&lt;/strong&gt;, and
&lt;strong&gt;&lt;a href=&quot;https://hermes-agent.nousresearch.com/&quot;&gt;Hermes Agent&lt;/a&gt;&lt;/strong&gt;, each in separate
gVisor sandboxes, all with local inference powered by
&lt;strong&gt;&lt;a href=&quot;https://ollama.com/&quot;&gt;Ollama&lt;/a&gt;&lt;/strong&gt; in a gVisor sandbox using three different
models, convening together in a self-hosted &lt;strong&gt;&lt;a href=&quot;https://matrix.org&quot;&gt;Matrix.org&lt;/a&gt;&lt;/strong&gt;
server (naturally, also running in a gVisor sandbox). Each agent will be given
its own set of capabilities, each of which will be sandboxed. At the end of the
day, you will have a fully self-sovereign triple-agent system that can answer
queries, browse the web, and cogitate with itself.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Does this particular setup make practical sense?&lt;/strong&gt; &lt;em&gt;No, but it is cool&lt;/em&gt;. More
importantly, it demonstrates the versatility of gVisor at sandboxing basically
any component that an agentic system may need. gVisor’s compatibility has grown
significantly over the last few years, and agent harnesses fit well within what
gVisor is capable of.&lt;/p&gt;

&lt;!--/excerpt--&gt;

&lt;p&gt;Let’s go.&lt;/p&gt;

&lt;!--* pragma: { seclinter_this_is_fine: true } *--&gt;

&lt;details&gt;

  &lt;summary&gt;

    &lt;h3 id=&quot;basic-machine-setup-dockergvisornvidia-drivers&quot;&gt;Basic machine setup: Docker/gVisor/NVIDIA drivers&lt;/h3&gt;

    &lt;p&gt;We will use a &lt;code class=&quot;highlighter-rouge&quot;&gt;g2-standard-96&lt;/code&gt; GCE VM running stock Ubuntu for this, but any
Linux machine with similar GPUs would work. This section describes its basic
setup.&lt;/p&gt;

  &lt;/summary&gt;

  &lt;p&gt;Getting a GCE VM:&lt;/p&gt;

  &lt;div class=&quot;language-shell highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;gcloud compute instances create magi &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--project&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;eperot-gke-dev &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--zone&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;europe-west1-c &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--machine-type&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;g2-standard-96 &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--maintenance-policy&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;TERMINATE &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--accelerator&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;count&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;8,type&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;nvidia-l4 &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--create-disk&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;auto-delete&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;yes&lt;/span&gt;,boot&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;yes&lt;/span&gt;,device-name&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;magi,image&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;projects/ubuntu-os-cloud/global/images/ubuntu-2404-noble-amd64-v20260316,mode&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;rw,size&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;2048,type&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;pd-ssd
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;  &lt;/div&gt;

  &lt;p&gt;We will be using the following ports:&lt;/p&gt;

  &lt;ul&gt;
    &lt;li&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;8008&lt;/code&gt;: Matrix.org server (Synapse)&lt;/li&gt;
    &lt;li&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;8084&lt;/code&gt;: Cinny web UI (Matrix.org client)&lt;/li&gt;
    &lt;li&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;11434&lt;/code&gt;: Ollama (inference API server)&lt;/li&gt;
    &lt;li&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;18789&lt;/code&gt;: OpenClaw gateway web UI&lt;/li&gt;
    &lt;li&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;18790&lt;/code&gt;: PicoClaw gateway&lt;/li&gt;
    &lt;li&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;3002&lt;/code&gt;: Self-hosted Firecrawl&lt;/li&gt;
  &lt;/ul&gt;

  &lt;p&gt;If SSHing into a VM, you can forward some of them for convenient access:&lt;/p&gt;

  &lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;-L 8008:127.0.0.1:8008 -L 8084:127.0.0.1:8084 -L 11434:127.0.0.1:11434 -L 18789:127.0.0.1:18789
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;  &lt;/div&gt;

  &lt;p&gt;Setting up the GCE VM (once SSH’d as &lt;code class=&quot;highlighter-rouge&quot;&gt;root&lt;/code&gt;):&lt;/p&gt;

  &lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c&quot;&gt;# Basics&lt;/span&gt;
&lt;span class=&quot;nb&quot;&gt;sudo &lt;/span&gt;apt-get update &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;sudo &lt;/span&gt;apt-get &lt;span class=&quot;nt&quot;&gt;-y&lt;/span&gt; upgrade

&lt;span class=&quot;c&quot;&gt;# NVIDIA driver&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;DRIVER_VERSION&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;590.48.01&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
  &lt;span class=&quot;nb&quot;&gt;sudo &lt;/span&gt;apt-get &lt;span class=&quot;nb&quot;&gt;install&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-y&lt;/span&gt; build-essential linux-headers-&lt;span class=&quot;si&quot;&gt;$(&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;uname&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-r&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
  curl &lt;span class=&quot;nt&quot;&gt;-fSsl&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-O&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;https://us.download.nvidia.com/tesla/&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$DRIVER_VERSION&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;/NVIDIA-Linux-x86_64-&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$DRIVER_VERSION&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;.run&quot;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
  &lt;span class=&quot;nb&quot;&gt;sudo &lt;/span&gt;sh NVIDIA-Linux-x86_64-&lt;span class=&quot;nv&quot;&gt;$DRIVER_VERSION&lt;/span&gt;.run &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
  &lt;span class=&quot;nb&quot;&gt;rm &lt;/span&gt;NVIDIA-Linux-x86_64-&lt;span class=&quot;nv&quot;&gt;$DRIVER_VERSION&lt;/span&gt;.run

&lt;span class=&quot;c&quot;&gt;# Docker&lt;/span&gt;
&lt;span class=&quot;nb&quot;&gt;sudo &lt;/span&gt;apt update &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
  &lt;span class=&quot;nb&quot;&gt;sudo &lt;/span&gt;apt &lt;span class=&quot;nb&quot;&gt;install&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-y&lt;/span&gt; ca-certificates curl &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
  &lt;span class=&quot;nb&quot;&gt;sudo install&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-m&lt;/span&gt; 0755 &lt;span class=&quot;nt&quot;&gt;-d&lt;/span&gt; /etc/apt/keyrings &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
  &lt;span class=&quot;nb&quot;&gt;sudo &lt;/span&gt;curl &lt;span class=&quot;nt&quot;&gt;-fsSL&lt;/span&gt; https://download.docker.com/linux/ubuntu/gpg &lt;span class=&quot;nt&quot;&gt;-o&lt;/span&gt; /etc/apt/keyrings/docker.asc &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
  &lt;span class=&quot;nb&quot;&gt;sudo chmod &lt;/span&gt;a+r /etc/apt/keyrings/docker.asc
&lt;span class=&quot;nb&quot;&gt;sudo tee&lt;/span&gt; /etc/apt/sources.list.d/docker.sources &lt;span class=&quot;o&quot;&gt;&amp;lt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;no&quot;&gt;EOF&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;
Types: deb
URIs: https://download.docker.com/linux/ubuntu
Suites: &lt;/span&gt;&lt;span class=&quot;si&quot;&gt;$(&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;.&lt;/span&gt; /etc/os-release &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;echo&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;${&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;UBUNTU_CODENAME&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;:-&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$VERSION_CODENAME&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;
Components: stable
Signed-By: /etc/apt/keyrings/docker.asc
&lt;/span&gt;&lt;span class=&quot;no&quot;&gt;EOF
&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;sudo &lt;/span&gt;apt update &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
  &lt;span class=&quot;nb&quot;&gt;sudo &lt;/span&gt;apt &lt;span class=&quot;nb&quot;&gt;install&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-y&lt;/span&gt; docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin

&lt;span class=&quot;c&quot;&gt;# NVIDIA container toolkit&lt;/span&gt;
&lt;span class=&quot;nb&quot;&gt;sudo &lt;/span&gt;apt-get update &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;sudo &lt;/span&gt;apt-get &lt;span class=&quot;nb&quot;&gt;install&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-y&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;--no-install-recommends&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
  ca-certificates &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
  curl &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
  gnupg2 &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
  curl &lt;span class=&quot;nt&quot;&gt;-fsSL&lt;/span&gt; https://nvidia.github.io/libnvidia-container/gpgkey | &lt;span class=&quot;nb&quot;&gt;sudo &lt;/span&gt;gpg &lt;span class=&quot;nt&quot;&gt;--dearmor&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-o&lt;/span&gt; /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
  curl &lt;span class=&quot;nt&quot;&gt;-s&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-L&lt;/span&gt; https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nb&quot;&gt;sed&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;'s#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g'&lt;/span&gt; | &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nb&quot;&gt;sudo tee&lt;/span&gt; /etc/apt/sources.list.d/nvidia-container-toolkit.list &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nb&quot;&gt;sudo &lt;/span&gt;apt-get update &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nb&quot;&gt;export &lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;NVIDIA_CONTAINER_TOOLKIT_VERSION&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;1.19.0-1 &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nb&quot;&gt;sudo &lt;/span&gt;apt-get &lt;span class=&quot;nb&quot;&gt;install&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-y&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
      nvidia-container-toolkit&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;${&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;NVIDIA_CONTAINER_TOOLKIT_VERSION&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;}&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
      nvidia-container-toolkit-base&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;${&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;NVIDIA_CONTAINER_TOOLKIT_VERSION&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;}&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
      libnvidia-container-tools&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;${&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;NVIDIA_CONTAINER_TOOLKIT_VERSION&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;}&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
      libnvidia-container1&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;${&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;NVIDIA_CONTAINER_TOOLKIT_VERSION&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;}&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;# gVisor&lt;/span&gt;
&lt;span class=&quot;nb&quot;&gt;sudo &lt;/span&gt;apt-get update &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
  &lt;span class=&quot;nb&quot;&gt;sudo &lt;/span&gt;apt-get &lt;span class=&quot;nb&quot;&gt;install&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-y&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    apt-transport-https &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    ca-certificates &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    curl &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    gnupg &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
  curl &lt;span class=&quot;nt&quot;&gt;-fsSL&lt;/span&gt; https://gvisor.dev/archive.key | &lt;span class=&quot;nb&quot;&gt;sudo &lt;/span&gt;gpg &lt;span class=&quot;nt&quot;&gt;--dearmor&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-o&lt;/span&gt; /usr/share/keyrings/gvisor-archive-keyring.gpg &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
  &lt;span class=&quot;nb&quot;&gt;echo&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;deb [arch=&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;$(&lt;/span&gt;dpkg &lt;span class=&quot;nt&quot;&gt;--print-architecture&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt; signed-by=/usr/share/keyrings/gvisor-archive-keyring.gpg] https://storage.googleapis.com/gvisor/releases release main&quot;&lt;/span&gt; | &lt;span class=&quot;nb&quot;&gt;sudo tee&lt;/span&gt; /etc/apt/sources.list.d/gvisor.list &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; /dev/null &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
  &lt;span class=&quot;nb&quot;&gt;sudo &lt;/span&gt;apt-get update &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;sudo &lt;/span&gt;apt-get &lt;span class=&quot;nb&quot;&gt;install&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-y&lt;/span&gt; runsc &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
  &lt;span class=&quot;nb&quot;&gt;sudo &lt;/span&gt;runsc &lt;span class=&quot;nb&quot;&gt;install&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;--&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;--nvproxy&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;true&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;--nvproxy-allowed-driver-capabilities&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;all &lt;span class=&quot;nt&quot;&gt;--net-raw&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;true&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;--allow-packet-socket-write&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;true&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;--host-uds&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;all &lt;span class=&quot;nt&quot;&gt;--debug-log&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;/tmp/runsc/ &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
  &lt;span class=&quot;nb&quot;&gt;sudo &lt;/span&gt;systemctl restart docker
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;  &lt;/div&gt;

  &lt;p&gt;Verifying everything works:&lt;/p&gt;

  &lt;div class=&quot;language-shell highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;nvidia-smi
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;docker run &lt;span class=&quot;nt&quot;&gt;--runtime&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;runsc &lt;span class=&quot;nt&quot;&gt;--gpus&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;all &lt;span class=&quot;nt&quot;&gt;--rm&lt;/span&gt; ubuntu:latest sh &lt;span class=&quot;nt&quot;&gt;-c&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;'ls -al /dev/nvidia*'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;  &lt;/div&gt;

&lt;/details&gt;

&lt;section class=&quot;sticky-section&quot;&gt;

  &lt;h2 id=&quot;self-hosted-matrixorg-server--cinny-web-frontend-setup&quot;&gt;Self-hosted Matrix.org server + Cinny web frontend setup&lt;/h2&gt;

  &lt;div class=&quot;sticky-section-body&quot;&gt;

    &lt;figure class=&quot;follow-along&quot;&gt;
&lt;img src=&quot;/assets/images/2026-04-15-magi/synapse.blink.gif&quot; alt=&quot;Diagram showing the MAGI system with the 'Synapse' and 'Cinny' containers blinking.&quot; /&gt;
&lt;figcaption&gt;Setting up Synapse and Cinny.&lt;/figcaption&gt;
&lt;/figure&gt;

    &lt;div class=&quot;section-content&quot;&gt;

      &lt;p&gt;Let’s set up the &lt;strong&gt;Matrix.org server&lt;/strong&gt; for communication, and the &lt;strong&gt;Cinny&lt;/strong&gt; web
client that we humans can use to communicate with it.&lt;/p&gt;

      &lt;div class=&quot;language-shell highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c&quot;&gt;# Generate homeserver.yaml&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;docker run &lt;span class=&quot;nt&quot;&gt;-it&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;--runtime&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;runsc &lt;span class=&quot;nt&quot;&gt;--rm&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--mount&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;type&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;volume,src&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;synapse-data,dst&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;/data &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;-e&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;SYNAPSE_SERVER_NAME&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;magi &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;-e&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;SYNAPSE_REPORT_STATS&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;no &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    matrixdotorg/synapse:latest generate

&lt;span class=&quot;c&quot;&gt;# Run server&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;docker run &lt;span class=&quot;nt&quot;&gt;--detach&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;--runtime&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;runsc &lt;span class=&quot;nt&quot;&gt;--restart&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;always &lt;span class=&quot;nt&quot;&gt;--name&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;synapse &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--mount&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;type&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;volume,src&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;synapse-data,dst&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;/data &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;-p&lt;/span&gt; 8008:8008 &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    matrixdotorg/synapse:latest

&lt;span class=&quot;c&quot;&gt;# Create admin user&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;docker &lt;span class=&quot;nb&quot;&gt;exec&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-it&lt;/span&gt; synapse register_new_matrix_user &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;-c&lt;/span&gt; /data/homeserver.yaml &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--user&lt;/span&gt; gendo &lt;span class=&quot;nt&quot;&gt;--password&lt;/span&gt; yui &lt;span class=&quot;nt&quot;&gt;--admin&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;# Run cinny (Matrix client)&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;docker run &lt;span class=&quot;nt&quot;&gt;-it&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;--runtime&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;runsc &lt;span class=&quot;nt&quot;&gt;--restart&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;always &lt;span class=&quot;nt&quot;&gt;--name&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;cinny &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--link&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;synapse:synapse &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;-p&lt;/span&gt; 8084:80 &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    ghcr.io/cinnyapp/cinny:latest

&lt;span class=&quot;c&quot;&gt;# Access Cinny web UI at http://localhost:8084&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;# Log in as:&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;#   Homeserver: http://127.0.0.1:8008&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;#   Username: gendo&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;#   Password: yui&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;      &lt;/div&gt;

    &lt;/div&gt;

  &lt;/div&gt;

&lt;/section&gt;

&lt;section class=&quot;sticky-section&quot;&gt;

  &lt;h2 id=&quot;self-hosted-inference-server-ollama&quot;&gt;Self-hosted inference server: Ollama&lt;/h2&gt;

  &lt;div class=&quot;sticky-section-body&quot;&gt;

    &lt;figure class=&quot;follow-along&quot;&gt;
&lt;img src=&quot;/assets/images/2026-04-15-magi/ollama.blink.gif&quot; alt=&quot;Diagram showing the MAGI system with the 'Ollama' and 'NVIDIA GPU' boxes blinking.&quot; /&gt;
&lt;figcaption&gt;Setting up Ollama for GPU inference.&lt;/figcaption&gt;
&lt;/figure&gt;

    &lt;div class=&quot;section-content&quot;&gt;

      &lt;p&gt;Setting up &lt;strong&gt;Ollama&lt;/strong&gt;, the GPU-enabled inference server and the brain of it all.&lt;/p&gt;

      &lt;div class=&quot;language-shell highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;docker run &lt;span class=&quot;nt&quot;&gt;--detach&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;--runtime&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;runsc &lt;span class=&quot;nt&quot;&gt;--restart&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;always &lt;span class=&quot;nt&quot;&gt;--name&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;ollama &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--gpus&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;all &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--mount&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;type&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;volume,src&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;ollama-data,dst&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;/root &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;-p&lt;/span&gt; 11434:11434 &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    ollama/ollama:0.20.0

&lt;span class=&quot;c&quot;&gt;# Pull and load some models.&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;docker &lt;span class=&quot;nb&quot;&gt;exec&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-it&lt;/span&gt; ollama sh &lt;span class=&quot;nt&quot;&gt;-c&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;'ollama pull qwen3.5:27b-q4_K_M   &amp;amp;&amp;amp; ollama run --keepalive=9001h qwen3.5:27b-q4_K_M     Say hello.'&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;docker &lt;span class=&quot;nb&quot;&gt;exec&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-it&lt;/span&gt; ollama sh &lt;span class=&quot;nt&quot;&gt;-c&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;'ollama pull glm-4.7-flash:q4_K_M &amp;amp;&amp;amp; ollama run --keepalive=9001h glm-4.7-flash:q4_K_M   Say hello.'&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;docker &lt;span class=&quot;nb&quot;&gt;exec&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-it&lt;/span&gt; ollama sh &lt;span class=&quot;nt&quot;&gt;-c&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;'ollama pull gpt-oss:20b          &amp;amp;&amp;amp; ollama run --keepalive=9001h gemma4:26b-a4b-it-q8_0 Say hello.'&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;docker &lt;span class=&quot;nb&quot;&gt;exec&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-it&lt;/span&gt; ollama sh &lt;span class=&quot;nt&quot;&gt;-c&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;'ollama pull gpt-oss:20b          &amp;amp;&amp;amp; ollama run --keepalive=9001h nomic-embed-text:137m-v1.5-fp16 &quot;&quot;'&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;# Make sure they all fit together in VRAM, otherwise you'll get bad performance.&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;docker &lt;span class=&quot;nb&quot;&gt;exec&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-it&lt;/span&gt; ollama ollama ps
NAME                      ID              SIZE     PROCESSOR    CONTEXT    UNTIL
gemma4:26b-a4b-it-q8_0    6bfaf9a8cb37    89 GB    100% GPU     262144     12 months from now
glm-4.7-flash:q4_K_M      d1a8a26252f1    40 GB    100% GPU     202752     12 months from now
qwen3.5:27b-q4_K_M        7653528ba5cb    44 GB    100% GPU     262144     12 months from now
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;      &lt;/div&gt;

    &lt;/div&gt;

  &lt;/div&gt;

&lt;/section&gt;

&lt;section class=&quot;sticky-section&quot;&gt;

  &lt;h2 id=&quot;containerized-openclaw-setup-with-browser-use&quot;&gt;Containerized OpenClaw setup with Browser Use&lt;/h2&gt;

  &lt;div class=&quot;sticky-section-body&quot;&gt;

    &lt;figure class=&quot;follow-along&quot;&gt;
&lt;img src=&quot;/assets/images/2026-04-15-magi/openclaw.blink.gif&quot; alt=&quot;Diagram showing the MAGI system with the 'OpenClaw' and 'Chrome' containers blinking.&quot; /&gt;
&lt;figcaption&gt;Setting up OpenClaw and Chrome browser.&lt;/figcaption&gt;
&lt;/figure&gt;

    &lt;div class=&quot;section-content&quot;&gt;

      &lt;p&gt;Now let’s set up &lt;strong&gt;OpenClaw&lt;/strong&gt; and hook it up to a web browser for fully-local
Browser Use.&lt;/p&gt;

      &lt;p&gt;We will use the official &lt;code class=&quot;highlighter-rouge&quot;&gt;ghcr.io/openclaw/openclaw&lt;/code&gt; OpenClaw container image,
but we will also modify it to install the Google Chrome, as per
&lt;a href=&quot;https://docs.openclaw.ai/tools/browser-linux-troubleshooting#solution-1-install-google-chrome-recommended&quot;&gt;recommended in the OpenClaw docs&lt;/a&gt;.
This will allow the agent to use a web browser, all running in gVisor.&lt;/p&gt;

      &lt;div class=&quot;language-shell highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;export &lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;MELCHIOR&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$HOME&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;/agents/melchior-1&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;mkdir&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-p&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$MELCHIOR&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;cat&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;no&quot;&gt;EOF&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt; &amp;gt; &quot;&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$MELCHIOR&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;/Dockerfile&quot;
FROM ghcr.io/openclaw/openclaw:2026.4.2

USER 0:0
RUN export DEBIAN_FRONTEND=noninteractive; apt update -y &amp;amp;&amp;amp; &lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;
    apt install -y wget chromium libvulkan1 &amp;amp;&amp;amp; &lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;
    wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb &amp;amp;&amp;amp; &lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;
    dpkg -i google-chrome-stable_current_amd64.deb &amp;amp;&amp;amp; &lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;
    rm google-chrome-stable_current_amd64.deb &amp;amp;&amp;amp; &lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;
    apt --fix-broken install -y
&lt;/span&gt;&lt;span class=&quot;no&quot;&gt;EOF

&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;docker build &lt;span class=&quot;nt&quot;&gt;-t&lt;/span&gt; openclaw:melchior-1 &lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$MELCHIOR&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;      &lt;/div&gt;

      &lt;p&gt;Note that the resulting image runs as root. This is not a security risk; “root”
in a gVisor sandbox doesn’t imply any root-like level access on the host.&lt;/p&gt;

      &lt;p&gt;Let’s create a Matrix account for it and seed its configuration:&lt;/p&gt;

      &lt;div class=&quot;language-shell highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;mkdir&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-p&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$MELCHIOR&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;/config&quot;&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$MELCHIOR&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;/home&quot;&lt;/span&gt;

&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;docker &lt;span class=&quot;nb&quot;&gt;exec&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-it&lt;/span&gt; synapse register_new_matrix_user &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;-c&lt;/span&gt; /data/homeserver.yaml &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--user&lt;/span&gt; melchior &lt;span class=&quot;nt&quot;&gt;--password&lt;/span&gt; akagi &lt;span class=&quot;nt&quot;&gt;--no-admin&lt;/span&gt;

&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;cat&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;no&quot;&gt;EOF&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt; &amp;gt; &quot;&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$MELCHIOR&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;/config/openclaw.json&quot;
{
  &quot;auth&quot;: {
    &quot;profiles&quot;: {
      &quot;ollama:default&quot;: {
        &quot;provider&quot;: &quot;ollama&quot;,
        &quot;mode&quot;: &quot;api_key&quot;
      }
    }
  },
  &quot;agents&quot;: {
    &quot;defaults&quot;: {
      &quot;models&quot;: {
        &quot;ollama/gemma4:26b-a4b-it-q8_0&quot;: {}
      }
    }
  },
  &quot;models&quot;: {
    &quot;mode&quot;: &quot;merge&quot;,
    &quot;providers&quot;: {
      &quot;ollama&quot;: {
        &quot;baseUrl&quot;: &quot;http://ollama:11434&quot;,
        &quot;api&quot;: &quot;ollama&quot;,
        &quot;apiKey&quot;: &quot;OLLAMA_API_KEY&quot;,
        &quot;models&quot;: [
          {
            &quot;id&quot;: &quot;gemma4:26b-a4b-it-q8_0&quot;,
            &quot;name&quot;: &quot;gemma4:26b-a4b-it-q8_0&quot;,
            &quot;reasoning&quot;: true,
            &quot;input&quot;: [
              &quot;text&quot;
            ],
            &quot;cost&quot;: {
              &quot;input&quot;: 0,
              &quot;output&quot;: 0,
              &quot;cacheRead&quot;: 0,
              &quot;cacheWrite&quot;: 0
            },
            &quot;contextWindow&quot;: 262144,
            &quot;maxTokens&quot;: 8192
          }
        ]
      }
    }
  },
  &quot;channels&quot;: {
    &quot;matrix&quot;: {
      &quot;enabled&quot;: true,
      &quot;homeserver&quot;: &quot;http://synapse:8008&quot;,
      &quot;userId&quot;: &quot;@melchior:magi&quot;,
      &quot;password&quot;: &quot;akagi&quot;,
      &quot;deviceName&quot;: &quot;Melchior&quot;,
      &quot;allowPrivateNetwork&quot;: true,
      &quot;encryption&quot;: false,
      &quot;groupPolicy&quot;: &quot;open&quot;,
      &quot;autoJoin&quot;: &quot;always&quot;,
      &quot;dm&quot;: {
        &quot;policy&quot;: &quot;open&quot;,
        &quot;allowFrom&quot;: [
          &quot;*&quot;
        ]
      }
    }
  },
  &quot;gateway&quot;: {
    &quot;mode&quot;: &quot;local&quot;,
    &quot;controlUi&quot;: {
      &quot;dangerouslyDisableDeviceAuth&quot;: true,
      &quot;dangerouslyAllowHostHeaderOriginFallback&quot;: true
    }
  },
  &quot;skills&quot;: {
    &quot;install&quot;: {
      &quot;nodeManager&quot;: &quot;npm&quot;
    }
  },
  &quot;browser&quot;: {
    &quot;enabled&quot;: true,
    &quot;executablePath&quot;: &quot;/usr/bin/google-chrome-stable&quot;,
    &quot;headless&quot;: true,
    &quot;noSandbox&quot;: true
  },
  &quot;tools&quot;: {
    &quot;web&quot;: {
      &quot;search&quot;: {
        &quot;enabled&quot;: true,
        &quot;provider&quot;: &quot;duckduckgo&quot;
      },
      &quot;fetch&quot;: {
        &quot;enabled&quot;: true
      }
    }
  },
  &quot;plugins&quot;: {
    &quot;entries&quot;: {
      &quot;matrix&quot;: {
        &quot;enabled&quot;: true
      },
      &quot;browser&quot;: {
        &quot;enabled&quot;: true
      }
    }
  }
}
&lt;/span&gt;&lt;span class=&quot;no&quot;&gt;EOF
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;      &lt;/div&gt;

      &lt;p&gt;Note: for the purpose of simplifying demo setup, the above configuration
disables authentication, allows the bot to auto-join all Matrix channels it is
invited to, etc. For real deployments, do not use these settings.&lt;/p&gt;

      &lt;p&gt;Let’s run it!&lt;/p&gt;

      &lt;div class=&quot;language-shell highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;export &lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;MELCHIOR&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$HOME&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;/agents/melchior-1&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; docker run &lt;span class=&quot;nt&quot;&gt;--detach&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--name&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;melchior &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--runtime&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;runsc &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--restart&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;always &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--env&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;OPENCLAW_GATEWAY_TOKEN&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;dummy-token-for-sandbox&quot;&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--env&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;OPENCLAW_CONFIG_PATH&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;/etc/openclaw/openclaw.json&quot;&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;-p&lt;/span&gt; 18789:18789 &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--env&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;HOME&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;/home/node &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--link&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;synapse:synapse &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--link&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;ollama:ollama &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;-v&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$MELCHIOR&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;/home&quot;&lt;/span&gt;:/home/node/.openclaw &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;-v&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$MELCHIOR&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;/config&quot;&lt;/span&gt;:/etc/openclaw &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    openclaw:melchior-1 &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    node &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
        dist/index.js &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
        gateway &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
           &lt;span class=&quot;nt&quot;&gt;--bind&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;lan &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
           &lt;span class=&quot;nt&quot;&gt;--port&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;18789 &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
           &lt;span class=&quot;nt&quot;&gt;--allow-unconfigured&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
           &lt;span class=&quot;nt&quot;&gt;--verbose&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;      &lt;/div&gt;

      &lt;p&gt;Run &lt;code class=&quot;highlighter-rouge&quot;&gt;docker exec -it melchior openclaw configure&lt;/code&gt; for further interactive
configuration.&lt;/p&gt;

    &lt;/div&gt;

  &lt;/div&gt;

&lt;/section&gt;

&lt;p&gt;You can now go to &lt;code class=&quot;highlighter-rouge&quot;&gt;http://127.0.0.1:18789/?token=dummy-token-for-sandbox&lt;/code&gt; and
talk to your OpenClaw instance!&lt;/p&gt;

&lt;figure class=&quot;img-100pct&quot;&gt;
&lt;div class=&quot;double-border-glow&quot;&gt;
&lt;img src=&quot;/assets/images/2026-04-15-magi/openclaw-ui.png&quot; alt=&quot;The OpenClaw gateway web UI displaying a chat with the dmesg output, confirming that it is running in gVisor.&quot; /&gt;
&lt;/div&gt;
&lt;figcaption&gt;OpenClaw web UI running in gVisor. The &lt;code&gt;dmesg&lt;/code&gt; output is characteristic of gVisor.&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;h3 id=&quot;browser-use&quot;&gt;Browser Use&lt;/h3&gt;

&lt;p&gt;The &lt;code class=&quot;highlighter-rouge&quot;&gt;Dockerfile&lt;/code&gt; we built earlier contains the Google Chrome web browser, which
&lt;a href=&quot;&quot;&gt;OpenClaw knows how to use&lt;/a&gt;. You can ask it to open websites and take
screenshots. Here is the gVisor website rendered in Chrome-in-gVisor by
OpenClaw:&lt;/p&gt;

&lt;figure&gt;
&lt;div class=&quot;double-border-glow&quot;&gt;
&lt;img src=&quot;/assets/images/2026-04-15-magi/gvisor-website.png&quot; alt=&quot;gVisor website rendered by Chrome running in gVisor.&quot; /&gt;
&lt;/div&gt;
&lt;figcaption&gt;gVisor website rendered by Chrome in gVisor, orchestrated by OpenClaw.&lt;br /&gt;&lt;em&gt;Funnily enough, the OpenClaw web interface didn't provide the means for OpenClaw to display this image directly.&lt;/em&gt;&lt;br /&gt;&lt;em&gt;OpenClaw autonomously solved this problem by uploading this picture to a temporary image hosting service and responding with the uploaded image URL.&lt;/em&gt;&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;p&gt;Now let’s bring the other two brains to life.&lt;/p&gt;

&lt;section class=&quot;sticky-section&quot;&gt;

  &lt;h2 id=&quot;containerized-picoclaw-with-web-and-github-skills&quot;&gt;Containerized PicoClaw with web and GitHub skills&lt;/h2&gt;

  &lt;div class=&quot;sticky-section-body&quot;&gt;

    &lt;figure class=&quot;follow-along&quot;&gt;
&lt;img src=&quot;/assets/images/2026-04-15-magi/picoclaw.blink.gif&quot; alt=&quot;Diagram showing the MAGI system with the 'PicoClaw' container blinking.&quot; /&gt;
&lt;figcaption&gt;Setting up PicoClaw.&lt;/figcaption&gt;
&lt;/figure&gt;

    &lt;div class=&quot;section-content&quot;&gt;

      &lt;p&gt;Moving on to PicoClaw, the minimal agent.&lt;/p&gt;

      &lt;p&gt;We will use the
&lt;a href=&quot;https://hub.docker.com/r/sipeed/picoclaw&quot;&gt;PicoClaw Docker image&lt;/a&gt;, and enable a
few skills for GitHub interaction with the
&lt;a href=&quot;https://github.com/google/gvisor&quot;&gt;gVisor repository&lt;/a&gt;.&lt;/p&gt;

      &lt;p&gt;Note that while this demo was on a x86-64 VM, PicoClaw has also been confirmed
to work in &lt;strong&gt;gVisor on arm64 on a Raspberry Pi 4 Model B&lt;/strong&gt;.&lt;/p&gt;

      &lt;div class=&quot;language-shell highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;export &lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;BALTHASAR&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$HOME&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;/agents/balthasar-2&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;mkdir&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-p&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$BALTHASAR&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;/picoclaw&quot;&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;docker &lt;span class=&quot;nb&quot;&gt;exec&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-it&lt;/span&gt; synapse register_new_matrix_user &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;-c&lt;/span&gt; /data/homeserver.yaml &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--user&lt;/span&gt; balthasar &lt;span class=&quot;nt&quot;&gt;--password&lt;/span&gt; ritsuko &lt;span class=&quot;nt&quot;&gt;--no-admin&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;$ matrix_token&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;$(&lt;/span&gt;curl &lt;span class=&quot;nt&quot;&gt;-X&lt;/span&gt; POST &lt;span class=&quot;nt&quot;&gt;-H&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;Content-Type: application/json&quot;&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;s2&quot;&gt;&quot;http://127.0.0.1:8008/_matrix/client/v3/login&quot;&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;-d&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;s1&quot;&gt;'{&quot;type&quot;: &quot;m.login.password&quot;, &quot;user&quot;: &quot;balthasar&quot;, &quot;password&quot;: &quot;ritsuko&quot;}'&lt;/span&gt; | &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    jq &lt;span class=&quot;nt&quot;&gt;-r&lt;/span&gt; .access_token&lt;span class=&quot;si&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;cat&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;no&quot;&gt;EOF&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt; &amp;gt; &quot;&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$BALTHASAR&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;/picoclaw/config.json&quot;
{
  &quot;model_list&quot;: [
    {
      &quot;model_name&quot;: &quot;glm-4.7-flash&quot;,
      &quot;model&quot;: &quot;ollama/glm-4.7-flash:q4_K_M&quot;,
      &quot;api_base&quot;: &quot;http://ollama:11434/v1&quot;
    }
  ],
  &quot;agents&quot;: {
    &quot;defaults&quot;: {
      &quot;model_name&quot;: &quot;glm-4.7-flash&quot;
    }
  },
  &quot;gateway&quot;: {
    &quot;host&quot;: &quot;0.0.0.0&quot;,
    &quot;port&quot;: 18790
  },
  &quot;channels&quot;: {
    &quot;matrix&quot;: {
      &quot;enabled&quot;: true,
      &quot;homeserver&quot;: &quot;http://synapse:8008&quot;,
      &quot;user_id&quot;: &quot;@balthasar:magi&quot;,
      &quot;access_token&quot;: &quot;&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;${&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;matrix_token&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;,
      &quot;join_on_invite&quot;: true,
      &quot;allow_from&quot;: []
    }
  }
}
&lt;/span&gt;&lt;span class=&quot;no&quot;&gt;EOF
&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;docker run &lt;span class=&quot;nt&quot;&gt;-it&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--name&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;balthasar &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--runtime&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;runsc &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--restart&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;always &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;-v&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$BALTHASAR&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;/picoclaw:/root/.picoclaw&quot;&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--link&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;synapse:synapse &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--link&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;ollama:ollama &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--entrypoint&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;/usr/local/bin/picoclaw &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    sipeed/picoclaw:latest gateway
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;      &lt;/div&gt;

      &lt;p&gt;PicoClaw should start, although it does not have a lot of functionality out of
the box. Let’s enable some skills:&lt;/p&gt;

      &lt;div class=&quot;language-shell highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;cp&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$BALTHASAR&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;/picoclaw/config.json&quot;&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$BALTHASAR&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;/picoclaw/config.json.bak&quot;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
  jq &lt;span class=&quot;s1&quot;&gt;'.tools.web.enabled = true |
      .tools.web.prefer_native = true |
      .tools.exec.enabled = true |
      .tools.exec.allow_remote = true |
      .tools.skills.enabled = true |
      .tools.skills.github = {
        &quot;enabled&quot;: true,
        &quot;token&quot;: &quot;YOUR_GITHUB_TOKEN_HERE&quot;,
        &quot;timeout&quot;: 30,
        &quot;max_results&quot;: 5
      } |
      .tools.skills.max_concurrent_searches = 5
      | .tools.skills.search_cache = {
        &quot;max_size&quot;: 100,
        &quot;ttl_seconds&quot;: 300
      } |
      .tools.web_fetch.enabled = true'&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
      &amp;lt; &lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$BALTHASAR&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;/picoclaw/config.json.bak&quot;&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
      &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$BALTHASAR&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;/picoclaw/config.json&quot;&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;# Restart PicoClaw to apply config changes.&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;docker restart balthasar

&lt;span class=&quot;c&quot;&gt;# You can re-attach to an interactive CLI for PicoClaw with:&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;docker &lt;span class=&quot;nb&quot;&gt;exec&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-it&lt;/span&gt; balthasar picoclaw agent
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;      &lt;/div&gt;

      &lt;p&gt;Now we can ask it to interact with GitHub.&lt;/p&gt;

      &lt;figure class=&quot;img-100pct&quot;&gt;
&lt;div class=&quot;double-border-glow&quot;&gt;
&lt;img src=&quot;/assets/images/2026-04-15-magi/picoclaw-1.png&quot; alt=&quot;PicoClaw starting up and being tasked with looking up the top trending GitHub repositories that day.&quot; /&gt;
&lt;/div&gt;

&lt;figcaption&gt;PicoClaw being tasked with looking up the current trending GitHub
repositories.&lt;/figcaption&gt; &lt;/figure&gt;

      &lt;p&gt;Funnily enough, the top GitHub repository today is Hermes Agent, which we will
install next. For now, let’s review a small gVisor PR:&lt;/p&gt;

      &lt;figure class=&quot;img-100pct&quot;&gt;
&lt;div class=&quot;double-border-glow&quot;&gt;
&lt;img src=&quot;/assets/images/2026-04-15-magi/picoclaw-2.png&quot; alt=&quot;PicoClaw being tasked with explaining and reviewing a gVisor pull request.&quot; /&gt;
&lt;/div&gt;
&lt;figcaption&gt;PicoClaw being tasked with explaining and reviewing [gVisor pull request #12911](https://github.com/google/gvisor/pull/12911).&lt;br /&gt;Which was later reviewed by a human as well.&lt;/figcaption&gt;
&lt;/figure&gt;

    &lt;/div&gt;

  &lt;/div&gt;

&lt;/section&gt;

&lt;section class=&quot;sticky-section&quot;&gt;

  &lt;h2 id=&quot;modularized--sandboxed-hermes-agent-setup&quot;&gt;Modularized &amp;amp; sandboxed Hermes Agent setup&lt;/h2&gt;

  &lt;div class=&quot;sticky-section-body&quot;&gt;

    &lt;figure class=&quot;follow-along&quot;&gt;
&lt;img src=&quot;/assets/images/2026-04-15-magi/hermes-agent.blink.gif&quot; alt=&quot;Diagram showing the MAGI system with the 'Hermes Agent' container blinking.&quot; /&gt;
&lt;figcaption&gt;Setting up Hermes Agent.&lt;/figcaption&gt;
&lt;/figure&gt;

    &lt;div class=&quot;section-content&quot;&gt;

      &lt;p&gt;Finally, let’s set up &lt;strong&gt;Hermes Agent&lt;/strong&gt;, and fully load it with sandboxed
&lt;strong&gt;Browser Use&lt;/strong&gt;, sandboxed &lt;strong&gt;web crawling&lt;/strong&gt;, and sandboxed &lt;strong&gt;code execution&lt;/strong&gt;.&lt;/p&gt;

      &lt;p&gt;We will use
&lt;a href=&quot;https://hermes-agent.nousresearch.com/docs/user-guide/docker&quot;&gt;Hermes Agent’s official Docker image&lt;/a&gt;:
&lt;code class=&quot;highlighter-rouge&quot;&gt;nousresearch/hermes-agent&lt;/code&gt;, expanded with the dependencies needed to perform
local text-to-speech and Matrix.org integration, all running in gVisor.
Additionally, for extra security, we will do the following:&lt;/p&gt;

      &lt;ul&gt;
        &lt;li&gt;Run &lt;a href=&quot;https://github.com/jo-inc/camofox-browser&quot;&gt;Camofox Browser&lt;/a&gt; in a
separate gVisor container, for browser use.&lt;/li&gt;
        &lt;li&gt;Run
&lt;a href=&quot;https://github.com/firecrawl/firecrawl/blob/main/SELF_HOST.md&quot;&gt;self-hosted Firecrawl&lt;/a&gt;
in a separate gVisor container, for agentic search.&lt;/li&gt;
        &lt;li&gt;Run &lt;a href=&quot;/docs/tutorials/docker-in-gvisor/&quot;&gt;Docker-in-gVisor&lt;/a&gt; in a separate
container, for Hermes Agent to execute arbitrary code safely.&lt;/li&gt;
      &lt;/ul&gt;

      &lt;p&gt;Note that the &lt;code class=&quot;highlighter-rouge&quot;&gt;--net-raw=true --allow-packet-socket-write=true&lt;/code&gt; runsc flags are
&lt;a href=&quot;/docs/tutorials/docker-in-gvisor/&quot;&gt;required for Docker to work in gVisor&lt;/a&gt;. For
this reason, we need to install a secondary runtime for the Docker-in-gVisor
container, and enable host UDS (&lt;code class=&quot;highlighter-rouge&quot;&gt;--host-uds=all&lt;/code&gt;) so that the Docker daemon
socket file can be exported out of that sandbox into the Hermes Agent sandbox.&lt;/p&gt;

    &lt;/div&gt;

  &lt;/div&gt;

&lt;/section&gt;

&lt;figure class=&quot;img-100pct&quot;&gt;
&lt;div class=&quot;double-border-glow&quot;&gt;
&lt;img src=&quot;/assets/images/2026-04-15-magi/hermes-agent-in-gvisor.png&quot; alt=&quot;Hermes Agent running in gVisor.&quot; /&gt;
&lt;/div&gt;
&lt;figcaption&gt;Hermes Agent running in gVisor.&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;section class=&quot;sticky-section&quot;&gt;

  &lt;h3 id=&quot;setting-up-docker-in-gvisor-for-code-execution&quot;&gt;Setting up Docker-in-gVisor for code execution&lt;/h3&gt;

  &lt;div class=&quot;sticky-section-body&quot;&gt;

    &lt;figure class=&quot;follow-along&quot;&gt;
&lt;img src=&quot;/assets/images/2026-04-15-magi/docker.blink.gif&quot; alt=&quot;Diagram showing the MAGI system with the 'Docker' box blinking.&quot; /&gt;
&lt;figcaption&gt;Setting up Docker-in-gVisor for code execution.&lt;/figcaption&gt;
&lt;/figure&gt;

    &lt;div class=&quot;section-content&quot;&gt;

      &lt;p&gt;&lt;strong&gt;gVisor is capable of
&lt;a href=&quot;https://gvisor.dev/docs/tutorials/docker-in-gvisor/&quot;&gt;running Docker inside of itself&lt;/a&gt;&lt;/strong&gt;.
Since Hermes Agent has
&lt;a href=&quot;https://hermes-agent.nousresearch.com/docs/user-guide/configuration#docker-backend&quot;&gt;Docker as a code execution backend&lt;/a&gt;,
we will use this to spawn a separate Docker-in-gVisor container which Hermes
Agent can use to run code safely.&lt;/p&gt;

      &lt;div class=&quot;language-shell highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;export &lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;CASPER&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$HOME&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;/agents/casper-3&quot;&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;runsc &lt;span class=&quot;nb&quot;&gt;install&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;--runtime&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;docker-in-gvisor &lt;span class=&quot;nt&quot;&gt;--&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;--net-raw&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;true&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;--allow-packet-socket-write&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;true&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;--host-uds&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;all

&lt;span class=&quot;c&quot;&gt;# Reload *host* dockerd configuration to make it notice the new runtime we just added.&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;kill&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-HUP&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;$(&lt;/span&gt;pidof dockerd&lt;span class=&quot;si&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;# Run Docker-in-gVisor container.&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;# Note: The `--cap-add=all` flag does *not* grant the container any&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;# capabilities on the host. It only enables the sandboxed workload to use&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;# elevated privileges **within the sandbox**.&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;# This is necessary to be able to run `dockerd` inside a container.&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;mkdir&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-p&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$CASPER&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;/docker-run&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; docker run &lt;span class=&quot;nt&quot;&gt;--detach&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--name&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;hermes-exec &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--runtime&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;docker-in-gvisor &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--restart&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;always &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--cap-add&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;all &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--mount&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;type=bind,src=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$CASPER&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;/docker-run,dst=/var/run&quot;&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    us-central1-docker.pkg.dev/gvisor-presubmit/gvisor-presubmit-images/basic/docker_x86_64

&lt;span class=&quot;c&quot;&gt;# Verify that we can talk to the `dockerd` server running in gVisor.&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;# We need --security-opt=seccomp=unconfined here, because otherwise&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;# Docker's default seccomp profile would block the `syslog(2)` syscall that&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;# the `dmesg` process uses to read the kernel logs (which here is actually&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;# reading the gVisor kernel logs). This is not a security problem, since we&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;# are still all running in gVisor.&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;$ DOCKER_HOST&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;unix://&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$CASPER&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;/docker-run/docker.sock&quot;&lt;/span&gt; docker run &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--rm&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--security-opt&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;seccomp&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;unconfined &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    debian:latest &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    dmesg
&lt;span class=&quot;c&quot;&gt;# [...]&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;    0.000000] Starting gVisor...
&lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;    0.429798] DeFUSEing fork bombs...
&lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;    0.782957] Adversarially training Redcode AI...
&lt;span class=&quot;c&quot;&gt;# [...]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;      &lt;/div&gt;

    &lt;/div&gt;

  &lt;/div&gt;

&lt;/section&gt;

&lt;section class=&quot;sticky-section&quot;&gt;

  &lt;h3 id=&quot;building-camofox-docker-image-in-docker-in-gvisor&quot;&gt;Building Camofox Docker image in Docker-in-gVisor&lt;/h3&gt;

  &lt;div class=&quot;sticky-section-body&quot;&gt;

    &lt;figure class=&quot;follow-along&quot;&gt;
&lt;img src=&quot;/assets/images/2026-04-15-magi/camofox.blink.gif&quot; alt=&quot;Diagram showing the MAGI system with the 'Camofox' container blinking.&quot; /&gt;
&lt;figcaption&gt;Setting up Camofox Browser.&lt;/figcaption&gt;
&lt;/figure&gt;

    &lt;div class=&quot;section-content&quot;&gt;

      &lt;p&gt;&lt;a href=&quot;https://github.com/jo-inc/camofox-browser&quot;&gt;Camofox&lt;/a&gt; is a Firefox-based web
browser for agentic browsing. Let’s run it in its own sandboxed container.&lt;/p&gt;

      &lt;p&gt;Camofox comes with an image that also contains &lt;code class=&quot;highlighter-rouge&quot;&gt;Xvfb&lt;/code&gt; to simulate an X11 display
server, and &lt;code class=&quot;highlighter-rouge&quot;&gt;yt-dlp&lt;/code&gt; for YouTube video extraction, all working in gVisor. Let’s
build it.&lt;/p&gt;

      &lt;p&gt;The Camofox project doesn’t provide pre-built Docker images, so we need to build
it ourselves. But wait! Camofox may or may not be a fishy project. What if it
contains malicious code?&lt;/p&gt;

      &lt;p&gt;&lt;strong&gt;Have no fear, gVisor is here!&lt;/strong&gt; We can simply build the image inside gVisor.
Let’s spin up an ephemeral Docker-in-gVisor container, run the Camofox Docker
image build process within, extract the image out, and import it into the host
&lt;code class=&quot;highlighter-rouge&quot;&gt;dockerd&lt;/code&gt;’s local image repository.&lt;/p&gt;

      &lt;figure class=&quot;follow-along&quot;&gt;
&lt;img src=&quot;/assets/images/2026-04-15-magi/turtles.jpg&quot; alt=&quot;I heard you like containers so we put Docker build in Docker in gVisor in Docker.&quot; /&gt;
&lt;figcaption&gt;It's containers all the way down.&lt;/figcaption&gt;
&lt;/figure&gt;

      &lt;div class=&quot;language-shell highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c&quot;&gt;# Start Docker-in-gVisor with large-enough /var/lib/docker tmpfs&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;mkdir&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-p&lt;/span&gt; /tmp/docker-tmp &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; docker run &lt;span class=&quot;nt&quot;&gt;--detach&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--name&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;docker-tmp &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--runtime&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;docker-in-gvisor &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--restart&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;always &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--cap-add&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;all &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--mount&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;type=bind,src=/tmp/docker-tmp,dst=/tmp/docker-tmp&quot;&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;-e&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;DOCKER_TMPFS_SIZE&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;8G &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    us-central1-docker.pkg.dev/gvisor-presubmit/gvisor-presubmit-images/basic/docker_x86_64

&lt;span class=&quot;c&quot;&gt;# Build image within the in-gVisor Docker.&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;# The `make` command will run `docker build` in-sandbox.&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;docker &lt;span class=&quot;nb&quot;&gt;exec &lt;/span&gt;docker-tmp sh &lt;span class=&quot;nt&quot;&gt;-c&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;'true &amp;amp;&amp;amp; \
    apt update -y &amp;amp;&amp;amp; \
    apt install -y git build-essential &amp;amp;&amp;amp; \
    git clone https://github.com/jo-inc/camofox-browser.git &amp;amp;&amp;amp; \
    cd camofox-browser &amp;amp;&amp;amp; \
    make'&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;# Extract the image out of the container and import as host Docker image.&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;# The `docker save` command dumps the image to stdout, which gets piped&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;# to the out-of-sandbox `docker load` command.&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;docker &lt;span class=&quot;nb&quot;&gt;exec &lt;/span&gt;docker-tmp docker save camofox-browser | docker load
Loaded image: camofox-browser:135.0.1-x86_64

&lt;span class=&quot;c&quot;&gt;# You now have the image on the host Docker:&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;docker images | &lt;span class=&quot;nb&quot;&gt;grep &lt;/span&gt;camofox
camofox-browser:135.0.1-x86_64      80c072259479      4.6GB      2.27GB

&lt;span class=&quot;c&quot;&gt;# Clean up.&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;docker &lt;span class=&quot;nb&quot;&gt;rm&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-f&lt;/span&gt; docker-tmp
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;      &lt;/div&gt;

      &lt;p&gt;Now that we have our Camofox image, let’s run it:&lt;/p&gt;

      &lt;div class=&quot;language-shell highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;docker run &lt;span class=&quot;nt&quot;&gt;--detach&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--name&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;camofox &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--runtime&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;runsc &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--restart&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;always &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    camofox-browser:135.0.1-x86_64

&lt;span class=&quot;c&quot;&gt;# Camofox binds on port 3000 by default; we don't need to expose it&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;# to the host though, as we will use inter-container networking.&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;# Nonetheless, let's make sure it works:&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;docker &lt;span class=&quot;nb&quot;&gt;exec&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-e&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;DEBIAN_FRONTEND&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;noninteractive camofox sh &lt;span class=&quot;nt&quot;&gt;-c&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;'true &amp;amp;&amp;amp; \
    apt update -y &amp;gt;/dev/null &amp;amp;&amp;amp; \
    apt install -y curl jq &amp;gt;/dev/null &amp;amp;&amp;amp; \
    tabId=&quot;$(curl -q -X POST http://127.0.0.1:3000/tabs -H &quot;Content-Type: application/json&quot; -d &quot;{\&quot;userId\&quot;: \&quot;me\&quot;, \&quot;sessionKey\&quot;: \&quot;task\&quot;, \&quot;url\&quot;: \&quot;https://gvisor.dev\&quot;}&quot; | jq -r .tabId)&quot; &amp;amp;&amp;amp; \
    curl -q --output - &quot;http://127.0.0.1:3000/tabs/${tabId}/screenshot?userId=me&quot;
  '&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; /tmp/screenshot.png
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;file /tmp/screenshot.png
/tmp/screenshot.png: PNG image data, 1280 x 720, 8-bit/color RGBA, non-interlaced
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;      &lt;/div&gt;

    &lt;/div&gt;

  &lt;/div&gt;

&lt;/section&gt;

&lt;section class=&quot;sticky-section&quot;&gt;

  &lt;h3 id=&quot;running-self-hosted-firecrawl-in-gvisor&quot;&gt;Running self-hosted Firecrawl in gVisor&lt;/h3&gt;

  &lt;div class=&quot;sticky-section-body&quot;&gt;

    &lt;figure class=&quot;follow-along&quot;&gt;
&lt;img src=&quot;/assets/images/2026-04-15-magi/firecrawl.blink.gif&quot; alt=&quot;Diagram showing the MAGI system with the 'Firecrawl', 'Redis', 'RabbitMQ', 'Playwright', and 'PostgreSQL' containers blinking.&quot; /&gt;
&lt;figcaption&gt;Setting up the Firecrawl stack.&lt;/figcaption&gt;
&lt;/figure&gt;

    &lt;div class=&quot;section-content&quot;&gt;

      &lt;p&gt;We will use the
&lt;a href=&quot;https://github.com/firecrawl/firecrawl/blob/main/docker-compose.yaml&quot;&gt;Firecrawl &lt;code class=&quot;highlighter-rouge&quot;&gt;docker-compose.yaml&lt;/code&gt; template&lt;/a&gt;,
simply modified to run all containers in gVisor. Because
&lt;a href=&quot;https://github.com/google/gvisor/issues/7469&quot;&gt;the way &lt;code class=&quot;highlighter-rouge&quot;&gt;docker-compose&lt;/code&gt; sets up DNS&lt;/a&gt;
is incompatible with gVisor’s per-container network stack, we need to use
pre-assigned IPs rather than container hostnames in the &lt;code class=&quot;highlighter-rouge&quot;&gt;docker-compose&lt;/code&gt; file.&lt;/p&gt;

      &lt;div class=&quot;language-shell highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;export &lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;CASPER&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$HOME&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;/agents/casper-3&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; git clone https://github.com/firecrawl/firecrawl.git &lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$HOME&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;/agents/casper-3/firecrawl&quot;&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;cat&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;no&quot;&gt;EOF&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt; &amp;gt; &quot;&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$CASPER&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;/firecrawl/.env&quot;
PORT=3002
HOST=0.0.0.0
OLLAMA_BASE_URL=http://172.17.0.1:11434/api
MODEL_NAME=qwen3.5:27b-q4_K_M
MODEL_EMBEDDING_NAME=nomic-embed-text:137m-v1.5-fp16
BULL_AUTH_KEY=CHANGEME
&lt;/span&gt;&lt;span class=&quot;no&quot;&gt;EOF
&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;git apply &lt;span class=&quot;o&quot;&gt;&amp;lt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;no&quot;&gt;EOF&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;
diff --git a/docker-compose.yaml b/docker-compose.yaml
index 46829cafb..819f9cc87 100644
--- a/docker-compose.yaml
+++ b/docker-compose.yaml
@@ -10,8 +10,6 @@ x-common-service: &amp;amp;common-service
     nofile:
       soft: 65535
       hard: 65535
-  networks:
-    - backend
   extra_hosts:
     - &quot;host.docker.internal:host-gateway&quot;
   logging:
@@ -22,13 +20,13 @@ x-common-service: &amp;amp;common-service
       compress: &quot;true&quot;

 x-common-env: &amp;amp;common-env
-  REDIS_URL: &lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\$&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;{REDIS_URL:-redis://redis:6379}
-  REDIS_RATE_LIMIT_URL: &lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\$&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;{REDIS_URL:-redis://redis:6379}
-  PLAYWRIGHT_MICROSERVICE_URL: &lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\$&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;{PLAYWRIGHT_MICROSERVICE_URL:-http://playwright-service:3000/scrape}
+  REDIS_URL: &lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\$&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;{REDIS_URL:-redis://172.16.0.30:6379}
+  REDIS_RATE_LIMIT_URL: &lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\$&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;{REDIS_URL:-redis://172.16.0.30:6379}
+  PLAYWRIGHT_MICROSERVICE_URL: &lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\$&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;{PLAYWRIGHT_MICROSERVICE_URL:-http://172.16.0.20:3000/scrape}
   POSTGRES_USER: &lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\$&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;{POSTGRES_USER:-postgres}
   POSTGRES_PASSWORD: &quot;&lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\$&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;{POSTGRES_PASSWORD:-postgres}&quot;
   POSTGRES_DB: &lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\$&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;{POSTGRES_DB:-postgres}
-  POSTGRES_HOST: &lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\$&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;{POSTGRES_HOST:-nuq-postgres}
+  POSTGRES_HOST: &lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\$&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;{POSTGRES_HOST:-172.16.0.50}
   POSTGRES_PORT: &lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\$&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;{POSTGRES_PORT:-5432}
   USE_DB_AUTHENTICATION: &lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\$&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;{USE_DB_AUTHENTICATION:-false}
   NUM_WORKERS_PER_QUEUE: &lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\$&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;{NUM_WORKERS_PER_QUEUE:-8}
@@ -58,6 +56,10 @@ x-common-env: &amp;amp;common-env

 services:
   playwright-service:
+    runtime: &quot;runsc&quot;
+    networks:
+      backend:
+        ipv4_address: 172.16.0.20
     # NOTE: If you don't want to build the service locally,
     # comment out the build: statement and uncomment the image: statement
     # image: ghcr.io/firecrawl/playwright-service:latest
@@ -71,8 +73,6 @@ services:
       BLOCK_MEDIA: &lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\$&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;{BLOCK_MEDIA}
       # Configure maximum concurrent pages for Playwright browser instances
       MAX_CONCURRENT_PAGES: &lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\$&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;{CRAWL_CONCURRENT_REQUESTS:-10}
-    networks:
-      - backend
     # Resource limits for Docker Compose (not Swarm)
     cpus: 2.0
     mem_limit: 4G
@@ -88,13 +88,17 @@ services:

   api:
     &amp;lt;&amp;lt;: *common-service
+    runtime: &quot;runsc&quot;
+    networks:
+      backend:
+        ipv4_address: 172.16.0.10
     environment:
       &amp;lt;&amp;lt;: *common-env
       HOST: &quot;0.0.0.0&quot;
       PORT: &lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\$&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;{INTERNAL_PORT:-3002}
       EXTRACT_WORKER_PORT: &lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\$&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;{EXTRACT_WORKER_PORT:-3004}
       WORKER_PORT: &lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\$&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;{WORKER_PORT:-3005}
-      NUQ_RABBITMQ_URL: amqp://rabbitmq:5672
+      NUQ_RABBITMQ_URL: amqp://172.16.0.40:5672
       ENV: local
     depends_on:
       redis:
@@ -113,6 +117,7 @@ services:
     memswap_limit: 8G

   redis:
+    runtime: &quot;runsc&quot;
     # NOTE: If you want to use Valkey (open source) instead of Redis (source available),
     # uncomment the Valkey statement and comment out the Redis statement.
     # Using Valkey with Firecrawl is untested and not guaranteed to work. Use with caution.
@@ -120,7 +125,8 @@ services:
     # image: valkey/valkey:alpine

     networks:
-      - backend
+      backend:
+        ipv4_address: 172.16.0.30
     command: redis-server --bind 0.0.0.0
     logging:
       driver: &quot;json-file&quot;
@@ -130,9 +136,11 @@ services:
         compress: &quot;true&quot;

   rabbitmq:
+    runtime: &quot;runsc&quot;
     image: rabbitmq:3-management
     networks:
-      - backend
+      backend:
+        ipv4_address: 172.16.0.40
     command: rabbitmq-server
     healthcheck:
       test: [&quot;CMD&quot;, &quot;rabbitmq-diagnostics&quot;, &quot;-q&quot;, &quot;check_running&quot;]
@@ -148,6 +156,7 @@ services:
         compress: &quot;true&quot;

   nuq-postgres:
+    runtime: &quot;runsc&quot;
     # NOTE: If you don't want to build the image locally,
     # comment out the build: statement and uncomment the image: statement
     # image: ghcr.io/firecrawl/nuq-postgres:latest
@@ -157,7 +166,8 @@ services:
       POSTGRES_PASSWORD: &lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\$&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;{POSTGRES_PASSWORD:-postgres}
       POSTGRES_DB: &lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\$&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;{POSTGRES_DB:-postgres}
     networks:
-      - backend
+      backend:
+        ipv4_address: 172.16.0.50
     logging:
       driver: &quot;json-file&quot;
       options:
@@ -168,3 +178,8 @@ services:
 networks:
   backend:
     driver: bridge
+    ipam:
+      config:
+        - gateway: 172.16.0.1
+          subnet: 172.16.0.0/16
+      driver: default
&lt;/span&gt;&lt;span class=&quot;no&quot;&gt;EOF

&lt;/span&gt;&lt;span class=&quot;c&quot;&gt;# Run.&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;cd&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$CASPER&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;/firecrawl&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; docker compose build &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; docker compose up &lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;# Make sure it works:&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;curl &lt;span class=&quot;nt&quot;&gt;-X&lt;/span&gt; POST http://localhost:3002/v1/crawl &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;-H&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;'Content-Type: application/json'&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;-d&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;'{
      &quot;url&quot;: &quot;https://firecrawl.dev&quot;
    }'&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;success&quot;&lt;/span&gt;:true,&lt;span class=&quot;s2&quot;&gt;&quot;id&quot;&lt;/span&gt;:&lt;span class=&quot;s2&quot;&gt;&quot;019d7a78-e77a-70af-9f49-8e03421dad32&quot;&lt;/span&gt;,&lt;span class=&quot;s2&quot;&gt;&quot;url&quot;&lt;/span&gt;:&lt;span class=&quot;s2&quot;&gt;&quot;http://localhost:3002/v1/crawl/019d7a78-e77a-70af-9f49-8e03421dad32&quot;&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;      &lt;/div&gt;

      &lt;p&gt;This brings up all the following applications in separate gVisor containers on
their own inter-container network:&lt;/p&gt;

      &lt;ul&gt;
        &lt;li&gt;&lt;strong&gt;Redis&lt;/strong&gt; for key/value storage.&lt;/li&gt;
        &lt;li&gt;&lt;strong&gt;RabbitMQ&lt;/strong&gt; for message queuing.&lt;/li&gt;
        &lt;li&gt;&lt;strong&gt;Playwright&lt;/strong&gt; for browser automation.&lt;/li&gt;
        &lt;li&gt;&lt;strong&gt;PostgreSQL&lt;/strong&gt; for long-term storage.&lt;/li&gt;
        &lt;li&gt;&lt;strong&gt;Firecrawl&lt;/strong&gt; as main API endpoint for Hermes Agent to interact with.&lt;/li&gt;
      &lt;/ul&gt;

    &lt;/div&gt;

  &lt;/div&gt;

&lt;/section&gt;

&lt;section class=&quot;sticky-section&quot;&gt;

  &lt;h3 id=&quot;putting-it-all-together&quot;&gt;Putting it all together&lt;/h3&gt;

  &lt;div class=&quot;sticky-section-body&quot;&gt;

    &lt;figure class=&quot;follow-along&quot;&gt;
&lt;img src=&quot;/assets/images/2026-04-15-magi/hermes-agent.blink.gif&quot; alt=&quot;Diagram showing the MAGI system with the 'Hermes Agent' container blinking.&quot; /&gt;
&lt;figcaption&gt;Setting up Hermes Agent and connecting it.&lt;/figcaption&gt;
&lt;/figure&gt;

    &lt;div class=&quot;section-content&quot;&gt;

      &lt;p&gt;Let’s put the pieces together for the Hermes Agent container.&lt;/p&gt;

      &lt;div class=&quot;language-shell highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;export &lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;CASPER&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$HOME&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;/agents/casper-3&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;mkdir&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-p&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$CASPER&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;# Register Matrix user.&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;docker &lt;span class=&quot;nb&quot;&gt;exec&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-it&lt;/span&gt; synapse register_new_matrix_user &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;-c&lt;/span&gt; /data/homeserver.yaml &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--user&lt;/span&gt; casper &lt;span class=&quot;nt&quot;&gt;--password&lt;/span&gt; naoko &lt;span class=&quot;nt&quot;&gt;--no-admin&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;# Hermes requires a non-root user for its home directory.&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;groupadd &lt;span class=&quot;nt&quot;&gt;--gid&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;10337 hermes &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    useradd &lt;span class=&quot;nt&quot;&gt;--home-dir&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;/dev/null &lt;span class=&quot;nt&quot;&gt;--no-create-home&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;--shell&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;$(&lt;/span&gt;which nologin&lt;span class=&quot;si&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
      &lt;span class=&quot;nt&quot;&gt;--uid&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;10337 &lt;span class=&quot;nt&quot;&gt;--gid&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;10337 hermes

&lt;span class=&quot;c&quot;&gt;# Build Docker image with extra packages.&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;cat&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;no&quot;&gt;EOF&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt; &amp;gt; &quot;&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$CASPER&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;/Dockerfile&quot;
FROM nousresearch/hermes-agent:v2026.4.13

# Install basic packages.
RUN export DEBIAN_FRONTEND=noninteractive; apt update -y &amp;amp;&amp;amp; &lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;
    apt install -y sudo wget curl git build-essential python3-pip

# Install dependencies for Hermes Agent's Matrix.org support.
RUN export DEBIAN_FRONTEND=noninteractive; apt update -y &amp;amp;&amp;amp; &lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;
    apt install -y libolm-dev &amp;amp;&amp;amp; &lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;
    python3 -m pip config set global.break-system-packages true &amp;amp;&amp;amp; &lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;
    pip install 'matrix-nio' 'mautrix[encryption]'

# Install espeak-ng and NeuTTS model for local text-to-speech capabilities.
RUN export DEBIAN_FRONTEND=noninteractive; apt update -y &amp;amp;&amp;amp; &lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;
    apt install -y espeak-ng &amp;amp;&amp;amp; &lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;
    pip install 'neutts[all]'

# Install Docker; not required for dockerd since that's running in a separate
# container, but Hermes Agent still needs the Docker **client** CLI.
RUN export DEBIAN_FRONTEND=noninteractive; apt update -y &amp;amp;&amp;amp; &lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;
    apt install -y docker.io
&lt;/span&gt;&lt;span class=&quot;no&quot;&gt;EOF

&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;docker build &lt;span class=&quot;nt&quot;&gt;-t&lt;/span&gt; hermes-agent:casper-3 &lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$CASPER&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;      &lt;/div&gt;

      &lt;p&gt;As Hermes Agent does not easily support non-interactive configuration, we need
to configure it manually. Let’s run it for interactive configuration purposes:&lt;/p&gt;

      &lt;div class=&quot;language-shell highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;export &lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;CASPER&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$HOME&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;/agents/casper-3&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nb&quot;&gt;mkdir&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$CASPER&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;/home&quot;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;chown &lt;/span&gt;hermes:hermes &lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$CASPER&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;/home&quot;&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;docker run &lt;span class=&quot;nt&quot;&gt;-it&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--name&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;casper &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--runtime&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;runsc &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--restart&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;always &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--shm-size&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;1g &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--link&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;synapse:synapse &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--link&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;ollama:ollama &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--link&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;camofox:camofox &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--mount&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;type=bind,src=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$CASPER&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;/home,dst=/opt/data&quot;&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--mount&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;type=bind,src=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$CASPER&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;/docker-run,dst=/docker-run&quot;&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;-e&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;HERMES_UID&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;$(&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;id&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-u&lt;/span&gt; hermes&lt;span class=&quot;si&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;-e&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;HERMES_GID&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;$(&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;id&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-g&lt;/span&gt; hermes&lt;span class=&quot;si&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;-e&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;DOCKER_HOST&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;unix:///docker-run/docker.sock&quot;&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    hermes-agent:casper-3 setup
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;      &lt;/div&gt;

      &lt;figure class=&quot;img-100pct&quot;&gt;
&lt;div class=&quot;double-border-glow&quot;&gt;
&lt;video src=&quot;/assets/images/2026-04-15-magi/hermes-agent-setup.webm&quot; autoplay=&quot;&quot; loop=&quot;&quot; muted=&quot;&quot; playsinline=&quot;&quot;&gt;&lt;/video&gt;
&lt;/div&gt;

&lt;figcaption&gt;Going through Hermes Agent's interactive setup process in
gVisor.&lt;/figcaption&gt; &lt;/figure&gt;

      &lt;details&gt;

        &lt;summary&gt;

          &lt;h4 id=&quot;interactive-setup-instructions&quot;&gt;Interactive setup instructions&lt;/h4&gt;

          &lt;p&gt;Expand this section for a text version of the screen recording above.&lt;/p&gt;

        &lt;/summary&gt;

        &lt;ul&gt;
          &lt;li&gt;Choose &lt;code class=&quot;highlighter-rouge&quot;&gt;Full setup&lt;/code&gt;&lt;/li&gt;
          &lt;li&gt;Inference Provider: &lt;code class=&quot;highlighter-rouge&quot;&gt;More providers&lt;/code&gt; → &lt;code class=&quot;highlighter-rouge&quot;&gt;Custom endpoint&lt;/code&gt;&lt;/li&gt;
          &lt;li&gt;API base URL: &lt;code class=&quot;highlighter-rouge&quot;&gt;http://ollama:11434/v1&lt;/code&gt;&lt;/li&gt;
          &lt;li&gt;API key: (leave empty)&lt;/li&gt;
          &lt;li&gt;Select model: &lt;code class=&quot;highlighter-rouge&quot;&gt;qwen3.5:27b-q4_K_M&lt;/code&gt;&lt;/li&gt;
          &lt;li&gt;Context length in tokens: &lt;code class=&quot;highlighter-rouge&quot;&gt;262144&lt;/code&gt; (per the
&lt;a href=&quot;https://huggingface.co/Qwen/Qwen3.5-27B&quot;&gt;Qwen3.7-27B model card&lt;/a&gt;)&lt;/li&gt;
          &lt;li&gt;Select TTS provider: &lt;code class=&quot;highlighter-rouge&quot;&gt;NeuTTS&lt;/code&gt; (local on-device)&lt;/li&gt;
          &lt;li&gt;Terminal Backend: &lt;code class=&quot;highlighter-rouge&quot;&gt;Docker&lt;/code&gt;&lt;/li&gt;
          &lt;li&gt;Docker image: (leave default)&lt;/li&gt;
          &lt;li&gt;Container Resource Settings: Up to you&lt;/li&gt;
          &lt;li&gt;Max iterations / Tool progress mode/ […] / Inactivity timeout: Up to you&lt;/li&gt;
          &lt;li&gt;Select platforms: &lt;code class=&quot;highlighter-rouge&quot;&gt;Matrix&lt;/code&gt;&lt;/li&gt;
          &lt;li&gt;Homeserver URL: &lt;code class=&quot;highlighter-rouge&quot;&gt;http://synapse:8008&lt;/code&gt;&lt;/li&gt;
          &lt;li&gt;Access token: (leave empty)&lt;/li&gt;
          &lt;li&gt;User ID: &lt;code class=&quot;highlighter-rouge&quot;&gt;@casper:magi&lt;/code&gt;&lt;/li&gt;
          &lt;li&gt;Password: &lt;code class=&quot;highlighter-rouge&quot;&gt;naoko&lt;/code&gt;&lt;/li&gt;
          &lt;li&gt;Enable end-to-end encryption (E2EE): Up to you&lt;/li&gt;
          &lt;li&gt;Allowed user IDs: &lt;code class=&quot;highlighter-rouge&quot;&gt;@gendo:magi&lt;/code&gt;&lt;/li&gt;
          &lt;li&gt;Home room ID: (leave empty)&lt;/li&gt;
          &lt;li&gt;Install gateway as systemd service: No, as this isn’t relevant for a
containerized install.&lt;/li&gt;
          &lt;li&gt;Tools: Feel free to configure.&lt;/li&gt;
          &lt;li&gt;Browser provider: &lt;code class=&quot;highlighter-rouge&quot;&gt;Camofox&lt;/code&gt;&lt;/li&gt;
          &lt;li&gt;Camofox server URL: &lt;code class=&quot;highlighter-rouge&quot;&gt;http://camofox:3000&lt;/code&gt;&lt;/li&gt;
          &lt;li&gt;Image generation FAL API key: (leave empty unless you have one)&lt;/li&gt;
          &lt;li&gt;TTS provider: Skip&lt;/li&gt;
          &lt;li&gt;Search provider: &lt;code class=&quot;highlighter-rouge&quot;&gt;Self-hosted Firecrawl&lt;/code&gt;&lt;/li&gt;
          &lt;li&gt;Firecrawl instance URL: &lt;code class=&quot;highlighter-rouge&quot;&gt;http://172.17.0.1:3002&lt;/code&gt;&lt;/li&gt;
        &lt;/ul&gt;

      &lt;/details&gt;

      &lt;p&gt;You can verify that Hermes Agent’s “terminal” backend is the Docker-in-gVisor by
running &lt;code class=&quot;highlighter-rouge&quot;&gt;htop&lt;/code&gt; in the &lt;code class=&quot;highlighter-rouge&quot;&gt;hermes-exec&lt;/code&gt; container.&lt;/p&gt;

      &lt;div class=&quot;language-shell highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;docker &lt;span class=&quot;nb&quot;&gt;exec&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-it&lt;/span&gt; hermes-exec sh &lt;span class=&quot;nt&quot;&gt;-c&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;'apt update -y &amp;amp;&amp;amp; apt install -y htop'&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;# Watch this command while asking Hermes Agent to run `curl https://gvisor.dev`:&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;docker &lt;span class=&quot;nb&quot;&gt;exec&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-it&lt;/span&gt; hermes-exec htop
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;      &lt;/div&gt;

      &lt;p&gt;To make Hermes Agent actually join the Matrix room, you need to restart the
container in gateway mode.&lt;/p&gt;

      &lt;div class=&quot;language-shell highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;docker &lt;span class=&quot;nb&quot;&gt;rm&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-f&lt;/span&gt; casper&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; docker run &lt;span class=&quot;nt&quot;&gt;--detach&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--name&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;casper &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--runtime&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;runsc &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--restart&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;always &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--shm-size&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;1g &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--link&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;synapse:synapse &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--link&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;ollama:ollama &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--link&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;camofox:camofox &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--mount&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;type=bind,src=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$CASPER&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;/home,dst=/opt/data&quot;&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--mount&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;type=bind,src=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$CASPER&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;/docker-run,dst=/docker-run&quot;&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;-e&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;DOCKER_HOST&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;unix:///docker-run/docker.sock&quot;&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    hermes-agent:casper-3 gateway
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;      &lt;/div&gt;

      &lt;p&gt;Now invite the bot to your Matrix room and send &lt;code class=&quot;highlighter-rouge&quot;&gt;/sethome&lt;/code&gt; on the main channel.&lt;/p&gt;

      &lt;p&gt;You now have Hermes Agent running in gVisor. To recap, Hermes Agent has:&lt;/p&gt;

      &lt;ul&gt;
        &lt;li&gt;&lt;strong&gt;Hermes Agent&lt;/strong&gt; running in its own gVisor container&lt;/li&gt;
        &lt;li&gt;&lt;strong&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;dockerd&lt;/code&gt;&lt;/strong&gt; running in a separate gVisor container, for subcommand
execution&lt;/li&gt;
        &lt;li&gt;&lt;strong&gt;Camofox Browser&lt;/strong&gt; running with a virtual display (&lt;strong&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;Xvfb&lt;/code&gt;&lt;/strong&gt;) for browser
use, in its own gVisor container&lt;/li&gt;
        &lt;li&gt;Self-hosted &lt;strong&gt;Firecrawl&lt;/strong&gt; for agentic search, in its own set of gVisor
containers.&lt;/li&gt;
        &lt;li&gt;&lt;strong&gt;NeuTTS&lt;/strong&gt; for text-to-speech capabilities in Hermes Agent, evaluated within
gVisor.&lt;/li&gt;
        &lt;li&gt;&lt;strong&gt;Ollama&lt;/strong&gt; for inference and &lt;strong&gt;Matrix.org&lt;/strong&gt; for communication, same as the
other agents.&lt;/li&gt;
      &lt;/ul&gt;

    &lt;/div&gt;

  &lt;/div&gt;

&lt;/section&gt;

&lt;section class=&quot;sticky-section&quot;&gt;

  &lt;h3 id=&quot;putting-these-agents-in-a-room&quot;&gt;Putting these agents in a room&lt;/h3&gt;

  &lt;p&gt;You can now ask your 3 agents to do your bidding and get various perspectives.&lt;/p&gt;

  &lt;figure class=&quot;img-100pct&quot;&gt;
&lt;div class=&quot;double-border-glow&quot;&gt;
&lt;img src=&quot;/assets/images/2026-04-15-magi/magi-three-way.png&quot; alt=&quot;All three agents together in a Matrix.org room displayed in the Cinny web UI, with each agent fetching the gVisor homepage and confirming that they are each running in gVisor.&quot; /&gt;
&lt;/div&gt;
&lt;figcaption&gt;The three agents fetching the gVisor homepage and verifying that they are running in gVisor.&lt;br /&gt;Note: Hermes Agent cannot call &lt;code&gt;dmesg&lt;/code&gt;, due to the default system call filter applied to the Docker container that its code execution tool runs in.&lt;br /&gt;However, the &lt;code&gt;4.4.0&lt;/code&gt; kernel version is characteristic of gVisor.&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;/section&gt;

&lt;section class=&quot;sticky-section&quot;&gt;

  &lt;h2 id=&quot;sandboxing-agents-what-actually-makes-sense&quot;&gt;Sandboxing agents: What actually makes sense?&lt;/h2&gt;

  &lt;p&gt;The setup described in this blog post is a contrived example of agent
sandboxing, where every part of the stack is mutually sandboxed from one
another. In closer-to-real-world settings, not all of these components are
untrusted, some of them will run remotely, others may be delegated to
off-machine APIs, etc. So what would a more practical setup look like?&lt;/p&gt;

  &lt;p&gt;At a high level, an autonomous agent stack looks like this:&lt;/p&gt;

  &lt;ul&gt;
    &lt;li&gt;A &lt;strong&gt;core daemon&lt;/strong&gt; (written in good old regular code, e.g. TypeScript for
OpenClaw), typically listening on a TCP port. This daemon is responsible
for:
      &lt;ul&gt;
        &lt;li&gt;Receiving user requests via a communications plugin (e.g. Signal,
Mattermost…)&lt;/li&gt;
        &lt;li&gt;Running inference API calls&lt;/li&gt;
        &lt;li&gt;Dispatching tool calls&lt;/li&gt;
        &lt;li&gt;Running the control loop necessary to make forward progress on long-term
tasks, using inference and tool calls&lt;/li&gt;
        &lt;li&gt;Running cron-like tasks and
&lt;a href=&quot;https://docs.openclaw.ai/gateway/heartbeat&quot;&gt;heartbeats&lt;/a&gt; to keep the
agent autonomous&lt;/li&gt;
      &lt;/ul&gt;
    &lt;/li&gt;
    &lt;li&gt;A pretty &lt;strong&gt;web interface&lt;/strong&gt; (sometimes part of the core daemon, sometimes
separate)&lt;/li&gt;
    &lt;li&gt;A &lt;strong&gt;plugin ecosystem&lt;/strong&gt;, adding new tools, communication channels, etc. to
the agent&lt;/li&gt;
    &lt;li&gt;A database of &lt;strong&gt;skills and general knowledge&lt;/strong&gt; (memory) that the agent can
evolve over time as they learn from its mistakes, or learn more about their
raison d’être and the user they are dealing with.&lt;/li&gt;
    &lt;li&gt;A &lt;strong&gt;policy engine&lt;/strong&gt; that can decide on the security policies needed for any
action the agent would like to take (tool call, API call, credential access,
etc.).&lt;/li&gt;
  &lt;/ul&gt;

  &lt;p&gt;When you send a message to such an agent, it ends up running a control loop to
handle your query. This control loop will initially run inference, then very
likely follow this up by a sequence of tool calls and further inference
requests, until a satisfying conclusion is reached. These tool calls can
include:&lt;/p&gt;

  &lt;ul&gt;
    &lt;li&gt;&lt;strong&gt;Data lookups&lt;/strong&gt; on the web&lt;/li&gt;
    &lt;li&gt;&lt;strong&gt;API requests&lt;/strong&gt; to external services, often requiring sensitive credentials
to “act as” the user&lt;/li&gt;
    &lt;li&gt;&lt;strong&gt;Browser use&lt;/strong&gt;, sometimes with similar credential needs&lt;/li&gt;
    &lt;li&gt;&lt;strong&gt;Code snippet&lt;/strong&gt; executions&lt;/li&gt;
    &lt;li&gt;&lt;strong&gt;Memory&lt;/strong&gt; reads and writes, database-like&lt;/li&gt;
    &lt;li&gt;&lt;strong&gt;Introspection requests&lt;/strong&gt;, where the agent can modify its own configuration
or skill database, sometimes fixing its own setup/configuration issues
rather than requiring a human to get it unstuck.&lt;/li&gt;
  &lt;/ul&gt;

  &lt;p&gt;Where does sandboxing fit in?&lt;/p&gt;

  &lt;ul&gt;
    &lt;li&gt;&lt;strong&gt;Sandboxing individual tools&lt;/strong&gt;: Most tool calls don’t do anything fancy.
They just make web requests and are not expected to have side-effects. They
have no business reading local files or modifying the agent’s own
configuration. Sandboxing these tools allows for defense-in-depth.
      &lt;ul&gt;
        &lt;li&gt;Concrete example: One can craft malicious &lt;code class=&quot;highlighter-rouge&quot;&gt;.mov&lt;/code&gt; videos which can refer
to arbitrary file paths on the host. What if your agent gets tricked
into converting a video that tries to embed a subtitle file pointing to
&lt;code class=&quot;highlighter-rouge&quot;&gt;/etc/shadow&lt;/code&gt;? Sandbox your tool calls and avoid this problem.&lt;/li&gt;
      &lt;/ul&gt;
    &lt;/li&gt;
    &lt;li&gt;&lt;strong&gt;Sandboxing subsystems&lt;/strong&gt;: Some agent functionality may depend on
long-running daemons which themselves don’t need system-wide access. This
can be important for network-exposed or network-accessing subsystems.
      &lt;ul&gt;
        &lt;li&gt;Concrete example: If using Signal as communications layer, the
&lt;a href=&quot;https://github.com/AsamK/signal-cli&quot;&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;signal-cli&lt;/code&gt; daemon&lt;/a&gt; can run in a
sandbox for defense-in-depth.&lt;/li&gt;
        &lt;li&gt;Similarly, in the examples above, we sandbox &lt;code class=&quot;highlighter-rouge&quot;&gt;dockerd&lt;/code&gt; and Camofox
browser in separate containers.&lt;/li&gt;
      &lt;/ul&gt;
    &lt;/li&gt;
    &lt;li&gt;&lt;strong&gt;Sandbox the core daemon&lt;/strong&gt;: The need for the agent to be able to &lt;strong&gt;change
its own environment&lt;/strong&gt; to debug or update itself is a very powerful feature.
To do so, the agent requires effectively root control over its own core code
and configuration. Therefore, &lt;strong&gt;sandboxing the entire agent’s core daemon&lt;/strong&gt;
makes sense: the agent can leverage its own intelligence to make itself
better, while still being confined to a box. That box is useful because:
      &lt;ul&gt;
        &lt;li&gt;Destructive changes can be &lt;strong&gt;rolled back&lt;/strong&gt;.&lt;/li&gt;
        &lt;li&gt;The agent’s &lt;strong&gt;policy engine can live outside&lt;/strong&gt; the core sandbox. This
prevents the agent from changing the policy engine’s policies
maliciously.&lt;/li&gt;
        &lt;li&gt;Relatedly, sensitive &lt;strong&gt;credentials can live outside&lt;/strong&gt; the core sandbox.
This ensures that all credential use is mediated through components the
agent can’t modify. This includes API keys, crypto wallet keys for
agentic commerce, and user-authenticated browser sessions.&lt;/li&gt;
      &lt;/ul&gt;
    &lt;/li&gt;
  &lt;/ul&gt;

  &lt;p&gt;&lt;em&gt;Other parts of the stack typically run fully-trusted code with little to no
need for sandboxing. For example, the memory subsystem may be a local vector
lookup or similar database, with no internet connectivity and no need to run
arbitrary code. Thus, similar to the
&lt;a href=&quot;/docs/user_guide/production/&quot;&gt;gVisor production guide&lt;/a&gt;, it does not need to be
sandboxed.&lt;/em&gt;&lt;/p&gt;

  &lt;p&gt;We see some of these ideas being implemented across the ecosystem:&lt;/p&gt;

  &lt;ul&gt;
    &lt;li&gt;OpenClaw supports agent-level containerization via
&lt;a href=&quot;https://docs.openclaw.ai/install/docker&quot;&gt;Docker&lt;/a&gt; and
&lt;a href=&quot;https://docs.openclaw.ai/install/podman&quot;&gt;Podman&lt;/a&gt;.&lt;/li&gt;
    &lt;li&gt;NemoClaw uses &lt;a href=&quot;https://github.com/NVIDIA/OpenShell&quot;&gt;OpenShell&lt;/a&gt; to ensure
tool calls have initially-restricted access which can then be widened as
needed by the tool.&lt;/li&gt;
    &lt;li&gt;Hermes Agent implements
&lt;a href=&quot;https://hermes-agent.nousresearch.com/docs/user-guide/checkpoints-and-rollback&quot;&gt;checkpoints and rollbacks&lt;/a&gt;
to protect against destructive operations.&lt;/li&gt;
    &lt;li&gt;&lt;a href=&quot;https://www.ironclaw.com/&quot;&gt;IronClaw&lt;/a&gt; segregates API keys out of the agent’s
core sandbox and injects them at egress time.&lt;/li&gt;
  &lt;/ul&gt;

  &lt;p&gt;Security practices for these tools are rapidly evolving, and gVisor has a role
to play.&lt;/p&gt;

&lt;/section&gt;

&lt;section class=&quot;sticky-section&quot;&gt;

  &lt;h2 id=&quot;should-i-use-gvisor-to-sandbox-my-agent&quot;&gt;Should I use gVisor to sandbox my agent?&lt;/h2&gt;

  &lt;p&gt;gVisor dramatically &lt;strong&gt;reduces the attack surface&lt;/strong&gt; for sandbox escapes. It does
so by reimplementing a large portion of Linux in userspace, preventing the
sandboxed application from attacking the host kernel. Read
&lt;a href=&quot;https://gvisor.dev/docs/architecture_guide/intro/&quot;&gt;more about gVisor’s security architecture&lt;/a&gt;.&lt;/p&gt;

  &lt;p&gt;For autonomous agents, you don’t just need a strong sandbox, you also need
&lt;strong&gt;strong policies around &lt;em&gt;when&lt;/em&gt; and &lt;em&gt;what&lt;/em&gt; to sandbox&lt;/strong&gt;. As a sandboxing
technology, gVisor does not help you with these decisions. gVisor only
&lt;strong&gt;enhances the level of security of the sandboxing capabilities that the agent
already has&lt;/strong&gt;. Thus, &lt;strong&gt;gVisor is &lt;em&gt;necessary&lt;/em&gt;, but not &lt;em&gt;sufficient&lt;/em&gt;&lt;/strong&gt;.&lt;/p&gt;

  &lt;p&gt;gVisor’s capabilities are also uniquely well-suited to agentic workloads:&lt;/p&gt;

  &lt;ul&gt;
    &lt;li&gt;Sandboxes &lt;strong&gt;start and stop in milliseconds&lt;/strong&gt;, critical to keeping these
systems responsive and minimizing time between inference calls.&lt;/li&gt;
    &lt;li&gt;Thanks to its process-like model (not a virtual machine), gVisor can achieve
&lt;strong&gt;superior density&lt;/strong&gt;, i.e. more sandboxes running concurrently on the same
host.&lt;/li&gt;
    &lt;li&gt;gVisor supports &lt;strong&gt;checkpoint/restore&lt;/strong&gt;, making slow-to-initialize repetitive
actions quick to replay, and checkpoints/rollbacks can be done seamlessly
without sandboxed-workload-specific support.&lt;/li&gt;
  &lt;/ul&gt;

  &lt;p&gt;One current drawback of gVisor is its relative difficulty to integrate within
existing applications that have such sandboxing needs. For example, this is one
reason why the above demo does not sandbox Hermes Agent tool calls in
&lt;strong&gt;separate&lt;/strong&gt; gVisor instances. This is being worked on. Watch this space!&lt;/p&gt;

&lt;/section&gt;

&lt;figure class=&quot;img-100pct&quot;&gt;
&lt;img src=&quot;/assets/images/2026-04-15-magi/magi.gif&quot; alt=&quot;Diagram showing the MAGI system: three agents running in gVisor, along with a lot of side-services in gVisor-sandboxed containers. Evangelion style.&quot; /&gt;
&lt;figcaption&gt;&lt;em&gt;*cogitation intensifies*&lt;/em&gt;&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;!--* pragma: { seclinter_this_is_fine: false } *--&gt;</content><author><name>eperot</name></author><summary type="html">Get in the sandbox, Agents. Does gVisor work with OpenClaw? This question has been asked a lot, so let’s answer it here and now: Yes. In this post, we will set up a triple-agent system combining OpenClaw, PicoClaw, and Hermes Agent, each in separate gVisor sandboxes, all with local inference powered by Ollama in a gVisor sandbox using three different models, convening together in a self-hosted Matrix.org server (naturally, also running in a gVisor sandbox). Each agent will be given its own set of capabilities, each of which will be sandboxed. At the end of the day, you will have a fully self-sovereign triple-agent system that can answer queries, browse the web, and cogitate with itself. Does this particular setup make practical sense? No, but it is cool. More importantly, it demonstrates the versatility of gVisor at sandboxing basically any component that an agentic system may need. gVisor’s compatibility has grown significantly over the last few years, and agent harnesses fit well within what gVisor is capable of.</summary></entry><entry><title type="html">Safe Ride into the Dangerzone: Reducing attack surface with gVisor</title><link href="/blog/2024/09/23/safe-ride-into-the-dangerzone/" rel="alternate" type="text/html" title=" Safe Ride into the Dangerzone: Reducing attack surface with gVisor" /><published>2024-09-23T00:00:00-05:00</published><updated>2024-09-23T00:00:00-05:00</updated><id>/blog/2024/09/23/dangerzone</id><content type="html" xml:base="/blog/2024/09/23/safe-ride-into-the-dangerzone/">&lt;p&gt;&lt;em&gt;This article was written in collaboration with the
&lt;a href=&quot;https://freedom.press&quot;&gt;Freedom of the Press Foundation&lt;/a&gt; and
&lt;a href=&quot;https://dangerzone.rocks/news/2024-09-23-gvisor&quot;&gt;cross-posted on the Dangerzone blog&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;One of the oft-repeated sound bites of computer security advice is: “Don’t open
random attachments from strangers.” If you are a journalist, however, opening
attachments and documents is part of your job description. Since journalists
already have a lot of security threats to worry about in dealing with sources,
the safe opening of documents should not be one of them.
&lt;a href=&quot;https://dangerzone.rocks&quot;&gt;Dangerzone&lt;/a&gt; was developed to solve this problem. It
lets you open suspicious documents with confidence and gets out of your way.&lt;/p&gt;

&lt;p&gt;For the past few months, members of the Dangerzone team and the
&lt;a href=&quot;https://gvisor.dev&quot;&gt;gVisor project&lt;/a&gt; collaborated on significantly improving the
security properties of Dangerzone. We’re excited to announce that &lt;strong&gt;as of
version 0.7.0, Dangerzone uses gVisor to secure its document conversion
process&lt;/strong&gt;. It is already trusted by Google
&lt;a href=&quot;https://gvisor.dev/users&quot;&gt;and others&lt;/a&gt; to secure cloud products, scan Gmail
attachments for viruses, etc.&lt;/p&gt;

&lt;!--/excerpt--&gt;

&lt;p&gt;If you’re an existing Dangerzone user on 0.7.0 scratching your head and thinking
“Well, I haven’t noticed anything different,” then first of all, “yay!” That was
the plan. And second, because the plan worked so deviously well, this change has
probably flown under the radar, so here are more than 3,000 words to amend this.&lt;/p&gt;

&lt;p&gt;The rest of the article dives deep into Dangerzone’s security, describes how
gVisor works as a technology, and explains how Dangerzone’s security profile has
changed after this integration. Expect some technical terms and nerdery.&lt;/p&gt;

&lt;h2 id=&quot;how-dangerzone-works&quot;&gt;How Dangerzone works&lt;/h2&gt;

&lt;p&gt;Dangerzone’s purpose is to sanitize documents of any elements that can
compromise your computer or the source’s identity (think malware and document
metadata). To do this, it first renders the document into visual data (pixels)
and then turns this visual representation back into a readable document file.
The first part of this process (rendering the document into pixel data) is the
most security-critical part and, for the purpose of this article, we will zoom
in on just this.&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;💡 For a broader understanding of how Dangerzone works, we encourage you to
read the &lt;a href=&quot;https://dangerzone.rocks/about/&quot;&gt;“About Dangerzone”&lt;/a&gt; section on the
Dangerzone website. Props to the &lt;a href=&quot;https://www.qubes-os.org/&quot;&gt;Qubes OS&lt;/a&gt; team,
who first popularized the concept that is now their
&lt;a href=&quot;https://blog.invisiblethings.org/2013/02/21/converting-untrusted-pdfs-into-trusted.html&quot;&gt;TrustedPDF feature&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;In order to support a wide variety of document formats (PDF, office documents,
image formats, etc.), Dangerzone needs to open them with software that
potentially has security bugs. That may result in compromise of the user’s
device, personal files, and communication. This is the same risk you face when
you use your computer to open attachments from unknown sources. Dangerzone needs
to somehow isolate this process from the rest of your computer, so that anything
it does cannot “get out of the box”.&lt;/p&gt;

&lt;p&gt;Dangerzone’s isolation relies on &lt;strong&gt;Linux containers&lt;/strong&gt;. Containers are very handy
for two things: ensuring that they work the same way across operating systems
and separating the container from the rest of the machine.&lt;/p&gt;

&lt;figure&gt;
&lt;img src=&quot;/assets/images/2024-09-23-dangerzone-outline.svg&quot; alt=&quot;Diagram showing the Dangerzone UI sending a document to a document renderer, which converts it to pixels, and then receives the pixels back.&quot; /&gt;
&lt;figcaption&gt;Outline of how Dangerzone uses containers to render a document into pixels.&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;p&gt;Dangerzone benefits from both of these aspects: Development and testing are made
easy by using containers’ cross-platform compatibility; and containers’
security, especially how Dangerzone configured them, offers strong isolation
guarantees. The
&lt;a href=&quot;https://freedom.press/news/dangerzone-receives-favorable-audit/&quot;&gt;security audit Dangerzone passed recently&lt;/a&gt;
is a testament to this.&lt;/p&gt;

&lt;p&gt;In computer security, the gold standard of isolation is &lt;strong&gt;virtual machines&lt;/strong&gt;.
VMs are what they sound like: a computer running within a computer. When running
a virtual machine, the “host” (outer) machine is protected from the action of
the “guest” (inner) virtual machine. This is why the TrustedPDF feature of
QubesOS uses disposable VMs as its isolation mechanism. Dangerzone also tried to
use VMs in the past, but implementing them in a multiplatform way proved
high-maintenance. Thus, Dangerzone switched back to containers, but the team
always wanted to improve Dangerzone’s security properties.&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;💡 How does Dangerzone use Linux containers on Windows and Mac OS? It requires
&lt;a href=&quot;https://www.docker.com/products/docker-desktop/&quot;&gt;Docker Desktop&lt;/a&gt;, which runs
Linux inside a virtual machine and then runs Linux containers in it.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2 id=&quot;dangerzones-attack-surface&quot;&gt;Dangerzone’s attack surface&lt;/h2&gt;

&lt;p&gt;To understand how to protect Dangerzone users from exploits, it’s useful to
think like an attacker. When Dangerzone processes a malicious document within a
container, the first point of the attack is the application that opens the
document. Dangerzone is designed with the assumption that determined attackers
will find a vulnerability in such applications and take control of them (check
out this &lt;a href=&quot;https://github.com/freedomofpress/dangerzone/blob/main/docs/advisories/2023-12-07.md&quot;&gt;security advisory from the Dangerzone team about a recent, critical
LibreOffice
vulnerability&lt;/a&gt;).
From there on, the next point of attack is to circumvent the Linux kernel
protections for the container or directly compromise the Linux kernel.&lt;/p&gt;

&lt;p&gt;The Linux kernel, even in Docker Desktop VMs, is a very privileged component. It
has access to sensitive data, such as other files on the user’s machine or the
user’s browser history, and to your computer’s network.&lt;/p&gt;

&lt;p&gt;Processes in containers interface with the Linux kernel through
&lt;a href=&quot;https://en.wikipedia.org/wiki/System_call&quot;&gt;&lt;strong&gt;system calls&lt;/strong&gt;&lt;/a&gt; and
&lt;a href=&quot;https://opensource.com/article/19/3/virtual-filesystems-linux&quot;&gt;&lt;strong&gt;virtual filesystems&lt;/strong&gt;&lt;/a&gt;.
Attackers can try to take advantage of security bugs in the above interfaces. So
it is critical to limit the container’s access to the Linux kernel. We call this
the container’s
&lt;a href=&quot;https://en.wikipedia.org/wiki/Attack_surface&quot;&gt;&lt;strong&gt;attack surface&lt;/strong&gt;&lt;/a&gt;. The smaller
it is, the more secure a system is.&lt;/p&gt;

&lt;p&gt;Dangerzone tries to reduce its attack surface by multiple mechanisms available
to Linux containers:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Removal of
&lt;a href=&quot;https://en.wikipedia.org/wiki/Capability-based_security&quot;&gt;process capabilities&lt;/a&gt;.
This reduces the set of permissions the container has in the kernel.&lt;/li&gt;
  &lt;li&gt;Removal of network access. This prevents the container from accessing the
internet to exfiltrate document data.&lt;/li&gt;
  &lt;li&gt;Filtering of allowed system calls through
&lt;a href=&quot;https://en.wikipedia.org/wiki/Seccomp&quot;&gt;seccomp&lt;/a&gt;. This reduces the set of
system calls (i.e., types of actions) that the container is allowed to make
to the kernel.&lt;/li&gt;
  &lt;li&gt;Minimal &lt;a href=&quot;https://en.wikipedia.org/wiki/User_identifier&quot;&gt;user ID&lt;/a&gt; mapping.
This reduces the risk that the container may access files belonging to users
other than the Dangerzone user on the same computer.&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
  &lt;p&gt;💡 Check out the above protection measures in
&lt;a href=&quot;https://github.com/freedomofpress/dangerzone/blob/88a2d151ab4a3cb2f769998f27f251518d93bb45/dangerzone/isolation_provider/container.py#L188-L213&quot;&gt;Dangerzone’s codebase&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;figure&gt;
&lt;img src=&quot;/assets/images/2024-09-23-dangerzone-protections.svg&quot; alt=&quot;Diagram showing that the renderer and LibreOffice make system calls to the Linux kernel, to which several filters are applied.&quot; /&gt;
&lt;figcaption&gt;Container protections employed by Dangerzone prior to 0.7.0.&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;p&gt;This provides the container with a fair degree of isolation from the Linux
kernel. However, some attack surface remains, since:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;The computer’s user is still mapped in the container. This means that a
container escape would allow the attacker to access the user’s personal
files (browser data, documents, etc.); it would be more isolated if that
were not the case.&lt;/li&gt;
  &lt;li&gt;The system call filter is still relatively permissive. The specific system
calls that are blocked are dependent on the container manager and version in
use (see
&lt;a href=&quot;https://github.com/microsoft/docker/blob/master/docs/security/seccomp.md&quot;&gt;Docker’s filters, for example&lt;/a&gt;),
but in general, the system call filter only blocks obscure or
system-admin-only system calls (e.g., rebooting, modifying systemwide
settings). It does not block containers from opening arbitrary files or
interacting with the network stack, which can still be vectors for security
bugs.&lt;/li&gt;
  &lt;li&gt;The container’s root filesystem, while ephemeral, is still writable. This
allows attackers to exploit potential vulnerabilities in Linux’s filesystem
stack.&lt;/li&gt;
  &lt;li&gt;The Linux kernel is still exposed to the container. While it is possible to
reduce the attack surface available to the container to a minimum, this
architecture still requires that the container have direct access to Linux
via system calls. So if a Linux security bug can be triggered within the set
of filtered system calls, an attack may still be successful.&lt;/li&gt;
&lt;/ul&gt;

&lt;figure&gt;
&lt;img src=&quot;/assets/images/2024-09-23-dangerzone-protections-annotated.svg&quot; alt=&quot;Diagram highlighting how access to the Linux kernel and the relatively permissive system filter may create exposure to bugs or vulnerabilities.&quot; /&gt;
&lt;figcaption&gt;Dangerzone's attack surface prior to 0.7.0, illustrated.&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;p&gt;We’ve wanted to mitigate these risks for a while now, but we had to do so in a
cross-platform way and without burdening the user with administrative tasks.&lt;/p&gt;

&lt;p&gt;Enter gVisor.&lt;/p&gt;

&lt;h2 id=&quot;what-is-gvisor&quot;&gt;What is gVisor?&lt;/h2&gt;

&lt;p&gt;&lt;a href=&quot;https://gvisor.dev&quot;&gt;&lt;strong&gt;gVisor&lt;/strong&gt;&lt;/a&gt; is a container security solution. In short, it
makes it much harder for malicious code to break out of the container boundary.
This was a great fit for Dangerzone’s security needs.&lt;/p&gt;

&lt;p&gt;An open source project written in Go, gVisor was released in May 2018 by Google
under the Apache 2.0 license. It runs on Linux and integrates with all popular
container management software, such as Docker, Podman, or Kubernetes. At its
core, gVisor is an &lt;strong&gt;application kernel&lt;/strong&gt; that implements a substantial portion
of the Linux system call interface. This means gVisor sits between a container
and the Linux kernel and plays both roles: from the container’s perspective,
gVisor acts as a &lt;strong&gt;kernel&lt;/strong&gt;, but from Linux’s perspective, gVisor is just a
regular &lt;strong&gt;application&lt;/strong&gt;. That means the container can no longer directly
interface with the Linux kernel. This is a massive reduction in attack surface.&lt;/p&gt;

&lt;p&gt;If you’re new to gVisor, the concept of not interfacing with the Linux kernel at
all may seem either quite vague or overly restrictive. That’s normal, so let’s
toy with this concept a bit for fun and illustrative purposes. Here’s a
perfectly normal sentence:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;“A process opens a document on the filesystem”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;And here’s how gVisor warps every single word in that sentence:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;“on the filesystem”: Nope, no such thing. The gVisor container runs in an
empty filesystem.&lt;/li&gt;
  &lt;li&gt;“opens a document”: Nuh-uh, the gVisor container does not even have the
permission to perform the &lt;code class=&quot;highlighter-rouge&quot;&gt;open&lt;/code&gt; system call. Also, there are no files to
open in the first place.&lt;/li&gt;
  &lt;li&gt;“A process”: Amusingly, the gVisor container does not even have the ability
to perform the &lt;code class=&quot;highlighter-rouge&quot;&gt;exec&lt;/code&gt; system calls. From the Linux kernel’s perspective, the
gVisor “process” looks like a typical multithreaded program, even while many
independent processes are running within the gVisor sandbox.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And yet, gVisor can containerize most applications without issue. For example,
the Dangerzone container image was not altered at all for the gVisor
integration.&lt;/p&gt;

&lt;p&gt;So what’s going on here?&lt;/p&gt;

&lt;p&gt;gVisor manages to pull the above trick with the help of two components:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Sentry&lt;/strong&gt; is the component that runs the containerized application. It
intercepts every system call that the application makes and reimplements it
in Go. As part of this, it may decide to do one or more system calls to the
host Linux kernel. However, it’s heavily restricted with a strict seccomp
filter (that’s why system calls like &lt;code class=&quot;highlighter-rouge&quot;&gt;open&lt;/code&gt;, &lt;code class=&quot;highlighter-rouge&quot;&gt;socket&lt;/code&gt;, or &lt;code class=&quot;highlighter-rouge&quot;&gt;exec&lt;/code&gt; are not
allowed).&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Gofer&lt;/strong&gt; is a component that runs outside the container and is responsible
for filesystem operations. The sentry may make I/O requests to the gofer.
The gofer will independently validate them, then perform these I/O
operations on the container’s behalf (that’s how the container can read
files from the host filesystem, even though &lt;code class=&quot;highlighter-rouge&quot;&gt;open&lt;/code&gt; is not allowed from the
sentry).&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The above components are managed by a container runtime called &lt;code class=&quot;highlighter-rouge&quot;&gt;runsc&lt;/code&gt;, which
exposes the same interface as other container runtimes. This means it can be
integrated in other container management software like Podman, Docker, or
Kubernetes.&lt;/p&gt;

&lt;figure&gt;
&lt;img src=&quot;/assets/images/2024-09-23-gvisor-outline.svg&quot; alt=&quot;Diagram showing a potentially vulnerable application running in the gVisor sandbox. gVisor Sentry implements the sandbox and intercepts all system calls. It services them either by making limited system calls of its own, or by asking gVisor Gofer to perform I/O system calls on its behalf. Both components are further restricted by a tailored kernel filter, along with other kernel protections.&quot; /&gt;
&lt;figcaption&gt;gVisor intercepting system calls from a sandboxed application&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;p&gt;With the above architecture, gVisor blue-pills the application into thinking
that it interacts with a regular Linux kernel. In practice, gVisor reimplements
most basic features that Linux provides (memory management, scheduling, system
call interface, I/O, networking), and only issues system calls to the Linux
kernel when truly necessary, such as when it needs information from it (e.g.,
reading the document to be converted by Dangerzone).&lt;/p&gt;

&lt;p&gt;The gVisor kernel is designed to be difficult to break out of. gVisor is written
in Go. Many of Linux’s security woes stem from its use of C, which is a
memory-unsafe language. By contrast, gVisor is a regular Go application and
inherits Go’s memory safety features. This eliminates a large class of security
vulnerabilities.&lt;/p&gt;

&lt;p&gt;The gVisor kernel also has a much smaller code footprint, because unlike a
traditional kernel like Linux, it does not have to deal with things like
hardware devices, and only implements a subset of the Linux kernel interface
that is sufficient for most applications to work in practice. Because of its
smaller implementation, there are fewer moving parts to juggle between, and thus
fewer opportunities for bugs to exist.&lt;/p&gt;

&lt;p&gt;Beyond its kernel indirection, gVisor also hardens itself through a bunch of
security measures on startup, some of which are similar to regular containers:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Isolation&lt;/strong&gt;: Running in its own set of namespaces (user namespace, process
namespace, network namespace, etc.) to further isolate it from the host.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;File access prevention&lt;/strong&gt;: Running in its own root with exactly zero host
files initially visible to it.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Privilege revocation&lt;/strong&gt;: Dropping all capabilities it has to ensure it runs
with the least privileges.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;System call filtering&lt;/strong&gt;: Setting a strict system call filter tuned for the
gVisor Sentry specifically.
    &lt;ul&gt;
      &lt;li&gt;As mentioned, unlike Docker or Podman’s default system call filter, this
is a &lt;em&gt;very restricted set&lt;/em&gt; of system calls. This filter blocks basic
operations like opening files, creating network connections, or
executing other processes. The presence of this filter does &lt;em&gt;not&lt;/em&gt;
prevent use of these system calls from within the gVisor sandbox;
instead, the gVisor kernel &lt;em&gt;intercepts and reimplements&lt;/em&gt; system calls
internally without needing to make a “real” system call out to the Linux
kernel.&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;The gofer also uses all of the above techniques to isolate itself as much as
possible.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The gVisor kernel has been battle-tested by Google and other large companies
like Ant and Cloudflare. For example, searching for the text “GKE Sandbox”
(which uses gVisor) on the
&lt;a href=&quot;https://cloud.google.com/kubernetes-engine/security-bulletins&quot;&gt;GKE security bulletin&lt;/a&gt;
shows how often Linux kernel vulnerabilities occur but that gVisor prevents.
gVisor is also continuously &lt;a href=&quot;https://en.wikipedia.org/wiki/Fuzzing&quot;&gt;fuzz-tested&lt;/a&gt;
for bugs using &lt;a href=&quot;https://github.com/google/syzkaller/&quot;&gt;Syzkaller&lt;/a&gt;, an automated
kernel security testing tool.&lt;/p&gt;

&lt;p&gt;What’s the catch here? Applications that perform lots of system calls and heavy
I/O will have some degraded performance. Also, applications that rely on exotic
features by the Linux kernel may not work. In practice,
&lt;a href=&quot;https://gvisor.dev/docs/user_guide/compatibility&quot;&gt;the majority of applications do not suffer from this issue&lt;/a&gt;.&lt;/p&gt;

&lt;h2 id=&quot;integrating-gvisor-with-dangerzone&quot;&gt;Integrating gVisor with Dangerzone&lt;/h2&gt;

&lt;p&gt;So, gVisor looks like a strong candidate for Dangerzone, which is a relatively
simple application that does not perform a heavy amount of system calls. Also,
gVisor conveniently offers a container runtime that is a drop-in replacement for
use with Docker/Podman. Therefore, integrating these two projects should be
really simple, right?&lt;/p&gt;

&lt;p&gt;Well, not so fast.&lt;/p&gt;

&lt;p&gt;Dangerzone is a &lt;em&gt;multiplatform&lt;/em&gt; application, and most of its users are on
Windows and macOS. Integrating gVisor just for Linux would not cut it. At the
same time, gVisor works strictly on Linux systems, so we are at an impasse.&lt;/p&gt;

&lt;p&gt;In what is, in retrospect, a classic case of
&lt;a href=&quot;https://en.wikipedia.org/wiki/Law_of_the_instrument&quot;&gt;Maslow’s hammer&lt;/a&gt;, we
decided to solve our container problems with yet another container. The idea is
simple; why not containerize gVisor and make it run on Docker Desktop? After
all, as we already pointed out, Docker Desktop runs Linux inside a virtual
machine.&lt;/p&gt;

&lt;p&gt;By doing so, Dangerzone now has two containers with different responsibilities:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;The &lt;strong&gt;outer&lt;/strong&gt; Docker/Podman container acts as the &lt;strong&gt;portability&lt;/strong&gt; layer for
Dangerzone. Its main responsibility is to bundle the necessary config files,
scripts, and programs to run gVisor. It’s also responsible for bundling the
container image that gVisor will spawn a container from.&lt;/li&gt;
  &lt;li&gt;The &lt;strong&gt;inner&lt;/strong&gt; gVisor container acts as the &lt;strong&gt;isolation&lt;/strong&gt; layer for
Dangerzone. Its sole responsibility is to run the actual Dangerzone logic
for rendering documents to pixels.&lt;/li&gt;
&lt;/ul&gt;

&lt;figure&gt;
&lt;img src=&quot;/assets/images/2024-09-23-dangerzone-with-gvisor.svg&quot; alt=&quot;Diagram showing the Dangerzone UI sending a document to a document renderer within an inner container, which is protected by gVisor's Sentry. The Sentry intercepts system calls, allowing only limited system calls to pass to the Linux kernel with strict security settings. I/O system calls are handled by gVisor Gofer in an outer container, with less strict but controlled permissions&quot; /&gt;
&lt;figcaption&gt;Outline of how gVisor integrates with Dangerzone. There are now two nested containers, and each one brings its own protections. Usage of LibreOffice is implied.&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;p&gt;Running gVisor inside a container came with its own set of challenges:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;The Docker/Podman’s seccomp filter must allow the &lt;code class=&quot;highlighter-rouge&quot;&gt;ptrace&lt;/code&gt; system call. We
found that recent Docker Desktop versions and Podman version &amp;gt;= 4.0 have a
seccomp filter that allows this system call. For older versions, we
specified a custom seccomp filter that allowed it.&lt;/li&gt;
  &lt;li&gt;gVisor cannot run under SELinux in enforcing mode under default settings, so
we labeled the container with &lt;code class=&quot;highlighter-rouge&quot;&gt;container_engine_t&lt;/code&gt; (see GitHub issue
&lt;a href=&quot;https://github.com/freedomofpress/dangerzone/issues/880&quot;&gt;#880&lt;/a&gt;).&lt;/li&gt;
  &lt;li&gt;The Docker/Podman container must run with the &lt;code class=&quot;highlighter-rouge&quot;&gt;SYS_CHROOT&lt;/code&gt; capability. This
is needed by gVisor to restrict its own access to the filesystem before it
starts document processing. Other than that, the &lt;strong&gt;outer&lt;/strong&gt; container drops
all other capabilities and privileges.&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
  &lt;p&gt;💡 You can find more details about this integration in the Dangerzone’s
&lt;a href=&quot;https://github.com/freedomofpress/dangerzone/blob/main/docs/developer/gvisor.md&quot;&gt;gVisor design doc&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2 id=&quot;dangerzone-protections&quot;&gt;Dangerzone protections&lt;/h2&gt;

&lt;p&gt;We talked about Dangerzone’s original attack surface, and how we integrated
gVisor to reduce it. In practice though, in what ways is Dangerzone better off
than before? Well, if the Matryoshka containers are giving you a headache, or
you just skimmed to this section (no shade), here’s how the new Dangerzone
protections fare against the previous version, and the default protections of
Linux containers:&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;🛡️ &lt;strong&gt;Protections&lt;/strong&gt;&lt;/th&gt;
      &lt;th&gt;&lt;strong&gt;Default&lt;/strong&gt;&lt;/th&gt;
      &lt;th&gt;&lt;strong&gt;Dangerzone (0.6.1)&lt;/strong&gt;&lt;/th&gt;
      &lt;th&gt;&lt;strong&gt;Dangerzone + gVisor (0.7.0)&lt;/strong&gt;&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td&gt;🐧 &lt;strong&gt;Linux kernel&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;&lt;span style=&quot;color: #505050;&quot;&gt;Exposed&lt;/span&gt;&lt;/td&gt;
      &lt;td&gt;&lt;span style=&quot;color: #990000;&quot;&gt;👎 Exposed&lt;/span&gt;&lt;/td&gt;
      &lt;td&gt;&lt;span style=&quot;color: #38761d&quot;&gt;🎉 Not exposed&lt;/span&gt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;🛠️ &lt;strong&gt;System call filter&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;&lt;span style=&quot;color: #505050;&quot;&gt;Moderate&lt;/span&gt;&lt;/td&gt;
      &lt;td&gt;&lt;span style=&quot;color: #990000;&quot;&gt;👎 Moderate&lt;/span&gt;&lt;/td&gt;
      &lt;td&gt;&lt;span style=&quot;color: #38761d&quot;&gt;👍 Strict&lt;/span&gt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;🛠️ &lt;strong&gt;Capabilities&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;&lt;span style=&quot;color: #505050;&quot;&gt;Default&lt;/span&gt;&lt;/td&gt;
      &lt;td&gt;&lt;span style=&quot;color: #38761d&quot;&gt;👍 None&lt;/span&gt;&lt;/td&gt;
      &lt;td&gt;&lt;span style=&quot;color: #38761d&quot;&gt;👍 None&lt;/span&gt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;👤 &lt;strong&gt;Host user&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;&lt;span style=&quot;color: #505050;&quot;&gt;Mapped&lt;/span&gt;&lt;/td&gt;
      &lt;td&gt;&lt;span style=&quot;color: #990000;&quot;&gt;👎 Mapped&lt;/span&gt;&lt;/td&gt;
      &lt;td&gt;&lt;span style=&quot;color: #38761d&quot;&gt;👍 Unmapped&lt;/span&gt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;📁 &lt;strong&gt;Filesystem&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;&lt;span style=&quot;color: #505050;&quot;&gt;Exposed&lt;/span&gt;&lt;/td&gt;
      &lt;td&gt;&lt;span style=&quot;color: #990000;&quot;&gt;👎 Writable&lt;/span&gt;&lt;/td&gt;
      &lt;td&gt;&lt;span style=&quot;color: #38761d&quot;&gt;👍 Read-only&lt;/span&gt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;🌐 &lt;strong&gt;Network&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;&lt;span style=&quot;color: #505050;&quot;&gt;Exposed&lt;/span&gt;&lt;/td&gt;
      &lt;td&gt;&lt;span style=&quot;color: #38761d&quot;&gt;👍 Disabled&lt;/span&gt;&lt;/td&gt;
      &lt;td&gt;&lt;span style=&quot;color: #38761d&quot;&gt;✌️ Disabled at two levels&lt;/span&gt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;🔒 &lt;strong&gt;SELinux&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;&lt;span style=&quot;color: #505050;&quot;&gt;Yes (&lt;code class=&quot;highlighter-rouge&quot;&gt;container_t&lt;/code&gt;)&lt;/span&gt;&lt;/td&gt;
      &lt;td&gt;&lt;span style=&quot;color: #38761d&quot;&gt;👍 Yes (&lt;code class=&quot;highlighter-rouge&quot;&gt;container_t&lt;/code&gt;)&lt;/span&gt;&lt;/td&gt;
      &lt;td&gt;&lt;span style=&quot;color: #38761d&quot;&gt;👍 Yes (&lt;code class=&quot;highlighter-rouge&quot;&gt;container_engine_t&lt;/code&gt;)&lt;/span&gt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;🖥️ &lt;strong&gt;Hardware Virtualization&lt;/strong&gt;&lt;/td&gt;
      &lt;td&gt;&lt;span style=&quot;color: #505050;&quot;&gt;None&lt;/span&gt;&lt;/td&gt;
      &lt;td&gt;&lt;span style=&quot;color: #990000;&quot;&gt;👎 None&lt;/span&gt;&lt;/td&gt;
      &lt;td&gt;&lt;span style=&quot;color: #990000;&quot;&gt;👎 None&lt;/span&gt;&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;As you can see, the most important protection is that &lt;strong&gt;the document conversion
process no longer has access to the Linux kernel&lt;/strong&gt;. Instead, it only has access
to the gVisor kernel (in the Sentry), and must break out of it before it can
access the Linux kernel that it (prior to gVisor integration) had access to.&lt;/p&gt;

&lt;p&gt;Additionally, Dangerzone itself configures the two containers to be more secure
with:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Privilege revocation: Removing all privileges and capabilities of the
document conversion process in the &lt;strong&gt;inner container&lt;/strong&gt;, and minimizing the
set of capabilities granted to the &lt;strong&gt;outer container&lt;/strong&gt; to just &lt;code class=&quot;highlighter-rouge&quot;&gt;SYS_CHROOT&lt;/code&gt;
and no other.&lt;/li&gt;
  &lt;li&gt;File modification prevention: Making the &lt;strong&gt;inner container&lt;/strong&gt;’s root
filesystem read-only.&lt;/li&gt;
  &lt;li&gt;User isolation: Running the &lt;strong&gt;outer container&lt;/strong&gt; in a user namespace that
does not include the Dangerzone UI user (available in Linux distributions
with Podman version 4.1 or greater).&lt;/li&gt;
  &lt;li&gt;Kernel security settings: Setting the &lt;strong&gt;outer container&lt;/strong&gt;’s system call
filter and SELinux label settings.&lt;/li&gt;
  &lt;li&gt;Host access prevention: Not using any mounts in either container.&lt;/li&gt;
  &lt;li&gt;Network access prevention: Disabling both containers’ ability to use
networking.&lt;/li&gt;
&lt;/ul&gt;

&lt;figure&gt;
&lt;img src=&quot;/assets/images/2024-09-23-dangerzone-with-gvisor-annotated.svg&quot; alt=&quot;Diagram highlighting how gVisor mitigates against bugs and vulnerabilities in the inner container, including exploits which escalate privileges to the outer container.&quot; /&gt;
&lt;figcaption&gt;Explanation of how Dangerzone's latest protections limit its attack surface.&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;h2 id=&quot;conclusion&quot;&gt;Conclusion&lt;/h2&gt;

&lt;p&gt;Integrating the gVisor project with Dangerzone was very exciting: It’s a good
example of how gVisor can add another line of defense to a project without
requiring application-level changes.&lt;/p&gt;

&lt;p&gt;At the same time, the design complexity of the Dangerzone project increased a
bit, mostly to cater to its cross-platform nature, but honestly not that much.
Dangerzone is strongly security-focused, so we believe it’s worth the cost.&lt;/p&gt;

&lt;p&gt;We hope that this article demystifies some security aspects of containers, so
that you can use Dangerzone and gVisor with even more confidence. Feel free to
reach out to us with any questions or comments:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;https://notmyidea.org&quot;&gt;Alexis Métaireau&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://freedom.press/people/alex-p&quot;&gt;Alex Pyrgiotis&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://perot.me&quot;&gt;Etienne Perot&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://freedom.press/contact/&quot;&gt;Freedom of the Press Foundation (FPF)&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://gvisor.dev/community&quot;&gt;gVisor community&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;</content><author><name>almet</name></author><summary type="html">This article was written in collaboration with the Freedom of the Press Foundation and cross-posted on the Dangerzone blog. One of the oft-repeated sound bites of computer security advice is: “Don’t open random attachments from strangers.” If you are a journalist, however, opening attachments and documents is part of your job description. Since journalists already have a lot of security threats to worry about in dealing with sources, the safe opening of documents should not be one of them. Dangerzone was developed to solve this problem. It lets you open suspicious documents with confidence and gets out of your way. For the past few months, members of the Dangerzone team and the gVisor project collaborated on significantly improving the security properties of Dangerzone. We’re excited to announce that as of version 0.7.0, Dangerzone uses gVisor to secure its document conversion process. It is already trusted by Google and others to secure cloud products, scan Gmail attachments for viruses, etc.</summary></entry><entry><title type="html">Optimizing seccomp usage in gVisor</title><link href="/blog/2024/02/01/seccomp/" rel="alternate" type="text/html" title=" Optimizing seccomp usage in gVisor" /><published>2024-02-01T00:00:00-06:00</published><updated>2024-02-01T00:00:00-06:00</updated><id>/blog/2024/02/01/seccomp</id><content type="html" xml:base="/blog/2024/02/01/seccomp/">&lt;p&gt;gVisor is a multi-layered security sandbox. &lt;a href=&quot;https://www.kernel.org/doc/html/v4.19/userspace-api/seccomp_filter.html&quot;&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt;&lt;/a&gt; is
gVisor’s second layer of defense against container escape attacks. gVisor uses
&lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; to filter its own syscalls by the host kernel. This significantly
reduces the attack surface to the host that a compromised gVisor process can
access. However, this layer comes at a cost: every legitimate system call that
gVisor makes must be evaluated against this filter by the host kernel before it
is actually executed. &lt;strong&gt;This blog post contains more than you ever wanted to
know about &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt;, and explores the past few months of work to optimize
gVisor’s use of it.&lt;/strong&gt;&lt;/p&gt;

&lt;!--/excerpt--&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/2024-02-01-gvisor-seccomp.png&quot; alt=&quot;gVisor and seccomp&quot; title=&quot;gVisor and seccomp&quot; style=&quot;max-width:100%&quot; /&gt;
&lt;span class=&quot;attribution&quot;&gt;A diagram showing gVisor’s two main layers of
security: gVisor itself, and &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt;. This blog post touches on the
&lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; part.
&lt;a href=&quot;https://commons.wikimedia.org/wiki/File:Tux.svg&quot;&gt;Tux logo by Larry Ewing and The GIMP&lt;/a&gt;.&lt;/span&gt;&lt;/p&gt;

&lt;h2 id=&quot;performance-considerations&quot;&gt;Understanding &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; performance in gVisor&lt;/h2&gt;

&lt;p&gt;One challenge with gVisor performance improvement ideas is that it is often very
difficult to estimate how much they will impact performance without first doing
most of the work necessary to actually implement them. Profiling tools help with
knowing where to look, but going from there to numbers is difficult.&lt;/p&gt;

&lt;p&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; is one area which is actually much more straightforward to
estimate. Because it is a secondary layer of defense that lives outside of
gVisor, and it is merely a filter, we can simply yank it out of gVisor and
benchmark the performance we get. While running gVisor in this way is strictly
&lt;strong&gt;less secure&lt;/strong&gt; and not a mode that gVisor should support, the numbers we get in
this manner do provide an upper bound on the maximum &lt;em&gt;potential&lt;/em&gt; performance
gains we could see from optimizations within gVisor’s use of &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;To visualize this, we can run a benchmark with the following variants:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Unsandboxed&lt;/strong&gt;: Unsandboxed performance without gVisor.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;gVisor&lt;/strong&gt;: gVisor from before any of the performance improvements described
later in this post.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;gVisor with empty filter&lt;/strong&gt;: Same as &lt;strong&gt;gVisor&lt;/strong&gt;, but with the &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt;
filter replaced with one that unconditionally approves every system call.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;From these three variants, we can break down the gVisor overhead that comes from
gVisor itself vs the one that comes from &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; filtering. The difference
between &lt;strong&gt;gVisor&lt;/strong&gt; and &lt;strong&gt;unsandboxed&lt;/strong&gt; represents the total gVisor performance
overhead, and the difference between &lt;strong&gt;gVisor&lt;/strong&gt; and &lt;strong&gt;gVisor with empty filter&lt;/strong&gt;
represents the performance overhead of gVisor’s &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; filtering rules.&lt;/p&gt;

&lt;p&gt;Let’s run these numbers for the ABSL build benchmark:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/2024-02-01-gvisor-seccomp-absl-empty-filter.png&quot; alt=&quot;ABSL seccomp-bpf performance&quot; title=&quot;ABSL seccomp-bpf performance&quot; style=&quot;max-width:100%&quot; /&gt;&lt;/p&gt;

&lt;p&gt;We can now use these numbers to give a rough breakdown of where the overhead is
coming from:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/2024-02-01-gvisor-seccomp-absl-breakdown.png&quot; alt=&quot;ABSL seccomp-bpf performance breakdown&quot; title=&quot;ABSL seccomp-bpf performance breakdown&quot; style=&quot;max-width:100%&quot; /&gt;&lt;/p&gt;

&lt;p&gt;The &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; overhead is small in absolute terms. The numbers suggest that
the best that can be shaved off by optimizing &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; filters is &lt;strong&gt;up to&lt;/strong&gt;
3.4 seconds off from the total ABSL build time, which represents a reduction of
total runtime by ~3.6%. However, when looking at this amount relative to
gVisor’s overhead over unsandboxed time, this means that optimizing the
&lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; filters may remove &lt;strong&gt;up to&lt;/strong&gt; ~15% of gVisor overhead, which is
significant. &lt;em&gt;(Not all benchmarks have this behavior; some benchmarks show
smaller &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt;-related overhead. The overhead is also highly
platform-dependent.)&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Of course, this level of performance is what was reached with &lt;strong&gt;empty
&lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; filtering rules&lt;/strong&gt;, so we cannot hope to reach this level of
performance gains. However, it is still useful as an upper bound. Let’s see how
much of it we can recoup without compromising security.&lt;/p&gt;

&lt;h2 id=&quot;a-primer-on-bpf-and-seccomp-bpf&quot;&gt;A primer on BPF and &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt;&lt;/h2&gt;

&lt;h3 id=&quot;bpf-cbpf-ebpf-oh-my&quot;&gt;BPF, cBPF, eBPF, oh my!&lt;/h3&gt;

&lt;p&gt;&lt;a href=&quot;https://en.wikipedia.org/wiki/Berkeley_Packet_Filter&quot;&gt;BPF (Berkeley Packet Filter)&lt;/a&gt; is a virtual machine and eponymous machine
language. Its name comes from its original purpose: filtering packets in a
kernel network stack. However, its use has expanded to other domains of the
kernel where programmability is desirable. Syscall filtering in the context of
&lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp&lt;/code&gt; is one such area.&lt;/p&gt;

&lt;p&gt;BPF itself comes in two dialects: “Classic BPF” (sometimes stylized as cBPF),
and the now-more-well-known &lt;a href=&quot;https://en.wikipedia.org/wiki/EBPF&quot;&gt;“Extended BPF” (commonly known as eBPF)&lt;/a&gt;.
eBPF is a superset of cBPF and is usable extensively throughout the kernel.
However, &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp&lt;/code&gt; is not one such area. While
&lt;a href=&quot;https://lwn.net/Articles/857228/&quot;&gt;the topic has been heavily debated&lt;/a&gt;, the
status quo remains that &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp&lt;/code&gt; filters may only use cBPF, so this post will
focus on cBPF alone.&lt;/p&gt;

&lt;h3 id=&quot;so-what-is-seccomp-bpf-exactly&quot;&gt;So what is &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; exactly?&lt;/h3&gt;

&lt;p&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; is a part of the Linux kernel which allows a program to impose
syscall filters on itself. A &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; filter is a cBPF program that is
given syscall data as input, and outputs an “action” (a 32-bit integer) to do as
a result of this system call: allow it, reject it, crash the program, trap
execution, etc. The kernel evaluates the cBPF program on every system call the
application makes. The “input” of this cBPF program is the byte layout of the
&lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp_data&lt;/code&gt; struct, which can be loaded into the registers of the cBPF
virtual machine for analysis.&lt;/p&gt;

&lt;p&gt;Here’s what the &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp_data&lt;/code&gt; struct looks like in
&lt;a href=&quot;https://github.com/torvalds/linux/blob/master/include/uapi/linux/seccomp.h&quot;&gt;Linux’s &lt;code class=&quot;highlighter-rouge&quot;&gt;include/uapi/linux/seccomp.h&lt;/code&gt;&lt;/a&gt;:&lt;/p&gt;

&lt;div class=&quot;language-c highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;struct&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;seccomp_data&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;kt&quot;&gt;int&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;nr&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;                     &lt;span class=&quot;c1&quot;&gt;// 32 bits&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;__u32&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;arch&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;                 &lt;span class=&quot;c1&quot;&gt;// 32 bits&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;__u64&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;instruction_pointer&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;  &lt;span class=&quot;c1&quot;&gt;// 64 bits&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;__u64&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;args&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;6&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;];&lt;/span&gt;              &lt;span class=&quot;c1&quot;&gt;// 64 bits × 6&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;};&lt;/span&gt;                              &lt;span class=&quot;c1&quot;&gt;// Total 512 bits&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;sample-filter&quot;&gt;Sample &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; filter&lt;/h3&gt;

&lt;p&gt;Here is an example &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; filter, adapted from the
&lt;a href=&quot;https://www.kernel.org/doc/Documentation/networking/filter.txt&quot;&gt;Linux kernel documentation&lt;/a&gt;&lt;sup id=&quot;fnref:1&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:1&quot; class=&quot;footnote&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;:&lt;/p&gt;

&lt;!-- Markdown note: This uses &quot;javascript&quot; syntax highlighting because that
     happens to work pretty well with this pseudo-assembly-like language.
     It is not actually JavaScript. --&gt;

&lt;div class=&quot;language-javascript highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;mi&quot;&gt;00&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;load32&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;4&lt;/span&gt;                &lt;span class=&quot;c1&quot;&gt;// Load 32 bits at offsetof(struct seccomp_data, arch) (= 4)&lt;/span&gt;
                            &lt;span class=&quot;c1&quot;&gt;//   of the seccomp_data input struct into register A.&lt;/span&gt;
&lt;span class=&quot;mi&quot;&gt;01&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;jeq&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;0xc000003e&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;11&lt;/span&gt;   &lt;span class=&quot;c1&quot;&gt;// If A == AUDIT_ARCH_X86_64, jump by 0 instructions [to 02]&lt;/span&gt;
                            &lt;span class=&quot;c1&quot;&gt;//   else jump by 11 instructions [to 13].&lt;/span&gt;
&lt;span class=&quot;mi&quot;&gt;02&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;load32&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;                &lt;span class=&quot;c1&quot;&gt;// Load 32 bits at offsetof(struct seccomp_data, nr) (= 0)&lt;/span&gt;
                            &lt;span class=&quot;c1&quot;&gt;//   of the seccomp_data input struct into register A.&lt;/span&gt;
&lt;span class=&quot;mi&quot;&gt;03&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;jeq&lt;/span&gt;  &lt;span class=&quot;mi&quot;&gt;15&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;  &lt;span class=&quot;mi&quot;&gt;10&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;   &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;       &lt;span class=&quot;c1&quot;&gt;// If A == __NR_rt_sigreturn, jump by 10 instructions [to 14]&lt;/span&gt;
                            &lt;span class=&quot;c1&quot;&gt;//   else jump by 0 instructions [to 04].&lt;/span&gt;
&lt;span class=&quot;mi&quot;&gt;04&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;jeq&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;231&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;   &lt;span class=&quot;mi&quot;&gt;9&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;   &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;       &lt;span class=&quot;c1&quot;&gt;// If A == __NR_exit_group, jump by 9 instructions [to 14]&lt;/span&gt;
                            &lt;span class=&quot;c1&quot;&gt;//   else jump by 0 instructions [to 05].&lt;/span&gt;
&lt;span class=&quot;mi&quot;&gt;05&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;jeq&lt;/span&gt;  &lt;span class=&quot;mi&quot;&gt;60&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;   &lt;span class=&quot;mi&quot;&gt;8&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;   &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;       &lt;span class=&quot;c1&quot;&gt;// If A == __NR_exit, jump by 8 instructions [to 14]&lt;/span&gt;
                            &lt;span class=&quot;c1&quot;&gt;//   else jump by 0 instructions [to 06].&lt;/span&gt;
&lt;span class=&quot;mi&quot;&gt;06&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;jeq&lt;/span&gt;   &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;   &lt;span class=&quot;mi&quot;&gt;7&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;   &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;       &lt;span class=&quot;c1&quot;&gt;// Same thing for __NR_read.&lt;/span&gt;
&lt;span class=&quot;mi&quot;&gt;07&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;jeq&lt;/span&gt;   &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;   &lt;span class=&quot;mi&quot;&gt;6&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;   &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;       &lt;span class=&quot;c1&quot;&gt;// Same thing for __NR_write.&lt;/span&gt;
&lt;span class=&quot;mi&quot;&gt;08&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;jeq&lt;/span&gt;   &lt;span class=&quot;mi&quot;&gt;5&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;   &lt;span class=&quot;mi&quot;&gt;5&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;   &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;       &lt;span class=&quot;c1&quot;&gt;// Same thing for __NR_fstat.&lt;/span&gt;
&lt;span class=&quot;mi&quot;&gt;09&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;jeq&lt;/span&gt;   &lt;span class=&quot;mi&quot;&gt;9&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;   &lt;span class=&quot;mi&quot;&gt;4&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;   &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;       &lt;span class=&quot;c1&quot;&gt;// Same thing for __NR_mmap.&lt;/span&gt;
&lt;span class=&quot;mi&quot;&gt;10&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;jeq&lt;/span&gt;  &lt;span class=&quot;mi&quot;&gt;14&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;   &lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;   &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;       &lt;span class=&quot;c1&quot;&gt;// Same thing for __NR_rt_sigprocmask.&lt;/span&gt;
&lt;span class=&quot;mi&quot;&gt;11&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;jeq&lt;/span&gt;  &lt;span class=&quot;mi&quot;&gt;13&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;   &lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;   &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;       &lt;span class=&quot;c1&quot;&gt;// Same thing for __NR_rt_sigaction.&lt;/span&gt;
&lt;span class=&quot;mi&quot;&gt;12&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;jeq&lt;/span&gt;  &lt;span class=&quot;mi&quot;&gt;35&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;   &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;   &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;       &lt;span class=&quot;c1&quot;&gt;// If A == __NR_nanosleep, jump by 1 instruction [to 14]&lt;/span&gt;
                            &lt;span class=&quot;c1&quot;&gt;//   else jump by 0 instructions [to 13].&lt;/span&gt;
&lt;span class=&quot;mi&quot;&gt;13&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;                &lt;span class=&quot;c1&quot;&gt;// Return SECCOMP_RET_KILL_THREAD&lt;/span&gt;
&lt;span class=&quot;mi&quot;&gt;14&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;0x7fff0000&lt;/span&gt;       &lt;span class=&quot;c1&quot;&gt;// Return SECCOMP_RET_ALLOW&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This filter effectively allows only the following syscalls: &lt;code class=&quot;highlighter-rouge&quot;&gt;rt_sigreturn&lt;/code&gt;,
&lt;code class=&quot;highlighter-rouge&quot;&gt;exit_group&lt;/code&gt;, &lt;code class=&quot;highlighter-rouge&quot;&gt;exit&lt;/code&gt;, &lt;code class=&quot;highlighter-rouge&quot;&gt;read&lt;/code&gt;, &lt;code class=&quot;highlighter-rouge&quot;&gt;write&lt;/code&gt;, &lt;code class=&quot;highlighter-rouge&quot;&gt;fstat&lt;/code&gt;, &lt;code class=&quot;highlighter-rouge&quot;&gt;mmap&lt;/code&gt;, &lt;code class=&quot;highlighter-rouge&quot;&gt;rt_sigprocmask&lt;/code&gt;,
&lt;code class=&quot;highlighter-rouge&quot;&gt;rt_sigaction&lt;/code&gt;, and &lt;code class=&quot;highlighter-rouge&quot;&gt;nanosleep&lt;/code&gt;. All other syscalls result in the calling thread
being killed.&lt;/p&gt;

&lt;h3 id=&quot;cbpf-limitations&quot;&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; and cBPF limitations&lt;/h3&gt;

&lt;p&gt;cBPF is quite limited as a language. The following limitations all factor into
the optimizations described in this blog post:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;The cBPF virtual machine only has 2 32-bit registers, and a tertiary
pseudo-register for a 32-bit immediate value. (Note that syscall arguments
evaluated in the context of &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp&lt;/code&gt; are 64-bit values, so you can already
foresee that this leads to complications.)&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; programs are limited to 4,096 instructions.&lt;/li&gt;
  &lt;li&gt;Jump instructions can only go forward (this ensures that programs must
halt).&lt;/li&gt;
  &lt;li&gt;Jump instructions may only jump by a fixed (“immediate”) number of
instructions. (You cannot say: “jump by whatever this register says”.)&lt;/li&gt;
  &lt;li&gt;Jump instructions come in two flavors:
    &lt;ul&gt;
      &lt;li&gt;“Unconditional” jump instructions, which jump by a fixed number of
instructions. This number must fit in 16 bits.&lt;/li&gt;
      &lt;li&gt;“Conditional” jump instructions, which include a condition expression
and two jump targets:
        &lt;ul&gt;
          &lt;li&gt;The number of instructions to jump by if the condition is true. This
number must fit in 8 bits, so this cannot jump by more than 255
instructions.&lt;/li&gt;
          &lt;li&gt;The number of instructions to jump by if the condition is false.
This number must fit in 8 bits, so this cannot jump by more than 255
instructions.&lt;/li&gt;
        &lt;/ul&gt;
      &lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;seccomp-bpf-caching-in-linux&quot;&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; caching in Linux&lt;/h3&gt;

&lt;p&gt;Since
&lt;a href=&quot;https://www.phoronix.com/news/Linux-5.11-SECCOMP-Performance&quot;&gt;Linux kernel version 5.11&lt;/a&gt;,
when a program uploads a &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; filter into the kernel,
&lt;a href=&quot;https://github.com/torvalds/linux/commit/8e01b51a31a1e08e2c3e8fcc0ef6790441be2f61&quot;&gt;Linux runs a BPF emulator&lt;/a&gt;
that looks for system call numbers where the BPF program doesn’t do any fancy
operations nor load any bits from the &lt;code class=&quot;highlighter-rouge&quot;&gt;instruction_pointer&lt;/code&gt; or &lt;code class=&quot;highlighter-rouge&quot;&gt;args&lt;/code&gt; fields of
the &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp_data&lt;/code&gt; input struct, and still returns “allow”. When this is the
case, &lt;strong&gt;Linux will cache this information&lt;/strong&gt; in a per-syscall-number bitfield.&lt;/p&gt;

&lt;p&gt;Later, when a cacheable syscall number is executed, the BPF program is not
evaluated at all; since the kernel knows that the program is deterministic and
doesn’t depend on the syscall arguments, it can safely allow the syscall without
actually running the BPF program.&lt;/p&gt;

&lt;p&gt;This post uses the term “cacheable” to refer to syscalls that match this
criteria.&lt;/p&gt;

&lt;h2 id=&quot;how-gvisor-builds-its-seccomp-bpf-filter&quot;&gt;How gVisor builds its &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; filter&lt;/h2&gt;

&lt;p&gt;gVisor imposes a &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; filter on itself as part of Sentry start-up. This
process works as follows:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;gVisor gathers bits of configuration that are relevant to the construction
of its &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; filter. This includes which platform is in use, whether
certain features that require looser filtering are enabled (e.g. host
networking, profiling, GPU proxying, etc.), and certain file descriptors
(FDs) which may be checked against syscall arguments that pass in FDs.&lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;gVisor generates a sequence of rulesets from this configuration. A ruleset
is a mapping from syscall number to a predicate that must be true for this
system call, along with an “action” (return code) that is taken should this
predicate be satisfied. For ease of human understanding, the predicate is
often written as a
&lt;a href=&quot;https://en.wikipedia.org/wiki/Logical_disjunction&quot;&gt;disjunctive rule&lt;/a&gt;, for
which each sub-rule is a
&lt;a href=&quot;https://en.wikipedia.org/wiki/Logical_conjunction&quot;&gt;conjunctive rule&lt;/a&gt; that
verifies each syscall argument. In other words, &lt;code class=&quot;highlighter-rouge&quot;&gt;(fA(args[0]) &amp;amp;&amp;amp; fB(args[1])
&amp;amp;&amp;amp; ...) || (fC(args[0]) &amp;amp;&amp;amp; fD(args[1]) &amp;amp;&amp;amp; ...) || ...&lt;/code&gt;. This is represented
&lt;a href=&quot;https://github.com/google/gvisor/blob/master/runsc/boot/filter/config/config_main.go&quot;&gt;in gVisor code&lt;/a&gt;
as follows:&lt;/p&gt;

    &lt;div class=&quot;language-go highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;Or&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;          &lt;span class=&quot;c&quot;&gt;// Disjunction rule&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;PerArg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;  &lt;span class=&quot;c&quot;&gt;// Conjunction rule over each syscall argument&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;fA&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;  &lt;span class=&quot;c&quot;&gt;// Predicate for `seccomp_data.args[0]`&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;fB&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;  &lt;span class=&quot;c&quot;&gt;// Predicate for `seccomp_data.args[1]`&lt;/span&gt;
        &lt;span class=&quot;c&quot;&gt;// ... More predicates can go here (up to 6 arguments per syscall)&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;PerArg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;  &lt;span class=&quot;c&quot;&gt;// Conjunction rule over each syscall argument&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;fC&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;  &lt;span class=&quot;c&quot;&gt;// Predicate for `seccomp_data.args[0]`&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;fD&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;  &lt;span class=&quot;c&quot;&gt;// Predicate for `seccomp_data.args[1]`&lt;/span&gt;
        &lt;span class=&quot;c&quot;&gt;// ... More predicates can go here (up to 6 arguments per syscall)&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;    &lt;/div&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;gVisor performs several optimizations on this data structure.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;gVisor then renders this list of rulesets into a linear program that looks
close to the final machine language, other than jump offsets which are
initially represented as symbolic named labels during the rendering process.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;gVisor then resolves all the labels to their actual instruction index, and
computes the actual jump targets of all jump instructions to obtain valid
cBPF machine code.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;gVisor runs further optimizations on this cBPF bytecode.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;Finally, the cBPF bytecode is uploaded into the host kernel and the
&lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; filter becomes effective.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Optimizing the &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; filter to be more efficient allows the program to
be more compact (i.e. it’s possible to pack more complex filters in the 4,096
instruction limit), and to run faster. While &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; evaluation is
measured in nanoseconds, the impact of any optimization is magnified here,
because host syscalls are an important part of the synchronous “syscall hot
path” that must execute as part of handling certain performance-sensitive
syscall from the sandboxed application. The relationship is not 1-to-1: a single
application syscall may result in several host syscalls, especially due to
&lt;code class=&quot;highlighter-rouge&quot;&gt;futex(2)&lt;/code&gt; which the Sentry calls many times to synchronize its own operations.
Therefore, shaving a nanosecond here and there results in several shaved
nanoseconds in the syscall hot path.&lt;/p&gt;

&lt;h2 id=&quot;structure&quot;&gt;Structural optimizations&lt;/h2&gt;

&lt;p&gt;The first optimization done for gVisor’s &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; was to turn its linear
search over syscall numbers into a
&lt;a href=&quot;https://en.wikipedia.org/wiki/Binary_search_tree&quot;&gt;binary search tree&lt;/a&gt;. This
turns the search for syscall numbers from &lt;code class=&quot;highlighter-rouge&quot;&gt;O(n)&lt;/code&gt; to &lt;code class=&quot;highlighter-rouge&quot;&gt;O(log n)&lt;/code&gt; instructions.
This is a very common &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; optimization technique which is replicated
in other projects such as
&lt;a href=&quot;https://github.com/seccomp/libseccomp/issues/116&quot;&gt;libseccomp&lt;/a&gt; and Chromium.&lt;/p&gt;

&lt;p&gt;To do this, a cBPF program basically loads the 32-bit &lt;code class=&quot;highlighter-rouge&quot;&gt;nr&lt;/code&gt; (syscall number)
field of the &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp_data&lt;/code&gt; struct, and does a binary tree traversal of the
&lt;a href=&quot;https://chromium.googlesource.com/chromiumos/docs/+/HEAD/constants/syscalls.md#tables&quot;&gt;syscall number space&lt;/a&gt;.
When it finds a match, it jumps to a set of instructions that check that
syscall’s arguments for validity, and then returns allow/reject.&lt;/p&gt;

&lt;p&gt;But why stop here? Let’s go further.&lt;/p&gt;

&lt;p&gt;The problem with the binary search tree approach is that it treats all syscall
numbers equally. This is a problem for three reasons:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;It does not matter to have good performance for disallowed syscalls, because
such syscalls should never happen during normal program execution.&lt;/li&gt;
  &lt;li&gt;It does not matter to have good performance for syscalls which can be cached
by the kernel, because the BPF program will only have to run once for these
system calls.&lt;/li&gt;
  &lt;li&gt;For the system calls which are allowed but are not cacheable by the kernel,
there is a
&lt;a href=&quot;https://en.wikipedia.org/wiki/Pareto_distribution&quot;&gt;Pareto distribution&lt;/a&gt; of
their relative frequency. To exploit this we should evaluate the most-often
used syscalls faster than the least-often used ones. The binary tree
structure does not exploit this distribution, and instead treats all
syscalls equally.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;So gVisor splits syscall numbers into four sets:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;🅰: Non-cacheable 🅰llowed, called very frequently.&lt;/li&gt;
  &lt;li&gt;🅱: Non-cacheable allowed, called once in a 🅱lue moon.&lt;/li&gt;
  &lt;li&gt;🅲: 🅲acheable allowed (whether called frequently or not).&lt;/li&gt;
  &lt;li&gt;🅳: 🅳isallowed (which, by definition, is neither cacheable nor expected to
ever be called).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Then, the cBPF program is structured in the following layout:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Linear search over allowed frequently-called non-cacheable syscalls (🅰).
These syscalls are ordered in most-frequently-called first (e.g. &lt;code class=&quot;highlighter-rouge&quot;&gt;futex(2)&lt;/code&gt;
is the first one as it is by far the most-frequently-called system call).&lt;/li&gt;
  &lt;li&gt;Binary search over allowed infrequently-called non-cacheable syscalls (🅱).&lt;/li&gt;
  &lt;li&gt;Binary search over allowed cacheable syscalls (🅲).&lt;/li&gt;
  &lt;li&gt;Reject anything else (🅳).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This structure takes full advantage of the kernel caching functionality, and of
the Pareto distribution of syscalls.&lt;/p&gt;

&lt;details&gt;

  &lt;summary&gt;

    &lt;h3 id=&quot;binary-search-tree-optimizations&quot;&gt;Binary search tree optimizations&lt;/h3&gt;

    &lt;p&gt;Beyond classifying syscalls to see which binary search tree they should be a
part of, gVisor also optimizes the binary search process itself.&lt;/p&gt;

  &lt;/summary&gt;

  &lt;p&gt;Each syscall number is a node in the tree. When traversing the tree, there are
three options at each point:&lt;/p&gt;

  &lt;ul&gt;
    &lt;li&gt;The syscall number is an exact match&lt;/li&gt;
    &lt;li&gt;The syscall number is lower than the node’s value&lt;/li&gt;
    &lt;li&gt;The syscall number is higher than the node’s value&lt;/li&gt;
  &lt;/ul&gt;

  &lt;p&gt;In order to render the BST as cBPF bytecode, gVisor used to render the following
(in pseudocode):&lt;/p&gt;

  &lt;div class=&quot;language-javascript highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;syscall&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;number&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;current&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;node&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;value&lt;/span&gt;
    &lt;span class=&quot;nx&quot;&gt;jump&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;rules_for_this_syscall&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;syscall&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;number&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;current&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;node&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;value&lt;/span&gt;
    &lt;span class=&quot;nx&quot;&gt;jump&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;left_node&lt;/span&gt;
&lt;span class=&quot;nx&quot;&gt;jump&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;right_node&lt;/span&gt;

&lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;rules_for_this_syscall&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;c1&quot;&gt;// Render bytecode for this syscall's filters here...&lt;/span&gt;

&lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;left_node&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;c1&quot;&gt;// Recursively render the bytecode for the left node value here...&lt;/span&gt;

&lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;right_node&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;c1&quot;&gt;// Recursively render the bytecode for the right node value here...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;  &lt;/div&gt;

  &lt;p&gt;Keep in mind the &lt;a href=&quot;#cbpf-limitations&quot;&gt;cBPF limitations&lt;/a&gt; here. Because conditional
jumps are limited to 255 instructions, the jump to &lt;code class=&quot;highlighter-rouge&quot;&gt;@left_node&lt;/code&gt; can be further
than 255 instructions away (especially for syscalls with complex filtering rules
like &lt;a href=&quot;https://man7.org/linux/man-pages/man2/ioctl.2.html&quot;&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;ioctl(2)&lt;/code&gt;&lt;/a&gt;). The jump
to &lt;code class=&quot;highlighter-rouge&quot;&gt;@right_node&lt;/code&gt; is almost certainly more than 255 instructions away. This means
in actual cBPF bytecode, we would often need to use conditional jumps followed
by unconditional jumps in order to jump so far forward. Meanwhile, the jump to
&lt;code class=&quot;highlighter-rouge&quot;&gt;@rules_for_this_syscall&lt;/code&gt; would be a very short hop away, but this locality
would only be taken advantage of for a single node of the entire tree for each
traversal.&lt;/p&gt;

  &lt;p&gt;Consider this structure instead:&lt;/p&gt;

  &lt;div class=&quot;language-javascript highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c1&quot;&gt;// Traversal code:&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;syscall&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;number&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;current&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;node&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;value&lt;/span&gt;
      &lt;span class=&quot;nx&quot;&gt;jump&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;left_node&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;syscall_number&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;current&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;node&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;value&lt;/span&gt;
      &lt;span class=&quot;nx&quot;&gt;jump&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;right_node&lt;/span&gt;
  &lt;span class=&quot;nx&quot;&gt;jump&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;rules_for_this_syscall&lt;/span&gt;
  &lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;left_node&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
    &lt;span class=&quot;c1&quot;&gt;// Recursively render only the traversal code for the left node here&lt;/span&gt;
  &lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;right_node&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
    &lt;span class=&quot;c1&quot;&gt;// Recursively render only the traversal code for the right node here&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;// Filtering code:&lt;/span&gt;
  &lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;rules_for_this_syscall&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
    &lt;span class=&quot;c1&quot;&gt;// Render bytecode for this syscall's filters here&lt;/span&gt;
  &lt;span class=&quot;c1&quot;&gt;// Recursively render only the filtering code for the left node here&lt;/span&gt;
  &lt;span class=&quot;c1&quot;&gt;// Recursively render only the filtering code for the right node here&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;  &lt;/div&gt;

  &lt;p&gt;This effectively separates the per-syscall rules from the traversal of the BST.
This ensures that the traversal can be done entirely using conditional jumps,
and that for any given execution of the cBPF program, there will be at most one
unconditional jump to the syscall-specific rules.&lt;/p&gt;

  &lt;p&gt;This structure is further improvable by taking advantage of the fact that
syscall numbers are a dense space, and so are syscall filter rules. This means
we can often avoid needless comparisons. For example, given the following tree:&lt;/p&gt;

  &lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;      22
     /  \
    9    24
   /    /  \
  8   23    50
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;  &lt;/div&gt;

  &lt;p&gt;Notice that the tree contains &lt;code class=&quot;highlighter-rouge&quot;&gt;22&lt;/code&gt;, &lt;code class=&quot;highlighter-rouge&quot;&gt;23&lt;/code&gt;, and &lt;code class=&quot;highlighter-rouge&quot;&gt;24&lt;/code&gt;. This means that if we get to
node &lt;code class=&quot;highlighter-rouge&quot;&gt;23&lt;/code&gt;, we do not need to check for syscall number equality, because we’ve
already established from the traversal that the syscall number must be &lt;code class=&quot;highlighter-rouge&quot;&gt;23&lt;/code&gt;.&lt;/p&gt;

&lt;/details&gt;

&lt;h2 id=&quot;cbpf-bytecode-optimizations&quot;&gt;cBPF bytecode optimizations&lt;/h2&gt;

&lt;p&gt;gVisor now implements a
&lt;a href=&quot;https://github.com/google/gvisor/blob/master/pkg/bpf/optimizer.go&quot;&gt;bytecode-level cBPF optimizer&lt;/a&gt;
running a few lossless optimizations. These optimizations are run repeatedly
until the bytecode no longer changes. This is because each type of optimization
tends to feed on the fruits of the others, as we’ll see below.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/2024-02-01-gvisor-seccomp-sentry-filter-size.png&quot; alt=&quot;gVisor sentry seccomp-bpf filter program size&quot; title=&quot;gVisor sentry seccomp-bpf filter program size&quot; style=&quot;max-width:100%&quot; /&gt;&lt;/p&gt;

&lt;p&gt;gVisor’s &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; program size is reduced by over a factor of 4 using the
optimizations below.&lt;/p&gt;

&lt;details&gt;

  &lt;summary&gt;

    &lt;h3 id=&quot;optimizing-cbpf-jumps&quot;&gt;Optimizing cBPF jumps&lt;/h3&gt;

    &lt;p&gt;The &lt;a href=&quot;#cbpf-limitations&quot;&gt;limitations of cBPF jump instructions described earlier&lt;/a&gt;
means that typical BPF bytecode rendering code will usually favor unconditional
jumps even when they are not necessary. However, they can be optimized after the
fact.&lt;/p&gt;

  &lt;/summary&gt;

  &lt;p&gt;Typical BPF bytecode rendering code for a simple condition is usually rendered
as follows:&lt;/p&gt;

  &lt;div class=&quot;language-javascript highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nx&quot;&gt;jif&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;condition&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;     &lt;span class=&quot;c1&quot;&gt;// If &amp;lt;condition&amp;gt; is true, continue,&lt;/span&gt;
                          &lt;span class=&quot;c1&quot;&gt;//   otherwise skip over 1 instruction.&lt;/span&gt;
&lt;span class=&quot;nx&quot;&gt;jmp&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;condition_was_true&lt;/span&gt;   &lt;span class=&quot;c1&quot;&gt;// Unconditional jump to label @condition_was_true.&lt;/span&gt;
&lt;span class=&quot;nx&quot;&gt;jmp&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;condition_was_false&lt;/span&gt;  &lt;span class=&quot;c1&quot;&gt;// Unconditional jump to label @condition_was_false.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;  &lt;/div&gt;

  &lt;p&gt;… or as follows:&lt;/p&gt;

  &lt;div class=&quot;language-javascript highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nx&quot;&gt;jif&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;condition&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;     &lt;span class=&quot;c1&quot;&gt;// If &amp;lt;condition&amp;gt; is true, jump by 1 instruction,&lt;/span&gt;
                          &lt;span class=&quot;c1&quot;&gt;//   otherwise continue.&lt;/span&gt;
&lt;span class=&quot;nx&quot;&gt;jmp&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;condition_was_false&lt;/span&gt;  &lt;span class=&quot;c1&quot;&gt;// Unconditional jump to label @condition_was_false.&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;// Flow through here if the condition was true.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;  &lt;/div&gt;

  &lt;p&gt;… In other words, the generated code always uses unconditional jumps, and
conditional jump offsets are always either 0 or 1 instructions forward. This is
because conditional jumps are limited to 8 bits (255 instructions), and it is
not always possible at BPF bytecode rendering time to know ahead of time that
the jump targets (&lt;code class=&quot;highlighter-rouge&quot;&gt;@condition_was_true&lt;/code&gt;, &lt;code class=&quot;highlighter-rouge&quot;&gt;@condition_was_false&lt;/code&gt;) will resolve to
an instruction that is close enough ahead that the offset would fit in 8 bits.
The safe thing to do is to always use an unconditional jump. Since unconditional
jump targets have 16 bits to play with, and &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; programs are limited
to 4,096 instructions, it is always possible to encode a jump using an
unconditional jump instruction.&lt;/p&gt;

  &lt;p&gt;But of course, the jump target often &lt;em&gt;does&lt;/em&gt; fit in 8 bits. So gVisor looks over
the bytecode for optimization opportunities:&lt;/p&gt;

  &lt;ul&gt;
    &lt;li&gt;&lt;strong&gt;Conditional jumps that jump to unconditional jumps&lt;/strong&gt; are rewritten to
their final destination, so long as this fits within the 255-instruction
conditional jump limit.&lt;/li&gt;
    &lt;li&gt;&lt;strong&gt;Unconditional jumps that jump to other unconditional jumps&lt;/strong&gt; are rewritten
to their final destination.&lt;/li&gt;
    &lt;li&gt;&lt;strong&gt;Conditional jumps where both branches jump to the same instruction&lt;/strong&gt; are
replaced by an unconditional jump to that instruction.&lt;/li&gt;
    &lt;li&gt;&lt;strong&gt;Unconditional jumps with a zero-instruction jump target&lt;/strong&gt; are removed.&lt;/li&gt;
  &lt;/ul&gt;

  &lt;p&gt;The aim of these optimizations is to clean up after needless indirection that is
a byproduct of cBPF bytecode rendering code. Once they all have run, all jumps
are as tight as they can be.&lt;/p&gt;

&lt;/details&gt;

&lt;details&gt;

  &lt;summary&gt;

    &lt;h3 id=&quot;removing-dead-code&quot;&gt;Removing dead code&lt;/h3&gt;

    &lt;p&gt;Because cBPF is a very restricted language, it is possible to determine with
certainty that some instructions can never be reached.&lt;/p&gt;

  &lt;/summary&gt;

  &lt;p&gt;In cBPF, each instruction either:&lt;/p&gt;

  &lt;ul&gt;
    &lt;li&gt;&lt;strong&gt;Flows&lt;/strong&gt; forward (e.g. &lt;code class=&quot;highlighter-rouge&quot;&gt;load&lt;/code&gt; operations, math operations).&lt;/li&gt;
    &lt;li&gt;&lt;strong&gt;Jumps&lt;/strong&gt; by a fixed (immediate) number of instructions.&lt;/li&gt;
    &lt;li&gt;&lt;strong&gt;Stops&lt;/strong&gt; the execution immediately (&lt;code class=&quot;highlighter-rouge&quot;&gt;return&lt;/code&gt; instructions).&lt;/li&gt;
  &lt;/ul&gt;

  &lt;p&gt;Therefore, gVisor runs a simple program traversal algorithm. It creates a
bitfield with one bit per instruction, then traverses the program and all its
possible branches. Then, all instructions that were never traversed are removed
from the program, and all jump targets are updated to account for these
removals.&lt;/p&gt;

  &lt;p&gt;In turn, this makes the program shorter, which makes more jump optimizations
possible.&lt;/p&gt;

&lt;/details&gt;

&lt;details&gt;

  &lt;summary&gt;

    &lt;h3 id=&quot;redundant-loads&quot;&gt;Removing redundant &lt;code class=&quot;highlighter-rouge&quot;&gt;load&lt;/code&gt; instructions&lt;/h3&gt;

    &lt;p&gt;cBPF programs filter system calls by inspecting their arguments. To do these
comparisons, this data must first be loaded into the cBPF VM registers. These
load operations can be optimized.&lt;/p&gt;

  &lt;/summary&gt;

  &lt;p&gt;cBPF’s conditional operations (e.g. “is equal to”, “is greater than”, etc.)
operate on a single 32-bit register called “A”. As such, a &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; program
typically consists of many load operations (&lt;code class=&quot;highlighter-rouge&quot;&gt;load32&lt;/code&gt;) that loads a 32-bit value
from a given offset of the &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp_data&lt;/code&gt; struct into register A, then performs
a comparative operation on it to see if it matches the filter.&lt;/p&gt;

  &lt;div class=&quot;language-javascript highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;mi&quot;&gt;00&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;load32&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;offset&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt;
&lt;span class=&quot;mi&quot;&gt;01&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;jif&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;condition1&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;condition1_was_true&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;condition1_was_false&lt;/span&gt;
&lt;span class=&quot;mi&quot;&gt;02&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;load32&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;offset&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt;
&lt;span class=&quot;mi&quot;&gt;03&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;jif&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;condition2&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;condition2_was_true&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;condition2_was_false&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;// ...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;  &lt;/div&gt;

  &lt;p&gt;But when a syscall rule is of the form “this syscall argument must be one of the
following values”, we don’t need to reload the same value (from the same offset)
multiple times. So gVisor looks for redundant loads like this, and removes them.&lt;/p&gt;

  &lt;div class=&quot;language-javascript highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;mi&quot;&gt;00&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;load32&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;offset&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt;
&lt;span class=&quot;mi&quot;&gt;01&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;jif&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;condition1&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;condition1_was_true&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;condition1_was_false&lt;/span&gt;
&lt;span class=&quot;mi&quot;&gt;02&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;jif&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;condition2&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;condition2_was_true&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;condition2_was_false&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;// ...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;  &lt;/div&gt;

  &lt;p&gt;Note that syscall arguments are &lt;strong&gt;64-bit&lt;/strong&gt; values, whereas the A register is
only 32-bits wide. Therefore, asserting that a syscall argument matches a
predicate usually involves at least 2 &lt;code class=&quot;highlighter-rouge&quot;&gt;load32&lt;/code&gt; operations on different offsets,
thereby making this optimization useless for the “this syscall argument must be
one of the following values” case. We’ll get back to that.&lt;/p&gt;

&lt;/details&gt;

&lt;details&gt;

  &lt;summary&gt;

    &lt;h3 id=&quot;minimizing-the-number-of-return-instructions&quot;&gt;Minimizing the number of &lt;code class=&quot;highlighter-rouge&quot;&gt;return&lt;/code&gt; instructions&lt;/h3&gt;

    &lt;p&gt;A typical syscall filter program consists of many predicates which return either
“allowed” or “rejected”. These are encoded in the bytecode as either &lt;code class=&quot;highlighter-rouge&quot;&gt;return&lt;/code&gt;
instructions, or jumps to &lt;code class=&quot;highlighter-rouge&quot;&gt;return&lt;/code&gt; instructions. These instructions can show up
dozens or hundreds of times in the cBPF bytecode in quick succession, presenting
an optimization opportunity.&lt;/p&gt;

  &lt;/summary&gt;

  &lt;p&gt;Since two &lt;code class=&quot;highlighter-rouge&quot;&gt;return&lt;/code&gt; instructions with the same immediate return code are exactly
equivalent to one another, it is possible to rewrite jumps to all &lt;code class=&quot;highlighter-rouge&quot;&gt;return&lt;/code&gt;
instructions that return “allowed” to go to a single &lt;code class=&quot;highlighter-rouge&quot;&gt;return&lt;/code&gt; instruction that
returns this code, and similar for “rejected”, so long as the jump offsets fit
within the limits of conditional jumps (255 instructions). In turn, this makes
the program shorter, and therefore makes more jump optimizations possible.&lt;/p&gt;

  &lt;p&gt;To implement this optimization, gVisor first replaces all unconditional jump
instructions that go to &lt;code class=&quot;highlighter-rouge&quot;&gt;return&lt;/code&gt; statements with a copy of that &lt;code class=&quot;highlighter-rouge&quot;&gt;return&lt;/code&gt;
statement. This removes needless indirection.&lt;/p&gt;

  &lt;div class=&quot;language-javascript highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;    &lt;span class=&quot;nx&quot;&gt;Original&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;bytecode&lt;/span&gt;                      &lt;span class=&quot;nx&quot;&gt;New&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;bytecode&lt;/span&gt;
&lt;span class=&quot;mi&quot;&gt;00&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;jeq&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;                        &lt;span class=&quot;mi&quot;&gt;00&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;jeq&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;
&lt;span class=&quot;mi&quot;&gt;01&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;jmp&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;good&lt;/span&gt;                    &lt;span class=&quot;o&quot;&gt;--&amp;gt;&lt;/span&gt;   &lt;span class=&quot;mi&quot;&gt;01&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;allowed&lt;/span&gt;
&lt;span class=&quot;mi&quot;&gt;02&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;jmp&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;bad&lt;/span&gt;                     &lt;span class=&quot;o&quot;&gt;--&amp;gt;&lt;/span&gt;   &lt;span class=&quot;mi&quot;&gt;02&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;rejected&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;...&lt;/span&gt;                                    &lt;span class=&quot;p&quot;&gt;...&lt;/span&gt;
&lt;span class=&quot;mi&quot;&gt;10&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;jge&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;                        &lt;span class=&quot;mi&quot;&gt;10&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;jge&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;
&lt;span class=&quot;mi&quot;&gt;11&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;jmp&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;good&lt;/span&gt;                    &lt;span class=&quot;o&quot;&gt;--&amp;gt;&lt;/span&gt;   &lt;span class=&quot;mi&quot;&gt;11&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;allowed&lt;/span&gt;
&lt;span class=&quot;mi&quot;&gt;12&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;jmp&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;bad&lt;/span&gt;                     &lt;span class=&quot;o&quot;&gt;--&amp;gt;&lt;/span&gt;   &lt;span class=&quot;mi&quot;&gt;12&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;rejected&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;...&lt;/span&gt;                                    &lt;span class=&quot;p&quot;&gt;...&lt;/span&gt;
&lt;span class=&quot;mi&quot;&gt;100&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;good&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]:&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;allowed&lt;/span&gt;            &lt;span class=&quot;mi&quot;&gt;100&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;good&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]:&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;allowed&lt;/span&gt;
&lt;span class=&quot;mi&quot;&gt;101&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;bad&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]:&lt;/span&gt;  &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;rejected&lt;/span&gt;           &lt;span class=&quot;mi&quot;&gt;101&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;bad&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]:&lt;/span&gt;  &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;rejected&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;  &lt;/div&gt;

  &lt;p&gt;gVisor then searches for &lt;code class=&quot;highlighter-rouge&quot;&gt;return&lt;/code&gt; statements which can be entirely removed by
seeing if it is possible to rewrite the rest of the program to jump or flow
through to an equivalent &lt;code class=&quot;highlighter-rouge&quot;&gt;return&lt;/code&gt; statement (without making the program longer
in the process). In the above example:&lt;/p&gt;

  &lt;div class=&quot;language-javascript highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;    &lt;span class=&quot;nx&quot;&gt;Original&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;bytecode&lt;/span&gt;                      &lt;span class=&quot;nx&quot;&gt;New&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;bytecode&lt;/span&gt;
&lt;span class=&quot;mi&quot;&gt;00&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;jeq&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;                  &lt;span class=&quot;o&quot;&gt;--&amp;gt;&lt;/span&gt;   &lt;span class=&quot;mi&quot;&gt;00&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;jeq&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;99&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;100&lt;/span&gt;   &lt;span class=&quot;c1&quot;&gt;// Targets updated&lt;/span&gt;
&lt;span class=&quot;mi&quot;&gt;01&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;allowed&lt;/span&gt;                     &lt;span class=&quot;mi&quot;&gt;01&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;allowed&lt;/span&gt;   &lt;span class=&quot;c1&quot;&gt;// Now dead code&lt;/span&gt;
&lt;span class=&quot;mi&quot;&gt;02&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;reject&lt;/span&gt;                      &lt;span class=&quot;mi&quot;&gt;02&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;rejected&lt;/span&gt;  &lt;span class=&quot;c1&quot;&gt;// Now dead code&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;...&lt;/span&gt;                                    &lt;span class=&quot;p&quot;&gt;...&lt;/span&gt;
&lt;span class=&quot;mi&quot;&gt;10&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;jge&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;                  &lt;span class=&quot;o&quot;&gt;--&amp;gt;&lt;/span&gt;   &lt;span class=&quot;mi&quot;&gt;10&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;jge&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;89&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;90&lt;/span&gt;    &lt;span class=&quot;c1&quot;&gt;// Targets updated&lt;/span&gt;
&lt;span class=&quot;mi&quot;&gt;11&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;jmp&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;good&lt;/span&gt;                          &lt;span class=&quot;mi&quot;&gt;11&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;allowed&lt;/span&gt;   &lt;span class=&quot;c1&quot;&gt;// Now dead code&lt;/span&gt;
&lt;span class=&quot;mi&quot;&gt;12&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;jmp&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;bad&lt;/span&gt;                           &lt;span class=&quot;mi&quot;&gt;12&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;rejected&lt;/span&gt;  &lt;span class=&quot;c1&quot;&gt;// Now dead code&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;...&lt;/span&gt;                                    &lt;span class=&quot;p&quot;&gt;...&lt;/span&gt;
&lt;span class=&quot;mi&quot;&gt;100&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;good&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]:&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;allowed&lt;/span&gt;            &lt;span class=&quot;mi&quot;&gt;100&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;good&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]:&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;allowed&lt;/span&gt;
&lt;span class=&quot;mi&quot;&gt;101&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;bad&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]:&lt;/span&gt;  &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;rejected&lt;/span&gt;           &lt;span class=&quot;mi&quot;&gt;101&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;bad&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]:&lt;/span&gt;  &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;rejected&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;  &lt;/div&gt;

  &lt;p&gt;Finally, the dead code removal pass cleans up the dead &lt;code class=&quot;highlighter-rouge&quot;&gt;return&lt;/code&gt; statements and
the program becomes shorter.&lt;/p&gt;

  &lt;div class=&quot;language-javascript highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;    &lt;span class=&quot;nx&quot;&gt;Original&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;bytecode&lt;/span&gt;                      &lt;span class=&quot;nx&quot;&gt;New&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;bytecode&lt;/span&gt;
&lt;span class=&quot;mi&quot;&gt;00&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;jeq&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;99&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;100&lt;/span&gt;               &lt;span class=&quot;o&quot;&gt;--&amp;gt;&lt;/span&gt;   &lt;span class=&quot;mi&quot;&gt;00&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;jeq&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;95&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;96&lt;/span&gt;  &lt;span class=&quot;c1&quot;&gt;// Targets updated&lt;/span&gt;
&lt;span class=&quot;mi&quot;&gt;01&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;allowed&lt;/span&gt;               &lt;span class=&quot;o&quot;&gt;--&amp;gt;&lt;/span&gt;   &lt;span class=&quot;cm&quot;&gt;/* Removed */&lt;/span&gt;
&lt;span class=&quot;mi&quot;&gt;02&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;reject&lt;/span&gt;                &lt;span class=&quot;o&quot;&gt;--&amp;gt;&lt;/span&gt;   &lt;span class=&quot;cm&quot;&gt;/* Removed */&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;...&lt;/span&gt;                                    &lt;span class=&quot;p&quot;&gt;...&lt;/span&gt;
&lt;span class=&quot;mi&quot;&gt;10&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;jge&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;89&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;90&lt;/span&gt;                &lt;span class=&quot;o&quot;&gt;--&amp;gt;&lt;/span&gt;   &lt;span class=&quot;mi&quot;&gt;08&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;jge&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;87&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;88&lt;/span&gt;  &lt;span class=&quot;c1&quot;&gt;// Targets updated&lt;/span&gt;
&lt;span class=&quot;mi&quot;&gt;11&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;allowed&lt;/span&gt;               &lt;span class=&quot;o&quot;&gt;--&amp;gt;&lt;/span&gt;   &lt;span class=&quot;cm&quot;&gt;/* Removed */&lt;/span&gt;
&lt;span class=&quot;mi&quot;&gt;12&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;rejected&lt;/span&gt;              &lt;span class=&quot;o&quot;&gt;--&amp;gt;&lt;/span&gt;   &lt;span class=&quot;cm&quot;&gt;/* Removed */&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;...&lt;/span&gt;                                    &lt;span class=&quot;p&quot;&gt;...&lt;/span&gt;
&lt;span class=&quot;mi&quot;&gt;100&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;good&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]:&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;allowed&lt;/span&gt;            &lt;span class=&quot;mi&quot;&gt;96&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;good&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]:&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;allowed&lt;/span&gt;
&lt;span class=&quot;mi&quot;&gt;101&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;bad&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]:&lt;/span&gt;  &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;rejected&lt;/span&gt;           &lt;span class=&quot;mi&quot;&gt;97&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;bad&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]:&lt;/span&gt;  &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;rejected&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;  &lt;/div&gt;

  &lt;p&gt;While this search is expensive to perform, in a program full of predicates —
which is exactly what &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; programs are — this approach massively
reduces program size.&lt;/p&gt;

&lt;/details&gt;

&lt;h2 id=&quot;optimize-rulesets&quot;&gt;Ruleset optimizations&lt;/h2&gt;

&lt;p&gt;Bytecode-level optimizations are cool, but why stop here? gVisor now also
performs
&lt;a href=&quot;https://github.com/google/gvisor/blob/master/pkg/seccomp/seccomp_optimizer.go&quot;&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp&lt;/code&gt; ruleset optimizations&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;In gVisor, a &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp&lt;/code&gt; &lt;code class=&quot;highlighter-rouge&quot;&gt;RuleSet&lt;/code&gt; is a mapping from syscall number to a logical
expression named &lt;code class=&quot;highlighter-rouge&quot;&gt;SyscallRule&lt;/code&gt;, along with a &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; action (e.g. “allow”)
if a syscall with a given number matches its &lt;code class=&quot;highlighter-rouge&quot;&gt;SyscallRule&lt;/code&gt;.&lt;/p&gt;

&lt;details&gt;

  &lt;summary&gt;

    &lt;h3 id=&quot;basic-ruleset-simplifications&quot;&gt;Basic ruleset simplifications&lt;/h3&gt;

    &lt;p&gt;A &lt;code class=&quot;highlighter-rouge&quot;&gt;SyscallRule&lt;/code&gt; is a predicate over the data contained in the &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp_data&lt;/code&gt;
struct (beyond its &lt;code class=&quot;highlighter-rouge&quot;&gt;nr&lt;/code&gt;). A trivial implementation is &lt;code class=&quot;highlighter-rouge&quot;&gt;MatchAll&lt;/code&gt;, which simply
matches any &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp_data&lt;/code&gt;. Other implementations include &lt;code class=&quot;highlighter-rouge&quot;&gt;Or&lt;/code&gt; and &lt;code class=&quot;highlighter-rouge&quot;&gt;And&lt;/code&gt; (which
do what they sound like), and &lt;code class=&quot;highlighter-rouge&quot;&gt;PerArg&lt;/code&gt; which applies predicates to each specific
argument of a &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp_data&lt;/code&gt;, and forms the meat of actual syscall filtering
rules. Some basic simplifications are already possible with these building
blocks.&lt;/p&gt;

  &lt;/summary&gt;

  &lt;p&gt;gVisor implements the following basic optimizers, which look like they may be
useless on their own but end up simplifying the logic of the more complex
optimizer described in other sections quite a bit:&lt;/p&gt;

  &lt;ul&gt;
    &lt;li&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;Or&lt;/code&gt; and &lt;code class=&quot;highlighter-rouge&quot;&gt;And&lt;/code&gt; rules with a single predicate within them are replaced with
just that predicate.&lt;/li&gt;
    &lt;li&gt;Duplicate predicates within &lt;code class=&quot;highlighter-rouge&quot;&gt;Or&lt;/code&gt; and &lt;code class=&quot;highlighter-rouge&quot;&gt;And&lt;/code&gt; rules are removed.&lt;/li&gt;
    &lt;li&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;Or&lt;/code&gt; rules within &lt;code class=&quot;highlighter-rouge&quot;&gt;Or&lt;/code&gt; rules are flattened.&lt;/li&gt;
    &lt;li&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;And&lt;/code&gt; rules within &lt;code class=&quot;highlighter-rouge&quot;&gt;And&lt;/code&gt; rules are flattened.&lt;/li&gt;
    &lt;li&gt;An &lt;code class=&quot;highlighter-rouge&quot;&gt;Or&lt;/code&gt; rule which contains a &lt;code class=&quot;highlighter-rouge&quot;&gt;MatchAll&lt;/code&gt; predicate is replaced with
&lt;code class=&quot;highlighter-rouge&quot;&gt;MatchAll&lt;/code&gt;.&lt;/li&gt;
    &lt;li&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;MatchAll&lt;/code&gt; predicates within &lt;code class=&quot;highlighter-rouge&quot;&gt;And&lt;/code&gt; rules are removed.&lt;/li&gt;
    &lt;li&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;PerArg&lt;/code&gt; rules with &lt;code class=&quot;highlighter-rouge&quot;&gt;MatchAll&lt;/code&gt; predicates for each argument are replaced
with a rule that matches anything.&lt;/li&gt;
  &lt;/ul&gt;

  &lt;p&gt;As with the bytecode-level optimizations, gVisor runs these in a loop until the
structure of the rules no longer change. With the basic optimizations above,
this silly-looking rule:&lt;/p&gt;

  &lt;div class=&quot;language-go highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;Or&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;Or&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;And&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;MatchAll&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;PerArg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;EqualTo&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
        &lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;MatchAll&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;PerArg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;EqualTo&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;PerArg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;EqualTo&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;  &lt;/div&gt;

  &lt;p&gt;… is simplified down to just &lt;code class=&quot;highlighter-rouge&quot;&gt;PerArg{AnyValue, EqualTo(2), AnyValue}&lt;/code&gt;.&lt;/p&gt;

&lt;/details&gt;

&lt;details&gt;

  &lt;summary&gt;

    &lt;h3 id=&quot;extracting-repeated-argument-matchers&quot;&gt;Extracting repeated argument matchers&lt;/h3&gt;

    &lt;p&gt;This is the main optimization that gVisor performs on rulesets. gVisor looks for
common argument matchers that are repeated across all combinations of &lt;em&gt;other&lt;/em&gt;
argument matchers in branches of an &lt;code class=&quot;highlighter-rouge&quot;&gt;Or&lt;/code&gt; rule. It removes them from these
&lt;code class=&quot;highlighter-rouge&quot;&gt;PerArg&lt;/code&gt; rules, and &lt;code class=&quot;highlighter-rouge&quot;&gt;And&lt;/code&gt; the overall syscall rule with a single instance of
that argument matcher. Sound complicated? Let’s look at an example.&lt;/p&gt;

  &lt;/summary&gt;

  &lt;p&gt;In the
&lt;a href=&quot;https://github.com/google/gvisor/blob/master/runsc/boot/filter/config/&quot;&gt;gVisor Sentry &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; configuration&lt;/a&gt;,
these are the rules for the
&lt;a href=&quot;https://man7.org/linux/man-pages/man2/fcntl.2.html&quot;&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;fcntl(2)&lt;/code&gt; system call&lt;/a&gt;:&lt;/p&gt;

  &lt;div class=&quot;language-go highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;rules&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;...&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;map&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;uintptr&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;SyscallRule&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;SYS_FCNTL&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Or&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;PerArg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;NonNegativeFD&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;EqualTo&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;F_GETFL&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;
        &lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;PerArg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;NonNegativeFD&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;EqualTo&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;F_SETFL&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;
        &lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;PerArg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;NonNegativeFD&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;EqualTo&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;F_GETFD&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;
        &lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;  &lt;/div&gt;

  &lt;p&gt;… This means that for the &lt;code class=&quot;highlighter-rouge&quot;&gt;fcntl(2)&lt;/code&gt; system call, &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp_data.args[0]&lt;/code&gt; may
be any non-negative number, &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp_data.args[1]&lt;/code&gt; may be either &lt;code class=&quot;highlighter-rouge&quot;&gt;F_GETFL&lt;/code&gt;,
&lt;code class=&quot;highlighter-rouge&quot;&gt;F_SETFL&lt;/code&gt;, or &lt;code class=&quot;highlighter-rouge&quot;&gt;F_GETFD&lt;/code&gt;, and all other &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp_data&lt;/code&gt; fields may be any value.&lt;/p&gt;

  &lt;p&gt;If rendered naively in BPF, this would iterate over each branch of the &lt;code class=&quot;highlighter-rouge&quot;&gt;Or&lt;/code&gt;
expression, and re-check the &lt;code class=&quot;highlighter-rouge&quot;&gt;NonNegativeFD&lt;/code&gt; each time. Clearly wasteful.
Conceptually, the ideal expression is something like this:&lt;/p&gt;

  &lt;div class=&quot;language-go highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;rules&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;...&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;map&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;uintptr&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;SyscallRule&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;SYS_FCNTL&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;PerArg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;NonNegativeFD&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;AnyOf&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;F_GETFL&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;F_SETFL&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;F_GETFD&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;  &lt;/div&gt;

  &lt;p&gt;… But going through all the syscall rules to look for this pattern would be
quite tedious, and some of them are actually &lt;code class=&quot;highlighter-rouge&quot;&gt;Or&lt;/code&gt;‘d from multiple
&lt;code class=&quot;highlighter-rouge&quot;&gt;map[uintptr]SyscallRule&lt;/code&gt; in different files (e.g. platform-dependent syscalls),
so they cannot be all specified in a single location with a single predicate on
&lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp_data.args[1]&lt;/code&gt;. So gVisor needs to detect this programmatically at
optimization time.&lt;/p&gt;

  &lt;p&gt;Conceptually, gVisor goes from:&lt;/p&gt;

  &lt;div class=&quot;language-go highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;Or&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;PerArg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;A1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;B1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;C1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;D&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;PerArg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;A2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;B1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;C1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;D&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;PerArg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;A1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;B2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;C2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;D&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;PerArg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;A2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;B2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;C2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;D&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;PerArg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;A1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;B3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;C3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;D&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;PerArg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;A2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;B3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;C3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;D&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;  &lt;/div&gt;

  &lt;p&gt;… to (after one pass):&lt;/p&gt;

  &lt;div class=&quot;language-go highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;And&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;Or&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;PerArg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;A1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;PerArg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;A2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;PerArg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;A1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;PerArg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;A2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;PerArg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;A1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;PerArg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;A2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;Or&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;PerArg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;B1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;C1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;D&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;PerArg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;B1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;C1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;D&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;PerArg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;B2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;C2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;D&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;PerArg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;B2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;C2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;D&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;PerArg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;B3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;C3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;D&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;PerArg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;B3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;C3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;D&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;  &lt;/div&gt;

  &lt;p&gt;Then the &lt;a href=&quot;#basic-ruleset-simplifications&quot;&gt;basic optimizers&lt;/a&gt; will kick in and
detect duplicate &lt;code class=&quot;highlighter-rouge&quot;&gt;PerArg&lt;/code&gt; rules in &lt;code class=&quot;highlighter-rouge&quot;&gt;Or&lt;/code&gt; expressions, and delete them:&lt;/p&gt;

  &lt;div class=&quot;language-go highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;And&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;Or&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;PerArg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;A1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;PerArg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;A2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;Or&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;PerArg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;B1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;C1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;D&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;PerArg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;B2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;C2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;D&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;PerArg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;B3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;C3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;D&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;  &lt;/div&gt;

  &lt;p&gt;… Then, on the next pass, the second inner &lt;code class=&quot;highlighter-rouge&quot;&gt;Or&lt;/code&gt; rule gets recursively
optimized:&lt;/p&gt;

  &lt;div class=&quot;language-go highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;And&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;Or&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;PerArg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;A1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;PerArg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;A2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;And&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;Or&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;PerArg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;D&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;PerArg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;D&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;PerArg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;D&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
        &lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;Or&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;PerArg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;B1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;C1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;PerArg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;B2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;C2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;PerArg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;B3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;C3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
        &lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;  &lt;/div&gt;

  &lt;p&gt;… which, after other basic optimizers clean this all up, finally becomes:&lt;/p&gt;

  &lt;div class=&quot;language-go highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;And&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;Or&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;PerArg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;A1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;PerArg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;A2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;PerArg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;D&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;Or&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;PerArg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;B1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;C1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;PerArg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;B2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;C2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;PerArg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;B3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;C3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;  &lt;/div&gt;

  &lt;p&gt;This has turned what would be 24 comparisons into just 9:&lt;/p&gt;

  &lt;ul&gt;
    &lt;li&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp_data[0]&lt;/code&gt; must either match predicate &lt;code class=&quot;highlighter-rouge&quot;&gt;A1&lt;/code&gt; or &lt;code class=&quot;highlighter-rouge&quot;&gt;A2&lt;/code&gt;.&lt;/li&gt;
    &lt;li&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp_data[3]&lt;/code&gt; must match predicate &lt;code class=&quot;highlighter-rouge&quot;&gt;D&lt;/code&gt;.&lt;/li&gt;
    &lt;li&gt;At least one of the following must be true:
      &lt;ul&gt;
        &lt;li&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp_data[1]&lt;/code&gt; must match predicate &lt;code class=&quot;highlighter-rouge&quot;&gt;B1&lt;/code&gt; and &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp_data[2]&lt;/code&gt; must
match predicate &lt;code class=&quot;highlighter-rouge&quot;&gt;C1&lt;/code&gt;.&lt;/li&gt;
        &lt;li&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp_data[1]&lt;/code&gt; must match predicate &lt;code class=&quot;highlighter-rouge&quot;&gt;B2&lt;/code&gt; and &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp_data[2]&lt;/code&gt; must
match predicate &lt;code class=&quot;highlighter-rouge&quot;&gt;C2&lt;/code&gt;.&lt;/li&gt;
        &lt;li&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp_data[1]&lt;/code&gt; must match predicate &lt;code class=&quot;highlighter-rouge&quot;&gt;B3&lt;/code&gt; and &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp_data[2]&lt;/code&gt; must
match predicate &lt;code class=&quot;highlighter-rouge&quot;&gt;C3&lt;/code&gt;.&lt;/li&gt;
      &lt;/ul&gt;
    &lt;/li&gt;
  &lt;/ul&gt;

  &lt;p&gt;To go back to our &lt;code class=&quot;highlighter-rouge&quot;&gt;fcntl(2)&lt;/code&gt; example, the rules would therefore be rewritten to:&lt;/p&gt;

  &lt;div class=&quot;language-go highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;rules&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;...&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;map&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;uintptr&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;SyscallRule&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;SYS_FCNTL&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;And&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
        &lt;span class=&quot;c&quot;&gt;// Check for args[0] exclusively:&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;PerArg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;NonNegativeFD&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
        &lt;span class=&quot;c&quot;&gt;// Check for args[1] exclusively:&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;Or&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;PerArg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;EqualTo&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;F_GETFL&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)},&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;PerArg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;EqualTo&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;F_SETFL&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)},&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;PerArg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;EqualTo&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;F_GETFD&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)},&lt;/span&gt;
        &lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;  &lt;/div&gt;

  &lt;p&gt;… thus we’ve turned 6 comparisons into 4. But we can do better still!&lt;/p&gt;

&lt;/details&gt;

&lt;details&gt;

  &lt;summary&gt;

    &lt;h3 id=&quot;extracting-repeated-32-bit-match-logic-from-64-bit-argument-matchers&quot;&gt;Extracting repeated 32-bit match logic from 64-bit argument matchers&lt;/h3&gt;

    &lt;p&gt;We can apply the same optimization, but down to the 32-bit matching logic that
underlies the 64-bit syscall argument matching predicates.&lt;/p&gt;

  &lt;/summary&gt;

  &lt;p&gt;As you may recall,
&lt;a href=&quot;#cbpf-limitations&quot;&gt;cBPF instructions are limited to 32-bit math&lt;/a&gt;. This means
that when rendered, each of these argument comparisons are actually 2 operations
each: one for the first 32-bit half of the argument, and one for the second
32-bit half of the argument.&lt;/p&gt;

  &lt;p&gt;Let’s look at the &lt;code class=&quot;highlighter-rouge&quot;&gt;F_GETFL&lt;/code&gt;, &lt;code class=&quot;highlighter-rouge&quot;&gt;F_SETFL&lt;/code&gt;, and &lt;code class=&quot;highlighter-rouge&quot;&gt;F_GETFD&lt;/code&gt; constants:&lt;/p&gt;

  &lt;div class=&quot;language-go highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;F_GETFL&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;0x3&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;F_SETFL&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;0x4&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;F_GETFD&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;0x1&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;  &lt;/div&gt;

  &lt;p&gt;The cBPF bytecode for checking the arguments of this syscall may therefore look
something like this:&lt;/p&gt;

  &lt;div class=&quot;language-javascript highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c1&quot;&gt;// Check for `seccomp_data.args[0]`:&lt;/span&gt;
  &lt;span class=&quot;mi&quot;&gt;00&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;load32&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;16&lt;/span&gt;                &lt;span class=&quot;c1&quot;&gt;// Load the first 32 bits of&lt;/span&gt;
                               &lt;span class=&quot;c1&quot;&gt;//   `seccomp_data.args[0]` into register A.&lt;/span&gt;
  &lt;span class=&quot;mi&quot;&gt;01&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;jeq&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;bad&lt;/span&gt;           &lt;span class=&quot;c1&quot;&gt;// If A == 0, continue, otherwise jump to @bad.&lt;/span&gt;
  &lt;span class=&quot;mi&quot;&gt;02&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;load32&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;20&lt;/span&gt;                &lt;span class=&quot;c1&quot;&gt;// Load the second 32 bits of&lt;/span&gt;
                               &lt;span class=&quot;c1&quot;&gt;//   `seccomp_data.args[0]` into register A.&lt;/span&gt;
  &lt;span class=&quot;mi&quot;&gt;03&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;jset&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;0x80000000&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;bad&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;// If A &amp;amp; 0x80000000 != 0, jump to @bad,&lt;/span&gt;
                               &lt;span class=&quot;c1&quot;&gt;//   otherwise continue.&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;// Check for `seccomp_data.args[1]`:&lt;/span&gt;
  &lt;span class=&quot;mi&quot;&gt;04&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;load32&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;24&lt;/span&gt;                &lt;span class=&quot;c1&quot;&gt;// Load the first 32 bits of&lt;/span&gt;
                               &lt;span class=&quot;c1&quot;&gt;//   `seccomp_data.args[1]` into register A.&lt;/span&gt;
  &lt;span class=&quot;mi&quot;&gt;05&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;jeq&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;next1&lt;/span&gt;         &lt;span class=&quot;c1&quot;&gt;// If A == 0, continue, otherwise jump to @next1.&lt;/span&gt;
  &lt;span class=&quot;mi&quot;&gt;06&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;load32&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;28&lt;/span&gt;                &lt;span class=&quot;c1&quot;&gt;// Load the second 32 bits of&lt;/span&gt;
                               &lt;span class=&quot;c1&quot;&gt;//   `seccomp_data.args[1]` into register A.&lt;/span&gt;
  &lt;span class=&quot;mi&quot;&gt;07&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;jeq&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;0x3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;good&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;next1&lt;/span&gt;   &lt;span class=&quot;c1&quot;&gt;// If A == 0x3, jump to @good,&lt;/span&gt;
                               &lt;span class=&quot;c1&quot;&gt;//   otherwise jump to @next1.&lt;/span&gt;

&lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;next1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;mi&quot;&gt;08&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;load32&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;24&lt;/span&gt;                &lt;span class=&quot;c1&quot;&gt;// Load the first 32 bits of&lt;/span&gt;
                               &lt;span class=&quot;c1&quot;&gt;//   `seccomp_data.args[1]` into register A.&lt;/span&gt;
  &lt;span class=&quot;mi&quot;&gt;09&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;jeq&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;next2&lt;/span&gt;         &lt;span class=&quot;c1&quot;&gt;// If A == 0, continue, otherwise jump to @next2.&lt;/span&gt;
  &lt;span class=&quot;mi&quot;&gt;10&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;load32&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;28&lt;/span&gt;                &lt;span class=&quot;c1&quot;&gt;// Load the second 32 bits of&lt;/span&gt;
                               &lt;span class=&quot;c1&quot;&gt;//   `seccomp_data.args[1]` into register A.&lt;/span&gt;
  &lt;span class=&quot;mi&quot;&gt;11&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;jeq&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;0x4&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;good&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;next2&lt;/span&gt;   &lt;span class=&quot;c1&quot;&gt;// If A == 0x3, jump to @good,&lt;/span&gt;
                               &lt;span class=&quot;c1&quot;&gt;//   otherwise jump to @next2.&lt;/span&gt;

&lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;next2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;mi&quot;&gt;12&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;load32&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;24&lt;/span&gt;                &lt;span class=&quot;c1&quot;&gt;// Load the first 32 bits of&lt;/span&gt;
                               &lt;span class=&quot;c1&quot;&gt;//   `seccomp_data.args[1]` into register A.&lt;/span&gt;
  &lt;span class=&quot;mi&quot;&gt;13&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;jeq&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;bad&lt;/span&gt;           &lt;span class=&quot;c1&quot;&gt;// If A == 0, continue, otherwise jump to @bad.&lt;/span&gt;
  &lt;span class=&quot;mi&quot;&gt;14&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;load32&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;28&lt;/span&gt;                &lt;span class=&quot;c1&quot;&gt;// Load the second 32 bits of&lt;/span&gt;
                               &lt;span class=&quot;c1&quot;&gt;//   `seccomp_data.args[1]` into register A.&lt;/span&gt;
  &lt;span class=&quot;mi&quot;&gt;15&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;jeq&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;0x1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;good&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;bad&lt;/span&gt;     &lt;span class=&quot;c1&quot;&gt;// If A == 0x1, jump to @good,&lt;/span&gt;
                               &lt;span class=&quot;c1&quot;&gt;//   otherwise jump to @bad.&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;// Good/bad jump targets for the checks above to jump to:&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;good&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;mi&quot;&gt;16&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;ALLOW&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;bad&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;mi&quot;&gt;17&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;REJECT&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;  &lt;/div&gt;

  &lt;p&gt;Clearly this could be better. The first 32 bits must be zero in all possible
cases. So the syscall argument value-matching primitives (e.g. &lt;code class=&quot;highlighter-rouge&quot;&gt;EqualTo&lt;/code&gt;) may be
split into 2 32-bit value matchers:&lt;/p&gt;

  &lt;div class=&quot;language-go highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;rules&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;...&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;map&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;uintptr&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;SyscallRule&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;SYS_FCNTL&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;And&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;PerArg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;NonNegativeFD&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;Or&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;PerArg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
                &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
                &lt;span class=&quot;n&quot;&gt;splitMatcher&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
                    &lt;span class=&quot;n&quot;&gt;high32bits&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;EqualTo32Bits&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
                      &lt;span class=&quot;n&quot;&gt;F_GETFL&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;0xffffffff00000000&lt;/span&gt; &lt;span class=&quot;c&quot;&gt;/* = 0 */&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;
                    &lt;span class=&quot;n&quot;&gt;low32bits&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;  &lt;span class=&quot;n&quot;&gt;EqualTo32Bits&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
                      &lt;span class=&quot;n&quot;&gt;F_GETFL&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;0x00000000ffffffff&lt;/span&gt; &lt;span class=&quot;c&quot;&gt;/* = 0x3 */&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;
                &lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
            &lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;PerArg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
                &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
                &lt;span class=&quot;n&quot;&gt;splitMatcher&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
                    &lt;span class=&quot;n&quot;&gt;high32bits&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;EqualTo32Bits&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
                      &lt;span class=&quot;n&quot;&gt;F_SETFL&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;0xffffffff00000000&lt;/span&gt; &lt;span class=&quot;c&quot;&gt;/* = 0 */&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;
                    &lt;span class=&quot;n&quot;&gt;low32bits&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;  &lt;span class=&quot;n&quot;&gt;EqualTo32Bits&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
                      &lt;span class=&quot;n&quot;&gt;F_SETFL&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;0x00000000ffffffff&lt;/span&gt; &lt;span class=&quot;c&quot;&gt;/* = 0x4 */&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;
                &lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
            &lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;PerArg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
                &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
                &lt;span class=&quot;n&quot;&gt;splitMatcher&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
                    &lt;span class=&quot;n&quot;&gt;high32bits&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;EqualTo32Bits&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
                      &lt;span class=&quot;n&quot;&gt;F_GETFD&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;0xffffffff00000000&lt;/span&gt; &lt;span class=&quot;c&quot;&gt;/* = 0 */&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;
                    &lt;span class=&quot;n&quot;&gt;low32bits&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;  &lt;span class=&quot;n&quot;&gt;EqualTo32Bits&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
                      &lt;span class=&quot;n&quot;&gt;F_GETFD&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;0x00000000ffffffff&lt;/span&gt; &lt;span class=&quot;c&quot;&gt;/* = 0x1 */&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;
                &lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
            &lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
        &lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;  &lt;/div&gt;

  &lt;p&gt;gVisor then applies the same optimization as earlier, but this time going into
each 32-bit half of each argument. This means it can extract the
&lt;code class=&quot;highlighter-rouge&quot;&gt;EqualTo32Bits(0)&lt;/code&gt; matcher from the &lt;code class=&quot;highlighter-rouge&quot;&gt;high32bits&lt;/code&gt; part of each &lt;code class=&quot;highlighter-rouge&quot;&gt;splitMatcher&lt;/code&gt; and
move it up to the &lt;code class=&quot;highlighter-rouge&quot;&gt;And&lt;/code&gt; expression like so:&lt;/p&gt;

  &lt;div class=&quot;language-go highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;rules&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;...&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;map&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;uintptr&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;SyscallRule&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;SYS_FCNTL&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;And&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;PerArg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;NonNegativeFD&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;PerArg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;splitMatcher&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
                &lt;span class=&quot;n&quot;&gt;high32bits&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;EqualTo32Bits&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;
                &lt;span class=&quot;n&quot;&gt;low32bits&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;  &lt;span class=&quot;n&quot;&gt;Any32BitsValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
            &lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
        &lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;Or&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;PerArg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
                &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
                &lt;span class=&quot;n&quot;&gt;splitMatcher&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
                    &lt;span class=&quot;n&quot;&gt;high32bits&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Any32BitsValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
                    &lt;span class=&quot;n&quot;&gt;low32bits&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;  &lt;span class=&quot;n&quot;&gt;EqualTo32Bits&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
                      &lt;span class=&quot;n&quot;&gt;F_GETFL&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;0x00000000ffffffff&lt;/span&gt; &lt;span class=&quot;c&quot;&gt;/* = 0x3 */&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;
                &lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
            &lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;PerArg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
                &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
                &lt;span class=&quot;n&quot;&gt;splitMatcher&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
                    &lt;span class=&quot;n&quot;&gt;high32bits&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Any32BitsValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
                    &lt;span class=&quot;n&quot;&gt;low32bits&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;  &lt;span class=&quot;n&quot;&gt;EqualTo32Bits&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
                      &lt;span class=&quot;n&quot;&gt;F_SETFL&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;0x00000000ffffffff&lt;/span&gt; &lt;span class=&quot;c&quot;&gt;/* = 0x4 */&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;
                &lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
            &lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;PerArg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
                &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
                &lt;span class=&quot;n&quot;&gt;splitMatcher&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
                    &lt;span class=&quot;n&quot;&gt;high32bits&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Any32BitsValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
                    &lt;span class=&quot;n&quot;&gt;low32bits&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;  &lt;span class=&quot;n&quot;&gt;EqualTo32Bits&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
                      &lt;span class=&quot;n&quot;&gt;F_GETFD&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;0x00000000ffffffff&lt;/span&gt; &lt;span class=&quot;c&quot;&gt;/* = 0x1 */&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;
                &lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
            &lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
        &lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;  &lt;/div&gt;

  &lt;p&gt;This looks bigger as a tree, but keep in mind that the &lt;code class=&quot;highlighter-rouge&quot;&gt;AnyValue&lt;/code&gt; and
&lt;code class=&quot;highlighter-rouge&quot;&gt;Any32BitsValue&lt;/code&gt; matchers do not produce any bytecode. So now let’s render that
tree to bytecode:&lt;/p&gt;

  &lt;div class=&quot;language-javascript highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c1&quot;&gt;// Check for `seccomp_data.args[0]`:&lt;/span&gt;
  &lt;span class=&quot;mi&quot;&gt;00&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;load32&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;16&lt;/span&gt;                &lt;span class=&quot;c1&quot;&gt;// Load the first 32 bits of&lt;/span&gt;
                               &lt;span class=&quot;c1&quot;&gt;//   `seccomp_data.args[0]` into register A.&lt;/span&gt;
  &lt;span class=&quot;mi&quot;&gt;01&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;jeq&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;bad&lt;/span&gt;           &lt;span class=&quot;c1&quot;&gt;// If A == 0, continue, otherwise jump to @bad.&lt;/span&gt;
  &lt;span class=&quot;mi&quot;&gt;02&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;load32&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;20&lt;/span&gt;                &lt;span class=&quot;c1&quot;&gt;// Load the second 32 bits of&lt;/span&gt;
                               &lt;span class=&quot;c1&quot;&gt;//   `seccomp_data.args[0]` into register A.&lt;/span&gt;
  &lt;span class=&quot;mi&quot;&gt;03&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;jset&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;0x80000000&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;bad&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;// If A &amp;amp; 0x80000000 != 0, jump to @bad,&lt;/span&gt;
                               &lt;span class=&quot;c1&quot;&gt;//   otherwise continue.&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;// Check for `seccomp_data.args[1]`:&lt;/span&gt;
  &lt;span class=&quot;mi&quot;&gt;04&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;load32&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;24&lt;/span&gt;                &lt;span class=&quot;c1&quot;&gt;// Load the first 32 bits of&lt;/span&gt;
                               &lt;span class=&quot;c1&quot;&gt;//   `seccomp_data.args[1]` into register A.&lt;/span&gt;
  &lt;span class=&quot;mi&quot;&gt;05&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;jeq&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;bad&lt;/span&gt;           &lt;span class=&quot;c1&quot;&gt;// If A == 0, continue, otherwise jump to @bad.&lt;/span&gt;
  &lt;span class=&quot;mi&quot;&gt;06&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;load32&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;28&lt;/span&gt;                &lt;span class=&quot;c1&quot;&gt;// Load the second 32 bits of&lt;/span&gt;
                               &lt;span class=&quot;c1&quot;&gt;//   `seccomp_data.args[1]` into register A.&lt;/span&gt;
  &lt;span class=&quot;mi&quot;&gt;07&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;jeq&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;0x3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;good&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;next1&lt;/span&gt;   &lt;span class=&quot;c1&quot;&gt;// If A == 0x3, jump to @good,&lt;/span&gt;
                               &lt;span class=&quot;c1&quot;&gt;//   otherwise jump to @next1.&lt;/span&gt;

&lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;next1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;mi&quot;&gt;08&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;load32&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;28&lt;/span&gt;                &lt;span class=&quot;c1&quot;&gt;// Load the second 32 bits of&lt;/span&gt;
                               &lt;span class=&quot;c1&quot;&gt;//   `seccomp_data.args[1]` into register A.&lt;/span&gt;
  &lt;span class=&quot;mi&quot;&gt;09&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;jeq&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;0x4&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;good&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;next2&lt;/span&gt;   &lt;span class=&quot;c1&quot;&gt;// If A == 0x3, jump to @good,&lt;/span&gt;
                               &lt;span class=&quot;c1&quot;&gt;//   otherwise jump to @next2.&lt;/span&gt;

&lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;next2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;mi&quot;&gt;10&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;load32&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;28&lt;/span&gt;                &lt;span class=&quot;c1&quot;&gt;// Load the second 32 bits of&lt;/span&gt;
                               &lt;span class=&quot;c1&quot;&gt;//   `seccomp_data.args[1]` into register A.&lt;/span&gt;
  &lt;span class=&quot;mi&quot;&gt;11&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;jeq&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;0x1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;good&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;bad&lt;/span&gt;     &lt;span class=&quot;c1&quot;&gt;// If A == 0x1, jump to @good,&lt;/span&gt;
                               &lt;span class=&quot;c1&quot;&gt;//   otherwise jump to @bad.&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;// Good/bad jump targets for the checks above to jump to:&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;good&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;mi&quot;&gt;12&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;ALLOW&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;bad&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;mi&quot;&gt;13&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;REJECT&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;  &lt;/div&gt;

  &lt;p&gt;This is where the bytecode-level optimization to remove redundant loads
&lt;a href=&quot;#redundant-loads&quot;&gt;described earlier&lt;/a&gt; finally becomes relevant. We don’t need to
load the second 32 bits of &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp_data.args[1]&lt;/code&gt; multiple times in a row:&lt;/p&gt;

  &lt;div class=&quot;language-javascript highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c1&quot;&gt;// Check for `seccomp_data.args[0]`:&lt;/span&gt;
  &lt;span class=&quot;mi&quot;&gt;00&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;load32&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;16&lt;/span&gt;                &lt;span class=&quot;c1&quot;&gt;// Load the first 32 bits of&lt;/span&gt;
                               &lt;span class=&quot;c1&quot;&gt;//   `seccomp_data.args[0]` into register A.&lt;/span&gt;
  &lt;span class=&quot;mi&quot;&gt;01&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;jeq&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;bad&lt;/span&gt;           &lt;span class=&quot;c1&quot;&gt;// If A == 0, continue, otherwise jump to @bad.&lt;/span&gt;
  &lt;span class=&quot;mi&quot;&gt;02&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;load32&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;20&lt;/span&gt;                &lt;span class=&quot;c1&quot;&gt;// Load the second 32 bits of&lt;/span&gt;
                               &lt;span class=&quot;c1&quot;&gt;//   `seccomp_data.args[0]` into register A.&lt;/span&gt;
  &lt;span class=&quot;mi&quot;&gt;03&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;jset&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;0x80000000&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;bad&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;// If A &amp;amp; 0x80000000 != 0, jump to @bad,&lt;/span&gt;
                               &lt;span class=&quot;c1&quot;&gt;//   otherwise continue.&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;// Check for `seccomp_data.args[1]`:&lt;/span&gt;
  &lt;span class=&quot;mi&quot;&gt;04&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;load32&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;24&lt;/span&gt;                &lt;span class=&quot;c1&quot;&gt;// Load the first 32 bits of&lt;/span&gt;
                               &lt;span class=&quot;c1&quot;&gt;//   `seccomp_data.args[1]` into register A.&lt;/span&gt;
  &lt;span class=&quot;mi&quot;&gt;05&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;jeq&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;bad&lt;/span&gt;           &lt;span class=&quot;c1&quot;&gt;// If A == 0, continue, otherwise jump to @bad.&lt;/span&gt;
  &lt;span class=&quot;mi&quot;&gt;06&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;load32&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;28&lt;/span&gt;                &lt;span class=&quot;c1&quot;&gt;// Load the second 32 bits of&lt;/span&gt;
                               &lt;span class=&quot;c1&quot;&gt;//   `seccomp_data.args[1]` into register A.&lt;/span&gt;
  &lt;span class=&quot;mi&quot;&gt;07&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;jeq&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;0x3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;good&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;next1&lt;/span&gt;   &lt;span class=&quot;c1&quot;&gt;// If A == 0x3, jump to @good,&lt;/span&gt;
                               &lt;span class=&quot;c1&quot;&gt;//   otherwise jump to @next1.&lt;/span&gt;

&lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;next1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;mi&quot;&gt;08&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;jeq&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;0x4&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;good&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;next2&lt;/span&gt;   &lt;span class=&quot;c1&quot;&gt;// If A == 0x3, jump to @good,&lt;/span&gt;
                               &lt;span class=&quot;c1&quot;&gt;//   otherwise jump to @next2.&lt;/span&gt;

&lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;next2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;mi&quot;&gt;09&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;jeq&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;0x1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;good&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;bad&lt;/span&gt;     &lt;span class=&quot;c1&quot;&gt;// If A == 0x1, jump to @good,&lt;/span&gt;
                               &lt;span class=&quot;c1&quot;&gt;//   otherwise jump to @bad.&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;// Good/bad jump targets for the checks above to jump to:&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;good&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;mi&quot;&gt;10&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;ALLOW&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;bad&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;mi&quot;&gt;11&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;REJECT&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;  &lt;/div&gt;

  &lt;p&gt;Of course, in practice the &lt;code class=&quot;highlighter-rouge&quot;&gt;@good&lt;/code&gt;/&lt;code class=&quot;highlighter-rouge&quot;&gt;@bad&lt;/code&gt; jump targets would also be unified
with rules from other system call filters in order to cut down on those too. And
by having reduced the number of instructions in each individual filtering rule,
the jumps to these targets can be deduplicated against that many more rules.&lt;/p&gt;

  &lt;p&gt;This example demonstrates how &lt;strong&gt;optimizations build on top of each other&lt;/strong&gt;,
making each optimization more likely to make &lt;em&gt;other&lt;/em&gt; optimizations useful in
turn.&lt;/p&gt;

&lt;/details&gt;

&lt;h2 id=&quot;other-optimizations&quot;&gt;Other optimizations&lt;/h2&gt;

&lt;p&gt;Beyond these, gVisor also has the following minor optimizations.&lt;/p&gt;

&lt;details&gt;

  &lt;summary&gt;

    &lt;h3 id=&quot;making-futex2-rules-faster&quot;&gt;Making &lt;code class=&quot;highlighter-rouge&quot;&gt;futex(2)&lt;/code&gt; rules faster&lt;/h3&gt;

    &lt;p&gt;&lt;a href=&quot;https://man7.org/linux/man-pages/man2/futex.2.html&quot;&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;futex(2)&lt;/code&gt;&lt;/a&gt; is by far the
most-often-called system call that gVisor calls as part of its operation. It is
used for synchronization, so it needs to be very efficient.&lt;/p&gt;

  &lt;/summary&gt;

  &lt;p&gt;Its rules used to look like this:&lt;/p&gt;

  &lt;div class=&quot;language-go highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;SYS_FUTEX&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Or&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;PerArg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;EqualTo&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;FUTEX_WAIT&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;|&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;FUTEX_PRIVATE_FLAG&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;PerArg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;EqualTo&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;FUTEX_WAKE&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;|&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;FUTEX_PRIVATE_FLAG&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;PerArg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;EqualTo&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;FUTEX_WAIT&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;PerArg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;AnyValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;EqualTo&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;FUTEX_WAKE&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;  &lt;/div&gt;

  &lt;p&gt;Essentially a 4-way &lt;code class=&quot;highlighter-rouge&quot;&gt;Or&lt;/code&gt; between 4 different values allowed for
&lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp_data.args[1]&lt;/code&gt;. This is all well and good, and the above optimizations
already optimize this down to the minimum amount of &lt;code class=&quot;highlighter-rouge&quot;&gt;jeq&lt;/code&gt; comparison operations.&lt;/p&gt;

  &lt;p&gt;But looking at the actual bit values of the &lt;code class=&quot;highlighter-rouge&quot;&gt;FUTEX_*&lt;/code&gt; constants above:&lt;/p&gt;

  &lt;div class=&quot;language-go highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;FUTEX_WAIT&lt;/span&gt;         &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;0x00&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;FUTEX_WAKE&lt;/span&gt;         &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;0x01&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;FUTEX_PRIVATE_FLAG&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;0x80&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;  &lt;/div&gt;

  &lt;p&gt;… We can see that this is equivalent to checking that no bits other than
&lt;code class=&quot;highlighter-rouge&quot;&gt;0x01&lt;/code&gt; and &lt;code class=&quot;highlighter-rouge&quot;&gt;0x80&lt;/code&gt; may be set. Turns out that cBPF has an instruction for that.
This is now optimized down to two comparison operations:&lt;/p&gt;

  &lt;div class=&quot;language-javascript highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;mi&quot;&gt;01&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;load32&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;24&lt;/span&gt;                     &lt;span class=&quot;c1&quot;&gt;// Load the first 32 bits of&lt;/span&gt;
                                  &lt;span class=&quot;c1&quot;&gt;//   `seccomp_data.args[1]` into register A.&lt;/span&gt;
&lt;span class=&quot;mi&quot;&gt;02&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;jeq&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;bad&lt;/span&gt;                &lt;span class=&quot;c1&quot;&gt;// If A == 0, continue,&lt;/span&gt;
                                  &lt;span class=&quot;c1&quot;&gt;//   otherwise jump to @bad.&lt;/span&gt;
&lt;span class=&quot;mi&quot;&gt;03&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;load32&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;28&lt;/span&gt;                     &lt;span class=&quot;c1&quot;&gt;// Load the second 32 bits of&lt;/span&gt;
                                  &lt;span class=&quot;c1&quot;&gt;//   `seccomp_data.args[1]` into register A.&lt;/span&gt;
&lt;span class=&quot;mi&quot;&gt;04&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;jset&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;0xffffff7e&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;bad&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;nd&quot;&gt;good&lt;/span&gt;  &lt;span class=&quot;c1&quot;&gt;// If A &amp;amp; ^(0x01 | 0x80) != 0, jump to @bad,&lt;/span&gt;
                                  &lt;span class=&quot;c1&quot;&gt;//   otherwise jump to @good.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;  &lt;/div&gt;

&lt;/details&gt;

&lt;details&gt;

  &lt;summary&gt;

    &lt;h3 id=&quot;optimizing-non-negative-fd-checks&quot;&gt;Optimizing non-negative FD checks&lt;/h3&gt;

    &lt;p&gt;A lot of syscall arguments are file descriptors (FD numbers), which we need to
filter efficiently.&lt;/p&gt;

  &lt;/summary&gt;

  &lt;p&gt;An FD is a 32-bit positive integer, but is passed as a 64-bit value as all
syscall arguments are. Instead of doing a “less than” operation, we can simply
turn it into a bitwise check. We simply check that the first half of the 64-bit
value is zero, and that the 31st bit of the second half of the 64-bit value is
not set.&lt;/p&gt;

&lt;/details&gt;

&lt;details&gt;

  &lt;summary&gt;

    &lt;h3 id=&quot;enforcing-consistency-of-argument-wise-matchers&quot;&gt;Enforcing consistency of argument-wise matchers&lt;/h3&gt;

    &lt;p&gt;When one syscall argument is checked consistently across all branches of an
&lt;code class=&quot;highlighter-rouge&quot;&gt;Or&lt;/code&gt;, enforcing that this is the case ensures that the
&lt;a href=&quot;#optimize-rulesets&quot;&gt;optimization for such matchers&lt;/a&gt; remains effective.&lt;/p&gt;

  &lt;/summary&gt;

  &lt;p&gt;The &lt;code class=&quot;highlighter-rouge&quot;&gt;ioctl(2)&lt;/code&gt; system call takes an FD as one of its arguments. Since it is a
“grab bag” of a system call, gVisor’s rules for &lt;code class=&quot;highlighter-rouge&quot;&gt;ioctl(2)&lt;/code&gt; were similarly spread
across many files and rules, and not all of them checked that the FD argument
was non-negative; some of them simply accepted any value for the FD argument.&lt;/p&gt;

  &lt;p&gt;Before this optimization work, this meant that the BPF program did less work for
the rules which didn’t check the value of the FD argument. However, now that
gVisor &lt;a href=&quot;#optimize-rulesets&quot;&gt;optimizes repeated argument-wise matchers&lt;/a&gt;, it is
now actually &lt;em&gt;cheaper&lt;/em&gt; if &lt;em&gt;all&lt;/em&gt; &lt;code class=&quot;highlighter-rouge&quot;&gt;ioctl(2)&lt;/code&gt; rules verify the value of the FD
argument consistently, as that argument check can be performed exactly once for
all possible branches of the &lt;code class=&quot;highlighter-rouge&quot;&gt;ioctl(2)&lt;/code&gt; rules. So now gVisor has a test that
verifies that this is the case. This is a good example that shows that
&lt;strong&gt;optimization work can lead to improved security&lt;/strong&gt; due to the efficiency gains
that comes from applying security checks consistently.&lt;/p&gt;

&lt;/details&gt;

&lt;h2 id=&quot;secbench&quot;&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;secbench&lt;/code&gt;: Benchmarking &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; programs&lt;/h2&gt;

&lt;p&gt;To measure the effectiveness of the above improvements, measuring gVisor
performance itself would be very difficult, because each improvement is a rather
tiny part of the syscall hot path. At the scale of each of these optimizations,
we need to zoom in a bit more.&lt;/p&gt;

&lt;p&gt;So now gVisor has
&lt;a href=&quot;https://github.com/google/gvisor/blob/master/test/secbench/&quot;&gt;tooling for benchmarking &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; programs&lt;/a&gt;.
It works by taking a
&lt;a href=&quot;https://github.com/google/gvisor/blob/master/runsc/boot/filter/filter_bench_test.go&quot;&gt;cBPF program along with several possible syscalls&lt;/a&gt;
to try with it. It runs a subprocess that installs this program as &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt;
filter for itself, replacing all actions (other than “approve syscall”) with
“return error” in order to avoid crashing. Then it measures the latency of each
syscall. This is then measured against the latency of the very same syscalls in
a subprocess that has an empty &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; (i.e. the only instruction within
it is &lt;code class=&quot;highlighter-rouge&quot;&gt;return ALLOW&lt;/code&gt;).&lt;/p&gt;

&lt;p&gt;Let’s measure the effect of the above improvements on a gVisor-like workload.&lt;/p&gt;

&lt;details&gt;

  &lt;summary&gt;

    &lt;h3 id=&quot;modeling-gvisor-seccomp-bpf-behavior-for-benchmarking&quot;&gt;Modeling gVisor &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; behavior for benchmarking&lt;/h3&gt;

    &lt;p&gt;This can be done by running gVisor under &lt;code class=&quot;highlighter-rouge&quot;&gt;ptrace&lt;/code&gt; to see what system calls the
gVisor process is doing.&lt;/p&gt;

  &lt;/summary&gt;

  &lt;p&gt;Note that &lt;code class=&quot;highlighter-rouge&quot;&gt;ptrace&lt;/code&gt; here refers to the mechanism by which we can inspect the
system call that the gVisor Sentry is making. This is distinct from the system
calls the &lt;em&gt;sandboxed&lt;/em&gt; application is doing. It has also nothing to do with
gVisor’s former “ptrace” platform.&lt;/p&gt;

  &lt;p&gt;For example, after running a Postgres benchmark inside gVisor with Systrap, the
&lt;code class=&quot;highlighter-rouge&quot;&gt;ptrace&lt;/code&gt; tool generated the following summary table:&lt;/p&gt;

  &lt;div class=&quot;language-markdown highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;% time     seconds  usecs/call     calls    errors syscall
&lt;span class=&quot;p&quot;&gt;------ ----------- ----------- --------- --------- ----------------&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt; 62.&lt;/span&gt;10  431.799048         496    870063     46227 futex
&lt;span class=&quot;p&quot;&gt;  4.&lt;/span&gt;23   29.399526         106    275649        38 nanosleep
&lt;span class=&quot;p&quot;&gt;  0.&lt;/span&gt;87    6.032292          37    160201           sendmmsg
&lt;span class=&quot;p&quot;&gt;  0.&lt;/span&gt;28    1.939492          16    115769           fstat
&lt;span class=&quot;p&quot;&gt; 27.&lt;/span&gt;96  194.415343        2787     69749       137 ppoll
&lt;span class=&quot;p&quot;&gt;  1.&lt;/span&gt;05    7.298717         315     23131           fsync
&lt;span class=&quot;p&quot;&gt;  0.&lt;/span&gt;06    0.446930          31     14096           pwrite64
&lt;span class=&quot;p&quot;&gt;  3.&lt;/span&gt;37   23.398106        1907     12266         9 epoll_pwait
&lt;span class=&quot;p&quot;&gt;  0.&lt;/span&gt;00    0.019711           9      1991         6 close
&lt;span class=&quot;p&quot;&gt;  0.&lt;/span&gt;02    0.116739          82      1414           tgkill
&lt;span class=&quot;p&quot;&gt;  0.&lt;/span&gt;01    0.068481          48      1414       201 rt_sigreturn
&lt;span class=&quot;p&quot;&gt;  0.&lt;/span&gt;02    0.147048         104      1413           getpid
&lt;span class=&quot;p&quot;&gt;  0.&lt;/span&gt;01    0.045338          41      1080           write
&lt;span class=&quot;p&quot;&gt;  0.&lt;/span&gt;01    0.039876          37      1056           read
&lt;span class=&quot;p&quot;&gt;  0.&lt;/span&gt;00    0.015637          18       836        24 openat
&lt;span class=&quot;p&quot;&gt;  0.&lt;/span&gt;01    0.066699          81       814           madvise
&lt;span class=&quot;p&quot;&gt;  0.&lt;/span&gt;00    0.029757         111       267           fallocate
&lt;span class=&quot;p&quot;&gt;  0.&lt;/span&gt;00    0.006619          15       420           pread64
&lt;span class=&quot;p&quot;&gt;  0.&lt;/span&gt;00    0.013334          35       375           sched_yield
&lt;span class=&quot;p&quot;&gt;  0.&lt;/span&gt;00    0.008112         114        71           pwritev2
&lt;span class=&quot;p&quot;&gt;  0.&lt;/span&gt;00    0.003005          57        52           munmap
&lt;span class=&quot;p&quot;&gt;  0.&lt;/span&gt;00    0.000343          18        19         6 unlinkat
&lt;span class=&quot;p&quot;&gt;  0.&lt;/span&gt;00    0.000249          15        16           shutdown
&lt;span class=&quot;p&quot;&gt;  0.&lt;/span&gt;00    0.000100           8        12           getdents64
&lt;span class=&quot;p&quot;&gt;  0.&lt;/span&gt;00    0.000045           4        10           newfstatat
...
&lt;span class=&quot;p&quot;&gt;------ ----------- ----------- --------- --------- ----------------&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;100.&lt;/span&gt;00  695.311111         447   1552214     46651 total
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;  &lt;/div&gt;

  &lt;p&gt;To mimic the syscall profile of this gVisor sandbox from the perspective of
&lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; overhead, we need to have it call these system calls with the same
relative frequency. Therefore, the dimension that matters here isn’t &lt;code class=&quot;highlighter-rouge&quot;&gt;time&lt;/code&gt; or
&lt;code class=&quot;highlighter-rouge&quot;&gt;seconds&lt;/code&gt; or even &lt;code class=&quot;highlighter-rouge&quot;&gt;usecs/call&lt;/code&gt;; it is actually just the number of system calls
(&lt;code class=&quot;highlighter-rouge&quot;&gt;calls&lt;/code&gt;). In graph form:&lt;/p&gt;

  &lt;p&gt;&lt;img src=&quot;/assets/images/2024-02-01-gvisor-seccomp-sentry-syscall-profile.png&quot; alt=&quot;Sentry syscall profile&quot; title=&quot;Sentry syscall profile&quot; style=&quot;max-width:100%&quot; /&gt;&lt;/p&gt;

  &lt;p&gt;The Pareto distribution of system calls becomes immediately clear.&lt;/p&gt;

&lt;/details&gt;

&lt;h3 id=&quot;seccomp-bpf-filtering-overhead-reduction&quot;&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; filtering overhead reduction&lt;/h3&gt;

&lt;p&gt;The &lt;code class=&quot;highlighter-rouge&quot;&gt;secbench&lt;/code&gt; library lets us take the top 10 system calls and measure their
&lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; filtering overhead individually, as well as building a weighted
aggregate of their overall overhead. Here are the numbers from before and after
the filtering optimizations described in this post:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/2024-02-01-gvisor-seccomp-systrap.png&quot; alt=&quot;Systrap seccomp-bpf performance&quot; title=&quot;Systrap seccomp-bpf performance&quot; style=&quot;max-width:100%&quot; /&gt;&lt;/p&gt;

&lt;p&gt;The &lt;code class=&quot;highlighter-rouge&quot;&gt;nanosleep(2)&lt;/code&gt; system call is a bit of an oddball here. Unlike the others,
this system call causes the current thread to be descheduled. To make the
results more legible, here is the same data with the duration normalized to the
&lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; filtering overhead from before optimizations:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/2024-02-01-gvisor-seccomp-systrap-normalized.png&quot; alt=&quot;Systrap seccomp-bpf performance&quot; title=&quot;Systrap seccomp-bpf performance&quot; style=&quot;max-width:100%&quot; /&gt;&lt;/p&gt;

&lt;p&gt;This shows that most system calls have had their filtering overhead reduced, but
others haven’t significantly changed (10% or less change in either direction).
This is to be expected: those that have not changed are the ones that are
cacheable: &lt;code class=&quot;highlighter-rouge&quot;&gt;nanosleep(2)&lt;/code&gt;, &lt;code class=&quot;highlighter-rouge&quot;&gt;fstat(2)&lt;/code&gt;, &lt;code class=&quot;highlighter-rouge&quot;&gt;ppoll(2)&lt;/code&gt;, &lt;code class=&quot;highlighter-rouge&quot;&gt;fsync(2)&lt;/code&gt;, &lt;code class=&quot;highlighter-rouge&quot;&gt;pwrite64(2)&lt;/code&gt;,
&lt;code class=&quot;highlighter-rouge&quot;&gt;close(2)&lt;/code&gt;, &lt;code class=&quot;highlighter-rouge&quot;&gt;getpid(2)&lt;/code&gt;. The non-cacheable syscalls
&lt;a href=&quot;#structure&quot;&gt;which have dedicated checks&lt;/a&gt; before the main BST, &lt;code class=&quot;highlighter-rouge&quot;&gt;futex(2)&lt;/code&gt; and
&lt;code class=&quot;highlighter-rouge&quot;&gt;sendmmsg(2)&lt;/code&gt;, experienced the biggest boost. Lastly, &lt;code class=&quot;highlighter-rouge&quot;&gt;epoll_pwait(2)&lt;/code&gt; is
non-cacheable but doesn’t have a dedicated check before the main BST, so while
it still sees a small performance gain, that gain is lower than its
counterparts.&lt;/p&gt;

&lt;p&gt;The “Aggregate” number comes from the &lt;code class=&quot;highlighter-rouge&quot;&gt;secbench&lt;/code&gt; library and represents the
total time difference spent in system calls after calling them using weighted
randomness. It represents the average system call overhead that a Sentry using
Systrap would incur. Therefore, per these numbers, these optimizations removed
~29% from gVisor’s overall &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; filtering overhead.&lt;/p&gt;

&lt;p&gt;Here is the same data for KVM, which has a slightly different syscall profile
with &lt;code class=&quot;highlighter-rouge&quot;&gt;ioctl(2)&lt;/code&gt; and &lt;code class=&quot;highlighter-rouge&quot;&gt;rt_sigreturn(2)&lt;/code&gt; being critical for performance:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/2024-02-01-gvisor-seccomp-kvm-normalized.png&quot; alt=&quot;KVM seccomp-bpf performance&quot; title=&quot;KVM seccomp-bpf performance&quot; style=&quot;max-width:100%&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Lastly, let’s look at GPU workload performance. This benchmark enables gVisor’s
&lt;a href=&quot;/blog/2023/06/20/gpu-pytorch-stable-diffusion/&quot;&gt;experimental &lt;code class=&quot;highlighter-rouge&quot;&gt;nvproxy&lt;/code&gt; feature for GPU support&lt;/a&gt;.
What matters for this workload is &lt;code class=&quot;highlighter-rouge&quot;&gt;ioctl(2)&lt;/code&gt; performance, as this is the system
call used to issue commands to the GPU. Here is the &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; filtering
overhead of various CUDA control commands issued via &lt;code class=&quot;highlighter-rouge&quot;&gt;ioctl(2)&lt;/code&gt;:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/2024-02-01-gvisor-seccomp-nvproxy-ioctl.png&quot; alt=&quot;nvproxy ioctl seccomp-bpf performance&quot; title=&quot;nvproxy ioctl seccomp-bpf performance&quot; style=&quot;max-width:100%&quot; /&gt;&lt;/p&gt;

&lt;p&gt;As &lt;code class=&quot;highlighter-rouge&quot;&gt;nvproxy&lt;/code&gt; adds a lot of complexity to the &lt;code class=&quot;highlighter-rouge&quot;&gt;ioctl(2)&lt;/code&gt; filtering rules, this is
where we see the most improvement from these optimizations.&lt;/p&gt;

&lt;h2 id=&quot;secfuzz&quot;&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;secfuzz&lt;/code&gt;: Fuzzing &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; programs&lt;/h2&gt;

&lt;p&gt;To ensure that the optimizations above don’t accidentally end up producing a
cBPF program that has different behavior from the unoptimized one used to do,
gVisor also has
&lt;a href=&quot;https://github.com/google/gvisor/blob/master/test/secfuzz/&quot;&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; fuzz tests&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Because gVisor knows which high-level filters went into constructing the
&lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; program, it also
&lt;a href=&quot;https://github.com/google/gvisor/blob/master/runsc/boot/filter/filter_fuzz_test.go&quot;&gt;automatically generates test cases&lt;/a&gt;
from these filters, and the fuzzer verifies that each line and every branch of
the optimized cBPF bytecode is executed, and that the result is the same as
giving the same input to the unoptimized program.&lt;/p&gt;

&lt;p&gt;(Line or branch coverage of the unoptimized program is not enforceable, because
without optimizations, the bytecode contains many redundant checks for which
later branches can never be reached.)&lt;/p&gt;

&lt;h2 id=&quot;optimizing-in-gvisor-seccomp-bpf-filtering&quot;&gt;Optimizing in-gVisor &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; filtering&lt;/h2&gt;

&lt;p&gt;gVisor supports sandboxed applications adding &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; filters onto
themselves, and
&lt;a href=&quot;https://github.com/google/gvisor/blob/master/pkg/bpf/interpreter.go&quot;&gt;implements its own cBPF interpreter&lt;/a&gt;
for this purpose.&lt;/p&gt;

&lt;p&gt;Because the cBPF bytecode-level optimizations are lossless and are generally
applicable to any cBPF program, they are applied onto programs uploaded by
sandboxed applications to make filter evaluation faster in gVisor itself.&lt;/p&gt;

&lt;p&gt;Additionally, gVisor removed the use of Go interfaces previously used for
loading data from the BPF “input” (i.e. the &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp_data&lt;/code&gt; struct for
&lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt;). This used to require an endianness-specific interface due to how
the BPF interpreter was used in two places in gVisor: network processing (which
uses network byte ordering), and &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; (which uses native byte
ordering). This interface has now been replaced with
&lt;a href=&quot;https://go.dev/doc/tutorial/generics&quot;&gt;Go templates&lt;/a&gt;, yielding to a 2x speedup
on &lt;a href=&quot;#sample-filter&quot;&gt;the reference simplistic &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; filter&lt;/a&gt;. The more
&lt;code class=&quot;highlighter-rouge&quot;&gt;load&lt;/code&gt; instructions are in the filter, the better the effect. &lt;em&gt;(Naturally, this
also benefits network filtering performance!)&lt;/em&gt;&lt;/p&gt;

&lt;h3 id=&quot;gvisor-cbpf-interpreter-performance&quot;&gt;gVisor cBPF interpreter performance&lt;/h3&gt;

&lt;p&gt;The graph below shows the gVisor cBPF interpreter’s performance against three
sample filters: &lt;a href=&quot;#sample-filter&quot;&gt;the reference simplistic &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; filter&lt;/a&gt;,
and optimized vs unoptimized versions of gVisor’s own syscall filter (to
represent a more complex filter).&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/2024-02-01-gvisor-seccomp-interpreter.png&quot; alt=&quot;gVisor cBPF interpreter performance&quot; title=&quot;gVisor cBPF interpreter performance&quot; style=&quot;max-width:100%&quot; /&gt;&lt;/p&gt;

&lt;h3 id=&quot;seccomp-bpf-filter-result-caching-for-sandboxed-applications&quot;&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; filter result caching for sandboxed applications&lt;/h3&gt;

&lt;p&gt;Lastly, gVisor now also implements an in-sandbox caching mechanism for syscalls
which do not depend on the &lt;code class=&quot;highlighter-rouge&quot;&gt;instruction_pointer&lt;/code&gt; or syscall arguments. Unlike
Linux’s &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; cache, gVisor’s implementation also handles actions other
than “allow”, and supports the entire set of cBPF instructions rather than the
restricted emulator Linux uses for caching evaluation purposes. This removes the
interpreter from the syscall hot path entirely for cacheable syscalls, further
speeding up system calls from applications that use &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; within gVisor.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/2024-02-01-gvisor-seccomp-cache.png&quot; alt=&quot;gVisor seccomp-bpf cache&quot; title=&quot;gVisor seccomp-bpf cache&quot; style=&quot;max-width:100%&quot; /&gt;&lt;/p&gt;

&lt;h2 id=&quot;faster-gvisor-startup-via-filter-precompilation&quot;&gt;Faster gVisor startup via filter precompilation&lt;/h2&gt;

&lt;p&gt;Due to these optimizations, the overall process of building the syscall
filtering rules, rendering them to cBPF bytecode, and running all the
optimizations, can take quite a while (~10ms). As one of gVisor’s strengths is
its startup latency being much faster than VMs, this is an unacceptable delay.&lt;/p&gt;

&lt;p&gt;Therefore, gVisor now
&lt;a href=&quot;https://github.com/google/gvisor/blob/master/pkg/seccomp/precompiledseccomp/&quot;&gt;precompiles the rules&lt;/a&gt;
to optimized cBPF bytecode for most possible gVisor configurations. This means
the &lt;code class=&quot;highlighter-rouge&quot;&gt;runsc&lt;/code&gt; binary contains cBPF bytecode embedded in it for some subset of
popular configurations, and it will use this bytecode rather than compiling the
cBPF program from scratch during startup. If &lt;code class=&quot;highlighter-rouge&quot;&gt;runsc&lt;/code&gt; is invoked with a
configuration for which the cBPF bytecode isn’t embedded in the &lt;code class=&quot;highlighter-rouge&quot;&gt;runsc&lt;/code&gt; binary,
it will fall back to compiling the program from scratch.&lt;/p&gt;

&lt;details&gt;

  &lt;summary&gt;

    &lt;h3 id=&quot;dealing-with-dynamic-values-in-precompiled-rules&quot;&gt;Dealing with dynamic values in precompiled rules&lt;/h3&gt;

  &lt;/summary&gt;

  &lt;p&gt;One challenge with this approach is to support parts of the configuration that
are only known at &lt;code class=&quot;highlighter-rouge&quot;&gt;runsc&lt;/code&gt; startup time. For example, many filters act on a
specific file descriptor used for interacting with the &lt;code class=&quot;highlighter-rouge&quot;&gt;runsc&lt;/code&gt; process after
startup over a Unix Domain Socket (called the “controller FD”). This is an
integer that is only known at runtime, so its value cannot be embedded inside
the optimized cBPF bytecode prepared at &lt;code class=&quot;highlighter-rouge&quot;&gt;runsc&lt;/code&gt; compilation time.&lt;/p&gt;

  &lt;p&gt;To address this, the &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; precompilation tooling actually supports the
notions of 32-bit “variables”, and takes as input a function to render cBPF
bytecode for a given key-value mapping of variables to placeholder 32-bit
values. The precompiler calls this function &lt;em&gt;twice&lt;/em&gt; with different arbitrary
value mappings for each variable, and observes where these arbitrary values show
up in the generated cBPF bytecode. This takes advantage of the fact that
gVisor’s &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; program generation is deterministic.&lt;/p&gt;

  &lt;p&gt;If the two cBPF programs are of the same byte length, and the placeholder values
show up at exactly the same byte offsets within the cBPF bytecode both times,
and the rest of the cBPF bytecode is byte-for-byte equivalent, the precompiler
has very high confidence that these offsets are where the 32-bit variables are
represented in the cBPF bytecode. It then stores these offsets as part of the
embedded data inside the &lt;code class=&quot;highlighter-rouge&quot;&gt;runsc&lt;/code&gt; binary. Finally, at &lt;code class=&quot;highlighter-rouge&quot;&gt;runsc&lt;/code&gt; execution time, the
bytes at these offsets are replaced with the now-known values of the variables.&lt;/p&gt;

&lt;/details&gt;

&lt;h2 id=&quot;performance&quot;&gt;OK that’s great and all, but is gVisor actually faster?&lt;/h2&gt;

&lt;p&gt;The short answer is: &lt;strong&gt;yes, but only slightly&lt;/strong&gt;. As we
&lt;a href=&quot;#performance-considerations&quot;&gt;established earlier&lt;/a&gt;, &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; is only a
small portion of gVisor’s total overhead, and the &lt;code class=&quot;highlighter-rouge&quot;&gt;secbench&lt;/code&gt; benchmark shows
that this work only removes a portion of that overhead, so we should not expect
large differences here.&lt;/p&gt;

&lt;p&gt;Let’s come back to the trusty ABSL build benchmark, with a new build of gVisor
with all of these optimizations turned on:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/2024-02-01-gvisor-seccomp-absl-vs-unsandboxed.png&quot; alt=&quot;ABSL build performance&quot; title=&quot;ABSL build performance&quot; style=&quot;max-width:100%&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Let’s zoom the vertical axis in on the gVisor variants to see the difference
better:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/2024-02-01-gvisor-seccomp-absl.png&quot; alt=&quot;ABSL build performance&quot; title=&quot;ABSL build performance&quot; style=&quot;max-width:100%&quot; /&gt;&lt;/p&gt;

&lt;p&gt;This is about in line with what the earlier benchmarks showed. The initial
benchmarks showed that &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; filtering overhead for this benchmark was
on the order of ~3.6% of total runtime, and the &lt;code class=&quot;highlighter-rouge&quot;&gt;secbench&lt;/code&gt; benchmarks showed
that the optimizations reduced &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; filter evaluation time by ~29% in
aggregate. The final absolute reduction in total runtime should then be around
~1%, which is just about what this result shows.&lt;/p&gt;

&lt;p&gt;Other benchmarks show a similar pattern. Here’s gRPC build, similar to ABSL:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/2024-02-01-gvisor-seccomp-grpc-vs-unsandboxed.png&quot; alt=&quot;gRPC build performance&quot; title=&quot;gRPC build performance&quot; style=&quot;max-width:100%&quot; /&gt;&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/2024-02-01-gvisor-seccomp-grpc.png&quot; alt=&quot;gRPC build performance&quot; title=&quot;gRPC build performance&quot; style=&quot;max-width:100%&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Here’s a benchmark running the
&lt;a href=&quot;https://github.com/fastlane/fastlane&quot;&gt;Ruby Fastlane&lt;/a&gt; test suite:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/2024-02-01-gvisor-seccomp-rubydev-vs-unsandboxed.png&quot; alt=&quot;Ruby Fastlane performance&quot; title=&quot;Ruby Fastlane performance&quot; style=&quot;max-width:100%&quot; /&gt;&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/2024-02-01-gvisor-seccomp-rubydev.png&quot; alt=&quot;Ruby Fastlane performance&quot; title=&quot;Ruby Fastlane performance&quot; style=&quot;max-width:100%&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Here’s the 50th percentile of nginx serving latency for an empty webpage.
&lt;a href=&quot;https://www.prnewswire.com/news-releases/akamai-online-retail-performance-report-milliseconds-are-critical-300441498.html&quot;&gt;Every microsecond counts when it comes to web serving&lt;/a&gt;,
and here we’ve shaven off 20 of them.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/2024-02-01-gvisor-seccomp-nginx-vs-unsandboxed.png&quot; alt=&quot;nginx performance&quot; title=&quot;nginx performance&quot; style=&quot;max-width:100%&quot; /&gt;&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/2024-02-01-gvisor-seccomp-nginx.png&quot; alt=&quot;nginx performance&quot; title=&quot;nginx performance&quot; style=&quot;max-width:100%&quot; /&gt;&lt;/p&gt;

&lt;p&gt;CUDA workloads also get a boost from this work. Since their gVisor-related
overhead is already relatively small, &lt;strong&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; filtering makes up a
higher proportion of their overhead&lt;/strong&gt;. Additionally, as the performance
improvements described in this post disproportionately help the &lt;code class=&quot;highlighter-rouge&quot;&gt;ioctl(2)&lt;/code&gt;
system call, this cuts a larger portion of the &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; filtering overhead
of these workload, since CUDA uses the &lt;code class=&quot;highlighter-rouge&quot;&gt;ioctl(2)&lt;/code&gt; system call to communicate
with the GPU.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/2024-02-01-gvisor-seccomp-pytorch-vs-unsandboxed.png&quot; alt=&quot;PyTorch performance&quot; title=&quot;PyTorch performance&quot; style=&quot;max-width:100%&quot; /&gt;&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/2024-02-01-gvisor-seccomp-pytorch.png&quot; alt=&quot;PyTorch performance&quot; title=&quot;PyTorch performance&quot; style=&quot;max-width:100%&quot; /&gt;&lt;/p&gt;

&lt;p&gt;While some of these results may not seem like much in absolute terms, it’s
important to remember:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;These improvements have resulted in gVisor being able to enforce &lt;strong&gt;more&lt;/strong&gt;
&lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; filters than it previously could; gVisor’s &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt;
filter was nearly half the maximum &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; program size, so it could
at most double in complexity. After optimizations, it is reduced to less
than a fourth of this size.&lt;/li&gt;
  &lt;li&gt;These improvements allow the gVisor filters to &lt;strong&gt;scale better&lt;/strong&gt;. This is
visible from the effects on &lt;code class=&quot;highlighter-rouge&quot;&gt;ioctl(2)&lt;/code&gt; performance with &lt;code class=&quot;highlighter-rouge&quot;&gt;nvproxy&lt;/code&gt; enabled.&lt;/li&gt;
  &lt;li&gt;The resulting work has produced useful libraries for &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; tooling
which may be helpful for other projects: testing, fuzzing, and benchmarking
&lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; filters.&lt;/li&gt;
  &lt;li&gt;This overhead could not have been addressed in another way. Unlike other
areas of gVisor, such as network overhead or file I/O, overhead from the
host kernel evaluating &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; filter lives outside of gVisor itself
and therefore it can only be improved upon by this type of work.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;further-work&quot;&gt;Further work&lt;/h2&gt;

&lt;p&gt;One potential source of work is to look into the performance gap between no
&lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; filter at all versus performance with an empty &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt;
filter (equivalent to an all-cacheable filter). This points to a potential
inefficiency in the Linux kernel implementation of the &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; cache.&lt;/p&gt;

&lt;p&gt;Another potential point of improvement is to port over the optimizations that
went into searching for a syscall number into the
&lt;a href=&quot;https://man7.org/linux/man-pages/man2/ioctl.2.html&quot;&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;ioctl(2)&lt;/code&gt; system call&lt;/a&gt;. &lt;code class=&quot;highlighter-rouge&quot;&gt;ioctl(2)&lt;/code&gt; is a “grab-bag” kind of system call,
used by many drivers and other subsets of the Linux kernel to extend the syscall
interface without using up valuable syscall numbers. For example, the
&lt;a href=&quot;https://en.wikipedia.org/wiki/Kernel-based_Virtual_Machine&quot;&gt;KVM&lt;/a&gt; subsystem is
almost entirely controlled through &lt;code class=&quot;highlighter-rouge&quot;&gt;ioctl(2)&lt;/code&gt; system calls issued against
&lt;code class=&quot;highlighter-rouge&quot;&gt;/dev/kvm&lt;/code&gt; or against per-VM file descriptors.&lt;/p&gt;

&lt;p&gt;For this reason, the first non-file-descriptor argument of &lt;a href=&quot;https://man7.org/linux/man-pages/man2/ioctl.2.html&quot;&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;ioctl(2)&lt;/code&gt;&lt;/a&gt;
(“request”) usually encodes something analogous to what the syscall number
usually represents: the type of request made to the kernel. Currently, gVisor
performs a linear scan through all possible enumerations of this argument. This
is usually fine, but with features like &lt;code class=&quot;highlighter-rouge&quot;&gt;nvproxy&lt;/code&gt; which massively expand this
list of possible values, this can take a long time. &lt;code class=&quot;highlighter-rouge&quot;&gt;ioctl&lt;/code&gt; performance is also
critical for gVisor’s KVM platform. A binary search tree would make sense here.&lt;/p&gt;

&lt;p&gt;gVisor welcomes further contributions to its &lt;code class=&quot;highlighter-rouge&quot;&gt;seccomp-bpf&lt;/code&gt; machinery. Thanks for
reading!&lt;/p&gt;

&lt;div class=&quot;footnotes&quot; role=&quot;doc-endnotes&quot;&gt;
  &lt;ol&gt;
    &lt;li id=&quot;fn:1&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;cBPF does not have a canonical assembly-style representation. The
assembly-like code in this blog post is close to
&lt;a href=&quot;https://man7.org/linux/man-pages/man8/bpfc.8.html&quot;&gt;the one used in &lt;code class=&quot;highlighter-rouge&quot;&gt;bpfc&lt;/code&gt;&lt;/a&gt;
but diverges in ways to make it hopefully clearer as to what’s happening,
and all code is annotated with &lt;code class=&quot;highlighter-rouge&quot;&gt;// comments&lt;/code&gt;. &lt;a href=&quot;#fnref:1&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
  &lt;/ol&gt;
&lt;/div&gt;</content><author><name>eperot</name></author><summary type="html">gVisor is a multi-layered security sandbox. seccomp-bpf is gVisor’s second layer of defense against container escape attacks. gVisor uses seccomp-bpf to filter its own syscalls by the host kernel. This significantly reduces the attack surface to the host that a compromised gVisor process can access. However, this layer comes at a cost: every legitimate system call that gVisor makes must be evaluated against this filter by the host kernel before it is actually executed. This blog post contains more than you ever wanted to know about seccomp-bpf, and explores the past few months of work to optimize gVisor’s use of it.</summary></entry><entry><title type="html">Faster filesystem access with Directfs</title><link href="/blog/2023/06/27/directfs/" rel="alternate" type="text/html" title=" Faster filesystem access with Directfs" /><published>2023-06-27T00:00:00-05:00</published><updated>2023-06-27T00:00:00-05:00</updated><id>/blog/2023/06/27/directfs</id><content type="html" xml:base="/blog/2023/06/27/directfs/">&lt;p&gt;Directfs is now the default in runsc. This feature gives gVisor’s application
kernel (the Sentry) secure direct access to the container filesystem, avoiding
expensive round trips to the filesystem gofer. Learn more about this feature in
the following blog that was
&lt;a href=&quot;https://opensource.googleblog.com/2023/06/optimizing-gvisor-filesystems-with-directfs.html&quot;&gt;originally posted&lt;/a&gt;
on &lt;a href=&quot;https://opensource.googleblog.com/&quot;&gt;Google Open Source Blog&lt;/a&gt;.&lt;/p&gt;

&lt;!--/excerpt--&gt;

&lt;h2 id=&quot;origins-of-the-gofer&quot;&gt;Origins of the Gofer&lt;/h2&gt;

&lt;p&gt;gVisor is used internally at Google to run a variety of services and workloads.
One of the challenges we faced while building gVisor was providing remote
filesystem access securely to the sandbox. gVisor’s strict
&lt;a href=&quot;https://gvisor.dev/docs/architecture_guide/security/&quot;&gt;security model&lt;/a&gt; and
defense in depth approach assumes that the sandbox may get compromised because
it shares the same execution context as the untrusted application. Hence the
sandbox cannot be given sensitive keys and credentials to access Google-internal
remote filesystems.&lt;/p&gt;

&lt;p&gt;To address this challenge, we added a trusted filesystem proxy called a “gofer”.
The gofer runs outside the sandbox, and provides a secure interface for
untrusted containers to access such remote filesystems. For architectural
simplicity, gofers were also used to serve local filesystems as well as remote.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/2023-06-27-gofer-proxy.svg&quot; alt=&quot;Figure 1&quot; title=&quot;Filesystem gofer proxy&quot; width=&quot;100%&quot; /&gt;&lt;/p&gt;

&lt;h2 id=&quot;isolating-the-container-filesystem-in-runsc&quot;&gt;Isolating the Container Filesystem in runsc&lt;/h2&gt;

&lt;p&gt;When gVisor was &lt;a href=&quot;https://github.com/google/gvisor&quot;&gt;open sourced&lt;/a&gt; as
&lt;a href=&quot;https://gvisor.dev/docs/&quot;&gt;runsc&lt;/a&gt;, the same gofer model was copied over to
maintain the same security guarantees. runsc was configured to start one gofer
process per container which serves the container filesystem to the sandbox over
a predetermined protocol (now
&lt;a href=&quot;https://github.com/google/gvisor/blob/master/pkg/lisafs&quot;&gt;LISAFS&lt;/a&gt;). However, a gofer
adds a layer of indirection with significant overhead.&lt;/p&gt;

&lt;p&gt;This gofer model (built for remote filesystems) brings very few advantages for
the runsc use-case, where all the filesystems served by the gofer (like rootfs
and &lt;a href=&quot;https://docs.docker.com/storage/bind-mounts/&quot;&gt;bind mounts&lt;/a&gt;) are mounted
locally on the host. The gofer directly accesses them using filesystem syscalls.&lt;/p&gt;

&lt;p&gt;Linux provides some security primitives to effectively isolate local
filesystems. These include,
&lt;a href=&quot;https://man7.org/linux/man-pages/man7/mount_namespaces.7.html&quot;&gt;mount namespaces&lt;/a&gt;,
&lt;a href=&quot;https://man7.org/linux/man-pages/man2/pivot_root.2.html&quot;&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;pivot_root&lt;/code&gt;&lt;/a&gt; and
detached bind mounts&lt;sup id=&quot;fnref:1&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:1&quot; class=&quot;footnote&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;. &lt;strong&gt;Directfs&lt;/strong&gt; is a new filesystem access mode that uses
these primitives to expose the container filesystem to the sandbox in a secure
manner. The sandbox’s view of the filesystem tree is limited to just the
container filesystem. The sandbox process is not given access to anything
mounted on the broader host filesystem. Even if the sandbox gets compromised,
these mechanisms provide additional barriers to prevent broader system
compromise.&lt;/p&gt;

&lt;h2 id=&quot;directfs&quot;&gt;Directfs&lt;/h2&gt;

&lt;p&gt;In directfs mode, the gofer still exists as a cooperative process outside the
sandbox. As usual, the gofer enters a new mount namespace, sets up appropriate
bind mounts to create the container filesystem in a new directory and then
&lt;a href=&quot;https://man7.org/linux/man-pages/man2/pivot_root.2.html&quot;&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;pivot_root(2)&lt;/code&gt;&lt;/a&gt;s into
that directory. Similarly, the sandbox process enters new user and mount
namespaces and then
&lt;a href=&quot;https://man7.org/linux/man-pages/man2/pivot_root.2.html&quot;&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;pivot_root(2)&lt;/code&gt;&lt;/a&gt;s into
an empty directory to ensure it cannot access anything via path traversal. But
instead of making RPCs to the gofer to access the container filesystem, the
sandbox requests the gofer to provide file descriptors to all the mount points
via &lt;a href=&quot;https://man7.org/linux/man-pages/man7/unix.7.html&quot;&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;SCM_RIGHTS&lt;/code&gt; messages&lt;/a&gt;.
The sandbox then directly makes file-descriptor-relative syscalls (e.g.
&lt;a href=&quot;https://linux.die.net/man/2/fstatat&quot;&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;fstatat(2)&lt;/code&gt;&lt;/a&gt;,
&lt;a href=&quot;https://linux.die.net/man/2/openat&quot;&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;openat(2)&lt;/code&gt;&lt;/a&gt;,
&lt;a href=&quot;https://linux.die.net/man/2/mkdirat&quot;&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;mkdirat(2)&lt;/code&gt;&lt;/a&gt;, etc) to perform filesystem
operations.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/2023-06-27-directfs.svg&quot; alt=&quot;Figure 2&quot; title=&quot;Directfs configuration&quot; width=&quot;100%&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Earlier when the gofer performed all filesystem operations, we could deny all
these syscalls in the sandbox process using seccomp. But with directfs enabled,
the sandbox process’s seccomp filters need to allow the usage of these syscalls.
Most notably, the sandbox can now make
&lt;a href=&quot;https://linux.die.net/man/2/openat&quot;&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;openat(2)&lt;/code&gt;&lt;/a&gt; syscalls (which allow path
traversal), but with certain restrictions:
&lt;a href=&quot;https://github.com/google/gvisor/commit/114a033bd038519fa6e867c230dc4ad4e057e675&quot;&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;O_NOFOLLOW&lt;/code&gt; is required&lt;/a&gt;,
&lt;a href=&quot;https://github.com/google/gvisor/commit/fcbc289a7ac14b8d84d0c0b23c4b2a14fc626e79&quot;&gt;no access to procfs&lt;/a&gt;
and
&lt;a href=&quot;https://github.com/google/gvisor/commit/aa8abdfa9256cf057202ec8f4a81ba9f5d6a203f&quot;&gt;no directory FDs from the host&lt;/a&gt;.
We also had to give the sandbox the same privileges as the gofer (for example
&lt;code class=&quot;highlighter-rouge&quot;&gt;CAP_DAC_OVERRIDE&lt;/code&gt; and &lt;code class=&quot;highlighter-rouge&quot;&gt;CAP_DAC_READ_SEARCH&lt;/code&gt;), so it can perform the same
filesystem operations.&lt;/p&gt;

&lt;p&gt;It is noteworthy that only the trusted gofer provides FDs (of the container
filesystem) to the sandbox. The sandbox cannot walk backwards (using ‘..’) or
follow a malicious symlink to escape out of the container filesystem. In effect,
we’ve decreased our dependence on the syscall filters to catch bad behavior, but
correspondingly increased our dependence on Linux’s filesystem isolation
protections.&lt;/p&gt;

&lt;h2 id=&quot;performance&quot;&gt;Performance&lt;/h2&gt;

&lt;p&gt;Making RPCs to the gofer for every filesystem operation adds a lot of overhead
to runsc. Hence, avoiding gofer round trips significantly improves performance.
Let’s find out what this means for some of our benchmarks. We will run the
benchmarks using our newly released
&lt;a href=&quot;https://gvisor.dev/blog/2023/04/28/systrap-release/&quot;&gt;systrap platform&lt;/a&gt; on bind
mounts (as opposed to rootfs). This would simulate more realistic use cases
because bind mounts are extensively used while configuring filesystems in
containers. Bind mounts also do not have an overlay
(&lt;a href=&quot;https://opensource.googleblog.com/2023/04/gvisor-improves-performance-with-root-filesystem-overlay.html&quot;&gt;like the rootfs mount&lt;/a&gt;),
so all operations go through goferfs / directfs mount.&lt;/p&gt;

&lt;p&gt;Let’s first look at our
&lt;a href=&quot;https://github.com/google/gvisor/blob/master/test/perf/linux/stat_benchmark.cc&quot;&gt;stat micro-benchmark&lt;/a&gt;,
which repeatedly calls
&lt;a href=&quot;https://man7.org/linux/man-pages/man2/lstat.2.html&quot;&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;stat(2)&lt;/code&gt;&lt;/a&gt; on a file.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/2023-06-27-stat-benchmark.svg&quot; alt=&quot;Figure 3&quot; title=&quot;Stat micro benchmark&quot; width=&quot;100%&quot; /&gt;&lt;/p&gt;

&lt;p&gt;The &lt;code class=&quot;highlighter-rouge&quot;&gt;stat(2)&lt;/code&gt; syscall is more than 2x faster! However, since this is not
representative of real-world applications, we should not extrapolate these
results. So let’s look at some
&lt;a href=&quot;https://github.com/google/gvisor/blob/master/test/benchmarks/fs&quot;&gt;real-world benchmarks&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/2023-06-27-real-world-benchmarks.svg&quot; alt=&quot;Figure 4&quot; title=&quot;Real world benchmarks&quot; width=&quot;100%&quot; /&gt;&lt;/p&gt;

&lt;p&gt;We see a 12% reduction in the absolute time to run these workloads and 17%
reduction in Ruby load time!&lt;/p&gt;

&lt;h2 id=&quot;conclusion&quot;&gt;Conclusion&lt;/h2&gt;

&lt;p&gt;The gofer model in runsc was overly restrictive for accessing host files. We
were able to leverage existing filesystem isolation mechanisms in Linux to
bypass the gofer without compromising security. Directfs significantly improves
performance for certain workloads. This is part of our ongoing efforts to
improve gVisor performance. You can learn more about gVisor at
&lt;a href=&quot;http://www.gvisor.dev/&quot;&gt;gvisor.dev&lt;/a&gt;. You can also use gVisor in
&lt;a href=&quot;https://cloud.google.com/kubernetes-engine&quot;&gt;GKE&lt;/a&gt; with
&lt;a href=&quot;https://cloud.google.com/kubernetes-engine/docs/concepts/sandbox-pods&quot;&gt;GKE Sandbox&lt;/a&gt;.
Happy sandboxing!&lt;/p&gt;

&lt;hr /&gt;

&lt;div class=&quot;footnotes&quot; role=&quot;doc-endnotes&quot;&gt;
  &lt;ol&gt;
    &lt;li id=&quot;fn:1&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Detached bind mounts can be created by first creating a bind mount using
mount(MS_BIND) and then detaching it from the filesystem tree using
umount(MNT_DETACH). &lt;a href=&quot;#fnref:1&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
  &lt;/ol&gt;
&lt;/div&gt;</content><author><name>ayushranjan</name></author><summary type="html">Directfs is now the default in runsc. This feature gives gVisor’s application kernel (the Sentry) secure direct access to the container filesystem, avoiding expensive round trips to the filesystem gofer. Learn more about this feature in the following blog that was originally posted on Google Open Source Blog.</summary></entry><entry><title type="html">Running Stable Diffusion on GPU with gVisor</title><link href="/blog/2023/06/20/gpu-pytorch-stable-diffusion/" rel="alternate" type="text/html" title=" Running Stable Diffusion on GPU with gVisor" /><published>2023-06-20T00:00:00-05:00</published><updated>2023-06-20T00:00:00-05:00</updated><id>/blog/2023/06/20/gpu-pytorch-stable-diffusion</id><content type="html" xml:base="/blog/2023/06/20/gpu-pytorch-stable-diffusion/">&lt;p&gt;gVisor is &lt;a href=&quot;https://github.com/google/gvisor/blob/master/g3doc/proposals/nvidia_driver_proxy.md&quot;&gt;starting to support GPU&lt;/a&gt; workloads. This post
showcases running the &lt;a href=&quot;https://stability.ai/blog/stable-diffusion-public-release&quot;&gt;Stable Diffusion&lt;/a&gt; generative model from &lt;a href=&quot;https://stability.ai/&quot;&gt;Stability AI&lt;/a&gt; to
generate images using a GPU from within gVisor. Both the
&lt;a href=&quot;https://github.com/AUTOMATIC1111/stable-diffusion-webui&quot;&gt;Automatic1111 Stable Diffusion web UI&lt;/a&gt;
and the &lt;a href=&quot;https://pytorch.org/&quot;&gt;PyTorch&lt;/a&gt; code used by Stable Diffusion were run entirely within gVisor
while being able to leverage the NVIDIA GPU.&lt;/p&gt;

&lt;!--/excerpt--&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/2023-06-20-sandboxed-gpu.png&quot; alt=&quot;A sandboxed GPU&quot; title=&quot;A sandboxed GPU.&quot; /&gt;
&lt;span class=&quot;attribution&quot;&gt;&lt;strong&gt;Sand&lt;/strong&gt;boxing a GPU. Generated with Stable Diffusion
v1.5.&lt;br /&gt;This picture gets a lot deeper once you realize that GPUs are made out
of sand.&lt;/span&gt;&lt;/p&gt;

&lt;h2 id=&quot;disclaimer&quot;&gt;Disclaimer&lt;/h2&gt;

&lt;p&gt;As of this writing (2023-06), &lt;a href=&quot;https://github.com/google/gvisor/blob/master/g3doc/proposals/nvidia_driver_proxy.md&quot;&gt;gVisor’s GPU support&lt;/a&gt; is not
generalized. Only some PyTorch workloads have been tested on NVIDIA T4, L4,
A100, and H100 GPUs, using the specific driver versions that your runsc version
supports using the command below. Contributions are welcome to expand this set
to support other GPUs and driver versions!&lt;/p&gt;

&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;# From a cloned gVisor repository:
$ make run TARGETS=runsc ARGS=&quot;nvproxy list-supported-drivers&quot;

# From a runsc binary:
$ runsc nvproxy list-supported-drivers
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Additionally, while gVisor does its best to sandbox the workload, interacting
with the GPU inherently requires running code on GPU hardware, where isolation
is enforced by the GPU driver and hardware itself rather than gVisor. More to
come soon on the value of the protection gVisor provides for GPU workloads.&lt;/p&gt;

&lt;p&gt;In a few months, gVisor’s GPU support will have broadened and become
easier-to-use, such that it will not be constrained to the specific sets of
versions used here. In the meantime, this blog stands as an example of what’s
possible today with gVisor’s GPU support.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/2023-06-20-spacesuit-helmets.png&quot; alt=&quot;Various space suit helmets&quot; title=&quot;Various space suit helmets.&quot; width=&quot;100%&quot; /&gt;
&lt;span class=&quot;attribution&quot;&gt;&lt;strong&gt;A collection of astronaut helmets in various styles&lt;/strong&gt;.&lt;br /&gt;Other than the helmet in the center, each helmet was generated using Stable Diffusion v1.5.&lt;/span&gt;&lt;/p&gt;

&lt;h2 id=&quot;why-even-do-this&quot;&gt;Why even do this?&lt;/h2&gt;

&lt;p&gt;The recent explosion of machine learning models has led to a large number of new
open-source projects. Much like it is good practice to be careful about running
new software downloaded from the Internet, it is good practice to run new
open-source projects in a sandbox. For projects like the
&lt;a href=&quot;https://github.com/AUTOMATIC1111/stable-diffusion-webui&quot;&gt;Automatic1111 Stable Diffusion web UI&lt;/a&gt;,
which automatically download various models, components, and
&lt;a href=&quot;https://github.com/AUTOMATIC1111/stable-diffusion-webui-extensions/blob/master/index.json&quot;&gt;extensions&lt;/a&gt; from external repositories as
the user enables them in the web UI, this principle applies all the more.&lt;/p&gt;

&lt;p&gt;Additionally, within the machine learning space, tooling for packaging and
distributing models are still nascent. While some models (including Stable
Diffusion) are packaged using the more secure &lt;a href=&quot;https://github.com/huggingface/safetensors&quot;&gt;safetensors&lt;/a&gt; format, &lt;strong&gt;the
majority of models available online today are distributed using the
&lt;a href=&quot;https://www.splunk.com/en_us/blog/security/paws-in-the-pickle-jar-risk-vulnerability-in-the-model-sharing-ecosystem.html&quot;&gt;Pickle format&lt;/a&gt;, which can execute arbitrary Python code&lt;/strong&gt; upon deserialization.
As such, even when using trustworthy software, using Pickle-formatted models may
still be risky (&lt;strong&gt;Edited 2024-04-04:
&lt;a href=&quot;https://www.wiz.io/blog/wiz-and-hugging-face-address-risks-to-ai-infrastructure&quot;&gt;this exact vulnerability vector was found in Hugging Face’s Inference API&lt;/a&gt;&lt;/strong&gt;).
gVisor provides a layer of protection around this process which helps protect
the host machine.&lt;/p&gt;

&lt;p&gt;Third, &lt;strong&gt;machine learning applications are typically not I/O heavy&lt;/strong&gt;, which
means they tend not to experience a significant performance overhead. The
process of uploading code to the GPU is not a significant number of system
calls, and most communication to/from the GPU happens over shared memory, where
gVisor imposes no overhead. Therefore, the question is not so much “why should I
run this GPU workload in gVisor?” but rather “why not?”.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/2023-06-20-turbo.png&quot; alt=&quot;Cool astronauts don't look at explosions&quot; title=&quot;Cool astronauts don't look at explosions.&quot; /&gt;
&lt;span class=&quot;attribution&quot;&gt;&lt;strong&gt;Cool astronauts don’t look at explosions&lt;/strong&gt;.
Generated using Stable Diffusion v1.5.&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;Lastly, running GPU workloads in gVisor is pretty cool.&lt;/p&gt;

&lt;h2 id=&quot;setup&quot;&gt;Setup&lt;/h2&gt;

&lt;p&gt;We use a Debian virtual machine on GCE. The machine needs to have a GPU and to
have sufficient RAM and disk space to handle Stable Diffusion and its large
model files. The following command creates a VM with 4 vCPUs, 15GiB of RAM, 64GB
of disk space, and an NVIDIA T4 GPU, running Debian 11 (bullseye). Since this is
just an experiment, the VM is set to self-destruct after 6 hours.&lt;/p&gt;

&lt;div class=&quot;language-shell highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;gcloud compute instances create stable-diffusion-testing &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--zone&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;us-central1-a &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--machine-type&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;n1-standard-4 &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--max-run-duration&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;6h &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--instance-termination-action&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;DELETE &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--maintenance-policy&lt;/span&gt; TERMINATE &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--accelerator&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;count&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;1,type&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;nvidia-tesla-t4 &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--create-disk&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;auto-delete&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;yes&lt;/span&gt;,boot&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;yes&lt;/span&gt;,device-name&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;stable-diffusion-testing,image&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;projects/debian-cloud/global/images/debian-11-bullseye-v20230509,mode&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;rw,size&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;64
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;gcloud compute ssh &lt;span class=&quot;nt&quot;&gt;--zone&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;us-central1-a stable-diffusion-testing
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;All further commands in this post are performed while SSH’d into the VM. We
first need to install the specific NVIDIA driver version that gVisor is
currently compatible with.&lt;/p&gt;

&lt;div class=&quot;language-shell highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;sudo &lt;/span&gt;apt-get update &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;sudo &lt;/span&gt;apt-get &lt;span class=&quot;nt&quot;&gt;-y&lt;/span&gt; upgrade
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;sudo &lt;/span&gt;apt-get &lt;span class=&quot;nb&quot;&gt;install&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-y&lt;/span&gt; build-essential linux-headers-&lt;span class=&quot;si&quot;&gt;$(&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;uname&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-r&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;runsc nvproxy list-supported-drivers
&lt;span class=&quot;nv&quot;&gt;$ DRIVER_VERSION&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;some-driver-version &lt;span class=&quot;c&quot;&gt;# Get from your runsc binary.&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;curl &lt;span class=&quot;nt&quot;&gt;-fSsl&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-O&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;https://us.download.nvidia.com/tesla/&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$DRIVER_VERSION&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;/NVIDIA-Linux-x86_64-&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$DRIVER_VERSION&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;.run&quot;&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;sudo &lt;/span&gt;sh NVIDIA-Linux-x86_64-&lt;span class=&quot;nv&quot;&gt;$DRIVER_VERSION&lt;/span&gt;.run
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;!--
The above in a single line, for convenience:
DRIVER_VERSION=some-driver-version; sudo apt-get update &amp;&amp; sudo apt-get -y upgrade &amp;&amp; sudo apt-get install -y build-essential linux-headers-$(uname -r) &amp;&amp; curl -fSsl -O &quot;https://us.download.nvidia.com/tesla/$DRIVER_VERSION/NVIDIA-Linux-x86_64-$DRIVER_VERSION.run&quot; &amp;&amp; sudo sh NVIDIA-Linux-x86_64-$DRIVER_VERSION.run
--&gt;

&lt;p&gt;Next, we install Docker, per &lt;a href=&quot;https://docs.docker.com/engine/install/debian/&quot;&gt;its instructions&lt;/a&gt;.&lt;/p&gt;

&lt;div class=&quot;language-shell highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;sudo &lt;/span&gt;apt-get &lt;span class=&quot;nb&quot;&gt;install&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-y&lt;/span&gt; ca-certificates curl gnupg
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;sudo install&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-m&lt;/span&gt; 0755 &lt;span class=&quot;nt&quot;&gt;-d&lt;/span&gt; /etc/apt/keyrings
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;curl &lt;span class=&quot;nt&quot;&gt;-fsSL&lt;/span&gt; https://download.docker.com/linux/debian/gpg | &lt;span class=&quot;nb&quot;&gt;sudo &lt;/span&gt;gpg &lt;span class=&quot;nt&quot;&gt;--dearmor&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;--batch&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;--yes&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-o&lt;/span&gt; /etc/apt/keyrings/docker.gpg
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;sudo chmod &lt;/span&gt;a+r /etc/apt/keyrings/docker.gpg
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;echo&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;deb [arch=&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;$(&lt;/span&gt;dpkg &lt;span class=&quot;nt&quot;&gt;--print-architecture&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt; signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/debian &lt;/span&gt;&lt;span class=&quot;si&quot;&gt;$(&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;.&lt;/span&gt; /etc/os-release &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;echo&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;$VERSION_CODENAME&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt; stable&quot;&lt;/span&gt; | &lt;span class=&quot;nb&quot;&gt;sudo tee&lt;/span&gt; /etc/apt/sources.list.d/docker.list &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; /dev/null
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;sudo &lt;/span&gt;apt-get update &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;sudo &lt;/span&gt;apt-get &lt;span class=&quot;nb&quot;&gt;install&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-y&lt;/span&gt; docker-ce docker-ce-cli
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;!--
The above in a single live, for convenience:
sudo apt-get install -y ca-certificates curl gnupg &amp;&amp; sudo install -m 0755 -d /etc/apt/keyrings &amp;&amp; curl -fsSL https://download.docker.com/linux/debian/gpg | sudo gpg --dearmor --batch --yes -o /etc/apt/keyrings/docker.gpg &amp;&amp; sudo chmod a+r /etc/apt/keyrings/docker.gpg &amp;&amp; echo &quot;deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/debian $(. /etc/os-release &amp;&amp; echo &quot;$VERSION_CODENAME&quot;) stable&quot; | sudo tee /etc/apt/sources.list.d/docker.list &gt; /dev/null &amp;&amp; sudo apt-get update &amp;&amp; sudo apt-get install -y docker-ce docker-ce-cli
--&gt;

&lt;p&gt;We will also need the &lt;a href=&quot;https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/user-guide.html&quot;&gt;NVIDIA container toolkit&lt;/a&gt;, which enables use of GPUs with
Docker. Per its
&lt;a href=&quot;https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html&quot;&gt;installation instructions&lt;/a&gt;:&lt;/p&gt;

&lt;div class=&quot;language-shell highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nv&quot;&gt;$ distribution&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;$(&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;.&lt;/span&gt; /etc/os-release&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;echo&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;$ID$VERSION_ID&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; curl &lt;span class=&quot;nt&quot;&gt;-fsSL&lt;/span&gt; https://nvidia.github.io/libnvidia-container/gpgkey | &lt;span class=&quot;nb&quot;&gt;sudo &lt;/span&gt;gpg &lt;span class=&quot;nt&quot;&gt;--dearmor&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-o&lt;/span&gt; /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; curl &lt;span class=&quot;nt&quot;&gt;-s&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-L&lt;/span&gt; https://nvidia.github.io/libnvidia-container/&lt;span class=&quot;nv&quot;&gt;$distribution&lt;/span&gt;/libnvidia-container.list | &lt;span class=&quot;nb&quot;&gt;sed&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;'s#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g'&lt;/span&gt; | &lt;span class=&quot;nb&quot;&gt;sudo tee&lt;/span&gt; /etc/apt/sources.list.d/nvidia-container-toolkit.list
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;sudo &lt;/span&gt;apt-get update &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;sudo &lt;/span&gt;apt-get &lt;span class=&quot;nb&quot;&gt;install&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-y&lt;/span&gt; nvidia-container-toolkit
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Of course, we also need to &lt;a href=&quot;https://gvisor.dev/docs/user_guide/install/&quot;&gt;install gVisor&lt;/a&gt; itself.&lt;/p&gt;

&lt;div class=&quot;language-shell highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;sudo &lt;/span&gt;apt-get &lt;span class=&quot;nb&quot;&gt;install&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-y&lt;/span&gt; apt-transport-https ca-certificates curl gnupg
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;curl &lt;span class=&quot;nt&quot;&gt;-fsSL&lt;/span&gt; https://gvisor.dev/archive.key | &lt;span class=&quot;nb&quot;&gt;sudo &lt;/span&gt;gpg &lt;span class=&quot;nt&quot;&gt;--dearmor&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-o&lt;/span&gt; /usr/share/keyrings/gvisor-archive-keyring.gpg
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;echo&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;deb [arch=&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;$(&lt;/span&gt;dpkg &lt;span class=&quot;nt&quot;&gt;--print-architecture&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt; signed-by=/usr/share/keyrings/gvisor-archive-keyring.gpg] https://storage.googleapis.com/gvisor/releases release main&quot;&lt;/span&gt; | &lt;span class=&quot;nb&quot;&gt;sudo tee&lt;/span&gt; /etc/apt/sources.list.d/gvisor.list &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; /dev/null
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;sudo &lt;/span&gt;apt-get update &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;sudo &lt;/span&gt;apt-get &lt;span class=&quot;nb&quot;&gt;install&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-y&lt;/span&gt; runsc

＃ As gVisor does not yet &lt;span class=&quot;nb&quot;&gt;enable &lt;/span&gt;GPU support by default, we need to &lt;span class=&quot;nb&quot;&gt;set &lt;/span&gt;the flags
＃ that will &lt;span class=&quot;nb&quot;&gt;enable &lt;/span&gt;it:
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;sudo &lt;/span&gt;runsc &lt;span class=&quot;nb&quot;&gt;install&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;--&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;--nvproxy&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;true&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;--nvproxy-docker&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;true&lt;/span&gt;

&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;sudo &lt;/span&gt;systemctl restart docker
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Now, let’s make sure everything works by running commands that involve more and
more of what we just set up.&lt;/p&gt;

&lt;div class=&quot;language-shell highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;＃ Check that the NVIDIA drivers are installed, with the right version, and with
＃ a supported GPU attached
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;sudo &lt;/span&gt;nvidia-smi &lt;span class=&quot;nt&quot;&gt;-L&lt;/span&gt;
GPU 0: Tesla T4 &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;UUID: GPU-6a96a2af-2271-5627-34c5-91dcb4f408aa&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;sudo cat&lt;/span&gt; /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module  DRIVER_VERSION  Wed Nov 30 06:39:21 UTC 2022

＃ Check that Docker works.
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;sudo &lt;/span&gt;docker version
＃ &lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;...]
Server: Docker Engine - Community
 Engine:
  Version:          24.0.2
＃ &lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;...]

＃ Check that gVisor works.
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;sudo &lt;/span&gt;docker run &lt;span class=&quot;nt&quot;&gt;--rm&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;--runtime&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;runsc debian:latest dmesg | &lt;span class=&quot;nb&quot;&gt;head&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-1&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;    0.000000] Starting gVisor...

＃ Check that Docker GPU support &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;without gVisor&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; works.
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;sudo &lt;/span&gt;docker run &lt;span class=&quot;nt&quot;&gt;--rm&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;--gpus&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;all nvidia/cuda:11.6.2-base-ubuntu20.04 nvidia-smi &lt;span class=&quot;nt&quot;&gt;-L&lt;/span&gt;
GPU 0: Tesla T4 &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;UUID: GPU-6a96a2af-2271-5627-34c5-91dcb4f408aa&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;

＃ Check that gVisor works with the GPU.
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;sudo &lt;/span&gt;docker run &lt;span class=&quot;nt&quot;&gt;--rm&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;--runtime&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;runsc &lt;span class=&quot;nt&quot;&gt;--gpus&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;all nvidia/cuda:11.6.2-base-ubuntu20.04 nvidia-smi &lt;span class=&quot;nt&quot;&gt;-L&lt;/span&gt;
GPU 0: Tesla T4 &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;UUID: GPU-6a96a2af-2271-5627-34c5-91dcb4f408aa&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;We’re all set! Now we can actually get Stable Diffusion running.&lt;/p&gt;

&lt;p&gt;We used the following &lt;code class=&quot;highlighter-rouge&quot;&gt;Dockerfile&lt;/code&gt; to run Stable Diffusion and its web UI within
a GPU-enabled Docker container.&lt;/p&gt;

&lt;div class=&quot;language-dockerfile highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;FROM&lt;/span&gt;&lt;span class=&quot;s&quot;&gt; python:3.10&lt;/span&gt;

＃ Set of dependencies that are needed to make this work.
&lt;span class=&quot;k&quot;&gt;RUN &lt;/span&gt;apt-get update &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; apt-get &lt;span class=&quot;nb&quot;&gt;install&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-y&lt;/span&gt; git wget build-essential &lt;span class=&quot;se&quot;&gt;\
&lt;/span&gt;        nghttp2 libnghttp2-dev libssl-dev ffmpeg libsm6 libxext6
＃ Clone the project at the revision used for this test.
&lt;span class=&quot;k&quot;&gt;RUN &lt;/span&gt;git clone https://github.com/AUTOMATIC1111/stable-diffusion-webui.git &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\
&lt;/span&gt;    &lt;span class=&quot;nb&quot;&gt;cd&lt;/span&gt; /stable-diffusion-webui &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\
&lt;/span&gt;    git checkout baf6946e06249c5af9851c60171692c44ef633e0
＃ We don't want the build step to start the server.
&lt;span class=&quot;k&quot;&gt;RUN &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;sed&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-i&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;'/start()/d'&lt;/span&gt; /stable-diffusion-webui/launch.py
＃ Install some pip packages.
＃ Note that this command will run as part of the Docker build process,
＃ which is *not* sandboxed by gVisor.
&lt;span class=&quot;k&quot;&gt;RUN &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;cd&lt;/span&gt; /stable-diffusion-webui &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;COMMANDLINE_ARGS&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;--skip-torch-cuda-test&lt;/span&gt; python launch.py
&lt;span class=&quot;k&quot;&gt;WORKDIR&lt;/span&gt;&lt;span class=&quot;s&quot;&gt; /stable-diffusion-webui&lt;/span&gt;
＃ This causes the web UI to use the Gradio service to create a public URL.
＃ Do not use this if you plan on leaving the container running long-term.
&lt;span class=&quot;k&quot;&gt;ENV&lt;/span&gt;&lt;span class=&quot;s&quot;&gt; COMMANDLINE_ARGS=--share&lt;/span&gt;
＃ Start the webui app.
&lt;span class=&quot;k&quot;&gt;CMD&lt;/span&gt;&lt;span class=&quot;s&quot;&gt; [&quot;python&quot;, &quot;webui.py&quot;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;We build the image and create a container with it using the &lt;code class=&quot;highlighter-rouge&quot;&gt;docker&lt;/code&gt;
command-line.&lt;/p&gt;

&lt;div class=&quot;language-shell highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;cat&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; Dockerfile
&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;... Paste the above contents...&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
^D
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;sudo &lt;/span&gt;docker build &lt;span class=&quot;nt&quot;&gt;--tag&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;sdui &lt;span class=&quot;nb&quot;&gt;.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Finally, we can start the Stable Diffusion web UI. Note that it will take a long
time to start, as it has to download all the models from the Internet. To keep
this post simple, we didn’t set up any kind of volume that would enable data
persistence, so it will do this every time the container starts.&lt;/p&gt;

&lt;div class=&quot;language-shell highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;sudo &lt;/span&gt;docker run &lt;span class=&quot;nt&quot;&gt;--runtime&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;runsc &lt;span class=&quot;nt&quot;&gt;--gpus&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;all &lt;span class=&quot;nt&quot;&gt;--name&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;sdui &lt;span class=&quot;nt&quot;&gt;--detach&lt;/span&gt; sdui

＃ Follow the logs:
&lt;span class=&quot;nv&quot;&gt;$ &lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;sudo &lt;/span&gt;docker logs &lt;span class=&quot;nt&quot;&gt;-f&lt;/span&gt; sdui
＃ &lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;...]
Calculating sha256 &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; /stable-diffusion-webui/models/Stable-diffusion/v1-5-pruned-emaonly.safetensors: Running on &lt;span class=&quot;nb&quot;&gt;local &lt;/span&gt;URL:  http://127.0.0.1:7860
Running on public URL: https://4446d982b4129a66d7.gradio.live

This share &lt;span class=&quot;nb&quot;&gt;link &lt;/span&gt;expires &lt;span class=&quot;k&quot;&gt;in &lt;/span&gt;72 hours.
＃ &lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;...]
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;We’re all set! Now we can browse to the Gradio URL shown in the logs and start
generating pictures, all within the secure confines of gVisor.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/2023-06-20-stable-diffusion-web-ui.png&quot; alt=&quot;Stable Diffusion Web UI&quot; title=&quot;Stable Diffusion UI.&quot; width=&quot;100%&quot; /&gt;
&lt;span class=&quot;attribution&quot;&gt;&lt;strong&gt;Stable Diffusion Web UI screenshot.&lt;/strong&gt; Inner image
generated with Stable Diffusion v1.5.&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;Happy sandboxing!&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/2023-06-20-astronaut-thumbs-up.png&quot; alt=&quot;Astronaut showing thumbs up&quot; title=&quot;Astronaut showing thumbs up.&quot; /&gt;
&lt;span class=&quot;attribution&quot;&gt;&lt;strong&gt;Happy sandboxing!&lt;/strong&gt; Generated with Stable Diffusion
v1.5.&lt;/span&gt;&lt;/p&gt;</content><author><name>eperot</name></author><summary type="html">gVisor is starting to support GPU workloads. This post showcases running the Stable Diffusion generative model from Stability AI to generate images using a GPU from within gVisor. Both the Automatic1111 Stable Diffusion web UI and the PyTorch code used by Stable Diffusion were run entirely within gVisor while being able to leverage the NVIDIA GPU.</summary></entry><entry><title type="html">Rootfs Overlay</title><link href="/blog/2023/05/08/rootfs-overlay/" rel="alternate" type="text/html" title=" Rootfs Overlay" /><published>2023-05-08T00:00:00-05:00</published><updated>2023-05-08T00:00:00-05:00</updated><id>/blog/2023/05/08/rootfs-overlay</id><content type="html" xml:base="/blog/2023/05/08/rootfs-overlay/">&lt;p&gt;Root filesystem overlay is now the default in runsc. This improves performance
for filesystem-heavy workloads by overlaying the container root filesystem with
a tmpfs filesystem. Learn more about this feature in the following blog that was
&lt;a href=&quot;https://opensource.googleblog.com/2023/04/gvisor-improves-performance-with-root-filesystem-overlay.html&quot;&gt;originally posted&lt;/a&gt;
on &lt;a href=&quot;https://opensource.googleblog.com/&quot;&gt;Google Open Source Blog&lt;/a&gt;.&lt;/p&gt;

&lt;!--/excerpt--&gt;

&lt;h2 id=&quot;costly-filesystem-access&quot;&gt;Costly Filesystem Access&lt;/h2&gt;

&lt;p&gt;gVisor uses a trusted filesystem proxy process (“gofer”) to access the
filesystem on behalf of the sandbox. The sandbox process is considered untrusted
in gVisor’s
&lt;a href=&quot;https://gvisor.dev/docs/architecture_guide/security/&quot;&gt;security model&lt;/a&gt;. As a
result, it is not given direct access to the container filesystem and
&lt;a href=&quot;https://github.com/google/gvisor/tree/master/runsc/boot/filter&quot;&gt;its seccomp filters&lt;/a&gt;
do not allow filesystem syscalls.&lt;/p&gt;

&lt;p&gt;In gVisor, the container rootfs and
&lt;a href=&quot;https://docs.docker.com/storage/bind-mounts/#&quot;&gt;bind mounts&lt;/a&gt; are configured to
be served by a gofer.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/2023-05-08-rootfs-overlay-gofer-diagram.svg&quot; alt=&quot;Figure 1&quot; title=&quot;Gofer process diagram.&quot; width=&quot;100%&quot; /&gt;&lt;/p&gt;

&lt;p&gt;When the container needs to perform a filesystem operation, it makes an RPC to
the gofer which makes host system calls and services the RPC. This is quite
expensive due to:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;RPC cost: This is the cost of communicating with the gofer process,
including process scheduling, message serialization and
&lt;a href=&quot;https://en.wikipedia.org/wiki/Inter-process_communication&quot;&gt;IPC&lt;/a&gt; system
calls.
    &lt;ul&gt;
      &lt;li&gt;To ameliorate this, gVisor recently developed a purpose-built protocol
called &lt;a href=&quot;https://github.com/google/gvisor/tree/master/pkg/lisafs&quot;&gt;LISAFS&lt;/a&gt;
which is much more efficient than its predecessor.&lt;/li&gt;
      &lt;li&gt;gVisor is also
&lt;a href=&quot;https://groups.google.com/g/gvisor-users/c/v-ODHzCrIjE&quot;&gt;experimenting&lt;/a&gt;
with giving the sandbox direct access to the container filesystem in a
secure manner. This would essentially nullify RPC costs as it avoids the
gofer being in the critical path of filesystem operations.&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;Syscall cost: This is the cost of making the host syscall which actually
accesses/modifies the container filesystem. Syscalls are expensive, because
they perform context switches into the kernel and back into userspace.
    &lt;ul&gt;
      &lt;li&gt;To help with this, gVisor heavily caches the filesystem tree in memory.
So operations like
&lt;a href=&quot;https://man7.org/linux/man-pages/man2/lstat.2.html&quot;&gt;stat(2)&lt;/a&gt; on cached
files are serviced quickly. But other operations like
&lt;a href=&quot;https://man7.org/linux/man-pages/man2/mkdir.2.html&quot;&gt;mkdir(2)&lt;/a&gt; or
&lt;a href=&quot;https://man7.org/linux/man-pages/man2/rename.2.html&quot;&gt;rename(2)&lt;/a&gt; still
need to make host syscalls.&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;h2 id=&quot;container-root-filesystem&quot;&gt;Container Root Filesystem&lt;/h2&gt;

&lt;p&gt;In Docker and Kubernetes, the container’s root filesystem (rootfs) is based on
the filesystem packaged with the image. The image’s filesystem is immutable. Any
change a container makes to the rootfs is stored separately and is destroyed
with the container. This way, the image’s filesystem can be shared efficiently
with all containers running the same image. This is different from bind mounts,
which allow containers to access the bound host filesystem tree. Changes to bind
mounts are always propagated to the host and persist after the container exits.&lt;/p&gt;

&lt;p&gt;Docker and Kubernetes both use the
&lt;a href=&quot;https://docs.kernel.org/filesystems/overlayfs.html&quot;&gt;overlay filesystem&lt;/a&gt; by
default to configure container rootfs. Overlayfs mounts are composed of one
upper layer and multiple lower layers. The overlay filesystem presents a merged
view of all these filesystem layers at its mount location and ensures that lower
layers are read-only while all changes are held in the upper layer. The lower
layer(s) constitute the “image layer” and the upper layer is the “container
layer”. When the container is destroyed, the upper layer mount is destroyed as
well, discarding the root filesystem changes the container may have made.
Docker’s
&lt;a href=&quot;https://docs.docker.com/storage/storagedriver/overlayfs-driver/#how-the-overlay2-driver-works&quot;&gt;overlayfs driver documentation&lt;/a&gt;
has a good explanation.&lt;/p&gt;

&lt;h2 id=&quot;rootfs-configuration-before&quot;&gt;Rootfs Configuration Before&lt;/h2&gt;

&lt;p&gt;Let’s consider an example where the image has files &lt;code class=&quot;highlighter-rouge&quot;&gt;foo&lt;/code&gt; and &lt;code class=&quot;highlighter-rouge&quot;&gt;baz&lt;/code&gt;. The
container overwrites &lt;code class=&quot;highlighter-rouge&quot;&gt;foo&lt;/code&gt; and creates a new file &lt;code class=&quot;highlighter-rouge&quot;&gt;bar&lt;/code&gt;. The diagram below shows
how the root filesystem used to be configured in gVisor earlier. We used to go
through the gofer and access/mutate the overlaid directory on the host. It also
shows the state of the host overlay filesystem.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/2023-05-08-rootfs-overlay-before.svg&quot; alt=&quot;Figure 2&quot; title=&quot;Rootfs state before.&quot; width=&quot;100%&quot; /&gt;&lt;/p&gt;

&lt;h2 id=&quot;opportunity-sandbox-internal-overlay&quot;&gt;Opportunity! Sandbox Internal Overlay&lt;/h2&gt;

&lt;p&gt;Given that the upper layer is destroyed with the container and that it is
expensive to access/mutate a host filesystem from the sandbox, why keep the
upper layer on the host at all? Instead we can move the upper layer &lt;strong&gt;into the
sandbox&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The idea is to overlay the rootfs using a sandbox-internal overlay mount. We can
use a tmpfs upper (container) layer and a read-only lower layer served by the
gofer client. Any changes to rootfs would be held in tmpfs (in-memory).
Accessing/mutating the upper layer would not require any gofer RPCs or syscalls
to the host. This really speeds up filesystem operations on the upper layer,
which contains newly created or copied-up files and directories.&lt;/p&gt;

&lt;p&gt;Using the same example as above, the following diagram shows what the rootfs
configuration would look like using a sandbox-internal overlay.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/2023-05-08-rootfs-overlay-memory.svg&quot; alt=&quot;Figure 3&quot; title=&quot;Memory-backed rootfs overlay.&quot; width=&quot;100%&quot; /&gt;&lt;/p&gt;

&lt;h2 id=&quot;host-backed-overlay&quot;&gt;Host-Backed Overlay&lt;/h2&gt;

&lt;p&gt;The tmpfs mount by default will use the sandbox process’s memory to back all the
file data in the mount. This can cause sandbox memory usage to blow up and
exhaust the container’s memory limits, so it’s important to store all file data
from tmpfs upper layer on disk. We need to have a tmpfs-backing “filestore” on
the host filesystem. Using the example from above, this filestore on the host
will store file data for &lt;code class=&quot;highlighter-rouge&quot;&gt;foo&lt;/code&gt; and &lt;code class=&quot;highlighter-rouge&quot;&gt;bar&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;This would essentially flatten all regular files in tmpfs into one host file.
The sandbox can &lt;a href=&quot;https://man7.org/linux/man-pages/man2/mmap.2.html&quot;&gt;mmap(2)&lt;/a&gt; the
filestore into its address space. This allows it to access and mutate the
filestore very efficiently, without incurring gofer RPCs or syscalls overheads.&lt;/p&gt;

&lt;h2 id=&quot;self-backed-overlay&quot;&gt;Self-Backed Overlay&lt;/h2&gt;

&lt;p&gt;In Kubernetes, you can set
&lt;a href=&quot;https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#setting-requests-and-limits-for-local-ephemeral-storage&quot;&gt;local ephemeral storage limits&lt;/a&gt;.
The upper layer of the rootfs overlay (writeable container layer) on the host
&lt;a href=&quot;https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#resource-emphemeralstorage-consumption&quot;&gt;contributes towards this limit&lt;/a&gt;.
The kubelet enforces this limit by
&lt;a href=&quot;https://github.com/containerd/containerd/blob/bbcfbf2189f15c9e9e2ce0775c3caf2e8642274c/vendor/github.com/containerd/continuity/fs/du_unix.go#L57-L58&quot;&gt;traversing&lt;/a&gt;
the entire
&lt;a href=&quot;https://github.com/containerd/containerd/blob/bbcfbf2189f15c9e9e2ce0775c3caf2e8642274c/snapshots/overlay/overlay.go#L189-L190&quot;&gt;upper layer&lt;/a&gt;,
&lt;code class=&quot;highlighter-rouge&quot;&gt;stat(2)&lt;/code&gt;-ing all files and
&lt;a href=&quot;https://github.com/containerd/containerd/blob/bbcfbf2189f15c9e9e2ce0775c3caf2e8642274c/vendor/github.com/containerd/continuity/fs/du_unix.go#L69-L74&quot;&gt;summing up&lt;/a&gt;
their &lt;code class=&quot;highlighter-rouge&quot;&gt;stat.st_blocks*block_size&lt;/code&gt;. If we move the upper layer into the sandbox,
then the host upper layer is empty and the kubelet will not be able to enforce
these limits.&lt;/p&gt;

&lt;p&gt;To address this issue, we
&lt;a href=&quot;https://github.com/google/gvisor/commit/a53b22ad5283b00b766178eff847c3193c1293b7&quot;&gt;introduced “self-backed” overlays&lt;/a&gt;,
which create the filestore in the host upper layer. This way, when the kubelet
scans the host upper layer, the filestore will be detected and its
&lt;code class=&quot;highlighter-rouge&quot;&gt;stat.st_blocks&lt;/code&gt; should be representative of the total file usage in the
sandbox-internal upper layer. It is also important to hide this filestore from
the containerized application to avoid confusing it. We do so by
&lt;a href=&quot;https://github.com/google/gvisor/commit/09459b203a532c24fbb76cc88484d533356b8b91&quot;&gt;creating a whiteout&lt;/a&gt;
in the sandbox-internal upper layer, which blocks this file from appearing in
the merged directory.&lt;/p&gt;

&lt;p&gt;The following diagram shows what rootfs configuration would finally look like
today in gVisor.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/2023-05-08-rootfs-overlay-self.svg&quot; alt=&quot;Figure 4&quot; title=&quot;Self-backed rootfs overlay.&quot; width=&quot;100%&quot; /&gt;&lt;/p&gt;

&lt;h2 id=&quot;performance-gains&quot;&gt;Performance Gains&lt;/h2&gt;

&lt;p&gt;Let’s look at some filesystem-intensive workloads to see how rootfs overlay
impacts performance. These benchmarks were run on a gLinux desktop with
&lt;a href=&quot;https://gvisor.dev/docs/architecture_guide/platforms/#kvm&quot;&gt;KVM platform&lt;/a&gt;.&lt;/p&gt;

&lt;h3 id=&quot;micro-benchmark&quot;&gt;Micro Benchmark&lt;/h3&gt;

&lt;p&gt;&lt;a href=&quot;https://linux-test-project.github.io/&quot;&gt;Linux Test Project&lt;/a&gt; provides a
&lt;a href=&quot;https://github.com/linux-test-project/ltp/tree/master/testcases/kernel/fs/fsstress&quot;&gt;fsstress binary&lt;/a&gt;.
This program performs a large number of filesystem operations concurrently,
creating and modifying a large filesystem tree of all sorts of files. We ran
this program on the container’s root filesystem. The exact usage was:&lt;/p&gt;

&lt;p&gt;    &lt;code class=&quot;highlighter-rouge&quot;&gt;sh -c &quot;mkdir /test &amp;amp;&amp;amp; time fsstress -d /test -n 500 -p
20 -s 1680153482 -X -l 10&quot;&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;You can use the -v flag (verbose mode) to see what filesystem operations are
being performed.&lt;/p&gt;

&lt;p&gt;The results were astounding! Rootfs overlay reduced the time to run this
fsstress program &lt;strong&gt;from 262.79 seconds to 3.18 seconds&lt;/strong&gt;! However, note that
such microbenchmarks are not representative of real-world applications and we
should not extrapolate these results to real-world performance.&lt;/p&gt;

&lt;h3 id=&quot;real-world-benchmark&quot;&gt;Real-world Benchmark&lt;/h3&gt;

&lt;p&gt;Build jobs are very filesystem intensive workloads. They read a lot of source
files, compile and write out binaries and object files. Let’s consider building
the &lt;a href=&quot;https://github.com/abseil/abseil-cpp&quot;&gt;abseil-cpp project&lt;/a&gt; with
&lt;a href=&quot;https://bazel.build/&quot;&gt;bazel&lt;/a&gt;. Bazel performs a lot of filesystem operations in
rootfs; in bazel’s cache located at &lt;code class=&quot;highlighter-rouge&quot;&gt;~/.cache/bazel/&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;This is representative of the real-world because many other applications also
use the container root filesystem as scratch space due to the handy property
that it disappears on container exit. To make this more realistic, the
abseil-cpp repo was attached to the container using a bind mount, which does not
have an overlay.&lt;/p&gt;

&lt;p&gt;When measuring performance, we care about reducing the sandboxing overhead and
bringing gVisor performance as close as possible to unsandboxed performance.
Sandboxing overhead can be calculated using the formula &lt;em&gt;overhead = (s-n)/n&lt;/em&gt;
where &lt;code class=&quot;highlighter-rouge&quot;&gt;s&lt;/code&gt; is the amount of time taken to run a workload inside gVisor sandbox
and &lt;code class=&quot;highlighter-rouge&quot;&gt;n&lt;/code&gt; is the time taken to run the same workload natively (unsandboxed). The
following graph shows that rootfs overlay &lt;strong&gt;halved the sandboxing overhead&lt;/strong&gt; for
abseil build!&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/2023-05-08-rootfs-overlay-benchmark-result.svg&quot; alt=&quot;Figure 5&quot; title=&quot;Sandbox Overhead: rootfs overlay vs no overlay.&quot; width=&quot;100%&quot; /&gt;&lt;/p&gt;

&lt;h2 id=&quot;conclusion&quot;&gt;Conclusion&lt;/h2&gt;

&lt;p&gt;Rootfs overlay in gVisor substantially improves performance for many
filesystem-intensive workloads, so that developers no longer have to make large
tradeoffs between performance and security. We recently made this optimization
&lt;a href=&quot;https://github.com/google/gvisor/commit/38750cdedcce19a3039da10e515f5852565d2c7e&quot;&gt;the default&lt;/a&gt;
in runsc. This is part of our ongoing efforts to improve gVisor performance. You
can use gVisor in GKE with GKE Sandbox. Happy sandboxing!&lt;/p&gt;</content><author><name>ayushranjan</name></author><summary type="html">Root filesystem overlay is now the default in runsc. This improves performance for filesystem-heavy workloads by overlaying the container root filesystem with a tmpfs filesystem. Learn more about this feature in the following blog that was originally posted on Google Open Source Blog.</summary></entry><entry><title type="html">Releasing Systrap - A high-performance gVisor platform</title><link href="/blog/2023/04/28/systrap-release/" rel="alternate" type="text/html" title=" Releasing Systrap - A high-performance gVisor platform" /><published>2023-04-28T00:00:00-05:00</published><updated>2023-04-28T00:00:00-05:00</updated><id>/blog/2023/04/28/systrap-release</id><content type="html" xml:base="/blog/2023/04/28/systrap-release/">&lt;p&gt;We are releasing a new gVisor platform: Systrap. Like the existing ptrace
platform, Systrap runs on most Linux machines out of the box without
virtualization. Unlike the ptrace platform, it’s fast 🚀. Go try it by adding
&lt;code class=&quot;highlighter-rouge&quot;&gt;--platform=systrap&lt;/code&gt; to the runsc flags. If you want to know more about it, read
on.&lt;/p&gt;

&lt;!--/excerpt--&gt;

&lt;p&gt;gVisor is a security boundary for arbitrary Linux processes. Boundaries do not
come for free, and gVisor imposes some performance overhead on sandboxed
applications. One of the most fundamental performance challenges with the
security model implemented by gVisor is system call interception, which is the
focus of this post.&lt;/p&gt;

&lt;p&gt;To recap on the
&lt;a href=&quot;https://gvisor.dev/docs/architecture_guide/security/#what-can-a-sandbox-do&quot;&gt;security model&lt;/a&gt;:
gVisor is an application kernel that implements the Linux ABI. This includes
system calls, signals, memory management, and more. For example, when a
sandboxed application calls
&lt;a href=&quot;https://man7.org/linux/man-pages/man2/read.2.html&quot;&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;read(2)&lt;/code&gt;&lt;/a&gt;, it actually
transparently calls into
&lt;a href=&quot;https://github.com/google/gvisor/blob/44e2d0fcfeb641f3b8013c3f93cacdae447cc0f1/pkg/sentry/syscalls/linux/sys_read_write.go#L36&quot;&gt;gVisor’s implementation of this system call&lt;/a&gt;
This minimizes the attack surface of the host kernel, because sandboxed programs
simply can’t make system calls directly to the host in the first place&lt;sup id=&quot;fnref:1&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:1&quot; class=&quot;footnote&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;. This
interception happens through an internal layer called the Platform interface,
which we have written about in a previous
&lt;a href=&quot;https://gvisor.dev/blog/2020/10/22/platform-portability/&quot;&gt;blog post&lt;/a&gt;. To handle
these interceptions, this interface must also create new address spaces,
allocate memory, and create execution contexts to run the workload.&lt;/p&gt;

&lt;p&gt;gVisor had two platform implementations: KVM and ptrace. The KVM platform uses
the kernel’s KVM functionality to allow the Sentry to act as both guest OS and
VMM (Virtual machine monitor). It does system call interception just like a
normal virtual machine would. This gives good performance when using bare-metal
virtualization, but has a noticeable impact with nested virtualization. The
other obvious downside is that it requires support for nested virtualization in
the first place, which is not supported by all hardware (such as ARM CPUs) or
within some Cloud environments.&lt;/p&gt;

&lt;p&gt;The ptrace platform was the alternative wherever KVM was not available. It works
through the
&lt;a href=&quot;http://man7.org/linux/man-pages/man2/ptrace.2.html&quot;&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;PTRACE_SYSEMU&lt;/code&gt;&lt;/a&gt; action,
which makes the user process hand back execution to the sentry whenever it
encounters a system call. This is a clean method to achieve system call
interception in any environment, virtualized or not, except that it’s quite
slow. To see how slow, an unrealistic but highly illustrative benchmark to use
is the
&lt;a href=&quot;https://github.com/google/gvisor/blob/108410638aa8480e82933870ba8279133f543d2b/test/perf/linux/getpid_benchmark.cc&quot;&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;getpid&lt;/code&gt; benchmark&lt;/a&gt;&lt;sup id=&quot;fnref:2&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:2&quot; class=&quot;footnote&quot;&gt;2&lt;/a&gt;&lt;/sup&gt;.
This benchmark runs the
&lt;a href=&quot;https://man7.org/linux/man-pages/man2/getpid.2.html&quot;&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;getpid(2)&lt;/code&gt;&lt;/a&gt; system call
in a tight &lt;code class=&quot;highlighter-rouge&quot;&gt;while&lt;/code&gt; loop. No useful application has this behavior, so it is not a
realistic benchmark, but it is well-suited to measure system call latency.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/2023-04-28-getpid-ptrace-vs-native.svg&quot; alt=&quot;Figure 1&quot; title=&quot;Getpid benchmark: ptrace vs. native Linux.&quot; width=&quot;100%&quot; /&gt;&lt;/p&gt;

&lt;p&gt;All &lt;code class=&quot;highlighter-rouge&quot;&gt;getpid&lt;/code&gt; runs have been performed on a GCE n2-standard-4 VM, with the
&lt;code class=&quot;highlighter-rouge&quot;&gt;debian-11-bullseye-v20230306&lt;/code&gt; image.&lt;/p&gt;

&lt;p&gt;While this benchmark is not applicable to most real-world workloads, just about
any workload will generally suffer from high overhead in system call
performance. Since running in a virtualized environment is the default state for
most cloud users these days, it’s important that gVisor performs well in this
context. Systrap is the new platform targeting this important use case.&lt;/p&gt;

&lt;p&gt;Systrap relies on multiple techniques to implement the Platform interface. Like
the ptrace platform, Systrap uses Linux’s ptrace subsystem to initialize
workload executor threads, which are started as child processes of the main
gVisor sentry process. Systrap additionally sets a very restrictive seccomp
filter, installs a custom signal handler, and allocates chunks of memory shared
between user threads and runsc sentry. This shared memory is what serves as the
main form of communication between the sentry and sandboxed programs: whenever
the sandboxed process attempts to execute a system call, it triggers a &lt;code class=&quot;highlighter-rouge&quot;&gt;SIGSYS&lt;/code&gt;
signal which is handled by our signal handler. The signal handler in turn
populates shared memory regions, and requests the sentry to handle the requested
system call. This alone proved to be faster than using &lt;code class=&quot;highlighter-rouge&quot;&gt;PTRACE_SYSEMU&lt;/code&gt;, as
demonstrated by the &lt;code class=&quot;highlighter-rouge&quot;&gt;getpid&lt;/code&gt; benchmark:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/2023-04-28-getpid-ptrace-vs-systrap-unoptimized.svg&quot; alt=&quot;Figure 2&quot; title=&quot;Getpid benchmark: ptrace vs. Systrap.&quot; width=&quot;100%&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Can we make it even faster? Recall what the main purpose of our signal handler
is: to send a request to the sentry via shared memory. To do that, the sandboxed
process must first incur the overhead of executing the seccomp filter&lt;sup id=&quot;fnref:3&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:3&quot; class=&quot;footnote&quot;&gt;3&lt;/a&gt;&lt;/sup&gt;, and
then generating a full signal stack before being able to run the signal handler.
What if there was a way to simply have the sandboxed process jump to another
user-space function when it wanted to perform a system call? Well, turns out,
there is&lt;sup id=&quot;fnref:4&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:4&quot; class=&quot;footnote&quot;&gt;4&lt;/a&gt;&lt;/sup&gt;! There is a popular x86 instruction pattern that’s used to perform
system calls, and it goes a little something like this: &lt;strong&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;mov sysno, %eax;
syscall&lt;/code&gt;&lt;/strong&gt;. The size of the mov instruction is 5 bytes and the size of the
syscall instruction is 2 bytes. Luckily this is just enough space to fit in a
&lt;strong&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;jmp *%gs:offset&lt;/code&gt;&lt;/strong&gt; instruction. When the signal handler sees this instruction
pattern, it signals to the sentry that the original instructions can be replaced
with a &lt;strong&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;jmp&lt;/code&gt;&lt;/strong&gt; to trampoline code that performs the same function as the
regular &lt;code class=&quot;highlighter-rouge&quot;&gt;SIGSYS&lt;/code&gt; signal handler. The system call number is not lost, but rather
encoded in the offset. The results are even more impressive:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/2023-04-28-getpid-ptrace-vs-systrap-opt.svg&quot; alt=&quot;Figure 3&quot; title=&quot;Getpid benchmark: ptrace vs. Optimized Systrap.&quot; width=&quot;100%&quot; /&gt;&lt;/p&gt;

&lt;p&gt;As mentioned, the &lt;code class=&quot;highlighter-rouge&quot;&gt;getpid&lt;/code&gt; benchmark is not representative of real-world
performance. To get a better picture of the magnitude of improvement, here are
some real-world workloads:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;The
&lt;a href=&quot;https://github.com/google/gvisor/blob/master/blob/master/test/benchmarks/fs/bazel_test.go&quot;&gt;Build ABSL benchmark&lt;/a&gt;
measures compilation performance by compiling
&lt;a href=&quot;https://abseil.io/&quot;&gt;abseil.io&lt;/a&gt;; this is a highly system call dependent
workload due to needing to do a lot of I/O filesystem operations (gVisor’s
file system overhead is also dependent upon file system isolation it
implements, which is something you can learn about
&lt;a href=&quot;https://gvisor.dev/docs/user_guide/filesystem/&quot;&gt;here&lt;/a&gt;).&lt;/li&gt;
  &lt;li&gt;The
&lt;a href=&quot;https://github.com/google/gvisor/blob/master/blob/master/test/benchmarks/media/ffmpeg_test.go&quot;&gt;ffmpeg benchmark&lt;/a&gt;
runs a multimedia processing tool, to perform video stream encoding/decoding
for example; this workload does not require a significant amount of system
calls and there are very few userspace to kernel mode switches.&lt;/li&gt;
  &lt;li&gt;The
&lt;a href=&quot;https://github.com/google/gvisor/blob/master/blob/master/test/benchmarks/ml/tensorflow_test.go&quot;&gt;Tensorflow benchmark&lt;/a&gt;
trains a variety of machine learning models on CPU; the system-call usage of
this workload is in between compilation and ffmpeg, due to needing to
retrieve training and validation data, but the majority of time is still
spent just running userspace computations.&lt;/li&gt;
  &lt;li&gt;Finally, the Redis benchmark performs SET RPC calls with 5 concurrent
clients, measures the latency that each call takes to execute, and reports
the median (scaled by 250,000 to fit the graph’s axis); this workload is
heavily bounded by system call performance due to high network stack usage.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/2023-04-28-systrap-sample-workloads.svg&quot; alt=&quot;Figure 4&quot; title=&quot;Comparison of sample workloads running on ptrace, Systrap, and native Linux.&quot; width=&quot;100%&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Systrap will replace the ptrace platform by September 2023 and become the
default. Until then, we are working really hard to make it production-ready,
which includes working on additional performance and stability improvements, and
making sure we maintain a high bar for security through targeted fuzz-testing
for Systrap specifically.&lt;/p&gt;

&lt;p&gt;In the meantime, we would like gVisor users to try it out, and give us feedback!
If you run gVisor using ptrace today (either by specifying &lt;code class=&quot;highlighter-rouge&quot;&gt;--platform ptrace&lt;/code&gt;
or not specifying the &lt;code class=&quot;highlighter-rouge&quot;&gt;--platform&lt;/code&gt; flag at all), or you use the KVM platform with
nested virtualization, switching to Systrap should be a drop-in performance
upgrade. All you have to do is specify &lt;code class=&quot;highlighter-rouge&quot;&gt;--platform systrap&lt;/code&gt; to runsc. If you
encounter any issues, please let us know at
&lt;a href=&quot;https://github.com/google/gvisor/issues&quot;&gt;gvisor.dev/issues&lt;/a&gt;.
&lt;br /&gt;
&lt;br /&gt;&lt;/p&gt;

&lt;hr /&gt;

&lt;!-- mdformat off(Footnotes need to be separated by linebreaks to be rendered) --&gt;

&lt;!-- mdformat on --&gt;
&lt;div class=&quot;footnotes&quot; role=&quot;doc-endnotes&quot;&gt;
  &lt;ol&gt;
    &lt;li id=&quot;fn:1&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Even if the sandbox itself is compromised, it will still be bound by
several defense-in-depth layers, including a restricted set of seccomp
filters. You can find more details here:
&lt;a href=&quot;https://gvisor.dev/blog/2020/09/18/containing-a-real-vulnerability/&quot;&gt;https://gvisor.dev/blog/2020/09/18/containing-a-real-vulnerability/&lt;/a&gt;. &lt;a href=&quot;#fnref:1&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:2&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Once the system call has been intercepted by gVisor (or in the case of
Linux, once the process has entered kernel-mode), actually executing the
getpid system call itself is very fast, so this benchmark effectively
measures single-thread syscall-interception overhead. &lt;a href=&quot;#fnref:2&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:3&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;Seccomp filters are known to have a “not insubstantial” overhead:
&lt;a href=&quot;https://lwn.net/Articles/656307/&quot;&gt;https://lwn.net/Articles/656307/&lt;/a&gt;. &lt;a href=&quot;#fnref:3&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:4&quot; role=&quot;doc-endnote&quot;&gt;
      &lt;p&gt;On the x86_64 architecture. ARM does not have this optimization as of the
time of writing. &lt;a href=&quot;#fnref:4&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
  &lt;/ol&gt;
&lt;/div&gt;</content><author><name>bogomolov</name></author><summary type="html">We are releasing a new gVisor platform: Systrap. Like the existing ptrace platform, Systrap runs on most Linux machines out of the box without virtualization. Unlike the ptrace platform, it’s fast 🚀. Go try it by adding --platform=systrap to the runsc flags. If you want to know more about it, read on.</summary></entry><entry><title type="html">How we Eliminated 99% of gVisor Networking Memory Allocations with Enhanced Buffer Pooling</title><link href="/blog/2022/10/24/buffer-pooling/" rel="alternate" type="text/html" title=" How we Eliminated 99% of gVisor Networking Memory Allocations with Enhanced Buffer Pooling" /><published>2022-10-24T00:00:00-05:00</published><updated>2022-10-24T00:00:00-05:00</updated><id>/blog/2022/10/24/buffer-pooling</id><content type="html" xml:base="/blog/2022/10/24/buffer-pooling/">&lt;p&gt;In an
&lt;a href=&quot;https://gvisor.dev/blog/2020/04/02/gvisor-networking-security/&quot;&gt;earlier blog post&lt;/a&gt;
about networking security, we described how and why gVisor implements its own
userspace network stack in the Sentry (gVisor kernel). In summary, we’ve
implemented our networking stack – aka Netstack – in Go to minimize exposure to
unsafe code and avoid using an unsafe Foreign Function Interface. With Netstack,
gVisor can do all packet processing internally and only has to enable a few host
I/O syscalls for near-complete networking capabilities. This keeps gVisor’s
exposure to host vulnerabilities as narrow as possible.&lt;/p&gt;

&lt;!--/excerpt--&gt;

&lt;p&gt;Although writing Netstack in Go was important for runtime safety, up until now
it had an undeniable performance cost. iperf benchmarks showed Netstack was
spending between 20-30% of its processing time allocating memory and pausing for
garbage collection, a slowdown that limited gVisor’s ability to efficiently
sandbox networking workloads. In this blog we will show how we crafted a cure
for Netstack’s allocation addiction, reducing them by 99%, while also increasing
gVisor networking throughput by 30+%.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/2022-10-24-buffer-pooling-figure1.png&quot; alt=&quot;Figure 1&quot; title=&quot;Buffer pooling results.&quot; width=&quot;100%&quot; /&gt;&lt;/p&gt;

&lt;h2 id=&quot;a-waste-management-problem&quot;&gt;A Waste Management Problem&lt;/h2&gt;

&lt;p&gt;Go guarantees a basic level of memory safety through the use of a garbage
collector (GC), which is described in great detail by the Go team
&lt;a href=&quot;https://tip.golang.org/doc/gc-guide&quot;&gt;here&lt;/a&gt;. The Go runtime automatically tracks
and frees objects allocated from the heap, relieving the programmer of the often
painful and error-prone process of manual memory management. Unfortunately,
tracking and freeing memory during runtime comes at a performance cost. Running
the GC adds scheduling overhead, consumes valuable CPU time, and occasionally
pauses the entire program’s progress to track down garbage.&lt;/p&gt;

&lt;p&gt;Go’s GC is highly optimized, tunable, and sufficient for a majority of
workloads. Most of the other parts of gVisor happily use Go’s GC with no
complaints. However, under high network stress, Netstack needed to aggressively
allocate buffers used for processing TCP/IP data and metadata. These buffers
often had short lifespans, and once the processing was done they were left to be
cleaned up by the GC. This meant Netstack was producing tons of garbage that
needed to be tracked and freed by GC workers.&lt;/p&gt;

&lt;h2 id=&quot;recycling-to-the-rescue&quot;&gt;Recycling to the Rescue&lt;/h2&gt;

&lt;p&gt;Luckily, we weren’t the only ones with this problem. This pattern of small,
frequently allocated and discarded objects was common enough that the Go team
introduced &lt;a href=&quot;https://pkg.go.dev/sync#Pool&quot;&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;sync.Pool&lt;/code&gt;&lt;/a&gt; in Go1.3. &lt;code class=&quot;highlighter-rouge&quot;&gt;sync.Pool&lt;/code&gt; is
designed to relieve pressure off the Go GC by maintaining a thread-safe cache of
previously allocated objects. &lt;code class=&quot;highlighter-rouge&quot;&gt;sync.Pool&lt;/code&gt; can retrieve an object from the cache
if it exists or allocate a new one according to a user specified allocation
function. Once the user is finished with an object they can safely return it to
the cache to be reused again.&lt;/p&gt;

&lt;p&gt;While &lt;code class=&quot;highlighter-rouge&quot;&gt;sync.Pool&lt;/code&gt; was exactly what we needed to reduce allocations,
incorporating it into Netstack wasn’t going to be as easy as just replacing all
our &lt;code class=&quot;highlighter-rouge&quot;&gt;make()&lt;/code&gt;s with &lt;code class=&quot;highlighter-rouge&quot;&gt;pool.Get()&lt;/code&gt;s.&lt;/p&gt;

&lt;h2 id=&quot;netstack-challenges&quot;&gt;Netstack Challenges&lt;/h2&gt;

&lt;p&gt;Netstack uses a few different types of buffers under the hood. Some of these are
specific to protocols, like
&lt;a href=&quot;https://github.com/google/gvisor/blob/master/pkg/tcpip/transport/tcp/segment.go&quot;&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;segment&lt;/code&gt;&lt;/a&gt;
for TCP, and others are more widely shared, like
&lt;a href=&quot;https://github.com/google/gvisor/blob/master/pkg/tcpip/stack/packet_buffer.go&quot;&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;PacketBuffer&lt;/code&gt;&lt;/a&gt;,
which is used for IP, ICMP, UDP, etc. Although each of these buffer types are
slightly different, they generally share a few common traits that made it
difficult to use &lt;code class=&quot;highlighter-rouge&quot;&gt;sync.Pool&lt;/code&gt; out of the box:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;The buffers were originally built with the assumption that a garbage
collector would clean them up automatically – there was little (if any)
effort put into tracking object lifetimes. This meant that we had no way to
know when it was safe to return buffers to a pool.&lt;/li&gt;
  &lt;li&gt;Buffers have dynamic sizes that are determined during creation, usually
depending on the size of the packet holding them. A &lt;code class=&quot;highlighter-rouge&quot;&gt;sync.Pool&lt;/code&gt; out of the
box can only accommodate buffers of a single size. One common solution to
this is to fill a pool with
&lt;a href=&quot;https://pkg.go.dev/bytes#Buffer&quot;&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;bytes.Buffer&lt;/code&gt;&lt;/a&gt;, but even a pooled
&lt;code class=&quot;highlighter-rouge&quot;&gt;bytes.Buffer&lt;/code&gt; could incur allocations if it were too small and had to be
grown to the requested size.&lt;/li&gt;
  &lt;li&gt;Netstack splits, merges, and clones buffers at various points during
processing (for example, breaking a large segment into smaller MTU-sized
packets). Modifying a buffer’s size during runtime could mean lots of
reallocating from the pool in a one-size-fits-all setup. This would limit
the theoretical effectiveness of a pooled solution.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We needed an efficient, low-level buffer abstraction that had answers for the
Netstack specific challenges and could be shared by the various intermediate
buffer types. By sharing a common buffer abstraction, we could maximize the
benefits of pooling and avoid introducing additional allocations while minimally
changing any intermediate buffer processing logic.&lt;/p&gt;

&lt;h2 id=&quot;introducing-bufferv2&quot;&gt;Introducing bufferv2&lt;/h2&gt;

&lt;p&gt;Our solution was
&lt;a href=&quot;https://github.com/google/gvisor/tree/1ceb81454444981448ad57612139adfc0def1b85/pkg/bufferv2&quot;&gt;bufferv2&lt;/a&gt;.
Bufferv2 is a non-contiguous, reference counted, pooled, copy-on-write,
buffer-like data structure.&lt;/p&gt;

&lt;p&gt;Internally, a bufferv2 &lt;code class=&quot;highlighter-rouge&quot;&gt;Buffer&lt;/code&gt; is a linked list of &lt;code class=&quot;highlighter-rouge&quot;&gt;View&lt;/code&gt;s. Each &lt;code class=&quot;highlighter-rouge&quot;&gt;View&lt;/code&gt; has
start/end indices and holds a pointer to a &lt;code class=&quot;highlighter-rouge&quot;&gt;Chunk&lt;/code&gt;. A &lt;code class=&quot;highlighter-rouge&quot;&gt;Chunk&lt;/code&gt; is a
reference-counted structure that’s allocated from a pool and holds data in a
byte slice. There are several &lt;code class=&quot;highlighter-rouge&quot;&gt;Chunk&lt;/code&gt; pools, each of which allocates chunks with
different sized byte slices. These sizes start at 64 and double until 64k.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/2022-10-24-buffer-pooling-figure2.png&quot; alt=&quot;Figure 2&quot; title=&quot;bufferv2 implementation diagram.&quot; width=&quot;100%&quot; /&gt;&lt;/p&gt;

&lt;p&gt;The design of bufferv2 has a few key advantages over simpler object pooling:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Zero-cost copies and copy-on-write&lt;/strong&gt;: Cloning a Buffer only increments the
reference count of the underlying chunks instead of reallocating from the
pool. Since buffers are much more frequently read than modified, this saves
allocations. In the cases where a buffer is modified, only the chunk that’s
changed has to be cloned, not the whole buffer.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Fast buffer transformations&lt;/strong&gt;: Truncating and merging buffers or appending
and prepending Views to Buffers are fast operations. Thanks to the
non-contiguous memory structure these operations are usually as quick as
adding a node to a linked list or changing the indices in a View.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Tiered pools&lt;/strong&gt;: When growing a Buffer or appending data, the new chunks
come from different pools of previously allocated chunks. Using multiple
pools means we are flexible enough to efficiently accommodate packets of all
sizes with minimal overhead. Unlike a one-size-fits-all solution, we don’t
have to waste lots of space with a chunk size that is too big or loop
forever allocating small chunks.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;trade-offs&quot;&gt;Trade-offs&lt;/h2&gt;

&lt;p&gt;Shifting Netstack to bufferv2 came with some costs. To start, rewriting all
buffers to use bufferv2 was a sizable effort that took many months to fully roll
out. Any place in Netstack that allocated or used a byte slice needed to be
rewritten. Reference counting had to be introduced so all the aforementioned
intermediate buffer types (&lt;code class=&quot;highlighter-rouge&quot;&gt;PacketBuffer&lt;/code&gt;, &lt;code class=&quot;highlighter-rouge&quot;&gt;segment&lt;/code&gt;, etc) could accurately
track buffer lifetimes, and tests had to be modified to ensure reference
counting correctness.&lt;/p&gt;

&lt;p&gt;In addition to the upfront cost, the shift to bufferv2 also increased the
engineering complexity of future Netstack changes. Netstack contributors must
adhere to new rules to maintain memory safety and maximize the benefits of
pooling. These rules are strict – there needs to be strong justification to
break them. They are as follows:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Never allocate a byte slice; always use &lt;code class=&quot;highlighter-rouge&quot;&gt;NewView()&lt;/code&gt; instead.&lt;/li&gt;
  &lt;li&gt;Use a &lt;code class=&quot;highlighter-rouge&quot;&gt;View&lt;/code&gt; for simple data operations (e.g writing some data of a fixed
size) and a &lt;code class=&quot;highlighter-rouge&quot;&gt;Buffer&lt;/code&gt; for more complex I/O operations (e.g appending data of
variable size, merging data, writing from an &lt;code class=&quot;highlighter-rouge&quot;&gt;io.Reader&lt;/code&gt;).&lt;/li&gt;
  &lt;li&gt;If you need access to the contents of a &lt;code class=&quot;highlighter-rouge&quot;&gt;View&lt;/code&gt; as a byte slice, use
&lt;code class=&quot;highlighter-rouge&quot;&gt;View.AsSlice()&lt;/code&gt;. If you need access to the contents of a &lt;code class=&quot;highlighter-rouge&quot;&gt;Buffer&lt;/code&gt; as a byte
slice, consider refactoring, as this will cause an allocation.&lt;/li&gt;
  &lt;li&gt;Never write or modify the slices returned by &lt;code class=&quot;highlighter-rouge&quot;&gt;View.AsSlice()&lt;/code&gt;; they are
still owned by the view.&lt;/li&gt;
  &lt;li&gt;Release bufferv2 objects as close to where they’re created as possible. This
is usually most easily done with defer.&lt;/li&gt;
  &lt;li&gt;Document function ownership of bufferv2 object parameters. If there is no
documentation, it is assumed that the function does not take ownership of
its parameters.&lt;/li&gt;
  &lt;li&gt;If a function takes ownership of its bufferv2 parameters, the bufferv2
objects must be cloned before passing them as arguments.&lt;/li&gt;
  &lt;li&gt;All new Netstack tests must enable the leak checker and run a final leak
check after the test is complete.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;give-it-a-try&quot;&gt;Give it a Try&lt;/h2&gt;

&lt;p&gt;Bufferv2 is enabled by default as of
&lt;a href=&quot;https://github.com/google/gvisor/releases/tag/release-20221017.0&quot;&gt;gVisor 20221017&lt;/a&gt;,
and will be rolling out to
&lt;a href=&quot;https://cloud.google.com/kubernetes-engine/docs/concepts/sandbox-pods&quot;&gt;GKE Sandbox&lt;/a&gt;
soon, so no action is required to see a performance boost. Network-bound
workloads, such as web servers or databases like Redis, are the most likely to
see benefits. All the code implementing bufferv2 is public
&lt;a href=&quot;https://github.com/google/gvisor/tree/master/pkg/bufferv2&quot;&gt;here&lt;/a&gt;, and
contributions are welcome! If you’d like to run the iperf benchmark for
yourself, you can run:&lt;/p&gt;

&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;make run-benchmark BENCHMARKS_TARGETS=//test/benchmarks/network:iperf_test \
  RUNTIME=your-runtime-here BENCHMARKS_OPTIONS=-test.benchtime=60s
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;in the base gVisor directory. If you experience any issues, please feel free to
let us know at &lt;a href=&quot;https://github.com/google/gvisor/issues&quot;&gt;gvisor.dev/issues&lt;/a&gt;.&lt;/p&gt;</content><author><name>lucasmanning</name></author><summary type="html">In an earlier blog post about networking security, we described how and why gVisor implements its own userspace network stack in the Sentry (gVisor kernel). In summary, we’ve implemented our networking stack – aka Netstack – in Go to minimize exposure to unsafe code and avoid using an unsafe Foreign Function Interface. With Netstack, gVisor can do all packet processing internally and only has to enable a few host I/O syscalls for near-complete networking capabilities. This keeps gVisor’s exposure to host vulnerabilities as narrow as possible.</summary></entry><entry><title type="html">Threat Detection in gVisor</title><link href="/blog/2022/08/01/threat-detection/" rel="alternate" type="text/html" title=" Threat Detection in gVisor" /><published>2022-08-31T00:00:00-05:00</published><updated>2022-08-31T00:00:00-05:00</updated><id>/blog/2022/08/01/threat-detection</id><content type="html" xml:base="/blog/2022/08/01/threat-detection/">&lt;p&gt;gVisor helps users secure their infrastructure by running containers in a
dedicated kernel that is isolated from the host. But wouldn’t it be nice if you
could tell when someone attempts to break out? Or get an early warning that your
web server might have been compromised? Now you can do it with gVisor! We are
pleased to announce support for &lt;strong&gt;runtime monitoring&lt;/strong&gt;. Runtime monitoring
provides the ability for an external process to observe application behavior and
detect threats at runtime. Using this mechanism, gVisor users can watch actions
performed by the container and generate alerts when something unexpected occurs.&lt;/p&gt;

&lt;!--/excerpt--&gt;

&lt;p&gt;A monitoring process can connect to the gVisor sandbox and receive a stream of
actions that the application is performing. The monitoring process decides what
actions are allowed and what steps to take based on policies for the given
application. gVisor communicates with the monitoring process via a simple
protocol based on
&lt;a href=&quot;https://developers.google.com/protocol-buffers&quot;&gt;Protocol Buffers&lt;/a&gt;, which is the
basis for &lt;a href=&quot;https://grpc.io/&quot;&gt;gRPC&lt;/a&gt; and is well supported in several languages.
The monitoring process runs isolated from the application inside the sandbox for
security reasons, and can be shared among all sandboxes running on the same
machine to save resources. Trace points can be individually configured when
creating a tracing session to capture only what’s needed.&lt;/p&gt;

&lt;p&gt;Let’s go over a simple example of a web server that gets compromised while being
monitored. The web server can execute files from &lt;code class=&quot;highlighter-rouge&quot;&gt;/bin&lt;/code&gt;, read files from &lt;code class=&quot;highlighter-rouge&quot;&gt;/etc&lt;/code&gt;
and &lt;code class=&quot;highlighter-rouge&quot;&gt;/html&lt;/code&gt; directories, create files under &lt;code class=&quot;highlighter-rouge&quot;&gt;/tmp&lt;/code&gt;, etc. All these actions are
reported to a monitoring process which analyzes them and deems them normal
application behavior. Now suppose that an attacker takes control over the web
server and starts executing code inside the container. The attacker writes a
script under &lt;code class=&quot;highlighter-rouge&quot;&gt;/tmp&lt;/code&gt; and, in an attempt to make it executable, runs &lt;code class=&quot;highlighter-rouge&quot;&gt;chmod u+x
/tmp/exploit.sh&lt;/code&gt;. The monitoring process determines that making a file
executable is not expected in the normal web server execution and raises an
alert to the security team for investigation. Additionally, it can also decide
to kill the container and stop the attacker from making more progress.&lt;/p&gt;

&lt;h2 id=&quot;falco&quot;&gt;Falco&lt;/h2&gt;

&lt;p&gt;&lt;a href=&quot;https://falco.org/&quot;&gt;Falco&lt;/a&gt; is an Open Source Cloud Native Security monitor that
detects threats at runtime by observing the behavior of your applications and
containers. Falco
&lt;a href=&quot;https://falco.org/blog/falco-0-32-1/&quot;&gt;supports monitoring applications running inside gVisor&lt;/a&gt;.
All the Falco rules and tooling work seamlessly with gVisor. You can use
&lt;a href=&quot;https://gvisor.dev/docs/tutorials/falco/&quot;&gt;this tutorial&lt;/a&gt; to learn how to
configure Falco and gVisor together. More information can be found on the
&lt;a href=&quot;https://falco.org/blog/intro-gvisor-falco/&quot;&gt;Falco blog&lt;/a&gt;.&lt;/p&gt;

&lt;h2 id=&quot;whats-next&quot;&gt;What’s next?&lt;/h2&gt;

&lt;p&gt;We’re looking for more projects to take advantage of the runtime monitoring
system and the visibility that it provides into the sandbox. There are a few
unique capabilities provided by the system that makes it easy to monitor
applications inside gVisor, like resolving file descriptors to full paths,
providing container ID with traces, separating processes that were exec’ed into
the container, internal procfs state access, and many more.&lt;/p&gt;

&lt;p&gt;If you would like to explore it further, there is a
&lt;a href=&quot;https://docs.google.com/document/d/1RQQKzeFpO-zOoBHZLA-tr5Ed_bvAOLDqgGgKhqUff2A&quot;&gt;design document&lt;/a&gt;
and
&lt;a href=&quot;https://github.com/google/gvisor/tree/master/pkg/sentry/seccheck/README.md&quot;&gt;documentation&lt;/a&gt;
with more details about the configuration and communication protocol. In
addition, the &lt;a href=&quot;https://gvisor.dev/docs/tutorials/falco/&quot;&gt;tutorial using Falco&lt;/a&gt;
is a great way to see it in action.&lt;/p&gt;

&lt;p&gt;We would like to thank &lt;a href=&quot;https://github.com/LucaGuerra&quot;&gt;Luca Guerra&lt;/a&gt;,
&lt;a href=&quot;https://github.com/loresuso&quot;&gt;Lorenzo Susini&lt;/a&gt;, and the Falco team for their
support while building this feature.&lt;/p&gt;</content><author><name>fvoznika</name></author><summary type="html">gVisor helps users secure their infrastructure by running containers in a dedicated kernel that is isolated from the host. But wouldn’t it be nice if you could tell when someone attempts to break out? Or get an early warning that your web server might have been compromised? Now you can do it with gVisor! We are pleased to announce support for runtime monitoring. Runtime monitoring provides the ability for an external process to observe application behavior and detect threats at runtime. Using this mechanism, gVisor users can watch actions performed by the container and generate alerts when something unexpected occurs.</summary></entry></feed>