Scaling Agentic-RL Sandboxes to the Millions with gVisor at Tencent

2026-04-23T00:00:00-05:00

This article was contributed by Tencent. Yifeng Tan, Hua Liu, and Hui Chen are engineers at Tencent, responsible for the internal container infrastructure.

As LLMs evolve from chat interfaces to autonomous agents, building a robust and secure isolation environment becomes a necessity. We chose gVisor as the default sandbox for our Agentic-RL scenarios. Today, we run millions of gVisor sandboxes daily for Agentic-RL training in production, and that scale continues to grow. After more than 74,000 side-by-side comparisons between runsc (gVisor) and runc (unsandboxed/Linux), combined with targeted fixes driven by real-world workloads, we have essentially closed the execution correctness gap with runc, fully meeting our production-grade business requirements. During this process, we successfully investigated and resolved gVisor compatibility issues that accounted for approximately 1.7% of all test cases.

This post focuses on CPU-centric code execution and testing workloads. We will discuss gVisor compatibility verification and highlight representative issues, skipping implementation details like GPU support, image distribution, or cluster scheduling. We aim to answer three questions:

Why choose gVisor?
Why doesn’t manual compatibility verification scale?
How can AI agents analyze compatibility issues, what do typical failures look like, and what best practices have we established?

Background: Why Agentic-RL Needs gVisor

Over the past two years, benchmarks like SWE-bench have turned “Agents fixing bugs in real code repositories” from a research concept into an engineering reality. The agent behavioral model has evolved from static code generation to dynamic environmental interaction, spanning the entire lifecycle of dependency resolution, execution, test feedback, and iterative debugging. We don’t just need “an environment that runs Docker,” but rather a sandbox that strictly constrains the kernel attack surface while remaining lightweight and easy to deploy at scale. gVisor is a great fit for this scenario:

It implements an application-level kernel in user space, intercepting and re-implementing system calls, significantly reducing the attack surface where containers directly interact with the host kernel. Its isolation has been well-recognized by the industry.
It integrates naturally with existing Docker/Kubernetes infrastructure, avoiding the need for an entirely new guest kernel operation and maintenance system.
Compared to microVM solutions—which must run on bare-metal hosts—gVisor can run inside regular VMs, making it significantly cheaper while remaining more flexible with lower startup and resource costs. This makes it far better suited for large-scale deployments of sandbox containers.
It is also more friendly to GPU scenarios, facilitating integration with existing heterogeneous computing environments.

However, re-implementing the Linux ABI means its compatibility must be rigorously validated. In an Agentic-RL scenario where “any project can run and any environment can appear,” compatibility can’t rely on intuition. It requires large-scale verification against real workloads.

Challenge: Verifying Tens of Thousands of Cases Cannot Rely Entirely on Manual Effort

Compatibility issues are rarely simple. Analyzing a typical SWE-related failure usually requires answering several questions at once:

Is this failure unique to runsc (gVisor), or does it also fail under runc?
If it only fails under gVisor, is it a semantic inconsistency in the Linux ABI, missing procfs / sysfs, file system behavioral differences, or a TOCTOU (Time-of-Check to Time-of-Use) race condition amplified by system call overhead?
What is the actual behavior of the Linux kernel? At which layer did gVisor deviate?
Should this issue be addressed by patching gVisor, modifying the test case, adjusting configurations, or simply avoiding a certain way of running?

Engineers can handle a handful of cases manually. But across these datasets, we are dealing with hundreds of thousands of real-world project instances, over a dozen programming languages, and numerous build systems (Gradle, Maven, CMake, Cargo, pip, npm, sbt, SwiftPM). Manual triage simply doesn’t scale.

To solve this, we brought AI coding agents into the verification pipeline to act as compatibility analysts. The process breaks down into four layers:

Baseline Comparison Layer: Run the same set of test cases in parallel under runc and runsc, collecting complete execution logs and exit statuses.
Difference Filtering Layer: Filter out environmental noise and non-deterministic outputs unrelated to the runtime, preserving samples that only fail under gVisor.
AI Diagnostic Layer: LLMs output structured root cause analysis reports by combining logs and relevant source code.
Decision Routing Layer: Route the reports into gVisor bugs, user-space race conditions, environmental differences, or test case issues, providing suggestions for fixes or workarounds.

This creates a neat closed loop: AI analyzing its own runtime environment.

graph TD
    A[Baseline Comparison Layer] -->|Run under runc/runsc in parallel<br>Collect logs & exit status| B(Difference Filtering Layer)
    B -->|Filter environmental noise<br>Keep gVisor-specific failures| C{AI Diagnostic Layer}
    C -->|Combine logs & source code<br>Output structured root cause report| D[Decision Routing Layer]

    D -->|gVisor bug| E[Submit community fix]
    D -->|User-space race condition| F[Workaround strategy]
    D -->|Environmental difference| G[Adjust environment]
    D -->|Test case issue| H[Fix test case]

    subgraph AI-Driven Compatibility Verification Framework
    A
    B
    C
    D
    end

In our workflow, every deeply analyzed case produces a structured document, typically containing:

Failure symptoms and minimal reproduction method
runc/runsc comparison results
Root cause classification: gVisor bug, missing feature, environmental difference, test case issue, or race condition amplification
Linux kernel behavior comparison and source code evidence
Fixes or workaround suggestions
Regression verification results

To date, we have used AI to automatically analyze thousands of test cases exhibiting behavioral differences. From these, we extracted and deeply reviewed 100+ highly representative cases across 10+ programming languages and multiple build systems. These cases help us determine not only “whether gVisor is usable,” but also “who is actually to blame for a given failure.”

Compatibility Landscape: Boundaries Defined by Batch Comparisons

Looking at a small sample of failures makes it easy to misjudge gVisor’s compatibility. Reliable conclusions require large-scale A/B testing.

Across 10 mainstream code execution datasets in our Agentic-RL infrastructure, we’ve run 74,379 side-by-side comparisons between runc and runsc.

Please see the detailed data in the table below:

Dataset	Total cases	Native `runc` accuracy	gVisor pre-fix `runsc` accuracy	gVisor post-fix `runsc` accuracy
`terminal-bench2`	89	100.00%	94.38%	97.75%
`swe-public/Multi-SWE-bench`	1,632	70.16%	72.49%	73.16%
`swe-public/Multi-SWE-RL`	7,046	27.73%	20.49%	26.81%
`swe-public/SWE-bench_Multilingual`	300	93.00%	92.67%	93.00%
`swe-public/SWE-bench_Not_Verified`	1,794	97.94%	97.94%	97.94%
`swe-public/SWE-bench_Pro`	731	90.15%	90.97%	90.97%
`swe-public/SWE-bench_Verified`	500	100.00%	99.60%	100.00%
`swe-public/SWE-Gym`	2,438	86.75%	88.27%	88.27%
`swe-public/SWE-rebench`	21,336	83.33%	83.33%	83.77%
`swe-public/SWE-smith`	38,513	99.37%	97.42%	99.31%
Total	74,379	86.78%	85.18%	86.91%

Three key takeaways emerge from this data:

runsc (gVisor) and runc (Linux native) are now effectively on par. Across 74,379 runs, the correctness gap between runsc and runc is only about 0.13 percentage points (86.91% vs 86.78%). We also performed retries and cross-validation on core datasets to rule out one-off flakiness. We have improved runsc’s overall pass rate by approximately 1.7 percentage points. This correctness gain largely stemmed from highly concentrated failures in a small number of repositories—such as trio, cloud-custodian, asciidoctor, and syncthing. Once a root cause was identified, a single fix could often resolve hundreds of failing cases at once.
Most “compatibility issues” should not be attributed to gVisor. The table clearly demonstrates that even under the native runc environment, there is an inherent failure rate of about 13% (with an average correctness of 86.91%). These failures largely stem from flaky test code, build environment deficiencies, or limitations within the underlying datasets. Evaluating gVisor without a runc baseline could easily lead to misattributing this 13% background failure rate as sandbox incompatibilities.
The overall pass rate for Multi-SWE-RL is relatively low (around ~27% for both runtimes). This is because our internal evaluation framework and some case-execution methods are still being adapted, so it is not a standalone compatibility problem in gVisor itself. The same bias affects both runc and runsc, and therefore does not change the comparative conclusion.

At the production scale we described earlier—millions of gVisor sandboxes running every day—this data answers the real question: how much correctness do we lose by replacing runc with runsc? The answer is: almost none.

Representative Cases: Six Types of Issues and Corresponding Fix Paths

After filtering out cases where both runc and runsc failed simultaneously, we conducted in-depth reviews of the remaining cases that exhibited behavioral differences. Using these 100+ representative cases as a sample, their final root-cause attribution can roughly be divided into the following categories:

Root Cause Category	Requires gVisor Modification?	Typical Examples
Genuine gVisor bugs	Yes	`poll` incorrectly modifying `events`, inconsistent `execve` `errno` returns, `O_TRUNC` missing `IN_MODIFY` inotify events
Missing syscalls and virtual FS entries	Yes	Unimplemented `copy_file_range` syscall, missing `/proc/sys/fs/pipe-max-size` configuration file, and absence of `/sys/dev/block` directory
Clock and timer precision differences	Partially	CPU clock measurement precision, monotonic clock start value differences, sleep duration jitter
Amplified race conditions	No	Gradle `clean test` parallel execution concurrency race, CMake `copy_if_different` TOCTOU race
Environmental or config differences	No	External network access restrictions, JDK version mismatches, missing dynamic library paths
Test case issues	No	Test execution order dependencies, underlying dataset defects, inherently flaky tests

This shows that aside from genuine bugs or missing Linux ABI implementations in gVisor, a significant portion of behavioral differences stems from timing-sensitive tests, amplified user-space race conditions, or environmental setup differences. This is especially crucial for Agentic-RL scenarios. Without runc baselines and root cause analysis, these failures could easily be misattributed as sandbox incompatibilities, leading to systematically pessimistic conclusions.

These cases highlight the different types of compatibility issues we see in Agentic-RL: system call semantic deviations, Linux ABI gaps, VFS implementation gaps, and user-space race conditions.

Case 1: `poll` Behavior Inconsistency Causes `tmux` Busy-Loop

The evaluation cluster’s CPU utilization was unusually high. Investigation revealed that the tmux server in each Agent container was pegging a CPU core: under gVisor, CPU usage hovered at 96.6%, while under runc it was practically 0%.

The root cause was poll write-back semantics. gVisor internally appended POLLHUP|POLLERR to pollfd.events and wrote the entire pollfd struct back to user space. Linux, however, only writes to revents and never modifies the user’s original events. This discrepancy prevented libevent from properly removing closed file descriptors. Subsequent poll calls immediately returned POLLNVAL, triggering a busy-loop.

After fixing this, the tmux CPU dropped from 96.6% to 0%. The impact goes far beyond tmux — any program relying on the libevent poll backend benefits from this.

Case 2: syncthing Test Case Exposes Two Independent Linux ABI Gaps (Unimplemented Syscalls or Virtual Files)

In real-world workloads, it’s not uncommon for a single test case to hit two independent gVisor compatibility issues at once. The syncthing__syncthing-7828 test case in the Multi-SWE-RL dataset passes normally under runc, but consistently fails under runsc: 16 TestCopyRange/* subtests report function not implemented, and another TestTruncateFileOnly times out waiting for an inotify event.

This was caused by two independent Linux ABI gaps:

copy_file_range (syscall 326) was unimplemented. gVisor registered it as ErrorWithEvent(ENOSYS), so any program using this syscall received function not implemented.
open(O_TRUNC) was missing the IN_MODIFY inotify event. The Linux kernel generates IN_MODIFY along the do_open() → handle_truncate() → notify_change() path. However, gVisor VFS’s OpenAt only generated IN_OPEN, causing programs listening for file modification events to be “deaf” to the truncation action.

The fix proceeded along two lines: implementing copy_file_range for both amd64 (326) and arm64 (285), and issuing IN_MODIFY at the VFS layer for O_TRUNC on non-newly created files (skipping it for newly created files via the FMODE_CREATED flag, consistent with Linux). After the fix, this test case passed consistently under runsc just like under runc.

Case 3: Gradle clean test Concurrency Race—Root Cause in User Space, Not gVisor

Not all issues that “only reproduce under gVisor” are actually gVisor bugs.

A Thunderbird Android test running ./gradlew clean test --max-workers 8 --continue under runsc frequently failed with Unable to delete directory. However, running it 7 times under runc yielded 5 failures (71%). This pointed to a user-space TOCTOU race condition in Gradle’s parallel build: one subproject was still writing to build/, while another subproject’s clean task was already trying to delete it.

gVisor’s higher system call overhead amplified the probability of triggering this race, but it did not introduce new semantic errors. Splitting the command into ./gradlew clean and ./gradlew test ... fixed it completely. This is also a fundamental principle we follow in compatibility analysis: always use runc as a baseline first, then determine whether the issue should be attributed to the sandbox itself.

Case 4: Missing procfs / sysfs Causes Real Applications to Take Abnormal Paths

Agentic-RL workloads are full of paths that are not usually tested in isolation but are relied upon by real projects, such as /proc/sys/fs/pipe-max-size, /proc/sys/kernel/randomize_va_space, /sys/dev/block, /proc/[pid]/fdinfo, etc. Once missing, these typically manifest as ENOENT or cause upper-layer libraries to take abnormal code paths.

These are usually cheap to fix by wiring up static files or directory structures. They perfectly illustrate the value of real-world workloads: we aren’t adding these paths to satisfy a benchmark, we’re adding them because real applications actually read them.

Case 5: Inconsistent PTY Implementation Causes Interactive Agents to Error

Interactive terminals are easily overlooked but heavily used in Agent systems (tmux, screen, expect, REPLs, etc.). All rely on PTYs. We fixed several inconsistencies here:

The ISIG flag was not checked correctly, causing signals to still be generated after stty -isig.
When the master closed, it did not send SIGHUP to the foreground process group as Linux does.
TCSBRK / TCFLSH and other ioctls were missing or had incorrect directional semantics, affecting programs like pyserial.

Notably, TCFLSH semantics must be evaluated from the caller’s perspective rather than hardcoding internal queue names. Otherwise, the flush directions seen by the master and replica are reversed compared to Linux.

Case 6: Jekyll Test Order Dependency Causes Flaky Failures—A Pure Test Case Issue

Sometimes, a test failing under gVisor has nothing to do with the runtime environment at all.

During evaluation, a Jekyll test case (jekyll-7637) failed under runsc but coincidentally passed under runc. After a deep dive, we found that this test actually had a roughly 33% chance of failing in any environment.

The root cause was rather dramatic: the test code itself had a bug where it passed a configuration value as a Ruby Symbol type, while the underlying source code incorrectly compared it as a String. As a result, this test could never load its required syntax highlighting plugin as intended. So why did it sometimes pass? Because the testing framework (minitest) executes tests in a randomized order. If this buggy test happened to run after another test that correctly loaded the plugin into memory, it would “freeload” off that global state and pass. But if the randomized order happened to put this test first, it would genuinely fail. It just so happened that gVisor hit that 1-in-3 failure chance during our evaluation.

This perfectly illustrates why we need large-scale A/B testing and deep analysis: without them, sporadic test flakiness like this can easily be misdiagnosed as “sandbox instability.”

Best Practices: Suggestions for Using gVisor in Agentic-RL Scenarios

If you’re building an Agent execution environment with gVisor, here are some practical tips.

Suggestions for Different Build Systems

Build System	Common Risks	Suggestions
Gradle	clean test concurrency race	Split into clean and test steps
Maven	Remote dependency download timeout or 403	Pre-populate local repo cache, minimize online downloads
CMake	`copy_if_different` race conditions	Lower parallelism, avoid over-reliance on extremely short time windows
sbt / Scala	Deep stack, slow startup, test flakiness	Increase `-Xss`, give the first compilation a more generous timeout
pip / pytest	Differences in CPU count vs cgroup quota perception	Be aware of the relationship between `os.cpu_count()` and actual quotas
Cargo / npm / yarn	Generally good compatibility	Usually do not require special handling

Debugging Procedure When Encountering Failures

When a test fails, we recommend this debugging flow:

First reproduce the same command under runc to confirm if the failure is specific to gVisor.
If runc also fails, prioritize investigating test case issues, environmental differences, or race conditions.
If it only fails under gVisor, check for obvious missing syscalls, procfs, or sysfs.
For issues with no obvious missing features, compare logs, strace, and runtime behavior to distinguish between semantic inconsistencies, amplified race conditions, or environmental configuration differences.
Only after confirming it is a gVisor semantic issue, proceed to locate the code path, create a minimal reproduction, and add regression tests.

Note: Many perceived “gVisor compatibility issues” are ultimately reclassified as test case issues during this step.

AI-Driven Compatibility Analysis: Why This Path Is Feasible

Large-scale compatibility analysis is well suited to AI assistance because it involves a large amount of repetitive, context-heavy work:

Reading project source code and build scripts
Comparing behavioral differences between two runtimes
Comparing syscall, procfs, sysfs, PTY, network, and VFS semantics
Turning conclusions into executable patches, PRs, or workaround suggestions
Running regression validation and re-investigating the issue when validation fails

Manual analysis does not scale, while hardcoded rules often break down on complex cases. AI agents fit naturally in the middle: they can take on most of the “read logs → categorize → locate → report” work, while human engineers still review the proposed approach and code.

The real value here is not just saving time; it is making our conclusions scalable, traceable, and continuously improvable:

Every case has standardized analysis artifacts rather than scattered chat logs.
Every fix can be validated again against the original real-world test case.
Every case that is “not a gVisor issue” can still be turned into a concrete workaround playbook.
As new datasets, images, or build systems arrive, the same analysis framework can be reused.

Through this method, we already have more than ten fixes merged into the gVisor mainline, covering multiple areas such as file systems, networking, proc/sysfs, PTY, and system call semantics. Some representative PRs are listed below:

PR	Fix Content	Typical Agentic-RL Scenario
#12851	poll: Only write back `revents`	tmux, libevent poll backend
#12911	proc: Add `/proc/sys/fs/pipe-max-size`	Python libraries like wurlitzer
#12915	pty: Implement `TCSBRK` / `TCFLSH`	pyserial, interactive PTY programs
#12814	proc: Add `randomize_va_space`	Performance and security inspection tools
#12813	sysfs: Add `/sys/dev/block` and `/sys/dev/char`	lsblk, device-related tools
#12819	proc: Fill in `fdinfo` fields	lsof, fuser, diagnostic tools
#12786	devpts: Fix `ISIG` check	Interactive shells / terminal-based agents
#12853	vfs: `FICLONE*` returns `EOPNOTSUPP`	file copying tools

In this sense, Agentic-RL is not just a new use case for gVisor; it has also pushed our compatibility engineering toward a more AI-driven workflow.

Conclusion

Agentic-RL is both a proving ground for gVisor and, in practice, a large-scale regression suite: it continuously drives real-world projects through the sandbox and exposes compatibility boundaries that standard unit tests struggle to cover. By bringing AI agents into this verification loop, we can evaluate gVisor’s production readiness with data rather than intuition.

Our conclusions are simple:

gVisor’s compatibility has proven to be production-ready.
Most “compatibility issues” should not actually be attributed to gVisor.
Real-world workloads are better than handpicked tests at revealing critical problems.
AI-driven compatibility analysis is practical.

As AI agents take on heavier tasks, the code-execution sandbox will become an indispensable security foundation. We will continue refining this AI-driven verification system, applying it to new datasets and language stacks, and upstreaming our findings to the gVisor community. For Agentic-RL, a good sandbox is not just secure—it also needs to be highly compatible, debuggable, and able to evolve alongside real-world workloads.

Multi-Agent gVisor Isolation (MAGI)

2026-04-15T00:00:00-05:00

Get in the sandbox, Agents.

Does gVisor work with OpenClaw? This question has been asked a lot, so let’s answer it here and now: Yes.

In this post, we will set up a triple-agent system combining OpenClaw, PicoClaw, and Hermes Agent, each in separate gVisor sandboxes, all with local inference powered by Ollama in a gVisor sandbox using three different models, convening together in a self-hosted Matrix.org server (naturally, also running in a gVisor sandbox). Each agent will be given its own set of capabilities, each of which will be sandboxed. At the end of the day, you will have a fully self-sovereign triple-agent system that can answer queries, browse the web, and cogitate with itself.

Does this particular setup make practical sense? No, but it is cool. More importantly, it demonstrates the versatility of gVisor at sandboxing basically any component that an agentic system may need. gVisor’s compatibility has grown significantly over the last few years, and agent harnesses fit well within what gVisor is capable of.

Let’s go.

Basic machine setup: Docker/gVisor/NVIDIA drivers

We will use a g2-standard-96 GCE VM running stock Ubuntu for this, but any Linux machine with similar GPUs would work. This section describes its basic setup.

Getting a GCE VM:

$ gcloud compute instances create magi \
    --project=eperot-gke-dev \
    --zone=europe-west1-c \
    --machine-type=g2-standard-96 \
    --maintenance-policy=TERMINATE \
    --accelerator=count=8,type=nvidia-l4 \
    --create-disk=auto-delete=yes,boot=yes,device-name=magi,image=projects/ubuntu-os-cloud/global/images/ubuntu-2404-noble-amd64-v20260316,mode=rw,size=2048,type=pd-ssd

We will be using the following ports:

8008: Matrix.org server (Synapse)
8084: Cinny web UI (Matrix.org client)
11434: Ollama (inference API server)
18789: OpenClaw gateway web UI
18790: PicoClaw gateway
3002: Self-hosted Firecrawl

If SSHing into a VM, you can forward some of them for convenient access:

-L 8008:127.0.0.1:8008 -L 8084:127.0.0.1:8084 -L 11434:127.0.0.1:11434 -L 18789:127.0.0.1:18789

Setting up the GCE VM (once SSH’d as root):

# Basics
sudo apt-get update && sudo apt-get -y upgrade

# NVIDIA driver
DRIVER_VERSION=590.48.01; \
  sudo apt-get install -y build-essential linux-headers-$(uname -r) && \
  curl -fSsl -O "https://us.download.nvidia.com/tesla/$DRIVER_VERSION/NVIDIA-Linux-x86_64-$DRIVER_VERSION.run" && \
  sudo sh NVIDIA-Linux-x86_64-$DRIVER_VERSION.run && \
  rm NVIDIA-Linux-x86_64-$DRIVER_VERSION.run

# Docker
sudo apt update && \
  sudo apt install -y ca-certificates curl && \
  sudo install -m 0755 -d /etc/apt/keyrings && \
  sudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg -o /etc/apt/keyrings/docker.asc && \
  sudo chmod a+r /etc/apt/keyrings/docker.asc
sudo tee /etc/apt/sources.list.d/docker.sources <<EOF
Types: deb
URIs: https://download.docker.com/linux/ubuntu
Suites: $(. /etc/os-release && echo "${UBUNTU_CODENAME:-$VERSION_CODENAME}")
Components: stable
Signed-By: /etc/apt/keyrings/docker.asc
EOF
sudo apt update && \
  sudo apt install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin

# NVIDIA container toolkit
sudo apt-get update && sudo apt-get install -y --no-install-recommends \
  ca-certificates \
  curl \
  gnupg2 && \
  curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg && \
  curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
    sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
    sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list && \
    sudo apt-get update && \
    export NVIDIA_CONTAINER_TOOLKIT_VERSION=1.19.0-1 && \
    sudo apt-get install -y \
      nvidia-container-toolkit=${NVIDIA_CONTAINER_TOOLKIT_VERSION} \
      nvidia-container-toolkit-base=${NVIDIA_CONTAINER_TOOLKIT_VERSION} \
      libnvidia-container-tools=${NVIDIA_CONTAINER_TOOLKIT_VERSION} \
      libnvidia-container1=${NVIDIA_CONTAINER_TOOLKIT_VERSION}

# gVisor
sudo apt-get update && \
  sudo apt-get install -y \
    apt-transport-https \
    ca-certificates \
    curl \
    gnupg && \
  curl -fsSL https://gvisor.dev/archive.key | sudo gpg --dearmor -o /usr/share/keyrings/gvisor-archive-keyring.gpg && \
  echo "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/gvisor-archive-keyring.gpg] https://storage.googleapis.com/gvisor/releases release main" | sudo tee /etc/apt/sources.list.d/gvisor.list > /dev/null && \
  sudo apt-get update && sudo apt-get install -y runsc && \
  sudo runsc install -- --nvproxy=true --nvproxy-allowed-driver-capabilities=all --net-raw=true --allow-packet-socket-write=true --host-uds=all --debug-log=/tmp/runsc/ && \
  sudo systemctl restart docker

Verifying everything works:

$ nvidia-smi
$ docker run --runtime=runsc --gpus=all --rm ubuntu:latest sh -c 'ls -al /dev/nvidia*'

Self-hosted Matrix.org server + Cinny web frontend setup

Setting up Synapse and Cinny.

Let’s set up the Matrix.org server for communication, and the Cinny web client that we humans can use to communicate with it.

# Generate homeserver.yaml
$ docker run -it --runtime=runsc --rm \
    --mount=type=volume,src=synapse-data,dst=/data \
    -e SYNAPSE_SERVER_NAME=magi \
    -e SYNAPSE_REPORT_STATS=no \
    matrixdotorg/synapse:latest generate

# Run server
$ docker run --detach --runtime=runsc --restart=always --name=synapse \
    --mount=type=volume,src=synapse-data,dst=/data \
    -p 8008:8008 \
    matrixdotorg/synapse:latest

# Create admin user
$ docker exec -it synapse register_new_matrix_user \
    -c /data/homeserver.yaml \
    --user gendo --password yui --admin

# Run cinny (Matrix client)
$ docker run -it --runtime=runsc --restart=always --name=cinny \
    --link=synapse:synapse \
    -p 8084:80 \
    ghcr.io/cinnyapp/cinny:latest

# Access Cinny web UI at http://localhost:8084
# Log in as:
#   Homeserver: http://127.0.0.1:8008
#   Username: gendo
#   Password: yui

Self-hosted inference server: Ollama

Setting up Ollama for GPU inference.

Setting up Ollama, the GPU-enabled inference server and the brain of it all.

$ docker run --detach --runtime=runsc --restart=always --name=ollama \
    --gpus=all \
    --mount=type=volume,src=ollama-data,dst=/root \
    -p 11434:11434 \
    ollama/ollama:0.20.0

# Pull and load some models.
$ docker exec -it ollama sh -c 'ollama pull qwen3.5:27b-q4_K_M   && ollama run --keepalive=9001h qwen3.5:27b-q4_K_M     Say hello.'
$ docker exec -it ollama sh -c 'ollama pull glm-4.7-flash:q4_K_M && ollama run --keepalive=9001h glm-4.7-flash:q4_K_M   Say hello.'
$ docker exec -it ollama sh -c 'ollama pull gpt-oss:20b          && ollama run --keepalive=9001h gemma4:26b-a4b-it-q8_0 Say hello.'
$ docker exec -it ollama sh -c 'ollama pull gpt-oss:20b          && ollama run --keepalive=9001h nomic-embed-text:137m-v1.5-fp16 ""'

# Make sure they all fit together in VRAM, otherwise you'll get bad performance.
$ docker exec -it ollama ollama ps
NAME                      ID              SIZE     PROCESSOR    CONTEXT    UNTIL
gemma4:26b-a4b-it-q8_0    6bfaf9a8cb37    89 GB    100% GPU     262144     12 months from now
glm-4.7-flash:q4_K_M      d1a8a26252f1    40 GB    100% GPU     202752     12 months from now
qwen3.5:27b-q4_K_M        7653528ba5cb    44 GB    100% GPU     262144     12 months from now

Containerized OpenClaw setup with Browser Use

Setting up OpenClaw and Chrome browser.

Now let’s set up OpenClaw and hook it up to a web browser for fully-local Browser Use.

We will use the official ghcr.io/openclaw/openclaw OpenClaw container image, but we will also modify it to install the Google Chrome, as per recommended in the OpenClaw docs. This will allow the agent to use a web browser, all running in gVisor.

$ export MELCHIOR="$HOME/agents/melchior-1"; mkdir -p "$MELCHIOR"
$ cat <<EOF > "$MELCHIOR/Dockerfile"
FROM ghcr.io/openclaw/openclaw:2026.4.2

USER 0:0
RUN export DEBIAN_FRONTEND=noninteractive; apt update -y && \
    apt install -y wget chromium libvulkan1 && \
    wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb && \
    dpkg -i google-chrome-stable_current_amd64.deb && \
    rm google-chrome-stable_current_amd64.deb && \
    apt --fix-broken install -y
EOF

$ docker build -t openclaw:melchior-1 "$MELCHIOR"

Note that the resulting image runs as root. This is not a security risk; “root” in a gVisor sandbox doesn’t imply any root-like level access on the host.

Let’s create a Matrix account for it and seed its configuration:

$ mkdir -p "$MELCHIOR/config" "$MELCHIOR/home"

$ docker exec -it synapse register_new_matrix_user \
    -c /data/homeserver.yaml \
    --user melchior --password akagi --no-admin

$ cat <<EOF > "$MELCHIOR/config/openclaw.json"
{
  "auth": {
    "profiles": {
      "ollama:default": {
        "provider": "ollama",
        "mode": "api_key"
      }
    }
  },
  "agents": {
    "defaults": {
      "models": {
        "ollama/gemma4:26b-a4b-it-q8_0": {}
      }
    }
  },
  "models": {
    "mode": "merge",
    "providers": {
      "ollama": {
        "baseUrl": "http://ollama:11434",
        "api": "ollama",
        "apiKey": "OLLAMA_API_KEY",
        "models": [
          {
            "id": "gemma4:26b-a4b-it-q8_0",
            "name": "gemma4:26b-a4b-it-q8_0",
            "reasoning": true,
            "input": [
              "text"
            ],
            "cost": {
              "input": 0,
              "output": 0,
              "cacheRead": 0,
              "cacheWrite": 0
            },
            "contextWindow": 262144,
            "maxTokens": 8192
          }
        ]
      }
    }
  },
  "channels": {
    "matrix": {
      "enabled": true,
      "homeserver": "http://synapse:8008",
      "userId": "@melchior:magi",
      "password": "akagi",
      "deviceName": "Melchior",
      "allowPrivateNetwork": true,
      "encryption": false,
      "groupPolicy": "open",
      "autoJoin": "always",
      "dm": {
        "policy": "open",
        "allowFrom": [
          "*"
        ]
      }
    }
  },
  "gateway": {
    "mode": "local",
    "controlUi": {
      "dangerouslyDisableDeviceAuth": true,
      "dangerouslyAllowHostHeaderOriginFallback": true
    }
  },
  "skills": {
    "install": {
      "nodeManager": "npm"
    }
  },
  "browser": {
    "enabled": true,
    "executablePath": "/usr/bin/google-chrome-stable",
    "headless": true,
    "noSandbox": true
  },
  "tools": {
    "web": {
      "search": {
        "enabled": true,
        "provider": "duckduckgo"
      },
      "fetch": {
        "enabled": true
      }
    }
  },
  "plugins": {
    "entries": {
      "matrix": {
        "enabled": true
      },
      "browser": {
        "enabled": true
      }
    }
  }
}
EOF

Note: for the purpose of simplifying demo setup, the above configuration disables authentication, allows the bot to auto-join all Matrix channels it is invited to, etc. For real deployments, do not use these settings.

Let’s run it!

$ export MELCHIOR="$HOME/agents/melchior-1"; docker run --detach \
    --name=melchior \
    --runtime=runsc \
    --restart=always \
    --env=OPENCLAW_GATEWAY_TOKEN="dummy-token-for-sandbox" \
    --env=OPENCLAW_CONFIG_PATH="/etc/openclaw/openclaw.json" \
    -p 18789:18789 \
    --env=HOME=/home/node \
    --link=synapse:synapse \
    --link=ollama:ollama \
    -v "$MELCHIOR/home":/home/node/.openclaw \
    -v "$MELCHIOR/config":/etc/openclaw \
    openclaw:melchior-1 \
    node \
        dist/index.js \
        gateway \
           --bind=lan \
           --port=18789 \
           --allow-unconfigured \
           --verbose

Run docker exec -it melchior openclaw configure for further interactive configuration.

You can now go to http://127.0.0.1:18789/?token=dummy-token-for-sandbox and talk to your OpenClaw instance!

OpenClaw web UI running in gVisor. The dmesg output is characteristic of gVisor.

Browser Use

The Dockerfile we built earlier contains the Google Chrome web browser, which OpenClaw knows how to use. You can ask it to open websites and take screenshots. Here is the gVisor website rendered in Chrome-in-gVisor by OpenClaw:

gVisor website rendered by Chrome in gVisor, orchestrated by OpenClaw.
Funnily enough, the OpenClaw web interface didn't provide the means for OpenClaw to display this image directly.
OpenClaw autonomously solved this problem by uploading this picture to a temporary image hosting service and responding with the uploaded image URL.

Now let’s bring the other two brains to life.

Containerized PicoClaw with web and GitHub skills

Setting up PicoClaw.

Moving on to PicoClaw, the minimal agent.

We will use the PicoClaw Docker image, and enable a few skills for GitHub interaction with the gVisor repository.

Note that while this demo was on a x86-64 VM, PicoClaw has also been confirmed to work in gVisor on arm64 on a Raspberry Pi 4 Model B.

$ export BALTHASAR="$HOME/agents/balthasar-2"; mkdir -p "$BALTHASAR/picoclaw"
$ docker exec -it synapse register_new_matrix_user \
    -c /data/homeserver.yaml \
    --user balthasar --password ritsuko --no-admin
$ matrix_token="$(curl -X POST -H "Content-Type: application/json" \
    "http://127.0.0.1:8008/_matrix/client/v3/login" \
    -d \
    '{"type": "m.login.password", "user": "balthasar", "password": "ritsuko"}' | \
    jq -r .access_token)"
$ cat <<EOF > "$BALTHASAR/picoclaw/config.json"
{
  "model_list": [
    {
      "model_name": "glm-4.7-flash",
      "model": "ollama/glm-4.7-flash:q4_K_M",
      "api_base": "http://ollama:11434/v1"
    }
  ],
  "agents": {
    "defaults": {
      "model_name": "glm-4.7-flash"
    }
  },
  "gateway": {
    "host": "0.0.0.0",
    "port": 18790
  },
  "channels": {
    "matrix": {
      "enabled": true,
      "homeserver": "http://synapse:8008",
      "user_id": "@balthasar:magi",
      "access_token": "${matrix_token}",
      "join_on_invite": true,
      "allow_from": []
    }
  }
}
EOF
$ docker run -it \
    --name=balthasar \
    --runtime=runsc \
    --restart=always \
    -v "$BALTHASAR/picoclaw:/root/.picoclaw" \
    --link=synapse:synapse \
    --link=ollama:ollama \
    --entrypoint=/usr/local/bin/picoclaw \
    sipeed/picoclaw:latest gateway

PicoClaw should start, although it does not have a lot of functionality out of the box. Let’s enable some skills:

$ cp "$BALTHASAR/picoclaw/config.json" "$BALTHASAR/picoclaw/config.json.bak" && \
  jq '.tools.web.enabled = true |
      .tools.web.prefer_native = true |
      .tools.exec.enabled = true |
      .tools.exec.allow_remote = true |
      .tools.skills.enabled = true |
      .tools.skills.github = {
        "enabled": true,
        "token": "YOUR_GITHUB_TOKEN_HERE",
        "timeout": 30,
        "max_results": 5
      } |
      .tools.skills.max_concurrent_searches = 5
      | .tools.skills.search_cache = {
        "max_size": 100,
        "ttl_seconds": 300
      } |
      .tools.web_fetch.enabled = true' \
      < "$BALTHASAR/picoclaw/config.json.bak" \
      > "$BALTHASAR/picoclaw/config.json"

# Restart PicoClaw to apply config changes.
$ docker restart balthasar

# You can re-attach to an interactive CLI for PicoClaw with:
$ docker exec -it balthasar picoclaw agent

Now we can ask it to interact with GitHub.

PicoClaw being tasked with looking up the current trending GitHub repositories.

Funnily enough, the top GitHub repository today is Hermes Agent, which we will install next. For now, let’s review a small gVisor PR:

PicoClaw being tasked with explaining and reviewing [gVisor pull request #12911](https://github.com/google/gvisor/pull/12911).
Which was later reviewed by a human as well.

Modularized & sandboxed Hermes Agent setup

Setting up Hermes Agent.

Finally, let’s set up Hermes Agent, and fully load it with sandboxed Browser Use, sandboxed web crawling, and sandboxed code execution.

We will use Hermes Agent’s official Docker image: nousresearch/hermes-agent, expanded with the dependencies needed to perform local text-to-speech and Matrix.org integration, all running in gVisor. Additionally, for extra security, we will do the following:

Run Camofox Browser in a separate gVisor container, for browser use.
Run self-hosted Firecrawl in a separate gVisor container, for agentic search.
Run Docker-in-gVisor in a separate container, for Hermes Agent to execute arbitrary code safely.

Note that the --net-raw=true --allow-packet-socket-write=true runsc flags are required for Docker to work in gVisor. For this reason, we need to install a secondary runtime for the Docker-in-gVisor container, and enable host UDS (--host-uds=all) so that the Docker daemon socket file can be exported out of that sandbox into the Hermes Agent sandbox.

Hermes Agent running in gVisor.

Setting up Docker-in-gVisor for code execution

Setting up Docker-in-gVisor for code execution.

gVisor is capable of running Docker inside of itself. Since Hermes Agent has Docker as a code execution backend, we will use this to spawn a separate Docker-in-gVisor container which Hermes Agent can use to run code safely.

$ export CASPER="$HOME/agents/casper-3"
$ runsc install --runtime=docker-in-gvisor -- --net-raw=true --allow-packet-socket-write=true --host-uds=all

# Reload *host* dockerd configuration to make it notice the new runtime we just added.
$ kill -HUP "$(pidof dockerd)"

# Run Docker-in-gVisor container.
# Note: The `--cap-add=all` flag does *not* grant the container any
# capabilities on the host. It only enables the sandboxed workload to use
# elevated privileges **within the sandbox**.
# This is necessary to be able to run `dockerd` inside a container.
$ mkdir -p "$CASPER/docker-run"; docker run --detach \
    --name=hermes-exec \
    --runtime=docker-in-gvisor \
    --restart=always \
    --cap-add=all \
    --mount="type=bind,src=$CASPER/docker-run,dst=/var/run" \
    us-central1-docker.pkg.dev/gvisor-presubmit/gvisor-presubmit-images/basic/docker_x86_64

# Verify that we can talk to the `dockerd` server running in gVisor.
# We need --security-opt=seccomp=unconfined here, because otherwise
# Docker's default seccomp profile would block the `syslog(2)` syscall that
# the `dmesg` process uses to read the kernel logs (which here is actually
# reading the gVisor kernel logs). This is not a security problem, since we
# are still all running in gVisor.
$ DOCKER_HOST="unix://$CASPER/docker-run/docker.sock" docker run \
    --rm \
    --security-opt=seccomp=unconfined \
    debian:latest \
    dmesg
# [...]
[    0.000000] Starting gVisor...
[    0.429798] DeFUSEing fork bombs...
[    0.782957] Adversarially training Redcode AI...
# [...]

Building Camofox Docker image in Docker-in-gVisor

Setting up Camofox Browser.

Camofox is a Firefox-based web browser for agentic browsing. Let’s run it in its own sandboxed container.

Camofox comes with an image that also contains Xvfb to simulate an X11 display server, and yt-dlp for YouTube video extraction, all working in gVisor. Let’s build it.

The Camofox project doesn’t provide pre-built Docker images, so we need to build it ourselves. But wait! Camofox may or may not be a fishy project. What if it contains malicious code?

Have no fear, gVisor is here! We can simply build the image inside gVisor. Let’s spin up an ephemeral Docker-in-gVisor container, run the Camofox Docker image build process within, extract the image out, and import it into the host dockerd’s local image repository.

It's containers all the way down.

# Start Docker-in-gVisor with large-enough /var/lib/docker tmpfs
$ mkdir -p /tmp/docker-tmp && docker run --detach \
    --name=docker-tmp \
    --runtime=docker-in-gvisor \
    --restart=always \
    --cap-add=all \
    --mount="type=bind,src=/tmp/docker-tmp,dst=/tmp/docker-tmp" \
    -e DOCKER_TMPFS_SIZE=8G \
    us-central1-docker.pkg.dev/gvisor-presubmit/gvisor-presubmit-images/basic/docker_x86_64

# Build image within the in-gVisor Docker.
# The `make` command will run `docker build` in-sandbox.
$ docker exec docker-tmp sh -c 'true && \
    apt update -y && \
    apt install -y git build-essential && \
    git clone https://github.com/jo-inc/camofox-browser.git && \
    cd camofox-browser && \
    make'

# Extract the image out of the container and import as host Docker image.
# The `docker save` command dumps the image to stdout, which gets piped
# to the out-of-sandbox `docker load` command.
$ docker exec docker-tmp docker save camofox-browser | docker load
Loaded image: camofox-browser:135.0.1-x86_64

# You now have the image on the host Docker:
$ docker images | grep camofox
camofox-browser:135.0.1-x86_64      80c072259479      4.6GB      2.27GB

# Clean up.
$ docker rm -f docker-tmp

Now that we have our Camofox image, let’s run it:

$ docker run --detach \
    --name=camofox \
    --runtime=runsc \
    --restart=always \
    camofox-browser:135.0.1-x86_64

# Camofox binds on port 3000 by default; we don't need to expose it
# to the host though, as we will use inter-container networking.
# Nonetheless, let's make sure it works:
$ docker exec -e DEBIAN_FRONTEND=noninteractive camofox sh -c 'true && \
    apt update -y >/dev/null && \
    apt install -y curl jq >/dev/null && \
    tabId="$(curl -q -X POST http://127.0.0.1:3000/tabs -H "Content-Type: application/json" -d "{\"userId\": \"me\", \"sessionKey\": \"task\", \"url\": \"https://gvisor.dev\"}" | jq -r .tabId)" && \
    curl -q --output - "http://127.0.0.1:3000/tabs/${tabId}/screenshot?userId=me"
  ' > /tmp/screenshot.png
$ file /tmp/screenshot.png
/tmp/screenshot.png: PNG image data, 1280 x 720, 8-bit/color RGBA, non-interlaced

Running self-hosted Firecrawl in gVisor

Setting up the Firecrawl stack.

We will use the Firecrawl docker-compose.yaml template, simply modified to run all containers in gVisor. Because the way docker-compose sets up DNS is incompatible with gVisor’s per-container network stack, we need to use pre-assigned IPs rather than container hostnames in the docker-compose file.

$ export CASPER="$HOME/agents/casper-3"; git clone https://github.com/firecrawl/firecrawl.git "$HOME/agents/casper-3/firecrawl"
$ cat <<EOF > "$CASPER/firecrawl/.env"
PORT=3002
HOST=0.0.0.0
OLLAMA_BASE_URL=http://172.17.0.1:11434/api
MODEL_NAME=qwen3.5:27b-q4_K_M
MODEL_EMBEDDING_NAME=nomic-embed-text:137m-v1.5-fp16
BULL_AUTH_KEY=CHANGEME
EOF
$ git apply <<EOF
diff --git a/docker-compose.yaml b/docker-compose.yaml
index 46829cafb..819f9cc87 100644
--- a/docker-compose.yaml
+++ b/docker-compose.yaml
@@ -10,8 +10,6 @@ x-common-service: &common-service
     nofile:
       soft: 65535
       hard: 65535
-  networks:
-    - backend
   extra_hosts:
     - "host.docker.internal:host-gateway"
   logging:
@@ -22,13 +20,13 @@ x-common-service: &common-service
       compress: "true"

 x-common-env: &common-env
-  REDIS_URL: \${REDIS_URL:-redis://redis:6379}
-  REDIS_RATE_LIMIT_URL: \${REDIS_URL:-redis://redis:6379}
-  PLAYWRIGHT_MICROSERVICE_URL: \${PLAYWRIGHT_MICROSERVICE_URL:-http://playwright-service:3000/scrape}
+  REDIS_URL: \${REDIS_URL:-redis://172.16.0.30:6379}
+  REDIS_RATE_LIMIT_URL: \${REDIS_URL:-redis://172.16.0.30:6379}
+  PLAYWRIGHT_MICROSERVICE_URL: \${PLAYWRIGHT_MICROSERVICE_URL:-http://172.16.0.20:3000/scrape}
   POSTGRES_USER: \${POSTGRES_USER:-postgres}
   POSTGRES_PASSWORD: "\${POSTGRES_PASSWORD:-postgres}"
   POSTGRES_DB: \${POSTGRES_DB:-postgres}
-  POSTGRES_HOST: \${POSTGRES_HOST:-nuq-postgres}
+  POSTGRES_HOST: \${POSTGRES_HOST:-172.16.0.50}
   POSTGRES_PORT: \${POSTGRES_PORT:-5432}
   USE_DB_AUTHENTICATION: \${USE_DB_AUTHENTICATION:-false}
   NUM_WORKERS_PER_QUEUE: \${NUM_WORKERS_PER_QUEUE:-8}
@@ -58,6 +56,10 @@ x-common-env: &common-env

 services:
   playwright-service:
+    runtime: "runsc"
+    networks:
+      backend:
+        ipv4_address: 172.16.0.20
     # NOTE: If you don't want to build the service locally,
     # comment out the build: statement and uncomment the image: statement
     # image: ghcr.io/firecrawl/playwright-service:latest
@@ -71,8 +73,6 @@ services:
       BLOCK_MEDIA: \${BLOCK_MEDIA}
       # Configure maximum concurrent pages for Playwright browser instances
       MAX_CONCURRENT_PAGES: \${CRAWL_CONCURRENT_REQUESTS:-10}
-    networks:
-      - backend
     # Resource limits for Docker Compose (not Swarm)
     cpus: 2.0
     mem_limit: 4G
@@ -88,13 +88,17 @@ services:

   api:
     <<: *common-service
+    runtime: "runsc"
+    networks:
+      backend:
+        ipv4_address: 172.16.0.10
     environment:
       <<: *common-env
       HOST: "0.0.0.0"
       PORT: \${INTERNAL_PORT:-3002}
       EXTRACT_WORKER_PORT: \${EXTRACT_WORKER_PORT:-3004}
       WORKER_PORT: \${WORKER_PORT:-3005}
-      NUQ_RABBITMQ_URL: amqp://rabbitmq:5672
+      NUQ_RABBITMQ_URL: amqp://172.16.0.40:5672
       ENV: local
     depends_on:
       redis:
@@ -113,6 +117,7 @@ services:
     memswap_limit: 8G

   redis:
+    runtime: "runsc"
     # NOTE: If you want to use Valkey (open source) instead of Redis (source available),
     # uncomment the Valkey statement and comment out the Redis statement.
     # Using Valkey with Firecrawl is untested and not guaranteed to work. Use with caution.
@@ -120,7 +125,8 @@ services:
     # image: valkey/valkey:alpine

     networks:
-      - backend
+      backend:
+        ipv4_address: 172.16.0.30
     command: redis-server --bind 0.0.0.0
     logging:
       driver: "json-file"
@@ -130,9 +136,11 @@ services:
         compress: "true"

   rabbitmq:
+    runtime: "runsc"
     image: rabbitmq:3-management
     networks:
-      - backend
+      backend:
+        ipv4_address: 172.16.0.40
     command: rabbitmq-server
     healthcheck:
       test: ["CMD", "rabbitmq-diagnostics", "-q", "check_running"]
@@ -148,6 +156,7 @@ services:
         compress: "true"

   nuq-postgres:
+    runtime: "runsc"
     # NOTE: If you don't want to build the image locally,
     # comment out the build: statement and uncomment the image: statement
     # image: ghcr.io/firecrawl/nuq-postgres:latest
@@ -157,7 +166,8 @@ services:
       POSTGRES_PASSWORD: \${POSTGRES_PASSWORD:-postgres}
       POSTGRES_DB: \${POSTGRES_DB:-postgres}
     networks:
-      - backend
+      backend:
+        ipv4_address: 172.16.0.50
     logging:
       driver: "json-file"
       options:
@@ -168,3 +178,8 @@ services:
 networks:
   backend:
     driver: bridge
+    ipam:
+      config:
+        - gateway: 172.16.0.1
+          subnet: 172.16.0.0/16
+      driver: default
EOF

# Run.
$ ( cd "$CASPER/firecrawl"; docker compose build && docker compose up )

# Make sure it works:
$ curl -X POST http://localhost:3002/v1/crawl \
    -H 'Content-Type: application/json' \
    -d '{
      "url": "https://firecrawl.dev"
    }'
{"success":true,"id":"019d7a78-e77a-70af-9f49-8e03421dad32","url":"http://localhost:3002/v1/crawl/019d7a78-e77a-70af-9f49-8e03421dad32"}

This brings up all the following applications in separate gVisor containers on their own inter-container network:

Redis for key/value storage.
RabbitMQ for message queuing.
Playwright for browser automation.
PostgreSQL for long-term storage.
Firecrawl as main API endpoint for Hermes Agent to interact with.

Putting it all together

Setting up Hermes Agent and connecting it.

Let’s put the pieces together for the Hermes Agent container.

$ export CASPER="$HOME/agents/casper-3"; mkdir -p "$CASPER"

# Register Matrix user.
$ docker exec -it synapse register_new_matrix_user \
    -c /data/homeserver.yaml \
    --user casper --password naoko --no-admin

# Hermes requires a non-root user for its home directory.
$ groupadd --gid=10337 hermes && \
    useradd --home-dir=/dev/null --no-create-home --shell="$(which nologin)" \
      --uid=10337 --gid=10337 hermes

# Build Docker image with extra packages.
$ cat <<EOF > "$CASPER/Dockerfile"
FROM nousresearch/hermes-agent:v2026.4.13

# Install basic packages.
RUN export DEBIAN_FRONTEND=noninteractive; apt update -y && \
    apt install -y sudo wget curl git build-essential python3-pip

# Install dependencies for Hermes Agent's Matrix.org support.
RUN export DEBIAN_FRONTEND=noninteractive; apt update -y && \
    apt install -y libolm-dev && \
    python3 -m pip config set global.break-system-packages true && \
    pip install 'matrix-nio' 'mautrix[encryption]'

# Install espeak-ng and NeuTTS model for local text-to-speech capabilities.
RUN export DEBIAN_FRONTEND=noninteractive; apt update -y && \
    apt install -y espeak-ng && \
    pip install 'neutts[all]'

# Install Docker; not required for dockerd since that's running in a separate
# container, but Hermes Agent still needs the Docker **client** CLI.
RUN export DEBIAN_FRONTEND=noninteractive; apt update -y && \
    apt install -y docker.io
EOF

$ docker build -t hermes-agent:casper-3 "$CASPER"

As Hermes Agent does not easily support non-interactive configuration, we need to configure it manually. Let’s run it for interactive configuration purposes:

$ export CASPER="$HOME/agents/casper-3"; \
    mkdir "$CASPER/home" && chown hermes:hermes "$CASPER/home"
$ docker run -it \
    --name=casper \
    --runtime=runsc \
    --restart=always \
    --shm-size=1g \
    --link=synapse:synapse \
    --link=ollama:ollama \
    --link=camofox:camofox \
    --mount="type=bind,src=$CASPER/home,dst=/opt/data" \
    --mount="type=bind,src=$CASPER/docker-run,dst=/docker-run" \
    -e HERMES_UID="$(id -u hermes)" \
    -e HERMES_GID="$(id -g hermes)" \
    -e DOCKER_HOST="unix:///docker-run/docker.sock" \
    hermes-agent:casper-3 setup

Going through Hermes Agent's interactive setup process in gVisor.

Interactive setup instructions

Expand this section for a text version of the screen recording above.

Choose Full setup
Inference Provider: More providers → Custom endpoint
API base URL: http://ollama:11434/v1
API key: (leave empty)
Select model: qwen3.5:27b-q4_K_M
Context length in tokens: 262144 (per the Qwen3.7-27B model card)
Select TTS provider: NeuTTS (local on-device)
Terminal Backend: Docker
Docker image: (leave default)
Container Resource Settings: Up to you
Max iterations / Tool progress mode/ […] / Inactivity timeout: Up to you
Select platforms: Matrix
Homeserver URL: http://synapse:8008
Access token: (leave empty)
User ID: @casper:magi
Password: naoko
Enable end-to-end encryption (E2EE): Up to you
Allowed user IDs: @gendo:magi
Home room ID: (leave empty)
Install gateway as systemd service: No, as this isn’t relevant for a containerized install.
Tools: Feel free to configure.
Browser provider: Camofox
Camofox server URL: http://camofox:3000
Image generation FAL API key: (leave empty unless you have one)
TTS provider: Skip
Search provider: Self-hosted Firecrawl
Firecrawl instance URL: http://172.17.0.1:3002

You can verify that Hermes Agent’s “terminal” backend is the Docker-in-gVisor by running htop in the hermes-exec container.

$ docker exec -it hermes-exec sh -c 'apt update -y && apt install -y htop'

# Watch this command while asking Hermes Agent to run `curl https://gvisor.dev`:
$ docker exec -it hermes-exec htop

To make Hermes Agent actually join the Matrix room, you need to restart the container in gateway mode.

$ docker rm -f casper; docker run --detach \
    --name=casper \
    --runtime=runsc \
    --restart=always \
    --shm-size=1g \
    --link=synapse:synapse \
    --link=ollama:ollama \
    --link=camofox:camofox \
    --mount="type=bind,src=$CASPER/home,dst=/opt/data" \
    --mount="type=bind,src=$CASPER/docker-run,dst=/docker-run" \
    -e DOCKER_HOST="unix:///docker-run/docker.sock" \
    hermes-agent:casper-3 gateway

Now invite the bot to your Matrix room and send /sethome on the main channel.

You now have Hermes Agent running in gVisor. To recap, Hermes Agent has:

Hermes Agent running in its own gVisor container
dockerd running in a separate gVisor container, for subcommand execution
Camofox Browser running with a virtual display (Xvfb) for browser use, in its own gVisor container
Self-hosted Firecrawl for agentic search, in its own set of gVisor containers.
NeuTTS for text-to-speech capabilities in Hermes Agent, evaluated within gVisor.
Ollama for inference and Matrix.org for communication, same as the other agents.

Putting these agents in a room

You can now ask your 3 agents to do your bidding and get various perspectives.

The three agents fetching the gVisor homepage and verifying that they are running in gVisor.
Note: Hermes Agent cannot call dmesg, due to the default system call filter applied to the Docker container that its code execution tool runs in.
However, the 4.4.0 kernel version is characteristic of gVisor.

Sandboxing agents: What actually makes sense?

The setup described in this blog post is a contrived example of agent sandboxing, where every part of the stack is mutually sandboxed from one another. In closer-to-real-world settings, not all of these components are untrusted, some of them will run remotely, others may be delegated to off-machine APIs, etc. So what would a more practical setup look like?

At a high level, an autonomous agent stack looks like this:

A core daemon (written in good old regular code, e.g. TypeScript for OpenClaw), typically listening on a TCP port. This daemon is responsible for:
- Receiving user requests via a communications plugin (e.g. Signal, Mattermost…)
- Running inference API calls
- Dispatching tool calls
- Running the control loop necessary to make forward progress on long-term tasks, using inference and tool calls
- Running cron-like tasks and heartbeats to keep the agent autonomous
A pretty web interface (sometimes part of the core daemon, sometimes separate)
A plugin ecosystem, adding new tools, communication channels, etc. to the agent
A database of skills and general knowledge (memory) that the agent can evolve over time as they learn from its mistakes, or learn more about their raison d’être and the user they are dealing with.
A policy engine that can decide on the security policies needed for any action the agent would like to take (tool call, API call, credential access, etc.).

When you send a message to such an agent, it ends up running a control loop to handle your query. This control loop will initially run inference, then very likely follow this up by a sequence of tool calls and further inference requests, until a satisfying conclusion is reached. These tool calls can include:

Data lookups on the web
API requests to external services, often requiring sensitive credentials to “act as” the user
Browser use, sometimes with similar credential needs
Code snippet executions
Memory reads and writes, database-like
Introspection requests, where the agent can modify its own configuration or skill database, sometimes fixing its own setup/configuration issues rather than requiring a human to get it unstuck.

Where does sandboxing fit in?

Sandboxing individual tools: Most tool calls don’t do anything fancy. They just make web requests and are not expected to have side-effects. They have no business reading local files or modifying the agent’s own configuration. Sandboxing these tools allows for defense-in-depth.
- Concrete example: One can craft malicious .mov videos which can refer to arbitrary file paths on the host. What if your agent gets tricked into converting a video that tries to embed a subtitle file pointing to /etc/shadow? Sandbox your tool calls and avoid this problem.
Sandboxing subsystems: Some agent functionality may depend on long-running daemons which themselves don’t need system-wide access. This can be important for network-exposed or network-accessing subsystems.
- Concrete example: If using Signal as communications layer, the signal-cli daemon can run in a sandbox for defense-in-depth.
- Similarly, in the examples above, we sandbox dockerd and Camofox browser in separate containers.
Sandbox the core daemon: The need for the agent to be able to change its own environment to debug or update itself is a very powerful feature. To do so, the agent requires effectively root control over its own core code and configuration. Therefore, sandboxing the entire agent’s core daemon makes sense: the agent can leverage its own intelligence to make itself better, while still being confined to a box. That box is useful because:
- Destructive changes can be rolled back.
- The agent’s policy engine can live outside the core sandbox. This prevents the agent from changing the policy engine’s policies maliciously.
- Relatedly, sensitive credentials can live outside the core sandbox. This ensures that all credential use is mediated through components the agent can’t modify. This includes API keys, crypto wallet keys for agentic commerce, and user-authenticated browser sessions.

Other parts of the stack typically run fully-trusted code with little to no need for sandboxing. For example, the memory subsystem may be a local vector lookup or similar database, with no internet connectivity and no need to run arbitrary code. Thus, similar to the gVisor production guide, it does not need to be sandboxed.

We see some of these ideas being implemented across the ecosystem:

OpenClaw supports agent-level containerization via Docker and Podman.
NemoClaw uses OpenShell to ensure tool calls have initially-restricted access which can then be widened as needed by the tool.
Hermes Agent implements checkpoints and rollbacks to protect against destructive operations.
IronClaw segregates API keys out of the agent’s core sandbox and injects them at egress time.

Security practices for these tools are rapidly evolving, and gVisor has a role to play.

Should I use gVisor to sandbox my agent?

gVisor dramatically reduces the attack surface for sandbox escapes. It does so by reimplementing a large portion of Linux in userspace, preventing the sandboxed application from attacking the host kernel. Read more about gVisor’s security architecture.

For autonomous agents, you don’t just need a strong sandbox, you also need strong policies around when and what to sandbox. As a sandboxing technology, gVisor does not help you with these decisions. gVisor only enhances the level of security of the sandboxing capabilities that the agent already has. Thus, gVisor is necessary, but not sufficient.

gVisor’s capabilities are also uniquely well-suited to agentic workloads:

Sandboxes start and stop in milliseconds, critical to keeping these systems responsive and minimizing time between inference calls.
Thanks to its process-like model (not a virtual machine), gVisor can achieve superior density, i.e. more sandboxes running concurrently on the same host.
gVisor supports checkpoint/restore, making slow-to-initialize repetitive actions quick to replay, and checkpoints/rollbacks can be done seamlessly without sandboxed-workload-specific support.

One current drawback of gVisor is its relative difficulty to integrate within existing applications that have such sandboxing needs. For example, this is one reason why the above demo does not sandbox Hermes Agent tool calls in separate gVisor instances. This is being worked on. Watch this space!

*cogitation intensifies*

Safe Ride into the Dangerzone: Reducing attack surface with gVisor

2024-09-23T00:00:00-05:00

This article was written in collaboration with the Freedom of the Press Foundation and cross-posted on the Dangerzone blog.

One of the oft-repeated sound bites of computer security advice is: “Don’t open random attachments from strangers.” If you are a journalist, however, opening attachments and documents is part of your job description. Since journalists already have a lot of security threats to worry about in dealing with sources, the safe opening of documents should not be one of them. Dangerzone was developed to solve this problem. It lets you open suspicious documents with confidence and gets out of your way.

For the past few months, members of the Dangerzone team and the gVisor project collaborated on significantly improving the security properties of Dangerzone. We’re excited to announce that as of version 0.7.0, Dangerzone uses gVisor to secure its document conversion process. It is already trusted by Google and others to secure cloud products, scan Gmail attachments for viruses, etc.

If you’re an existing Dangerzone user on 0.7.0 scratching your head and thinking “Well, I haven’t noticed anything different,” then first of all, “yay!” That was the plan. And second, because the plan worked so deviously well, this change has probably flown under the radar, so here are more than 3,000 words to amend this.

The rest of the article dives deep into Dangerzone’s security, describes how gVisor works as a technology, and explains how Dangerzone’s security profile has changed after this integration. Expect some technical terms and nerdery.

How Dangerzone works

Dangerzone’s purpose is to sanitize documents of any elements that can compromise your computer or the source’s identity (think malware and document metadata). To do this, it first renders the document into visual data (pixels) and then turns this visual representation back into a readable document file. The first part of this process (rendering the document into pixel data) is the most security-critical part and, for the purpose of this article, we will zoom in on just this.

💡 For a broader understanding of how Dangerzone works, we encourage you to read the “About Dangerzone” section on the Dangerzone website. Props to the Qubes OS team, who first popularized the concept that is now their TrustedPDF feature.

In order to support a wide variety of document formats (PDF, office documents, image formats, etc.), Dangerzone needs to open them with software that potentially has security bugs. That may result in compromise of the user’s device, personal files, and communication. This is the same risk you face when you use your computer to open attachments from unknown sources. Dangerzone needs to somehow isolate this process from the rest of your computer, so that anything it does cannot “get out of the box”.

Dangerzone’s isolation relies on Linux containers. Containers are very handy for two things: ensuring that they work the same way across operating systems and separating the container from the rest of the machine.

Outline of how Dangerzone uses containers to render a document into pixels.

Dangerzone benefits from both of these aspects: Development and testing are made easy by using containers’ cross-platform compatibility; and containers’ security, especially how Dangerzone configured them, offers strong isolation guarantees. The security audit Dangerzone passed recently is a testament to this.

In computer security, the gold standard of isolation is virtual machines. VMs are what they sound like: a computer running within a computer. When running a virtual machine, the “host” (outer) machine is protected from the action of the “guest” (inner) virtual machine. This is why the TrustedPDF feature of QubesOS uses disposable VMs as its isolation mechanism. Dangerzone also tried to use VMs in the past, but implementing them in a multiplatform way proved high-maintenance. Thus, Dangerzone switched back to containers, but the team always wanted to improve Dangerzone’s security properties.

💡 How does Dangerzone use Linux containers on Windows and Mac OS? It requires Docker Desktop, which runs Linux inside a virtual machine and then runs Linux containers in it.

Dangerzone’s attack surface

To understand how to protect Dangerzone users from exploits, it’s useful to think like an attacker. When Dangerzone processes a malicious document within a container, the first point of the attack is the application that opens the document. Dangerzone is designed with the assumption that determined attackers will find a vulnerability in such applications and take control of them (check out this security advisory from the Dangerzone team about a recent, critical LibreOffice vulnerability). From there on, the next point of attack is to circumvent the Linux kernel protections for the container or directly compromise the Linux kernel.

The Linux kernel, even in Docker Desktop VMs, is a very privileged component. It has access to sensitive data, such as other files on the user’s machine or the user’s browser history, and to your computer’s network.

Processes in containers interface with the Linux kernel through system calls and virtual filesystems. Attackers can try to take advantage of security bugs in the above interfaces. So it is critical to limit the container’s access to the Linux kernel. We call this the container’s attack surface. The smaller it is, the more secure a system is.

Dangerzone tries to reduce its attack surface by multiple mechanisms available to Linux containers:

Removal of process capabilities. This reduces the set of permissions the container has in the kernel.
Removal of network access. This prevents the container from accessing the internet to exfiltrate document data.
Filtering of allowed system calls through seccomp. This reduces the set of system calls (i.e., types of actions) that the container is allowed to make to the kernel.
Minimal user ID mapping. This reduces the risk that the container may access files belonging to users other than the Dangerzone user on the same computer.

💡 Check out the above protection measures in Dangerzone’s codebase.

Container protections employed by Dangerzone prior to 0.7.0.

This provides the container with a fair degree of isolation from the Linux kernel. However, some attack surface remains, since:

The computer’s user is still mapped in the container. This means that a container escape would allow the attacker to access the user’s personal files (browser data, documents, etc.); it would be more isolated if that were not the case.
The system call filter is still relatively permissive. The specific system calls that are blocked are dependent on the container manager and version in use (see Docker’s filters, for example), but in general, the system call filter only blocks obscure or system-admin-only system calls (e.g., rebooting, modifying systemwide settings). It does not block containers from opening arbitrary files or interacting with the network stack, which can still be vectors for security bugs.
The container’s root filesystem, while ephemeral, is still writable. This allows attackers to exploit potential vulnerabilities in Linux’s filesystem stack.
The Linux kernel is still exposed to the container. While it is possible to reduce the attack surface available to the container to a minimum, this architecture still requires that the container have direct access to Linux via system calls. So if a Linux security bug can be triggered within the set of filtered system calls, an attack may still be successful.

Dangerzone's attack surface prior to 0.7.0, illustrated.

We’ve wanted to mitigate these risks for a while now, but we had to do so in a cross-platform way and without burdening the user with administrative tasks.

Enter gVisor.

What is gVisor?

gVisor is a container security solution. In short, it makes it much harder for malicious code to break out of the container boundary. This was a great fit for Dangerzone’s security needs.

An open source project written in Go, gVisor was released in May 2018 by Google under the Apache 2.0 license. It runs on Linux and integrates with all popular container management software, such as Docker, Podman, or Kubernetes. At its core, gVisor is an application kernel that implements a substantial portion of the Linux system call interface. This means gVisor sits between a container and the Linux kernel and plays both roles: from the container’s perspective, gVisor acts as a kernel, but from Linux’s perspective, gVisor is just a regular application. That means the container can no longer directly interface with the Linux kernel. This is a massive reduction in attack surface.

If you’re new to gVisor, the concept of not interfacing with the Linux kernel at all may seem either quite vague or overly restrictive. That’s normal, so let’s toy with this concept a bit for fun and illustrative purposes. Here’s a perfectly normal sentence:

“A process opens a document on the filesystem”

And here’s how gVisor warps every single word in that sentence:

“on the filesystem”: Nope, no such thing. The gVisor container runs in an empty filesystem.
“opens a document”: Nuh-uh, the gVisor container does not even have the permission to perform the open system call. Also, there are no files to open in the first place.
“A process”: Amusingly, the gVisor container does not even have the ability to perform the exec system calls. From the Linux kernel’s perspective, the gVisor “process” looks like a typical multithreaded program, even while many independent processes are running within the gVisor sandbox.

And yet, gVisor can containerize most applications without issue. For example, the Dangerzone container image was not altered at all for the gVisor integration.

So what’s going on here?

gVisor manages to pull the above trick with the help of two components:

Sentry is the component that runs the containerized application. It intercepts every system call that the application makes and reimplements it in Go. As part of this, it may decide to do one or more system calls to the host Linux kernel. However, it’s heavily restricted with a strict seccomp filter (that’s why system calls like open, socket, or exec are not allowed).
Gofer is a component that runs outside the container and is responsible for filesystem operations. The sentry may make I/O requests to the gofer. The gofer will independently validate them, then perform these I/O operations on the container’s behalf (that’s how the container can read files from the host filesystem, even though open is not allowed from the sentry).

The above components are managed by a container runtime called runsc, which exposes the same interface as other container runtimes. This means it can be integrated in other container management software like Podman, Docker, or Kubernetes.

gVisor intercepting system calls from a sandboxed application

With the above architecture, gVisor blue-pills the application into thinking that it interacts with a regular Linux kernel. In practice, gVisor reimplements most basic features that Linux provides (memory management, scheduling, system call interface, I/O, networking), and only issues system calls to the Linux kernel when truly necessary, such as when it needs information from it (e.g., reading the document to be converted by Dangerzone).

The gVisor kernel is designed to be difficult to break out of. gVisor is written in Go. Many of Linux’s security woes stem from its use of C, which is a memory-unsafe language. By contrast, gVisor is a regular Go application and inherits Go’s memory safety features. This eliminates a large class of security vulnerabilities.

The gVisor kernel also has a much smaller code footprint, because unlike a traditional kernel like Linux, it does not have to deal with things like hardware devices, and only implements a subset of the Linux kernel interface that is sufficient for most applications to work in practice. Because of its smaller implementation, there are fewer moving parts to juggle between, and thus fewer opportunities for bugs to exist.

Beyond its kernel indirection, gVisor also hardens itself through a bunch of security measures on startup, some of which are similar to regular containers:

Isolation: Running in its own set of namespaces (user namespace, process namespace, network namespace, etc.) to further isolate it from the host.
File access prevention: Running in its own root with exactly zero host files initially visible to it.
Privilege revocation: Dropping all capabilities it has to ensure it runs with the least privileges.
System call filtering: Setting a strict system call filter tuned for the gVisor Sentry specifically.
- As mentioned, unlike Docker or Podman’s default system call filter, this is a very restricted set of system calls. This filter blocks basic operations like opening files, creating network connections, or executing other processes. The presence of this filter does not prevent use of these system calls from within the gVisor sandbox; instead, the gVisor kernel intercepts and reimplements system calls internally without needing to make a “real” system call out to the Linux kernel.
The gofer also uses all of the above techniques to isolate itself as much as possible.

The gVisor kernel has been battle-tested by Google and other large companies like Ant and Cloudflare. For example, searching for the text “GKE Sandbox” (which uses gVisor) on the GKE security bulletin shows how often Linux kernel vulnerabilities occur but that gVisor prevents. gVisor is also continuously fuzz-tested for bugs using Syzkaller, an automated kernel security testing tool.

What’s the catch here? Applications that perform lots of system calls and heavy I/O will have some degraded performance. Also, applications that rely on exotic features by the Linux kernel may not work. In practice, the majority of applications do not suffer from this issue.

Integrating gVisor with Dangerzone

So, gVisor looks like a strong candidate for Dangerzone, which is a relatively simple application that does not perform a heavy amount of system calls. Also, gVisor conveniently offers a container runtime that is a drop-in replacement for use with Docker/Podman. Therefore, integrating these two projects should be really simple, right?

Well, not so fast.

Dangerzone is a multiplatform application, and most of its users are on Windows and macOS. Integrating gVisor just for Linux would not cut it. At the same time, gVisor works strictly on Linux systems, so we are at an impasse.

In what is, in retrospect, a classic case of Maslow’s hammer, we decided to solve our container problems with yet another container. The idea is simple; why not containerize gVisor and make it run on Docker Desktop? After all, as we already pointed out, Docker Desktop runs Linux inside a virtual machine.

By doing so, Dangerzone now has two containers with different responsibilities:

The outer Docker/Podman container acts as the portability layer for Dangerzone. Its main responsibility is to bundle the necessary config files, scripts, and programs to run gVisor. It’s also responsible for bundling the container image that gVisor will spawn a container from.
The inner gVisor container acts as the isolation layer for Dangerzone. Its sole responsibility is to run the actual Dangerzone logic for rendering documents to pixels.

Outline of how gVisor integrates with Dangerzone. There are now two nested containers, and each one brings its own protections. Usage of LibreOffice is implied.

Running gVisor inside a container came with its own set of challenges:

The Docker/Podman’s seccomp filter must allow the ptrace system call. We found that recent Docker Desktop versions and Podman version >= 4.0 have a seccomp filter that allows this system call. For older versions, we specified a custom seccomp filter that allowed it.
gVisor cannot run under SELinux in enforcing mode under default settings, so we labeled the container with container_engine_t (see GitHub issue #880).
The Docker/Podman container must run with the SYS_CHROOT capability. This is needed by gVisor to restrict its own access to the filesystem before it starts document processing. Other than that, the outer container drops all other capabilities and privileges.

💡 You can find more details about this integration in the Dangerzone’s gVisor design doc.

Dangerzone protections

We talked about Dangerzone’s original attack surface, and how we integrated gVisor to reduce it. In practice though, in what ways is Dangerzone better off than before? Well, if the Matryoshka containers are giving you a headache, or you just skimmed to this section (no shade), here’s how the new Dangerzone protections fare against the previous version, and the default protections of Linux containers:

🛡️ Protections	Default	Dangerzone (0.6.1)	Dangerzone + gVisor (0.7.0)
🐧 Linux kernel	Exposed	👎 Exposed	🎉 Not exposed
🛠️ System call filter	Moderate	👎 Moderate	👍 Strict
🛠️ Capabilities	Default	👍 None	👍 None
👤 Host user	Mapped	👎 Mapped	👍 Unmapped
📁 Filesystem	Exposed	👎 Writable	👍 Read-only
🌐 Network	Exposed	👍 Disabled	✌️ Disabled at two levels
🔒 SELinux	Yes (`container_t`)	👍 Yes (`container_t`)	👍 Yes (`container_engine_t`)
🖥️ Hardware Virtualization	None	👎 None	👎 None

As you can see, the most important protection is that the document conversion process no longer has access to the Linux kernel. Instead, it only has access to the gVisor kernel (in the Sentry), and must break out of it before it can access the Linux kernel that it (prior to gVisor integration) had access to.

Additionally, Dangerzone itself configures the two containers to be more secure with:

Privilege revocation: Removing all privileges and capabilities of the document conversion process in the inner container, and minimizing the set of capabilities granted to the outer container to just SYS_CHROOT and no other.
File modification prevention: Making the inner container’s root filesystem read-only.
User isolation: Running the outer container in a user namespace that does not include the Dangerzone UI user (available in Linux distributions with Podman version 4.1 or greater).
Kernel security settings: Setting the outer container’s system call filter and SELinux label settings.
Host access prevention: Not using any mounts in either container.
Network access prevention: Disabling both containers’ ability to use networking.

Explanation of how Dangerzone's latest protections limit its attack surface.

Conclusion

Integrating the gVisor project with Dangerzone was very exciting: It’s a good example of how gVisor can add another line of defense to a project without requiring application-level changes.

At the same time, the design complexity of the Dangerzone project increased a bit, mostly to cater to its cross-platform nature, but honestly not that much. Dangerzone is strongly security-focused, so we believe it’s worth the cost.

We hope that this article demystifies some security aspects of containers, so that you can use Dangerzone and gVisor with even more confidence. Feel free to reach out to us with any questions or comments:

Optimizing seccomp usage in gVisor

2024-02-01T00:00:00-06:00

gVisor is a multi-layered security sandbox. seccomp-bpf is gVisor’s second layer of defense against container escape attacks. gVisor uses seccomp-bpf to filter its own syscalls by the host kernel. This significantly reduces the attack surface to the host that a compromised gVisor process can access. However, this layer comes at a cost: every legitimate system call that gVisor makes must be evaluated against this filter by the host kernel before it is actually executed. This blog post contains more than you ever wanted to know about seccomp-bpf, and explores the past few months of work to optimize gVisor’s use of it.

A diagram showing gVisor’s two main layers of security: gVisor itself, and seccomp-bpf. This blog post touches on the seccomp-bpf part. Tux logo by Larry Ewing and The GIMP.

Understanding `seccomp-bpf` performance in gVisor

One challenge with gVisor performance improvement ideas is that it is often very difficult to estimate how much they will impact performance without first doing most of the work necessary to actually implement them. Profiling tools help with knowing where to look, but going from there to numbers is difficult.

seccomp-bpf is one area which is actually much more straightforward to estimate. Because it is a secondary layer of defense that lives outside of gVisor, and it is merely a filter, we can simply yank it out of gVisor and benchmark the performance we get. While running gVisor in this way is strictly less secure and not a mode that gVisor should support, the numbers we get in this manner do provide an upper bound on the maximum potential performance gains we could see from optimizations within gVisor’s use of seccomp-bpf.

To visualize this, we can run a benchmark with the following variants:

Unsandboxed: Unsandboxed performance without gVisor.
gVisor: gVisor from before any of the performance improvements described later in this post.
gVisor with empty filter: Same as gVisor, but with the seccomp-bpf filter replaced with one that unconditionally approves every system call.

From these three variants, we can break down the gVisor overhead that comes from gVisor itself vs the one that comes from seccomp-bpf filtering. The difference between gVisor and unsandboxed represents the total gVisor performance overhead, and the difference between gVisor and gVisor with empty filter represents the performance overhead of gVisor’s seccomp-bpf filtering rules.

Let’s run these numbers for the ABSL build benchmark:

We can now use these numbers to give a rough breakdown of where the overhead is coming from:

The seccomp-bpf overhead is small in absolute terms. The numbers suggest that the best that can be shaved off by optimizing seccomp-bpf filters is up to 3.4 seconds off from the total ABSL build time, which represents a reduction of total runtime by ~3.6%. However, when looking at this amount relative to gVisor’s overhead over unsandboxed time, this means that optimizing the seccomp-bpf filters may remove up to ~15% of gVisor overhead, which is significant. (Not all benchmarks have this behavior; some benchmarks show smaller seccomp-bpf-related overhead. The overhead is also highly platform-dependent.)

Of course, this level of performance is what was reached with empty seccomp-bpf filtering rules, so we cannot hope to reach this level of performance gains. However, it is still useful as an upper bound. Let’s see how much of it we can recoup without compromising security.

A primer on BPF and `seccomp-bpf`

BPF, cBPF, eBPF, oh my!

BPF (Berkeley Packet Filter) is a virtual machine and eponymous machine language. Its name comes from its original purpose: filtering packets in a kernel network stack. However, its use has expanded to other domains of the kernel where programmability is desirable. Syscall filtering in the context of seccomp is one such area.

BPF itself comes in two dialects: “Classic BPF” (sometimes stylized as cBPF), and the now-more-well-known “Extended BPF” (commonly known as eBPF). eBPF is a superset of cBPF and is usable extensively throughout the kernel. However, seccomp is not one such area. While the topic has been heavily debated, the status quo remains that seccomp filters may only use cBPF, so this post will focus on cBPF alone.

So what is `seccomp-bpf` exactly?

seccomp-bpf is a part of the Linux kernel which allows a program to impose syscall filters on itself. A seccomp-bpf filter is a cBPF program that is given syscall data as input, and outputs an “action” (a 32-bit integer) to do as a result of this system call: allow it, reject it, crash the program, trap execution, etc. The kernel evaluates the cBPF program on every system call the application makes. The “input” of this cBPF program is the byte layout of the seccomp_data struct, which can be loaded into the registers of the cBPF virtual machine for analysis.

Here’s what the seccomp_data struct looks like in Linux’s include/uapi/linux/seccomp.h:

struct seccomp_data {
    int nr;                     // 32 bits
    __u32 arch;                 // 32 bits
    __u64 instruction_pointer;  // 64 bits
    __u64 args[6];              // 64 bits × 6
};                              // Total 512 bits

Sample `seccomp-bpf` filter

Here is an example seccomp-bpf filter, adapted from the Linux kernel documentation¹:

00: load32 4                // Load 32 bits at offsetof(struct seccomp_data, arch) (= 4)
                            //   of the seccomp_data input struct into register A.
01: jeq 0xc000003e, 0, 11   // If A == AUDIT_ARCH_X86_64, jump by 0 instructions [to 02]
                            //   else jump by 11 instructions [to 13].
02: load32 0                // Load 32 bits at offsetof(struct seccomp_data, nr) (= 0)
                            //   of the seccomp_data input struct into register A.
03: jeq  15,  10,   0       // If A == __NR_rt_sigreturn, jump by 10 instructions [to 14]
                            //   else jump by 0 instructions [to 04].
04: jeq 231,   9,   0       // If A == __NR_exit_group, jump by 9 instructions [to 14]
                            //   else jump by 0 instructions [to 05].
05: jeq  60,   8,   0       // If A == __NR_exit, jump by 8 instructions [to 14]
                            //   else jump by 0 instructions [to 06].
06: jeq   0,   7,   0       // Same thing for __NR_read.
07: jeq   1,   6,   0       // Same thing for __NR_write.
08: jeq   5,   5,   0       // Same thing for __NR_fstat.
09: jeq   9,   4,   0       // Same thing for __NR_mmap.
10: jeq  14,   3,   0       // Same thing for __NR_rt_sigprocmask.
11: jeq  13,   2,   0       // Same thing for __NR_rt_sigaction.
12: jeq  35,   1,   0       // If A == __NR_nanosleep, jump by 1 instruction [to 14]
                            //   else jump by 0 instructions [to 13].
13: return 0                // Return SECCOMP_RET_KILL_THREAD
14: return 0x7fff0000       // Return SECCOMP_RET_ALLOW

This filter effectively allows only the following syscalls: rt_sigreturn, exit_group, exit, read, write, fstat, mmap, rt_sigprocmask, rt_sigaction, and nanosleep. All other syscalls result in the calling thread being killed.

`seccomp-bpf` and cBPF limitations

cBPF is quite limited as a language. The following limitations all factor into the optimizations described in this blog post:

The cBPF virtual machine only has 2 32-bit registers, and a tertiary pseudo-register for a 32-bit immediate value. (Note that syscall arguments evaluated in the context of seccomp are 64-bit values, so you can already foresee that this leads to complications.)
seccomp-bpf programs are limited to 4,096 instructions.
Jump instructions can only go forward (this ensures that programs must halt).
Jump instructions may only jump by a fixed (“immediate”) number of instructions. (You cannot say: “jump by whatever this register says”.)
Jump instructions come in two flavors:
- “Unconditional” jump instructions, which jump by a fixed number of instructions. This number must fit in 16 bits.
- “Conditional” jump instructions, which include a condition expression and two jump targets:
  - The number of instructions to jump by if the condition is true. This number must fit in 8 bits, so this cannot jump by more than 255 instructions.
  - The number of instructions to jump by if the condition is false. This number must fit in 8 bits, so this cannot jump by more than 255 instructions.

`seccomp-bpf` caching in Linux

Since Linux kernel version 5.11, when a program uploads a seccomp-bpf filter into the kernel, Linux runs a BPF emulator that looks for system call numbers where the BPF program doesn’t do any fancy operations nor load any bits from the instruction_pointer or args fields of the seccomp_data input struct, and still returns “allow”. When this is the case, Linux will cache this information in a per-syscall-number bitfield.

Later, when a cacheable syscall number is executed, the BPF program is not evaluated at all; since the kernel knows that the program is deterministic and doesn’t depend on the syscall arguments, it can safely allow the syscall without actually running the BPF program.

This post uses the term “cacheable” to refer to syscalls that match this criteria.

How gVisor builds its `seccomp-bpf` filter

gVisor imposes a seccomp-bpf filter on itself as part of Sentry start-up. This process works as follows:

gVisor gathers bits of configuration that are relevant to the construction of its seccomp-bpf filter. This includes which platform is in use, whether certain features that require looser filtering are enabled (e.g. host networking, profiling, GPU proxying, etc.), and certain file descriptors (FDs) which may be checked against syscall arguments that pass in FDs.

gVisor generates a sequence of rulesets from this configuration. A ruleset is a mapping from syscall number to a predicate that must be true for this system call, along with an “action” (return code) that is taken should this predicate be satisfied. For ease of human understanding, the predicate is often written as a disjunctive rule, for which each sub-rule is a conjunctive rule that verifies each syscall argument. In other words, (fA(args[0]) && fB(args[1]) && ...) || (fC(args[0]) && fD(args[1]) && ...) || .... This is represented in gVisor code as follows:

Or{          // Disjunction rule
    PerArg{  // Conjunction rule over each syscall argument
        fA,  // Predicate for `seccomp_data.args[0]`
        fB,  // Predicate for `seccomp_data.args[1]`
        // ... More predicates can go here (up to 6 arguments per syscall)
    },
    PerArg{  // Conjunction rule over each syscall argument
        fC,  // Predicate for `seccomp_data.args[0]`
        fD,  // Predicate for `seccomp_data.args[1]`
        // ... More predicates can go here (up to 6 arguments per syscall)
    },
}

gVisor performs several optimizations on this data structure.
gVisor then renders this list of rulesets into a linear program that looks close to the final machine language, other than jump offsets which are initially represented as symbolic named labels during the rendering process.
gVisor then resolves all the labels to their actual instruction index, and computes the actual jump targets of all jump instructions to obtain valid cBPF machine code.
gVisor runs further optimizations on this cBPF bytecode.
Finally, the cBPF bytecode is uploaded into the host kernel and the seccomp-bpf filter becomes effective.

Optimizing the seccomp-bpf filter to be more efficient allows the program to be more compact (i.e. it’s possible to pack more complex filters in the 4,096 instruction limit), and to run faster. While seccomp-bpf evaluation is measured in nanoseconds, the impact of any optimization is magnified here, because host syscalls are an important part of the synchronous “syscall hot path” that must execute as part of handling certain performance-sensitive syscall from the sandboxed application. The relationship is not 1-to-1: a single application syscall may result in several host syscalls, especially due to futex(2) which the Sentry calls many times to synchronize its own operations. Therefore, shaving a nanosecond here and there results in several shaved nanoseconds in the syscall hot path.

Structural optimizations

The first optimization done for gVisor’s seccomp-bpf was to turn its linear search over syscall numbers into a binary search tree. This turns the search for syscall numbers from O(n) to O(log n) instructions. This is a very common seccomp-bpf optimization technique which is replicated in other projects such as libseccomp and Chromium.

To do this, a cBPF program basically loads the 32-bit nr (syscall number) field of the seccomp_data struct, and does a binary tree traversal of the syscall number space. When it finds a match, it jumps to a set of instructions that check that syscall’s arguments for validity, and then returns allow/reject.

But why stop here? Let’s go further.

The problem with the binary search tree approach is that it treats all syscall numbers equally. This is a problem for three reasons:

It does not matter to have good performance for disallowed syscalls, because such syscalls should never happen during normal program execution.
It does not matter to have good performance for syscalls which can be cached by the kernel, because the BPF program will only have to run once for these system calls.
For the system calls which are allowed but are not cacheable by the kernel, there is a Pareto distribution of their relative frequency. To exploit this we should evaluate the most-often used syscalls faster than the least-often used ones. The binary tree structure does not exploit this distribution, and instead treats all syscalls equally.

So gVisor splits syscall numbers into four sets:

🅰: Non-cacheable 🅰llowed, called very frequently.
🅱: Non-cacheable allowed, called once in a 🅱lue moon.
🅲: 🅲acheable allowed (whether called frequently or not).
🅳: 🅳isallowed (which, by definition, is neither cacheable nor expected to ever be called).

Then, the cBPF program is structured in the following layout:

Linear search over allowed frequently-called non-cacheable syscalls (🅰). These syscalls are ordered in most-frequently-called first (e.g. futex(2) is the first one as it is by far the most-frequently-called system call).
Binary search over allowed infrequently-called non-cacheable syscalls (🅱).
Binary search over allowed cacheable syscalls (🅲).
Reject anything else (🅳).

This structure takes full advantage of the kernel caching functionality, and of the Pareto distribution of syscalls.

Binary search tree optimizations

Beyond classifying syscalls to see which binary search tree they should be a part of, gVisor also optimizes the binary search process itself.

Each syscall number is a node in the tree. When traversing the tree, there are three options at each point:

The syscall number is an exact match
The syscall number is lower than the node’s value
The syscall number is higher than the node’s value

In order to render the BST as cBPF bytecode, gVisor used to render the following (in pseudocode):

if syscall number == current node value
    jump @rules_for_this_syscall
if syscall number < current node value
    jump @left_node
jump @right_node

@rules_for_this_syscall:
  // Render bytecode for this syscall's filters here...

@left_node:
  // Recursively render the bytecode for the left node value here...

@right_node:
  // Recursively render the bytecode for the right node value here...

Keep in mind the cBPF limitations here. Because conditional jumps are limited to 255 instructions, the jump to @left_node can be further than 255 instructions away (especially for syscalls with complex filtering rules like ioctl(2)). The jump to @right_node is almost certainly more than 255 instructions away. This means in actual cBPF bytecode, we would often need to use conditional jumps followed by unconditional jumps in order to jump so far forward. Meanwhile, the jump to @rules_for_this_syscall would be a very short hop away, but this locality would only be taken advantage of for a single node of the entire tree for each traversal.

Consider this structure instead:

// Traversal code:
  if syscall number < current node value
      jump @left_node
  if syscall_number > current node value
      jump @right_node
  jump @rules_for_this_syscall
  @left_node:
    // Recursively render only the traversal code for the left node here
  @right_node:
    // Recursively render only the traversal code for the right node here

// Filtering code:
  @rules_for_this_syscall:
    // Render bytecode for this syscall's filters here
  // Recursively render only the filtering code for the left node here
  // Recursively render only the filtering code for the right node here

This effectively separates the per-syscall rules from the traversal of the BST. This ensures that the traversal can be done entirely using conditional jumps, and that for any given execution of the cBPF program, there will be at most one unconditional jump to the syscall-specific rules.

This structure is further improvable by taking advantage of the fact that syscall numbers are a dense space, and so are syscall filter rules. This means we can often avoid needless comparisons. For example, given the following tree:

      22
     /  \
    9    24
   /    /  \
  8   23    50

Notice that the tree contains 22, 23, and 24. This means that if we get to node 23, we do not need to check for syscall number equality, because we’ve already established from the traversal that the syscall number must be 23.

cBPF bytecode optimizations

gVisor now implements a bytecode-level cBPF optimizer running a few lossless optimizations. These optimizations are run repeatedly until the bytecode no longer changes. This is because each type of optimization tends to feed on the fruits of the others, as we’ll see below.

gVisor’s seccomp-bpf program size is reduced by over a factor of 4 using the optimizations below.

Optimizing cBPF jumps

The limitations of cBPF jump instructions described earlier means that typical BPF bytecode rendering code will usually favor unconditional jumps even when they are not necessary. However, they can be optimized after the fact.

Typical BPF bytecode rendering code for a simple condition is usually rendered as follows:

jif <condition>, 0, 1     // If <condition> is true, continue,
                          //   otherwise skip over 1 instruction.
jmp @condition_was_true   // Unconditional jump to label @condition_was_true.
jmp @condition_was_false  // Unconditional jump to label @condition_was_false.

… or as follows:

jif <condition>, 1, 0     // If <condition> is true, jump by 1 instruction,
                          //   otherwise continue.
jmp @condition_was_false  // Unconditional jump to label @condition_was_false.
// Flow through here if the condition was true.

… In other words, the generated code always uses unconditional jumps, and conditional jump offsets are always either 0 or 1 instructions forward. This is because conditional jumps are limited to 8 bits (255 instructions), and it is not always possible at BPF bytecode rendering time to know ahead of time that the jump targets (@condition_was_true, @condition_was_false) will resolve to an instruction that is close enough ahead that the offset would fit in 8 bits. The safe thing to do is to always use an unconditional jump. Since unconditional jump targets have 16 bits to play with, and seccomp-bpf programs are limited to 4,096 instructions, it is always possible to encode a jump using an unconditional jump instruction.

But of course, the jump target often does fit in 8 bits. So gVisor looks over the bytecode for optimization opportunities:

Conditional jumps that jump to unconditional jumps are rewritten to their final destination, so long as this fits within the 255-instruction conditional jump limit.
Unconditional jumps that jump to other unconditional jumps are rewritten to their final destination.
Conditional jumps where both branches jump to the same instruction are replaced by an unconditional jump to that instruction.
Unconditional jumps with a zero-instruction jump target are removed.

The aim of these optimizations is to clean up after needless indirection that is a byproduct of cBPF bytecode rendering code. Once they all have run, all jumps are as tight as they can be.

Removing dead code

Because cBPF is a very restricted language, it is possible to determine with certainty that some instructions can never be reached.

In cBPF, each instruction either:

Flows forward (e.g. load operations, math operations).
Jumps by a fixed (immediate) number of instructions.
Stops the execution immediately (return instructions).

Therefore, gVisor runs a simple program traversal algorithm. It creates a bitfield with one bit per instruction, then traverses the program and all its possible branches. Then, all instructions that were never traversed are removed from the program, and all jump targets are updated to account for these removals.

In turn, this makes the program shorter, which makes more jump optimizations possible.

Removing redundant `load` instructions

cBPF programs filter system calls by inspecting their arguments. To do these comparisons, this data must first be loaded into the cBPF VM registers. These load operations can be optimized.

cBPF’s conditional operations (e.g. “is equal to”, “is greater than”, etc.) operate on a single 32-bit register called “A”. As such, a seccomp-bpf program typically consists of many load operations (load32) that loads a 32-bit value from a given offset of the seccomp_data struct into register A, then performs a comparative operation on it to see if it matches the filter.

load32 <offset>
jif <condition1>, @condition1_was_true, @condition1_was_false
load32 <offset>
jif <condition2>, @condition2_was_true, @condition2_was_false
// ...

But when a syscall rule is of the form “this syscall argument must be one of the following values”, we don’t need to reload the same value (from the same offset) multiple times. So gVisor looks for redundant loads like this, and removes them.

load32 <offset>
jif <condition1>, @condition1_was_true, @condition1_was_false
jif <condition2>, @condition2_was_true, @condition2_was_false
// ...

Note that syscall arguments are 64-bit values, whereas the A register is only 32-bits wide. Therefore, asserting that a syscall argument matches a predicate usually involves at least 2 load32 operations on different offsets, thereby making this optimization useless for the “this syscall argument must be one of the following values” case. We’ll get back to that.

Minimizing the number of `return` instructions

A typical syscall filter program consists of many predicates which return either “allowed” or “rejected”. These are encoded in the bytecode as either return instructions, or jumps to return instructions. These instructions can show up dozens or hundreds of times in the cBPF bytecode in quick succession, presenting an optimization opportunity.

Since two return instructions with the same immediate return code are exactly equivalent to one another, it is possible to rewrite jumps to all return instructions that return “allowed” to go to a single return instruction that returns this code, and similar for “rejected”, so long as the jump offsets fit within the limits of conditional jumps (255 instructions). In turn, this makes the program shorter, and therefore makes more jump optimizations possible.

To implement this optimization, gVisor first replaces all unconditional jump instructions that go to return statements with a copy of that return statement. This removes needless indirection.

    Original bytecode                      New bytecode
jeq 0, 0, 1                        00: jeq 0, 0, 1
jmp @good                    -->   01: return allowed
jmp @bad                     -->   02: return rejected
...                                    ...
jge 0, 0, 1                        10: jge 0, 0, 1
jmp @good                    -->   11: return allowed
jmp @bad                     -->   12: return rejected
...                                    ...
[@good]: return allowed            100 [@good]: return allowed
[@bad]:  return rejected           101 [@bad]:  return rejected

gVisor then searches for return statements which can be entirely removed by seeing if it is possible to rewrite the rest of the program to jump or flow through to an equivalent return statement (without making the program longer in the process). In the above example:

    Original bytecode                      New bytecode
jeq 0, 0, 1                  -->   00: jeq 0, 99, 100   // Targets updated
return allowed                     01: return allowed   // Now dead code
return reject                      02: return rejected  // Now dead code
...                                    ...
jge 0, 0, 1                  -->   10: jge 0, 89, 90    // Targets updated
jmp @good                          11: return allowed   // Now dead code
jmp @bad                           12: return rejected  // Now dead code
...                                    ...
[@good]: return allowed            100 [@good]: return allowed
[@bad]:  return rejected           101 [@bad]:  return rejected

Finally, the dead code removal pass cleans up the dead return statements and the program becomes shorter.

    Original bytecode                      New bytecode
jeq 0, 99, 100               -->   00: jeq 0, 95, 96  // Targets updated
return allowed               -->   /* Removed */
return reject                -->   /* Removed */
...                                    ...
jge 0, 89, 90                -->   08: jge 0, 87, 88  // Targets updated
return allowed               -->   /* Removed */
return rejected              -->   /* Removed */
...                                    ...
[@good]: return allowed            96 [@good]: return allowed
[@bad]:  return rejected           97 [@bad]:  return rejected

While this search is expensive to perform, in a program full of predicates — which is exactly what seccomp-bpf programs are — this approach massively reduces program size.

Ruleset optimizations

Bytecode-level optimizations are cool, but why stop here? gVisor now also performs seccomp ruleset optimizations.

In gVisor, a seccomp RuleSet is a mapping from syscall number to a logical expression named SyscallRule, along with a seccomp-bpf action (e.g. “allow”) if a syscall with a given number matches its SyscallRule.

Basic ruleset simplifications

A SyscallRule is a predicate over the data contained in the seccomp_data struct (beyond its nr). A trivial implementation is MatchAll, which simply matches any seccomp_data. Other implementations include Or and And (which do what they sound like), and PerArg which applies predicates to each specific argument of a seccomp_data, and forms the meat of actual syscall filtering rules. Some basic simplifications are already possible with these building blocks.

gVisor implements the following basic optimizers, which look like they may be useless on their own but end up simplifying the logic of the more complex optimizer described in other sections quite a bit:

Or and And rules with a single predicate within them are replaced with just that predicate.
Duplicate predicates within Or and And rules are removed.
Or rules within Or rules are flattened.
And rules within And rules are flattened.
An Or rule which contains a MatchAll predicate is replaced with MatchAll.
MatchAll predicates within And rules are removed.
PerArg rules with MatchAll predicates for each argument are replaced with a rule that matches anything.

As with the bytecode-level optimizations, gVisor runs these in a loop until the structure of the rules no longer change. With the basic optimizations above, this silly-looking rule:

Or{
    Or{
        And{
            MatchAll,
            PerArg{AnyValue, EqualTo(2), AnyValue},
        },
        MatchAll,
    },
    PerArg{AnyValue, EqualTo(2), AnyValue},
    PerArg{AnyValue, EqualTo(2), AnyValue},
}

… is simplified down to just PerArg{AnyValue, EqualTo(2), AnyValue}.

Extracting repeated argument matchers

This is the main optimization that gVisor performs on rulesets. gVisor looks for common argument matchers that are repeated across all combinations of other argument matchers in branches of an Or rule. It removes them from these PerArg rules, and And the overall syscall rule with a single instance of that argument matcher. Sound complicated? Let’s look at an example.

In the gVisor Sentry seccomp-bpf configuration, these are the rules for the fcntl(2) system call:

rules = ...(map[uintptr]SyscallRule{
    SYS_FCNTL: Or{
        PerArg{
            NonNegativeFD,
            EqualTo(F_GETFL),
        },
        PerArg{
            NonNegativeFD,
            EqualTo(F_SETFL),
        },
        PerArg{
            NonNegativeFD,
            EqualTo(F_GETFD),
        },
    },
})

… This means that for the fcntl(2) system call, seccomp_data.args[0] may be any non-negative number, seccomp_data.args[1] may be either F_GETFL, F_SETFL, or F_GETFD, and all other seccomp_data fields may be any value.

If rendered naively in BPF, this would iterate over each branch of the Or expression, and re-check the NonNegativeFD each time. Clearly wasteful. Conceptually, the ideal expression is something like this:

rules = ...(map[uintptr]SyscallRule{
    SYS_FCNTL: PerArg{
        NonNegativeFD,
        AnyOf(F_GETFL, F_SETFL, F_GETFD),
    },
})

… But going through all the syscall rules to look for this pattern would be quite tedious, and some of them are actually Or‘d from multiple map[uintptr]SyscallRule in different files (e.g. platform-dependent syscalls), so they cannot be all specified in a single location with a single predicate on seccomp_data.args[1]. So gVisor needs to detect this programmatically at optimization time.

Conceptually, gVisor goes from:

Or{
    PerArg{A1, B1, C1, D},
    PerArg{A2, B1, C1, D},
    PerArg{A1, B2, C2, D},
    PerArg{A2, B2, C2, D},
    PerArg{A1, B3, C3, D},
    PerArg{A2, B3, C3, D},
}

… to (after one pass):

And{
    Or{
        PerArg{A1, AnyValue, AnyValue, AnyValue},
        PerArg{A2, AnyValue, AnyValue, AnyValue},
        PerArg{A1, AnyValue, AnyValue, AnyValue},
        PerArg{A2, AnyValue, AnyValue, AnyValue},
        PerArg{A1, AnyValue, AnyValue, AnyValue},
        PerArg{A2, AnyValue, AnyValue, AnyValue},
    },
    Or{
        PerArg{AnyValue, B1, C1, D},
        PerArg{AnyValue, B1, C1, D},
        PerArg{AnyValue, B2, C2, D},
        PerArg{AnyValue, B2, C2, D},
        PerArg{AnyValue, B3, C3, D},
        PerArg{AnyValue, B3, C3, D},
    },
}

Then the basic optimizers will kick in and detect duplicate PerArg rules in Or expressions, and delete them:

And{
    Or{
        PerArg{A1, AnyValue, AnyValue, AnyValue},
        PerArg{A2, AnyValue, AnyValue, AnyValue},
    },
    Or{
        PerArg{AnyValue, B1, C1, D},
        PerArg{AnyValue, B2, C2, D},
        PerArg{AnyValue, B3, C3, D},
    },
}

… Then, on the next pass, the second inner Or rule gets recursively optimized:

And{
    Or{
        PerArg{A1, AnyValue, AnyValue, AnyValue},
        PerArg{A2, AnyValue, AnyValue, AnyValue},
    },
    And{
        Or{
            PerArg{AnyValue, AnyValue, AnyValue, D},
            PerArg{AnyValue, AnyValue, AnyValue, D},
            PerArg{AnyValue, AnyValue, AnyValue, D},
        },
        Or{
            PerArg{AnyValue, B1, C1, AnyValue},
            PerArg{AnyValue, B2, C2, AnyValue},
            PerArg{AnyValue, B3, C3, AnyValue},
        },
    },
}

… which, after other basic optimizers clean this all up, finally becomes:

And{
    Or{
        PerArg{A1, AnyValue, AnyValue, AnyValue},
        PerArg{A2, AnyValue, AnyValue, AnyValue},
    },
    PerArg{AnyValue, AnyValue, AnyValue, D},
    Or{
        PerArg{AnyValue, B1, C1, AnyValue},
        PerArg{AnyValue, B2, C2, AnyValue},
        PerArg{AnyValue, B3, C3, AnyValue},
    },
}

This has turned what would be 24 comparisons into just 9:

seccomp_data[0] must either match predicate A1 or A2.
seccomp_data[3] must match predicate D.
At least one of the following must be true:
- seccomp_data[1] must match predicate B1 and seccomp_data[2] must match predicate C1.
- seccomp_data[1] must match predicate B2 and seccomp_data[2] must match predicate C2.
- seccomp_data[1] must match predicate B3 and seccomp_data[2] must match predicate C3.

To go back to our fcntl(2) example, the rules would therefore be rewritten to:

rules = ...(map[uintptr]SyscallRule{
    SYS_FCNTL: And{
        // Check for args[0] exclusively:
        PerArg{NonNegativeFD, AnyValue},
        // Check for args[1] exclusively:
        Or{
            PerArg{AnyValue, EqualTo(F_GETFL)},
            PerArg{AnyValue, EqualTo(F_SETFL)},
            PerArg{AnyValue, EqualTo(F_GETFD)},
        },
    },
})

… thus we’ve turned 6 comparisons into 4. But we can do better still!

Extracting repeated 32-bit match logic from 64-bit argument matchers

We can apply the same optimization, but down to the 32-bit matching logic that underlies the 64-bit syscall argument matching predicates.

As you may recall, cBPF instructions are limited to 32-bit math. This means that when rendered, each of these argument comparisons are actually 2 operations each: one for the first 32-bit half of the argument, and one for the second 32-bit half of the argument.

Let’s look at the F_GETFL, F_SETFL, and F_GETFD constants:

F_GETFL = 0x3
F_SETFL = 0x4
F_GETFD = 0x1

The cBPF bytecode for checking the arguments of this syscall may therefore look something like this:

// Check for `seccomp_data.args[0]`:
  00: load32 16                // Load the first 32 bits of
                               //   `seccomp_data.args[0]` into register A.
  01: jeq 0, 0, @bad           // If A == 0, continue, otherwise jump to @bad.
  02: load32 20                // Load the second 32 bits of
                               //   `seccomp_data.args[0]` into register A.
  03: jset 0x80000000, @bad, 0 // If A & 0x80000000 != 0, jump to @bad,
                               //   otherwise continue.

// Check for `seccomp_data.args[1]`:
  04: load32 24                // Load the first 32 bits of
                               //   `seccomp_data.args[1]` into register A.
  05: jeq 0, 0, @next1         // If A == 0, continue, otherwise jump to @next1.
  06: load32 28                // Load the second 32 bits of
                               //   `seccomp_data.args[1]` into register A.
  07: jeq 0x3, @good, @next1   // If A == 0x3, jump to @good,
                               //   otherwise jump to @next1.

@next1:
  08: load32 24                // Load the first 32 bits of
                               //   `seccomp_data.args[1]` into register A.
  09: jeq 0, 0, @next2         // If A == 0, continue, otherwise jump to @next2.
  10: load32 28                // Load the second 32 bits of
                               //   `seccomp_data.args[1]` into register A.
  11: jeq 0x4, @good, @next2   // If A == 0x3, jump to @good,
                               //   otherwise jump to @next2.

@next2:
  12: load32 24                // Load the first 32 bits of
                               //   `seccomp_data.args[1]` into register A.
  13: jeq 0, 0, @bad           // If A == 0, continue, otherwise jump to @bad.
  14: load32 28                // Load the second 32 bits of
                               //   `seccomp_data.args[1]` into register A.
  15: jeq 0x1, @good, @bad     // If A == 0x1, jump to @good,
                               //   otherwise jump to @bad.

// Good/bad jump targets for the checks above to jump to:
@good:
  16: return ALLOW
@bad:
  17: return REJECT

Clearly this could be better. The first 32 bits must be zero in all possible cases. So the syscall argument value-matching primitives (e.g. EqualTo) may be split into 2 32-bit value matchers:

rules = ...(map[uintptr]SyscallRule{
    SYS_FCNTL: And{
        PerArg{NonNegativeFD, AnyValue},
        Or{
            PerArg{
                AnyValue,
                splitMatcher{
                    high32bits: EqualTo32Bits(
                      F_GETFL & 0xffffffff00000000 /* = 0 */),
                    low32bits:  EqualTo32Bits(
                      F_GETFL & 0x00000000ffffffff /* = 0x3 */),
                },
            },
            PerArg{
                AnyValue,
                splitMatcher{
                    high32bits: EqualTo32Bits(
                      F_SETFL & 0xffffffff00000000 /* = 0 */),
                    low32bits:  EqualTo32Bits(
                      F_SETFL & 0x00000000ffffffff /* = 0x4 */),
                },
            },
            PerArg{
                AnyValue,
                splitMatcher{
                    high32bits: EqualTo32Bits(
                      F_GETFD & 0xffffffff00000000 /* = 0 */),
                    low32bits:  EqualTo32Bits(
                      F_GETFD & 0x00000000ffffffff /* = 0x1 */),
                },
            },
        },
    },
})

gVisor then applies the same optimization as earlier, but this time going into each 32-bit half of each argument. This means it can extract the EqualTo32Bits(0) matcher from the high32bits part of each splitMatcher and move it up to the And expression like so:

rules = ...(map[uintptr]SyscallRule{
    SYS_FCNTL: And{
        PerArg{NonNegativeFD, AnyValue},
        PerArg{
            AnyValue,
            splitMatcher{
                high32bits: EqualTo32Bits(0),
                low32bits:  Any32BitsValue,
            },
        },
        Or{
            PerArg{
                AnyValue,
                splitMatcher{
                    high32bits: Any32BitsValue,
                    low32bits:  EqualTo32Bits(
                      F_GETFL & 0x00000000ffffffff /* = 0x3 */),
                },
            },
            PerArg{
                AnyValue,
                splitMatcher{
                    high32bits: Any32BitsValue,
                    low32bits:  EqualTo32Bits(
                      F_SETFL & 0x00000000ffffffff /* = 0x4 */),
                },
            },
            PerArg{
                AnyValue,
                splitMatcher{
                    high32bits: Any32BitsValue,
                    low32bits:  EqualTo32Bits(
                      F_GETFD & 0x00000000ffffffff /* = 0x1 */),
                },
            },
        },
    },
})

This looks bigger as a tree, but keep in mind that the AnyValue and Any32BitsValue matchers do not produce any bytecode. So now let’s render that tree to bytecode:

// Check for `seccomp_data.args[0]`:
  00: load32 16                // Load the first 32 bits of
                               //   `seccomp_data.args[0]` into register A.
  01: jeq 0, 0, @bad           // If A == 0, continue, otherwise jump to @bad.
  02: load32 20                // Load the second 32 bits of
                               //   `seccomp_data.args[0]` into register A.
  03: jset 0x80000000, @bad, 0 // If A & 0x80000000 != 0, jump to @bad,
                               //   otherwise continue.

// Check for `seccomp_data.args[1]`:
  04: load32 24                // Load the first 32 bits of
                               //   `seccomp_data.args[1]` into register A.
  05: jeq 0, 0, @bad           // If A == 0, continue, otherwise jump to @bad.
  06: load32 28                // Load the second 32 bits of
                               //   `seccomp_data.args[1]` into register A.
  07: jeq 0x3, @good, @next1   // If A == 0x3, jump to @good,
                               //   otherwise jump to @next1.

@next1:
  08: load32 28                // Load the second 32 bits of
                               //   `seccomp_data.args[1]` into register A.
  09: jeq 0x4, @good, @next2   // If A == 0x3, jump to @good,
                               //   otherwise jump to @next2.

@next2:
  10: load32 28                // Load the second 32 bits of
                               //   `seccomp_data.args[1]` into register A.
  11: jeq 0x1, @good, @bad     // If A == 0x1, jump to @good,
                               //   otherwise jump to @bad.

// Good/bad jump targets for the checks above to jump to:
@good:
  12: return ALLOW
@bad:
  13: return REJECT

This is where the bytecode-level optimization to remove redundant loads described earlier finally becomes relevant. We don’t need to load the second 32 bits of seccomp_data.args[1] multiple times in a row:

// Check for `seccomp_data.args[0]`:
  00: load32 16                // Load the first 32 bits of
                               //   `seccomp_data.args[0]` into register A.
  01: jeq 0, 0, @bad           // If A == 0, continue, otherwise jump to @bad.
  02: load32 20                // Load the second 32 bits of
                               //   `seccomp_data.args[0]` into register A.
  03: jset 0x80000000, @bad, 0 // If A & 0x80000000 != 0, jump to @bad,
                               //   otherwise continue.

// Check for `seccomp_data.args[1]`:
  04: load32 24                // Load the first 32 bits of
                               //   `seccomp_data.args[1]` into register A.
  05: jeq 0, 0, @bad           // If A == 0, continue, otherwise jump to @bad.
  06: load32 28                // Load the second 32 bits of
                               //   `seccomp_data.args[1]` into register A.
  07: jeq 0x3, @good, @next1   // If A == 0x3, jump to @good,
                               //   otherwise jump to @next1.

@next1:
  08: jeq 0x4, @good, @next2   // If A == 0x3, jump to @good,
                               //   otherwise jump to @next2.

@next2:
  09: jeq 0x1, @good, @bad     // If A == 0x1, jump to @good,
                               //   otherwise jump to @bad.

// Good/bad jump targets for the checks above to jump to:
@good:
  10: return ALLOW
@bad:
  11: return REJECT

Of course, in practice the @good/@bad jump targets would also be unified with rules from other system call filters in order to cut down on those too. And by having reduced the number of instructions in each individual filtering rule, the jumps to these targets can be deduplicated against that many more rules.

This example demonstrates how optimizations build on top of each other, making each optimization more likely to make other optimizations useful in turn.

Other optimizations

Beyond these, gVisor also has the following minor optimizations.

Making `futex(2)` rules faster

futex(2) is by far the most-often-called system call that gVisor calls as part of its operation. It is used for synchronization, so it needs to be very efficient.

Its rules used to look like this:

SYS_FUTEX: Or{
    PerArg{
        AnyValue,
        EqualTo(FUTEX_WAIT | FUTEX_PRIVATE_FLAG),
    },
    PerArg{
        AnyValue,
        EqualTo(FUTEX_WAKE | FUTEX_PRIVATE_FLAG),
    },
    PerArg{
        AnyValue,
        EqualTo(FUTEX_WAIT),
    },
    PerArg{
        AnyValue,
        EqualTo(FUTEX_WAKE),
    },
},

Essentially a 4-way Or between 4 different values allowed for seccomp_data.args[1]. This is all well and good, and the above optimizations already optimize this down to the minimum amount of jeq comparison operations.

But looking at the actual bit values of the FUTEX_* constants above:

FUTEX_WAIT         = 0x00
FUTEX_WAKE         = 0x01
FUTEX_PRIVATE_FLAG = 0x80

… We can see that this is equivalent to checking that no bits other than 0x01 and 0x80 may be set. Turns out that cBPF has an instruction for that. This is now optimized down to two comparison operations:

01: load32 24                     // Load the first 32 bits of
                                  //   `seccomp_data.args[1]` into register A.
02: jeq 0, 0, @bad                // If A == 0, continue,
                                  //   otherwise jump to @bad.
03: load32 28                     // Load the second 32 bits of
                                  //   `seccomp_data.args[1]` into register A.
04: jset 0xffffff7e, @bad, @good  // If A & ^(0x01 | 0x80) != 0, jump to @bad,
                                  //   otherwise jump to @good.

Optimizing non-negative FD checks

A lot of syscall arguments are file descriptors (FD numbers), which we need to filter efficiently.

An FD is a 32-bit positive integer, but is passed as a 64-bit value as all syscall arguments are. Instead of doing a “less than” operation, we can simply turn it into a bitwise check. We simply check that the first half of the 64-bit value is zero, and that the 31st bit of the second half of the 64-bit value is not set.

Enforcing consistency of argument-wise matchers

When one syscall argument is checked consistently across all branches of an Or, enforcing that this is the case ensures that the optimization for such matchers remains effective.

The ioctl(2) system call takes an FD as one of its arguments. Since it is a “grab bag” of a system call, gVisor’s rules for ioctl(2) were similarly spread across many files and rules, and not all of them checked that the FD argument was non-negative; some of them simply accepted any value for the FD argument.

Before this optimization work, this meant that the BPF program did less work for the rules which didn’t check the value of the FD argument. However, now that gVisor optimizes repeated argument-wise matchers, it is now actually cheaper if all ioctl(2) rules verify the value of the FD argument consistently, as that argument check can be performed exactly once for all possible branches of the ioctl(2) rules. So now gVisor has a test that verifies that this is the case. This is a good example that shows that optimization work can lead to improved security due to the efficiency gains that comes from applying security checks consistently.

`secbench`: Benchmarking `seccomp-bpf` programs

To measure the effectiveness of the above improvements, measuring gVisor performance itself would be very difficult, because each improvement is a rather tiny part of the syscall hot path. At the scale of each of these optimizations, we need to zoom in a bit more.

So now gVisor has tooling for benchmarking seccomp-bpf programs. It works by taking a cBPF program along with several possible syscalls to try with it. It runs a subprocess that installs this program as seccomp-bpf filter for itself, replacing all actions (other than “approve syscall”) with “return error” in order to avoid crashing. Then it measures the latency of each syscall. This is then measured against the latency of the very same syscalls in a subprocess that has an empty seccomp-bpf (i.e. the only instruction within it is return ALLOW).

Let’s measure the effect of the above improvements on a gVisor-like workload.

Modeling gVisor `seccomp-bpf` behavior for benchmarking

This can be done by running gVisor under ptrace to see what system calls the gVisor process is doing.

Note that ptrace here refers to the mechanism by which we can inspect the system call that the gVisor Sentry is making. This is distinct from the system calls the sandboxed application is doing. It has also nothing to do with gVisor’s former “ptrace” platform.

For example, after running a Postgres benchmark inside gVisor with Systrap, the ptrace tool generated the following summary table:

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
10  431.799048         496    870063     46227 futex
23   29.399526         106    275649        38 nanosleep
87    6.032292          37    160201           sendmmsg
28    1.939492          16    115769           fstat
96  194.415343        2787     69749       137 ppoll
05    7.298717         315     23131           fsync
06    0.446930          31     14096           pwrite64
37   23.398106        1907     12266         9 epoll_pwait
00    0.019711           9      1991         6 close
02    0.116739          82      1414           tgkill
01    0.068481          48      1414       201 rt_sigreturn
02    0.147048         104      1413           getpid
01    0.045338          41      1080           write
01    0.039876          37      1056           read
00    0.015637          18       836        24 openat
01    0.066699          81       814           madvise
00    0.029757         111       267           fallocate
00    0.006619          15       420           pread64
00    0.013334          35       375           sched_yield
00    0.008112         114        71           pwritev2
00    0.003005          57        52           munmap
00    0.000343          18        19         6 unlinkat
00    0.000249          15        16           shutdown
00    0.000100           8        12           getdents64
00    0.000045           4        10           newfstatat
...
------ ----------- ----------- --------- --------- ----------------
00  695.311111         447   1552214     46651 total

To mimic the syscall profile of this gVisor sandbox from the perspective of seccomp-bpf overhead, we need to have it call these system calls with the same relative frequency. Therefore, the dimension that matters here isn’t time or seconds or even usecs/call; it is actually just the number of system calls (calls). In graph form:

The Pareto distribution of system calls becomes immediately clear.

`seccomp-bpf` filtering overhead reduction

The secbench library lets us take the top 10 system calls and measure their seccomp-bpf filtering overhead individually, as well as building a weighted aggregate of their overall overhead. Here are the numbers from before and after the filtering optimizations described in this post:

The nanosleep(2) system call is a bit of an oddball here. Unlike the others, this system call causes the current thread to be descheduled. To make the results more legible, here is the same data with the duration normalized to the seccomp-bpf filtering overhead from before optimizations:

This shows that most system calls have had their filtering overhead reduced, but others haven’t significantly changed (10% or less change in either direction). This is to be expected: those that have not changed are the ones that are cacheable: nanosleep(2), fstat(2), ppoll(2), fsync(2), pwrite64(2), close(2), getpid(2). The non-cacheable syscalls which have dedicated checks before the main BST, futex(2) and sendmmsg(2), experienced the biggest boost. Lastly, epoll_pwait(2) is non-cacheable but doesn’t have a dedicated check before the main BST, so while it still sees a small performance gain, that gain is lower than its counterparts.

The “Aggregate” number comes from the secbench library and represents the total time difference spent in system calls after calling them using weighted randomness. It represents the average system call overhead that a Sentry using Systrap would incur. Therefore, per these numbers, these optimizations removed ~29% from gVisor’s overall seccomp-bpf filtering overhead.

Here is the same data for KVM, which has a slightly different syscall profile with ioctl(2) and rt_sigreturn(2) being critical for performance:

Lastly, let’s look at GPU workload performance. This benchmark enables gVisor’s experimental nvproxy feature for GPU support. What matters for this workload is ioctl(2) performance, as this is the system call used to issue commands to the GPU. Here is the seccomp-bpf filtering overhead of various CUDA control commands issued via ioctl(2):

As nvproxy adds a lot of complexity to the ioctl(2) filtering rules, this is where we see the most improvement from these optimizations.

`secfuzz`: Fuzzing `seccomp-bpf` programs

To ensure that the optimizations above don’t accidentally end up producing a cBPF program that has different behavior from the unoptimized one used to do, gVisor also has seccomp-bpf fuzz tests.

Because gVisor knows which high-level filters went into constructing the seccomp-bpf program, it also automatically generates test cases from these filters, and the fuzzer verifies that each line and every branch of the optimized cBPF bytecode is executed, and that the result is the same as giving the same input to the unoptimized program.

(Line or branch coverage of the unoptimized program is not enforceable, because without optimizations, the bytecode contains many redundant checks for which later branches can never be reached.)

Optimizing in-gVisor `seccomp-bpf` filtering

gVisor supports sandboxed applications adding seccomp-bpf filters onto themselves, and implements its own cBPF interpreter for this purpose.

Because the cBPF bytecode-level optimizations are lossless and are generally applicable to any cBPF program, they are applied onto programs uploaded by sandboxed applications to make filter evaluation faster in gVisor itself.

Additionally, gVisor removed the use of Go interfaces previously used for loading data from the BPF “input” (i.e. the seccomp_data struct for seccomp-bpf). This used to require an endianness-specific interface due to how the BPF interpreter was used in two places in gVisor: network processing (which uses network byte ordering), and seccomp-bpf (which uses native byte ordering). This interface has now been replaced with Go templates, yielding to a 2x speedup on the reference simplistic seccomp-bpf filter. The more load instructions are in the filter, the better the effect. (Naturally, this also benefits network filtering performance!)

gVisor cBPF interpreter performance

The graph below shows the gVisor cBPF interpreter’s performance against three sample filters: the reference simplistic seccomp-bpf filter, and optimized vs unoptimized versions of gVisor’s own syscall filter (to represent a more complex filter).

`seccomp-bpf` filter result caching for sandboxed applications

Lastly, gVisor now also implements an in-sandbox caching mechanism for syscalls which do not depend on the instruction_pointer or syscall arguments. Unlike Linux’s seccomp-bpf cache, gVisor’s implementation also handles actions other than “allow”, and supports the entire set of cBPF instructions rather than the restricted emulator Linux uses for caching evaluation purposes. This removes the interpreter from the syscall hot path entirely for cacheable syscalls, further speeding up system calls from applications that use seccomp-bpf within gVisor.

Faster gVisor startup via filter precompilation

Due to these optimizations, the overall process of building the syscall filtering rules, rendering them to cBPF bytecode, and running all the optimizations, can take quite a while (~10ms). As one of gVisor’s strengths is its startup latency being much faster than VMs, this is an unacceptable delay.

Therefore, gVisor now precompiles the rules to optimized cBPF bytecode for most possible gVisor configurations. This means the runsc binary contains cBPF bytecode embedded in it for some subset of popular configurations, and it will use this bytecode rather than compiling the cBPF program from scratch during startup. If runsc is invoked with a configuration for which the cBPF bytecode isn’t embedded in the runsc binary, it will fall back to compiling the program from scratch.

Dealing with dynamic values in precompiled rules

One challenge with this approach is to support parts of the configuration that are only known at runsc startup time. For example, many filters act on a specific file descriptor used for interacting with the runsc process after startup over a Unix Domain Socket (called the “controller FD”). This is an integer that is only known at runtime, so its value cannot be embedded inside the optimized cBPF bytecode prepared at runsc compilation time.

To address this, the seccomp-bpf precompilation tooling actually supports the notions of 32-bit “variables”, and takes as input a function to render cBPF bytecode for a given key-value mapping of variables to placeholder 32-bit values. The precompiler calls this function twice with different arbitrary value mappings for each variable, and observes where these arbitrary values show up in the generated cBPF bytecode. This takes advantage of the fact that gVisor’s seccomp-bpf program generation is deterministic.

If the two cBPF programs are of the same byte length, and the placeholder values show up at exactly the same byte offsets within the cBPF bytecode both times, and the rest of the cBPF bytecode is byte-for-byte equivalent, the precompiler has very high confidence that these offsets are where the 32-bit variables are represented in the cBPF bytecode. It then stores these offsets as part of the embedded data inside the runsc binary. Finally, at runsc execution time, the bytes at these offsets are replaced with the now-known values of the variables.

OK that’s great and all, but is gVisor actually faster?

The short answer is: yes, but only slightly. As we established earlier, seccomp-bpf is only a small portion of gVisor’s total overhead, and the secbench benchmark shows that this work only removes a portion of that overhead, so we should not expect large differences here.

Let’s come back to the trusty ABSL build benchmark, with a new build of gVisor with all of these optimizations turned on:

Let’s zoom the vertical axis in on the gVisor variants to see the difference better:

This is about in line with what the earlier benchmarks showed. The initial benchmarks showed that seccomp-bpf filtering overhead for this benchmark was on the order of ~3.6% of total runtime, and the secbench benchmarks showed that the optimizations reduced seccomp-bpf filter evaluation time by ~29% in aggregate. The final absolute reduction in total runtime should then be around ~1%, which is just about what this result shows.

Other benchmarks show a similar pattern. Here’s gRPC build, similar to ABSL:

Here’s a benchmark running the Ruby Fastlane test suite:

Here’s the 50th percentile of nginx serving latency for an empty webpage. Every microsecond counts when it comes to web serving, and here we’ve shaven off 20 of them.

CUDA workloads also get a boost from this work. Since their gVisor-related overhead is already relatively small, seccomp-bpf filtering makes up a higher proportion of their overhead. Additionally, as the performance improvements described in this post disproportionately help the ioctl(2) system call, this cuts a larger portion of the seccomp-bpf filtering overhead of these workload, since CUDA uses the ioctl(2) system call to communicate with the GPU.

While some of these results may not seem like much in absolute terms, it’s important to remember:

These improvements have resulted in gVisor being able to enforce more seccomp-bpf filters than it previously could; gVisor’s seccomp-bpf filter was nearly half the maximum seccomp-bpf program size, so it could at most double in complexity. After optimizations, it is reduced to less than a fourth of this size.
These improvements allow the gVisor filters to scale better. This is visible from the effects on ioctl(2) performance with nvproxy enabled.
The resulting work has produced useful libraries for seccomp-bpf tooling which may be helpful for other projects: testing, fuzzing, and benchmarking seccomp-bpf filters.
This overhead could not have been addressed in another way. Unlike other areas of gVisor, such as network overhead or file I/O, overhead from the host kernel evaluating seccomp-bpf filter lives outside of gVisor itself and therefore it can only be improved upon by this type of work.

Further work

One potential source of work is to look into the performance gap between no seccomp-bpf filter at all versus performance with an empty seccomp-bpf filter (equivalent to an all-cacheable filter). This points to a potential inefficiency in the Linux kernel implementation of the seccomp-bpf cache.

Another potential point of improvement is to port over the optimizations that went into searching for a syscall number into the ioctl(2) system call. ioctl(2) is a “grab-bag” kind of system call, used by many drivers and other subsets of the Linux kernel to extend the syscall interface without using up valuable syscall numbers. For example, the KVM subsystem is almost entirely controlled through ioctl(2) system calls issued against /dev/kvm or against per-VM file descriptors.

For this reason, the first non-file-descriptor argument of ioctl(2) (“request”) usually encodes something analogous to what the syscall number usually represents: the type of request made to the kernel. Currently, gVisor performs a linear scan through all possible enumerations of this argument. This is usually fine, but with features like nvproxy which massively expand this list of possible values, this can take a long time. ioctl performance is also critical for gVisor’s KVM platform. A binary search tree would make sense here.

gVisor welcomes further contributions to its seccomp-bpf machinery. Thanks for reading!

cBPF does not have a canonical assembly-style representation. The assembly-like code in this blog post is close to the one used in bpfc but diverges in ways to make it hopefully clearer as to what’s happening, and all code is annotated with // comments. ↩

Faster filesystem access with Directfs

2023-06-27T00:00:00-05:00

Directfs is now the default in runsc. This feature gives gVisor’s application kernel (the Sentry) secure direct access to the container filesystem, avoiding expensive round trips to the filesystem gofer. Learn more about this feature in the following blog that was originally posted on Google Open Source Blog.

Origins of the Gofer

gVisor is used internally at Google to run a variety of services and workloads. One of the challenges we faced while building gVisor was providing remote filesystem access securely to the sandbox. gVisor’s strict security model and defense in depth approach assumes that the sandbox may get compromised because it shares the same execution context as the untrusted application. Hence the sandbox cannot be given sensitive keys and credentials to access Google-internal remote filesystems.

To address this challenge, we added a trusted filesystem proxy called a “gofer”. The gofer runs outside the sandbox, and provides a secure interface for untrusted containers to access such remote filesystems. For architectural simplicity, gofers were also used to serve local filesystems as well as remote.

Isolating the Container Filesystem in runsc

When gVisor was open sourced as runsc, the same gofer model was copied over to maintain the same security guarantees. runsc was configured to start one gofer process per container which serves the container filesystem to the sandbox over a predetermined protocol (now LISAFS). However, a gofer adds a layer of indirection with significant overhead.

This gofer model (built for remote filesystems) brings very few advantages for the runsc use-case, where all the filesystems served by the gofer (like rootfs and bind mounts) are mounted locally on the host. The gofer directly accesses them using filesystem syscalls.

Linux provides some security primitives to effectively isolate local filesystems. These include, mount namespaces, pivot_root and detached bind mounts¹. Directfs is a new filesystem access mode that uses these primitives to expose the container filesystem to the sandbox in a secure manner. The sandbox’s view of the filesystem tree is limited to just the container filesystem. The sandbox process is not given access to anything mounted on the broader host filesystem. Even if the sandbox gets compromised, these mechanisms provide additional barriers to prevent broader system compromise.

Directfs

In directfs mode, the gofer still exists as a cooperative process outside the sandbox. As usual, the gofer enters a new mount namespace, sets up appropriate bind mounts to create the container filesystem in a new directory and then pivot_root(2)s into that directory. Similarly, the sandbox process enters new user and mount namespaces and then pivot_root(2)s into an empty directory to ensure it cannot access anything via path traversal. But instead of making RPCs to the gofer to access the container filesystem, the sandbox requests the gofer to provide file descriptors to all the mount points via SCM_RIGHTS messages. The sandbox then directly makes file-descriptor-relative syscalls (e.g. fstatat(2), openat(2), mkdirat(2), etc) to perform filesystem operations.

Earlier when the gofer performed all filesystem operations, we could deny all these syscalls in the sandbox process using seccomp. But with directfs enabled, the sandbox process’s seccomp filters need to allow the usage of these syscalls. Most notably, the sandbox can now make openat(2) syscalls (which allow path traversal), but with certain restrictions: O_NOFOLLOW is required, no access to procfs and no directory FDs from the host. We also had to give the sandbox the same privileges as the gofer (for example CAP_DAC_OVERRIDE and CAP_DAC_READ_SEARCH), so it can perform the same filesystem operations.

It is noteworthy that only the trusted gofer provides FDs (of the container filesystem) to the sandbox. The sandbox cannot walk backwards (using ‘..’) or follow a malicious symlink to escape out of the container filesystem. In effect, we’ve decreased our dependence on the syscall filters to catch bad behavior, but correspondingly increased our dependence on Linux’s filesystem isolation protections.

Performance

Making RPCs to the gofer for every filesystem operation adds a lot of overhead to runsc. Hence, avoiding gofer round trips significantly improves performance. Let’s find out what this means for some of our benchmarks. We will run the benchmarks using our newly released systrap platform on bind mounts (as opposed to rootfs). This would simulate more realistic use cases because bind mounts are extensively used while configuring filesystems in containers. Bind mounts also do not have an overlay (like the rootfs mount), so all operations go through goferfs / directfs mount.

Let’s first look at our stat micro-benchmark, which repeatedly calls stat(2) on a file.

The stat(2) syscall is more than 2x faster! However, since this is not representative of real-world applications, we should not extrapolate these results. So let’s look at some real-world benchmarks.

We see a 12% reduction in the absolute time to run these workloads and 17% reduction in Ruby load time!

Conclusion

The gofer model in runsc was overly restrictive for accessing host files. We were able to leverage existing filesystem isolation mechanisms in Linux to bypass the gofer without compromising security. Directfs significantly improves performance for certain workloads. This is part of our ongoing efforts to improve gVisor performance. You can learn more about gVisor at gvisor.dev. You can also use gVisor in GKE with GKE Sandbox. Happy sandboxing!

Detached bind mounts can be created by first creating a bind mount using mount(MS_BIND) and then detaching it from the filesystem tree using umount(MNT_DETACH). ↩

Running Stable Diffusion on GPU with gVisor

2023-06-20T00:00:00-05:00

gVisor is starting to support GPU workloads. This post showcases running the Stable Diffusion generative model from Stability AI to generate images using a GPU from within gVisor. Both the Automatic1111 Stable Diffusion web UI and the PyTorch code used by Stable Diffusion were run entirely within gVisor while being able to leverage the NVIDIA GPU.

Sandboxing a GPU. Generated with Stable Diffusion v1.5.
This picture gets a lot deeper once you realize that GPUs are made out of sand.

Disclaimer

As of this writing (2023-06), gVisor’s GPU support is not generalized. Only some PyTorch workloads have been tested on NVIDIA T4, L4, A100, and H100 GPUs, using the specific driver versions that your runsc version supports using the command below. Contributions are welcome to expand this set to support other GPUs and driver versions!

# From a cloned gVisor repository:
$ make run TARGETS=runsc ARGS="nvproxy list-supported-drivers"

# From a runsc binary:
$ runsc nvproxy list-supported-drivers

Additionally, while gVisor does its best to sandbox the workload, interacting with the GPU inherently requires running code on GPU hardware, where isolation is enforced by the GPU driver and hardware itself rather than gVisor. More to come soon on the value of the protection gVisor provides for GPU workloads.

In a few months, gVisor’s GPU support will have broadened and become easier-to-use, such that it will not be constrained to the specific sets of versions used here. In the meantime, this blog stands as an example of what’s possible today with gVisor’s GPU support.

A collection of astronaut helmets in various styles.
Other than the helmet in the center, each helmet was generated using Stable Diffusion v1.5.

Why even do this?

The recent explosion of machine learning models has led to a large number of new open-source projects. Much like it is good practice to be careful about running new software downloaded from the Internet, it is good practice to run new open-source projects in a sandbox. For projects like the Automatic1111 Stable Diffusion web UI, which automatically download various models, components, and extensions from external repositories as the user enables them in the web UI, this principle applies all the more.

Additionally, within the machine learning space, tooling for packaging and distributing models are still nascent. While some models (including Stable Diffusion) are packaged using the more secure safetensors format, the majority of models available online today are distributed using the Pickle format, which can execute arbitrary Python code upon deserialization. As such, even when using trustworthy software, using Pickle-formatted models may still be risky (Edited 2024-04-04: this exact vulnerability vector was found in Hugging Face’s Inference API). gVisor provides a layer of protection around this process which helps protect the host machine.

Third, machine learning applications are typically not I/O heavy, which means they tend not to experience a significant performance overhead. The process of uploading code to the GPU is not a significant number of system calls, and most communication to/from the GPU happens over shared memory, where gVisor imposes no overhead. Therefore, the question is not so much “why should I run this GPU workload in gVisor?” but rather “why not?”.

Cool astronauts don’t look at explosions. Generated using Stable Diffusion v1.5.

Lastly, running GPU workloads in gVisor is pretty cool.

Setup

We use a Debian virtual machine on GCE. The machine needs to have a GPU and to have sufficient RAM and disk space to handle Stable Diffusion and its large model files. The following command creates a VM with 4 vCPUs, 15GiB of RAM, 64GB of disk space, and an NVIDIA T4 GPU, running Debian 11 (bullseye). Since this is just an experiment, the VM is set to self-destruct after 6 hours.

$ gcloud compute instances create stable-diffusion-testing \
    --zone=us-central1-a \
    --machine-type=n1-standard-4 \
    --max-run-duration=6h \
    --instance-termination-action=DELETE \
    --maintenance-policy TERMINATE \
    --accelerator=count=1,type=nvidia-tesla-t4 \
    --create-disk=auto-delete=yes,boot=yes,device-name=stable-diffusion-testing,image=projects/debian-cloud/global/images/debian-11-bullseye-v20230509,mode=rw,size=64
$ gcloud compute ssh --zone=us-central1-a stable-diffusion-testing

All further commands in this post are performed while SSH’d into the VM. We first need to install the specific NVIDIA driver version that gVisor is currently compatible with.

$ sudo apt-get update && sudo apt-get -y upgrade
$ sudo apt-get install -y build-essential linux-headers-$(uname -r)
$ runsc nvproxy list-supported-drivers
$ DRIVER_VERSION=some-driver-version # Get from your runsc binary.
$ curl -fSsl -O "https://us.download.nvidia.com/tesla/$DRIVER_VERSION/NVIDIA-Linux-x86_64-$DRIVER_VERSION.run"
$ sudo sh NVIDIA-Linux-x86_64-$DRIVER_VERSION.run

Next, we install Docker, per its instructions.

$ sudo apt-get install -y ca-certificates curl gnupg
$ sudo install -m 0755 -d /etc/apt/keyrings
$ curl -fsSL https://download.docker.com/linux/debian/gpg | sudo gpg --dearmor --batch --yes -o /etc/apt/keyrings/docker.gpg
$ sudo chmod a+r /etc/apt/keyrings/docker.gpg
$ echo "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/debian $(. /etc/os-release && echo "$VERSION_CODENAME") stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
$ sudo apt-get update && sudo apt-get install -y docker-ce docker-ce-cli

We will also need the NVIDIA container toolkit, which enables use of GPUs with Docker. Per its installation instructions:

$ distribution=$(. /etc/os-release;echo $ID$VERSION_ID) && curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg && curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
$ sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit

Of course, we also need to install gVisor itself.

$ sudo apt-get install -y apt-transport-https ca-certificates curl gnupg
$ curl -fsSL https://gvisor.dev/archive.key | sudo gpg --dearmor -o /usr/share/keyrings/gvisor-archive-keyring.gpg
$ echo "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/gvisor-archive-keyring.gpg] https://storage.googleapis.com/gvisor/releases release main" | sudo tee /etc/apt/sources.list.d/gvisor.list > /dev/null
$ sudo apt-get update && sudo apt-get install -y runsc

＃ As gVisor does not yet enable GPU support by default, we need to set the flags
＃ that will enable it:
$ sudo runsc install -- --nvproxy=true --nvproxy-docker=true

$ sudo systemctl restart docker

Now, let’s make sure everything works by running commands that involve more and more of what we just set up.

＃ Check that the NVIDIA drivers are installed, with the right version, and with
＃ a supported GPU attached
$ sudo nvidia-smi -L
GPU 0: Tesla T4 (UUID: GPU-6a96a2af-2271-5627-34c5-91dcb4f408aa)
$ sudo cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module  DRIVER_VERSION  Wed Nov 30 06:39:21 UTC 2022

＃ Check that Docker works.
$ sudo docker version
＃ [...]
Server: Docker Engine - Community
 Engine:
  Version:          24.0.2
＃ [...]

＃ Check that gVisor works.
$ sudo docker run --rm --runtime=runsc debian:latest dmesg | head -1
[    0.000000] Starting gVisor...

＃ Check that Docker GPU support (without gVisor) works.
$ sudo docker run --rm --gpus=all nvidia/cuda:11.6.2-base-ubuntu20.04 nvidia-smi -L
GPU 0: Tesla T4 (UUID: GPU-6a96a2af-2271-5627-34c5-91dcb4f408aa)

＃ Check that gVisor works with the GPU.
$ sudo docker run --rm --runtime=runsc --gpus=all nvidia/cuda:11.6.2-base-ubuntu20.04 nvidia-smi -L
GPU 0: Tesla T4 (UUID: GPU-6a96a2af-2271-5627-34c5-91dcb4f408aa)

We’re all set! Now we can actually get Stable Diffusion running.

We used the following Dockerfile to run Stable Diffusion and its web UI within a GPU-enabled Docker container.

FROM python:3.10

＃ Set of dependencies that are needed to make this work.
RUN apt-get update && apt-get install -y git wget build-essential \
        nghttp2 libnghttp2-dev libssl-dev ffmpeg libsm6 libxext6
＃ Clone the project at the revision used for this test.
RUN git clone https://github.com/AUTOMATIC1111/stable-diffusion-webui.git && \
    cd /stable-diffusion-webui && \
    git checkout baf6946e06249c5af9851c60171692c44ef633e0
＃ We don't want the build step to start the server.
RUN sed -i '/start()/d' /stable-diffusion-webui/launch.py
＃ Install some pip packages.
＃ Note that this command will run as part of the Docker build process,
＃ which is *not* sandboxed by gVisor.
RUN cd /stable-diffusion-webui && COMMANDLINE_ARGS=--skip-torch-cuda-test python launch.py
WORKDIR /stable-diffusion-webui
＃ This causes the web UI to use the Gradio service to create a public URL.
＃ Do not use this if you plan on leaving the container running long-term.
ENV COMMANDLINE_ARGS=--share
＃ Start the webui app.
CMD ["python", "webui.py"]

We build the image and create a container with it using the docker command-line.

$ cat > Dockerfile
(... Paste the above contents...)
^D
$ sudo docker build --tag=sdui .

Finally, we can start the Stable Diffusion web UI. Note that it will take a long time to start, as it has to download all the models from the Internet. To keep this post simple, we didn’t set up any kind of volume that would enable data persistence, so it will do this every time the container starts.

$ sudo docker run --runtime=runsc --gpus=all --name=sdui --detach sdui

＃ Follow the logs:
$ sudo docker logs -f sdui
＃ [...]
Calculating sha256 for /stable-diffusion-webui/models/Stable-diffusion/v1-5-pruned-emaonly.safetensors: Running on local URL:  http://127.0.0.1:7860
Running on public URL: https://4446d982b4129a66d7.gradio.live

This share link expires in 72 hours.
＃ [...]

We’re all set! Now we can browse to the Gradio URL shown in the logs and start generating pictures, all within the secure confines of gVisor.

Stable Diffusion Web UI screenshot. Inner image generated with Stable Diffusion v1.5.

Happy sandboxing!

Happy sandboxing! Generated with Stable Diffusion v1.5.

Rootfs Overlay

2023-05-08T00:00:00-05:00

Root filesystem overlay is now the default in runsc. This improves performance for filesystem-heavy workloads by overlaying the container root filesystem with a tmpfs filesystem. Learn more about this feature in the following blog that was originally posted on Google Open Source Blog.

Costly Filesystem Access

gVisor uses a trusted filesystem proxy process (“gofer”) to access the filesystem on behalf of the sandbox. The sandbox process is considered untrusted in gVisor’s security model. As a result, it is not given direct access to the container filesystem and its seccomp filters do not allow filesystem syscalls.

In gVisor, the container rootfs and bind mounts are configured to be served by a gofer.

When the container needs to perform a filesystem operation, it makes an RPC to the gofer which makes host system calls and services the RPC. This is quite expensive due to:

RPC cost: This is the cost of communicating with the gofer process, including process scheduling, message serialization and IPC system calls.
- To ameliorate this, gVisor recently developed a purpose-built protocol called LISAFS which is much more efficient than its predecessor.
- gVisor is also experimenting with giving the sandbox direct access to the container filesystem in a secure manner. This would essentially nullify RPC costs as it avoids the gofer being in the critical path of filesystem operations.
Syscall cost: This is the cost of making the host syscall which actually accesses/modifies the container filesystem. Syscalls are expensive, because they perform context switches into the kernel and back into userspace.
- To help with this, gVisor heavily caches the filesystem tree in memory. So operations like stat(2) on cached files are serviced quickly. But other operations like mkdir(2) or rename(2) still need to make host syscalls.

Container Root Filesystem

In Docker and Kubernetes, the container’s root filesystem (rootfs) is based on the filesystem packaged with the image. The image’s filesystem is immutable. Any change a container makes to the rootfs is stored separately and is destroyed with the container. This way, the image’s filesystem can be shared efficiently with all containers running the same image. This is different from bind mounts, which allow containers to access the bound host filesystem tree. Changes to bind mounts are always propagated to the host and persist after the container exits.

Docker and Kubernetes both use the overlay filesystem by default to configure container rootfs. Overlayfs mounts are composed of one upper layer and multiple lower layers. The overlay filesystem presents a merged view of all these filesystem layers at its mount location and ensures that lower layers are read-only while all changes are held in the upper layer. The lower layer(s) constitute the “image layer” and the upper layer is the “container layer”. When the container is destroyed, the upper layer mount is destroyed as well, discarding the root filesystem changes the container may have made. Docker’s overlayfs driver documentation has a good explanation.

Rootfs Configuration Before

Let’s consider an example where the image has files foo and baz. The container overwrites foo and creates a new file bar. The diagram below shows how the root filesystem used to be configured in gVisor earlier. We used to go through the gofer and access/mutate the overlaid directory on the host. It also shows the state of the host overlay filesystem.

Opportunity! Sandbox Internal Overlay

Given that the upper layer is destroyed with the container and that it is expensive to access/mutate a host filesystem from the sandbox, why keep the upper layer on the host at all? Instead we can move the upper layer into the sandbox.

The idea is to overlay the rootfs using a sandbox-internal overlay mount. We can use a tmpfs upper (container) layer and a read-only lower layer served by the gofer client. Any changes to rootfs would be held in tmpfs (in-memory). Accessing/mutating the upper layer would not require any gofer RPCs or syscalls to the host. This really speeds up filesystem operations on the upper layer, which contains newly created or copied-up files and directories.

Using the same example as above, the following diagram shows what the rootfs configuration would look like using a sandbox-internal overlay.

Host-Backed Overlay

The tmpfs mount by default will use the sandbox process’s memory to back all the file data in the mount. This can cause sandbox memory usage to blow up and exhaust the container’s memory limits, so it’s important to store all file data from tmpfs upper layer on disk. We need to have a tmpfs-backing “filestore” on the host filesystem. Using the example from above, this filestore on the host will store file data for foo and bar.

This would essentially flatten all regular files in tmpfs into one host file. The sandbox can mmap(2) the filestore into its address space. This allows it to access and mutate the filestore very efficiently, without incurring gofer RPCs or syscalls overheads.

Self-Backed Overlay

In Kubernetes, you can set local ephemeral storage limits. The upper layer of the rootfs overlay (writeable container layer) on the host contributes towards this limit. The kubelet enforces this limit by traversing the entire upper layer, stat(2)-ing all files and summing up their stat.st_blocks*block_size. If we move the upper layer into the sandbox, then the host upper layer is empty and the kubelet will not be able to enforce these limits.

To address this issue, we introduced “self-backed” overlays, which create the filestore in the host upper layer. This way, when the kubelet scans the host upper layer, the filestore will be detected and its stat.st_blocks should be representative of the total file usage in the sandbox-internal upper layer. It is also important to hide this filestore from the containerized application to avoid confusing it. We do so by creating a whiteout in the sandbox-internal upper layer, which blocks this file from appearing in the merged directory.

The following diagram shows what rootfs configuration would finally look like today in gVisor.

Performance Gains

Let’s look at some filesystem-intensive workloads to see how rootfs overlay impacts performance. These benchmarks were run on a gLinux desktop with KVM platform.

Micro Benchmark

Linux Test Project provides a fsstress binary. This program performs a large number of filesystem operations concurrently, creating and modifying a large filesystem tree of all sorts of files. We ran this program on the container’s root filesystem. The exact usage was:

sh -c "mkdir /test && time fsstress -d /test -n 500 -p 20 -s 1680153482 -X -l 10"

You can use the -v flag (verbose mode) to see what filesystem operations are being performed.

The results were astounding! Rootfs overlay reduced the time to run this fsstress program from 262.79 seconds to 3.18 seconds! However, note that such microbenchmarks are not representative of real-world applications and we should not extrapolate these results to real-world performance.

Real-world Benchmark

Build jobs are very filesystem intensive workloads. They read a lot of source files, compile and write out binaries and object files. Let’s consider building the abseil-cpp project with bazel. Bazel performs a lot of filesystem operations in rootfs; in bazel’s cache located at ~/.cache/bazel/.

This is representative of the real-world because many other applications also use the container root filesystem as scratch space due to the handy property that it disappears on container exit. To make this more realistic, the abseil-cpp repo was attached to the container using a bind mount, which does not have an overlay.

When measuring performance, we care about reducing the sandboxing overhead and bringing gVisor performance as close as possible to unsandboxed performance. Sandboxing overhead can be calculated using the formula overhead = (s-n)/n where s is the amount of time taken to run a workload inside gVisor sandbox and n is the time taken to run the same workload natively (unsandboxed). The following graph shows that rootfs overlay halved the sandboxing overhead for abseil build!

Conclusion

Rootfs overlay in gVisor substantially improves performance for many filesystem-intensive workloads, so that developers no longer have to make large tradeoffs between performance and security. We recently made this optimization the default in runsc. This is part of our ongoing efforts to improve gVisor performance. You can use gVisor in GKE with GKE Sandbox. Happy sandboxing!

Releasing Systrap - A high-performance gVisor platform

2023-04-28T00:00:00-05:00

We are releasing a new gVisor platform: Systrap. Like the existing ptrace platform, Systrap runs on most Linux machines out of the box without virtualization. Unlike the ptrace platform, it’s fast 🚀. Go try it by adding --platform=systrap to the runsc flags. If you want to know more about it, read on.

gVisor is a security boundary for arbitrary Linux processes. Boundaries do not come for free, and gVisor imposes some performance overhead on sandboxed applications. One of the most fundamental performance challenges with the security model implemented by gVisor is system call interception, which is the focus of this post.

To recap on the security model: gVisor is an application kernel that implements the Linux ABI. This includes system calls, signals, memory management, and more. For example, when a sandboxed application calls read(2), it actually transparently calls into gVisor’s implementation of this system call This minimizes the attack surface of the host kernel, because sandboxed programs simply can’t make system calls directly to the host in the first place¹. This interception happens through an internal layer called the Platform interface, which we have written about in a previous blog post. To handle these interceptions, this interface must also create new address spaces, allocate memory, and create execution contexts to run the workload.

gVisor had two platform implementations: KVM and ptrace. The KVM platform uses the kernel’s KVM functionality to allow the Sentry to act as both guest OS and VMM (Virtual machine monitor). It does system call interception just like a normal virtual machine would. This gives good performance when using bare-metal virtualization, but has a noticeable impact with nested virtualization. The other obvious downside is that it requires support for nested virtualization in the first place, which is not supported by all hardware (such as ARM CPUs) or within some Cloud environments.

The ptrace platform was the alternative wherever KVM was not available. It works through the PTRACE_SYSEMU action, which makes the user process hand back execution to the sentry whenever it encounters a system call. This is a clean method to achieve system call interception in any environment, virtualized or not, except that it’s quite slow. To see how slow, an unrealistic but highly illustrative benchmark to use is the getpid benchmark². This benchmark runs the getpid(2) system call in a tight while loop. No useful application has this behavior, so it is not a realistic benchmark, but it is well-suited to measure system call latency.

All getpid runs have been performed on a GCE n2-standard-4 VM, with the debian-11-bullseye-v20230306 image.

While this benchmark is not applicable to most real-world workloads, just about any workload will generally suffer from high overhead in system call performance. Since running in a virtualized environment is the default state for most cloud users these days, it’s important that gVisor performs well in this context. Systrap is the new platform targeting this important use case.

Systrap relies on multiple techniques to implement the Platform interface. Like the ptrace platform, Systrap uses Linux’s ptrace subsystem to initialize workload executor threads, which are started as child processes of the main gVisor sentry process. Systrap additionally sets a very restrictive seccomp filter, installs a custom signal handler, and allocates chunks of memory shared between user threads and runsc sentry. This shared memory is what serves as the main form of communication between the sentry and sandboxed programs: whenever the sandboxed process attempts to execute a system call, it triggers a SIGSYS signal which is handled by our signal handler. The signal handler in turn populates shared memory regions, and requests the sentry to handle the requested system call. This alone proved to be faster than using PTRACE_SYSEMU, as demonstrated by the getpid benchmark:

Can we make it even faster? Recall what the main purpose of our signal handler is: to send a request to the sentry via shared memory. To do that, the sandboxed process must first incur the overhead of executing the seccomp filter³, and then generating a full signal stack before being able to run the signal handler. What if there was a way to simply have the sandboxed process jump to another user-space function when it wanted to perform a system call? Well, turns out, there is⁴! There is a popular x86 instruction pattern that’s used to perform system calls, and it goes a little something like this: mov sysno, %eax; syscall. The size of the mov instruction is 5 bytes and the size of the syscall instruction is 2 bytes. Luckily this is just enough space to fit in a jmp *%gs:offset instruction. When the signal handler sees this instruction pattern, it signals to the sentry that the original instructions can be replaced with a jmp to trampoline code that performs the same function as the regular SIGSYS signal handler. The system call number is not lost, but rather encoded in the offset. The results are even more impressive:

As mentioned, the getpid benchmark is not representative of real-world performance. To get a better picture of the magnitude of improvement, here are some real-world workloads:

The Build ABSL benchmark measures compilation performance by compiling abseil.io; this is a highly system call dependent workload due to needing to do a lot of I/O filesystem operations (gVisor’s file system overhead is also dependent upon file system isolation it implements, which is something you can learn about here).
The ffmpeg benchmark runs a multimedia processing tool, to perform video stream encoding/decoding for example; this workload does not require a significant amount of system calls and there are very few userspace to kernel mode switches.
The Tensorflow benchmark trains a variety of machine learning models on CPU; the system-call usage of this workload is in between compilation and ffmpeg, due to needing to retrieve training and validation data, but the majority of time is still spent just running userspace computations.
Finally, the Redis benchmark performs SET RPC calls with 5 concurrent clients, measures the latency that each call takes to execute, and reports the median (scaled by 250,000 to fit the graph’s axis); this workload is heavily bounded by system call performance due to high network stack usage.

Systrap will replace the ptrace platform by September 2023 and become the default. Until then, we are working really hard to make it production-ready, which includes working on additional performance and stability improvements, and making sure we maintain a high bar for security through targeted fuzz-testing for Systrap specifically.

In the meantime, we would like gVisor users to try it out, and give us feedback! If you run gVisor using ptrace today (either by specifying --platform ptrace or not specifying the --platform flag at all), or you use the KVM platform with nested virtualization, switching to Systrap should be a drop-in performance upgrade. All you have to do is specify --platform systrap to runsc. If you encounter any issues, please let us know at gvisor.dev/issues.

Even if the sandbox itself is compromised, it will still be bound by several defense-in-depth layers, including a restricted set of seccomp filters. You can find more details here: https://gvisor.dev/blog/2020/09/18/containing-a-real-vulnerability/. ↩
Once the system call has been intercepted by gVisor (or in the case of Linux, once the process has entered kernel-mode), actually executing the getpid system call itself is very fast, so this benchmark effectively measures single-thread syscall-interception overhead. ↩
Seccomp filters are known to have a “not insubstantial” overhead: https://lwn.net/Articles/656307/. ↩
On the x86_64 architecture. ARM does not have this optimization as of the time of writing. ↩

How we Eliminated 99% of gVisor Networking Memory Allocations with Enhanced Buffer Pooling

2022-10-24T00:00:00-05:00

In an earlier blog post about networking security, we described how and why gVisor implements its own userspace network stack in the Sentry (gVisor kernel). In summary, we’ve implemented our networking stack – aka Netstack – in Go to minimize exposure to unsafe code and avoid using an unsafe Foreign Function Interface. With Netstack, gVisor can do all packet processing internally and only has to enable a few host I/O syscalls for near-complete networking capabilities. This keeps gVisor’s exposure to host vulnerabilities as narrow as possible.

Although writing Netstack in Go was important for runtime safety, up until now it had an undeniable performance cost. iperf benchmarks showed Netstack was spending between 20-30% of its processing time allocating memory and pausing for garbage collection, a slowdown that limited gVisor’s ability to efficiently sandbox networking workloads. In this blog we will show how we crafted a cure for Netstack’s allocation addiction, reducing them by 99%, while also increasing gVisor networking throughput by 30+%.

A Waste Management Problem

Go guarantees a basic level of memory safety through the use of a garbage collector (GC), which is described in great detail by the Go team here. The Go runtime automatically tracks and frees objects allocated from the heap, relieving the programmer of the often painful and error-prone process of manual memory management. Unfortunately, tracking and freeing memory during runtime comes at a performance cost. Running the GC adds scheduling overhead, consumes valuable CPU time, and occasionally pauses the entire program’s progress to track down garbage.

Go’s GC is highly optimized, tunable, and sufficient for a majority of workloads. Most of the other parts of gVisor happily use Go’s GC with no complaints. However, under high network stress, Netstack needed to aggressively allocate buffers used for processing TCP/IP data and metadata. These buffers often had short lifespans, and once the processing was done they were left to be cleaned up by the GC. This meant Netstack was producing tons of garbage that needed to be tracked and freed by GC workers.

Recycling to the Rescue

Luckily, we weren’t the only ones with this problem. This pattern of small, frequently allocated and discarded objects was common enough that the Go team introduced sync.Pool in Go1.3. sync.Pool is designed to relieve pressure off the Go GC by maintaining a thread-safe cache of previously allocated objects. sync.Pool can retrieve an object from the cache if it exists or allocate a new one according to a user specified allocation function. Once the user is finished with an object they can safely return it to the cache to be reused again.

While sync.Pool was exactly what we needed to reduce allocations, incorporating it into Netstack wasn’t going to be as easy as just replacing all our make()s with pool.Get()s.

Netstack Challenges

Netstack uses a few different types of buffers under the hood. Some of these are specific to protocols, like segment for TCP, and others are more widely shared, like PacketBuffer, which is used for IP, ICMP, UDP, etc. Although each of these buffer types are slightly different, they generally share a few common traits that made it difficult to use sync.Pool out of the box:

The buffers were originally built with the assumption that a garbage collector would clean them up automatically – there was little (if any) effort put into tracking object lifetimes. This meant that we had no way to know when it was safe to return buffers to a pool.
Buffers have dynamic sizes that are determined during creation, usually depending on the size of the packet holding them. A sync.Pool out of the box can only accommodate buffers of a single size. One common solution to this is to fill a pool with bytes.Buffer, but even a pooled bytes.Buffer could incur allocations if it were too small and had to be grown to the requested size.
Netstack splits, merges, and clones buffers at various points during processing (for example, breaking a large segment into smaller MTU-sized packets). Modifying a buffer’s size during runtime could mean lots of reallocating from the pool in a one-size-fits-all setup. This would limit the theoretical effectiveness of a pooled solution.

We needed an efficient, low-level buffer abstraction that had answers for the Netstack specific challenges and could be shared by the various intermediate buffer types. By sharing a common buffer abstraction, we could maximize the benefits of pooling and avoid introducing additional allocations while minimally changing any intermediate buffer processing logic.

Introducing bufferv2

Our solution was bufferv2. Bufferv2 is a non-contiguous, reference counted, pooled, copy-on-write, buffer-like data structure.

Internally, a bufferv2 Buffer is a linked list of Views. Each View has start/end indices and holds a pointer to a Chunk. A Chunk is a reference-counted structure that’s allocated from a pool and holds data in a byte slice. There are several Chunk pools, each of which allocates chunks with different sized byte slices. These sizes start at 64 and double until 64k.

The design of bufferv2 has a few key advantages over simpler object pooling:

Zero-cost copies and copy-on-write: Cloning a Buffer only increments the reference count of the underlying chunks instead of reallocating from the pool. Since buffers are much more frequently read than modified, this saves allocations. In the cases where a buffer is modified, only the chunk that’s changed has to be cloned, not the whole buffer.
Fast buffer transformations: Truncating and merging buffers or appending and prepending Views to Buffers are fast operations. Thanks to the non-contiguous memory structure these operations are usually as quick as adding a node to a linked list or changing the indices in a View.
Tiered pools: When growing a Buffer or appending data, the new chunks come from different pools of previously allocated chunks. Using multiple pools means we are flexible enough to efficiently accommodate packets of all sizes with minimal overhead. Unlike a one-size-fits-all solution, we don’t have to waste lots of space with a chunk size that is too big or loop forever allocating small chunks.

Trade-offs

Shifting Netstack to bufferv2 came with some costs. To start, rewriting all buffers to use bufferv2 was a sizable effort that took many months to fully roll out. Any place in Netstack that allocated or used a byte slice needed to be rewritten. Reference counting had to be introduced so all the aforementioned intermediate buffer types (PacketBuffer, segment, etc) could accurately track buffer lifetimes, and tests had to be modified to ensure reference counting correctness.

In addition to the upfront cost, the shift to bufferv2 also increased the engineering complexity of future Netstack changes. Netstack contributors must adhere to new rules to maintain memory safety and maximize the benefits of pooling. These rules are strict – there needs to be strong justification to break them. They are as follows:

Never allocate a byte slice; always use NewView() instead.
Use a View for simple data operations (e.g writing some data of a fixed size) and a Buffer for more complex I/O operations (e.g appending data of variable size, merging data, writing from an io.Reader).
If you need access to the contents of a View as a byte slice, use View.AsSlice(). If you need access to the contents of a Buffer as a byte slice, consider refactoring, as this will cause an allocation.
Never write or modify the slices returned by View.AsSlice(); they are still owned by the view.
Release bufferv2 objects as close to where they’re created as possible. This is usually most easily done with defer.
Document function ownership of bufferv2 object parameters. If there is no documentation, it is assumed that the function does not take ownership of its parameters.
If a function takes ownership of its bufferv2 parameters, the bufferv2 objects must be cloned before passing them as arguments.
All new Netstack tests must enable the leak checker and run a final leak check after the test is complete.

Give it a Try

Bufferv2 is enabled by default as of gVisor 20221017, and will be rolling out to GKE Sandbox soon, so no action is required to see a performance boost. Network-bound workloads, such as web servers or databases like Redis, are the most likely to see benefits. All the code implementing bufferv2 is public here, and contributions are welcome! If you’d like to run the iperf benchmark for yourself, you can run:

make run-benchmark BENCHMARKS_TARGETS=//test/benchmarks/network:iperf_test \
  RUNTIME=your-runtime-here BENCHMARKS_OPTIONS=-test.benchtime=60s

in the base gVisor directory. If you experience any issues, please feel free to let us know at gvisor.dev/issues.

Threat Detection in gVisor

2022-08-31T00:00:00-05:00

gVisor helps users secure their infrastructure by running containers in a dedicated kernel that is isolated from the host. But wouldn’t it be nice if you could tell when someone attempts to break out? Or get an early warning that your web server might have been compromised? Now you can do it with gVisor! We are pleased to announce support for runtime monitoring. Runtime monitoring provides the ability for an external process to observe application behavior and detect threats at runtime. Using this mechanism, gVisor users can watch actions performed by the container and generate alerts when something unexpected occurs.

A monitoring process can connect to the gVisor sandbox and receive a stream of actions that the application is performing. The monitoring process decides what actions are allowed and what steps to take based on policies for the given application. gVisor communicates with the monitoring process via a simple protocol based on Protocol Buffers, which is the basis for gRPC and is well supported in several languages. The monitoring process runs isolated from the application inside the sandbox for security reasons, and can be shared among all sandboxes running on the same machine to save resources. Trace points can be individually configured when creating a tracing session to capture only what’s needed.

Let’s go over a simple example of a web server that gets compromised while being monitored. The web server can execute files from /bin, read files from /etc and /html directories, create files under /tmp, etc. All these actions are reported to a monitoring process which analyzes them and deems them normal application behavior. Now suppose that an attacker takes control over the web server and starts executing code inside the container. The attacker writes a script under /tmp and, in an attempt to make it executable, runs chmod u+x /tmp/exploit.sh. The monitoring process determines that making a file executable is not expected in the normal web server execution and raises an alert to the security team for investigation. Additionally, it can also decide to kill the container and stop the attacker from making more progress.

Falco

Falco is an Open Source Cloud Native Security monitor that detects threats at runtime by observing the behavior of your applications and containers. Falco supports monitoring applications running inside gVisor. All the Falco rules and tooling work seamlessly with gVisor. You can use this tutorial to learn how to configure Falco and gVisor together. More information can be found on the Falco blog.

What’s next?

We’re looking for more projects to take advantage of the runtime monitoring system and the visibility that it provides into the sandbox. There are a few unique capabilities provided by the system that makes it easy to monitor applications inside gVisor, like resolving file descriptors to full paths, providing container ID with traces, separating processes that were exec’ed into the container, internal procfs state access, and many more.

If you would like to explore it further, there is a design document and documentation with more details about the configuration and communication protocol. In addition, the tutorial using Falco is a great way to see it in action.

We would like to thank Luca Guerra, Lorenzo Susini, and the Falco team for their support while building this feature.

Scaling Agentic-RL Sandboxes to the Millions with gVisor at Tencent

Scaling Agentic-RL Sandboxes to the Millions with gVisor at Tencent

Background: Why Agentic-RL Needs gVisor

Challenge: Verifying Tens of Thousands of Cases Cannot Rely Entirely on Manual Effort

Compatibility Landscape: Boundaries Defined by Batch Comparisons

Representative Cases: Six Types of Issues and Corresponding Fix Paths

Case 1: poll Behavior Inconsistency Causes tmux Busy-Loop

Case 2: syncthing Test Case Exposes Two Independent Linux ABI Gaps (Unimplemented Syscalls or Virtual Files)

Case 3: Gradle clean test Concurrency Race—Root Cause in User Space, Not gVisor

Case 4: Missing procfs / sysfs Causes Real Applications to Take Abnormal Paths

Case 5: Inconsistent PTY Implementation Causes Interactive Agents to Error

Case 6: Jekyll Test Order Dependency Causes Flaky Failures—A Pure Test Case Issue

Best Practices: Suggestions for Using gVisor in Agentic-RL Scenarios

Suggestions for Different Build Systems

Debugging Procedure When Encountering Failures

AI-Driven Compatibility Analysis: Why This Path Is Feasible

Conclusion

Multi-Agent gVisor Isolation (MAGI)

Basic machine setup: Docker/gVisor/NVIDIA drivers

Self-hosted Matrix.org server + Cinny web frontend setup

Self-hosted inference server: Ollama

Containerized OpenClaw setup with Browser Use

Browser Use

Containerized PicoClaw with web and GitHub skills

Modularized & sandboxed Hermes Agent setup

Setting up Docker-in-gVisor for code execution

Building Camofox Docker image in Docker-in-gVisor

Running self-hosted Firecrawl in gVisor

Putting it all together

Interactive setup instructions

Putting these agents in a room

Sandboxing agents: What actually makes sense?

Should I use gVisor to sandbox my agent?

Safe Ride into the Dangerzone: Reducing attack surface with gVisor

How Dangerzone works

Dangerzone’s attack surface

What is gVisor?

Integrating gVisor with Dangerzone

Dangerzone protections

Conclusion

Optimizing seccomp usage in gVisor

Understanding seccomp-bpf performance in gVisor

A primer on BPF and seccomp-bpf

BPF, cBPF, eBPF, oh my!

So what is seccomp-bpf exactly?

Sample seccomp-bpf filter

seccomp-bpf and cBPF limitations

seccomp-bpf caching in Linux

How gVisor builds its seccomp-bpf filter

Structural optimizations

Binary search tree optimizations

cBPF bytecode optimizations

Optimizing cBPF jumps

Removing dead code

Removing redundant load instructions

Minimizing the number of return instructions

Ruleset optimizations

Basic ruleset simplifications

Extracting repeated argument matchers

Extracting repeated 32-bit match logic from 64-bit argument matchers

Other optimizations

Making futex(2) rules faster

Optimizing non-negative FD checks

Enforcing consistency of argument-wise matchers

secbench: Benchmarking seccomp-bpf programs

Modeling gVisor seccomp-bpf behavior for benchmarking

seccomp-bpf filtering overhead reduction

secfuzz: Fuzzing seccomp-bpf programs

Optimizing in-gVisor seccomp-bpf filtering

gVisor cBPF interpreter performance

seccomp-bpf filter result caching for sandboxed applications

Faster gVisor startup via filter precompilation

Dealing with dynamic values in precompiled rules

OK that’s great and all, but is gVisor actually faster?

Further work

Faster filesystem access with Directfs

Origins of the Gofer

Isolating the Container Filesystem in runsc

Directfs

Performance

Case 1: `poll` Behavior Inconsistency Causes `tmux` Busy-Loop

Understanding `seccomp-bpf` performance in gVisor

A primer on BPF and `seccomp-bpf`

So what is `seccomp-bpf` exactly?

Sample `seccomp-bpf` filter

`seccomp-bpf` and cBPF limitations

`seccomp-bpf` caching in Linux

How gVisor builds its `seccomp-bpf` filter

Removing redundant `load` instructions

Minimizing the number of `return` instructions

Making `futex(2)` rules faster

`secbench`: Benchmarking `seccomp-bpf` programs

Modeling gVisor `seccomp-bpf` behavior for benchmarking

`seccomp-bpf` filtering overhead reduction

`secfuzz`: Fuzzing `seccomp-bpf` programs

Optimizing in-gVisor `seccomp-bpf` filtering

`seccomp-bpf` filter result caching for sandboxed applications