This article was contributed by Tencent. Yifeng Tan, Hua Liu, and Hui Chen are engineers at Tencent, responsible for the internal container infrastructure.
As LLMs evolve from chat interfaces to autonomous agents, building a robust and
secure isolation environment becomes a necessity. We chose
gVisor as the default sandbox for our Agentic-RL
scenarios. Today, we run millions of gVisor sandboxes daily for Agentic-RL
training in production, and that scale continues to grow. After more than
74,000 side-by-side comparisons between runsc (gVisor) and runc
(unsandboxed/Linux), combined with targeted fixes driven by real-world
workloads, we have essentially closed the execution correctness gap with runc,
fully meeting our production-grade business requirements. During this process,
we successfully investigated and resolved gVisor compatibility issues that
accounted for approximately 1.7% of all test cases.
This post focuses on CPU-centric code execution and testing workloads. We will discuss gVisor compatibility verification and highlight representative issues, skipping implementation details like GPU support, image distribution, or cluster scheduling. We aim to answer three questions:
- Why choose gVisor?
- Why doesn’t manual compatibility verification scale?
- How can AI agents analyze compatibility issues, what do typical failures look like, and what best practices have we established?
Background: Why Agentic-RL Needs gVisor
Over the past two years, benchmarks like SWE-bench have turned “Agents fixing bugs in real code repositories” from a research concept into an engineering reality. The agent behavioral model has evolved from static code generation to dynamic environmental interaction, spanning the entire lifecycle of dependency resolution, execution, test feedback, and iterative debugging. We don’t just need “an environment that runs Docker,” but rather a sandbox that strictly constrains the kernel attack surface while remaining lightweight and easy to deploy at scale. gVisor is a great fit for this scenario:
- It implements an application-level kernel in user space, intercepting and re-implementing system calls, significantly reducing the attack surface where containers directly interact with the host kernel. Its isolation has been well-recognized by the industry.
- It integrates naturally with existing Docker/Kubernetes infrastructure, avoiding the need for an entirely new guest kernel operation and maintenance system.
- Compared to microVM solutions—which must run on bare-metal hosts—gVisor can run inside regular VMs, making it significantly cheaper while remaining more flexible with lower startup and resource costs. This makes it far better suited for large-scale deployments of sandbox containers.
- It is also more friendly to GPU scenarios, facilitating integration with existing heterogeneous computing environments.
However, re-implementing the Linux ABI means its compatibility must be rigorously validated. In an Agentic-RL scenario where “any project can run and any environment can appear,” compatibility can’t rely on intuition. It requires large-scale verification against real workloads.
Challenge: Verifying Tens of Thousands of Cases Cannot Rely Entirely on Manual Effort
Compatibility issues are rarely simple. Analyzing a typical SWE-related failure usually requires answering several questions at once:
- Is this failure unique to
runsc(gVisor), or does it also fail underrunc? - If it only fails under gVisor, is it a semantic inconsistency in the Linux ABI, missing procfs / sysfs, file system behavioral differences, or a TOCTOU (Time-of-Check to Time-of-Use) race condition amplified by system call overhead?
- What is the actual behavior of the Linux kernel? At which layer did gVisor deviate?
- Should this issue be addressed by patching gVisor, modifying the test case, adjusting configurations, or simply avoiding a certain way of running?
Engineers can handle a handful of cases manually. But across these datasets, we are dealing with hundreds of thousands of real-world project instances, over a dozen programming languages, and numerous build systems (Gradle, Maven, CMake, Cargo, pip, npm, sbt, SwiftPM). Manual triage simply doesn’t scale.
To solve this, we brought AI coding agents into the verification pipeline to act as compatibility analysts. The process breaks down into four layers:
- Baseline Comparison Layer: Run the same set of test cases in parallel
under
runcandrunsc, collecting complete execution logs and exit statuses. - Difference Filtering Layer: Filter out environmental noise and non-deterministic outputs unrelated to the runtime, preserving samples that only fail under gVisor.
- AI Diagnostic Layer: LLMs output structured root cause analysis reports by combining logs and relevant source code.
- Decision Routing Layer: Route the reports into gVisor bugs, user-space race conditions, environmental differences, or test case issues, providing suggestions for fixes or workarounds.
This creates a neat closed loop: AI analyzing its own runtime environment.
graph TD
A[Baseline Comparison Layer] -->|Run under runc/runsc in parallel<br>Collect logs & exit status| B(Difference Filtering Layer)
B -->|Filter environmental noise<br>Keep gVisor-specific failures| C{AI Diagnostic Layer}
C -->|Combine logs & source code<br>Output structured root cause report| D[Decision Routing Layer]
D -->|gVisor bug| E[Submit community fix]
D -->|User-space race condition| F[Workaround strategy]
D -->|Environmental difference| G[Adjust environment]
D -->|Test case issue| H[Fix test case]
subgraph AI-Driven Compatibility Verification Framework
A
B
C
D
end
In our workflow, every deeply analyzed case produces a structured document, typically containing:
- Failure symptoms and minimal reproduction method
runc/runsccomparison results- Root cause classification: gVisor bug, missing feature, environmental difference, test case issue, or race condition amplification
- Linux kernel behavior comparison and source code evidence
- Fixes or workaround suggestions
- Regression verification results
To date, we have used AI to automatically analyze thousands of test cases exhibiting behavioral differences. From these, we extracted and deeply reviewed 100+ highly representative cases across 10+ programming languages and multiple build systems. These cases help us determine not only “whether gVisor is usable,” but also “who is actually to blame for a given failure.”
Compatibility Landscape: Boundaries Defined by Batch Comparisons
Looking at a small sample of failures makes it easy to misjudge gVisor’s compatibility. Reliable conclusions require large-scale A/B testing.
Across 10 mainstream code execution datasets in our Agentic-RL infrastructure,
we’ve run 74,379 side-by-side comparisons between runc and runsc.
Please see the detailed data in the table below:
| Dataset | Total cases | Native runc accuracy |
gVisor pre-fix runsc accuracy |
gVisor post-fix runsc accuracy |
|---|---|---|---|---|
terminal-bench2 |
89 | 100.00% | 94.38% | 97.75% |
swe-public/Multi-SWE-bench |
1,632 | 70.16% | 72.49% | 73.16% |
swe-public/Multi-SWE-RL |
7,046 | 27.73% | 20.49% | 26.81% |
swe-public/SWE-bench_Multilingual |
300 | 93.00% | 92.67% | 93.00% |
swe-public/SWE-bench_Not_Verified |
1,794 | 97.94% | 97.94% | 97.94% |
swe-public/SWE-bench_Pro |
731 | 90.15% | 90.97% | 90.97% |
swe-public/SWE-bench_Verified |
500 | 100.00% | 99.60% | 100.00% |
swe-public/SWE-Gym |
2,438 | 86.75% | 88.27% | 88.27% |
swe-public/SWE-rebench |
21,336 | 83.33% | 83.33% | 83.77% |
swe-public/SWE-smith |
38,513 | 99.37% | 97.42% | 99.31% |
| Total | 74,379 | 86.78% | 85.18% | 86.91% |
Three key takeaways emerge from this data:
runsc(gVisor) andrunc(Linux native) are now effectively on par. Across 74,379 runs, the correctness gap betweenrunscandruncis only about 0.13 percentage points (86.91% vs 86.78%). We also performed retries and cross-validation on core datasets to rule out one-off flakiness. We have improvedrunsc’s overall pass rate by approximately 1.7 percentage points. This correctness gain largely stemmed from highly concentrated failures in a small number of repositories—such as trio, cloud-custodian, asciidoctor, and syncthing. Once a root cause was identified, a single fix could often resolve hundreds of failing cases at once.- Most “compatibility issues” should not be attributed to gVisor. The
table clearly demonstrates that even under the native
runcenvironment, there is an inherent failure rate of about 13% (with an average correctness of 86.91%). These failures largely stem from flaky test code, build environment deficiencies, or limitations within the underlying datasets. Evaluating gVisor without aruncbaseline could easily lead to misattributing this 13% background failure rate as sandbox incompatibilities. - The overall pass rate for Multi-SWE-RL is relatively low (around ~27% for
both runtimes). This is because our internal evaluation framework and some
case-execution methods are still being adapted, so it is not a standalone
compatibility problem in gVisor itself. The same bias affects both
runcandrunsc, and therefore does not change the comparative conclusion.
At the production scale we described earlier—millions of gVisor sandboxes
running every day—this data answers the real question: how much correctness do
we lose by replacing runc with runsc? The answer is: almost none.
Representative Cases: Six Types of Issues and Corresponding Fix Paths
After filtering out cases where both runc and runsc failed simultaneously,
we conducted in-depth reviews of the remaining cases that exhibited behavioral
differences. Using these 100+ representative cases as a sample, their final
root-cause attribution can roughly be divided into the following categories:
| Root Cause Category | Requires gVisor Modification? | Typical Examples |
|---|---|---|
| Genuine gVisor bugs | Yes | poll incorrectly modifying events, inconsistent execve errno returns, O_TRUNC missing IN_MODIFY inotify events |
| Missing syscalls and virtual FS entries | Yes | Unimplemented copy_file_range syscall, missing /proc/sys/fs/pipe-max-size configuration file, and absence of /sys/dev/block directory |
| Clock and timer precision differences | Partially | CPU clock measurement precision, monotonic clock start value differences, sleep duration jitter |
| Amplified race conditions | No | Gradle clean test parallel execution concurrency race, CMake copy_if_different TOCTOU race |
| Environmental or config differences | No | External network access restrictions, JDK version mismatches, missing dynamic library paths |
| Test case issues | No | Test execution order dependencies, underlying dataset defects, inherently flaky tests |
This shows that aside from genuine bugs or missing Linux ABI implementations in
gVisor, a significant portion of behavioral differences stems from
timing-sensitive tests, amplified user-space race conditions, or environmental
setup differences. This is especially crucial for Agentic-RL scenarios. Without
runc baselines and root cause analysis, these failures could easily be
misattributed as sandbox incompatibilities, leading to systematically
pessimistic conclusions.
These cases highlight the different types of compatibility issues we see in Agentic-RL: system call semantic deviations, Linux ABI gaps, VFS implementation gaps, and user-space race conditions.
Case 1: poll Behavior Inconsistency Causes tmux Busy-Loop
The evaluation cluster’s CPU utilization was unusually high. Investigation
revealed that the tmux server in each Agent container was pegging a CPU core:
under gVisor, CPU usage hovered at 96.6%, while under runc it was
practically 0%.
The root cause was poll write-back semantics. gVisor internally appended
POLLHUP|POLLERR to pollfd.events and wrote the entire pollfd struct back
to user space. Linux, however, only writes to revents and never modifies the
user’s original events. This discrepancy prevented libevent from properly
removing closed file descriptors. Subsequent poll calls immediately returned
POLLNVAL, triggering a busy-loop.
After fixing this, the tmux CPU dropped from 96.6% to 0%. The impact goes far
beyond tmux — any program relying on the libevent poll backend benefits
from this.
Case 2: syncthing Test Case Exposes Two Independent Linux ABI Gaps (Unimplemented Syscalls or Virtual Files)
In real-world workloads, it’s not uncommon for a single test case to hit two
independent gVisor compatibility issues at once. The syncthing__syncthing-7828
test case in the Multi-SWE-RL dataset passes normally under runc, but
consistently fails under runsc: 16 TestCopyRange/* subtests report function
not implemented, and another TestTruncateFileOnly times out waiting for an
inotify event.
This was caused by two independent Linux ABI gaps:
copy_file_range(syscall 326) was unimplemented. gVisor registered it asErrorWithEvent(ENOSYS), so any program using this syscall receivedfunction not implemented.open(O_TRUNC)was missing theIN_MODIFYinotify event. The Linux kernel generatesIN_MODIFYalong thedo_open()→handle_truncate()→notify_change()path. However, gVisor VFS’sOpenAtonly generatedIN_OPEN, causing programs listening for file modification events to be “deaf” to the truncation action.
The fix proceeded along two lines: implementing copy_file_range for both amd64
(326) and arm64 (285), and issuing IN_MODIFY at the VFS layer for O_TRUNC on
non-newly created files (skipping it for newly created files via the
FMODE_CREATED flag, consistent with Linux). After the fix, this test case
passed consistently under runsc just like under runc.
Case 3: Gradle clean test Concurrency Race—Root Cause in User Space, Not gVisor
Not all issues that “only reproduce under gVisor” are actually gVisor bugs.
A Thunderbird Android test running ./gradlew clean test --max-workers 8
--continue under runsc frequently failed with Unable to delete directory.
However, running it 7 times under runc yielded 5 failures (71%). This
pointed to a user-space TOCTOU race condition in Gradle’s parallel build: one
subproject was still writing to build/, while another subproject’s clean task
was already trying to delete it.
gVisor’s higher system call overhead amplified the probability of triggering
this race, but it did not introduce new semantic errors. Splitting the command
into ./gradlew clean and ./gradlew test ... fixed it completely. This is
also a fundamental principle we follow in compatibility analysis: always use
runc as a baseline first, then determine whether the issue should be
attributed to the sandbox itself.
Case 4: Missing procfs / sysfs Causes Real Applications to Take Abnormal Paths
Agentic-RL workloads are full of paths that are not usually tested in isolation
but are relied upon by real projects, such as /proc/sys/fs/pipe-max-size,
/proc/sys/kernel/randomize_va_space, /sys/dev/block, /proc/[pid]/fdinfo,
etc. Once missing, these typically manifest as ENOENT or cause upper-layer
libraries to take abnormal code paths.
These are usually cheap to fix by wiring up static files or directory structures. They perfectly illustrate the value of real-world workloads: we aren’t adding these paths to satisfy a benchmark, we’re adding them because real applications actually read them.
Case 5: Inconsistent PTY Implementation Causes Interactive Agents to Error
Interactive terminals are easily overlooked but heavily used in Agent systems (tmux, screen, expect, REPLs, etc.). All rely on PTYs. We fixed several inconsistencies here:
- The
ISIGflag was not checked correctly, causing signals to still be generated afterstty -isig. - When the master closed, it did not send
SIGHUPto the foreground process group as Linux does. TCSBRK/TCFLSHand other ioctls were missing or had incorrect directional semantics, affecting programs like pyserial.
Notably, TCFLSH semantics must be evaluated from the caller’s perspective
rather than hardcoding internal queue names. Otherwise, the flush directions
seen by the master and replica are reversed compared to Linux.
Case 6: Jekyll Test Order Dependency Causes Flaky Failures—A Pure Test Case Issue
Sometimes, a test failing under gVisor has nothing to do with the runtime environment at all.
During evaluation, a Jekyll test case (jekyll-7637) failed under runsc but
coincidentally passed under runc. After a deep dive, we found that this test
actually had a roughly 33% chance of failing in any environment.
The root cause was rather dramatic: the test code itself had a bug where it
passed a configuration value as a Ruby Symbol type, while the underlying
source code incorrectly compared it as a String. As a result, this test could
never load its required syntax highlighting plugin as intended. So why did
it sometimes pass? Because the testing framework (minitest) executes tests in
a randomized order. If this buggy test happened to run after another test
that correctly loaded the plugin into memory, it would “freeload” off that
global state and pass. But if the randomized order happened to put this test
first, it would genuinely fail. It just so happened that gVisor hit that 1-in-3
failure chance during our evaluation.
This perfectly illustrates why we need large-scale A/B testing and deep analysis: without them, sporadic test flakiness like this can easily be misdiagnosed as “sandbox instability.”
Best Practices: Suggestions for Using gVisor in Agentic-RL Scenarios
If you’re building an Agent execution environment with gVisor, here are some practical tips.
Suggestions for Different Build Systems
| Build System | Common Risks | Suggestions |
|---|---|---|
| Gradle | clean test concurrency race | Split into clean and test steps |
| Maven | Remote dependency download timeout or 403 | Pre-populate local repo cache, minimize online downloads |
| CMake | copy_if_different race conditions |
Lower parallelism, avoid over-reliance on extremely short time windows |
| sbt / Scala | Deep stack, slow startup, test flakiness | Increase -Xss, give the first compilation a more generous timeout |
| pip / pytest | Differences in CPU count vs cgroup quota perception | Be aware of the relationship between os.cpu_count() and actual quotas |
| Cargo / npm / yarn | Generally good compatibility | Usually do not require special handling |
Debugging Procedure When Encountering Failures
When a test fails, we recommend this debugging flow:
- First reproduce the same command under
runcto confirm if the failure is specific to gVisor. - If
runcalso fails, prioritize investigating test case issues, environmental differences, or race conditions. - If it only fails under gVisor, check for obvious missing syscalls, procfs, or sysfs.
- For issues with no obvious missing features, compare logs, strace, and runtime behavior to distinguish between semantic inconsistencies, amplified race conditions, or environmental configuration differences.
- Only after confirming it is a gVisor semantic issue, proceed to locate the code path, create a minimal reproduction, and add regression tests.
Note: Many perceived “gVisor compatibility issues” are ultimately reclassified as test case issues during this step.
AI-Driven Compatibility Analysis: Why This Path Is Feasible
Large-scale compatibility analysis is well suited to AI assistance because it involves a large amount of repetitive, context-heavy work:
- Reading project source code and build scripts
- Comparing behavioral differences between two runtimes
- Comparing syscall, procfs, sysfs, PTY, network, and VFS semantics
- Turning conclusions into executable patches, PRs, or workaround suggestions
- Running regression validation and re-investigating the issue when validation fails
Manual analysis does not scale, while hardcoded rules often break down on complex cases. AI agents fit naturally in the middle: they can take on most of the “read logs → categorize → locate → report” work, while human engineers still review the proposed approach and code.
The real value here is not just saving time; it is making our conclusions scalable, traceable, and continuously improvable:
- Every case has standardized analysis artifacts rather than scattered chat logs.
- Every fix can be validated again against the original real-world test case.
- Every case that is “not a gVisor issue” can still be turned into a concrete workaround playbook.
- As new datasets, images, or build systems arrive, the same analysis framework can be reused.
Through this method, we already have more than ten fixes merged into the gVisor mainline, covering multiple areas such as file systems, networking, proc/sysfs, PTY, and system call semantics. Some representative PRs are listed below:
| PR | Fix Content | Typical Agentic-RL Scenario |
|---|---|---|
| #12851 | poll: Only write back revents |
tmux, libevent poll backend |
| #12911 | proc: Add /proc/sys/fs/pipe-max-size |
Python libraries like wurlitzer |
| #12915 | pty: Implement TCSBRK / TCFLSH |
pyserial, interactive PTY programs |
| #12814 | proc: Add randomize_va_space |
Performance and security inspection tools |
| #12813 | sysfs: Add /sys/dev/block and /sys/dev/char |
lsblk, device-related tools |
| #12819 | proc: Fill in fdinfo fields |
lsof, fuser, diagnostic tools |
| #12786 | devpts: Fix ISIG check |
Interactive shells / terminal-based agents |
| #12853 | vfs: FICLONE* returns EOPNOTSUPP |
file copying tools |
In this sense, Agentic-RL is not just a new use case for gVisor; it has also pushed our compatibility engineering toward a more AI-driven workflow.
Conclusion
Agentic-RL is both a proving ground for gVisor and, in practice, a large-scale regression suite: it continuously drives real-world projects through the sandbox and exposes compatibility boundaries that standard unit tests struggle to cover. By bringing AI agents into this verification loop, we can evaluate gVisor’s production readiness with data rather than intuition.
Our conclusions are simple:
- gVisor’s compatibility has proven to be production-ready.
- Most “compatibility issues” should not actually be attributed to gVisor.
- Real-world workloads are better than handpicked tests at revealing critical problems.
- AI-driven compatibility analysis is practical.
As AI agents take on heavier tasks, the code-execution sandbox will become an indispensable security foundation. We will continue refining this AI-driven verification system, applying it to new datasets and language stacks, and upstreaming our findings to the gVisor community. For Agentic-RL, a good sandbox is not just secure—it also needs to be highly compatible, debuggable, and able to evolve alongside real-world workloads.