diff --git a/.cargo/config.toml b/.cargo/config.toml index 830d89d12..0fb7b84f7 100644 --- a/.cargo/config.toml +++ b/.cargo/config.toml @@ -3,3 +3,10 @@ rustflags = ["-C", "target-feature=+crt-static", "-C", "link-arg=-Wl,-z,stack-si [target.x86_64-unknown-linux-musl] rustflags = ["-C", "target-feature=+crt-static", "-C", "link-arg=-Wl,-z,stack-size=2097152"] + +# Windows MSVC: allow duplicate symbols when linking libkrun staticlib into Rust binaries. +# libkrun is built as a staticlib (bundles Rust stdlib) for C consumers, but when linked +# into a Rust binary the stdlib symbols collide. /FORCE:MULTIPLE resolves this safely +# since both copies are identical. +[target.x86_64-pc-windows-msvc] +rustflags = ["-C", "link-arg=/FORCE:MULTIPLE"] diff --git a/.claude/commands/md-to-pdf.md b/.claude/commands/md-to-pdf.md new file mode 100644 index 000000000..3c0f677ca --- /dev/null +++ b/.claude/commands/md-to-pdf.md @@ -0,0 +1,64 @@ +Convert Markdown files in `./docs/` to PDF format using md-to-pdf. + +Arguments: $ARGUMENTS + +## Parameters + +- **File pattern** (required): A glob pattern to match files in `./docs/`, e.g. `in-depth-*`, `in-depth-cn-*`, `in-depth-01-*`. The `.md` extension is appended automatically if not included. + +Existing PDFs are always overwritten. + +## Prerequisites + +- **Node.js** must be installed +- **md-to-pdf** npm package: install globally if not available + +```bash +npm install -g md-to-pdf +``` + +## Output File Naming + +- **Input**: `{filename}.md` +- **Output**: `{filename}.pdf` + +Example: `cn.d-big_data.s-literary.big_data-cn-v6.md.md` -> `cn.d-big_data.s-literary.big_data-cn-v6.md.pdf` + +## Conversion Requirements + +1. **Preserve Formatting**: The PDF must faithfully render all Markdown formatting including headings, bold, italic, bullet points, nested lists, blockquotes, and inline code. + +2. **HTML Support**: Must correctly render embedded HTML tags (``, `
`, ` `, etc.) commonly used in the resume files. + +3. **Page Layout**: + - Paper size: A4 + - Margins: 8.5mm all sides (matches MPE preview padding of 2em/32px) + - Font: system default sans-serif + +4. **Styling**: Use different stylesheets for Chinese and English files: + - **Chinese files** (filename contains `-cn-`): Use `./docs/github-light.css` (original MPE stylesheet) + - **English files** (filename does NOT contain `-cn-`): Use `./docs/en.github-light.css` (optimized with reduced h2 and h3 spacing) + +5. **No External Dependencies Beyond md-to-pdf**: Do not require LaTeX, wkhtmltopdf, or other heavy toolchains. + +## Process + +1. Check if `md-to-pdf` is installed; if not, install it via `npm install -g md-to-pdf` +2. Parse arguments: extract the file pattern. If no file pattern is provided, report an error. +3. Use `Glob` to find matching `.md` files in `./docs/` using the pattern. If the pattern does not end with `.md`, append `.md` (e.g. `in-depth-*` becomes `in-depth-*.md`). +4. If no files match, report an error and list available `.md` files in `./docs/`. +5. For each matching file, determine the stylesheet based on filename: + - If filename contains `-cn-`, use `./docs/github-light.css` (Chinese) + - Otherwise, use `./docs/en.github-light.css` (English) + + Then run (use the stylesheet determined above): + ```bash + # Chinese files (containing "-cn-"): + md-to-pdf --stylesheet "./docs/github-light.css" --pdf-options '{"format":"A4","margin":{"top":"8.5mm","bottom":"8.5mm","left":"8.5mm","right":"8.5mm"}}' + + # English files (NOT containing "-cn-"): + md-to-pdf --stylesheet "./docs/en.github-light.css" --pdf-options '{"format":"A4","margin":{"top":"8.5mm","bottom":"8.5mm","left":"8.5mm","right":"8.5mm"}}' + ``` +6. Report results: + - List converted files + - Confirm total counts diff --git a/.claude/commands/win10-e2e.md b/.claude/commands/win10-e2e.md new file mode 100644 index 000000000..d9866124a --- /dev/null +++ b/.claude/commands/win10-e2e.md @@ -0,0 +1,63 @@ +# Win10 Full E2E Workflow + +Complete Win10 E2E testing workflow. Execute these commands in order. + +## Workflow + +### 1. Local Validation (before deploying) + +```bash +cargo test -p boxlite --no-default-features --lib +cargo clippy -p boxlite --no-default-features --lib -- -D warnings +``` + +Fix any failures before proceeding. + +### 2. `/win10-sync` — Pack + Deploy + Generate .bat + +Creates tarball of modified `src/` files, generates a rebuild+test .bat script, and SCPs both to Win10. + +### 3. Execute .bat on Win10 + +Two options: + +**Option A: Run the .bat (rebuild + test in one shot)** +```bash +ssh -o IdentitiesOnly=yes -i ~/.ssh/id_rsa lilongen@192.168.3.143 "cd C:\\ws-boxlite && win10-e2eN.bat > e2e-testN.txt 2>&1 && echo DONE" +``` +Timeout: 300s (covers build + test). + +**Option B: Step by step (more control)** +- `/win10-rebuild` — Rebuild shim and/or SDK +- `/win10-test` — Run test + retrieve + analyze + +### 4. `/win10-test` — Retrieve + Analyze Results + +Fetches `e2e-testN.txt` from Win10, reads it, and checks for success. + +## Common Pitfalls + +| Pitfall | Symptom | Fix | +|---------|---------|-----| +| `set VAR=val &&` (space!) | pip "Failed to parse" proxy URL | `set VAR=val&&` — NO space before `&&` | +| Missing file in tarball | Same error as before fix | Verify with `findstr` in .bat | +| Didn't rebuild SDK | SDK uses old crate code | Rebuild BOTH shim + SDK | +| Disk cache stale | Permission/format bugs persist | Clear `disk-images/` (see `/win10-sync`) | +| SCP backslash paths | "No such file" on retrieve | Use `C:/path/` not `C:\\path\\` | +| Locked .exe | Build silently produces old binary | `taskkill` before build | +| Stale shim | "transport error" / broken pipe | `taskkill /F /IM boxlite-shim.exe` | +| Flaky ContainerInit | Passes on retry | Re-run once; if consistent, it's a real bug | + +## Success Criteria + +All 8 phases of vm-bench.py show ms times: +``` +1. import boxlite ~80 ms +2. runtime_init ~100 ms +3. box_create ~6 ms (cached) / ~250 ms (first) +4. first_exec (cold) ~1700 ms +5. second_exec (warm) ~55 ms +6. third_exec (warm) ~36 ms +7. stop ~155 ms +8. remove ~55 ms +``` diff --git a/.claude/commands/win10-rebuild.md b/.claude/commands/win10-rebuild.md new file mode 100644 index 000000000..6921abf39 --- /dev/null +++ b/.claude/commands/win10-rebuild.md @@ -0,0 +1,59 @@ +# Win10 Rebuild (shim + SDK) + +Rebuild boxlite-shim and/or Python SDK on Win10 after code has been deployed. + +**CRITICAL Windows `set` rule**: `set VAR=value&&` — NO space before `&&`. A trailing space becomes part of the value and breaks URL parsing in pip/cargo. + +## Quick Rebuild (Both) + +**Shim:** +```bash +ssh -o IdentitiesOnly=yes -i ~/.ssh/id_rsa lilongen@192.168.3.143 "cmd /c \"cd C:\\ws-boxlite\\boxlite&&set HTTP_PROXY=http://127.0.0.1:7897&&set HTTPS_PROXY=http://127.0.0.1:7897&&set PATH=C:\\ws-boxlite\\tools\\protoc\\bin;%PATH%&&taskkill /F /IM boxlite-shim.exe 2>nul&cargo build -p boxlite --bin boxlite-shim --no-default-features --features krun&© /Y target\\debug\\boxlite-shim.exe C:\\ws-boxlite\\runtime\\boxlite-shim.exe&&echo SHIM OK\"" 2>&1 +``` + +**SDK** (separate command, different working dir): +```bash +ssh -o IdentitiesOnly=yes -i ~/.ssh/id_rsa lilongen@192.168.3.143 "cmd /c \"cd C:\\ws-boxlite\\boxlite\\sdks\\python&&set HTTP_PROXY=http://127.0.0.1:7897&&set HTTPS_PROXY=http://127.0.0.1:7897&&set BOXLITE_DEPS_STUB=1&&pip install -e .&&echo SDK OK\"" 2>&1 +``` + +## Preferred: Use .bat Script + +For reliability, write a `.bat` file instead of SSH one-liners (avoids quoting/set issues): + +```bat +@echo off +cd C:\ws-boxlite\boxlite +set HTTP_PROXY=http://127.0.0.1:7897 +set HTTPS_PROXY=http://127.0.0.1:7897 +set PATH=C:\ws-boxlite\tools\protoc\bin;%PATH% +taskkill /F /IM boxlite-shim.exe 2>nul +cargo build -p boxlite --bin boxlite-shim --no-default-features --features krun +if %ERRORLEVEL% NEQ 0 (echo SHIM FAILED & exit /b 1) +copy /Y target\debug\boxlite-shim.exe C:\ws-boxlite\runtime\boxlite-shim.exe +echo === Shim OK === +set BOXLITE_DEPS_STUB=1 +cd C:\ws-boxlite\boxlite\sdks\python +pip install -e . +if %ERRORLEVEL% NEQ 0 (echo SDK FAILED & exit /b 1) +echo === SDK OK === +``` + +## When to Rebuild What + +| Component | When to rebuild | +|-----------|----------------| +| **Shim only** | Changes in `src/boxlite/src/bin/shim/` only | +| **SDK only** | Changes in `src/boxlite/src/images/`, `src/boxlite/src/litebox/`, `src/boxlite/src/disk/` | +| **Both** | Changes in `src/boxlite/src/vmm/`, `src/boxlite/src/portal/`, `Cargo.toml`, or unsure | + +## Build Times (incremental) + +- Shim: ~10-50s (depends on what changed) +- SDK: ~20-80s (first build ~83s, incremental ~10-20s) + +## Prerequisite + +Always kill old shim before rebuild: +```bash +ssh -o IdentitiesOnly=yes -i ~/.ssh/id_rsa lilongen@192.168.3.143 "taskkill /F /IM boxlite-shim.exe 2>nul" +``` diff --git a/.claude/commands/win10-setup.md b/.claude/commands/win10-setup.md new file mode 100644 index 000000000..1d7a86a0d --- /dev/null +++ b/.claude/commands/win10-setup.md @@ -0,0 +1,215 @@ +# Win10 Environment Setup + +One-time setup for Win10 (MBP 2014) WHPX development/testing environment. + +## Machine Info + +- **SSH**: `ssh -o IdentitiesOnly=yes -i ~/.ssh/id_rsa lilongen@192.168.3.143` (pw: `JtwmY8.15`) +- **Workspace**: `C:\ws-boxlite\` +- **Proxy**: `HTTP_PROXY=http://127.0.0.1:7897` + +## Prerequisites (manual install on Windows) + +### 1. Check/Install Rust + +```bash +ssh -o IdentitiesOnly=yes -i ~/.ssh/id_rsa lilongen@192.168.3.143 "rustc --version && cargo --version" +``` + +If missing: download `rustup-init.exe` from https://rustup.rs, install stable toolchain. +Required: rustc 1.94+ (stable), MSVC target `x86_64-pc-windows-msvc`. + +### 2. Check/Install Python 3.12+ + +```bash +ssh -o IdentitiesOnly=yes -i ~/.ssh/id_rsa lilongen@192.168.3.143 "python --version && pip --version" +``` + +If missing: download from https://www.python.org/downloads/. Install to default location. +Ensure `python` and `pip` are in PATH. + +### 3. Check/Install MSVC Build Tools + +```bash +ssh -o IdentitiesOnly=yes -i ~/.ssh/id_rsa lilongen@192.168.3.143 "where cl.exe 2>nul && echo MSVC OK || echo MSVC MISSING" +``` + +If missing: install Visual Studio Build Tools (C++ workload). + +### 4. Check/Install protoc + +```bash +ssh -o IdentitiesOnly=yes -i ~/.ssh/id_rsa lilongen@192.168.3.143 "C:\ws-boxlite\tools\protoc\bin\protoc.exe --version 2>nul && echo PROTOC OK || echo PROTOC MISSING" +``` + +If missing: download protoc from https://github.com/protocolbuffers/protobuf/releases (win64.zip), +extract to `C:\ws-boxlite\tools\protoc\`. + +## Automated Setup Steps + +### Step 1: Create Workspace Structure + +```bash +ssh -o IdentitiesOnly=yes -i ~/.ssh/id_rsa lilongen@192.168.3.143 "cmd /c \"mkdir C:\ws-boxlite\boxlite C:\ws-boxlite\runtime C:\ws-boxlite\tools 2>nul && echo DIRS OK\"" +``` + +### Step 2: Create Full Source Tarball (on macOS) + +```bash +cd /Users/lilongen/github/boxlite +tar czf /tmp/boxlite-full-src.tar.gz \ + --exclude='target' \ + --exclude='.git' \ + --exclude='src/deps/*/vendor/*/target' \ + src/ sdks/python/ Cargo.toml Cargo.lock +``` + +Verify size (~50MB): +```bash +ls -lh /tmp/boxlite-full-src.tar.gz +``` + +### Step 3: Deploy Vendor (libkrun submodule) + +The libkrun vendor directory is large and must be synced separately: + +```bash +tar czf /tmp/boxlite-vendor.tar.gz \ + --exclude='target' \ + src/deps/libkrun-sys/vendor/ +``` + +### Step 4: Deploy Source to Win10 + +```bash +scp -o IdentitiesOnly=yes -i ~/.ssh/id_rsa /tmp/boxlite-full-src.tar.gz lilongen@192.168.3.143:"C:/ws-boxlite/" +scp -o IdentitiesOnly=yes -i ~/.ssh/id_rsa /tmp/boxlite-vendor.tar.gz lilongen@192.168.3.143:"C:/ws-boxlite/" +``` + +Extract on Win10: +```bash +ssh -o IdentitiesOnly=yes -i ~/.ssh/id_rsa lilongen@192.168.3.143 "cmd /c \"cd C:\ws-boxlite && tar xzf boxlite-full-src.tar.gz -C boxlite\ && tar xzf boxlite-vendor.tar.gz -C boxlite\ && echo EXTRACT OK\"" +``` + +### Step 5: Create .cargo/config.toml + +**CRITICAL**: Without this, linking fails with LNK1169 (duplicate `rust_eh_personality`). + +```bash +ssh -o IdentitiesOnly=yes -i ~/.ssh/id_rsa lilongen@192.168.3.143 "cmd /c \"mkdir C:\ws-boxlite\boxlite\.cargo 2>nul\"" +``` + +Create file locally and SCP: +```bash +cat > /tmp/cargo-config-win.toml << 'EOF' +[target.aarch64-unknown-linux-musl] +rustflags = ["-C", "target-feature=+crt-static", "-C", "link-arg=-Wl,-z,stack-size=2097152"] + +[target.x86_64-unknown-linux-musl] +rustflags = ["-C", "target-feature=+crt-static", "-C", "link-arg=-Wl,-z,stack-size=2097152"] + +# Windows MSVC: allow duplicate symbols when linking libkrun staticlib into Rust binaries. +# libkrun is built as a staticlib (bundles Rust stdlib) for C consumers, but when linked +# into a Rust binary the stdlib symbols collide. /FORCE:MULTIPLE resolves this safely +# since both copies are identical. +[target.x86_64-pc-windows-msvc] +rustflags = ["-C", "link-arg=/FORCE:MULTIPLE"] +EOF +scp -o IdentitiesOnly=yes -i ~/.ssh/id_rsa /tmp/cargo-config-win.toml lilongen@192.168.3.143:"C:/ws-boxlite/boxlite/.cargo/config.toml" +``` + +### Step 6: Deploy Runtime Files + +Runtime files are built on macOS/Lima and deployed to Windows. + +**Build guest binary (on macOS):** +```bash +CARGO_TARGET_X86_64_UNKNOWN_LINUX_MUSL_LINKER=x86_64-linux-musl-gcc \ + cargo build -p boxlite-guest --release --target x86_64-unknown-linux-musl +``` + +**Collect runtime files:** +```bash +mkdir -p /tmp/win-runtime +cp target/x86_64-unknown-linux-musl/release/boxlite-guest /tmp/win-runtime/ +# vmlinuz and initrd.img are built separately (see build-kernel docs) +# mke2fs.exe and debugfs.exe are cross-compiled e2fsprogs +``` + +**Deploy to Win10:** +```bash +scp -o IdentitiesOnly=yes -i ~/.ssh/id_rsa /tmp/win-runtime/* lilongen@192.168.3.143:"C:/ws-boxlite/runtime/" +``` + +**Required runtime files:** + +| File | Source | Size | +|------|--------|------| +| `boxlite-guest` | Cross-compiled on macOS (musl) | ~12MB | +| `boxlite-shim.exe` | Built on Win10 (Step 7) | ~13MB | +| `vmlinuz` | libkrunfw kernel (with 9p built-in) | ~7MB | +| `initrd.img` | Built in Lima VM | ~1.5MB | +| `mke2fs.exe` | Cross-compiled e2fsprogs | ~529KB | +| `debugfs.exe` | Cross-compiled e2fsprogs | ~612KB | + +### Step 7: Build boxlite-shim + +```bash +ssh -o IdentitiesOnly=yes -i ~/.ssh/id_rsa lilongen@192.168.3.143 "cmd /c \"cd C:\ws-boxlite\boxlite && set HTTP_PROXY=http://127.0.0.1:7897&& set HTTPS_PROXY=http://127.0.0.1:7897&& set PATH=C:\ws-boxlite\tools\protoc\bin;%PATH%&& cargo build -p boxlite --bin boxlite-shim --no-default-features --features krun 2>&1 && copy /Y target\debug\boxlite-shim.exe C:\ws-boxlite\runtime\boxlite-shim.exe && echo SHIM OK\"" 2>&1 +``` + +First build takes ~2-5 minutes. Check output for `SHIM OK`. + +### Step 8: Install Python SDK + +```bash +ssh -o IdentitiesOnly=yes -i ~/.ssh/id_rsa lilongen@192.168.3.143 "cmd /c \"cd C:\ws-boxlite\boxlite\sdks\python && set HTTP_PROXY=http://127.0.0.1:7897&& set HTTPS_PROXY=http://127.0.0.1:7897&& set BOXLITE_DEPS_STUB=1&& set PATH=C:\ws-boxlite\tools\protoc\bin;%PATH%&& pip install -e . 2>&1 && echo SDK OK\"" 2>&1 +``` + +### Step 9: Cache OCI Images + +Pull alpine and debian images (needed by vm-bench.py): + +```bash +ssh -o IdentitiesOnly=yes -i ~/.ssh/id_rsa lilongen@192.168.3.143 "cmd /c \"set HTTP_PROXY=http://127.0.0.1:7897&& set HTTPS_PROXY=http://127.0.0.1:7897&& set BOXLITE_RUNTIME_DIR=C:\ws-boxlite\runtime&& python -c \"import asyncio, boxlite; asyncio.run(boxlite.Boxlite.default().pull('alpine:latest'))\" 2>&1 && echo PULL OK\"" 2>&1 +``` + +**Alternative**: Copy image cache from another machine: +```bash +# On source machine: tar czf /tmp/boxlite-images.tar.gz -C $HOME/.boxlite images/ +# SCP to Win10 and extract to %USERPROFILE%\.boxlite\ +``` + +### Step 10: Deploy vm-bench.py + +```bash +scp -o IdentitiesOnly=yes -i ~/.ssh/id_rsa scripts/test/vm-bench.py lilongen@192.168.3.143:"C:/ws-boxlite/vm-bench.py" +``` + +### Step 11: Verify Setup + +Run a single vm-bench test: +```bash +ssh -o IdentitiesOnly=yes -i ~/.ssh/id_rsa lilongen@192.168.3.143 "cmd /c \"cd C:\ws-boxlite && set HTTP_PROXY=http://127.0.0.1:7897&& set HTTPS_PROXY=http://127.0.0.1:7897&& set BOXLITE_RUNTIME_DIR=C:\ws-boxlite\runtime&& set RUST_LOG=warn&& python vm-bench.py\"" 2>&1 +``` + +All 8 phases should show ms times. WHPX is flaky (~15-20% success on this machine), so retry if it fails with "transport error". + +## Verification Checklist + +```bash +# All checks in one command +ssh -o IdentitiesOnly=yes -i ~/.ssh/id_rsa lilongen@192.168.3.143 "cmd /c \"echo === Toolchain === && rustc --version && python --version && echo === Workspace === && dir /b C:\ws-boxlite\boxlite\Cargo.toml && echo === Cargo Config === && type C:\ws-boxlite\boxlite\.cargo\config.toml | findstr FORCE && echo === Runtime === && dir /b C:\ws-boxlite\runtime\ && echo === SDK === && python -c \"import boxlite; print(f'boxlite {boxlite.__version__}')\" && echo === ALL OK ===\"" 2>&1 +``` + +## Troubleshooting + +| Problem | Symptom | Fix | +|---------|---------|-----| +| LNK1169 duplicate symbol | `rust_eh_personality` already defined | Missing `.cargo/config.toml` — redo Step 5 | +| LNK1120 unresolved externals | `krun_*` symbols not found | Built with `BOXLITE_DEPS_STUB=1` — remove it for shim build | +| protoc not found | `boxlite-shared` build error | protoc not in PATH — check Step 4 | +| Image pull fails | `error sending request for url` | Proxy not set — add HTTP_PROXY/HTTPS_PROXY | +| Python not found | `python is not recognized` | Python not in PATH — reinstall with "Add to PATH" | +| GBK encoding error | `UnicodeEncodeError: 'gbk' codec` | Add `sys.stdout.reconfigure(encoding='utf-8')` to scripts | +| `cd` doesn't switch drives | Stays on C: after `cd D:\...` | Use drive letter first: `D:` then `cd D:\path` | diff --git a/.claude/commands/win10-sync.md b/.claude/commands/win10-sync.md new file mode 100644 index 000000000..5df608249 --- /dev/null +++ b/.claude/commands/win10-sync.md @@ -0,0 +1,204 @@ +# Win10 Sync + +Pack ALL modified source files (vs main branch), generate a rebuild+test .bat script, and deploy everything to Win10 (MBP 2014). + +## Environment + +- **SSH**: `ssh -o IdentitiesOnly=yes -i ~/.ssh/id_rsa lilongen@192.168.3.143` +- **Workspace**: `C:\ws-boxlite\` (working dir), `C:\ws-boxlite\boxlite\` (source) +- **Runtime**: `C:\ws-boxlite\runtime\` +- **Proxy**: `HTTP_PROXY=http://127.0.0.1:7897` +- **SCP path format**: Forward slashes only: `"lilongen@192.168.3.143:C:/ws-boxlite/file.txt"` + +## Steps + +### 1. Identify ALL Modified Files vs Main + +**CRITICAL**: Use `git diff main` (not `git diff`). This captures ALL branch changes including committed changes from previous iterations — not just unstaged changes in the current session. + +```bash +# ALL src/ files changed on this branch vs main +git diff main --name-only -- src/ > /tmp/win10-sync-files.txt + +# Also include any unstaged changes not yet committed +git diff --name-only -- src/ >> /tmp/win10-sync-files.txt + +# Deduplicate +sort -u /tmp/win10-sync-files.txt -o /tmp/win10-sync-files.txt + +# Show count and list +echo "=== Files to sync: $(wc -l < /tmp/win10-sync-files.txt) ===" +cat /tmp/win10-sync-files.txt +``` + +Only include `src/` files. Skip docs, scripts, .claude, etc. + +### 2. Analyze What Changed (for cache/rebuild decisions) + +Run this ONCE and note the results — they drive Steps 4's .bat generation: + +```bash +FILES=$(cat /tmp/win10-sync-files.txt) + +# Check: need disk-images cache clear? +NEED_DISK_CACHE_CLEAR=false +echo "$FILES" | grep -qE "(image_disk\.rs|disk/ext4\.rs|disk/constants\.rs)" && NEED_DISK_CACHE_CLEAR=true + +# Check: need cargo clean? (any Rust src changed = yes, since linker caches) +NEED_CARGO_CLEAN=false +echo "$FILES" | grep -qE "\.rs$" && NEED_CARGO_CLEAN=true + +# Check: need libgvproxy cross-compile? +NEED_GVPROXY=false +echo "$FILES" | grep -q "libgvproxy-sys/gvproxy-bridge/" && NEED_GVPROXY=true + +# Check: VMM files changed? (libkrun submodule) +VMM_CHANGED=false +echo "$FILES" | grep -q "libkrun-sys/vendor/libkrun/src/vmm/" && VMM_CHANGED=true + +echo "disk-images cache clear: $NEED_DISK_CACHE_CLEAR" +echo "cargo clean: $NEED_CARGO_CLEAN" +echo "libgvproxy cross-compile: $NEED_GVPROXY" +echo "VMM files changed: $VMM_CHANGED" +``` + +### 3. Cross-compile libgvproxy (only if gvproxy sources changed) + +If `NEED_GVPROXY=true`: + +```bash +bash scripts/build/cross-compile-gvproxy-windows.sh +``` + +Output: `target/kernel-windows-x86_64/libgvproxy.lib` (31MB). Skip if only Rust files changed. + +### 4. Create Sync Tarball + +Increment N from previous sync (check `/tmp/boxlite-sync*.tar.gz`): + +```bash +tar czf /tmp/boxlite-syncN.tar.gz -T /tmp/win10-sync-files.txt +echo "Tarball: $(ls -lh /tmp/boxlite-syncN.tar.gz)" +echo "File count: $(tar tzf /tmp/boxlite-syncN.tar.gz | wc -l)" +``` + +**Verification**: The file count must match the count from Step 1. If they differ, investigate. + +### 5. Generate .bat Script + +Write `/tmp/win10-e2eN.bat` with the sections below. The cache clearing and cargo clean lines are **deterministic** based on Step 2 analysis. + +**CRITICAL rules**: +- One `set` per line (no `&&` after `set`) +- `RUST_LOG=info` (NEVER debug — debug kills WHPX networking) +- Always `cargo clean` when Rust source files changed +- `LIBGVPROXY_PREBUILT` must point to `gvproxy.lib` (7KB DLL import lib), NOT `libgvproxy.lib` (40MB static) + +```bat +@echo off +cd /d C:\ws-boxlite\boxlite +set HTTP_PROXY=http://127.0.0.1:7897 +set HTTPS_PROXY=http://127.0.0.1:7897 +set PATH=C:\ws-boxlite\tools\protoc\bin;%PATH% + +echo === Kill old processes === +taskkill /F /IM boxlite-shim.exe 2>nul + +echo === Extract updated files === +cd /d C:\ws-boxlite +tar xzf boxlite-syncN.tar.gz -C boxlite\ +echo Extract OK + +echo === Verify sync completeness === +echo Expected: files +REM Pick 2-3 key files from different directories to verify: +findstr /C:"UNIQUE_STRING_1" boxlite\path\to\file1 +findstr /C:"UNIQUE_STRING_2" boxlite\path\to\file2 +if %ERRORLEVEL% NEQ 0 ( + echo WARNING: Sync incomplete! + exit /b 1 +) + +echo === Clear caches === +if exist "%USERPROFILE%\.boxlite\boxes" (rmdir /S /Q "%USERPROFILE%\.boxlite\boxes") +REM --- ONLY if NEED_DISK_CACHE_CLEAR=true (image_disk.rs or ext4.rs changed): --- +if exist "%USERPROFILE%\.boxlite\images\disk-images" (rmdir /S /Q "%USERPROFILE%\.boxlite\images\disk-images") +REM --- Remove the line above if NEED_DISK_CACHE_CLEAR=false --- + +echo === Cargo clean === +cd /d C:\ws-boxlite\boxlite +set LIBGVPROXY_PREBUILT=C:\ws-boxlite\runtime\gvproxy.lib +cargo clean 2>&1 +echo === Clean done === + +echo === Rebuild shim === +cargo build -p boxlite --bin boxlite-shim --no-default-features --features krun,gvproxy 2>&1 +if %ERRORLEVEL% NEQ 0 (echo SHIM BUILD FAILED && exit /b %ERRORLEVEL%) +copy /Y target\debug\boxlite-shim.exe C:\ws-boxlite\runtime\boxlite-shim.exe +echo === Shim OK === + +echo === Rebuild SDK === +set BOXLITE_DEPS_STUB=1 +cd /d C:\ws-boxlite\boxlite\sdks\python +pip install -e . 2>&1 +if %ERRORLEVEL% NEQ 0 (echo SDK BUILD FAILED && exit /b %ERRORLEVEL%) +set BOXLITE_DEPS_STUB= +echo === SDK OK === + +echo === Run vm-bench === +cd /d C:\ws-boxlite +set BOXLITE_RUNTIME_DIR=C:\ws-boxlite\runtime +set RUST_LOG=info +python vm-bench.py > e2e-testN.txt 2>&1 + +echo === vm-bench Summary === +findstr /C:"import" /C:"runtime_init" /C:"box_create" /C:"exec" /C:"stop" /C:"remove" /C:"Error" /C:"Grand" e2e-testN.txt + +echo === Run net-test === +python net-test.py > net-testN.txt 2>&1 + +echo === net-test Summary === +findstr /C:"PASS" /C:"FAIL" /C:"Error" /C:"Grand" net-testN.txt + +echo === DONE === +``` + +### 6. SCP Tarball + .bat + libgvproxy.lib to Win10 + +```bash +# Convert .bat to CRLF +perl -pe 's/\n/\r\n/' /tmp/win10-e2eN.bat > /tmp/win10-e2eN-crlf.bat +mv /tmp/win10-e2eN-crlf.bat /tmp/win10-e2eN.bat + +# SCP +scp -o IdentitiesOnly=yes -i ~/.ssh/id_rsa /tmp/boxlite-syncN.tar.gz /tmp/win10-e2eN.bat lilongen@192.168.3.143:"C:/ws-boxlite/" +``` + +If libgvproxy.lib was cross-compiled (step 3), also SCP it: + +```bash +scp -o IdentitiesOnly=yes -i ~/.ssh/id_rsa target/kernel-windows-x86_64/libgvproxy.lib lilongen@192.168.3.143:"C:/ws-boxlite/runtime/" +``` + +### 7. Verify Deployment + +```bash +ssh -o IdentitiesOnly=yes -i ~/.ssh/id_rsa lilongen@192.168.3.143 "dir C:\\ws-boxlite\\boxlite-syncN.tar.gz C:\\ws-boxlite\\win10-e2eN.bat" +``` + +Both files must exist with non-zero size. + +## Automatic Cache Clearing Rules + +These are DETERMINISTIC — apply them based on Step 2 analysis: + +| Condition | Action | +|-----------|--------| +| ANY `.rs` file changed | `cargo clean` (linker caches stale objects) | +| `image_disk.rs` or `disk/ext4.rs` or `disk/constants.rs` changed | Clear `disk-images/` cache | +| `image_disk.rs` changed | Also verify with `findstr /C:"has_non_ascii"` in .bat | +| Always | Clear `boxes/` cache (safe, forces clean box creation) | + +## Rebuild Rules + +Both shim and SDK are ALWAYS rebuilt after `cargo clean`. No selective rebuild logic needed. diff --git a/.claude/commands/win10-test.md b/.claude/commands/win10-test.md new file mode 100644 index 000000000..ae5c7d4ad --- /dev/null +++ b/.claude/commands/win10-test.md @@ -0,0 +1,77 @@ +# Win10 Run E2E Test + +Run vm-bench.py on Win10, retrieve results, and analyze. Assumes code is already deployed and rebuilt (via `/win10-sync` + `/win10-rebuild`, or via the .bat from `/win10-sync`). + +## Run Test + +### 1. Determine Next Test Number + +```bash +ssh -o IdentitiesOnly=yes -i ~/.ssh/id_rsa lilongen@192.168.3.143 "dir C:\\ws-boxlite\\e2e-test*.txt" +``` + +### 2. Execute Test (replace N) + +```bash +ssh -o IdentitiesOnly=yes -i ~/.ssh/id_rsa lilongen@192.168.3.143 "cmd /c \"cd C:\\ws-boxlite&&taskkill /F /IM boxlite-shim.exe 2>nul&set BOXLITE_RUNTIME_DIR=C:\\ws-boxlite\\runtime&&set RUST_LOG=debug&&if exist \"%USERPROFILE%\\.boxlite\\boxes\" rmdir /S /Q \"%USERPROFILE%\\.boxlite\\boxes\"&&python vm-bench.py > e2e-testN.txt 2>&1&&echo TEST DONE\"" 2>&1 +``` + +- `taskkill` with `&` (not `&&`) — continues even if no process found +- **Timeout**: 60s. If SSH hangs, the test may still have completed on Win10. + +### 3. Retrieve Results + +```bash +scp -o IdentitiesOnly=yes -i ~/.ssh/id_rsa "lilongen@192.168.3.143:C:/ws-boxlite/e2e-testN.txt" /tmp/e2e-testN.txt +``` + +**CRITICAL**: Forward slashes in SCP source path! Backslashes fail silently. + +### 4. Analyze + +Read `/tmp/e2e-testN.txt` with the Read tool. Check: + +- **Success**: All 8 phases show ms times in the summary table +- **Failure patterns**: + - `os error 2` = file not found (missing file, wrong path) + - `os error 5` = access denied (permission issue) + - `Broken pipe` = VM crashed or shutdown during gRPC + - `Box initialization failed` = init pipeline error (check preceding lines) +- **Flaky**: ContainerInit "transport error" / "broken pipe" on single run — re-run once + +### Quick Summary (without full read) + +```bash +grep -a "Phase\|exec\|stop\|remove\|Error\|Grand" /tmp/e2e-testN.txt +``` + +## If SSH Hangs + +WHPX VM occasionally hangs during init (~20% of runs on MBP 2014). When this happens: + +1. Stop/kill the SSH command +2. Check if output exists: `ssh ... "dir C:\\ws-boxlite\\e2e-testN.txt"` +3. If it exists with the summary table at the end, test completed — fetch it +4. If truncated, VM hung. Kill and retry: + ```bash + ssh -o IdentitiesOnly=yes -i ~/.ssh/id_rsa lilongen@192.168.3.143 "taskkill /F /IM boxlite-shim.exe 2>nul" + ``` + Re-run with the next N. + +## Read Shim Stderr (for shim-side errors) + +```bash +ssh -o IdentitiesOnly=yes -i ~/.ssh/id_rsa lilongen@192.168.3.143 "cmd /c \"dir /s /b %USERPROFILE%\\.boxlite\\boxes\\*\\stderr\"" +``` + +Then fetch: +```bash +scp -o IdentitiesOnly=yes -i ~/.ssh/id_rsa "lilongen@192.168.3.143:C:/Users/lilongen/.boxlite/boxes//stderr" /tmp/shim-stderr.txt +``` + +## Clear Disk Cache (if needed) + +Only when `image_disk.rs` or `disk/ext4.rs` changed: +```bash +ssh -o IdentitiesOnly=yes -i ~/.ssh/id_rsa lilongen@192.168.3.143 "cmd /c \"if exist \"%USERPROFILE%\\.boxlite\\images\\disk-images\" rmdir /S /Q \"%USERPROFILE%\\.boxlite\\images\\disk-images\"&&echo Cleared\"" +``` diff --git a/.claude/commands/win11-e2e.md b/.claude/commands/win11-e2e.md new file mode 100644 index 000000000..6301c9064 --- /dev/null +++ b/.claude/commands/win11-e2e.md @@ -0,0 +1,63 @@ +# Win11 Full E2E Workflow + +Complete Win11 (T14) E2E testing workflow. Execute these commands in order. + +## Workflow + +### 1. Local Validation (before deploying) + +```bash +cargo test -p boxlite --no-default-features --lib +cargo clippy -p boxlite --no-default-features --lib -- -D warnings +``` + +Fix any failures before proceeding. + +### 2. `/win11-sync` — Pack + Deploy + Generate .bat + +Creates tarball of modified `src/` files, generates a rebuild+test .bat script, and SCPs both to Win11. + +### 3. Execute .bat on Win11 + +Two options: + +**Option A: Run the .bat (rebuild + test in one shot)** +```bash +ssh -o IdentitiesOnly=yes -i ~/.ssh/id_rsa t14@192.168.3.221 "cd D:\\ws-boxlite && win11-e2eN.bat > e2e-testN.txt 2>&1 && echo DONE" +``` +Timeout: 300s (covers build + test). + +**Option B: Step by step (more control)** +- `/win11-rebuild` — Rebuild shim and/or SDK +- `/win11-test` — Run test + retrieve + analyze + +### 4. `/win11-test` — Retrieve + Analyze Results + +Fetches `e2e-testN.txt` from Win11, reads it, and checks for success. + +## Common Pitfalls + +| Pitfall | Symptom | Fix | +|---------|---------|-----| +| `set VAR=val &&` (space!) | pip "Failed to parse" proxy URL | `set VAR=val&&` — NO space before `&&` | +| Missing file in tarball | Same error as before fix | Verify with `findstr` in .bat | +| Didn't rebuild SDK | SDK uses old crate code | Rebuild BOTH shim + SDK | +| Disk cache stale | Permission/format bugs persist | Clear `disk-images/` (see `/win11-sync`) | +| SCP backslash paths | "No such file" on retrieve | Use `D:/path/` not `D:\\path\\` | +| Locked .exe | Build silently produces old binary | `taskkill` before build | +| Stale shim | "transport error" / broken pipe | `taskkill /F /IM boxlite-shim.exe` | +| Flaky ContainerInit | Passes on retry | Re-run once; if consistent, it's a real bug | + +## Success Criteria + +All 8 phases of vm-bench.py show ms times: +``` +1. import boxlite ~80 ms +2. runtime_init ~100 ms +3. box_create ~6 ms (cached) / ~250 ms (first) +4. first_exec (cold) ~1700 ms +5. second_exec (warm) ~55 ms +6. third_exec (warm) ~36 ms +7. stop ~155 ms +8. remove ~55 ms +``` diff --git a/.claude/commands/win11-rebuild.md b/.claude/commands/win11-rebuild.md new file mode 100644 index 000000000..d6902a7fc --- /dev/null +++ b/.claude/commands/win11-rebuild.md @@ -0,0 +1,59 @@ +# Win11 Rebuild (shim + SDK) + +Rebuild boxlite-shim and/or Python SDK on Win11 (T14) after code has been deployed. + +**CRITICAL Windows `set` rule**: `set VAR=value&&` — NO space before `&&`. A trailing space becomes part of the value and breaks URL parsing in pip/cargo. + +## Quick Rebuild (Both) + +**Shim:** +```bash +ssh -o IdentitiesOnly=yes -i ~/.ssh/id_rsa t14@192.168.3.221 "cmd /c \"cd D:\\ws-boxlite\\boxlite&&set HTTP_PROXY=http://127.0.0.1:7897&&set HTTPS_PROXY=http://127.0.0.1:7897&&set PATH=D:\\ws-boxlite\\tools\\protoc\\bin;%PATH%&&taskkill /F /IM boxlite-shim.exe 2>nul&cargo build -p boxlite --bin boxlite-shim --no-default-features --features krun&© /Y target\\debug\\boxlite-shim.exe D:\\ws-boxlite\\runtime\\boxlite-shim.exe&&echo SHIM OK\"" 2>&1 +``` + +**SDK** (separate command, different working dir): +```bash +ssh -o IdentitiesOnly=yes -i ~/.ssh/id_rsa t14@192.168.3.221 "cmd /c \"cd D:\\ws-boxlite\\boxlite\\sdks\\python&&set HTTP_PROXY=http://127.0.0.1:7897&&set HTTPS_PROXY=http://127.0.0.1:7897&&set BOXLITE_DEPS_STUB=1&&pip install -e .&&echo SDK OK\"" 2>&1 +``` + +## Preferred: Use .bat Script + +For reliability, write a `.bat` file instead of SSH one-liners (avoids quoting/set issues): + +```bat +@echo off +cd D:\ws-boxlite\boxlite +set HTTP_PROXY=http://127.0.0.1:7897 +set HTTPS_PROXY=http://127.0.0.1:7897 +set PATH=D:\ws-boxlite\tools\protoc\bin;%PATH% +taskkill /F /IM boxlite-shim.exe 2>nul +cargo build -p boxlite --bin boxlite-shim --no-default-features --features krun +if %ERRORLEVEL% NEQ 0 (echo SHIM FAILED & exit /b 1) +copy /Y target\debug\boxlite-shim.exe D:\ws-boxlite\runtime\boxlite-shim.exe +echo === Shim OK === +set BOXLITE_DEPS_STUB=1 +cd D:\ws-boxlite\boxlite\sdks\python +pip install -e . +if %ERRORLEVEL% NEQ 0 (echo SDK FAILED & exit /b 1) +echo === SDK OK === +``` + +## When to Rebuild What + +| Component | When to rebuild | +|-----------|----------------| +| **Shim only** | Changes in `src/boxlite/src/bin/shim/` only | +| **SDK only** | Changes in `src/boxlite/src/images/`, `src/boxlite/src/litebox/`, `src/boxlite/src/disk/` | +| **Both** | Changes in `src/boxlite/src/vmm/`, `src/boxlite/src/portal/`, `Cargo.toml`, or unsure | + +## Build Times (incremental) + +- Shim: ~10-50s (depends on what changed) +- SDK: ~20-80s (first build ~83s, incremental ~10-20s) + +## Prerequisite + +Always kill old shim before rebuild: +```bash +ssh -o IdentitiesOnly=yes -i ~/.ssh/id_rsa t14@192.168.3.221 "taskkill /F /IM boxlite-shim.exe 2>nul" +``` diff --git a/.claude/commands/win11-setup.md b/.claude/commands/win11-setup.md new file mode 100644 index 000000000..2e0ffee25 --- /dev/null +++ b/.claude/commands/win11-setup.md @@ -0,0 +1,238 @@ +# Win11 Environment Setup + +One-time setup for Win11 (ThinkPad T14) WHPX development/testing environment. + +## Machine Info + +- **SSH**: `ssh -o IdentitiesOnly=yes -i ~/.ssh/id_rsa t14@192.168.3.221` (pw: `121314`) +- **Workspace**: `D:\ws-boxlite\` +- **Proxy**: `HTTP_PROXY=http://127.0.0.1:7897` +- **Note**: Workspace is on D: drive — use `D:` before `cd D:\path` in .bat files + +## Prerequisites (manual install on Windows) + +### 1. Check/Install Rust + +```bash +ssh -o IdentitiesOnly=yes -i ~/.ssh/id_rsa t14@192.168.3.221 "rustc --version && cargo --version" +``` + +If missing: download `rustup-init.exe` from https://rustup.rs, install stable toolchain. +Required: rustc 1.94+ (stable), MSVC target `x86_64-pc-windows-msvc`. + +### 2. Check/Install Python 3.12+ + +```bash +ssh -o IdentitiesOnly=yes -i ~/.ssh/id_rsa t14@192.168.3.221 "python --version && pip --version" +``` + +If missing: download from https://www.python.org/downloads/. Install to default location. +Ensure `python` and `pip` are in PATH. + +### 3. Check/Install MSVC Build Tools + +```bash +ssh -o IdentitiesOnly=yes -i ~/.ssh/id_rsa t14@192.168.3.221 "where cl.exe 2>nul && echo MSVC OK || echo MSVC MISSING" +``` + +If missing: install Visual Studio Build Tools (C++ workload). + +### 4. Check/Install protoc + +```bash +ssh -o IdentitiesOnly=yes -i ~/.ssh/id_rsa t14@192.168.3.221 "D:\ws-boxlite\tools\protoc\bin\protoc.exe --version 2>nul && echo PROTOC OK || echo PROTOC MISSING" +``` + +If missing: download protoc from https://github.com/protocolbuffers/protobuf/releases (win64.zip), +extract to `D:\ws-boxlite\tools\protoc\`. + +## Automated Setup Steps + +### Step 1: Create Workspace Structure + +```bash +ssh -o IdentitiesOnly=yes -i ~/.ssh/id_rsa t14@192.168.3.221 "cmd /c \"mkdir D:\ws-boxlite\boxlite D:\ws-boxlite\runtime D:\ws-boxlite\tools 2>nul && echo DIRS OK\"" +``` + +### Step 2: Create Full Source Tarball (on macOS) + +```bash +cd /Users/lilongen/github/boxlite +tar czf /tmp/boxlite-full-src.tar.gz \ + --exclude='target' \ + --exclude='.git' \ + --exclude='src/deps/*/vendor/*/target' \ + src/ sdks/python/ Cargo.toml Cargo.lock +``` + +Verify size (~50MB): +```bash +ls -lh /tmp/boxlite-full-src.tar.gz +``` + +### Step 3: Deploy Vendor (libkrun submodule) + +The libkrun vendor directory is large and must be synced separately: + +```bash +tar czf /tmp/boxlite-vendor.tar.gz \ + --exclude='target' \ + src/deps/libkrun-sys/vendor/ +``` + +### Step 4: Deploy Source to Win11 + +```bash +scp -o IdentitiesOnly=yes -i ~/.ssh/id_rsa /tmp/boxlite-full-src.tar.gz t14@192.168.3.221:"D:/ws-boxlite/" +scp -o IdentitiesOnly=yes -i ~/.ssh/id_rsa /tmp/boxlite-vendor.tar.gz t14@192.168.3.221:"D:/ws-boxlite/" +``` + +Extract on Win11: +```bash +ssh -o IdentitiesOnly=yes -i ~/.ssh/id_rsa t14@192.168.3.221 "cmd /c \"D: && cd D:\ws-boxlite && tar xzf boxlite-full-src.tar.gz -C boxlite\ && tar xzf boxlite-vendor.tar.gz -C boxlite\ && echo EXTRACT OK\"" +``` + +### Step 5: Create .cargo/config.toml + +**CRITICAL**: Without this, linking fails with LNK1169 (duplicate `rust_eh_personality`). + +```bash +ssh -o IdentitiesOnly=yes -i ~/.ssh/id_rsa t14@192.168.3.221 "cmd /c \"mkdir D:\ws-boxlite\boxlite\.cargo 2>nul\"" +``` + +Create file locally and SCP: +```bash +cat > /tmp/cargo-config-win.toml << 'EOF' +[target.aarch64-unknown-linux-musl] +rustflags = ["-C", "target-feature=+crt-static", "-C", "link-arg=-Wl,-z,stack-size=2097152"] + +[target.x86_64-unknown-linux-musl] +rustflags = ["-C", "target-feature=+crt-static", "-C", "link-arg=-Wl,-z,stack-size=2097152"] + +# Windows MSVC: allow duplicate symbols when linking libkrun staticlib into Rust binaries. +# libkrun is built as a staticlib (bundles Rust stdlib) for C consumers, but when linked +# into a Rust binary the stdlib symbols collide. /FORCE:MULTIPLE resolves this safely +# since both copies are identical. +[target.x86_64-pc-windows-msvc] +rustflags = ["-C", "link-arg=/FORCE:MULTIPLE"] +EOF +scp -o IdentitiesOnly=yes -i ~/.ssh/id_rsa /tmp/cargo-config-win.toml t14@192.168.3.221:"D:/ws-boxlite/boxlite/.cargo/config.toml" +``` + +### Step 6: Deploy Runtime Files + +Runtime files are built on macOS/Lima and deployed to Windows. + +**Build guest binary (on macOS):** +```bash +CARGO_TARGET_X86_64_UNKNOWN_LINUX_MUSL_LINKER=x86_64-linux-musl-gcc \ + cargo build -p boxlite-guest --release --target x86_64-unknown-linux-musl +``` + +**Collect runtime files:** +```bash +mkdir -p /tmp/win-runtime +cp target/x86_64-unknown-linux-musl/release/boxlite-guest /tmp/win-runtime/ +# vmlinuz and initrd.img are built separately (see build-kernel docs) +# mke2fs.exe and debugfs.exe are cross-compiled e2fsprogs +``` + +**Deploy to Win11:** +```bash +scp -o IdentitiesOnly=yes -i ~/.ssh/id_rsa /tmp/win-runtime/* t14@192.168.3.221:"D:/ws-boxlite/runtime/" +``` + +**Alternative**: Copy runtime from Win10 (if already set up): +```bash +scp -o IdentitiesOnly=yes -i ~/.ssh/id_rsa lilongen@192.168.3.143:"C:/ws-boxlite/runtime/boxlite-guest" /tmp/ +scp -o IdentitiesOnly=yes -i ~/.ssh/id_rsa lilongen@192.168.3.143:"C:/ws-boxlite/runtime/vmlinuz" /tmp/ +scp -o IdentitiesOnly=yes -i ~/.ssh/id_rsa lilongen@192.168.3.143:"C:/ws-boxlite/runtime/initrd.img" /tmp/ +scp -o IdentitiesOnly=yes -i ~/.ssh/id_rsa lilongen@192.168.3.143:"C:/ws-boxlite/runtime/mke2fs.exe" /tmp/ +scp -o IdentitiesOnly=yes -i ~/.ssh/id_rsa lilongen@192.168.3.143:"C:/ws-boxlite/runtime/debugfs.exe" /tmp/ +scp -o IdentitiesOnly=yes -i ~/.ssh/id_rsa /tmp/boxlite-guest /tmp/vmlinuz /tmp/initrd.img /tmp/mke2fs.exe /tmp/debugfs.exe t14@192.168.3.221:"D:/ws-boxlite/runtime/" +``` + +**Required runtime files:** + +| File | Source | Size | +|------|--------|------| +| `boxlite-guest` | Cross-compiled on macOS (musl) | ~12MB | +| `boxlite-shim.exe` | Built on Win11 (Step 7) | ~13MB | +| `vmlinuz` | libkrunfw kernel (with 9p built-in) | ~7MB | +| `initrd.img` | Built in Lima VM | ~1.5MB | +| `mke2fs.exe` | Cross-compiled e2fsprogs | ~529KB | +| `debugfs.exe` | Cross-compiled e2fsprogs | ~612KB | + +### Step 7: Build boxlite-shim + +```bash +ssh -o IdentitiesOnly=yes -i ~/.ssh/id_rsa t14@192.168.3.221 "cmd /c \"D: && cd D:\ws-boxlite\boxlite && set HTTP_PROXY=http://127.0.0.1:7897&& set HTTPS_PROXY=http://127.0.0.1:7897&& set PATH=D:\ws-boxlite\tools\protoc\bin;%PATH%&& cargo build -p boxlite --bin boxlite-shim --no-default-features --features krun 2>&1 && copy /Y target\debug\boxlite-shim.exe D:\ws-boxlite\runtime\boxlite-shim.exe && echo SHIM OK\"" 2>&1 +``` + +First build takes ~2-5 minutes. Check output for `SHIM OK`. + +### Step 8: Install Python SDK + +```bash +ssh -o IdentitiesOnly=yes -i ~/.ssh/id_rsa t14@192.168.3.221 "cmd /c \"D: && cd D:\ws-boxlite\boxlite\sdks\python && set HTTP_PROXY=http://127.0.0.1:7897&& set HTTPS_PROXY=http://127.0.0.1:7897&& set BOXLITE_DEPS_STUB=1&& set PATH=D:\ws-boxlite\tools\protoc\bin;%PATH%&& pip install -e . 2>&1 && echo SDK OK\"" 2>&1 +``` + +### Step 9: Cache OCI Images + +**Option A**: Pull directly (needs proxy): +```bash +ssh -o IdentitiesOnly=yes -i ~/.ssh/id_rsa t14@192.168.3.221 "cmd /c \"set HTTP_PROXY=http://127.0.0.1:7897&& set HTTPS_PROXY=http://127.0.0.1:7897&& set BOXLITE_RUNTIME_DIR=D:\ws-boxlite\runtime&& python -c \"import asyncio, boxlite; asyncio.run(boxlite.Boxlite.default().pull('alpine:latest'))\" 2>&1 && echo PULL OK\"" 2>&1 +``` + +**Option B**: Copy image cache from Win10 (faster, no proxy needed): +```bash +# On Win10: pack image cache +ssh -o IdentitiesOnly=yes -i ~/.ssh/id_rsa lilongen@192.168.3.143 "cmd /c \"cd %USERPROFILE% && tar czf C:\ws-boxlite\boxlite-images.tar.gz .boxlite\images\"" +# Copy via macOS relay +scp -o IdentitiesOnly=yes -i ~/.ssh/id_rsa lilongen@192.168.3.143:"C:/ws-boxlite/boxlite-images.tar.gz" /tmp/ +scp -o IdentitiesOnly=yes -i ~/.ssh/id_rsa /tmp/boxlite-images.tar.gz t14@192.168.3.221:"D:/ws-boxlite/" +# Extract on Win11 +ssh -o IdentitiesOnly=yes -i ~/.ssh/id_rsa t14@192.168.3.221 "cmd /c \"cd %USERPROFILE% && tar xzf D:\ws-boxlite\boxlite-images.tar.gz && echo IMAGES OK\"" +``` + +### Step 10: Deploy vm-bench.py + +```bash +scp -o IdentitiesOnly=yes -i ~/.ssh/id_rsa scripts/test/vm-bench.py t14@192.168.3.221:"D:/ws-boxlite/vm-bench.py" +``` + +### Step 11: Verify Setup + +Run a single vm-bench test: +```bash +ssh -o IdentitiesOnly=yes -i ~/.ssh/id_rsa t14@192.168.3.221 "cmd /c \"D: && cd D:\ws-boxlite && set HTTP_PROXY=http://127.0.0.1:7897&& set HTTPS_PROXY=http://127.0.0.1:7897&& set BOXLITE_RUNTIME_DIR=D:\ws-boxlite\runtime&& set RUST_LOG=warn&& python vm-bench.py\"" 2>&1 +``` + +All 8 phases should show ms times. WHPX is flaky, so retry if it fails with "transport error". + +## Verification Checklist + +```bash +# All checks in one command +ssh -o IdentitiesOnly=yes -i ~/.ssh/id_rsa t14@192.168.3.221 "cmd /c \"echo === Toolchain === && rustc --version && python --version && echo === Workspace === && dir /b D:\ws-boxlite\boxlite\Cargo.toml && echo === Cargo Config === && type D:\ws-boxlite\boxlite\.cargo\config.toml | findstr FORCE && echo === Runtime === && dir /b D:\ws-boxlite\runtime\ && echo === SDK === && python -c \"import boxlite; print(f'boxlite {boxlite.__version__}')\" && echo === ALL OK ===\"" 2>&1 +``` + +## Win11-Specific Notes + +- **D: drive**: Workspace is on D: — always use `D:` before `cd D:\path` in .bat files +- **Python PATH**: May need explicit PATH in .bat: `set PATH=C:\Users\T14\AppData\Local\Programs\Python\Python312;C:\Users\T14\AppData\Local\Programs\Python\Python312\Scripts;%PATH%` +- **No git**: Git is not installed on Win11 — use `findstr` instead of `git diff` for verification +- **WHPX stability**: Win11 should theoretically be more stable than Win10 (PIC-HLT fix), but actual results vary by hardware + +## Troubleshooting + +| Problem | Symptom | Fix | +|---------|---------|-----| +| LNK1169 duplicate symbol | `rust_eh_personality` already defined | Missing `.cargo/config.toml` — redo Step 5 | +| LNK1120 unresolved externals | `krun_*` symbols not found | Built with `BOXLITE_DEPS_STUB=1` — remove it for shim build | +| protoc not found | `boxlite-shared` build error | protoc not in PATH — check Step 4 | +| Image pull fails | `error sending request for url` | Proxy not set — add HTTP_PROXY/HTTPS_PROXY | +| Python not found | `python is not recognized` | Python not in PATH — add to .bat PATH line | +| GBK encoding error | `UnicodeEncodeError: 'gbk' codec` | Add `sys.stdout.reconfigure(encoding='utf-8')` to scripts | +| `cd` doesn't switch drives | Stays on C: after `cd D:\...` | Use `D:` before `cd D:\path` in .bat | +| SSH connection reset | Win11 drops SSH during long test | WHPX crash may destabilize system — reboot and retry | diff --git a/.claude/commands/win11-sync.md b/.claude/commands/win11-sync.md new file mode 100644 index 000000000..62c2ce002 --- /dev/null +++ b/.claude/commands/win11-sync.md @@ -0,0 +1,204 @@ +# Win11 Sync + +Pack ALL modified source files (vs main branch), generate a rebuild+test .bat script, and deploy everything to Win11 (T14). + +## Environment + +- **SSH**: `ssh -o IdentitiesOnly=yes -i ~/.ssh/id_rsa t14@192.168.3.221` +- **Workspace**: `D:\ws-boxlite\` (working dir), `D:\ws-boxlite\boxlite\` (source) +- **Runtime**: `D:\ws-boxlite\runtime\` +- **Proxy**: `HTTP_PROXY=http://127.0.0.1:7897` +- **SCP path format**: Forward slashes only: `"t14@192.168.3.221:D:/ws-boxlite/file.txt"` + +## Steps + +### 1. Identify ALL Modified Files vs Main + +**CRITICAL**: Use `git diff main` (not `git diff`). This captures ALL branch changes including committed changes from previous iterations — not just unstaged changes in the current session. + +```bash +# ALL src/ files changed on this branch vs main +git diff main --name-only -- src/ > /tmp/win11-sync-files.txt + +# Also include any unstaged changes not yet committed +git diff --name-only -- src/ >> /tmp/win11-sync-files.txt + +# Deduplicate +sort -u /tmp/win11-sync-files.txt -o /tmp/win11-sync-files.txt + +# Show count and list +echo "=== Files to sync: $(wc -l < /tmp/win11-sync-files.txt) ===" +cat /tmp/win11-sync-files.txt +``` + +Only include `src/` files. Skip docs, scripts, .claude, etc. + +### 2. Analyze What Changed (for cache/rebuild decisions) + +Run this ONCE and note the results — they drive Steps 4's .bat generation: + +```bash +FILES=$(cat /tmp/win11-sync-files.txt) + +# Check: need disk-images cache clear? +NEED_DISK_CACHE_CLEAR=false +echo "$FILES" | grep -qE "(image_disk\.rs|disk/ext4\.rs|disk/constants\.rs)" && NEED_DISK_CACHE_CLEAR=true + +# Check: need cargo clean? (any Rust src changed = yes, since linker caches) +NEED_CARGO_CLEAN=false +echo "$FILES" | grep -qE "\.rs$" && NEED_CARGO_CLEAN=true + +# Check: need libgvproxy cross-compile? +NEED_GVPROXY=false +echo "$FILES" | grep -q "libgvproxy-sys/gvproxy-bridge/" && NEED_GVPROXY=true + +# Check: VMM files changed? (libkrun submodule) +VMM_CHANGED=false +echo "$FILES" | grep -q "libkrun-sys/vendor/libkrun/src/vmm/" && VMM_CHANGED=true + +echo "disk-images cache clear: $NEED_DISK_CACHE_CLEAR" +echo "cargo clean: $NEED_CARGO_CLEAN" +echo "libgvproxy cross-compile: $NEED_GVPROXY" +echo "VMM files changed: $VMM_CHANGED" +``` + +### 3. Cross-compile libgvproxy (only if gvproxy sources changed) + +If `NEED_GVPROXY=true`: + +```bash +bash scripts/build/cross-compile-gvproxy-windows.sh +``` + +Output: `target/kernel-windows-x86_64/libgvproxy.lib` (31MB). Skip if only Rust files changed. + +### 4. Create Sync Tarball + +Increment N from previous sync (check `/tmp/boxlite-sync*.tar.gz`): + +```bash +tar czf /tmp/boxlite-syncN.tar.gz -T /tmp/win11-sync-files.txt +echo "Tarball: $(ls -lh /tmp/boxlite-syncN.tar.gz)" +echo "File count: $(tar tzf /tmp/boxlite-syncN.tar.gz | wc -l)" +``` + +**Verification**: The file count must match the count from Step 1. If they differ, investigate. + +### 5. Generate .bat Script + +Write `/tmp/win11-e2eN.bat` with the sections below. The cache clearing and cargo clean lines are **deterministic** based on Step 2 analysis. + +**CRITICAL rules**: +- One `set` per line (no `&&` after `set`) +- Use `cd /d` for drive switching +- `RUST_LOG=info` (NEVER debug — debug kills WHPX networking) +- Always `cargo clean` when Rust source files changed + +```bat +@echo off +cd /d D:\ws-boxlite\boxlite +set HTTP_PROXY=http://127.0.0.1:7897 +set HTTPS_PROXY=http://127.0.0.1:7897 +set PATH=D:\ws-boxlite\tools\protoc\bin;%PATH% + +echo === Kill old processes === +taskkill /F /IM boxlite-shim.exe 2>nul + +echo === Extract updated files === +cd /d D:\ws-boxlite +tar xzf boxlite-syncN.tar.gz -C boxlite\ +echo Extract OK + +echo === Verify sync completeness === +echo Expected: files +REM Pick 2-3 key files from different directories to verify: +findstr /C:"UNIQUE_STRING_1" boxlite\path\to\file1 +findstr /C:"UNIQUE_STRING_2" boxlite\path\to\file2 +if %ERRORLEVEL% NEQ 0 ( + echo WARNING: Sync incomplete! + exit /b 1 +) + +echo === Clear caches === +if exist "%USERPROFILE%\.boxlite\boxes" (rmdir /S /Q "%USERPROFILE%\.boxlite\boxes") +REM --- ONLY if NEED_DISK_CACHE_CLEAR=true (image_disk.rs or ext4.rs changed): --- +if exist "%USERPROFILE%\.boxlite\images\disk-images" (rmdir /S /Q "%USERPROFILE%\.boxlite\images\disk-images") +REM --- Remove the line above if NEED_DISK_CACHE_CLEAR=false --- + +echo === Cargo clean === +cd /d D:\ws-boxlite\boxlite +set LIBGVPROXY_PREBUILT=D:\ws-boxlite\runtime\gvproxy.lib +cargo clean 2>&1 +echo === Clean done === + +echo === Rebuild shim === +cargo build -p boxlite --bin boxlite-shim --no-default-features --features krun,gvproxy 2>&1 +if %ERRORLEVEL% NEQ 0 (echo SHIM BUILD FAILED && exit /b %ERRORLEVEL%) +copy /Y target\debug\boxlite-shim.exe D:\ws-boxlite\runtime\boxlite-shim.exe +echo === Shim OK === + +echo === Rebuild SDK === +set BOXLITE_DEPS_STUB=1 +cd /d D:\ws-boxlite\boxlite\sdks\python +pip install -e . 2>&1 +if %ERRORLEVEL% NEQ 0 (echo SDK BUILD FAILED && exit /b %ERRORLEVEL%) +set BOXLITE_DEPS_STUB= +echo === SDK OK === + +echo === Run vm-bench === +cd /d D:\ws-boxlite +set BOXLITE_RUNTIME_DIR=D:\ws-boxlite\runtime +set RUST_LOG=info +python vm-bench.py > e2e-testN.txt 2>&1 + +echo === vm-bench Summary === +findstr /C:"import" /C:"runtime_init" /C:"box_create" /C:"exec" /C:"stop" /C:"remove" /C:"Error" /C:"Grand" e2e-testN.txt + +echo === Run net-test === +python net-test.py > net-testN.txt 2>&1 + +echo === net-test Summary === +findstr /C:"PASS" /C:"FAIL" /C:"Error" /C:"Grand" net-testN.txt + +echo === DONE === +``` + +### 6. SCP Tarball + .bat + libgvproxy.lib to Win11 + +```bash +# Convert .bat to CRLF +perl -pe 's/\n/\r\n/' /tmp/win11-e2eN.bat > /tmp/win11-e2eN-crlf.bat +mv /tmp/win11-e2eN-crlf.bat /tmp/win11-e2eN.bat + +# SCP +scp -o IdentitiesOnly=yes -i ~/.ssh/id_rsa /tmp/boxlite-syncN.tar.gz /tmp/win11-e2eN.bat t14@192.168.3.221:"D:/ws-boxlite/" +``` + +If libgvproxy.lib was cross-compiled (step 3), also SCP it: + +```bash +scp -o IdentitiesOnly=yes -i ~/.ssh/id_rsa target/kernel-windows-x86_64/libgvproxy.lib t14@192.168.3.221:"D:/ws-boxlite/runtime/" +``` + +### 7. Verify Deployment + +```bash +ssh -o IdentitiesOnly=yes -i ~/.ssh/id_rsa t14@192.168.3.221 "dir D:\\ws-boxlite\\boxlite-syncN.tar.gz D:\\ws-boxlite\\win11-e2eN.bat" +``` + +Both files must exist with non-zero size. + +## Automatic Cache Clearing Rules + +These are DETERMINISTIC — apply them based on Step 2 analysis: + +| Condition | Action | +|-----------|--------| +| ANY `.rs` file changed | `cargo clean` (linker caches stale objects) | +| `image_disk.rs` or `disk/ext4.rs` or `disk/constants.rs` changed | Clear `disk-images/` cache | +| `image_disk.rs` changed | Also verify with `findstr /C:"has_non_ascii"` in .bat | +| Always | Clear `boxes/` cache (safe, forces clean box creation) | + +## Rebuild Rules + +Both shim and SDK are ALWAYS rebuilt after `cargo clean`. No selective rebuild logic needed. diff --git a/.claude/commands/win11-test.md b/.claude/commands/win11-test.md new file mode 100644 index 000000000..408deee89 --- /dev/null +++ b/.claude/commands/win11-test.md @@ -0,0 +1,77 @@ +# Win11 Run E2E Test + +Run vm-bench.py on Win11 (T14), retrieve results, and analyze. Assumes code is already deployed and rebuilt (via `/win11-sync` + `/win11-rebuild`, or via the .bat from `/win11-sync`). + +## Run Test + +### 1. Determine Next Test Number + +```bash +ssh -o IdentitiesOnly=yes -i ~/.ssh/id_rsa t14@192.168.3.221 "dir D:\\ws-boxlite\\e2e-test*.txt" +``` + +### 2. Execute Test (replace N) + +```bash +ssh -o IdentitiesOnly=yes -i ~/.ssh/id_rsa t14@192.168.3.221 "cmd /c \"cd D:\\ws-boxlite&&taskkill /F /IM boxlite-shim.exe 2>nul&set BOXLITE_RUNTIME_DIR=D:\\ws-boxlite\\runtime&&set RUST_LOG=debug&&if exist \"%USERPROFILE%\\.boxlite\\boxes\" rmdir /S /Q \"%USERPROFILE%\\.boxlite\\boxes\"&&python vm-bench.py > e2e-testN.txt 2>&1&&echo TEST DONE\"" 2>&1 +``` + +- `taskkill` with `&` (not `&&`) — continues even if no process found +- **Timeout**: 60s. If SSH hangs, the test may still have completed on Win11. + +### 3. Retrieve Results + +```bash +scp -o IdentitiesOnly=yes -i ~/.ssh/id_rsa "t14@192.168.3.221:D:/ws-boxlite/e2e-testN.txt" /tmp/e2e-testN.txt +``` + +**CRITICAL**: Forward slashes in SCP source path! Backslashes fail silently. + +### 4. Analyze + +Read `/tmp/e2e-testN.txt` with the Read tool. Check: + +- **Success**: All 8 phases show ms times in the summary table +- **Failure patterns**: + - `os error 2` = file not found (missing file, wrong path) + - `os error 5` = access denied (permission issue) + - `Broken pipe` = VM crashed or shutdown during gRPC + - `Box initialization failed` = init pipeline error (check preceding lines) +- **Flaky**: ContainerInit "transport error" / "broken pipe" on single run — re-run once + +### Quick Summary (without full read) + +```bash +grep -a "Phase\|exec\|stop\|remove\|Error\|Grand" /tmp/e2e-testN.txt +``` + +## If SSH Hangs + +Win11 T14 should be more stable than Win10 MBP, but if it hangs: + +1. Stop/kill the SSH command +2. Check if output exists: `ssh ... "dir D:\\ws-boxlite\\e2e-testN.txt"` +3. If it exists with the summary table at the end, test completed — fetch it +4. If truncated, VM hung. Kill and retry: + ```bash + ssh -o IdentitiesOnly=yes -i ~/.ssh/id_rsa t14@192.168.3.221 "taskkill /F /IM boxlite-shim.exe 2>nul" + ``` + Re-run with the next N. + +## Read Shim Stderr (for shim-side errors) + +```bash +ssh -o IdentitiesOnly=yes -i ~/.ssh/id_rsa t14@192.168.3.221 "cmd /c \"dir /s /b %USERPROFILE%\\.boxlite\\boxes\\*\\stderr\"" +``` + +Then fetch: +```bash +scp -o IdentitiesOnly=yes -i ~/.ssh/id_rsa "t14@192.168.3.221:C:/Users/t14/.boxlite/boxes//stderr" /tmp/shim-stderr.txt +``` + +## Clear Disk Cache (if needed) + +Only when `image_disk.rs` or `disk/ext4.rs` changed: +```bash +ssh -o IdentitiesOnly=yes -i ~/.ssh/id_rsa t14@192.168.3.221 "cmd /c \"if exist \"%USERPROFILE%\\.boxlite\\images\\disk-images\" rmdir /S /Q \"%USERPROFILE%\\.boxlite\\images\\disk-images\"&&echo Cleared\"" +``` diff --git a/.github/workflows/test-windows-e2e.yml b/.github/workflows/test-windows-e2e.yml new file mode 100644 index 000000000..c00a89b30 --- /dev/null +++ b/.github/workflows/test-windows-e2e.yml @@ -0,0 +1,93 @@ +# Windows E2E integration tests on self-hosted WHPX runners. +# +# GitHub-hosted runners lack Hyper-V/WHPX, so real VM tests require +# self-hosted machines with hardware virtualization enabled. +# +# Triggered manually via workflow_dispatch. Intended for pre-release +# validation and reliability regression testing. +name: Windows E2E (Manual) + +on: + workflow_dispatch: + inputs: + rounds: + description: 'Number of stability test rounds' + default: '5' + type: string + suite: + description: 'Test suite to run' + default: 'all' + type: choice + options: + - all + - stability + - functional + - performance + skip_perf: + description: 'Skip performance suite (faster)' + type: boolean + default: false + +env: + CARGO_TERM_COLOR: always + CARGO_INCREMENTAL: '0' + +jobs: + windows-e2e: + name: WHPX E2E (${{ matrix.machine }}) + runs-on: [self-hosted, windows, whpx, '${{ matrix.machine }}'] + timeout-minutes: 30 + strategy: + fail-fast: false + matrix: + machine: [win10, win11] + + steps: + - name: Checkout code + uses: actions/checkout@v5 + with: + submodules: recursive + + - name: Install Rust + uses: actions-rust-lang/setup-rust-toolchain@v1 + with: + toolchain: stable + + - name: Install protobuf + run: choco install protoc -y + + - name: Kill stale shim processes + run: | + taskkill /F /IM boxlite-shim.exe 2>$null + exit 0 + shell: pwsh + + - name: Build shim + run: cargo build -p boxlite --bin boxlite-shim --features krun,gvproxy + + - name: Install Python SDK + run: pip install -e sdks/python/ + + - name: Run E2E tests + run: | + $suite = "${{ inputs.suite }}" + $rounds = "${{ inputs.rounds }}" + $timestamp = Get-Date -Format "yyyyMMdd-HHmmss" + $outfile = "e2e-results-${{ matrix.machine }}-${timestamp}.json" + + $args = @("scripts/test/cross_platform_e2e.py", "--rounds", $rounds, "--json", $outfile) + if ($suite -ne "all") { + $args += @("--suite", $suite) + } + python @args + + Write-Host "Results saved to $outfile" + shell: pwsh + + - name: Upload results + if: always() + uses: actions/upload-artifact@v4 + with: + name: whpx-e2e-${{ matrix.machine }} + path: e2e-results-*.json + retention-days: 14 diff --git a/.github/workflows/test-windows.yml b/.github/workflows/test-windows.yml new file mode 100644 index 000000000..acf166226 --- /dev/null +++ b/.github/workflows/test-windows.yml @@ -0,0 +1,59 @@ +# Windows compile, lint, and unit test checks. +# +# GitHub runners do not have WHPX/Hyper-V, so we use BOXLITE_DEPS_STUB=1 +# to stub out native dependencies (libkrun, libgvproxy) and verify: +# - All cfg(windows) code compiles +# - Clippy passes on Windows target +# - Unit tests pass (633 tests, all platform-independent) +name: Windows + +on: + push: + branches: [main] + paths: + - 'src/**/*.rs' + - '**/Cargo.toml' + - 'Cargo.lock' + - '.github/workflows/test-windows.yml' + pull_request: + branches: [main] + paths: + - 'src/**/*.rs' + - '**/Cargo.toml' + - 'Cargo.lock' + - '.github/workflows/test-windows.yml' + +env: + CARGO_TERM_COLOR: always + CARGO_INCREMENTAL: '0' + BOXLITE_DEPS_STUB: '1' + +jobs: + windows-check: + name: Windows compile + clippy + tests + runs-on: windows-latest + steps: + - name: Checkout code + uses: actions/checkout@v5 + + - name: Install Rust + uses: actions-rust-lang/setup-rust-toolchain@v1 + with: + toolchain: stable + components: clippy + + - name: Install protobuf + run: choco install protoc -y + + - name: Cargo check (all crates) + # Exclude boxlite-guest (Linux-only, has compile_error! on non-Linux) + run: cargo check --workspace --all-targets --exclude boxlite-guest + + - name: Clippy + run: cargo clippy --workspace --all-targets --exclude boxlite-guest -- -D warnings + + - name: Unit tests (boxlite) + run: cargo test -p boxlite --no-default-features --lib + + - name: Unit tests (boxlite-shared) + run: cargo test -p boxlite-shared --lib diff --git a/Cargo.lock b/Cargo.lock index c29f78d3f..17441c687 100644 --- a/Cargo.lock +++ b/Cargo.lock @@ -432,10 +432,12 @@ dependencies = [ "tracing", "tracing-appender", "tracing-subscriber", + "uds_windows", "ulid", "urlencoding", "uuid", "walkdir", + "windows-sys 0.61.2", "xattr", "zstd", ] @@ -541,6 +543,7 @@ dependencies = [ "serde_json", "tokio", "tracing", + "tracing-subscriber", ] [[package]] @@ -4507,6 +4510,17 @@ version = "1.19.0" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "562d481066bde0658276a35467c4af00bdc6ee726305698a55b86e61d7ad82bb" +[[package]] +name = "uds_windows" +version = "1.2.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "f2f6fb2847f6742cd76af783a2a2c49e9375d0a111c7bef6f71cd9e738c72d6e" +dependencies = [ + "memoffset 0.9.1", + "tempfile", + "windows-sys 0.61.2", +] + [[package]] name = "ulid" version = "1.2.1" diff --git a/docs/PR-boxlite-whpx.md b/docs/PR-boxlite-whpx.md new file mode 100644 index 000000000..738293169 --- /dev/null +++ b/docs/PR-boxlite-whpx.md @@ -0,0 +1,147 @@ +# PR: boxlite — Native Windows WHPX Support + +**Title:** `feat(windows): add native Windows WHPX hypervisor support` + +**Repo:** boxlite-labs/boxlite +**Branch:** feat/windows-whpx-support → main +**Stats:** 84 files changed, +6,133 / -523 (1 squashed commit) +**Depends on:** libkrun submodule PR (boxlite-ai/libkrun#TBD) + +--- + +## Summary + +Adds native Windows support to BoxLite using the Windows Hypervisor Platform (WHPX) API. This enables BoxLite to run lightweight Linux VMs directly on Windows without WSL2, providing the same SDK interface across all three platforms (macOS, Linux, Windows). + +**What works:** +- Full VM lifecycle: create, start, exec, stop +- OCI image pull and ext4 disk construction on Windows +- Network connectivity (gvproxy + AF_UNIX vsock) +- Multi-vCPU (up to 4 vCPUs) +- Volume mounts (virtiofs via 9p) +- Python SDK on Windows +- Process isolation via Windows Job Objects + +## Architecture Overview + +``` +┌──────────────────────────────────────────────────────────┐ +│ Python/Node SDK │ +├──────────────────────────────────────────────────────────┤ +│ boxlite (Rust core) │ +│ ┌────────────────┐ ┌──────────────┐ ┌─────────────┐ │ +│ │ image_disk │ │ krun/ │ │ jailer/ │ │ +│ │ (ext4 build) │ │ engine.rs │ │ job_object │ │ +│ └────────────────┘ └──────────────┘ └─────────────┘ │ +├──────────────────────────────────────────────────────────┤ +│ boxlite-shim (subprocess) │ +│ ┌────────────────┐ ┌──────────────┐ ┌─────────────┐ │ +│ │ watchdog │ │ libkrun │ │ gvproxy │ │ +│ │ (Event-based) │ │ (WHPX VMM) │ │ (DLL) │ │ +│ └────────────────┘ └──────────────┘ └─────────────┘ │ +└──────────────────────────────────────────────────────────┘ +``` + +## Key Changes + +### VM Engine (`src/boxlite/src/vmm/krun/`) + +- **engine.rs** — Windows-specific VM lifecycle: `krun_start` (non-blocking) + `krun_wait` (poll for exit) + `krun_stop` (graceful shutdown). Linux/macOS use `krun_start_enter` (blocking, process takeover) which isn't available on Windows. +- **context.rs** — Merged import list for new libkrun APIs (`krun_add_virtiofs3`, `krun_add_net_unixgram`, `krun_add_vsock`, etc.) + +### Process Lifecycle (`src/boxlite/src/vmm/controller/`) + +- **spawn.rs** — `CREATE_SUSPENDED` + Job Object assignment + `ResumeThread` pattern to eliminate TOCTOU between spawn and sandboxing. PID file written from parent (no `pre_exec` on Windows). +- **watchdog.rs** — Windows implementation using named Events + parent process handle monitoring (replaces Unix pipe-based POLLHUP detection). +- **shim.rs** — Windows graceful shutdown via `krun_stop` API instead of Unix signals. + +### Sandbox Isolation (`src/boxlite/src/jailer/`) + +- **job_object.rs** — New Windows sandbox using Job Objects: process count limits, memory limits, kill-on-close semantics, and network restrictions via Silos (when available). + +### Image & Disk (`src/boxlite/src/images/`) + +- **image_disk.rs** — Platform-aware ext4 disk construction. On Windows, uses `mkfs.ext4` from bundled e2fsprogs and raw file I/O instead of loop devices. Layer extraction uses tar with Windows path handling. + +### Networking + +- **port.rs** — New module for TCP port availability checking (used by gvproxy on Windows). +- **socket_path.rs** — Cross-platform Unix socket path handling. + +### Build System + +- **build.rs** (boxlite) — Windows-specific dependency bundling (kernel, initrd, e2fsprogs, gvproxy DLL). +- **build.rs** (libkrun-sys) — Windows static library linking with MSVC. +- **build.rs** (libgvproxy-sys) — DLL import lib generation for Windows. + +### Guest Agent (`src/guest/`) + +- **zygote.rs** — Timeout-based container readiness (Windows has no `pidfd` for container PID1 monitoring). +- **mounts.rs** — Conditional bind-mount logic (no `/dev/kvm` passthrough on Windows guests). +- **virtiofs.rs** — 9p mount fallback path for Windows host. + +### CI & Scripts + +- `.github/workflows/test-windows.yml` — Windows build + unit test CI +- `.github/workflows/test-windows-e2e.yml` — Windows E2E test workflow +- `scripts/build/build-windows-runtime.sh` — Cross-compile all Windows runtime dependencies +- `scripts/build/cross-compile-*.sh` — Individual cross-compilation scripts (kernel, e2fsprogs, gvproxy) + +### Cross-Platform Test Report + +- `docs/cross-platform-test-report-20260503.md` — Full E2E test results across macOS ARM64, Win11, Win10 + +## Platform-Specific Behavior + +| Aspect | macOS/Linux | Windows | +|--------|-------------|---------| +| Hypervisor | Hypervisor.framework / KVM | WHPX | +| VM lifecycle | `krun_start_enter` (blocking) | `krun_start` + `krun_wait` (async poll) | +| Shutdown | SIGTERM → guest | `krun_stop` API call | +| Watchdog | Pipe POLLHUP | Named Event + parent handle | +| Sandbox | sandbox-exec / seccomp | Job Objects | +| Disk build | losetup + mkfs.ext4 | Bundled e2fsprogs (raw file) | +| Networking | gvproxy (static lib) | gvproxy (DLL) | +| Process spawn | fork + pre_exec | CREATE_SUSPENDED + Job Object | + +## Testing + +### Automated (this PR) +- macOS ARM64: `cargo test -p boxlite --no-default-features --lib` — **689/689 PASS** +- Linux (Lima): `cargo test` — **673 PASS**, 26 fail (pre-existing, need `/dev/kvm`) +- `cargo clippy` — PASS (macOS + Linux) +- `cargo fmt` — PASS + +### Manual E2E (Windows) +- **Win11** (ThinkPad T14, i5-1135G7, 16GB): + - vm-bench 8/8 PASS (create, exec, file I/O, env, networking, stop) + - net-test 8/8 PASS (DNS, HTTP, large transfer, concurrent connections) + - BrowserBox: 4/6 PASS (lifecycle works; playwright_endpoint has unrelated libcontainer issue) +- **Win10** (MBP 2014, i7-4770HQ, 16GB): + - vm-bench 8/8 PASS + - net-test 8/8 PASS + - BrowserBox: 6/6 PASS + +### Cross-Platform Matrix + +| Test Suite | macOS ARM64 | Win11 | Win10 | +|-----------|-------------|-------|-------| +| vm-bench (8 tests) | 8/8 PASS | 8/8 PASS | 8/8 PASS | +| net-test (8 tests) | 8/8 PASS | 8/8 PASS | 8/8 PASS | +| BrowserBox (6 tests) | 8/8 PASS | 4/6 PASS | 6/6 PASS | + +## Test Plan + +- [ ] CI: macOS + Linux builds unaffected (zero regression) +- [ ] CI: Windows build compiles successfully +- [ ] Manual: vm-bench passes on Windows (create/exec/stop lifecycle) +- [ ] Manual: net-test passes on Windows (guest networking) +- [ ] Code review: security of Job Object sandbox +- [ ] Code review: no secrets or credentials in committed files + +## Known Limitations + +1. **vCPU cap: 4** — Sufficient for the target use case (AI agent sandboxes). Can be raised later. +2. **No GPU passthrough** — WHPX doesn't support GPU virtualization. GPU workloads should use WSL2. +3. **First-boot image build is slow** — Large OCI images (>2GB) take several minutes for initial ext4 construction. Subsequent boots use cached disk. +4. **Win11 BrowserBox playwright_endpoint** — libcontainer sends unexpected `InitReady` message. Not WHPX-related; tracked separately. diff --git a/docs/PR-libkrun-whpx.md b/docs/PR-libkrun-whpx.md new file mode 100644 index 000000000..c14e02ae6 --- /dev/null +++ b/docs/PR-libkrun-whpx.md @@ -0,0 +1,89 @@ +# PR: libkrun — Windows WHPX Hypervisor Backend + +**Title:** `feat(windows): add native Windows WHPX hypervisor backend` + +**Repo:** boxlite-ai/libkrun +**Branch:** feat/windows-whpx-support → main +**Stats:** 51 files changed, +27,501 / -261 (30 commits) + +--- + +## Summary + +Adds a complete Windows Hyper-V Platform (WHPX) hypervisor backend to libkrun, enabling native VM execution on Windows without WSL2. The implementation provides feature parity with the existing KVM (Linux) and Hypervisor.framework (macOS) backends. + +**Key capabilities:** +- Full x86-64 guest boot via WHPX API (`windows-sys` 0.61) +- Userspace device emulation: PIC, PIT, IOAPIC, LAPIC, serial, CMOS RTC +- virtio-mmio devices: blk (async worker), net, vsock, p9, rng, balloon +- Multi-vCPU support (up to 4 vCPUs) with INIT-SIPI-SIPI AP bootstrap +- Lock-free interrupt injection via `SharedApicState` + atomic pull_irr +- ACPI tables (RSDP, RSDT, XSDT, MADT, DSDT with S5 shutdown) +- Linux kernel boot with custom initrd and cmdline + +## Architecture + +``` +┌─────────────────────────────────────────────┐ +│ libkrun API (FFI) │ +├─────────────────────────────────────────────┤ +│ src/libkrun/src/windows_api.rs │ ← krun_* FFI entry points +├─────────────────────────────────────────────┤ +│ src/vmm/src/windows/ │ +│ ├── context.rs VM configuration │ +│ ├── runner.rs Main VMM event loop │ +│ ├── vcpu.rs Per-vCPU state │ +│ ├── whpx.rs WHPX API wrapper │ +│ ├── memory.rs Guest physical memory │ +│ ├── insn.rs x86 instruction decode │ +│ ├── boot/ Kernel loading + ACPI │ +│ ├── devices/ Userspace device models│ +│ │ ├── irq_chip PIC → APIC transition │ +│ │ ├── ioapic I/O APIC emulation │ +│ │ ├── lapic Local APIC + timer │ +│ │ ├── virtio/ Block, Net, Vsock... │ +│ │ └── ... PIT, Serial, RTC │ +│ └── cmdline.rs Kernel cmdline builder │ +└─────────────────────────────────────────────┘ +``` + +## Key Design Decisions + +1. **Userspace APIC emulation** — WHPX's in-kernel APIC emulation crashes on some hardware (Win10 MBP 2014). We implement full LAPIC/IOAPIC in userspace with atomic lock-free interrupt delivery. + +2. **Lock-free `SharedApicState`** — Device threads raise interrupts via `AtomicU64` IRR bitmask. vCPU threads pull pending interrupts without acquiring locks, avoiding contention in the hot path. + +3. **ICR broadcast shorthand** — Linux kernel uses "All Excluding Self" (shorthand 0b11) for IPI broadcast. Without handling this, only 2 vCPUs work (coincidence: single AP gets the targeted IPI). Fixed by parsing ICR bits 19:18 and dispatching to all APs. + +4. **Async virtio-blk worker** — Disk I/O runs on a dedicated thread with Windows overlapped I/O, preventing vCPU stalls during block operations. + +5. **AF_UNIX sockets** (not TCP) — Host-guest vsock traffic uses Unix domain sockets for security and performance, matching the macOS/Linux backends. + +6. **HLT tiered sleep** — Idle vCPUs use adaptive sleep (short spin → WaitForSingleObject) to balance latency vs CPU usage. LAPIC timer throttling prevents excessive wakeups. + +## Changes by Area + +### New Files (38 files under `src/vmm/src/windows/`) +- Boot: `acpi.rs`, `loader.rs`, `mp_table.rs`, `params.rs`, `setup.rs` +- Core: `context.rs`, `runner.rs`, `vcpu.rs`, `whpx.rs`, `memory.rs`, `insn.rs`, `cmdline.rs`, `types.rs`, `error.rs` +- Devices: `manager.rs`, `irq_chip.rs`, `ioapic.rs`, `lapic.rs`, `pic.rs`, `pit.rs`, `serial.rs` +- Virtio: `mmio.rs`, `queue.rs`, `block.rs`, `block_worker.rs`, `disk.rs`, `net.rs`, `vsock/mod.rs`, `vsock/connection.rs`, `vsock/packet.rs`, `p9/mod.rs`, `p9/filesystem.rs`, `p9/protocol.rs`, `rng.rs`, `balloon.rs` + +### Modified Files +- `src/libkrun/src/lib.rs` — cfg-gate Unix-only APIs, expose `krun_start`/`krun_stop`/`krun_wait` for Windows +- `src/libkrun/src/windows_api.rs` — New FFI bridge for Windows-specific lifecycle +- `src/vmm/Cargo.toml` — Add Windows dependencies (windows-sys, crossbeam, parking_lot) +- `Cargo.lock` — Updated dependency tree + +## Testing + +- **Win11** (ThinkPad T14, i5-1135G7): vm-bench 8/8 PASS, net-test 8/8 PASS (4 vCPUs) +- **Win10** (MBP 2014, i7-4770HQ): vm-bench 8/8 PASS, net-test 8/8 PASS (4 vCPUs) +- **macOS/Linux**: Zero regression (code is fully cfg-gated behind `#[cfg(target_os = "windows")]`) + +## Test Plan + +- [ ] CI passes on Linux (existing tests unaffected) +- [ ] Manual verification on Windows with `boot_kernel` example +- [ ] vm-bench: create/exec/stop lifecycle (1 vCPU) +- [ ] net-test: network connectivity via vsock (4 vCPUs) diff --git a/docs/PR-summary.html b/docs/PR-summary.html new file mode 100644 index 000000000..fb855679a --- /dev/null +++ b/docs/PR-summary.html @@ -0,0 +1,233 @@ + + PR-summary + + + + + + + + + + + + + +
+ +

BoxLite PR Summary

+
+

Submitted PRs

+

PR #406 — fix(jailer): Dynamic FD Closure

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ItemDetail
URLhttps://github.com/boxlite-ai/boxlite/pull/406
Branchfix/jailer-dynamic-fd-closure
Commit28e2ce4
CategorySecurity fix
Files changed1 file, +257 / -15
+

Problem: FD cleanup in jailer pre_exec used hardcoded upper bounds (1024 on Linux, 4096 on macOS). On systems with raised ulimit -n, FDs above these limits leaked into jailed processes, potentially exposing credentials, database connections, or network sockets.

+

Solution: 3-strategy cascade (Linux):

+
    +
  1. close_range(first_fd, ~0U, 0) — O(1), Linux 5.9+
  2. +
  3. /proc/self/fd enumeration via raw getdents64 — no heap allocation
  4. +
  5. Brute-force close with dynamic limit from getrlimit(RLIMIT_NOFILE)
  6. +
+

macOS uses brute-force with dynamic getrlimit limit. All operations remain async-signal-safe for the pre_exec context.

+
+

PR #407 — feat(vmm): pidfd/kqueue Event-Driven Process Monitor

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ItemDetail
URLhttps://github.com/boxlite-ai/boxlite/pull/407
Branchfeat/pidfd-kqueue-process-monitor
Commit78d484e
CategoryPerformance
Files changed2 files, +467 / -18
+

Problem: ProcessMonitor::wait_for_exit() used a 500ms sleep-based polling loop (tokio::time::sleep + try_wait), violating Rule #15: "No Sleep for Events." This added up to 500ms latency to VM crash detection during startup.

+

Solution: Platform-native event-driven mechanisms:

+
    +
  • Linux: pidfd_open() (kernel 5.3+) + tokio AsyncFd
  • +
  • macOS: kqueue + EVFILT_PROC + NOTE_EXIT + tokio AsyncFd
  • +
  • Fallback: 100ms polling for older kernels (< 5.3)
  • +
+

Key design: OwnedFd wraps raw FDs immediately (leak-free by construction), fcntl O_NONBLOCK with graceful fallback, best-effort race guard via is_alive() after FD setup.

+
+

PR #408 — feat(python-sdk): EventListener + Typed Errors

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ItemDetail
URLhttps://github.com/boxlite-ai/boxlite/pull/408
Branchfeat/python-event-listener-typed-errors
Commite5ad727
CategorySDK feature
Files changed17 files, +1050 / -75
+

Problem: Python SDK had no way to receive push-based lifecycle callbacks. All errors were generic PyRuntimeError, making programmatic error handling impossible.

+

Solution:

+
    +
  • PyEventListener bridge: duck-typing via PyO3, missing methods silently skipped
  • +
  • Typed exceptions: 15 exception classes inheriting from BoxliteError, exhaustive match on all 18 BoxliteError variants for compile-time completeness
  • +
  • event_listeners parameter on BoxliteOptions, propagated through RuntimeImpl
  • +
  • 165 Python tests covering exception hierarchy, isolation, and exports
  • +
+
+

PR #409 — feat(portal): Streaming File Upload

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ItemDetail
URLhttps://github.com/boxlite-ai/boxlite/pull/409
Branchfeat/streaming-file-upload
Commit37ca16f
CategoryPerformance
Files changed1 file, +365 / -36
+

Problem: upload_tar buffered the entire file into a Vec, causing memory usage of O(file_size). Large file uploads could OOM the host process.

+

Solution: Bounded mpsc channel (capacity=4) with a spawned reader task, capping peak memory at ~5 MiB regardless of file size. Matches the streaming pattern already used in download_tar and guest-side upload handler.

+

Key design: stream_file_chunks helper accepts impl AsyncRead for testability, std::mem::take for zero-copy first chunk, always await reader JoinHandle before checking gRPC result (root-cause priority). 8 unit tests added.

+
+

Summary Stats

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
PRCategoryLines Changed
#406 Dynamic FD ClosureSecurity+257 / -15
#407 pidfd/kqueue ProcessMonitorPerformance+467 / -18
#408 Python EventListener + Typed ErrorsSDK+1050 / -75
#409 Streaming File UploadPerformance+365 / -36
Total+2139 / -144
+ +
+ + + + + + + + + + \ No newline at end of file diff --git a/docs/PR-summary.md b/docs/PR-summary.md new file mode 100644 index 000000000..fb97f5b1d --- /dev/null +++ b/docs/PR-summary.md @@ -0,0 +1,158 @@ +# BoxLite PR Summary + +--- + +## Submitted PRs + +### [PR #406](https://github.com/boxlite-ai/boxlite/pull/406) — fix(jailer): Dynamic FD Closure + +| Item | Detail | +|------|--------| +| URL | https://github.com/boxlite-ai/boxlite/pull/406 | +| Branch | `fix/jailer-dynamic-fd-closure` | +| Commit | `28e2ce4` | +| Category | Security fix | +| Files changed | 1 file, +257 / -15 | + +**Problem:** FD cleanup in jailer `pre_exec` used hardcoded upper bounds (1024 on Linux, 4096 on macOS). On systems with raised `ulimit -n`, FDs above these limits leaked into jailed processes, potentially exposing credentials, database connections, or network sockets. + +**Solution:** 3-strategy cascade (Linux): +1. `close_range(first_fd, ~0U, 0)` — O(1), Linux 5.9+ +2. `/proc/self/fd` enumeration via raw `getdents64` — no heap allocation +3. Brute-force close with dynamic limit from `getrlimit(RLIMIT_NOFILE)` + +macOS uses brute-force with dynamic `getrlimit` limit. All operations remain async-signal-safe for the `pre_exec` context. + +--- + +### [PR #407](https://github.com/boxlite-ai/boxlite/pull/407) — feat(vmm): pidfd/kqueue Event-Driven Process Monitor + +| Item | Detail | +|------|--------| +| URL | https://github.com/boxlite-ai/boxlite/pull/407 | +| Branch | `feat/pidfd-kqueue-process-monitor` | +| Commit | `78d484e` | +| Category | Performance | +| Files changed | 2 files, +467 / -18 | + +**Problem:** `ProcessMonitor::wait_for_exit()` used a 500ms sleep-based polling loop (`tokio::time::sleep` + `try_wait`), violating Rule #15: "No Sleep for Events." This added up to 500ms latency to VM crash detection during startup. + +**Solution:** Platform-native event-driven mechanisms: +- **Linux**: `pidfd_open()` (kernel 5.3+) + tokio `AsyncFd` +- **macOS**: `kqueue` + `EVFILT_PROC` + `NOTE_EXIT` + tokio `AsyncFd` +- **Fallback**: 100ms polling for older kernels (< 5.3) + +Key design: `OwnedFd` wraps raw FDs immediately (leak-free by construction), `fcntl O_NONBLOCK` with graceful fallback, best-effort race guard via `is_alive()` after FD setup. + +--- + +### [PR #408](https://github.com/boxlite-ai/boxlite/pull/408) — feat(python-sdk): EventListener + Typed Errors + +| Item | Detail | +|------|--------| +| URL | https://github.com/boxlite-ai/boxlite/pull/408 | +| Branch | `feat/python-event-listener-typed-errors` | +| Commit | `e5ad727` | +| Category | SDK feature | +| Files changed | 17 files, +1050 / -75 | + +**Problem:** Python SDK had no way to receive push-based lifecycle callbacks. All errors were generic `PyRuntimeError`, making programmatic error handling impossible. + +**Solution:** +- **PyEventListener** bridge: duck-typing via PyO3, missing methods silently skipped +- **Typed exceptions**: 15 exception classes inheriting from `BoxliteError`, exhaustive match on all 18 `BoxliteError` variants for compile-time completeness +- **`event_listeners`** parameter on `BoxliteOptions`, propagated through `RuntimeImpl` +- **165 Python tests** covering exception hierarchy, isolation, and exports + +--- + +### [PR #409](https://github.com/boxlite-ai/boxlite/pull/409) — feat(portal): Streaming File Upload + +| Item | Detail | +|------|--------| +| URL | https://github.com/boxlite-ai/boxlite/pull/409 | +| Branch | `feat/streaming-file-upload` | +| Commit | `37ca16f` | +| Category | Performance | +| Files changed | 1 file, +365 / -36 | + +**Problem:** `upload_tar` buffered the entire file into a `Vec`, causing memory usage of O(file_size). Large file uploads could OOM the host process. + +**Solution:** Bounded `mpsc` channel (capacity=4) with a spawned reader task, capping peak memory at ~5 MiB regardless of file size. Matches the streaming pattern already used in `download_tar` and guest-side upload handler. + +Key design: `stream_file_chunks` helper accepts `impl AsyncRead` for testability, `std::mem::take` for zero-copy first chunk, always await reader `JoinHandle` before checking gRPC result (root-cause priority). 8 unit tests added. + +--- + +### [PR #413](https://github.com/boxlite-ai/boxlite/pull/413) — feat(litebox): Pause/Resume API for Zero-CPU VM Freezing + +| Item | Detail | +|------|--------| +| URL | https://github.com/boxlite-ai/boxlite/pull/413 | +| Branch | `feat/pause-resume-api` | +| Commit | `ded35bf` | +| Category | Feature | +| Files changed | 16 files, +1430 / -48 | + +**Problem:** Running VMs consume CPU even when idle. For AI agent sandboxes that run intermittently, there was no way to suspend a VM and reclaim compute resources without destroying the box. + +**Solution:** Full pause/resume lifecycle with `SIGSTOP`/`SIGCONT` signals: +- **`LiteBox::pause()`** / **`resume()`** with state machine enforcement (`Running` ↔ `Paused`) +- **Quiesced tracking**: Operations that observe the paused state are tracked; if a pause fails, the box is marked `QuiesceFailed` rather than silently reverting +- **ESRCH race handling**: Graceful handling of process-already-gone races during signal delivery +- **SDK bindings**: Python (`await box.pause()` / `await box.resume()`), Node.js (`box.pause()` / `box.resume()`) +- **Audit events**: `BoxPaused` / `BoxResumed` emitted through EventListener +- **REST API**: `POST /boxes/{id}/pause` / `POST /boxes/{id}/resume` +- **350-line integration test suite** + Python example script + +--- + +### [PR #415](https://github.com/boxlite-ai/boxlite/pull/415) — fix(box_impl): Offload Blocking handler.stop() and metrics() to spawn_blocking + +| Item | Detail | +|------|--------| +| URL | https://github.com/boxlite-ai/boxlite/pull/415 | +| Branch | `fix/spawn-blocking-handler` | +| Commit | `8d043d1` | +| Category | Performance fix | +| Files changed | 1 file, +22 / -11 | + +**Problem:** `ShimHandler::stop()` uses a `std::thread::sleep(50ms)` polling loop (up to 2 seconds total) and `metrics()` performs synchronous sysinfo I/O. Both are called from async `BoxImpl` methods, blocking Tokio worker threads and causing latency spikes for concurrent operations. + +**Solution:** Wrap `handler` field in `Arc>` and offload both blocking calls via `tokio::task::spawn_blocking`: +- **`stop()`**: Swallows lock poison (shutdown must proceed regardless) +- **`metrics()`**: Propagates lock poison (monitoring should surface anomalies) +- **Double `??` pattern**: `spawn_blocking` returns `Result, JoinError>` — first `?` unwraps JoinError, second unwraps inner BoxliteError + +--- + +### C2 fix (PR pending) — fix(exec): Remove UB in Python SDK by Relaxing Execution Methods to &self + +| Item | Detail | +|------|--------| +| URL | *PR not yet created* | +| Branch | `fix/execution-remove-unsafe` | +| Commit | `4e76d8c` | +| Category | Safety / UB fix | +| Files changed | 2 files, +10 / -15 | + +**Problem:** Python SDK `PyExecution` created `&mut Execution` from a shared `Arc` via `unsafe { &mut *(Arc::as_ptr(&self.execution) as *mut Execution) }` — 5 occurrences. This violates Rust's aliasing rules and is Undefined Behavior. + +**Solution:** Two-layer fix: +1. **Core library** (`src/boxlite/src/litebox/exec.rs`): Relax 5 `Execution` methods from `&mut self` to `&self` — safe because all mutation goes through the inner `Arc>` +2. **Python SDK** (`sdks/python/src/exec.rs`): Remove all 5 `unsafe` blocks, call methods directly via `Arc::Deref` + +--- + +## Summary Stats + +| PR | Category | Lines Changed | +|----|----------|--------------| +| [#406](https://github.com/boxlite-ai/boxlite/pull/406) Dynamic FD Closure | Security | +257 / -15 | +| [#407](https://github.com/boxlite-ai/boxlite/pull/407) pidfd/kqueue ProcessMonitor | Performance | +467 / -18 | +| [#408](https://github.com/boxlite-ai/boxlite/pull/408) Python EventListener + Typed Errors | SDK | +1050 / -75 | +| [#409](https://github.com/boxlite-ai/boxlite/pull/409) Streaming File Upload | Performance | +365 / -36 | +| [#413](https://github.com/boxlite-ai/boxlite/pull/413) Pause/Resume API | Feature | +1430 / -48 | +| [#415](https://github.com/boxlite-ai/boxlite/pull/415) spawn_blocking for handler | Performance fix | +22 / -11 | +| C2 (pending) Remove Execution UB | Safety / UB fix | +10 / -15 | +| **Total** | | **+3611 / -218** | diff --git a/docs/PR-summary.pdf b/docs/PR-summary.pdf new file mode 100644 index 000000000..cd9cbeac9 Binary files /dev/null and b/docs/PR-summary.pdf differ diff --git a/docs/[*]ai-agent-sandbox-runtime-market-research.md b/docs/[*]ai-agent-sandbox-runtime-market-research.md new file mode 100644 index 000000000..421387587 --- /dev/null +++ b/docs/[*]ai-agent-sandbox-runtime-market-research.md @@ -0,0 +1,649 @@ +# AI Agent Sandbox Runtime Service 市场调研报告 + +> 调研日期: 2026-05-12 +> 目标: 为 BoxLite 作为 AI Agent Sandbox Runtime PaaS 提供商的战略方向提供市场洞察 + +--- + +## 核心发现摘要 + +### 市场现状 +- Agentic AI 市场 2026 年预计 $10.8B, 2032 年达 $54.8B (CAGR ~33%) +- AI Sandbox 已成为基础设施刚需, MicroVM 隔离是行业共识 +- 2025 年 $6.42B 流入 Agentic AI 领域, 头部集中效应明显 + +### 11 家服务商全景 +覆盖 **E2B, Modal, Fly.io Sprites, Daytona, Cloudflare, Vercel, Northflank, Blaxel, RunLoop, Koyeb, Docker Sandboxes**, 从隔离技术、定价、功能、融资等多维度深度对比。 + +### BoxLite 的独特优势 +**所有竞品都是远程云服务, 没有一家提供可嵌入的 VM 级沙箱库。** BoxLite 的 "SQLite for Sandboxing" 定位在市场上完全空白: +- **无需 daemon/root 的嵌入式 microVM** — 竞品无法轻易复制 +- **跨平台 (Linux + macOS + Windows)** — 竞品均仅支持 Linux +- **OCI 容器原生** — 与容器生态无缝对接 + +### 建议定位 +**Hybrid Embedded + Cloud**: 同一 SDK, 本地嵌入执行或透明扩展到云端。这是一个无竞品覆盖的市场定位。 + +### 需补齐的关键能力 (P0) +1. Snapshot/Checkpoint — 行业标配, Blaxel 25ms 恢复是标杆 +2. 云端托管服务 — 从库到服务的关键一跃 +3. 计费系统 + 多租户编排 + +--- + +## 目录 + +1. [市场概览](#1-市场概览) +2. [核心服务商深度分析](#2-核心服务商深度分析) +3. [隔离技术路线对比](#3-隔离技术路线对比) +4. [定价模型对比](#4-定价模型对比) +5. [功能矩阵对比](#5-功能矩阵对比) +6. [市场格局与竞争态势](#6-市场格局与竞争态势) +7. [BoxLite 差异化定位分析](#7-boxlite-差异化定位分析) +8. [战略建议](#8-战略建议) + +--- + +## 1. 市场概览 + +### 1.1 市场规模与增长 + +Agentic AI 市场正经历爆发式增长: + +- **2025 年市场规模**: ~$7.6B +- **2026 年预测**: ~$10.8B (YoY +42%) +- **2032 年预测**: ~$54.8B (CAGR ~33%) +- **2034 年预测**: ~$105.6B + +AI Agent Sandbox Runtime 作为 Agentic AI 基础设施的关键层, 直接受益于这一增长趋势。 + +### 1.2 融资热度 + +- **2025 年**: 全年 $6.42B 流入 Agentic AI 领域 — 占该领域历史总融资的 1/4 以上 +- **2025 Q4 ~ 2026 Q1**: 15 家 Agentic AI 创业公司的平均轮次规模达 $155M, 是 2025 H1 ($82M) 的近 2 倍 +- **关键融资事件**: + - E2B: $21M Series A (2025.07, Insight Partners 领投) + - Daytona: $24M Series A (2026.02, FirstMark Capital 领投) + - 市场呈现"更少但更大"的押注趋势, 头部集中效应明显 + +### 1.3 需求驱动因素 + +| 驱动因素 | 说明 | +|---------|------| +| AI Coding Agents 爆发 | OpenAI Codex 周活跃用户突破 200 万; Claude Code、Cursor、Windsurf 等编程 agent 快速普及 | +| RL 训练需求 | 强化学习训练需要大量并行沙箱 (Modal 客户已达 ~100K 并发沙箱) | +| 安全合规要求 | 企业对 AI 生成代码的执行安全要求日益严格 (SOC 2, HIPAA, ISO 27001) | +| 多租户隔离 | SaaS 平台需要为每个租户/请求提供独立隔离环境 | + +--- + +## 2. 核心服务商深度分析 + +### 2.1 E2B — "The Enterprise AI Agent Cloud" + +**概况**: +- 总部: 布拉格, 捷克 +- 员工: ~28 人 (2026.03) +- 融资: 累计 ~$43.8M (含 $21M Series A) +- 收入: $1.5M ARR (2025.06) +- 开源: [github.com/e2b-dev/E2B](https://github.com/e2b-dev/E2B) + +**技术架构**: +- **隔离技术**: Firecracker microVM +- **冷启动**: ~150-200ms +- **最大会话时长**: 24 小时 (Pro 计划) +- **运行时**: 任意 Linux 运行时, 支持自定义模板 +- **SDK**: Python, JavaScript/TypeScript +- **网络**: 沙箱内默认有完整互联网访问; 可暴露服务到公网 + +**核心能力**: +- SDK-first 设计, 开发者体验优秀 +- 自定义沙箱模板 (Dockerfile 方式定义) +- 文件系统读写、进程管理、端口暴露 +- 与 Docker 合作 (Docker + E2B 联合方案) + +**部署选项**: +- 托管 SaaS (默认) +- 自托管 (Terraform, 当前支持 GCP, AWS 开发中) + +**局限**: +- 会话时长上限 24h +- 不支持 GPU +- 自托管仍处早期 +- 无 BYOC (Bring Your Own Cloud) 成熟方案 + +--- + +### 2.2 Modal — "Run any code in the cloud" + +**概况**: +- 总部: 纽约, 美国 +- 定位: 通用云计算平台, 沙箱是其产品线之一 + +**技术架构**: +- **隔离技术**: gVisor 容器 +- **冷启动**: 亚秒级 +- **最大会话时长**: 可配置 +- **运行时**: Python-first, 支持动态运行时定义 +- **SDK**: Python, JavaScript, Go +- **GPU**: 全面支持 (L4, A100, H100, H200) + +**核心能力**: +- 极致的弹性伸缩: 可瞬时扩展到 50,000+ 沙箱 +- 创建吞吐量: 测试达 1,000 沙箱/秒 +- 强大的 GPU 支持和 serverless GPU 调度 +- Code-first 开发者体验 +- Snapshot/Volume 原语支持 +- 内建 Tunnel 机制 + +**优势**: +- RL 训练场景的王者 (客户已运行 ~100K 并发沙箱) +- GPU + CPU 混合工作负载 +- 成熟的 serverless 基础设施 + +**局限**: +- 沙箱定价是标准计算的 3x 溢价 +- 无 BYOC +- 隔离强度: gVisor (非硬件级 VM 隔离) + +--- + +### 2.3 Fly.io Sprites — "Persistent VMs for AI Agents" + +**概况**: +- 产品: [sprites.dev](https://sprites.dev) +- 发布: 2026.01 +- 理念: "Ephemeral sandboxes are obsolete" — 反对临时沙箱 + +**技术架构**: +- **隔离技术**: 完整 VM (Firecracker) +- **冷启动**: 1-2 秒创建 +- **持久性**: 完全持久化, 文件系统在会话间保持 +- **存储**: 直连 NVMe + 持久化到对象存储 +- **计费模式**: 空闲不收费, 按使用付费 + +**核心能力**: +- **Checkpoint & Restore**: ~300ms 完成检查点, 支持回滚 +- 完整 Linux 环境, 默认预装 Claude +- 按写入块收费 (TRIM 友好, 删除数据可降低账单) +- 持久化文件系统 + +**差异化**: +- 唯一明确主张"持久化 > 临时"的主流平台 +- 强调有状态的长期 agent 环境 +- Checkpoint 机制对开发类 agent 极有价值 + +**局限**: +- GPU 支持有限 +- 规模化成本较高 (200 并发沙箱 >$35K/月) +- 生态较新, 企业级功能待完善 + +--- + +### 2.4 Daytona — "Secure Infrastructure for AI-Generated Code" + +**概况**: +- 融资: $24M Series A (2026.02, FirstMark Capital) +- 转型: 2025 年初从开发环境转型为 AI 代码执行平台 +- 开源: [github.com/daytonaio/daytona](https://github.com/daytonaio/daytona) + +**技术架构**: +- **隔离技术**: Docker 容器 (可选 Kata Containers) +- **冷启动**: <90ms +- **持久性**: 有状态工作空间 +- **SDK/API**: RESTful API +- **特色**: Git 集成, LSP 支持, 文件系统操作, Computer Use (Linux/macOS/Windows 桌面) + +**部署选项**: +- 全托管 SaaS +- 开源自部署 +- 混合部署 (Daytona 编排, 客户硬件执行) + +**优势**: +- 极快冷启动 (<90ms) +- Computer Use 能力 (GUI 桌面操作) +- 灵活的部署模型 +- GPU 支持 + +**局限**: +- 默认隔离仅为 Docker (非 VM 级别) +- 公开定价仅 $200 免费额度, 超出需走企业销售 +- 转型时间短, 产品成熟度待验证 + +--- + +### 2.5 Cloudflare — Sandboxes + Dynamic Workers + +**概况**: +- 产品: Cloudflare Sandboxes (GA, 2026.04) + Dynamic Workers (Open Beta, 2026.04) +- 定位: 全球边缘网络上的 AI agent 基础设施 + +**技术架构**: + +| 产品 | 隔离技术 | 冷启动 | 适用场景 | +|------|---------|--------|---------| +| **Sandboxes** | 容器 (全 Linux 环境) | 秒级 | 需要完整环境、持久状态 | +| **Dynamic Workers** | V8 Isolate | 毫秒级 | 轻量、高频、JS/TS 执行 | + +**核心能力**: +- Dynamic Workers: 比容器快 100x, 内存效率高 100x +- 按名称寻址的有状态沙箱, 自动休眠/唤醒 +- HTTP 出站请求拦截 (credential injection, agent 代码不接触密钥) +- 全球边缘部署 + +**优势**: +- 全球分布式边缘网络 +- 两种隔离模型 (容器 + isolate) 覆盖不同场景 +- Dynamic Workers 极致低延迟 +- 安全能力强 (凭证注入、网络隔离) + +**局限**: +- 容器沙箱非 VM 级隔离 +- Dynamic Workers 仅支持 JS/TS (及 Wasm) +- 不支持 GPU +- 定制化程度有限 + +--- + +### 2.6 Vercel Sandbox + +**概况**: +- 定位: Vercel 生态内的代码执行原语 +- 技术: Firecracker microVM + +**技术架构**: +- **隔离**: Firecracker microVM (独立文件系统和网络) +- **冷启动**: 毫秒级 +- **运行时**: Amazon Linux 2023, Node.js 24/22, Python 3.13 +- **最大时长**: Hobby 45 分钟, Pro/Enterprise 5 小时 + +**核心能力**: +- Snapshotting (保存/恢复沙箱状态) +- Persistent Sandboxes (Beta, 自动保存/恢复) +- 网络防火墙 (allow-all / deny-all / 自定义规则) + +**局限**: +- 运行时选择有限 (仅 Node.js + Python) +- 会话时长较短 +- 深度绑定 Vercel 生态 +- 不支持 GPU + +--- + +### 2.7 Northflank + +**概况**: +- 定位: 全栈 PaaS + AI 沙箱平台 +- 月处理量: 200 万+ 隔离工作负载 + +**技术架构**: +- **隔离**: MicroVM (Kata Containers + Cloud Hypervisor) + gVisor +- **冷启动**: 秒级 +- **会话时长**: 无限制 +- **运行时**: 任意 OCI 镜像 +- **GPU**: 全面支持 (L4, A100, H100, H200) + +**核心能力**: +- 唯一提供自助 BYOC 且有公开定价的平台 +- 无限会话持续时间 +- 标准 OCI 镜像, 无需改造 +- 多层隔离 (MicroVM + gVisor) + +**优势**: +- 最低的公开 PaaS CPU 费率 ($0.01667/vCPU-hr) +- BYOC 大幅降低规模化成本 +- GPU 定价较公有云便宜最高 62% +- 完整 PaaS 能力 (不仅是沙箱) + +--- + +### 2.8 Blaxel — "The Persistent Sandbox Platform" + +**概况**: +- 定位: 为生产环境 AI agent 构建的持久化沙箱 +- 目标客户: Series A ~ Series D 的 AI-first 公司 + +**技术架构**: +- **隔离**: microVM (类似 AWS Lambda 技术) +- **恢复时间**: ~25ms (从待机状态恢复, 含完整内存状态) +- **待机成本**: 零计算费用, 仅收快照存储费 +- **合规**: SOC 2, HIPAA, ISO 27001 + +**核心能力**: +- 无限待机 (零计算费用) +- 25ms 恢复 (含完整文件系统 + 内存状态) +- Agent 与沙箱共置 (极低延迟) + +**定价**: +- 按内存层级计费 (含 CPU): + - XS (2GB): $0.0828/hr + - S (4GB): $0.1656/hr + - M (8GB): $0.3312/hr + - L (16GB): $0.6624/hr + - XL (32GB): $1.3248/hr +- 免费额度: $200 + +--- + +### 2.9 RunLoop — "AI Agent Accelerator" + +**概况**: +- 定位: 企业级 AI Coding Agent 基础设施 +- 合规: SOC 2 +- 并发能力: 10,000+ 并行实例 + +**技术架构**: +- **隔离**: 双层隔离 (VM + Container) +- **性能**: 定制裸金属 hypervisor, 2x 更快 vCPU, 100ms 命令执行 +- **SDK**: Python, TypeScript, CLI, Dashboard + +**核心能力**: +- Blueprints (可复用模板) +- Snapshots (暂停/恢复) +- 内建 Benchmark & Eval 框架 +- 自动推断 Git 仓库构建环境 + +**局限**: +- 定价不透明 (需联系销售) +- 专注 coding agent 场景, 通用性有限 + +--- + +### 2.10 Koyeb + +**概况**: +- 定位: 高性能 serverless AI 基础设施 + +**核心能力**: +- CPU + GPU 沙箱 +- 多区域部署 (低延迟) +- Python + JavaScript SDK +- Claude Agent SDK 集成示例 + +--- + +### 2.11 Docker Sandboxes + +**概况**: +- 发布: 2026.03 (实验性功能) +- 定位: 本地开发环境中的 AI agent 沙箱 + +**技术架构**: +- **隔离**: microVM (独立 Linux 内核) +- **特色**: 每个沙箱有独立 Docker daemon, 文件系统, 网络 + +**支持的 Agent**: +- Claude Code, Codex, Gemini CLI, GitHub Copilot, Kiro, Docker Agent 等 + +**定位分析**: +- 面向本地开发, 非云服务 +- 唯一允许 agent 在沙箱内构建/运行 Docker 容器的方案 +- 不直接与云 sandbox 服务竞争, 但影响开发者心智 + +--- + +## 3. 隔离技术路线对比 + +### 3.1 四大隔离技术 + +| 技术 | 安全强度 | 冷启动 | 内存开销 | 语言限制 | 代表产品 | +|------|---------|--------|---------|---------|---------| +| **MicroVM (Firecracker)** | ★★★★★ 硬件级 | ~125-200ms | <5 MiB/VM | 无限制 | E2B, Vercel, Fly.io | +| **MicroVM (Kata/CLH)** | ★★★★★ 硬件级 | 秒级 | 较高 | 无限制 | Northflank | +| **gVisor** | ★★★★ 用户态内核 | 亚秒级 | 中等 | 无限制 | Modal, Northflank | +| **Docker 容器** | ★★★ 内核共享 | <90ms | 最低 | 无限制 | Daytona | +| **V8 Isolate** | ★★★ 语言运行时 | 毫秒级 | ~MB 级 | JS/TS/Wasm | Cloudflare Dynamic Workers | + +### 3.2 行业趋势 + +> "In the span of 18 months, nearly every major platform converged on the same answer: untrusted code needs stronger isolation than a container, and most chose microVMs." + +- **共识**: MicroVM 已成为生产级 AI agent 沙箱的事实标准 +- **分化**: 轻量场景 (JS/TS) 倾向 V8 Isolate; 对启动速度极致要求的场景使用容器 + gVisor +- **BoxLite 技术契合度**: libkrun (基于 KVM/Hypervisor.framework 的 microVM) 在隔离强度上处于最高级别, 与行业趋势高度一致 + +--- + +## 4. 定价模型对比 + +### 4.1 CPU 定价 ($/vCPU-hour) + +| 服务商 | 费率 | 计费粒度 | +|-------|------|---------| +| **Northflank** | $0.01667 | 秒 | +| **E2B** | $0.0504 | 秒 | +| **Daytona** | $0.0504 | 秒 | +| **Fly.io Sprites** | $0.07 | 秒 (空闲免费) | +| **Modal** | ~$0.071 (含 3x 沙箱溢价) | 秒 | +| **Cloudflare Sandbox** | $0.072 (仅活跃 CPU) | 秒 | +| **RunLoop** | $0.108 | 秒 | +| **Vercel Sandbox** | $0.128 (仅活跃 CPU) | 秒 | + +### 4.2 内存定价 ($/GiB-hour) + +| 服务商 | 费率 | +|-------|------| +| **Northflank** | $0.00833 | +| **E2B** | $0.0162 | +| **Daytona** | $0.0162 | +| **Modal** | $0.0242 | +| **RunLoop** | $0.0252 | +| **Vercel** | $0.0212 | +| **Fly.io Sprites** | $0.04375 | + +### 4.3 GPU 定价 ($/hour) + +| GPU 型号 | Northflank | Modal | +|---------|-----------|-------| +| L4 | $0.80 | $0.80 | +| A100 40GB | $1.42 | $2.10 | +| A100 80GB | $1.76 | $2.50 | +| H100 | $2.74 | $3.95 | +| H200 | $3.14 | $4.54 | + +*注: E2B, Daytona, Vercel, Cloudflare 均不支持 GPU* + +### 4.4 规模化成本对比 (200 并发沙箱/月) + +| 服务商 | 模式 | 月费用 | +|-------|------|-------| +| **Northflank BYOC** | BYOC | ~$2,060 | +| **Northflank PaaS** | PaaS | ~$7,200 | +| **E2B** | PaaS | ~$16,819 | +| **Daytona** | PaaS | ~$16,819 | +| **Modal** | PaaS | ~$24,491 | +| **Fly.io Sprites** | PaaS | >$35,000 | + +### 4.5 免费额度 + +| 服务商 | 免费额度 | +|-------|---------| +| E2B | $100 (一次性) | +| Daytona | $200 | +| Blaxel | $200 | +| RunLoop | $50 | +| Modal | $30/月 | + +--- + +## 5. 功能矩阵对比 + +| 特性 | E2B | Modal | Fly.io Sprites | Daytona | Cloudflare | Vercel | Northflank | Blaxel | RunLoop | +|------|-----|-------|----------------|---------|------------|--------|------------|--------|---------| +| **隔离级别** | microVM | gVisor | VM | Docker | 容器/Isolate | microVM | microVM+gVisor | microVM | VM+容器 | +| **冷启动** | ~150ms | <1s | 1-2s | <90ms | ms级(Workers) | ms级 | 秒级 | 25ms恢复 | 100ms | +| **最大会话** | 24h | 可配置 | 无限 | 有状态 | 可配置 | 5h(Pro) | 无限 | 无限待机 | 可配置 | +| **GPU** | ❌ | ✅ | 有限 | ✅ | ❌ | ❌ | ✅ | ❌ | ❌ | +| **BYOC** | 实验 | ❌ | ❌ | 混合 | ❌ | ❌ | ✅ | ❌ | ❌ | +| **OCI 镜像** | 自定义模板 | 动态 | ✅ | Docker | ❌ | 有限 | ✅ | ❌ | 蓝图 | +| **Snapshot** | ✅ | ✅ | Checkpoint | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | +| **SDK** | Py/JS | Py/JS/Go | CLI/API | REST | JS | JS | API | API | Py/TS | +| **开源** | ✅ | ❌ | ❌ | ✅ | 部分 | ❌ | ❌ | ❌ | ❌ | +| **SOC 2** | 进行中 | ✅ | ❌ | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ | + +--- + +## 6. 市场格局与竞争态势 + +### 6.1 市场分层 + +``` +┌─────────────────────────────────────────────────────────┐ +│ Tier 1: 专注 AI Sandbox │ +│ E2B · Daytona · Blaxel · RunLoop │ +│ (SDK-first, AI-native, 垂直深耕) │ +├─────────────────────────────────────────────────────────┤ +│ Tier 2: 平台型 (沙箱为产品线之一) │ +│ Modal · Fly.io · Cloudflare · Vercel · Koyeb │ +│ (更广的产品组合, 沙箱服务于更大平台战略) │ +├─────────────────────────────────────────────────────────┤ +│ Tier 3: 全栈 PaaS │ +│ Northflank │ +│ (完整 PaaS + 沙箱, BYOC, 成本领先) │ +├─────────────────────────────────────────────────────────┤ +│ Tier 4: 开发工具/本地方案 │ +│ Docker Sandboxes │ +│ (本地开发, 非云服务, 影响开发者心智) │ +└─────────────────────────────────────────────────────────┘ +``` + +### 6.2 竞争维度分析 + +| 维度 | 领先者 | 说明 | +|------|-------|------| +| **开发者体验/SDK** | E2B, Modal | SDK 设计精良, 上手快 | +| **隔离安全强度** | E2B, Vercel, Northflank | 硬件级 microVM 隔离 | +| **规模化成本** | Northflank (BYOC) | 200 沙箱仅 $2K/月 | +| **极致冷启动** | Blaxel (25ms恢复), Daytona (<90ms) | 毫秒级启动 | +| **GPU 能力** | Modal, Northflank | 全面 GPU 型号支持 | +| **持久性/有状态** | Fly.io Sprites, Blaxel | 沙箱在会话间持久 | +| **企业合规** | Blaxel, RunLoop, Modal | SOC 2, HIPAA, ISO | +| **全球部署** | Cloudflare, Koyeb | 边缘节点全球分布 | +| **开源** | E2B, Daytona | 社区驱动, 可自托管 | + +### 6.3 关键趋势 + +1. **MicroVM 成为共识**: 18 个月内几乎所有主流平台都收敛到 microVM 方案 +2. **从临时到持久**: Fly.io Sprites 和 Blaxel 引领"持久化沙箱"趋势 +3. **Snapshot/Checkpoint**: 成为差异化功能, 减少重复环境搭建 +4. **BYOC 需求增强**: 企业客户对数据主权和成本控制的要求推动 BYOC +5. **Computer Use 新赛道**: Daytona 的 GUI 桌面操作能力开辟了新场景 +6. **SDK-first 胜过 API-first**: 开发者更偏好原生语言 SDK 而非 REST API + +--- + +## 7. BoxLite 差异化定位分析 + +### 7.1 BoxLite 的技术优势 + +| 优势 | 对应市场需求 | 竞争对手情况 | +|------|------------|------------| +| **libkrun microVM** (KVM/Hypervisor.framework) | 硬件级隔离 — 行业共识最强隔离 | E2B 用 Firecracker; Modal 用 gVisor; Daytona 用 Docker | +| **无需 daemon/root** | 嵌入式部署, 降低运维复杂度 | 多数竞品需要平台级基础设施 | +| **跨平台** (Linux KVM + macOS HVF + Windows WHPX) | 覆盖所有主流开发/部署平台 | 多数竞品仅 Linux; Docker Sandboxes 覆盖桌面 | +| **OCI 容器原生** | 与容器生态无缝对接 | Northflank 亦支持任意 OCI; E2B 自定义模板 | +| **SQLite 持久化** | 轻量嵌入式状态管理 | 竞品多依赖外部数据库/对象存储 | +| **Async-first (Tokio)** | 高并发并行沙箱 | Modal 的 Python 异步; E2B 的 SDK 异步 | +| **gRPC vsock** | 高性能 host-guest 通信 | 标准做法, 但实现细节影响性能 | +| **多 SDK** (Python/C/Node.js) | 覆盖主要开发者群体 | E2B: Py/JS; Modal: Py/JS/Go | + +### 7.2 差异化定位选项 + +#### 方案 A: "嵌入式 AI Sandbox Runtime" (SQLite 模式) + +> "SQLite for Sandboxing" — 直接嵌入应用, 无需外部服务 + +- **目标**: 让任何应用嵌入 VM 级沙箱能力, 如同嵌入 SQLite +- **差异**: 所有竞品都是远程云服务; BoxLite 可以是嵌入式库 + 可选云服务 +- **市场空白**: 无竞品提供嵌入式 SDK (无需网络调用, 本地启动 VM) +- **适用场景**: 边缘设备、私有部署、离线环境、对延迟极度敏感的应用 + +#### 方案 B: "跨平台 AI Sandbox Cloud" + +> 唯一原生支持 Linux + macOS + Windows 的沙箱云服务 + +- **差异**: 所有竞品仅 Linux; BoxLite 跨平台 hypervisor 支持 +- **市场空白**: macOS 开发者本地测试无需 Linux VM; Windows 原生支持 +- **适用场景**: 跨平台 CI/CD、桌面应用沙箱、多平台 agent + +#### 方案 C: "Hybrid Embedded + Cloud Sandbox" + +> 嵌入式本地沙箱 + 云端弹性扩展, 同一 SDK + +- **差异**: 同一 API/SDK, 本地执行或透明扩展到云端 +- **市场空白**: 无竞品能在本地和云之间透明切换 +- **适用场景**: 开发时本地快速迭代, 生产时云端弹性伸缩 + +### 7.3 BoxLite 需补齐的能力 + +| 能力 | 优先级 | 说明 | +|------|-------|------| +| **Snapshot/Checkpoint** | P0 | 行业标配, Blaxel 25ms 恢复是标杆 | +| **云端托管服务** | P0 | 从库到服务的关键一跃 | +| **计费系统** | P0 | 按秒/按资源计费 | +| **多租户编排** | P0 | 并发沙箱管理, 资源调度 | +| **SDK 质量与文档** | P1 | E2B 的 SDK 体验是标杆 | +| **GPU passthrough** | P1 | RL 训练和推理场景的刚需 | +| **全球多区域部署** | P1 | 降低延迟, 满足数据合规 | +| **SOC 2 / ISO 27001** | P1 | 企业客户准入门槛 | +| **网络隔离/防火墙** | P2 | 安全合规要求 | +| **BYOC** | P2 | 降低大客户规模化成本 | + +--- + +## 8. 战略建议 + +### 8.1 短期 (0-6 个月): 确立嵌入式差异化 + +1. **明确 "Embeddable VM Sandbox" 定位** — 这是 BoxLite 独有的、竞品无法轻易复制的优势 +2. **完善 Python SDK 到生产级** — AI agent 生态以 Python 为主 (LangChain, CrewAI, AutoGen) +3. **实现 Snapshot/Resume** — 冷启动优化和状态持久化 +4. **构建 "BoxLite Cloud" MVP** — 托管沙箱服务, 验证 PMF + +### 8.2 中期 (6-12 个月): 构建云服务 + +1. **发布 BoxLite Cloud** — 按秒计费的托管沙箱服务 +2. **GPU passthrough** — 进入 RL 训练市场 +3. **SOC 2 合规** — 企业客户准入 +4. **打造 Hybrid 模式** — 同一 SDK, 本地嵌入或云端执行 + +### 8.3 长期 (12+ 个月): 生态扩展 + +1. **BYOC 支持** — 降低大客户成本, 参考 Northflank 模式 +2. **全球多区域** — 边缘部署 +3. **Agent Framework 集成** — 成为 LangChain/CrewAI/Claude Agent SDK 的首选沙箱 runtime +4. **Marketplace** — 预置模板市场 + +### 8.4 定价策略建议 + +基于市场调研, 建议 BoxLite Cloud 定价策略: + +| 指标 | 建议值 | 参考 | +|------|-------|------| +| CPU | $0.03-0.04/vCPU-hr | 介于 Northflank ($0.017) 和 E2B ($0.05) 之间 | +| 内存 | $0.01-0.015/GiB-hr | 与 Northflank 对齐 | +| 免费额度 | $100-200 | 行业标准 | +| 计费粒度 | 按秒 | 行业标准 | +| 嵌入式 SDK | 开源免费 | 吸引开发者, 云服务变现 | + +--- + +## 附录: 信息来源 + +- [Northflank AI Sandbox Pricing Comparison 2026](https://northflank.com/blog/ai-sandbox-pricing) +- [Northflank Best Code Execution Sandbox for AI Agents](https://northflank.com/blog/best-code-execution-sandbox-for-ai-agents) +- [E2B Official](https://e2b.dev/) +- [Modal Sandboxes](https://modal.com/products/sandboxes) +- [Fly.io Sprites](https://sprites.dev/) +- [Daytona](https://www.daytona.io/) +- [Cloudflare Sandboxes](https://developers.cloudflare.com/sandbox/) +- [Cloudflare Dynamic Workers](https://blog.cloudflare.com/dynamic-workers/) +- [Vercel Sandbox](https://vercel.com/docs/vercel-sandbox) +- [Blaxel](https://blaxel.ai/) +- [RunLoop](https://runloop.ai/) +- [Koyeb Sandboxes](https://www.koyeb.com/blog/koyeb-sandboxes-fast-scalable-fully-isolated-environments-for-ai-agents) +- [Docker Sandboxes](https://docs.docker.com/ai/sandboxes/) +- [Firecrawl AI Agent Sandbox Guide](https://www.firecrawl.dev/blog/ai-agent-sandbox) +- [Better Stack Sandbox Runners Comparison](https://betterstack.com/community/comparisons/best-sandbox-runners/) +- [Agentic AI Funding Analysis](https://newmarketpitch.com/blogs/news/agentic-ai-funding-analysis) +- [AgentMarketCap Funding Velocity Report](https://agentmarketcap.ai/blog/2026/04/08/agentic-ai-funding-velocity-2026-sector-map-vertical-distribution) diff --git a/docs/[*]microvm-vs-qemu-technical-comparison.md b/docs/[*]microvm-vs-qemu-technical-comparison.md new file mode 100644 index 000000000..7f466c181 --- /dev/null +++ b/docs/[*]microvm-vs-qemu-technical-comparison.md @@ -0,0 +1,952 @@ +# MicroVM vs 传统 KVM+QEMU: 技术深度对比与 AI Agent Sandbox 优势分析 + +> 调研日期: 2026-05-12 +> 目标: 从技术架构层面深度对比 microVM 方案 (BoxLite/libkrun, Firecracker, Cloud Hypervisor 等) 与传统 KVM+QEMU 方案, 分析 microVM 在 AI Agent Sandbox 场景下的技术优势 + +--- + +## 核心结论 + +MicroVM 并非"缩小版的 QEMU", 而是一种**面向特定工作负载的根本性架构重设计**。两者共享同一个硬件虚拟化层 (KVM/HVF), 但在其上层的 VMM (Virtual Machine Monitor) 设计哲学截然不同。这些差异在 AI Agent Sandbox 场景下转化为决定性的产品优势。 + +| 维度 | 传统 KVM+QEMU | MicroVM (BoxLite/libkrun 等) | AI Sandbox 影响 | +|------|-------------|---------------------------|----------------| +| 设计目标 | 通用虚拟化 | 特定工作负载隔离 | 专注 = 极致优化 | +| 代码规模 | ~200 万行 C | ~5 万行 Rust | 攻击面缩小 97% | +| 启动时间 | 1-10 秒 | 125-200ms | 沙箱即开即用 | +| 内存开销 | 128-512 MB/VM | <5 MiB/VM | 单机万级并发 | +| 设备模型 | 数百设备 | 4-6 设备 | 安全面最小化 | +| 语言安全 | C (内存不安全) | Rust (内存安全) | 消除整类漏洞 | + +--- + +## 目录 + +1. [架构层次对比](#1-架构层次对比) +2. [设备模型: 核心分歧点](#2-设备模型-核心分歧点) +3. [启动流程对比](#3-启动流程对比) +4. [内存管理与密度](#4-内存管理与密度) +5. [安全架构对比](#5-安全架构对比) +6. [网络架构对比](#6-网络架构对比) +7. [快照与恢复机制](#7-快照与恢复机制) +8. [跨平台 Hypervisor 支持](#8-跨平台-hypervisor-支持) +9. [AI Agent Sandbox 场景优势映射](#9-ai-agent-sandbox-场景优势映射) +10. [BoxLite/libkrun 的独特技术优势](#10-boxlitelibkrun-的独特技术优势) +11. [总结: 为什么 AI Sandbox 需要 MicroVM](#11-总结-为什么-ai-sandbox-需要-microvm) + +--- + +## 1. 架构层次对比 + +### 1.1 共同基础: 硬件虚拟化层 + +两种方案共享同一个底层: + +``` +┌─────────────────────────────────────────────┐ +│ Guest OS (Linux) │ +├─────────────────────────────────────────────┤ +│ VMM (用户态虚拟机监视器) │ ← 这一层是核心差异 +├─────────────────────────────────────────────┤ +│ KVM / Hypervisor.framework / WHPX │ ← 共享硬件虚拟化 +├─────────────────────────────────────────────┤ +│ Hardware (VT-x / ARM VHE) │ +└─────────────────────────────────────────────┘ +``` + +- **KVM** (Linux): 将 Linux 内核转化为 Type-1 hypervisor, 通过 ioctl 接口暴露 vCPU/内存管理 +- **Hypervisor.framework** (macOS): Apple 提供的用户态虚拟化框架 +- **WHPX** (Windows): Windows Hypervisor Platform API + +两种方案获得**完全相同的硬件级隔离强度** — CPU 特权级分离、内存地址空间隔离、中断虚拟化。差异完全在 VMM 用户态实现。 + +### 1.2 VMM 设计哲学分歧 + +``` +传统 QEMU: MicroVM (libkrun/Firecracker/CLH): +┌──────────────────────┐ ┌──────────────────────┐ +│ 通用型 VMM │ │ 专用型 VMM │ +│ │ │ │ +│ ┌────────────────┐ │ │ ┌────────────────┐ │ +│ │ 数百设备模拟 │ │ │ │ 4-6 virtio 设备 │ │ +│ │ IDE/SATA/NVMe │ │ │ │ block/net/vsock │ │ +│ │ VGA/QXL/virtio │ │ │ │ console/fs │ │ +│ │ USB/Audio/TPM │ │ │ └────────────────┘ │ +│ │ Floppy/Serial │ │ │ │ +│ │ PCI/PCIe/ACPI │ │ │ 无 PCI, 无 ACPI │ +│ └────────────────┘ │ │ 无 BIOS/UEFI 复杂链 │ +│ │ │ 仅 virtio-mmio 传输 │ +│ 支持 30+ CPU 架构 │ │ │ +│ 支持完整 BIOS/UEFI │ │ 直接内核加载 │ +│ 支持 PCI 设备直通 │ │ │ +│ 支持遗留系统 │ │ 仅支持现代 Linux │ +└──────────────────────┘ └──────────────────────┘ + ~200 万行 C 代码 ~5 万行 Rust 代码 + 通用、全能、庞大 专用、极简、高效 +``` + +**核心差异**: QEMU 问的是 "这台 VM 需要什么才能模拟一台完整计算机", MicroVM 问的是 "运行一个 Linux 进程最少需要什么"。 + +--- + +## 2. 设备模型: 核心分歧点 + +设备模型是 MicroVM 与 QEMU 最根本的技术分歧, 也是所有性能和安全差异的源头。 + +### 2.1 QEMU 设备模型 + +QEMU 模拟完整的 PC 硬件平台: + +``` +QEMU 设备栈: +├── PCI/PCIe 总线 +│ ├── 存储控制器 +│ │ ├── IDE (ATA/ATAPI) +│ │ ├── AHCI (SATA) +│ │ ├── virtio-blk / virtio-scsi +│ │ ├── NVMe +│ │ └── USB Mass Storage +│ ├── 网络适配器 +│ │ ├── e1000 / e1000e +│ │ ├── rtl8139 +│ │ ├── virtio-net +│ │ └── vmxnet3 +│ ├── 显示适配器 +│ │ ├── VGA / Cirrus +│ │ ├── QXL (SPICE) +│ │ ├── virtio-gpu +│ │ └── bochs-display +│ ├── 音频设备 +│ │ ├── AC97 / Intel HDA +│ │ └── virtio-sound +│ ├── USB 控制器 +│ │ ├── UHCI / OHCI / EHCI / xHCI +│ │ └── USB 设备 (键盘/鼠标/存储/...) +│ └── 其他 PCI 设备 +│ ├── watchdog +│ ├── RNG (virtio-rng) +│ └── TPM +├── ISA 总线 +│ ├── i8259 PIC +│ ├── i8254 PIT +│ ├── MC146818 RTC +│ ├── 串口 (COM1-COM4) +│ ├── 并口 +│ └── PS/2 键盘/鼠标 +├── ACPI 子系统 +│ ├── 电源管理 +│ ├── 热插拔 +│ └── 设备枚举 +├── 固件 +│ ├── SeaBIOS +│ ├── OVMF (UEFI) +│ └── iPXE (网络启动) +└── 软盘控制器 (是的, 软盘) +``` + +每一个设备模拟都是一段复杂的 C 代码, 需要: +- 实现硬件寄存器的读写语义 +- 处理 DMA 传输 +- 管理中断路由 +- 维护设备状态机 + +### 2.2 MicroVM 设备模型 + +以 Firecracker 为例, 仅实现 **5 个设备**: + +``` +Firecracker 设备栈: +├── virtio-block (块存储) +├── virtio-net (网络) +├── virtio-vsock (host-guest 通信) +├── serial console (控制台 I/O) +└── i8042 keyboard (仅用于停止 VM) +``` + +libkrun (BoxLite 使用) 的设备集: + +``` +libkrun 设备栈: +├── virtio-block (块存储, 支持 raw/QCOW2/VMDK) +├── virtio-net (网络, 可选 passt/gvproxy 后端) +├── virtio-vsock (TSI 透明 socket 代理) +├── virtio-fs (目录共享, 宿主-客户文件系统映射) +├── serial console (控制台) +└── [可选] virtio-gpu / virtio-sound (feature flag 控制) +``` + +### 2.3 传输层差异 + +| 特性 | QEMU | MicroVM | +|------|------|---------| +| **设备发现** | PCI 总线枚举 + ACPI 表 | 内核命令行 (x86) / FDT (ARM) | +| **传输协议** | PCI (BAR 映射, MSI-X 中断) | virtio-mmio (内存映射 I/O) | +| **设备热插拔** | 支持 (PCI hotplug + ACPI) | 不支持 (启动时确定) | +| **初始化复杂度** | 高 (BIOS 枚举 → PCI 配置空间 → 驱动加载) | 低 (内核直接从命令行获取设备地址) | + +**virtio-mmio vs PCI 的性能影响**: +- PCI 需要配置空间读写、BAR 映射、MSI-X 中断路由 — 引入额外的 VMEXIT +- virtio-mmio 直接通过内存地址访问, 减少了 PCI 层的开销 +- 对于仅需 4-6 个设备的场景, PCI 总线的通用性完全是多余的复杂度 + +### 2.4 对 AI Sandbox 的影响 + +| QEMU 设备模型的问题 | MicroVM 如何解决 | AI Sandbox 收益 | +|-------------------|----------------|----------------| +| 数百设备 = 数百潜在攻击面 | 仅 4-6 设备 = 攻击面缩小 98% | AI 生成的恶意代码利用面极小 | +| PCI 枚举增加启动时间 | 无 PCI, 直接 mmio | 沙箱秒开 | +| 每个设备占用内存 | 最小设备集 = 最小内存 | 单机更多并发沙箱 | +| 设备驱动 bug 导致 VM 逃逸 | 简单 virtio, 易审计 | 隔离可信度更高 | + +--- + +## 3. 启动流程对比 + +### 3.1 QEMU 传统启动流程 + +``` +QEMU 启动流程 (标准 PC): + +[0ms] QEMU 进程启动 + ├── 解析命令行参数 + ├── 初始化内存后端 + ├── 创建 KVM VM + └── 初始化设备模型 + +[~50ms] 固件加载 (SeaBIOS / UEFI) + ├── POST (Power-On Self-Test) + ├── 内存检测 + ├── PCI 总线扫描与配置 + ├── ACPI 表构建 + ├── 中断控制器初始化 (APIC/IOAPIC) + └── 引导设备选择 + +[~200ms] 引导加载器 (GRUB / syslinux) + ├── 读取配置文件 + ├── 加载内核映像 + ├── 加载 initramfs + └── 跳转到内核入口点 + +[~500ms] Linux 内核初始化 + ├── 解压并重定位 + ├── 建立页表 + ├── 初始化控制台 + ├── 检测 CPU 拓扑 + ├── PCI 驱动探测 (每个设备依次加载驱动) + ├── ACPI 子系统初始化 + ├── 磁盘/网络驱动加载 + └── 挂载根文件系统 + +[~800ms] Init 系统 (systemd / init) + ├── 解析 unit 文件 + ├── 启动系统服务 + ├── 网络配置 + └── 用户空间就绪 + +[~1200ms] ────── 应用就绪 ────── +``` + +**总耗时: 1-10 秒** (取决于配置复杂度) + +### 3.2 MicroVM 启动流程 + +``` +MicroVM 启动流程 (Firecracker / libkrun): + +[0ms] VMM 进程初始化 + ├── 解析配置 + ├── 分配 Guest 内存 (mmap) + ├── 创建 KVM/HVF VM + └── 注册 virtio-mmio 设备 (4-6 个) + +[~10ms] 直接内核加载 (无 BIOS/UEFI) + ├── 将内核映像复制到 Guest 内存 + ├── 设置引导参数 (boot_params) + ├── 将 initrd 复制到 Guest 内存 (可选) + ├── 设置内核命令行 (含设备地址) + └── 设置 vCPU 寄存器 → 内核入口点 + +[~20ms] 启动 vCPU 线程 + └── vcpu.run() → 进入 Guest 模式 + +[~30ms] Linux 内核初始化 (精简) + ├── 无 PCI 枚举 (没有 PCI 总线) + ├── 无 ACPI 解析 (没有 ACPI 表) + ├── 直接初始化 virtio-mmio 设备 + │ (内核命令行已提供设备地址) + ├── 挂载根文件系统 + └── 执行 /init + +[~125ms] 用户空间就绪 + ├── 执行目标工作负载 + └── 建立 vsock 通信 + +[~125ms] ────── 应用就绪 ────── +``` + +**总耗时: ~125ms** (Firecracker), ~150-200ms (E2B 生产环境) + +### 3.3 启动流程差异解析 + +| 阶段 | QEMU 耗时 | MicroVM 耗时 | 差异原因 | +|------|----------|-------------|---------| +| VMM 初始化 | ~50ms | ~10ms | 设备模型简单 10x | +| 固件/BIOS | ~150ms | **0ms** | 直接内核加载, 跳过 BIOS | +| 引导加载器 | ~100ms | **0ms** | 无 GRUB, 直接设置寄存器 | +| 内核设备探测 | ~300ms | ~30ms | 无 PCI 枚举, 无 ACPI | +| Init 系统 | ~400ms | ~50ms | 最小 init, 直接 execvp | +| **总计** | **~1200ms** | **~125ms** | **~10x 差距** | + +关键优化: +1. **跳过固件层**: 直接将内核映像加载到 Guest 内存, 设置 CPU 寄存器指向入口点 +2. **跳过设备枚举**: 通过内核命令行或 FDT 告知设备地址, 无需运行时发现 +3. **最小化内核初始化路径**: 定制内核 (如 libkrunfw) 可裁剪不需要的子系统 + +### 3.4 对 AI Sandbox 的影响 + +| 场景 | QEMU 体验 | MicroVM 体验 | +|------|----------|-------------| +| 用户发送代码执行请求 | 等待 1-10 秒才能开始执行 | 125ms 后开始执行 (用户无感知延迟) | +| Agent 工具调用 (tool_use) | 每次调用产生秒级延迟 | 每次调用亚秒响应 | +| 批量 RL 训练 | 冷启动成为瓶颈 | 100K+ 并发沙箱可行 | +| 交互式编码助手 | "正在准备环境..." | 即时开始 | + +> "Conversational AI experiences depend on perceived responsiveness. Users tolerate 1-2 second delays for complex reasoning but not for sandbox initialization." +> +> — 行业观点: 沙箱冷启动需 <200ms 才能满足对话式 AI 体验 + +--- + +## 4. 内存管理与密度 + +### 4.1 内存开销对比 + +``` +QEMU 单 VM 内存构成: MicroVM 单 VM 内存构成: + +QEMU 进程本身: ~30-50 MB VMM 进程: ~1-3 MB + ├── 设备模型状态: ~10-20 MB ├── virtio 设备状态: ~0.1 MB + ├── PCI 配置空间: ~5 MB ├── mmio 映射: ~0.1 MB + ├── ACPI 表: ~2 MB └── vCPU 上下文: ~0.1 MB + ├── 固件映像: ~4 MB + ├── VGA/显示缓冲: ~8 MB Guest 内核: ~2-4 MB + └── 其他: ~10 MB (精简内核, 仅必要驱动) + +Guest 内核: ~30-80 MB ───────────────────────────── + (完整内核, 全量驱动) 总固定开销: ~3-5 MiB + +───────────────────────────── +总固定开销: ~128-512 MB +``` + +### 4.2 密度计算 + +以 256 GB 主机内存为例, 每个沙箱分配 512 MB Guest RAM: + +| 指标 | QEMU | MicroVM | 差距 | +|------|------|---------|------| +| 单 VM 固定开销 | ~200 MB | ~5 MB | 40x | +| 可分配给 Guest 的内存 | 256GB - (N × 200MB) | 256GB - (N × 5MB) | — | +| 最大 VM 数量 (512MB/VM) | ~365 | ~500 | 1.4x | +| 最大 VM 数量 (128MB/VM) | ~780 | ~1,900 | 2.4x | +| 最大 VM 数量 (64MB/VM) | ~970 | ~3,800 | 3.9x | + +**关键洞察**: Guest RAM 越小 (AI sandbox 通常不需要大内存), microVM 的密度优势越大。对于仅需执行代码片段的 AI agent, 64MB Guest RAM 通常足够, 此时 microVM 密度优势达 **~4x**。 + +### 4.3 大规模场景 + +Modal 客户实例: 单平台运行 100,000 并发沙箱用于 RL 训练。 +- 用 QEMU (200MB 开销/VM): 需要 ~20TB 仅固定开销 +- 用 MicroVM (5MB 开销/VM): 固定开销 ~500GB, 可控 + +Firecracker 测试: 150 microVM/秒/主机 的创建速率, 支持万级快速扩缩容。 + +--- + +## 5. 安全架构对比 + +### 5.1 攻击面分析 + +``` +攻击面 = 恶意 Guest 可触达的 VMM 代码量 + +QEMU 攻击面: +┌────────────────────────────────────────────────┐ +│ ~200 万行 C 代码 │ +│ │ +│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ +│ │ IDE 控制器│ │ VGA 模拟 │ │ USB 控制器│ ... │ +│ │ (CVE多发) │ │ (CVE多发) │ │ (CVE多发) │ │ +│ └──────────┘ └──────────┘ └──────────┘ │ +│ │ +│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ +│ │ SCSI 控制 │ │ 音频设备 │ │ 网络设备 │ ... │ +│ └──────────┘ └──────────┘ └──────────┘ │ +│ │ +│ 攻击路径: Guest → 设备寄存器写入 → 触发 │ +│ VMM 代码中的内存错误 → 宿主代码执行 │ +└────────────────────────────────────────────────┘ + +MicroVM 攻击面: +┌──────────────────────────────┐ +│ ~5 万行 Rust 代码 │ +│ │ +│ ┌──────────┐ ┌──────────┐ │ +│ │ virtio-blk│ │ virtio-net│ │ +│ └──────────┘ └──────────┘ │ +│ ┌──────────┐ ┌──────────┐ │ +│ │virtio-vsock│ │ serial │ │ +│ └──────────┘ └──────────┘ │ +│ │ +│ + Rust 内存安全保证 │ +│ + seccomp 系统调用白名单 │ +└──────────────────────────────┘ +``` + +### 5.2 QEMU 漏洞历史 + +QEMU 累计 CVE 数量: **数百个**, 其中多个高危 VM 逃逸漏洞: + +| CVE | 组件 | 影响 | +|-----|------|------| +| CVE-2020-14364 | USB 模拟 (EHCI) | VM 逃逸, 宿主代码执行 | +| CVE-2021-3748 | virtio-net | 堆溢出, Guest 可控内存写入 | +| CVE-2023-3180 | virtio-crypto | 堆溢出 | +| CVE-2020-25084 | SCSI (LSI53C895A) | Use-after-free, VM 逃逸 | +| CVE-2020-25624 | USB EHCI | 越界读取 | +| CVE-2021-20203 | e1000e 网络 | 堆溢出 | + +**根因**: QEMU 的设备模拟代码 (C 语言) 需要精确实现硬件寄存器语义, 包括 DMA 传输和中断处理 — 这些是内存安全 bug 的高发区。 + +### 5.3 MicroVM 安全优势 + +**1. 语言级安全 (Rust)**: +- 编译时消除 buffer overflow, use-after-free, double-free, data race +- 这些正是 QEMU CVE 的主要类型 +- Firecracker 研究表明: "Rust 内存安全未对性能产生负面影响" + +**2. 最小设备集**: +- Firecracker: 仅 5 个设备, 全部基于 virtio (规范明确, 实现简单) +- 对比 QEMU 的 USB/IDE/VGA 等遗留设备 — 规范复杂, 实现中陷阱多 + +**3. 多层防御 (Firecracker Jailer)**: + +``` +Firecracker 安全分层: + +Layer 1: KVM 硬件隔离 + └── CPU 特权级, EPT/NPT 内存隔离 + +Layer 2: VMM 最小化 + Rust + └── 5 万行 Rust 代码, 最小攻击面 + +Layer 3: Jailer (chroot + namespace + seccomp) + ├── chroot: 仅包含 Firecracker 二进制 + 必要文件 + ├── pid namespace: 进程隔离 + ├── net namespace: 网络隔离 + ├── 降权: 非 root 运行 + └── seccomp-bpf: 白名单 24 个系统调用 + 30 个 ioctl +``` + +**4. BoxLite 的安全分层**: + +``` +BoxLite 安全分层: + +Layer 1: KVM/HVF/WHPX 硬件隔离 + └── 与 QEMU 相同强度的 CPU/内存隔离 + +Layer 2: libkrun VMM (Rust, ~万行代码) + └── 最小设备集, 内存安全 + +Layer 3: boxlite-shim 进程隔离 + └── 每个 Box 独立进程 (libkrun process takeover) + +Layer 4: Jailer (seccomp / sandbox-exec / namespaces) + └── OS 级沙箱包裹 shim 进程, 纵深防御 +``` + +### 5.4 对 AI Sandbox 的安全意义 + +| 威胁 | QEMU 风险 | MicroVM 风险 | 说明 | +|------|----------|-------------|------| +| AI 生成的恶意代码利用设备漏洞逃逸 | **高** (数百设备, 历史 CVE) | **极低** (4-6 virtio 设备, Rust) | Agent 可能生成针对性的漏洞利用代码 | +| 内存安全漏洞 (buffer overflow) | **高** (C 语言, 复杂设备模拟) | **极低** (Rust 编译时保证) | 消除整类漏洞 | +| 系统调用逃逸 | 中 (可配 seccomp) | **低** (默认 seccomp + 24 调用白名单) | MicroVM 默认最小权限 | +| 跨 VM 侧信道攻击 | 中 | 中 | 两者类似 (共享 KVM) | + +--- + +## 6. 网络架构对比 + +### 6.1 QEMU 网络 + +``` +QEMU 网络栈: + +Guest 应用 + ↓ +Guest 内核网络栈 + ↓ +虚拟 NIC 驱动 (e1000e / virtio-net) + ↓ +QEMU 设备模拟 (PCI BAR 映射, 中断注入) + ↓ +后端选择: + ├── TAP 设备 → Linux bridge/OVS → 物理网络 + ├── user mode (-net user) → SLIRP (用户态 NAT) + ├── vhost-net → 内核态 virtio 后端 + └── macvtap → 直接桥接 +``` + +特点: +- 完整的 TCP/IP 栈在 Guest 内核中运行 +- 需要配置虚拟网桥、TAP 设备、iptables 规则 +- 支持任意网络拓扑 +- 配置复杂度高 + +### 6.2 MicroVM 网络 + +**方案 A: virtio-net (Firecracker / Cloud Hypervisor)** + +``` +Guest 应用 → Guest 内核 TCP/IP → virtio-net → 后端: + ├── TAP + tc/iptables (Firecracker) + └── vhost-net / vhost-user (Cloud Hypervisor) +``` + +**方案 B: TSI — Transparent Socket Impersonation (libkrun/BoxLite 独特方案)** + +``` +TSI 架构 (libkrun): + +Guest 应用 + ↓ socket() / connect() / bind() / listen() +Guest 内核 (libkrunfw 定制内核) + ↓ 拦截 AF_INET/AF_INET6/AF_UNIX socket 系统调用 +virtio-vsock 通道 + ↓ 转发到 VMM +libkrun VMM (宿主进程) + ↓ 代理执行真实 socket 操作 +宿主网络栈 + ↓ +物理/虚拟网络 +``` + +### 6.3 TSI 的技术创新 + +TSI 是 libkrun (BoxLite 底层) 的独特技术, 在 AI Sandbox 场景下有显著优势: + +| 特性 | 传统 virtio-net | TSI (libkrun) | +|------|----------------|---------------| +| Guest 内需要虚拟 NIC | 是 | **否** | +| Guest 内需要完整网络栈配置 | 是 (IP 地址, 路由, DNS) | **否** (透明代理) | +| 出站连接 | 通过虚拟 NIC + NAT/桥接 | **直接代理** (使用宿主网络身份) | +| 入站连接 | 需要端口映射/桥接 | **支持** (VMM 代理 bind/listen) | +| Unix Domain Socket | 不支持 (跨 VM 边界) | **支持** (VMM 代理) | +| 网络配置复杂度 | 高 (TAP/bridge/iptables) | **零** (开箱即用) | +| 适用场景 | 需要完整网络栈的工作负载 | 进程级隔离, AI sandbox | + +**对 AI Sandbox 的意义**: +- AI agent 的代码通常需要 `pip install`, `npm install`, HTTP API 调用 — TSI 让这些操作无需任何网络配置即可工作 +- 无需配置虚拟 NIC, 无需 TAP/bridge 权限 — 支持非 root 运行 +- Unix socket 代理能力使 gRPC/IPC 通信更自然 + +--- + +## 7. 快照与恢复机制 + +### 7.1 QEMU 快照 + +``` +QEMU 快照流程: + +保存: + ├── 暂停所有 vCPU + ├── 序列化 CPU 状态 (寄存器、MSR、FPU) + ├── 序列化所有设备状态 (数百设备各自的状态机) + ├── 保存 Guest 内存 (全量, 数百 MB-数 GB) + └── 写入 QCOW2 内部快照或外部文件 + +恢复: + ├── 加载 CPU 状态 + ├── 反序列化所有设备状态 + ├── 加载 Guest 内存 (全量) + └── 恢复 vCPU 执行 + +耗时: 秒级到分钟级 (取决于内存大小) +``` + +问题: +- 设备状态序列化复杂 (数百设备, 每个有独立状态机) +- 全量内存保存/恢复, I/O 密集 +- 快照文件大 (= Guest RAM 大小) +- 跨版本兼容性脆弱 (设备状态格式变化) + +### 7.2 MicroVM 快照 + +``` +Firecracker 快照流程: + +保存: + ├── 暂停所有 vCPU + ├── 序列化 CPU + 4-6 个 virtio 设备状态 → vmstate 文件 + ├── 保存 Guest 内存: + │ ├── 全量快照: 一次性写入 + │ └── 增量快照: 仅脏页 (通过 KVM dirty page tracking) + └── 完成 (vmstate ~KB, memory ~MB) + +恢复: + ├── 新建 Firecracker 进程 + ├── 加载 vmstate (反序列化 4-6 个设备, 微秒级) + ├── MAP_PRIVATE 映射内存文件 (不拷贝!) + │ └── 按需加载 (lazy page fault) + │ └── 写入时复制 (copy-on-write) + └── 恢复 vCPU 执行 + +恢复耗时: p50 = 4.1ms, p99 = 12ms +``` + +### 7.3 技术差异对比 + +| 特性 | QEMU 快照 | MicroVM 快照 | +|------|----------|-------------| +| 设备状态序列化 | 数百设备, 复杂且脆弱 | 4-6 设备, 简单可靠 | +| 内存保存 | 全量 (必须完整拷贝) | 支持增量 (仅脏页) | +| 内存恢复 | 全量加载到内存 | MAP_PRIVATE + lazy loading | +| 恢复延迟 | 秒级 ~ 分钟级 | **毫秒级** (p50: 4.1ms) | +| 内存写入 | 直接修改恢复的内存 | Copy-on-Write (不污染快照) | +| 多实例恢复 | 每个实例需独立加载全部内存 | **共享底层快照文件** (CoW 分离) | + +### 7.4 对 AI Sandbox 的影响 + +**"预热快照" 模式** — MicroVM 的快照能力使以下工作流成为可能: + +``` +AI Sandbox 预热快照工作流: + +1. 预构建阶段 (离线): + 创建 microVM → 安装 Python/Node/系统依赖 → 创建快照 + │ +2. 运行时 (在线): ▼ + 用户请求 → 从快照恢复 (4ms) → 执行用户代码 → 返回结果 → 销毁 + + 对比传统方式: + 用户请求 → 创建 VM (秒级) → 安装依赖 (十秒级) → 执行 → 返回 +``` + +- **冷启动 → 热启动**: 从秒级降到毫秒级 +- **克隆成本为零**: CoW 映射, 1000 个快照实例共享同一内存文件 +- **Blaxel 标杆**: 25ms 从待机恢复 (含完整文件系统 + 内存状态) +- **Fly.io Sprites**: 300ms checkpoint, 支持任意时间点回滚 + +--- + +## 8. 跨平台 Hypervisor 支持 + +### 8.1 QEMU 的跨平台方式 + +``` +QEMU 跨平台策略: + +Linux: QEMU + KVM → 硬件加速虚拟化 +macOS: QEMU + HVF → 通过翻译层适配 (有限支持) + QEMU + TCG → 纯软件模拟 (极慢, 无实用价值) +Windows: QEMU + WHPX → 通过翻译层适配 (实验性) + QEMU + TCG → 纯软件模拟 + +问题: +- 非 Linux 平台为"二等公民" +- HVF/WHPX 后端成熟度远低于 KVM +- 设备模型相同 (不针对平台优化) +- 代码路径复杂 (条件编译 + 抽象层) +``` + +### 8.2 MicroVM 的跨平台方式 + +**libkrun/BoxLite 方案**: + +``` +libkrun 跨平台策略: + +Linux x86_64/aarch64/riscv64: + └── KVM (kvm-ioctls crate, 原生支持) + +macOS aarch64: + └── Hypervisor.framework (原生 Swift/C 绑定, src/hvf/) + +Windows x86_64 (BoxLite WHPX 扩展): + └── Windows Hypervisor Platform API + +统一抽象: + trait Vm { ... } + trait Vcpu { ... } + → KVM, HVF, WHPX 各自实现同一 trait + → VMM 层完全透明, 不感知具体 hypervisor +``` + +**Docker Sandboxes 方案 (印证同一趋势)**: + +``` +Docker 构建了全新 VMM: + macOS: Hypervisor.framework + Windows: Windows Hypervisor Platform + Linux: KVM + +"Zero translation layers = Zero abstraction tax" + — Docker 工程团队 +``` + +### 8.3 跨平台在 AI Sandbox 中的价值 + +| 场景 | 仅 Linux (Firecracker/E2B) | 跨平台 (BoxLite/Docker) | +|------|--------------------------|----------------------| +| 云端部署 | ✅ 覆盖 | ✅ 覆盖 | +| macOS 开发者本地测试 | ❌ 需要 Linux VM | ✅ 原生 HVF | +| Windows 开发者本地测试 | ❌ 需要 WSL2 | ✅ 原生 WHPX | +| 边缘设备 (ARM Mac) | ❌ | ✅ | +| CI/CD (GitHub Actions macOS runner) | ❌ | ✅ | +| 嵌入式 SDK (桌面应用集成) | ❌ | ✅ | + +> Docker 团队的观点: "Coding agents run on developer laptops, not in the cloud — requiring cross-platform support." +> +> 这同样适用于 BoxLite: AI sandbox 不仅是云服务, 也是开发者工具。 + +--- + +## 9. AI Agent Sandbox 场景优势映射 + +### 9.1 场景一: 交互式 AI Coding Agent + +用户与 Claude Code / Cursor / Windsurf 等工具交互, agent 需要实时执行代码。 + +``` +用户: "帮我写一个排序算法并测试" + +Agent 工作流: + ├── 生成代码 (~1s, LLM 推理) + ├── 创建/恢复沙箱 → 执行代码 → 返回结果 + │ ├── QEMU: +1-10s (冷启动) 或 +数秒 (快照恢复) + │ └── MicroVM: +125ms (冷启动) 或 +4ms (快照恢复) + └── 展示结果给用户 + +端到端延迟: + QEMU: 1s(LLM) + 5s(VM) = ~6s ← 沙箱成为瓶颈 + MicroVM: 1s(LLM) + 0.13s(VM) = ~1.1s ← LLM 是唯一瓶颈 +``` + +**MicroVM 优势**: 沙箱延迟从用户可感知 (秒级) 降到不可感知 (<200ms)。 + +### 9.2 场景二: 大规模 RL/Eval 训练 + +强化学习训练或 Agent 评估需要大量并行沙箱。 + +| 指标 | QEMU | MicroVM | +|------|------|---------| +| 单主机最大并发 (64MB/VM) | ~970 | ~3,800 | +| 创建速率 | ~10 VM/s | **150 VM/s** | +| 冷启动延迟 | 1-10s | 125ms | +| 快照克隆 | 每实例全量内存拷贝 | **CoW, 零拷贝** | +| 100K 并发的基础设施成本 | 极高 (内存浪费) | 可控 (高密度) | + +**MicroVM 优势**: 高密度 + 快速创建 + CoW 快照 = 万级并发经济可行。 + +### 9.3 场景三: 多租户 SaaS 沙箱 + +每个 API 请求/用户会话需要独立隔离环境。 + +``` +多租户请求隔离: + +QEMU 方案: + 请求 → VM 池 (预热, 固定数量) → 复用 VM → 返回 + 问题: 预热 = 资源浪费; 复用 = 残留数据泄漏风险 + +MicroVM 方案: + 请求 → 从快照创建新 VM (4ms) → 执行 → 销毁 → 返回 + 优势: 每请求独立 VM, 零残留, 按需伸缩 +``` + +**MicroVM 优势**: 快照恢复速度使"每请求独立 VM"成为可行方案, 不再需要 VM 池复用。 + +### 9.4 场景四: 嵌入式 AI SDK + +将沙箱能力作为库嵌入到应用中 (BoxLite 独特场景)。 + +``` +嵌入式场景: + +QEMU 嵌入问题: + ├── ~200 万行代码, 编译产物庞大 + ├── 复杂的依赖链 (glib, pixman, SDL, ...) + ├── 需要 root 权限配置网络 (TAP/bridge) + ├── 进程模型复杂 (多进程/多线程混合) + └── 不适合作为库嵌入 + +BoxLite/libkrun 嵌入: + ├── 动态库 (libkrun.so / libkrun.dylib) + ├── 简单 C API (krun_create_ctx, krun_start_enter) + ├── TSI 网络 (无需 root, 无需 TAP) + ├── 嵌入宿主进程地址空间 + └── 应用程序直接 dlopen 即可获得 VM 隔离 +``` + +**MicroVM (libkrun) 优势**: 这是 QEMU 根本无法实现的场景 — 作为库嵌入应用, 无需 daemon、无需 root、无需复杂部署。 + +### 9.5 优势总结矩阵 + +| AI Sandbox 需求 | 传统 KVM+QEMU | MicroVM 方案 | 优势倍数 | +|----------------|-------------|------------|---------| +| 冷启动延迟 | 1-10s | 125ms | **8-80x** | +| 快照恢复延迟 | 秒级 | 4ms (p50) | **250x+** | +| 内存开销/VM | 128-512 MB | <5 MiB | **25-100x** | +| VMM 代码量 (攻击面) | ~200 万行 C | ~5 万行 Rust | **40x 更小** | +| 模拟设备数 (攻击面) | 数百 | 4-6 | **50x+ 更小** | +| 创建速率 | ~10 VM/s/host | 150 VM/s/host | **15x** | +| 嵌入式部署 | 不可行 | 原生支持 | **∞** | +| 非 root 运行 | 需要 root (网络) | 支持 (TSI) | 质的差异 | +| 跨平台原生支持 | Linux 优先 | Linux + macOS + Windows | 覆盖面 3x | + +--- + +## 10. BoxLite/libkrun 的独特技术优势 + +相对于其他 microVM 方案 (Firecracker, Cloud Hypervisor), BoxLite 基于的 libkrun 有以下独特之处: + +### 10.1 vs Firecracker + +| 维度 | Firecracker | libkrun (BoxLite) | +|------|------------|-------------------| +| 运行形态 | 独立进程 + REST API 控制 | **动态库** (嵌入宿主进程) | +| 网络模型 | virtio-net + TAP (需 root) | **TSI** (无需 root, 无需虚拟 NIC) | +| 跨平台 | **仅 Linux** | Linux + macOS (HVF) + Windows (WHPX) | +| 文件共享 | virtio-block (块设备) | **virtio-fs** (目录级共享) | +| TEE 支持 | 无 | SEV-SNP, TDX, AWS Nitro | +| GPU 支持 | 无 (无 PCI) | **可选 virtio-gpu** (feature flag) | +| 目标场景 | 云端 serverless (AWS Lambda) | **嵌入式进程隔离** | + +### 10.2 vs Cloud Hypervisor + +| 维度 | Cloud Hypervisor | libkrun (BoxLite) | +|------|-----------------|-------------------| +| 运行形态 | 独立进程 + API | **动态库** | +| 设备传输 | PCI + MMIO | **仅 MMIO** (更简单) | +| 热插拔 | 支持 CPU/内存/设备热插拔 | 不支持 (不需要) | +| 网络 | virtio-net | **TSI** (透明 socket 代理) | +| 复杂度 | 中等 (支持更多场景) | **最低** (专注进程隔离) | +| macOS 支持 | 无 | **原生 HVF** | + +### 10.3 BoxLite 的独特技术组合 + +``` +BoxLite 技术栈独特性: + + libkrun VMM (嵌入式, Rust) + │ + ┌────────────────┼────────────────┐ + │ │ │ + KVM (Linux) HVF (macOS) WHPX (Windows) + │ + TSI 网络 + (无需 root) + │ + virtio-fs + (目录共享) + │ + OCI 容器运行时 + (libcontainer) + │ + gRPC over vsock + (高性能 host-guest 通信) + │ + ┌─────────┴─────────┐ + │ │ + 嵌入式 SDK 云端服务 + (本地, 无网络) (分布式, 弹性) +``` + +这个技术组合在 AI Sandbox 市场中独一无二: +- **E2B** 用 Firecracker → 仅 Linux, 仅云端, 仅远程 +- **Modal** 用 gVisor → 非 VM 级隔离 +- **Docker Sandbox** 自研 VMM → 类似路线, 但专注本地开发 +- **BoxLite** 用 libkrun → 嵌入式 + 跨平台 + TSI + VM 级隔离 + 可云端化 + +--- + +## 11. 总结: 为什么 AI Sandbox 需要 MicroVM + +### 11.1 MicroVM 不是"精简版 QEMU" + +MicroVM 与 QEMU 的关系, 类似于 SQLite 与 Oracle Database 的关系 — 不是同一事物的大小版本, 而是面向不同约束条件的不同设计: + +| 类比 | 通用方案 | 专用方案 | +|------|---------|---------| +| 数据库 | Oracle / PostgreSQL | SQLite | +| 虚拟化 | QEMU | Firecracker / libkrun | +| 设计目标 | 功能完备, 覆盖所有场景 | 极致精简, 最优化特定场景 | +| 取舍 | 牺牲效率换通用性 | 牺牲通用性换效率 | + +### 11.2 AI Sandbox 场景与 MicroVM 设计的天然契合 + +``` +AI Sandbox 的核心约束: + +1. 安全至上: 执行不可信代码, 必须硬件级隔离 → MicroVM ✓ (KVM/HVF) +2. 极速启动: 用户不等待, <200ms 可感知 → MicroVM ✓ (125ms) +3. 高密度: 万级并发, 成本可控 → MicroVM ✓ (<5MiB/VM) +4. 快速销毁: 用完即弃, 零残留 → MicroVM ✓ (进程退出) +5. 简单运维: 无需复杂网络/存储配置 → MicroVM ✓ (TSI/mmio) + +AI Sandbox 不需要的: + +✗ 运行 Windows XP → QEMU 的 BIOS/UEFI/PCI 是多余的 +✗ 连接 USB 设备 → QEMU 的 xHCI/EHCI 是多余的 +✗ 显示图形界面 → QEMU 的 VGA/QXL 是多余的 +✗ 播放音频 → QEMU 的 HDA/AC97 是多余的 +✗ 使用软盘 → QEMU 的 FDC 是多余的 (显然) +``` + +### 11.3 技术差异 → 产品优势映射 + +``` +技术差异 产品优势 商业价值 +─────────── ──────── ──────── +125ms 冷启动 → 沙箱即开即用 → 用户体验领先 +4ms 快照恢复 → 预热环境零等待 → 开发者满意度 +<5MiB 内存开销 → 万级并发密度 → 基础设施成本降低 +5 万行 Rust 代码 → 最小攻击面 → 安全合规 (SOC2) +4-6 virtio 设备 → 漏洞风险极低 → 企业客户信任 +TSI 无需 root → 嵌入式/边缘部署 → 新市场 (SQLite 模式) +跨平台 KVM/HVF/WHPX → 全平台覆盖 → 开发者覆盖面最大 +CoW 快照克隆 → 零成本实例复制 → RL 训练成本降低 +``` + +### 11.4 一句话总结 + +> **QEMU 是为"模拟一台完整计算机"而设计的; MicroVM 是为"安全地运行一段代码"而设计的。AI Agent Sandbox 需要的恰恰是后者。** + +--- + +## 附录: 信息来源 + +- [libkrun Architecture Overview (DeepWiki)](https://deepwiki.com/containers/libkrun/3-architecture-overview) +- [libkrun GitHub](https://github.com/containers/libkrun) +- [Firecracker vs QEMU (E2B)](https://e2b.dev/blog/firecracker-vs-qemu) +- [Firecracker vs QEMU (Northflank)](https://northflank.com/blog/firecracker-vs-qemu) +- [Firecracker Official](https://firecracker-microvm.github.io/) +- [Firecracker: Lightweight Virtualization for Serverless Computing (NSDI'20)](https://www.usenix.org/system/files/nsdi20-paper-agache.pdf) +- [Firecracker Snapshot System](https://github.com/firecracker-microvm/firecracker/blob/main/docs/snapshotting/snapshot-support.md) +- [QEMU microvm Machine Type](https://www.qemu.org/docs/master/system/i386/microvm.html) +- [Cloud Hypervisor GitHub](https://github.com/cloud-hypervisor/cloud-hypervisor) +- [Cloud Hypervisor Guide (Northflank)](https://northflank.com/blog/guide-to-cloud-hypervisor) +- [Why MicroVMs: Architecture Behind Docker Sandboxes (Docker)](https://www.docker.com/blog/why-microvms-the-architecture-behind-docker-sandboxes/) +- [The State of MicroVM Isolation in 2026](https://emirb.github.io/blog/microvm-2026/) +- [How to Sandbox AI Agents in 2026 (Northflank)](https://northflank.com/blog/how-to-sandbox-ai-agents) +- [Comparing Sandboxing Approaches for AI Agents (Docker)](https://www.docker.com/blog/comparing-sandboxing-approaches-ai-agents/) +- [QEMU Attack Surface and Security Internals (HITB)](https://gsec.hitb.org/sg2017/sessions/qemu-attack-surface-and-security-internals/) +- [QEMU CVE List](https://www.cvedetails.com/vulnerability-list/vendor_id-7506/Qemu.html) +- [Expeditious High-Concurrency MicroVM SnapStart (USENIX ATC'24)](https://www.usenix.org/system/files/atc24-pang.pdf) +- [QEMU vs Firecracker: Why We Replaced (Hocus)](https://hocus.dev/blog/qemu-vs-firecracker/) +- [Performance Analysis of KVM-based microVMs (Firebench)](https://dreadl0ck.net/papers/Firebench.pdf) +- [Differences Between QEMU and Cloud Hypervisor (Depot)](https://depot.dev/blog/differences-between-qemu-and-cloud-hypervisor) +- [AI Agent Sandbox: How to Safely Run Autonomous Agents (Firecrawl)](https://www.firecrawl.dev/blog/ai-agent-sandbox) diff --git a/docs/[*]virtio-protocol-guide.md b/docs/[*]virtio-protocol-guide.md new file mode 100644 index 000000000..2ce5d0076 --- /dev/null +++ b/docs/[*]virtio-protocol-guide.md @@ -0,0 +1,696 @@ +# Virtio 协议技术介绍 + +> 目标: 从协议规范、数据结构、数据流到 libkrun 实现, 全面介绍 virtio 半虚拟化 I/O 框架 + +--- + +## 核心结论 + +Virtio 是 OASIS 标准化的**半虚拟化 (paravirtualization) I/O 框架**, 由 Rusty Russell (Linux 内核开发者) 于 2008 年提出。当前最新规范为 [VIRTIO v1.3](https://docs.oasis-open.org/virtio/virtio/v1.3/csd01/virtio-v1.3-csd01.html)。 + +核心思想: Guest OS 内核**知道自己运行在虚拟环境中**, 不再假装访问真实硬件, 而是通过优化的共享内存协议与 VMM 直接通信。 + +**Virtio 让 VMM 从"模拟硬件"变成"共享内存通信", 这正是 microVM 能做到 125ms 启动、5MiB 开销的技术基础。** + +--- + +## 目录 + +1. [架构三层模型](#1-架构三层模型) +2. [设备组成四要素](#2-设备组成四要素) +3. [Device Status — 设备生命周期状态机](#3-device-status--设备生命周期状态机) +4. [Feature Bits — 特性协商](#4-feature-bits--特性协商) +5. [Virtqueue — 核心数据传输机制](#5-virtqueue--核心数据传输机制) +6. [数据流: 一个完整的 I/O 请求](#6-数据流-一个完整的-io-请求) +7. [Transport 层: MMIO vs PCI](#7-transport-层-mmio-vs-pci) +8. [通知优化](#8-通知优化) +9. [Packed Virtqueue (v1.1+)](#9-packed-virtqueue-v11) +10. [为什么 Virtio 适合 MicroVM / AI Sandbox](#10-为什么-virtio-适合-microvm--ai-sandbox) + +--- + +## 1. 架构三层模型 + +``` +┌──────────────────────────────────────────────────┐ +│ Guest OS │ +│ ┌──────────────────────────────────────────┐ │ +│ │ Virtio Driver (前端, FE) │ │ +│ │ (Linux: drivers/virtio/virtio_*.c) │ │ +│ └──────────────┬───────────────────────────┘ │ +│ │ virtqueue (共享内存) │ +├─────────────────┼────────────────────────────────┤ +│ Transport 层 │ PCI / MMIO / Channel I/O │ +├─────────────────┼────────────────────────────────┤ +│ ▼ │ +│ ┌──────────────────────────────────────────┐ │ +│ │ Virtio Device (后端, BE) │ │ +│ │ (VMM 侧: libkrun/Firecracker/QEMU) │ │ +│ └──────────────────────────────────────────┘ │ +│ Host / VMM │ +└──────────────────────────────────────────────────┘ +``` + +三层职责: + +| 层 | 职责 | 例子 | +|---|------|------| +| **Device (后端)** | VMM 中的设备实现, 处理 I/O 请求 | libkrun 的 `virtio/block/device.rs` | +| **Driver (前端)** | Guest 内核中的驱动, 提交 I/O 请求 | Linux `virtio_blk.c` | +| **Transport** | 连接前后端的通信机制 | virtio-mmio, virtio-pci | + +设计四原则 (来自规范): + +- **Straightforward**: 使用标准中断和 DMA 机制, 设备驱动作者无需学习新范式 +- **Efficient**: 描述符环形缓冲区经过优化, 避免 cache line 争用 +- **Standard**: 跨多种传输类型 (PCI, MMIO, Channel I/O) 通用 +- **Extensible**: Feature bits 机制实现前后向兼容 + +--- + +## 2. 设备组成四要素 + +每个 virtio 设备由 4 部分组成: + +``` +┌─────────────────────────────────────────────┐ +│ Virtio Device │ +│ │ +│ ① Device Status Field (设备状态字段) │ +│ 控制设备初始化生命周期 │ +│ │ +│ ② Feature Bits (特性协商位) │ +│ 前后端能力协商 │ +│ │ +│ ③ Configuration Space (设备配置空间) │ +│ 设备特定参数 (如 block 的磁盘大小) │ +│ │ +│ ④ Virtqueue(s) (数据传输队列) │ +│ 实际 I/O 数据传输通道 │ +└─────────────────────────────────────────────┘ +``` + +--- + +## 3. Device Status — 设备生命周期状态机 + +设备通过状态字段驱动初始化。状态位**只能递增设置, 不能清除** (除非写 0 重置): + +``` + 写 0 重置 + ┌─────────────────────────────────┐ + ▼ │ + ┌─────────┐ │ + │ INIT │ status = 0 │ + │ (0x0) │ │ + └────┬────┘ │ + │ Driver 发现设备 │ + ▼ │ + ┌──────────────┐ │ + │ ACKNOWLEDGE │ "我认出这是 virtio" │ + │ (0x01) │ │ + └────┬─────────┘ │ + │ Driver 可以驱动此设备 │ + ▼ │ + ┌──────────────┐ │ + │ DRIVER │ "我有这个设备的驱动" │ + │ (0x02) │ │ + └────┬─────────┘ │ + │ 特性协商完成 │ + ▼ │ + ┌──────────────┐ │ + │ FEATURES_OK │ "特性协商达成一致" │ + │ (0x08) │ │ + └────┬─────────┘ │ + │ 队列配置完成, 准备就绪 │ + ▼ │ + ┌──────────────┐ │ + │ DRIVER_OK │ "设备已激活, 可以工作" │ + │ (0x04) │ │ + └────┬─────────┘ │ + │ │ + ▼ │ + ┌──────────────────┐ │ + │ DEVICE_NEEDS_ │ 设备遇到错误 │ + │ RESET (0x40) │ 需要恢复 │ + └────┬─────────────┘ │ + │ 出错 │ + ▼ │ + ┌──────────────┐ │ + │ FAILED │ status |= 0x80 │ + │ (0x80) │───────────────────────┘ + └──────────────┘ +``` + +### 完整初始化序列 (规范 3.1.1) + +1. 重置设备 (写 status = 0) +2. 设置 `ACKNOWLEDGE` — 识别出 virtio 设备 +3. 设置 `DRIVER` — 知道如何驱动此设备 +4. 读取 device feature bits, 写入 driver 理解的子集 +5. 设置 `FEATURES_OK` — 特性协商完成 +6. **重新读取** status 确认 `FEATURES_OK` 仍然设置 (设备可能拒绝) +7. 配置 virtqueue (设置描述符表/可用环/已用环地址) +8. 设置 `DRIVER_OK` — 设备激活, 可以开始 I/O + +### libkrun 中的实现 + +来自 `src/devices/src/virtio/mmio.rs`: + +```rust +fn set_device_status(&mut self, status: u32) { + match !self.device_status & status { + ACKNOWLEDGE if self.device_status == INIT => { + self.device_status = status; + } + DRIVER if self.device_status == ACKNOWLEDGE => { + self.device_status = status; + } + FEATURES_OK if self.device_status == (ACKNOWLEDGE | DRIVER) => { + self.device_status = status; + } + DRIVER_OK if self.device_status == (ACKNOWLEDGE|DRIVER|FEATURES_OK) => { + self.device_status = status; + if !device_activated { + self.activate(); // ← 激活设备, 将 queue 所有权转移给 device + } + } + _ if status == 0 => { + self.reset(); // ← 写 0 = 重置设备 + } + _ => { + warn!("invalid virtio driver status transition"); + } + } +} +``` + +关键规则: **设备在 DRIVER_OK 设置前, 不得消费缓冲区或发送中断。** + +--- + +## 4. Feature Bits — 特性协商 + +协商机制实现前后向兼容: + +``` +Device 广播: "我支持 features = 0b1111_0011" + │ +Driver 回应: "我理解 features = 0b0011_0001" (子集) + │ + ──── 取交集 ──── + │ + 生效特性 = 0b0011_0001 +``` + +### Feature Bit 分配 + +| 范围 | 用途 | 例子 | +|------|------|------| +| **0-23** | 设备特定特性 | VIRTIO_BLK_F_FLUSH, VIRTIO_NET_F_CSUM | +| **24-40** | 队列和协商扩展 | VIRTIO_RING_F_EVENT_IDX, VIRTIO_F_VERSION_1 | +| **41-49** | 保留/未来扩展 | — | +| **50-127** | 设备特定特性 (扩展) | — | +| **128+** | 未来扩展 | — | + +### 协商规则 + +- Driver **不得**接受 Device 未声明的特性 +- Driver **不得**接受依赖未被接受特性的特性 +- 重新协商的唯一方式是**重置设备** +- 如果设备曾成功协商某特性集, 重置后**不应拒绝**相同特性集的再次协商 + +### libkrun 中的实现 + +来自 `src/devices/src/virtio/device.rs`: + +```rust +fn ack_features_by_page(&mut self, page: u32, value: u32) { + let mut v = match page { + 0 => u64::from(value), + 1 => u64::from(value) << 32, + _ => { warn!("Cannot acknowledge unknown features page"); 0u64 } + }; + + // 检查 Guest 是否在确认我们未声明的特性 + let unrequested_features = v & !self.avail_features(); + if unrequested_features != 0 { + warn!("Received acknowledge request for unknown feature"); + v &= !unrequested_features; // 忽略未声明的特性 + } + self.set_acked_features(self.acked_features() | v); +} +``` + +--- + +## 5. Virtqueue — 核心数据传输机制 + +Virtqueue 是 virtio 的核心 — Driver 和 Device 之间通过**共享 Guest 物理内存**传递 I/O 请求的环形缓冲区。 + +### 5.1 Split Virtqueue 结构 (v1.0 格式) + +``` +Guest 物理内存中的三段区域: + +┌──────────────────────────────────────────────────────┐ +│ Descriptor Table │ +│ (描述符表: 存放所有缓冲区的地址/长度/标志) │ +│ │ +│ ┌──────┬──────┬───────┬──────┐ │ +│ │ desc │ desc │ desc │ ... │ 每项 16 字节 │ +│ │ #0 │ #1 │ #2 │ │ 共 QueueSize 项 │ +│ └──────┴──────┴───────┴──────┘ │ +│ 对齐: 16 字节 │ +│ 大小: 16 × QueueSize 字节 │ +├──────────────────────────────────────────────────────┤ +│ Available Ring │ +│ (可用环: Driver 告知 Device "这些缓冲区准备好了") │ +│ │ +│ ┌───────┬─────┬──────────────────┬────────────┐ │ +│ │ flags │ idx │ ring[QueueSize] │ used_event │ │ +│ └───────┴─────┴──────────────────┴────────────┘ │ +│ Driver 写, Device 只读 │ +│ 对齐: 2 字节 │ +│ 大小: 6 + 2 × QueueSize 字节 │ +├──────────────────────────────────────────────────────┤ +│ Used Ring │ +│ (已用环: Device 告知 Driver "这些缓冲区处理完了") │ +│ │ +│ ┌───────┬─────┬──────────────────┬─────────────┐ │ +│ │ flags │ idx │ ring[QueueSize] │ avail_event │ │ +│ └───────┴─────┴──────────────────┴─────────────┘ │ +│ Device 写, Driver 只读 │ +│ 对齐: 4 字节 │ +│ 大小: 6 + 8 × QueueSize 字节 │ +└──────────────────────────────────────────────────────┘ + +QueueSize 必须是 2 的幂, 最大 32768 +``` + +### 5.2 Descriptor (描述符) 结构 + +```c +struct virtq_desc { + le64 addr; // 缓冲区 Guest 物理地址 + le32 len; // 缓冲区长度 (字节) + le16 flags; // 标志位 + le16 next; // 链中下一个描述符的索引 +}; +// 每个描述符 16 字节 +``` + +**Flags 定义**: + +| 标志 | 值 | 含义 | +|------|---|------| +| `VIRTQ_DESC_F_NEXT` | 0x1 | 描述符链继续, `next` 字段有效 | +| `VIRTQ_DESC_F_WRITE` | 0x2 | 缓冲区供 Device 写入 (否则供 Device 读取) | +| `VIRTQ_DESC_F_INDIRECT` | 0x4 | 缓冲区包含间接描述符表 | + +**描述符链**: 一个 I/O 请求可由多个不连续内存块组成, 通过 `next` 字段串联: + +``` +desc[0] desc[3] desc[7] +┌──────────────┐ ┌──────────────┐ ┌──────────────┐ +│ addr: 0x1000 │ │ addr: 0x3000 │ │ addr: 0x5000 │ +│ len: 512 │ │ len: 1024 │ │ len: 256 │ +│ flags: NEXT │────→│ flags: NEXT │────→│ flags: WRITE │ +│ next: 3 │ │ next: 7 │ │ next: - │ +└──────────────┘ └──────────────┘ └──────────────┘ + Device 读取 512B Device 读取 1024B Device 写入 256B + +这条链 = 一个 I/O 请求: 读取请求头+数据 → 处理 → 写入结果 +``` + +**规则**: 描述符链总长度不得超过 2^32 字节; 链中不允许有环。 + +### 5.3 Available Ring (可用环) + +```c +struct virtq_avail { + le16 flags; // 通知抑制标志 (VIRTQ_AVAIL_F_NO_INTERRUPT) + le16 idx; // 下一个写入位置 (单调递增, 永不回绕) + le16 ring[QueueSize]; // 描述符链头部索引的数组 + le16 used_event; // EVENT_IDX 特性: 期望 Device 通知的 used idx 阈值 +}; +``` + +**Driver 的操作**: +1. 填好描述符链 (在 Descriptor Table 中) +2. 将链**头部**索引写入 `ring[idx % QueueSize]` +3. 内存屏障 (确保 Device 看到描述符内容) +4. `idx++` +5. 通知 Device (kick) + +**关键**: `idx` 只增不减 — Driver **不能**撤回已发布的缓冲区。 + +### 5.4 Used Ring (已用环) + +```c +struct virtq_used { + le16 flags; // 通知抑制标志 (VIRTQ_USED_F_NO_NOTIFY) + le16 idx; // 下一个写入位置 (单调递增) + struct virtq_used_elem { + le32 id; // 完成的描述符链头部索引 + le32 len; // Device 实际写入的字节数 + } ring[QueueSize]; + le16 avail_event; // EVENT_IDX: 期望 Driver 通知的 avail idx 阈值 +}; +``` + +**Device 的操作**: +1. 从 Available Ring 取出描述符链头索引 +2. 遍历描述符链, 执行 I/O +3. 将 `{id, len}` 写入 `ring[idx % QueueSize]` +4. 内存屏障 +5. `idx++` +6. 设置 len (规范要求: **必须在更新 idx 之前设置 len**) +7. 发送中断通知 Driver + +### 5.5 三个操作原语 + +Virtqueue 的全部交互归结为三个操作: + +| 操作 | 方向 | 含义 | +|------|------|------| +| **add_buf** | Driver → Available Ring | 提交新的 I/O 请求缓冲区 | +| **get_buf** | Driver ← Used Ring | 获取已完成的 I/O 结果 | +| **kick** | Driver → Device | 通知 Device 有新缓冲区可处理 | + +批量操作和延迟通知是高性能 I/O 的关键 — 因为 Driver 和 Device 之间的通知通常涉及昂贵的 VM EXIT。 + +--- + +## 6. 数据流: 一个完整的 I/O 请求 + +以 virtio-block 读操作为例: + +``` +Driver (Guest 内核) Device (VMM / libkrun) + │ │ + ① │ 分配描述符, 填写: │ + │ desc[0]: request header (读, sector N) │ + │ flags: NEXT, next: 1 │ + │ desc[1]: data buffer (512 bytes) │ + │ flags: WRITE|NEXT, next: 2 │ + │ desc[2]: status byte (1 byte) │ + │ flags: WRITE │ + │ │ + ② │ 将 desc[0] 的索引写入 avail ring │ + │ avail.ring[avail.idx % size] = 0 │ + │ wmb() // 写屏障 │ + │ avail.idx++ │ + │ │ + ③ │ ─── kick (通知 Device) ───────────────→ │ + │ (写 MMIO 偏移 0x50 = QueueNotify) │ + │ → 触发 KVM ioeventfd │ + │ → EventFd 通知 VMM worker 线程 │ + │ │ + │ ④ │ worker 线程被唤醒 + │ │ 读取 avail ring + │ │ 取出 desc[0] 索引 + │ │ 遍历描述符链: [0]→[1]→[2] + │ │ + │ ⑤ │ 执行 I/O: + │ │ 从 desc[0] 读取请求头 + │ │ 读取磁盘 sector N 数据 + │ │ 写入 desc[1] 指向的 Guest 内存 + │ │ 写入 desc[2] 状态 = OK + │ │ + │ ⑥ │ 将 {id=0, len=513} 写入 used ring + │ │ wmb() + │ │ used.idx++ + │ │ + │ ←───── 发送中断 (irqfd) ──────────── ⑦ │ + │ InterruptStatus |= VRING │ + │ 触发 irqfd → KVM 注入虚拟中断 │ + │ │ + ⑧ │ 中断处理程序: │ + │ 读取 InterruptStatus, 确认 VRING │ + │ 读取 used ring, 取出 {id=0, len=513} │ + │ 回收 desc[0-2] 到空闲池 │ + │ 将数据交给上层文件系统 │ +``` + +### libkrun 中通知的实现 + +**Driver → Device (kick)**: Guest 写 MMIO 偏移 0x50 + +```rust +// mmio.rs, BusDevice::write, offset 0x50 +0x50 => { + // Guest 写入 queue 索引, 触发对应 EventFd + if let Some(eventfd) = self.queue_evts.get(v as usize) { + eventfd.write(1).unwrap(); + } +} +``` + +VMM 将此地址注册为 KVM ioeventfd, 使得 Guest 写操作**不触发 VMEXIT**, 直接通知 VMM 线程。 + +**Device → Driver (中断)**: + +```rust +// mmio.rs +pub fn signal_used_queue(&self) { + self.status.fetch_or(VIRTIO_MMIO_INT_VRING as usize, Ordering::SeqCst); + self.intc.lock().unwrap().set_irq(self.irq_line, Some(&self.event))?; +} +``` + +VMM 将中断 EventFd 注册为 KVM irqfd, 使得 VMM 写 EventFd 即可直接注入虚拟中断, 无需额外 VMEXIT。 + +--- + +## 7. Transport 层: MMIO vs PCI + +Transport 负责三件事: 设备发现、寄存器访问、通知/中断传递。 + +### 7.1 对比 + +| 维度 | virtio-pci | virtio-mmio | +|------|-----------|-------------| +| **发现** | PCI 总线枚举 (Vendor 0x1AF4) | 平台特定 (设备树/内核命令行) | +| **寄存器访问** | PCI BAR + Capability 结构 | 固定偏移 MMIO 寄存器 | +| **中断** | MSI-X (多队列独立中断) | 单个 IRQ line | +| **通知** | IO port 或 MMIO 写 | MMIO 偏移 0x50 写 | +| **设备数量** | 数千 (多 PCI bus) | 受地址空间限制 (~几十个) | +| **热插拔** | 支持 | 不支持 | +| **Linux 代码量** | 161 文件, 78,237 行 | **1 文件, 538 行** | +| **适用场景** | 通用 VM (QEMU) | **MicroVM** (libkrun, Firecracker) | + +### 7.2 libkrun virtio-mmio 寄存器布局 + +从 `src/devices/src/virtio/mmio.rs` 源码提取: + +``` +偏移 方向 寄存器名 功能 +──── ──── ────── ────── +0x00 R MagicValue 固定 0x74726976 ("virt"), 识别 virtio 设备 +0x04 R Version 固定 2 (virtio modern, v1.0+) +0x08 R DeviceID 设备类型 (block=2, net=1, vsock=19...) +0x0c R VendorID 厂商 ID (libkrun 固定为 0) +0x10 R DeviceFeatures 按页读取设备特性位 +0x14 W DeviceFeaturesSel 选择特性页 (0=低32位, 1=高32位) +0x20 W DriverFeatures Driver 确认的特性位 +0x24 W DriverFeaturesSel 选择 Driver 特性页 +0x30 W QueueSel 选择当前操作的队列索引 +0x34 R QueueNumMax 当前队列的最大容量 +0x38 W QueueNum 设置当前队列大小 +0x44 R/W QueueReady 队列就绪标志 +0x50 W QueueNotify 队列通知 (kick) ← ioeventfd 注册点 +0x60 R InterruptStatus 中断状态位图 +0x64 W InterruptACK 中断确认 (清除状态位) +0x70 R/W Status 设备状态 (驱动初始化状态机) +0x80 W QueueDescLow 描述符表地址 (低 32 位) +0x84 W QueueDescHigh 描述符表地址 (高 32 位) +0x90 W QueueAvailLow Available Ring 地址 (低 32 位) +0x94 W QueueAvailHigh Available Ring 地址 (高 32 位) +0xa0 W QueueUsedLow Used Ring 地址 (低 32 位) +0xa4 W QueueUsedHigh Used Ring 地址 (高 32 位) +0xac W SHMRegionSel 共享内存区域选择 +0xb0-bc R SHMRegion* 共享内存区域长度/基地址 +0xfc R ConfigGeneration 配置空间版本号 (原子读取用) +0x100+ R/W Config Space 设备特定配置空间 +``` + +每个设备占用 **4KB (一页)** MMIO 地址空间。 + +### 7.3 为什么 MicroVM 选择 MMIO + +``` +virtio-pci 的开销: + ├── 需要 PCI 总线模拟 (配置空间、BAR 映射、MSI-X 表) + ├── Guest 内核需要 PCI 枚举 (BIOS/ACPI 协助) + ├── 代码复杂度高 (78K 行 vs 538 行) + └── 启动时间增加 (PCI 枚举 + ACPI 解析) + +virtio-mmio 的优势: + ├── 无需 PCI/ACPI 基础设施 + ├── 设备地址通过内核命令行 (x86) 或 FDT (ARM) 直接告知 + ├── 极简实现, 代码量小 100x+ + └── 对于 <10 个设备的 microVM 场景完全够用 +``` + +--- + +## 8. 通知优化 + +频繁的通知 (kick/中断) 会导致大量 VMEXIT, 严重影响性能。Virtio 提供两种抑制机制: + +### 8.1 Flags 抑制 (简单模式) + +``` +Driver 视角: + avail.flags = VIRTQ_AVAIL_F_NO_INTERRUPT (0x1) + → 告诉 Device: "处理完缓冲区后别给我发中断, 我会轮询 used ring" + +Device 视角: + used.flags = VIRTQ_USED_F_NO_NOTIFY (0x1) + → 告诉 Driver: "有新缓冲区别通知我, 我会轮询 avail ring" +``` + +### 8.2 EVENT_IDX (精细模式, 推荐) + +需要协商 `VIRTIO_RING_F_EVENT_IDX` 特性: + +``` +Driver 写 avail.used_event = N: + → 告诉 Device: "只在 used.idx 从 N-1 变到 N 时才发中断" + +Device 写 used.avail_event = M: + → 告诉 Driver: "只在 avail.idx 从 M-1 变到 M 时才通知我" +``` + +效果: 高负载时, 多个 I/O 完成合并为一次中断; 低负载时, 每次 I/O 仍及时通知。 + +### 8.3 libkrun 中的 EVENT_IDX + +```rust +// mmio.rs, 设备激活时检查 EVENT_IDX 特性 +fn activate(&mut self) { + let event_idx_enabled = + (locked_device.acked_features() & (1 << VIRTIO_RING_F_EVENT_IDX)) != 0; + for dq in &mut device_queues { + dq.queue.set_event_idx(event_idx_enabled); + } + locked_device.activate(self.mem.clone(), self.interrupt.clone(), device_queues)?; +} +``` + +### 8.4 KVM 加速: ioeventfd + irqfd + +在 KVM 环境下, 通知进一步优化: + +``` +传统通知路径 (无 ioeventfd): + Guest 写 MMIO 0x50 → VMEXIT → KVM → 返回 VMM → VMM 处理 → VMENTER + 开销: 每次通知 ~1-2μs + +ioeventfd 优化路径: + Guest 写 MMIO 0x50 → KVM 直接写 eventfd (无 VMEXIT!) → VMM epoll 唤醒 + 开销: ~0.1μs + +传统中断路径 (无 irqfd): + VMM 调用 ioctl → KVM 注入中断 → Guest 处理 + 开销: 需要 ioctl 系统调用 + +irqfd 优化路径: + VMM 写 eventfd → KVM 自动注入中断 (无 ioctl!) + 开销: 仅 eventfd 写操作 +``` + +--- + +## 9. Packed Virtqueue (v1.1+) + +Virtio 1.1 引入 Packed Virtqueue, 优化 cache 局部性: + +### Split vs Packed 对比 + +``` +Split Virtqueue (v1.0): + 3 块独立内存: Descriptor Table + Available Ring + Used Ring + → 3 块内存分散, cache miss 频繁 + → Driver 和 Device 写不同区域 (cache bouncing) + +Packed Virtqueue (v1.1): + 1 块统一内存: 描述符/可用/已用信息合并 + → 所有信息在同一 cache line 附近 + → 减少 cache miss 和 cache line bouncing +``` + +``` +Packed 描述符结构: +struct pvirtq_desc { + le64 addr; + le32 len; + le16 id; + le16 flags; // 包含 AVAIL 和 USED 标志位 +}; + +Driver 和 Device 通过翻转 AVAIL/USED flag 位来标识状态: + AVAIL=1, USED=0 → Driver 提供的缓冲区 (等价于在 avail ring 中) + AVAIL=1, USED=1 → Device 已处理完 (等价于在 used ring 中) +``` + +Packed Virtqueue 在高吞吐场景下性能更优, 但实现更复杂。libkrun 当前使用 Split Virtqueue。 + +--- + +## 10. 为什么 Virtio 适合 MicroVM / AI Sandbox + +### 10.1 对比全硬件模拟 + +| 维度 | 全硬件模拟 (QEMU e1000/IDE) | Virtio | +|------|---------------------------|--------| +| **原理** | 模拟真实硬件寄存器时序 | 共享内存环形缓冲区 | +| **Guest 驱动** | 使用真实硬件驱动 (不知道虚拟化) | 使用 virtio 驱动 (知道虚拟化) | +| **每次 I/O** | 多次寄存器读写 = 多次 VMEXIT | 批量描述符 + 一次通知 | +| **DMA** | 模拟 DMA 引擎 | 直接共享内存访问 | +| **性能** | 原生的 ~60-70% | 原生的 ~95%+ | +| **代码复杂度** | 极高 (精确模拟硬件状态机) | 低 (简单环形缓冲区) | +| **安全风险** | 高 (复杂代码 = 更多 CVE) | 低 (简单协议 = 更少 bug) | + +### 10.2 对 AI Sandbox 的具体收益 + +| Virtio 特性 | AI Sandbox 收益 | +|------------|----------------| +| 无硬件模拟, 代码量小 | VMM 攻击面极小, 不可信代码难以逃逸 | +| 共享内存传输 | 文件读写、网络 I/O 接近原生性能 | +| MMIO transport (无 PCI) | 启动快 10x (无 PCI 枚举), 内存省 | +| 标准化驱动 (Linux 内核内置) | 无需定制 Guest 内核, 任意 Linux 发行版直接可用 | +| 通知优化 (ioeventfd/irqfd) | 高吞吐 I/O, 满足频繁代码执行场景 | +| 简单实现 | 容易审计, 更高的安全可信度 | + +### 10.3 libkrun 中的 virtio 设备全景 + +| 设备 | Type ID | Virtqueue 数量 | 用途 (AI Sandbox 场景) | +|------|---------|---------------|----------------------| +| **virtio-block** | 2 | 1 | 挂载根文件系统, 存储代码/数据 | +| **virtio-net** | 1 | 2 (rx+tx) | pip install, API 调用, 网络访问 | +| **virtio-console** | 3 | 2×N (每端口) | 标准输出/错误捕获, 日志 | +| **virtio-vsock** | 19 | 2 (rx+tx) | host-guest gRPC 通信, TSI 网络代理 | +| **virtio-fs** | 26 | 1+ | 宿主目录共享到 Guest (代码挂载) | +| **virtio-balloon** | 5 | 3 | 动态内存回收 (高密度部署) | +| **virtio-rng** | 4 | 1 | 为 Guest 提供高质量随机数 | +| virtio-gpu | 16 | 2 | 可选: GUI 渲染 | +| virtio-input | 18 | 2 | 可选: 输入设备直通 | +| virtio-sound | 25 | 4 | 可选: 音频 | + +**全部使用 virtio-mmio v2 传输**, 无 PCI, 无 ACPI。 + +--- + +## 附录: 信息来源 + +- [OASIS VIRTIO Specification v1.2](https://docs.oasis-open.org/virtio/virtio/v1.2/csd01/virtio-v1.2-csd01.html) +- [OASIS VIRTIO Specification v1.3](https://docs.oasis-open.org/virtio/virtio/v1.3/csd01/virtio-v1.3-csd01.html) +- [Virtio on Linux — Kernel Documentation](https://docs.kernel.org/driver-api/virtio/virtio.html) +- [Virtio Devices High-Level Design — Project ACRN](https://projectacrn.github.io/latest/developer-guides/hld/hld-virtio-devices.html) +- [Rusty Russell: virtio — Towards a De-Facto Standard](https://ozlabs.org/~rusty/virtio-spec/virtio-paper.pdf) +- [Virtqueues and virtio ring: How the data travels — Red Hat](https://www.redhat.com/en/blog/virtqueues-and-virtio-ring-how-data-travels) +- [Packed virtqueue: How to reduce overhead — Red Hat](https://www.redhat.com/en/blog/packed-virtqueue-how-reduce-overhead-virtio) +- [Virtio devices and drivers overview — Red Hat](https://www.redhat.com/en/blog/virtio-devices-and-drivers-overview-headjack-and-phone) +- libkrun 源码: `src/devices/src/virtio/mmio.rs`, `device.rs`, `queue.rs` diff --git a/docs/architecture-guide.md b/docs/architecture-guide.md new file mode 100644 index 000000000..05373e9a6 --- /dev/null +++ b/docs/architecture-guide.md @@ -0,0 +1,1122 @@ +# BoxLite Architecture Guide + +> Cross-Platform Architecture & Design Reference for Windows Native Support Preparation + +--- + +## Table of Contents + +1. [High-Level Architecture](#1-high-level-architecture) +2. [Layered Architecture](#2-layered-architecture) +3. [Complete Call Chain](#3-complete-call-chain) +4. [Platform Abstraction Map](#4-platform-abstraction-map) +5. [Module Deep Dive](#5-module-deep-dive) +6. [External Dependencies & Libraries](#6-external-dependencies--libraries) +7. [Guest Agent Architecture](#7-guest-agent-architecture) +8. [Host-Guest Communication](#8-host-guest-communication) +9. [Windows Native Porting Analysis](#9-windows-native-porting-analysis) +10. [Initialization Pipeline](#10-initialization-pipeline) +11. [Snapshot & Clone Architecture](#11-snapshot--clone-architecture) +12. [Key Design Decisions](#12-key-design-decisions) + +--- + +## 1. High-Level Architecture + +BoxLite is an **embeddable VM runtime** — "SQLite for sandboxing." It runs OCI containers inside lightweight VMs with hardware-level isolation, without requiring a daemon or root privileges. + +```mermaid +graph TB + subgraph "User Applications" + PY[Python App] + JS[Node.js App] + C_APP[C/Go App] + CLI[CLI] + REST[REST Client] + end + + subgraph "SDK Layer" + PY_SDK["Python SDK
(PyO3)"] + JS_SDK["Node.js SDK
(napi-rs)"] + FFI_SDK["C FFI Layer"] + CLI_BIN["CLI Binary"] + REST_SERVER["REST/gRPC Server"] + end + + subgraph "Core Runtime (boxlite crate)" + RT["BoxliteRuntime"] + LB["LiteBox"] + VMM["VMM Engine"] + JAIL["Jailer"] + NET["Network Backend"] + IMG["Image Manager"] + DISK["Disk Manager"] + ROOTFS["Rootfs Builder"] + DB["SQLite DB"] + PORTAL["Portal (gRPC)"] + end + + subgraph "Shim Process (boxlite-shim)" + SHIM["Shim Controller"] + KRUN["libkrun Engine"] + end + + subgraph "Guest VM" + GUEST["Guest Agent
(boxlite-guest)"] + CONTAINER["OCI Container"] + end + + PY --> PY_SDK + JS --> JS_SDK + C_APP --> FFI_SDK + CLI --> CLI_BIN + REST --> REST_SERVER + + PY_SDK --> RT + JS_SDK --> RT + FFI_SDK --> RT + CLI_BIN --> RT + REST_SERVER --> RT + + RT --> LB + RT --> IMG + RT --> DB + + LB --> VMM + LB --> PORTAL + LB --> DISK + LB --> ROOTFS + + VMM --> JAIL + VMM --> NET + VMM --> SHIM + + SHIM --> KRUN + + KRUN -.->|"vsock/gRPC"| GUEST + PORTAL -.->|"vsock/gRPC"| GUEST + GUEST --> CONTAINER +``` + +--- + +## 2. Layered Architecture + +```mermaid +graph TB + subgraph "Layer 5: SDK / API" + direction LR + L5A["Python SDK
(PyO3 + pyo3-async-runtimes)"] + L5B["Node.js SDK
(napi-rs + napi-derive)"] + L5C["C FFI
(boxlite-ffi crate)"] + L5D["REST/gRPC Server
(axum + tonic)"] + end + + subgraph "Layer 4: Runtime Orchestration" + direction LR + L4A["BoxliteRuntime
(RuntimeBackend trait)"] + L4B["RuntimeImpl
(LocalRuntime)"] + L4C["BoxManager
(Box lifecycle)"] + L4D["ImageManager
(OCI pull/cache)"] + end + + subgraph "Layer 3: Box Lifecycle" + direction LR + L3A["LiteBox
(BoxBackend trait)"] + L3B["BoxImpl
(VM-backed)"] + L3C["BoxBuilder
(Init pipeline)"] + L3D["Execution
(Process handle)"] + end + + subgraph "Layer 2: VM Management" + direction LR + L2A["VmmController
(Spawn trait)"] + L2B["ShimController
(Subprocess spawn)"] + L2C["VmmHandler
(Runtime ops)"] + L2D["ProcessMonitor
(Exit detection)"] + end + + subgraph "Layer 1: Platform Services" + direction LR + L1A["Jailer
(Sandbox trait)"] + L1B["NetworkBackend
(trait)"] + L1C["Disk/Rootfs
(ext4/qcow2)"] + L1D["Portal
(gRPC channel)"] + end + + subgraph "Layer 0: Native / OS" + direction LR + L0A["libkrun
(KVM / Hvf)"] + L0B["bubblewrap / seatbelt"] + L0C["gvproxy
(Go, userspace net)"] + L0D["e2fsprogs
(mke2fs)"] + end + + L5A --> L4A + L5B --> L4A + L5C --> L4A + L5D --> L4A + L4A --> L4B + L4B --> L4C + L4B --> L4D + L4C --> L3A + L3A --> L3B + L3B --> L3C + L3B --> L3D + L3C --> L2A + L2A --> L2B + L2B --> L2C + L2B --> L2D + L2C --> L1A + L2B --> L1A + L3C --> L1B + L3C --> L1C + L3B --> L1D + L1A --> L0A + L1A --> L0B + L1B --> L0C + L1C --> L0D +``` + +--- + +## 3. Complete Call Chain + +### 3.1 Box Creation Flow + +```mermaid +sequenceDiagram + participant User as User Code + participant SDK as SDK (Python/Node/C) + participant RT as BoxliteRuntime + participant RI as RuntimeImpl + participant BB as BoxBuilder + participant IMG as ImageManager + participant DISK as DiskManager + participant ROOTFS as RootfsBuilder + participant SHIM as ShimSpawner + participant JAIL as Jailer + participant VMM as boxlite-shim + participant KRUN as libkrun + participant GUEST as Guest Agent + + User->>SDK: boxlite.run(image, cmd) + SDK->>RT: BoxliteRuntime::run() + RT->>RI: RuntimeImpl::create_box() + + Note over RI: Step 1: Prepare Box Config + RI->>IMG: ImageManager::pull(image) + IMG-->>RI: ImageHandle (layers, config) + RI->>DISK: Create container rootfs disk (ext4) + RI->>DISK: Create guest rootfs disk (qcow2 COW) + RI->>ROOTFS: RootfsBuilder::build() + ROOTFS-->>RI: Prepared rootfs + mounts + + Note over RI: Step 2: Build InstanceSpec + RI->>BB: BoxBuilder::new(config) + BB->>BB: Configure transport (Unix socket) + BB->>BB: Configure network (gvproxy) + BB->>BB: Build InstanceSpec + + Note over RI: Step 3: Spawn Shim + BB->>SHIM: ShimSpawner::spawn(config_json) + SHIM->>JAIL: JailerBuilder::build() + JAIL-->>SHIM: Jail (BwrapSandbox / SeatbeltSandbox) + SHIM->>JAIL: jail.prepare() [cgroups on Linux] + SHIM->>JAIL: jail.command(shim_binary, args) + Note over JAIL: Adds pre_exec hook:
FD cleanup, rlimits,
PID file, cgroup join + + SHIM->>VMM: Child::spawn() [boxlite-shim] + VMM->>VMM: Read config from stdin + VMM->>VMM: Start gvproxy (if network) + VMM->>KRUN: libkrun FFI setup + Note over KRUN: krun_create_ctx()
krun_set_vm_config()
krun_set_root()
krun_set_mapped_volumes()
krun_set_port_map()
krun_start_enter() + KRUN->>KRUN: Process takeover (never returns) + + Note over GUEST: Inside VM + GUEST->>GUEST: Mount overlayfs + GUEST->>GUEST: Start gRPC server (vsock) + GUEST->>VMM: Ready notification (vsock) + VMM-->>BB: Shim PID + transport + + Note over RI: Step 4: Establish Connection + BB->>BB: Wait for ready notification + BB->>BB: Create GuestSession (gRPC) + BB-->>RI: BoxImpl (LiveState) + RI-->>RT: LiteBox + RT-->>SDK: Box handle + SDK-->>User: box +``` + +### 3.2 Command Execution Flow + +```mermaid +sequenceDiagram + participant User as User Code + participant LB as LiteBox + participant BI as BoxImpl + participant GS as GuestSession + participant GUEST as Guest Agent + participant CTR as Container Runtime + + User->>LB: box.exec(cmd) + LB->>BI: BoxBackend::exec() + BI->>BI: Ensure VM running (lazy start) + BI->>GS: GuestSession::exec(command) + GS->>GUEST: gRPC ExecRequest (vsock) + GUEST->>CTR: Fork + exec in container + CTR-->>GUEST: Process spawned + GUEST-->>GS: ExecResponse (exec_id) + GS-->>BI: Execution handle + BI-->>LB: Execution + LB-->>User: Execution {stdin, stdout, stderr, wait()} +``` + +--- + +## 4. Platform Abstraction Map + +### 4.1 Platform Decision Tree + +```mermaid +graph TD + START["BoxLite Startup"] --> SYSCHECK["SystemCheck::run()"] + + SYSCHECK -->|"Linux"| LINUX_CHECK["Open /dev/kvm
+ KVM_CREATE_VM smoke test"] + SYSCHECK -->|"macOS"| MAC_CHECK["sysctl kern.hv_support == 1
(Hypervisor.framework)"] + SYSCHECK -->|"Other"| UNSUPPORTED["Err(Unsupported)"] + + LINUX_CHECK --> VMM_ENGINE + MAC_CHECK --> VMM_ENGINE + + VMM_ENGINE["VMM Engine: libkrun"] + + VMM_ENGINE -->|"Linux"| KVM["KVM backend
/dev/kvm ioctl"] + VMM_ENGINE -->|"macOS"| HVF["Hypervisor.framework
hv_vm_create()"] + + VMM_ENGINE --> JAIL_SELECT["Jailer Selection"] + + JAIL_SELECT -->|"Linux"| BWRAP["BwrapSandbox
(bubblewrap)"] + JAIL_SELECT -->|"macOS"| SEATBELT["SeatbeltSandbox
(sandbox-exec)"] + + BWRAP --> LINUX_EXTRAS["+ Seccomp
+ Landlock
+ AppArmor
+ Cgroups v2
+ Credentials (uid/gid)"] + + JAIL_SELECT --> NET_SELECT["Network Backend"] + NET_SELECT --> GVPROXY["gvproxy
(gvisor-tap-vsock)"] + + GVPROXY -->|"Linux"| GVPROXY_LINUX["UnixStream socket
+ virtio-net"] + GVPROXY -->|"macOS"| GVPROXY_MAC["UnixDgram socket
+ virtio-net"] + + JAIL_SELECT --> PROCESS_MON["ProcessMonitor"] + PROCESS_MON -->|"Linux 5.3+"| PIDFD["pidfd_open()
+ AsyncFd"] + PROCESS_MON -->|"macOS"| KQUEUE["kqueue
+ EVFILT_PROC"] + PROCESS_MON -->|"Fallback"| POLLING["100ms poll
(try_wait loop)"] +``` + +### 4.2 Platform-Specific Code Map + +| Module | Linux | macOS | Windows (TODO) | +|--------|-------|-------|----------------| +| **Hypervisor** | KVM (`/dev/kvm`) | Hypervisor.framework | WHPX / Hyper-V (MSHV) | +| **VMM Library** | `libkrun` (KVM backend) | `libkrun` (Hvf backend) | Cloud Hypervisor / custom | +| **Jailer Sandbox** | `bubblewrap` (namespaces, pivot_root) | `sandbox-exec` (Seatbelt/SBPL) | Job Objects + AppContainer | +| **Seccomp/Syscall** | Seccomp BPF filter | N/A (Seatbelt covers) | N/A | +| **Landlock** | Landlock LSM (kernel 5.13+) | N/A | N/A | +| **Cgroups** | cgroups v2 | N/A | Job Objects | +| **AppArmor** | AppArmor profiles | N/A | N/A | +| **Network Socket** | `UnixStream` | `UnixDgram` | Named Pipes / AF_HYPERV | +| **Process Monitor** | `pidfd_open()` | `kqueue` + `EVFILT_PROC` | `WaitForSingleObject()` | +| **FD Cleanup** | `close_range()` / `/proc/self/fd` | `getrlimit` brute-force | `NtQueryInformationProcess` | +| **Host-Guest Transport** | vsock (`AF_VSOCK`) | vsock (via libkrun) | Hyper-V sockets (`AF_HYPERV`) | +| **Filesystem Sharing** | virtiofs | virtiofs | Plan 9 / virtiofs | +| **Disk Creation** | `mke2fs` (e2fsprogs) | `mke2fs` (e2fsprogs) | Need ext4 tools or alt format | +| **Bind Mounts** | `mount --bind` | N/A (virtiofs share) | N/A | +| **User Namespaces** | Clone + unshare | N/A | N/A | +| **DNS Configuration** | Write `/etc/resolv.conf` in rootfs | Same | Same | + +--- + +## 5. Module Deep Dive + +### 5.1 Runtime Layer (`src/boxlite/src/runtime/`) + +```mermaid +classDiagram + class BoxliteRuntime { + +backend: Arc~dyn RuntimeBackend~ + +image_backend: Option~Arc~dyn ImageBackend~~ + +new(options: BoxliteOptions) BoxliteResult~Self~ + +default() BoxliteResult~Self~ + +run(image, cmd) BoxliteResult~LiteBox~ + +box_builder(options) BoxliteResult~LiteBox~ + +list() Vec~BoxInfo~ + +kill(id) BoxliteResult + +shutdown() + } + + class RuntimeBackend { + <> + +create_box(options, name) BoxliteResult~LiteBox~ + +list_boxes() Vec~BoxInfo~ + +get_box(id) Option~LiteBox~ + +kill_box(id) BoxliteResult + +shutdown_sync() + } + + class RuntimeImpl { + +layout: FilesystemLayout + +box_manager: BoxManager + +lock: LockGuard + +event_listeners: Vec~Arc~dyn EventListener~~ + +new(options) BoxliteResult~Self~ + } + + class LocalRuntime { + +RuntimeImpl + } + + BoxliteRuntime --> RuntimeBackend : delegates to + LocalRuntime ..|> RuntimeBackend : implements + LocalRuntime --> RuntimeImpl : wraps +``` + +**Key types:** +- `BoxliteRuntime` — Public API, cloneable (`Arc`), delegates to a `RuntimeBackend` +- `RuntimeImpl` — Local implementation: filesystem layout, box manager, SQLite DB, event listeners +- `BoxliteOptions` — Configuration: home dir, log level, event listeners, resource defaults +- `FilesystemLayout` — Typed paths: `~/.boxlite/{boxes,images,layers,bases,logs,db}` + +### 5.2 LiteBox Layer (`src/boxlite/src/litebox/`) + +```mermaid +classDiagram + class LiteBox { + +id: BoxID + +name: Option~String~ + +box_backend: Arc~dyn BoxBackend~ + +snapshot_backend: Arc~dyn SnapshotBackend~ + +start() + +exec(command) Execution + +stop() + +metrics() BoxMetrics + +copy_into(src, dst) + +copy_out(src, dst) + +clone_box(options) + +export(options, dest) + } + + class BoxImpl { + +config: BoxConfig + +state: RwLock~BoxState~ + +live: OnceCell~LiveState~ + +runtime: SharedRuntimeImpl + } + + class LiveState { + +handler: Mutex~Box~dyn VmmHandler~~ + +guest_session: GuestSession + +metrics: BoxMetricsStorage + +container_rootfs_disk: Disk + +bind_mount: Option~BindMountHandle~ [Linux] + } + + class BoxBuilder { + +build(config, options) BoxliteResult~BoxImpl~ + } + + class Execution { + +id() String + +stdin() Option~ExecStdin~ + +stdout() Option~ExecStdout~ + +stderr() Option~ExecStderr~ + +wait() ExecResult + +kill() + +resize_tty(rows, cols) + } + + LiteBox --> BoxImpl : delegates to + BoxImpl --> LiveState : lazy init + BoxBuilder --> BoxImpl : creates + BoxImpl --> Execution : creates via exec() +``` + +**Key design:** +- `LiteBox` is a thin wrapper over `BoxBackend` trait (enables REST and local backends) +- `BoxImpl` holds `BoxConfig` (persisted) and `LiveState` (lazy, via `OnceCell`) +- `BoxBuilder` is the init pipeline: disk creation, rootfs assembly, shim spawn, gRPC connect +- `Execution` wraps the gRPC exec stream: stdin/stdout/stderr via `Arc>` + +### 5.3 VMM Layer (`src/boxlite/src/vmm/`) + +```mermaid +classDiagram + class Vmm { + <> + +create(config: InstanceSpec) VmmInstance + } + + class VmmInstance { + +enter() BoxliteResult + } + + class VmmController { + <> + +start(bundle: InstanceSpec) Box~dyn VmmHandler~ + } + + class VmmHandler { + <> + +stop() + +metrics() VmmMetrics + +is_running() bool + +pid() u32 + } + + class ShimController { + +binary_path: PathBuf + +layout: BoxFilesystemLayout + } + + class ShimHandler { + +child: Child + +pid: u32 + +handler: Arc~Mutex~dyn VmmHandler~~ + } + + class InstanceSpec { + +engine: VmmKind + +box_id: String + +security: SecurityOptions + +cpus: Option~u8~ + +memory_mib: Option~u32~ + +fs_shares: FsShares + +block_devices: BlockDevices + +guest_entrypoint: Entrypoint + +transport: Transport + +network_config: NetworkBackendConfig + +guest_rootfs: GuestRootfs + } + + class Krun { + +options: VmmConfig + +create(config) VmmInstance + } + + ShimController ..|> VmmController + ShimHandler ..|> VmmHandler + Krun ..|> Vmm + Vmm --> VmmInstance : creates + VmmController --> VmmHandler : returns +``` + +**Architecture split:** +- **VmmController** = spawn operations (creates a VmmHandler) +- **VmmHandler** = runtime operations (stop, metrics, is_running) +- **Vmm trait** = engine-specific (libkrun): used inside the **shim process** +- **ShimController** = spawns `boxlite-shim` as subprocess (isolation from process takeover) + +### 5.4 Jailer Layer (`src/boxlite/src/jailer/`) + +```mermaid +classDiagram + class Jail { + <> + +prepare() BoxliteResult + +command(binary, args) Command + } + + class Sandbox { + <> + +name() str + +is_available() bool + +setup(ctx) BoxliteResult + +apply(ctx, cmd) + } + + class JailerS { + +sandbox: S + +security: SecurityOptions + +volumes: Vec~VolumeSpec~ + +box_id: String + +layout: BoxFilesystemLayout + +preserved_fds: Vec~RawFd_i32~ + } + + class BwrapSandbox { + <> + +Mount namespaces + +PID namespaces + +Network namespaces + +Chroot/pivot_root + } + + class SeatbeltSandbox { + <> + +SBPL policy generation + +sandbox-exec wrapping + +Per-path allow rules + } + + class NoopSandbox { + +No isolation + } + + class CompositeSandbox { + <> + +Bwrap + Landlock + } + + JailerS ..|> Jail + BwrapSandbox ..|> Sandbox + SeatbeltSandbox ..|> Sandbox + NoopSandbox ..|> Sandbox + CompositeSandbox ..|> Sandbox + JailerS --> Sandbox : delegates to + + note for BwrapSandbox "Linux only:\n+ Seccomp BPF\n+ Landlock LSM\n+ AppArmor\n+ Cgroups v2\n+ Credential drop" + note for SeatbeltSandbox "macOS only:\nSBPL deny-default policy\nPer-path granular access\nNetwork enable/disable" +``` + +**Pre-exec hook chain** (applied to `std::process::Command`): +1. FD preservation (dup2 watchdog pipe) +2. FD cleanup (`close_range` / `/proc/self/fd` / brute-force) +3. Resource limits (rlimits) +4. PID file write +5. Cgroup join (Linux, added by `BwrapSandbox::apply`) +6. Landlock enforcement (Linux, added by `CompositeSandbox::apply`) + +### 5.5 Network Layer (`src/boxlite/src/net/`) + +```mermaid +classDiagram + class NetworkBackend { + <> + +endpoint() NetworkBackendEndpoint + +name() str + +metrics() Option~NetworkMetrics~ + } + + class NetworkBackendFactory { + +create(config) Option~Box~dyn NetworkBackend~~ + } + + class GvisorTapBackend { + +gvproxy process (Go binary) + +UnixStream (Linux) + +UnixDgram (macOS) + +DNS sinkhole (allow_net) + +MITM proxy (secrets) + } + + class LibslirpBackend { + +libslirp library + +UnixStream + } + + class NetworkBackendEndpoint { + UnixSocket: path + ConnectionType + mac_address + } + + class NetworkBackendConfig { + +port_mappings: Vec~u16_u16~ + +socket_path: PathBuf + +allow_net: Vec~String~ + +secrets: Vec~Secret~ + +ca_cert_pem: Option~String~ + } + + GvisorTapBackend ..|> NetworkBackend + LibslirpBackend ..|> NetworkBackend + NetworkBackendFactory --> NetworkBackend : creates +``` + +**gvproxy (gvisor-tap-vsock):** +- Go binary, vendored in `src/deps/libgvproxy-sys` +- Provides userspace TCP/IP stack (no root, no TUN/TAP) +- DNS sinkhole for `allow_net` filtering +- MITM proxy for `secrets` injection into HTTPS +- Connection type differs: `UnixStream` on Linux, `UnixDgram` on macOS + +--- + +## 6. External Dependencies & Libraries + +### 6.1 Vendored Sys Crates (`src/deps/`) + +```mermaid +graph LR + subgraph "src/deps/ (vendored C/Go sys crates)" + LIBKRUN["libkrun-sys
━━━━━━━━━
VMM hypervisor
KVM (Linux)
Hvf (macOS)"] + BWRAP["bubblewrap-sys
━━━━━━━━━
Linux sandbox
Namespaces
pivot_root"] + E2FS["e2fsprogs-sys
━━━━━━━━━
ext4 creation
mke2fs binary"] + GVPROXY["libgvproxy-sys
━━━━━━━━━
Network backend
Go binary
gvisor-tap-vsock"] + end + + subgraph "Platform Availability" + direction TB + LINUX["Linux ✅"] + MACOS["macOS ✅"] + WIN["Windows ❌"] + end + + LIBKRUN -->|"✅"| LINUX + LIBKRUN -->|"✅"| MACOS + LIBKRUN -->|"❌ No WHPX/MSHV"| WIN + + BWRAP -->|"✅"| LINUX + BWRAP -->|"❌ Linux-only"| MACOS + BWRAP -->|"❌ Linux-only"| WIN + + E2FS -->|"✅"| LINUX + E2FS -->|"✅ brew install"| MACOS + E2FS -->|"⚠️ Cross-compile"| WIN + + GVPROXY -->|"✅"| LINUX + GVPROXY -->|"✅"| MACOS + GVPROXY -->|"⚠️ Needs Go build"| WIN +``` + +| Crate | Purpose | Linux | macOS | Windows | +|-------|---------|-------|-------|---------| +| `libkrun-sys` | VMM: KVM/Hvf hypervisor, virtio devices, process takeover | KVM | Hypervisor.framework | **Blocker**: No WHPX/MSHV backend | +| `bubblewrap-sys` | Sandbox: namespaces, pivot_root, seccomp | Full | Not used | Not applicable | +| `e2fsprogs-sys` | Disk: `mke2fs` for ext4 filesystem creation | Native | Homebrew | Needs cross-compile or alt | +| `libgvproxy-sys` | Network: Go-based userspace TCP/IP | Full | Full | Needs Go cross-compile | + +### 6.2 Rust Crate Dependencies + +| Category | Crate | Purpose | +|----------|-------|---------| +| **Async Runtime** | `tokio` | Event loop, tasks, timers, I/O | +| **gRPC** | `tonic` + `prost` | Host-guest communication protocol | +| **OCI Images** | `oci-client` | Container image pull/push | +| **Database** | `rusqlite` | Box metadata persistence | +| **HTTP Server** | `axum` | REST API server | +| **Serialization** | `serde` + `serde_json` | Config, IPC, persistence | +| **Logging** | `tracing` + `tracing-subscriber` | Structured logging | +| **Python FFI** | `pyo3` + `pyo3-async-runtimes` | Python SDK bindings | +| **Node.js FFI** | `napi` + `napi-derive` | Node.js SDK bindings | +| **Process** | `sysinfo` | Process CPU/memory metrics | +| **Crypto** | `rcgen` + `time` | MITM CA cert generation | +| **Concurrency** | `parking_lot` | Fast RwLock for BoxState | +| **Async Traits** | `async-trait` | Async trait methods | +| **TLS** | `rustls` | gRPC TLS support | + +### 6.3 Key Library Choices & Rationale + +```mermaid +mindmap + root((BoxLite
Library Choices)) + Hypervisor + libkrun + Process takeover model + KVM + Hvf backends + Built-in virtio devices + TSI networking + Sandboxing + bubblewrap (Linux) + Unprivileged namespaces + pivot_root isolation + Mature, well-tested + sandbox-exec (macOS) + Seatbelt/SBPL policy + deny-default + allow rules + No root required + Networking + gvproxy (gvisor-tap-vsock) + Userspace TCP/IP + No root/TUN/TAP + DNS sinkhole + MITM proxy + Communication + gRPC over vsock + Streaming exec I/O + Bidirectional + Proto-defined API + Storage + ext4 + qcow2 + COW snapshots + Thin clones + Standard formats +``` + +--- + +## 7. Guest Agent Architecture + +The guest agent (`src/guest/`) runs **inside the VM** and is always compiled for Linux. + +```mermaid +graph TB + subgraph "Guest VM (Linux)" + MAIN["main.rs
Entry point"] + SERVER["GuestServer
(gRPC server)"] + SERVICE["GuestService
(request handler)"] + CONTAINER["Container Runtime
(libcontainer)"] + MOUNTS["Mounts Manager
(overlayfs, volumes)"] + NETWORK["Network Setup
(resolv.conf, routes)"] + STORAGE["Storage Manager
(disks, filesystems)"] + CA["CA Trust
(inject MITM certs)"] + end + + subgraph "gRPC Services" + EXEC["Exec Service
fork+exec in container"] + FILE["File Service
upload/download tar"] + HEALTH["Health Service
readiness probe"] + RESIZE["Resize Service
PTY terminal resize"] + end + + MAIN --> SERVER + SERVER --> SERVICE + SERVICE --> EXEC + SERVICE --> FILE + SERVICE --> HEALTH + SERVICE --> RESIZE + + EXEC --> CONTAINER + FILE --> MOUNTS + SERVICE --> NETWORK + SERVICE --> STORAGE + SERVICE --> CA + + HOST["Host (Portal)"] -.->|"vsock:2695
gRPC"| SERVER + HOST -.->|"vsock:2696
Ready notify"| MAIN +``` + +**Guest startup sequence:** +1. **Start Zygote** (`clone3()` fork server) **before** Tokio — avoids musl `malloc` deadlock in forked async runtime +2. Mount essential tmpfs (`/tmp`, `/dev/shm`) +3. Parse args (`--listen vsock://2695 --notify vsock://2696`) +4. Initialize tracing +5. Prepare guest layout (`/boxlite/*`) +6. Start gRPC server on vsock +7. Send ready notification to host + +**On `Guest.Init()` gRPC call:** +1. Mount volumes (virtiofs + block devices) +2. Configure network (DNS, routes) +3. Inject CA certs (if MITM secrets configured) + +**On `Container.Init()` gRPC call:** +1. Assemble overlayfs (upper + lower layers) +2. Start OCI container via `libcontainer` + +**Zygote pattern:** Container processes are spawned via the pre-forked Zygote using `clone3()` syscall. This avoids the musl libc deadlock that occurs when `fork()` is called from a multi-threaded Tokio runtime — the Zygote is started **before** any threads exist. + +--- + +## 8. Host-Guest Communication + +```mermaid +graph LR + subgraph "Host Process" + PORTAL["Portal
(GuestSession)"] + TONIC_C["tonic gRPC Client"] + end + + subgraph "Transport Layer" + direction TB + VSOCK["vsock (AF_VSOCK)
Port 2695: gRPC
Port 2696: Ready"] + UNIX["Unix Socket
(fallback)"] + end + + subgraph "Guest VM" + TONIC_S["tonic gRPC Server"] + GUEST_SVC["GuestService"] + end + + PORTAL --> TONIC_C + TONIC_C -->|"Host sends"| VSOCK + VSOCK -->|"Guest receives"| TONIC_S + TONIC_S --> GUEST_SVC + + TONIC_C -.->|"Fallback"| UNIX + UNIX -.->|"Fallback"| TONIC_S +``` + +**Transport abstraction:** +``` +Transport enum: + ├── Vsock { port: u32 } ← Primary (inside VM, no host setup) + ├── Unix { socket_path } ← Fallback / development + └── Tcp { port: u16 } ← Future / distributed +``` + +**Krun-specific transform:** The host configures `Unix` transport (socket in box dir), but libkrun's process bridges it to `vsock` inside the guest. The `Krun::transform_shell_arg_unix_to_vsock()` method rewrites the guest entrypoint args. + +--- + +## 9. Windows Native Porting Analysis + +### 9.1 Component Readiness + +```mermaid +graph TB + subgraph "Ready (No Changes)" + style Ready fill:#90EE90 + SDK["SDK Layer
Python/Node.js/C FFI"] + RUNTIME["Runtime Orchestration
BoxliteRuntime, RuntimeImpl"] + LITEBOX["LiteBox Layer
BoxImpl, Execution"] + DB_W["SQLite DB"] + IMG_W["Image Manager
(OCI pull/cache)"] + PROTO["gRPC Proto
(protobuf definitions)"] + end + + subgraph "Moderate Effort" + style Moderate fill:#FFD700 + DISK_W["Disk Manager
Need ext4 tools on Win"] + NET_W["Network Backend
Named Pipes or AF_HYPERV"] + PORTAL_W["Portal Transport
Hyper-V sockets"] + PROCESS_W["ProcessMonitor
WaitForSingleObject"] + FD_W["FD Cleanup
NtQueryInformationProcess"] + end + + subgraph "Major Effort (Blockers)" + style Major fill:#FF6347 + VMM_W["VMM Engine
Replace libkrun entirely"] + JAIL_W["Jailer / Sandbox
Job Objects + AppContainer"] + GUEST_W["Guest Agent
Linux-only (runs in VM)"] + SHIM_W["Shim Process
No process takeover on Win"] + end +``` + +### 9.2 Platform Abstraction Strategy + +```mermaid +graph TB + TRAIT["Platform Trait
(new abstraction)"] --> LINUX_IMPL["LinuxPlatform"] + TRAIT --> MACOS_IMPL["MacOSPlatform"] + TRAIT --> WIN_IMPL["WindowsPlatform"] + + LINUX_IMPL --> L_VMM["libkrun (KVM)"] + LINUX_IMPL --> L_JAIL["bubblewrap + seccomp"] + LINUX_IMPL --> L_NET["gvproxy (UnixStream)"] + LINUX_IMPL --> L_MON["pidfd_open()"] + LINUX_IMPL --> L_TRANS["vsock (AF_VSOCK)"] + + MACOS_IMPL --> M_VMM["libkrun (Hvf)"] + MACOS_IMPL --> M_JAIL["sandbox-exec (Seatbelt)"] + MACOS_IMPL --> M_NET["gvproxy (UnixDgram)"] + MACOS_IMPL --> M_MON["kqueue + EVFILT_PROC"] + MACOS_IMPL --> M_TRANS["vsock (via libkrun)"] + + WIN_IMPL --> W_VMM["Cloud Hypervisor
(MSHV backend)"] + WIN_IMPL --> W_JAIL["Job Objects +
AppContainer +
Restricted Tokens"] + WIN_IMPL --> W_NET["gvproxy
(Named Pipes)"] + WIN_IMPL --> W_MON["WaitForSingleObject
(HANDLE)"] + WIN_IMPL --> W_TRANS["Hyper-V sockets
(AF_HYPERV)"] +``` + +### 9.3 Recommended Windows Porting Phases + +| Phase | Component | Effort | Description | +|-------|-----------|--------|-------------| +| **Phase 0** | `SystemCheck` | Small | Add `target_os = "windows"` check for WHPX/Hyper-V | +| **Phase 1** | `ProcessMonitor` | Small | `WaitForSingleObject` implementation | +| **Phase 1** | `FD Cleanup` | Small | Replace with `NtQueryInformationProcess` or skip | +| **Phase 2** | `Transport` | Medium | Add `HyperVSocket { vm_id, service_id }` variant | +| **Phase 2** | `Portal` | Medium | Hyper-V socket gRPC transport | +| **Phase 3** | `VMM Engine` | **Large** | Cloud Hypervisor with MSHV backend (new engine impl) | +| **Phase 3** | `Shim` | Large | No process takeover — use subprocess model instead | +| **Phase 4** | `Jailer` | Medium | `WindowsSandbox` impl: Job Objects + AppContainer | +| **Phase 5** | `Network` | Medium | gvproxy on Windows (Named Pipes + Go cross-compile) | +| **Phase 6** | `Disk` | Medium | ext4 tools for Windows or alternative format | + +--- + +## 10. Initialization Pipeline + +BoxBuilder uses a **staged execution pipeline** (`src/boxlite/src/litebox/init/`) with parallel and sequential phases, adapting to the box's current status. + +### 10.1 First Start (Configured) + +```mermaid +graph LR + subgraph "Stage 1 (Sequential)" + FS["FilesystemTask
Create box directory structure"] + end + + subgraph "Stage 2 (Parallel)" + CR["ContainerRootfsTask
Pull OCI image → COW disk"] + GR["GuestRootfsTask
Prepare guest rootfs → COW disk"] + end + + subgraph "Stage 3 (Sequential)" + VMM["VmmSpawnTask
Build InstanceSpec → spawn shim"] + end + + subgraph "Stage 4 (Sequential)" + GC["GuestConnectTask
Wait ready signal → GuestSession"] + end + + subgraph "Stage 5 (Sequential)" + GI["GuestInitTask
Guest.Init() → Container.Init()"] + end + + FS --> CR + FS --> GR + CR --> VMM + GR --> VMM + VMM --> GC + GC --> GI +``` + +### 10.2 Restart (Stopped) + +Same pipeline, but: +- **ContainerRootfsTask**: Reuses existing COW disk (preserves user modifications) +- **GuestRootfsTask**: Reuses existing COW disk +- **VmmSpawnTask**: Spawns **new** VM process +- **GuestInitTask**: Must run (new VM has fresh guest daemon) + +### 10.3 Reattach (Running) + +```mermaid +graph LR + ATT["VmmAttachTask
Attach to existing PID"] --> GC["GuestConnectTask
Reconnect to gRPC server"] +``` + +### 10.4 RAII Cleanup Guarantees + +- **CleanupGuard** in BoxBuilder: Kills VM + removes directory on pipeline failure +- **Disk** RAII: Deletes file on drop (unless `persistent=true`) +- **BindMountHandle** RAII (Linux): Unmounts on drop +- **LockGuard**: Releases filesystem lock on drop + +--- + +## 11. Snapshot & Clone Architecture + +### 11.1 Snapshot Flow (Quiesce + Fork) + +```mermaid +sequenceDiagram + participant User as User Code + participant LB as LiteBox + participant SH as SnapshotHandle + participant BI as BoxImpl + participant GUEST as Guest Agent + participant DISK as Disk Manager + + User->>LB: box.snapshots().create("snap1") + LB->>SH: SnapshotHandle::create() + SH->>BI: with_quiesce_async() + + Note over BI,GUEST: Quiesce Phase + BI->>GUEST: guest.quiesce() [FIFREEZE ioctl] + GUEST-->>BI: Filesystems frozen + + BI->>BI: SIGSTOP shim process + + Note over BI,DISK: Fork Phase + BI->>DISK: fork_qcow2(disk.qcow2, bases/snap1/disk.qcow2) + DISK->>DISK: 1. Read virtual size + DISK->>DISK: 2. Rename disk.qcow2 → bases/snap1/disk.qcow2 + DISK->>DISK: 3. Create COW child at disk.qcow2 + DISK-->>BI: Immutable base + live overlay + + Note over BI,GUEST: Thaw Phase + BI->>BI: SIGCONT shim process + BI->>GUEST: guest.thaw() [FITHAW ioctl] + GUEST-->>BI: Filesystems unfrozen + + BI-->>SH: SnapshotInfo + SH-->>User: Snapshot created +``` + +### 11.2 Clone Flow (Thin Overlay) + +```mermaid +graph TB + subgraph "Source Box" + SNAP["bases/snap1/disk.qcow2
(immutable base)"] + LIVE["boxes/src/disk.qcow2
(COW overlay, ~64KB)"] + end + + subgraph "Clone 1" + C1["boxes/clone1/disk.qcow2
(COW overlay, ~64KB)"] + end + + subgraph "Clone 2" + C2["boxes/clone2/disk.qcow2
(COW overlay, ~64KB)"] + end + + subgraph "Clone 3" + C3["boxes/clone3/disk.qcow2
(COW overlay, ~64KB)"] + end + + LIVE -->|"backing_file"| SNAP + C1 -->|"backing_file"| SNAP + C2 -->|"backing_file"| SNAP + C3 -->|"backing_file"| SNAP +``` + +**Batch clone** (`clone_boxes`): Source disks copied once into shared base, then each clone gets a thin qcow2 overlay (~64KB) — O(1) per clone instead of O(disk_size). + +--- + +## 12. Key Design Decisions + +### 12.1 Why Shim Process? + +```mermaid +graph LR + subgraph "Without Shim (Dangerous)" + HOST1["Host Process"] -->|"krun_start_enter()"| TAKEOVER["Process Takeover
Host process GONE"] + end + + subgraph "With Shim (Current Design)" + HOST2["Host Process"] -->|"spawn()"| SHIM2["boxlite-shim"] + SHIM2 -->|"krun_start_enter()"| VM2["VM Running
Host survives"] + end +``` + +`libkrun`'s `krun_start_enter()` **takes over the calling process** — it never returns. The shim subprocess isolates this behavior, letting the host application continue running and manage multiple VMs concurrently. + +### 12.2 Why vsock? + +- No host network configuration needed +- Works inside hardware-isolated VM +- Faster than TCP (no network stack overhead) +- Secure by design (no network exposure) +- Standard Linux/macOS kernel support + +### 12.3 Why gvproxy (not TUN/TAP)? + +- **No root required** — userspace TCP/IP stack +- **No TUN/TAP device** — works in unprivileged containers +- **Built-in features** — DNS sinkhole, MITM proxy, port mapping +- **Cross-platform** — Go binary works on Linux and macOS + +### 12.4 Trait-Based Extensibility + +``` +RuntimeBackend (trait) +├── LocalRuntime → VM-backed boxes +└── RestRuntime → HTTP-backed boxes (distributed) + +BoxBackend (trait) +├── BoxImpl → Local VM lifecycle +└── RestBox → Remote box via REST API + +Sandbox (trait) +├── BwrapSandbox → Linux namespaces +├── SeatbeltSandbox → macOS Seatbelt +├── CompositeSandbox → Bwrap + Landlock +├── NoopSandbox → Disabled +└── WindowsSandbox → TODO: Job Objects + AppContainer + +NetworkBackend (trait) +├── GvisorTapBackend → gvproxy (primary) +└── LibslirpBackend → libslirp (fallback) + +VmmController (trait) +├── ShimController → Subprocess shim +└── (future) → Direct VM management + +Vmm (trait) +├── Krun → libkrun engine +└── (future) → Cloud Hypervisor, Firecracker +``` + +The trait-based architecture is well-suited for Windows porting — new platform implementations can be added behind the existing trait boundaries without modifying the upper layers. diff --git a/docs/boxlite-deps.md b/docs/boxlite-deps.md new file mode 100644 index 000000000..c7fb915ee --- /dev/null +++ b/docs/boxlite-deps.md @@ -0,0 +1,1472 @@ +# BoxLite Dependencies: Four Core Native Crates + +BoxLite 依赖四个原生构建封装 crate(`*-sys`),分别对应四个上游项目,提供虚拟化、网络、磁盘和沙箱能力。 + +--- + +## 目录 + +- [1. bubblewrap / bubblewrap-sys](#1-bubblewrap--bubblewrap-sys) +- [2. e2fsprogs / e2fsprogs-sys](#2-e2fsprogs--e2fsprogs-sys) +- [3. gvisor-tap-vsock / libgvproxy / libgvproxy-sys](#3-gvisor-tap-vsock--libgvproxy--libgvproxy-sys) +- [4. libkrun / libkrun-sys](#4-libkrun--libkrun-sys) +- [5. Landlock 文件系统 ACL](#5-landlock-文件系统-acl) +- [6. Seatbelt 与 Seccomp — 进程级安全沙箱](#6-seatbelt-与-seccomp--进程级安全沙箱) +- [调用架构图](#调用架构图) +- [总结对比](#总结对比) +- [附录](#附录) + - [A. Shim 的含义](#a-shim-的含义) + - [B. Rust Crate 与 Python/Java 的对应关系](#b-rust-crate-与-pythonjava-的对应关系) + - [C. Rust `-sys` Crate 惯例](#c-rust--sys-crate-惯例) + +--- + +## 1. bubblewrap / bubblewrap-sys + +**上游项目**: [containers/bubblewrap](https://github.com/containers/bubblewrap) — 轻量级无特权沙箱工具,利用 Linux namespace 实现进程隔离。Flatpak 和 GNOME 也使用它。 + +**核心能力**: 不需要 root 权限即可创建 mount/pid/ipc/uts namespace 隔离环境。 + +| 项目 | 说明 | +|------|------| +| **bubblewrap** | C 语言编写的 `bwrap` 二进制,通过命令行参数声明隔离策略 | +| **bubblewrap-sys** | Rust 构建封装,从 vendored C 源码编译 bwrap 二进制 | + +**集成方式**: 子进程执行(非 FFI 链接) + +### 构建流程 + +`src/deps/bubblewrap-sys/build.rs`: + +``` +Meson setup → Ninja build → 输出 bwrap 二进制 +cargo:bwrap_BOXLITE_DEP={path} # 导出路径给 boxlite +``` + +构建配置禁用了 SELinux、man pages、tests、shell completions 等不必要功能,最小化依赖。 + +### BoxLite 中的使用 + +**关键文件**: `src/boxlite/src/jailer/` + +- `bwrap.rs` — `BwrapCommand` builder,组装 bwrap 命令行参数 +- `sandbox/bwrap.rs` — 实现 `Sandbox` trait,与 Landlock 组合使用 +- `apparmor.rs` — 处理 Ubuntu 23.10+ 的 AppArmor 限制 + +**功能**: namespace 隔离、只读绑定挂载、环境变量清洗、seccomp 过滤器注入 + +**仅 Linux 平台**,macOS 用 sandbox-exec,Windows 用 Job Object。 + +### 隔离策略示例 + +```bash +bwrap --unshare-user --unshare-pid --unshare-ipc --unshare-uts \ + --die-with-parent --new-session \ + --ro-bind /usr /usr --ro-bind /lib /lib \ + --dev /dev --dev-bind /dev/kvm /dev/kvm \ + --bind ~/.boxlite ~/.boxlite \ + --tmpfs /tmp --clearenv \ + --setenv PATH /usr/bin:/usr/sbin \ + -- boxlite-shim +``` + +### Sandbox 组合 + +Linux 使用分层隔离 — BwrapSandbox + LandlockSandbox: + +```rust +// jailer/sandbox/composite.rs +pub fn platform_new() -> Self { + Self::new(vec![ + Box::new(super::BwrapSandbox::new()), // Namespace 隔离 + Box::new(super::LandlockSandbox::new()), // 文件系统 ACL + ]) +} +``` + +### bwrap 发现优先级 + +1. 系统 bwrap: `PATH` 中的 `bwrap`(允许用户覆盖) +2. 捆绑 bwrap: bubblewrap-sys 编译产出的二进制 + +### bwrap 提供的隔离层(纵深防御) + +**bwrap 内部**: +- Namespace 隔离 (mount, pid, ipc, uts) +- 文件系统隔离 (pivot_root / chroot) +- 环境变量清洗 (--clearenv) +- Seccomp 过滤器注入 (BPF from fd) +- PR_SET_NO_NEW_PRIVS (禁用 setuid) +- Die-with-parent 行为 + +**BoxLite 额外添加**: +- Cgroups v2(资源限制) +- Seccomp BPF 过滤器生成 +- FD 清理 +- rlimits +- Landlock 文件系统 ACL (Linux 5.13+) + +--- + +## 2. e2fsprogs / e2fsprogs-sys + +**上游项目**: [tytso/e2fsprogs](https://github.com/tytso/e2fsprogs) — Linux ext2/ext3/ext4 文件系统工具集,由 Theodore Ts'o 维护。 + +**核心能力**: 创建和操作 ext4 文件系统镜像。 + +| 项目 | 说明 | +|------|------| +| **e2fsprogs** | C 工具集:`mke2fs`(创建 ext4)、`debugfs`(修改 ext4 内部文件) | +| **e2fsprogs-sys** | Rust 构建封装,从 vendored 源码编译 mke2fs 和 debugfs 二进制 | + +**集成方式**: 子进程执行(非 FFI 链接) + +### 构建流程 + +`src/deps/e2fsprogs-sys/build.rs`: + +``` +./configure --disable-nls --disable-threads ... → make libs → make mke2fs debugfs +cargo:mke2fs_BOXLITE_DEP={path} +cargo:debugfs_BOXLITE_DEP={path} +``` + +禁用了 nls、threads、tdb、imager、resizer、defrag、fsck、e2initrd-helper 等模块,只构建 mke2fs 和 debugfs。 + +### BoxLite 中的使用 + +**关键文件**: `src/boxlite/src/disk/ext4.rs` + +| 函数 | 用途 | +|------|------| +| `create_ext4_from_dir()` | OCI 镜像层合并后创建 ext4 磁盘镜像 (`mke2fs -d`) | +| `fix_ownership_with_debugfs()` | 修复所有文件所有权为 root:root (`debugfs sif`) | +| `inject_file_into_ext4()` | 向 guest rootfs 注入 boxlite-guest 二进制 (`debugfs write`) | + +### 磁盘大小计算 + +``` +文件内容 (4KB 块对齐) + inode 空间 (256B/文件) + 10% 元数据开销 + 64MB journal +最小 256MB +``` + +相关常量 (`disk/ext4.rs`): + +```rust +BLOCK_SIZE = 4096 +INODE_SIZE = 256 +SIZE_MULTIPLIER_NUM/DEN = 11/10 // 1.1x = 10% overhead +JOURNAL_OVERHEAD_BYTES = 64MB +MIN_DISK_SIZE_BYTES = 256MB +``` + +### 关键使用场景 + +1. **Container rootfs**: OCI 镜像层 → 合并目录 → `mke2fs -d` → container.ext4 +2. **Guest rootfs**: bootstrap 镜像 → ext4 → `debugfs write` 注入 boxlite-guest → guest-rootfs.ext4 +3. **Windows 特殊处理**: 通过 debugfs 修复 Unicode 文件名、创建 symlink、恢复权限 + +### mke2fs 命令参数 + +```bash +mke2fs -t ext4 -b 4096 -d -m 0 -E root_owner=0:0 -F -q +``` + +| 参数 | 含义 | +|------|------| +| `-t ext4` | ext4 文件系统类型 | +| `-b 4096` | 4KB 块大小 | +| `-d ` | 从目录填充文件系统内容 | +| `-m 0` | 不保留块(容器场景不需要) | +| `-E root_owner=0:0` | 根 inode 所有权设为 root | +| `-F` | 强制执行 | + +### debugfs 操作 + +```bash +# 修复所有权 +debugfs -w -f