Skip to content

spike(wasm): edge-runtime portability — Cloudflare Workers / workerd / Deno#137

Draft
linyiru wants to merge 5 commits into
masterfrom
spike/cf-worker-poc
Draft

spike(wasm): edge-runtime portability — Cloudflare Workers / workerd / Deno#137
linyiru wants to merge 5 commits into
masterfrom
spike/cf-worker-poc

Conversation

@linyiru

@linyiru linyiru commented May 26, 2026

Copy link
Copy Markdown
Owner

What this is

A spike, not a feature PR. Do not merge. Opening as a draft so the empirical findings + scaffolding are discoverable from the repo, not just sitting on a local branch.

The spike validates one claim and one disclaimer:

  • Claim: a single Wizer-pre-initialised rubyrs_worker.wasm (1.68 MB) runs unchanged across three V8-based edge runtimes — Cloudflare Workers (managed), workerd (self-host), Deno (self-host) — plus wasmtime as a non-V8 baseline.
  • Disclaimer: 18 ms cold-start (workerd local) is the V8-wasm floor for this build shape, confirmed by two negative experiments documented in docs/BENCHMARKS.md. To go lower would require either a function-count refactor inside rubyrs, a component-model + AOT move (wasmtime serve), or CF exposing a generic snapshot API.

Files

poc/cf-worker/
├── src/worker.js                  CF Workers / workerd entry (workers-wasi shim)
├── src/rubyrs_worker.wasm         build artifact (gitignored)
├── build.sh                       cargo → optional wasm-opt → wizer → src/
├── wrangler.toml                  CF Workers deploy config
├── package.json                   workerd + wrangler + workers-wasi + esbuild
├── workerd/
│   ├── config.capnp               self-host workerd config
│   ├── bundle.sh                  esbuild bundler for worker.mjs
│   └── bench-js/                  JS-only worker for baseline cold-start
├── deno/
│   ├── server.ts                  Deno.serve + browser_wasi_shim
│   └── deno.json                  scoped to keep parent npm tree out
└── README.md                      run instructions per target

crates/rubyrs/src/bin/wasm_worker.rs  stdin → Runtime::eval → stdout
docs/BENCHMARKS.md                    +"Edge runtimes" section, +cold-start floor notes

Headline numbers (n=5 each, Apple M-series)

puts 1+1 workload, wizer-only wasm (no wasm-opt — counter-intuitive finding):

Runtime Engine Self-host? Cold-start Warm tiny
Deno 2.8 + browser_wasi_shim V8 25 ms 1.5 ms
workerd + workers-wasi V8 18 ms 2.5 ms
CF Workers edge V8 ~149 ms wall / ~80 ms cpu cold; 7 ms cpu warm (see edge bucket)
wasmtime 45 (CLI, no HTTP) wasmtime 12.7 ms raw / ~7 ms AOT n/a

Surprises worth flagging

  1. wasm-opt is consistently net-negative on V8 cold-start. At every level tested (-Oz, -O2). Smaller wasm doesn't help; V8 wasm parser bottlenecks on IR construction, not byte count. Build default is now WASM_OPT_LEVEL=skip.
  2. Wizer pre-init is the big win. −69 % cold-start on workerd local (57 → 18 ms). Same trick CF uses internally for Python Workers via make_snapshots.py.
  3. CF edge variance is pool-warming, not build choice. First-pass conclusion that "opt+wizer regresses edge perf" was debunked once requests were bucketed by per-isolate invocation count (see x-rubyrs-invocation header in src/worker.js).
  4. Deno beats workerd on warm by ~40 % despite trailing on cold. browser_wasi_shim + Deno.serve's hyper-based Rust HTTP cuts out workerd's JS-shim ↔ kj layer.
  5. @cloudflare/workers-wasi cannot be used on Deno (internal memfs.wasm confuses Deno's loader); jsr:@std/wasi doesn't exist any more (removed Nov 2023). The Deno path uses @bjorn3/browser_wasi_shim.
  6. A bug surfaced and was fixed during the spike: SyntaxError messages were leaking ruby_prism::Diagnostic Debug format (raw pointers + PhantomData markers). Shipped separately as fix(vm): SyntaxError message no longer leaks Prism Diagnostic Debug format #130.

What this is NOT

  • Not a proposal to ship anything inside the rubyrs crate. The only main-crate change is src/bin/wasm_worker.rs — a 60-line stdin → eval → stdout bin that's incidental to the PoC and self-contained.
  • Not a recommendation for production deployment on CF Workers Free. The 10 ms CPU cap is below our cold-start floor.
  • Not a Spin / wasmtime-serve / Fastly target — those require component-model adoption, which is a separable rubyrs-internal decision (likely tied to a future wasi-preview-2 migration).
  • Not benchmarked against AWS Lambda / Vercel / Cloud Run. The spike's thesis is portability across V8-based runtimes, not absolute fastest hosting.

How to reproduce

brew install deno binaryen
cargo install wizer-cli
rustup target add wasm32-wasip1
# Set WASI_SDK_PATH per docs/DEVELOPMENT.md.

cd poc/cf-worker
npm install
./build.sh                                # → src/rubyrs_worker.wasm (Wizer'd)

# CF Workers managed:
npx wrangler dev                          # local V8 (workerd via Miniflare)
# npx wrangler deploy                     # real edge (needs your CF account)

# workerd self-host:
./workerd/bundle.sh
npx workerd serve workerd/config.capnp    # listens on :8080

# Deno self-host:
cd deno && deno run --allow-net --allow-read server.ts  # listens on :8000

# Smoke against any of them:
curl -X POST --data-binary 'puts (1..5).sum' http://localhost:PORT

Branch retention

Keep this branch + draft PR open as a stable reference for the runtime-portability claim. If a future rubyrs feature wants to revisit (e.g. component model migration, lazy-load preamble follow-up), this is the starting point + measurement harness.

linyiru added 5 commits May 26, 2026 16:16
Minimum-viable end-to-end:

  HTTP POST body (Ruby source)
    → worker.js  pipes body as stdin via @cloudflare/workers-wasi
    → wasm_worker.wasm  reads stdin → Runtime::eval → stdout
    → HTTP 200  body=Ruby script's stdout

Local `wrangler dev` (Miniflare v3 → workerd, same runtime as
production) round-trips smoke.rb in ~12 ms wall and a single-
statement `puts (1..5).sum` cold request in ~38 ms. CPU time
specifically (which is what Workers Free's 10 ms cap measures)
needs a real edge deploy to know; Miniflare doesn't enforce
those caps locally.

Design choices, locked in the README:

- New `wasm_worker` bin reads Ruby from stdin rather than argv,
  because `@cloudflare/workers-wasi` has no public API to write
  files into its in-isolate littlefs before `wasi.start()`.
  Stdin is the documented input channel for command-shape wasm.
- No `[[rules]] CompiledWasm` in `wrangler.toml` — wrangler v3
  ships a default rule for `**/*.wasm` with `fallthrough = false`,
  so a re-declaration fails the build.
- `build.sh` prepends `~/.cargo/bin` to PATH to dodge Homebrew's
  rust (lacking the `wasm32-wasip1` rust-std component) shadowing
  rustup's. Same WASI_SDK_PATH requirement as `tests/wasm/smoke.sh`.
- wizer is best-effort — skipped when the bin lacks the
  `wizer.initialize` export (`wasm_worker` currently doesn't carry
  it, since the export only lives on the `rubyrs` CLI bin). A
  follow-up could add the same pattern to `wasm_worker` to skip
  the ~3-6 ms classes-and-preamble cold-start on each isolate.

Knobs / follow-ups documented in README: streaming response,
RUBYRS_DEADLINE_MS env wiring, real-edge cold-start measurement,
static-script (embedded `include_str!`) mode.

This is a spike, not a finished feature — the
`poc/cf-worker/` directory is meant to stay outside the workspace
build graph until/unless a real product target emerges.
Adds a self-host deployment path that complements the existing
wrangler/CF-edge target — runs the exact same rubyrs.wasm under
`workerd serve config.capnp` with no Cloudflare account, no CPU
caps, and no isolate-eviction noise. workerd is CF's own runtime
binary (Apache 2.0) so the wasm + JS surface is identical to the
managed edge, just hosted ourselves.

Measurement findings driving the build.sh changes:

- workerd local cold-start, n=5 each (`puts 1+1`, restart between):
    baseline raw (1.54 MB) ............ 57 ms median
    wasm-opt -Oz only (1.22 MB) ....... 53 ms  (size −21%, time −7%)
    wasm-opt -Oz + wizer (1.37 MB) .... 27 ms
    wasm-opt -O2 + wizer (1.42 MB) .... 23 ms
    wizer only (1.68 MB) .............. 18 ms  ← BEST
- wasm-opt is consistently net-negative on V8 cold-start at every
  optimisation level tested. Smaller wasm doesn't translate into
  faster instantiate — the V8 wasm parser appears to be bottlenecked
  on something other than raw byte count (likely IR construction /
  module setup).
- Default `WASM_OPT_LEVEL=skip` in build.sh; opt-in by env when
  benchmarking other levels.

CF edge cold/warm bucketing (new `x-rubyrs-invocation` header
in worker.js lets the harness partition requests by per-isolate
hit count; tail captures cpuTime). 60-request bursts:
    baseline raw   warm n=51  cpu median  10 ms  (p10 7, p90 16, max 86)
    wizer only     warm n=58  cpu median   7 ms  (p10 6, p90 12, max 13)
The wizer win on edge is smaller than local (−30% vs −69%) but
the cpu-max collapse (86 → 13 ms) is the load-bearing improvement:
it eliminates the cold-isolate spike that pushes individual requests
over the Free 10 ms CPU cap.

Earlier "edge appears to regress with opt+wizer" reading was
debunked once cold/warm were bucketed properly — that was post-deploy
pool warming, not a wasm-choice regression. The Pyodide-on-Workers
internal docs confirm CF's published mean cold-start is a blended
pool-hit + pool-miss number with the same noise profile.

build.sh changes:
- WASM_OPT_LEVEL env knob (skip | -O2 | -O3 | -Oz)
- wasm-opt → wizer ordering (wizer snapshots the post-shape memory,
  reversed order would invalidate the snapshot's function-index refs)
- `grep -q` over a tempfile not a pipe — `set -o pipefail` was turning
  successful "wizer.initialize present" detection into pipe-failure
  because grep -q closes its stdin early and SIGPIPEs upstream
- `--wasm-bulk-memory true` (wizer's flag wants an explicit bool)

workerd self-host:
- workerd/config.capnp: HTTP socket on :8080, modules list aliases
  worker.js's `./rubyrs_worker.wasm` import to the build output
- workerd/bundle.sh: esbuild bundles src/worker.js + workers-wasi
  into a single .mjs (workerd's capnp modules need explicit deps;
  `--external:*.wasm` keeps wasm imports out of the bundle so
  capnp's `wasm = embed` resolves them)
- workerd/bench-js: minimal JS-only worker for cold-start baselining
  (workerd's own boot cost is <1 ms; the rest of our 18 ms is wasm
  + Ruby setup)

src/worker.js:
- Co-located wasm import (`./rubyrs_worker.wasm`) — works for both
  wrangler (default CompiledWasm rule) and workerd (workerd rejects
  `..` paths; bare specifiers don't resolve). Same artifact, two hosts.
- `x-rubyrs-invocation` header from a module-scope counter — lets
  the measurement harness bucket cold (invocation == 1) vs warm
  per V8 isolate.

build artifact moved from `wasm/` to `src/` to satisfy workerd's
relative-import resolution; .gitignore updated.
Adds a Deno + browser_wasi_shim path that runs the EXACT SAME
`src/rubyrs_worker.wasm` bytes as the CF Workers / workerd
targets. Demonstrates the broader thesis of the spike — one wasm
artifact, multiple V8-host runtimes, no vendor lock-in. Deno
plays the same role for "self-host JS edge runtime" that workerd
plays for "self-host workers-compatible runtime"; Deno Deploy
is the managed counterpart, mirroring the workerd → CF Workers
duality.

Measurement on the wizer-only wasm (1.68 MB), n=5 each:

  | metric        | Deno     | workerd  | CF edge (warm) |
  | cold-start    | 25 ms    | 18 ms    | ~149 ms wall   |
  | warm tiny     | 1.5 ms   | 2.5 ms   | 7 ms cpu       |
  | warm smoke.rb | 1.7 ms   | 4.0 ms   | 7 ms cpu       |
  | 1M iter each  | 124 ms   | 135 ms   | 173 ms cpu     |

Deno actually edges out workerd on warm (1.5 vs 2.5 ms tiny —
~40 % faster) despite being behind on cold (25 vs 18 ms). Two
plausible reasons: (1) browser_wasi_shim's stdin/stdout path is
a pure-JS callback fed by a single `Uint8Array` buffer, whereas
workers-wasi proxies through its own bundled `memfs.wasm`; (2)
`Deno.serve` is a hyper-based Rust HTTP server cutting out the
JS-shim ↔ kj layer that workerd's HTTP path goes through. Heavy
compute converges to within ~10 % because V8's wasm engine
dominates that regime.

Stack we landed on, after the dead-ends documented in the
server.ts header:

- @cloudflare/workers-wasi does NOT work in Deno — its bundled
  memfs.wasm uses `import "./memfs.wasm"` which Deno's module
  loader eagerly walks looking for JS deps, hitting an
  unresolvable `wasi_snapshot_preview1` import.
- jsr:@std/wasi does NOT exist any more — Deno deprecated
  std/wasi in Oct 2023 (PR #3732) and removed it in Nov 2023
  (PR #3733) before the JSR cutover. Never published to JSR.
- npm:@bjorn3/browser_wasi_shim — pure-JS Preview 1 shim with
  no internal wasm deps. File / OpenFile / ConsoleStdout fd
  adapters; buffer-shaped stdin from request body.

deno.json + lock are scoped to the deno/ subdirectory so the
parent package.json's npm tree (wrangler / workerd / esbuild
and their transitive vitest/rolldown/turbo footprints) does
NOT get pulled into Deno's node_modules. Running `deno run`
from deno/ keeps the dependency surface to just
browser_wasi_shim.

`x-rubyrs-invocation` header is reused identically across all
three host wrappers so the cold/warm bucketing harness works
unchanged regardless of which target is being measured.
New section sandwiched between the P2-A pivot (rubyrs.wasm vs
ruby.wasm size/speed thesis) and Throughput. Documents what the
spike/cf-worker-poc branch actually validated: same Wizer'd
`rubyrs_worker.wasm` (1.68 MB) running unchanged across three
V8-based edge runtimes (Deno self-host, workerd self-host, CF
Workers managed) plus wasmtime as the non-V8 baseline.

Two tables:

  1. Per-runtime cold-start / warm-tiny / warm-smoke / heavy.
     Deno wins warm (1.5 ms tiny — ~40% faster than workerd),
     workerd wins cold (18 ms vs Deno's 25 ms), CF edge cpu
     settles at 7 ms p50 warm but eats ~149 ms wall on cold-
     isolate hits. Each cell sourced from the n=5 PoC bench.

  2. wasm-opt vs Wizer build-pipeline ablation (workerd local
     cold-start). Counter-intuitive finding worth flagging in
     the public BENCHMARKS: every `wasm-opt` level we measured
     (-Oz, -O2) was net-negative on V8 cold-start; Wizer-only
     (1.68 MB, no opt) is the fastest at 18 ms vs 57 ms baseline
     (−69%). Smaller wasm doesn't translate into faster
     instantiate — V8's wasm parser appears to bottleneck on IR
     construction, not byte count.

Notes call out the methodology gotcha (CF edge variance is pool-
warming, not wasm-choice — bucket by `x-rubyrs-invocation`
header) and the Pyodide-on-Workers precedent (CF's
`make_snapshots.py` is the same trick we're applying via Wizer,
externally). Also notes wasmtime is HTTP-less by default so it
sits alongside the V8 hosts as a CLI-shape baseline rather than
in the apples-to-apples row.
Documents why the PoC stops optimising at 18 ms (workerd local
cold-start with Wizer-only build). Two independent angles
attempted, both produced negative or zero results, both for the
same root cause:

1. Lazy-loading Tier 1 stdlib preambles (Random + SecureRandom):
   −40 KB wasm, n=5 cold-start MEDIAN 19.5 ms vs 18.2 ms baseline
   — variance, no improvement.

2. `opt-level = "z"` + LTO=fat + codegen-units=1: repo's own
   [profile.release-min] history note records 3–19 % SLOWER cold
   start despite 56 % smaller binary, measured across three hosts.

Both confirm: V8 wasm cold-start is dominated by the per-function
IR build (1798 functions in our wasm), NOT byte count. The 18 ms
floor is the V8 parser + module-instantiate fixed cost for this
build shape, and reducing further requires either a function-count
refactor inside rubyrs, a component-model + AOT move (wasmtime
serve), or CF exposing a generic snapshot API (currently
privileged-only for their bundled Python runtime).

Worth flagging publicly in BENCHMARKS because the natural
follow-up reading ("ok, why didn't you wasm-opt -Oz? Why didn't
you trim stdlib?") has an empirical answer documented inline.
@github-actions

Copy link
Copy Markdown

gapscan PR diff

Both binaries produced identical histograms across the 10 canonical scan targets. (If the classifier changed for node classes that none of these targets exercise, this view won't catch it — the data-file diff would.)

See docs/gap-reports/ for the dataset and methodology.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant