spike(wasm): edge-runtime portability — Cloudflare Workers / workerd / Deno#137
Draft
linyiru wants to merge 5 commits into
Draft
spike(wasm): edge-runtime portability — Cloudflare Workers / workerd / Deno#137linyiru wants to merge 5 commits into
linyiru wants to merge 5 commits into
Conversation
Minimum-viable end-to-end:
HTTP POST body (Ruby source)
→ worker.js pipes body as stdin via @cloudflare/workers-wasi
→ wasm_worker.wasm reads stdin → Runtime::eval → stdout
→ HTTP 200 body=Ruby script's stdout
Local `wrangler dev` (Miniflare v3 → workerd, same runtime as
production) round-trips smoke.rb in ~12 ms wall and a single-
statement `puts (1..5).sum` cold request in ~38 ms. CPU time
specifically (which is what Workers Free's 10 ms cap measures)
needs a real edge deploy to know; Miniflare doesn't enforce
those caps locally.
Design choices, locked in the README:
- New `wasm_worker` bin reads Ruby from stdin rather than argv,
because `@cloudflare/workers-wasi` has no public API to write
files into its in-isolate littlefs before `wasi.start()`.
Stdin is the documented input channel for command-shape wasm.
- No `[[rules]] CompiledWasm` in `wrangler.toml` — wrangler v3
ships a default rule for `**/*.wasm` with `fallthrough = false`,
so a re-declaration fails the build.
- `build.sh` prepends `~/.cargo/bin` to PATH to dodge Homebrew's
rust (lacking the `wasm32-wasip1` rust-std component) shadowing
rustup's. Same WASI_SDK_PATH requirement as `tests/wasm/smoke.sh`.
- wizer is best-effort — skipped when the bin lacks the
`wizer.initialize` export (`wasm_worker` currently doesn't carry
it, since the export only lives on the `rubyrs` CLI bin). A
follow-up could add the same pattern to `wasm_worker` to skip
the ~3-6 ms classes-and-preamble cold-start on each isolate.
Knobs / follow-ups documented in README: streaming response,
RUBYRS_DEADLINE_MS env wiring, real-edge cold-start measurement,
static-script (embedded `include_str!`) mode.
This is a spike, not a finished feature — the
`poc/cf-worker/` directory is meant to stay outside the workspace
build graph until/unless a real product target emerges.
Adds a self-host deployment path that complements the existing
wrangler/CF-edge target — runs the exact same rubyrs.wasm under
`workerd serve config.capnp` with no Cloudflare account, no CPU
caps, and no isolate-eviction noise. workerd is CF's own runtime
binary (Apache 2.0) so the wasm + JS surface is identical to the
managed edge, just hosted ourselves.
Measurement findings driving the build.sh changes:
- workerd local cold-start, n=5 each (`puts 1+1`, restart between):
baseline raw (1.54 MB) ............ 57 ms median
wasm-opt -Oz only (1.22 MB) ....... 53 ms (size −21%, time −7%)
wasm-opt -Oz + wizer (1.37 MB) .... 27 ms
wasm-opt -O2 + wizer (1.42 MB) .... 23 ms
wizer only (1.68 MB) .............. 18 ms ← BEST
- wasm-opt is consistently net-negative on V8 cold-start at every
optimisation level tested. Smaller wasm doesn't translate into
faster instantiate — the V8 wasm parser appears to be bottlenecked
on something other than raw byte count (likely IR construction /
module setup).
- Default `WASM_OPT_LEVEL=skip` in build.sh; opt-in by env when
benchmarking other levels.
CF edge cold/warm bucketing (new `x-rubyrs-invocation` header
in worker.js lets the harness partition requests by per-isolate
hit count; tail captures cpuTime). 60-request bursts:
baseline raw warm n=51 cpu median 10 ms (p10 7, p90 16, max 86)
wizer only warm n=58 cpu median 7 ms (p10 6, p90 12, max 13)
The wizer win on edge is smaller than local (−30% vs −69%) but
the cpu-max collapse (86 → 13 ms) is the load-bearing improvement:
it eliminates the cold-isolate spike that pushes individual requests
over the Free 10 ms CPU cap.
Earlier "edge appears to regress with opt+wizer" reading was
debunked once cold/warm were bucketed properly — that was post-deploy
pool warming, not a wasm-choice regression. The Pyodide-on-Workers
internal docs confirm CF's published mean cold-start is a blended
pool-hit + pool-miss number with the same noise profile.
build.sh changes:
- WASM_OPT_LEVEL env knob (skip | -O2 | -O3 | -Oz)
- wasm-opt → wizer ordering (wizer snapshots the post-shape memory,
reversed order would invalidate the snapshot's function-index refs)
- `grep -q` over a tempfile not a pipe — `set -o pipefail` was turning
successful "wizer.initialize present" detection into pipe-failure
because grep -q closes its stdin early and SIGPIPEs upstream
- `--wasm-bulk-memory true` (wizer's flag wants an explicit bool)
workerd self-host:
- workerd/config.capnp: HTTP socket on :8080, modules list aliases
worker.js's `./rubyrs_worker.wasm` import to the build output
- workerd/bundle.sh: esbuild bundles src/worker.js + workers-wasi
into a single .mjs (workerd's capnp modules need explicit deps;
`--external:*.wasm` keeps wasm imports out of the bundle so
capnp's `wasm = embed` resolves them)
- workerd/bench-js: minimal JS-only worker for cold-start baselining
(workerd's own boot cost is <1 ms; the rest of our 18 ms is wasm
+ Ruby setup)
src/worker.js:
- Co-located wasm import (`./rubyrs_worker.wasm`) — works for both
wrangler (default CompiledWasm rule) and workerd (workerd rejects
`..` paths; bare specifiers don't resolve). Same artifact, two hosts.
- `x-rubyrs-invocation` header from a module-scope counter — lets
the measurement harness bucket cold (invocation == 1) vs warm
per V8 isolate.
build artifact moved from `wasm/` to `src/` to satisfy workerd's
relative-import resolution; .gitignore updated.
Adds a Deno + browser_wasi_shim path that runs the EXACT SAME `src/rubyrs_worker.wasm` bytes as the CF Workers / workerd targets. Demonstrates the broader thesis of the spike — one wasm artifact, multiple V8-host runtimes, no vendor lock-in. Deno plays the same role for "self-host JS edge runtime" that workerd plays for "self-host workers-compatible runtime"; Deno Deploy is the managed counterpart, mirroring the workerd → CF Workers duality. Measurement on the wizer-only wasm (1.68 MB), n=5 each: | metric | Deno | workerd | CF edge (warm) | | cold-start | 25 ms | 18 ms | ~149 ms wall | | warm tiny | 1.5 ms | 2.5 ms | 7 ms cpu | | warm smoke.rb | 1.7 ms | 4.0 ms | 7 ms cpu | | 1M iter each | 124 ms | 135 ms | 173 ms cpu | Deno actually edges out workerd on warm (1.5 vs 2.5 ms tiny — ~40 % faster) despite being behind on cold (25 vs 18 ms). Two plausible reasons: (1) browser_wasi_shim's stdin/stdout path is a pure-JS callback fed by a single `Uint8Array` buffer, whereas workers-wasi proxies through its own bundled `memfs.wasm`; (2) `Deno.serve` is a hyper-based Rust HTTP server cutting out the JS-shim ↔ kj layer that workerd's HTTP path goes through. Heavy compute converges to within ~10 % because V8's wasm engine dominates that regime. Stack we landed on, after the dead-ends documented in the server.ts header: - @cloudflare/workers-wasi does NOT work in Deno — its bundled memfs.wasm uses `import "./memfs.wasm"` which Deno's module loader eagerly walks looking for JS deps, hitting an unresolvable `wasi_snapshot_preview1` import. - jsr:@std/wasi does NOT exist any more — Deno deprecated std/wasi in Oct 2023 (PR #3732) and removed it in Nov 2023 (PR #3733) before the JSR cutover. Never published to JSR. - npm:@bjorn3/browser_wasi_shim — pure-JS Preview 1 shim with no internal wasm deps. File / OpenFile / ConsoleStdout fd adapters; buffer-shaped stdin from request body. deno.json + lock are scoped to the deno/ subdirectory so the parent package.json's npm tree (wrangler / workerd / esbuild and their transitive vitest/rolldown/turbo footprints) does NOT get pulled into Deno's node_modules. Running `deno run` from deno/ keeps the dependency surface to just browser_wasi_shim. `x-rubyrs-invocation` header is reused identically across all three host wrappers so the cold/warm bucketing harness works unchanged regardless of which target is being measured.
New section sandwiched between the P2-A pivot (rubyrs.wasm vs
ruby.wasm size/speed thesis) and Throughput. Documents what the
spike/cf-worker-poc branch actually validated: same Wizer'd
`rubyrs_worker.wasm` (1.68 MB) running unchanged across three
V8-based edge runtimes (Deno self-host, workerd self-host, CF
Workers managed) plus wasmtime as the non-V8 baseline.
Two tables:
1. Per-runtime cold-start / warm-tiny / warm-smoke / heavy.
Deno wins warm (1.5 ms tiny — ~40% faster than workerd),
workerd wins cold (18 ms vs Deno's 25 ms), CF edge cpu
settles at 7 ms p50 warm but eats ~149 ms wall on cold-
isolate hits. Each cell sourced from the n=5 PoC bench.
2. wasm-opt vs Wizer build-pipeline ablation (workerd local
cold-start). Counter-intuitive finding worth flagging in
the public BENCHMARKS: every `wasm-opt` level we measured
(-Oz, -O2) was net-negative on V8 cold-start; Wizer-only
(1.68 MB, no opt) is the fastest at 18 ms vs 57 ms baseline
(−69%). Smaller wasm doesn't translate into faster
instantiate — V8's wasm parser appears to bottleneck on IR
construction, not byte count.
Notes call out the methodology gotcha (CF edge variance is pool-
warming, not wasm-choice — bucket by `x-rubyrs-invocation`
header) and the Pyodide-on-Workers precedent (CF's
`make_snapshots.py` is the same trick we're applying via Wizer,
externally). Also notes wasmtime is HTTP-less by default so it
sits alongside the V8 hosts as a CLI-shape baseline rather than
in the apples-to-apples row.
Documents why the PoC stops optimising at 18 ms (workerd local
cold-start with Wizer-only build). Two independent angles
attempted, both produced negative or zero results, both for the
same root cause:
1. Lazy-loading Tier 1 stdlib preambles (Random + SecureRandom):
−40 KB wasm, n=5 cold-start MEDIAN 19.5 ms vs 18.2 ms baseline
— variance, no improvement.
2. `opt-level = "z"` + LTO=fat + codegen-units=1: repo's own
[profile.release-min] history note records 3–19 % SLOWER cold
start despite 56 % smaller binary, measured across three hosts.
Both confirm: V8 wasm cold-start is dominated by the per-function
IR build (1798 functions in our wasm), NOT byte count. The 18 ms
floor is the V8 parser + module-instantiate fixed cost for this
build shape, and reducing further requires either a function-count
refactor inside rubyrs, a component-model + AOT move (wasmtime
serve), or CF exposing a generic snapshot API (currently
privileged-only for their bundled Python runtime).
Worth flagging publicly in BENCHMARKS because the natural
follow-up reading ("ok, why didn't you wasm-opt -Oz? Why didn't
you trim stdlib?") has an empirical answer documented inline.
gapscan PR diffBoth binaries produced identical histograms across the 10 canonical scan targets. (If the classifier changed for node classes that none of these targets exercise, this view won't catch it — the data-file diff would.) See docs/gap-reports/ for the dataset and methodology. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this is
A spike, not a feature PR. Do not merge. Opening as a draft so the empirical findings + scaffolding are discoverable from the repo, not just sitting on a local branch.
The spike validates one claim and one disclaimer:
rubyrs_worker.wasm(1.68 MB) runs unchanged across three V8-based edge runtimes — Cloudflare Workers (managed), workerd (self-host), Deno (self-host) — plus wasmtime as a non-V8 baseline.docs/BENCHMARKS.md. To go lower would require either a function-count refactor inside rubyrs, a component-model + AOT move (wasmtime serve), or CF exposing a generic snapshot API.Files
Headline numbers (n=5 each, Apple M-series)
puts 1+1workload, wizer-only wasm (no wasm-opt — counter-intuitive finding):Surprises worth flagging
wasm-optis consistently net-negative on V8 cold-start. At every level tested (-Oz, -O2). Smaller wasm doesn't help; V8 wasm parser bottlenecks on IR construction, not byte count. Build default is nowWASM_OPT_LEVEL=skip.make_snapshots.py.x-rubyrs-invocationheader insrc/worker.js).Deno.serve's hyper-based Rust HTTP cuts out workerd's JS-shim ↔ kj layer.@cloudflare/workers-wasicannot be used on Deno (internalmemfs.wasmconfuses Deno's loader);jsr:@std/wasidoesn't exist any more (removed Nov 2023). The Deno path uses@bjorn3/browser_wasi_shim.ruby_prism::DiagnosticDebug format (raw pointers + PhantomData markers). Shipped separately as fix(vm): SyntaxError message no longer leaks Prism Diagnostic Debug format #130.What this is NOT
rubyrscrate. The only main-crate change issrc/bin/wasm_worker.rs— a 60-line stdin → eval → stdout bin that's incidental to the PoC and self-contained.How to reproduce
Branch retention
Keep this branch + draft PR open as a stable reference for the runtime-portability claim. If a future rubyrs feature wants to revisit (e.g. component model migration, lazy-load preamble follow-up), this is the starting point + measurement harness.