-
-
Notifications
You must be signed in to change notification settings - Fork 3k
docs: scaling dive 2026-05 (closes Phase 2 of #7756) #7765
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
JohnMcLear
wants to merge
15
commits into
develop
Choose a base branch
from
docs/scaling-dive-2026-05
base: develop
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Changes from 1 commit
Commits
Show all changes
15 commits
Select commit
Hold shift + click to select a range
2645da2
docs: scaling dive 2026-05 — first numbers-backed answer to #7756
JohnMcLear 80b3b74
docs(scaling-dive): rewrite with cliff-finding + Qodo fixes
JohnMcLear 142c5f1
docs(scaling-dive): add lever-8 negative result + methodology noise c…
JohnMcLear 03f4308
docs(scaling-dive): triple-run noise envelope + honest re-evaluation
JohnMcLear 4d6eec7
docs(scaling-dive): close #7769 in the doc + update recommendations
JohnMcLear 92d40ec
docs(scaling-dive): N=3 re-eval of lever 3 + add lever 8b (flush defer)
JohnMcLear 6e16b21
docs(scaling-dive): add lever 9 (SessionManager throw fix #7775)
JohnMcLear eff3a01
docs(scaling-dive): add N=3 measured numbers for lever 9 (#7775)
JohnMcLear 1ee3e9e
docs(scaling-dive): #7775+#7776 stacked = -12% to -20% CPU, cliff moves
JohnMcLear 661e829
docs(scaling-dive): scope worker-thread first cut for applyToText
JohnMcLear a0df336
docs(scaling-dive): per-call worker-thread offload falsified
JohnMcLear f20d56e
docs(scaling-dive): link #7780 (room-broadcast fan-out) as next lever…
JohnMcLear 0e8cb97
docs(scaling-dive): tiered roadmap for future effort
JohnMcLear f966e62
docs(scaling-dive): clarify Tier 3 ecosystem impact (clients + plugins)
JohnMcLear 7e32f1e
docs(scaling-dive): split horizontal scaling into 6a (proxy shard) + …
JohnMcLear File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,133 @@ | ||
| # Scaling dive — 2026-05 | ||
|
|
||
| **Closes Phase 2 of #7756.** First numbers-backed answer to "how many editors can be on one pad, and what is the bottleneck when it falls over?" | ||
|
|
||
| ## TL;DR | ||
|
|
||
| Two clean conclusions from three matrix runs on the same GitHub-hosted `ubuntu-latest` runner shape: | ||
|
|
||
| 1. **Server-side changeset apply is not the bottleneck.** Even at 200 concurrent authors, `etherpad_changeset_apply_duration_seconds` mean is ~3.7–4.4 ms — well under client-perceived p95 (~20–25 ms). The remaining latency lives in *fan-out*, not in *apply*. | ||
| 2. **Dropping the socket.io polling fallback (`socketTransportProtocols: ["websocket"]`) makes things worse, not better, under high concurrency.** At 200 authors it nearly doubles client p95 (37 ms vs 20 ms baseline). The hypothesis that the polling fallback was costing us is falsified. | ||
|
|
||
| Raising the node heap (`--max-old-space-size=4096`) makes no measurable difference — memory is not where the cost lives. | ||
|
|
||
| Next step: prototype the **fan-out batching** lever (spec section 9 lever 3). Today `etherpad_socket_emits_total{type=NEW_CHANGES}` scales O(N²) — 1160 emits per 10s dwell at 20 authors grows to 66 032 emits at 200 authors. Coalescing N changesets within a configurable window before broadcasting should attack that directly. | ||
|
|
||
| ## Methodology | ||
|
|
||
| - **Harness:** [`ether/etherpad-load-test`](https://github.com/ether/etherpad-load-test) at the post-#100 main (sim/ library + `--sweep` mode + `/stats/prometheus` scraping + `apply_mean_ms` / `emits_new_changes` CSV columns). | ||
| - **Server-side instruments:** the three Prometheus counters added in #7762, enabled via `settings.scalingDiveMetrics=true`. | ||
| - **SUT:** etherpad core `develop` HEAD at the time of run. | ||
| - **Runner shape:** GitHub-hosted `ubuntu-latest` (4 vCPU, ~16 GB RAM). Same shape across all three matrix entries, so noise is constant. | ||
| - **Workflow:** [`.github/workflows/scaling-dive.yml`](https://github.com/ether/etherpad-load-test/blob/main/.github/workflows/scaling-dive.yml), manual `workflow_dispatch`. Two runs analysed: | ||
| - **Run 25936626554** — default sweep `authors=10..80:step=10:dwell=15s:warmup=3s`. | ||
| - **Run 25936813657** — deeper sweep `authors=20..200:step=20:dwell=10s:warmup=2s` (used for the conclusions below). | ||
|
|
||
| ### Decision rules (per spec section 6) | ||
|
|
||
| - p95 latency up *without* event-loop p99 up ⇒ network IO bound. | ||
| - p95 latency up *with* event-loop p99 up ⇒ server CPU / event-loop bound. | ||
| - p95 latency up *with* RSS climbing across steps ⇒ leak / backpressure. | ||
|
|
||
| ## Baseline curve | ||
|
|
||
| The deep sweep on baseline (no levers, develop HEAD): | ||
|
|
||
| | Step | p50 | p95 | p99 | EL p99 | apply_mean | emits_NEW_CHANGES | cpu_user (s) | | ||
| |---:|---:|---:|---:|---:|---:|---:|---:| | ||
| | 20 | 9 | 11 | 12 | 11 | 4.84 ms | 1 160 | 2.4 | | ||
| | 40 | 8 | 11 | 12 | 12 | 4.62 ms | 3 520 | 4.0 | | ||
| | 60 | 8 | 11 | 13 | 12 | 4.63 ms | 7 040 | 6.3 | | ||
| | 80 | 10 | 17 | 19 | 12 | 5.18 ms | 11 780 | 9.5 | | ||
| | 100 | 8 | 16 | 18 | 11 | 5.08 ms | 17 668 | 13.0 | | ||
| | 120 | 5 | 12 | 16 | 11 | 4.55 ms | 24 793 | 17.5 | | ||
| | 140 | 3 | 8 | 11 | 11 | 3.96 ms | 33 088 | 22.8 | | ||
| | 160 | 4 | 9 | 11 | 11 | 3.62 ms | 42 563 | 29.0 | | ||
| | 180 | 5 | 16 | 20 | 12 | 3.56 ms | 54 112 | 36.5 | | ||
| | 200 | 7 | 20 | 25 | 12 | 3.67 ms | 66 032 | 44.0 | | ||
|
|
||
| Reading against the decision rules: | ||
|
|
||
| - p95 grows slowly (11 → 20 ms across the range), but doesn't cliff. | ||
| - Event-loop p99 stays at 11–12 ms — flat. **Not event-loop bound.** | ||
| - RSS climbs from 393 MB → 651 MB but no leak shape (it plateaus around step 100). | ||
| - CPU is the headline: 200 authors burns 44 CPU-seconds in 10 s wall-clock — ~4.4 cores. The runner has 4 vCPU. We're saturating the CPU on fan-out work. | ||
|
|
||
| So per the decision rules: **network/CPU bound, but the work is fan-out, not apply.** The `apply_mean` stays low while emits grow O(N²) with concurrency. | ||
|
|
||
| ## Lever 1 — perMessageDeflate | ||
|
|
||
| **Not run.** Verifying that core's socket.io setup plumbs `perMessageDeflate` through settings is itself a small core PR. Folded into the recommendation below. | ||
|
|
||
| ## Lever 2 — `--max-old-space-size=4096` (NODE_OPTIONS) | ||
|
|
||
| Run as the `nodemem` matrix entry. Selected step-by-step diff vs baseline: | ||
|
|
||
| | Step | baseline p95 | nodemem p95 | Δ | | ||
| |---:|---:|---:|---:| | ||
| | 80 | 17 | 17 | 0 | | ||
| | 120 | 12 | 16 | +4 | | ||
| | 160 | 9 | 13 | +4 | | ||
| | 200 | 20 | 13 | -7 | | ||
|
|
||
| Noise within ±5 ms. RSS grows similarly. apply_mean and emits_NEW_CHANGES are essentially identical. | ||
|
|
||
| **Verdict: no measurable effect.** The user's hunch on the issue (memory is not the bottleneck) is confirmed. Don't recommend bumping the heap as a scaling lever. | ||
|
|
||
| ## Lever 3 — fan-out batching | ||
|
|
||
| **Deferred.** Requires a code change in `PadMessageHandler.ts` (specifically the per-socket loop in `updatePadClients` and/or the broadcast emit at line 627). Recommended as the next concrete code change. The harness is ready to score it as soon as a candidate branch exists — point the workflow's `core_ref` input at the branch. | ||
|
|
||
| The `emits_new_changes` column on the curve table above is the direct measurement target. At 200 authors we're producing 66 032 emits per 10 s dwell. Halving the emit rate (by coalescing two changesets per emit on a sub-50 ms window) would directly reduce CPU. | ||
|
|
||
| ## Lever 4 — `socketTransportProtocols: ["websocket"]` | ||
|
|
||
| Run as the `websocket-only` matrix entry. Selected step-by-step diff vs baseline: | ||
|
|
||
| | Step | baseline p95 | websocket-only p95 | Δ | baseline apply_mean | ws-only apply_mean | | ||
| |---:|---:|---:|---:|---:|---:| | ||
| | 20 | 11 | 10 | -1 | 4.84 ms | 3.67 ms | | ||
| | 60 | 11 | 9 | -2 | 4.63 ms | 3.28 ms | | ||
| | 100 | 16 | 13 | -3 | 5.08 ms | 3.27 ms | | ||
| | 140 | 8 | 24 | **+16** | 3.96 ms | 5.13 ms | | ||
| | 180 | 16 | 35 | **+19** | 3.56 ms | 8.07 ms | | ||
| | 200 | 20 | 37 | **+17** | 3.67 ms | 8.77 ms | | ||
|
|
||
| Below ~100 authors, websocket-only is a modest win (-1 to -3 ms p95). Above 120 authors it goes sharply worse: p95 doubles, apply_mean doubles, evloop_p99 jumps from 12 → 17. The websocket-only path also produced a single 271 ms tail max at step 40 — likely a handshake stall, but worth confirming with more runs. | ||
|
|
||
| **Verdict: do not recommend dropping the polling fallback.** The cost of forcing all clients onto websocket compounds with concurrency. This was a legitimate hypothesis from issue #7756 (thread #1) that the dive *refutes*. | ||
|
|
||
| ## Lever 5 — raw `ws` (drop socket.io entirely) | ||
|
|
||
| **Not pursued.** Lever 4 demonstrated that the transport choice within socket.io is already an inversion — dropping the polling fallback hurts. Ripping socket.io out entirely is high blast radius and the dive gives no signal that it would help. Defer indefinitely. | ||
|
|
||
| ## Recommendation | ||
|
|
||
| In priority order: | ||
|
|
||
| 1. **Prototype fan-out batching** (lever 3). The dive identifies fan-out as the single dominant cost. Coalescing changesets within a sub-50 ms window inside `updatePadClients` is the highest-leverage code change. Open a feature branch in core; the harness scores it via `workflow_dispatch` with `core_ref` pointing at the branch. | ||
| 2. **Verify and run lever 1** (`perMessageDeflate`). Even if compression has overhead at low concurrency, at 200 authors the emit *bytes* are the second-order cost behind emit *count*. Worth scoring once lever 3 is in. | ||
| 3. **Do not merge lever 4.** Keep `socketTransportProtocols: ["websocket", "polling"]` as the default. | ||
| 4. **Do not merge lever 2.** No effect. | ||
| 5. **Add core counters for fan-out byte size** as a small follow-up to #7762. The histogram of changeset bytes per emit would make lever 1 scorable without instrumenting client-side. | ||
|
|
||
| ## Reproducing | ||
|
|
||
| ``` | ||
| # Trigger a dive run against any core ref. | ||
| gh workflow run "Scaling dive" --repo ether/etherpad-load-test \ | ||
| -f core_ref=develop \ | ||
| -f sweep='authors=20..200:step=20:dwell=10s:warmup=2s' | ||
|
|
||
| # Fetch artifacts. | ||
| gh run download <RUN_ID> --repo ether/etherpad-load-test | ||
| ``` | ||
|
|
||
| Per-lever CSV / JSON / MD artifacts drop in `scaling-dive-{baseline,websocket-only,nodemem}/`. The CSV is plot-ready; the JSON has the full per-step `Snapshot.gauges`. | ||
|
|
||
| ## Out of scope (sequel issues worth filing) | ||
|
|
||
| - The `apply_mean` calculation uses `histogram._sum / histogram._count` for a simple mean. A proper p99 from the bucket distribution would require parsing `_bucket{le=...}` rows in the harness. Worth adding to the Scraper if lever 3 scoring needs it. | ||
| - The websocket-only step-40 spike (271 ms max) needs a second run to confirm it isn't a flake. | ||
| - The harness sweep stops short of producing a *cliff* — even 200 authors didn't trip the breakage thresholds. A "big cluster" dive (multi-host harness) is the natural sequel but is explicitly out of scope per spec section 9. | ||
| - Re-run with the same methodology after every batching-prototype iteration to track progress numerically. | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1. Fan-out batching lacks results
📎 Requirement gap➹ PerformanceAgent Prompt
ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools