WebGPURenderer: Reduce per-frame allocations in render & compute paths#33419
Open
RenaudRohlinger wants to merge 17 commits intomrdoob:devfrom
Open
WebGPURenderer: Reduce per-frame allocations in render & compute paths#33419RenaudRohlinger wants to merge 17 commits intomrdoob:devfrom
RenaudRohlinger wants to merge 17 commits intomrdoob:devfrom
Conversation
- beginCompute: cache pass descriptor and label; avoid per-dispatch map/join. - clear: drop unused colorAttachments array literal. - beginRender: cache occlusionQuerySet descriptor across frames. - _getRenderPassDescriptor: reuse scratch view descriptors instead of fresh literals. - _createDepthLayerDescriptors: reuse per-layer descriptors in place; truncate stale depth views and layer descriptors when cameras.length shrinks. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Renderer.compute: cache a singleton array for single-node dispatches instead of allocating [ computeNode ] per call. - WebGPUTimestampQueryPool._resolveQueries: reuse scratch submit array. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- test/e2e/perf-regression.js: run a WebGPU example and capture JS heap, GC, frame timing, and WebGPU resource deltas (buffers, textures, bind groups, pipelines, submits/render passes/compute passes, uncaptured errors). - test/e2e/perf-regression-compare.js: diff two runs and print a clean A/B table. - Self-contained — relies only on puppeteer plus the existing utils/server.js. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
📦 Bundle sizeFull ESM build, minified and gzipped.
🌳 Bundle size after tree-shakingMinimal build including a renderer, camera, empty scene, and dependencies.
|
These changes belong to mrdoob#33418 and were bundled into this branch by mistake. - Renderer.js: drop `onError` / `_onError` hook. - WebGPUBackend.js: drop `device.onuncapturederror` → `renderer.onError` bridge. - WebGPUPipelineUtils.js: revert to dev (removes pipeline-label error messages and `_reportShaderDiagnostics`). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mugen87
reviewed
Apr 19, 2026
Mugen87
reviewed
Apr 19, 2026
- RenderContexts: drop attachment-state cache; rebuild the template literal every call. Benchmarked three variants (instance cache, WeakMap cache, no cache) and the cache provided no net benefit. Addresses review comments about monkey-patching render targets and avoiding comparison complexity for a short string. - WebGPUBackend: extract `_setTexelCopyInfo` and `_submit` module helpers to collapse repeated field-by-field mutation patterns. `_submit` replaces the three-line submit-array scratch pattern at 5 call sites; `_setTexelCopyInfo` collapses 10 field assignments in `copyTextureToTexture` into 2 helper calls. Addresses the readability comment on line 2641. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces the `_setTexelCopyInfo` helper function with a `TexelCopyTextureInfo` class that exposes chainable `setTexture`/`setMipLevel`/`setOrigin` setters, matching the reviewer's suggestion on PR mrdoob#33419 and mirroring three.js's Matrix4/Vector3 conventions. The four `copyTextureToTexture` / `copyFramebufferToTexture` scratch descriptors are now instances of this class, so call sites document themselves (`.setTexture(...).setOrigin(...)`) rather than opaque positional arguments to an ad-hoc helper. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The wrapper class for `GPUTexelCopyTextureInfo` belongs with the other texture utilities. Export it from `WebGPUTextureUtils.js` and import it into `WebGPUBackend.js`. No behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The wrapper class for `GPUTexelCopyTextureInfo` is a standalone helper — move it out of `WebGPUTextureUtils.js` (which is about texture management) into `src/renderers/webgpu/utils/TexelCopyTextureInfo.js`, matching three.js's "one class per file" convention. No behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The perf-regression scripts belong in a separate PR (or stay as personal tooling); this PR should be scoped to the renderer allocation fixes only. Drops `test/e2e/perf-regression.js`, `test/e2e/perf-regression-compare.js`, their README section, and the corresponding `.gitignore` entry. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…dup.
`WebGPUPipelineUtils.setPipeline` tracked the active pipeline per pass
encoder via a WeakMap. Pass encoders are created fresh every frame, so
each pipeline change fired `_activePipelines.set( pass, pipeline )` —
allocating a new entry hundreds of times per second. Sampled heap
profile attributed ~64 KB / 10 s to this one line on webgpu_backdrop_water.
We already track the active pipeline in `currentSets.pipeline` (render)
and can do the same on the compute group's data entry. Inline the
dedup at both call sites, delete the WeakMap, delete the method.
Bonus: use an Array for `currentSets.attributes` instead of `{}` and
reset with `.length = 0` — avoids the `for…in / delete` loop in
`_resetCurrentSets` that caused hidden-class transitions.
Measured on webgpu_backdrop_water (Inspector stripped, 10 s, 1 KB
sampling): `update @ Animation.js:70` subtree 402 KB → 341 KB (−15%).
`animate` child subtree 135 KB → 61 KB (−55%).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`getLights` allocated a fresh `{ renderId, lightsData }` object every
time `renderId` changed (i.e. once per frame per lightsNode). Mutate
the cached entry's fields in place instead, so the cache-miss path
doesn't allocate after the first frame.
`getLightsData` also allocated a fresh `const lights = []` per call;
change the signature to accept the output array so it can be reused.
Measured on webgpu_backdrop_water (Inspector stripped, 1 KB sampling):
site no longer appears in the top 30 allocators.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`addUniformUpdateRange` allocated a fresh `{start, count}` object every
frame, for every uniform whose value changed. On an animated scene with
dozens of per-object uniforms this was hundreds of allocations per
frame — the single largest remaining source of sampled-heap traffic
after the earlier backend fixes.
Keep the range objects in `_updateRangeCache` across frames; mark each
as `added: false` on `clearUpdateRanges()` so the next frame pushes the
same cached object back onto `updateRanges` instead of allocating a
new one.
Measured on webgpu_backdrop_water (Inspector stripped, 10 s, 1 KB
sampling, 3-run average):
total sampled heap 97.0 KB → 66.5 KB (-31.5%)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…eagerly. Two small per-frame allocation reductions in the rAF bucket: - `update()` now calls `performance.now()` once instead of 2-3 times. Each call returns a non-Smi double that V8 boxes into a `HeapNumber` when stored on a property; fewer calls means fewer boxes per frame. - `lastTime` is initialized to `performance.now()` in the constructor, so the property starts as a double rather than transitioning from `undefined`. Keeps V8's hidden class stable and allows the property to stay in double-unboxed storage across frames. Measured on webgpu_backdrop_water (Inspector stripped, 1 KB sampling, 3-run average): `update @ Animation.js:70` attribution 15.98 KB → 11.80 KB (−26%). Total sampled heap unchanged within noise — V8 re-attributes the remaining bytes to other sites — but the `update` row specifically is the frontier that V8 inlining fuzz lets us touch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mugen87
reviewed
Apr 20, 2026
Mugen87
reviewed
Apr 20, 2026
Moving UniformsGroup range pooling and NodeMaterialObserver cache reuse into separate PRs so each can be reviewed independently by the relevant subsystem owners. NodeFrame change is dropped entirely. Reverts: 598a225 NodeFrame: performance.now() consolidation — dropped 47bcec4 UniformsGroup: Pool per-uniform update-range — moved to own PR dbad1d4 NodeMaterialObserver: Reuse lightsData cache entry — moved to own PR Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Collaborator
Author
|
Everything should be ready for review. Overall a pretty big win with a total sampled-heap reduction of about ~14-17% on the same workload. Related PR: |
…er starts. Fixes 'No pipeline set' on compute dispatches in scenes with repeating compute calls (e.g. webgpu_compute_texture_pingpong). The inline pipeline dedup introduced in 2e164b2 tracked the active pipeline on the compute group's data entry, but each `beginCompute` creates a new pass encoder — so the tracker would still match last frame's pipeline and skip the `setPipeline` call on the fresh encoder. Reset `groupGPU.currentPipeline = null` whenever we recreate the command encoder in `beginCompute`, so the first `compute()` call after each `beginCompute` always calls `setPipeline` on the new encoder. Render path already does this via `_resetCurrentSets` in `beginRender`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was referenced Apr 20, 2026
Mugen87
reviewed
Apr 20, 2026
Mugen87
reviewed
Apr 21, 2026
| const _copyDst = new TexelCopyTextureInfo(); | ||
| const _copySize = [ 0, 0 ]; | ||
|
|
||
| const _texCopyEncoderOptions = { label: '' }; |
Collaborator
There was a problem hiding this comment.
I think the code gets more clear if we introduce more descriptor classes. For that, I would suggest to create a new director src/renderers/webgpu/descriptors and put the class definitions (including TexelCopyTextureInfo) there.
Here is a list with possible types for the new module scope objects:
- _texCopyEncoderOptions: GPUCommandEncoderDescriptor
- _clearEncoderOptions: GPUCommandEncoderDescriptor
- _viewDescriptor: GPUTextureViewDescriptor
- _depthViewOptions: GPUTextureViewDescriptor
- _clearPassDescriptor: GPURenderPassDescriptor
- _timestampWrites: GPURenderPassTimestampWrites
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Related:
lightsDatacache entry per frame. #33425Pre-allocates WebGPU descriptors, state-tracking objects, and scratch arrays on the render/compute hot paths so they're reused each frame instead of re-created per call. Also removes a
WeakMap.setdedup pattern (_activePipelines) in favor of inline state the render loop already tracks.Introduces a
TexelCopyTextureInfoclass wrapping the WebGPU IDL type, used as a reusable scratch forcopyTextureToTexture/copyFramebufferToTexture.Before / after —
webgpu_backdrop_water, 10 s window, Inspector stripped, 1 KB sampling, 3-run averageTotal sampled heap
Net of sites this PR actually touches: ~−65 KB. The remaining ~6 KB of the total delta comes from V8's sampler re-attributing hits across the render chain after the descriptor / submit-array allocations disappear — not individually
causal, only visible in aggregate.
WebGPU resource deltas (interceptor-counted)
Combined impact with follow-up PRs
Two smaller allocation-reduction changes were spun off into their own PRs:
perf/uniforms-group-range-pooling— pool per-uniform{ start, count }objects inUniformsGroupperf/node-material-observer-lights-cache— reuse cachedlightsDataentry inNodeMaterialObserverApplied together with this PR, total sampled-heap reduction is ~14-17% on the same workload (exact figure drifts ±3% between sessions due to sampler noise). Most of the win is from this PR's
_activePipelineschange alone; the follow-ups are structural correctness improvements that don't produce a measurable additional drop in isolation.This contribution is funded by Spawn