RTC: fix stale delete propagation issue (symptom: delete causes temporary divergence between clients, which heals to incorrect result with deleted content getting re-added)#78320
Conversation
|
The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the If you're merging code through a pull request on GitHub, copy and paste the following into the bottom of the merge commit message. To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook. |
|
Warning: Type of PR label mismatch To merge this PR, it requires exactly 1 label indicating the type of PR. Other labels are optional and not being checked here.
Read more about Type labels in Gutenberg. Don't worry if you don't have the required permissions to add labels; the PR reviewer should be able to help with the task. |
b42f417 to
501f2c6
Compare
501f2c6 to
b67abf1
Compare
|
@danluu Thanks for this! A bit of initial feedback on the reproduction video. In short, a link to the e2e test code in the video would be helpful, because the steps do not reproduce for me. The reproduction video shows a very simple situation leading to data loss / duplication. Usually these PRs have come with an e2e test (or link to one in a branch) I can use to understand the nitty-gritty of the reproduction, but this one only has unit tests and no linked e2e test. Here are my attempts doing the same simple reproduction on With WebSockets enabled paragraph-delete-ws-non-repro.movBlocks converge as expected. I also tried on a non-WebSockets version: paragraph-delete-non-repro.movI also tried CLI-generating a post in a non-WebSockets version, in case it was like #77666 and reproed primarily on posts that were CLI-generated and had no prior user save. Still no repro: paragraph-delete-cli-non-repro.movI'm not saying the reproduction video is fabricated, but I think there's more going on behind the scenes to make the reproduction possible that isn't documented here. It's completely fine if it's specific circumstances or narrow timing, but I have no easy way to discern from the video alone. I'll continue looking into the lower-level unit testing to better understand the fix, but as it stands the reproduction video appears misleading. My sincere apologies if this failure to reproduce is due to something I missed. |
|
I made some progress trying to reproduce by introducing a 10-second delay between Yjs operation merging and the application of the record update: // In packages/sync/src/manager.ts:
const onRecordUpdate = async (
_events: Y.YEvent< any >[],
transaction: Y.Transaction
): Promise< void > => {
if (
transaction.local &&
! ( transaction.origin instanceof Y.UndoManager )
) {
return;
}
console.log( '[remote update applied]' );
await new Promise( ( resolve ) => setTimeout( resolve, 10000 ) ); // Delay 10s
console.log( '[applying updateEntityRecord]' );
void internal.updateEntityRecord( objectType, objectId );
};When I try this on trunk with the 10s delay, I can reproduce the issue in the description! trunk-delay-update-repro.movFor AI support, here's the reproduction steps: Using the WebSockets setup from #78363 with
I tried the same on this PR's branch. The first run worked the the problem appeared fixed: pr-delay-fix.movWeirdly, though, on a second attempt I saw the same behavior as on trunk in the very next test: pr-delay-fix2-notfixed.movIn this case the |
This is part of an AI fuzzing project, where an AI wrote a fuzzer and then triages bugs from the fuzzer and creates fixes. See #77716 for the tracking issue. As of this writing, there have been no known false positives from this project, but there have been some issues, which are documented in #77716. I expect we’ll see false positives at some point (and may even have one that’s been filed in a PR that hasn’t been inspected by a code owner yet). However, analyses, alleged root causes, etc., are often not correct and any statement from the AI about why something happened should be taken with a grain of salt.
What?
This is another staleness issue that is related to, but not caused by or fixed by, #77876. The class of failure discussed in this PR repros against trunk and also with #77876 merged. It's possible this fix should supersede that one or can be broadened to supersede that one (TBD, I haven't looked into that yet).
rtc-stale-delete-long-wait-wrong-convergence.mp4
This was the next PR in the fuzzing queue and, coincidentally, it relates to a class of issue @dmsnell, @alecgeatches, @maxschmeling, and I just discussed.
In the AI text below, it mentions that we can't distinguish between
We have blocks A, B, R. A' is a modified A. In the local snapshot, we see that B still exists, R is gone, and A has been modified. With the information we have, it seems like we can't tell the difference between:
Note this fix handles the in-memory path but (if codex isn't lying), this fix doesn't try to handle the save/reload path. I don't think I have the full context there, but my understanding is that there was some resistance to serializing a CRDT ID or anything equivalent that's metadata that's not visible content?
I'm not a distributed systems person, so I could be wrong about this, but I think that we'll have some version of this issue on save and reload in one way or another if we don't serialize some kind of provenance information.
AI TEXT
The headline representative is
890d98d04cda. The cleanest reduction for the same user-facing family is3ac375556552: one collaborator appends a normal top-level paragraph, both editors converge, and another collaborator deletes that visible paragraph through the normal block UI. The deleting peer removes it, but the other peer retains it, or the lower-level Yjs repro reintroduces it on both peers.Identifier note: short hexadecimal strings such as
890d98d04cda,3ac375556552, and007e79caf228are RTC fuzzing finding identifiers, notGit commit hashes. The fuzzing and reduction pipeline uses them to name
specific generated findings and their associated local repro artifacts. A
"representative" is one concrete finding chosen to stand for a broader
user-facing bug family after related findings have been clustered or reduced.
These identifiers are useful for matching this explanation to local fuzzing
metadata and videos, but they are not expected to resolve in GitHub.
Local realistic video evidence:
Evidence
890d98d04cda, "Top-level delete propagation leaves a stale block on a peer"; related clean representatives:3ac375556552,007e79caf228.3ac375556552/realistic-results/append-from-tail-enter-attempt-0.jsonrecordsstatesEqualAfterInsert: true,exactDeleteBug: true, andstatesEqualAfterDelete: false.End,Enter, and typing, then A deletes by selecting the inserted paragraph and using block optionsDelete. No injected transport fault, reload, revision restore, parser edge case, or artificial block is needed.3ac375556552/unit-repro-result.jsonrecordsremoteStillPresent: true; after the delete, the Yjs state still containsseed-954092-remote-paragraph.codex/rtc-stale-delete-890d98d04cda-20260514at2f59cd94b1ba41b004f4707ff25674506c81d796does not contain PR RTC: fix stale block snapshot overwriting newer state #77876 head6aad4e5801af23a5b42cd9fde4044895f06946ea, butrealistic-results-no-77876-current/append-from-tail-enter-attempt-0.jsonstill records
statesEqualAfterInsert: true,exactDeleteBug: true, andstatesEqualAfterDelete: false.RTC_3AC3_DELETE_CONVERGENCE_TIMEOUT_MS=60000, theappend-from-tail repros no longer diverged after delete, but they converged
to the wrong state. Both editors still contained the deleted paragraph:
realistic-results-long-wait-60000-current/append-from-tail-enter-attempt-0.jsonand
append-from-tail-enter-attempt-2.jsonrecordstatesEqualAfterDelete: true,semanticDeleteBug: true, andinsertedPresentAfterDelete: true.origin/trunkat2b5a7a9930490b13933a89c69f4455252072c14d, the exact3ac375556552unit delete probe converges cleanly, but a siblingstale-top-level probe reproduces both sides of the same full-snapshot
ambiguity: remote append is lost after a stale local edit, and remote delete
is resurrected after a stale local edit. Result:
unit-current-origin-trunk-2b5a7a99304/stale-top-level-family-probe-result.json.Gutenberg plugin actually loaded, the repo-local normal-UI stale-save loop
reproduced on repeat 1. The primary editor reloaded with
["Alpha local stale save loop 1", "Beta"], losing collaborator B'sGamma remote stale save loop 1paragraph. Artifacts:/Users/danluu/dev/fuzz/rtc-repros/current-trunk-stale-top-level-built-20260514/playwright-artifacts/test-results/editor-collaboration-colla-620e7-ith-overlapping-draft-saves-chromium/.How It Was Introduced
The original substrate is PR #72262 /
84019935998c16f877e976ad85e84748355d7282, "Improve CRDT merge logic for post entities". That commit createdpackages/core-data/src/utils/crdt-blocks.tsand the left/right full-array merge path used for block CRDT updates.The architectural gap is that Gutenberg feeds the CRDT merge layer full block
snapshots from the editor, not explicit user operations. A later local snapshot
can be stale relative to the current Yjs block array. Without reliable
per-snapshot base provenance, the merge layer cannot distinguish these cases:
That tuple can mean either:
R, soRshould be preserved;Rand deleted it, soRshould be deleted.Current
origin/trunkstill has the first side of this bug family. A stalelocal snapshot can erase an unseen remote top-level append, or resurrect a
remote top-level delete, because the left/right merge infers array structure
operations from a stale full snapshot.
Later stale-save/stale-snapshot fixes tried to repair that first side by
preserving remote work from stale snapshots. The local branch that reproduces
the clean browser delete bug does not contain PR #77876, but it does contain an
earlier local copy of that repair line, including
7bc178d07b781ce5c24a7597e8e5c412534806d0, cherry-picked fromcd6822b89c95050e56d39cd217a6bf9e036af315. PR #77876 later carries related stale-snapshot work, but the browser repro shows #77876 itself is not the original introducer.Those stale-snapshot repairs have the opposite ambiguity. A preservation rule
without exact outgoing-snapshot provenance can also preserve a block that the
local editor did observe and then delete:
If
Rwas visible in the editor and the outgoing snapshot is based on a viewthat contained
R, omission means delete. If the outgoing snapshot is olderthan
R, omission means stale no-op. Membership in current Y state,previousLocalBlocksCache, or provider receipt is not enough to tell thoseapart.
Nearby but not likely direct introducers:
c214929139f50337250efe2bb24ff82c3ff2b6aamade sync a side-concern over normal editor state. It made this class of stale full-snapshot issue easier to hit, but did not add the block-array merge policy.128a3c29b7f1db4f35faf9e326dd1e5e7ac11104expandedmergeCrdtBlocks()testing but missed observed remote insert then local delete.09a21c64b5b92c2626bd93065d4a0192eb4fac47and PR #77164 /a6bfd3e55432981c7c2cb09190ee77954530b1a5changed nested/table/array attribute stability, not the top-level observed-delete policy.Scope caveat: most direct proof is for clean representative
3ac375556552. The headline representative890d98d04cdais the same user-facing stale-delete family, but some audit passes flagged a possible additional stale editor-store refresh path throughgetPostChangesFromCRDTDoc().007e79caf228also deserves separate reduction before treating every related hash as the same exact internal mechanism.Fix Plan
The fix must address the shared full-snapshot ambiguity, not just one
stale-preservation helper or one PR branch. The plan is:
snapshots, not as a single-PR regression.
BlockSnapshotProvenanceas the exact immutable editor-visible baseblock tree that produced this outgoing
blockssnapshot, or as explicitoperations/tombstones carrying equivalent information.
path. This does not require serializing Gutenberg
clientIds into postHTML, but a complete fix needs some persisted identity/provenance in the
CRDT/persistence path; serialized block shape, text, and position alone
cannot distinguish observed deletes from stale no-op snapshots.
block array is produced. Pass it through
editEntityRecord()/SyncManager.update()/applyChangesToCRDTDoc()intomergeCrdtBlocks().Do not reconstruct it inside
mergeCrdtBlocks()from current Y state.on ordinary editor edits.
Rand the new local snapshot omitsR, delete wins;R, preserveRas unseen remote work;flat
clientIdset is only a narrow top-level optimization; it does notprove moves, reparenting, nested
innerBlocks, duplicate/missingclientIds, or delete-plus-reinsert semantics. If the first patch istop-level-only, state that scope and add a follow-up for nested block arrays.
array: base, incoming local snapshot, and current Y state. Current-only
blocks absent from the incoming local snapshot should be preserved only when
the snapshot base did not include them. Blocks absent from current Y state
should not be resurrected from a stale local snapshot unless provenance
proves explicit local reinsert intent.
queued stale snapshot can arrive after a remote update and must not become a
false delete or false reinsert.
previousLocalBlocksCache,yblocks.toJSON(), CRDT receipt,provider sync, latest entity record, or global store dispatch as proof of
user/editor observation. The relevant fact is what base produced the exact
outgoing local snapshot.
in the edit/save flow, REST
content, CRDTcontent, and_crdt_documentmust not be overwritten by stale serialized HTML, including save-time
content-only materialization and
prePersistPostType()paths.both current trunk's "stale snapshot destroys remote work" and the local
known-fix branch's "preservation hides observed delete".
preserves remote append and the other deletes observed block
R;remote update is applied;
innerBlocksappend/delete, or an explicit top-level-only scope test;contentis derived from mergedblocks, and RESTcontent, CRDTcontent,_crdt_document, and a freshthird editor agree;
890d98d04cdastale editor-store refresh hypothesis and for007e79caf228.Bad fixes to avoid:
position, or block count alone;
clientIdor a unique marker.Fix Plan Audit
The current audit incorporates the branch-without-#77876 check and the
current-trunk evidence. Raw Codex outputs are under:
The audit converged on these conclusions:
reconcileStaleLocalBlocks()heuristic would still permit stale snapshots toerase unseen remote work or mask observed deletes.
produced the outgoing
blockssnapshot, or explicit operations/tombstoneswith equivalent information.
in the RTC path. It does not need to write Gutenberg
clientIds intoserialized post HTML, but it cannot rely only on block text, type, position,
or count.
previousLocalBlocksCache, current Y state, provider receipt, latest storestate, and global dispatch are not valid observation signals.
cannot be normal for the editor path because it intentionally masks observed
deletes.
clientIdset is not a complete fix if recursive block arrays are inscope. Either make provenance path-aware or explicitly scope the first patch
to top-level arrays.
save-time stale serialized HTML bypasses the merged block tree.
then local block-options delete, and remote-delete resurrection prevention.
The audit reduced the action items to:
RTC persistence path.
Real Bug Or False Positive
Verdict: real bug, not a false positive.
Evidence hierarchy:
statesEqualAfterInsert: true,exactDeleteBug: true,statesEqualAfterDelete: false.clientIdremains present.provenance; later stale-snapshot preservation repairs can expose the
opposite-side observed-delete ambiguity even without PR RTC: fix stale block snapshot overwriting newer state #77876.
False-positive audits:
Meta-audit corrections:
previousBlocks,currentBlocks, andlocalBlocksToSyncat delete time.3ac375556552to the890d98d04cdastale-delete family and state the remaining uncertainty.What would invalidate the claim:
previousLocalBlocksCachealready contained the remote block before the delete merge;Follow-Up Instrumentation
To move the root-cause claim from high-confidence to proven for the browser path, instrument the delete merge to record:
previousLocalBlocksCacheclientIds;reconcileStaleLocalBlocks()spliced the deleted block back intoblocksToSync.That instrumentation should be temporary or test-only. The committed regression should remain a durable repo-local unit/browser test, not a triage artifact wrapper.
END AI TEXT