fix: sync through empty epochs in range sync#9417
Conversation
A peer's correct response for an epoch in which every slot was skipped is zero blocks. Range sync rejected that response, so the node could not sync past an empty epoch during periods of poor chain liveness. Two interacting defects, both introduced during the fulu/PeerDAS sync work: 1. validateBlockByRangeResponse threw MISSING_BLOCKS_RESPONSE on a 0-block response. In sendBatch this became a downloadingError; since every peer returns empty for a genuinely empty epoch, all retries failed and after MAX_BATCH_DOWNLOAD_ATTEMPTS the batch threw MAX_DOWNLOAD_ATTEMPTS and the whole SyncChain errored out. A second copy of the throw in validateResponses fired for post-Deneb epochs that also carry a blobs/columns request. 2. The MAX_LOOK_AHEAD_EPOCHS guard in includeNextBatch counted batches in the AwaitingValidation state. Empty batches reach AwaitingValidation but never call advanceChain (only non-empty batches do), so lastEpochWithProcessBlocks never advances. A run of empty epochs longer than MAX_LOOK_AHEAD_EPOCHS (2) filled the look-ahead window, includeNextBatch returned null, and the chain stalled with nothing left to download or process. This deadlock is why the naive "return [] instead of throw" fix had previously been reverted. Archeology: - downloadByRange.ts was introduced in #8200 (feat: refactor block input, 2025-09-18) -- the fulu block-input/PeerDAS sync rewrite that deleted the old Deneb-era beaconBlocksMaybeBlobsByRange.ts downloader, which had tolerated empty epochs. The empty-blocks throw was then cemented in #8482 (fix: refactor validateColumnsByRangeResponse, 2025-10-27, b992c32). - The MAX_LOOK_AHEAD_EPOCHS guard was added 2025-08-11 (38889e2) during PeerDAS sync work. The original 2021 BATCH_BUFFER_SIZE check directly above it (f46b63f, dapplion) deliberately excluded AwaitingValidation batches "to prevent stalling sync if the current processing window is contained in a long range of skip slots." The new guard did not replicate that carve-out, re-creating the exact stall the 2021 code was written to avoid. Fix: - validateBlockByRangeResponse returns an empty result with a MISSING_BLOCKS_RESPONSE warning instead of throwing. Whether an epoch is genuinely empty is enforced at the chain level: the empty batch is held in AwaitingValidation and only confirmed once a later batch imports a block, so a peer cannot stall sync by falsely claiming an epoch is empty. - validateResponses skips data-sidecar validation when there are no blocks in the data request's slot range (parent-by-root columns and envelopes still run). - includeNextBatch's look-ahead guard ignores AwaitingValidation batches, mirroring the BATCH_BUFFER_SIZE carve-out. Tests: - validateBlockByRangeResponse accepts an empty response during chain liveness issues. - validateResponses accepts an empty epoch that carries a data request. - chain.test.ts gains a "3 epochs of skipped slots" case -- a run longer than MAX_LOOK_AHEAD_EPOCHS -- which previously deadlocked. Refs: #8147
validateResponses had three separate "nothing to validate here, fall through to parent-by-root columns + envelope validation" paths, with that parent + envelope tail duplicated between the no-data-request branch and the main path. The empty-epoch fix made the duplication more pronounced by adding a third skip path. Collapse them: gate getBlocksForDataValidation on `dataRequest` (defaulting to an empty list when absent) and let the single shared tail handle parent-by-root columns and envelopes for every case. This deletes the no-data-request early return entirely, and flattens data-sidecar validation from an outer `if (blocksForDataValidation.length > 0)` wrapper into per-request guards. Behavior-preserving: an envelopes-only / parent-payload-only request and an empty epoch now flow through the same path that already validated them, minus the duplicated branch. Net -20 lines. Adds a characterization test pinning the no-data-request (envelopes-only) path, which previously had no direct unit coverage.
There was a problem hiding this comment.
Code Review
This pull request addresses sync deadlocks during periods of poor chain liveness by allowing the sync chain to process empty epochs (epochs with zero blocks) without stalling or throwing errors. It updates the look-ahead window check to only count pending batches, preventing empty batches in AwaitingValidation from filling the window. Additionally, it refactors the response validation to return warnings instead of throwing errors when zero blocks are returned for an epoch, and adds corresponding regression tests. A review comment suggests simplifying a redundant ternary operator in validateResponses when passing validated blocks to getBlocksForDataValidation.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: fe749a76a7
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Performance Report✔️ no performance regression detected Full benchmark results
|
Motivation
A peer's correct response for an epoch in which every slot was skipped is zero
blocks. Range sync rejected that response, so the node could not sync past an
empty epoch during periods of poor chain liveness.
Two interacting defects, both introduced during the fulu/PeerDAS sync work:
validateBlockByRangeResponse threw MISSING_BLOCKS_RESPONSE on a 0-block
response. In sendBatch this became a downloadingError; since every peer
returns empty for a genuinely empty epoch, all retries failed and after
MAX_BATCH_DOWNLOAD_ATTEMPTS the batch threw MAX_DOWNLOAD_ATTEMPTS and the
whole SyncChain errored out. A second copy of the throw in validateResponses
fired for post-Deneb epochs that also carry a blobs/columns request.
The MAX_LOOK_AHEAD_EPOCHS guard in includeNextBatch counted batches in the
AwaitingValidation state. Empty batches reach AwaitingValidation but never
call advanceChain (only non-empty batches do), so lastEpochWithProcessBlocks
never advances. A run of empty epochs longer than MAX_LOOK_AHEAD_EPOCHS (2)
filled the look-ahead window, includeNextBatch returned null, and the chain
stalled with nothing left to download or process. This deadlock is why the
naive "return [] instead of throw" fix had previously been reverted.
Description
MISSING_BLOCKS_RESPONSE warning instead of throwing. Whether an epoch is
genuinely empty is enforced at the chain level: the empty batch is held in
AwaitingValidation and only confirmed once a later batch imports a block, so
a peer cannot stall sync by falsely claiming an epoch is empty.
the data request's slot range (parent-by-root columns and envelopes still
run).
mirroring the BATCH_BUFFER_SIZE carve-out.
Archeology
2025-09-18) -- the fulu block-input/PeerDAS sync rewrite that deleted the old
Deneb-era beaconBlocksMaybeBlobsByRange.ts downloader, which had tolerated
empty epochs. The empty-blocks throw was then cemented in fix: refactor validateColumnsByRangeResponse #8482 (fix:
refactor validateColumnsByRangeResponse, 2025-10-27, b992c32).
PeerDAS sync work. The original 2021 BATCH_BUFFER_SIZE check directly above it
(f46b63f, dapplion) deliberately excluded AwaitingValidation batches "to
prevent stalling sync if the current processing window is contained in a long
range of skip slots." The new guard did not replicate that carve-out,
re-creating the exact stall the 2021 code was written to avoid.
AI Assistance Disclosure