fix(index): give each FTS partition a distinct scheduler base priority#7449
Merged
Conversation
Multi-partition inverted-index reads (FTS prewarm, corpus stats, df) fan out over partitions through one shared ScanScheduler. Every partition opened its files at base_priority 0, so all concurrent reads tied at the same priority. The scheduler's backpressure deadlock-break admits the lowest-priority in-flight request without a byte-budget check, which requires a total order over in-flight priorities to guarantee one request can always advance and drain the byte budget. With every partition at priority 0 there is no unique lowest request, so when the shared 64-IOP scheduler saturates, freed IOP slots go to arbitrary tied tasks and no consumer's batch is guaranteed to complete and refund resources -> the I/O loop can wedge with work pending but none admissible. Stamp each partition's IndexStore with its enumerate index as the base priority at load time, mirroring how a filtered read scan prioritizes each fragment. The stores share the same scheduler (Arc), so global backpressure is unchanged; only the priority differs, restoring the total order across partitions that the deadlock-break relies on. The single, non-concurrent metadata read stays at priority 0 (no tie). Adds IndexStore::with_base_priority (default no-op for backends without a priority concept) and a test asserting partitions load with distinct, dense priorities. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Multi-partition inverted-index reads (FTS prewarm, corpus stats, df) fan out over partitions through one shared ScanScheduler. Every partition opened its files at base_priority 0, so all concurrent reads tied at the same priority. The scheduler's backpressure deadlock-break admits the lowest-priority in-flight request without a byte-budget check, which requires a total order over in-flight priorities to guarantee one request can always advance and drain the byte budget. With every partition at priority 0 there is no unique lowest request, so when the shared 64-IOP scheduler saturates, freed IOP slots go to arbitrary tied tasks and no consumer's batch is guaranteed to complete and refund resources -> the I/O loop can wedge with work pending but none admissible. Stamp each partition's IndexStore with its enumerate index as the base priority at load time, mirroring how a filtered read scan prioritizes each fragment. The stores share the same scheduler (Arc), so global backpressure is unchanged; only the priority differs, restoring the total order across partitions that the deadlock-break relies on. The single, non-concurrent metadata read stays at priority 0 (no tie). Adds IndexStore::with_base_priority (default no-op for backends without a priority concept) and a test asserting partitions load with distinct, dense priorities. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
FTS prewarm fans out over partitions through one shared
ScanScheduler. Every partition opened its files atbase_priority = 0, so all concurrent reads tied at the same priority.The scheduler's deadlock-break admits the lowest-priority in-flight request without a byte-budget check (
task.priority <= min_in_flight), which needs a total order to guarantee one request always advances. With all priorities tied at 0, freed IOP slots go to arbitrary tasks → no consumer drains → the I/O loop wedges (FTS index-cache prewarm hang on large multi-partition indexes).Fix
Stamp each partition's
IndexStorewith its index asbase_priorityat load time (mirrors how a scan prioritizes each fragment). Stores share the same scheduler, so backpressure is unchanged — only the priority differs, restoring a total order across partitions.Validation
Reproduced the hang on a 3-node cluster (600M-row multi-partition FTS index), verified the fix reaches
complete=3/3. Adds a test for distinct, dense partition priorities.🤖 Generated with Claude Code