Skip to content

fix(index): give each FTS partition a distinct scheduler base priority#7449

Merged
LuQQiu merged 6 commits into
lance-format:mainfrom
LuQQiu:lu/fix_prewarm_priority
Jun 24, 2026
Merged

fix(index): give each FTS partition a distinct scheduler base priority#7449
LuQQiu merged 6 commits into
lance-format:mainfrom
LuQQiu:lu/fix_prewarm_priority

Conversation

@LuQQiu

@LuQQiu LuQQiu commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Problem

FTS prewarm fans out over partitions through one shared ScanScheduler. Every partition opened its files at base_priority = 0, so all concurrent reads tied at the same priority.

The scheduler's deadlock-break admits the lowest-priority in-flight request without a byte-budget check (task.priority <= min_in_flight), which needs a total order to guarantee one request always advances. With all priorities tied at 0, freed IOP slots go to arbitrary tasks → no consumer drains → the I/O loop wedges (FTS index-cache prewarm hang on large multi-partition indexes).

Fix

Stamp each partition's IndexStore with its index as base_priority at load time (mirrors how a scan prioritizes each fragment). Stores share the same scheduler, so backpressure is unchanged — only the priority differs, restoring a total order across partitions.

Validation

Reproduced the hang on a 3-node cluster (600M-row multi-partition FTS index), verified the fix reaches complete=3/3. Adds a test for distinct, dense partition priorities.

🤖 Generated with Claude Code

LuQQiu and others added 2 commits June 24, 2026 13:51
Multi-partition inverted-index reads (FTS prewarm, corpus stats, df)
fan out over partitions through one shared ScanScheduler. Every
partition opened its files at base_priority 0, so all concurrent reads
tied at the same priority.

The scheduler's backpressure deadlock-break admits the lowest-priority
in-flight request without a byte-budget check, which requires a total
order over in-flight priorities to guarantee one request can always
advance and drain the byte budget. With every partition at priority 0
there is no unique lowest request, so when the shared 64-IOP scheduler
saturates, freed IOP slots go to arbitrary tied tasks and no consumer's
batch is guaranteed to complete and refund resources -> the I/O loop
can wedge with work pending but none admissible.

Stamp each partition's IndexStore with its enumerate index as the base
priority at load time, mirroring how a filtered read scan prioritizes
each fragment. The stores share the same scheduler (Arc), so global
backpressure is unchanged; only the priority differs, restoring the
total order across partitions that the deadlock-break relies on. The
single, non-concurrent metadata read stays at priority 0 (no tie).

Adds IndexStore::with_base_priority (default no-op for backends without
a priority concept) and a test asserting partitions load with distinct,
dense priorities.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Multi-partition inverted-index reads (FTS prewarm, corpus stats, df)
fan out over partitions through one shared ScanScheduler. Every
partition opened its files at base_priority 0, so all concurrent reads
tied at the same priority.

The scheduler's backpressure deadlock-break admits the lowest-priority
in-flight request without a byte-budget check, which requires a total
order over in-flight priorities to guarantee one request can always
advance and drain the byte budget. With every partition at priority 0
there is no unique lowest request, so when the shared 64-IOP scheduler
saturates, freed IOP slots go to arbitrary tied tasks and no consumer's
batch is guaranteed to complete and refund resources -> the I/O loop
can wedge with work pending but none admissible.

Stamp each partition's IndexStore with its enumerate index as the base
priority at load time, mirroring how a filtered read scan prioritizes
each fragment. The stores share the same scheduler (Arc), so global
backpressure is unchanged; only the priority differs, restoring the
total order across partitions that the deadlock-break relies on. The
single, non-concurrent metadata read stays at priority 0 (no tie).

Adds IndexStore::with_base_priority (default no-op for backends without
a priority concept) and a test asserting partitions load with distinct,
dense priorities.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added A-index Vector index, linalg, tokenizer bug Something isn't working labels Jun 24, 2026
@LuQQiu LuQQiu requested review from jackye1995 and westonpace June 24, 2026 21:27

@westonpace westonpace left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

@jackye1995 jackye1995 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me!

@codecov

codecov Bot commented Jun 24, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 82.69231% with 9 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance-index/src/scalar/inverted/builder.rs 0.00% 9 Missing ⚠️

📢 Thoughts on this report? Let us know!

@LuQQiu LuQQiu merged commit 2ac811f into lance-format:main Jun 24, 2026
30 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-index Vector index, linalg, tokenizer bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants