Skip to content

fix(index): use global BM25 scorer for WAND pruning#7467

Draft
BubbleCal wants to merge 1 commit into
mainfrom
yang/oss-1345-use-global-bm25-scorer-for-partition-wand-pruning
Draft

fix(index): use global BM25 scorer for WAND pruning#7467
BubbleCal wants to merge 1 commit into
mainfrom
yang/oss-1345-use-global-bm25-scorer-for-partition-wand-pruning

Conversation

@BubbleCal

Copy link
Copy Markdown
Contributor

Bug Fix

What is the bug?

Partition WAND candidate generation used partition-local BM25 statistics while final indexed FTS scoring used the global BM25 scorer. Because WAND also shares a top-k floor across partitions, one partition could publish a threshold in a different score space and cause another partition to prune a candidate that should win under the global scorer.

What incorrect behavior does this cause?

Multi-partition FTS queries could drop the correct global top-k result when local term distributions differ from global term distributions.

How does this PR fix the problem?

  • Materializes one active MemBM25Scorer for each indexed BM25 search and threads it through posting loading, WAND scoring, and final aggregation.
  • Computes posting query weights from the active scorer instead of partition-local IDF.
  • Recomputes compressed posting block upper bounds in the active scorer's score space before WAND pruning.
  • Uses a conservative BM25 ceiling for plain postings without usable scorer-space impacts.
  • Removes the obsolete partition-local IndexBM25Scorer path.
  • Adds a regression test where a shared threshold previously pruned the row that wins under global BM25.

Linear: https://linear.app/lancedb/issue/OSS-1345/use-global-bm25-scorer-for-partition-wand-pruning

Validation

  • cargo fmt --all --check
  • CARGO_TARGET_DIR=/tmp/lance-target-oss-1345 cargo test -p lance-index test_bm25_search_shared_threshold_keeps_global_bm25_winner -- --nocapture
  • CARGO_TARGET_DIR=/tmp/lance-target-oss-1345 cargo test -p lance-index scalar::inverted::wand::tests -- --nocapture
  • CARGO_TARGET_DIR=/tmp/lance-target-oss-1345 cargo test -p lance io::exec::fts::tests::test_match_query_exec_with_base_scorer_matches_baseline -- --nocapture
  • CARGO_TARGET_DIR=/tmp/lance-target-oss-1345 cargo test -p lance-index fuzzy_and -- --nocapture
  • CARGO_TARGET_DIR=/tmp/lance-target-oss-1345 cargo clippy -p lance-index -p lance --tests -- -D warnings

@github-actions github-actions Bot added A-index Vector index, linalg, tokenizer bug Something isn't working and removed A-index Vector index, linalg, tokenizer labels Jun 25, 2026
@codecov

codecov Bot commented Jun 25, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 94.79167% with 10 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance-index/src/scalar/inverted/wand.rs 86.44% 7 Missing and 1 partial ⚠️
rust/lance-index/src/scalar/inverted/index.rs 98.42% 2 Missing ⚠️

📢 Thoughts on this report? Let us know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant