Skip to content

perf(index): use NGram posting cardinalities for regex conjunction search#7390

Open
everySympathy wants to merge 1 commit into
lance-format:mainfrom
everySympathy:codex/fast-regex-search
Open

perf(index): use NGram posting cardinalities for regex conjunction search#7390
everySympathy wants to merge 1 commit into
lance-format:mainfrom
everySympathy:codex/fast-regex-search

Conversation

@everySympathy

@everySympathy everySympathy commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

Summary

This PR improves regex conjunction search on the NGram index by making posting-list evaluation more cost-aware.

The existing regex NGram acceleration can derive required trigrams from regex patterns, but conjunction queries may still load and intersect posting lists without considering posting-list size. This PR adds posting-list cardinality metadata and uses rare-first ordering to reduce unnecessary work, especially when a required trigram is missing or when a regex contains many common required trigrams.

Changes

Area Change
Index format Adds a cardinality column to NGram postings, storing the row-count of each trigram posting list.
Query planning Sorts required regex conjunction trigrams by cardinality before lookup.
Missing token handling Short-circuits pure conjunction regex queries immediately when any required trigram is absent.
Bitmap intersection Sorts loaded posting lists by actual bitmap cardinality before intersection, preserving rare-first CPU behavior after unordered async reads.
Compatibility Keeps older two-column NGram postings readable when cardinality metadata is absent.
Remap/update Recomputes cardinality when remapping posting lists and writes the destination in the current format.
Benchmarking Extends regex_ngram with skewed conjunction cases for cardinality-aware posting-list planning and allows scaling row count with LANCE_REGEX_NGRAM_TOTAL.

Why

For a regex conjunction, every required trigram must be present. If one required trigram is absent, the candidate set is empty and there is no need to load/intersect other posting lists.

When all required trigrams are present, intersecting smaller posting lists first avoids cloning and intersecting large Roaring bitmaps early. This is the same selectivity principle used by inverted-index query planning: rare terms are more valuable filters than common terms.

Benchmark Results

Measured with the final rust/lance/benches/regex_ngram.rs benchmark data: HEAD^ plus the benchmark-only patch is compared against this PR.

Both runs use the same negative conjunction query:

commonmarkerabcdefghijklmnopqrstuvwx.*missingmarker

Default 200k-row run:

cargo bench -p lance --bench regex_ngram -- many_common_missing_trigram

10M-row scale run:

LANCE_REGEX_NGRAM_TOTAL=10000000 cargo bench -p lance --bench regex_ngram -- many_common_missing_trigram
Dataset Baseline This PR Change
200k rows [783.43 us, 803.03 us, 829.14 us] [417.02 us, 435.05 us, 454.90 us] ~46% faster
10M rows [10.734 ms, 10.779 ms, 10.843 ms] [465.45 us, 471.39 us, 482.43 us] ~22.9x faster

The biggest improvement is in negative conjunction cases where a regex contains many common required trigrams plus a missing required trigram. The new path can return an empty candidate set before loading large common posting lists.

Compatibility

Index format Behavior
New three-column postings: tokens, cardinality, posting_list Uses cardinality metadata for regex conjunction planning.
Old two-column postings: tokens, posting_list Still readable; token_cardinalities is None, and the index falls back safely.

The old-format path checks the index file schema before choosing the projection. This avoids an intentional failed read on old-format postings and keeps compatibility covered in the NGram tests.

Testing

  • cargo fmt --all
  • cargo test -p lance-index ngram
  • cargo test -p lance --bench regex_ngram
  • cargo bench -p lance --bench regex_ngram -- many_common_missing_trigram
  • LANCE_REGEX_NGRAM_TOTAL=10000000 cargo bench -p lance --bench regex_ngram -- many_common_missing_trigram
  • cargo clippy --all --tests --benches -- -D warnings

@github-actions github-actions Bot added A-index Vector index, linalg, tokenizer performance labels Jun 22, 2026
@codecov

codecov Bot commented Jun 22, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 98.48485% with 3 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance-index/src/scalar/ngram.rs 98.48% 1 Missing and 2 partials ⚠️

📢 Thoughts on this report? Let us know!

@everySympathy everySympathy force-pushed the codex/fast-regex-search branch 4 times, most recently from 807c6d9 to 59187b6 Compare June 23, 2026 16:06
@everySympathy everySympathy marked this pull request as ready for review June 23, 2026 16:11
@everySympathy everySympathy force-pushed the codex/fast-regex-search branch from 59187b6 to 1e33ff5 Compare June 24, 2026 02:59
@everySympathy everySympathy force-pushed the codex/fast-regex-search branch from 1e33ff5 to b1ba3f9 Compare June 24, 2026 03:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-index Vector index, linalg, tokenizer performance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant