perf(index): add sparse ngram tokenization#7413
Draft
everySympathy wants to merge 1 commit into
Draft
Conversation
e992a22 to
0139ba7
Compare
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
0139ba7 to
be8932a
Compare
Member
|
Looks like a really cool idea! |
Contributor
Author
Thanks! The idea was inspired by Cursor’s fast regex search blog post. I’ll keep pushing this PR forward. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Adds an optional sparse n-gram tokenization mode to the scalar NGram index.
The default index remains the existing fixed trigram mode. Users can opt into sparse tokenization with NGram index params:
{"tokenization": "sparse"}Sparse tokenization stores selected longer n-gram tokens instead of every fixed trigram. This is intended for long literal regex / contains predicates where each individual trigram is common, but the longer literal itself is rare.
How
NGramTokenization::{Trigram, Sparse}and persists the selected mode in the postings file metadata.regex_ngrambenchmark coverage to compare fixed trigram and sparse n-gram indexes on the same dataset.Benchmark
cargo bench -p lance --bench regex_ngram, 200k rows, fixed trigram index vs sparse n-gram index:regexp_match(doc, 'zqxwvu.*needlexyz')regexp_match(doc, '(zqxwvu|qwerasdf|needlexyz)')regexp_match(doc, 'zqxwvu')regexp_match(doc, 'a.b')regexp_match(doc, 'sparsemarkerabcdefghijklmnopqrstuvwx')The last query is the target case: every row contains all fixed trigrams from the long literal, but only a small fraction contains the full literal. Fixed trigrams therefore produce a broad candidate set, while sparse longer n-grams stay selective.
The benchmark was run locally with:
Additional local runs used the same command shape for
selective_and,alternation,plain_literal, andnon_accelerable_a_dot_b.Testing
cargo fmt --allcargo test --target-dir /tmp/lance-target-4ca9 -p lance-index ngramcargo clippy --target-dir /tmp/lance-target-4ca9 -p lance-index --tests --benches -- -D warningscargo clippy --target-dir /tmp/lance-target-4ca9 -p lance --bench regex_ngram -- -D warnings