Skip to content

perf(index): add sparse ngram tokenization#7413

Draft
everySympathy wants to merge 1 commit into
lance-format:mainfrom
everySympathy:codex/sparse-ngram-regex-main
Draft

perf(index): add sparse ngram tokenization#7413
everySympathy wants to merge 1 commit into
lance-format:mainfrom
everySympathy:codex/sparse-ngram-regex-main

Conversation

@everySympathy

Copy link
Copy Markdown
Contributor

What

Adds an optional sparse n-gram tokenization mode to the scalar NGram index.

The default index remains the existing fixed trigram mode. Users can opt into sparse tokenization with NGram index params:

{"tokenization": "sparse"}

Sparse tokenization stores selected longer n-gram tokens instead of every fixed trigram. This is intended for long literal regex / contains predicates where each individual trigram is common, but the longer literal itself is rare.

How

  • Adds NGramTokenization::{Trigram, Sparse} and persists the selected mode in the postings file metadata.
  • Keeps existing trigram behavior as the default for compatibility.
  • Builds sparse index tokens from normalized alphanumeric runs using deterministic span selection and stable hashing.
  • Reuses the existing regex-to-ngram planner, but lets it derive token requirements against the index tokenization mode.
  • Extends regex_ngram benchmark coverage to compare fixed trigram and sparse n-gram indexes on the same dataset.

Benchmark

cargo bench -p lance --bench regex_ngram, 200k rows, fixed trigram index vs sparse n-gram index:

Query Trigram Sparse Result
regexp_match(doc, 'zqxwvu.*needlexyz') 13.34 ms 13.22 ms roughly flat
regexp_match(doc, '(zqxwvu|qwerasdf|needlexyz)') 16.49 ms 16.46 ms roughly flat
regexp_match(doc, 'zqxwvu') 12.38 ms 12.29 ms roughly flat
regexp_match(doc, 'a.b') 98.99 ms 97.85 ms roughly flat
regexp_match(doc, 'sparsemarkerabcdefghijklmnopqrstuvwx') 53.97 ms 3.87 ms ~13.9x faster

The last query is the target case: every row contains all fixed trigrams from the long literal, but only a small fraction contains the full literal. Fixed trigrams therefore produce a broad candidate set, while sparse longer n-grams stay selective.

The benchmark was run locally with:

CARGO_PROFILE_BENCH_LTO=false CARGO_PROFILE_BENCH_DEBUG=0 \
cargo bench --target-dir /tmp/lance-target-4ca9-nolto \
  -p lance --bench regex_ngram -- sparse_decoy_literal \
  --sample-size 10 --warm-up-time 1

Additional local runs used the same command shape for selective_and, alternation, plain_literal, and non_accelerable_a_dot_b.

Testing

  • cargo fmt --all
  • cargo test --target-dir /tmp/lance-target-4ca9 -p lance-index ngram
  • cargo clippy --target-dir /tmp/lance-target-4ca9 -p lance-index --tests --benches -- -D warnings
  • cargo clippy --target-dir /tmp/lance-target-4ca9 -p lance --bench regex_ngram -- -D warnings

@github-actions github-actions Bot added A-index Vector index, linalg, tokenizer performance and removed A-index Vector index, linalg, tokenizer labels Jun 23, 2026
@everySympathy everySympathy force-pushed the codex/sparse-ngram-regex-main branch from e992a22 to 0139ba7 Compare June 23, 2026 06:21
@github-actions github-actions Bot added the A-index Vector index, linalg, tokenizer label Jun 23, 2026
@codecov

codecov Bot commented Jun 23, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 94.68504% with 27 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance-index/src/scalar/ngram.rs 94.34% 20 Missing and 6 partials ⚠️
rust/lance-index/src/scalar/ngram/ngram_regex.rs 97.91% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

@everySympathy everySympathy force-pushed the codex/sparse-ngram-regex-main branch from 0139ba7 to be8932a Compare June 23, 2026 14:11
@westonpace

Copy link
Copy Markdown
Member

Looks like a really cool idea!

@everySympathy

Copy link
Copy Markdown
Contributor Author

Looks like a really cool idea!

Thanks! The idea was inspired by Cursor’s fast regex search blog post. I’ll keep pushing this PR forward.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-index Vector index, linalg, tokenizer performance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants