perf(index): add sparse ngram tokenization by everySympathy · Pull Request #7413 · lance-format/lance

everySympathy · 2026-06-23T06:05:52Z

What

Adds an optional sparse n-gram tokenization mode to the scalar NGram index.

The default index remains the existing fixed trigram mode. Users can opt into sparse tokenization with NGram index params:

{"tokenization": "sparse"}

Sparse tokenization stores selected longer n-gram tokens instead of every fixed trigram. This is intended for long literal regex / contains predicates where each individual trigram is common, but the longer literal itself is rare.

How

Adds NGramTokenization::{Trigram, Sparse} and persists the selected mode in the postings file metadata.
Keeps existing trigram behavior as the default for compatibility.
Builds sparse index tokens from normalized alphanumeric runs using deterministic span selection and stable hashing.
Reuses the existing regex-to-ngram planner, but lets it derive token requirements against the index tokenization mode.
Extends regex_ngram benchmark coverage to compare fixed trigram and sparse n-gram indexes on the same dataset.

Benchmark

cargo bench -p lance --bench regex_ngram, 200k rows, fixed trigram index vs sparse n-gram index:

Query	Trigram	Sparse	Result
`regexp_match(doc, 'zqxwvu.*needlexyz')`	13.34 ms	13.22 ms	roughly flat
`regexp_match(doc, '(zqxwvu\|qwerasdf\|needlexyz)')`	16.49 ms	16.46 ms	roughly flat
`regexp_match(doc, 'zqxwvu')`	12.38 ms	12.29 ms	roughly flat
`regexp_match(doc, 'a.b')`	98.99 ms	97.85 ms	roughly flat
`regexp_match(doc, 'sparsemarkerabcdefghijklmnopqrstuvwx')`	53.97 ms	3.87 ms	~13.9x faster

The last query is the target case: every row contains all fixed trigrams from the long literal, but only a small fraction contains the full literal. Fixed trigrams therefore produce a broad candidate set, while sparse longer n-grams stay selective.

The benchmark was run locally with:

CARGO_PROFILE_BENCH_LTO=false CARGO_PROFILE_BENCH_DEBUG=0 \
cargo bench --target-dir /tmp/lance-target-4ca9-nolto \
  -p lance --bench regex_ngram -- sparse_decoy_literal \
  --sample-size 10 --warm-up-time 1

Additional local runs used the same command shape for selective_and, alternation, plain_literal, and non_accelerable_a_dot_b.

Testing

cargo fmt --all
cargo test --target-dir /tmp/lance-target-4ca9 -p lance-index ngram
cargo clippy --target-dir /tmp/lance-target-4ca9 -p lance-index --tests --benches -- -D warnings
cargo clippy --target-dir /tmp/lance-target-4ca9 -p lance --bench regex_ngram -- -D warnings

codecov · 2026-06-23T07:00:54Z

Codecov Report

❌ Patch coverage is 94.68504% with 27 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
rust/lance-index/src/scalar/ngram.rs	94.34%	20 Missing and 6 partials ⚠️
rust/lance-index/src/scalar/ngram/ngram_regex.rs	97.91%	1 Missing ⚠️

📢 Thoughts on this report? Let us know!

westonpace · 2026-06-23T15:17:12Z

Looks like a really cool idea!

everySympathy · 2026-06-24T09:00:27Z

Looks like a really cool idea!

Thanks! The idea was inspired by Cursor’s fast regex search blog post. I’ll keep pushing this PR forward.

github-actions Bot added A-index Vector index, linalg, tokenizer performance and removed A-index Vector index, linalg, tokenizer labels Jun 23, 2026

everySympathy force-pushed the codex/sparse-ngram-regex-main branch from e992a22 to 0139ba7 Compare June 23, 2026 06:21

github-actions Bot added the A-index Vector index, linalg, tokenizer label Jun 23, 2026

perf(index): add sparse ngram tokenization

be8932a

everySympathy force-pushed the codex/sparse-ngram-regex-main branch from 0139ba7 to be8932a Compare June 23, 2026 14:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf(index): add sparse ngram tokenization#7413

perf(index): add sparse ngram tokenization#7413
everySympathy wants to merge 1 commit into
lance-format:mainfrom
everySympathy:codex/sparse-ngram-regex-main

everySympathy commented Jun 23, 2026

Uh oh!

codecov Bot commented Jun 23, 2026 •

edited

Loading

Uh oh!

westonpace commented Jun 23, 2026

Uh oh!

everySympathy commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

everySympathy commented Jun 23, 2026

What

How

Benchmark

Testing

Uh oh!

codecov Bot commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

westonpace commented Jun 23, 2026

Uh oh!

everySympathy commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov Bot commented Jun 23, 2026 •

edited

Loading