Skip to content

feat: add ICU split tokenizer variant#7474

Merged
Xuanwo merged 1 commit into
mainfrom
xuanwo/oss-1274-icu-split-controls
Jun 25, 2026
Merged

feat: add ICU split tokenizer variant#7474
Xuanwo merged 1 commit into
mainfrom
xuanwo/oss-1274-icu-split-controls

Conversation

@Xuanwo

@Xuanwo Xuanwo commented Jun 25, 2026

Copy link
Copy Markdown
Collaborator

This adds an opt-in icu/split tokenizer variant for FTS so mixed-language text can keep ICU segmentation while also splitting identifier-like tokens on simple-style delimiters. The existing icu tokenizer remains unchanged, and the variant is configured through the existing base_tokenizer surface.

Closes #7280.

Validation note: uv run make lint currently fails at pyright in this local environment because optional tensorflow and torch imports cannot be resolved; ruff formatting and checks passed before that failure.

@github-actions

Copy link
Copy Markdown
Contributor

Important

This PR touches the Lance format specification.

Substantive changes to the format specification — the .proto definitions
and the spec docs under docs/src/format/ — require a PMC vote before merge.
Minor edits such as typo fixes, wording, or formatting are excluded; use your
judgment.

If this is a meaningful format change:

  • Start a vote following the Lance community voting process.
    Format specification modifications need 3 binding +1 votes (excluding the
    proposer), held on GitHub Discussions, with a minimum voting period of 1 week.
  • Once the vote passes, link the completed vote in this PR. It should not be
    merged until the vote is linked.

@github-actions github-actions Bot added A-python Python bindings A-index Vector index, linalg, tokenizer A-java Java bindings + JNI A-format On-disk format: protos and format spec docs enhancement New feature or request labels Jun 25, 2026
@Xuanwo Xuanwo marked this pull request as ready for review June 25, 2026 08:06

@BubbleCal BubbleCal left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work!

@Xuanwo Xuanwo merged commit b492ddc into main Jun 25, 2026
30 checks passed
@Xuanwo Xuanwo deleted the xuanwo/oss-1274-icu-split-controls branch June 25, 2026 08:29
@codecov

codecov Bot commented Jun 25, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-format On-disk format: protos and format spec docs A-index Vector index, linalg, tokenizer A-java Java bindings + JNI A-python Python bindings enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add simple-style split control for the ICU FTS tokenizer

2 participants