fix(fts): use async send in FTS index builder to prevent thread-pool … by a-agmon · Pull Request #7423 · lance-format/lance

a-agmon · 2026-06-23T16:25:45Z

Fixes lancedb/lancedb#3568 (the issue arises in lancedb indexing)

Building a full-text-search index hangs permanently at 0% CPU on hosts
whose Lance CPU pool has a single thread.
The CPU compute pool is sized max(1, num_cpus - LANCE_IO_CORE_RESERVATION) (default reservation 2), so any machine with <= 3 visible CPUs (1-vCPU VMs, CI runners, CPU-limited Kubernetes pods) collapses to a 1-thread pool and deadlocks.

Root cause is in write_posting_lists. The posting-list producer runs on the CPU pool via spawn_cpu and pushes batches into a capacity-1 async_channel using the synchronous tx.send_blocking(). When the channel is full, send_blocking parks the OS thread it is running on. On a single-thread pool that is the only thread, and the async consumer's column encoder (write_record_batch -> spawn_cpu) needs that same pool to drain the channel. The parked producer and the starved consumer wait on each other forever: no timeout, no error, just a silent hang at 0% CPU.
The hang only triggers once the posting lists span a second output batch (so the producer reaches a second, blocking send), which is why it appears as a data-size "cliff".

The PR restructures the producer as an async task that builds each batch on the CPU pool via spawn_cpu and dispatches it with tx.send(batch).await. When the channel is full, send().await yields the task back to the runtime instead of parking a pool thread, so the consumer can always be scheduled to drain it. Between batches the producer holds no pool thread while waiting, making the pool size irrelevant. The builder and the remaining posting-list iterator are handed back out of each spawn_cpu call so the cross-batch cache-group accumulator is preserved.

In addition, it adds a regression test that writes a partition whose posting lists span many output batches (exercising channel back-pressure) under a timeout and verifies every batch is searchable.

(verbose comments added in the code intentionally for review purposes - can be removed if inappropriate. I just thought it might be helpful as the issue is somewhat confusing)

…deadlock

a-agmon · 2026-06-24T03:22:41Z

Hi @westonpace - would be happy for your review.
This issue causes a nasty bug on K8S pods with one core, and it took my team quite some time to pin down. Especially as it occurs in native rust space. Submitting this PR to resolve this.
Thanks!

fix(fts): use async send in FTS index builder to prevent thread-pool …

710a9fa

…deadlock

github-actions Bot added bug Something isn't working A-index Vector index, linalg, tokenizer and removed bug Something isn't working labels Jun 23, 2026

github-actions Bot added the bug Something isn't working label Jun 24, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(fts): use async send in FTS index builder to prevent thread-pool …#7423

fix(fts): use async send in FTS index builder to prevent thread-pool …#7423
a-agmon wants to merge 1 commit into
lance-format:mainfrom
a-agmon:fix/fts-async-send

a-agmon commented Jun 23, 2026 •

edited

Loading

Uh oh!

a-agmon commented Jun 24, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

a-agmon commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

a-agmon commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

a-agmon commented Jun 23, 2026 •

edited

Loading

a-agmon commented Jun 24, 2026 •

edited

Loading