Skip to content

fix(fts): use async send in FTS index builder to prevent thread-pool …#7423

Open
a-agmon wants to merge 1 commit into
lance-format:mainfrom
a-agmon:fix/fts-async-send
Open

fix(fts): use async send in FTS index builder to prevent thread-pool …#7423
a-agmon wants to merge 1 commit into
lance-format:mainfrom
a-agmon:fix/fts-async-send

Conversation

@a-agmon

@a-agmon a-agmon commented Jun 23, 2026

Copy link
Copy Markdown

Fixes lancedb/lancedb#3568 (the issue arises in lancedb indexing)

Building a full-text-search index hangs permanently at 0% CPU on hosts
whose Lance CPU pool has a single thread.
The CPU compute pool is sized max(1, num_cpus - LANCE_IO_CORE_RESERVATION) (default reservation 2), so any machine with <= 3 visible CPUs (1-vCPU VMs, CI runners, CPU-limited Kubernetes pods) collapses to a 1-thread pool and deadlocks.

Root cause is in write_posting_lists. The posting-list producer runs on the CPU pool via spawn_cpu and pushes batches into a capacity-1 async_channel using the synchronous tx.send_blocking(). When the channel is full, send_blocking parks the OS thread it is running on. On a single-thread pool that is the only thread, and the async consumer's column encoder (write_record_batch -> spawn_cpu) needs that same pool to drain the channel. The parked producer and the starved consumer wait on each other forever: no timeout, no error, just a silent hang at 0% CPU.
The hang only triggers once the posting lists span a second output batch (so the producer reaches a second, blocking send), which is why it appears as a data-size "cliff".

The PR restructures the producer as an async task that builds each batch on the CPU pool via spawn_cpu and dispatches it with tx.send(batch).await. When the channel is full, send().await yields the task back to the runtime instead of parking a pool thread, so the consumer can always be scheduled to drain it. Between batches the producer holds no pool thread while waiting, making the pool size irrelevant. The builder and the remaining posting-list iterator are handed back out of each spawn_cpu call so the cross-batch cache-group accumulator is preserved.

In addition, it adds a regression test that writes a partition whose posting lists span many output batches (exercising channel back-pressure) under a timeout and verifies every batch is searchable.

(verbose comments added in the code intentionally for review purposes - can be removed if inappropriate. I just thought it might be helpful as the issue is somewhat confusing)

@github-actions github-actions Bot added bug Something isn't working A-index Vector index, linalg, tokenizer and removed bug Something isn't working labels Jun 23, 2026
@a-agmon

a-agmon commented Jun 24, 2026

Copy link
Copy Markdown
Author

Hi @westonpace - would be happy for your review.
This issue causes a nasty bug on K8S pods with one core, and it took my team quite some time to pin down. Especially as it occurs in native rust space. Submitting this PR to resolve this.
Thanks!

@github-actions github-actions Bot added the bug Something isn't working label Jun 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-index Vector index, linalg, tokenizer bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug(python): create_fts_index deadlocks with the ngram tokenizer on CPU-limited hosts (1CPU K8S pod)

1 participant