perf(backend): replace per-file sleep(480) deletion timers with a single janitor thread by mdmohsin7 · Pull Request #7855 · BasedHardware/omi

mdmohsin7 · 2026-06-12T10:12:28Z

Problem

Four call sites (1 in routers/sync.py, 3 voice-message flows in utils/chat.py) delete temporal GCS blobs by parking a storage_executor thread in time.sleep(480) per file. At current sync volume (~20 jobs/min post-#7801) that keeps ~90 of the pool's 128 threads asleep as ad-hoc timers — confirmed in prod executor_pool_health logs (~70–80% "utilization" with queue_depth 0, i.e. occupancy without work).

This is the remaining root cause behind #7531's storage-pool saturation: the pool was bumped 32→64→96→128 in ten days and _PRECACHE_FILE_SEM was halved (#7526) largely to feed threads whose only job was waiting.

Change

New utils/other/deferred_delete.py: DeferredDeleter — a due-time min-heap plus one lazily-started daemon thread. schedule() is an O(log n) heap push; the janitor wakes when the next deletion is due. An earlier-due schedule arriving mid-wait re-notifies and re-peeks, so ordering holds. Delete failures are logged and skipped (the syncing bucket's lifecycle rule remains the backstop, exactly as it was for the sleeping threads).
storage.py: schedule_syncing_temporal_file_deletion(path, delay=480s) wraps a module-level janitor bound to delete_syncing_temporal_file. Same 480s semantics (under the 15-min signed-URL validity).
All four call sites switch to the scheduler; chat.py drops its now-unused storage_executor/time imports.
_PRECACHE_FILE_SEM 2 → 4, reverting backend: lower precache concurrency cap 4 → 2 to free storage_executor for sync #7526's load-shed now that the sleepers are gone — audio merge/precache get the freed headroom.

Deletion timing, crash semantics (pending deletions die with the process either way), and the lifecycle backstop are all unchanged. Hundreds of pending deletions now cost one thread instead of hundreds.

Expected impact (measurable immediately after deploy)

executor_pool_health storage active_count should drop from ~90–100 to real work only (~5–15). That single number is the before/after check.

Tests

New tests/unit/test_deferred_blob_janitor.py (11 tests): real-module behavioral coverage (due-order with out-of-order schedules, near-term schedule interrupting a long wait, failure doesn't kill the janitor, 200 pending = exactly 1 thread) + structural guards (no time.sleep(480) remains, all four sites use the scheduler, sem = 4). Registered in test.sh.
test_storage_fanout_limits.py: semaphore value assertion updated 2 → 4 with rationale.
test_sync_silent_failure.py: removed the executor-swap setup/teardown machinery that existed solely to neutralize the old sleeping deleters in tests; all 41 pass.
scan_async_blockers.py: no new findings (the one chat.py hit pre-exists on main).
test_sync_v2.py remains at the main baseline (5 pre-existing failures from fix(backend): offline-sync chunks split into separate conversations — serialize assignment chronologically (#6551) #7819, unrelated).

🤖 Generated with Claude Code

One daemon thread + a due-time heap replace the per-file time.sleep(480) pattern that parked a storage_executor thread per blob — ~70% of the pool idle as ad-hoc timers at sync volume (#7531).

…estore precache sem to 4 The 4→2 cut (#7526) was load-shedding while the pool was full of sleeping deletion timers; with the janitor holding those, precache gets its concurrency back.

…age sites Drops the now-unused storage_executor and time imports.

…nitor

greptile-apps · 2026-06-12T10:17:33Z

Greptile Summary

This PR replaces four per-file time.sleep(480) deletion timers (three in utils/chat.py, one in routers/sync.py) with a single DeferredDeleter janitor — a min-heap of due times plus one lazily-started daemon thread — freeing the ~90 storage_executor threads that were parked as idle timers and restoring _PRECACHE_FILE_SEM from 2 → 4.

utils/other/deferred_delete.py (new): DeferredDeleter uses a threading.Condition-guarded min-heap; schedule() is O(log n); the janitor wakes only when the next deletion is due and re-peeks on early-arrival notifications, so ordering is maintained across concurrent schedule() calls.
utils/other/storage.py: adds schedule_syncing_temporal_file_deletion wrapping a module-level DeferredDeleter singleton; _PRECACHE_FILE_SEM reverted 2 → 4.
Tests: 11 new behavioral + structural tests in test_deferred_blob_janitor.py; test_sync_silent_failure.py drops the executor-swap machinery that existed solely to neutralize the old sleeping deleters.

Confidence Score: 4/5

Safe to merge — the core algorithm is correct, all four call sites are migrated, and the lifecycle backstop remains unchanged.

The DeferredDeleter min-heap logic handles due-order, early interruption, and delete failures correctly. The one gap is that _thread is never checked for liveness, so a crashed janitor would silently stop processing the heap. Given the explicit best-effort design and the GCS lifecycle backstop this is low-impact, but it is an undetected failure mode.

backend/utils/other/deferred_delete.py — the schedule() method's thread-start guard should check is_alive() in addition to the None check.

Important Files Changed

Filename	Overview
backend/utils/other/deferred_delete.py	New janitor implementation — min-heap + one daemon thread; algorithm is correct for normal operation but dead-thread detection is absent in schedule()
backend/utils/other/storage.py	Adds schedule_syncing_temporal_file_deletion wrapping the module-level DeferredDeleter singleton and reverts _PRECACHE_FILE_SEM from 2 → 4
backend/utils/chat.py	Removes all three per-file sleep(480) closures and storage_executor/time imports, replaces with schedule_syncing_temporal_file_deletion
backend/routers/sync.py	Removes the one remaining per-file sleep(480) deletion closure in process_segment, replaces with schedule_syncing_temporal_file_deletion
backend/tests/unit/test_deferred_blob_janitor.py	New behavioral + structural tests for the janitor; covers due-order, interrupt-long-wait, failure resilience, single-thread invariant, and source-pattern guards
backend/tests/unit/test_storage_fanout_limits.py	Semaphore assertion updated from BoundedSemaphore(2) to BoundedSemaphore(4) with rationale comment
backend/tests/unit/test_sync_silent_failure.py	Removes executor-swap setup/teardown machinery that existed solely to neutralize sleeping deleters; adds schedule_syncing_temporal_file_deletion mock where needed
backend/test.sh	Registers the new test_deferred_blob_janitor.py in the test suite

Sequence Diagram

sequenceDiagram
    participant Caller as chat.py / sync.py
    participant Sched as schedule_syncing_temporal_file_deletion
    participant Deleter as DeferredDeleter (heap + cond)
    participant Janitor as syncing-blob-janitor thread
    participant GCS as GCS (syncing bucket)

    Caller->>Sched: schedule_syncing_temporal_file_deletion(path)
    Sched->>Deleter: schedule(path, 480s)
    Deleter->>Deleter: heappush((now+480, seq, path))
    Deleter-->>Janitor: cond.notify()
    Note over Janitor: waits until due time

    Note over Janitor: 480s later…
    Janitor->>Deleter: heappop() — due item
    Janitor->>GCS: delete_syncing_temporal_file(path)
    GCS-->>Janitor: OK (or BlobNotFound → ignored)

_{Reviews (1): Last reviewed commit: "test(backend): register test_deferred_bl..." | Re-trigger Greptile}

greptile-apps · 2026-06-12T10:17:37Z

+            if self._thread is None:
+                self._thread = threading.Thread(target=self._run, name=self._name, daemon=True)
+                self._thread.start()
+            self._cond.notify()


Dead janitor thread silently stops all future deletions

_thread is set once and never cleared. If _run exits unexpectedly — e.g., via a BaseException subclass like MemoryError or SystemExit that bypasses the except Exception catch — _thread will still point to a dead Thread object. Every subsequent schedule() call skips the if self._thread is None: branch, items pile up in the heap, and no deletion ever fires. The lifecycle rule is the backstop, but this is a silent failure rather than a logged one. Changing the guard to if self._thread is None or not self._thread.is_alive(): would restart the janitor in the rare case it dies.

kodjima33

Solid janitor-thread design w/ tests; perf rewrite stays maintainer-merge per policy

Greptile review: a MemoryError/SystemExit escaping the except-Exception catch would leave _thread pointing at a dead thread, silently piling up schedules for the process lifetime. is_alive() guard self-heals.

mdmohsin7 added 8 commits June 12, 2026 15:40

feat(backend): single-thread deferred-deletion scheduler

b5acc75

One daemon thread + a due-time heap replace the per-file time.sleep(480) pattern that parked a storage_executor thread per blob — ~70% of the pool idle as ad-hoc timers at sync volume (#7531).

feat(storage): schedule_syncing_temporal_file_deletion via janitor; r…

9b8180c

…estore precache sem to 4 The 4→2 cut (#7526) was load-shedding while the pool was full of sleeping deletion timers; with the janitor holding those, precache gets its concurrency back.

refactor(sync): use deferred-deletion janitor for segment wav cleanup

f27ffd7

refactor(chat): use deferred-deletion janitor at all three voice-mess…

dd0a6ab

…age sites Drops the now-unused storage_executor and time imports.

test(backend): behavioral + structural coverage for the deletion janitor

85cf7e1

test(storage): precache semaphore assertion 2 -> 4

3b58469

test(sync): drop executor-swap machinery obsoleted by the deletion ja…

91c70e6

…nitor

test(backend): register test_deferred_blob_janitor in test.sh

7121d75

greptile-apps Bot reviewed Jun 12, 2026

View reviewed changes

kodjima33 approved these changes Jun 12, 2026

View reviewed changes

mdmohsin7 added 2 commits June 12, 2026 20:11

fix(backend): restart janitor thread if killed by a BaseException

d0de8a3

Greptile review: a MemoryError/SystemExit escaping the except-Exception catch would leave _thread pointing at a dead thread, silently piling up schedules for the process lifetime. is_alive() guard self-heals.

test(backend): janitor restarts after BaseException kills the thread

814428f

mdmohsin7 merged commit 7be6e13 into main Jun 12, 2026
1 check passed

mdmohsin7 deleted the feat/syncing-blob-janitor branch June 12, 2026 14:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(backend): replace per-file sleep(480) deletion timers with a single janitor thread#7855

perf(backend): replace per-file sleep(480) deletion timers with a single janitor thread#7855
mdmohsin7 merged 10 commits into
mainfrom
feat/syncing-blob-janitor

mdmohsin7 commented Jun 12, 2026

Uh oh!

greptile-apps Bot commented Jun 12, 2026

Uh oh!

greptile-apps Bot Jun 12, 2026

Uh oh!

kodjima33 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mdmohsin7 commented Jun 12, 2026

Problem

Change

Expected impact (measurable immediately after deploy)

Tests

Uh oh!

greptile-apps Bot commented Jun 12, 2026

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

kodjima33 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants