Skip to content

perf(backend): replace per-file sleep(480) deletion timers with a single janitor thread#7855

Merged
mdmohsin7 merged 10 commits into
mainfrom
feat/syncing-blob-janitor
Jun 12, 2026
Merged

perf(backend): replace per-file sleep(480) deletion timers with a single janitor thread#7855
mdmohsin7 merged 10 commits into
mainfrom
feat/syncing-blob-janitor

Conversation

@mdmohsin7

Copy link
Copy Markdown
Member

Problem

Four call sites (1 in routers/sync.py, 3 voice-message flows in utils/chat.py) delete temporal GCS blobs by parking a storage_executor thread in time.sleep(480) per file. At current sync volume (~20 jobs/min post-#7801) that keeps ~90 of the pool's 128 threads asleep as ad-hoc timers — confirmed in prod executor_pool_health logs (~70–80% "utilization" with queue_depth 0, i.e. occupancy without work).

This is the remaining root cause behind #7531's storage-pool saturation: the pool was bumped 32→64→96→128 in ten days and _PRECACHE_FILE_SEM was halved (#7526) largely to feed threads whose only job was waiting.

Change

  • New utils/other/deferred_delete.py: DeferredDeleter — a due-time min-heap plus one lazily-started daemon thread. schedule() is an O(log n) heap push; the janitor wakes when the next deletion is due. An earlier-due schedule arriving mid-wait re-notifies and re-peeks, so ordering holds. Delete failures are logged and skipped (the syncing bucket's lifecycle rule remains the backstop, exactly as it was for the sleeping threads).
  • storage.py: schedule_syncing_temporal_file_deletion(path, delay=480s) wraps a module-level janitor bound to delete_syncing_temporal_file. Same 480s semantics (under the 15-min signed-URL validity).
  • All four call sites switch to the scheduler; chat.py drops its now-unused storage_executor/time imports.
  • _PRECACHE_FILE_SEM 2 → 4, reverting backend: lower precache concurrency cap 4 → 2 to free storage_executor for sync #7526's load-shed now that the sleepers are gone — audio merge/precache get the freed headroom.

Deletion timing, crash semantics (pending deletions die with the process either way), and the lifecycle backstop are all unchanged. Hundreds of pending deletions now cost one thread instead of hundreds.

Expected impact (measurable immediately after deploy)

executor_pool_health storage active_count should drop from ~90–100 to real work only (~5–15). That single number is the before/after check.

Tests

  • New tests/unit/test_deferred_blob_janitor.py (11 tests): real-module behavioral coverage (due-order with out-of-order schedules, near-term schedule interrupting a long wait, failure doesn't kill the janitor, 200 pending = exactly 1 thread) + structural guards (no time.sleep(480) remains, all four sites use the scheduler, sem = 4). Registered in test.sh.
  • test_storage_fanout_limits.py: semaphore value assertion updated 2 → 4 with rationale.
  • test_sync_silent_failure.py: removed the executor-swap setup/teardown machinery that existed solely to neutralize the old sleeping deleters in tests; all 41 pass.
  • scan_async_blockers.py: no new findings (the one chat.py hit pre-exists on main).
  • test_sync_v2.py remains at the main baseline (5 pre-existing failures from fix(backend): offline-sync chunks split into separate conversations — serialize assignment chronologically (#6551) #7819, unrelated).

🤖 Generated with Claude Code

One daemon thread + a due-time heap replace the per-file
time.sleep(480) pattern that parked a storage_executor thread per
blob — ~70% of the pool idle as ad-hoc timers at sync volume (#7531).
…estore precache sem to 4

The 4→2 cut (#7526) was load-shedding while the pool was full of
sleeping deletion timers; with the janitor holding those, precache
gets its concurrency back.
…age sites

Drops the now-unused storage_executor and time imports.
@greptile-apps

greptile-apps Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR replaces four per-file time.sleep(480) deletion timers (three in utils/chat.py, one in routers/sync.py) with a single DeferredDeleter janitor — a min-heap of due times plus one lazily-started daemon thread — freeing the ~90 storage_executor threads that were parked as idle timers and restoring _PRECACHE_FILE_SEM from 2 → 4.

  • utils/other/deferred_delete.py (new): DeferredDeleter uses a threading.Condition-guarded min-heap; schedule() is O(log n); the janitor wakes only when the next deletion is due and re-peeks on early-arrival notifications, so ordering is maintained across concurrent schedule() calls.
  • utils/other/storage.py: adds schedule_syncing_temporal_file_deletion wrapping a module-level DeferredDeleter singleton; _PRECACHE_FILE_SEM reverted 2 → 4.
  • Tests: 11 new behavioral + structural tests in test_deferred_blob_janitor.py; test_sync_silent_failure.py drops the executor-swap machinery that existed solely to neutralize the old sleeping deleters.

Confidence Score: 4/5

Safe to merge — the core algorithm is correct, all four call sites are migrated, and the lifecycle backstop remains unchanged.

The DeferredDeleter min-heap logic handles due-order, early interruption, and delete failures correctly. The one gap is that _thread is never checked for liveness, so a crashed janitor would silently stop processing the heap. Given the explicit best-effort design and the GCS lifecycle backstop this is low-impact, but it is an undetected failure mode.

backend/utils/other/deferred_delete.py — the schedule() method's thread-start guard should check is_alive() in addition to the None check.

Important Files Changed

Filename Overview
backend/utils/other/deferred_delete.py New janitor implementation — min-heap + one daemon thread; algorithm is correct for normal operation but dead-thread detection is absent in schedule()
backend/utils/other/storage.py Adds schedule_syncing_temporal_file_deletion wrapping the module-level DeferredDeleter singleton and reverts _PRECACHE_FILE_SEM from 2 → 4
backend/utils/chat.py Removes all three per-file sleep(480) closures and storage_executor/time imports, replaces with schedule_syncing_temporal_file_deletion
backend/routers/sync.py Removes the one remaining per-file sleep(480) deletion closure in process_segment, replaces with schedule_syncing_temporal_file_deletion
backend/tests/unit/test_deferred_blob_janitor.py New behavioral + structural tests for the janitor; covers due-order, interrupt-long-wait, failure resilience, single-thread invariant, and source-pattern guards
backend/tests/unit/test_storage_fanout_limits.py Semaphore assertion updated from BoundedSemaphore(2) to BoundedSemaphore(4) with rationale comment
backend/tests/unit/test_sync_silent_failure.py Removes executor-swap setup/teardown machinery that existed solely to neutralize sleeping deleters; adds schedule_syncing_temporal_file_deletion mock where needed
backend/test.sh Registers the new test_deferred_blob_janitor.py in the test suite

Sequence Diagram

sequenceDiagram
    participant Caller as chat.py / sync.py
    participant Sched as schedule_syncing_temporal_file_deletion
    participant Deleter as DeferredDeleter (heap + cond)
    participant Janitor as syncing-blob-janitor thread
    participant GCS as GCS (syncing bucket)

    Caller->>Sched: schedule_syncing_temporal_file_deletion(path)
    Sched->>Deleter: schedule(path, 480s)
    Deleter->>Deleter: heappush((now+480, seq, path))
    Deleter-->>Janitor: cond.notify()
    Note over Janitor: waits until due time

    Note over Janitor: 480s later…
    Janitor->>Deleter: heappop() — due item
    Janitor->>GCS: delete_syncing_temporal_file(path)
    GCS-->>Janitor: OK (or BlobNotFound → ignored)
Loading

Reviews (1): Last reviewed commit: "test(backend): register test_deferred_bl..." | Re-trigger Greptile

Comment thread backend/utils/other/deferred_delete.py Outdated
Comment on lines +37 to +40
if self._thread is None:
self._thread = threading.Thread(target=self._run, name=self._name, daemon=True)
self._thread.start()
self._cond.notify()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Dead janitor thread silently stops all future deletions

_thread is set once and never cleared. If _run exits unexpectedly — e.g., via a BaseException subclass like MemoryError or SystemExit that bypasses the except Exception catch — _thread will still point to a dead Thread object. Every subsequent schedule() call skips the if self._thread is None: branch, items pile up in the heap, and no deletion ever fires. The lifecycle rule is the backstop, but this is a silent failure rather than a logged one. Changing the guard to if self._thread is None or not self._thread.is_alive(): would restart the janitor in the rare case it dies.

@kodjima33 kodjima33 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid janitor-thread design w/ tests; perf rewrite stays maintainer-merge per policy

Greptile review: a MemoryError/SystemExit escaping the except-Exception
catch would leave _thread pointing at a dead thread, silently piling up
schedules for the process lifetime. is_alive() guard self-heals.
@mdmohsin7 mdmohsin7 merged commit 7be6e13 into main Jun 12, 2026
1 check passed
@mdmohsin7 mdmohsin7 deleted the feat/syncing-blob-janitor branch June 12, 2026 14:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants