perf(backend): replace per-file sleep(480) deletion timers with a single janitor thread#7855
Conversation
One daemon thread + a due-time heap replace the per-file time.sleep(480) pattern that parked a storage_executor thread per blob — ~70% of the pool idle as ad-hoc timers at sync volume (#7531).
…estore precache sem to 4 The 4→2 cut (#7526) was load-shedding while the pool was full of sleeping deletion timers; with the janitor holding those, precache gets its concurrency back.
…age sites Drops the now-unused storage_executor and time imports.
Greptile SummaryThis PR replaces four per-file
Confidence Score: 4/5Safe to merge — the core algorithm is correct, all four call sites are migrated, and the lifecycle backstop remains unchanged. The DeferredDeleter min-heap logic handles due-order, early interruption, and delete failures correctly. The one gap is that _thread is never checked for liveness, so a crashed janitor would silently stop processing the heap. Given the explicit best-effort design and the GCS lifecycle backstop this is low-impact, but it is an undetected failure mode. backend/utils/other/deferred_delete.py — the schedule() method's thread-start guard should check is_alive() in addition to the None check. Important Files Changed
Sequence DiagramsequenceDiagram
participant Caller as chat.py / sync.py
participant Sched as schedule_syncing_temporal_file_deletion
participant Deleter as DeferredDeleter (heap + cond)
participant Janitor as syncing-blob-janitor thread
participant GCS as GCS (syncing bucket)
Caller->>Sched: schedule_syncing_temporal_file_deletion(path)
Sched->>Deleter: schedule(path, 480s)
Deleter->>Deleter: heappush((now+480, seq, path))
Deleter-->>Janitor: cond.notify()
Note over Janitor: waits until due time
Note over Janitor: 480s later…
Janitor->>Deleter: heappop() — due item
Janitor->>GCS: delete_syncing_temporal_file(path)
GCS-->>Janitor: OK (or BlobNotFound → ignored)
Reviews (1): Last reviewed commit: "test(backend): register test_deferred_bl..." | Re-trigger Greptile |
| if self._thread is None: | ||
| self._thread = threading.Thread(target=self._run, name=self._name, daemon=True) | ||
| self._thread.start() | ||
| self._cond.notify() |
There was a problem hiding this comment.
Dead janitor thread silently stops all future deletions
_thread is set once and never cleared. If _run exits unexpectedly — e.g., via a BaseException subclass like MemoryError or SystemExit that bypasses the except Exception catch — _thread will still point to a dead Thread object. Every subsequent schedule() call skips the if self._thread is None: branch, items pile up in the heap, and no deletion ever fires. The lifecycle rule is the backstop, but this is a silent failure rather than a logged one. Changing the guard to if self._thread is None or not self._thread.is_alive(): would restart the janitor in the rare case it dies.
kodjima33
left a comment
There was a problem hiding this comment.
Solid janitor-thread design w/ tests; perf rewrite stays maintainer-merge per policy
Greptile review: a MemoryError/SystemExit escaping the except-Exception catch would leave _thread pointing at a dead thread, silently piling up schedules for the process lifetime. is_alive() guard self-heals.
Problem
Four call sites (1 in
routers/sync.py, 3 voice-message flows inutils/chat.py) delete temporal GCS blobs by parking astorage_executorthread intime.sleep(480)per file. At current sync volume (~20 jobs/min post-#7801) that keeps ~90 of the pool's 128 threads asleep as ad-hoc timers — confirmed in prodexecutor_pool_healthlogs (~70–80% "utilization" with queue_depth 0, i.e. occupancy without work).This is the remaining root cause behind #7531's storage-pool saturation: the pool was bumped 32→64→96→128 in ten days and
_PRECACHE_FILE_SEMwas halved (#7526) largely to feed threads whose only job was waiting.Change
utils/other/deferred_delete.py:DeferredDeleter— a due-time min-heap plus one lazily-started daemon thread.schedule()is an O(log n) heap push; the janitor wakes when the next deletion is due. An earlier-due schedule arriving mid-wait re-notifies and re-peeks, so ordering holds. Delete failures are logged and skipped (the syncing bucket's lifecycle rule remains the backstop, exactly as it was for the sleeping threads).storage.py:schedule_syncing_temporal_file_deletion(path, delay=480s)wraps a module-level janitor bound todelete_syncing_temporal_file. Same 480s semantics (under the 15-min signed-URL validity).chat.pydrops its now-unusedstorage_executor/timeimports._PRECACHE_FILE_SEM2 → 4, reverting backend: lower precache concurrency cap 4 → 2 to free storage_executor for sync #7526's load-shed now that the sleepers are gone — audio merge/precache get the freed headroom.Deletion timing, crash semantics (pending deletions die with the process either way), and the lifecycle backstop are all unchanged. Hundreds of pending deletions now cost one thread instead of hundreds.
Expected impact (measurable immediately after deploy)
executor_pool_healthstorageactive_countshould drop from ~90–100 to real work only (~5–15). That single number is the before/after check.Tests
tests/unit/test_deferred_blob_janitor.py(11 tests): real-module behavioral coverage (due-order with out-of-order schedules, near-term schedule interrupting a long wait, failure doesn't kill the janitor, 200 pending = exactly 1 thread) + structural guards (notime.sleep(480)remains, all four sites use the scheduler, sem = 4). Registered intest.sh.test_storage_fanout_limits.py: semaphore value assertion updated 2 → 4 with rationale.test_sync_silent_failure.py: removed the executor-swap setup/teardown machinery that existed solely to neutralize the old sleeping deleters in tests; all 41 pass.scan_async_blockers.py: no new findings (the onechat.pyhit pre-exists on main).test_sync_v2.pyremains at the main baseline (5 pre-existing failures from fix(backend): offline-sync chunks split into separate conversations — serialize assignment chronologically (#6551) #7819, unrelated).🤖 Generated with Claude Code