Skip to content

fix(backend): bound first-file cache wait in /v1/sync/audio urls endpoint (#7325)#7815

Open
kodjima33 wants to merge 3 commits into
mainfrom
watchdog/issue-7325-urls-bounded-wait
Open

fix(backend): bound first-file cache wait in /v1/sync/audio urls endpoint (#7325)#7815
kodjima33 wants to merge 3 commits into
mainfrom
watchdog/issue-7325-urls-bounded-wait

Conversation

@kodjima33

Copy link
Copy Markdown
Collaborator

Bug

#7325 — audio download endpoints hang with 0 bytes (90–300s, then timeout). Confirmed live in prod LB logs: GET /v1/sync/audio/{id}/urls repeatedly took 60–115s on 2026-06-10 alone; anything slower hits the 120s TimeoutMiddleware / client timeouts and surfaces as the reported hang.

Root cause

get_audio_signed_urls_endpoint merges the first uncached audio file synchronously in the request thread ("cache synchronously for immediate playback", routers/sync.py). For large recordings (hundreds of chunks) or under GCS pushback, download_audio_chunks_and_merge takes minutes, so the request blocks with zero bytes sent.

(The 429s the reporter saw are produced at the edge — those requests never reach backend pods, no app-level rate limit exists on this path — so the limiter half of the issue is infra config, out of scope here.)

Fix

Bound the wait: submit _precache_audio_file to sync_executor (not storage_executor — its chunk-download children run there; deadlock rule 3) and wait at most FIRST_FILE_CACHE_WAIT_SECONDS = 15s. Healthy merges (0.1–5s per prod logs) behave exactly as before. On timeout the merge keeps running and lands in the GCS cache; the file is returned as pending with a warning log.

Clients already handle pending: the Flutter player falls back to the direct stream URL, and the web detail panel precaches + refetches. A fast pending beats a 60–115s block or an opaque 504.

Tests

tests/unit/test_sync_urls_bounded_wait.py (registered in test.sh): structural guards (bounded result(timeout=), no inline merge, coordinator not on storage_executor, budget < middleware timeout) + logic tests of the wait/timeout primitive (timed-out merge still completes in background). All 6 verified locally via stdlib harness (pytest unavailable locally; CI runs the suite). py_compile, lint_async_blockers, black clean. UNVERIFIED at runtime against a live large-conversation merge.

🤖 automated by hourly watchdog; opened for review, not merged.

Fixes #7325

@greptile-apps

greptile-apps Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR fixes a production hang on GET /v1/sync/audio/{id}/urls where large first-file audio merges (60–115s) exceeded both client timeouts and the 120s middleware limit. The fix submits _precache_audio_file to sync_executor (not storage_executor, to prevent deadlock since its chunk-download children run there), waits at most 15 s via Future.result(timeout=), and returns the file as pending on timeout while the merge finishes in the background.

  • backend/routers/sync.py: Replaces the unbounded inline _precache_audio_file call with a bounded sync_executor-backed future and a FuturesTimeoutError catch; introduces FIRST_FILE_CACHE_WAIT_SECONDS = 15.0.
  • backend/tests/unit/test_sync_urls_bounded_wait.py: New test file with structural guards (bounded wait, no inline call, correct pool, budget < 120s) and a ThreadPoolExecutor-based proof that Future.result(timeout=) leaves the background task running to completion after timing out.

Confidence Score: 4/5

The fix correctly bounds the first-file wait to 15 s and the executor choice avoids deadlock. Safe to merge, with two minor gaps worth addressing before heavy production use.

The core logic is sound: sync_executor is the right pool (its chunk-download children run on storage_executor), the 15s budget is well under the 120s middleware timeout, and the pending fallback is already handled by clients. Two issues reduce confidence slightly: the except FuturesTimeoutError block does not catch concurrent.futures.CancelledError, so a queued-but-cancelled future during shutdown produces a 500 instead of a graceful pending response; and the structural test only checks != storage_executor (negative), so any other wrong executor choice would go undetected.

backend/routers/sync.py (exception handling around lines 297-303) and backend/tests/unit/test_sync_urls_bounded_wait.py (pool assertion at lines 64-66)

Important Files Changed

Filename Overview
backend/routers/sync.py Core fix: replaces unbounded inline _precache_audio_file with a sync_executor-backed future and 15s timeout; logic is correct and executor choice avoids deadlock, but CancelledError from future cancellation is not handled alongside FuturesTimeoutError.
backend/tests/unit/test_sync_urls_bounded_wait.py Good structural and primitive tests; the pool assertion in test_coordinator_not_on_storage_executor only checks the negative case (!= storage_executor) and would pass for any other (incorrect) executor choice.
backend/test.sh Registers the new test file in the test suite; no issues.

Reviews (1): Last reviewed commit: "test(backend): register test_sync_urls_b..." | Re-trigger Greptile

Comment on lines +64 to +66
assert (
m.group(1) != 'storage_executor'
), "coordinator must not share storage_executor with its chunk-download children"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 The assertion only guards the negative case. If a future refactor changes the pool to postprocess_executor, db_executor, or any other executor, this test still passes — including pools whose workers could deadlock or whose purpose clashes with audio-merge work. Asserting positively on sync_executor catches any such drift.

Suggested change
assert (
m.group(1) != 'storage_executor'
), "coordinator must not share storage_executor with its chunk-download children"
assert (
m.group(1) == 'sync_executor'
), "coordinator must run on sync_executor (not storage_executor, to avoid deadlock with its children)"

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Comment thread backend/routers/sync.py
Comment on lines +297 to +303
try:
first_file_future.result(timeout=FIRST_FILE_CACHE_WAIT_SECONDS)
except FuturesTimeoutError:
logger.warning(
f"sync_urls first-file cache exceeded {FIRST_FILE_CACHE_WAIT_SECONDS}s, returning pending "
f"uid={uid} convo={conversation_id} file={audio_file_id}"
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 CancelledError not caught alongside FuturesTimeoutError

future.result() can also raise concurrent.futures.CancelledError — a separate exception class from FuturesTimeoutError — when the future is cancelled before it starts executing. This happens when sync_executor is at capacity and shutdown_executors(cancel_futures=True) fires during a graceful shutdown: the queued task is cancelled and result() raises rather than blocking. The exception propagates unhandled through the endpoint and becomes a 500 instead of the graceful pending response that would be correct here. Catching (FuturesTimeoutError, CancelledError) in the same handler would close the gap.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Audio download endpoint hangs with 0 bytes after 429 rate-limit (other endpoints unaffected)

1 participant