Fix anyio 4.13 CPU hot-loop: narrow zombie-scope patch#152
Open
constkolesnyak wants to merge 1 commit into
Open
Fix anyio 4.13 CPU hot-loop: narrow zombie-scope patch#152constkolesnyak wants to merge 1 commit into
constkolesnyak wants to merge 1 commit into
Conversation
Adds a monkey-patch of anyio.CancelScope._deliver_cancellation that fixes a 100% CPU spin observed in production when a done task lingers in CancelScope._tasks longer than upstream's task_done callback takes to prune it. Bug: upstream sets should_retry=True for every task in _tasks before checking whether a cancel can actually be delivered. For a done task, task.cancel() is a no-op but should_retry stays True → the scope re-arms call_soon(_deliver_cancellation) on every event-loop tick. Observed live (April 24, 2026): three simultaneous zombie-scopes in one process, ~55k epoll_pwait/sec combined, 100% CPU on one core, each scope holding a single done task. Fix: insert a single `if task.done(): continue` at the top of the loop body, before should_retry is touched. Every other branch is byte-for-byte upstream anyio 4.13.0 — deferred self-delivery (s.cancel(); await sleep(N)) and re-delivery after a swallowed CancelledError both behave exactly like upstream. Tests (10/10 passing): * test_patch_is_applied / test_apply_is_idempotent — wiring * test_fail_after_still_works / test_move_on_after_still_works / test_task_group_cancellation_still_works — upstream behavior * test_self_cancel_then_await_sleep_cancels_immediately — preserves anyio's deferred self-delivery contract * test_task_cancels_own_taskgroup_scope_then_awaits — same shape via TaskGroup * test_swallowed_cancelled_error_is_redelivered — preserves anyio's re-delivery contract * test_no_hot_loop_when_only_task_is_done — the zombie-scope regression this patch is for * test_live_task_still_reschedules_alongside_done_task — mixed done+live scope still reschedules for the live task The patch is intentionally as narrow as possible. An earlier wider version (skipping current_task and _must_cancel without setting should_retry) was reverted because it broke deferred self-delivery and re-delivery — see the module docstring for the writeup. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
anyio.CancelScope._deliver_cancellationsetsshould_retry = Truefor every task inself._tasksbefore checking whether a cancel can actually be delivered. When a task withdone() is Truelingers in_tasks(the upstreamtask_donecallback doesn't always prune it before the scope's cancel callback fires),task.cancel()is a no-op butshould_retrystays True → the scope re-armscall_soon(_deliver_cancellation)on every event-loop tick.Result: one CPU core pinned at ~100% with tens of thousands of
epoll_pwaitsyscalls per second and no forward progress.Observed live in production on 2026-04-24: three simultaneous zombie-scopes in one nerve process (scope IDs
0x7ffec1774f50,0x7ffec17ae090,0x7ffec17ad6d0), ~55kepoll_pwait/seccombined, 100% CPU on one core, load 1.6, cpu-thermal 60°C. Each scope held a singledone=Truetask. Diagnosed viapy-spy dumpand a GC scan ofCancelScopeinstances.The existing
_safe_disconnect()workaround innerve/agent/engine.pyonly clears the scope duringclient.disconnect(), so spins triggered elsewhere (Telegram polling, cron, an active SDK request) aren't covered.Fix
Monkey-patch
anyio.CancelScope._deliver_cancellation(applied fromnerve/__init__.pyso any import path picks it up). The patch is as narrow as possible — it adds a singleif task.done(): continueskip at the top of the loop body, beforeshould_retry = Trueis touched. Every other branch is byte-for-byte identical to upstream anyio 4.13.0.This intentional narrowness preserves anyio's contracts that a wider patch would break:
Deferred self-delivery:
with anyio.CancelScope() as s: s.cancel(); await sleep(N).s.cancel()calls_deliver_cancellationsynchronously whilecurrent_task()points at the host task; anyio relies on thecall_soonreschedule firing on the next tick (whencurrent_task()isNone) to land the cancel. Skipping the current task withoutshould_retry=Truestrands the cancel — the sleep runs to completion.Re-delivery after swallowed CancelledError: anyio's contract is to keep redelivering until the scope exits. Skipping
_must_canceltasks withoutshould_retry=Truestrands tasks that catch the firstCancelledErrorand loop.A previous wider version of this patch (skipping
current_taskand_must_cancelin addition to done tasks) was reverted because it broke both contracts. See the module docstring for the full writeup.Tests
tests/test_anyio_patch.py— 10 tests, all passing:Wiring (2):
test_patch_is_applied— patch is installed at import timetest_apply_is_idempotent— applying twice is a no-opUpstream behavior still works (3):
test_fail_after_still_workstest_move_on_after_still_workstest_task_group_cancellation_still_worksAnyio cancellation contracts preserved (3 — these would fail on a wider patch):
test_self_cancel_then_await_sleep_cancels_immediately—s.cancel(); await sleep(5)returns in <0.5s via deferred self-deliverytest_task_cancels_own_taskgroup_scope_then_awaits— same shape via TaskGrouptest_swallowed_cancelled_error_is_redelivered— task catches firstCancelledError, loops, gets redelivered (≥3 deliveries withinfail_after(2.0))The zombie-scope regression this patch is for (2):
test_no_hot_loop_when_only_task_is_done— stub scope with a single done task does not rescheduletest_live_task_still_reschedules_alongside_done_task— mixed done + live scope still reschedules for the live task (proves the done-skip doesn't short-circuit re-delivery for others)Files
nerve/_anyio_patch.py— the narrow patched_deliver_cancellation(new)nerve/__init__.py— import the patch at module loadtests/test_anyio_patch.py— regression suite (new)History
Supersedes #128 — addresses @pufit's review feedback there. The previous PR's monkeypatch also skipped
current_taskand_must_cancelwithout settingshould_retry, which broke the two anyio contracts above. This version narrows the patch to the single deviation production observation actually requires and adds tests for both regression patterns.🤖 Generated with Claude Code