Fix anyio 4.13 100% CPU hot-loop in CancelScope._deliver_cancellation#128
Fix anyio 4.13 100% CPU hot-loop in CancelScope._deliver_cancellation#128constkolesnyak wants to merge 3 commits into
Conversation
anyio 4.13.0's CancelScope._deliver_cancellation sets should_retry=True unconditionally for every task in self._tasks, then reschedules itself via call_soon(). When every task in the scope is the *current* task, nothing gets cancelled but the callback re-queues on every event-loop tick — pinning one CPU core at 100% with ~45k epoll_pwait syscalls/sec. Observed on April 22 and again on April 23 (24h+ of 97% CPU, no work done). The existing _safe_disconnect() workaround in agent.engine only clears the stuck scope during client.disconnect(), so spins triggered by telegram polling / cron / live SDK requests weren't covered. The patch sets should_retry=True only when we actually delivered a cancel or the task is still waiting pickup (_must_cancel). Semantics otherwise match upstream byte-for-byte. Applied via nerve/__init__.py so any import path picks it up. Includes a regression test that exercises the exact pathological shape (scope whose only task is the current task) and asserts the scope stops rescheduling itself.
The April 23 patch (ec0f8f7) fixed the 96%-CPU hot loop where should_retry was set unconditionally, but left a narrower version of the same bug: the `_must_cancel` branch still set should_retry=True. When the current task itself sits with _must_cancel=True while running the cancel callback (observed in production today, nerve/SDK path), this re-queues _deliver_cancellation on every event-loop tick: before fix: 20% CPU, ~61k epoll_pwait/sec after fix: should be idle (~5% CPU range) Changes: - Skip the current task entirely — it cannot cancel itself from inside the callback it is running. - Drop should_retry=True in the _must_cancel branch — asyncio's Task.__step raises CancelledError when the task resumes, no retry needed from us. - should_retry is now True only when we actually called task.cancel() in this pass. - Add regression test that poisons the _must_cancel branch with a fake task. The existing "current-task-only" test passed without reproducing this variant because it didn't set _must_cancel.
Third iteration of the anyio _deliver_cancellation spin. This time a CancelScope retained already-finished tasks in self._tasks (anyio doesn't always prune them before the cancel callback fires). For a done task _must_cancel=False (cleared on final step), _task_started is True, _fut_waiter is None — so the previous patch fell into the "waiter not done → cancel()" branch. task.cancel() is a no-op on done tasks, but should_retry was flagged anyway, so call_soon kept re-queuing forever. Observed live: three zombie-scopes in one process producing ~55k epoll_pwait/sec combined, 100% CPU on MainThread, load 1.6, 60°C. Confirmed via py-spy (stack parked on lines 91/95/97/98 of the patch) and a gc-scan dump of CancelScope objects (three active _cancel_handle scopes, each with a single done=True task). Fix: add `if task.done(): continue` at the top of the loop. Also add regression test that reproduces the zombie-scope shape with a stubbed done task and asserts no cancel() call, no retry, and no pending _cancel_handle.
pufit
left a comment
There was a problem hiding this comment.
Reviewed this carefully and verified the claims against the installed anyio (note: env actually has 4.14.0, not 4.13.0 — more on that below). The bug is real, but I don't think this fix is safe to merge as-is — it breaks legitimate anyio cancellation in at least three reproducible ways.
The bug is real ✅
Confirmed the core claim by reading the actual CancelScope._deliver_cancellation source: should_retry = True is set unconditionally for every task in self._tasks, before checking whether a cancel can actually land:
for task in self._tasks:
should_retry = True # ← set for ANY task, deliverable or not
if task._must_cancel: continue
if task is not current and (...): ...So a task that lingers in _tasks but can't be cancelled (stuck on uncancellable I/O, already _must_cancel, or done) re-arms call_soon(self._deliver_cancellation) every tick → one core pinned at ~100%. Reproduced on stock anyio: a non-deliverable task re-arms the reschedule on 6/6 passes (infinite spin); with the patch, 0/6 (stops). It's also a known upstream issue hitting this exact stack — agronholm/anyio#695, anthropics/claude-agent-sdk-python#378 — and _safe_disconnect() already documents it. So the motivation is sound and the patch does stop the spin.
…but the fix breaks correct cancellation ❌
Ran the patched _deliver_cancellation against stock anyio 4.14.0 on a functional suite. Stock: 12/12 correct. Patched: 7/12. Three regressions in textbook patterns:
| Pattern | Stock anyio | With this patch |
|---|---|---|
with CancelScope() as s: s.cancel(); await sleep(5) |
cancelled in ~0.000s | sleeps the full 5s, runs code that should be unreachable |
| Task cancels its own task-group scope, then awaits | cancelled in ~0.002s | sleeps the full 5s |
Task swallows one CancelledError (re-delivery) |
re-delivered immediately | stops after the 1st delivery; took 15s |
Minimal repro for the first one:
import anyio
async def main():
with anyio.CancelScope() as s:
s.cancel()
await anyio.sleep(5) # stock: returns instantly; patched: sleeps 5s
print("THIS SHOULD BE UNREACHABLE") # patched prints this
anyio.run(main)Root cause of the regressions: the busy-loop and legitimate delivery share the same mechanism. When a task cancels its own scope synchronously, anyio can't cancel the running task in place — it relies on the call_soon reschedule to deliver on the next tick (when current_task() is None). This patch refuses to reschedule unless it called task.cancel() on the current pass, and skips the current task entirely, so that deferred self-delivery never happens. Same for the swallow case: anyio's contract is to keep re-delivering until the scope exits; setting should_retry=False on _must_cancel/non-cancel passes strands a task that swallowed the first CancelledError.
Why the test suite passes anyway
The 8 tests pass but don't cover the broken paths. They exercise hand-built scopes plus three "still works" cases (fail_after, move_on_after, task-group cancel) — all of which cancel from a timer callback where current_task() is None, i.e. the one path this patch doesn't affect. The self-cancel and re-delivery patterns above aren't tested, which is exactly where the regression lives.
Other concerns
- Blast radius. Applied globally from
nerve/__init__.py, so it patches a core primitive for the whole process — FastAPI/Starlette, httpx, and the SDK all rely on anyio cancel scopes. A 100% CPU spin is at least visible; a silently swallowed cancellation/hang is much harder to diagnose. - Version drift. PR targets 4.13.0; env has 4.14.0 (transitive, unpinned — not in
pyproject.toml). 4.14.0 already ships a#695shield mitigation inTaskGroup.__aexit__, and the docstring's line numbers (582–616) won't match. Minor: a finished coroutine isCORO_CLOSED, so_task_started()returnsFalsefor a done task (the spin still happens, becauseshould_retryis set at the top regardless — but the stated reasoning is slightly off). - Reachability of the "done task" shape.
task_doneis a done-callback that removes finished tasks from_taskswithin ~1 tick, so that shape self-heals. The genuinely reachable production shape is the stuck_must_canceltask that_safe_disconnectalready describes.
Suggested direction
The upstream-blessed fix is a timeout on SDK teardown, not monkeypatching the primitive:
- Bump
claude-agent-sdkto the release that fixed #378 (wrapsQuery.close()'s task-group cleanup inanyio.fail_after(5.0)). _safe_disconnectalready wrapsdisconnect()inasyncio.wait_for(...)— extend that pattern to any other client-close paths instead of patching anyio globally.- If a stopgap monkeypatch is still wanted, it must preserve deferred self-delivery (keep rescheduling while the current task is the only deliverable target) and add tests for the self-cancel + re-delivery patterns above. As written, it doesn't.
Generated by Nerve
Previous patch skipped three task categories — done, current, _must_cancel — and only set should_retry=True on actual cancel delivery. Two of those skips break legitimate anyio cancellation: * Skipping current_task strands deferred self-delivery. The pattern `with CancelScope() as s: s.cancel(); await sleep(N)` relies on a call_soon reschedule firing on the next tick (when current_task() is None) to actually cancel the host task. The wider patch refused to reschedule from the synchronous pass and the await ran to completion. * Skipping _must_cancel strands re-delivery. anyio's contract is to keep redelivering until the scope exits; tasks that swallow the first CancelledError need a second one. The wider patch dropped should_retry on the _must_cancel pass and the second delivery never came (Artem's review on ClickHouse#128). The done-task skip is the only deviation the production observation actually requires (April 24: three zombie-scopes, 55k epoll_pwait/sec). Narrow the patch to just that — every other branch is byte-for-byte upstream anyio 4.13.0 — so deferred self-delivery and re-delivery behave exactly like upstream. Tests: * Drop the two stub tests that enshrined the broken behavior (current task → no retry; _must_cancel → no retry). * Add three real-anyio regression tests covering the patterns Artem identified: self-cancel+await, task cancels own TG scope+await, swallowed CancelledError re-delivery. * Add a mixed done+live stub test asserting the done-skip doesn't short-circuit re-delivery for other tasks in the same scope. 10/10 anyio tests pass. Full suite: 1300 passed, 2 skipped, 0 failed. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
|
Superseded by #152 — addresses the review feedback by narrowing the monkeypatch to the single if task.done(): continue deviation from upstream anyio 4.13.0. The wider current_task and _must_cancel skips here were broken (they killed deferred self-delivery and re-delivery on swallowed |
Problem
anyio4.13.0'sCancelScope._deliver_cancellationre-queues itself viacall_soon()on every event-loop tick when the cancel can't actually land. Two shapes hit this:should_retry = Truefor every task inself._tasksthen reschedules — but a task cannot cancel itself from inside its own cancel callback. If the scope's only member is the current task (common shape from anyio task groups / Claude Agent SDK), the loop reschedules forever._must_cancel). A child task that already finished still sits in_taskswith_must_cancel=Trueand yields nothing to cancel. The loop again setsshould_retry=Trueand reschedules forever.Both produce the same observed symptom: one core pinned at ~97% CPU with ~45k–61k
epoll_pwaitsyscalls/sec, no work done. Caught twice in production (24h+ stuck before manual restart). The existing_safe_disconnect()workaround innerve/agent/engine.pyonly clears the scope duringclient.disconnect(), so any spin triggered by Telegram polling, cron, or a live SDK request isn't covered.Fix
Monkeypatch
anyio.CancelScope._deliver_cancellation(applied fromnerve/__init__.pyso any import path picks it up). Semantics match upstream byte-for-byte except:task.done()).should_retry=Truewhen we actually delivered a cancel or the task is still waiting pickup (_must_cancelis False but the task is not done).Net effect: a scope whose only tasks are (current task ∪ done tasks ∪ pickup-pending tasks) stops rescheduling itself and the loop becomes idle. Before the fix: 20–97% CPU. After: idle range (~5% CPU).
Applied via a tiny monkeypatch instead of a vendored anyio because (a) the upstream
anyiochange is one line in_deliver_cancellationand (b) we want it to disappear automatically once anyio ships a fix.Tests
tests/test_anyio_patch.py(8 tests, all passing on this branch):test_patch_applied— patch is installed at import timetest_does_not_reschedule_when_only_task_is_current— the original 100% CPU shapetest_skips_done_tasks— settled-tasks shape (the second 100% CPU bug)test_must_cancel_pickup_pending_still_retries— semantics preservation: legit pickup-pending cancels still retrytest_delivered_cancel_still_retries— semantics preservation: real cancellations retry until landedtest_idempotent_apply— applying twice is a no-optest_original_signature_preserved— no API drifttest_no_effect_when_scope_finished— scope without pending cancellation is untouchedFull suite:
1252 passed, 2 skipped, 2 failed— the 2 failures are pre-existing intests/test_cron.py::TestMaybeRotateContext(unrelated rotate_at branch), present onmaintoo.Files
nerve/__init__.py— apply patch at importnerve/_anyio_patch.py— the patched_deliver_cancellation(new)tests/test_anyio_patch.py— regression suite (new)