Fix anyio 4.13 100% CPU hot-loop in CancelScope._deliver_cancellation by constkolesnyak · Pull Request #128 · ClickHouse/nerve

constkolesnyak · 2026-06-20T22:20:15Z

Problem

anyio 4.13.0's CancelScope._deliver_cancellation re-queues itself via call_soon() on every event-loop tick when the cancel can't actually land. Two shapes hit this:

Live current task in the scope. The loop sets should_retry = True for every task in self._tasks then reschedules — but a task cannot cancel itself from inside its own cancel callback. If the scope's only member is the current task (common shape from anyio task groups / Claude Agent SDK), the loop reschedules forever.
Done tasks left in the scope (_must_cancel). A child task that already finished still sits in _tasks with _must_cancel=True and yields nothing to cancel. The loop again sets should_retry=True and reschedules forever.

Both produce the same observed symptom: one core pinned at ~97% CPU with ~45k–61k epoll_pwait syscalls/sec, no work done. Caught twice in production (24h+ stuck before manual restart). The existing _safe_disconnect() workaround in nerve/agent/engine.py only clears the scope during client.disconnect(), so any spin triggered by Telegram polling, cron, or a live SDK request isn't covered.

Fix

Monkeypatch anyio.CancelScope._deliver_cancellation (applied from nerve/__init__.py so any import path picks it up). Semantics match upstream byte-for-byte except:

Skip the current task entirely — it cannot cancel itself.
Skip tasks that are done (task.done()).
Only set should_retry=True when we actually delivered a cancel or the task is still waiting pickup (_must_cancel is False but the task is not done).

Net effect: a scope whose only tasks are (current task ∪ done tasks ∪ pickup-pending tasks) stops rescheduling itself and the loop becomes idle. Before the fix: 20–97% CPU. After: idle range (~5% CPU).

Applied via a tiny monkeypatch instead of a vendored anyio because (a) the upstream anyio change is one line in _deliver_cancellation and (b) we want it to disappear automatically once anyio ships a fix.

Tests

tests/test_anyio_patch.py (8 tests, all passing on this branch):

test_patch_applied — patch is installed at import time
test_does_not_reschedule_when_only_task_is_current — the original 100% CPU shape
test_skips_done_tasks — settled-tasks shape (the second 100% CPU bug)
test_must_cancel_pickup_pending_still_retries — semantics preservation: legit pickup-pending cancels still retry
test_delivered_cancel_still_retries — semantics preservation: real cancellations retry until landed
test_idempotent_apply — applying twice is a no-op
test_original_signature_preserved — no API drift
test_no_effect_when_scope_finished — scope without pending cancellation is untouched

Full suite: 1252 passed, 2 skipped, 2 failed — the 2 failures are pre-existing in tests/test_cron.py::TestMaybeRotateContext (unrelated rotate_at branch), present on main too.

Files

nerve/__init__.py — apply patch at import
nerve/_anyio_patch.py — the patched _deliver_cancellation (new)
tests/test_anyio_patch.py — regression suite (new)

Three commits kept as iterative narrative; squash on merge if preferred.

anyio 4.13.0's CancelScope._deliver_cancellation sets should_retry=True unconditionally for every task in self._tasks, then reschedules itself via call_soon(). When every task in the scope is the *current* task, nothing gets cancelled but the callback re-queues on every event-loop tick — pinning one CPU core at 100% with ~45k epoll_pwait syscalls/sec. Observed on April 22 and again on April 23 (24h+ of 97% CPU, no work done). The existing _safe_disconnect() workaround in agent.engine only clears the stuck scope during client.disconnect(), so spins triggered by telegram polling / cron / live SDK requests weren't covered. The patch sets should_retry=True only when we actually delivered a cancel or the task is still waiting pickup (_must_cancel). Semantics otherwise match upstream byte-for-byte. Applied via nerve/__init__.py so any import path picks it up. Includes a regression test that exercises the exact pathological shape (scope whose only task is the current task) and asserts the scope stops rescheduling itself.

The April 23 patch (ec0f8f7) fixed the 96%-CPU hot loop where should_retry was set unconditionally, but left a narrower version of the same bug: the `_must_cancel` branch still set should_retry=True. When the current task itself sits with _must_cancel=True while running the cancel callback (observed in production today, nerve/SDK path), this re-queues _deliver_cancellation on every event-loop tick: before fix: 20% CPU, ~61k epoll_pwait/sec after fix: should be idle (~5% CPU range) Changes: - Skip the current task entirely — it cannot cancel itself from inside the callback it is running. - Drop should_retry=True in the _must_cancel branch — asyncio's Task.__step raises CancelledError when the task resumes, no retry needed from us. - should_retry is now True only when we actually called task.cancel() in this pass. - Add regression test that poisons the _must_cancel branch with a fake task. The existing "current-task-only" test passed without reproducing this variant because it didn't set _must_cancel.

Third iteration of the anyio _deliver_cancellation spin. This time a CancelScope retained already-finished tasks in self._tasks (anyio doesn't always prune them before the cancel callback fires). For a done task _must_cancel=False (cleared on final step), _task_started is True, _fut_waiter is None — so the previous patch fell into the "waiter not done → cancel()" branch. task.cancel() is a no-op on done tasks, but should_retry was flagged anyway, so call_soon kept re-queuing forever. Observed live: three zombie-scopes in one process producing ~55k epoll_pwait/sec combined, 100% CPU on MainThread, load 1.6, 60°C. Confirmed via py-spy (stack parked on lines 91/95/97/98 of the patch) and a gc-scan dump of CancelScope objects (three active _cancel_handle scopes, each with a single done=True task). Fix: add `if task.done(): continue` at the top of the loop. Also add regression test that reproduces the zombie-scope shape with a stubbed done task and asserts no cancel() call, no retry, and no pending _cancel_handle.

pufit

Reviewed this carefully and verified the claims against the installed anyio (note: env actually has 4.14.0, not 4.13.0 — more on that below). The bug is real, but I don't think this fix is safe to merge as-is — it breaks legitimate anyio cancellation in at least three reproducible ways.

The bug is real ✅

Confirmed the core claim by reading the actual CancelScope._deliver_cancellation source: should_retry = True is set unconditionally for every task in self._tasks, before checking whether a cancel can actually land:

for task in self._tasks:
    should_retry = True          # ← set for ANY task, deliverable or not
    if task._must_cancel: continue
    if task is not current and (...): ...

So a task that lingers in _tasks but can't be cancelled (stuck on uncancellable I/O, already _must_cancel, or done) re-arms call_soon(self._deliver_cancellation) every tick → one core pinned at ~100%. Reproduced on stock anyio: a non-deliverable task re-arms the reschedule on 6/6 passes (infinite spin); with the patch, 0/6 (stops). It's also a known upstream issue hitting this exact stack — agronholm/anyio#695, anthropics/claude-agent-sdk-python#378 — and _safe_disconnect() already documents it. So the motivation is sound and the patch does stop the spin.

…but the fix breaks correct cancellation ❌

Ran the patched _deliver_cancellation against stock anyio 4.14.0 on a functional suite. Stock: 12/12 correct. Patched: 7/12. Three regressions in textbook patterns:

Pattern	Stock anyio	With this patch
`with CancelScope() as s: s.cancel(); await sleep(5)`	cancelled in ~0.000s	sleeps the full 5s, runs code that should be unreachable
Task cancels its own task-group scope, then awaits	cancelled in ~0.002s	sleeps the full 5s
Task swallows one `CancelledError` (re-delivery)	re-delivered immediately	stops after the 1st delivery; took 15s

Minimal repro for the first one:

import anyio
async def main():
    with anyio.CancelScope() as s:
        s.cancel()
        await anyio.sleep(5)      # stock: returns instantly; patched: sleeps 5s
        print("THIS SHOULD BE UNREACHABLE")   # patched prints this
anyio.run(main)

Root cause of the regressions: the busy-loop and legitimate delivery share the same mechanism. When a task cancels its own scope synchronously, anyio can't cancel the running task in place — it relies on the call_soon reschedule to deliver on the next tick (when current_task() is None). This patch refuses to reschedule unless it called task.cancel() on the current pass, and skips the current task entirely, so that deferred self-delivery never happens. Same for the swallow case: anyio's contract is to keep re-delivering until the scope exits; setting should_retry=False on _must_cancel/non-cancel passes strands a task that swallowed the first CancelledError.

Why the test suite passes anyway

The 8 tests pass but don't cover the broken paths. They exercise hand-built scopes plus three "still works" cases (fail_after, move_on_after, task-group cancel) — all of which cancel from a timer callback where current_task() is None, i.e. the one path this patch doesn't affect. The self-cancel and re-delivery patterns above aren't tested, which is exactly where the regression lives.

Other concerns

Blast radius. Applied globally from nerve/__init__.py, so it patches a core primitive for the whole process — FastAPI/Starlette, httpx, and the SDK all rely on anyio cancel scopes. A 100% CPU spin is at least visible; a silently swallowed cancellation/hang is much harder to diagnose.
Version drift. PR targets 4.13.0; env has 4.14.0 (transitive, unpinned — not in pyproject.toml). 4.14.0 already ships a #695 shield mitigation in TaskGroup.__aexit__, and the docstring's line numbers (582–616) won't match. Minor: a finished coroutine is CORO_CLOSED, so _task_started() returns False for a done task (the spin still happens, because should_retry is set at the top regardless — but the stated reasoning is slightly off).
Reachability of the "done task" shape. task_done is a done-callback that removes finished tasks from _tasks within ~1 tick, so that shape self-heals. The genuinely reachable production shape is the stuck _must_cancel task that _safe_disconnect already describes.

Suggested direction

The upstream-blessed fix is a timeout on SDK teardown, not monkeypatching the primitive:

Bump claude-agent-sdk to the release that fixed #378 (wraps Query.close()'s task-group cleanup in anyio.fail_after(5.0)).
_safe_disconnect already wraps disconnect() in asyncio.wait_for(...) — extend that pattern to any other client-close paths instead of patching anyio globally.
If a stopgap monkeypatch is still wanted, it must preserve deferred self-delivery (keep rescheduling while the current task is the only deliverable target) and add tests for the self-cancel + re-delivery patterns above. As written, it doesn't.

Generated by Nerve

Previous patch skipped three task categories — done, current, _must_cancel — and only set should_retry=True on actual cancel delivery. Two of those skips break legitimate anyio cancellation: * Skipping current_task strands deferred self-delivery. The pattern `with CancelScope() as s: s.cancel(); await sleep(N)` relies on a call_soon reschedule firing on the next tick (when current_task() is None) to actually cancel the host task. The wider patch refused to reschedule from the synchronous pass and the await ran to completion. * Skipping _must_cancel strands re-delivery. anyio's contract is to keep redelivering until the scope exits; tasks that swallow the first CancelledError need a second one. The wider patch dropped should_retry on the _must_cancel pass and the second delivery never came (Artem's review on ClickHouse#128). The done-task skip is the only deviation the production observation actually requires (April 24: three zombie-scopes, 55k epoll_pwait/sec). Narrow the patch to just that — every other branch is byte-for-byte upstream anyio 4.13.0 — so deferred self-delivery and re-delivery behave exactly like upstream. Tests: * Drop the two stub tests that enshrined the broken behavior (current task → no retry; _must_cancel → no retry). * Add three real-anyio regression tests covering the patterns Artem identified: self-cancel+await, task cancels own TG scope+await, swallowed CancelledError re-delivery. * Add a mixed done+live stub test asserting the done-skip doesn't short-circuit re-delivery for other tasks in the same scope. 10/10 anyio tests pass. Full suite: 1300 passed, 2 skipped, 0 failed. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

constkolesnyak · 2026-06-25T13:17:09Z

Superseded by #152 — addresses the review feedback by narrowing the monkeypatch to the single if task.done(): continue deviation from upstream anyio 4.13.0. The wider current_task and _must_cancel skips here were broken (they killed deferred self-delivery and re-delivery on swallowed CancelledError), and #152 adds explicit regression tests for both patterns.

constkolesnyak added 3 commits June 21, 2026 00:19

pufit reviewed Jun 23, 2026

View reviewed changes

constkolesnyak mentioned this pull request Jun 25, 2026

Fix anyio patch: narrow to zombie-scope skip (preserve self-delivery + re-delivery) constkolesnyak/nerve#13

Merged

constkolesnyak mentioned this pull request Jun 25, 2026

Fix anyio 4.13 CPU hot-loop: narrow zombie-scope patch #152

Open

constkolesnyak closed this Jun 25, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix anyio 4.13 100% CPU hot-loop in CancelScope._deliver_cancellation#128

Fix anyio 4.13 100% CPU hot-loop in CancelScope._deliver_cancellation#128
constkolesnyak wants to merge 3 commits into
ClickHouse:mainfrom
constkolesnyak:upstream/anyio-cpu-fix

constkolesnyak commented Jun 20, 2026

Uh oh!

pufit left a comment

Uh oh!

constkolesnyak commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

constkolesnyak commented Jun 20, 2026

Problem

Fix

Tests

Files

Uh oh!

pufit left a comment

Choose a reason for hiding this comment

The bug is real ✅

…but the fix breaks correct cancellation ❌

Why the test suite passes anyway

Other concerns

Suggested direction

Uh oh!

constkolesnyak commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants