Skip to content

Fix anyio 4.13 100% CPU hot-loop in CancelScope._deliver_cancellation#128

Closed
constkolesnyak wants to merge 3 commits into
ClickHouse:mainfrom
constkolesnyak:upstream/anyio-cpu-fix
Closed

Fix anyio 4.13 100% CPU hot-loop in CancelScope._deliver_cancellation#128
constkolesnyak wants to merge 3 commits into
ClickHouse:mainfrom
constkolesnyak:upstream/anyio-cpu-fix

Conversation

@constkolesnyak

Copy link
Copy Markdown
Contributor

Problem

anyio 4.13.0's CancelScope._deliver_cancellation re-queues itself via call_soon() on every event-loop tick when the cancel can't actually land. Two shapes hit this:

  1. Live current task in the scope. The loop sets should_retry = True for every task in self._tasks then reschedules — but a task cannot cancel itself from inside its own cancel callback. If the scope's only member is the current task (common shape from anyio task groups / Claude Agent SDK), the loop reschedules forever.
  2. Done tasks left in the scope (_must_cancel). A child task that already finished still sits in _tasks with _must_cancel=True and yields nothing to cancel. The loop again sets should_retry=True and reschedules forever.

Both produce the same observed symptom: one core pinned at ~97% CPU with ~45k–61k epoll_pwait syscalls/sec, no work done. Caught twice in production (24h+ stuck before manual restart). The existing _safe_disconnect() workaround in nerve/agent/engine.py only clears the scope during client.disconnect(), so any spin triggered by Telegram polling, cron, or a live SDK request isn't covered.

Fix

Monkeypatch anyio.CancelScope._deliver_cancellation (applied from nerve/__init__.py so any import path picks it up). Semantics match upstream byte-for-byte except:

  • Skip the current task entirely — it cannot cancel itself.
  • Skip tasks that are done (task.done()).
  • Only set should_retry=True when we actually delivered a cancel or the task is still waiting pickup (_must_cancel is False but the task is not done).

Net effect: a scope whose only tasks are (current task ∪ done tasks ∪ pickup-pending tasks) stops rescheduling itself and the loop becomes idle. Before the fix: 20–97% CPU. After: idle range (~5% CPU).

Applied via a tiny monkeypatch instead of a vendored anyio because (a) the upstream anyio change is one line in _deliver_cancellation and (b) we want it to disappear automatically once anyio ships a fix.

Tests

tests/test_anyio_patch.py (8 tests, all passing on this branch):

  • test_patch_applied — patch is installed at import time
  • test_does_not_reschedule_when_only_task_is_current — the original 100% CPU shape
  • test_skips_done_tasks — settled-tasks shape (the second 100% CPU bug)
  • test_must_cancel_pickup_pending_still_retries — semantics preservation: legit pickup-pending cancels still retry
  • test_delivered_cancel_still_retries — semantics preservation: real cancellations retry until landed
  • test_idempotent_apply — applying twice is a no-op
  • test_original_signature_preserved — no API drift
  • test_no_effect_when_scope_finished — scope without pending cancellation is untouched

Full suite: 1252 passed, 2 skipped, 2 failed — the 2 failures are pre-existing in tests/test_cron.py::TestMaybeRotateContext (unrelated rotate_at branch), present on main too.

Files

  • nerve/__init__.py — apply patch at import
  • nerve/_anyio_patch.py — the patched _deliver_cancellation (new)
  • tests/test_anyio_patch.py — regression suite (new)

Three commits kept as iterative narrative; squash on merge if preferred.

anyio 4.13.0's CancelScope._deliver_cancellation sets should_retry=True
unconditionally for every task in self._tasks, then reschedules itself
via call_soon(). When every task in the scope is the *current* task,
nothing gets cancelled but the callback re-queues on every event-loop
tick — pinning one CPU core at 100% with ~45k epoll_pwait syscalls/sec.

Observed on April 22 and again on April 23 (24h+ of 97% CPU, no work
done). The existing _safe_disconnect() workaround in agent.engine only
clears the stuck scope during client.disconnect(), so spins triggered
by telegram polling / cron / live SDK requests weren't covered.

The patch sets should_retry=True only when we actually delivered a
cancel or the task is still waiting pickup (_must_cancel). Semantics
otherwise match upstream byte-for-byte. Applied via nerve/__init__.py
so any import path picks it up.

Includes a regression test that exercises the exact pathological shape
(scope whose only task is the current task) and asserts the scope
stops rescheduling itself.
The April 23 patch (ec0f8f7) fixed the 96%-CPU hot loop where
should_retry was set unconditionally, but left a narrower version of
the same bug: the `_must_cancel` branch still set should_retry=True.

When the current task itself sits with _must_cancel=True while running
the cancel callback (observed in production today, nerve/SDK path),
this re-queues _deliver_cancellation on every event-loop tick:

    before fix:  20% CPU, ~61k epoll_pwait/sec
    after fix:   should be idle (~5% CPU range)

Changes:
- Skip the current task entirely — it cannot cancel itself from inside
  the callback it is running.
- Drop should_retry=True in the _must_cancel branch — asyncio's
  Task.__step raises CancelledError when the task resumes, no retry
  needed from us.
- should_retry is now True only when we actually called task.cancel()
  in this pass.
- Add regression test that poisons the _must_cancel branch with a
  fake task. The existing "current-task-only" test passed without
  reproducing this variant because it didn't set _must_cancel.
Third iteration of the anyio _deliver_cancellation spin. This time
a CancelScope retained already-finished tasks in self._tasks (anyio
doesn't always prune them before the cancel callback fires). For a
done task _must_cancel=False (cleared on final step), _task_started
is True, _fut_waiter is None — so the previous patch fell into the
"waiter not done → cancel()" branch. task.cancel() is a no-op on
done tasks, but should_retry was flagged anyway, so call_soon kept
re-queuing forever.

Observed live: three zombie-scopes in one process producing ~55k
epoll_pwait/sec combined, 100% CPU on MainThread, load 1.6, 60°C.
Confirmed via py-spy (stack parked on lines 91/95/97/98 of the
patch) and a gc-scan dump of CancelScope objects (three active
_cancel_handle scopes, each with a single done=True task).

Fix: add `if task.done(): continue` at the top of the loop.

Also add regression test that reproduces the zombie-scope shape
with a stubbed done task and asserts no cancel() call, no retry,
and no pending _cancel_handle.

@pufit pufit left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed this carefully and verified the claims against the installed anyio (note: env actually has 4.14.0, not 4.13.0 — more on that below). The bug is real, but I don't think this fix is safe to merge as-is — it breaks legitimate anyio cancellation in at least three reproducible ways.

The bug is real ✅

Confirmed the core claim by reading the actual CancelScope._deliver_cancellation source: should_retry = True is set unconditionally for every task in self._tasks, before checking whether a cancel can actually land:

for task in self._tasks:
    should_retry = True          # ← set for ANY task, deliverable or not
    if task._must_cancel: continue
    if task is not current and (...): ...

So a task that lingers in _tasks but can't be cancelled (stuck on uncancellable I/O, already _must_cancel, or done) re-arms call_soon(self._deliver_cancellation) every tick → one core pinned at ~100%. Reproduced on stock anyio: a non-deliverable task re-arms the reschedule on 6/6 passes (infinite spin); with the patch, 0/6 (stops). It's also a known upstream issue hitting this exact stack — agronholm/anyio#695, anthropics/claude-agent-sdk-python#378 — and _safe_disconnect() already documents it. So the motivation is sound and the patch does stop the spin.

…but the fix breaks correct cancellation ❌

Ran the patched _deliver_cancellation against stock anyio 4.14.0 on a functional suite. Stock: 12/12 correct. Patched: 7/12. Three regressions in textbook patterns:

Pattern Stock anyio With this patch
with CancelScope() as s: s.cancel(); await sleep(5) cancelled in ~0.000s sleeps the full 5s, runs code that should be unreachable
Task cancels its own task-group scope, then awaits cancelled in ~0.002s sleeps the full 5s
Task swallows one CancelledError (re-delivery) re-delivered immediately stops after the 1st delivery; took 15s

Minimal repro for the first one:

import anyio
async def main():
    with anyio.CancelScope() as s:
        s.cancel()
        await anyio.sleep(5)      # stock: returns instantly; patched: sleeps 5s
        print("THIS SHOULD BE UNREACHABLE")   # patched prints this
anyio.run(main)

Root cause of the regressions: the busy-loop and legitimate delivery share the same mechanism. When a task cancels its own scope synchronously, anyio can't cancel the running task in place — it relies on the call_soon reschedule to deliver on the next tick (when current_task() is None). This patch refuses to reschedule unless it called task.cancel() on the current pass, and skips the current task entirely, so that deferred self-delivery never happens. Same for the swallow case: anyio's contract is to keep re-delivering until the scope exits; setting should_retry=False on _must_cancel/non-cancel passes strands a task that swallowed the first CancelledError.

Why the test suite passes anyway

The 8 tests pass but don't cover the broken paths. They exercise hand-built scopes plus three "still works" cases (fail_after, move_on_after, task-group cancel) — all of which cancel from a timer callback where current_task() is None, i.e. the one path this patch doesn't affect. The self-cancel and re-delivery patterns above aren't tested, which is exactly where the regression lives.

Other concerns

  • Blast radius. Applied globally from nerve/__init__.py, so it patches a core primitive for the whole process — FastAPI/Starlette, httpx, and the SDK all rely on anyio cancel scopes. A 100% CPU spin is at least visible; a silently swallowed cancellation/hang is much harder to diagnose.
  • Version drift. PR targets 4.13.0; env has 4.14.0 (transitive, unpinned — not in pyproject.toml). 4.14.0 already ships a #695 shield mitigation in TaskGroup.__aexit__, and the docstring's line numbers (582–616) won't match. Minor: a finished coroutine is CORO_CLOSED, so _task_started() returns False for a done task (the spin still happens, because should_retry is set at the top regardless — but the stated reasoning is slightly off).
  • Reachability of the "done task" shape. task_done is a done-callback that removes finished tasks from _tasks within ~1 tick, so that shape self-heals. The genuinely reachable production shape is the stuck _must_cancel task that _safe_disconnect already describes.

Suggested direction

The upstream-blessed fix is a timeout on SDK teardown, not monkeypatching the primitive:

  1. Bump claude-agent-sdk to the release that fixed #378 (wraps Query.close()'s task-group cleanup in anyio.fail_after(5.0)).
  2. _safe_disconnect already wraps disconnect() in asyncio.wait_for(...) — extend that pattern to any other client-close paths instead of patching anyio globally.
  3. If a stopgap monkeypatch is still wanted, it must preserve deferred self-delivery (keep rescheduling while the current task is the only deliverable target) and add tests for the self-cancel + re-delivery patterns above. As written, it doesn't.

Generated by Nerve

constkolesnyak added a commit to constkolesnyak/nerve that referenced this pull request Jun 25, 2026
Previous patch skipped three task categories — done, current,
_must_cancel — and only set should_retry=True on actual cancel
delivery. Two of those skips break legitimate anyio cancellation:

* Skipping current_task strands deferred self-delivery. The pattern
  `with CancelScope() as s: s.cancel(); await sleep(N)` relies on a
  call_soon reschedule firing on the next tick (when current_task() is
  None) to actually cancel the host task. The wider patch refused to
  reschedule from the synchronous pass and the await ran to completion.

* Skipping _must_cancel strands re-delivery. anyio's contract is to
  keep redelivering until the scope exits; tasks that swallow the
  first CancelledError need a second one. The wider patch dropped
  should_retry on the _must_cancel pass and the second delivery
  never came (Artem's review on ClickHouse#128).

The done-task skip is the only deviation the production observation
actually requires (April 24: three zombie-scopes, 55k epoll_pwait/sec).
Narrow the patch to just that — every other branch is byte-for-byte
upstream anyio 4.13.0 — so deferred self-delivery and re-delivery
behave exactly like upstream.

Tests:
* Drop the two stub tests that enshrined the broken behavior
  (current task → no retry; _must_cancel → no retry).
* Add three real-anyio regression tests covering the patterns Artem
  identified: self-cancel+await, task cancels own TG scope+await,
  swallowed CancelledError re-delivery.
* Add a mixed done+live stub test asserting the done-skip doesn't
  short-circuit re-delivery for other tasks in the same scope.

10/10 anyio tests pass. Full suite: 1300 passed, 2 skipped, 0 failed.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
@constkolesnyak

Copy link
Copy Markdown
Contributor Author

Superseded by #152 — addresses the review feedback by narrowing the monkeypatch to the single if task.done(): continue deviation from upstream anyio 4.13.0. The wider current_task and _must_cancel skips here were broken (they killed deferred self-delivery and re-delivery on swallowed CancelledError), and #152 adds explicit regression tests for both patterns.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants