Skip to content

Fix anyio 4.13 CPU hot-loop: narrow zombie-scope patch#152

Open
constkolesnyak wants to merge 1 commit into
ClickHouse:mainfrom
constkolesnyak:upstream/anyio-narrow-fix-v2
Open

Fix anyio 4.13 CPU hot-loop: narrow zombie-scope patch#152
constkolesnyak wants to merge 1 commit into
ClickHouse:mainfrom
constkolesnyak:upstream/anyio-narrow-fix-v2

Conversation

@constkolesnyak

Copy link
Copy Markdown
Contributor

Problem

anyio.CancelScope._deliver_cancellation sets should_retry = True for every task in self._tasks before checking whether a cancel can actually be delivered. When a task with done() is True lingers in _tasks (the upstream task_done callback doesn't always prune it before the scope's cancel callback fires), task.cancel() is a no-op but should_retry stays True → the scope re-arms call_soon(_deliver_cancellation) on every event-loop tick.

Result: one CPU core pinned at ~100% with tens of thousands of epoll_pwait syscalls per second and no forward progress.

Observed live in production on 2026-04-24: three simultaneous zombie-scopes in one nerve process (scope IDs 0x7ffec1774f50, 0x7ffec17ae090, 0x7ffec17ad6d0), ~55k epoll_pwait/sec combined, 100% CPU on one core, load 1.6, cpu-thermal 60°C. Each scope held a single done=True task. Diagnosed via py-spy dump and a GC scan of CancelScope instances.

The existing _safe_disconnect() workaround in nerve/agent/engine.py only clears the scope during client.disconnect(), so spins triggered elsewhere (Telegram polling, cron, an active SDK request) aren't covered.

Fix

Monkey-patch anyio.CancelScope._deliver_cancellation (applied from nerve/__init__.py so any import path picks it up). The patch is as narrow as possible — it adds a single if task.done(): continue skip at the top of the loop body, before should_retry = True is touched. Every other branch is byte-for-byte identical to upstream anyio 4.13.0.

This intentional narrowness preserves anyio's contracts that a wider patch would break:

  • Deferred self-delivery: with anyio.CancelScope() as s: s.cancel(); await sleep(N). s.cancel() calls _deliver_cancellation synchronously while current_task() points at the host task; anyio relies on the call_soon reschedule firing on the next tick (when current_task() is None) to land the cancel. Skipping the current task without should_retry=True strands the cancel — the sleep runs to completion.

  • Re-delivery after swallowed CancelledError: anyio's contract is to keep redelivering until the scope exits. Skipping _must_cancel tasks without should_retry=True strands tasks that catch the first CancelledError and loop.

A previous wider version of this patch (skipping current_task and _must_cancel in addition to done tasks) was reverted because it broke both contracts. See the module docstring for the full writeup.

Tests

tests/test_anyio_patch.py — 10 tests, all passing:

Wiring (2):

  • test_patch_is_applied — patch is installed at import time
  • test_apply_is_idempotent — applying twice is a no-op

Upstream behavior still works (3):

  • test_fail_after_still_works
  • test_move_on_after_still_works
  • test_task_group_cancellation_still_works

Anyio cancellation contracts preserved (3 — these would fail on a wider patch):

  • test_self_cancel_then_await_sleep_cancels_immediatelys.cancel(); await sleep(5) returns in <0.5s via deferred self-delivery
  • test_task_cancels_own_taskgroup_scope_then_awaits — same shape via TaskGroup
  • test_swallowed_cancelled_error_is_redelivered — task catches first CancelledError, loops, gets redelivered (≥3 deliveries within fail_after(2.0))

The zombie-scope regression this patch is for (2):

  • test_no_hot_loop_when_only_task_is_done — stub scope with a single done task does not reschedule
  • test_live_task_still_reschedules_alongside_done_task — mixed done + live scope still reschedules for the live task (proves the done-skip doesn't short-circuit re-delivery for others)

Files

  • nerve/_anyio_patch.py — the narrow patched _deliver_cancellation (new)
  • nerve/__init__.py — import the patch at module load
  • tests/test_anyio_patch.py — regression suite (new)

History

Supersedes #128 — addresses @pufit's review feedback there. The previous PR's monkeypatch also skipped current_task and _must_cancel without setting should_retry, which broke the two anyio contracts above. This version narrows the patch to the single deviation production observation actually requires and adds tests for both regression patterns.

🤖 Generated with Claude Code

Adds a monkey-patch of anyio.CancelScope._deliver_cancellation that
fixes a 100% CPU spin observed in production when a done task lingers
in CancelScope._tasks longer than upstream's task_done callback takes
to prune it.

Bug: upstream sets should_retry=True for every task in _tasks before
checking whether a cancel can actually be delivered. For a done task,
task.cancel() is a no-op but should_retry stays True → the scope
re-arms call_soon(_deliver_cancellation) on every event-loop tick.
Observed live (April 24, 2026): three simultaneous zombie-scopes in
one process, ~55k epoll_pwait/sec combined, 100% CPU on one core,
each scope holding a single done task.

Fix: insert a single `if task.done(): continue` at the top of the
loop body, before should_retry is touched. Every other branch is
byte-for-byte upstream anyio 4.13.0 — deferred self-delivery
(s.cancel(); await sleep(N)) and re-delivery after a swallowed
CancelledError both behave exactly like upstream.

Tests (10/10 passing):
* test_patch_is_applied / test_apply_is_idempotent — wiring
* test_fail_after_still_works / test_move_on_after_still_works /
  test_task_group_cancellation_still_works — upstream behavior
* test_self_cancel_then_await_sleep_cancels_immediately — preserves
  anyio's deferred self-delivery contract
* test_task_cancels_own_taskgroup_scope_then_awaits — same shape
  via TaskGroup
* test_swallowed_cancelled_error_is_redelivered — preserves anyio's
  re-delivery contract
* test_no_hot_loop_when_only_task_is_done — the zombie-scope
  regression this patch is for
* test_live_task_still_reschedules_alongside_done_task — mixed
  done+live scope still reschedules for the live task

The patch is intentionally as narrow as possible. An earlier wider
version (skipping current_task and _must_cancel without setting
should_retry) was reverted because it broke deferred self-delivery
and re-delivery — see the module docstring for the writeup.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant