Skip to content

Fix anyio patch: narrow to zombie-scope skip (preserve self-delivery + re-delivery)#13

Merged
constkolesnyak merged 1 commit into
mainfrom
fix/anyio-narrow-zombie-skip
Jun 25, 2026
Merged

Fix anyio patch: narrow to zombie-scope skip (preserve self-delivery + re-delivery)#13
constkolesnyak merged 1 commit into
mainfrom
fix/anyio-narrow-zombie-skip

Conversation

@constkolesnyak

Copy link
Copy Markdown
Owner

Summary

Addresses the review feedback on ClickHouse/nerve#128. The previous monkeypatch was too aggressive — it skipped three task categories (done, current, _must_cancel) and only set should_retry=True when it actually called task.cancel(). Two of those skips broke legitimate anyio cancellation:

  • Deferred self-delivery: with anyio.CancelScope() as s: s.cancel(); await sleep(N) ran the full sleep instead of cancelling immediately. s.cancel() calls _deliver_cancellation synchronously while current_task() points at the host task; anyio relies on a call_soon reschedule firing on the next tick (when current_task() is None) to land the cancel. Skipping the current task without should_retry=True killed that reschedule.

  • Re-delivery after swallowed CancelledError: anyio's contract is to keep redelivering until the scope exits. Skipping _must_cancel tasks without should_retry=True stranded any task that caught the first CancelledError and looped.

Fix

nerve/_anyio_patch.py is now byte-for-byte identical to upstream anyio 4.13.0 CancelScope._deliver_cancellation except for one if task.done(): continue at the top of the loop body, before should_retry = True is set. That's the only deviation the production observation actually requires — the April 24 zombie-scope shape (three simultaneous scopes, ~55k epoll_pwait/sec combined, 100% CPU on one core, each containing a single done task).

Module docstring rewritten to explain the narrower scope and why the wider skips were reverted.

Tests

tests/test_anyio_patch.py:

  • Dropped two stub tests (test_no_hot_loop_when_only_task_is_current, test_no_hot_loop_when_current_task_has_must_cancel) — they enshrined the now-removed broken behavior.
  • Added three real-anyio regression tests, all of which would fail on the previous patch:
    • test_self_cancel_then_await_sleep_cancels_immediatelywith CancelScope() as s: s.cancel(); await sleep(5) returns in <0.5s.
    • test_task_cancels_own_taskgroup_scope_then_awaits — same shape via TaskGroup; awaited sleep must not run to completion.
    • test_swallowed_cancelled_error_is_redelivered — task catches first CancelledError, loops, gets redelivered (verified ≥3 deliveries inside fail_after(2.0)).
  • Added test_live_task_still_reschedules_alongside_done_task — a mixed done + _must_cancel-flagged stub scope must still reschedule (proves the done-skip doesn't short-circuit re-delivery for other tasks).
  • Kept the zombie-scope stub test (test_no_hot_loop_when_only_task_is_done) and the three "upstream behavior still works" tests (fail_after, move_on_after, task-group cancel).

Verification

  • pytest tests/test_anyio_patch.py -v — 10/10 pass.
  • pytest -q — 1300 passed, 2 skipped, 0 failed.

Notes for the upstream PR

Phase 1: this PR opens against the fork (constkolesnyak/nerve:main) for review. Phase 2 (separate) will push the same change to upstream/anyio-cpu-fix and update ClickHouse/nerve#128.

🤖 Generated with Claude Code

Previous patch skipped three task categories — done, current,
_must_cancel — and only set should_retry=True on actual cancel
delivery. Two of those skips break legitimate anyio cancellation:

* Skipping current_task strands deferred self-delivery. The pattern
  `with CancelScope() as s: s.cancel(); await sleep(N)` relies on a
  call_soon reschedule firing on the next tick (when current_task() is
  None) to actually cancel the host task. The wider patch refused to
  reschedule from the synchronous pass and the await ran to completion.

* Skipping _must_cancel strands re-delivery. anyio's contract is to
  keep redelivering until the scope exits; tasks that swallow the
  first CancelledError need a second one. The wider patch dropped
  should_retry on the _must_cancel pass and the second delivery
  never came (Artem's review on ClickHouse#128).

The done-task skip is the only deviation the production observation
actually requires (April 24: three zombie-scopes, 55k epoll_pwait/sec).
Narrow the patch to just that — every other branch is byte-for-byte
upstream anyio 4.13.0 — so deferred self-delivery and re-delivery
behave exactly like upstream.

Tests:
* Drop the two stub tests that enshrined the broken behavior
  (current task → no retry; _must_cancel → no retry).
* Add three real-anyio regression tests covering the patterns Artem
  identified: self-cancel+await, task cancels own TG scope+await,
  swallowed CancelledError re-delivery.
* Add a mixed done+live stub test asserting the done-skip doesn't
  short-circuit re-delivery for other tasks in the same scope.

10/10 anyio tests pass. Full suite: 1300 passed, 2 skipped, 0 failed.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@constkolesnyak constkolesnyak merged commit 5396361 into main Jun 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant