fix(healthcheck): update targets incrementally instead of destroy-and-rebuild by AlinsRan · Pull Request #13582 · apache/apisix

AlinsRan · 2026-06-22T02:55:37Z

Description

Fixes a cluster of active-health-check problems that share an apisix-side root cause: the health-check manager destroys and rebuilds the checker on every upstream change, even when only the upstream nodes changed and the checks config is identical.

Root cause

fetch_checker() keys the working checker by a version derived from both modifiedIndex and the nodes version (upstream_utils.version). So a node-only change bumps the version, fetch_checker() returns nil, and the resource is queued for an asynchronous rebuild. During that rebuild window:

api_ctx.up_checker is nil, so the balancer's health filtering is bypassed and traffic flows to nodes already known to be unhealthy for ~1-2s (bug: Health check state lost and checker not working after upstream node changes #13282).
The rebuild discards the checker's accumulated health state and re-probes every node from scratch.

timer_working_pool_check also destroyed the checker for any version mismatch, racing timer_create_checker and widening the nil window.

Fix

Added compute_targets() / sync_checker_targets(): when the checks config is unchanged (compared with core.table.deep_eq) and only the nodes changed, reconcile the existing checker's targets in place via add_target/remove_target against the authoritative shm target list, keeping the checker and its health state alive.
timer_working_pool_check no longer destroys a checker for a node-only version change.
When a rebuild is genuinely required (the checks config changed, or no checker exists), the new checker is created and inserted into the working pool before the old one is stopped, eliminating the up_checker == nil gap.
The working-pool entry now stores the checks config so changes can be detected.

Test

t/node/healthcheck-incremental-update.t:

TEST 1: a node-only change must not destroy/rebuild the checker — asserts create new checker appears once (initial) and clear checker (delayed_clear) does not.
TEST 2: a checks-config change must still rebuild — asserts clear checker appears.

NOTE on local verification: I could not run the prove suite in my environment because the installed OpenResty (1.21.4.4) lacks the HTTP/3/QUIC support that the current t/APISIX.pm harness unconditionally requires (listen 1994 quic, http3 on), so nginx refuses to start for any test here. The diff/reconcile algorithm used by sync_checker_targets (add/remove against get_target_list) was validated separately against the real lua-resty-healthcheck library. Please run CI / a maintainer with a QUIC-capable OpenResty to confirm the new .t.

Cross-PR dependency

This depends on the companion library PR api7/lua-resty-healthcheck#55 (clean every checker each cleanup window + release the periodic lock when idle), which fixes the related #13385 / #13141 / #13235 root cause inside the library. The rockspec dependency is bumped to lua-resty-healthcheck-api7 = 3.3.0-0 as a placeholder; the library fix is not yet released/tagged, so this PR should not be merged until that library version is published and the version here is confirmed.

Relates to #13282, #13385, #13141, #13235.

Checklist

I have explained the need for this PR and the problem it solves
I have explained the changes or the new features added to this PR
I have added tests corresponding to this change
I have updated the documentation to reflect this change (N/A)
I have verified that this change is backward compatible (incremental update falls back to full rebuild whenever checks changes)

When an upstream's nodes change but its `checks` config is unchanged, the health-check manager destroyed the existing checker and built a new one. Two problems followed: 1. fetch_checker() keys the working checker by a version derived from both modifiedIndex and the nodes version, so a node-only change makes it return nil until the timer rebuilds the checker. During that window api_ctx.up_checker is nil and the balancer routes traffic to nodes already known to be unhealthy (apache#13282). 2. The rebuild throws away the checker's accumulated health state and re-probes every node from scratch. The manager now reconciles the existing checker's targets in place with add_target/remove_target when the `checks` config is unchanged (compared with core.table.deep_eq), keeping the checker and its state alive. timer_working_pool_check no longer destroys a checker for a node-only version change, and when a rebuild is genuinely required (the `checks` config changed) the new checker is created and inserted into the working pool before the old one is released, so fetch_checker never observes a nil gap. Bumps the lua-resty-healthcheck-api7 rockspec dependency to 3.3.0-0, which contains the companion library fix (clean every checker each window + release the periodic lock when idle) required by this change. Adds t/node/healthcheck-incremental-update.t: a node-only change must not destroy/rebuild the checker (no "clear checker"), while a checks-config change still rebuilds it.

When the checks config changes, install the freshly created checker into the working pool before stopping the previous one. This prevents a request from briefly fetching a stopped checker for the old version during the swap window.

…rop to 0 - compute_targets now returns an ordered array (preserving node order) so targets are added deterministically, keeping ordered error-log assertions stable. - Only keep/reuse a checker incrementally when the upstream still has nodes; when the node count drops to 0 the checker is destroyed as before. - Update existing healthcheck tests to assert the new incremental-reuse behaviour (a checker is reused instead of recreated when only the nodes change, and is not cleared in that case).

AlinsRan force-pushed the fix/healthcheck-incremental-rebuild branch from 34c2560 to 9c71af9 Compare June 22, 2026 03:02

AlinsRan force-pushed the fix/healthcheck-incremental-rebuild branch from 9c71af9 to 7994168 Compare June 22, 2026 03:39

AlinsRan added 2 commits June 22, 2026 14:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(healthcheck): update targets incrementally instead of destroy-and-rebuild#13582

fix(healthcheck): update targets incrementally instead of destroy-and-rebuild#13582
AlinsRan wants to merge 3 commits into
apache:masterfrom
AlinsRan:fix/healthcheck-incremental-rebuild

AlinsRan commented Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

AlinsRan commented Jun 22, 2026

Description

Root cause

Fix

Test

Cross-PR dependency

Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant