Skip to content

fix(healthcheck): update targets incrementally instead of destroy-and-rebuild#13582

Draft
AlinsRan wants to merge 3 commits into
apache:masterfrom
AlinsRan:fix/healthcheck-incremental-rebuild
Draft

fix(healthcheck): update targets incrementally instead of destroy-and-rebuild#13582
AlinsRan wants to merge 3 commits into
apache:masterfrom
AlinsRan:fix/healthcheck-incremental-rebuild

Conversation

@AlinsRan

Copy link
Copy Markdown
Contributor

Description

Fixes a cluster of active-health-check problems that share an apisix-side root cause: the health-check manager destroys and rebuilds the checker on every upstream change, even when only the upstream nodes changed and the checks config is identical.

Root cause

fetch_checker() keys the working checker by a version derived from both modifiedIndex and the nodes version (upstream_utils.version). So a node-only change bumps the version, fetch_checker() returns nil, and the resource is queued for an asynchronous rebuild. During that rebuild window:

  1. api_ctx.up_checker is nil, so the balancer's health filtering is bypassed and traffic flows to nodes already known to be unhealthy for ~1-2s (bug: Health check state lost and checker not working after upstream node changes #13282).
  2. The rebuild discards the checker's accumulated health state and re-probes every node from scratch.

timer_working_pool_check also destroyed the checker for any version mismatch, racing timer_create_checker and widening the nil window.

Fix

  • Added compute_targets() / sync_checker_targets(): when the checks config is unchanged (compared with core.table.deep_eq) and only the nodes changed, reconcile the existing checker's targets in place via add_target/remove_target against the authoritative shm target list, keeping the checker and its health state alive.
  • timer_working_pool_check no longer destroys a checker for a node-only version change.
  • When a rebuild is genuinely required (the checks config changed, or no checker exists), the new checker is created and inserted into the working pool before the old one is stopped, eliminating the up_checker == nil gap.
  • The working-pool entry now stores the checks config so changes can be detected.

Test

t/node/healthcheck-incremental-update.t:

  • TEST 1: a node-only change must not destroy/rebuild the checker — asserts create new checker appears once (initial) and clear checker (delayed_clear) does not.
  • TEST 2: a checks-config change must still rebuild — asserts clear checker appears.

NOTE on local verification: I could not run the prove suite in my environment because the installed OpenResty (1.21.4.4) lacks the HTTP/3/QUIC support that the current t/APISIX.pm harness unconditionally requires (listen 1994 quic, http3 on), so nginx refuses to start for any test here. The diff/reconcile algorithm used by sync_checker_targets (add/remove against get_target_list) was validated separately against the real lua-resty-healthcheck library. Please run CI / a maintainer with a QUIC-capable OpenResty to confirm the new .t.

Cross-PR dependency

This depends on the companion library PR api7/lua-resty-healthcheck#55 (clean every checker each cleanup window + release the periodic lock when idle), which fixes the related #13385 / #13141 / #13235 root cause inside the library. The rockspec dependency is bumped to lua-resty-healthcheck-api7 = 3.3.0-0 as a placeholder; the library fix is not yet released/tagged, so this PR should not be merged until that library version is published and the version here is confirmed.

Relates to #13282, #13385, #13141, #13235.

Checklist

  • I have explained the need for this PR and the problem it solves
  • I have explained the changes or the new features added to this PR
  • I have added tests corresponding to this change
  • I have updated the documentation to reflect this change (N/A)
  • I have verified that this change is backward compatible (incremental update falls back to full rebuild whenever checks changes)

@AlinsRan AlinsRan force-pushed the fix/healthcheck-incremental-rebuild branch from 34c2560 to 9c71af9 Compare June 22, 2026 03:02
When an upstream's nodes change but its `checks` config is unchanged,
the health-check manager destroyed the existing checker and built a new
one. Two problems followed:

1. fetch_checker() keys the working checker by a version derived from
   both modifiedIndex and the nodes version, so a node-only change makes
   it return nil until the timer rebuilds the checker. During that
   window api_ctx.up_checker is nil and the balancer routes traffic to
   nodes already known to be unhealthy (apache#13282).

2. The rebuild throws away the checker's accumulated health state and
   re-probes every node from scratch.

The manager now reconciles the existing checker's targets in place with
add_target/remove_target when the `checks` config is unchanged
(compared with core.table.deep_eq), keeping the checker and its state
alive. timer_working_pool_check no longer destroys a checker for a
node-only version change, and when a rebuild is genuinely required (the
`checks` config changed) the new checker is created and inserted into
the working pool before the old one is released, so fetch_checker never
observes a nil gap.

Bumps the lua-resty-healthcheck-api7 rockspec dependency to 3.3.0-0,
which contains the companion library fix (clean every checker each
window + release the periodic lock when idle) required by this change.

Adds t/node/healthcheck-incremental-update.t: a node-only change must
not destroy/rebuild the checker (no "clear checker"), while a
checks-config change still rebuilds it.
@AlinsRan AlinsRan force-pushed the fix/healthcheck-incremental-rebuild branch from 9c71af9 to 7994168 Compare June 22, 2026 03:39
AlinsRan added 2 commits June 22, 2026 14:16
When the checks config changes, install the freshly created checker into
the working pool before stopping the previous one. This prevents a request
from briefly fetching a stopped checker for the old version during the
swap window.
…rop to 0

- compute_targets now returns an ordered array (preserving node order) so
  targets are added deterministically, keeping ordered error-log assertions
  stable.
- Only keep/reuse a checker incrementally when the upstream still has nodes;
  when the node count drops to 0 the checker is destroyed as before.
- Update existing healthcheck tests to assert the new incremental-reuse
  behaviour (a checker is reused instead of recreated when only the nodes
  change, and is not cleared in that case).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant