fix(healthcheck): update targets incrementally instead of destroy-and-rebuild#13582
Draft
AlinsRan wants to merge 3 commits into
Draft
fix(healthcheck): update targets incrementally instead of destroy-and-rebuild#13582AlinsRan wants to merge 3 commits into
AlinsRan wants to merge 3 commits into
Conversation
34c2560 to
9c71af9
Compare
When an upstream's nodes change but its `checks` config is unchanged, the health-check manager destroyed the existing checker and built a new one. Two problems followed: 1. fetch_checker() keys the working checker by a version derived from both modifiedIndex and the nodes version, so a node-only change makes it return nil until the timer rebuilds the checker. During that window api_ctx.up_checker is nil and the balancer routes traffic to nodes already known to be unhealthy (apache#13282). 2. The rebuild throws away the checker's accumulated health state and re-probes every node from scratch. The manager now reconciles the existing checker's targets in place with add_target/remove_target when the `checks` config is unchanged (compared with core.table.deep_eq), keeping the checker and its state alive. timer_working_pool_check no longer destroys a checker for a node-only version change, and when a rebuild is genuinely required (the `checks` config changed) the new checker is created and inserted into the working pool before the old one is released, so fetch_checker never observes a nil gap. Bumps the lua-resty-healthcheck-api7 rockspec dependency to 3.3.0-0, which contains the companion library fix (clean every checker each window + release the periodic lock when idle) required by this change. Adds t/node/healthcheck-incremental-update.t: a node-only change must not destroy/rebuild the checker (no "clear checker"), while a checks-config change still rebuilds it.
9c71af9 to
7994168
Compare
When the checks config changes, install the freshly created checker into the working pool before stopping the previous one. This prevents a request from briefly fetching a stopped checker for the old version during the swap window.
…rop to 0 - compute_targets now returns an ordered array (preserving node order) so targets are added deterministically, keeping ordered error-log assertions stable. - Only keep/reuse a checker incrementally when the upstream still has nodes; when the node count drops to 0 the checker is destroyed as before. - Update existing healthcheck tests to assert the new incremental-reuse behaviour (a checker is reused instead of recreated when only the nodes change, and is not cleared in that case).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Fixes a cluster of active-health-check problems that share an apisix-side root cause: the health-check manager destroys and rebuilds the checker on every upstream change, even when only the upstream nodes changed and the
checksconfig is identical.Root cause
fetch_checker()keys the working checker by a version derived from bothmodifiedIndexand the nodes version (upstream_utils.version). So a node-only change bumps the version,fetch_checker()returnsnil, and the resource is queued for an asynchronous rebuild. During that rebuild window:api_ctx.up_checkerisnil, so the balancer's health filtering is bypassed and traffic flows to nodes already known to be unhealthy for ~1-2s (bug: Health check state lost and checker not working after upstream node changes #13282).timer_working_pool_checkalso destroyed the checker for any version mismatch, racingtimer_create_checkerand widening the nil window.Fix
compute_targets()/sync_checker_targets(): when thechecksconfig is unchanged (compared withcore.table.deep_eq) and only the nodes changed, reconcile the existing checker's targets in place viaadd_target/remove_targetagainst the authoritative shm target list, keeping the checker and its health state alive.timer_working_pool_checkno longer destroys a checker for a node-only version change.checksconfig changed, or no checker exists), the new checker is created and inserted into the working pool before the old one is stopped, eliminating theup_checker == nilgap.checksconfig so changes can be detected.Test
t/node/healthcheck-incremental-update.t:create new checkerappears once (initial) andclear checker(delayed_clear) does not.checks-config change must still rebuild — assertsclear checkerappears.Cross-PR dependency
This depends on the companion library PR api7/lua-resty-healthcheck#55 (clean every checker each cleanup window + release the periodic lock when idle), which fixes the related #13385 / #13141 / #13235 root cause inside the library. The rockspec dependency is bumped to
lua-resty-healthcheck-api7 = 3.3.0-0as a placeholder; the library fix is not yet released/tagged, so this PR should not be merged until that library version is published and the version here is confirmed.Relates to #13282, #13385, #13141, #13235.
Checklist
checkschanges)