Skip to content

feat: incremental compilation routers#12826

Closed
apaloleg wants to merge 3 commits into
apache:masterfrom
apaloleg:issue-9140
Closed

feat: incremental compilation routers#12826
apaloleg wants to merge 3 commits into
apache:masterfrom
apaloleg:issue-9140

Conversation

@apaloleg

@apaloleg apaloleg commented Dec 19, 2025

Copy link
Copy Markdown

Description

This commit introduces incremental route updates in the radixtree_host_uri, radixtree_uri, radixtree_uri_with_parameter routers implementation, eliminating the need to rebuild the entire radix tree on every configuration change via the Admin API.
Refer to #9692.

This feature depends on changes in lua-resty-radixtree lib: api7/lua-resty-radixtree#156

Which issue(s) this PR fixes:

Fixes (#9140)

Checklist

  • I have explained the need for this PR and the problem it solves
  • I have explained the changes or the new features added to this PR
  • I have added tests corresponding to this change
  • I have updated the documentation to reflect this change
  • I have verified that this change is backward compatible (If not, please discuss on the APISIX mailing list first)

This commit introduces incremental route updates in the radixtree_host_uri, radixtree_uri, radixtree_uri_with_parameter
routers implementation, eliminating the need to rebuild the entire radix tree on every configuration change via the Admin API.
@dosubot dosubot Bot added size:L This PR changes 100-499 lines, ignoring generated files. enhancement New feature or request labels Dec 19, 2025
@apaloleg

apaloleg commented Dec 20, 2025

Copy link
Copy Markdown
Author

Hi!

Thank you @ranxuxin001 for the previous attempt to implement incremental updates for APISIX! The core idea was really solid and greatly inspired this solution here.

Unfortunately, when I tried to build on that solution for radixtree_host_uri, I couldn't quite get it working properly in my tests — it's very possible I didn't fully understand the original intent. So, I decided to take a slightly different path by directly passing and retrieving sub-routers in the host routes.

This allowed me to achieve stable incremental updates, but I'm well aware there might be a more elegant or cleaner way to do it. If you or anyone else has alternative ideas or suggestions, I'd genuinely love to hear them and am happy to adjust the implementation!

Additionally, in this PR was also:

  1. Added proper handling of the route status field (0/1) during create/update/delete operations.
  2. Removed storing changes in sync_tb in the privileged agent to prevent unnecessary data accumulation inside process.
  3. Some changes according prev review.

@dosubot dosubot Bot added size:XL This PR changes 500-999 lines, ignoring generated files. and removed size:L This PR changes 100-499 lines, ignoring generated files. labels Dec 21, 2025
Comment thread apisix/http/router/radixtree_host_uri.lua Outdated
Comment thread apisix/router.lua Outdated
Comment thread apisix/http/router/radixtree_host_uri.lua Outdated
@dosubot dosubot Bot added size:L This PR changes 100-499 lines, ignoring generated files. and removed size:XL This PR changes 500-999 lines, ignoring generated files. labels Dec 22, 2025
@Baoyuantop Baoyuantop requested a review from membphis December 23, 2025 01:58
@Baoyuantop

Copy link
Copy Markdown
Contributor

Hi @apaloleg, thanks for your contribution. I would like to ask if you have encountered the problem described in this PR while actually using APISIX? When evaluating the implementation solutions for this PR, we found that introducing incremental updates might alleviate the problem to some extent, but incremental updates are unreliable, and their reliability is lower than that of full updates after long-term operation. Therefore, I would like to hear about the problems you actually encountered to determine a reasonable solution.

@Baoyuantop Baoyuantop added wait for update wait for the author's response in this issue/PR and removed awaiting review labels Dec 25, 2025
@apaloleg

Copy link
Copy Markdown
Author

Hi @apaloleg, thanks for your contribution. I would like to ask if you have encountered the problem described in this PR while actually using APISIX? When evaluating the implementation solutions for this PR, we found that introducing incremental updates might alleviate the problem to some extent, but incremental updates are unreliable, and their reliability is lower than that of full updates after long-term operation. Therefore, I would like to hear about the problems you actually encountered to determine a reasonable solution.

Hi!

In our environment, we currently have over 40,000 routes (and this number is steadily growing over time), and we're using the radixtree_host_uri router.

We have frequently routes updates (add, update, delete) — these can be small operations (1–5 routes) or large ones (50–100+ routes). A series of such operations can last from 3-5s to 10+ minutes. At the same time, we have constant traffic on the gateway: around 200–300 RPS on average, and up to 900–1000 RPS during peak hours.

During these update periods, the CPU usage of APISIX pods spikes to 100%, and the latency of incoming requests increased significantly because requests had to wait for the full rebuild of the radix tree. This was a critical issue for us in production.

The patch from this PR completely resolved our problem. We've been running it in our production environment for about a month now, and we no longer experience these CPU spikes or increased latency during this updates period.

@Baoyuantop Baoyuantop changed the title feat: incremental route updates for radixtree routers feat: incremental compilation routers Dec 26, 2025
@Baoyuantop

Copy link
Copy Markdown
Contributor

Hi @apaloleg, may I ask if this solution is still working well?

@apaloleg

apaloleg commented Mar 3, 2026

Copy link
Copy Markdown
Author

Hi @apaloleg, may I ask if this solution is still working well?

Hi! Yes, we're still using this solution since mid-November - no problems whatsoever so far.

The high CPU usage we were seeing during admin operations before this patch hasn't come back at all.

We also haven't noticed any issues with create/update/delete operations or when toggling route status from 0 to 1 (and vice versa).

@Baoyuantop

Copy link
Copy Markdown
Contributor

cc @membphis

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces incremental (in-place) route updates for the radixtree_uri, radixtree_uri_with_parameter, and radixtree_host_uri HTTP routers to avoid rebuilding the entire radix tree on every Admin API/etcd configuration change (targeting large route sets).

Changes:

  • Extend config_etcd filter invocation to pass (item, pre_val_or_size, obj) so watchers can distinguish full reloads vs incremental updates.
  • Track per-route create/update/delete operations in apisix.router (sync_tb) and apply them incrementally in router matchers.
  • Add id to radixtree route entries and adjust router build/event push behavior to support incremental operations.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
apisix/router.lua Adds sync table + route filter logic to record incremental ops and trigger rebuilds.
apisix/http/service.lua Detects service hosts changes and requests radixtree rebuilds.
apisix/http/router/radixtree_uri.lua Implements incremental add/update/delete for URI router based on sync_tb.
apisix/http/router/radixtree_uri_with_parameter.lua Uses rebuild vs incremental update based on need_create_radixtree.
apisix/http/router/radixtree_host_uri.lua Adds incremental updates for host+uri router, including sub-router manipulation.
apisix/http/route.lua Adds id to built routes and gates BUILD_ROUTER event push.
apisix/core/config_etcd.lua Passes prior value / load size into filters to enable incremental tracking.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment thread apisix/http/service.lua
Comment on lines +55 to +58
if type(pre_service_or_size) == "number" or not obj then
return
end

Copilot AI Mar 8, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

filter() returns immediately for full reloads (type(pre_service_or_size) == "number"). With incremental routing, service changes can affect computed route hosts (routes using service_id inherit service.hosts during router build). If services are reloaded via need_reload (compaction/restart), service_version will change but need_create_radixtree will remain false, so the router may not be rebuilt and can keep stale service-host-derived matching. Consider marking apisix.router.need_create_radixtree = true on service full reload (and/or providing a reliable end-of-reload signal similar to routes).

Suggested change
if type(pre_service_or_size) == "number" or not obj then
return
end
if type(pre_service_or_size) == "number" then
-- full reload of services; router may need rebuild because routes
-- inheriting service.hosts can be affected
local ar = require("apisix.router")
ar.need_create_radixtree = true
core.log.info("full services reload, rebuild radixtree")
return
end
if not obj then
return
end

Copilot uses AI. Check for mistakes.

if route and route.value then
local status = core.table.try_read_attr(route, "value", "status")
if status and status == 0 then

Copilot AI Mar 8, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In incremental_operate_radixtree, when a route is disabled (status == 0) the code does goto CONTINUE, but the ::CONTINUE:: label is after ar.sync_tb[k] = nil. This leaves the sync entry in ar.sync_tb forever, so every subsequent request will keep re-processing (and skipping) the same entry. Clear the entry before jumping (or move the cleanup before the label).

Suggested change
if status and status == 0 then
if status and status == 0 then
-- clear sync entry for disabled route before skipping further processing
ar.sync_tb[k] = nil

Copilot uses AI. Check for mistakes.
Comment on lines +69 to +74
id = route.value.id,
paths = route.value.uris or route.value.uri,
methods = route.value.methods,
priority = route.value.priority,
hosts = route.value.hosts or route.value.host,
remote_addrs = route.value.remote_addrs or route.value.remote_addr,

Copilot AI Mar 8, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incremental route building sets hosts = route.value.hosts or route.value.host, but the full router build (create_radixtree_uri_router) also derives hosts from the referenced service_id when the route itself has no host/hosts. Without matching that logic here, incremental create/update/delete can change host-matching semantics (and may fail to find the existing route to update/delete). Compute hosts the same way as the full build (including service_fetch(service_id) fallback).

Copilot uses AI. Check for mistakes.
if not filter_fun then
core.log.error("failed to load filter function: ", err,
" route id", route.value.id)
return

Copilot AI Mar 8, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On filter_func load failure, the code logs and returns from incremental_operate_radixtree, aborting all remaining pending sync operations and leaving ar.sync_tb uncleared. The full rebuild path skips only the bad route; incremental should similarly skip/mark-rebuild and still clear the current sync entry to avoid getting stuck on the same failure forever.

Suggested change
return
goto CONTINUE

Copilot uses AI. Check for mistakes.
Comment on lines +32 to +39
function _M.incremental_operate_radixtree(router, routes, with_parameter)
local op, route, last_route, err
local cur_tmp, last_tmp
local router_opts = {
no_param_match = with_parameter ~= true
}

event.push(event.CONST.BUILD_ROUTER, routes)

Copilot AI Mar 8, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The incremental router update logic in this module is a behavior change (create/update/delete without full rebuild) but there are no tests covering it. Please add Test::Nginx cases that exercise route create/update/delete via the Admin API and verify correct matching (including host matching via service_id, status enable/disable, and parameterized routes for the with_parameter variant).

Copilot uses AI. Check for mistakes.
Comment thread apisix/router.lua
Comment on lines +94 to +96
if pre_route_or_size == #obj.values then
_M.need_create_radixtree = true
end

Copilot AI Mar 8, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

load_full_data() passes a numeric second arg (#values) into this filter. The current pre_route_or_size == #obj.values check can fail when some items are invalid and skipped (so #obj.values < #values), meaning need_create_radixtree may never be set on a full reload. That can leave the worker using a stale router after need_reload reloads config. Consider setting need_create_radixtree unconditionally on full reload (or using a more reliable end-of-reload signal than comparing counts).

Suggested change
if pre_route_or_size == #obj.values then
_M.need_create_radixtree = true
end
-- full data reload: always rebuild radixtree after load_full_data,
-- since some routes may be invalid and skipped, making #obj.values
-- smaller than the original size passed in as pre_route_or_size
_M.need_create_radixtree = true

Copilot uses AI. Check for mistakes.
Comment thread apisix/router.lua
Comment on lines +111 to +120
local pre_status = pre_route_or_size.value.status
local status = route.value.status

-- sync according status
if pre_status == 0 and status == 1 then
sync_tb_create(sync_tb, route)
elseif pre_status == 1 and status == 0 then
sync_tb_delete(sync_tb, pre_route_or_size)
elseif pre_status == 1 and status == 1 then
sync_tb_update(sync_tb, pre_route_or_size, route)

Copilot AI Mar 8, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sync logic treats status as always 0/1, but APISIX historically treats missing status as enabled (many existing objects omit it, and schema validation does not apply defaults). Here, a nil status won't match any branch and may prevent create/update/delete from being recorded (and create uses if route.value.status == 1 only). Normalize missing status to 1 (for both current and previous route) before computing the operation.

Copilot uses AI. Check for mistakes.
Comment thread apisix/router.lua
Comment on lines +108 to +110
--update route
core.log.notice("update routes watched from etcd into radixtree. ",
core.json.delay_encode(route, true))

Copilot AI Mar 8, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These core.log.notice(... delay_encode(route, true)) statements will emit the full route object (including plugins) at NOTICE level on every watched change. For frequent route updates this can significantly increase log volume/CPU and may expose sensitive plugin config in logs. Consider lowering to debug or logging only the route id/op (and maybe key fields) instead of the full JSON.

Copilot uses AI. Check for mistakes.

@moonming moonming left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @apaloleg, thank you for the incremental router compilation work!

This is an ambitious and important optimization — rebuilding the entire radix tree on every route change is a known performance bottleneck, especially at scale.

Context: There's also #13073 which takes a simpler approach (throttling rebuilds with a minimum interval). The two approaches complement each other:

  • Your incremental compilation: Fundamentally reduces rebuild cost
  • Throttling (#13073): Quick win that reduces rebuild frequency

Questions:

  1. Does this require changes to lua-resty-radixtree upstream? If so, what's the status of those changes?
  2. What's the performance improvement in your benchmarks? (e.g., time to add/remove a route with 10K existing routes)
  3. How does it handle batch route updates (e.g., syncing 100 routes from etcd at once)?

This is a strategic improvement we'd love to land. Let's discuss the path forward — we might merge the throttling approach first as a quick win while this more fundamental optimization matures.

Thank you for the significant engineering effort!

@github-actions

Copy link
Copy Markdown

This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 4 weeks if no further activity occurs. If you think that's incorrect or this pull request should instead be reviewed, please simply write any comment. Even if closed, you can still revive the PR at any time or discuss it on the dev@apisix.apache.org list. Thank you for your contributions.

@github-actions github-actions Bot added the stale label May 15, 2026
@github-actions

Copy link
Copy Markdown

This pull request/issue has been closed due to lack of activity. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

@github-actions github-actions Bot closed this Jun 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

discuss enhancement New feature or request size:L This PR changes 100-499 lines, ignoring generated files. stale user responded wait for update wait for the author's response in this issue/PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants