feat: incremental compilation routers#12826
Conversation
This commit introduces incremental route updates in the radixtree_host_uri, radixtree_uri, radixtree_uri_with_parameter routers implementation, eliminating the need to rebuild the entire radix tree on every configuration change via the Admin API.
|
Hi! Thank you @ranxuxin001 for the previous attempt to implement incremental updates for APISIX! The core idea was really solid and greatly inspired this solution here. Unfortunately, when I tried to build on that solution for This allowed me to achieve stable incremental updates, but I'm well aware there might be a more elegant or cleaner way to do it. If you or anyone else has alternative ideas or suggestions, I'd genuinely love to hear them and am happy to adjust the implementation! Additionally, in this PR was also:
|
|
Hi @apaloleg, thanks for your contribution. I would like to ask if you have encountered the problem described in this PR while actually using APISIX? When evaluating the implementation solutions for this PR, we found that introducing incremental updates might alleviate the problem to some extent, but incremental updates are unreliable, and their reliability is lower than that of full updates after long-term operation. Therefore, I would like to hear about the problems you actually encountered to determine a reasonable solution. |
Hi! In our environment, we currently have over 40,000 routes (and this number is steadily growing over time), and we're using the We have frequently routes updates (add, update, delete) — these can be small operations (1–5 routes) or large ones (50–100+ routes). A series of such operations can last from 3-5s to 10+ minutes. At the same time, we have constant traffic on the gateway: around 200–300 RPS on average, and up to 900–1000 RPS during peak hours. During these update periods, the CPU usage of APISIX pods spikes to 100%, and the latency of incoming requests increased significantly because requests had to wait for the full rebuild of the radix tree. This was a critical issue for us in production. The patch from this PR completely resolved our problem. We've been running it in our production environment for about a month now, and we no longer experience these CPU spikes or increased latency during this updates period. |
|
Hi @apaloleg, may I ask if this solution is still working well? |
Hi! Yes, we're still using this solution since mid-November - no problems whatsoever so far. The high CPU usage we were seeing during admin operations before this patch hasn't come back at all. We also haven't noticed any issues with create/update/delete operations or when toggling route status from 0 to 1 (and vice versa). |
|
cc @membphis |
There was a problem hiding this comment.
Pull request overview
This PR introduces incremental (in-place) route updates for the radixtree_uri, radixtree_uri_with_parameter, and radixtree_host_uri HTTP routers to avoid rebuilding the entire radix tree on every Admin API/etcd configuration change (targeting large route sets).
Changes:
- Extend
config_etcdfilter invocation to pass(item, pre_val_or_size, obj)so watchers can distinguish full reloads vs incremental updates. - Track per-route create/update/delete operations in
apisix.router(sync_tb) and apply them incrementally in router matchers. - Add
idto radixtree route entries and adjust router build/event push behavior to support incremental operations.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| apisix/router.lua | Adds sync table + route filter logic to record incremental ops and trigger rebuilds. |
| apisix/http/service.lua | Detects service hosts changes and requests radixtree rebuilds. |
| apisix/http/router/radixtree_uri.lua | Implements incremental add/update/delete for URI router based on sync_tb. |
| apisix/http/router/radixtree_uri_with_parameter.lua | Uses rebuild vs incremental update based on need_create_radixtree. |
| apisix/http/router/radixtree_host_uri.lua | Adds incremental updates for host+uri router, including sub-router manipulation. |
| apisix/http/route.lua | Adds id to built routes and gates BUILD_ROUTER event push. |
| apisix/core/config_etcd.lua | Passes prior value / load size into filters to enable incremental tracking. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
| if type(pre_service_or_size) == "number" or not obj then | ||
| return | ||
| end | ||
|
|
There was a problem hiding this comment.
filter() returns immediately for full reloads (type(pre_service_or_size) == "number"). With incremental routing, service changes can affect computed route hosts (routes using service_id inherit service.hosts during router build). If services are reloaded via need_reload (compaction/restart), service_version will change but need_create_radixtree will remain false, so the router may not be rebuilt and can keep stale service-host-derived matching. Consider marking apisix.router.need_create_radixtree = true on service full reload (and/or providing a reliable end-of-reload signal similar to routes).
| if type(pre_service_or_size) == "number" or not obj then | |
| return | |
| end | |
| if type(pre_service_or_size) == "number" then | |
| -- full reload of services; router may need rebuild because routes | |
| -- inheriting service.hosts can be affected | |
| local ar = require("apisix.router") | |
| ar.need_create_radixtree = true | |
| core.log.info("full services reload, rebuild radixtree") | |
| return | |
| end | |
| if not obj then | |
| return | |
| end |
|
|
||
| if route and route.value then | ||
| local status = core.table.try_read_attr(route, "value", "status") | ||
| if status and status == 0 then |
There was a problem hiding this comment.
In incremental_operate_radixtree, when a route is disabled (status == 0) the code does goto CONTINUE, but the ::CONTINUE:: label is after ar.sync_tb[k] = nil. This leaves the sync entry in ar.sync_tb forever, so every subsequent request will keep re-processing (and skipping) the same entry. Clear the entry before jumping (or move the cleanup before the label).
| if status and status == 0 then | |
| if status and status == 0 then | |
| -- clear sync entry for disabled route before skipping further processing | |
| ar.sync_tb[k] = nil |
| id = route.value.id, | ||
| paths = route.value.uris or route.value.uri, | ||
| methods = route.value.methods, | ||
| priority = route.value.priority, | ||
| hosts = route.value.hosts or route.value.host, | ||
| remote_addrs = route.value.remote_addrs or route.value.remote_addr, |
There was a problem hiding this comment.
Incremental route building sets hosts = route.value.hosts or route.value.host, but the full router build (create_radixtree_uri_router) also derives hosts from the referenced service_id when the route itself has no host/hosts. Without matching that logic here, incremental create/update/delete can change host-matching semantics (and may fail to find the existing route to update/delete). Compute hosts the same way as the full build (including service_fetch(service_id) fallback).
| if not filter_fun then | ||
| core.log.error("failed to load filter function: ", err, | ||
| " route id", route.value.id) | ||
| return |
There was a problem hiding this comment.
On filter_func load failure, the code logs and returns from incremental_operate_radixtree, aborting all remaining pending sync operations and leaving ar.sync_tb uncleared. The full rebuild path skips only the bad route; incremental should similarly skip/mark-rebuild and still clear the current sync entry to avoid getting stuck on the same failure forever.
| return | |
| goto CONTINUE |
| function _M.incremental_operate_radixtree(router, routes, with_parameter) | ||
| local op, route, last_route, err | ||
| local cur_tmp, last_tmp | ||
| local router_opts = { | ||
| no_param_match = with_parameter ~= true | ||
| } | ||
|
|
||
| event.push(event.CONST.BUILD_ROUTER, routes) |
There was a problem hiding this comment.
The incremental router update logic in this module is a behavior change (create/update/delete without full rebuild) but there are no tests covering it. Please add Test::Nginx cases that exercise route create/update/delete via the Admin API and verify correct matching (including host matching via service_id, status enable/disable, and parameterized routes for the with_parameter variant).
| if pre_route_or_size == #obj.values then | ||
| _M.need_create_radixtree = true | ||
| end |
There was a problem hiding this comment.
load_full_data() passes a numeric second arg (#values) into this filter. The current pre_route_or_size == #obj.values check can fail when some items are invalid and skipped (so #obj.values < #values), meaning need_create_radixtree may never be set on a full reload. That can leave the worker using a stale router after need_reload reloads config. Consider setting need_create_radixtree unconditionally on full reload (or using a more reliable end-of-reload signal than comparing counts).
| if pre_route_or_size == #obj.values then | |
| _M.need_create_radixtree = true | |
| end | |
| -- full data reload: always rebuild radixtree after load_full_data, | |
| -- since some routes may be invalid and skipped, making #obj.values | |
| -- smaller than the original size passed in as pre_route_or_size | |
| _M.need_create_radixtree = true |
| local pre_status = pre_route_or_size.value.status | ||
| local status = route.value.status | ||
|
|
||
| -- sync according status | ||
| if pre_status == 0 and status == 1 then | ||
| sync_tb_create(sync_tb, route) | ||
| elseif pre_status == 1 and status == 0 then | ||
| sync_tb_delete(sync_tb, pre_route_or_size) | ||
| elseif pre_status == 1 and status == 1 then | ||
| sync_tb_update(sync_tb, pre_route_or_size, route) |
There was a problem hiding this comment.
The sync logic treats status as always 0/1, but APISIX historically treats missing status as enabled (many existing objects omit it, and schema validation does not apply defaults). Here, a nil status won't match any branch and may prevent create/update/delete from being recorded (and create uses if route.value.status == 1 only). Normalize missing status to 1 (for both current and previous route) before computing the operation.
| --update route | ||
| core.log.notice("update routes watched from etcd into radixtree. ", | ||
| core.json.delay_encode(route, true)) |
There was a problem hiding this comment.
These core.log.notice(... delay_encode(route, true)) statements will emit the full route object (including plugins) at NOTICE level on every watched change. For frequent route updates this can significantly increase log volume/CPU and may expose sensitive plugin config in logs. Consider lowering to debug or logging only the route id/op (and maybe key fields) instead of the full JSON.
moonming
left a comment
There was a problem hiding this comment.
Hi @apaloleg, thank you for the incremental router compilation work!
This is an ambitious and important optimization — rebuilding the entire radix tree on every route change is a known performance bottleneck, especially at scale.
Context: There's also #13073 which takes a simpler approach (throttling rebuilds with a minimum interval). The two approaches complement each other:
- Your incremental compilation: Fundamentally reduces rebuild cost
- Throttling (#13073): Quick win that reduces rebuild frequency
Questions:
- Does this require changes to
lua-resty-radixtreeupstream? If so, what's the status of those changes? - What's the performance improvement in your benchmarks? (e.g., time to add/remove a route with 10K existing routes)
- How does it handle batch route updates (e.g., syncing 100 routes from etcd at once)?
This is a strategic improvement we'd love to land. Let's discuss the path forward — we might merge the throttling approach first as a quick win while this more fundamental optimization matures.
Thank you for the significant engineering effort!
|
This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 4 weeks if no further activity occurs. If you think that's incorrect or this pull request should instead be reviewed, please simply write any comment. Even if closed, you can still revive the PR at any time or discuss it on the dev@apisix.apache.org list. Thank you for your contributions. |
|
This pull request/issue has been closed due to lack of activity. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time. |
Description
This commit introduces incremental route updates in the radixtree_host_uri, radixtree_uri, radixtree_uri_with_parameter routers implementation, eliminating the need to rebuild the entire radix tree on every configuration change via the Admin API.
Refer to #9692.
This feature depends on changes in lua-resty-radixtree lib: api7/lua-resty-radixtree#156
Which issue(s) this PR fixes:
Fixes (#9140)
Checklist