feat: add --kube-api-cache-sync-timeout flag for configurable cache sync timeout#6363
Conversation
…ync timeout Add a new --kube-api-cache-sync-timeout flag (default: 60s) to configure the timeout for Kubernetes informer cache sync operations during startup. This applies to all informer-based sources. Values <= 0 fall back to the default (60s). The --request-timeout flag remains unchanged for HTTP client requests. Signed-off-by: Andrew Hay <andrew.hay@benchmarkanalytics.com>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Here is data from one of the cluster 60k pods synced in less then 2seconds. So I think documenting the use case for this flag is the needed. To make it clear when to use and when it will not work. |
|
Another open question is about feature designs. We are passing two values context and timeout now explicitly to every WaitForCacheSync https://github.com/kubernetes-sigs/external-dns/blob/master/source/informers/informers.go#L58-L59 only to re-configure a context that we passing as well. It works, but not too sure if semantically correct. The proposed change is a simple solution, without doubts. Current What we could/should consider, maybe there are more different options Option 1 - Scoped Syncer Object Extract the timeout out of the raw time.Duration and into a small CacheSyncer type owned by the informers package. Sources receive a syncer, not a duration. source.Config replaces InformerSyncTimeout time.Duration with Syncer *informers.CacheSyncer. Sources store it as a field and call s.syncer().WaitForCacheSync(...). Option 2 - Startup Context Separation The root problem is that sources receive one context that serves two very different lifetimes: startup (bounded, should time out) and operation (long-lived, should not time out). The current design in master branch collapses these. Example we could Introduce an explicit startCtx / runCtx split at the BuildWithConfig or Config boundary: Each source constructor accepts (startCtx, ctx context.Context, cfg *Config). WaitForCacheSync loses the timeout parameter entirely — it just uses startCtx: |
|
Thanks @ivankatliarchuk for running the benchmark and for the thoughtful design analysis — really appreciate it. On the performance data Agreed that the default 60s is already generous for most clusters — 60k pods in <2s is a strong data point. My motivation for the flag is the less common but real failure mode where startup appears to hang: stuck CRD aggregations, slow API servers behind in-cluster auth webhooks, or rate-limited On the design alternatives I like the direction of Option 2 (Startup Context Separation) — you're right that the current shape conflates startup and run lifetimes, and My preference would be to keep this PR minimal (flag + plumbing, no API change to sources) so the behavior lands quickly, then do the Option 2 refactor as a follow-up where we can:
That refactor touches every source constructor and I'd rather not couple it to the flag change, both for reviewability and bisect-ability if anything regresses. Happy to open the follow-up PR once this one lands, or I can do them together if you'd prefer one unit of review — let me know which you'd rather see. Either way I'll get the docs/help-text update pushed to this PR shortly. |
Signed-off-by: Andrew Hay <andrew.hay@benchmarkanalytics.com>
Coverage Report for CI Build 24611048434Warning Build has drifted: This PR's base is out of sync with its target branch, so coverage data may include unrelated changes. Coverage increased (+0.4%) to 80.906%Details
Uncovered ChangesNo uncovered changes found. Coverage RegressionsNo coverage regressions found. Coverage Stats
💛 - Coveralls |
Summary
Add a new
--kube-api-cache-sync-timeoutflag (default: 60s) to configure the timeout for Kubernetes informer cache sync operations during startup. This applies to all informer-based sources.Supersedes #6104. Addresses all review feedback from @ivankatliarchuk.
Fixes #6091 #5636
Changes
CacheSyncTimeoutfield to bothexternaldns.Configandsource.Config, plumb through to all sourcesinformers.WaitForCacheSyncandWaitForDynamicCacheSyncDefaultCacheSyncTimeoutconstant for shared use--request-timeoutremains unchanged for HTTP client requestsReview feedback addressed
--informer-sync-timeoutto--kube-api-cache-sync-timeout(follows--kube-api-*convention)--request-timeoutis no longer deprecated