feat: add --kube-api-cache-sync-timeout flag for configurable cache sync timeout by AndrewCharlesHay · Pull Request #6363 · kubernetes-sigs/external-dns

AndrewCharlesHay · 2026-04-10T18:52:23Z

Summary

Add a new --kube-api-cache-sync-timeout flag (default: 60s) to configure the timeout for Kubernetes informer cache sync operations during startup. This applies to all informer-based sources.

Supersedes #6104. Addresses all review feedback from @ivankatliarchuk.

Fixes #6091 #5636

Changes

Add CacheSyncTimeout field to both externaldns.Config and source.Config, plumb through to all sources
Add timeout parameter to informers.WaitForCacheSync and WaitForDynamicCacheSync
Export DefaultCacheSyncTimeout constant for shared use
Values <= 0 fall back to the default (60s) — no infinite wait allowed
--request-timeout remains unchanged for HTTP client requests

Review feedback addressed

Renamed flag from --informer-sync-timeout to --kube-api-cache-sync-timeout (follows --kube-api-* convention)
Removed separate documentation page — flag is self-documenting
--request-timeout is no longer deprecated
Removed all unrelated changes (CI, providers, events) — 22 files, focused diff
Fail-fast behavior preserved: cache sync timeout returns an error immediately

…ync timeout Add a new --kube-api-cache-sync-timeout flag (default: 60s) to configure the timeout for Kubernetes informer cache sync operations during startup. This applies to all informer-based sources. Values <= 0 fall back to the default (60s). The --request-timeout flag remains unchanged for HTTP client requests. Signed-off-by: Andrew Hay <andrew.hay@benchmarkanalytics.com>

k8s-ci-robot · 2026-04-10T18:52:42Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign raffo for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ivankatliarchuk · 2026-04-14T11:45:34Z

Here is data from one of the cluster

  nodes                    [==============================] 100% (3020/3020)
  services                 [==============================] 100% (2606/2606)
  svc-pods                 [==============================] 100% (60430/60430)

=== Scenario: service-heavy — Production-scale Service source — mix of headless and NodePort services across a large cluster ===
--- service (sources: 2606, endpoints: 3909) ---
  --- Endpoints() call latency ---
  warmup  (iter 1)    855ms
  steady  (iter 2–10)  min=818ms  max=855ms  mean=833ms  p50=832ms  p99=855ms  1 calls/s
  wall                18s  (active: 8.348s + pauses: 10s)

  --- kube api requests ---
  total: 1   per call: 0.10  (1 kube requests / 10 Endpoints() calls)
  GET:       4 calls  mean=112ms  p50=112ms  p99=112ms
  WATCH:     8 calls  mean=83ms  p50=100ms  p99=100ms

  --- kube api breakdown ---
  source setup (before benchmark)  total=1.6s:
    GET:       4 calls  mean=112ms  p50=112ms  p99=112ms
    WATCH:     7 calls  mean=88ms  p50=108ms  p99=108ms
      WATCH /api/*/nodes                          2 calls  mean=91ms  p50=108ms  p99=108ms  (background reconnect)
      WATCH /api/*/pods                           2 calls  mean=81ms  p50=108ms  p99=108ms  (background reconnect)
      WATCH /api/*/services                       2 calls  mean=84ms  p50=108ms  p99=108ms  (background reconnect)
      GET /api/*/nodes                            1 call   mean=67ms
      GET /api/*/pods                             1 call   mean=236ms
      GET /api/*/services                         1 call   mean=60ms
      GET /apis/*/*/endpointslices                1 call   mean=87ms
      WATCH /apis/*/*/endpointslices              1 call   mean=108ms  (background reconnect)

60k pods synced in less then 2seconds. So I think documenting the use case for this flag is the needed. To make it clear when to use and when it will not work.

ivankatliarchuk · 2026-04-14T12:02:47Z

Another open question is about feature designs. We are passing two values context and timeout now explicitly to every WaitForCacheSync https://github.com/kubernetes-sigs/external-dns/blob/master/source/informers/informers.go#L58-L59 only to re-configure a context that we passing as well. It works, but not too sure if semantically correct.

The proposed change is a simple solution, without doubts.

Current

  externaldns.Config.InformerSyncTimeout
          │
          ▼
  source.Config.InformerSyncTimeout
          │
          ├─► NewServiceSource(ctx, cfg)  → WaitForCacheSync(ctx, factory, cfg.InformerSyncTimeout)
          ├─► NewIngressSource(ctx, cfg)  → WaitForCacheSync(ctx, factory, cfg.InformerSyncTimeout)
          └─► NewGatewaySource(ctx, cfg)  → WaitForCacheSync(ctx, factory, cfg.InformerSyncTimeout)

What we could/should consider, maybe there are more different options

Option 1 - Scoped Syncer Object

Extract the timeout out of the raw time.Duration and into a small CacheSyncer type owned by the informers package. Sources receive a syncer, not a duration.

  // informers package
  type CacheSyncer struct{ timeout time.Duration }

  func NewCacheSyncer(timeout time.Duration) *CacheSyncer { ... }

  func (s *CacheSyncer) WaitForCacheSync(ctx context.Context, f informerFactory) error        { ... }
  func (s *CacheSyncer) WaitForDynamicCacheSync(ctx context.Context, f dynamicInformerFactory) error { ... }

source.Config replaces InformerSyncTimeout time.Duration with Syncer *informers.CacheSyncer. Sources store it as a field and call s.syncer().WaitForCacheSync(...).

  NewSourceConfig() {
      Syncer: informers.NewCacheSyncer(cfg.InformerSyncTimeout),
  }
       
   │
  ├─► NewServiceSource → source.syncer().WaitForCacheSync(ctx, factory)
  └─► NewGatewaySource → source.syncer().WaitForCacheSync(ctx, factory)

Option 2 - Startup Context Separation

The root problem is that sources receive one context that serves two very different lifetimes: startup (bounded, should time out) and operation (long-lived, should not time out). The current design in master branch collapses these.

Example we could Introduce an explicit startCtx / runCtx split at the BuildWithConfig or Config boundary:

  // store.go - config construction
  startCtx, cancel := context.WithTimeout(ctx, cfg.InformerSyncTimeout)
  defer cancel()

Each source constructor accepts (startCtx, ctx context.Context, cfg *Config). WaitForCacheSync loses the timeout parameter entirely — it just uses startCtx:

  func WaitForCacheSync(ctx context.Context, factory informerFactory) error {
      // ctx already carries the startup deadline set by the caller
      for typ, done := range factory.WaitForCacheSync(ctx.Done()) { ... }
  }

    startCtx = context.WithTimeout(ctx, InformerSyncTimeout)
          │
          ├─► NewServiceSource(ctx, cfg)
          │       └─► WaitForCacheSync(cfg.startCtx(), factory)  ← no timeout param
          └─► NewGatewaySource(ctx, cfg)
                  └─► WaitForCacheSync(cfg.startCtx(), factory)  ← same

AndrewCharlesHay · 2026-04-18T18:10:05Z

Thanks @ivankatliarchuk for running the benchmark and for the thoughtful design analysis — really appreciate it.

On the performance data

Agreed that the default 60s is already generous for most clusters — 60k pods in <2s is a strong data point. My motivation for the flag is the less common but real failure mode where startup appears to hang: stuck CRD aggregations, slow API servers behind in-cluster auth webhooks, or rate-limited list calls in very large clusters that push past 60s on a cold start. In those cases, operators today either can't get external-dns up at all or have to patch/rebuild. I'll expand the --help text and add a docs paragraph to make the narrow intent clear: this flag is a diagnostic lever for unusually slow cache warms, not a general tuning knob, and the first step should still be to rule out API server / RBAC / network causes.

On the design alternatives

I like the direction of Option 2 (Startup Context Separation) — you're right that the current shape conflates startup and run lifetimes, and WaitForCacheSync(ctx, factory, timeout) is semantically awkward because the timeout contradicts the context's deadline. Option 1 (scoped CacheSyncer) is a nice middle ground that cleans the signature without touching source constructors.

My preference would be to keep this PR minimal (flag + plumbing, no API change to sources) so the behavior lands quickly, then do the Option 2 refactor as a follow-up where we can:

Introduce startCtx, cancel := context.WithTimeout(ctx, cfg.InformerSyncTimeout) at the BuildWithConfig/store boundary
Drop the timeout time.Duration parameter from WaitForCacheSync / WaitForDynamicCacheSync entirely
Remove InformerSyncTimeout from source.Config since it's now carried in the context

That refactor touches every source constructor and I'd rather not couple it to the flag change, both for reviewability and bisect-ability if anything regresses.

Happy to open the follow-up PR once this one lands, or I can do them together if you'd prefer one unit of review — let me know which you'd rather see. Either way I'll get the docs/help-text update pushed to this PR shortly.

Signed-off-by: Andrew Hay <andrew.hay@benchmarkanalytics.com>

coveralls · 2026-04-18T18:31:36Z

Coverage Report for CI Build 24611048434

Warning

Build has drifted: This PR's base is out of sync with its target branch, so coverage data may include unrelated changes.
Quick fix: rebase this PR. Learn more →

Coverage increased (+0.4%) to 80.906%

Details

Coverage increased (+0.4%) from the base build.
Patch coverage: No coverable lines changed in this PR.
No coverage regressions found.

Uncovered Changes

No uncovered changes found.

Coverage Regressions

No coverage regressions found.

Coverage Stats


Relevant Lines:	21357
Covered Lines:	17279
Line Coverage:	80.91%
Coverage Strength:	1468.12 hits per line

💛 - Coveralls

k8s-ci-robot added the apis Issues or PRs related to API change label Apr 10, 2026

k8s-ci-robot requested review from mloiseleur and szuecs April 10, 2026 18:52

k8s-ci-robot added source size/M Denotes a PR that changes 30-99 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Apr 10, 2026

docs: regenerate flags.md for --kube-api-cache-sync-timeout

fb49ec9

Signed-off-by: Andrew Hay <andrew.hay@benchmarkanalytics.com>

k8s-ci-robot added the docs label Apr 18, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add --kube-api-cache-sync-timeout flag for configurable cache sync timeout#6363

feat: add --kube-api-cache-sync-timeout flag for configurable cache sync timeout#6363
AndrewCharlesHay wants to merge 2 commits intokubernetes-sigs:masterfrom
AndrewCharlesHay:feat/configurable-timeout-v2

AndrewCharlesHay commented Apr 10, 2026

Uh oh!

k8s-ci-robot commented Apr 10, 2026

Uh oh!

ivankatliarchuk commented Apr 14, 2026

Uh oh!

ivankatliarchuk commented Apr 14, 2026

Uh oh!

AndrewCharlesHay commented Apr 18, 2026

Uh oh!

coveralls commented Apr 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

AndrewCharlesHay commented Apr 10, 2026

Summary

Changes

Review feedback addressed

Uh oh!

k8s-ci-robot commented Apr 10, 2026

Uh oh!

ivankatliarchuk commented Apr 14, 2026

Uh oh!

ivankatliarchuk commented Apr 14, 2026

Uh oh!

AndrewCharlesHay commented Apr 18, 2026

Uh oh!

coveralls commented Apr 18, 2026

Coverage Report for CI Build 24611048434

Coverage increased (+0.4%) to 80.906%

Details

Uncovered Changes

Coverage Regressions

Coverage Stats

💛 - Coveralls

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants