Skip to content

feat: add --kube-api-cache-sync-timeout flag for configurable cache sync timeout#6363

Open
AndrewCharlesHay wants to merge 2 commits intokubernetes-sigs:masterfrom
AndrewCharlesHay:feat/configurable-timeout-v2
Open

feat: add --kube-api-cache-sync-timeout flag for configurable cache sync timeout#6363
AndrewCharlesHay wants to merge 2 commits intokubernetes-sigs:masterfrom
AndrewCharlesHay:feat/configurable-timeout-v2

Conversation

@AndrewCharlesHay
Copy link
Copy Markdown
Contributor

Summary

Add a new --kube-api-cache-sync-timeout flag (default: 60s) to configure the timeout for Kubernetes informer cache sync operations during startup. This applies to all informer-based sources.

Supersedes #6104. Addresses all review feedback from @ivankatliarchuk.

Fixes #6091 #5636

Changes

  • Add CacheSyncTimeout field to both externaldns.Config and source.Config, plumb through to all sources
  • Add timeout parameter to informers.WaitForCacheSync and WaitForDynamicCacheSync
  • Export DefaultCacheSyncTimeout constant for shared use
  • Values <= 0 fall back to the default (60s) — no infinite wait allowed
  • --request-timeout remains unchanged for HTTP client requests

Review feedback addressed

  • Renamed flag from --informer-sync-timeout to --kube-api-cache-sync-timeout (follows --kube-api-* convention)
  • Removed separate documentation page — flag is self-documenting
  • --request-timeout is no longer deprecated
  • Removed all unrelated changes (CI, providers, events) — 22 files, focused diff
  • Fail-fast behavior preserved: cache sync timeout returns an error immediately

…ync timeout

Add a new --kube-api-cache-sync-timeout flag (default: 60s) to configure
the timeout for Kubernetes informer cache sync operations during startup.
This applies to all informer-based sources.

Values <= 0 fall back to the default (60s). The --request-timeout flag
remains unchanged for HTTP client requests.

Signed-off-by: Andrew Hay <andrew.hay@benchmarkanalytics.com>
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign raffo for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the apis Issues or PRs related to API change label Apr 10, 2026
@k8s-ci-robot k8s-ci-robot added source size/M Denotes a PR that changes 30-99 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Apr 10, 2026
@ivankatliarchuk
Copy link
Copy Markdown
Member

Here is data from one of the cluster

  nodes                    [==============================] 100% (3020/3020)
  services                 [==============================] 100% (2606/2606)
  svc-pods                 [==============================] 100% (60430/60430)

=== Scenario: service-heavy — Production-scale Service source — mix of headless and NodePort services across a large cluster ===
--- service (sources: 2606, endpoints: 3909) ---
  --- Endpoints() call latency ---
  warmup  (iter 1)    855ms
  steady  (iter 2–10)  min=818ms  max=855ms  mean=833ms  p50=832ms  p99=855ms  1 calls/s
  wall                18s  (active: 8.348s + pauses: 10s)

  --- kube api requests ---
  total: 1   per call: 0.10  (1 kube requests / 10 Endpoints() calls)
  GET:       4 calls  mean=112ms  p50=112ms  p99=112ms
  WATCH:     8 calls  mean=83ms  p50=100ms  p99=100ms

  --- kube api breakdown ---
  source setup (before benchmark)  total=1.6s:
    GET:       4 calls  mean=112ms  p50=112ms  p99=112ms
    WATCH:     7 calls  mean=88ms  p50=108ms  p99=108ms
      WATCH /api/*/nodes                          2 calls  mean=91ms  p50=108ms  p99=108ms  (background reconnect)
      WATCH /api/*/pods                           2 calls  mean=81ms  p50=108ms  p99=108ms  (background reconnect)
      WATCH /api/*/services                       2 calls  mean=84ms  p50=108ms  p99=108ms  (background reconnect)
      GET /api/*/nodes                            1 call   mean=67ms
      GET /api/*/pods                             1 call   mean=236ms
      GET /api/*/services                         1 call   mean=60ms
      GET /apis/*/*/endpointslices                1 call   mean=87ms
      WATCH /apis/*/*/endpointslices              1 call   mean=108ms  (background reconnect)

60k pods synced in less then 2seconds. So I think documenting the use case for this flag is the needed. To make it clear when to use and when it will not work.

@ivankatliarchuk
Copy link
Copy Markdown
Member

Another open question is about feature designs. We are passing two values context and timeout now explicitly to every WaitForCacheSync https://github.com/kubernetes-sigs/external-dns/blob/master/source/informers/informers.go#L58-L59 only to re-configure a context that we passing as well. It works, but not too sure if semantically correct.

The proposed change is a simple solution, without doubts.

Current

  externaldns.Config.InformerSyncTimeout
          │
          ▼
  source.Config.InformerSyncTimeout
          │
          ├─► NewServiceSource(ctx, cfg)  → WaitForCacheSync(ctx, factory, cfg.InformerSyncTimeout)
          ├─► NewIngressSource(ctx, cfg)  → WaitForCacheSync(ctx, factory, cfg.InformerSyncTimeout)
          └─► NewGatewaySource(ctx, cfg)  → WaitForCacheSync(ctx, factory, cfg.InformerSyncTimeout)

What we could/should consider, maybe there are more different options

Option 1 - Scoped Syncer Object

Extract the timeout out of the raw time.Duration and into a small CacheSyncer type owned by the informers package. Sources receive a syncer, not a duration.

  // informers package
  type CacheSyncer struct{ timeout time.Duration }

  func NewCacheSyncer(timeout time.Duration) *CacheSyncer { ... }

  func (s *CacheSyncer) WaitForCacheSync(ctx context.Context, f informerFactory) error        { ... }
  func (s *CacheSyncer) WaitForDynamicCacheSync(ctx context.Context, f dynamicInformerFactory) error { ... }

source.Config replaces InformerSyncTimeout time.Duration with Syncer *informers.CacheSyncer. Sources store it as a field and call s.syncer().WaitForCacheSync(...).

  NewSourceConfig() {
      Syncer: informers.NewCacheSyncer(cfg.InformerSyncTimeout),
  }
       
   │
  ├─► NewServiceSource → source.syncer().WaitForCacheSync(ctx, factory)
  └─► NewGatewaySource → source.syncer().WaitForCacheSync(ctx, factory)

Option 2 - Startup Context Separation

The root problem is that sources receive one context that serves two very different lifetimes: startup (bounded, should time out) and operation (long-lived, should not time out). The current design in master branch collapses these.

Example we could Introduce an explicit startCtx / runCtx split at the BuildWithConfig or Config boundary:

  // store.go - config construction
  startCtx, cancel := context.WithTimeout(ctx, cfg.InformerSyncTimeout)
  defer cancel()

Each source constructor accepts (startCtx, ctx context.Context, cfg *Config). WaitForCacheSync loses the timeout parameter entirely — it just uses startCtx:

  func WaitForCacheSync(ctx context.Context, factory informerFactory) error {
      // ctx already carries the startup deadline set by the caller
      for typ, done := range factory.WaitForCacheSync(ctx.Done()) { ... }
  }

    startCtx = context.WithTimeout(ctx, InformerSyncTimeout)
          │
          ├─► NewServiceSource(ctx, cfg)
          │       └─► WaitForCacheSync(cfg.startCtx(), factory)  ← no timeout param
          └─► NewGatewaySource(ctx, cfg)
                  └─► WaitForCacheSync(cfg.startCtx(), factory)  ← same

@AndrewCharlesHay
Copy link
Copy Markdown
Contributor Author

Thanks @ivankatliarchuk for running the benchmark and for the thoughtful design analysis — really appreciate it.

On the performance data

Agreed that the default 60s is already generous for most clusters — 60k pods in <2s is a strong data point. My motivation for the flag is the less common but real failure mode where startup appears to hang: stuck CRD aggregations, slow API servers behind in-cluster auth webhooks, or rate-limited list calls in very large clusters that push past 60s on a cold start. In those cases, operators today either can't get external-dns up at all or have to patch/rebuild. I'll expand the --help text and add a docs paragraph to make the narrow intent clear: this flag is a diagnostic lever for unusually slow cache warms, not a general tuning knob, and the first step should still be to rule out API server / RBAC / network causes.

On the design alternatives

I like the direction of Option 2 (Startup Context Separation) — you're right that the current shape conflates startup and run lifetimes, and WaitForCacheSync(ctx, factory, timeout) is semantically awkward because the timeout contradicts the context's deadline. Option 1 (scoped CacheSyncer) is a nice middle ground that cleans the signature without touching source constructors.

My preference would be to keep this PR minimal (flag + plumbing, no API change to sources) so the behavior lands quickly, then do the Option 2 refactor as a follow-up where we can:

  1. Introduce startCtx, cancel := context.WithTimeout(ctx, cfg.InformerSyncTimeout) at the BuildWithConfig/store boundary
  2. Drop the timeout time.Duration parameter from WaitForCacheSync / WaitForDynamicCacheSync entirely
  3. Remove InformerSyncTimeout from source.Config since it's now carried in the context

That refactor touches every source constructor and I'd rather not couple it to the flag change, both for reviewability and bisect-ability if anything regresses.

Happy to open the follow-up PR once this one lands, or I can do them together if you'd prefer one unit of review — let me know which you'd rather see. Either way I'll get the docs/help-text update pushed to this PR shortly.

Signed-off-by: Andrew Hay <andrew.hay@benchmarkanalytics.com>
@coveralls
Copy link
Copy Markdown

Coverage Report for CI Build 24611048434

Warning

Build has drifted: This PR's base is out of sync with its target branch, so coverage data may include unrelated changes.
Quick fix: rebase this PR. Learn more →

Coverage increased (+0.4%) to 80.906%

Details

  • Coverage increased (+0.4%) from the base build.
  • Patch coverage: No coverable lines changed in this PR.
  • No coverage regressions found.

Uncovered Changes

No uncovered changes found.

Coverage Regressions

No coverage regressions found.


Coverage Stats

Coverage Status
Relevant Lines: 21357
Covered Lines: 17279
Line Coverage: 80.91%
Coverage Strength: 1468.12 hits per line

💛 - Coveralls

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

apis Issues or PRs related to API change cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. docs size/M Denotes a PR that changes 30-99 lines, ignoring generated files. source

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: Cloudflare Provider crashes with "Identical record already exists" (81058) when using Region Key

4 participants