Skip to content

OCPBUGS-85677: avoid Progressing during node reboot pod recreation#3018

Open
jluhrsen wants to merge 1 commit into
openshift:masterfrom
jluhrsen:OCPBUGS-85677
Open

OCPBUGS-85677: avoid Progressing during node reboot pod recreation#3018
jluhrsen wants to merge 1 commit into
openshift:masterfrom
jluhrsen:OCPBUGS-85677

Conversation

@jluhrsen

@jluhrsen jluhrsen commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

The counter check (UpdatedNumberScheduled < CurrentNumberScheduled) correctly detects CNO-initiated rollouts but also triggers during node reboots when pods are recreated without any DaemonSet spec change.

Require generation > observedGeneration to confirm CNO recently updated the DaemonSet, filtering out transient pod churn from node lifecycle events.

Fixes false positives during MCO node reboots with ipsec machine configs.

Summary by CodeRabbit

  • Bug Fixes
    • Improved detection of active DaemonSet rollouts during deployment monitoring. Rollout status is now kept “active” only when there’s an actual pending generation or meaningful progress, reducing false “in-progress” reports in stalled or transitioning states.

@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Jun 1, 2026
@openshift-ci-robot

Copy link
Copy Markdown
Contributor

@jluhrsen: This pull request references Jira Issue OCPBUGS-85677, which is invalid:

  • expected the bug to target the "5.0.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

The counter check (UpdatedNumberScheduled < CurrentNumberScheduled) correctly detects CNO-initiated rollouts but also triggers during node reboots when pods are recreated without any DaemonSet spec change.

Require generation > observedGeneration to confirm CNO recently updated the DaemonSet, filtering out transient pod churn from node lifecycle events.

Fixes false positives during MCO node reboots with ipsec machine configs.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai

coderabbitai Bot commented Jun 1, 2026

Copy link
Copy Markdown

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 5805e0aa-1924-4034-b48d-5cd9a75480b2

📥 Commits

Reviewing files that changed from the base of the PR and between 194f573 and 570b098.

📒 Files selected for processing (1)
  • pkg/controller/statusmanager/pod_status.go

Walkthrough

The SetFromPods function's DaemonSet rollout active detection now adds a generation-based guard condition: rollout activity requires Generation > ObservedGeneration alongside count-based progress stall indicators, preventing false active status when the controller has not yet observed the latest generation. Rollout state tracking is expanded to start whenever this pending condition holds, and hung-rollout detection is restructured within this broader tracking flow with tightened condition guards.

Changes

DaemonSet Rollout Status Detection

Layer / File(s) Summary
Generation-based rollout tracking and hung-rollout detection
pkg/controller/statusmanager/pod_status.go
The rollout-active check now requires dsRolloutPending (Generation > ObservedGeneration) alongside existing count-based stall checks. Rollout state tracking is predicated on either active progress or pending generation. Hung-rollout detection is moved into the tracking flow and tightened to require progressing state and prior tracked state.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 14 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Test Structure And Quality ❓ Inconclusive The custom check requests review of Ginkgo test code, but the PR adds/modifies tests using Go's standard testing package, not Ginkgo. The check instructions reference Ginkgo-specific features (Desc... Clarify whether the check applies to standard Go testing package tests or only Ginkgo-based tests, as the repository uses the standard testing package for unit tests, not Ginkgo.
✅ Passed checks (14 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title references avoiding 'Progressing' during node reboot pod recreation, which aligns with the main objective of preventing false positive rollout status reports during MCO node reboots by adding a generation/observedGeneration check.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed PR does not contain Ginkgo tests; it uses standard Go unit testing framework with stable, deterministic test names. Check is not applicable to this PR.
Microshift Test Compatibility ✅ Passed PR adds no new Ginkgo e2e tests. Only file changed is pkg/controller/statusmanager/pod_status.go (controller logic, not tests), which contains no Ginkgo test definitions. Check is not applicable.
Single Node Openshift (Sno) Test Compatibility ✅ Passed No new Ginkgo e2e tests were added in this PR. The only changes are to pkg/controller/statusmanager/pod_status.go, which modifies existing non-test code logic for DaemonSet rollout detection.
Topology-Aware Scheduling Compatibility ✅ Passed This PR modifies runtime status management logic in pod_status.go, which reads and reports DaemonSet/Deployment/StatefulSet status. It does not modify deployment manifests, pod specs, or introduce...
Ote Binary Stdout Contract ✅ Passed This PR modifies pkg/controller/statusmanager/pod_status.go, which is a Kubernetes controller package in the Cluster Network Operator (not an OTE test binary). The OTE Binary Stdout Contract applie...
Ipv6 And Disconnected Network Test Compatibility ✅ Passed This PR does not add any new Ginkgo e2e tests. It only modifies pkg/controller/statusmanager/pod_status.go, which is a unit test file for the cluster-network-operator (not an e2e test). The repos...
No-Weak-Crypto ✅ Passed The PR modifies DaemonSet rollout detection logic in pod_status.go. The changes add a generation/observedGeneration check to filter out false positives during node reboots. The code contains no cry...
Container-Privileges ✅ Passed The PR adds cluster-network-operator codebase to git tracking. Privileged containers found in K8s manifests (OVN-Kubernetes, kube-proxy, multus, frr-k8s) are infrastructure components that legitima...
No-Sensitive-Data-In-Logs ✅ Passed PR adds no new logging statements; only modifies DaemonSet rollout logic. Existing logs contain only resource names, namespaces, status numbers, and timestamps—no sensitive data.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.12.2)

level=error msg="Running error: context loading failed: failed to load packages: failed to load packages: failed to load with go/packages: err: exit status 1: stderr: go: inconsistent vendoring in :\n\tgithub.com/Masterminds/semver@v1.5.0: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/Masterminds/sprig/v3@v3.2.3: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/containernetworking/cni@v0.8.0: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/ghodss/yaml@v1.0.1-0.20190212211648-25d852aebe32: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/go-bindata/go-bindata@v3.1.2+incompatible: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/onsi/gomega@v1.39.1: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tgithub.com/ope

... [truncated 17357 characters] ...

red in go.mod, but not marked as explicit in vendor/modules.txt\n\tk8s.io/gengo/v2@v2.0.0-20251215205346-5ee0d033ba5b: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tk8s.io/kms@v0.35.2: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tk8s.io/kube-aggregator@v0.35.1: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tsigs.k8s.io/randfill@v1.0.0: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\tsigs.k8s.io/structured-merge-diff/v6@v6.3.2: is explicitly required in go.mod, but not marked as explicit in vendor/modules.txt\n\n\tTo ignore the vendor directory, use -mod=readonly or -mod=mod.\n\tTo sync the vendor directory, run:\n\t\tgo mod vendor\n"


Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot requested review from martinkennelly and miheer June 1, 2026 20:50
@openshift-ci

openshift-ci Bot commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: jluhrsen
Once this PR has been reviewed and has the lgtm label, please assign kyrtapz for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@jluhrsen

jluhrsen commented Jun 1, 2026

Copy link
Copy Markdown
Contributor Author

/payload-aggregate periodic-ci-openshift-release-main-nightly-4.22-e2e-aws-ovn-upgrade-ipsec 10

@openshift-ci

openshift-ci Bot commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

@jluhrsen: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-main-nightly-4.22-e2e-aws-ovn-upgrade-ipsec

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/9c86cb10-5dfb-11f1-81a6-9b75c8f028d7-0

@jluhrsen

jluhrsen commented Jun 1, 2026

Copy link
Copy Markdown
Contributor Author

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Jun 1, 2026
@openshift-ci-robot

Copy link
Copy Markdown
Contributor

@jluhrsen: This pull request references Jira Issue OCPBUGS-85677, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)
Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
pkg/controller/statusmanager/pod_status.go (1)

97-105: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Guard is not applied to the main DaemonSet update-progress condition.

The new Generation > ObservedGeneration check is only folded into dsRolloutActive, but Line 104 still marks progressing on UpdatedNumberScheduled < CurrentNumberScheduled unconditionally. That preserves the node-reboot false-positive path described in this PR.

Suggested fix
-		} else if ds.Status.UpdatedNumberScheduled < ds.Status.CurrentNumberScheduled {
+		} else if ds.Status.UpdatedNumberScheduled < ds.Status.CurrentNumberScheduled && ds.Generation > ds.Status.ObservedGeneration {
 			progressing = append(progressing, fmt.Sprintf("DaemonSet %q update is rolling out (%d out of %d updated)", dsName.String(), ds.Status.UpdatedNumberScheduled, ds.Status.CurrentNumberScheduled))
 			dsProgressing = true
 		} else if ds.Status.NumberUnavailable > 0 {
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/controller/statusmanager/pod_status.go` around lines 97 - 105, The
progressing branch incorrectly uses ds.Status.UpdatedNumberScheduled <
ds.Status.CurrentNumberScheduled without respecting the rollout guard; instead
use the previously computed boolean dsRolloutActive (which already includes
generation > ObservedGeneration) when deciding whether to append the
rolling-update message and set dsProgressing. Update the else-if condition that
currently checks UpdatedNumberScheduled < CurrentNumberScheduled to require
dsRolloutActive (e.g., dsRolloutActive && ds.Status.UpdatedNumberScheduled <
ds.Status.CurrentNumberScheduled) so the generation/observedGeneration guard
prevents false-positive node-reboot progress for DaemonSet updates.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@pkg/controller/statusmanager/pod_status.go`:
- Around line 97-105: The progressing branch incorrectly uses
ds.Status.UpdatedNumberScheduled < ds.Status.CurrentNumberScheduled without
respecting the rollout guard; instead use the previously computed boolean
dsRolloutActive (which already includes generation > ObservedGeneration) when
deciding whether to append the rolling-update message and set dsProgressing.
Update the else-if condition that currently checks UpdatedNumberScheduled <
CurrentNumberScheduled to require dsRolloutActive (e.g., dsRolloutActive &&
ds.Status.UpdatedNumberScheduled < ds.Status.CurrentNumberScheduled) so the
generation/observedGeneration guard prevents false-positive node-reboot progress
for DaemonSet updates.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: b3ae22f7-9d4a-4c06-a369-336eec3e1a73

📥 Commits

Reviewing files that changed from the base of the PR and between 7d4c17a and d4b833b.

📒 Files selected for processing (1)
  • pkg/controller/statusmanager/pod_status.go

@openshift-ci-robot

Copy link
Copy Markdown
Contributor

@jluhrsen: This pull request references Jira Issue OCPBUGS-85677, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)
Details

In response to this:

The counter check (UpdatedNumberScheduled < CurrentNumberScheduled) correctly detects CNO-initiated rollouts but also triggers during node reboots when pods are recreated without any DaemonSet spec change.

Require generation > observedGeneration to confirm CNO recently updated the DaemonSet, filtering out transient pod churn from node lifecycle events.

Fixes false positives during MCO node reboots with ipsec machine configs.

Summary by CodeRabbit

  • Bug Fixes
  • More accurately detects active DaemonSet rollouts to avoid false positives during deployment monitoring — rollout is now considered active only when installation is incomplete or genuine progress on a new revision is observed, reducing incorrect "in-progress" status reports.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@jluhrsen

jluhrsen commented Jun 1, 2026

Copy link
Copy Markdown
Contributor Author

/payload-aggregate periodic-ci-openshift-release-main-nightly-4.22-e2e-aws-ovn-upgrade-ipsec 10

@openshift-ci

openshift-ci Bot commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

@jluhrsen: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-main-nightly-4.22-e2e-aws-ovn-upgrade-ipsec

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/e577e9f0-5e03-11f1-9d4c-ab89f3f0ae9e-0

@jluhrsen

jluhrsen commented Jun 2, 2026

Copy link
Copy Markdown
Contributor Author

/test 4.22-upgrade-from-stable-4.21-e2e-aws-ovn-upgrade
/test 4.22-upgrade-from-stable-4.21-e2e-gcp-ovn-upgrade
/test e2e-aws-ovn-upgrade
/test e2e-azure-ovn-upgrade
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-ovn-ipsec-step-registry

The counter check (UpdatedNumberScheduled < CurrentNumberScheduled) correctly detects DaemonSet rollouts but also triggers during node reboots when pods are recreated without any DaemonSet spec change.

Use generation > observedGeneration only as the short-lived signal that a CNO-initiated DaemonSet rollout has started, then keep tracking that rollout via the existing last-seen state while updated pods are still behind current pods. This filters transient pod churn after install without losing legitimate rollout progress once the DaemonSet controller has observed the new generation.

Fixes false positives during MCO node reboots with ipsec machine configs.

Signed-off-by Jamo Luhrsen <jluhrsen@gmail.com>

Co-authored-by Claude Code <anthropic@noreply.com>
@jluhrsen

Copy link
Copy Markdown
Contributor Author

/payload-aggregate periodic-ci-openshift-release-main-nightly-4.22-e2e-aws-ovn-upgrade-ipsec 10

@openshift-ci

openshift-ci Bot commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

@jluhrsen: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-main-nightly-4.22-e2e-aws-ovn-upgrade-ipsec

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/7eaa3620-6a78-11f1-99e4-2abd6021fc0c-0

@jluhrsen

Copy link
Copy Markdown
Contributor Author

/payload-aggregate periodic-ci-openshift-release-main-nightly-4.22-e2e-aws-ovn-upgrade-ipsec 10

@openshift-ci

openshift-ci Bot commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

@jluhrsen: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-main-nightly-4.22-e2e-aws-ovn-upgrade-ipsec

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/a6a77cb0-6ac2-11f1-8d7d-8b5c17569682-0

@jluhrsen

Copy link
Copy Markdown
Contributor Author

/retest

@jluhrsen

Copy link
Copy Markdown
Contributor Author

/payload-aggregate periodic-ci-openshift-release-main-nightly-4.22-e2e-aws-ovn-upgrade-ipsec 10

@jluhrsen

Copy link
Copy Markdown
Contributor Author

/retest

@openshift-ci

openshift-ci Bot commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

@jluhrsen: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-main-nightly-4.22-e2e-aws-ovn-upgrade-ipsec

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/d633a3b0-6b20-11f1-8d77-2ae1fdeb1b0f-0

@jluhrsen

Copy link
Copy Markdown
Contributor Author

/payload-aggregate periodic-ci-openshift-release-main-nightly-4.22-e2e-aws-ovn-upgrade-ipsec 10

@openshift-ci

openshift-ci Bot commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

@jluhrsen: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-main-nightly-4.22-e2e-aws-ovn-upgrade-ipsec

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/ee8fe410-6b4c-11f1-9df3-ffe47acd3131-0

@jluhrsen

Copy link
Copy Markdown
Contributor Author

/retest

1 similar comment
@jluhrsen

Copy link
Copy Markdown
Contributor Author

/retest

@jluhrsen

Copy link
Copy Markdown
Contributor Author

/payload-aggregate periodic-ci-openshift-release-main-nightly-4.22-e2e-aws-ovn-upgrade-ipsec 10

@openshift-ci

openshift-ci Bot commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

@jluhrsen: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-main-nightly-4.22-e2e-aws-ovn-upgrade-ipsec

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/a3ad17a0-6ba2-11f1-91ae-4d4287eb12d8-0

@openshift-ci

openshift-ci Bot commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

@jluhrsen: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/4.22-upgrade-from-stable-4.21-e2e-aws-ovn-upgrade 194f573 link false /test 4.22-upgrade-from-stable-4.21-e2e-aws-ovn-upgrade
ci/prow/e2e-metal-ipi-ovn-dualstack-bgp-local-gw 570b098 link true /test e2e-metal-ipi-ovn-dualstack-bgp-local-gw
ci/prow/e2e-aws-ovn-rhcos10-techpreview 570b098 link false /test e2e-aws-ovn-rhcos10-techpreview
ci/prow/e2e-aws-ovn-upgrade 570b098 link true /test e2e-aws-ovn-upgrade
ci/prow/e2e-ovn-ipsec-step-registry 570b098 link true /test e2e-ovn-ipsec-step-registry
ci/prow/e2e-metal-ipi-ovn-ipv6-ipsec 570b098 link true /test e2e-metal-ipi-ovn-ipv6-ipsec
ci/prow/e2e-aws-ovn-upgrade-ipsec 570b098 link true /test e2e-aws-ovn-upgrade-ipsec

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@miheer

miheer commented Jun 19, 2026

Copy link
Copy Markdown

/retest

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants