Skip to content

Reconcile container CPU masks during NRI Synchronize#115

Open
pravk03 wants to merge 2 commits intokubernetes-sigs:mainfrom
pravk03:sync
Open

Reconcile container CPU masks during NRI Synchronize#115
pravk03 wants to merge 2 commits intokubernetes-sigs:mainfrom
pravk03:sync

Conversation

@pravk03
Copy link
Copy Markdown
Contributor

@pravk03 pravk03 commented Apr 10, 2026

This is a solution for #109, although not perfect.

This ensures cgroup is correctly updated if the plugin crash or restarts and is not available during container creation. But while the NRI plugin is down, the containers created will not have their CPU affinity restricted and will have access to all online CPUs on the node (potentially overlapping with exclusive CPUs assigned to other pods with claims)

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 10, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: pravk03

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Apr 10, 2026
@pravk03 pravk03 force-pushed the sync branch 3 times, most recently from 7e4ec59 to 73d8121 Compare April 10, 2026 17:23
@pravk03 pravk03 marked this pull request as ready for review April 10, 2026 17:40
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 10, 2026
@k8s-ci-robot k8s-ci-robot requested a review from ffromani April 10, 2026 17:40
@pravk03
Copy link
Copy Markdown
Contributor Author

pravk03 commented Apr 11, 2026

/retest

@pravk03
Copy link
Copy Markdown
Contributor Author

pravk03 commented Apr 11, 2026

/retest

This ensures cgroup is correctly updated if the plugin crash or restarts
and is not available during container creation.
@pravk03
Copy link
Copy Markdown
Contributor Author

pravk03 commented Apr 14, 2026

/retest

Comment thread pkg/driver/nri_hooks.go Outdated
Comment thread pkg/driver/nri_hooks.go Outdated
@ffromani
Copy link
Copy Markdown
Contributor

ffromani commented Apr 15, 2026

arm64 failures are a known issue: #120

Copy link
Copy Markdown
Contributor

@ffromani ffromani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

partial review, still need to catch up with the e2e test. Everything else LGTM besides the cpumask/cpuset terminology. Everything else equal, I'd prefer "cpuset" over "cpumask"

Comment thread pkg/driver/nri_hooks.go Outdated
Comment thread test/e2e/nri_reconciliation_test.go Outdated
Comment on lines +82 to +86
for _, arg := range orgDaemonSet.Spec.Template.Spec.Containers[0].Args {
if val, ok := parseCPUDeviceModeArg(arg); ok {
cpuDeviceMode = val
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(unrelated to this PR) this is becoming a pattern so we would need an helper soon enough.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Added a helper.

Copy link
Copy Markdown
Contributor

@AutuSnow AutuSnow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for solving it. It looks like a great idea. I left some comments in the code for unit testing

Comment thread test/e2e/nri_reconciliation_test.go Outdated
Comment thread test/e2e/nri_reconciliation_test.go Outdated
Comment thread test/e2e/nri_reconciliation_test.go Outdated
// Restore original DaemonSet
ds, err := rootFxt.K8SClientset.AppsV1().DaemonSets(daemonSetNamespaceRule).Get(ctx, "dracpu", metav1.GetOptions{})
gomega.Expect(err).ToNot(gomega.HaveOccurred())
ds.Spec = orgDaemonSet.Spec
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: This inline restore duplicates the earlier DeferCleanup. Since DeferCleanup will run on both success and failure paths, the inline block is redundant. Suggest removing the inline restore and relying on the DeferCleanup for a single source of truth.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we still need the restore here to reinstall the driver ds pod since we run cleanup only at the end.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants