Reconcile container CPU masks during NRI Synchronize by pravk03 · Pull Request #115 · kubernetes-sigs/dra-driver-cpu

pravk03 · 2026-04-10T00:01:04Z

This is a solution for #109, although not perfect.

This ensures cgroup is correctly updated if the plugin crash or restarts and is not available during container creation. But while the NRI plugin is down, the containers created will not have their CPU affinity restricted and will have access to all online CPUs on the node (potentially overlapping with exclusive CPUs assigned to other pods with claims)

k8s-ci-robot · 2026-04-10T00:01:07Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

k8s-ci-robot · 2026-04-10T00:01:10Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: pravk03

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [pravk03]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

pravk03 · 2026-04-11T17:33:33Z

/retest

pravk03 · 2026-04-11T22:50:27Z

/retest

This ensures cgroup is correctly updated if the plugin crash or restarts and is not available during container creation.

pravk03 · 2026-04-14T17:54:39Z

/retest

ffromani · 2026-04-15T11:48:44Z

arm64 failures are a known issue: #120

ffromani

partial review, still need to catch up with the e2e test. Everything else LGTM besides the cpumask/cpuset terminology. Everything else equal, I'd prefer "cpuset" over "cpumask"

ffromani · 2026-04-13T13:00:39Z

+		for _, arg := range orgDaemonSet.Spec.Template.Spec.Containers[0].Args {
+			if val, ok := parseCPUDeviceModeArg(arg); ok {
+				cpuDeviceMode = val
+			}
+		}


(unrelated to this PR) this is becoming a pattern so we would need an helper soon enough.

Good point. Added a helper.

AutuSnow

Thank you for solving it. It looks like a great idea. I left some comments in the code for unit testing

AutuSnow · 2026-04-20T07:27:19Z

+		// Restore original DaemonSet
+		ds, err := rootFxt.K8SClientset.AppsV1().DaemonSets(daemonSetNamespaceRule).Get(ctx, "dracpu", metav1.GetOptions{})
+		gomega.Expect(err).ToNot(gomega.HaveOccurred())
+		ds.Spec = orgDaemonSet.Spec


nit: This inline restore duplicates the earlier DeferCleanup. Since DeferCleanup will run on both success and failure paths, the inline block is redundant. Suggest removing the inline restore and relying on the DeferCleanup for a single source of truth.

I think we still need the restore here to reinstall the driver ds pod since we run cleanup only at the end.

k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 10, 2026

k8s-ci-robot requested review from johnbelamaric and klueska April 10, 2026 00:01

k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Apr 10, 2026

pravk03 force-pushed the sync branch 3 times, most recently from 7e4ec59 to 73d8121 Compare April 10, 2026 17:23

pravk03 marked this pull request as ready for review April 10, 2026 17:40

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 10, 2026

k8s-ci-robot requested a review from ffromani April 10, 2026 17:40

pravk03 mentioned this pull request Apr 10, 2026

Fix race condition during runtime restarts #109

Open

pravk03 force-pushed the sync branch from 73d8121 to 73ff517 Compare April 11, 2026 22:26

pravk03 mentioned this pull request Apr 14, 2026

PLANNING: Release 0.2.0 #57

Open

Reconcile container CPU masks during NRI Synchronize

8892cc1

This ensures cgroup is correctly updated if the plugin crash or restarts and is not available during container creation.

pravk03 force-pushed the sync branch from 73ff517 to 937da67 Compare April 14, 2026 16:12

fmuyassarov reviewed Apr 15, 2026

View reviewed changes

Comment thread pkg/driver/nri_hooks.go Outdated

Comment thread pkg/driver/nri_hooks.go Outdated

pravk03 force-pushed the sync branch from 937da67 to 9a9196d Compare April 15, 2026 16:34

ffromani reviewed Apr 17, 2026

View reviewed changes

pravk03 force-pushed the sync branch from 9a9196d to 4c1a790 Compare April 17, 2026 22:03

AutuSnow reviewed Apr 20, 2026

View reviewed changes

Add E2E test for reconcile on NRI restart

8f1cadc

pravk03 force-pushed the sync branch from 4c1a790 to 8f1cadc Compare April 21, 2026 20:18

Conversation

pravk03 commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

k8s-ci-robot commented Apr 10, 2026

Uh oh!

k8s-ci-robot commented Apr 10, 2026

Uh oh!

pravk03 commented Apr 11, 2026

Uh oh!

pravk03 commented Apr 11, 2026

Uh oh!

pravk03 commented Apr 14, 2026

Uh oh!

Uh oh!

Uh oh!

ffromani commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ffromani left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ffromani Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

pravk03 Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

AutuSnow left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

AutuSnow Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

pravk03 Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

pravk03 commented Apr 10, 2026 •

edited

Loading

ffromani commented Apr 15, 2026 •

edited

Loading