Reconcile container CPU masks during NRI Synchronize#115
Reconcile container CPU masks during NRI Synchronize#115pravk03 wants to merge 2 commits intokubernetes-sigs:mainfrom
Conversation
|
Skipping CI for Draft Pull Request. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: pravk03 The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
7e4ec59 to
73d8121
Compare
|
/retest |
|
/retest |
This ensures cgroup is correctly updated if the plugin crash or restarts and is not available during container creation.
|
/retest |
|
arm64 failures are a known issue: #120 |
ffromani
left a comment
There was a problem hiding this comment.
partial review, still need to catch up with the e2e test. Everything else LGTM besides the cpumask/cpuset terminology. Everything else equal, I'd prefer "cpuset" over "cpumask"
| for _, arg := range orgDaemonSet.Spec.Template.Spec.Containers[0].Args { | ||
| if val, ok := parseCPUDeviceModeArg(arg); ok { | ||
| cpuDeviceMode = val | ||
| } | ||
| } |
There was a problem hiding this comment.
(unrelated to this PR) this is becoming a pattern so we would need an helper soon enough.
There was a problem hiding this comment.
Good point. Added a helper.
AutuSnow
left a comment
There was a problem hiding this comment.
Thank you for solving it. It looks like a great idea. I left some comments in the code for unit testing
| // Restore original DaemonSet | ||
| ds, err := rootFxt.K8SClientset.AppsV1().DaemonSets(daemonSetNamespaceRule).Get(ctx, "dracpu", metav1.GetOptions{}) | ||
| gomega.Expect(err).ToNot(gomega.HaveOccurred()) | ||
| ds.Spec = orgDaemonSet.Spec |
There was a problem hiding this comment.
nit: This inline restore duplicates the earlier DeferCleanup. Since DeferCleanup will run on both success and failure paths, the inline block is redundant. Suggest removing the inline restore and relying on the DeferCleanup for a single source of truth.
There was a problem hiding this comment.
I think we still need the restore here to reinstall the driver ds pod since we run cleanup only at the end.
This is a solution for #109, although not perfect.
This ensures cgroup is correctly updated if the plugin crash or restarts and is not available during container creation. But while the NRI plugin is down, the containers created will not have their CPU affinity restricted and will have access to all online CPUs on the node (potentially overlapping with exclusive CPUs assigned to other pods with claims)