Skip to content

fix: Prevent MachineDeployment replica escalation on create failures#70

Open
Mmduh-483 wants to merge 1 commit intokubernetes-sigs:mainfrom
Mmduh-483:additional-vms
Open

fix: Prevent MachineDeployment replica escalation on create failures#70
Mmduh-483 wants to merge 1 commit intokubernetes-sigs:mainfrom
Mmduh-483:additional-vms

Conversation

@Mmduh-483
Copy link
Copy Markdown
Contributor

Fixes #N/A

Description

Each failed Create retry was incrementing the MachineDeployment replica count without rolling it back, causing unbounded growth (1→2→3→4...).

  • Add rollback on all error paths after the replica increment
  • On NodeClaim annotation failure, remove NodePoolMemberLabel from the Machine first so future polls can still discover it, then roll back replicas
  • Use context.WithoutCancel to keep rollback operations independent of the reconciliation context lifetime
  • Use RetryOnConflict for all rollback updates to handle stale cache conflicts

How was this change tested?
Deploying CAPI cluster, kamaji as controlplane with kubekey as infrastracture, KKcluster has no nodes

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Mmduh-483
Once this PR has been reviewed and has the lgtm label, please assign sbueringer for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Mar 14, 2026
@Mmduh-483
Copy link
Copy Markdown
Contributor Author

/cc @elmiko

@k8s-ci-robot k8s-ci-robot requested a review from elmiko March 14, 2026 08:12
Each failed Create retry was incrementing the MachineDeployment replica count
without rolling it back, causing unbounded growth (1→2→3→4...).

- Add rollback on all error paths after the replica increment
- On NodeClaim annotation failure, remove NodePoolMemberLabel from the Machine
  first so future polls can still discover it, then roll back replicas
- Use context.WithoutCancel to keep rollback operations independent of the
  reconciliation context lifetime
- Use RetryOnConflict for all rollback updates to handle stale cache conflicts
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Mar 16, 2026
@elmiko
Copy link
Copy Markdown
Collaborator

elmiko commented Mar 16, 2026

thanks @Mmduh-483 , will take a look this week.

@elmiko
Copy link
Copy Markdown
Collaborator

elmiko commented Mar 16, 2026

cc @maxcao13 looks like something weird happened with the ci, did we hit a pull limit or perhaps have an out of date version?

@maxcao13
Copy link
Copy Markdown
Contributor

maxcao13 commented Mar 16, 2026

Weird, I found this issue: kubernetes-sigs/cluster-api-addon-provider-helm#318

I think the image is discontinued but only happen recently?. I will open a PR on the cluster-api-kubemark repo to replace it. I'm assuming the e2e tests run there will show the same behaviour.

EDIT: Actually I think this was fixed in main, but the test clones 0.10.0 tag which doesn't have the changes: kubernetes-sigs/cluster-api-provider-kubemark#148

Should I backport to make CAPK 0.10.1? Officially this repo is at cluster api 1.9.3, but we are cloning 0.10.0 CAPK and release-1.10 CAPI repos to start a cluster.

EDIT2: This PR shows that this is fixed on CAPK main: #71

@elmiko
Copy link
Copy Markdown
Collaborator

elmiko commented Mar 18, 2026

thanks @maxcao13 , we should probably create a new release of CAPK. i was hoping to get it updated to CAPI 1.11 and kube 1.33.

i can create a 0.10.1 release, would that help?

@maxcao13
Copy link
Copy Markdown
Contributor

Agree, a 0.10.1 release would fix this, 🙏 thanks!

I can follow that up with a simple PR in this repo to bump the version after.

@elmiko
Copy link
Copy Markdown
Collaborator

elmiko commented Apr 8, 2026

i've created a v0.10.1 release of capk, hopefully it works lol.

https://github.com/kubernetes-sigs/cluster-api-provider-kubemark/releases/tag/v0.10.1

@maxcao13
Copy link
Copy Markdown
Contributor

@Mmduh-483 can you rebase please? It should be fixed now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants