fix: Prevent MachineDeployment replica escalation on create failures by Mmduh-483 · Pull Request #70 · kubernetes-sigs/karpenter-provider-cluster-api

Mmduh-483 · 2026-03-14T08:11:53Z

Fixes #N/A

Description

Each failed Create retry was incrementing the MachineDeployment replica count without rolling it back, causing unbounded growth (1→2→3→4...).

Add rollback on all error paths after the replica increment
On NodeClaim annotation failure, remove NodePoolMemberLabel from the Machine first so future polls can still discover it, then roll back replicas
Use context.WithoutCancel to keep rollback operations independent of the reconciliation context lifetime
Use RetryOnConflict for all rollback updates to handle stale cache conflicts

How was this change tested?
Deploying CAPI cluster, kamaji as controlplane with kubekey as infrastracture, KKcluster has no nodes

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

k8s-ci-robot · 2026-03-14T08:11:59Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Mmduh-483
Once this PR has been reviewed and has the lgtm label, please assign sbueringer for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Mmduh-483 · 2026-03-14T08:12:27Z

/cc @elmiko

Each failed Create retry was incrementing the MachineDeployment replica count without rolling it back, causing unbounded growth (1→2→3→4...). - Add rollback on all error paths after the replica increment - On NodeClaim annotation failure, remove NodePoolMemberLabel from the Machine first so future polls can still discover it, then roll back replicas - Use context.WithoutCancel to keep rollback operations independent of the reconciliation context lifetime - Use RetryOnConflict for all rollback updates to handle stale cache conflicts

elmiko · 2026-03-16T13:11:04Z

thanks @Mmduh-483 , will take a look this week.

elmiko · 2026-03-16T17:33:06Z

cc @maxcao13 looks like something weird happened with the ci, did we hit a pull limit or perhaps have an out of date version?

maxcao13 · 2026-03-16T20:34:40Z

Weird, I found this issue: kubernetes-sigs/cluster-api-addon-provider-helm#318

I think the image is discontinued but only happen recently?. I will open a PR on the cluster-api-kubemark repo to replace it. I'm assuming the e2e tests run there will show the same behaviour.

EDIT: Actually I think this was fixed in main, but the test clones 0.10.0 tag which doesn't have the changes: kubernetes-sigs/cluster-api-provider-kubemark#148

Should I backport to make CAPK 0.10.1? Officially this repo is at cluster api 1.9.3, but we are cloning 0.10.0 CAPK and release-1.10 CAPI repos to start a cluster.

EDIT2: This PR shows that this is fixed on CAPK main: #71

elmiko · 2026-03-18T13:58:51Z

thanks @maxcao13 , we should probably create a new release of CAPK. i was hoping to get it updated to CAPI 1.11 and kube 1.33.

i can create a 0.10.1 release, would that help?

maxcao13 · 2026-03-19T14:58:26Z

Agree, a 0.10.1 release would fix this, 🙏 thanks!

I can follow that up with a simple PR in this repo to bump the version after.

elmiko · 2026-04-08T12:55:49Z

i've created a v0.10.1 release of capk, hopefully it works lol.

https://github.com/kubernetes-sigs/cluster-api-provider-kubemark/releases/tag/v0.10.1

maxcao13 · 2026-04-13T23:02:17Z

@Mmduh-483 can you rebase please? It should be fixed now.

k8s-ci-robot requested review from JoelSpeed and enxebre March 14, 2026 08:11

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Mar 14, 2026

k8s-ci-robot requested a review from elmiko March 14, 2026 08:12

Mmduh-483 force-pushed the additional-vms branch from 0f34232 to 2dc0024 Compare March 16, 2026 07:04

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Mar 16, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Prevent MachineDeployment replica escalation on create failures#70

fix: Prevent MachineDeployment replica escalation on create failures#70
Mmduh-483 wants to merge 1 commit intokubernetes-sigs:mainfrom
Mmduh-483:additional-vms

Mmduh-483 commented Mar 14, 2026

Uh oh!

k8s-ci-robot commented Mar 14, 2026

Uh oh!

Mmduh-483 commented Mar 14, 2026

Uh oh!

elmiko commented Mar 16, 2026

Uh oh!

elmiko commented Mar 16, 2026

Uh oh!

maxcao13 commented Mar 16, 2026 •

edited

Loading

Uh oh!

elmiko commented Mar 18, 2026

Uh oh!

maxcao13 commented Mar 19, 2026

Uh oh!

elmiko commented Apr 8, 2026

Uh oh!

maxcao13 commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

Mmduh-483 commented Mar 14, 2026

Uh oh!

k8s-ci-robot commented Mar 14, 2026

Uh oh!

Mmduh-483 commented Mar 14, 2026

Uh oh!

elmiko commented Mar 16, 2026

Uh oh!

elmiko commented Mar 16, 2026

Uh oh!

maxcao13 commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elmiko commented Mar 18, 2026

Uh oh!

maxcao13 commented Mar 19, 2026

Uh oh!

elmiko commented Apr 8, 2026

Uh oh!

maxcao13 commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

maxcao13 commented Mar 16, 2026 •

edited

Loading