fix: Prevent MachineDeployment replica escalation on create failures#70
fix: Prevent MachineDeployment replica escalation on create failures#70Mmduh-483 wants to merge 1 commit intokubernetes-sigs:mainfrom
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: Mmduh-483 The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
/cc @elmiko |
Each failed Create retry was incrementing the MachineDeployment replica count without rolling it back, causing unbounded growth (1→2→3→4...). - Add rollback on all error paths after the replica increment - On NodeClaim annotation failure, remove NodePoolMemberLabel from the Machine first so future polls can still discover it, then roll back replicas - Use context.WithoutCancel to keep rollback operations independent of the reconciliation context lifetime - Use RetryOnConflict for all rollback updates to handle stale cache conflicts
0f34232 to
2dc0024
Compare
|
thanks @Mmduh-483 , will take a look this week. |
|
cc @maxcao13 looks like something weird happened with the ci, did we hit a pull limit or perhaps have an out of date version? |
|
Weird, I found this issue: kubernetes-sigs/cluster-api-addon-provider-helm#318 I think the image is discontinued but only happen recently?. I will open a PR on the cluster-api-kubemark repo to replace it. I'm assuming the e2e tests run there will show the same behaviour. EDIT: Actually I think this was fixed in main, but the test clones 0.10.0 tag which doesn't have the changes: kubernetes-sigs/cluster-api-provider-kubemark#148 Should I backport to make CAPK 0.10.1? Officially this repo is at cluster api 1.9.3, but we are cloning 0.10.0 CAPK and release-1.10 CAPI repos to start a cluster. EDIT2: This PR shows that this is fixed on CAPK main: #71 |
|
thanks @maxcao13 , we should probably create a new release of CAPK. i was hoping to get it updated to CAPI 1.11 and kube 1.33. i can create a 0.10.1 release, would that help? |
|
Agree, a 0.10.1 release would fix this, 🙏 thanks! I can follow that up with a simple PR in this repo to bump the version after. |
|
i've created a v0.10.1 release of capk, hopefully it works lol. https://github.com/kubernetes-sigs/cluster-api-provider-kubemark/releases/tag/v0.10.1 |
|
@Mmduh-483 can you rebase please? It should be fixed now. |
Fixes #N/A
Description
Each failed Create retry was incrementing the MachineDeployment replica count without rolling it back, causing unbounded growth (1→2→3→4...).
How was this change tested?
Deploying CAPI cluster, kamaji as controlplane with kubekey as infrastracture, KKcluster has no nodes
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.