Add maxNodes check for MachineDeployments by parthyadav3105 · Pull Request #79 · kubernetes-sigs/karpenter-provider-cluster-api

parthyadav3105 · 2026-04-14T17:07:08Z

Enforce cluster.x-k8s.io/cluster-api-autoscaler-node-group-max-size annotation to ensure that MachineDeployment replica count never exceeds max-size annotation.

This allows putting a cap on MachineDeployment max nodes.

Closes #77

k8s-ci-robot · 2026-04-14T17:07:16Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: parthyadav3105
Once this PR has been reviewed and has the lgtm label, please assign neolit123 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

elmiko

in general this is looking great, thank you @parthyadav3105. i have a question but it's not a blocker.

elmiko · 2026-04-16T14:25:12Z

 	}
+	if !isBelowMaxSize(machineDeployment) {
+		return nil, nil, cloudprovider.NewInsufficientCapacityError(fmt.Errorf("MachineDeployment %q has reached its maximum size", machineDeployment.Name))
+	}


given that we might return multiple MachineDeployments and we pick the first one from the list, i wonder if we should have some sort of loop here to check if there are other MachineDeployments we could attempt? (similar to what the TODO on L328 is talking about)

i don't think we need this behavior for the first version of this, but i wonder if we should add it.

Makes sense. I’ll look into where to add it. I think filterCompatibleInstanceTypes() might be the right place, while slices.SortFunc() can remain for additional customization on top of the core Karpenter cheapest sorted list.

if it seems too big, i'm ok with leaving the functionality as it is. but, i'm happy to hear your investigation as well.

I would be repeating what is already known to cover my understanding:

From what I understand(github.com/kubernetes-sigs/karpenter/pkg/controllers/nodeclaim/lifecycle/launch.go#L78-L88), when cloud-provider returns InsufficientCapacityError, then core karpenter considers it as failed NodeClaim and deletes that NodeClaim. In the next cycle, it fetches fresh batch of pods in pending state, do calculations and creates new fresh NodeClaim request again. This also gives cloud-providers a way to protect when quota limit or cloud-capacity is hit.

Meanwhile, cloud-provider is expected to keep reporting latest instance-type/offering metadata via GetInstanceTypes() to Karpenter, so that Karpenter filters out those instance types in the next cycle and only constructs the NodeClaim with valid requirements that are available: true.

With current code:

// identify which fit requirements compatibleInstanceTypes := filterCompatibleInstanceTypes(instanceTypes, nodeClaim) if len(compatibleInstanceTypes) == 0 { return nil, nil, cloudprovider.NewInsufficientCapacityError(fmt.Errorf("cannot satisfy create, no compatible instance types found")) } // TODO (elmiko) if multiple instance types are found to be compatible we need to select one. // for now, we sort by resource name and take the first in the list. In the future, this should // be an option or something more useful like minimum size or cost. slices.SortFunc(compatibleInstanceTypes, func(a, b *ClusterAPIInstanceType) int { return cmp.Compare(strings.ToLower(a.Name), strings.ToLower(b.Name)) }) selectedInstanceType := compatibleInstanceTypes[0]

In first layer, we have filterCompatibleInstanceTypes() which returns InsufficientCapacityError. Here we eliminate all those instance that cannot be provisioned or used for these requirements. If no combination fits, we return InsufficientCapacityError.

This makes first layer most suitable for us to add this, here we can eliminate MachineDeployment that have maxed out and have no capacity. If all Machine Deployment are eliminated then we return InsufficientCapacityError.

Meanwhile, the SortFunc is the second layer which says, now from what is possible, choose the cheapest Machine Deployment(or any customised behaviour via annotations as well).

Correction:

Makes sense. I’ll look into where to add it. I think filterCompatibleInstanceTypes() might be the right place, while slices.SortFunc() can remain for additional customization on top of the core Karpenter ~~cheapest sorted~~ list. (we get NodeClaim requirements in cloudprovider.Create())

Suggested Changes:

... func (c *CloudProvider) createMachine(ctx context.Context, nodeClaim *karpv1.NodeClaim) (*capiv1beta1.MachineDeployment, *capiv1beta1.Machine, error) { ... ... // identify which fit requirements compatibleInstanceTypes := filterCompatibleInstanceTypes(instanceTypes, nodeClaim) if len(compatibleInstanceTypes) == 0 { return nil, nil, cloudprovider.NewInsufficientCapacityError(fmt.Errorf("cannot satisfy create, no compatible instance types found")) } ... // once scalable resource is identified, increase replicas machineDeployment, err := c.machineDeploymentProvider.Get(ctx, selectedInstanceType.MachineDeploymentName, selectedInstanceType.MachineDeploymentNamespace) if err != nil { return nil, nil, fmt.Errorf("cannot satisfy create, unable to find MachineDeployment %q for InstanceType %q: %w", selectedInstanceType.MachineDeploymentName, selectedInstanceType.Name, err) } - if !isBelowMaxSize(machineDeployment) { - return nil, nil, cloudprovider.NewInsufficientCapacityError(fmt.Errorf("MachineDeployment %q has reached its maximum size", machineDeployment.Name)) - } originalReplicas := *machineDeployment.Spec.Replicas ... ... } func filterCompatibleInstanceTypes(instanceTypes []*ClusterAPIInstanceType, nodeClaim *karpv1.NodeClaim) []*ClusterAPIInstanceType { reqs := scheduling.NewNodeSelectorRequirementsWithMinValues(nodeClaim.Spec.Requirements...) filteredInstances := lo.Filter(instanceTypes, func(i *ClusterAPIInstanceType, _ int) bool { // TODO (elmiko) if/when we have offering availability, this is a good place to filter out unavailable instance types return reqs.Compatible(i.Requirements, scheduling.AllowUndefinedWellKnownLabels) == nil && resources.Fits(nodeClaim.Spec.Resources.Requests, i.Allocatable()) && + i.Offerings.Available().HasCompatible(reqs) }) return filteredInstances }

Note: Offerings is []*Offering, as of today there is only one offering per MachineDepoyment (set in machineDeploymentToInstanceType()) which might change in future.

That approach makes sense to me and aligns with upstream which I like. +1 to filtering in i.Offerings.Available().HasCompatible(reqs).

elmiko · 2026-04-16T14:28:18Z

/assign @maxcao13

maxcao13 · 2026-04-16T15:57:14Z

/ok-to-test

Trying to see if this works automatically.

Enforce "cluster.x-k8s.io/cluster-api-autoscaler-node-group-max-size" annotation to ensure that MachineDeployment replica count never exceeds max-size annotation. This allows putting a cap on MachineDeployment max nodes. Signed-off-by: Parth Yadav <parth@coredge.io>

maxcao13 · 2026-04-23T01:35:59Z

/lgtm

Thanks! Guess the automation for ok-to-test doesn't work, I can try to figure that out though.

k8s-ci-robot requested review from enxebre and richardcase April 14, 2026 17:07

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Apr 14, 2026

elmiko reviewed Apr 16, 2026

View reviewed changes

k8s-ci-robot assigned maxcao13 Apr 16, 2026

k8s-ci-robot added the ok-to-test Indicates a non-member PR verified by an org member that is safe to test. label Apr 16, 2026

parthyadav3105 force-pushed the max-size-node-limit-annotation branch from 33bc5a5 to 526ebcc Compare April 22, 2026 19:16

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 23, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add maxNodes check for MachineDeployments#79

Add maxNodes check for MachineDeployments#79
parthyadav3105 wants to merge 1 commit intokubernetes-sigs:mainfrom
parthyadav3105:max-size-node-limit-annotation

parthyadav3105 commented Apr 14, 2026

Uh oh!

k8s-ci-robot commented Apr 14, 2026

Uh oh!

elmiko left a comment

Uh oh!

elmiko Apr 16, 2026

Uh oh!

parthyadav3105 Apr 17, 2026

Uh oh!

elmiko Apr 17, 2026

Uh oh!

parthyadav3105 Apr 18, 2026 •

edited

Loading

Uh oh!

maxcao13 Apr 21, 2026

Uh oh!

parthyadav3105 Apr 22, 2026

Uh oh!

elmiko commented Apr 16, 2026

Uh oh!

maxcao13 commented Apr 16, 2026

Uh oh!

maxcao13 commented Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

parthyadav3105 commented Apr 14, 2026

Uh oh!

k8s-ci-robot commented Apr 14, 2026

Uh oh!

elmiko left a comment

Choose a reason for hiding this comment

Uh oh!

elmiko Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

parthyadav3105 Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

elmiko Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

parthyadav3105 Apr 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maxcao13 Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

parthyadav3105 Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

elmiko commented Apr 16, 2026

Uh oh!

maxcao13 commented Apr 16, 2026

Uh oh!

maxcao13 commented Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

parthyadav3105 Apr 18, 2026 •

edited

Loading