KEP-10587: Configurable node label prefix filtering for TAS cache#10591
KEP-10587: Configurable node label prefix filtering for TAS cache#10591shshr wants to merge 2 commits intokubernetes-sigs:mainfrom
Conversation
Add Resources.ExcludeNodeLabelPrefixes to the Kueue configuration API, allowing operators to specify label key prefixes that should be stripped from nodes when they are stored in the TAS node cache. This reduces memory usage in clusters where nodes carry many infrastructure labels (cloud provider, node-role, kubectl metadata, etc.) that are irrelevant to topology-aware scheduling decisions. The filtering happens at cache insertion time (nodesCache.sync), so all labels needed for topology levels, flavor node selectors, workload node selectors, and node affinity matching remain available at scheduling time -- only labels matching the configured exclude prefixes are dropped. Defaults to a set of common infrastructure prefixes: kubectl.kubernetes.io/, node-role.kubernetes.io/, cloud.google.com/, eks.amazonaws.com/, container.googleapis.com/, topology.ebs.csi.aws.com/, node.cluster.x-k8s.io/, node.kubernetes.io/exclude-from-external-load-balancers Operators can override this list (including setting it to empty) via the Kueue Configuration resources.excludeNodeLabelPrefixes field.
|
✅ Deploy Preview for kubernetes-sigs-kueue canceled.
|
|
Hi @shshr. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Regular contributors should join the org to skip this step. Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: shshr The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
adf93f4 to
106e59d
Compare
106e59d to
6db401f
Compare
| In large clusters (hundreds to thousands of nodes), each node can carry 30–80+ | ||
| labels injected by cloud providers and infrastructure controllers. Examples | ||
| include `kubectl.kubernetes.io/`, `cloud.google.com/`, `eks.amazonaws.com/`, | ||
| and `node.cluster.x-k8s.io/` prefixes. The TAS node cache stores all labels on |
There was a problem hiding this comment.
Not sure about other cloud providers but on GCP topological labels are prefixed with cloud.google.com (exmple: "cloud.google.com/gce-topology-block"). So using prefixes, it may be difficult to tell them from labels that are not important for TAS.
There was a problem hiding this comment.
@mwielgus Good point. Maybe prefix-based exclusion is too coarse.
Two mitigations worth considering:
-
Support exact label names in addition to prefixes - so operators can exclude
cloud.google.com/machine-familywithout affectingcloud.google.com/gce-topology-block. -
Add an allowlist option (
includeNodeLabelPrefixes) - operators declare which label prefixes TAS should keep rather than trying to enumerate what to exclude. Since operators already know which topology keys they configure in their Topology objects, an allowlist is more natural and less error-prone.
| pattern—a denylist of key prefixes with sensible defaults—is applied here to | ||
| node labels. | ||
|
|
||
| ### Goals |
There was a problem hiding this comment.
What will happen if a workload has antiaffinity on a excluded label? How can this be prevented or at least explicitely reported?
There was a problem hiding this comment.
Thinking of 2 approaches here:
-
Admission-time validation: when a workload is submitted, check whether any affinity/anti-affinity
topologyKeyvalues match an excluded prefix. If so, reject with a clear error message liketopologyKey X matches excluded node label prefix Y. This gives operators immediate, actionable feedback. -
An allowlist approach (
includeNodeLabelPrefixes) should largely address this by design - operators explicitly declare which label prefixes to keep, and they'd include any prefixes used as topology keys. If an operator omits a prefix their workloads reference, admission-time validation can catch it: check whether any affinity/anti-affinitytopologyKeyvalues are absent from the included prefixes and reject with a clear error liketopologyKey X does not match any included node label prefix.
This approach seems safer than the exclude-list approach where operators might not realize they've excluded something their workloads need.
| Kubernetes API server or other Kueue caches). | ||
| * Allowlist-based filtering (only keep certain prefixes). This could be a | ||
| future enhancement if needed. | ||
| * Filtering annotations or taints from cached nodes. |
There was a problem hiding this comment.
Why not filter annotations as well?
There was a problem hiding this comment.
Good question. They could be stripped unconditionally from cached nodes without any prefix configuration needed. I'll expand the scope of this KEP to include annotation stripping on cached nodes as part of the same feature.
|
And please sign the CLA. |
|
/ok-to-test |
|
@shshr: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
For some reason, CLA signing is not reflecting here. I have signed it a couple times. |
|
Check your commits. You have two authors on your first one. |
What type of PR is this?
/kind feature
/kind kep
/area tas
What this PR does / why we need it:
This PR proposes KEP-10587 and includes its reference implementation for a new
Resources.ExcludeNodeLabelPrefixesconfiguration field. The field controls which node label key prefixes are stripped from cached node objects in the Topology Aware Scheduling (TAS) node cache, reducing memory usage.Nodes in large clusters carry 30–80+ labels injected by cloud providers and infrastructure controllers (e.g.,
kubectl.kubernetes.io/,cloud.google.com/,eks.amazonaws.com/). These labels are irrelevant for topology, flavor, or workload scheduling decisions but consume significant memory in the TAS node cache.What's included:
keps/10587-exclude-node-label-prefixes/) — full design proposal with motivation, API design, defaults, graduation criteria (alpha → beta → GA), alternatives analysisExcludeNodeLabelPrefixestoResourcesinapis/config/v1beta2/configuration_types.go, default prefix list indefaults.go, and filtering logic inpkg/cache/scheduler/tas_flavor.goThe companion non-API change (stripping non-scheduling PodTemplateSpec fields from Workload cache) is submitted separately in #10590.
Which issue(s) this PR fixes:
NONE
Special notes for your reviewer:
This is split from #10587 per reviewer feedback to decouple API changes (which need KEP design review) from non-API optimizations.
The KEP follows the pattern established by the existing
ExcludeResourcePrefixesfield and proposes alpha → beta → GA graduation with a feature gate (ExcludeNodeLabelPrefixes). At beta, startup validation will warn when excluded prefixes overlap with configured topology level labels.Happy to iterate on the KEP design before merging the implementation.
Does this PR introduce a user-facing change?