Skip to content

KEP-10587: Configurable node label prefix filtering for TAS cache#10591

Open
shshr wants to merge 2 commits intokubernetes-sigs:mainfrom
shshr:kep-exclude-node-label-prefixes
Open

KEP-10587: Configurable node label prefix filtering for TAS cache#10591
shshr wants to merge 2 commits intokubernetes-sigs:mainfrom
shshr:kep-exclude-node-label-prefixes

Conversation

@shshr
Copy link
Copy Markdown

@shshr shshr commented Apr 17, 2026

What type of PR is this?

/kind feature
/kind kep
/area tas

What this PR does / why we need it:

This PR proposes KEP-10587 and includes its reference implementation for a new Resources.ExcludeNodeLabelPrefixes configuration field. The field controls which node label key prefixes are stripped from cached node objects in the Topology Aware Scheduling (TAS) node cache, reducing memory usage.

Nodes in large clusters carry 30–80+ labels injected by cloud providers and infrastructure controllers (e.g., kubectl.kubernetes.io/, cloud.google.com/, eks.amazonaws.com/). These labels are irrelevant for topology, flavor, or workload scheduling decisions but consume significant memory in the TAS node cache.

What's included:

  1. KEP-10587 (keps/10587-exclude-node-label-prefixes/) — full design proposal with motivation, API design, defaults, graduation criteria (alpha → beta → GA), alternatives analysis
  2. Implementation — adds ExcludeNodeLabelPrefixes to Resources in apis/config/v1beta2/configuration_types.go, default prefix list in defaults.go, and filtering logic in pkg/cache/scheduler/tas_flavor.go

The companion non-API change (stripping non-scheduling PodTemplateSpec fields from Workload cache) is submitted separately in #10590.

Which issue(s) this PR fixes:

NONE

Special notes for your reviewer:

This is split from #10587 per reviewer feedback to decouple API changes (which need KEP design review) from non-API optimizations.

The KEP follows the pattern established by the existing ExcludeResourcePrefixes field and proposes alpha → beta → GA graduation with a feature gate (ExcludeNodeLabelPrefixes). At beta, startup validation will warn when excluded prefixes overlap with configured topology level labels.

Happy to iterate on the KEP design before merging the implementation.

Does this PR introduce a user-facing change?

KEP-10587: Proposed Resources.ExcludeNodeLabelPrefixes configuration field to strip infrastructure labels from the TAS node cache, reducing per-node memory usage. A default set of common infrastructure label prefixes (kubectl.kubernetes.io/, cloud.google.com/, eks.amazonaws.com/, etc.) is provided.

Add Resources.ExcludeNodeLabelPrefixes to the Kueue configuration API,
allowing operators to specify label key prefixes that should be stripped
from nodes when they are stored in the TAS node cache. This reduces
memory usage in clusters where nodes carry many infrastructure labels
(cloud provider, node-role, kubectl metadata, etc.) that are irrelevant
to topology-aware scheduling decisions.

The filtering happens at cache insertion time (nodesCache.sync), so all
labels needed for topology levels, flavor node selectors, workload node
selectors, and node affinity matching remain available at scheduling
time -- only labels matching the configured exclude prefixes are dropped.

Defaults to a set of common infrastructure prefixes:
  kubectl.kubernetes.io/, node-role.kubernetes.io/,
  cloud.google.com/, eks.amazonaws.com/, container.googleapis.com/,
  topology.ebs.csi.aws.com/, node.cluster.x-k8s.io/,
  node.kubernetes.io/exclude-from-external-load-balancers

Operators can override this list (including setting it to empty) via the
Kueue Configuration resources.excludeNodeLabelPrefixes field.
@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/feature Categorizes issue or PR as related to a new feature. kind/kep Kueue Enhancement Proposal (Design) area/tas Topology-Aware Scheduling labels Apr 17, 2026
@linux-foundation-easycla
Copy link
Copy Markdown

linux-foundation-easycla Bot commented Apr 17, 2026

CLA Missing ID CLA Not Signed

@netlify
Copy link
Copy Markdown

netlify Bot commented Apr 17, 2026

Deploy Preview for kubernetes-sigs-kueue canceled.

Name Link
🔨 Latest commit 6db401f
🔍 Latest deploy log https://app.netlify.com/projects/kubernetes-sigs-kueue/deploys/69e26fd6d72b2300080c2c46

@k8s-ci-robot k8s-ci-robot added the cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. label Apr 17, 2026
@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Apr 17, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Hi @shshr. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: shshr
Once this PR has been reviewed and has the lgtm label, please assign gabesaba for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Apr 17, 2026
@shshr shshr force-pushed the kep-exclude-node-label-prefixes branch from adf93f4 to 106e59d Compare April 17, 2026 17:34
@shshr shshr force-pushed the kep-exclude-node-label-prefixes branch from 106e59d to 6db401f Compare April 17, 2026 17:37
In large clusters (hundreds to thousands of nodes), each node can carry 30–80+
labels injected by cloud providers and infrastructure controllers. Examples
include `kubectl.kubernetes.io/`, `cloud.google.com/`, `eks.amazonaws.com/`,
and `node.cluster.x-k8s.io/` prefixes. The TAS node cache stores all labels on
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about other cloud providers but on GCP topological labels are prefixed with cloud.google.com (exmple: "cloud.google.com/gce-topology-block"). So using prefixes, it may be difficult to tell them from labels that are not important for TAS.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mwielgus Good point. Maybe prefix-based exclusion is too coarse.

Two mitigations worth considering:

  • Support exact label names in addition to prefixes - so operators can exclude cloud.google.com/machine-family without affecting cloud.google.com/gce-topology-block.

  • Add an allowlist option (includeNodeLabelPrefixes) - operators declare which label prefixes TAS should keep rather than trying to enumerate what to exclude. Since operators already know which topology keys they configure in their Topology objects, an allowlist is more natural and less error-prone.

pattern—a denylist of key prefixes with sensible defaults—is applied here to
node labels.

### Goals
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What will happen if a workload has antiaffinity on a excluded label? How can this be prevented or at least explicitely reported?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking of 2 approaches here:

  • Admission-time validation: when a workload is submitted, check whether any affinity/anti-affinity topologyKey values match an excluded prefix. If so, reject with a clear error message like topologyKey X matches excluded node label prefix Y. This gives operators immediate, actionable feedback.

  • An allowlist approach (includeNodeLabelPrefixes) should largely address this by design - operators explicitly declare which label prefixes to keep, and they'd include any prefixes used as topology keys. If an operator omits a prefix their workloads reference, admission-time validation can catch it: check whether any affinity/anti-affinity topologyKey values are absent from the included prefixes and reject with a clear error like topologyKey X does not match any included node label prefix.

This approach seems safer than the exclude-list approach where operators might not realize they've excluded something their workloads need.

Kubernetes API server or other Kueue caches).
* Allowlist-based filtering (only keep certain prefixes). This could be a
future enhancement if needed.
* Filtering annotations or taints from cached nodes.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not filter annotations as well?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. They could be stripped unconditionally from cached nodes without any prefix configuration needed. I'll expand the scope of this KEP to include annotation stripping on cached nodes as part of the same feature.

@mwielgus
Copy link
Copy Markdown
Contributor

And please sign the CLA.

@tenzen-y
Copy link
Copy Markdown
Member

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 20, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

@shshr: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-kueue-test-unit-main 6db401f link true /test pull-kueue-test-unit-main
pull-kueue-verify-main 6db401f link true /test pull-kueue-verify-main
pull-kueue-test-integration-extended-main 6db401f link true /test pull-kueue-test-integration-extended-main

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@shshr
Copy link
Copy Markdown
Author

shshr commented Apr 20, 2026

For some reason, CLA signing is not reflecting here. I have signed it a couple times.

@kannon92
Copy link
Copy Markdown
Contributor

Check your commits. You have two authors on your first one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/tas Topology-Aware Scheduling cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. kind/kep Kueue Enhancement Proposal (Design) ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants