|
| 1 | +# KEP-10587: Configurable Node Label Prefix Filtering for TAS Cache |
| 2 | + |
| 3 | +<!-- toc --> |
| 4 | +- [Summary](#summary) |
| 5 | +- [Motivation](#motivation) |
| 6 | + - [Goals](#goals) |
| 7 | + - [Non-Goals](#non-goals) |
| 8 | +- [Proposal](#proposal) |
| 9 | + - [User Stories](#user-stories) |
| 10 | + - [Story 1 – Large-scale Azure/GCP clusters](#story-1--large-scale-azuregcp-clusters) |
| 11 | + - [Story 2 – Multi-cloud operator standardization](#story-2--multi-cloud-operator-standardization) |
| 12 | + - [Notes/Constraints/Caveats](#notesconstraintscaveats) |
| 13 | + - [Risks and Mitigations](#risks-and-mitigations) |
| 14 | +- [Design Details](#design-details) |
| 15 | + - [API Changes](#api-changes) |
| 16 | + - [Defaults](#defaults) |
| 17 | + - [Filtering Implementation](#filtering-implementation) |
| 18 | + - [Test Plan](#test-plan) |
| 19 | + - [Prerequisite testing updates](#prerequisite-testing-updates) |
| 20 | + - [Unit tests](#unit-tests) |
| 21 | + - [Integration tests](#integration-tests) |
| 22 | + - [e2e tests](#e2e-tests) |
| 23 | + - [Graduation Criteria](#graduation-criteria) |
| 24 | + - [Alpha](#alpha) |
| 25 | + - [Beta](#beta) |
| 26 | + - [GA](#ga) |
| 27 | +- [Implementation History](#implementation-history) |
| 28 | +- [Drawbacks](#drawbacks) |
| 29 | +- [Alternatives](#alternatives) |
| 30 | + - [Allowlist instead of denylist](#allowlist-instead-of-denylist) |
| 31 | + - [Regex-based filtering](#regex-based-filtering) |
| 32 | + - [Per-ClusterQueue configuration](#per-clusterqueue-configuration) |
| 33 | +<!-- /toc --> |
| 34 | + |
| 35 | +## Summary |
| 36 | + |
| 37 | +This KEP adds a `Resources.ExcludeNodeLabelPrefixes` configuration field to |
| 38 | +Kueue's `Configuration` API (`v1beta2`). When Topology Aware Scheduling (TAS) |
| 39 | +caches node objects, labels whose keys match any of the configured prefixes are |
| 40 | +stripped before storage. This reduces per-node memory in the TAS cache without |
| 41 | +affecting scheduling correctness, because the excluded labels are infrastructure |
| 42 | +metadata that Kueue never uses for topology levels, flavor node selectors, or |
| 43 | +workload scheduling decisions. |
| 44 | + |
| 45 | +## Motivation |
| 46 | + |
| 47 | +In large clusters (hundreds to thousands of nodes), each node can carry 30–80+ |
| 48 | +labels injected by cloud providers and infrastructure controllers. Examples |
| 49 | +include `kubectl.kubernetes.io/`, `cloud.google.com/`, `eks.amazonaws.com/`, |
| 50 | +and `node.cluster.x-k8s.io/` prefixes. The TAS node cache stores all labels on |
| 51 | +every node, even though Kueue only inspects a small subset for topology, |
| 52 | +flavor, and workload scheduling purposes. |
| 53 | + |
| 54 | +At scale, these unnecessary labels become a measurable memory overhead. In |
| 55 | +testing on clusters running ~3,600 Workloads and ~100 nodes, the Kueue |
| 56 | +controller-manager RSS was ~202 MB at baseline. Combined with a complementary |
| 57 | +Workload cache optimization (stripping non-scheduling PodTemplateSpec fields), |
| 58 | +excluding infrastructure node labels reduced RSS to ~179 MB—a 9.3% reduction. |
| 59 | + |
| 60 | +This is analogous to the existing `Resources.ExcludeResourcePrefixes` field, |
| 61 | +which strips irrelevant resource types from quota calculations. The same |
| 62 | +pattern—a denylist of key prefixes with sensible defaults—is applied here to |
| 63 | +node labels. |
| 64 | + |
| 65 | +### Goals |
| 66 | + |
| 67 | +* Provide a `Resources.ExcludeNodeLabelPrefixes` configuration field that |
| 68 | + controls which node label prefixes are stripped from the TAS node cache. |
| 69 | +* Ship a default set of common infrastructure label prefixes so that |
| 70 | + operators benefit out of the box without configuration. |
| 71 | +* Reduce per-node memory usage in the TAS cache proportionally to the |
| 72 | + number of excluded labels. |
| 73 | +* Maintain full backward compatibility: an empty or nil value falls back |
| 74 | + to the default prefix list; an explicit empty list (`[]`) disables |
| 75 | + filtering entirely. |
| 76 | + |
| 77 | +### Non-Goals |
| 78 | + |
| 79 | +* Filtering labels from node objects outside the TAS cache (e.g., in the |
| 80 | + Kubernetes API server or other Kueue caches). |
| 81 | +* Allowlist-based filtering (only keep certain prefixes). This could be a |
| 82 | + future enhancement if needed. |
| 83 | +* Filtering annotations or taints from cached nodes. |
| 84 | +* Dynamically reloading the prefix list without restarting the controller. |
| 85 | + |
| 86 | +## Proposal |
| 87 | + |
| 88 | +Add a new field `ExcludeNodeLabelPrefixes` to the `Resources` struct in |
| 89 | +`apis/config/v1beta2/configuration_types.go`. When set (or defaulted), the |
| 90 | +TAS node cache strips matching labels at ingestion time—inside the |
| 91 | +`newNodeInfo()` constructor that converts a `*corev1.Node` to the internal |
| 92 | +`nodeInfo` representation. |
| 93 | + |
| 94 | +### User Stories |
| 95 | + |
| 96 | +#### Story 1 – Large-scale Azure/GCP clusters |
| 97 | + |
| 98 | +An operator runs Kueue with TAS enabled on a 500-node AKS cluster. Each node |
| 99 | +has ~60 labels, of which only 5–8 are used for topology levels and flavors. |
| 100 | +With the default `ExcludeNodeLabelPrefixes`, approximately 20 labels per node |
| 101 | +are stripped, saving ~40 KB of string data across the cluster in the TAS cache. |
| 102 | +No configuration is required—the defaults cover common Azure and GCP |
| 103 | +infrastructure labels. |
| 104 | + |
| 105 | +#### Story 2 – Multi-cloud operator standardization |
| 106 | + |
| 107 | +An operator manages Kueue across AWS EKS, GCP GKE, and on-prem clusters. Each |
| 108 | +environment injects different infrastructure labels. The operator customizes |
| 109 | +`ExcludeNodeLabelPrefixes` per environment to strip cloud-specific labels while |
| 110 | +retaining topology labels used for their scheduling policies. On the on-prem |
| 111 | +cluster where there are no cloud labels to strip, they set `[]` to disable |
| 112 | +filtering. |
| 113 | + |
| 114 | +### Notes/Constraints/Caveats |
| 115 | + |
| 116 | +* **Ordering**: Prefix matching uses `strings.HasPrefix` for each label key |
| 117 | + against each prefix in the list. The list is unordered; the first match wins. |
| 118 | + For most practical configurations (<20 prefixes), linear scan is efficient. |
| 119 | +* **Interaction with topology levels**: Operators must ensure they do not |
| 120 | + exclude label prefixes used in their `TopologySpec.levels[].nodeLabel` |
| 121 | + definitions or flavor `nodeLabels`. Kueue will not validate this |
| 122 | + cross-reference automatically in the alpha stage. |
| 123 | +* **Immutability during runtime**: The prefix list is read at controller |
| 124 | + startup. Changing it requires a restart. Existing cached nodes are not |
| 125 | + retroactively re-filtered. |
| 126 | + |
| 127 | +### Risks and Mitigations |
| 128 | + |
| 129 | +| Risk | Mitigation | |
| 130 | +|------|------------| |
| 131 | +| Operator accidentally excludes a prefix used for topology levels, causing TAS scheduling failures | Document clearly that topology-relevant labels must not be excluded. In beta, add a startup validation warning that cross-references configured topology levels against excluded prefixes. | |
| 132 | +| Default prefix list is too aggressive for some environments | Defaults are limited to well-known infrastructure prefixes (`kubectl.kubernetes.io/`, cloud-provider-specific). Operators can override with `[]` to disable. | |
| 133 | +| Performance regression from prefix scanning on hot path | `newNodeInfo()` is called once per node watch event, not per scheduling cycle. Linear prefix scan on <20 prefixes is negligible. | |
| 134 | + |
| 135 | +## Design Details |
| 136 | + |
| 137 | +### API Changes |
| 138 | + |
| 139 | +Add the following field to the `Resources` struct in |
| 140 | +`apis/config/v1beta2/configuration_types.go`: |
| 141 | + |
| 142 | +```go |
| 143 | +// ExcludeNodeLabelPrefixes lists label key prefixes that should be |
| 144 | +// stripped from cached node objects in the Topology Aware Scheduling |
| 145 | +// (TAS) node cache to reduce memory usage. Any node label whose key |
| 146 | +// starts with one of these prefixes is dropped when the node is |
| 147 | +// stored in the TAS cache. Labels needed for topology levels, flavor |
| 148 | +// node selectors, workload node selectors, and node affinity should |
| 149 | +// NOT be listed here. |
| 150 | +// Defaults to a set of common infrastructure labels that are not |
| 151 | +// relevant for scheduling decisions (see defaults.go). |
| 152 | +// +optional |
| 153 | +ExcludeNodeLabelPrefixes []string `json:"excludeNodeLabelPrefixes,omitempty"` |
| 154 | +``` |
| 155 | + |
| 156 | +### Defaults |
| 157 | + |
| 158 | +In `apis/config/v1beta2/defaults.go`, define: |
| 159 | + |
| 160 | +```go |
| 161 | +var DefaultExcludeNodeLabelPrefixes = []string{ |
| 162 | + "kubectl.kubernetes.io/", |
| 163 | + "node.kubernetes.io/exclude-from-external-load-balancers", |
| 164 | + "node-role.kubernetes.io/", |
| 165 | + "cloud.google.com/", |
| 166 | + "eks.amazonaws.com/", |
| 167 | + "topology.ebs.csi.aws.com/", |
| 168 | + "node.cluster.x-k8s.io/", |
| 169 | + "container.googleapis.com/", |
| 170 | +} |
| 171 | +``` |
| 172 | + |
| 173 | +The defaulting logic in `SetDefaults_Configuration` applies this list when |
| 174 | +`cfg.Resources.ExcludeNodeLabelPrefixes` is `nil`. An explicit empty slice |
| 175 | +(`[]`) disables filtering. |
| 176 | + |
| 177 | +### Filtering Implementation |
| 178 | + |
| 179 | +In `pkg/cache/scheduler/tas_flavor.go`, modify `newNodeInfo()` to accept |
| 180 | +the prefix list and filter labels before caching: |
| 181 | + |
| 182 | +```go |
| 183 | +func newNodeInfo(node *corev1.Node, excludePrefixes []string) *nodeInfo { |
| 184 | + labels := node.Labels |
| 185 | + if len(excludePrefixes) > 0 && len(labels) > 0 { |
| 186 | + labels = filterLabelsByPrefix(labels, excludePrefixes) |
| 187 | + } |
| 188 | + return &nodeInfo{ |
| 189 | + Name: node.Name, |
| 190 | + Labels: labels, |
| 191 | + Taints: node.Spec.Taints, |
| 192 | + Allocatable: node.Status.Allocatable, |
| 193 | + } |
| 194 | +} |
| 195 | + |
| 196 | +func filterLabelsByPrefix(src map[string]string, prefixes []string) map[string]string { |
| 197 | + result := make(map[string]string, len(src)) |
| 198 | + for k, v := range src { |
| 199 | + if !hasAnyPrefix(k, prefixes) { |
| 200 | + result[k] = v |
| 201 | + } |
| 202 | + } |
| 203 | + return result |
| 204 | +} |
| 205 | + |
| 206 | +func hasAnyPrefix(s string, prefixes []string) bool { |
| 207 | + for _, p := range prefixes { |
| 208 | + if strings.HasPrefix(s, p) { |
| 209 | + return true |
| 210 | + } |
| 211 | + } |
| 212 | + return false |
| 213 | +} |
| 214 | +``` |
| 215 | + |
| 216 | +The `excludePrefixes` value is threaded from the `Configuration` object through |
| 217 | +the TAS cache constructor at startup. |
| 218 | + |
| 219 | +### Test Plan |
| 220 | + |
| 221 | +[x] I/we understand the owners of the involved components may require updates to |
| 222 | +existing tests to make this code solid enough prior to committing the changes |
| 223 | +necessary to implement this enhancement. |
| 224 | + |
| 225 | +#### Prerequisite testing updates |
| 226 | + |
| 227 | +Existing TAS cache unit tests cover `newNodeInfo()` and node label propagation. |
| 228 | +These must be verified to pass before and after the change. |
| 229 | + |
| 230 | +#### Unit tests |
| 231 | + |
| 232 | +* `pkg/cache/scheduler`: Test `filterLabelsByPrefix` with empty prefixes, |
| 233 | + matching prefixes, non-matching prefixes, empty labels map. |
| 234 | +* `pkg/cache/scheduler`: Test `newNodeInfo` with and without prefix filtering, |
| 235 | + verifying that topology-relevant labels are preserved. |
| 236 | +* `apis/config/v1beta2`: Test `SetDefaults_Configuration` applies defaults when |
| 237 | + `ExcludeNodeLabelPrefixes` is nil and preserves explicit empty slice. |
| 238 | + |
| 239 | +Core packages and current coverage: |
| 240 | +- `pkg/cache/scheduler`: 2026-04-17 - TBD |
| 241 | +- `apis/config/v1beta2`: 2026-04-17 - TBD |
| 242 | + |
| 243 | +#### Integration tests |
| 244 | + |
| 245 | +* TAS scheduling integration test with `ExcludeNodeLabelPrefixes` set to |
| 246 | + exclude labels that are *not* used for topology, verifying scheduling |
| 247 | + still works correctly. |
| 248 | +* TAS scheduling integration test with default prefixes on nodes carrying |
| 249 | + infrastructure labels, verifying those labels are absent from the internal |
| 250 | + cache but scheduling succeeds. |
| 251 | + |
| 252 | +#### e2e tests |
| 253 | + |
| 254 | +Not required at alpha. At beta, add an e2e test verifying that a TAS-enabled |
| 255 | +cluster with the default prefix list can schedule workloads to nodes carrying |
| 256 | +infrastructure labels. |
| 257 | + |
| 258 | +### Graduation Criteria |
| 259 | + |
| 260 | +#### Alpha |
| 261 | + |
| 262 | +* Feature gate `ExcludeNodeLabelPrefixes` defaults to disabled. |
| 263 | +* `Resources.ExcludeNodeLabelPrefixes` field added to `Configuration` API. |
| 264 | +* Default prefix list defined. |
| 265 | +* Filtering implemented in TAS node cache. |
| 266 | +* Unit tests covering filtering logic and defaulting. |
| 267 | + |
| 268 | +#### Beta |
| 269 | + |
| 270 | +* Feature gate defaults to enabled. |
| 271 | +* Startup validation warning when excluded prefixes overlap with configured |
| 272 | + topology level `nodeLabel` values. |
| 273 | +* Integration tests demonstrating correct TAS scheduling with filtering. |
| 274 | +* Documentation of the field in the Kueue configuration reference. |
| 275 | + |
| 276 | +#### GA |
| 277 | + |
| 278 | +* Feature gate locked to enabled and removed. |
| 279 | +* e2e test coverage. |
| 280 | +* At least two releases with no reported issues from beta users. |
| 281 | + |
| 282 | +## Implementation History |
| 283 | + |
| 284 | +* 2026-04-17: KEP drafted |
| 285 | +* 2026-04-17: PR [#10587](https://github.com/kubernetes-sigs/kueue/pull/10587) |
| 286 | + submitted with initial implementation (combined with Workload cache stripping) |
| 287 | + |
| 288 | +## Drawbacks |
| 289 | + |
| 290 | +* Adds a new configuration field, increasing the Configuration API surface. |
| 291 | + However, this follows the established pattern of `ExcludeResourcePrefixes` |
| 292 | + and is a natural extension of the same concept. |
| 293 | +* Operators could misconfigure the list and break TAS scheduling. The beta |
| 294 | + graduation criterion addresses this with cross-reference validation. |
| 295 | + |
| 296 | +## Alternatives |
| 297 | + |
| 298 | +### Allowlist instead of denylist |
| 299 | + |
| 300 | +Instead of excluding prefixes, an allowlist approach would only cache labels |
| 301 | +matching specified prefixes. This is more aggressive and safer against new |
| 302 | +unknown labels, but harder to configure correctly—operators would need to |
| 303 | +enumerate all topology, flavor, and workload-relevant label prefixes. The |
| 304 | +denylist approach is simpler for most users since the defaults cover common |
| 305 | +infrastructure prefixes. |
| 306 | + |
| 307 | +### Regex-based filtering |
| 308 | + |
| 309 | +Using regular expressions instead of prefix strings provides more flexibility |
| 310 | +but adds complexity and potential for misconfiguration. Prefix matching covers |
| 311 | +the vast majority of infrastructure labels (which share common prefixes by |
| 312 | +convention) and is simpler to reason about. |
| 313 | + |
| 314 | +### Per-ClusterQueue configuration |
| 315 | + |
| 316 | +Making the exclusion list per-ClusterQueue instead of global would allow |
| 317 | +different queues to have different caching behavior. However, the TAS node |
| 318 | +cache is shared, so per-queue filtering would complicate the cache |
| 319 | +architecture significantly for minimal practical benefit. |
0 commit comments