Skip to content

Commit 106e59d

Browse files
committed
kep: add KEP-10587 for ExcludeNodeLabelPrefixes configuration
1 parent c098f8f commit 106e59d

2 files changed

Lines changed: 355 additions & 0 deletions

File tree

Lines changed: 319 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,319 @@
1+
# KEP-10587: Configurable Node Label Prefix Filtering for TAS Cache
2+
3+
<!-- toc -->
4+
- [Summary](#summary)
5+
- [Motivation](#motivation)
6+
- [Goals](#goals)
7+
- [Non-Goals](#non-goals)
8+
- [Proposal](#proposal)
9+
- [User Stories](#user-stories)
10+
- [Story 1 – Large-scale Azure/GCP clusters](#story-1--large-scale-azuregcp-clusters)
11+
- [Story 2 – Multi-cloud operator standardization](#story-2--multi-cloud-operator-standardization)
12+
- [Notes/Constraints/Caveats](#notesconstraintscaveats)
13+
- [Risks and Mitigations](#risks-and-mitigations)
14+
- [Design Details](#design-details)
15+
- [API Changes](#api-changes)
16+
- [Defaults](#defaults)
17+
- [Filtering Implementation](#filtering-implementation)
18+
- [Test Plan](#test-plan)
19+
- [Prerequisite testing updates](#prerequisite-testing-updates)
20+
- [Unit tests](#unit-tests)
21+
- [Integration tests](#integration-tests)
22+
- [e2e tests](#e2e-tests)
23+
- [Graduation Criteria](#graduation-criteria)
24+
- [Alpha](#alpha)
25+
- [Beta](#beta)
26+
- [GA](#ga)
27+
- [Implementation History](#implementation-history)
28+
- [Drawbacks](#drawbacks)
29+
- [Alternatives](#alternatives)
30+
- [Allowlist instead of denylist](#allowlist-instead-of-denylist)
31+
- [Regex-based filtering](#regex-based-filtering)
32+
- [Per-ClusterQueue configuration](#per-clusterqueue-configuration)
33+
<!-- /toc -->
34+
35+
## Summary
36+
37+
This KEP adds a `Resources.ExcludeNodeLabelPrefixes` configuration field to
38+
Kueue's `Configuration` API (`v1beta2`). When Topology Aware Scheduling (TAS)
39+
caches node objects, labels whose keys match any of the configured prefixes are
40+
stripped before storage. This reduces per-node memory in the TAS cache without
41+
affecting scheduling correctness, because the excluded labels are infrastructure
42+
metadata that Kueue never uses for topology levels, flavor node selectors, or
43+
workload scheduling decisions.
44+
45+
## Motivation
46+
47+
In large clusters (hundreds to thousands of nodes), each node can carry 30–80+
48+
labels injected by cloud providers and infrastructure controllers. Examples
49+
include `kubectl.kubernetes.io/`, `cloud.google.com/`, `eks.amazonaws.com/`,
50+
and `node.cluster.x-k8s.io/` prefixes. The TAS node cache stores all labels on
51+
every node, even though Kueue only inspects a small subset for topology,
52+
flavor, and workload scheduling purposes.
53+
54+
At scale, these unnecessary labels become a measurable memory overhead. In
55+
testing on clusters running ~3,600 Workloads and ~100 nodes, the Kueue
56+
controller-manager RSS was ~202 MB at baseline. Combined with a complementary
57+
Workload cache optimization (stripping non-scheduling PodTemplateSpec fields),
58+
excluding infrastructure node labels reduced RSS to ~179 MB—a 9.3% reduction.
59+
60+
This is analogous to the existing `Resources.ExcludeResourcePrefixes` field,
61+
which strips irrelevant resource types from quota calculations. The same
62+
pattern—a denylist of key prefixes with sensible defaults—is applied here to
63+
node labels.
64+
65+
### Goals
66+
67+
* Provide a `Resources.ExcludeNodeLabelPrefixes` configuration field that
68+
controls which node label prefixes are stripped from the TAS node cache.
69+
* Ship a default set of common infrastructure label prefixes so that
70+
operators benefit out of the box without configuration.
71+
* Reduce per-node memory usage in the TAS cache proportionally to the
72+
number of excluded labels.
73+
* Maintain full backward compatibility: an empty or nil value falls back
74+
to the default prefix list; an explicit empty list (`[]`) disables
75+
filtering entirely.
76+
77+
### Non-Goals
78+
79+
* Filtering labels from node objects outside the TAS cache (e.g., in the
80+
Kubernetes API server or other Kueue caches).
81+
* Allowlist-based filtering (only keep certain prefixes). This could be a
82+
future enhancement if needed.
83+
* Filtering annotations or taints from cached nodes.
84+
* Dynamically reloading the prefix list without restarting the controller.
85+
86+
## Proposal
87+
88+
Add a new field `ExcludeNodeLabelPrefixes` to the `Resources` struct in
89+
`apis/config/v1beta2/configuration_types.go`. When set (or defaulted), the
90+
TAS node cache strips matching labels at ingestion time—inside the
91+
`newNodeInfo()` constructor that converts a `*corev1.Node` to the internal
92+
`nodeInfo` representation.
93+
94+
### User Stories
95+
96+
#### Story 1 – Large-scale Azure/GCP clusters
97+
98+
An operator runs Kueue with TAS enabled on a 500-node AKS cluster. Each node
99+
has ~60 labels, of which only 5–8 are used for topology levels and flavors.
100+
With the default `ExcludeNodeLabelPrefixes`, approximately 20 labels per node
101+
are stripped, saving ~40 KB of string data across the cluster in the TAS cache.
102+
No configuration is required—the defaults cover common Azure and GCP
103+
infrastructure labels.
104+
105+
#### Story 2 – Multi-cloud operator standardization
106+
107+
An operator manages Kueue across AWS EKS, GCP GKE, and on-prem clusters. Each
108+
environment injects different infrastructure labels. The operator customizes
109+
`ExcludeNodeLabelPrefixes` per environment to strip cloud-specific labels while
110+
retaining topology labels used for their scheduling policies. On the on-prem
111+
cluster where there are no cloud labels to strip, they set `[]` to disable
112+
filtering.
113+
114+
### Notes/Constraints/Caveats
115+
116+
* **Ordering**: Prefix matching uses `strings.HasPrefix` for each label key
117+
against each prefix in the list. The list is unordered; the first match wins.
118+
For most practical configurations (<20 prefixes), linear scan is efficient.
119+
* **Interaction with topology levels**: Operators must ensure they do not
120+
exclude label prefixes used in their `TopologySpec.levels[].nodeLabel`
121+
definitions or flavor `nodeLabels`. Kueue will not validate this
122+
cross-reference automatically in the alpha stage.
123+
* **Immutability during runtime**: The prefix list is read at controller
124+
startup. Changing it requires a restart. Existing cached nodes are not
125+
retroactively re-filtered.
126+
127+
### Risks and Mitigations
128+
129+
| Risk | Mitigation |
130+
|------|------------|
131+
| Operator accidentally excludes a prefix used for topology levels, causing TAS scheduling failures | Document clearly that topology-relevant labels must not be excluded. In beta, add a startup validation warning that cross-references configured topology levels against excluded prefixes. |
132+
| Default prefix list is too aggressive for some environments | Defaults are limited to well-known infrastructure prefixes (`kubectl.kubernetes.io/`, cloud-provider-specific). Operators can override with `[]` to disable. |
133+
| Performance regression from prefix scanning on hot path | `newNodeInfo()` is called once per node watch event, not per scheduling cycle. Linear prefix scan on <20 prefixes is negligible. |
134+
135+
## Design Details
136+
137+
### API Changes
138+
139+
Add the following field to the `Resources` struct in
140+
`apis/config/v1beta2/configuration_types.go`:
141+
142+
```go
143+
// ExcludeNodeLabelPrefixes lists label key prefixes that should be
144+
// stripped from cached node objects in the Topology Aware Scheduling
145+
// (TAS) node cache to reduce memory usage. Any node label whose key
146+
// starts with one of these prefixes is dropped when the node is
147+
// stored in the TAS cache. Labels needed for topology levels, flavor
148+
// node selectors, workload node selectors, and node affinity should
149+
// NOT be listed here.
150+
// Defaults to a set of common infrastructure labels that are not
151+
// relevant for scheduling decisions (see defaults.go).
152+
// +optional
153+
ExcludeNodeLabelPrefixes []string `json:"excludeNodeLabelPrefixes,omitempty"`
154+
```
155+
156+
### Defaults
157+
158+
In `apis/config/v1beta2/defaults.go`, define:
159+
160+
```go
161+
var DefaultExcludeNodeLabelPrefixes = []string{
162+
"kubectl.kubernetes.io/",
163+
"node.kubernetes.io/exclude-from-external-load-balancers",
164+
"node-role.kubernetes.io/",
165+
"cloud.google.com/",
166+
"eks.amazonaws.com/",
167+
"topology.ebs.csi.aws.com/",
168+
"node.cluster.x-k8s.io/",
169+
"container.googleapis.com/",
170+
}
171+
```
172+
173+
The defaulting logic in `SetDefaults_Configuration` applies this list when
174+
`cfg.Resources.ExcludeNodeLabelPrefixes` is `nil`. An explicit empty slice
175+
(`[]`) disables filtering.
176+
177+
### Filtering Implementation
178+
179+
In `pkg/cache/scheduler/tas_flavor.go`, modify `newNodeInfo()` to accept
180+
the prefix list and filter labels before caching:
181+
182+
```go
183+
func newNodeInfo(node *corev1.Node, excludePrefixes []string) *nodeInfo {
184+
labels := node.Labels
185+
if len(excludePrefixes) > 0 && len(labels) > 0 {
186+
labels = filterLabelsByPrefix(labels, excludePrefixes)
187+
}
188+
return &nodeInfo{
189+
Name: node.Name,
190+
Labels: labels,
191+
Taints: node.Spec.Taints,
192+
Allocatable: node.Status.Allocatable,
193+
}
194+
}
195+
196+
func filterLabelsByPrefix(src map[string]string, prefixes []string) map[string]string {
197+
result := make(map[string]string, len(src))
198+
for k, v := range src {
199+
if !hasAnyPrefix(k, prefixes) {
200+
result[k] = v
201+
}
202+
}
203+
return result
204+
}
205+
206+
func hasAnyPrefix(s string, prefixes []string) bool {
207+
for _, p := range prefixes {
208+
if strings.HasPrefix(s, p) {
209+
return true
210+
}
211+
}
212+
return false
213+
}
214+
```
215+
216+
The `excludePrefixes` value is threaded from the `Configuration` object through
217+
the TAS cache constructor at startup.
218+
219+
### Test Plan
220+
221+
[x] I/we understand the owners of the involved components may require updates to
222+
existing tests to make this code solid enough prior to committing the changes
223+
necessary to implement this enhancement.
224+
225+
#### Prerequisite testing updates
226+
227+
Existing TAS cache unit tests cover `newNodeInfo()` and node label propagation.
228+
These must be verified to pass before and after the change.
229+
230+
#### Unit tests
231+
232+
* `pkg/cache/scheduler`: Test `filterLabelsByPrefix` with empty prefixes,
233+
matching prefixes, non-matching prefixes, empty labels map.
234+
* `pkg/cache/scheduler`: Test `newNodeInfo` with and without prefix filtering,
235+
verifying that topology-relevant labels are preserved.
236+
* `apis/config/v1beta2`: Test `SetDefaults_Configuration` applies defaults when
237+
`ExcludeNodeLabelPrefixes` is nil and preserves explicit empty slice.
238+
239+
Core packages and current coverage:
240+
- `pkg/cache/scheduler`: 2026-04-17 - TBD
241+
- `apis/config/v1beta2`: 2026-04-17 - TBD
242+
243+
#### Integration tests
244+
245+
* TAS scheduling integration test with `ExcludeNodeLabelPrefixes` set to
246+
exclude labels that are *not* used for topology, verifying scheduling
247+
still works correctly.
248+
* TAS scheduling integration test with default prefixes on nodes carrying
249+
infrastructure labels, verifying those labels are absent from the internal
250+
cache but scheduling succeeds.
251+
252+
#### e2e tests
253+
254+
Not required at alpha. At beta, add an e2e test verifying that a TAS-enabled
255+
cluster with the default prefix list can schedule workloads to nodes carrying
256+
infrastructure labels.
257+
258+
### Graduation Criteria
259+
260+
#### Alpha
261+
262+
* Feature gate `ExcludeNodeLabelPrefixes` defaults to disabled.
263+
* `Resources.ExcludeNodeLabelPrefixes` field added to `Configuration` API.
264+
* Default prefix list defined.
265+
* Filtering implemented in TAS node cache.
266+
* Unit tests covering filtering logic and defaulting.
267+
268+
#### Beta
269+
270+
* Feature gate defaults to enabled.
271+
* Startup validation warning when excluded prefixes overlap with configured
272+
topology level `nodeLabel` values.
273+
* Integration tests demonstrating correct TAS scheduling with filtering.
274+
* Documentation of the field in the Kueue configuration reference.
275+
276+
#### GA
277+
278+
* Feature gate locked to enabled and removed.
279+
* e2e test coverage.
280+
* At least two releases with no reported issues from beta users.
281+
282+
## Implementation History
283+
284+
* 2026-04-17: KEP drafted
285+
* 2026-04-17: PR [#10587](https://github.com/kubernetes-sigs/kueue/pull/10587)
286+
submitted with initial implementation (combined with Workload cache stripping)
287+
288+
## Drawbacks
289+
290+
* Adds a new configuration field, increasing the Configuration API surface.
291+
However, this follows the established pattern of `ExcludeResourcePrefixes`
292+
and is a natural extension of the same concept.
293+
* Operators could misconfigure the list and break TAS scheduling. The beta
294+
graduation criterion addresses this with cross-reference validation.
295+
296+
## Alternatives
297+
298+
### Allowlist instead of denylist
299+
300+
Instead of excluding prefixes, an allowlist approach would only cache labels
301+
matching specified prefixes. This is more aggressive and safer against new
302+
unknown labels, but harder to configure correctly—operators would need to
303+
enumerate all topology, flavor, and workload-relevant label prefixes. The
304+
denylist approach is simpler for most users since the defaults cover common
305+
infrastructure prefixes.
306+
307+
### Regex-based filtering
308+
309+
Using regular expressions instead of prefix strings provides more flexibility
310+
but adds complexity and potential for misconfiguration. Prefix matching covers
311+
the vast majority of infrastructure labels (which share common prefixes by
312+
convention) and is simpler to reason about.
313+
314+
### Per-ClusterQueue configuration
315+
316+
Making the exclusion list per-ClusterQueue instead of global would allow
317+
different queues to have different caching behavior. However, the TAS node
318+
cache is shared, so per-queue filtering would complicate the cache
319+
architecture significantly for minimal practical benefit.
Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
title: Configurable Node Label Prefix Filtering for TAS Cache
2+
kep-number: 10587
3+
authors:
4+
- "@shshr"
5+
- "@rduser"
6+
status: provisional
7+
creation-date: 2026-04-17
8+
reviewers:
9+
- TBD
10+
approvers:
11+
- TBD
12+
13+
see-also:
14+
- "/keps/2724-topology-aware-scheduling"
15+
- "/keps/2937-resource-transformer"
16+
17+
# The target maturity stage in the current dev cycle for this KEP.
18+
stage: alpha
19+
20+
# The most recent milestone for which work toward delivery of this KEP has been
21+
# done.
22+
latest-milestone: "v0.18"
23+
24+
# The milestone at which this feature was, or is targeted to be, at each stage.
25+
milestone:
26+
alpha: "v0.18"
27+
beta: "v0.19"
28+
stable: "v0.20"
29+
30+
# The following PRR answers are required at alpha release
31+
feature-gates:
32+
- name: ExcludeNodeLabelPrefixes
33+
disable-supported: true
34+
35+
metrics:
36+
- kueue_tas_cached_node_labels_total

0 commit comments

Comments
 (0)