Skip to content

Metrics on topological node assignment #10505

@amy

Description

@amy

What would you like to be added:
I would like to emit a workload metric that tells me which nodes workloads are assigned to so that I can query something like "how many 64 GPU sized slots are there?" from Kueue's perspective per topology domain.

It could look something like:
kueue_tas_domain_usage{flavor="flavor-123", domain_id="host-123", resource="gpu"} 8

Why is this needed:
To understand Topology fragmentation, there's 2 tiers: 1.) k8s scheduler view and 2.) Kueue controller view

From the k8s scheduler view, you can sort of derive what Kueue may be seeing by mapping created pods to nodes and construct a query that looks something like: "how many 4 GPU sized slots are there?"

However there's potentially a difference between what Kueue sees and what the k8s scheduler sees.

  • Kueue could get to QuotaReserved state without admitting the workload.
  • People may also be confused from the k8s view where you see that there's slots available, but Kueue is still not admitting workloads because the quota has already been reserved (just the pods arent created yet)

I think this could be complicated to determine these slots when workloads within a flavor may satisfy different hierarchy levels. But I welcome discussion/brainstorming on how to get this overall result.

Completion requirements:

This enhancement requires the following artifacts:

  • Design doc
  • API change
  • Docs update

The artifacts should be linked in subsequent comments.

Metadata

Metadata

Assignees

Labels

kind/featureCategorizes issue or PR as related to a new feature.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions