Metrics on topological node assignment

**What would you like to be added**:
I would like to emit a workload metric that tells me which nodes workloads are assigned to so that I can query something like "how many 64 GPU sized slots are there?" from Kueue's perspective per topology domain.

It could look something like:
kueue_tas_domain_usage{flavor="flavor-123", domain_id="host-123", resource="gpu"} 8

**Why is this needed**:
To understand Topology fragmentation, there's 2 tiers: 1.) k8s scheduler view and 2.) Kueue controller view

From the k8s scheduler view, you can sort of derive what Kueue may be seeing by mapping created pods to nodes and construct a query that looks something like: "how many 4 GPU sized slots are there?"

However there's potentially a difference between what Kueue sees and what the k8s scheduler sees. 
- Kueue could get to QuotaReserved state without admitting the workload. 
- People may also be confused from the k8s view where you see that there's slots available, but Kueue is still not admitting workloads because the quota has already been reserved (just the pods arent created yet)

I think this could be complicated to determine these slots when workloads within a flavor may satisfy different hierarchy levels. But I welcome discussion/brainstorming on how to get this overall result. 

**Completion requirements**:

This enhancement requires the following artifacts:

- [ ] Design doc
- [ ] API change
- [ ] Docs update

The artifacts should be linked in subsequent comments.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metrics on topological node assignment #10505

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Metrics on topological node assignment #10505

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions