Skip to content

Kueue sometimes does not inject scheduling gates for elastic jobs #10167

@ns-sundar

Description

@ns-sundar

What happened:
Environment:
Kueue 0.15.4 is deployed with ElasticJobsViaWorkloadSlices feature gate. Ray Services are submitted to one or more cluster queues, all in the same namespace, and all of them have kueue.x-k8s.io/elastic-job annotation. All of them have the Redis-cleanup finalizer. The Ray Cluster (RC) webhook is enabled.

Problem:
However, some Ray Clusters (RCs), but not all, are missing the pod scheduling gates in head/worker pod templates that Kueue is supposed to inject. They are also missing the tolerations defined in the Resource Flavor that Kueue picked for the RC. At this point, there are 74 RCs in this state in our environment.

This is a big problem because it causes lots of downstream impact:

  • When the Ray Service is terminated, after the Redis-cleanup job has run, KubeRay will try to remove the finalizer from the RC.
  • This will fail, because Kueue checks to see whether the scheduler gates are present before allowing mutation. This check was introduced in the commit "Support in-tree RayAutoscaler for Elastic RayCluster objects (#6662)" . It is still present today.
  • As a result, the Ray Cluster will remain in zombie state -- head/worker pods and redis-cleanup pods are gone, but the Ray Cluster alone remains.
  • But this means the Kueue workload also remains. So, resource quota, including GPU quota, is held by Kueue. This is a big problem.
  • Another problem is that, Kueue may select this Ray Service to be preempted when another job is submitted. But the preemption will fail because the Ray Cluster and the Kueue workload remain forever. The preempting job's workload has a message that "Pending the preemption of 1 workload(s)" and it remains stuck in that state.

We have observed all these problems in production. At this point, 74 Ray clusters are stuck due to this issue.

What you expected to happen:
We expect that the pod scheduling gates are present in all RCs with the elastic job annotation when Kueue is deployed with the elastic job feature gate.

How to reproduce it (as minimally and precisely as possible):
Deploy lots of Ray Services with Redis-cleanup enabled. This happens frequently in our environment.

Anything else we need to know?:

My analysis: there's a potential code path that leads to this issue.

In short, a Get call that fails randomly can cause pod gates to not be added to RCs.

Potential solution: The Ray Cluster webhook should not return early if the suspend call fails. Instead, we could go ahead and apply the gates anyway, and then return a combined error at the bottom of the function.

Environment:

  • Kubernetes version (use kubectl version): 1.29
  • Kueue version (use git describe --tags --dirty --always): 0.15.4 (but may apply to main branch too)
  • KubeRay

Metadata

Metadata

Labels

kind/bugCategorizes issue or PR as related to a bug.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions