Kueue sometimes does not inject scheduling gates for elastic jobs

**What happened**:
Environment:
Kueue 0.15.4 is deployed with ElasticJobsViaWorkloadSlices feature gate. Ray Services are submitted to one or more cluster queues, all in the same namespace, and all of them have kueue.x-k8s.io/elastic-job annotation. All of them have the Redis-cleanup finalizer. The Ray Cluster (RC) webhook is enabled.

Problem:
However, some Ray Clusters (RCs), but not all, are missing the pod scheduling gates in head/worker pod templates that Kueue is supposed to inject. They are also missing the tolerations defined in the Resource Flavor  that Kueue picked for the RC. At this point, there are 74 RCs in this state in our environment.

This is a big problem because it causes lots of downstream impact:
  * When the Ray Service is terminated, after the Redis-cleanup job has run, KubeRay will try to remove the finalizer from the RC.
  * This will fail, because Kueue checks to see whether the scheduler gates are present before allowing mutation. This check was introduced in the [commit "Support in-tree RayAutoscaler for Elastic RayCluster objects (#6662)" ](https://github.pie.apple.com/apple-batch/kueue/commit/6d7eac119f7e0bbe80a34e6df572e4598c1b9c9a). It is [still present today](https://github.pie.apple.com/apple-batch/kueue/blob/main-apple/pkg/controller/jobs/raycluster/raycluster_webhook.go#L180-L190).
  * As a result, the Ray Cluster will remain in zombie state -- head/worker pods and redis-cleanup pods are gone, but the Ray Cluster alone remains.
  * But this means the Kueue workload also remains. So, resource quota, including GPU quota, is held by Kueue. This is a big problem.
  * Another problem is that, Kueue may select this Ray Service to be preempted when another job is submitted. But the preemption will fail because the Ray Cluster and the Kueue workload remain forever. The preempting job's workload has a message that "Pending the preemption of 1 workload(s)" and it remains stuck in that state.

We have observed all these problems in production. At this point, 74 Ray clusters are stuck due to this issue. 

**What you expected to happen**:
We expect that the pod scheduling gates are present in all RCs with the elastic job annotation when Kueue is deployed with the elastic job feature gate.

**How to reproduce it (as minimally and precisely as possible)**:
Deploy lots of Ray Services with Redis-cleanup enabled. This happens frequently in our environment. 

**Anything else we need to know?**:

My analysis: there's a potential code path that leads to this issue. 
 * The Ray Cluster webhook does not apply the pod gates if [ApplyDefaultForSuspend() returns error](https://github.pie.apple.com/apple-batch/kueue/blob/main-apple/pkg/controller/jobs/raycluster/raycluster_webhook.go#L93).
 * ApplyDefaultForSuspend() would return an error if [WorkloadShouldBeSuspended() returns error](https://github.pie.apple.com/apple-batch/kueue/blob/main-apple/pkg/controller/jobframework/defaults.go#L40).
 * WorkloadShouldBeSuspended() returns error if [FindAncestorJobManagedByKueue() returns error](https://github.pie.apple.com/apple-batch/kueue/blob/main-apple/pkg/controller/jobframework/defaults.go#L54). The next check in WorkloadShouldBeSuspended for the queue name will always succeed, because we always have the queue name annotation.
 * FindAncestorJobManagedByKueue() can return error for many reasons. However, the cyclic dependency and the long chain of ownership are unlikely reasons. But the [attempt to get the owner Ray Service may fail](https://github.pie.apple.com/apple-batch/kueue/blob/main-apple/pkg/controller/jobframework/reconciler.go#L838). For example, the API server may be overloaded or have rate limits. So, this Get may fail randomly.

In short, a Get call that fails randomly can cause pod gates to not be added to RCs.

Potential solution:  The Ray Cluster webhook should not return early if the suspend call fails. Instead, we could go ahead and apply the gates anyway, and then return a combined error at the bottom of the function. 

**Environment**:
- Kubernetes version (use `kubectl version`): 1.29
- Kueue version (use `git describe --tags --dirty --always`): 0.15.4 (but may apply to main branch too)
 - KubeRay

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kueue sometimes does not inject scheduling gates for elastic jobs #10167

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Kueue sometimes does not inject scheduling gates for elastic jobs #10167

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions