|
| 1 | +## v0.17.1 |
| 2 | + |
| 3 | +Changes since `v0.17.0`: |
| 4 | + |
| 5 | +## Urgent Upgrade Notes |
| 6 | + |
| 7 | +### (No, really, you MUST read this before you upgrade) |
| 8 | + |
| 9 | +- AdmissionChecks: Add the alpha `RejectUpdatesToCQWithInvalidOnFlavors` feature gate (disabled by default) to reject updates to existing ClusterQueues with invalid `AdmissionCheckStrategy.OnFlavors` references. |
| 10 | + when enabling this feature gate, fix any existing invalid `OnFlavors` references before updating the affected ClusterQueues. (#10512, @tenzen-y) |
| 11 | + |
| 12 | +## Changes by Kind |
| 13 | + |
| 14 | +### Bug or Regression |
| 15 | + |
| 16 | +- AdmissionChecks: ClusterQueue validation now checks that the flavors specified in `AdmissionCheckStrategy.OnFlavors` are listed in quota. (#10369, @ShaanveerS) |
| 17 | +- AdmissionChecks: fix the bug that on backoff admission checks which are spanning all ResourceFlavors, such as MultiKueue, may be missing in the Workload’s status. |
| 18 | + |
| 19 | + For MultiKueue that manifested with a bug, when aside from the MultiKueue admission check there was another non-MultiKueue admission check. In the scenario when eviction on the management cluster happened the manager that had temporarily lost connection to a worker, the remote workload would keep running on the reconnected worker, despite the workload staying without reservation on the manager cluster. (#9359, @Singularity23x0) |
| 20 | +- AdmissionFairSharing: Fixed a bug in entry penalties by reducing them when workload is admitted and also clearing them up if all the resources on the admission entry penalty have value zero. (#10455, @MaysaMacedo) |
| 21 | +- ElasticJobs: Fix a bug where pods stay gated after scale-up by allowing finished workloads to ungate their own pods. (#10364, @sohankunkerkar) |
| 22 | +- FailureRecoveryPolicy: Fixed an issue where pods could remain stuck terminating if their node became unreachable only after the force-termination timeout had already elapsed. (#10500, @kshalot) |
| 23 | +- Fix a bug in HA mode that caused follower replicas to retain stale workload caches after deletion. (#10521, @Ladicle) |
| 24 | +- Fix a bug where the batch/v1 Job mutating webhook could still run even when the batch/job integration was disabled. (#10328, @Ladicle) |
| 25 | +- Fix handling of orphaned workloads which could result in the accumulation of stale workloads |
| 26 | + after PodsReady timeout eviction for Deployment-owned pods. (#10274, @sebest) |
| 27 | +- LeaderWorkerSet integration: fix the bug that the PodTemplate metadata wasn't propagated to the Workload's PodSets. (#10399, @pajakd) |
| 28 | +- MultiKueue: Fixes the bug where a job, after being dispatched to a worker, would not sync correctly after being evicted there. This would also cause its workload to be incorrectly labeled as admitted. |
| 29 | + |
| 30 | + Now the workload and the manager job instance will correctly reflect the evicted state and MultiKueue will perform a fallback, then dispatch remote workloads to all eligible workers again after being evicted from the Worker it was successfully admitted to before. An example of such a case is if the remote instance got preempted on the worker. (#10340, @Singularity23x0) |
| 31 | +- MultiKueue: fix the bug that when custom admission checks are configured on the manager cluster, other than |
| 32 | + the MultiKueue admission check, then the Job may start running on the selected worker before the other admission |
| 33 | + checks are satisfied (Ready). We fix the issue by deferring the dispatching of workload until all non-MultiKueue AdmissionChecks become Ready. (#10398, @mszadkow) |
| 34 | +- Observability: Fix a bug where kueue_cohort_subtree_admitted_workloads_total and kueue_cohort_subtree_admitted_active_workloads metrics could include results for an implicit root Cohort after deletion of a child Cohort or ClusterQueue. (#10395, @mbobrovskyi) |
| 35 | +- Observability: Fix excessive memory overhead in hot code paths by reusing the named logger in NewLogConstructor and avoiding unnecessary logger cloning. (#10393, @MatteoFari) |
| 36 | +- Observability: avoid logging update failures as "error" when they are caused by concurrent object modifications, especially when multiple errors are present. |
| 37 | + |
| 38 | + Example log message: "failed to update MultiKueueCluster status: Operation cannot be fulfilled on multikueueclusters.kueue.x-k8s.io \"testing-cluster\": the object has been modified; please apply your changes to the latest version and try again after failing to load client config: open /tmp/kubeconfig no such file or directory" (#10348, @mbobrovskyi) |
| 39 | +- TAS: Fix empty slices for count=0 podSets causing infinite scheduling loop (#10502, @jzhaojieh) |
| 40 | +- TAS: fix the bug that Pods which only contain the `kueue.x-k8s.io/podset-slice-required-topology` or `kueue.x-k8s.io/podset-slice-required-topology-constraints` as the TAS annotation are not ungated. (#10442, @tg123) |
| 41 | +- TAS: reduce the churn on the TAS-enabled controller, called NonTasUsageReconciler, by skipping triggering |
| 42 | + of the Reconcile on Pod changes which are irrelevant from the controller point-of-view. (#10508, @MatteoFari) |
| 43 | + |
1 | 44 | ## v0.17.0 |
2 | 45 |
|
3 | 46 | Changes since `v0.16.0`: |
|
0 commit comments