Skip to content

Commit 25f07dc

Browse files
authored
Fix flaky K8s xcom tests on ARM runners hitting 120s pod-start timeout (#65598)
On ARM CI runners with a cold containerd cache, the first test in the K8s system suite that needs the xcom sidecar image (alpine) or the basic_pod template's image can exceed KubernetesPodOperator's 120s startup budget, producing a PodLaunchTimeoutException that surfaces as a generic "AirflowException: Pod ... returned a failure". Two changes: - basic_pod.yaml / test_full_pod_spec now use `ubuntu` instead of `perl`. The pod only runs `/bin/bash -c 'echo ... > /airflow/xcom/return.json'`, so any image with bash works, and `ubuntu` is already warmed by earlier tests in the same suite — no extra image pull needed. - Every real-cluster test that sets `do_xcom_push=True` now passes `startup_timeout_seconds=XCOM_STARTUP_TIMEOUT_SECONDS` (300s). Since pytest ordering is not guaranteed, whichever xcom test runs first has to absorb the one-time alpine sidecar pull; bumping the budget on all of them keeps the suite order-independent. Observed failure: apache/airflow actions run 24716106401, job 72301089157 (K8S System:CeleryExecutor-3.12-v1.32.8-false, ARM). Both failing tests took exactly ~120s matching the default startup_timeout_seconds; pod events showed "Pulling image 'alpine' ..." with no "Successfully pulled" inside the 120s window. No production code change — the operator default of 120s is unchanged.
1 parent dae48ba commit 25f07dc

2 files changed

Lines changed: 13 additions & 3 deletions

File tree

kubernetes-tests/tests/kubernetes_tests/basic_pod.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@ metadata:
2323
spec:
2424
containers:
2525
- name: base
26-
image: perl
26+
image: ubuntu
2727
command: ["/bin/bash"]
2828
args: ["-c", 'echo {\"hello\" : \"world\"} | cat > /airflow/xcom/return.json']
2929
restartPolicy: Never

kubernetes-tests/tests/kubernetes_tests/test_kubernetes_pod_operator.py

Lines changed: 12 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -51,6 +51,12 @@
5151
HOOK_CLASS = "airflow.providers.cncf.kubernetes.operators.pod.KubernetesHook"
5252
POD_MANAGER_CLASS = "airflow.providers.cncf.kubernetes.utils.pod_manager.PodManager"
5353

54+
# Longer than the operator's 120s default to absorb the first-pull latency of
55+
# the xcom sidecar image (alpine) on ARM runners with a cold containerd cache.
56+
# Whichever xcom test runs first has to pay that cost; giving all of them the
57+
# same budget keeps the tests order-independent.
58+
XCOM_STARTUP_TIMEOUT_SECONDS = 300
59+
5460

5561
def create_context(task) -> Context:
5662
from tests_common.test_utils.version_compat import AIRFLOW_V_3_0_PLUS
@@ -709,6 +715,7 @@ def test_xcom_push(self, test_label):
709715
task_id=str(uuid4()),
710716
in_cluster=False,
711717
do_xcom_push=True,
718+
startup_timeout_seconds=XCOM_STARTUP_TIMEOUT_SECONDS,
712719
)
713720
context = create_context(k)
714721
assert k.execute(context) == expected
@@ -753,6 +760,7 @@ def test_pod_template_file_system(self, basic_pod_template):
753760
labels=self.labels,
754761
pod_template_file=basic_pod_template.as_posix(),
755762
do_xcom_push=True,
763+
startup_timeout_seconds=XCOM_STARTUP_TIMEOUT_SECONDS,
756764
)
757765

758766
context = create_context(k)
@@ -775,6 +783,7 @@ def test_pod_template_file_with_overrides_system(self, env_vars, test_label, bas
775783
in_cluster=False,
776784
pod_template_file=basic_pod_template.as_posix(),
777785
do_xcom_push=True,
786+
startup_timeout_seconds=XCOM_STARTUP_TIMEOUT_SECONDS,
778787
)
779788

780789
context = create_context(k)
@@ -814,6 +823,7 @@ def test_pod_template_file_with_full_pod_spec(self, test_label, basic_pod_templa
814823
pod_template_file=basic_pod_template.as_posix(),
815824
full_pod_spec=pod_spec,
816825
do_xcom_push=True,
826+
startup_timeout_seconds=XCOM_STARTUP_TIMEOUT_SECONDS,
817827
)
818828

819829
context = create_context(k)
@@ -842,7 +852,7 @@ def test_full_pod_spec(self, test_label):
842852
containers=[
843853
k8s.V1Container(
844854
name="base",
845-
image="perl",
855+
image="ubuntu",
846856
command=["/bin/bash"],
847857
args=["-c", 'echo {\\"hello\\" : \\"world\\"} | cat > /airflow/xcom/return.json'],
848858
env=[k8s.V1EnvVar(name="env_name", value="value")],
@@ -858,7 +868,7 @@ def test_full_pod_spec(self, test_label):
858868
full_pod_spec=pod_spec,
859869
do_xcom_push=True,
860870
on_finish_action=OnFinishAction.KEEP_POD,
861-
startup_timeout_seconds=30,
871+
startup_timeout_seconds=XCOM_STARTUP_TIMEOUT_SECONDS,
862872
)
863873

864874
context = create_context(k)

0 commit comments

Comments
 (0)