[CI] Parity: count a disagreement only when CUDA PASSED by ethanwee1 · Pull Request #3371 · ROCm/pytorch

ethanwee1 · 2026-06-24T19:47:29Z

Summary

Changes the parity DISAGREE/AGREE metric to count a ROCm SKIPPED/MISSED test as a disagreement only when CUDA actually PASSED it (status_cuda == PASSED), instead of "CUDA didn't SKIP it" (!= SKIPPED).

Why

The old definition counted as disagreements the large set of attention-backend parametrization variants (e.g. test_transformers Flash/CK/CUTLASS cases) that CUDA never even enumerates (CUDA = MISSED). Those aren't real ROCm-vs-CUDA coverage gaps — they just inflated DISAGREE to ~3% (AGREE ~97%). Counting only "CUDA passed, ROCm didn't" reflects the actionable gap and matches the triage spreadsheet's adjusted definition.

Effect (mi350, real data)

Overall AGREE% 97.3% → 99.07%
DEFAULT SKIPPED+MISSED 9,002 → 2,629, DISAGREE 3.0% → 0.88%
The two line items are relabeled SKIPPED (on rocm, PASSED on cuda) / MISSED (on rocm, PASSED on cuda) for accuracy.

Scope

compute_test_config_stats + compute_overall_stats in generate_summary.py. The dashboard's parity collector (ROCm/AI-Frameworks-Dashboard#49) is being updated with the same definition so the dashboard matches this report.

Test plan

py_compile clean.
Ran against a real mi350 status CSV → Overall AGREE 99.07%, DEFAULT DISAGREE 0.88%.

Made with Cursor

(cherry picked from commit a66eeda) Fixes #ISSUE_NUMBER Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>

========================================== Triton build conditionalized on ROCM_VERSION Include the ROCm version in triton version (cherry picked from commit 7d33910) (cherry picked from commit 0412eb4) Update triton-rocm.txt to triton.txt (cherry picked from commit 0ce9f6e) Use ROCm/triton for install_triton.sh (cherry picked from commit 6e9714b) update triton commit Revert "Use ROCm/triton for install_triton.sh" This reverts commit 81b0cbc8435122030044049c661f252ee8aa7ae5. change triton repo Update triton.txt to use release/internal/3.3.x branch Use ROCm/triton Use ROCm/triton for install_triton.sh (cherry picked from commit 0036db5)

…on (#2482) Related to https://github.com/ROCm/builder/pull/90/files http://rocm-ci.amd.com/job/mainline-pytorch_internal-manylinux-wheels/305/ PyTorch wheel installs successfully when building torchvision/torchaudio (cherry picked from commit c1ee54d)

Fixes #ISSUE_NUMBER (cherry picked from commit 0ea0592)

…A helper functions ======================================================================================= Implementation of PyTorch ut parsing script - QA helper function (#1386) * Initial implementation of PyTorch ut parsing script * Extracted path variables * Use nested dict to save results * Fixes typo * Cleanup * Fixes several issues * Minor name change * Update run_pytorch_unit_tests.py * Added file banners * Supported running from API * Added more help info * Consistent naming * Format help text --------- Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com> Co-authored-by: Jithun Nair <jithun.nair@amd.com> Print consolidated log file for pytorch unit test automation scripts (#1433) * Print consolidated log file for pytorch uts * Update run_entire_tests subprocess call as well * lint * Add ERROR string [SWDEV-466849] Enhancements for PyTorch UT helper scripts (#1491) * Check that >1 GPUs are visible when running TEST_CONFIG=distributed * Add EXECUTION_TIME to file-level and aggregate statistics PyTorch unit test helper scripts enhancements (#1517) * Fail earlier for distributed-on-1-GPU scenario * print cmd in consolidated log with prettier formatting * python->python3 Fixes https://ontrack-internal.amd.com/browse/SWDEV-477264 --------- Co-authored-by: blorange-amd <bo.li2@amd.com> Several issues fix of QA helper script (#1564) Fixes SWDEV-475071: https://ontrack-internal.amd.com/browse/SWDEV-475071 Removed args inside function (#1595) Fixes SWDEV-475071 (cherry picked from commit 041aa1b47978154de63edc6b7ffcdea218a847a3) QA script - Added multi gpu check with priority_tests (#1604) Fixes SWDEV-487907. Verified throwing exception for distributed is working correctly on single gpu with command: python .automation_scripts/run_pytorch_unit_tests.py --priority_test (cherry picked from commit 57cc742271cbf4547f9213710e57f6444bbc983e) (cherry picked from commit 6d5c3dc) (cherry picked from commit 2ee3aa2)

* Use triton commit same as that used for release/2.6 branch since both are triton version 3.2.0, so assuming they're compatible. Relates to: https://github.com/ROCm/rocAutomation/pull/660/files https://github.com/ROCm/builder/pull/70/files Validation http://ml-ci-internal.amd.com:8080/job/pytorch/job/manylinux_rocm_wheels/568/ --------- Co-authored-by: Jithun Nair <jithun.nair@amd.com> Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com> (cherry picked from commit 14c1417) (cherry picked from commit c20a8f8)

* Add trailing comma for consistency in gfx architecture list Signed-off-by: Jagadish Krishnamoorthy <jagadish.krishnamoorthy@amd.com> * ROCm: Enable tf32 testing on test_nn Signed-off-by: Jagadish Krishnamoorthy <jagadish.krishnamoorthy@amd.com> --------- Signed-off-by: Jagadish Krishnamoorthy <jagadish.krishnamoorthy@amd.com> (cherry picked from commit c113e14)

…-deps flags (#2121) Cherry-pick of #2103 Co-authored-by: Ethan Wee <Ethan.Wee@amd.com> (cherry picked from commit 1dea6e8)

Relates to: ROCm/builder#82 Validation: http://rocm-ci.amd.com/job/mainline-pytorch_internal-manylinux-wheels/98/ Using `registry-sc-harbor.amd.com/framework/compute-rocm-dkms-no-npi-hipclang:16180_ubuntu24.04_py3.12_pytorch_lw_rocm7.0_IT_upgrade_numpy_452f3df6`: ``` root@d92befdbb2a6:/# pip list | egrep "numpy|pandas" numpy 2.1.2 pandas 2.2.3 root@d92befdbb2a6:/# python3 Python 3.12.3 (main, Feb 4 2025, 14:48:35) [GCC 13.3.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import pandas >>> import torch >>> import numpy >>> exit() root@d92befdbb2a6:/data/pytorch-micro-benchmarking# HIP_VISIBLE_DEVICES=1 python3 micro_benchmarking_pytorch.py --network resnet50 INFO: running forward and backward for warmup. INFO: running the benchmark.. OK: finished running benchmark.. --------------------SUMMARY-------------------------- Microbenchmark for network : resnet50 Num devices: 1 Dtype: FP32 Mini batch size [img] : 64 Time per mini-batch : 0.11369450092315674 Throughput [img/sec] : 562.9120096428937 ``` --------- Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com> (cherry picked from commit cf32479)

…2269) Fixes SWDEV-536456 Fixes error post-#2256: ``` 00:12:44.248 #22 155.3 ERROR: Ignored the following versions that require a different python version: 0.52.0 Requires-Python >=3.6,<3.9; 0.52.0rc3 Requires-Python >=3.6,<3.9; 0.61.0 Requires-Python >=3.10; 0.61.0rc1 Requires-Python >=3.10; 0.61.0rc2 Requires-Python >=3.10; 0.61.1rc1 Requires-Python >=3.10; 0.61.2 Requires-Python >=3.10; 3.3 Requires-Python >=3.10; 3.3rc0 Requires-Python >=3.10; 3.4 Requires-Python >=3.10; 3.4.1 Requires-Python >=3.10; 3.4.2 Requires-Python >=3.10; 3.4rc0 Requires-Python >=3.10; 3.5 Requires-Python >=3.11; 3.5rc0 Requires-Python >=3.11; 8.2.0 Requires-Python >=3.10; 8.2.1 Requires-Python >=3.10 00:12:44.248 #22 155.3 ERROR: Could not find a version that satisfies the requirement numba==0.61.2 (from versions: 0.1, 0.2, 0.3, 0.5.0, 0.6.0, 0.7.0, 0.7.1, 0.7.2, 0.8.0, 0.8.1, 0.9.0, 0.10.0, 0.10.1, 0.11.0, 0.12.0, 0.12.1, 0.12.2, 0.13.0, 0.13.2, 0.13.3, 0.13.4, 0.14.0, 0.15.1, 0.16.0, 0.17.0, 0.18.1, 0.18.2, 0.19.1, 0.19.2, 0.20.0, 0.21.0, 0.22.0, 0.22.1, 0.23.0, 0.23.1, 0.24.0, 0.25.0, 0.26.0, 0.27.0, 0.28.1, 0.29.0, 0.30.0, 0.30.1, 0.31.0, 0.32.0, 0.33.0, 0.34.0, 0.35.0, 0.36.1, 0.36.2, 0.37.0, 0.38.0, 0.38.1, 0.39.0, 0.40.0, 0.40.1, 0.41.0, 0.42.0, 0.42.1, 0.43.0, 0.43.1, 0.44.0, 0.44.1, 0.45.0, 0.45.1, 0.46.0, 0.47.0, 0.48.0, 0.49.0, 0.49.1rc1, 0.49.1, 0.50.0rc1, 0.50.0, 0.50.1, 0.51.0rc1, 0.51.0, 0.51.1, 0.51.2, 0.52.0rc2, 0.53.0rc1.post1, 0.53.0rc2, 0.53.0rc3, 0.53.0, 0.53.1, 0.54.0rc2, 0.54.0rc3, 0.54.0, 0.54.1rc1, 0.54.1, 0.55.0rc1, 0.55.0, 0.55.1, 0.55.2, 0.56.0rc1, 0.56.0, 0.56.2, 0.56.3, 0.56.4, 0.57.0rc1, 0.57.0, 0.57.1rc1, 0.57.1, 0.58.0rc1, 0.58.0rc2, 0.58.0, 0.58.1, 0.59.0rc1, 0.59.0, 0.59.1, 0.60.0rc1, 0.60.0) 00:12:44.248 #22 155.3 ERROR: No matching distribution found for numba==0.61.2 ``` Validation: * Docker image: http://rocm-ci.amd.com/job/mainline-framework-pytorch-internal-cs9-ci/132 * Wheels: http://rocm-ci.amd.com/job/mainline-pytorch_internal-manylinux-wheels/102/ From `registry-sc-harbor.amd.com/framework/compute-rocm-dkms-no-npi-hipclang:16180_ubuntu22.04_py3.9_pytorch_lw_rocm7.0_IT_py3.9_a11d94ad`: ``` root@f43861a0a856:/# pip list | egrep "numpy|pandas" numpy 2.0.2 pandas 2.2.3 root@f43861a0a856:/# python Python 3.9.23 (main, Jun 4 2025, 08:55:38) [GCC 11.4.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import torch >>> import numpy >>> import pandas root@f43861a0a856:/data/pytorch-micro-benchmarking# HIP_VISIBLE_DEVICES=1 python3 micro_benchmarking_pytorch.py --network resnet50 INFO: running forward and backward for warmup. INFO: running the benchmark.. OK: finished running benchmark.. --------------------SUMMARY-------------------------- Microbenchmark for network : resnet50 Num devices: 1 Dtype: FP32 Mini batch size [img] : 64 Time per mini-batch : 0.11354223489761353 Throughput [img/sec] : 563.6669038416574 ``` (cherry picked from commit a0a9d81)

…cm7.0/7.1 (#2239) Revamped version of #2108 PR to: - enable complex data types for sparse matmul on ROCm - fix sparse addmm/baddbmm on ROCm - fix sparse hipification for ROCm - fix/enable sparse tests on ROCm (~50 tests total for non-fp16/bf16): - enable fp16/bf16 sparse path for rocm7.0 - enable fp16/bf16 sparse tests for rocm7.0/7.1 ``` test_sparse_csr.py::TestSparseCSRCUDA::test_bmm_cuda_* test_sparse.py::TestSparseCUDA::test_sparse_matmul_cuda_* test_sparse_csr.py::TestSparseCSRCUDA::test_mm_cuda_float64 test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_all_sparse_csr_SparseCS* test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_* test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_addmm_cuda_float16 ``` (cherry picked from commit cc2a69c)

#2326) Fixes https://ontrack-internal.amd.com/browse/SWDEV-541809 Upgrading tensorboard after numpy upgrade Ran in **registry-sc-harbor.amd.com/framework/compute-rocm-dkms-no-npi-hipclang:16381_ubuntu24.04_py3.12_pytorch_lw_rocm7.0_internal_testing_afe8b782** ``` 7 git checkout rocm7.0_IT_upgrade_tensorboard 8 pip install .ci/docker/requirements-ci.txt 9 pip install -r .ci/docker/requirements-ci.txt 10 PYTORCH_TEST_WITH_ROCM=1 python test/test_monitor.py TestMonitorTensorboard.test_event_handler root@ubb4-rack-22:/var/lib/jenkins/pytorch# PYTORCH_TEST_WITH_ROCM=1 python test/test_monitor.py TestMonitorTensorboard.test_event_handler /opt/venv/lib/python3.12/site-packages/google/protobuf/internal/well_known_types.py:91: DeprecationWarning: datetime.datetime.utcfromtimestamp() is deprecated and scheduled for removal in a future version. Use timezone-aware objects to represent datetimes in UTC: datetime.datetime.fromtimestamp(timestamp, datetime.UTC). _EPOCH_DATETIME_NAIVE = datetime.datetime.utcfromtimestamp(0) . ---------------------------------------------------------------------- Ran 1 test in 0.327s OK root@ubb4-rack-22:/var/lib/jenkins/pytorch# ``` (cherry picked from commit c7f61f4)

Tested locally successfully ``` root@rocm-framework-47:/var/lib/jenkins/pytorch# pip install -r requirements.txt Ignoring numpy: markers 'python_version == "3.9"' don't match your environment Requirement already satisfied: setuptools<80.0,>=70.1.0 in /opt/venv/lib/python3.10/site-packages (from -r /var/lib/jenkins/pytorch/requirements-build.txt (line 2)) (79.0.1) Requirement already satisfied: cmake>=3.31.4 in /opt/venv/lib/python3.10/site-packages (from -r /var/lib/jenkins/pytorch/requirements-build.txt (line 3)) (4.0.0) Requirement already satisfied: ninja==1.11.1.3 in /opt/venv/lib/python3.10/site-packages (from -r /var/lib/jenkins/pytorch/requirements-build.txt (line 4)) (1.11.1.3) Requirement already satisfied: numpy==2.1.2 in /opt/venv/lib/python3.10/site-packages (from -r /var/lib/jenkins/pytorch/requirements-build.txt (line 5)) (2.1.2) Requirement already satisfied: packaging==25.0 in /opt/venv/lib/python3.10/site-packages (from -r /var/lib/jenkins/pytorch/requirements-build.txt (line 6)) (25.0) Requirement already satisfied: pyyaml==6.0.2 in /opt/venv/lib/python3.10/site-packages (from -r /var/lib/jenkins/pytorch/requirements-build.txt (line 7)) (6.0.2) Requirement already satisfied: requests==2.32.4 in /opt/venv/lib/python3.10/site-packages (from -r /var/lib/jenkins/pytorch/requirements-build.txt (line 8)) (2.32.4) Requirement already satisfied: six==1.17.0 in /opt/venv/lib/python3.10/site-packages (from -r /var/lib/jenkins/pytorch/requirements-build.txt (line 9)) (1.17.0) Requirement already satisfied: typing-extensions==4.14.1 in /opt/venv/lib/python3.10/site-packages (from -r /var/lib/jenkins/pytorch/requirements-build.txt (line 10)) (4.14.1) Requirement already satisfied: expecttest==0.3.0 in /opt/venv/lib/python3.10/site-packages (from -r requirements.txt (line 8)) (0.3.0) Requirement already satisfied: filelock==3.18.0 in /opt/venv/lib/python3.10/site-packages (from -r requirements.txt (line 9)) (3.18.0) Requirement already satisfied: fsspec==2025.7.0 in /opt/venv/lib/python3.10/site-packages (from -r requirements.txt (line 10)) (2025.7.0) Requirement already satisfied: hypothesis==5.35.1 in /opt/venv/lib/python3.10/site-packages (from -r requirements.txt (line 11)) (5.35.1) Requirement already satisfied: jinja2==3.1.6 in /opt/venv/lib/python3.10/site-packages (from -r requirements.txt (line 12)) (3.1.6) Requirement already satisfied: lintrunner==0.12.7 in /opt/venv/lib/python3.10/site-packages (from -r requirements.txt (line 13)) (0.12.7) Requirement already satisfied: networkx==2.8.8 in /opt/venv/lib/python3.10/site-packages (from -r requirements.txt (line 14)) (2.8.8) Requirement already satisfied: optree==0.13.0 in /opt/venv/lib/python3.10/site-packages (from -r requirements.txt (line 18)) (0.13.0) Requirement already satisfied: psutil==7.0.0 in /opt/venv/lib/python3.10/site-packages (from -r requirements.txt (line 19)) (7.0.0) Requirement already satisfied: sympy==1.13.3 in /opt/venv/lib/python3.10/site-packages (from -r requirements.txt (line 20)) (1.13.3) Requirement already satisfied: wheel==0.45.1 in /opt/venv/lib/python3.10/site-packages (from -r requirements.txt (line 22)) (0.45.1) Requirement already satisfied: build[uv] in /opt/venv/lib/python3.10/site-packages (from -r requirements.txt (line 7)) (1.3.0) Requirement already satisfied: charset_normalizer<4,>=2 in /opt/venv/lib/python3.10/site-packages (from requests==2.32.4->-r /var/lib/jenkins/pytorch/requirements-build.txt (line 8)) (3.4.3) Requirement already satisfied: idna<4,>=2.5 in /opt/venv/lib/python3.10/site-packages (from requests==2.32.4->-r /var/lib/jenkins/pytorch/requirements-build.txt (line 8)) (3.10) Requirement already satisfied: urllib3<3,>=1.21.1 in /opt/venv/lib/python3.10/site-packages (from requests==2.32.4->-r /var/lib/jenkins/pytorch/requirements-build.txt (line 8)) (2.5.0) Requirement already satisfied: certifi>=2017.4.17 in /opt/venv/lib/python3.10/site-packages (from requests==2.32.4->-r /var/lib/jenkins/pytorch/requirements-build.txt (line 8)) (2025.8.3) Requirement already satisfied: attrs>=19.2.0 in /opt/venv/lib/python3.10/site-packages (from hypothesis==5.35.1->-r requirements.txt (line 11)) (25.3.0) Requirement already satisfied: sortedcontainers<3.0.0,>=2.1.0 in /opt/venv/lib/python3.10/site-packages (from hypothesis==5.35.1->-r requirements.txt (line 11)) (2.4.0) Requirement already satisfied: MarkupSafe>=2.0 in /opt/venv/lib/python3.10/site-packages (from jinja2==3.1.6->-r requirements.txt (line 12)) (3.0.2) Requirement already satisfied: mpmath<1.4,>=1.1.0 in /opt/venv/lib/python3.10/site-packages (from sympy==1.13.3->-r requirements.txt (line 20)) (1.3.0) Requirement already satisfied: pyproject_hooks in /opt/venv/lib/python3.10/site-packages (from build[uv]->-r requirements.txt (line 7)) (1.2.0) Requirement already satisfied: tomli>=1.1.0 in /opt/venv/lib/python3.10/site-packages (from build[uv]->-r requirements.txt (line 7)) (2.2.1) Requirement already satisfied: uv>=0.1.18 in /opt/venv/lib/python3.10/site-packages (from build[uv]->-r requirements.txt (line 7)) (0.8.10) root@rocm-framework-47:/var/lib/jenkins/pytorch# pip install -r requirements-build.txt ``` (cherry picked from commit 6e6e454)

This also fixes a problem in gesvd driver when UV is not needed. (cherry picked from commit 4ce57ec) (cherry picked from commit 167b4c1)

(cherry picked from commit d6879fa) (cherry picked from commit 123a164)

Signed-off-by: Jagadish Krishnamoorthy <jagadish.krishnamoorthy@amd.com> (cherry picked from commit 1ad5bb95d796283d5f56ac1edd16f1731d24a49d) (cherry picked from commit 519160d)

Fixes #ISSUE_NUMBER

- Need to use upstream/main for rocm/pytorch's develop branch. For release branches, `github.event.pull_request.base.ref` should work as is. - Need to remove any trailing space in PR TITTLE so branch name can be formed correctly Fixes #ISSUE_NUMBER

# Conflicts: # .ci/docker/requirements-ci.txt

[AUTOGENERATED] develop_IFU_20251104

# Conflicts: # .ci/docker/ci_commit_pins/triton.txt # requirements.txt

To keep triton version consistent with what is in rocm/triton's release/internal/3.5.x branch, we need to keep triton_version.txt at 3.5.0 and move triton hash to ToT of that branch.

[AUTOGENERATED] develop_IFU_20251118

[AUTOGENERATED] develop_IFU_20251124

# Conflicts: # .ci/docker/ci_commit_pins/triton.txt # .ci/docker/requirements-ci.txt # .ci/docker/triton_version.txt # .circleci/scripts/binary_populate_env.sh # .github/scripts/build_triton_wheel.py # test/test_sparse_csr.py

…3159) Docker credentials were using the ones from my fork and not rocm/pytorch credentials: https://github.com/ROCm/pytorch/actions/runs/24479854145/job/71541505148 Latest build https://github.com/ROCm/pytorch/actions/runs/24480169722/job/71542549933

…sting on that arch

…umn (#3153) ## Summary - Only display tests where ROCm status is FAILED in the summary (CUDA status shown as a context column alongside). Previously both ROCm and CUDA failures were shown. - Add "Also Failing In" column that shows which other architectures have the same test tuple (test_file, test_class, test_name) failing, making it easy to distinguish all-ROCm issues from architecture-specific ones. - Includes count of failed tests in the section header. - Add job-level and test-level shard info to "LOG-BASED FAILURES (not in XML)" and "FAILED TESTS" section - Includes flaky tests in "LOG-BASED FAILURES (not in XML)" section for any tests that pass when run in new process ## Test plan - [x] Cross-arch detection confirmed: tests failing on all 3 archs show the other 2 in "Also Failing In"; single-arch failures show empty - [x] CSV and Markdown output both updated consistently Latest run https://github.com/ROCm/pytorch/actions/runs/24798004968 Run without this PR on the same commit: https://github.com/ROCm/pytorch/actions/runs/24796654604

Repro job without this PR's change: https://github.com/ROCm/pytorch/actions/runs/25342470426/job/74303089638 Validation run with this PR's change: https://github.com/ROCm/pytorch/actions/runs/25342235984 Current issue: existing testing is not able to pick up the CUDA artifacts because the CUDA job and artifact names changed from `test` to `test-osdc` for default and distributed shards. Repro inputs: `sha=b1b5b61ddb689ea65aab0915ecfac5cc459b92fb`, `arch=mi355`, `skip_rocm=false`, `csv_name=pr3199-pre-change-repro`. CUDA job names now use `test-osdc` for default and distributed shards, for example: `linux-jammy-cuda13.0-py3.10-gcc11 / test-osdc (default, 1, 5, ...)` `linux-jammy-cuda13.0-py3.10-gcc11 / test-osdc (distributed, 1, 3, ...)` CUDA artifact names now look like: `test-reports-test-osdc-default-1-5` `test-reports-test-osdc-distributed-1-3`

## Summary - Update MI355 parity report shard counts to match current CI artifacts. - Change default shards from 6 to 10 and distributed shards from 3 to 4. ## Validation * Combined parity workflow for `5b9a4786ea4b1a6170c6e5a4878269e7f591224b` on `mi300, mi355`: <https://github.com/ROCm/pytorch/actions/runs/25738157290> --------- Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>

## Motivation Old IFU_GITHUB_TOKEN [seems to have expired](https://github.com/ROCm/pytorch/actions/runs/25856299592/job/75974982737) ## Technical Details Replace with PARITY_GITHUB_TOKEN (meant specifically for this workflow) ## Test Plan Run parity.yml with this PR branch and see if it still gives credential error. ## Test Result "Download artifacts" step succeeded in https://github.com/ROCm/pytorch/actions/runs/25857211908/job/75978008711 ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

## Summary - Select the CUDA test artifact kind from the jobs present for the target SHA. - Detect whether the target SHA uses test-osdc or legacy test CUDA jobs, then use the detected kind when building log keys and artifact prefixes. - Apply the same dynamic selection to CUDA inductor jobs. - Treat missing per-arch summary buckets as zero so mixed ROCm/CUDA coverage does not crash report generation. ## Validation - PR/ciflow case: dispatched `Parity Report` on this branch with `sha=386f38175e3aaee2dadb36b5c364deff0869664d` and `arch=mi355, mi300, mi200, navi31`. CUDA default/distributed and inductor selected `test`. - Run: https://github.com/ROCm/pytorch/actions/runs/25866762885 - Main branch case: dispatched `Parity Report` on this branch with `sha=f38b1ec280bafa2ad11f6e767558e73e9eb508a6`, `arch=mi300`, `skip_rocm=true`, and `exclude_distributed=true`. CUDA default and inductor selected `test-osdc`. - Run: https://github.com/ROCm/pytorch/actions/runs/25867046276 - Local syntax check: `python3 -m py_compile .automation_scripts/pytorch-unit-test-scripts/download_testlogs .automation_scripts/pytorch-unit-test-scripts/generate_summary.py`.

## Summary - Prefer the arch-specific MI200 workflows in `download_testlogs`: `rocm-mi200`, `periodic-rocm-mi200`, and `inductor-rocm-mi200`. - Match arch-specific MI200 test jobs with the `linux-jammy-rocm-py3.10-mi200` prefix for default, distributed, and inductor shards. - Keep `trunk-rocm-sandbox` as the fallback workflow for older SHAs that do not have the MI200-specific workflows, using the legacy `linux-jammy-rocm-py3.10` prefix in that fallback path. ## Motivation A parity run for `50d07a990e33f9822ae4d48bed2d7f06c96522d0` tried to collect MI200 distributed jobs with: `linux-jammy-rocm-py3.10 / test (distributed, ...)` The upstream jobs for this SHA are arch-specific and include `-mi200`, so the log lookup missed all three shards and XML artifact collection fell through to empty results. The script should look for the MI200-specific workflows first, then fall back to `trunk-rocm-sandbox` for older commits. ## Validation - `python3 -m py_compile .automation_scripts/pytorch-unit-test-scripts/download_testlogs` - Confirmed the fixed prefix matches upstream jobs for `50d07a990e33f9822ae4d48bed2d7f06c96522d0`: - `rocm-mi200`: 6 default shard matches - `periodic-rocm-mi200`: 3 distributed shard matches - `inductor-rocm-mi200`: 2 inductor shard matches - Dispatched `Parity Report` on this branch with `sha=50d07a990e33f9822ae4d48bed2d7f06c96522d0`, `arch=mi200`, and `skip_cuda=true` to validate collection end-to-end. - Initial run before fallback commit: https://github.com/ROCm/pytorch/actions/runs/25920564353 (success) - Current branch run after fallback commit: https://github.com/ROCm/pytorch/actions/runs/25920808611 (queued) Made with [Cursor](https://cursor.com)

## Summary - Raise the Python CSV parser field limit in `generate_summary.py` so large parity CSV diagnostic fields can be read. - Truncate oversized diagnostic text fields while loading rows so long failure/skip messages do not make summary generation or output unwieldy. - Preserve test identity, status, timing, and shard fields used by the parity report tables. ## Root Cause A parity run failed in the `summarize` job when Python's default CSV field limit rejected a generated-code assertion message larger than 131,072 bytes: https://github.com/ROCm/pytorch/actions/runs/26168276671/job/76979094769 The first offending row was `inductor.test_torchinductor_codegen_dynamic_shapes::DynamicShapesCodegenGPUTests::test_vmap_dot_decomposes_bmm_dynamic_shapes_cuda`, where `message_rocm` was 145,748 bytes. ## Test plan - `python3 -m py_compile .automation_scripts/pytorch-unit-test-scripts/generate_summary.py` - Re-ran `generate_summary.py` locally against the artifact from the failed run: - Input: `20260520_all_tests_status_mi355.csv` from run `26168276671` - Output: summary CSV and markdown generated successfully instead of failing with `_csv.Error: field larger than field limit (131072)`. - Triggered `parity.yml` on this branch with the same upstream commit and arch as the failing run: - SHA: `27f2e80e30fb950bc455c777a5e8079e9657a157` - Arch: `mi355` - Validation run: https://github.com/ROCm/pytorch/actions/runs/26175417191 - Result: `setup-matrix`, `generate-parity (mi355)`, and `summarize` all completed successfully. - The summarize log shows `CSV written to 27f2e80e30fb950bc455c777a5e8079e9657a157_summary.csv` and `Markdown written to 27f2e80e30fb950bc455c777a5e8079e9657a157_summary.md`.

## Summary Adds a single step to the `summarize` job in `parity.yml` that uploads the generated `*_summary.md` (the same content already appended to `$GITHUB_STEP_SUMMARY`) as a standalone artifact named `parity-summary-md`, with N-day retention. The existing per-arch result artifacts have a 1-day retention, which makes it impossible to recover the summary content (e.g. `### FAILED TESTS`, `### LOG-BASED FAILURES`) after that window. This change lets external tooling — for example the in-progress upstream CI failure tracking — fetch the exact UI summary via `gh run download` long after the CSVs are gone, with only a standard PAT. No behavior change for any existing job. `if-no-files-found: ignore` keeps the step a no-op on early-exit runs (no CSVs produced). ## Test plan - [ ] Re-run `parity.yml` (or an autoparity manual dispatch) and verify the `parity-summary-md` artifact appears alongside the per-arch results artifacts. - [ ] `gh run download <run_id> -R ROCm/pytorch -n parity-summary-md` returns the expected `*_summary.md`. - [ ] On a run with no CSVs (forced early exit), confirm the workflow still succeeds and no artifact is uploaded. Signed-off-by: Garay-Fernandez <pgarayfe@amd.com>

## Summary - Adds a clickable **Job ID** column at the end of both the `FAILED TESTS` and `LOG-BASED FAILURES (not in XML)` tables in the parity summary markdown. Each cell renders as `[<job_id>](https://github.com/pytorch/pytorch/actions/runs/<wf>/job/<job_id>)`, dropping the reviewer one click away from the stacktrace. - Threads the upstream `pytorch/pytorch` CI job url through the existing pipeline — `download_testlogs` was already fetching that info, it just wasn't being preserved. No new API calls; no schema migrations; just persistence through `download_testlogs` → `summarize_xml_testreports.py` / `detect_log_failures.py` → `generate_summary.py`. - Backwards-compatible: every consumer reads the new fields via `.get(..., '')` / `os.path.isfile`, so older artifacts and CSVs render the column as empty cells instead of breaking. ### Example resulting row (FAILED TESTS, set2-disabled case) ``` | Arch | Test Config | Test File | Test Class | Test Name | Job-Level Shard (rocm) | Test-Level Shard (rocm) | Status (rocm) | Also Failing In | Job ID (rocm) | | mi300 | default | test_foo | TestBar | test_baz | 3/6 | 5/15 | FAILED | mi355 | [76905282313](https://github.com/pytorch/pytorch/actions/runs/26146653222/job/76905282313) | ``` ### Data flow - **FAILED TESTS** (XML-based): `_shorten_unzipped_dirs` keeps the trailing `_<jobid>` of the artifact name on each `test-<cfg>-N-N/` dir → `download_xml_files` writes one `_wf_run_id` file at the parent → `parse_xml_reports_as_dict` builds the url and stamps it on each test case → per-arch CSV carries `job_url_{set_name}` → `collect_failed_tests` propagates → markdown renders. - **LOG-BASED FAILURES**: `write_test_log_to_file` writes a companion `<filename>.job_url` file (full url from the job's `html_url`) → `scan_logs` reads it and stamps `job_url` on every failure / flaky row → `log_failures_<arch>.csv` / `flaky_tests_<arch>.csv` carry it → `load_log_failures` / `load_flaky_tests_as_log_failures` propagate → markdown renders. ## Test plan - Trigger a `parity.yml` run and confirm: - Per-arch test-report shard dirs are named `test-<cfg>-N-N_<jobid>` after `_shorten_unzipped_dirs`. - `_wf_run_id` file exists alongside the shard dirs in `rocm_xml/` and `cuda_xml/`. - `<filename>.job_url` companion files exist next to each `rocm*.txt` / `cuda*.txt` log file. - Inspect the per-arch CSV emitted by `summarize_xml_testreports.py` and confirm `job_url_<set1_name>` / `job_url_<set2_name>` columns are populated for failing rows. - Inspect `log_failures_<arch>.csv` / `flaky_tests_<arch>.csv` and confirm `job_url` column is populated. - Inspect the parity summary markdown artifact and click a `Job ID` cell in both tables → lands on the failing pytorch/pytorch job page with the stacktrace. - Re-run against a historical commit whose artifacts predate this change and confirm cells render as empty (no crash, no broken table). --------- Signed-off-by: Garay-Fernandez <pgarayfe@amd.com>

## Summary Add a clickable `HUD LINK` to the parity report summary so users can jump from the GitHub Actions run to the matching PyTorch HUD page. For PR runs, the link points to the PR HUD page with the exact SHA used for the report, e.g. `https://hud.pytorch.org/pytorch/pytorch/pull/<pr_id>?sha=<sha>`. For SHA-only runs, the link points to the commit HUD page. ## Validation - PR-only example: triggered `parity.yml` with `pr_id=184377`, `arch=mi355`, and no SHA. The report resolved SHA `291ff45ffe10a301a88d1a83e98b9ba9987dbbfa`, so the HUD link is `https://hud.pytorch.org/pytorch/pytorch/pull/184377?sha=291ff45ffe10a301a88d1a83e98b9ba9987dbbfa`. Run passed: https://github.com/ROCm/pytorch/actions/runs/26476256833 - SHA-only example: triggered `parity.yml` with `sha=fe1b0a2ae93e0efcfa0defeee2ed879cf68eaac6`, `arch=mi355`, and no PR ID. Run passed: https://github.com/ROCm/pytorch/actions/runs/26475135922

## Summary - Adds a `Run Time (s)` column to both parity summary tables (FAILED TESTS and LOG-BASED FAILURES), in the `.md` and `.csv` outputs. - **FAILED** tests use the per-test JUnit XML `time` (already in the per-arch CSV as `running_time_<set>`). - **LOG-BASED** failures (timeouts/crashes/kills, which produce no XML) use the failing **job's wall-clock**, computed in `detect_log_failures.py` from each log's first-to-last ISO timestamps and attached to every failure/flaky entry. Implements ROCm/frameworks-internal#16856. ## Changes - `detect_log_failures.py`: compute per-log job run time; add `run_time` to the failures and flaky CSV reports. - `generate_summary.py`: `Run Time (s)` column in FAILED + LOG-BASED tables; thread run time through the flaky loader. ## Test plan - [x] Offline end-to-end: synthetic timeout log -> `detect_log_failures` (1801s) -> `generate_summary` LOG-BASED table -> downstream parser - [x] FAILED table run time verified with a synthetic per-arch CSV - [x] Verify on a real parity run --------- Signed-off-by: pablo-garay <pgarayfe@amd.com>

…ruth) (#3311) ## What Introduces `.automation_scripts/pytorch-unit-test-scripts/parity_job_config.json` — a single source of truth for `pytorch/pytorch` job-name matching used by the parity tooling. Per-arch upstream workflow names, job-name prefixes, shard counts, artifact substrings, fallbacks (plus the CUDA equivalents and the check-run / workflow gating regexes) were previously duplicated: - hardcoded as large dicts inside `download_testlogs`, and - independently re-encoded as regexes inside `parity-auto.yml`. This PR lands just the config file so the matching rules live in exactly one place. ## Why split this out Per review feedback on #3278 (separate concerns, and this file also being introduced/changed in #3231), the shared config is pulled into its own foundational PR: - **#3278** (download_testlogs re-run fixes) stacks on this and *consumes* the config for S3 artifact / log matching. - **#3231** (parity auto-trigger) stacks on this and reads the check-run / workflow regexes for upstream gating. Neither downstream PR carries its own copy of the file anymore — no more add/add duplication. ## Merge order Land this first. #3278 and #3231 are rebased onto it and drop their copies of `parity_job_config.json`. ## Validation This is a data-only file; it is exercised end-to-end by the downloader in #3278, which loads this exact config. A stacked #3278 parity run for `346976bc` (`mi350`) reads this file over the new path and produces a fully-populated report: - https://github.com/ROCm/pytorch/actions/runs/27716114811 — passed Made with [Cursor](https://cursor.com)

@jithunnair-amd

…parity reports (#3278) ## Problem Parity reports were intermittently missing or under-counting CUDA (and sometimes ROCm) numbers even when the underlying test reports clearly existed in S3 and were visible in HUD. The gaps showed up for SHAs whose upstream `pytorch/pytorch` CI run had been **re-run** or **partially re-triggered**. This PR fixes three distinct causes of those gaps in `download_testlogs`. --- ## Fix 1 - search all run attempts for S3 artifacts When an upstream run is re-run, GitHub only re-executes failed jobs; succeeded jobs keep their artifacts under the attempt in which they originally ran. `download_xml_files()` queried only the current attempt, so carried-over reports (most visibly CUDA `test-osdc` shards, which a ROCm re-run never retries) were silently skipped. Fix: for each shard, probe attempts from latest down to `1` and take the highest attempt that has the artifact. Single-attempt runs are unaffected. ## Fix 2 - gather CUDA job IDs across all attempts (+ check-runs fallback) The CUDA job-ID lookup had the same single-attempt blind spot. CUDA job IDs are now collected across **all** attempts, with a fallback to the check-runs API when the jobs listing is incomplete. ## Fix 3 - source ROCm and CUDA from the same trunk run A SHA can have more than one `trunk.yml` run. ROCm-default picked the newest-completed run while CUDA picked the run carrying the CUDA jobs, so they could diverge and most columns came back empty. Fix: resolve the canonical trunk run once via `resolve_full_trunk_run()` (the run carrying the CUDA test jobs, falling back to newest-completed; push runs preferred over scheduled `rerun_disabled_tests` runs) and reuse it for ROCm default, the ROCm `distributed -> trunk` fallback, and CUDA. Also includes: jobs-API retry with backoff for transient secondary-rate-limiting, non-fatal skip when ROCm inductor didn't run, mi350 inductor sourced from trunk, and the `github.token` fallback in `parity.yml`. --- ## Stacked on #3311 (config split out) Per review feedback (separate concerns; the config was also being introduced/changed in #3231), the shared `parity_job_config.json` was pulled into its own foundational PR **#3311** and this PR now **consumes** it. The stack: 1. **#3311** - adds `parity_job_config.json` (single source of truth). Base: `develop`. 2. **#3278** (this PR) - `download_testlogs` + `parity.yml`. Base: `ethanwee/parity-job-config`. 3. **#3231** - `parity-auto.yml`. Base: `ethanwee/fix-parity-rerun-attempt-s3`. Merge order: #3311, then #3278, then #3231. Each PR's diff is now a single concern with no overlap. ## Validation Re-validated on this stacked branch (which loads `parity_job_config.json` from #3311) — confirms the config split didn't break config-path resolution and the downloader still populates the report end-to-end: - https://github.com/ROCm/pytorch/actions/runs/27716114811 (`346976bc`, `mi350`) — passed The original re-run fix was validated on `f947794`, a SHA with both a partial and a full trunk run: ROCm coverage went from `shard_rocm` 50,153 -> 357,342 (`shard_cuda` 349,942 unchanged): https://github.com/ethanwee1/pytorch/actions/runs/27160536793 (the rebased branch tree is byte-identical to that validated head, so the fix carries over unchanged). ## Test plan - Re-run parity for a SHA with a re-run / partial upstream run; confirm CUDA and ROCm numbers both populate at full scale (runs linked above). - Confirm a normal single-attempt, single-run SHA still produces identical output. --- ## Update — rebased onto merged #3311 + latest review round (@jithunnair-amd) #3311 is now merged into `develop`, so this PR has been **rebased onto `develop`** (its tip is `671fe7b`). Status of the latest review comments: - **Rebase off merged #3311 / still works with the post-review JSON** — done. The `parity_job_config.json` diff that now appears here is the **intentional schema restructure** requested on #3311 (not the earlier drift): each arch's `default`/`distributed`/`inductor` is now an ordered `[{workflow, job_prefix}, …]` list (first entry = primary source, later entries = fallbacks). `artifact_substrings` is dropped — the S3 downloader always filters on the `rocm.gpu` runner tag. `workflow_regex` is dropped — it is derived from the union of each arch's `workflow` values. `download_testlogs` unpacks the new lists into the same flat dicts via a small `parity_config_views` adapter, so matching behaviour is unchanged (verified equal to the old `workflows`/`job_prefixes`/`fallbacks` for every arch). - **`run-name` cleanup** — dropped the repeated `arch` default, `csv_name` is prepended when given, and `sha` is used directly. - **`sha` input** — now `default: 'latest'` with the trimmed description; `'latest'` is treated as the latest-green sentinel wherever `sha` is consumed, so no `--sha1 latest` leaks (a no-SHA run behaves exactly like the old empty-SHA default). - **Verbose "Forward run options" comment** — shortened. - **Auto-parity concern moved out** — the `auto_triggered` input and the `autoparity-` run-name prefix now live in #3231 (where auto-parity is introduced), not here. - **`github.token` unconditionally** — kept the `secrets.PARITY_GITHUB_TOKEN || github.token` fallback for now, since the PAT was added for higher GitHub API rate limits (this PR also adds jobs-API retry/backoff for secondary rate-limiting). Happy to drop the PAT if we're confident `github.token`'s limits suffice — flagging for your call. ### Validation (rebased branch, new config schema) - `mi350`, real trunk SHA `5a5e50f0f8aa…`: https://github.com/ethanwee1/pytorch/actions/runs/27965062040 — **passed** end-to-end. The config was loaded from the restructured JSON (`Using ROCm workflows: {'default': 'trunk', 'distributed': 'periodic-rocm-mi350', 'inductor': 'trunk'}`), the always-`rocm.gpu` filter pulled the mi350 artifacts, and a full ~20 MB mi350 report was produced.

@jithunnair-amd

## Summary - Add a scheduled parity auto-trigger that scans completed `pytorch/pytorch` main `trunk.yml` pushes and dispatches `parity.yml` once per ready upstream SHA. - Gate dispatch on the ROCm arch workflows that actually ran for a SHA, plus the CUDA jobs consumed by parity, so partial reports are avoided. - Add a `pull_request` dry-run path with a smaller scan window to validate the scanner without creating parity reports from PR CI. ## How it works - The workflow runs every 10 minutes and queries recent completed `pytorch/pytorch` `trunk.yml` push runs on `main`. Those trunk runs provide the candidate upstream SHAs to evaluate. - For each candidate SHA, it first checks recent `ROCm/pytorch` `parity.yml` run titles. If any existing parity run already contains that SHA, the SHA is skipped so we keep one report per upstream commit. - Maximum number of dispatches of parity.yml are 50, which is comfortably above the maximum number of [commits to `main` branch of pytorch/pytorch in any 10-minute interval](https://pytorchci.grafana.net/public-dashboards/bcce3e849d73451c9106ad9733990b9d) - It then lists all upstream workflow runs for that SHA and determines which ROCm arches actually ran. Missing periodic arch workflows are not treated as pending work; only arches with matching workflow files are expected in that report. - For the arches that did run, it lists upstream check-runs and waits for the matching ROCm test shards to reach `status=completed`. It also waits for the CUDA default, distributed, and inductor check-runs consumed by parity. - Auxiliary shards such as `mem_leak_check` and `rerun_disabled_tests` are ignored because the parity report does not consume them. - Once all relevant ROCm and CUDA check-runs are complete, it dispatches `parity.yml` with the ready arch list and a CSV prefix containing the upstream SHA, for example `autoparity-YYYYMMDD-<sha>`. - Pull request runs are forced to `dry_run=true`, so they exercise the scanner and log would-be dispatches without creating reports. Scheduled and manually dispatched runs can create real parity reports. ## Test plan - Validated workflow YAML and embedded shell locally with `yaml.BaseLoader` and `bash -n`. - PR dry-run workflow succeeded: https://github.com/ROCm/pytorch/actions/runs/26039732579 - Full non-dry-run workflow_dispatch succeeded: https://github.com/ROCm/pytorch/actions/runs/26041358738 - The full run used `dry_run=false`, scanned 20 recent upstream trunk runs, skipped SHAs with pending parity check-runs, dispatched 5 ready SHAs, and stopped at `max_dispatches=5`. - Dispatched parity reports all completed successfully: - `d76e83ef` / `mi355`: https://github.com/ROCm/pytorch/actions/runs/26041518406 - `457e1890` / `mi355`: https://github.com/ROCm/pytorch/actions/runs/26041528996 - `60f38508` / `mi355`: https://github.com/ROCm/pytorch/actions/runs/26041541647 - `d1d96569` / `mi355`: https://github.com/ROCm/pytorch/actions/runs/26041551854 - `6e3cf2e4` / `mi355, mi300, mi200`: https://github.com/ROCm/pytorch/actions/runs/26041618237 > Note: the validation runs above predate the upstream `mi355` -> `mi350` arch rename; the workflow now reads the renamed regexes from `parity_job_config.json` (introduced in #3311). Fresh post-rename dry-runs on this re-stacked branch: > - workflow_dispatch dry-run (`archs=mi350`): https://github.com/ROCm/pytorch/actions/runs/27644407327 > - pull_request dry-run (auto, from the re-stack push): https://github.com/ROCm/pytorch/actions/runs/27642728760 ## Stacked on #3278 / #3311 This PR now contains only `parity-auto.yml`. The shared `parity_job_config.json` it reads comes from #3311, and `download_testlogs` / `parity.yml` from #3278. Stack/merge order: #3311 -> #3278 -> #3231. ## Dispatch cadence note - The full validation run used `max_dispatches=5` only to avoid flooding ROCm/pytorch during manual testing. - The production scheduled workflow runs every 10 minutes and defaults to `max_dispatches=50`, `max_commits=200`, and `max_age_hours=72` unless manually overridden. --- ## Update — rebased onto updated #3278 + latest review round (@jithunnair-amd) #3311 is merged; this PR is rebased onto the updated #3278 (its tip is `ce39907`). Changes from the latest round: - **Auto-parity now owns its `parity.yml` hook** — per the "separate concerns" note, the `auto_triggered` input and the `autoparity-` run-name prefix were moved out of #3278 and into this PR, since this is where auto-parity is introduced. - **Dropped the `archs` workflow input** — auto-parity is trunk-scoped (mi350 is the only ROCm arch that rides along in `trunk.yml`), so the arch scope is fixed and is no longer a dispatch variable to reason about. - **Workflow regex is derived, not stored** — matching the #3278 schema restructure, the per-arch upstream-workflow regex is built from the union of each arch's `workflow` values in `parity_job_config.json`; the `workflow_regex` field no longer exists. The `checkrun_regex` map is unchanged. ### Validation - `pull_request`-style dry-run on the re-stacked branch: https://github.com/ethanwee1/pytorch/actions/runs/27964717929 — **passed**. It loaded the config, built the derived `Arch->workflows` regex map, and ran the per-SHA readiness gating without dispatching (dry-run).

Split out of #3210 (3 of 3) for easier review. Surfaces test failure messages inline in the parity summary (`generate_summary.py`): - Extract the XML failure message per failed test and add it as a per-row collapsible **Error Message** column in the FAILED TESTS table. - Cap individual messages and, crucially, **budget total message size against GitHub's 1 MiB step-summary limit** so a run with many large tracebacks doesn't drop the entire summary. Only the longest messages are clipped, and only when a run would otherwise exceed the budget; the full text always stays in the CSV artifact. Independent of the other two split PRs (touches only `generate_summary.py`). Made with [Cursor](https://cursor.com)

Split out of #3210 (2 of 3) for easier review. `inductor.test_cpu_repro` is blocklisted on ROCm upstream, so it never runs there and would otherwise show up as falsely **MISSED** against the CUDA baseline (~791 phantom MISSED rows on a recent mi350 run). Add it to `EXCLUDED_TEST_SUITES` so it is dropped from the ROCm-vs-CUDA comparison. Verified against a real mi350 run: removes all `inductor.test_cpu_repro` rows; MISSED gap drops ~1,294 -> ~503 and DEFAULT disagreement 3.03% -> 2.77% (SKIPPED unchanged). Made with [Cursor](https://cursor.com) --------- Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>

The DISAGREE/AGREE metric counted a ROCm SKIPPED or MISSED test as a disagreement whenever CUDA merely did not SKIP it - which includes the large set of attention-backend parametrization variants CUDA never even enumerates (CUDA MISSED). Those are not real ROCm-vs-CUDA coverage gaps and inflated DISAGREE to ~3% (AGREE ~97%). Count a disagreement only when CUDA actually PASSED the test (s2 == PASSED) in both compute_test_config_stats and compute_overall_stats, and relabel the two line items accordingly ('PASSED on <set2>'). This matches the triage spreadsheet's adjusted definition; AGREE% becomes ~99%.

Count a ROCm SKIPPED/MISSED test as a disagreement only when CUDA actually PASSED it (s2 == PASSED), excluding parametrization variants CUDA never runs. AGREE% becomes ~99%, matching the triage spreadsheet. Mirrors ROCm#3371.

rocm-repo-management-api · 2026-06-24T19:57:09Z

Jenkins build for 7f5f023dd5f5dc4cfc6480a335e9392b33d762ec commit finished as FAILURE
Links: Pipeline Overview / Build artifacts / Test Results

#3375) Two user-reported bugs in the parity report. ## 1. LOG-BASED FAILURES duplication A test could be listed **twice** in the LOG-BASED FAILURES table — once as `FAILED` and once as `CONSISTENT_FAILURE` — because `detect_log_failures.py` emits both a per-test `FAILED` record and a `FAILED CONSISTENTLY` record for the same test, and `generate_summary.py` never deduped within `log_failures` (it only filtered against the XML FAILED TESTS table). **Fix:** add a shared `_select_rocm_log_failures()` helper (used by both the CSV and markdown renderers) that filters out XML-failed tests and **dedupes per test, preferring `CONSISTENT_FAILURE`** — so a test is never shown as both. Example that regressed: `mi350 · default · test_sparse · TestSparseMaskedReductionsCUDA · test_future_empty_dim_masked_prod_cuda_complex128` showed as both; it now shows only `CONSISTENT_FAILURE`. ## 2. Empty Job ID columns in FAILED TESTS `download_testlogs._shorten_unzipped_dirs` extracted the upstream CI job id with `re.search(r'_(\d{6,})\.zip$', ...)`, but the **unzipped** shard dirs end in `_<jobid>` with no `.zip` (e.g. `unzipped-...gfx950.2_83371682957`). So the `_<jobid>` suffix was dropped, dirs became `test-default-1-8`, and `summarize_xml`'s `parse_xml_reports_as_dict` (`_(\d+)$`) couldn't recover the id -> `job_url` was empty for every row -> the FAILED TESTS "Job ID" columns were always blank. **Fix:** make `.zip` optional: `r'_(\d{6,})(?:\.zip)?$'` (still matches the older `.zip` form). ## Validation - Both files `py_compile` / parse clean. - Dedup: ran `_select_rocm_log_failures` on a real `log_failures` CSV — the test that had both `FAILED` and `CONSISTENT_FAILURE` collapses to a single `CONSISTENT_FAILURE` row (0 keys left with multiple categories). - Job id: the new regex extracts `83371682957` from the real `unzipped-...` dir name, and the resulting `test-distributed-1-3_83371682957` parses correctly, so `job_url` populates. End-to-end `Job ID` population will be confirmed by the next parity run after merge. Independent of #3370 / #3371 (those touch a different section of `generate_summary.py`). Made with [Cursor](https://cursor.com) ## Validation Validated on a run of **this PR branch** (develop + both fixes) for the exact SHA that exhibited both bugs: - `206283a284889f3a4f62510f8b5a8068d41121c4 · mi350`: https://github.com/ROCm/pytorch/actions/runs/28190615761 - **Bug 1:** `test_sparse::TestSparseMaskedReductionsCUDA::test_future_empty_dim_masked_prod_cuda_complex128` is now listed **once** as `CONSISTENT_FAILURE` (was both FAILED + CONSISTENT_FAILURE); 0 keys with multiple categories across the LOG-BASED table. - **Bug 2:** FAILED TESTS table has **24 populated Job ID links**; in the status CSV `job_url_rocm` went 0 -> 308,793 and `job_url_cuda` 0 -> 354,344 non-empty. Also applied to `ethanwee1/pytorch@main` (`fe21d4e`); the fork run (https://github.com/ethanwee1/pytorch/actions/runs/28189128743) confirms Bug 1. Note: the fork's `summarize_xml_testreports.py` predates develop's `_wf_run_id`->`job_url` feature, so Job IDs there populate once the fork syncs that from `develop` (Bug 2's fix is the `download_testlogs` regex, which is on the fork).

pragupta and others added 30 commits October 29, 2025 17:24

Add github workflows to automate IFU (#2688) (#2748)

b97cff1

(cherry picked from commit a66eeda) Fixes #ISSUE_NUMBER Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>

[rocm7.1_internal_testing] Add triton_kernels wheel generation (#2566)

6b3a141

Fixes #ISSUE_NUMBER (cherry picked from commit 0ea0592)

[AUTOGENERATED] [rocm6.5_internal_testing] Remove --no-index and --no…

d14e5a9

…-deps flags (#2121) Cherry-pick of #2103 Co-authored-by: Ethan Wee <Ethan.Wee@amd.com> (cherry picked from commit 1dea6e8)

Enable gesvda for ROCM >= 6.1 (#1339)

11ca2d0

This also fixes a problem in gesvd driver when UV is not needed. (cherry picked from commit 4ce57ec) (cherry picked from commit 167b4c1)

Remove ROCmloops specific test

629e824

(cherry picked from commit d6879fa) (cherry picked from commit 123a164)

Bump triton to 3.5.x and update related_commits

ab4714d

Revert to prev sccache by ROCm

2536631

Signed-off-by: Jagadish Krishnamoorthy <jagadish.krishnamoorthy@amd.com> (cherry picked from commit 1ad5bb95d796283d5f56ac1edd16f1731d24a49d) (cherry picked from commit 519160d)

pytorch_ifu.yml: Change date format (#2776)

777e73c

Fixes #ISSUE_NUMBER

Merge remote-tracking branch 'upstream/main' into develop_IFU_20251104

223b9c5

# Conflicts: # .ci/docker/requirements-ci.txt

Fix merge conflict

b4c1e1e

Merge pull request #2784 from ROCm/develop_IFU_20251104

3d74218

[AUTOGENERATED] develop_IFU_20251104

Merge remote-tracking branch 'upstream/main' into develop_IFU_20251118

da5ac4a

# Conflicts: # .ci/docker/ci_commit_pins/triton.txt # requirements.txt

Fix conflicts and move triton ver to 3.5.0

a3c49a9

To keep triton version consistent with what is in rocm/triton's release/internal/3.5.x branch, we need to keep triton_version.txt at 3.5.0 and move triton hash to ToT of that branch.

Merge pull request #2812 from ROCm/develop_IFU_20251118

5ca076d

[AUTOGENERATED] develop_IFU_20251118

Merge remote-tracking branch 'upstream/main' into develop_IFU_20251124

ecdea86

Merge pull request #2827 from ROCm/develop_IFU_20251124

f742da3

[AUTOGENERATED] develop_IFU_20251124

Fix merge conflicts + bump triton to 3.6.x branch

7e17fb9

Remove stale opentelemetry-cpp submodule

4d67363

ethanwee1 and others added 21 commits April 15, 2026 16:52

Make gfx94x-dcgpu the default since theRock CI currently runs full te…

293ee53

…sting on that arch

Ensure one of pr_id/sha1 is provided to download_testlogs

f401954

Use default value (45 days for ROCm org) for artifact retention

7826f04

ethanwee1 mentioned this pull request Jun 25, 2026

[CI] Parity report: dedupe LOG-BASED FAILURES + fix empty Job ID links #3375

Merged

jithunnair-amd force-pushed the develop branch from ff872e3 to f27bbcf Compare June 29, 2026 20:57

jithunnair-amd requested review from jataylo, jeffdaily, jithunnair-amd and pruthvistony as code owners June 29, 2026 20:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[CI] Parity: count a disagreement only when CUDA PASSED#3371

[CI] Parity: count a disagreement only when CUDA PASSED#3371
ethanwee1 wants to merge 67 commits into
developfrom
ethanwee/parity-cuda-passed-disagree

ethanwee1 commented Jun 24, 2026

Uh oh!

rocm-repo-management-api Bot commented Jun 24, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

11 participants

Uh oh!

Conversation

ethanwee1 commented Jun 24, 2026

Summary

Why

Effect (mi350, real data)

Scope

Test plan

Uh oh!

rocm-repo-management-api Bot commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

11 participants

rocm-repo-management-api Bot commented Jun 24, 2026 •

edited

Loading