Skip to content

[CI] Parity: count a disagreement only when CUDA PASSED#3371

Open
ethanwee1 wants to merge 67 commits into
developfrom
ethanwee/parity-cuda-passed-disagree
Open

[CI] Parity: count a disagreement only when CUDA PASSED#3371
ethanwee1 wants to merge 67 commits into
developfrom
ethanwee/parity-cuda-passed-disagree

Conversation

@ethanwee1

Copy link
Copy Markdown

Summary

Changes the parity DISAGREE/AGREE metric to count a ROCm SKIPPED/MISSED test as a disagreement only when CUDA actually PASSED it (status_cuda == PASSED), instead of "CUDA didn't SKIP it" (!= SKIPPED).

Why

The old definition counted as disagreements the large set of attention-backend parametrization variants (e.g. test_transformers Flash/CK/CUTLASS cases) that CUDA never even enumerates (CUDA = MISSED). Those aren't real ROCm-vs-CUDA coverage gaps — they just inflated DISAGREE to ~3% (AGREE ~97%). Counting only "CUDA passed, ROCm didn't" reflects the actionable gap and matches the triage spreadsheet's adjusted definition.

Effect (mi350, real data)

  • Overall AGREE% 97.3% → 99.07%
  • DEFAULT SKIPPED+MISSED 9,002 → 2,629, DISAGREE 3.0% → 0.88%
  • The two line items are relabeled SKIPPED (on rocm, PASSED on cuda) / MISSED (on rocm, PASSED on cuda) for accuracy.

Scope

compute_test_config_stats + compute_overall_stats in generate_summary.py. The dashboard's parity collector (ROCm/AI-Frameworks-Dashboard#49) is being updated with the same definition so the dashboard matches this report.

Test plan

  • py_compile clean.
  • Ran against a real mi350 status CSV → Overall AGREE 99.07%, DEFAULT DISAGREE 0.88%.

Made with Cursor

pragupta and others added 30 commits October 29, 2025 17:24
(cherry picked from commit a66eeda)

Fixes #ISSUE_NUMBER

Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>
==========================================

Triton build conditionalized on ROCM_VERSION

Include the ROCm version in triton version

(cherry picked from commit 7d33910)
(cherry picked from commit 0412eb4)

Update triton-rocm.txt to triton.txt

(cherry picked from commit 0ce9f6e)

Use ROCm/triton for install_triton.sh

(cherry picked from commit 6e9714b)

update triton commit

Revert "Use ROCm/triton for install_triton.sh"

This reverts commit 81b0cbc8435122030044049c661f252ee8aa7ae5.

change triton repo

Update triton.txt to use release/internal/3.3.x branch

Use ROCm/triton

Use ROCm/triton for install_triton.sh

(cherry picked from commit 0036db5)
…A helper functions

=======================================================================================

Implementation of PyTorch ut parsing script - QA helper function (#1386)

* Initial implementation of PyTorch ut parsing script

* Extracted path variables

* Use nested dict to save results

* Fixes typo

* Cleanup

* Fixes several issues

* Minor name change

* Update run_pytorch_unit_tests.py

* Added file banners

* Supported running from API

* Added more help info

* Consistent naming

* Format help text

---------

Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>
Co-authored-by: Jithun Nair <jithun.nair@amd.com>

Print consolidated log file for pytorch unit test automation scripts (#1433)

* Print consolidated log file for pytorch uts

* Update run_entire_tests subprocess call as well

* lint

* Add ERROR string

[SWDEV-466849] Enhancements for PyTorch UT helper scripts (#1491)

* Check that >1 GPUs are visible when running TEST_CONFIG=distributed

* Add EXECUTION_TIME to file-level and aggregate statistics

PyTorch unit test helper scripts enhancements (#1517)

* Fail earlier for distributed-on-1-GPU scenario
* print cmd in consolidated log with prettier formatting
* python->python3

Fixes https://ontrack-internal.amd.com/browse/SWDEV-477264

---------

Co-authored-by: blorange-amd <bo.li2@amd.com>

Several issues fix of QA helper script (#1564)

Fixes SWDEV-475071: https://ontrack-internal.amd.com/browse/SWDEV-475071

Removed args inside function (#1595)

Fixes SWDEV-475071

(cherry picked from commit 041aa1b47978154de63edc6b7ffcdea218a847a3)

QA script - Added multi gpu check with priority_tests (#1604)

Fixes SWDEV-487907. Verified throwing exception for distributed is
working correctly on single gpu with command: python
.automation_scripts/run_pytorch_unit_tests.py --priority_test

(cherry picked from commit 57cc742271cbf4547f9213710e57f6444bbc983e)
(cherry picked from commit 6d5c3dc)
(cherry picked from commit 2ee3aa2)
* Use triton commit same as that used for release/2.6 branch since both
are triton version 3.2.0, so assuming they're compatible.

Relates to:
https://github.com/ROCm/rocAutomation/pull/660/files
https://github.com/ROCm/builder/pull/70/files

Validation

http://ml-ci-internal.amd.com:8080/job/pytorch/job/manylinux_rocm_wheels/568/

---------

Co-authored-by: Jithun Nair <jithun.nair@amd.com>
Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>
(cherry picked from commit 14c1417)
(cherry picked from commit c20a8f8)
* Add trailing comma for consistency in gfx architecture list

Signed-off-by: Jagadish Krishnamoorthy <jagadish.krishnamoorthy@amd.com>

* ROCm: Enable tf32 testing on test_nn

Signed-off-by: Jagadish Krishnamoorthy <jagadish.krishnamoorthy@amd.com>

---------

Signed-off-by: Jagadish Krishnamoorthy <jagadish.krishnamoorthy@amd.com>
(cherry picked from commit c113e14)
…-deps flags (#2121)

Cherry-pick of #2103

Co-authored-by: Ethan Wee <Ethan.Wee@amd.com>
(cherry picked from commit 1dea6e8)
Relates to: ROCm/builder#82

Validation:
http://rocm-ci.amd.com/job/mainline-pytorch_internal-manylinux-wheels/98/

Using
`registry-sc-harbor.amd.com/framework/compute-rocm-dkms-no-npi-hipclang:16180_ubuntu24.04_py3.12_pytorch_lw_rocm7.0_IT_upgrade_numpy_452f3df6`:
```
root@d92befdbb2a6:/# pip list | egrep "numpy|pandas"
numpy                   2.1.2
pandas                  2.2.3
root@d92befdbb2a6:/# python3
Python 3.12.3 (main, Feb  4 2025, 14:48:35) [GCC 13.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas
>>> import torch
>>> import numpy
>>> exit()
root@d92befdbb2a6:/data/pytorch-micro-benchmarking# HIP_VISIBLE_DEVICES=1 python3 micro_benchmarking_pytorch.py --network resnet50
INFO: running forward and backward for warmup.
INFO: running the benchmark..
OK: finished running benchmark..
--------------------SUMMARY--------------------------
Microbenchmark for network : resnet50
Num devices: 1
Dtype: FP32
Mini batch size [img] : 64
Time per mini-batch : 0.11369450092315674
Throughput [img/sec] : 562.9120096428937
```

---------

Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>
(cherry picked from commit cf32479)
…2269)

Fixes SWDEV-536456

Fixes error post-#2256:
```
00:12:44.248  #22 155.3 ERROR: Ignored the following versions that require a different python version: 0.52.0 Requires-Python >=3.6,<3.9; 0.52.0rc3 Requires-Python >=3.6,<3.9; 0.61.0 Requires-Python >=3.10; 0.61.0rc1 Requires-Python >=3.10; 0.61.0rc2 Requires-Python >=3.10; 0.61.1rc1 Requires-Python >=3.10; 0.61.2 Requires-Python >=3.10; 3.3 Requires-Python >=3.10; 3.3rc0 Requires-Python >=3.10; 3.4 Requires-Python >=3.10; 3.4.1 Requires-Python >=3.10; 3.4.2 Requires-Python >=3.10; 3.4rc0 Requires-Python >=3.10; 3.5 Requires-Python >=3.11; 3.5rc0 Requires-Python >=3.11; 8.2.0 Requires-Python >=3.10; 8.2.1 Requires-Python >=3.10
00:12:44.248  #22 155.3 ERROR: Could not find a version that satisfies the requirement numba==0.61.2 (from versions: 0.1, 0.2, 0.3, 0.5.0, 0.6.0, 0.7.0, 0.7.1, 0.7.2, 0.8.0, 0.8.1, 0.9.0, 0.10.0, 0.10.1, 0.11.0, 0.12.0, 0.12.1, 0.12.2, 0.13.0, 0.13.2, 0.13.3, 0.13.4, 0.14.0, 0.15.1, 0.16.0, 0.17.0, 0.18.1, 0.18.2, 0.19.1, 0.19.2, 0.20.0, 0.21.0, 0.22.0, 0.22.1, 0.23.0, 0.23.1, 0.24.0, 0.25.0, 0.26.0, 0.27.0, 0.28.1, 0.29.0, 0.30.0, 0.30.1, 0.31.0, 0.32.0, 0.33.0, 0.34.0, 0.35.0, 0.36.1, 0.36.2, 0.37.0, 0.38.0, 0.38.1, 0.39.0, 0.40.0, 0.40.1, 0.41.0, 0.42.0, 0.42.1, 0.43.0, 0.43.1, 0.44.0, 0.44.1, 0.45.0, 0.45.1, 0.46.0, 0.47.0, 0.48.0, 0.49.0, 0.49.1rc1, 0.49.1, 0.50.0rc1, 0.50.0, 0.50.1, 0.51.0rc1, 0.51.0, 0.51.1, 0.51.2, 0.52.0rc2, 0.53.0rc1.post1, 0.53.0rc2, 0.53.0rc3, 0.53.0, 0.53.1, 0.54.0rc2, 0.54.0rc3, 0.54.0, 0.54.1rc1, 0.54.1, 0.55.0rc1, 0.55.0, 0.55.1, 0.55.2, 0.56.0rc1, 0.56.0, 0.56.2, 0.56.3, 0.56.4, 0.57.0rc1, 0.57.0, 0.57.1rc1, 0.57.1, 0.58.0rc1, 0.58.0rc2, 0.58.0, 0.58.1, 0.59.0rc1, 0.59.0, 0.59.1, 0.60.0rc1, 0.60.0)
00:12:44.248  #22 155.3 ERROR: No matching distribution found for numba==0.61.2
```

Validation:
* Docker image:
http://rocm-ci.amd.com/job/mainline-framework-pytorch-internal-cs9-ci/132
* Wheels:
http://rocm-ci.amd.com/job/mainline-pytorch_internal-manylinux-wheels/102/

From
`registry-sc-harbor.amd.com/framework/compute-rocm-dkms-no-npi-hipclang:16180_ubuntu22.04_py3.9_pytorch_lw_rocm7.0_IT_py3.9_a11d94ad`:
```
root@f43861a0a856:/# pip list | egrep "numpy|pandas"
numpy                   2.0.2
pandas                  2.2.3
root@f43861a0a856:/# python
Python 3.9.23 (main, Jun  4 2025, 08:55:38)
[GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> import numpy
>>> import pandas
root@f43861a0a856:/data/pytorch-micro-benchmarking# HIP_VISIBLE_DEVICES=1 python3 micro_benchmarking_pytorch.py --network resnet50
INFO: running forward and backward for warmup.
INFO: running the benchmark..
OK: finished running benchmark..
--------------------SUMMARY--------------------------
Microbenchmark for network : resnet50
Num devices: 1
Dtype: FP32
Mini batch size [img] : 64
Time per mini-batch : 0.11354223489761353
Throughput [img/sec] : 563.6669038416574
```

(cherry picked from commit a0a9d81)
…cm7.0/7.1 (#2239)

Revamped version of #2108

PR to:
- enable complex data types for sparse matmul on ROCm
- fix sparse addmm/baddbmm on ROCm
- fix sparse hipification for ROCm
- fix/enable sparse tests on ROCm (~50 tests total for non-fp16/bf16):
- enable fp16/bf16 sparse path for rocm7.0
- enable fp16/bf16 sparse tests for rocm7.0/7.1
```
test_sparse_csr.py::TestSparseCSRCUDA::test_bmm_cuda_*
test_sparse.py::TestSparseCUDA::test_sparse_matmul_cuda_*
test_sparse_csr.py::TestSparseCSRCUDA::test_mm_cuda_float64
test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_all_sparse_csr_SparseCS*
test_sparse_csr.py::TestSparseCSRCUDA::test_addmm_sizes_all_sparse_csr_*
test_sparse_csr.py::TestSparseCSRCUDA::test_sparse_addmm_cuda_float16
```

(cherry picked from commit cc2a69c)
#2326)

Fixes https://ontrack-internal.amd.com/browse/SWDEV-541809

Upgrading tensorboard after numpy upgrade
Ran in
**registry-sc-harbor.amd.com/framework/compute-rocm-dkms-no-npi-hipclang:16381_ubuntu24.04_py3.12_pytorch_lw_rocm7.0_internal_testing_afe8b782**

```
    7  git checkout rocm7.0_IT_upgrade_tensorboard
    8  pip install .ci/docker/requirements-ci.txt
    9  pip install -r .ci/docker/requirements-ci.txt
   10  PYTORCH_TEST_WITH_ROCM=1 python test/test_monitor.py TestMonitorTensorboard.test_event_handler

root@ubb4-rack-22:/var/lib/jenkins/pytorch# PYTORCH_TEST_WITH_ROCM=1 python test/test_monitor.py TestMonitorTensorboard.test_event_handler
/opt/venv/lib/python3.12/site-packages/google/protobuf/internal/well_known_types.py:91: DeprecationWarning: datetime.datetime.utcfromtimestamp() is deprecated and scheduled for removal in a future version. Use timezone-aware objects to represent datetimes in UTC: datetime.datetime.fromtimestamp(timestamp, datetime.UTC).
  _EPOCH_DATETIME_NAIVE = datetime.datetime.utcfromtimestamp(0)
.
----------------------------------------------------------------------
Ran 1 test in 0.327s

OK
root@ubb4-rack-22:/var/lib/jenkins/pytorch#

```

(cherry picked from commit c7f61f4)
Tested locally successfully
```
root@rocm-framework-47:/var/lib/jenkins/pytorch# pip install -r requirements.txt
Ignoring numpy: markers 'python_version == "3.9"' don't match your environment
Requirement already satisfied: setuptools<80.0,>=70.1.0 in /opt/venv/lib/python3.10/site-packages (from -r /var/lib/jenkins/pytorch/requirements-build.txt (line 2)) (79.0.1)
Requirement already satisfied: cmake>=3.31.4 in /opt/venv/lib/python3.10/site-packages (from -r /var/lib/jenkins/pytorch/requirements-build.txt (line 3)) (4.0.0)
Requirement already satisfied: ninja==1.11.1.3 in /opt/venv/lib/python3.10/site-packages (from -r /var/lib/jenkins/pytorch/requirements-build.txt (line 4)) (1.11.1.3)
Requirement already satisfied: numpy==2.1.2 in /opt/venv/lib/python3.10/site-packages (from -r /var/lib/jenkins/pytorch/requirements-build.txt (line 5)) (2.1.2)
Requirement already satisfied: packaging==25.0 in /opt/venv/lib/python3.10/site-packages (from -r /var/lib/jenkins/pytorch/requirements-build.txt (line 6)) (25.0)
Requirement already satisfied: pyyaml==6.0.2 in /opt/venv/lib/python3.10/site-packages (from -r /var/lib/jenkins/pytorch/requirements-build.txt (line 7)) (6.0.2)
Requirement already satisfied: requests==2.32.4 in /opt/venv/lib/python3.10/site-packages (from -r /var/lib/jenkins/pytorch/requirements-build.txt (line 8)) (2.32.4)
Requirement already satisfied: six==1.17.0 in /opt/venv/lib/python3.10/site-packages (from -r /var/lib/jenkins/pytorch/requirements-build.txt (line 9)) (1.17.0)
Requirement already satisfied: typing-extensions==4.14.1 in /opt/venv/lib/python3.10/site-packages (from -r /var/lib/jenkins/pytorch/requirements-build.txt (line 10)) (4.14.1)
Requirement already satisfied: expecttest==0.3.0 in /opt/venv/lib/python3.10/site-packages (from -r requirements.txt (line 8)) (0.3.0)
Requirement already satisfied: filelock==3.18.0 in /opt/venv/lib/python3.10/site-packages (from -r requirements.txt (line 9)) (3.18.0)
Requirement already satisfied: fsspec==2025.7.0 in /opt/venv/lib/python3.10/site-packages (from -r requirements.txt (line 10)) (2025.7.0)
Requirement already satisfied: hypothesis==5.35.1 in /opt/venv/lib/python3.10/site-packages (from -r requirements.txt (line 11)) (5.35.1)
Requirement already satisfied: jinja2==3.1.6 in /opt/venv/lib/python3.10/site-packages (from -r requirements.txt (line 12)) (3.1.6)
Requirement already satisfied: lintrunner==0.12.7 in /opt/venv/lib/python3.10/site-packages (from -r requirements.txt (line 13)) (0.12.7)
Requirement already satisfied: networkx==2.8.8 in /opt/venv/lib/python3.10/site-packages (from -r requirements.txt (line 14)) (2.8.8)
Requirement already satisfied: optree==0.13.0 in /opt/venv/lib/python3.10/site-packages (from -r requirements.txt (line 18)) (0.13.0)
Requirement already satisfied: psutil==7.0.0 in /opt/venv/lib/python3.10/site-packages (from -r requirements.txt (line 19)) (7.0.0)
Requirement already satisfied: sympy==1.13.3 in /opt/venv/lib/python3.10/site-packages (from -r requirements.txt (line 20)) (1.13.3)
Requirement already satisfied: wheel==0.45.1 in /opt/venv/lib/python3.10/site-packages (from -r requirements.txt (line 22)) (0.45.1)
Requirement already satisfied: build[uv] in /opt/venv/lib/python3.10/site-packages (from -r requirements.txt (line 7)) (1.3.0)
Requirement already satisfied: charset_normalizer<4,>=2 in /opt/venv/lib/python3.10/site-packages (from requests==2.32.4->-r /var/lib/jenkins/pytorch/requirements-build.txt (line 8)) (3.4.3)
Requirement already satisfied: idna<4,>=2.5 in /opt/venv/lib/python3.10/site-packages (from requests==2.32.4->-r /var/lib/jenkins/pytorch/requirements-build.txt (line 8)) (3.10)
Requirement already satisfied: urllib3<3,>=1.21.1 in /opt/venv/lib/python3.10/site-packages (from requests==2.32.4->-r /var/lib/jenkins/pytorch/requirements-build.txt (line 8)) (2.5.0)
Requirement already satisfied: certifi>=2017.4.17 in /opt/venv/lib/python3.10/site-packages (from requests==2.32.4->-r /var/lib/jenkins/pytorch/requirements-build.txt (line 8)) (2025.8.3)
Requirement already satisfied: attrs>=19.2.0 in /opt/venv/lib/python3.10/site-packages (from hypothesis==5.35.1->-r requirements.txt (line 11)) (25.3.0)
Requirement already satisfied: sortedcontainers<3.0.0,>=2.1.0 in /opt/venv/lib/python3.10/site-packages (from hypothesis==5.35.1->-r requirements.txt (line 11)) (2.4.0)
Requirement already satisfied: MarkupSafe>=2.0 in /opt/venv/lib/python3.10/site-packages (from jinja2==3.1.6->-r requirements.txt (line 12)) (3.0.2)
Requirement already satisfied: mpmath<1.4,>=1.1.0 in /opt/venv/lib/python3.10/site-packages (from sympy==1.13.3->-r requirements.txt (line 20)) (1.3.0)
Requirement already satisfied: pyproject_hooks in /opt/venv/lib/python3.10/site-packages (from build[uv]->-r requirements.txt (line 7)) (1.2.0)
Requirement already satisfied: tomli>=1.1.0 in /opt/venv/lib/python3.10/site-packages (from build[uv]->-r requirements.txt (line 7)) (2.2.1)
Requirement already satisfied: uv>=0.1.18 in /opt/venv/lib/python3.10/site-packages (from build[uv]->-r requirements.txt (line 7)) (0.8.10)
root@rocm-framework-47:/var/lib/jenkins/pytorch# pip install -r requirements-build.txt

```

(cherry picked from commit 6e6e454)
This also fixes a problem in gesvd driver when UV is not needed.

(cherry picked from commit 4ce57ec)
(cherry picked from commit 167b4c1)
(cherry picked from commit d6879fa)
(cherry picked from commit 123a164)
Signed-off-by: Jagadish Krishnamoorthy <jagadish.krishnamoorthy@amd.com>

(cherry picked from commit 1ad5bb95d796283d5f56ac1edd16f1731d24a49d)
(cherry picked from commit 519160d)
- Need to use upstream/main for rocm/pytorch's develop branch. For
  release branches, `github.event.pull_request.base.ref` should work as
  is.

- Need to remove any trailing space in PR TITTLE so branch name can be
  formed correctly

Fixes #ISSUE_NUMBER
# Conflicts:
#	.ci/docker/requirements-ci.txt
[AUTOGENERATED] develop_IFU_20251104
# Conflicts:
#	.ci/docker/ci_commit_pins/triton.txt
#	requirements.txt
To keep triton version consistent with what is in rocm/triton's
release/internal/3.5.x branch, we need to keep triton_version.txt at
3.5.0 and move triton hash to ToT of that branch.
[AUTOGENERATED] develop_IFU_20251118
[AUTOGENERATED] develop_IFU_20251124
# Conflicts:
#	.ci/docker/ci_commit_pins/triton.txt
#	.ci/docker/requirements-ci.txt
#	.ci/docker/triton_version.txt
#	.circleci/scripts/binary_populate_env.sh
#	.github/scripts/build_triton_wheel.py
#	test/test_sparse_csr.py
ethanwee1 and others added 21 commits April 15, 2026 16:52
…umn (#3153)

## Summary
- Only display tests where ROCm status is FAILED in the summary (CUDA
status shown as a context column alongside). Previously both ROCm and
CUDA failures were shown.
- Add "Also Failing In" column that shows which other architectures have
the same test tuple (test_file, test_class, test_name) failing, making
it easy to distinguish all-ROCm issues from architecture-specific ones.
- Includes count of failed tests in the section header.
- Add job-level and test-level shard info to "LOG-BASED FAILURES (not in
XML)" and "FAILED TESTS" section
- Includes flaky tests in "LOG-BASED FAILURES (not in XML)" section for
any tests that pass when run in new process

## Test plan

- [x] Cross-arch detection confirmed: tests failing on all 3 archs show
the other 2 in "Also Failing In"; single-arch failures show empty
- [x] CSV and Markdown output both updated consistently
Latest run https://github.com/ROCm/pytorch/actions/runs/24798004968
Run without this PR on the same commit:
https://github.com/ROCm/pytorch/actions/runs/24796654604
Repro job without this PR's change:
https://github.com/ROCm/pytorch/actions/runs/25342470426/job/74303089638

Validation run with this PR's change:
https://github.com/ROCm/pytorch/actions/runs/25342235984

Current issue: existing testing is not able to pick up the CUDA
artifacts because the CUDA job and artifact names changed from `test` to
`test-osdc` for default and distributed shards.

Repro inputs: `sha=b1b5b61ddb689ea65aab0915ecfac5cc459b92fb`,
`arch=mi355`, `skip_rocm=false`, `csv_name=pr3199-pre-change-repro`.

CUDA job names now use `test-osdc` for default and distributed shards,
for example:

`linux-jammy-cuda13.0-py3.10-gcc11 / test-osdc (default, 1, 5, ...)`
`linux-jammy-cuda13.0-py3.10-gcc11 / test-osdc (distributed, 1, 3, ...)`

CUDA artifact names now look like:

`test-reports-test-osdc-default-1-5`
`test-reports-test-osdc-distributed-1-3`
## Summary
- Update MI355 parity report shard counts to match current CI artifacts.
- Change default shards from 6 to 10 and distributed shards from 3 to 4.

## Validation
* Combined parity workflow for
`5b9a4786ea4b1a6170c6e5a4878269e7f591224b` on `mi300, mi355`:
<https://github.com/ROCm/pytorch/actions/runs/25738157290>

---------

Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>
## Motivation

Old IFU_GITHUB_TOKEN [seems to have
expired](https://github.com/ROCm/pytorch/actions/runs/25856299592/job/75974982737)

## Technical Details

Replace with PARITY_GITHUB_TOKEN (meant specifically for this workflow)

## Test Plan

Run parity.yml with this PR branch and see if it still gives credential
error.

## Test Result

"Download artifacts" step succeeded in
https://github.com/ROCm/pytorch/actions/runs/25857211908/job/75978008711

## Submission Checklist

- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
## Summary
- Select the CUDA test artifact kind from the jobs present for the
target SHA.
- Detect whether the target SHA uses test-osdc or legacy test CUDA jobs,
then use the detected kind when building log keys and artifact prefixes.
- Apply the same dynamic selection to CUDA inductor jobs.
- Treat missing per-arch summary buckets as zero so mixed ROCm/CUDA
coverage does not crash report generation.

## Validation
- PR/ciflow case: dispatched `Parity Report` on this branch with
`sha=386f38175e3aaee2dadb36b5c364deff0869664d` and `arch=mi355, mi300,
mi200, navi31`. CUDA default/distributed and inductor selected `test`.
  - Run: https://github.com/ROCm/pytorch/actions/runs/25866762885
- Main branch case: dispatched `Parity Report` on this branch with
`sha=f38b1ec280bafa2ad11f6e767558e73e9eb508a6`, `arch=mi300`,
`skip_rocm=true`, and `exclude_distributed=true`. CUDA default and
inductor selected `test-osdc`.
  - Run: https://github.com/ROCm/pytorch/actions/runs/25867046276
- Local syntax check: `python3 -m py_compile
.automation_scripts/pytorch-unit-test-scripts/download_testlogs
.automation_scripts/pytorch-unit-test-scripts/generate_summary.py`.
## Summary
- Prefer the arch-specific MI200 workflows in `download_testlogs`:
`rocm-mi200`, `periodic-rocm-mi200`, and `inductor-rocm-mi200`.
- Match arch-specific MI200 test jobs with the
`linux-jammy-rocm-py3.10-mi200` prefix for default, distributed, and
inductor shards.
- Keep `trunk-rocm-sandbox` as the fallback workflow for older SHAs that
do not have the MI200-specific workflows, using the legacy
`linux-jammy-rocm-py3.10` prefix in that fallback path.

## Motivation
A parity run for `50d07a990e33f9822ae4d48bed2d7f06c96522d0` tried to
collect MI200 distributed jobs with:

`linux-jammy-rocm-py3.10 / test (distributed, ...)`

The upstream jobs for this SHA are arch-specific and include `-mi200`,
so the log lookup missed all three shards and XML artifact collection
fell through to empty results. The script should look for the
MI200-specific workflows first, then fall back to `trunk-rocm-sandbox`
for older commits.

## Validation
- `python3 -m py_compile
.automation_scripts/pytorch-unit-test-scripts/download_testlogs`
- Confirmed the fixed prefix matches upstream jobs for
`50d07a990e33f9822ae4d48bed2d7f06c96522d0`:
  - `rocm-mi200`: 6 default shard matches
  - `periodic-rocm-mi200`: 3 distributed shard matches
  - `inductor-rocm-mi200`: 2 inductor shard matches
- Dispatched `Parity Report` on this branch with
`sha=50d07a990e33f9822ae4d48bed2d7f06c96522d0`, `arch=mi200`, and
`skip_cuda=true` to validate collection end-to-end.
- Initial run before fallback commit:
https://github.com/ROCm/pytorch/actions/runs/25920564353 (success)
- Current branch run after fallback commit:
https://github.com/ROCm/pytorch/actions/runs/25920808611 (queued)

Made with [Cursor](https://cursor.com)
## Summary
- Raise the Python CSV parser field limit in `generate_summary.py` so
large parity CSV diagnostic fields can be read.
- Truncate oversized diagnostic text fields while loading rows so long
failure/skip messages do not make summary generation or output unwieldy.
- Preserve test identity, status, timing, and shard fields used by the
parity report tables.

## Root Cause
A parity run failed in the `summarize` job when Python's default CSV
field limit rejected a generated-code assertion message larger than
131,072 bytes:
https://github.com/ROCm/pytorch/actions/runs/26168276671/job/76979094769

The first offending row was
`inductor.test_torchinductor_codegen_dynamic_shapes::DynamicShapesCodegenGPUTests::test_vmap_dot_decomposes_bmm_dynamic_shapes_cuda`,
where `message_rocm` was 145,748 bytes.

## Test plan
- `python3 -m py_compile
.automation_scripts/pytorch-unit-test-scripts/generate_summary.py`
- Re-ran `generate_summary.py` locally against the artifact from the
failed run:
  - Input: `20260520_all_tests_status_mi355.csv` from run `26168276671`
- Output: summary CSV and markdown generated successfully instead of
failing with `_csv.Error: field larger than field limit (131072)`.
- Triggered `parity.yml` on this branch with the same upstream commit
and arch as the failing run:
  - SHA: `27f2e80e30fb950bc455c777a5e8079e9657a157`
  - Arch: `mi355`
- Validation run:
https://github.com/ROCm/pytorch/actions/runs/26175417191
- Result: `setup-matrix`, `generate-parity (mi355)`, and `summarize` all
completed successfully.
- The summarize log shows `CSV written to
27f2e80e30fb950bc455c777a5e8079e9657a157_summary.csv` and `Markdown
written to 27f2e80e30fb950bc455c777a5e8079e9657a157_summary.md`.
## Summary

Adds a single step to the `summarize` job in `parity.yml` that uploads
the generated `*_summary.md` (the same content already appended to
`$GITHUB_STEP_SUMMARY`) as a standalone artifact named
`parity-summary-md`, with N-day retention.

The existing per-arch result artifacts have a 1-day retention, which
makes it impossible to recover the summary content (e.g. `### FAILED
TESTS`, `### LOG-BASED FAILURES`) after that window. This change lets
external tooling — for example the in-progress upstream CI failure
tracking — fetch the exact UI summary via `gh run download` long after
the CSVs are gone, with only a standard PAT.

No behavior change for any existing job. `if-no-files-found: ignore`
keeps the step a no-op on early-exit runs (no CSVs produced).

## Test plan

- [ ] Re-run `parity.yml` (or an autoparity manual dispatch) and verify
      the `parity-summary-md` artifact appears alongside the per-arch
      results artifacts.
- [ ] `gh run download <run_id> -R ROCm/pytorch -n parity-summary-md`
      returns the expected `*_summary.md`.
- [ ] On a run with no CSVs (forced early exit), confirm the workflow
      still succeeds and no artifact is uploaded.

Signed-off-by: Garay-Fernandez <pgarayfe@amd.com>
## Summary

- Adds a clickable **Job ID** column at the end of both the `FAILED
TESTS` and `LOG-BASED FAILURES (not in XML)` tables in the parity
summary markdown. Each cell renders as
`[<job_id>](https://github.com/pytorch/pytorch/actions/runs/<wf>/job/<job_id>)`,
dropping the reviewer one click away from the stacktrace.
- Threads the upstream `pytorch/pytorch` CI job url through the existing
pipeline — `download_testlogs` was already fetching that info, it just
wasn't being preserved. No new API calls; no schema migrations; just
persistence through `download_testlogs` → `summarize_xml_testreports.py`
/ `detect_log_failures.py` → `generate_summary.py`.
- Backwards-compatible: every consumer reads the new fields via
`.get(..., '')` / `os.path.isfile`, so older artifacts and CSVs render
the column as empty cells instead of breaking.

### Example resulting row (FAILED TESTS, set2-disabled case)

```
| Arch | Test Config | Test File | Test Class | Test Name | Job-Level Shard (rocm) | Test-Level Shard (rocm) | Status (rocm) | Also Failing In | Job ID (rocm) |
| mi300 | default | test_foo | TestBar | test_baz | 3/6 | 5/15 | FAILED | mi355 | [76905282313](https://github.com/pytorch/pytorch/actions/runs/26146653222/job/76905282313) |
```

### Data flow

- **FAILED TESTS** (XML-based): `_shorten_unzipped_dirs` keeps the
trailing `_<jobid>` of the artifact name on each `test-<cfg>-N-N/` dir →
`download_xml_files` writes one `_wf_run_id` file at the parent →
`parse_xml_reports_as_dict` builds the url and stamps it on each test
case → per-arch CSV carries `job_url_{set_name}` →
`collect_failed_tests` propagates → markdown renders.
- **LOG-BASED FAILURES**: `write_test_log_to_file` writes a companion
`<filename>.job_url` file (full url from the job's `html_url`) →
`scan_logs` reads it and stamps `job_url` on every failure / flaky row →
`log_failures_<arch>.csv` / `flaky_tests_<arch>.csv` carry it →
`load_log_failures` / `load_flaky_tests_as_log_failures` propagate →
markdown renders.

## Test plan

- Trigger a `parity.yml` run and confirm:
- Per-arch test-report shard dirs are named `test-<cfg>-N-N_<jobid>`
after `_shorten_unzipped_dirs`.
- `_wf_run_id` file exists alongside the shard dirs in `rocm_xml/` and
`cuda_xml/`.
- `<filename>.job_url` companion files exist next to each `rocm*.txt` /
`cuda*.txt` log file.
- Inspect the per-arch CSV emitted by `summarize_xml_testreports.py` and
confirm `job_url_<set1_name>` / `job_url_<set2_name>` columns are
populated for failing rows.
- Inspect `log_failures_<arch>.csv` / `flaky_tests_<arch>.csv` and
confirm `job_url` column is populated.
- Inspect the parity summary markdown artifact and click a `Job ID` cell
in both tables → lands on the failing pytorch/pytorch job page with the
stacktrace.
- Re-run against a historical commit whose artifacts predate this change
and confirm cells render as empty (no crash, no broken table).

---------

Signed-off-by: Garay-Fernandez <pgarayfe@amd.com>
## Summary

Add a clickable `HUD LINK` to the parity report summary so users can
jump from the GitHub Actions run to the matching PyTorch HUD page.

For PR runs, the link points to the PR HUD page with the exact SHA used
for the report, e.g.
`https://hud.pytorch.org/pytorch/pytorch/pull/<pr_id>?sha=<sha>`. For
SHA-only runs, the link points to the commit HUD page.

## Validation

- PR-only example: triggered `parity.yml` with `pr_id=184377`,
`arch=mi355`, and no SHA. The report resolved SHA
`291ff45ffe10a301a88d1a83e98b9ba9987dbbfa`, so the HUD link is
`https://hud.pytorch.org/pytorch/pytorch/pull/184377?sha=291ff45ffe10a301a88d1a83e98b9ba9987dbbfa`.
Run passed: https://github.com/ROCm/pytorch/actions/runs/26476256833
- SHA-only example: triggered `parity.yml` with
`sha=fe1b0a2ae93e0efcfa0defeee2ed879cf68eaac6`, `arch=mi355`, and no PR
ID. Run passed: https://github.com/ROCm/pytorch/actions/runs/26475135922
## Summary
- Adds a `Run Time (s)` column to both parity summary tables (FAILED
TESTS and LOG-BASED FAILURES), in the `.md` and `.csv` outputs.
- **FAILED** tests use the per-test JUnit XML `time` (already in the
per-arch CSV as `running_time_<set>`).
- **LOG-BASED** failures (timeouts/crashes/kills, which produce no XML)
use the failing **job's wall-clock**, computed in
`detect_log_failures.py` from each log's first-to-last ISO timestamps
and attached to every failure/flaky entry.

Implements ROCm/frameworks-internal#16856.

## Changes
- `detect_log_failures.py`: compute per-log job run time; add `run_time`
to the failures and flaky CSV reports.
- `generate_summary.py`: `Run Time (s)` column in FAILED + LOG-BASED
tables; thread run time through the flaky loader.

## Test plan
- [x] Offline end-to-end: synthetic timeout log -> `detect_log_failures`
(1801s) -> `generate_summary` LOG-BASED table -> downstream parser
- [x] FAILED table run time verified with a synthetic per-arch CSV
- [x] Verify on a real parity run

---------

Signed-off-by: pablo-garay <pgarayfe@amd.com>
…ruth) (#3311)

## What

Introduces
`.automation_scripts/pytorch-unit-test-scripts/parity_job_config.json` —
a single source of truth for `pytorch/pytorch` job-name matching used by
the parity tooling.

Per-arch upstream workflow names, job-name prefixes, shard counts,
artifact substrings, fallbacks (plus the CUDA equivalents and the
check-run / workflow gating regexes) were previously duplicated:
- hardcoded as large dicts inside `download_testlogs`, and
- independently re-encoded as regexes inside `parity-auto.yml`.

This PR lands just the config file so the matching rules live in exactly
one place.

## Why split this out

Per review feedback on #3278 (separate concerns, and this file also
being introduced/changed in #3231), the shared config is pulled into its
own foundational PR:

- **#3278** (download_testlogs re-run fixes) stacks on this and
*consumes* the config for S3 artifact / log matching.
- **#3231** (parity auto-trigger) stacks on this and reads the check-run
/ workflow regexes for upstream gating.

Neither downstream PR carries its own copy of the file anymore — no more
add/add duplication.

## Merge order

Land this first. #3278 and #3231 are rebased onto it and drop their
copies of `parity_job_config.json`.

## Validation

This is a data-only file; it is exercised end-to-end by the downloader
in #3278, which loads this exact config. A stacked #3278 parity run for
`346976bc` (`mi350`) reads this file over the new path and produces a
fully-populated report:

- https://github.com/ROCm/pytorch/actions/runs/27716114811 — passed

Made with [Cursor](https://cursor.com)
…parity reports (#3278)

## Problem

Parity reports were intermittently missing or under-counting CUDA (and
sometimes ROCm) numbers even when the underlying test reports clearly
existed in S3 and were visible in HUD. The gaps showed up for SHAs whose
upstream `pytorch/pytorch` CI run had been **re-run** or **partially
re-triggered**.

This PR fixes three distinct causes of those gaps in
`download_testlogs`.

---

## Fix 1 - search all run attempts for S3 artifacts

When an upstream run is re-run, GitHub only re-executes failed jobs;
succeeded jobs keep their artifacts under the attempt in which they
originally ran. `download_xml_files()` queried only the current attempt,
so carried-over reports (most visibly CUDA `test-osdc` shards, which a
ROCm re-run never retries) were silently skipped. Fix: for each shard,
probe attempts from latest down to `1` and take the highest attempt that
has the artifact. Single-attempt runs are unaffected.

## Fix 2 - gather CUDA job IDs across all attempts (+ check-runs
fallback)

The CUDA job-ID lookup had the same single-attempt blind spot. CUDA job
IDs are now collected across **all** attempts, with a fallback to the
check-runs API when the jobs listing is incomplete.

## Fix 3 - source ROCm and CUDA from the same trunk run

A SHA can have more than one `trunk.yml` run. ROCm-default picked the
newest-completed run while CUDA picked the run carrying the CUDA jobs,
so they could diverge and most columns came back empty. Fix: resolve the
canonical trunk run once via `resolve_full_trunk_run()` (the run
carrying the CUDA test jobs, falling back to newest-completed; push runs
preferred over scheduled `rerun_disabled_tests` runs) and reuse it for
ROCm default, the ROCm `distributed -> trunk` fallback, and CUDA.

Also includes: jobs-API retry with backoff for transient
secondary-rate-limiting, non-fatal skip when ROCm inductor didn't run,
mi350 inductor sourced from trunk, and the `github.token` fallback in
`parity.yml`.

---

## Stacked on #3311 (config split out)

Per review feedback (separate concerns; the config was also being
introduced/changed in #3231), the shared `parity_job_config.json` was
pulled into its own foundational PR **#3311** and this PR now
**consumes** it. The stack:

1. **#3311** - adds `parity_job_config.json` (single source of truth).
Base: `develop`.
2. **#3278** (this PR) - `download_testlogs` + `parity.yml`. Base:
`ethanwee/parity-job-config`.
3. **#3231** - `parity-auto.yml`. Base:
`ethanwee/fix-parity-rerun-attempt-s3`.

Merge order: #3311, then #3278, then #3231. Each PR's diff is now a
single concern with no overlap.

## Validation

Re-validated on this stacked branch (which loads
`parity_job_config.json` from #3311) — confirms the config split didn't
break config-path resolution and the downloader still populates the
report end-to-end:

- https://github.com/ROCm/pytorch/actions/runs/27716114811 (`346976bc`,
`mi350`) — passed

The original re-run fix was validated on `f947794`, a SHA with both a
partial and a full trunk run: ROCm coverage went from `shard_rocm`
50,153 -> 357,342 (`shard_cuda` 349,942 unchanged):
https://github.com/ethanwee1/pytorch/actions/runs/27160536793 (the
rebased branch tree is byte-identical to that validated head, so the fix
carries over unchanged).

## Test plan

- Re-run parity for a SHA with a re-run / partial upstream run; confirm
CUDA and ROCm numbers both populate at full scale (runs linked above).
- Confirm a normal single-attempt, single-run SHA still produces
identical output.




---

## Update — rebased onto merged #3311 + latest review round
(@jithunnair-amd)

#3311 is now merged into `develop`, so this PR has been **rebased onto
`develop`** (its tip is `671fe7b`). Status of the latest review
comments:

- **Rebase off merged #3311 / still works with the post-review JSON** —
done. The `parity_job_config.json` diff that now appears here is the
**intentional schema restructure** requested on #3311 (not the earlier
drift): each arch's `default`/`distributed`/`inductor` is now an ordered
`[{workflow, job_prefix}, …]` list (first entry = primary source, later
entries = fallbacks). `artifact_substrings` is dropped — the S3
downloader always filters on the `rocm.gpu` runner tag. `workflow_regex`
is dropped — it is derived from the union of each arch's `workflow`
values. `download_testlogs` unpacks the new lists into the same flat
dicts via a small `parity_config_views` adapter, so matching behaviour
is unchanged (verified equal to the old
`workflows`/`job_prefixes`/`fallbacks` for every arch).
- **`run-name` cleanup** — dropped the repeated `arch` default,
`csv_name` is prepended when given, and `sha` is used directly.
- **`sha` input** — now `default: 'latest'` with the trimmed
description; `'latest'` is treated as the latest-green sentinel wherever
`sha` is consumed, so no `--sha1 latest` leaks (a no-SHA run behaves
exactly like the old empty-SHA default).
- **Verbose "Forward run options" comment** — shortened.
- **Auto-parity concern moved out** — the `auto_triggered` input and the
`autoparity-` run-name prefix now live in #3231 (where auto-parity is
introduced), not here.
- **`github.token` unconditionally** — kept the
`secrets.PARITY_GITHUB_TOKEN || github.token` fallback for now, since
the PAT was added for higher GitHub API rate limits (this PR also adds
jobs-API retry/backoff for secondary rate-limiting). Happy to drop the
PAT if we're confident `github.token`'s limits suffice — flagging for
your call.

### Validation (rebased branch, new config schema)
- `mi350`, real trunk SHA `5a5e50f0f8aa…`:
https://github.com/ethanwee1/pytorch/actions/runs/27965062040 —
**passed** end-to-end. The config was loaded from the restructured JSON
(`Using ROCm workflows: {'default': 'trunk', 'distributed':
'periodic-rocm-mi350', 'inductor': 'trunk'}`), the always-`rocm.gpu`
filter pulled the mi350 artifacts, and a full ~20 MB mi350 report was
produced.
## Summary
- Add a scheduled parity auto-trigger that scans completed
`pytorch/pytorch` main `trunk.yml` pushes and dispatches `parity.yml`
once per ready upstream SHA.
- Gate dispatch on the ROCm arch workflows that actually ran for a SHA,
plus the CUDA jobs consumed by parity, so partial reports are avoided.
- Add a `pull_request` dry-run path with a smaller scan window to
validate the scanner without creating parity reports from PR CI.

## How it works
- The workflow runs every 10 minutes and queries recent completed
`pytorch/pytorch` `trunk.yml` push runs on `main`. Those trunk runs
provide the candidate upstream SHAs to evaluate.
- For each candidate SHA, it first checks recent `ROCm/pytorch`
`parity.yml` run titles. If any existing parity run already contains
that SHA, the SHA is skipped so we keep one report per upstream commit.
- Maximum number of dispatches of parity.yml are 50, which is
comfortably above the maximum number of [commits to `main` branch of
pytorch/pytorch in any 10-minute
interval](https://pytorchci.grafana.net/public-dashboards/bcce3e849d73451c9106ad9733990b9d)
- It then lists all upstream workflow runs for that SHA and determines
which ROCm arches actually ran. Missing periodic arch workflows are not
treated as pending work; only arches with matching workflow files are
expected in that report.
- For the arches that did run, it lists upstream check-runs and waits
for the matching ROCm test shards to reach `status=completed`. It also
waits for the CUDA default, distributed, and inductor check-runs
consumed by parity.
- Auxiliary shards such as `mem_leak_check` and `rerun_disabled_tests`
are ignored because the parity report does not consume them.
- Once all relevant ROCm and CUDA check-runs are complete, it dispatches
`parity.yml` with the ready arch list and a CSV prefix containing the
upstream SHA, for example `autoparity-YYYYMMDD-<sha>`.
- Pull request runs are forced to `dry_run=true`, so they exercise the
scanner and log would-be dispatches without creating reports. Scheduled
and manually dispatched runs can create real parity reports.

## Test plan
- Validated workflow YAML and embedded shell locally with
`yaml.BaseLoader` and `bash -n`.
- PR dry-run workflow succeeded:
https://github.com/ROCm/pytorch/actions/runs/26039732579
- Full non-dry-run workflow_dispatch succeeded:
https://github.com/ROCm/pytorch/actions/runs/26041358738
- The full run used `dry_run=false`, scanned 20 recent upstream trunk
runs, skipped SHAs with pending parity check-runs, dispatched 5 ready
SHAs, and stopped at `max_dispatches=5`.
- Dispatched parity reports all completed successfully:
- `d76e83ef` / `mi355`:
https://github.com/ROCm/pytorch/actions/runs/26041518406
- `457e1890` / `mi355`:
https://github.com/ROCm/pytorch/actions/runs/26041528996
- `60f38508` / `mi355`:
https://github.com/ROCm/pytorch/actions/runs/26041541647
- `d1d96569` / `mi355`:
https://github.com/ROCm/pytorch/actions/runs/26041551854
- `6e3cf2e4` / `mi355, mi300, mi200`:
https://github.com/ROCm/pytorch/actions/runs/26041618237

> Note: the validation runs above predate the upstream `mi355` ->
`mi350` arch rename; the workflow now reads the renamed regexes from
`parity_job_config.json` (introduced in #3311). Fresh post-rename
dry-runs on this re-stacked branch:
> - workflow_dispatch dry-run (`archs=mi350`):
https://github.com/ROCm/pytorch/actions/runs/27644407327
> - pull_request dry-run (auto, from the re-stack push):
https://github.com/ROCm/pytorch/actions/runs/27642728760

## Stacked on #3278 / #3311
This PR now contains only `parity-auto.yml`. The shared
`parity_job_config.json` it reads comes from #3311, and
`download_testlogs` / `parity.yml` from #3278. Stack/merge order: #3311
-> #3278 -> #3231.

## Dispatch cadence note
- The full validation run used `max_dispatches=5` only to avoid flooding
ROCm/pytorch during manual testing.
- The production scheduled workflow runs every 10 minutes and defaults
to `max_dispatches=50`, `max_commits=200`, and `max_age_hours=72` unless
manually overridden.



---

## Update — rebased onto updated #3278 + latest review round
(@jithunnair-amd)

#3311 is merged; this PR is rebased onto the updated #3278 (its tip is
`ce39907`). Changes from the latest round:

- **Auto-parity now owns its `parity.yml` hook** — per the "separate
concerns" note, the `auto_triggered` input and the `autoparity-`
run-name prefix were moved out of #3278 and into this PR, since this is
where auto-parity is introduced.
- **Dropped the `archs` workflow input** — auto-parity is trunk-scoped
(mi350 is the only ROCm arch that rides along in `trunk.yml`), so the
arch scope is fixed and is no longer a dispatch variable to reason
about.
- **Workflow regex is derived, not stored** — matching the #3278 schema
restructure, the per-arch upstream-workflow regex is built from the
union of each arch's `workflow` values in `parity_job_config.json`; the
`workflow_regex` field no longer exists. The `checkrun_regex` map is
unchanged.

### Validation
- `pull_request`-style dry-run on the re-stacked branch:
https://github.com/ethanwee1/pytorch/actions/runs/27964717929 —
**passed**. It loaded the config, built the derived `Arch->workflows`
regex map, and ran the per-SHA readiness gating without dispatching
(dry-run).
Split out of #3210 (3 of 3) for easier review.

Surfaces test failure messages inline in the parity summary
(`generate_summary.py`):

- Extract the XML failure message per failed test and add it as a
per-row collapsible **Error Message** column in the FAILED TESTS table.
- Cap individual messages and, crucially, **budget total message size
against GitHub's 1 MiB step-summary limit** so a run with many large
tracebacks doesn't drop the entire summary. Only the longest messages
are clipped, and only when a run would otherwise exceed the budget; the
full text always stays in the CSV artifact.

Independent of the other two split PRs (touches only
`generate_summary.py`).

Made with [Cursor](https://cursor.com)
Split out of #3210 (2 of 3) for easier review.

`inductor.test_cpu_repro` is blocklisted on ROCm upstream, so it never
runs there and would otherwise show up as falsely **MISSED** against the
CUDA baseline (~791 phantom MISSED rows on a recent mi350 run). Add it
to `EXCLUDED_TEST_SUITES` so it is dropped from the ROCm-vs-CUDA
comparison.

Verified against a real mi350 run: removes all `inductor.test_cpu_repro`
rows; MISSED gap drops ~1,294 -> ~503 and DEFAULT disagreement 3.03% ->
2.77% (SKIPPED unchanged).

Made with [Cursor](https://cursor.com)

---------

Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>
The DISAGREE/AGREE metric counted a ROCm SKIPPED or MISSED test as a
disagreement whenever CUDA merely did not SKIP it - which includes the
large set of attention-backend parametrization variants CUDA never even
enumerates (CUDA MISSED). Those are not real ROCm-vs-CUDA coverage gaps and
inflated DISAGREE to ~3% (AGREE ~97%).

Count a disagreement only when CUDA actually PASSED the test (s2 == PASSED)
in both compute_test_config_stats and compute_overall_stats, and relabel the
two line items accordingly ('PASSED on <set2>'). This matches the triage
spreadsheet's adjusted definition; AGREE% becomes ~99%.
ethanwee1 added a commit to ethanwee1/pytorch that referenced this pull request Jun 24, 2026
Count a ROCm SKIPPED/MISSED test as a disagreement only when CUDA actually PASSED it (s2 == PASSED), excluding parametrization variants CUDA never runs. AGREE% becomes ~99%, matching the triage spreadsheet. Mirrors ROCm#3371.
@rocm-repo-management-api

rocm-repo-management-api Bot commented Jun 24, 2026

Copy link
Copy Markdown

Jenkins build for 7f5f023dd5f5dc4cfc6480a335e9392b33d762ec commit finished as FAILURE
Links: Pipeline Overview / Build artifacts / Test Results

jithunnair-amd pushed a commit that referenced this pull request Jun 26, 2026
#3375)

Two user-reported bugs in the parity report.

## 1. LOG-BASED FAILURES duplication
A test could be listed **twice** in the LOG-BASED FAILURES table — once
as `FAILED` and once as `CONSISTENT_FAILURE` — because
`detect_log_failures.py` emits both a per-test `FAILED` record and a
`FAILED CONSISTENTLY` record for the same test, and
`generate_summary.py` never deduped within `log_failures` (it only
filtered against the XML FAILED TESTS table).

**Fix:** add a shared `_select_rocm_log_failures()` helper (used by both
the CSV and markdown renderers) that filters out XML-failed tests and
**dedupes per test, preferring `CONSISTENT_FAILURE`** — so a test is
never shown as both.

Example that regressed: `mi350 · default · test_sparse ·
TestSparseMaskedReductionsCUDA ·
test_future_empty_dim_masked_prod_cuda_complex128` showed as both; it
now shows only `CONSISTENT_FAILURE`.

## 2. Empty Job ID columns in FAILED TESTS
`download_testlogs._shorten_unzipped_dirs` extracted the upstream CI job
id with `re.search(r'_(\d{6,})\.zip$', ...)`, but the **unzipped** shard
dirs end in `_<jobid>` with no `.zip` (e.g.
`unzipped-...gfx950.2_83371682957`). So the `_<jobid>` suffix was
dropped, dirs became `test-default-1-8`, and `summarize_xml`'s
`parse_xml_reports_as_dict` (`_(\d+)$`) couldn't recover the id ->
`job_url` was empty for every row -> the FAILED TESTS "Job ID" columns
were always blank.

**Fix:** make `.zip` optional: `r'_(\d{6,})(?:\.zip)?$'` (still matches
the older `.zip` form).

## Validation
- Both files `py_compile` / parse clean.
- Dedup: ran `_select_rocm_log_failures` on a real `log_failures` CSV —
the test that had both `FAILED` and `CONSISTENT_FAILURE` collapses to a
single `CONSISTENT_FAILURE` row (0 keys left with multiple categories).
- Job id: the new regex extracts `83371682957` from the real
`unzipped-...` dir name, and the resulting
`test-distributed-1-3_83371682957` parses correctly, so `job_url`
populates. End-to-end `Job ID` population will be confirmed by the next
parity run after merge.

Independent of #3370 / #3371 (those touch a different section of
`generate_summary.py`).

Made with [Cursor](https://cursor.com)

## Validation

Validated on a run of **this PR branch** (develop + both fixes) for the
exact SHA that exhibited both bugs:

- `206283a284889f3a4f62510f8b5a8068d41121c4 · mi350`:
https://github.com/ROCm/pytorch/actions/runs/28190615761
- **Bug 1:**
`test_sparse::TestSparseMaskedReductionsCUDA::test_future_empty_dim_masked_prod_cuda_complex128`
is now listed **once** as `CONSISTENT_FAILURE` (was both FAILED +
CONSISTENT_FAILURE); 0 keys with multiple categories across the
LOG-BASED table.
- **Bug 2:** FAILED TESTS table has **24 populated Job ID links**; in
the status CSV `job_url_rocm` went 0 -> 308,793 and `job_url_cuda` 0 ->
354,344 non-empty.

Also applied to `ethanwee1/pytorch@main` (`fe21d4e`); the fork run
(https://github.com/ethanwee1/pytorch/actions/runs/28189128743) confirms
Bug 1. Note: the fork's `summarize_xml_testreports.py` predates
develop's `_wf_run_id`->`job_url` feature, so Job IDs there populate
once the fork syncs that from `develop` (Bug 2's fix is the
`download_testlogs` regex, which is on the fork).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.