Skip to content

fix(ci): apply per-ref amdgpu exclusions in multi-arch PyTorch CI#6134

Closed
rahulc-gh wants to merge 3 commits into
mainfrom
users/rahulc/pytorch-gfxarch-exclusion
Closed

fix(ci): apply per-ref amdgpu exclusions in multi-arch PyTorch CI#6134
rahulc-gh wants to merge 3 commits into
mainfrom
users/rahulc/pytorch-gfxarch-exclusion

Conversation

@rahulc-gh

Copy link
Copy Markdown
Contributor

Summary

Enable gfx125x builds for PyTorch release/2.11 now that upstream support has landed, and fix multi-arch CI so per-ref AMDGPU family exclusions are applied the same way they already are in the release workflow.

Motivation

configure_pytorch_release_matrix.py maintains per-PyTorch-ref AMDGPU family exclusions. Some PyTorch release branches do not yet support certain GPU families (e.g. gfx125x), so those families are filtered out of the build matrix for those refs.

The multi-arch release workflow (multi_arch_release_linux_pytorch_wheels.yml) already uses this script and applies exclusions correctly. The multi-arch CI workflow (multi_arch_ci_linux.yml) did not — it used a hardcoded PyTorch ref matrix and passed the full dist_amdgpu_families list to every build.

This caused CI failures when gfx125x was added to the distribution families: release/2.10 was built with gfx125x even though that ref still excludes it. Example: https://github.com/ROCm/TheRock/actions/runs/28133800797/job/83339718163?pr=6122

Technical Details

1. Enable gfx125x for PyTorch release/2.11

Remove the exclude_amdgpu_families: {"gfx125x"} entry for release/2.11 in configure_pytorch_release_matrix.py, since gfx125x support is now upstreamed via ROCm/pytorch#3346.
Exclusions remain in place for refs that do not yet support gfx125x:

PyTorch ref gfx125x
release/2.9 excluded
release/2.10 excluded
release/2.11 included
release/2.12 excluded
nightly excluded

2. Apply per-ref exclusions in multi-arch CI

Update multi_arch_ci_linux.yml to match the release workflow pattern:

  • Add a setup_pytorch_matrix job that runs configure_pytorch_release_matrix.py
  • Drive build_pytorch_wheel_fat from the generated matrix (matrix.include) instead of a hardcoded ref list
  • Pass per-ref amdgpu_families from the matrix output rather than the unfiltered dist_amdgpu_families

3. Add --pytorch-refs CLI option

Extend configure_pytorch_release_matrix.py with an optional --pytorch-refs flag so CI can limit the matrix to release/2.10;release/2.11;release/2.12 (with py3.12) while still applying per-ref family filtering. The release workflow continues to use all configured refs by default.

4. Unit tests

Add configure_pytorch_release_matrix_test.py covering:

  • release/2.10 excludes gfx125x
  • release/2.11 includes gfx125x
  • --pytorch-refs correctly limits matrix rows

Test plan

  • python -m unittest build_tools.github_actions.tests.configure_pytorch_release_matrix_test
  • Multi-arch CI PyTorch build for release/2.10 completes without gfx125x
  • Multi-arch CI PyTorch build for release/2.11 includes gfx125x
  • Multi-arch release workflow behavior unchanged (still uses full matrix from configure_pytorch_release_matrix.py)

Related

rahulc-gh and others added 2 commits June 24, 2026 22:02
gfx125x support has been upstreamed via ROCm/pytorch#3346, so remove
the exclusion from the release/2.11 matrix entry to enable multiarch
builds for that branch.
Multi-arch CI was passing the full dist_amdgpu_families list to every
PyTorch release ref, unlike the release workflow which filters families
via configure_pytorch_release_matrix.py.

Co-authored-by: Cursor <cursoragent@cursor.com>
@rahulc-gh rahulc-gh added the gfx125x Issue/PR relates to gfx125x family label Jun 25, 2026
Fix pre-commit black failures in configure_pytorch_release_matrix.py and
its unit test.

Co-authored-by: Cursor <cursoragent@cursor.com>

@ScottTodd ScottTodd left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please let's not duplicate work. #6030 is tracking this and I already assigned myself.

Comment on lines +291 to +314
# Per-ref amdgpu family exclusions (e.g. gfx125x on release/2.10) are
# applied via configure_pytorch_release_matrix.py, matching the release
# workflow in multi_arch_release_linux_pytorch_wheels.yml.
setup_pytorch_matrix:
name: Setup PyTorch CI Matrix
if: ${{ fromJSON(inputs.build_config).build_pytorch == true }}
runs-on: ubuntu-24.04
outputs:
pytorch_matrix: ${{ steps.matrix.outputs.pytorch_matrix }}
steps:
- name: Checkout
uses: actions/checkout@df4cb1c069e1874edd31b4311f1884172cec0e10 # v6.0.3
with:
repository: ${{ inputs.repository || github.repository }}
ref: ${{ inputs.ref }}

- name: Generate PyTorch matrix
id: matrix
run: |
python ./build_tools/github_actions/configure_pytorch_release_matrix.py \
--python-versions="3.12" \
--platform=linux \
--pytorch-refs="release/2.10;release/2.11;release/2.12" \
--amdgpu-families="${{ fromJSON(inputs.build_config).dist_amdgpu_families }}"

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I already have a draft PR for this which follows CI conventions: #6082 (this belongs up a level in the entry point CI configuration code)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will close this one than , need to land this one though before 5 pm - https://github.com/ROCm/TheRock/actions/runs/28133800797/job/83339717736?pr=6122

Base automatically changed from users/rahulc/enable-gfx125x-pytorch_rel211 to main June 25, 2026 18:09
@therock-pr-bot

Copy link
Copy Markdown

❌ PR Check — Action Required

Check Status Details
🌿 Branch Name ✅ Pass
📝 PR Title/Description ❌ Fail Error: PR description must reference a JIRA ID or ISSUE ID.
Expected: include a JIRA ID or ISSUE ID line. The separator may be : or - (or omitted), and the value can be a JIRA key, a number (with or without #), or a link. Accepted examples:
JIRA ID : TESTAUTO-6039
JIRA ID - #330
JIRA ID #330
ISSUE ID : TESTUTO-3334
ISSUE ID #3334
ISSUE ID - TESTAUTO-3433
ISSUE ID : https://github.com/<org_name>/<repo_name>/issues/1234
Current: no valid JIRA/ISSUE reference found
Forbidden Files ✅ Pass
🧪 Unit Test ✅ Pass
🔎 pre-commit ⏳ Pending ⏳ Still running…
🚫 Draft PR 🔜 To Be Enabled
🚩 Feature Flag 🔜 To Be Enabled
📊 Code Coverage 🔜 To Be Enabled

⚠️ 1 policy check(s) failed. Please address the issues above before this PR can be Reviewed.

🚫 Please fix the failed policies

  • ❌ PR Title/Description

The Not ready to Review label was added to this PR. Once all policies pass, the label is removed automatically.

📖 Need help? See the Policy FAQ for details on every check and how to fix failures.

@therock-pr-bot therock-pr-bot Bot added the Not ready to Review PR has unresolved policy failures — reviews blocked label Jun 25, 2026
@therock-pr-bot

Copy link
Copy Markdown

🚫 Please fix the failed policies before requesting reviews.

The following policy checks failed:

  • ❌ PR Title/Description

The Not ready to Review label has been added to this PR.
Once all policies pass, the label will be removed automatically.

@rahulc-gh

Copy link
Copy Markdown
Contributor Author

duplicate chagnes #6082

@rahulc-gh rahulc-gh closed this Jun 25, 2026
@github-project-automation github-project-automation Bot moved this from TODO to Done in TheRock Triage Jun 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

gfx125x Issue/PR relates to gfx125x family Not ready to Review PR has unresolved policy failures — reviews blocked

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

2 participants