fix(ci): apply per-ref amdgpu exclusions in multi-arch PyTorch CI#6134
fix(ci): apply per-ref amdgpu exclusions in multi-arch PyTorch CI#6134rahulc-gh wants to merge 3 commits into
Conversation
gfx125x support has been upstreamed via ROCm/pytorch#3346, so remove the exclusion from the release/2.11 matrix entry to enable multiarch builds for that branch.
Multi-arch CI was passing the full dist_amdgpu_families list to every PyTorch release ref, unlike the release workflow which filters families via configure_pytorch_release_matrix.py. Co-authored-by: Cursor <cursoragent@cursor.com>
Fix pre-commit black failures in configure_pytorch_release_matrix.py and its unit test. Co-authored-by: Cursor <cursoragent@cursor.com>
| # Per-ref amdgpu family exclusions (e.g. gfx125x on release/2.10) are | ||
| # applied via configure_pytorch_release_matrix.py, matching the release | ||
| # workflow in multi_arch_release_linux_pytorch_wheels.yml. | ||
| setup_pytorch_matrix: | ||
| name: Setup PyTorch CI Matrix | ||
| if: ${{ fromJSON(inputs.build_config).build_pytorch == true }} | ||
| runs-on: ubuntu-24.04 | ||
| outputs: | ||
| pytorch_matrix: ${{ steps.matrix.outputs.pytorch_matrix }} | ||
| steps: | ||
| - name: Checkout | ||
| uses: actions/checkout@df4cb1c069e1874edd31b4311f1884172cec0e10 # v6.0.3 | ||
| with: | ||
| repository: ${{ inputs.repository || github.repository }} | ||
| ref: ${{ inputs.ref }} | ||
|
|
||
| - name: Generate PyTorch matrix | ||
| id: matrix | ||
| run: | | ||
| python ./build_tools/github_actions/configure_pytorch_release_matrix.py \ | ||
| --python-versions="3.12" \ | ||
| --platform=linux \ | ||
| --pytorch-refs="release/2.10;release/2.11;release/2.12" \ | ||
| --amdgpu-families="${{ fromJSON(inputs.build_config).dist_amdgpu_families }}" |
There was a problem hiding this comment.
I already have a draft PR for this which follows CI conventions: #6082 (this belongs up a level in the entry point CI configuration code)
There was a problem hiding this comment.
will close this one than , need to land this one though before 5 pm - https://github.com/ROCm/TheRock/actions/runs/28133800797/job/83339717736?pr=6122
❌ PR Check — Action Required
📖 Need help? See the Policy FAQ for details on every check and how to fix failures. |
|
🚫 Please fix the failed policies before requesting reviews. The following policy checks failed:
The |
|
duplicate chagnes #6082 |
Summary
Enable gfx125x builds for PyTorch release/2.11 now that upstream support has landed, and fix multi-arch CI so per-ref AMDGPU family exclusions are applied the same way they already are in the release workflow.
Motivation
configure_pytorch_release_matrix.py maintains per-PyTorch-ref AMDGPU family exclusions. Some PyTorch release branches do not yet support certain GPU families (e.g. gfx125x), so those families are filtered out of the build matrix for those refs.
The multi-arch release workflow (multi_arch_release_linux_pytorch_wheels.yml) already uses this script and applies exclusions correctly. The multi-arch CI workflow (multi_arch_ci_linux.yml) did not — it used a hardcoded PyTorch ref matrix and passed the full dist_amdgpu_families list to every build.
This caused CI failures when gfx125x was added to the distribution families: release/2.10 was built with gfx125x even though that ref still excludes it. Example: https://github.com/ROCm/TheRock/actions/runs/28133800797/job/83339718163?pr=6122
Technical Details
1. Enable gfx125x for PyTorch
release/2.11Remove the
exclude_amdgpu_families: {"gfx125x"}entry forrelease/2.11inconfigure_pytorch_release_matrix.py, since gfx125x support is now upstreamed via ROCm/pytorch#3346.Exclusions remain in place for refs that do not yet support gfx125x:
release/2.9release/2.10release/2.11release/2.12nightly2. Apply per-ref exclusions in multi-arch CI
Update
multi_arch_ci_linux.ymlto match the release workflow pattern:setup_pytorch_matrixjob that runsconfigure_pytorch_release_matrix.pybuild_pytorch_wheel_fatfrom the generated matrix (matrix.include) instead of a hardcoded ref listamdgpu_familiesfrom the matrix output rather than the unfiltereddist_amdgpu_families3. Add
--pytorch-refsCLI optionExtend
configure_pytorch_release_matrix.pywith an optional--pytorch-refsflag so CI can limit the matrix torelease/2.10;release/2.11;release/2.12(withpy3.12) while still applying per-ref family filtering. The release workflow continues to use all configured refs by default.4. Unit tests
Add
configure_pytorch_release_matrix_test.pycovering:release/2.10excludes gfx125xrelease/2.11includes gfx125x--pytorch-refscorrectly limits matrix rowsTest plan
python -m unittest build_tools.github_actions.tests.configure_pytorch_release_matrix_testrelease/2.10completes without gfx125xrelease/2.11includes gfx125xconfigure_pytorch_release_matrix.py)Related