Skip to content

[release/2.11] Add support for gfx1250#3346

Merged
pragupta merged 21 commits into
release/2.11from
release/2.11_gfx1250
Jun 24, 2026
Merged

[release/2.11] Add support for gfx1250#3346
pragupta merged 21 commits into
release/2.11from
release/2.11_gfx1250

Conversation

@pragupta

@pragupta pragupta commented Jun 19, 2026

Copy link
Copy Markdown
Collaborator

Refiled version of #3307

gfx942 Validation: https://github.com/ROCm/TheRock/actions/runs/28071362663

image image

Majority of the failures are either coming from missing g++ or symm mem issues. Rest all are accuracy issues, which shouldn't block this PR.

gfx1250 Validation: Core UTs ran on hardware
image

These 7 failures are being tracked via JIRAs (miopen issues). AOTriton one has been mitigated by an AOTriton bump in this PR.

@rocm-repo-management-api

rocm-repo-management-api Bot commented Jun 19, 2026

Copy link
Copy Markdown

Jenkins build for 3ba3ca4e6a059c829396a81b95d836ab5ae64329 commit finished as FAILURE
Links: Pipeline Overview / Build artifacts / Test Results

@rocm-repo-management-api

rocm-repo-management-api Bot commented Jun 19, 2026

Copy link
Copy Markdown

Jenkins build for 649c1a151fb0c711bb9598b41de41fe587a42382 commit finished as FAILURE
Links: Pipeline Overview / Build artifacts / Test Results

Comment thread aten/src/ATen/native/cuda/int4mm.cu Outdated
Comment thread aten/src/ATen/native/cuda/int4mm.cu Outdated
Comment thread aten/src/ATen/native/cuda/int4mm.cu Outdated
Comment thread aten/src/ATen/Context.cpp Outdated
Comment thread aten/src/ATen/native/sparse/cuda/cuSPARSELtOps.cpp
Comment thread aten/src/ATen/native/transformers/hip/flash_attn/ck/mha_fwd_ck.hip Outdated
Comment thread aten/src/ATen/native/transformers/hip/flash_attn/ck/mha_bwd_ck.hip Outdated
@glen-amd

Copy link
Copy Markdown

Missing Wave32 inductor heuristics in torch/_inductor/runtime/triton_heuristics.py.

The PR base (release/2.11) still has the hardcoded _num_warps Wave64 halving and no _device_warp_size_for_heuristics(). GFX1250 is Wave32 which this is the single most important codegen correctness/perf change for the part. Without it, Triton configs for gfx1250 are generated with the Wave64 assumption (half the warps), which is wrong.

Reference triton_heuristics.py at pytorch/pytorch@main...glen-amd:pytorch:initial_amd_gfx1250_support_2

@glen-amd

Copy link
Copy Markdown

Missing the 320 KB shared-memory plumbing (Ref. torch/cuda/_utils.py at pytorch/pytorch@main...glen-amd:pytorch:initial_amd_gfx1250_support_2).

The PR base still has 65536 if gcnArchName != "gfx950" else 160*1024, so on gfx1250 set_shared_memory_config caps at 64 KB, which is wrong for gfx1250.

Moreover, reference the :sramecc+:xnack- suffix-stripping as well.

@glen-amd

Copy link
Copy Markdown

Missing test gating (common_cuda.py, common_utils.py, common_distributed.py, test_linalg.py).

Reference test-support gates (evaluate_platform_supports_fp8, _mx_gemm, _fp8_grouped_gemm, GFX1250_ARCH, symm-mem exclusion, blaslt device list) from pytorch/pytorch@main...glen-amd:pytorch:initial_amd_gfx1250_support_2.

Without the gating, the very features this PR enables for gfx1250 (MX GEMM, FP8, hipSparseLt, scaled_mm) won't be exercised by the test suite on gfx1250, and some FA/symm-mem tests may try to run on a part where the kernels aren't built. At minimum port evaluate_platform_supports_mx_gemm/_fp8 and the symm-mem exclusion.

Comment thread aten/src/ATen/cuda/CublasHandlePool.cpp
Comment thread aten/src/ATen/native/cuda/GroupedBlas.cpp Outdated
Comment thread .ci/docker/build.sh Outdated
Comment thread aten/src/ATen/native/cuda/CUDALoops.cuh Outdated
Comment thread aten/src/ATen/native/cuda/MemoryAccess.cuh Outdated
Comment thread torch/testing/_internal/common_distributed.py Outdated
Comment thread test/test_cuda.py Outdated
Comment thread cmake/Dependencies.cmake Outdated
Comment thread aten/src/ATen/native/transformers/hip/flash_attn/ck/fav_v3/CMakeLists.txt Outdated
Comment thread aten/src/ATen/native/transformers/hip/flash_attn/ck/launch_kernel_pt.hpp Outdated
Comment thread aten/src/ATen/native/transformers/cuda/sdp_utils.cpp
Comment thread test/test_linalg.py Outdated
Comment thread torch/_inductor/utils.py
rraminen and others added 3 commits June 23, 2026 19:23
This change backs out gfx1250 specific additions that were unnecessary,
incorrect, or harmful for non–gfx1250 builds.

- **CI**: Drop `gfx1250` from the default multi-arch list in
`.ci/docker/build.sh` and `.ci/pytorch/build.sh` .
- **CK FMHA codegen**: Remove unconditional `--targets gfx1250` from
`generate.py` invocations for non–gfx1250 builds.
- **Allocator**: Remove the `__gfx1250__` branch in
`c10/core/AllocatorConfig.h` and keep the original code.
- **Kernels**: Drop gfx1250 from in `KernelUtils.cuh` as we have `#if
ROCM_VERSION < 60400` condition in place.
- **Scaled BLAS / hipSparseLt**: Only treat **gfx1250** as supported
where it matches the rest of the stack—**ROCm ≥ 7.2** .
- int4mm.cu: add the two missing semicolons after the gfx1250 TORCH_CHECK
  calls, and correct the CDNA2_OR_LATER comment (gfx90a/942/950 are Wave64,
  not "Warp-32").
- Context.cpp: restore gfx90a in ckSupported()'s arch list; MI200 CK support
  is unrelated to the gfx1250 expansion and should not have been dropped.
- common_distributed.py: revert evaluate_platform_supports_symm_mem() to
  gfx942/gfx950 only.
- test_cuda.py: use a 320KB shared-memory limit for gfx1250 (gfx950 stays at
  160KB), matching the values in torch/cuda/_utils.py.
- test_linalg.py: match gfx1250 exactly in the BLASLt check instead of the
  broader gfx125 prefix.
- .ci/docker/build.sh: add gfx1250 to PYTORCH_ROCM_ARCH for the ROCm CI image.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
composable_kernel has no gfx1250 support, so its CK GEMM and CK SDPA
kernels fail to compile when targeting that arch. Previously the only
lever was the global USE_ROCM_CK_GEMM / USE_ROCM_CK_SDPA options, so
disabling CK for gfx1250 also disabled it for every other arch in the
same (commonly multi-arch) build.

Instead, build the CK kernels for every requested arch except gfx1250 by
filtering --offload-arch on the CK targets, and only fall back to fully
disabling CK when gfx1250 is the sole arch.

1. aten/src/ATen/CMakeLists.txt
   - gfx1250-only guard: when gfx1250 is the only entry in
     PYTORCH_ROCM_ARCH there is no supported arch left to build CK for,
     so turn USE_ROCM_CK_GEMM and USE_ROCM_CK_SDPA off (via
     caffe2_update_option, which writes the cache and is therefore seen
     by caffe2/CMakeLists.txt). This runs before the CK blocks below.
   - ck_sdpa: filter --offload-arch to drop gfx1250.
   - CK GEMM: move the ck*.hip / bgemm_kernels sources out of the main
     torch_hip sources into a dedicated, --offload-arch-filtered ck_gemm
     library (mirrors the existing mslk/ck_sdpa pattern). The library
     gets USE_ROCM_CK_GEMM defined on the target since it is no longer a
     global flag.

2. cmake/Dependencies.cmake
   - Stop adding -DUSE_ROCM_CK_GEMM to the global HIP flags. Applied
     unconditionally here it ran before aten/src/ATen/CMakeLists.txt, so
     the gfx1250-only guard could not retract it.

3. caffe2/CMakeLists.txt
   - Apply USE_ROCM_CK_GEMM as a conditional target definition on
     torch_hip and link ck_gemm, mirroring how USE_ROCM_CK_SDPA is
     already handled. This makes the define honor the gfx1250-only guard.

Net behavior: a multi-arch build containing gfx1250 keeps CK GEMM/SDPA
for the other archs and skips only gfx1250; a gfx1250-only build cleanly
builds with CK GEMM/SDPA disabled (no empty --offload-arch compile and no
unresolved CK symbols); builds without gfx1250 are unchanged.

This change was co-authored with Claude.
@pragupta pragupta force-pushed the release/2.11_gfx1250 branch from 6172e6c to 62cced8 Compare June 23, 2026 19:24
@rocm-repo-management-api

rocm-repo-management-api Bot commented Jun 23, 2026

Copy link
Copy Markdown

Jenkins build for 62cced8ff7bb3d5227723b59cd74dd07a473e914 commit finished as FAILURE
Links: Pipeline Overview / Build artifacts / Test Results

@pragupta

Copy link
Copy Markdown
Collaborator Author

@glen-amd - as per our offline conversation, please file a PR for wave32 changes. I am not sure where we missed a commit. But, happy to land it here if it is crucial for 450 bringup. For TDM changes, since those are performance enhancements and not blocking functionality, lets please address them separately.

naromero77amd and others added 2 commits June 23, 2026 15:03
Update the Triton version marker and point the ROCm release branch at the verified 3.7.1 ROCm Triton tip so release builds consume the local module fix.

Co-authored-by: Cursor <cursoragent@cursor.com>
…ytorch#171129)

[Inductor] Fix constants handling for Triton constexpr (triton#8248) (pytorch#169782)

Up until recently (triton-lang/triton#8248), Triton did not explicitly interpret entries in `ASTSource.constants` , and Inductor historically treated  "non-tensor-like" arguments as `constexpr`, including runtime scalars. Triton’s updated `constexpr` handling now requires stricter semantics, so this change restricts `constants` to arguments marked `is_constexpr` rather than all “non-tensor-like” arguments.

- Fixes issue pytorch#170049.

Pull Request resolved: pytorch#169782
Approved by: https://github.com/oulgen

(cherry picked from commit 91cd44f)

Co-authored-by: Joshua James Venter <venter.joshua@gmail.com>
@rocm-repo-management-api

rocm-repo-management-api Bot commented Jun 23, 2026

Copy link
Copy Markdown

Jenkins build for 8d836042d518a4e86695612a8bad795d6053c050 commit finished as NOT_BUILT
Links: Pipeline Overview / Build artifacts / Test Results

jansel and others added 3 commits June 23, 2026 15:18
User-defined Triton kernels functionalize mutations by cloning mutated arguments before launching the kernel. Allocate stride-compatible clone storage and copy logical elements so strided intermediate views do not read beyond compact Inductor temporaries.

Fixes pytorch#161115

Generated by my agent

Pull Request resolved: pytorch#184050
Approved by: https://github.com/karthickai

(cherry picked from commit 9a8a62c)
…pytorch#175021)

Triton's LLVM bump (76e2689) emits IR attributes (`nocreateundeforpoison`,`dwarfAddressSpace`) that ROCm's clang doesn't recognize, breaking multi-arch compilation. This sanitizes those two attributes from the IR before passing to clang. Other newer attributes (`memory()`, `captures()`, `range()`) are kept as ROCm 7.0 accepts them and stripping them breaks kernel ABI.

Also ensures the current device arch is always included in the multi-arch target list when `PYTORCH_ROCM_ARCH` doesn't list it.

Pull Request resolved: pytorch#175021
Approved by: https://github.com/jeffdaily

(cherry picked from commit 717898d)
Co-authored-by: Cursor <cursoragent@cursor.com>
@rocm-repo-management-api

rocm-repo-management-api Bot commented Jun 23, 2026

Copy link
Copy Markdown

Jenkins build for 8d836042d518a4e86695612a8bad795d6053c050 commit finished as FAILURE
Links: Pipeline Overview / Build artifacts / Test Results

@rocm-repo-management-api

rocm-repo-management-api Bot commented Jun 23, 2026

Copy link
Copy Markdown

Jenkins build for 36c2d98f4a9e82c53581bd020eb6f4adac2da753 commit finished as FAILURE
Links: Pipeline Overview / Build artifacts / Test Results

@rocm-repo-management-api

rocm-repo-management-api Bot commented Jun 23, 2026

Copy link
Copy Markdown

Jenkins build for 75f45e135f0bd51f9743ecb9c5fd176f4b1ac1e9 commit finished as FAILURE
Links: Pipeline Overview / Build artifacts / Test Results

Detected error during base docker image building:

#40 357.8 6 out of 6 hunks ignored
#40 358.1 
#40 358.1 ERROR: Neither source directory 'builddir' nor build directory None contain a build file meson.build.
#40 358.1 WARNING: Running the setup command as `meson [options]` instead of `meson setup [options]` is ambiguous and deprecated.
#40 358.2 /builddir /
#40 358.2 ninja: error: loading 'build.ninja': No such file or directory
#40 358.2 /
#40 358.2 ./install_rocm_drm.sh: line 159: popd: directory stack empty
#40 ERROR: process "/bin/sh -c if [ \"${ROCM_VERSION}\" != \"nightly\" ]; then bash ./install_rocm_drm.sh /usr ; fi && rm install_rocm_drm.sh" did not complete successfully: exit code: 1
------
 > [stage-0 32/62] RUN if [ "7.2.4" != "nightly" ]; then bash ./install_rocm_drm.sh /usr ; fi && rm install_rocm_drm.sh:

Comment thread cmake/Dependencies.cmake Outdated
Comment thread torch/testing/_internal/common_cuda.py Outdated
…cy configs; Strict semantics subset relationships
@rocm-repo-management-api

rocm-repo-management-api Bot commented Jun 24, 2026

Copy link
Copy Markdown

Jenkins build for 1e5a66bcedbd518a856a497d610d5b5418663647 commit finished as FAILURE
Links: Pipeline Overview / Build artifacts / Test Results

@pruthvistony pruthvistony left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test results are good.

@pragupta pragupta merged commit 712584b into release/2.11 Jun 24, 2026
75 of 77 checks passed
@pragupta pragupta deleted the release/2.11_gfx1250 branch June 24, 2026 18:54
pragupta added a commit that referenced this pull request Jun 24, 2026
- int4mm.cu: add the two missing semicolons after the gfx1250 TORCH_CHECK
  calls, and correct the CDNA2_OR_LATER comment (gfx90a/942/950 are Wave64,
  not "Warp-32").
- Context.cpp: restore gfx90a in ckSupported()'s arch list; MI200 CK support
  is unrelated to the gfx1250 expansion and should not have been dropped.
- common_distributed.py: revert evaluate_platform_supports_symm_mem() to
  gfx942/gfx950 only.
- test_cuda.py: use a 320KB shared-memory limit for gfx1250 (gfx950 stays at
  160KB), matching the values in torch/cuda/_utils.py.
- test_linalg.py: match gfx1250 exactly in the BLASLt check instead of the
  broader gfx125 prefix.
- .ci/docker/build.sh: add gfx1250 to PYTORCH_ROCM_ARCH for the ROCm CI image.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
rahulc-gh added a commit to ROCm/TheRock that referenced this pull request Jun 25, 2026
gfx125x support has been upstreamed via ROCm/pytorch#3346, so remove the
exclusion from the release/2.11 matrix entry to enable multiarch builds
for that branch.

## Motivation

- Removes the `gfx125x` exclusion from the `release/2.11` PyTorch matrix
entry
  - gfx125x support has been upstreamed via ROCm/pytorch#3346
- This enables multiarch PyTorch builds (including gfx125x) for nightly
CI runs targeting the `release/2.11` branch

Exclusions for `release/2.9`, `release/2.10`, `release/2.12`, and
`nightly` remain in place as those branches do not have upstream
support.

ISSUE ID #5833 (partially — for
release/2.11).
  ## Unit Test
python ./build_tools/github_actions/configure_multi_arch_ci.py locally
with gfx125x label I can see -
https://github.com/ROCm/TheRock/actions/runs/28133800797/job/83316100363?pr=6122#step:4:57
   
Also There's no dedicated test file for
configure_pytorch_release_matrix.py —
the change is straightforward enough (removing a dict key) that it's
covered by the existing matrix infrastructure tests.
  ## Test plan
- [ ] Verify nightly CI triggers a `release/2.11` PyTorch build that
includes `gfx125x` in the matrix
  - [ ] Confirm other release branches still exclude `gfx125x`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants