[release/2.11] Add support for gfx1250 by pragupta · Pull Request #3346 · ROCm/pytorch

pragupta · 2026-06-19T17:16:49Z

Refiled version of #3307

gfx942 Validation: https://github.com/ROCm/TheRock/actions/runs/28071362663

Majority of the failures are either coming from missing g++ or symm mem issues. Rest all are accuracy issues, which shouldn't block this PR.

gfx1250 Validation: Core UTs ran on hardware

These 7 failures are being tracked via JIRAs (miopen issues). AOTriton one has been mitigated by an AOTriton bump in this PR.

rocm-repo-management-api · 2026-06-19T17:22:13Z

Jenkins build for 3ba3ca4e6a059c829396a81b95d836ab5ae64329 commit finished as FAILURE
Links: Pipeline Overview / Build artifacts / Test Results

rocm-repo-management-api · 2026-06-19T22:25:33Z

Jenkins build for 649c1a151fb0c711bb9598b41de41fe587a42382 commit finished as FAILURE
Links: Pipeline Overview / Build artifacts / Test Results

glen-amd · 2026-06-22T23:26:43Z

Missing Wave32 inductor heuristics in torch/_inductor/runtime/triton_heuristics.py.

The PR base (release/2.11) still has the hardcoded _num_warps Wave64 halving and no _device_warp_size_for_heuristics(). GFX1250 is Wave32 which this is the single most important codegen correctness/perf change for the part. Without it, Triton configs for gfx1250 are generated with the Wave64 assumption (half the warps), which is wrong.

Reference triton_heuristics.py at pytorch/pytorch@main...glen-amd:pytorch:initial_amd_gfx1250_support_2

glen-amd · 2026-06-22T23:31:00Z

Missing the 320 KB shared-memory plumbing (Ref. torch/cuda/_utils.py at pytorch/pytorch@main...glen-amd:pytorch:initial_amd_gfx1250_support_2).

The PR base still has 65536 if gcnArchName != "gfx950" else 160*1024, so on gfx1250 set_shared_memory_config caps at 64 KB, which is wrong for gfx1250.

Moreover, reference the :sramecc+:xnack- suffix-stripping as well.

glen-amd · 2026-06-22T23:36:37Z

Missing test gating (common_cuda.py, common_utils.py, common_distributed.py, test_linalg.py).

Reference test-support gates (evaluate_platform_supports_fp8, _mx_gemm, _fp8_grouped_gemm, GFX1250_ARCH, symm-mem exclusion, blaslt device list) from pytorch/pytorch@main...glen-amd:pytorch:initial_amd_gfx1250_support_2.

Without the gating, the very features this PR enables for gfx1250 (MX GEMM, FP8, hipSparseLt, scaled_mm) won't be exercised by the test suite on gfx1250, and some FA/symm-mem tests may try to run on a part where the kernels aren't built. At minimum port evaluate_platform_supports_mx_gemm/_fp8 and the symm-mem exclusion.

This change backs out gfx1250 specific additions that were unnecessary, incorrect, or harmful for non–gfx1250 builds. - **CI**: Drop `gfx1250` from the default multi-arch list in `.ci/docker/build.sh` and `.ci/pytorch/build.sh` . - **CK FMHA codegen**: Remove unconditional `--targets gfx1250` from `generate.py` invocations for non–gfx1250 builds. - **Allocator**: Remove the `__gfx1250__` branch in `c10/core/AllocatorConfig.h` and keep the original code. - **Kernels**: Drop gfx1250 from in `KernelUtils.cuh` as we have `#if ROCM_VERSION < 60400` condition in place. - **Scaled BLAS / hipSparseLt**: Only treat **gfx1250** as supported where it matches the rest of the stack—**ROCm ≥ 7.2** .

- int4mm.cu: add the two missing semicolons after the gfx1250 TORCH_CHECK calls, and correct the CDNA2_OR_LATER comment (gfx90a/942/950 are Wave64, not "Warp-32"). - Context.cpp: restore gfx90a in ckSupported()'s arch list; MI200 CK support is unrelated to the gfx1250 expansion and should not have been dropped. - common_distributed.py: revert evaluate_platform_supports_symm_mem() to gfx942/gfx950 only. - test_cuda.py: use a 320KB shared-memory limit for gfx1250 (gfx950 stays at 160KB), matching the values in torch/cuda/_utils.py. - test_linalg.py: match gfx1250 exactly in the BLASLt check instead of the broader gfx125 prefix. - .ci/docker/build.sh: add gfx1250 to PYTORCH_ROCM_ARCH for the ROCm CI image. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

composable_kernel has no gfx1250 support, so its CK GEMM and CK SDPA kernels fail to compile when targeting that arch. Previously the only lever was the global USE_ROCM_CK_GEMM / USE_ROCM_CK_SDPA options, so disabling CK for gfx1250 also disabled it for every other arch in the same (commonly multi-arch) build. Instead, build the CK kernels for every requested arch except gfx1250 by filtering --offload-arch on the CK targets, and only fall back to fully disabling CK when gfx1250 is the sole arch. 1. aten/src/ATen/CMakeLists.txt - gfx1250-only guard: when gfx1250 is the only entry in PYTORCH_ROCM_ARCH there is no supported arch left to build CK for, so turn USE_ROCM_CK_GEMM and USE_ROCM_CK_SDPA off (via caffe2_update_option, which writes the cache and is therefore seen by caffe2/CMakeLists.txt). This runs before the CK blocks below. - ck_sdpa: filter --offload-arch to drop gfx1250. - CK GEMM: move the ck*.hip / bgemm_kernels sources out of the main torch_hip sources into a dedicated, --offload-arch-filtered ck_gemm library (mirrors the existing mslk/ck_sdpa pattern). The library gets USE_ROCM_CK_GEMM defined on the target since it is no longer a global flag. 2. cmake/Dependencies.cmake - Stop adding -DUSE_ROCM_CK_GEMM to the global HIP flags. Applied unconditionally here it ran before aten/src/ATen/CMakeLists.txt, so the gfx1250-only guard could not retract it. 3. caffe2/CMakeLists.txt - Apply USE_ROCM_CK_GEMM as a conditional target definition on torch_hip and link ck_gemm, mirroring how USE_ROCM_CK_SDPA is already handled. This makes the define honor the gfx1250-only guard. Net behavior: a multi-arch build containing gfx1250 keeps CK GEMM/SDPA for the other archs and skips only gfx1250; a gfx1250-only build cleanly builds with CK GEMM/SDPA disabled (no empty --offload-arch compile and no unresolved CK symbols); builds without gfx1250 are unchanged. This change was co-authored with Claude.

rocm-repo-management-api · 2026-06-23T19:27:24Z

Jenkins build for 62cced8ff7bb3d5227723b59cd74dd07a473e914 commit finished as FAILURE
Links: Pipeline Overview / Build artifacts / Test Results

pragupta · 2026-06-23T19:28:34Z

@glen-amd - as per our offline conversation, please file a PR for wave32 changes. I am not sure where we missed a commit. But, happy to land it here if it is crucial for 450 bringup. For TDM changes, since those are performance enhancements and not blocking functionality, lets please address them separately.

Update the Triton version marker and point the ROCm release branch at the verified 3.7.1 ROCm Triton tip so release builds consume the local module fix. Co-authored-by: Cursor <cursoragent@cursor.com>

…ytorch#171129) [Inductor] Fix constants handling for Triton constexpr (triton#8248) (pytorch#169782) Up until recently (triton-lang/triton#8248), Triton did not explicitly interpret entries in `ASTSource.constants` , and Inductor historically treated "non-tensor-like" arguments as `constexpr`, including runtime scalars. Triton’s updated `constexpr` handling now requires stricter semantics, so this change restricts `constants` to arguments marked `is_constexpr` rather than all “non-tensor-like” arguments. - Fixes issue pytorch#170049. Pull Request resolved: pytorch#169782 Approved by: https://github.com/oulgen (cherry picked from commit 91cd44f) Co-authored-by: Joshua James Venter <venter.joshua@gmail.com>

rocm-repo-management-api · 2026-06-23T20:12:47Z

Jenkins build for 8d836042d518a4e86695612a8bad795d6053c050 commit finished as NOT_BUILT
Links: Pipeline Overview / Build artifacts / Test Results

User-defined Triton kernels functionalize mutations by cloning mutated arguments before launching the kernel. Allocate stride-compatible clone storage and copy logical elements so strided intermediate views do not read beyond compact Inductor temporaries. Fixes pytorch#161115 Generated by my agent Pull Request resolved: pytorch#184050 Approved by: https://github.com/karthickai (cherry picked from commit 9a8a62c)

…pytorch#175021) Triton's LLVM bump (76e2689) emits IR attributes (`nocreateundeforpoison`,`dwarfAddressSpace`) that ROCm's clang doesn't recognize, breaking multi-arch compilation. This sanitizes those two attributes from the IR before passing to clang. Other newer attributes (`memory()`, `captures()`, `range()`) are kept as ROCm 7.0 accepts them and stripping them breaks kernel ABI. Also ensures the current device arch is always included in the multi-arch target list when `PYTORCH_ROCM_ARCH` doesn't list it. Pull Request resolved: pytorch#175021 Approved by: https://github.com/jeffdaily (cherry picked from commit 717898d) Co-authored-by: Cursor <cursoragent@cursor.com>

rocm-repo-management-api · 2026-06-23T20:29:48Z

Jenkins build for 8d836042d518a4e86695612a8bad795d6053c050 commit finished as FAILURE
Links: Pipeline Overview / Build artifacts / Test Results

rocm-repo-management-api · 2026-06-23T22:42:10Z

Jenkins build for 36c2d98f4a9e82c53581bd020eb6f4adac2da753 commit finished as FAILURE
Links: Pipeline Overview / Build artifacts / Test Results

rocm-repo-management-api · 2026-06-23T23:56:41Z

Jenkins build for 75f45e135f0bd51f9743ecb9c5fd176f4b1ac1e9 commit finished as FAILURE
Links: Pipeline Overview / Build artifacts / Test Results

Detected error during base docker image building:

#40 357.8 6 out of 6 hunks ignored
#40 358.1 
#40 358.1 ERROR: Neither source directory 'builddir' nor build directory None contain a build file meson.build.
#40 358.1 WARNING: Running the setup command as `meson [options]` instead of `meson setup [options]` is ambiguous and deprecated.
#40 358.2 /builddir /
#40 358.2 ninja: error: loading 'build.ninja': No such file or directory
#40 358.2 /
#40 358.2 ./install_rocm_drm.sh: line 159: popd: directory stack empty
#40 ERROR: process "/bin/sh -c if [ \"${ROCM_VERSION}\" != \"nightly\" ]; then bash ./install_rocm_drm.sh /usr ; fi && rm install_rocm_drm.sh" did not complete successfully: exit code: 1
------
 > [stage-0 32/62] RUN if [ "7.2.4" != "nightly" ]; then bash ./install_rocm_drm.sh /usr ; fi && rm install_rocm_drm.sh:

…cy configs; Strict semantics subset relationships

rocm-repo-management-api · 2026-06-24T15:27:14Z

Jenkins build for 1e5a66bcedbd518a856a497d610d5b5418663647 commit finished as FAILURE
Links: Pipeline Overview / Build artifacts / Test Results

pruthvistony

Test results are good.

- int4mm.cu: add the two missing semicolons after the gfx1250 TORCH_CHECK calls, and correct the CDNA2_OR_LATER comment (gfx90a/942/950 are Wave64, not "Warp-32"). - Context.cpp: restore gfx90a in ckSupported()'s arch list; MI200 CK support is unrelated to the gfx1250 expansion and should not have been dropped. - common_distributed.py: revert evaluate_platform_supports_symm_mem() to gfx942/gfx950 only. - test_cuda.py: use a 320KB shared-memory limit for gfx1250 (gfx950 stays at 160KB), matching the values in torch/cuda/_utils.py. - test_linalg.py: match gfx1250 exactly in the BLASLt check instead of the broader gfx125 prefix. - .ci/docker/build.sh: add gfx1250 to PYTORCH_ROCM_ARCH for the ROCm CI image. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

gfx125x support has been upstreamed via ROCm/pytorch#3346, so remove the exclusion from the release/2.11 matrix entry to enable multiarch builds for that branch. ## Motivation - Removes the `gfx125x` exclusion from the `release/2.11` PyTorch matrix entry - gfx125x support has been upstreamed via ROCm/pytorch#3346 - This enables multiarch PyTorch builds (including gfx125x) for nightly CI runs targeting the `release/2.11` branch Exclusions for `release/2.9`, `release/2.10`, `release/2.12`, and `nightly` remain in place as those branches do not have upstream support. ISSUE ID #5833 (partially — for release/2.11). ## Unit Test python ./build_tools/github_actions/configure_multi_arch_ci.py locally with gfx125x label I can see - https://github.com/ROCm/TheRock/actions/runs/28133800797/job/83316100363?pr=6122#step:4:57 Also There's no dedicated test file for configure_pytorch_release_matrix.py — the change is straightforward enough (removing a dict key) that it's covered by the existing matrix infrastructure tests. ## Test plan - [ ] Verify nightly CI triggers a `release/2.11` PyTorch build that includes `gfx125x` in the matrix - [ ] Confirm other release branches still exclude `gfx125x`

pragupta requested review from jataylo, jeffdaily, jithunnair-amd and pruthvistony as code owners June 19, 2026 17:16

pragupta mentioned this pull request Jun 19, 2026

[release/2.11] Add support for gfx1250 #3307

Closed