[release/2.11] Add support for gfx1250#3346
Conversation
|
Jenkins build for 3ba3ca4e6a059c829396a81b95d836ab5ae64329 commit finished as FAILURE |
|
Jenkins build for 649c1a151fb0c711bb9598b41de41fe587a42382 commit finished as FAILURE |
|
Missing Wave32 inductor heuristics in torch/_inductor/runtime/triton_heuristics.py. The PR base (release/2.11) still has the hardcoded Reference triton_heuristics.py at pytorch/pytorch@main...glen-amd:pytorch:initial_amd_gfx1250_support_2 |
|
Missing the 320 KB shared-memory plumbing (Ref. torch/cuda/_utils.py at pytorch/pytorch@main...glen-amd:pytorch:initial_amd_gfx1250_support_2). The PR base still has Moreover, reference the |
|
Missing test gating (common_cuda.py, common_utils.py, common_distributed.py, test_linalg.py). Reference test-support gates (evaluate_platform_supports_fp8, _mx_gemm, _fp8_grouped_gemm, GFX1250_ARCH, symm-mem exclusion, blaslt device list) from pytorch/pytorch@main...glen-amd:pytorch:initial_amd_gfx1250_support_2. Without the gating, the very features this PR enables for gfx1250 (MX GEMM, FP8, hipSparseLt, scaled_mm) won't be exercised by the test suite on gfx1250, and some FA/symm-mem tests may try to run on a part where the kernels aren't built. At minimum port evaluate_platform_supports_mx_gemm/_fp8 and the symm-mem exclusion. |
This change backs out gfx1250 specific additions that were unnecessary, incorrect, or harmful for non–gfx1250 builds. - **CI**: Drop `gfx1250` from the default multi-arch list in `.ci/docker/build.sh` and `.ci/pytorch/build.sh` . - **CK FMHA codegen**: Remove unconditional `--targets gfx1250` from `generate.py` invocations for non–gfx1250 builds. - **Allocator**: Remove the `__gfx1250__` branch in `c10/core/AllocatorConfig.h` and keep the original code. - **Kernels**: Drop gfx1250 from in `KernelUtils.cuh` as we have `#if ROCM_VERSION < 60400` condition in place. - **Scaled BLAS / hipSparseLt**: Only treat **gfx1250** as supported where it matches the rest of the stack—**ROCm ≥ 7.2** .
- int4mm.cu: add the two missing semicolons after the gfx1250 TORCH_CHECK calls, and correct the CDNA2_OR_LATER comment (gfx90a/942/950 are Wave64, not "Warp-32"). - Context.cpp: restore gfx90a in ckSupported()'s arch list; MI200 CK support is unrelated to the gfx1250 expansion and should not have been dropped. - common_distributed.py: revert evaluate_platform_supports_symm_mem() to gfx942/gfx950 only. - test_cuda.py: use a 320KB shared-memory limit for gfx1250 (gfx950 stays at 160KB), matching the values in torch/cuda/_utils.py. - test_linalg.py: match gfx1250 exactly in the BLASLt check instead of the broader gfx125 prefix. - .ci/docker/build.sh: add gfx1250 to PYTORCH_ROCM_ARCH for the ROCm CI image. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
composable_kernel has no gfx1250 support, so its CK GEMM and CK SDPA
kernels fail to compile when targeting that arch. Previously the only
lever was the global USE_ROCM_CK_GEMM / USE_ROCM_CK_SDPA options, so
disabling CK for gfx1250 also disabled it for every other arch in the
same (commonly multi-arch) build.
Instead, build the CK kernels for every requested arch except gfx1250 by
filtering --offload-arch on the CK targets, and only fall back to fully
disabling CK when gfx1250 is the sole arch.
1. aten/src/ATen/CMakeLists.txt
- gfx1250-only guard: when gfx1250 is the only entry in
PYTORCH_ROCM_ARCH there is no supported arch left to build CK for,
so turn USE_ROCM_CK_GEMM and USE_ROCM_CK_SDPA off (via
caffe2_update_option, which writes the cache and is therefore seen
by caffe2/CMakeLists.txt). This runs before the CK blocks below.
- ck_sdpa: filter --offload-arch to drop gfx1250.
- CK GEMM: move the ck*.hip / bgemm_kernels sources out of the main
torch_hip sources into a dedicated, --offload-arch-filtered ck_gemm
library (mirrors the existing mslk/ck_sdpa pattern). The library
gets USE_ROCM_CK_GEMM defined on the target since it is no longer a
global flag.
2. cmake/Dependencies.cmake
- Stop adding -DUSE_ROCM_CK_GEMM to the global HIP flags. Applied
unconditionally here it ran before aten/src/ATen/CMakeLists.txt, so
the gfx1250-only guard could not retract it.
3. caffe2/CMakeLists.txt
- Apply USE_ROCM_CK_GEMM as a conditional target definition on
torch_hip and link ck_gemm, mirroring how USE_ROCM_CK_SDPA is
already handled. This makes the define honor the gfx1250-only guard.
Net behavior: a multi-arch build containing gfx1250 keeps CK GEMM/SDPA
for the other archs and skips only gfx1250; a gfx1250-only build cleanly
builds with CK GEMM/SDPA disabled (no empty --offload-arch compile and no
unresolved CK symbols); builds without gfx1250 are unchanged.
This change was co-authored with Claude.
6172e6c to
62cced8
Compare
|
Jenkins build for 62cced8ff7bb3d5227723b59cd74dd07a473e914 commit finished as FAILURE |
|
@glen-amd - as per our offline conversation, please file a PR for wave32 changes. I am not sure where we missed a commit. But, happy to land it here if it is crucial for 450 bringup. For TDM changes, since those are performance enhancements and not blocking functionality, lets please address them separately. |
Update the Triton version marker and point the ROCm release branch at the verified 3.7.1 ROCm Triton tip so release builds consume the local module fix. Co-authored-by: Cursor <cursoragent@cursor.com>
…ytorch#171129) [Inductor] Fix constants handling for Triton constexpr (triton#8248) (pytorch#169782) Up until recently (triton-lang/triton#8248), Triton did not explicitly interpret entries in `ASTSource.constants` , and Inductor historically treated "non-tensor-like" arguments as `constexpr`, including runtime scalars. Triton’s updated `constexpr` handling now requires stricter semantics, so this change restricts `constants` to arguments marked `is_constexpr` rather than all “non-tensor-like” arguments. - Fixes issue pytorch#170049. Pull Request resolved: pytorch#169782 Approved by: https://github.com/oulgen (cherry picked from commit 91cd44f) Co-authored-by: Joshua James Venter <venter.joshua@gmail.com>
|
Jenkins build for 8d836042d518a4e86695612a8bad795d6053c050 commit finished as NOT_BUILT |
User-defined Triton kernels functionalize mutations by cloning mutated arguments before launching the kernel. Allocate stride-compatible clone storage and copy logical elements so strided intermediate views do not read beyond compact Inductor temporaries. Fixes pytorch#161115 Generated by my agent Pull Request resolved: pytorch#184050 Approved by: https://github.com/karthickai (cherry picked from commit 9a8a62c)
…pytorch#175021) Triton's LLVM bump (76e2689) emits IR attributes (`nocreateundeforpoison`,`dwarfAddressSpace`) that ROCm's clang doesn't recognize, breaking multi-arch compilation. This sanitizes those two attributes from the IR before passing to clang. Other newer attributes (`memory()`, `captures()`, `range()`) are kept as ROCm 7.0 accepts them and stripping them breaks kernel ABI. Also ensures the current device arch is always included in the multi-arch target list when `PYTORCH_ROCM_ARCH` doesn't list it. Pull Request resolved: pytorch#175021 Approved by: https://github.com/jeffdaily (cherry picked from commit 717898d) Co-authored-by: Cursor <cursoragent@cursor.com>
|
Jenkins build for 8d836042d518a4e86695612a8bad795d6053c050 commit finished as FAILURE |
|
Jenkins build for 36c2d98f4a9e82c53581bd020eb6f4adac2da753 commit finished as FAILURE |
|
Jenkins build for 75f45e135f0bd51f9743ecb9c5fd176f4b1ac1e9 commit finished as FAILURE Detected error during base docker image building: |
…cy configs; Strict semantics subset relationships
|
Jenkins build for 1e5a66bcedbd518a856a497d610d5b5418663647 commit finished as FAILURE |
pruthvistony
left a comment
There was a problem hiding this comment.
Test results are good.
- int4mm.cu: add the two missing semicolons after the gfx1250 TORCH_CHECK calls, and correct the CDNA2_OR_LATER comment (gfx90a/942/950 are Wave64, not "Warp-32"). - Context.cpp: restore gfx90a in ckSupported()'s arch list; MI200 CK support is unrelated to the gfx1250 expansion and should not have been dropped. - common_distributed.py: revert evaluate_platform_supports_symm_mem() to gfx942/gfx950 only. - test_cuda.py: use a 320KB shared-memory limit for gfx1250 (gfx950 stays at 160KB), matching the values in torch/cuda/_utils.py. - test_linalg.py: match gfx1250 exactly in the BLASLt check instead of the broader gfx125 prefix. - .ci/docker/build.sh: add gfx1250 to PYTORCH_ROCM_ARCH for the ROCm CI image. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
gfx125x support has been upstreamed via ROCm/pytorch#3346, so remove the exclusion from the release/2.11 matrix entry to enable multiarch builds for that branch. ## Motivation - Removes the `gfx125x` exclusion from the `release/2.11` PyTorch matrix entry - gfx125x support has been upstreamed via ROCm/pytorch#3346 - This enables multiarch PyTorch builds (including gfx125x) for nightly CI runs targeting the `release/2.11` branch Exclusions for `release/2.9`, `release/2.10`, `release/2.12`, and `nightly` remain in place as those branches do not have upstream support. ISSUE ID #5833 (partially — for release/2.11). ## Unit Test python ./build_tools/github_actions/configure_multi_arch_ci.py locally with gfx125x label I can see - https://github.com/ROCm/TheRock/actions/runs/28133800797/job/83316100363?pr=6122#step:4:57 Also There's no dedicated test file for configure_pytorch_release_matrix.py — the change is straightforward enough (removing a dict key) that it's covered by the existing matrix infrastructure tests. ## Test plan - [ ] Verify nightly CI triggers a `release/2.11` PyTorch build that includes `gfx125x` in the matrix - [ ] Confirm other release branches still exclude `gfx125x`
Refiled version of #3307
gfx942 Validation: https://github.com/ROCm/TheRock/actions/runs/28071362663
Majority of the failures are either coming from missing g++ or symm mem issues. Rest all are accuracy issues, which shouldn't block this PR.
gfx1250 Validation: Core UTs ran on hardware

These 7 failures are being tracked via JIRAs (miopen issues). AOTriton one has been mitigated by an AOTriton bump in this PR.