Skip to content

[release/2.12] Add support for gfx1250#3327

Open
rraminen wants to merge 11 commits into
release/2.12from
release/2.12_gfx1250
Open

[release/2.12] Add support for gfx1250#3327
rraminen wants to merge 11 commits into
release/2.12from
release/2.12_gfx1250

Conversation

@rraminen

@rraminen rraminen commented Jun 17, 2026

Copy link
Copy Markdown

Add support for gfx1250

TheRock Validation: https://github.com/ROCm/TheRock/actions/runs/27717422954
Build is passing. Testing is in progress.

rraminen and others added 7 commits June 17, 2026 09:16
* CK - gfx1250 support (#5)

* Enable ROCM_CK_SDPA build

* [submodule] composable_kernel and aiter update (pytorch#172592)

Summary:
update ck to commit ROCm/composable_kernel@fcc9372

update aiter to commit ROCm/aiter@9a469a6

changes of caffe2/aten/src/ATen/CMakeLists.txt and caffe2/caffe2/CMakeLists.txt are adopted from pytorch#161759

updated caffe2/aten/src/ATen/native/transformers/hip/flash_attn/ck/launch_kernel_pt.hpp to match the ck version in https://github.com/ROCm/composable_kernel/blob/292df2719f28cd01464d5d059820684790c101da/include/ck_tile/host/kernel_launch.hpp

update aiter fav3 bwd codegen according to changes in ROCm/aiter#1573

update caffe2/aten/src/ATen/native/transformers/hip/flash_attn/ck mha fwd/bwd kernels according to the interfaces in https://github.com/ROCm/composable_kernel/tree/292df2719f28cd01464d5d059820684790c101da/example/ck_tile/01_fmha

Differential Revision: D88991877

Pull Request resolved: pytorch#172592
Approved by: https://github.com/alugorey, https://github.com/izaitsevfb

* Added MI450 supports and packages

* Fix misalinged ck api

* Replace aiter with ck for bwd

* [ROCm] Bump AOTriton to 0.11.2b (pytorch#174105)

Notable new features:

* AOTriton 0.11.2b adds gfx1151/1152/1153 support.
* Add precompiled AOTriton runtime for ROCM 7.2
* Match the sliding window attention behavior of `_flash_attention_forward/backward` with CUTLASS backend.

Bug fixes:

* Fixes pytorch#173204. Now all tests in `test/test_varlen_attention.py` are enabled on ROCm

Notes:

This replaces PR pytorch#173820 and pytorch#173469

Pull Request resolved: pytorch#174105
Approved by: https://github.com/jeffdaily

* Fix philox data types for this version of ck

* Update CK to use new gfx1250_pytorch branch

* Add new gfx1250 compile flags for CK

* add --targets to generate and a couple new compile flags

* Remove default USE_ROCM_CK_SDPA

---------

Co-authored-by: blorange-amd <bo.li2@amd.com>
Co-authored-by: Yu Guo <yuguo@meta.com>
Co-authored-by: Xinya Zhang <Xinya.Zhang@amd.com>

* Updated aiter module

* Fixed merged error

* Fixed additional merged error

* Reset USE_ROCM_CK_SDPA config

---------

Co-authored-by: LugoReyes, Andy <Andy.LugoReyes@amd.com>
Co-authored-by: Yu Guo <yuguo@meta.com>
Co-authored-by: Xinya Zhang <Xinya.Zhang@amd.com>
Fix `torch.arange` (and the other range factories sharing this kernel) for very large outputs on ROCm.

`torch.arange(N)` with `N >= 2^32` fails on ROCm because `hipLaunchKernel` does not support `gridDim.x * blockDim.x >= 2^32` for the per-thread kernel `aten/src/ATen/native/cuda/RangeFactories.cu` previously used. Depending on the ROCm version the launch returns `hipErrorInvalidConfiguration` or is accepted silently with the kernel never executing, leaving zero-initialized output. Concrete repro: `torch.arange(2 ** 32 + 1, device="cuda", dtype=torch.int32)`.

The fix replaces the per-thread launch on the ROCm path with a grid-stride loop that fixes the grid at `sm_count * 4` blocks, so the launch limit is no longer load-bearing for correctness regardless of `N`. The non-ROCm path is untouched.

On MI250X the grid-stride kernel matches the per-thread kernel within noise at `N=1024` and is 24-60% faster from `N=1M` up across `int32`, `int64`, and `float32`.

On MI300X the grid-stride kernel matches within noise at `N=1024` and `N=1M`, and is 2-5x faster from `N=64M` up across `int32`, `int64`, and `float32`.

The 64-bit-indexing test is extended to also cover `N = 2^32 + 1` and `N = 2^33 + 1` on ROCm when memory permits.

Pull Request resolved: pytorch#182657
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
* TDM on release/2.11 for bring-up based on careful selection

* Triton commit: Upstream fe0c38b5262c0447fed6df0d37e02cb8ea75deb4 -> AMD-ROCm-Internal Triton 250bb5d5b821377f49dc2d83d87ded75b952f0f7; Consequence: Triton TDM support may miss.

* Refinement according to reviewers' comments

* Added/modified UT cases; NUM_STAGES issue of ineffectiveness

* A couple of changes to related UTs

* Got rid of configs like `waves_per_cu=2`
- Need to turn MSLK on for mi300 and mi350
- Need to turn CK off for gfx1250

## Motivation

<!-- Explain the purpose of this PR and the goals it aims to achieve.
-->

## Technical Details

<!-- Explain the changes along with any relevant GitHub links. -->

## Test Plan

<!-- Explain any relevant testing done to verify this PR. -->

## Test Result

<!-- Briefly summarize test outcomes. -->

## Submission Checklist

- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
# Bump to AOTriton 0.12.50tp

Notable new features:

Enable gfx1250

## Features from AOTriton 0.12b

Notable new features:

* **BREAKING** Varlen LSE tensor shape changes to (H, Total_seqlen)
* Support head_dim != head_dim_v
* Support `use_deterministic_algorithims`
* Support seqused_k in test/test_varlen_attention.py
* gfx1100 and gfx1151 promoted out of experimental
* Partial FAv3 support on gfx950

Bug Fixes:

* GQA kernel failed to read bias tensor with the right offset.

Known Issues

* gfx950's Triton kernel has problem handling hdim=16's fwd, in addition
to hdim=48/80's bwd.
* Disables gfx90a's CK SDPA support due to GPU Segfault.

---------

Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: Prachi Gupta <pracgupt@amd.com>
@rraminen rraminen requested a review from jeffdaily as a code owner June 17, 2026 17:43
@rocm-repo-management-api

rocm-repo-management-api Bot commented Jun 17, 2026

Copy link
Copy Markdown

Jenkins build for aeb64a7497d08b2da400801c9340834bd6bde3f1 commit finished as FAILURE
Links: Pipeline Overview / Build artifacts / Test Results

Detected error during Pytorch building:

[5809/8035] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/torch/csrc/jit/frontend/error_report.cpp.o
[5810/8035] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/DeviceAccelerator.cpp.o
[5811/8035] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/core/DeprecatedTypeProperties.cpp.o
[5812/8035] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/LegacyVmapTransforms.cpp.o
[5813/8035] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/Context.cpp.o
FAILED: caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/Context.cpp.o 
/opt/cache/bin/sccache /opt/cache/bin/c++ -DAT_PER_OPERATOR_HEADERS -DBUILD_ONEDNN_GRAPH -DCAFFE2_BUILD_MAIN_LIB -DCPUINFO_SUPPORTED_PLATFORM=1 -DENABLE_IPC_FABRIC -DFMT_HEADER_ONLY=1 -DFXDIV_USE_INLINE_ASSEMBLY=0 -DHAS_ROCTRACER -DHAVE_MALLOC_USABLE_SIZE=1 -DHAVE_MMAP=1 -DHAVE_POSIX_FALLOCATE=1 -DHAVE_SHM_OPEN=1 -DHAVE_SHM_UNLINK=1 -DIDEEP_USE_MKL -DKINETO_NAMESPACE=libkineto -DMINIZ_DISABLE_ZIP_READER_CRC32_CHECKS -DNNP_CONVOLUTION_ONLY=0 -DNNP_INFERENCE_ONLY=0 -DONNXIFI_ENABLE_EXT=1 -DONNX_ML=1 -DONNX_NAMESPACE=onnx_torch -DROCM_VERSION=70204 -DTORCH_HIP_VERSION=702 -DUSE_C10D_GLOO -DUSE_DISTRIBUTED -DUSE_EXTERNAL_MZCRC -DUSE_LAYERNORM_FAST_RECIPROCAL -DUSE_ROCM -DUSE_RPC -DUSE_TENSORPIPE -DXNN_LOG_LEVEL=0 -D_FILE_OFFSET_BITS=64 -D__HIP_PLATFORM_AMD__ -Dtorch_cpu_EXPORTS -I/var/lib/jenkins/pytorch/build/aten/src -I/var/lib/jenkins/pytorch/aten/src -I/var/lib/jenkins/pytorch/build -I/var/lib/jenkins/pytorch -I/var/lib/jenkins/pytorch/nlohmann -I/var/lib/jenkins/pytorch/moodycamel -I/var/lib/jenkins/pytorch/torch/csrc/api -I/var/lib/jenkins/pytorch/torch/csrc/api/include -I/var/lib/jenkins/pytorch/caffe2/aten/src/TH -I/var/lib/jenkins/pytorch/build/caffe2/aten/src/TH -I/var/lib/jenkins/pytorch/build/caffe2/aten/src -I/var/lib/jenkins/pytorch/build/caffe2/../aten/src -I/var/lib/jenkins/pytorch/torch/csrc -I/var/lib/jenkins/pytorch/torch/headeronly -I/var/lib/jenkins/pytorch/third_party/miniz-3.0.2 -I/var/lib/jenkins/pytorch/third_party/kineto/libkineto/include -I/var/lib/jenkins/pytorch/third_party/kineto/libkineto/src -I/var/lib/jenkins/pytorch/third_party/cpp-httplib -I/var/lib/jenkins/pytorch/aten/src/ATen/.. -I/var/lib/jenkins/pytorch/third_party/FXdiv/include -I/var/lib/jenkins/pytorch/c10/.. -I/var/lib/jenkins/pytorch/third_party/pthreadpool/include -I/var/lib/jenkins/pytorch/third_party/cpuinfo/include -I/var/lib/jenkins/pytorch/aten/src/ATen/native/quantized/cpu/qnnpack/include -I/var/lib/jenkins/pytorch/aten/src/ATen/native/quantized/cpu/qnnpack/src -I/var/lib/jenkins/pytorch/aten/src/ATen/native/quantized/cpu/qnnpack/deps/clog/include -I/var/lib/jenkins/pytorch/third_party/NNPACK/include -I/var/lib/jenkins/pytorch/third_party/FP16/include -I/var/lib/jenkins/pytorch/third_party/tensorpipe -I/var/lib/jenkins/pytorch/build/third_party/tensorpipe -I/var/lib/jenkins/pytorch/third_party/tensorpipe/third_party/libnop/include -I/var/lib/jenkins/pytorch/third_party/fmt/include -I/var/lib/jenkins/pytorch/build/third_party/ideep/mkl-dnn/include -I/var/lib/jenkins/pytorch/third_party/ideep/mkl-dnn/src/../include -I/var/lib/jenkins/pytorch/third_party/onnx -I/var/lib/jenkins/pytorch/build/third_party/onnx -I/var/lib/jenkins/pytorch/third_party/flatbuffers/include -isystem /opt/rocm-7.2.4/include -isystem /var/lib/jenkins/pytorch/build/third_party/gloo -isystem /var/lib/jenkins/pytorch/cmake/../third_party/gloo -isystem /var/lib/jenkins/pytorch/cmake/../third_party/tensorpipe/third_party/libuv/include -isystem /var/lib/jenkins/pytorch/cmake/../third_party/googletest/googlemock/include -isystem /var/lib/jenkins/pytorch/cmake/../third_party/googletest/googletest/include -isystem /var/lib/jenkins/pytorch/third_party/protobuf/src -isystem /opt/conda/envs/py_3.12/include -isystem /var/lib/jenkins/pytorch/third_party/XNNPACK/include -isystem /var/lib/jenkins/pytorch/third_party/ittapi/include -isystem /var/lib/jenkins/pytorch/cmake/../third_party/eigen -isystem /opt/rocm/include -isystem /var/lib/jenkins/pytorch/third_party/ideep/mkl-dnn/include/oneapi/dnnl -isystem /var/lib/jenkins/pytorch/third_party/ideep/include -isystem /var/lib/jenkins/pytorch/INTERFACE -isystem /var/lib/jenkins/pytorch/third_party/nlohmann/include -isystem /var/lib/jenkins/pytorch/third_party/concurrentqueue -isystem /var/lib/jenkins/pytorch/build/include -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOCUPTI -DLIBKINETO_NOXPUPTI=ON -DUSE_MSLK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -DC10_NODEPRECATED -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=range-loop-construct -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-unknown-pragmas -Wno-unused-parameter -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=old-style-cast -faligned-new -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-dangling-reference -Wno-error=dangling-reference -Wno-stringop-overflow -DHAVE_AVX512_CPU_DEFINITION -DHAVE_AVX2_CPU_DEFINITION -O3 -DNDEBUG -DNDEBUG -std=gnu++20 -fPIC -fdiagnostics-color=always -DMKL_HAS_SBGEMM -DMKL_HAS_SHGEMM -DTORCH_USE_LIBUV -DCAFFE2_USE_GLOO -Wall -Wextra -Wdeprecated -Wunused -Wno-unused-parameter -Wno-missing-field-initializers -Wno-array-bounds -Wno-unknown-pragmas -Wno-strict-overflow -Wno-strict-aliasing -Wredundant-move -Wno-interference-size -Wno-maybe-uninitialized -fvisibility=hidden -pthread -fopenmp -MD -MT caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/Context.cpp.o -MF caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/Context.cpp.o.d -o caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/Context.cpp.o -c /var/lib/jenkins/pytorch/aten/src/ATen/Context.cpp
/var/lib/jenkins/pytorch/aten/src/ATen/Context.cpp: In static member function ‘static bool at::Context::ckSupported()’:
/var/lib/jenkins/pytorch/aten/src/ATen/Context.cpp:508:1: error: version control conflict marker in file
  508 | <<<<<<< xinyazhang/backport-aotriton-0.12b-2.12_gfx1250
      | ^~~~~~~

Co-authored-by: Xinya Zhang <Xinya.Zhang@amd.com>
@rocm-repo-management-api

rocm-repo-management-api Bot commented Jun 17, 2026

Copy link
Copy Markdown

Jenkins build for bbecc5657577e15f0c0aa057daf34b4e4be41c31 commit finished as NOT_BUILT
Links: Pipeline Overview / Build artifacts / Test Results

@rocm-repo-management-api

rocm-repo-management-api Bot commented Jun 17, 2026

Copy link
Copy Markdown

Jenkins build for 04b54055439beb8a156f244c3fd3cdb9e31a1d3b commit finished as NOT_BUILT
Links: Pipeline Overview / Build artifacts / Test Results

This PR is to address the reviewed comments on PR
#3307
@rocm-repo-management-api

rocm-repo-management-api Bot commented Jun 17, 2026

Copy link
Copy Markdown

Jenkins build for 43a62ac6cc17a57487826c1b7b0f6e7cf96a43c1 commit finished as NOT_BUILT
Links: Pipeline Overview / Build artifacts / Test Results

Comment thread test/inductor/test_max_autotune.py Outdated
@rocm-repo-management-api

rocm-repo-management-api Bot commented Jun 18, 2026

Copy link
Copy Markdown

Jenkins build for 43a62ac6cc17a57487826c1b7b0f6e7cf96a43c1 commit finished as FAILURE
Links: Pipeline Overview / Build artifacts / Test Results

# generate a list of kernels, but not actually emit files at config stage
execute_process(
COMMAND python3 ${CMAKE_SOURCE_DIR}/third_party/composable_kernel/example/ck_tile/01_fmha/generate.py
COMMAND python3 ${CMAKE_SOURCE_DIR}/third_party/composable_kernel/example/ck_tile/01_fmha/generate.py --targets gfx1250

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, --targets gfx1250 is appended to every generate.py invocation, unconditionally.

This restricts CK FMHA blob generation to gfx1250 for all builds, including pure gfx942/gfx950 builds. A PYTORCH_ROCM_ARCH=gfx942 build will now emit only gfx1250 kernels and lose its own FMHA code objects.

Suggested fix: derive --targets from PYTORCH_ROCM_ARCH (filtered to CK-supported archs), or drop the flag entirely and keep the generator's default multi-target behavior.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CK is turned off for gfx1250, not a priority at the moment. We can probably just drop this.

constexpr size_t kSmallSize = 1048576;
// allocations between 1 and 10 MiB may use kLargeBuffer
constexpr size_t kMinLargeAlloc = 10485760;
#if defined(USE_ROCM) && defined(__gfx1250__)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is host-side allocator code. __gfx1250__ is a device-compilation predefine. As a result, this #if block never compiles?

auto persistent_counter = mk_atomictensor(is_causal ? atomic_counter.data_ptr<int32_t>() : nullptr);
hipError_t err; // TODO: Error handling
if constexpr (AOTRITON_ALWAYS_V3_API) { // Better readability than nesting ifdef
#if AOTRITON_V3_API // if constexpr does not stop errors from undefined functions

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This removal of code block below made a AOTriton v3 hard-switch and removed the v2 fallback without a build guard. Would this bring portability risk?

@rraminen rraminen Jun 19, 2026

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @xinyazhang, could you please help me address this review w.r.t cherry-pick of aeb64a7 ?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is already in upstream. pytorch@5e3cb3e

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ultimately is 0.12.50tp is 0.12b with gfx1250 support (yes it's ABI compatible). The related PR is also 0.12b's PR+version bump 0.12b->0.12.50tp.

#if ROCM_VERSION >= 70000
TORCH_CHECK_NOT_IMPLEMENTED(at::detail::getCUDAHooks().isGPUArch({"gfx950"}),
"Block-wise scaling for Float8_e8m0fnu is only supported on gfx950");
TORCH_CHECK_NOT_IMPLEMENTED(at::detail::getCUDAHooks().isGPUArch({"gfx950", "gfx1250"}),

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Above _scaled_mm_allowed_device() (line ~82) gates gfx1250 at >= 70200. So So on ROCm 7.0/7.1 the device is rejected by _scaled_mm_allowed_device yet these inner checks would have admitted it.

How about nesting #if ROCM_VERSION >= 70200 inside each isGPUArch({...})?

try {
if (at::cuda::device_count() > 0) {
g_hipSparseLtSupported = at::detail::getCUDAHooks().isGPUArch({"gfx950", "gfx942"}, 0);
g_hipSparseLtSupported = at::detail::getCUDAHooks().isGPUArch({"gfx950", "gfx942", "gfx1250"}, 0);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we confirm whether hipSparseLT requires ROCm 7.2+?
gfx1250 is advertised unconditionally here, which might fail deeper.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, hipsparselt actually requires ROCm >=7.12. PR is in progress pytorch#178737

CK_USE_GFX94
#CK_USE_FNUZ_FP8
#CK_USE_GFX94
CK_USE_GFX1250

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here and below change the CK SDPA compile definitions globally.
Are CK/AITER artifacts for GFX1250 actually ready and validated?

Comment thread cmake/Dependencies.cmake

# composable_kernel has no gfx1250 support, so its CK GEMM/SDPA kernels fail
# to compile for that arch.
if("gfx1250" IN_LIST PYTORCH_ROCM_ARCH)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This disables both USE_ROCM_CK_GEMM and USE_ROCM_CK_SDPA whenever PYTORCH_ROCM_ARCH contains gfx1250.
Would this break mixed-arch builds such as gfx942;gfx950;gfx1250?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is highly problematic for multi-arch builds. We could follow the same approach like we do for other sub-components of PyTorch build by using HIP_CLANG_FLAGS temporary override?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was solved on release/2.11 in #3346 (merged). That PR moves the logic out of Dependencies.cmake and into aten/src/ATen/CMakeLists.txt:

# composable_kernel lacks gfx1250 support. CK GEMM/SDPA are otherwise built for
# every arch except gfx1250 (the --offload-arch filtering below). If gfx1250 is
# the ONLY arch there is no supported arch left to build CK for, so disable both
# entirely here. caffe2_update_option writes the cache, so this is honored by the
# conditional CK GEMM/SDPA defines and links in caffe2/CMakeLists.txt.
if(USE_ROCM AND "gfx1250" IN_LIST PYTORCH_ROCM_ARCH)
set(_ck_supported_archs ${PYTORCH_ROCM_ARCH})
list(REMOVE_ITEM _ck_supported_archs gfx1250)
if("${_ck_supported_archs}" STREQUAL "")
message(WARNING "gfx1250 is the only arch in PYTORCH_ROCM_ARCH: disabling USE_ROCM_CK_GEMM and USE_ROCM_CK_SDPA (composable_kernel lacks gfx1250 support)")
caffe2_update_option(USE_ROCM_CK_GEMM OFF)
caffe2_update_option(USE_ROCM_CK_SDPA OFF)
endif()
endif()

// ifdef USE_ROCM_CK_GEMM is required since ROCm systems w/o CK should not call ck path.
#if defined(USE_ROCM_CK_GEMM)
if (at::globalContext().rocmAllowGroupGemmCk() && at::detail::getCUDAHooks().isGPUArch({"gfx942", "gfx950", "gfx90a"})) {
if (at::globalContext().rocmAllowGroupGemmCk() && at::detail::getCUDAHooks().isGPUArch({"gfx942", "gfx950", "gfx90a", "gfx1250"})) {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it true that the existing CK grouped GEMM path is the Wave64/MFMA/XDL path used gfx90a/gfx942/gfx950? If so, because gfx1250 is Wave32 and WMMA/SWMMAC-oriented, it may not be routed into this path by arch-name allowlisting.

auto *from = reinterpret_cast<const vec_t *>(base_ptr);
#if defined(USE_ROCM) && defined(__gfx942__)
// Extend the non-temporal load optimization to GFX1250.
#if defined(USE_ROCM) && (defined(__gfx942__) || defined(__gfx1250__))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simply extending to gfx1250, would this be another Wave64-tuned path being applied to Wave32 hardware?

Comment thread torch/cuda/_utils.py
# CDNA4 (gfx950) 160KB, and CDNA5 (gfx1250) 320KB.
if device_props.gcnArchName == "gfx950":
max_shared_mem = 160 * 1024
elif device_props.gcnArchName == "gfx1250":

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gcnArchName can include feature suffixes such as gfx1250:sramecc+:xnack-. Would this exact comparison lead to unexpected fallback?

Comment thread .ci/pytorch/build.sh

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes are unnecessary unless we know for certain any build workflows that would use it. TheRock build workflows don't.

Comment thread .ci/docker/build.sh

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes are unnecessary unless we know for certain any build workflows that would use it. TheRock build workflows don't.

__device__ inline __hip_bfloat162 preview_unsafeAtomicAdd(__hip_bfloat162* address, __hip_bfloat162 value) {
#if (defined(__gfx942__)) && \
// `__gfx1250__`-specific `s_wait_loadcnt(0)` path for committed store already there
#if (defined(__gfx942__) || defined(__gfx1250__)) && \

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this change matter now, if the outer condition is #if ROCM_VERSION < 60400?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in #3347

@pragupta pragupta changed the title [release/2.12] Support for gfx1250 [release/2.12] Add support for gfx1250 Jun 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants