[release/2.11] Add support for gfx1250 by rraminen · Pull Request #3307 · ROCm/pytorch

rraminen · 2026-06-15T22:13:34Z

This PR is to add PyTorch support on gfx1250 hardware

…odule + triton pins)

* CK - gfx1250 support (pytorch#5) * Enable ROCM_CK_SDPA build * [submodule] composable_kernel and aiter update (pytorch#172592) Summary: update ck to commit ROCm/composable_kernel@fcc9372 update aiter to commit ROCm/aiter@9a469a6 changes of caffe2/aten/src/ATen/CMakeLists.txt and caffe2/caffe2/CMakeLists.txt are adopted from pytorch#161759 updated caffe2/aten/src/ATen/native/transformers/hip/flash_attn/ck/launch_kernel_pt.hpp to match the ck version in https://github.com/ROCm/composable_kernel/blob/292df2719f28cd01464d5d059820684790c101da/include/ck_tile/host/kernel_launch.hpp update aiter fav3 bwd codegen according to changes in ROCm/aiter#1573 update caffe2/aten/src/ATen/native/transformers/hip/flash_attn/ck mha fwd/bwd kernels according to the interfaces in https://github.com/ROCm/composable_kernel/tree/292df2719f28cd01464d5d059820684790c101da/example/ck_tile/01_fmha Differential Revision: D88991877 Pull Request resolved: pytorch#172592 Approved by: https://github.com/alugorey, https://github.com/izaitsevfb * Added MI450 supports and packages * Fix misalinged ck api * Replace aiter with ck for bwd * [ROCm] Bump AOTriton to 0.11.2b (pytorch#174105) Notable new features: * AOTriton 0.11.2b adds gfx1151/1152/1153 support. * Add precompiled AOTriton runtime for ROCM 7.2 * Match the sliding window attention behavior of `_flash_attention_forward/backward` with CUTLASS backend. Bug fixes: * Fixes pytorch#173204. Now all tests in `test/test_varlen_attention.py` are enabled on ROCm Notes: This replaces PR pytorch#173820 and pytorch#173469 Pull Request resolved: pytorch#174105 Approved by: https://github.com/jeffdaily * Fix philox data types for this version of ck * Update CK to use new gfx1250_pytorch branch * Add new gfx1250 compile flags for CK * add --targets to generate and a couple new compile flags * Remove default USE_ROCM_CK_SDPA --------- Co-authored-by: blorange-amd <bo.li2@amd.com> Co-authored-by: Yu Guo <yuguo@meta.com> Co-authored-by: Xinya Zhang <Xinya.Zhang@amd.com> * Updated aiter module * Fixed merged error * Fixed additional merged error * Reset USE_ROCM_CK_SDPA config --------- Co-authored-by: LugoReyes, Andy <Andy.LugoReyes@amd.com> Co-authored-by: Yu Guo <yuguo@meta.com> Co-authored-by: Xinya Zhang <Xinya.Zhang@amd.com>

Fix `torch.arange` (and the other range factories sharing this kernel) for very large outputs on ROCm. `torch.arange(N)` with `N >= 2^32` fails on ROCm because `hipLaunchKernel` does not support `gridDim.x * blockDim.x >= 2^32` for the per-thread kernel `aten/src/ATen/native/cuda/RangeFactories.cu` previously used. Depending on the ROCm version the launch returns `hipErrorInvalidConfiguration` or is accepted silently with the kernel never executing, leaving zero-initialized output. Concrete repro: `torch.arange(2 ** 32 + 1, device="cuda", dtype=torch.int32)`. The fix replaces the per-thread launch on the ROCm path with a grid-stride loop that fixes the grid at `sm_count * 4` blocks, so the launch limit is no longer load-bearing for correctness regardless of `N`. The non-ROCm path is untouched. On MI250X the grid-stride kernel matches the per-thread kernel within noise at `N=1024` and is 24-60% faster from `N=1M` up across `int32`, `int64`, and `float32`. On MI300X the grid-stride kernel matches within noise at `N=1024` and `N=1M`, and is 2-5x faster from `N=64M` up across `int32`, `int64`, and `float32`. The 64-bit-indexing test is extended to also cover `N = 2^32 + 1` and `N = 2^33 + 1` on ROCm when memory permits. Pull Request resolved: pytorch#182657 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>

* TDM on release/2.11 for bring-up based on careful selection * Triton commit: Upstream fe0c38b5262c0447fed6df0d37e02cb8ea75deb4 -> AMD-ROCm-Internal Triton 250bb5d5b821377f49dc2d83d87ded75b952f0f7; Consequence: Triton TDM support may miss. * Refinement according to reviewers' comments * Added/modified UT cases; NUM_STAGES issue of ineffectiveness * A couple of changes to related UTs * Got rid of configs like `waves_per_cu=2`

rocm-repo-management-api · 2026-06-15T22:22:41Z

Jenkins build for 1bc755ac30d6a8b5c183eafc823b8fe307a77417 commit finished as FAILURE
Links: Pipeline Overview / Build artifacts / Test Results

rocm-repo-management-api · 2026-06-15T23:38:05Z

Jenkins build for 9c8cbb3cecb9f73c443f28830a60226c2bf4e260 commit finished as FAILURE
Links: Pipeline Overview / Build artifacts / Test Results

jithunnair-amd · 2026-06-16T20:35:27Z

@@ -1 +1 @@
-3.6.0
+3.7.0


@jataylo / @iupaikov-amd to signoff on this (and the corresponding triton.txt change), since this is moving to a newer major version of triton, so just need confidence on compatibility for non-gfx1250 archs.

Ugh.. I somehow mis-read yesterday morning. I thought we were already using 3.7.0 in release/2.11. 3.6 is too old to be compatible with gfx1250.

@jataylo / @iupaikov-amd - can you please guide on which UTs we can run as a sanity testing for this triton bump?

@pragupta I have in my notes that the minimum UT suite would be test/inductor/test_torchinductor.py and test/inductor/test_max_autotune.py.

@pragupta FWIW, we normally document PyTorch-Triton compability here:
https://amd.atlassian.net/wiki/spaces/MLSE/pages/1032521014/PyTorch+-+Triton+-+Team+responsibilities

but looks like we haven't kept the page up to date.

concerned about which triton code we are pulling here.
commit - 9c610c781cb810a11bfcc9accba094550b189a5e belongs to which repo? Is it public?

Ok, it is tip of https://github.com/ROCm/triton/commits/release/internal/3.7.x/ then if the UTs pass, then ok.

pragupta · 2026-06-16T20:32:46Z

+# TODO:
+# MSLK related parts are missing that already exists upstream.
+# gfx1250 for MSLK needs to be involved as well.
+


Can we please open an issue to track this? Because 2.11 branch uses MSLK by default, we are changing that here. We can probably just turn it off and use mslk until it's ready?

Yes, that is better to turn it off.
We need to have a ROCm fork branch of MSLK for 2.11.
FBGEMM (Li Li) was checking on this we should reach out to him.

jithunnair-amd · 2026-06-16T20:47:57Z

@BLOrange-AMD @glen-amd TheRock CI flows do not use this file, so these changes would be irrelevant for it. Do you know if this file was invoked by the PyTorch NPI build flows?

jithunnair-amd · 2026-06-16T21:00:39Z

@BLOrange-AMD @glen-amd TheRock CI flows do not use this file, so these changes would be irrelevant for it. Do you know if this file was invoked by the PyTorch NPI build flows?

The only change is setting PYTORCH_ROCM_ARCH, I believe we should still keep this change for any legacy build.
In the ROCk this is controlled in the environment variable.

Even for our rocm-ci legacy builds, we used to set PYTORCH_ROCM_ARCH env variable directly from the CI job parameters IIRC. So I'm not sure why/where the build-environment-based method was being used.

jithunnair-amd · 2026-06-16T21:13:32Z

 export TORCH_PACKAGE_NAME='torch'

-export USE_FBGEMM=1
+export USE_FBGEMM=0


@rraminen @pragupta I don't think this file is used by theRock either. Do we want to do this only for gfx1250? If yes, we probably need to do this elsewhere, and only for gfx1250, since USE_FBGEMM is set to ON for release/2.11 builds in theRock today.

Maybe we use a similar strategy to cb4e545?

@jithunnair-amd ,
should this setting of USE_FBGEMM=0 be in the pytorch build workflow (github wf) of ROCk.

jithunnair-amd · 2026-06-16T21:15:53Z

Not needed at all?

jithunnair-amd · 2026-06-16T21:16:22Z

Not needed either?

For those files (not only this one), if we are sure they are not used, feel free to disregard or handle them in whatever way you think is best.

jithunnair-amd · 2026-06-16T21:21:21Z

 # generate a list of kernels, but not actually emit files at config stage
 execute_process(
-  COMMAND python3 ${CMAKE_SOURCE_DIR}/third_party/composable_kernel/example/ck_tile/01_fmha/generate.py
+  COMMAND python3 ${CMAKE_SOURCE_DIR}/third_party/composable_kernel/example/ck_tile/01_fmha/generate.py --targets gfx1250


@rraminen @pragupta This is problematic if it's only generating for gfx1250?

cc @alugorey

jithunnair-amd · 2026-06-16T21:28:27Z

     "0a4ff324bffdac0c2fde87a8a7f70563d3c84a80ad4e8f31345f2b40a1384e95" # amd-gfx120x
+     # TODO: Update when AOTriton publishes gfx1250 images.
+     # Until then, may need to set AOTRITON_INSTALL_FROM_SOURCE=1 to build from source.
+     #"0000000000000000000000000000000000000000000000000000000000000000" # amd-gfx1250


@pragupta @rraminen Does this need to use https://github.com/ROCm/aotriton/releases/tag/0.12.50tp?

Yes, that was released today only. We will update that.

cc @xinyazhang

jithunnair-amd · 2026-06-16T21:29:16Z

 # USE_ROCM is guarded against in Dependencies.cmake because USE_ROCM is not properly defined here
 cmake_dependent_option(USE_CUFILE "Use cuFile" ON "USE_CUDA AND NOT WIN32" OFF)
-option(USE_FBGEMM "Use FBGEMM (quantized 8-bit server operators)" ON)
+option(USE_FBGEMM "Use FBGEMM (quantized 8-bit server operators)" OFF)


@pragupta @rraminen We probably don't want to turn this off wholesale for 2.11, and do it only for gfx1250 instead, if needed.

jithunnair-amd · 2026-06-16T21:37:08Z

 numpy==2.4.3 ; python_version >= "3.14"
 optree==0.13.0 ; python_version < "3.14"
 optree==0.17.0 ; python_version >= "3.14"
+pandas


@xinyazhang @alugorey I think this is coming from https://github.com/AMD-ROCm-Internal/pytorch/pull/14 ... but I'm not sure why pandas has to be added as a requirement for PyTorch. Until we clearly understand if it's necessary for proper functioning of the pytorch wheel built from this branch, I wouldn't add it to requirements.txt.
Also, if we do want to add it, we should pin it as per established convention in our release branches.

jithunnair-amd

Left comments

xinyazhang · 2026-06-16T22:13:55Z

@pragupta @jithunnair-amd I've filed aotriton 0.12.50tp as a separate PR #3312

naromero77amd

Only concern is the PyTorch 2.11 -Triton 3.7 forward compatibility. Running a test_torchinductor.py and test_max_autotune.py should pick-up big incompatibilities. Maybe flex attention UTs need to be run as well?

test_torhinductor_opinfo.py is thorough, but takes a long time to run.

@jataylo @iupaikov-amd will likely have more feedback.

This PR is to address the reviewed comments on PR #3307

pragupta · 2026-06-17T21:34:55Z

@jithunnair-amd / @pruthvistony / @naromero77amd -- We moved the efforts to make gfx1250 changes public from release/2.11 branch to release/2.12 branch since we'd have to do a major triton bump leading which will add a lot of risk to our delivery to meta due this week. We've addressed all review comments from here in the 2.12 PR. Can you guys please review the 2.12 PR? #3327

jithunnair-amd · 2026-06-18T22:30:48Z

Moving this to Draft for now

glen-amd · 2026-06-19T02:07:52Z

My comments on PR#3327 (#3327) apply to this PR as well.

This PR is to address the reviewed comments on PR #3307

pragupta · 2026-06-19T17:17:49Z

2.11 gfx1250 changes will come in via: #3346 We had to create a PR from a different branch.

jithunnair-amd · 2026-06-19T17:41:19Z

Closing in favor of #3346

This PR is to address the reviewed comments on PR #3307

rraminen and others added 4 commits June 15, 2026 19:03

Cherry-pick bdbcbea8dbf09fb95685d499cd6b1de1e04fe4b0 (exclude CK subm…

a400a90

…odule + triton pins)

rraminen requested a review from jeffdaily as a code owner June 15, 2026 22:13

Bump triton version to 3.7.0

9c8cbb3

rraminen requested review from jataylo, jithunnair-amd and pruthvistony as code owners June 15, 2026 23:20

pragupta changed the title ~~Support for gfx1250~~ [release/2.11] Add support for gfx1250 Jun 16, 2026

jithunnair-amd reviewed Jun 16, 2026

View reviewed changes

pragupta reviewed Jun 16, 2026

View reviewed changes

jithunnair-amd reviewed Jun 16, 2026

View reviewed changes

Comment thread .github/workflows/inductor-rocm-gfx1250.yml

jithunnair-amd Jun 16, 2026

Copy link
Copy Markdown

Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not needed at all?

jithunnair-amd reviewed Jun 16, 2026

View reviewed changes

jithunnair-amd requested changes Jun 16, 2026

View reviewed changes

naromero77amd reviewed Jun 16, 2026

View reviewed changes

rraminen mentioned this pull request Jun 17, 2026

[release/2.12_gfx1250] Support for gfx1250 - bug fixes #3330

Merged

pragupta pushed a commit that referenced this pull request Jun 17, 2026

[release/2.12_gfx1250] Support for gfx1250 - bug fixes (#3330)

04b5405

This PR is to address the reviewed comments on PR #3307

jithunnair-amd marked this pull request as draft June 18, 2026 22:30

pragupta pushed a commit that referenced this pull request Jun 19, 2026

[release/2.11_gfx1250] Support for gfx1250 - bug fixes (#3330)

071fe3e

This PR is to address the reviewed comments on PR #3307

pragupta pushed a commit that referenced this pull request Jun 19, 2026

[release/2.11_gfx1250] Support for gfx1250 - bug fixes (#3330)

c0e0590

This PR is to address the reviewed comments on PR #3307

jithunnair-amd closed this Jun 19, 2026

jithunnair-amd mentioned this pull request Jun 19, 2026

[release/2.11] Add support for gfx1250 #3346

Merged

pragupta pushed a commit that referenced this pull request Jun 23, 2026

[release/2.11_gfx1250] Support for gfx1250 - bug fixes (#3330)

e0fb22e

This PR is to address the reviewed comments on PR #3307

pragupta pushed a commit that referenced this pull request Jun 23, 2026

[release/2.11_gfx1250] Support for gfx1250 - bug fixes (#3330)

dcea57c

This PR is to address the reviewed comments on PR #3307

pragupta pushed a commit that referenced this pull request Jun 24, 2026

[release/2.11_gfx1250] Support for gfx1250 - bug fixes (#3330)

d614a69

This PR is to address the reviewed comments on PR #3307

		@@ -1 +1 @@
		3.6.0
		3.7.0

Uh oh!

Conversation

rraminen commented Jun 15, 2026

Uh oh!

rocm-repo-management-api Bot commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rocm-repo-management-api Bot commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

naromero77amd Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jithunnair-amd left a comment

Choose a reason for hiding this comment

Uh oh!

xinyazhang commented Jun 16, 2026

Uh oh!

naromero77amd left a comment

Choose a reason for hiding this comment

Uh oh!

pragupta commented Jun 17, 2026

Uh oh!

jithunnair-amd commented Jun 18, 2026

Uh oh!

glen-amd commented Jun 19, 2026

Uh oh!

pragupta commented Jun 19, 2026

Uh oh!

jithunnair-amd commented Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

rocm-repo-management-api Bot commented Jun 15, 2026 •

edited

Loading

rocm-repo-management-api Bot commented Jun 15, 2026 •

edited

Loading

naromero77amd Jun 16, 2026 •

edited

Loading

jithunnair-amd commented Jun 19, 2026 •

edited

Loading