Skip to content

[release/2.11] Add support for gfx1250#3307

Closed
rraminen wants to merge 5 commits into
ROCm:release/2.11from
rraminen:cherry-pick-gfx1250-rel2.11
Closed

[release/2.11] Add support for gfx1250#3307
rraminen wants to merge 5 commits into
ROCm:release/2.11from
rraminen:cherry-pick-gfx1250-rel2.11

Conversation

@rraminen

Copy link
Copy Markdown

This PR is to add PyTorch support on gfx1250 hardware

rraminen and others added 4 commits June 15, 2026 19:03
* CK - gfx1250 support (pytorch#5)

* Enable ROCM_CK_SDPA build

* [submodule] composable_kernel and aiter update (pytorch#172592)

Summary:
update ck to commit ROCm/composable_kernel@fcc9372

update aiter to commit ROCm/aiter@9a469a6

changes of caffe2/aten/src/ATen/CMakeLists.txt and caffe2/caffe2/CMakeLists.txt are adopted from pytorch#161759

updated caffe2/aten/src/ATen/native/transformers/hip/flash_attn/ck/launch_kernel_pt.hpp to match the ck version in https://github.com/ROCm/composable_kernel/blob/292df2719f28cd01464d5d059820684790c101da/include/ck_tile/host/kernel_launch.hpp

update aiter fav3 bwd codegen according to changes in ROCm/aiter#1573

update caffe2/aten/src/ATen/native/transformers/hip/flash_attn/ck mha fwd/bwd kernels according to the interfaces in https://github.com/ROCm/composable_kernel/tree/292df2719f28cd01464d5d059820684790c101da/example/ck_tile/01_fmha

Differential Revision: D88991877

Pull Request resolved: pytorch#172592
Approved by: https://github.com/alugorey, https://github.com/izaitsevfb

* Added MI450 supports and packages

* Fix misalinged ck api

* Replace aiter with ck for bwd

* [ROCm] Bump AOTriton to 0.11.2b (pytorch#174105)

Notable new features:

* AOTriton 0.11.2b adds gfx1151/1152/1153 support.
* Add precompiled AOTriton runtime for ROCM 7.2
* Match the sliding window attention behavior of `_flash_attention_forward/backward` with CUTLASS backend.

Bug fixes:

* Fixes pytorch#173204. Now all tests in `test/test_varlen_attention.py` are enabled on ROCm

Notes:

This replaces PR pytorch#173820 and pytorch#173469

Pull Request resolved: pytorch#174105
Approved by: https://github.com/jeffdaily

* Fix philox data types for this version of ck

* Update CK to use new gfx1250_pytorch branch

* Add new gfx1250 compile flags for CK

* add --targets to generate and a couple new compile flags

* Remove default USE_ROCM_CK_SDPA

---------

Co-authored-by: blorange-amd <bo.li2@amd.com>
Co-authored-by: Yu Guo <yuguo@meta.com>
Co-authored-by: Xinya Zhang <Xinya.Zhang@amd.com>

* Updated aiter module

* Fixed merged error

* Fixed additional merged error

* Reset USE_ROCM_CK_SDPA config

---------

Co-authored-by: LugoReyes, Andy <Andy.LugoReyes@amd.com>
Co-authored-by: Yu Guo <yuguo@meta.com>
Co-authored-by: Xinya Zhang <Xinya.Zhang@amd.com>
Fix `torch.arange` (and the other range factories sharing this kernel) for very large outputs on ROCm.

`torch.arange(N)` with `N >= 2^32` fails on ROCm because `hipLaunchKernel` does not support `gridDim.x * blockDim.x >= 2^32` for the per-thread kernel `aten/src/ATen/native/cuda/RangeFactories.cu` previously used. Depending on the ROCm version the launch returns `hipErrorInvalidConfiguration` or is accepted silently with the kernel never executing, leaving zero-initialized output. Concrete repro: `torch.arange(2 ** 32 + 1, device="cuda", dtype=torch.int32)`.

The fix replaces the per-thread launch on the ROCm path with a grid-stride loop that fixes the grid at `sm_count * 4` blocks, so the launch limit is no longer load-bearing for correctness regardless of `N`. The non-ROCm path is untouched.

On MI250X the grid-stride kernel matches the per-thread kernel within noise at `N=1024` and is 24-60% faster from `N=1M` up across `int32`, `int64`, and `float32`.

On MI300X the grid-stride kernel matches within noise at `N=1024` and `N=1M`, and is 2-5x faster from `N=64M` up across `int32`, `int64`, and `float32`.

The 64-bit-indexing test is extended to also cover `N = 2^32 + 1` and `N = 2^33 + 1` on ROCm when memory permits.

Pull Request resolved: pytorch#182657
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
* TDM on release/2.11 for bring-up based on careful selection

* Triton commit: Upstream fe0c38b5262c0447fed6df0d37e02cb8ea75deb4 -> AMD-ROCm-Internal Triton 250bb5d5b821377f49dc2d83d87ded75b952f0f7; Consequence: Triton TDM support may miss.

* Refinement according to reviewers' comments

* Added/modified UT cases; NUM_STAGES issue of ineffectiveness

* A couple of changes to related UTs

* Got rid of configs like `waves_per_cu=2`
@rraminen rraminen requested a review from jeffdaily as a code owner June 15, 2026 22:13
@rocm-repo-management-api

rocm-repo-management-api Bot commented Jun 15, 2026

Copy link
Copy Markdown

Jenkins build for 1bc755ac30d6a8b5c183eafc823b8fe307a77417 commit finished as FAILURE
Links: Pipeline Overview / Build artifacts / Test Results

@rocm-repo-management-api

rocm-repo-management-api Bot commented Jun 15, 2026

Copy link
Copy Markdown

Jenkins build for 9c8cbb3cecb9f73c443f28830a60226c2bf4e260 commit finished as FAILURE
Links: Pipeline Overview / Build artifacts / Test Results

@pragupta pragupta changed the title Support for gfx1250 [release/2.11] Add support for gfx1250 Jun 16, 2026
@@ -1 +1 @@
3.6.0
3.7.0

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jataylo / @iupaikov-amd to signoff on this (and the corresponding triton.txt change), since this is moving to a newer major version of triton, so just need confidence on compatibility for non-gfx1250 archs.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh.. I somehow mis-read yesterday morning. I thought we were already using 3.7.0 in release/2.11. 3.6 is too old to be compatible with gfx1250.

@jataylo / @iupaikov-amd - can you please guide on which UTs we can run as a sanity testing for this triton bump?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pragupta I have in my notes that the minimum UT suite would be test/inductor/test_torchinductor.py and test/inductor/test_max_autotune.py.

@naromero77amd naromero77amd Jun 16, 2026

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pragupta FWIW, we normally document PyTorch-Triton compability here:
https://amd.atlassian.net/wiki/spaces/MLSE/pages/1032521014/PyTorch+-+Triton+-+Team+responsibilities

but looks like we haven't kept the page up to date.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

concerned about which triton code we are pulling here.
commit - 9c610c781cb810a11bfcc9accba094550b189a5e belongs to which repo? Is it public?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, it is tip of https://github.com/ROCm/triton/commits/release/internal/3.7.x/ then if the UTs pass, then ok.

Comment thread CMakeLists.txt
# TODO:
# MSLK related parts are missing that already exists upstream.
# gfx1250 for MSLK needs to be involved as well.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we please open an issue to track this? Because 2.11 branch uses MSLK by default, we are changing that here. We can probably just turn it off and use mslk until it's ready?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is better to turn it off.
We need to have a ROCm fork branch of MSLK for 2.11.
FBGEMM (Li Li) was checking on this we should reach out to him.

Comment thread .ci/pytorch/build.sh

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BLOrange-AMD @glen-amd TheRock CI flows do not use this file, so these changes would be irrelevant for it. Do you know if this file was invoked by the PyTorch NPI build flows?

Comment thread .ci/docker/build.sh

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BLOrange-AMD @glen-amd TheRock CI flows do not use this file, so these changes would be irrelevant for it. Do you know if this file was invoked by the PyTorch NPI build flows?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only change is setting PYTORCH_ROCM_ARCH, I believe we should still keep this change for any legacy build.
In the ROCk this is controlled in the environment variable.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even for our rocm-ci legacy builds, we used to set PYTORCH_ROCM_ARCH env variable directly from the CI job parameters IIRC. So I'm not sure why/where the build-environment-based method was being used.

export TORCH_PACKAGE_NAME='torch'

export USE_FBGEMM=1
export USE_FBGEMM=0

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rraminen @pragupta I don't think this file is used by theRock either. Do we want to do this only for gfx1250? If yes, we probably need to do this elsewhere, and only for gfx1250, since USE_FBGEMM is set to ON for release/2.11 builds in theRock today.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we use a similar strategy to cb4e545?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jithunnair-amd ,
should this setting of USE_FBGEMM=0 be in the pytorch build workflow (github wf) of ROCk.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not needed at all?

Comment thread .github/actionlint.yaml

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not needed either?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For those files (not only this one), if we are sure they are not used, feel free to disregard or handle them in whatever way you think is best.

# generate a list of kernels, but not actually emit files at config stage
execute_process(
COMMAND python3 ${CMAKE_SOURCE_DIR}/third_party/composable_kernel/example/ck_tile/01_fmha/generate.py
COMMAND python3 ${CMAKE_SOURCE_DIR}/third_party/composable_kernel/example/ck_tile/01_fmha/generate.py --targets gfx1250

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rraminen @pragupta This is problematic if it's only generating for gfx1250?

cc @alugorey

"0a4ff324bffdac0c2fde87a8a7f70563d3c84a80ad4e8f31345f2b40a1384e95" # amd-gfx120x
# TODO: Update when AOTriton publishes gfx1250 images.
# Until then, may need to set AOTRITON_INSTALL_FROM_SOURCE=1 to build from source.
#"0000000000000000000000000000000000000000000000000000000000000000" # amd-gfx1250

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that was released today only. We will update that.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment thread CMakeLists.txt
# USE_ROCM is guarded against in Dependencies.cmake because USE_ROCM is not properly defined here
cmake_dependent_option(USE_CUFILE "Use cuFile" ON "USE_CUDA AND NOT WIN32" OFF)
option(USE_FBGEMM "Use FBGEMM (quantized 8-bit server operators)" ON)
option(USE_FBGEMM "Use FBGEMM (quantized 8-bit server operators)" OFF)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pragupta @rraminen We probably don't want to turn this off wholesale for 2.11, and do it only for gfx1250 instead, if needed.

Comment thread requirements.txt
numpy==2.4.3 ; python_version >= "3.14"
optree==0.13.0 ; python_version < "3.14"
optree==0.17.0 ; python_version >= "3.14"
pandas

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xinyazhang @alugorey I think this is coming from https://github.com/AMD-ROCm-Internal/pytorch/pull/14 ... but I'm not sure why pandas has to be added as a requirement for PyTorch. Until we clearly understand if it's necessary for proper functioning of the pytorch wheel built from this branch, I wouldn't add it to requirements.txt.
Also, if we do want to add it, we should pin it as per established convention in our release branches.

@jithunnair-amd jithunnair-amd left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left comments

@xinyazhang

Copy link
Copy Markdown

@pragupta @jithunnair-amd I've filed aotriton 0.12.50tp as a separate PR #3312

@naromero77amd naromero77amd left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only concern is the PyTorch 2.11 -Triton 3.7 forward compatibility. Running a test_torchinductor.py and test_max_autotune.py should pick-up big incompatibilities. Maybe flex attention UTs need to be run as well?

test_torhinductor_opinfo.py is thorough, but takes a long time to run.

@jataylo @iupaikov-amd will likely have more feedback.

pragupta pushed a commit that referenced this pull request Jun 17, 2026
This PR is to address the reviewed comments on PR
#3307
@pragupta

Copy link
Copy Markdown
Collaborator

@jithunnair-amd / @pruthvistony / @naromero77amd -- We moved the efforts to make gfx1250 changes public from release/2.11 branch to release/2.12 branch since we'd have to do a major triton bump leading which will add a lot of risk to our delivery to meta due this week. We've addressed all review comments from here in the 2.12 PR. Can you guys please review the 2.12 PR? #3327

@jithunnair-amd

Copy link
Copy Markdown
Collaborator

Moving this to Draft for now

@jithunnair-amd jithunnair-amd marked this pull request as draft June 18, 2026 22:30
@glen-amd

Copy link
Copy Markdown

My comments on PR#3327 (#3327) apply to this PR as well.

pragupta pushed a commit that referenced this pull request Jun 19, 2026
This PR is to address the reviewed comments on PR
#3307
pragupta pushed a commit that referenced this pull request Jun 19, 2026
This PR is to address the reviewed comments on PR
#3307
@pragupta

Copy link
Copy Markdown
Collaborator

2.11 gfx1250 changes will come in via: #3346 We had to create a PR from a different branch.

@jithunnair-amd

jithunnair-amd commented Jun 19, 2026

Copy link
Copy Markdown
Collaborator

Closing in favor of #3346

pragupta pushed a commit that referenced this pull request Jun 23, 2026
This PR is to address the reviewed comments on PR
#3307
pragupta pushed a commit that referenced this pull request Jun 23, 2026
This PR is to address the reviewed comments on PR
#3307
pragupta pushed a commit that referenced this pull request Jun 24, 2026
This PR is to address the reviewed comments on PR
#3307
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants