[release/2.11] Add support for gfx1250#3307
Conversation
…odule + triton pins)
* CK - gfx1250 support (pytorch#5) * Enable ROCM_CK_SDPA build * [submodule] composable_kernel and aiter update (pytorch#172592) Summary: update ck to commit ROCm/composable_kernel@fcc9372 update aiter to commit ROCm/aiter@9a469a6 changes of caffe2/aten/src/ATen/CMakeLists.txt and caffe2/caffe2/CMakeLists.txt are adopted from pytorch#161759 updated caffe2/aten/src/ATen/native/transformers/hip/flash_attn/ck/launch_kernel_pt.hpp to match the ck version in https://github.com/ROCm/composable_kernel/blob/292df2719f28cd01464d5d059820684790c101da/include/ck_tile/host/kernel_launch.hpp update aiter fav3 bwd codegen according to changes in ROCm/aiter#1573 update caffe2/aten/src/ATen/native/transformers/hip/flash_attn/ck mha fwd/bwd kernels according to the interfaces in https://github.com/ROCm/composable_kernel/tree/292df2719f28cd01464d5d059820684790c101da/example/ck_tile/01_fmha Differential Revision: D88991877 Pull Request resolved: pytorch#172592 Approved by: https://github.com/alugorey, https://github.com/izaitsevfb * Added MI450 supports and packages * Fix misalinged ck api * Replace aiter with ck for bwd * [ROCm] Bump AOTriton to 0.11.2b (pytorch#174105) Notable new features: * AOTriton 0.11.2b adds gfx1151/1152/1153 support. * Add precompiled AOTriton runtime for ROCM 7.2 * Match the sliding window attention behavior of `_flash_attention_forward/backward` with CUTLASS backend. Bug fixes: * Fixes pytorch#173204. Now all tests in `test/test_varlen_attention.py` are enabled on ROCm Notes: This replaces PR pytorch#173820 and pytorch#173469 Pull Request resolved: pytorch#174105 Approved by: https://github.com/jeffdaily * Fix philox data types for this version of ck * Update CK to use new gfx1250_pytorch branch * Add new gfx1250 compile flags for CK * add --targets to generate and a couple new compile flags * Remove default USE_ROCM_CK_SDPA --------- Co-authored-by: blorange-amd <bo.li2@amd.com> Co-authored-by: Yu Guo <yuguo@meta.com> Co-authored-by: Xinya Zhang <Xinya.Zhang@amd.com> * Updated aiter module * Fixed merged error * Fixed additional merged error * Reset USE_ROCM_CK_SDPA config --------- Co-authored-by: LugoReyes, Andy <Andy.LugoReyes@amd.com> Co-authored-by: Yu Guo <yuguo@meta.com> Co-authored-by: Xinya Zhang <Xinya.Zhang@amd.com>
Fix `torch.arange` (and the other range factories sharing this kernel) for very large outputs on ROCm. `torch.arange(N)` with `N >= 2^32` fails on ROCm because `hipLaunchKernel` does not support `gridDim.x * blockDim.x >= 2^32` for the per-thread kernel `aten/src/ATen/native/cuda/RangeFactories.cu` previously used. Depending on the ROCm version the launch returns `hipErrorInvalidConfiguration` or is accepted silently with the kernel never executing, leaving zero-initialized output. Concrete repro: `torch.arange(2 ** 32 + 1, device="cuda", dtype=torch.int32)`. The fix replaces the per-thread launch on the ROCm path with a grid-stride loop that fixes the grid at `sm_count * 4` blocks, so the launch limit is no longer load-bearing for correctness regardless of `N`. The non-ROCm path is untouched. On MI250X the grid-stride kernel matches the per-thread kernel within noise at `N=1024` and is 24-60% faster from `N=1M` up across `int32`, `int64`, and `float32`. On MI300X the grid-stride kernel matches within noise at `N=1024` and `N=1M`, and is 2-5x faster from `N=64M` up across `int32`, `int64`, and `float32`. The 64-bit-indexing test is extended to also cover `N = 2^32 + 1` and `N = 2^33 + 1` on ROCm when memory permits. Pull Request resolved: pytorch#182657 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>
* TDM on release/2.11 for bring-up based on careful selection * Triton commit: Upstream fe0c38b5262c0447fed6df0d37e02cb8ea75deb4 -> AMD-ROCm-Internal Triton 250bb5d5b821377f49dc2d83d87ded75b952f0f7; Consequence: Triton TDM support may miss. * Refinement according to reviewers' comments * Added/modified UT cases; NUM_STAGES issue of ineffectiveness * A couple of changes to related UTs * Got rid of configs like `waves_per_cu=2`
|
Jenkins build for 1bc755ac30d6a8b5c183eafc823b8fe307a77417 commit finished as FAILURE |
|
Jenkins build for 9c8cbb3cecb9f73c443f28830a60226c2bf4e260 commit finished as FAILURE |
| @@ -1 +1 @@ | |||
| 3.6.0 | |||
| 3.7.0 | |||
There was a problem hiding this comment.
@jataylo / @iupaikov-amd to signoff on this (and the corresponding triton.txt change), since this is moving to a newer major version of triton, so just need confidence on compatibility for non-gfx1250 archs.
There was a problem hiding this comment.
Ugh.. I somehow mis-read yesterday morning. I thought we were already using 3.7.0 in release/2.11. 3.6 is too old to be compatible with gfx1250.
@jataylo / @iupaikov-amd - can you please guide on which UTs we can run as a sanity testing for this triton bump?
There was a problem hiding this comment.
@pragupta I have in my notes that the minimum UT suite would be test/inductor/test_torchinductor.py and test/inductor/test_max_autotune.py.
There was a problem hiding this comment.
@pragupta FWIW, we normally document PyTorch-Triton compability here:
https://amd.atlassian.net/wiki/spaces/MLSE/pages/1032521014/PyTorch+-+Triton+-+Team+responsibilities
but looks like we haven't kept the page up to date.
There was a problem hiding this comment.
concerned about which triton code we are pulling here.
commit - 9c610c781cb810a11bfcc9accba094550b189a5e belongs to which repo? Is it public?
There was a problem hiding this comment.
Ok, it is tip of https://github.com/ROCm/triton/commits/release/internal/3.7.x/ then if the UTs pass, then ok.
| # TODO: | ||
| # MSLK related parts are missing that already exists upstream. | ||
| # gfx1250 for MSLK needs to be involved as well. | ||
|
|
There was a problem hiding this comment.
Can we please open an issue to track this? Because 2.11 branch uses MSLK by default, we are changing that here. We can probably just turn it off and use mslk until it's ready?
There was a problem hiding this comment.
Yes, that is better to turn it off.
We need to have a ROCm fork branch of MSLK for 2.11.
FBGEMM (Li Li) was checking on this we should reach out to him.
There was a problem hiding this comment.
@BLOrange-AMD @glen-amd TheRock CI flows do not use this file, so these changes would be irrelevant for it. Do you know if this file was invoked by the PyTorch NPI build flows?
There was a problem hiding this comment.
@BLOrange-AMD @glen-amd TheRock CI flows do not use this file, so these changes would be irrelevant for it. Do you know if this file was invoked by the PyTorch NPI build flows?
There was a problem hiding this comment.
The only change is setting PYTORCH_ROCM_ARCH, I believe we should still keep this change for any legacy build.
In the ROCk this is controlled in the environment variable.
There was a problem hiding this comment.
Even for our rocm-ci legacy builds, we used to set PYTORCH_ROCM_ARCH env variable directly from the CI job parameters IIRC. So I'm not sure why/where the build-environment-based method was being used.
| export TORCH_PACKAGE_NAME='torch' | ||
|
|
||
| export USE_FBGEMM=1 | ||
| export USE_FBGEMM=0 |
There was a problem hiding this comment.
@jithunnair-amd ,
should this setting of USE_FBGEMM=0 be in the pytorch build workflow (github wf) of ROCk.
There was a problem hiding this comment.
Not needed at all?
There was a problem hiding this comment.
Not needed either?
There was a problem hiding this comment.
For those files (not only this one), if we are sure they are not used, feel free to disregard or handle them in whatever way you think is best.
| # generate a list of kernels, but not actually emit files at config stage | ||
| execute_process( | ||
| COMMAND python3 ${CMAKE_SOURCE_DIR}/third_party/composable_kernel/example/ck_tile/01_fmha/generate.py | ||
| COMMAND python3 ${CMAKE_SOURCE_DIR}/third_party/composable_kernel/example/ck_tile/01_fmha/generate.py --targets gfx1250 |
| "0a4ff324bffdac0c2fde87a8a7f70563d3c84a80ad4e8f31345f2b40a1384e95" # amd-gfx120x | ||
| # TODO: Update when AOTriton publishes gfx1250 images. | ||
| # Until then, may need to set AOTRITON_INSTALL_FROM_SOURCE=1 to build from source. | ||
| #"0000000000000000000000000000000000000000000000000000000000000000" # amd-gfx1250 |
There was a problem hiding this comment.
@pragupta @rraminen Does this need to use https://github.com/ROCm/aotriton/releases/tag/0.12.50tp?
There was a problem hiding this comment.
Yes, that was released today only. We will update that.
| # USE_ROCM is guarded against in Dependencies.cmake because USE_ROCM is not properly defined here | ||
| cmake_dependent_option(USE_CUFILE "Use cuFile" ON "USE_CUDA AND NOT WIN32" OFF) | ||
| option(USE_FBGEMM "Use FBGEMM (quantized 8-bit server operators)" ON) | ||
| option(USE_FBGEMM "Use FBGEMM (quantized 8-bit server operators)" OFF) |
| numpy==2.4.3 ; python_version >= "3.14" | ||
| optree==0.13.0 ; python_version < "3.14" | ||
| optree==0.17.0 ; python_version >= "3.14" | ||
| pandas |
There was a problem hiding this comment.
@xinyazhang @alugorey I think this is coming from https://github.com/AMD-ROCm-Internal/pytorch/pull/14 ... but I'm not sure why pandas has to be added as a requirement for PyTorch. Until we clearly understand if it's necessary for proper functioning of the pytorch wheel built from this branch, I wouldn't add it to requirements.txt.
Also, if we do want to add it, we should pin it as per established convention in our release branches.
|
@pragupta @jithunnair-amd I've filed aotriton 0.12.50tp as a separate PR #3312 |
naromero77amd
left a comment
There was a problem hiding this comment.
Only concern is the PyTorch 2.11 -Triton 3.7 forward compatibility. Running a test_torchinductor.py and test_max_autotune.py should pick-up big incompatibilities. Maybe flex attention UTs need to be run as well?
test_torhinductor_opinfo.py is thorough, but takes a long time to run.
@jataylo @iupaikov-amd will likely have more feedback.
This PR is to address the reviewed comments on PR #3307
|
@jithunnair-amd / @pruthvistony / @naromero77amd -- We moved the efforts to make gfx1250 changes public from release/2.11 branch to release/2.12 branch since we'd have to do a major triton bump leading which will add a lot of risk to our delivery to meta due this week. We've addressed all review comments from here in the 2.12 PR. Can you guys please review the 2.12 PR? #3327 |
|
Moving this to Draft for now |
|
My comments on PR#3327 (#3327) apply to this PR as well. |
This PR is to address the reviewed comments on PR #3307
This PR is to address the reviewed comments on PR #3307
|
2.11 gfx1250 changes will come in via: #3346 We had to create a PR from a different branch. |
|
Closing in favor of #3346 |
This PR is to address the reviewed comments on PR #3307
This PR is to address the reviewed comments on PR #3307
This PR is to address the reviewed comments on PR #3307
This PR is to add PyTorch support on gfx1250 hardware