-
Notifications
You must be signed in to change notification settings - Fork 81
[release/2.11] Add support for gfx1250 #3307
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
a400a90
611da06
aae31f5
1bc755a
9c8cbb3
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1 +1 @@ | ||
| 4ed888920c5a0871957f1cf912e557bc79fbe56c | ||
| 9c610c781cb810a11bfcc9accba094550b189a5e |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1 +1 @@ | ||
| 3.6.0 | ||
| 3.7.0 | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @jataylo / @iupaikov-amd to signoff on this (and the corresponding triton.txt change), since this is moving to a newer major version of triton, so just need confidence on compatibility for non-gfx1250 archs.
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ugh.. I somehow mis-read yesterday morning. I thought we were already using 3.7.0 in release/2.11. 3.6 is too old to be compatible with gfx1250. @jataylo / @iupaikov-amd - can you please guide on which UTs we can run as a sanity testing for this triton bump? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @pragupta I have in my notes that the minimum UT suite would be There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @pragupta FWIW, we normally document PyTorch-Triton compability here: but looks like we haven't kept the page up to date.
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. concerned about which triton code we are pulling here.
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ok, it is tip of https://github.com/ROCm/triton/commits/release/internal/3.7.x/ then if the UTs pass, then ok. |
||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @BLOrange-AMD @glen-amd TheRock CI flows do not use this file, so these changes would be irrelevant for it. Do you know if this file was invoked by the PyTorch NPI build flows? |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -152,7 +152,7 @@ export PYTORCH_EXTRA_INSTALL_REQUIREMENTS="${PYTORCH_EXTRA_INSTALL_REQUIREMENTS: | |
| # TODO: We don't need this anymore IIUC | ||
| export TORCH_PACKAGE_NAME='torch' | ||
|
|
||
| export USE_FBGEMM=1 | ||
| export USE_FBGEMM=0 | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe we use a similar strategy to cb4e545?
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @jithunnair-amd , |
||
| export PIP_UPLOAD_FOLDER="$PIP_UPLOAD_FOLDER" | ||
| export DOCKER_IMAGE="$DOCKER_IMAGE" | ||
|
|
||
|
|
||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not needed either? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For those files (not only this one), if we are sure they are not used, feel free to disregard or handle them in whatever way you think is best. |
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not needed at all? |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,77 @@ | ||
| # The name of this file is subject to change to stay consistent with other .yml files. | ||
| # | ||
| # The MI355 workflow (.github/workflows/inductor-rocm-mi355.yml) uses: | ||
| # - _linux-build.yml and _rocm-test.yml reusable workflows | ||
| # - Build environment linux-noble-rocm-py3.12-mi355 | ||
| # - Runner label linux.rocm.gpu.gfx950.1 | ||
| # - Docker image ci-image:pytorch-linux-noble-rocm-n-py3 | ||
| # - 2-shard test matrix for the inductor config | ||
| # | ||
| # The GFX1250 equivalent is following this exact pattern. | ||
|
|
||
| name: inductor-rocm-gfx1250 | ||
|
|
||
| on: | ||
| push: | ||
| branches: | ||
| - main | ||
| - release/* | ||
| tags: | ||
| - ciflow/inductor-rocm-gfx1250/* | ||
| workflow_dispatch: | ||
|
|
||
| concurrency: | ||
| group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.sha }}-${{ github.event_name == 'workflow_dispatch' }}-${{ github.event_name == 'schedule' }} | ||
| cancel-in-progress: true | ||
|
|
||
| permissions: | ||
| id-token: write | ||
| contents: read | ||
| actions: read | ||
|
|
||
| jobs: | ||
| target-determination: | ||
| if: github.repository_owner == 'pytorch' | ||
| name: before-test | ||
| uses: ./.github/workflows/target_determination.yml | ||
|
|
||
| get-label-type: | ||
| name: get-label-type | ||
| uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@release/2.11 | ||
| if: ${{ (github.event_name != 'schedule' || github.repository == 'pytorch/pytorch') && github.repository_owner == 'pytorch' }} | ||
| with: | ||
| triggering_actor: ${{ github.triggering_actor }} | ||
| issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }} | ||
| curr_branch: ${{ github.head_ref || github.ref_name }} | ||
| curr_ref_type: ${{ github.ref_type }} | ||
| opt_out_experiments: lf | ||
|
|
||
| linux-noble-rocm-py3_12-inductor-build: | ||
| name: linux-noble-rocm-py3.12-gfx1250 | ||
| uses: ./.github/workflows/_linux-build.yml | ||
| needs: get-label-type | ||
| with: | ||
| runner_prefix: "${{ needs.get-label-type.outputs.label-type }}" | ||
| build-environment: linux-noble-rocm-py3.12-gfx1250 | ||
| # Docker image stays the same as MI355 because ROCm image supports multiple arches. | ||
| docker-image-name: ci-image:pytorch-linux-noble-rocm-n-py3 | ||
| # Set PYTORCH_ROCM_ARCH directly in the workflow YAML as an env variable, | ||
| # so build.sh never needs to parse BUILD_ENVIRONMENT. | ||
| #env-var-script: | | ||
| # export PYTORCH_ROCM_ARCH=gfx1250 | ||
| test-matrix: | | ||
| { include: [ | ||
| { config: "inductor", shard: 1, num_shards: 2, runner: "linux.rocm.gpu.gfx1250.1" }, # It requires provisioning hardware. | ||
| { config: "inductor", shard: 2, num_shards: 2, runner: "linux.rocm.gpu.gfx1250.1" }, | ||
| ]} | ||
| secrets: inherit | ||
|
|
||
| linux-noble-rocm-py3_12-inductor-test: | ||
| name: linux-noble-rocm-py3.12-gfx1250 | ||
| uses: ./.github/workflows/_rocm-test.yml | ||
| needs: linux-noble-rocm-py3_12-inductor-build | ||
| with: | ||
| build-environment: ${{ needs.linux-noble-rocm-py3_12-inductor-build.outputs.build-environment }} | ||
| docker-image: ${{ needs.linux-noble-rocm-py3_12-inductor-build.outputs.docker-image }} | ||
| test-matrix: ${{ needs.linux-noble-rocm-py3_12-inductor-build.outputs.test-matrix }} | ||
| secrets: inherit |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -263,7 +263,7 @@ cmake_dependent_option(USE_CUSPARSELT "Use cuSPARSELt" ON "USE_CUDA" OFF) | |
| cmake_dependent_option(USE_CUDSS "Use cuDSS" ON "USE_CUDA" OFF) | ||
| # USE_ROCM is guarded against in Dependencies.cmake because USE_ROCM is not properly defined here | ||
| cmake_dependent_option(USE_CUFILE "Use cuFile" ON "USE_CUDA AND NOT WIN32" OFF) | ||
| option(USE_FBGEMM "Use FBGEMM (quantized 8-bit server operators)" ON) | ||
| option(USE_FBGEMM "Use FBGEMM (quantized 8-bit server operators)" OFF) | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
| option(USE_KINETO "Use Kineto profiling library" ON) | ||
| option(USE_CUPTI_SO "Use CUPTI as a shared library" ON) | ||
| option(USE_GFLAGS "Use GFLAGS" OFF) | ||
|
|
@@ -945,9 +945,13 @@ cmake_dependent_option( | |
| OFF) | ||
|
|
||
|
|
||
| # TODO: | ||
| # MSLK related parts are missing that already exists upstream. | ||
| # gfx1250 for MSLK needs to be involved as well. | ||
|
|
||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can we please open an issue to track this? Because 2.11 branch uses MSLK by default, we are changing that here. We can probably just turn it off and use mslk until it's ready?
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, that is better to turn it off. |
||
| IF(USE_ROCM AND ("gfx942" IN_LIST PYTORCH_ROCM_ARCH OR "gfx950" IN_LIST PYTORCH_ROCM_ARCH)) | ||
| message(WARNING "Setting USE_MSLK for gfx942/gfx950 to ON by default, doing ROCM build") | ||
| set(USE_MSLK_DEFAULT ON) | ||
| message(WARNING "Setting USE_FBGEMM_GENAI for gfx942/gfx950 to ON by default, doing ROCM build") | ||
| set(USE_FBGEMM_GENAI_DEFAULT ON) | ||
| elseif(USE_CUDA AND "$ENV{TORCH_CUDA_ARCH_LIST}" MATCHES "10.0" AND CMAKE_CUDA_COMPILER_VERSION VERSION_GREATER_EQUAL 12.8 AND NOT WIN32) | ||
| message(STATUS "Setting USE_MSLK to ON by default , doing CUDA build for SM100a") | ||
| set(USE_MSLK_DEFAULT ON) | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@BLOrange-AMD @glen-amd TheRock CI flows do not use this file, so these changes would be irrelevant for it. Do you know if this file was invoked by the PyTorch NPI build flows?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only change is setting PYTORCH_ROCM_ARCH, I believe we should still keep this change for any legacy build.
In the ROCk this is controlled in the environment variable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even for our rocm-ci legacy builds, we used to set PYTORCH_ROCM_ARCH env variable directly from the CI job parameters IIRC. So I'm not sure why/where the build-environment-based method was being used.