[CuTeDSL] Lower scalar Float16/BFloat16 load through Uint16+bitcast#3267
Open
cheshire wants to merge 1 commit into
Open
[CuTeDSL] Lower scalar Float16/BFloat16 load through Uint16+bitcast#3267cheshire wants to merge 1 commit into
cheshire wants to merge 1 commit into
Conversation
Fixes NVIDIA#3266 `nvvm.load.ext` rejects both `bf16` and `f16` result types at MLIR verification with "Unsupported FP type for ExtLoadOp", even though the underlying PTX op is just `ld.b16`. In `cute.arch.load`, route a scalar Float16/BFloat16 request through a `Uint16` load + `llvm.bitcast` back to the requested FP type. Transparent to callers. The same workaround handles Float16 — found while writing the regression test — so the patch covers both. Vector loads of f16/bf16 are not touched (they go through `ir.VectorType` and were not verified to hit the same issue). Added test/python/CuTeDSL/test_arch_load.py exercising both the worked-around 16-bit FP path and the dtypes that `nvvm.load.ext` accepts directly (Float32 / Uint16 / Uint32 / Int32) as a regression check.
Contributor
Author
|
@grypp WDYT? |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #3266
nvvm.load.extrejects bothbf16andf16result types at MLIR verification with "Unsupported FP type for ExtLoadOp", even though the underlying PTX op is justld.b16. Incute.arch.load, route a scalar Float16/BFloat16 request through aUint16load +llvm.bitcastback to the requested FP type. Transparent to callers.The same workaround handles Float16. Vector loads of f16/bf16 are not touched (they go through
ir.VectorTypeand were not verified to hit the same issue).Added test/python/CuTeDSL/test_arch_load.py exercising both the worked-around 16-bit FP path and the dtypes that
nvvm.load.extaccepts directly (Float32 / Uint16 / Uint32 / Int32) as a regression check.