[CuTeDSL] Add SM120 MXF4/NVFP4 native-TMA path#3273
Open
alecco wants to merge 5 commits into
Open
Conversation
added 5 commits
May 25, 2026 11:51
Add the SM120 MXF4/NVFP4 warp-level MMA op and teach cute.gemm's existing variadic operand path to lower explicit SM120 (operand, scale) bundles without adding a generic MmaOp bundle protocol. Cover direct helper execution, cute.gemm bundle parity, full-K scale-fragment mapping, nonzero distinct C accumulation across K64 halves, negative validation, and plain F16/BF16 cute.gemm compilation.
Add cute.as_position_independent_swizzle_tensor() to move a SMEM layout swizzle onto the pointer while exposing the non-swizzled layout shape to copy consumers. Cover rejection paths, the pointer-recast contract, a swizzled SMEM copy path, and an identity/no-swizzle SMEM copy path.
Add PipelineTmaAsync.producer_acquire_already_elected() for callers that are already inside an elected producer region and need the normal empty-barrier wait plus arrive.expect_tx without a nested election. Document that using the method outside an elect_one region is incorrect, and cover both the default token path and explicit producer_try_acquire token path in PTX compile tests.
Add a narrow cutlass.utils.gemm.sm120 helper package for the SM120 MXF4/NVFP4 native TMA path: CTA constants, config validation, logical A/B layouts, interleaved native-FP8 scale layouts, SMEM views, and A/B/SFA/SFB TMA atom construction. Keep scale TMA on the native FP8 tensor-map path, keep A/B tensors logical FP4, type the A/B SMEM format selector, make tile-coordinate defaults consistent, and validate exact interleaved scale layout shape/stride plus observable L-mode preservation through the public atom builder.
Add a minimal SM120 MXF4/NVFP4 smoke example under the CuTe Blackwell example namespace. The example builds native A/B and native-FP8 scale TMA atoms, issues the four TMA loads, executes two K64 MMA instructions, and stores one 16x8 BF16 output microtile. Keep the packed A/B SMEM format explicit because this microtile consumes A/B with the packed LDSM fragment path. Name the uniform-scale fragment loader and dynamic SMEM size explicitly so the example is clearly a fixed microtile integration test rather than a general scale partitioner or production GEMM tutorial. The test imports through the cute.blackwell example namespace, passes explicit DLPack alignment metadata, checks the fixed instruction counts intentionally, and verifies the output value: first K64 contributes 64, second K64 uses SFA scale 2 and contributes 128, for total 192.
This was referenced May 25, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Using this PR code, QuACK can match
examples/79_blackwell_geforce_gemm/79a_blackwell_geforce_nvfp4_bf16_gemm.cuat 95% consistently. Dao-AILab/quack#145Summary
This PR adds the CuTe DSL plumbing for the SM120 MXF4/NVFP4 native-TMA path:
cute.gemmto lower explicit(operand_fragment, scale_fragment)bundles for the narrow SM120 MXF4/NVFP4 MMA casecutlass.utils.gemm.sm120native-TMA layout helpers for MXF4/NVFP4The intent is to match the SM120 block-scaled structure rather than force SM120 through the SM100
tcgen05/ TMEM model. SM120 uses native A/B TMA, native FP8 scale TMA, register A/B fragments, explicit FP8 scale fragments, and warp-level MXF4/NVFP4 MMA.Details
Warp MMA lowering
Adds the SM120 MXF4/NVFP4 MMA path for:
Float4E2M1FNFloat8E4M3FNFloat32m16n8k6416The
cute.gemmintegration is intentionally narrow. It handles only the SM120 MXF4/NVFP4(fragment, scale)bundle case and otherwise leaves the existing variadic GEMM path unchanged.Native TMA helpers
Adds SM120 MXF4/NVFP4 helper utilities for:
The default A/B SMEM format is the packed path. The unpack path remains available explicitly and has compile coverage.
Pipeline / copy support
Adds:
as_position_independent_swizzle_tensor(...)so copy partitioning can use a non-swizzled logical layout while keeping the swizzle on the pointerproducer_acquire_already_elected(...)for TMA producer code that has already entered an elected-lane regionThese are needed by the SM120 native-TMA path and also keep the helper layering close to the existing Blackwell/CuTe DSL copy pipeline style.
Example
Adds a minimal SM120 MXF4/NVFP4 native-TMA microtile example.
The example intentionally computes one fixed 16x8 BF16 output microtile from a 128x128x128 native-TMA tile. It is not a full GEMM kernel. It demonstrates the native-TMA plumbing:
Tests / coverage
This PR adds SM120 tests covering:
mma_mxf4nvf4(...)executioncute.gemm(...)parityCaccumulation across K64 halvescute.gemmcompile coverage to ensure non-bundled GEMM dispatch is unaffectedThe microtile test checks for the expected instruction shape:
Notes
SM120 is intentionally handled differently from SM100. There is no
tcgen05/ TMEM path here. The implementation is centered on native TMA, register fragments, explicit FP8 scale fragments, and warp-level MXF4/NVFP4 MMA.Env