[CuTeDSL] Add SM120 MXF4/NVFP4 native-TMA path by alecco · Pull Request #3273 · NVIDIA/cutlass

alecco · 2026-05-25T11:15:07Z

Using this PR code, QuACK can match examples/79_blackwell_geforce_gemm/79a_blackwell_geforce_nvfp4_bf16_gemm.cu at 95% consistently. Dao-AILab/quack#145

Summary

This PR adds the CuTe DSL plumbing for the SM120 MXF4/NVFP4 native-TMA path:

add SM120 MXF4/NVFP4 warp MMA lowering
teach cute.gemm to lower explicit (operand_fragment, scale_fragment) bundles for the narrow SM120 MXF4/NVFP4 MMA case
add SM120 E4M3 FP8 scale-fragment helpers
add position-independent swizzle tensor support for SMEM copy partitioning
add an already-elected TMA producer acquire helper
add cutlass.utils.gemm.sm120 native-TMA layout helpers for MXF4/NVFP4
add a fixed native-TMA microtile example and focused SM120 tests

The intent is to match the SM120 block-scaled structure rather than force SM120 through the SM100 tcgen05 / TMEM model. SM120 uses native A/B TMA, native FP8 scale TMA, register A/B fragments, explicit FP8 scale fragments, and warp-level MXF4/NVFP4 MMA.

Details

Warp MMA lowering

Adds the SM120 MXF4/NVFP4 MMA path for:

A/B: Float4E2M1FN
scales: Float8E4M3FN
accumulator: Float32
instruction shape: m16n8k64
scale vector size: 16

The cute.gemm integration is intentionally narrow. It handles only the SM120 MXF4/NVFP4 (fragment, scale) bundle case and otherwise leaves the existing variadic GEMM path unchanged.

Native TMA helpers

Adds SM120 MXF4/NVFP4 helper utilities for:

logical A/B GMEM layouts
interleaved scale GMEM layouts
native A/B TMA atoms
native FP8 scale TMA atoms
packed and unpack A/B SMEM view construction
FP8 scale SMEM view construction
format-aware TMA transaction byte helpers
the 128x128x128 tiled MMA layout used by the SM120 path

The default A/B SMEM format is the packed path. The unpack path remains available explicitly and has compile coverage.

Pipeline / copy support

Adds:

as_position_independent_swizzle_tensor(...) so copy partitioning can use a non-swizzled logical layout while keeping the swizzle on the pointer
producer_acquire_already_elected(...) for TMA producer code that has already entered an elected-lane region

These are needed by the SM120 native-TMA path and also keep the helper layering close to the existing Blackwell/CuTe DSL copy pipeline style.

Example

Adds a minimal SM120 MXF4/NVFP4 native-TMA microtile example.

The example intentionally computes one fixed 16x8 BF16 output microtile from a 128x128x128 native-TMA tile. It is not a full GEMM kernel. It demonstrates the native-TMA plumbing:

build A/B native TMA atoms
build SFA/SFB native FP8 scale TMA atoms
issue one A and one B 3D TMA load, plus one SFA and one SFB 2D scale TMA load
execute two K64 MXF4/NVFP4 MMA instructions
store one BF16 output microtile

Tests / coverage

This PR adds SM120 tests covering:

direct mma_mxf4nvf4(...) execution
bundled cute.gemm(...) parity
full-K K64 accumulation
distinct nonzero C accumulation across K64 halves
SFA/SFB scale-fragment mapping
per-scale-column behavior
row-varying SFA behavior
negative dtype/layout validation
plain F16/BF16 cute.gemm compile coverage to ensure non-bundled GEMM dispatch is unaffected
native TMA layout helper validation
packed/unpack byte helper behavior
unpack TMA atom compile coverage
native-TMA microtile PTX and runtime result checks

The microtile test checks for the expected instruction shape:

two MXF4/NVFP4 MMA instructions
two A/B 3D TMA ops
two scale 2D TMA ops
expected BF16 output value for the fixed uniform-scale setup

Notes

SM120 is intentionally handled differently from SM100. There is no tcgen05 / TMEM path here. The implementation is centered on native TMA, register fragments, explicit FP8 scale fragments, and warp-level MXF4/NVFP4 MMA.

Env

# Install the CUDA 13 CuTe DSL runtime wheel set matching this branch.
# This file currently contains: nvidia-cutlass-dsl[cu13]==4.5.1
python -m pip install -r python/CuTeDSL/requirements-cu13.txt

# Prepare the editable CuTe DSL source tree from the matching wheel.
# This copies generated _mlir Python/extension files and writes VERSION.EDITABLE.
python python/CuTeDSL/prep_editable_install.py

# Install CuTe DSL in editable/dev mode.
python -m pip install -e python/CuTeDSL

Expected verification:

python -m pip show nvidia-cutlass-dsl nvidia-cutlass-dsl-libs-base nvidia-cutlass-dsl-libs-cu13 cuda-python

Local working environment currently shows:

nvidia-cutlass-dsl            4.5.1.dev0, editable: ${HOME}/Soft/cutlass/python/CuTeDSL
nvidia-cutlass-dsl-libs-base  4.5.1
nvidia-cutlass-dsl-libs-cu13  4.5.1
cuda-python                   13.2.0
cuda-toolkit                  13.0.2
torch                         2.11.0
triton                        3.6.0

For running SM120 tests/benchmarks:

export CUTE_DSL_ARCH=sm_120a
export CUTE_DSL_CACHE_DIR=/data/agent/CuTeDSL/cache

If the runtime library is not found automatically, set:

export CUTE_DSL_LIBS=/path/to/site-packages/nvidia_cutlass_dsl/lib/libcute_dsl_runtime.so

On my local install that path is:

${HOME}/.local/lib/python3.14/site-packages/nvidia_cutlass_dsl/lib/libcute_dsl_runtime.so

Add the SM120 MXF4/NVFP4 warp-level MMA op and teach cute.gemm's existing variadic operand path to lower explicit SM120 (operand, scale) bundles without adding a generic MmaOp bundle protocol. Cover direct helper execution, cute.gemm bundle parity, full-K scale-fragment mapping, nonzero distinct C accumulation across K64 halves, negative validation, and plain F16/BF16 cute.gemm compilation.

Add cute.as_position_independent_swizzle_tensor() to move a SMEM layout swizzle onto the pointer while exposing the non-swizzled layout shape to copy consumers. Cover rejection paths, the pointer-recast contract, a swizzled SMEM copy path, and an identity/no-swizzle SMEM copy path.

Add PipelineTmaAsync.producer_acquire_already_elected() for callers that are already inside an elected producer region and need the normal empty-barrier wait plus arrive.expect_tx without a nested election. Document that using the method outside an elect_one region is incorrect, and cover both the default token path and explicit producer_try_acquire token path in PTX compile tests.

Add a narrow cutlass.utils.gemm.sm120 helper package for the SM120 MXF4/NVFP4 native TMA path: CTA constants, config validation, logical A/B layouts, interleaved native-FP8 scale layouts, SMEM views, and A/B/SFA/SFB TMA atom construction. Keep scale TMA on the native FP8 tensor-map path, keep A/B tensors logical FP4, type the A/B SMEM format selector, make tile-coordinate defaults consistent, and validate exact interleaved scale layout shape/stride plus observable L-mode preservation through the public atom builder.

Add a minimal SM120 MXF4/NVFP4 smoke example under the CuTe Blackwell example namespace. The example builds native A/B and native-FP8 scale TMA atoms, issues the four TMA loads, executes two K64 MMA instructions, and stores one 16x8 BF16 output microtile. Keep the packed A/B SMEM format explicit because this microtile consumes A/B with the packed LDSM fragment path. Name the uniform-scale fragment loader and dynamic SMEM size explicitly so the example is clearly a fixed microtile integration test rather than a general scale partitioner or production GEMM tutorial. The test imports through the cute.blackwell example namespace, passes explicit DLPack alignment metadata, checks the fixed instruction counts intentionally, and verifies the output value: first K64 contributes 64, second K64 uses SFA scale 2 and contributes 128, for total 192.

agent added 5 commits May 25, 2026 11:51

This was referenced May 25, 2026

[SM120] Add NVFP4 blockscaled GEMM path (~95% CUDA) Dao-AILab/quack#145

Draft

Add Blackwell GeForce blockscaled GEMM examples #3272

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CuTeDSL] Add SM120 MXF4/NVFP4 native-TMA path#3273

[CuTeDSL] Add SM120 MXF4/NVFP4 native-TMA path#3273
alecco wants to merge 5 commits into
NVIDIA:mainfrom
alecco:sm120-nvfp4

alecco commented May 25, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

alecco commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Details

Warp MMA lowering

Native TMA helpers

Pipeline / copy support

Example

Tests / coverage

Notes

Env

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

alecco commented May 25, 2026 •

edited

Loading