fix: missing PDL wait on main_sf_load in sm103 blockscaled GEMM by tianyuxbear · Pull Request #3279 · NVIDIA/cutlass

tianyuxbear · 2026-05-27T10:33:45Z

Summary

main_sf_load was the only load-capable warp branch in sm103_blockscaled_gemm_tma_warpspecialized.hpp without a preceding cutlass::arch::wait_on_dependent_grids(). With PDL enabled, collective_mainloop.load_sf(...) issues TMA reads of scale-factor (SF) data before the producer kernel's writes are guaranteed flushed to global memory.
Fix: hoist wait_on_dependent_grids() above the warp specialization region so it covers all warp branches (including main_sf_load), and remove the now-redundant inner waits in main_ab_load / sched / epi_load. This matches the cleaned-up pattern already in sm90_gemm_array_tma_warpspecialized_{pingpong,cooperative}.hpp (1 outer wait, 0 inner).
Same class of PDL-placement bug as [BUG] Sm90 & Sm100 Array gemm kernels read ahead of wait_on_dependent_grids() #2962, but in a kernel that wasn't covered by that fix.

Why this matters

Observed in TensorRT-LLM on GB300 + DeepSeek-R1 NVFP4 forward (kernel: DeviceGemmFp4GemmSm103_..._KernelTmaWarpSpecializedBlockScaledSm103). With a diagnostic probe (ld.global.cv.u8 reads of the SF buffer in main_sf_load):

Config	producer fence	`main_sf_load` wait	GSM8K	Race probe
Baseline	✗	✗	91.93	2.31% poison (0x7f)
Producer fence only	✓	✗	95.34	0 / 2.4M poison
This fix (no producer fence)	✗	✓	95.34	0 / 3.2M poison
Reproducer run	✗	✓	95.11	0 / 3.2M poison

Two independent runs, 0 poison events across 6.4M probe samples — the race is fully eliminated on the CUTLASS side without any modification to the producer kernel.

[BUG] Sm90 & Sm100 Array gemm kernels read ahead of wait_on_dependent_grids() #2962 — same class of bug fixed for sm90/sm100 array variants; this PR addresses the sm103 non-array blockscaled variant not covered by that fix.
Suggest a follow-up cleanup PR to fully remove leftover inner waits in sm100_*_array_tma_warpspecialized*.hpp and sm103_blockscaled_gemm_array_tma_warpspecialized.hpp (which still retain redundant inner waits compared to the cleaned sm90 array variants).

fix: missing PDL wait on main_sf_load in sm103 blockscaled GEMM

25230b5

tianyuxbear mentioned this pull request May 28, 2026

[https://nvbugs/5970614][fix] Sync CTA before PDL trigger in quantize_with_block_size NVIDIA/TensorRT-LLM#14668

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: missing PDL wait on main_sf_load in sm103 blockscaled GEMM#3279

fix: missing PDL wait on main_sf_load in sm103 blockscaled GEMM#3279
tianyuxbear wants to merge 1 commit into
NVIDIA:mainfrom
tianyuxbear:fix/sm103_pdl

tianyuxbear commented May 27, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tianyuxbear commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why this matters

Related

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

tianyuxbear commented May 27, 2026 •

edited

Loading