Skip to content

fix: missing PDL wait on main_sf_load in sm103 blockscaled GEMM#3279

Open
tianyuxbear wants to merge 1 commit into
NVIDIA:mainfrom
tianyuxbear:fix/sm103_pdl
Open

fix: missing PDL wait on main_sf_load in sm103 blockscaled GEMM#3279
tianyuxbear wants to merge 1 commit into
NVIDIA:mainfrom
tianyuxbear:fix/sm103_pdl

Conversation

@tianyuxbear
Copy link
Copy Markdown

@tianyuxbear tianyuxbear commented May 27, 2026

Summary

  • main_sf_load was the only load-capable warp branch in sm103_blockscaled_gemm_tma_warpspecialized.hpp without a preceding cutlass::arch::wait_on_dependent_grids(). With PDL enabled, collective_mainloop.load_sf(...) issues TMA reads of scale-factor (SF) data before the producer kernel's writes are guaranteed flushed to global memory.
  • Fix: hoist wait_on_dependent_grids() above the warp specialization region so it covers all warp branches (including main_sf_load), and remove the now-redundant inner waits in main_ab_load / sched / epi_load. This matches the cleaned-up pattern already in sm90_gemm_array_tma_warpspecialized_{pingpong,cooperative}.hpp (1 outer wait, 0 inner).
  • Same class of PDL-placement bug as [BUG] Sm90 & Sm100 Array gemm kernels read ahead of wait_on_dependent_grids() #2962, but in a kernel that wasn't covered by that fix.

Why this matters

Observed in TensorRT-LLM on GB300 + DeepSeek-R1 NVFP4 forward (kernel: DeviceGemmFp4GemmSm103_..._KernelTmaWarpSpecializedBlockScaledSm103). With a diagnostic probe (ld.global.cv.u8 reads of the SF buffer in main_sf_load):

Config producer fence main_sf_load wait GSM8K Race probe
Baseline 91.93 2.31% poison (0x7f)
Producer fence only 95.34 0 / 2.4M poison
This fix (no producer fence) 95.34 0 / 3.2M poison
Reproducer run 95.11 0 / 3.2M poison

Two independent runs, 0 poison events across 6.4M probe samples — the race is fully eliminated on the CUTLASS side without any modification to the producer kernel.

Related

  • [BUG] Sm90 & Sm100 Array gemm kernels read ahead of wait_on_dependent_grids() #2962 — same class of bug fixed for sm90/sm100 array variants; this PR addresses the sm103 non-array blockscaled variant not covered by that fix.
  • Suggest a follow-up cleanup PR to fully remove leftover inner waits in sm100_*_array_tma_warpspecialized*.hpp and sm103_blockscaled_gemm_array_tma_warpspecialized.hpp (which still retain redundant inner waits compared to the cleaned sm90 array variants).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant