[https://nvbugs/5970614][fix] Sync CTA before PDL trigger in quantize_with_block_size#14668
[https://nvbugs/5970614][fix] Sync CTA before PDL trigger in quantize_with_block_size#14668tianyuxbear wants to merge 2 commits into
Conversation
… to 8 The throughput_pp4_mtp variant uses pp_size=4 + ep_size=1 + mtp_nextn=1 on a 4-GPU topology. With max_batch_size=32 (every other 4-GPU variant in this test uses bs<=16), the per-bs CUDA-graph snapshot pool plus per-step activation and NCCL collective buffers leave no headroom for the lazy cuBLAS Lt workspace allocation. Under sustained load this surfaces as mid-run CUBLAS_STATUS_EXECUTION_FAILED in the DeepseekV3 router / o_proj GEMMs, followed by an asynchronous illegal memory access. Lower max_batch_size from 32 to 8 to match the already-stable throughput_bs8_mtp configuration. With bs=8 the default 0.70 KV cache fraction provides sufficient cuBLAS workspace headroom; the test passes in repeated runs without any KV fraction reduction. Removes the matching waiver entry. Signed-off-by: Tianyu Xiong <117647511+tianyuxbear@users.noreply.github.com>
…_with_block_size Signed-off-by: Tianyu Xiong <117647511+tianyuxbear@users.noreply.github.com>
📝 WalkthroughWalkthroughThe PR addresses a CUDA graph execution issue in NVIDIA's TensorRT-LLM by adding memory-ordering synchronization to the quantization kernel and adjusting the DeepSeek-R1 NVFP4 test configuration, then removing the corresponding test waiver. ChangesNVFP4 DeepSeek-R1 Throughput Stabilization
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
cpp/tensorrt_llm/kernels/quantization.cuh (1)
2-2:⚠️ Potential issue | 🟠 Major | ⚡ Quick winUpdate the copyright header year on this modified source file.
This file was modified, but the NVIDIA header still ends at 2023. Please update it to include the latest modification year.
Suggested fix
- * Copyright (c) 2019-2023, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2019-2026, NVIDIA CORPORATION. All rights reserved.As per coding guidelines:
**/*.{cpp,cc,h,hpp,py,cu,cuh}: Include NVIDIA copyright header on ALL new files; update year on modified files.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cpp/tensorrt_llm/kernels/quantization.cuh` at line 2, Update the NVIDIA copyright header at the top of the modified source file by changing the year range from "2019-2023" to include the current modification year (e.g. "2019-2026"); edit the top-of-file header comment in quantization.cuh (the initial copyright comment block) so the year range reflects the latest modification year.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@cpp/tensorrt_llm/kernels/quantization.cuh`:
- Line 2: Update the NVIDIA copyright header at the top of the modified source
file by changing the year range from "2019-2023" to include the current
modification year (e.g. "2019-2026"); edit the top-of-file header comment in
quantization.cuh (the initial copyright comment block) so the year range
reflects the latest modification year.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 04887fc1-1f9e-4ec2-8b75-a6af13d6921c
📒 Files selected for processing (3)
cpp/tensorrt_llm/kernels/quantization.cuhtests/integration/defs/accuracy/test_llm_api_pytorch.pytests/integration/test_lists/waives.txt
💤 Files with no reviewable changes (1)
- tests/integration/test_lists/waives.txt
Summary
Fixes a PDL (Programmatic Dependent Launch) race in
quantize_with_block_sizethat intermittently corrupts NVFP4 GEMM outputs and degrades GSM8K accuracy for DeepSeek-R1 NVFP4 on GB300 + PP=4 + MTP (nvbug 5970614).cudaTriggerProgrammaticLaunchCompletion()only signals that the CTA has reached the trigger point — it does not flush prior stores to global memory. Memory visibility for the secondary kernel must be provided either by the producer (a fence before the trigger) or by the consumer (wait_on_dependent_grids()before its first dependent load). In the current NVFP4 path neither side does so: the producer lacks a fence, and the sm103 blockscaled GEMM'smain_sf_loadwarp branch is missing the correspondingwait_on_dependent_grids()(tracked separately in NVIDIA/cutlass#3279).Compounding this, PDL completion is reported per-CTA at-least-once: a single warp reaching the trigger marks its whole CTA as "trigger reached", even if peer warps in the same CTA are still writing
sf_out/out. Once every CTA has been marked, the driver launches the secondary kernel, which TMA-loads partial data — NaNs then propagate through the DeepSeek-R1 forward and corrupt output tokens.Fix: insert
__syncthreads(); __threadfence();immediately before the trigger, so all warps in the CTA reach the same program point and all their stores are made globally visible before PDL completion is signaled.In-tree precedent
This mirrors an existing pattern in
cpp/tensorrt_llm/kernels/fusedLayernormKernels/ws_layernorm.cuh:865-867, which uses__syncthreads(); membar.gl;immediately before the samecudaTriggerProgrammaticLaunchCompletion()call.__threadfence()is the CUDA intrinsic formembar.gl, so the two forms are semantically equivalent.Evidence
The race was characterized on
release/1.2, where the timing window reproduces it deterministically enough to measure. A cache-bypassld.global.cv.u8probe on the SF buffer detects0x7fpoison fill (from thecudaMallocAsyncpool) as a direct race indicator:On
mainthe race is latent: the GSM8K test no longer fails even without this fix (10/10 PASS, mean 94.95 ± 0.25 with autotuner off; with the fix mean 95.14 ± 0.19). Code/scheduler changes sincerelease/1.2appear to have narrowed the producer-to-consumer window enough that the race no longer trips the GSM8K threshold — but the underlying defect is unchanged, and any future change that widens the window (cutlass bump, scheduler reordering, kernel fusion) can re-expose it.The probe code used to measure the race is not part of this PR; it was a one-off diagnostic inside the cutlass consumer.
Companion cutlass fix
The consumer-side fix at NVIDIA/cutlass#3279 addresses the missing
wait_on_dependent_grids()in the sm103 blockscaled GEMMmain_sf_loadwarp branch (the other half of the race described in Summary). Either fix alone closes the race; both are correct individually:quantize_with_block_size, regardless of which cutlass revision trtllm pulls in.Dependency
This PR currently includes the #14585 commit as a prerequisite. Once #14585 lands, this branch will be rebased on top of
main(the prerequisite commit will drop automatically via patch identity) and CI re-triggered.Without #14585, the
TestDeepSeekR1::test_nvfp4_multi_gpus[throughput_pp4_mtp]test case crashes onmain(bs=32 OOM, nvbug 6018046) before reaching the PDL code path, so CI cannot validate this change until #14585 lands.Risk
quantize_with_block_size), covering NVFP4 / FP8 / MXFP8 paths.__syncthreads()+ one__threadfence()at the very end of the kernel (cold path). No effect on PDL launch overlap — the trigger still happens; it just happens after the CTA has drained.