[https://nvbugs/5970614][fix] Sync CTA before PDL trigger in quantize_with_block_size by tianyuxbear · Pull Request #14668 · NVIDIA/TensorRT-LLM

tianyuxbear · 2026-05-28T04:52:37Z

Summary

Fixes a PDL (Programmatic Dependent Launch) race in quantize_with_block_size that intermittently corrupts NVFP4 GEMM outputs and degrades GSM8K accuracy for DeepSeek-R1 NVFP4 on GB300 + PP=4 + MTP (nvbug 5970614).

cudaTriggerProgrammaticLaunchCompletion() only signals that the CTA has reached the trigger point — it does not flush prior stores to global memory. Memory visibility for the secondary kernel must be provided either by the producer (a fence before the trigger) or by the consumer (wait_on_dependent_grids() before its first dependent load). In the current NVFP4 path neither side does so: the producer lacks a fence, and the sm103 blockscaled GEMM's main_sf_load warp branch is missing the corresponding wait_on_dependent_grids() (tracked separately in NVIDIA/cutlass#3279).

Compounding this, PDL completion is reported per-CTA at-least-once: a single warp reaching the trigger marks its whole CTA as "trigger reached", even if peer warps in the same CTA are still writing sf_out / out. Once every CTA has been marked, the driver launches the secondary kernel, which TMA-loads partial data — NaNs then propagate through the DeepSeek-R1 forward and corrupt output tokens.

Fix: insert __syncthreads(); __threadfence(); immediately before the trigger, so all warps in the CTA reach the same program point and all their stores are made globally visible before PDL completion is signaled.

In-tree precedent

This mirrors an existing pattern in cpp/tensorrt_llm/kernels/fusedLayernormKernels/ws_layernorm.cuh:865-867, which uses __syncthreads(); membar.gl; immediately before the same cudaTriggerProgrammaticLaunchCompletion() call. __threadfence() is the CUDA intrinsic for membar.gl, so the two forms are semantically equivalent.

Evidence

The race was characterized on release/1.2, where the timing window reproduces it deterministically enough to measure. A cache-bypass ld.global.cv.u8 probe on the SF buffer detects 0x7f poison fill (from the cudaMallocAsync pool) as a direct race indicator:

Config	GSM8K	Poison probe
Baseline (no fix)	91.93 (FAIL, threshold 92.217)	2.31% (~74k / 3.2M samples)
With this fix	95.34 (PASS, ref 95.42)	0 / 2.4M

On main the race is latent: the GSM8K test no longer fails even without this fix (10/10 PASS, mean 94.95 ± 0.25 with autotuner off; with the fix mean 95.14 ± 0.19). Code/scheduler changes since release/1.2 appear to have narrowed the producer-to-consumer window enough that the race no longer trips the GSM8K threshold — but the underlying defect is unchanged, and any future change that widens the window (cutlass bump, scheduler reordering, kernel fusion) can re-expose it.

The probe code used to measure the race is not part of this PR; it was a one-off diagnostic inside the cutlass consumer.

Companion cutlass fix

The consumer-side fix at NVIDIA/cutlass#3279 addresses the missing wait_on_dependent_grids() in the sm103 blockscaled GEMM main_sf_load warp branch (the other half of the race described in Summary). Either fix alone closes the race; both are correct individually:

The cutlass fix protects every PDL producer routed through that GEMM.
This producer-side fix protects every PDL consumer downstream of quantize_with_block_size, regardless of which cutlass revision trtllm pulls in.

Dependency

This PR currently includes the #14585 commit as a prerequisite. Once #14585 lands, this branch will be rebased on top of main (the prerequisite commit will drop automatically via patch identity) and CI re-triggered.

Without #14585, the TestDeepSeekR1::test_nvfp4_multi_gpus[throughput_pp4_mtp] test case crashes on main (bs=32 OOM, nvbug 6018046) before reaching the PDL code path, so CI cannot validate this change until #14585 lands.

Risk

Surface: one producer kernel (quantize_with_block_size), covering NVFP4 / FP8 / MXFP8 paths.
Per-CTA cost: one __syncthreads() + one __threadfence() at the very end of the kernel (cold path). No effect on PDL launch overlap — the trigger still happens; it just happens after the CTA has drained.
No API or behavior change for callers.

… to 8 The throughput_pp4_mtp variant uses pp_size=4 + ep_size=1 + mtp_nextn=1 on a 4-GPU topology. With max_batch_size=32 (every other 4-GPU variant in this test uses bs<=16), the per-bs CUDA-graph snapshot pool plus per-step activation and NCCL collective buffers leave no headroom for the lazy cuBLAS Lt workspace allocation. Under sustained load this surfaces as mid-run CUBLAS_STATUS_EXECUTION_FAILED in the DeepseekV3 router / o_proj GEMMs, followed by an asynchronous illegal memory access. Lower max_batch_size from 32 to 8 to match the already-stable throughput_bs8_mtp configuration. With bs=8 the default 0.70 KV cache fraction provides sufficient cuBLAS workspace headroom; the test passes in repeated runs without any KV fraction reduction. Removes the matching waiver entry. Signed-off-by: Tianyu Xiong <117647511+tianyuxbear@users.noreply.github.com>

…_with_block_size Signed-off-by: Tianyu Xiong <117647511+tianyuxbear@users.noreply.github.com>

coderabbitai · 2026-05-28T04:55:54Z

📝 Walkthrough

Walkthrough

The PR addresses a CUDA graph execution issue in NVIDIA's TensorRT-LLM by adding memory-ordering synchronization to the quantization kernel and adjusting the DeepSeek-R1 NVFP4 test configuration, then removing the corresponding test waiver.

Changes

NVFP4 DeepSeek-R1 Throughput Stabilization

Layer / File(s)	Summary
Kernel synchronization barrier and fence `cpp/tensorrt_llm/kernels/quantization.cuh`	CTA-wide barrier and threadfence are inserted in `quantize_with_block_size` before launch completion to drain stores and ensure visibility to downstream GEMM consumers, fixing nvbug 5970614.
Test configuration adjustment and waiver removal `tests/integration/defs/accuracy/test_llm_api_pytorch.py`, `tests/integration/test_lists/waives.txt`	Batch size reduced to 8 with inline documentation of the cuBLASLt workspace headroom issue; the corresponding SKIP waiver is removed so the test now executes.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

NVIDIA/TensorRT-LLM#14504: Also modifies tests/integration/test_lists/waives.txt by changing SKIP/waiver entries for integration tests.

Suggested reviewers

mikeiovine
dongfengy
jieli-matrix
xinhe-nv

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title correctly identifies the primary fix: synchronizing the CTA before PDL trigger in quantize_with_block_size, with proper bug reference format.
Description check	✅ Passed	The PR description thoroughly covers the issue, root cause, fix, evidence, and risk assessment; however, it does not follow the template structure with clear section headings for Description and Test Coverage.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

cpp/tensorrt_llm/kernels/quantization.cuh (1)
2-2: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Update the copyright header year on this modified source file.

This file was modified, but the NVIDIA header still ends at 2023. Please update it to include the latest modification year.
Suggested fix
- * Copyright (c) 2019-2023, NVIDIA CORPORATION.  All rights reserved.
+ * Copyright (c) 2019-2026, NVIDIA CORPORATION.  All rights reserved.
As per coding guidelines: **/*.{cpp,cc,h,hpp,py,cu,cuh}: Include NVIDIA copyright header on ALL new files; update year on modified files.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/tensorrt_llm/kernels/quantization.cuh` at line 2, Update the NVIDIA
copyright header at the top of the modified source file by changing the year
range from "2019-2023" to include the current modification year (e.g.
"2019-2026"); edit the top-of-file header comment in quantization.cuh (the
initial copyright comment block) so the year range reflects the latest
modification year.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@cpp/tensorrt_llm/kernels/quantization.cuh`:
- Line 2: Update the NVIDIA copyright header at the top of the modified source
file by changing the year range from "2019-2023" to include the current
modification year (e.g. "2019-2026"); edit the top-of-file header comment in
quantization.cuh (the initial copyright comment block) so the year range
reflects the latest modification year.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 04887fc1-1f9e-4ec2-8b75-a6af13d6921c

📥 Commits

Reviewing files that changed from the base of the PR and between 59d4369 and f8d3481.

📒 Files selected for processing (3)

cpp/tensorrt_llm/kernels/quantization.cuh
tests/integration/defs/accuracy/test_llm_api_pytorch.py
tests/integration/test_lists/waives.txt

💤 Files with no reviewable changes (1)

tests/integration/test_lists/waives.txt

tianyuxbear added 2 commits May 26, 2026 09:15

[https://nvbugs/5970614][fix] Sync CTA before PDL trigger in quantize…

f8d3481

…_with_block_size Signed-off-by: Tianyu Xiong <117647511+tianyuxbear@users.noreply.github.com>

tianyuxbear requested a review from a team as a code owner May 28, 2026 04:52

github-actions Bot assigned tianyuxbear May 28, 2026

coderabbitai Bot reviewed May 28, 2026

View reviewed changes

xinhe-nv approved these changes May 29, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[https://nvbugs/5970614][fix] Sync CTA before PDL trigger in quantize_with_block_size#14668

[https://nvbugs/5970614][fix] Sync CTA before PDL trigger in quantize_with_block_size#14668
tianyuxbear wants to merge 2 commits into
NVIDIA:mainfrom
tianyuxbear:fix/5970614

tianyuxbear commented May 28, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented May 28, 2026

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tianyuxbear commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

In-tree precedent

Evidence

Companion cutlass fix

Dependency

Risk

Uh oh!

coderabbitai Bot commented May 28, 2026

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tianyuxbear commented May 28, 2026 •

edited

Loading