True-bfloat16 inference for cache aware pipelines by naymaraq · Pull Request #15763 · NVIDIA-NeMo/NeMo

naymaraq · 2026-06-07T19:59:39Z

Important

The Update branch button must only be pressed in very rare occassions.
An outdated branch is never blocking the merge of a PR.
Please reach out to the automation team before pressing that button.

What does this PR do ?

Fix true bfloat16 (use_amp=false) inference for cache-aware streaming ASR. Observed significant RTFx improvment and 2x cache compression.

Collection: [ASR]

Changelog

Problem: With use_amp: false, the cache-aware wrappers cast the model weights to bfloat16, but disabled autocast, while the input mel features and the encoder caches stayed float32. The encoder then received float32 inputs/caches against bfloat16 weights with no autocast to reconcile them, raising a dtype-mismatch error.

Cast input features: In both CacheAwareRNNTInferenceWrapper.stream_step and CacheAwareCTCInferenceWrapper.stream_step, cast processed_signal to self.cast_dtype after the device move.
Align cache dtype: CacheAwareASRInferenceWrapper.get_initial_cache_state now passes dtype=self.cast_dtype, so the context manager's persistent cache storage matches the encoder's output caches.
Config defaults: Set use_amp: false in cache_aware_rnnt.yaml and cache_aware_ctc.yaml so the example configs run true bf16 by default.

Results

Setup:

num_slots=256 and batch_size=64
Experiments done on NVIDIA RTX 5000 Ada GPU

Key Findings:

True-bf16 inference reduced cache memory 2x times versus fp32 (936 MB vs. 1872 MB) with essentially no WER degradation across all attention context sizes.
True-bf16 inference also significantly improved RTFx across all attention context sizes.

Method	AMP	Att Context Size	Comp Ratio	LS CLEAN	LS OTHER	TED	VOX	AVG.	RTFX (LS-OTHER)
bf16	TRUE	[70, 13]	x2	5.28%	8.26%	12.03%	11.13%	9.18%	407
bf16	FALSE	[70, 13]	x2	5.29%	8.26%	12.01%	11.15%	9.18%	499
bf16	TRUE	[70, 6]	x2	5.38%	8.49%	12.02%	11.23%	9.28%	264
bf16	FALSE	[70, 6]	x2	5.37%	8.48%	12.00%	11.21%	9.26%	338
bf16	TRUE	[70, 1]	x2	5.57%	8.91%	12.30%	11.56%	9.58%	107
bf16	FALSE	[70, 1]	x2	5.56%	8.91%	12.30%	11.51%	9.57%	136
bf16	TRUE	[70, 0]	x2	6.06%	9.80%	12.61%	12.70%	10.29%	63
bf16	FALSE	[70, 0]	x2	6.08%	9.79%	12.66%	12.67%	10.30%	80

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

New Feature
Bugfix
Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

Related to # (issue)

Signed-off-by: naymaraq <dkaramyan@nvidia.com>

copy-pr-bot · 2026-06-07T19:59:43Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

MahmoudAshraf97 · 2026-06-08T08:02:57Z

If we are looking for RTF improvements, maybe we should move the caching step after the qkv proj, this is a computation that is being repeated every step on cached inputs, the cost will be double the memory needed for cache but that is already affordable since the cache is capped at a certain size

artbataev

Thanks a lot! Great improvement!

naymaraq · 2026-06-08T08:44:03Z

If we are looking for RTF improvements, maybe we should move the caching step after the qkv proj, this is a computation that is being repeated every step on cached inputs, the cost will be double the memory needed for cache but that is already affordable since the cache is capped at a certain size

Good point. We need to understand what kind of RTFx gain it will provide at the cost of doubling the memory usage.

naymaraq · 2026-06-08T16:04:46Z

/claude review

claude · 2026-06-08T16:07:03Z

Test coverage: There are no unit tests covering the cache-aware inference wrappers (CacheAwareRNNTInferenceWrapper, CacheAwareCTCInferenceWrapper). A test that exercises stream_step with use_amp=False and a non-float32 compute_dtype would guard against dtype-mismatch regressions — which is exactly the bug this PR fixes.

artbataev · 2026-06-08T19:01:49Z

/ok to test 4d15f6e

fix for true bfloat16 inference

40e1dd9

Signed-off-by: naymaraq <dkaramyan@nvidia.com>

naymaraq requested a review from artbataev June 7, 2026 19:59

github-actions Bot added the ASR label Jun 7, 2026

artbataev approved these changes Jun 8, 2026

View reviewed changes

naymaraq added the CI label Jun 8, 2026

naymaraq enabled auto-merge (squash) June 8, 2026 10:14

naymaraq added Run CICD and removed CI Run CICD labels Jun 8, 2026

Merge branch 'main' into dkaramyan/cache-pipelines-bf16

4d15f6e

copy-pr-bot Bot temporarily deployed to public June 8, 2026 19:02 Inactive

copy-pr-bot Bot temporarily deployed to test June 8, 2026 19:03 Inactive

copy-pr-bot Bot temporarily deployed to public June 8, 2026 19:06 Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

True-bfloat16 inference for cache aware pipelines#15763

True-bfloat16 inference for cache aware pipelines#15763
naymaraq wants to merge 2 commits into
mainfrom
dkaramyan/cache-pipelines-bf16

naymaraq commented Jun 7, 2026

Uh oh!

copy-pr-bot Bot commented Jun 7, 2026

Uh oh!

MahmoudAshraf97 commented Jun 8, 2026

Uh oh!

artbataev left a comment

Uh oh!

naymaraq commented Jun 8, 2026

Uh oh!

naymaraq commented Jun 8, 2026

Uh oh!

claude Bot commented Jun 8, 2026

Uh oh!

artbataev commented Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

naymaraq commented Jun 7, 2026

What does this PR do ?

Changelog

Results

GitHub Actions CI

Before your PR is "Ready for review"

Who can review?

Additional Information

Uh oh!

copy-pr-bot Bot commented Jun 7, 2026

Uh oh!

MahmoudAshraf97 commented Jun 8, 2026

Uh oh!

artbataev left a comment

Choose a reason for hiding this comment

Uh oh!

naymaraq commented Jun 8, 2026

Uh oh!

naymaraq commented Jun 8, 2026

Uh oh!

claude Bot commented Jun 8, 2026

Uh oh!

artbataev commented Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants