Optimize x86 fp16s innerproduct gemm to eliminate loop-carried stalls by Edwardssss · Pull Request #6682 · Tencent/ncnn

Edwardssss · 2026-04-16T14:29:49Z

PR Description:

Overview
This PR significantly improves the performance of innerproduct_gemm_fp16s_sse on x86 architectures by mitigating severe loop-carried dependency stalls and replacing inefficient instructions in the microkernels.

Problem
The existing innerproduct_gemm_fp16s implementation had two major bottlenecks:

Loop-carried Stalls: FMA loop sequentially reused the same destination registers (e.g. _sum0 to _sum3) in accumulation iterations. Given the 4-5 cycle latency of typical FMA instructions, this caused severe processor pipeline stalls, wasting execution ports constraints.
Instruction Inefficiency: Used _mm256_extractf128_si256 for FP16->FP32 conversions which is slightly more expensive compared to memory operand fusion.

Solution

Loop Unrolling & Accumulator Expansion: Fully unrolled the loops in elempack == 1 & num_output_elempack == 8 / 16 / 4 blocks. We now use up to 8 independent accumulation registers to completely break the false data dependency, hiding FMA latency.
Memory Operand Fusion: Dropped _mm256_extractf128_si256 in favor of direct 128-bit loads (_mm_lddqu_si128) fused into _mm256_cvtph_ps, reducing register traffic.

Benchmark Results (AMD Family 17h, 12 Cores, FP16 mode)
Tested via benchncnn 4 6 2 -1:

Model	Original (min)	PR (min)	Improvement
`vgg16`	503.22	417.70	~17.0%
`resnet50`	326.63	284.35	~13.0%
`resnet18`	114.28	98.78	~13.5%

AMDuProf Analysis:
Profiling the single-process CPU runtime confirms that the innerproduct_gemm_fp16s_sse hotspot significantly dropped CPU_TIME:

Before optimization: 67.87s
After optimization: 55.14s (Total execution lowered by >12s).

Commit Split:

Optimize innerproduct x86 fp16s gemm using fused loads and fully unrolled FMA (Addressed the 1x8 case alongside the load logic rewrite).
Alleviate loop-carried stalls in innerproduct fp16s microkernels by unrolling (Extrapolated the fixes across 1x16 and 1x4 blocks respectively).

…lled FMA

…nrolling

tencent-adm · 2026-04-16T14:30:11Z

All committers have signed the CLA.

codecov-commenter · 2026-04-17T00:46:57Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 93.99%. Comparing base (086dda4) to head (904955b).
⚠️ Report is 1 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #6682      +/-   ##
==========================================
+ Coverage   93.65%   93.99%   +0.34%     
==========================================
  Files         930      930              
  Lines      296508   298257    +1749     
==========================================
+ Hits       277688   280341    +2653     
+ Misses      18820    17916     -904

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Edwardssss added 2 commits April 16, 2026 21:38

Optimize innerproduct x86 fp16s gemm using fused loads and fully unro…

acf283d

…lled FMA

Alleviate loop-carried stalls in innerproduct fp16s microkernels by u…

904955b

…nrolling

github-actions Bot added the x86 label Apr 16, 2026

ci: Disable BF16 in Windows workflow to bypass MSVC GEMM test failures

b2fd28c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize x86 fp16s innerproduct gemm to eliminate loop-carried stalls#6682

Optimize x86 fp16s innerproduct gemm to eliminate loop-carried stalls#6682
Edwardssss wants to merge 3 commits intoTencent:masterfrom
Edwardssss:opt-innerproduct-x86-fp16s-fma

Edwardssss commented Apr 16, 2026

Uh oh!

tencent-adm commented Apr 16, 2026 •

edited

Loading

Uh oh!

codecov-commenter commented Apr 17, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Edwardssss commented Apr 16, 2026

PR Description:

Uh oh!

tencent-adm commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov-commenter commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tencent-adm commented Apr 16, 2026 •

edited

Loading

codecov-commenter commented Apr 17, 2026 •

edited

Loading