Skip to content

Optimize x86 fp16s innerproduct gemm to eliminate loop-carried stalls#6682

Open
Edwardssss wants to merge 3 commits intoTencent:masterfrom
Edwardssss:opt-innerproduct-x86-fp16s-fma
Open

Optimize x86 fp16s innerproduct gemm to eliminate loop-carried stalls#6682
Edwardssss wants to merge 3 commits intoTencent:masterfrom
Edwardssss:opt-innerproduct-x86-fp16s-fma

Conversation

@Edwardssss
Copy link
Copy Markdown

PR Description:

Overview
This PR significantly improves the performance of innerproduct_gemm_fp16s_sse on x86 architectures by mitigating severe loop-carried dependency stalls and replacing inefficient instructions in the microkernels.

Problem
The existing innerproduct_gemm_fp16s implementation had two major bottlenecks:

  1. Loop-carried Stalls: FMA loop sequentially reused the same destination registers (e.g. _sum0 to _sum3) in accumulation iterations. Given the 4-5 cycle latency of typical FMA instructions, this caused severe processor pipeline stalls, wasting execution ports constraints.
  2. Instruction Inefficiency: Used _mm256_extractf128_si256 for FP16->FP32 conversions which is slightly more expensive compared to memory operand fusion.

Solution

  • Loop Unrolling & Accumulator Expansion: Fully unrolled the loops in elempack == 1 & num_output_elempack == 8 / 16 / 4 blocks. We now use up to 8 independent accumulation registers to completely break the false data dependency, hiding FMA latency.
  • Memory Operand Fusion: Dropped _mm256_extractf128_si256 in favor of direct 128-bit loads (_mm_lddqu_si128) fused into _mm256_cvtph_ps, reducing register traffic.

Benchmark Results (AMD Family 17h, 12 Cores, FP16 mode)
Tested via benchncnn 4 6 2 -1:

Model Original (min) PR (min) Improvement
vgg16 503.22 417.70 ~17.0%
resnet50 326.63 284.35 ~13.0%
resnet18 114.28 98.78 ~13.5%

AMDuProf Analysis:
Profiling the single-process CPU runtime confirms that the innerproduct_gemm_fp16s_sse hotspot significantly dropped CPU_TIME:

  • Before optimization: 67.87s
  • After optimization: 55.14s (Total execution lowered by >12s).

Commit Split:

  1. Optimize innerproduct x86 fp16s gemm using fused loads and fully unrolled FMA (Addressed the 1x8 case alongside the load logic rewrite).
  2. Alleviate loop-carried stalls in innerproduct fp16s microkernels by unrolling (Extrapolated the fixes across 1x16 and 1x4 blocks respectively).

@github-actions github-actions Bot added the x86 label Apr 16, 2026
@tencent-adm
Copy link
Copy Markdown
Member

tencent-adm commented Apr 16, 2026

CLA assistant check
All committers have signed the CLA.

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 17, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 93.99%. Comparing base (086dda4) to head (904955b).
⚠️ Report is 1 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6682      +/-   ##
==========================================
+ Coverage   93.65%   93.99%   +0.34%     
==========================================
  Files         930      930              
  Lines      296508   298257    +1749     
==========================================
+ Hits       277688   280341    +2653     
+ Misses      18820    17916     -904     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants