Skip to content

opt innerproduct and convolutiondepthwise x86 int8 sse4.1#6687

Open
Edwardssss wants to merge 2 commits intoTencent:masterfrom
Edwardssss:opt-innerproduct-x86-int8-sse41
Open

opt innerproduct and convolutiondepthwise x86 int8 sse4.1#6687
Edwardssss wants to merge 2 commits intoTencent:masterfrom
Edwardssss:opt-innerproduct-x86-int8-sse41

Conversation

@Edwardssss
Copy link
Copy Markdown

Description

It's TODO in src/layer/x86/convolutiondepthwise_x86.cpp and src/layer/x86/innerproduct_x86.cpp

Add SSE4.1 _mm_cvtepi8_epi16 (pmovsxbw instruction) for int8 sign extension in x86, replacing the legacy SSE2 pseudo sign-extension sequence.

This avoids the overhead of unpacking and shifting masks, significantly unlocking the performance of int8 inference on x86 architectures, particularly for models with heavy fully connected or depthwise separable convolution layers. Added proper fallback and -msse4.1 conditional #ifndef __SSE4_1__ guard loops to preserve backward compatibility.

Benchmark

  • Environment: Debian 13 , 4 threads (Command: ./benchmark/benchncnn 8 4 0)
  • CPU: 12 × AMD Ryzen 5 PRO 4650U with Radeon Graphics
  • Base: master (Clean build) vs PR Branch: opt-innerproduct-x86-int8-sse41
Model Master Avg (ms) PR Avg (ms) Speedup (%)
squeezenet_int8 6.83 6.56 +3.95%
mobilenet_int8 7.01 5.44 +22.40%
googlenet_int8 13.18 13.03 +1.14%
resnet18_int8 9.68 9.74 -0.62%
vgg16_int8 78.44 64.34 +17.98%
resnet50_int8 32.63 27.72 +15.05%
squeezenet_ssd_int8 17.06 15.57 +8.73%
mobilenet_ssd_int8 11.35 11.09 +2.29%

I noticed that there seems to be no benchmark testing for devices very similar to mine in benchmark/README.md. If needed, I can submit a new PR later to improve it. :)

@github-actions github-actions Bot added the x86 label Apr 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant