Skip to content

[WIP] massive mips and loongarch optimization#6662

Open
nihui wants to merge 76 commits intoTencent:masterfrom
nihui:mips-opt3
Open

[WIP] massive mips and loongarch optimization#6662
nihui wants to merge 76 commits intoTencent:masterfrom
nihui:mips-opt3

Conversation

@nihui
Copy link
Copy Markdown
Member

@nihui nihui commented Apr 9, 2026

No description provided.

@tencent-adm
Copy link
Copy Markdown
Member

CLA assistant check
Thank you for your submission, we really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 9, 2026

Codecov Report

❌ Patch coverage is 94.91028% with 156 lines in your changes missing coverage. Please review.
✅ Project coverage is 93.20%. Comparing base (71b1a61) to head (e910cbb).
⚠️ Report is 4 commits behind head on master.

Files with missing lines Patch % Lines
src/layer/loongarch/convolution_loongarch.cpp 72.79% 117 Missing ⚠️
src/layer/loongarch/convolution_im2col_gemm_int8.h 96.44% 26 Missing ⚠️
src/layer/loongarch/convolution_packed_bf16s.h 98.88% 4 Missing ⚠️
src/layer/loongarch/convolution1d_loongarch.cpp 92.50% 3 Missing ⚠️
src/layer/loongarch/convolution_packed_int8.h 98.87% 3 Missing ⚠️
src/layer/loongarch/binaryop_loongarch.cpp 99.32% 2 Missing ⚠️
src/layer/loongarch/convolution_packed.h 99.72% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6662      +/-   ##
==========================================
- Coverage   93.96%   93.20%   -0.76%     
==========================================
  Files         932      932              
  Lines      299059   332717   +33658     
==========================================
+ Hits       280998   310099   +29101     
- Misses      18061    22618    +4557     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

nihui and others added 11 commits April 10, 2026 07:10
Add jj+=12 loop unrolling to pack_B_tile, transpose_pack_B_tile,
transpose_unpack_output_tile, and gemm_transB_packed_tile for all
ii sections (8, 4, 2, 1). MIPS MSA has 32 SIMD registers so
jj+=12 fits well (24 registers for ii+=8, 12 for ii+=4).

Update get_optimal_tile_mnk to align TILE_N to multiples of 12
for better utilization of the new kernel.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ngArch

Integrate bf16 storage support into multiple operators:

MIPS: batchnorm, clip, dropout, selu, erf
LoongArch: batchnorm, clip, dropout

Each operator now declares forward_inplace_bf16s in its header,
sets support_bf16_storage=true in the constructor, dispatches bf16
inputs from forward_inplace, and implements the bf16s path using
the existing bf16s helper headers.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Add support_bf16_storage = true in constructors for both architectures
- Add crop_pack4_bf16s_msa() for MIPS MSA using int64_t copies (8 bytes)
- Add crop_pack4_bf16s_lsx() for LoongArch LSX using int64_t copies
- Add crop_pack8_lasx() for LoongArch LASX float pack8 (256-bit)
- Add crop_pack8_bf16s_lsx() for LoongArch LASX bf16 pack8 (128-bit)
- Dispatch to bf16 variants when elemsize matches bf16 packing
- Remove debug fprintf statements from MIPS deconvolution_packed_bf16s.h

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add interp_bilinear_pack8.h and interp_bicubic_pack8.h implementing
256-bit SIMD (8 floats) resize operations using LASX intrinsics.

Update interp_loongarch.cpp to:
- Include lasxintrin.h and the new pack8 headers under __loongarch_asx
- Add elempack == 8 paths for dims 1, 2, and 3 (nearest, bilinear, bicubic)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… approach

- Replace hand-written kernel packing and convolution loops with
  convolution1d_transform_kernel_packed() and convolution1d_packed()
  from convolution1d_packed.h
- Rename weight_data_packed to weight_data_tm to match x86 pattern
- Add LASX (256-bit) support with pack8 out_elempack
- Add NCNN_BF16 support using cast-based approach (bf16->fp32->conv->bf16)
- Add bf16 weight/bias cast in dynamic weight forward path
- Include cpu.h, lasxintrin.h headers for new functionality

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants