[WIP] massive mips and loongarch optimization#6662
Open
nihui wants to merge 76 commits intoTencent:masterfrom
Open
[WIP] massive mips and loongarch optimization#6662nihui wants to merge 76 commits intoTencent:masterfrom
nihui wants to merge 76 commits intoTencent:masterfrom
Conversation
Member
|
|
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #6662 +/- ##
==========================================
- Coverage 93.96% 93.20% -0.76%
==========================================
Files 932 932
Lines 299059 332717 +33658
==========================================
+ Hits 280998 310099 +29101
- Misses 18061 22618 +4557 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Add jj+=12 loop unrolling to pack_B_tile, transpose_pack_B_tile, transpose_unpack_output_tile, and gemm_transB_packed_tile for all ii sections (8, 4, 2, 1). MIPS MSA has 32 SIMD registers so jj+=12 fits well (24 registers for ii+=8, 12 for ii+=4). Update get_optimal_tile_mnk to align TILE_N to multiples of 12 for better utilization of the new kernel. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ngArch Integrate bf16 storage support into multiple operators: MIPS: batchnorm, clip, dropout, selu, erf LoongArch: batchnorm, clip, dropout Each operator now declares forward_inplace_bf16s in its header, sets support_bf16_storage=true in the constructor, dispatches bf16 inputs from forward_inplace, and implements the bf16s path using the existing bf16s helper headers. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Add support_bf16_storage = true in constructors for both architectures - Add crop_pack4_bf16s_msa() for MIPS MSA using int64_t copies (8 bytes) - Add crop_pack4_bf16s_lsx() for LoongArch LSX using int64_t copies - Add crop_pack8_lasx() for LoongArch LASX float pack8 (256-bit) - Add crop_pack8_bf16s_lsx() for LoongArch LASX bf16 pack8 (128-bit) - Dispatch to bf16 variants when elemsize matches bf16 packing - Remove debug fprintf statements from MIPS deconvolution_packed_bf16s.h Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add interp_bilinear_pack8.h and interp_bicubic_pack8.h implementing 256-bit SIMD (8 floats) resize operations using LASX intrinsics. Update interp_loongarch.cpp to: - Include lasxintrin.h and the new pack8 headers under __loongarch_asx - Add elempack == 8 paths for dims 1, 2, and 3 (nearest, bilinear, bicubic) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… approach - Replace hand-written kernel packing and convolution loops with convolution1d_transform_kernel_packed() and convolution1d_packed() from convolution1d_packed.h - Rename weight_data_packed to weight_data_tm to match x86 pattern - Add LASX (256-bit) support with pack8 out_elempack - Add NCNN_BF16 support using cast-based approach (bf16->fp32->conv->bf16) - Add bf16 weight/bias cast in dynamic weight forward path - Include cpu.h, lasxintrin.h headers for new functionality Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.