Add AVX2 optimizations for SAD, SadFour, and intra prediction (encoder)#92
Open
rbenaley wants to merge 2 commits into
Open
Add AVX2 optimizations for SAD, SadFour, and intra prediction (encoder)#92rbenaley wants to merge 2 commits into
rbenaley wants to merge 2 commits into
Conversation
Add AVX2 implementations for H.264 encoding performance improvement: Build system: - Define HAVE_AVX2 for NASM assembly compilation - Define HAVE_AVX2 for C++ compilation SAD (Sum of Absolute Differences) functions: - WelsSampleSad16x16_avx2, WelsSampleSad16x8_avx2 - WelsSampleSad8x16_avx2, WelsSampleSad8x8_avx2 - WelsSampleSadFour16x16_avx2, WelsSampleSadFour16x8_avx2 - WelsSampleSadFour8x16_avx2, WelsSampleSadFour8x8_avx2 Intra Prediction functions: - WelsI16x16LumaPredDc_avx2, WelsI16x16LumaPredPlane_avx2 - WelsI16x16LumaPredV_avx2, WelsI16x16LumaPredH_avx2 Performance: ~30-50% CPU reduction for video encoding on x86_64 with AVX2. Made-with: Cursor
feat(encoder): AVX2 optimizations for SAD and Intra Prediction
Owner
|
Thanks for the PR. Having faster encoding would be great, however, this looks like it's modifying upstream openh264 files directly. Unfortunately we can't diverge from upstream for maintenance reasons. The proper way of landing these is getting them merged to upstream master first (you seem to have PRs in flight already), then bumping the pinned commit / SHA here. About your |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR enables AVX2 support in the OpenH264 encoder by fixing
build.rsto pass-DHAVE_AVX2to NASM, and adds 13 new AVX2-optimized assembly functions for SAD and intra prediction.Problem
The current
build.rsinopenh264-sys2does not pass the-DHAVE_AVX2define to NASM. This means that all AVX2 assembly code in upstream OpenH264 -- including existing implementations for SATD, DCT, quantization, motion compensation, and VAA -- is silently excluded from the build. Only SSE2 code paths are compiled.Changes
1. Build system fix (
openh264-sys2/build.rs)Pass
-DHAVE_AVX2to both NASM and the C++ compiler on x86-64 targets, matching Cisco's officialmeson.buildbehavior. This alone unlocks all existing upstream AVX2 code.2. New AVX2 SAD functions (8 functions,
codec/common/x86/satd_sad.asm)The existing SATD functions already have AVX2 implementations in
satd_sad.asm, but the corresponding SAD functions were missing. SAD is the primary block comparison metric for motion estimation, evaluated tens of thousands of times per frame.WelsSampleSad{16x16,16x8,8x16,8x8}_avx2WelsSampleSadFour{16x16,16x8,8x16,8x8}_avx2The 16-wide functions use
vinserti128to pack two rows into a 256-bitymmregister, processing them with a singlevpsadbw. SadFour variants compute SAD against four reference positions simultaneously, avoiding redundant source loads during diamond search. All loops are fully unrolled with%rep.Files:
satd_sad.asm(+386 lines),sad_common.h(+12),sample.cpp(+11)3. New AVX2 intra prediction functions (5 functions)
intra_pred_com.asm):vbroadcasti128+ 256-bit stores, 8 stores instead of 16intra_pred_com.asm):vpbroadcastb xmm, [mem]replaces multi-instruction broadcastintra_pred.asm):vpbroadcastb ymmfor 32-byte fill aftervpsadbwsumintra_pred.asm): processes all 16 pixels per row in one pass viavpmullw+vpaddw+vpackuswb+vpermq, versus two 8-pixel passes in SSE2intra_pred.asm):vpbroadcastqfills the entire 64-byte block with 2 storesFiles:
intra_pred_com.asm(+44),intra_pred.asm(+215),intra_pred_common.h(+4),get_intra_predictor.h(+6),get_intra_predictor.cpp(+9)Compatibility
%ifdef HAVE_AVX2(NASM) /#if defined(HAVE_AVX2)(C++)vzeroupperbefore everyretto prevent SSE-AVX transition penaltiesMeasured performance
These optimizations were developed for Vauban, an open-source privileged access management bastion that uses OpenH264 for real-time H.264 encoding of RDP desktop sessions streamed to web browsers.
On a FreeBSD 15.0 server with an Intel Xeon E-2246G (Coffee Lake), encoding 1280x720 at 60 FPS with
ScreenContentRealTime:The gain comes from both the 13 new functions and unlocking the existing upstream AVX2 code (SATD, DCT, quantization) that was previously excluded.
Upstream PRs
These same optimizations have been submitted to Cisco's upstream OpenH264:
Technical details
For a comprehensive writeup covering H.264 pipeline context, implementation details, AVX2 instruction usage, and testing methodology, see:
Vauban OpenH264 AVX2 Optimizations -- Technical Document