Add AVX2 optimizations for SAD, SadFour, and intra prediction (encoder) by rbenaley · Pull Request #92 · ralfbiedert/openh264-rs

rbenaley · 2026-02-26T21:06:11Z

This PR enables AVX2 support in the OpenH264 encoder by fixing build.rs to pass -DHAVE_AVX2 to NASM, and adds 13 new AVX2-optimized assembly functions for SAD and intra prediction.

Problem

The current build.rs in openh264-sys2 does not pass the -DHAVE_AVX2 define to NASM. This means that all AVX2 assembly code in upstream OpenH264 -- including existing implementations for SATD, DCT, quantization, motion compensation, and VAA -- is silently excluded from the build. Only SSE2 code paths are compiled.

Changes

1. Build system fix (openh264-sys2/build.rs)

Pass -DHAVE_AVX2 to both NASM and the C++ compiler on x86-64 targets, matching Cisco's official meson.build behavior. This alone unlocks all existing upstream AVX2 code.

2. New AVX2 SAD functions (8 functions, codec/common/x86/satd_sad.asm)

The existing SATD functions already have AVX2 implementations in satd_sad.asm, but the corresponding SAD functions were missing. SAD is the primary block comparison metric for motion estimation, evaluated tens of thousands of times per frame.

4 simple SAD: WelsSampleSad{16x16,16x8,8x16,8x8}_avx2
4 SadFour: WelsSampleSadFour{16x16,16x8,8x16,8x8}_avx2

The 16-wide functions use vinserti128 to pack two rows into a 256-bit ymm register, processing them with a single vpsadbw. SadFour variants compute SAD against four reference positions simultaneously, avoiding redundant source loads during diamond search. All loops are fully unrolled with %rep.

Files: satd_sad.asm (+386 lines), sad_common.h (+12), sample.cpp (+11)

3. New AVX2 intra prediction functions (5 functions)

16x16 Luma V (intra_pred_com.asm): vbroadcasti128 + 256-bit stores, 8 stores instead of 16
16x16 Luma H (intra_pred_com.asm): vpbroadcastb xmm, [mem] replaces multi-instruction broadcast
16x16 Luma DC (intra_pred.asm): vpbroadcastb ymm for 32-byte fill after vpsadbw sum
16x16 Luma Plane (intra_pred.asm): processes all 16 pixels per row in one pass via vpmullw + vpaddw + vpackuswb + vpermq, versus two 8-pixel passes in SSE2
8x8 Chroma V (intra_pred.asm): vpbroadcastq fills the entire 64-byte block with 2 stores

Files: intra_pred_com.asm (+44), intra_pred.asm (+215), intra_pred_common.h (+4), get_intra_predictor.h (+6), get_intra_predictor.cpp (+9)

Compatibility

All new code is guarded by %ifdef HAVE_AVX2 (NASM) / #if defined(HAVE_AVX2) (C++)
Runtime CPU detection via CPUID selects AVX2 paths only when the CPU supports them
No change in behavior on processors without AVX2 (SSE2 fallback preserved)
vzeroupper before every ret to prevent SSE-AVX transition penalties

Measured performance

These optimizations were developed for Vauban, an open-source privileged access management bastion that uses OpenH264 for real-time H.264 encoding of RDP desktop sessions streamed to web browsers.

On a FreeBSD 15.0 server with an Intel Xeon E-2246G (Coffee Lake), encoding 1280x720 at 60 FPS with ScreenContentRealTime:

Metric	Before (SSE2 only)	After (SSE2 + AVX2)
CPU per session	~80-100% of one core	~40-50% of one core
Improvement	--	~50% reduction

The gain comes from both the 13 new functions and unlocking the existing upstream AVX2 code (SATD, DCT, quantization) that was previously excluded.

Upstream PRs

These same optimizations have been submitted to Cisco's upstream OpenH264:

cisco/openh264#3933 -- AVX2 SAD functions
cisco/openh264#3934 -- AVX2 intra prediction

Technical details

For a comprehensive writeup covering H.264 pipeline context, implementation details, AVX2 instruction usage, and testing methodology, see:
Vauban OpenH264 AVX2 Optimizations -- Technical Document

Add AVX2 implementations for H.264 encoding performance improvement: Build system: - Define HAVE_AVX2 for NASM assembly compilation - Define HAVE_AVX2 for C++ compilation SAD (Sum of Absolute Differences) functions: - WelsSampleSad16x16_avx2, WelsSampleSad16x8_avx2 - WelsSampleSad8x16_avx2, WelsSampleSad8x8_avx2 - WelsSampleSadFour16x16_avx2, WelsSampleSadFour16x8_avx2 - WelsSampleSadFour8x16_avx2, WelsSampleSadFour8x8_avx2 Intra Prediction functions: - WelsI16x16LumaPredDc_avx2, WelsI16x16LumaPredPlane_avx2 - WelsI16x16LumaPredV_avx2, WelsI16x16LumaPredH_avx2 Performance: ~30-50% CPU reduction for video encoding on x86_64 with AVX2. Made-with: Cursor

feat(encoder): AVX2 optimizations for SAD and Intra Prediction

ralfbiedert · 2026-03-07T05:17:33Z

Thanks for the PR.

Having faster encoding would be great, however, this looks like it's modifying upstream openh264 files directly. Unfortunately we can't diverge from upstream for maintenance reasons.

The proper way of landing these is getting them merged to upstream master first (you seem to have PRs in flight already), then bumping the pinned commit / SHA here.

About your build.rs, it appears you are unconditionally enabling AVX2. Instead, you probably want to query CARGO_CFG_TARGET_FEATURE or so and set it conditionally.

rbenaley added 2 commits February 26, 2026 18:34

Merge pull request #1 from rbenaley/feat/avx2-optimizations

351849b

feat(encoder): AVX2 optimizations for SAD and Intra Prediction

ralfbiedert added enhancement New feature or request upstream Issues related to upstream (OpenH264) code. labels Mar 7, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add AVX2 optimizations for SAD, SadFour, and intra prediction (encoder)#92

Add AVX2 optimizations for SAD, SadFour, and intra prediction (encoder)#92
rbenaley wants to merge 2 commits into
ralfbiedert:masterfrom
rbenaley:master

rbenaley commented Feb 26, 2026

Uh oh!

ralfbiedert commented Mar 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

rbenaley commented Feb 26, 2026

Problem

Changes

Compatibility

Measured performance

Upstream PRs

Technical details

Uh oh!

ralfbiedert commented Mar 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants