Skip to content

Add AVX2 optimizations for SAD, SadFour, and intra prediction (encoder)#92

Open
rbenaley wants to merge 2 commits into
ralfbiedert:masterfrom
rbenaley:master
Open

Add AVX2 optimizations for SAD, SadFour, and intra prediction (encoder)#92
rbenaley wants to merge 2 commits into
ralfbiedert:masterfrom
rbenaley:master

Conversation

@rbenaley
Copy link
Copy Markdown

This PR enables AVX2 support in the OpenH264 encoder by fixing build.rs to pass -DHAVE_AVX2 to NASM, and adds 13 new AVX2-optimized assembly functions for SAD and intra prediction.

Problem

The current build.rs in openh264-sys2 does not pass the -DHAVE_AVX2 define to NASM. This means that all AVX2 assembly code in upstream OpenH264 -- including existing implementations for SATD, DCT, quantization, motion compensation, and VAA -- is silently excluded from the build. Only SSE2 code paths are compiled.

Changes

1. Build system fix (openh264-sys2/build.rs)

Pass -DHAVE_AVX2 to both NASM and the C++ compiler on x86-64 targets, matching Cisco's official meson.build behavior. This alone unlocks all existing upstream AVX2 code.

2. New AVX2 SAD functions (8 functions, codec/common/x86/satd_sad.asm)

The existing SATD functions already have AVX2 implementations in satd_sad.asm, but the corresponding SAD functions were missing. SAD is the primary block comparison metric for motion estimation, evaluated tens of thousands of times per frame.

  • 4 simple SAD: WelsSampleSad{16x16,16x8,8x16,8x8}_avx2
  • 4 SadFour: WelsSampleSadFour{16x16,16x8,8x16,8x8}_avx2

The 16-wide functions use vinserti128 to pack two rows into a 256-bit ymm register, processing them with a single vpsadbw. SadFour variants compute SAD against four reference positions simultaneously, avoiding redundant source loads during diamond search. All loops are fully unrolled with %rep.

Files: satd_sad.asm (+386 lines), sad_common.h (+12), sample.cpp (+11)

3. New AVX2 intra prediction functions (5 functions)

  • 16x16 Luma V (intra_pred_com.asm): vbroadcasti128 + 256-bit stores, 8 stores instead of 16
  • 16x16 Luma H (intra_pred_com.asm): vpbroadcastb xmm, [mem] replaces multi-instruction broadcast
  • 16x16 Luma DC (intra_pred.asm): vpbroadcastb ymm for 32-byte fill after vpsadbw sum
  • 16x16 Luma Plane (intra_pred.asm): processes all 16 pixels per row in one pass via vpmullw + vpaddw + vpackuswb + vpermq, versus two 8-pixel passes in SSE2
  • 8x8 Chroma V (intra_pred.asm): vpbroadcastq fills the entire 64-byte block with 2 stores

Files: intra_pred_com.asm (+44), intra_pred.asm (+215), intra_pred_common.h (+4), get_intra_predictor.h (+6), get_intra_predictor.cpp (+9)

Compatibility

  • All new code is guarded by %ifdef HAVE_AVX2 (NASM) / #if defined(HAVE_AVX2) (C++)
  • Runtime CPU detection via CPUID selects AVX2 paths only when the CPU supports them
  • No change in behavior on processors without AVX2 (SSE2 fallback preserved)
  • vzeroupper before every ret to prevent SSE-AVX transition penalties

Measured performance

These optimizations were developed for Vauban, an open-source privileged access management bastion that uses OpenH264 for real-time H.264 encoding of RDP desktop sessions streamed to web browsers.

On a FreeBSD 15.0 server with an Intel Xeon E-2246G (Coffee Lake), encoding 1280x720 at 60 FPS with ScreenContentRealTime:

Metric Before (SSE2 only) After (SSE2 + AVX2)
CPU per session ~80-100% of one core ~40-50% of one core
Improvement -- ~50% reduction

The gain comes from both the 13 new functions and unlocking the existing upstream AVX2 code (SATD, DCT, quantization) that was previously excluded.

Upstream PRs

These same optimizations have been submitted to Cisco's upstream OpenH264:

Technical details

For a comprehensive writeup covering H.264 pipeline context, implementation details, AVX2 instruction usage, and testing methodology, see:
Vauban OpenH264 AVX2 Optimizations -- Technical Document

Add AVX2 implementations for H.264 encoding performance improvement:

Build system:
- Define HAVE_AVX2 for NASM assembly compilation
- Define HAVE_AVX2 for C++ compilation

SAD (Sum of Absolute Differences) functions:
- WelsSampleSad16x16_avx2, WelsSampleSad16x8_avx2
- WelsSampleSad8x16_avx2, WelsSampleSad8x8_avx2
- WelsSampleSadFour16x16_avx2, WelsSampleSadFour16x8_avx2
- WelsSampleSadFour8x16_avx2, WelsSampleSadFour8x8_avx2

Intra Prediction functions:
- WelsI16x16LumaPredDc_avx2, WelsI16x16LumaPredPlane_avx2
- WelsI16x16LumaPredV_avx2, WelsI16x16LumaPredH_avx2

Performance: ~30-50% CPU reduction for video encoding on x86_64 with AVX2.
Made-with: Cursor
feat(encoder): AVX2 optimizations for SAD and Intra Prediction
@ralfbiedert
Copy link
Copy Markdown
Owner

Thanks for the PR.

Having faster encoding would be great, however, this looks like it's modifying upstream openh264 files directly. Unfortunately we can't diverge from upstream for maintenance reasons.

The proper way of landing these is getting them merged to upstream master first (you seem to have PRs in flight already), then bumping the pinned commit / SHA here.

About your build.rs, it appears you are unconditionally enabling AVX2. Instead, you probably want to query CARGO_CFG_TARGET_FEATURE or so and set it conditionally.

@ralfbiedert ralfbiedert added enhancement New feature or request upstream Issues related to upstream (OpenH264) code. labels Mar 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request upstream Issues related to upstream (OpenH264) code.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants