Add AVX2 intra prediction for 16x16 luma and 8x8 chroma by rbenaley · Pull Request #3934 · cisco/openh264

rbenaley · 2026-02-25T15:11:18Z

This adds AVX2-optimized intra prediction for the four 16x16 luma modes (Vertical, Horizontal, DC, Plane) and 8x8 chroma vertical prediction.

Key improvements over the SSE2 versions:

V/DC: vbroadcasti128 / vpbroadcastb + 256-bit stores write two rows per operation (8 stores instead of 16)
H: vpbroadcastb xmm, [mem] replaces a multi-instruction broadcast sequence
Plane: processes all 16 pixels per row in a single pass using 256-bit arithmetic (vpmullw + vpaddw + vpackuswb + vpermq), versus two 8-pixel passes in SSE2
Chroma V: vpbroadcastq fills the entire 64-byte block with just 2 stores

Files changed:

codec/common/x86/intra_pred_com.asm — V, H functions (+44 lines)
codec/encoder/core/x86/intra_pred.asm — DC, Plane, Chroma V (+215 lines)
codec/common/inc/intra_pred_common.h — declarations (+4 lines)
codec/encoder/core/inc/get_intra_predictor.h — declarations (+6 lines)
codec/encoder/core/src/get_intra_predictor.cpp — registration (+9 lines)

These optimizations were developed for Vauban, an open-source privileged access management (PAM) bastion that uses OpenH264 for real-time H.264 encoding of RDP desktop sessions streamed to web browsers. Combined with AVX2 SAD functions, enabling full AVX2 support reduced CPU usage per session by approximately 50% on an Intel Xeon E-2246G running FreeBSD.

For a detailed technical writeup covering the encoding pipeline context, implementation choices, and performance measurements, see:
https://github.com/rbenaley/Vauban/blob/main/docs/technical/Vauban_OpenH264_AVX2_Optimizations_EN(1.0).md

Implement AVX2-optimized intra prediction for the four 16x16 luma modes (V, H, DC, Plane) and 8x8 chroma vertical prediction. V/DC use vbroadcasti128/vpbroadcastb with 256-bit stores (8 instead of 16). H uses vpbroadcastb for single-instruction byte broadcast. Plane processes all 16 pixels per row in one pass via vpmullw+vpaddw+vpackuswb+vpermq. Chroma V uses vpbroadcastq to fill the 64-byte block with 2 stores. All code is guarded by %ifdef HAVE_AVX2 / WELS_CPU_AVX2 and selected at runtime via CPUID detection.

rbenaley mentioned this pull request Feb 26, 2026

Add AVX2 optimizations for SAD, SadFour, and intra prediction (encoder) ralfbiedert/openh264-rs#92

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add AVX2 intra prediction for 16x16 luma and 8x8 chroma#3934

Add AVX2 intra prediction for 16x16 luma and 8x8 chroma#3934
rbenaley wants to merge 1 commit into
cisco:masterfrom
rbenaley:avx2-intra-prediction

rbenaley commented Feb 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rbenaley commented Feb 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant