Skip to content

Add AVX2 intra prediction for 16x16 luma and 8x8 chroma#3934

Open
rbenaley wants to merge 1 commit into
cisco:masterfrom
rbenaley:avx2-intra-prediction
Open

Add AVX2 intra prediction for 16x16 luma and 8x8 chroma#3934
rbenaley wants to merge 1 commit into
cisco:masterfrom
rbenaley:avx2-intra-prediction

Conversation

@rbenaley
Copy link
Copy Markdown

This adds AVX2-optimized intra prediction for the four 16x16 luma modes (Vertical, Horizontal, DC, Plane) and 8x8 chroma vertical prediction.

Key improvements over the SSE2 versions:

  • V/DC: vbroadcasti128 / vpbroadcastb + 256-bit stores write two rows per operation (8 stores instead of 16)
  • H: vpbroadcastb xmm, [mem] replaces a multi-instruction broadcast sequence
  • Plane: processes all 16 pixels per row in a single pass using 256-bit arithmetic (vpmullw + vpaddw + vpackuswb + vpermq), versus two 8-pixel passes in SSE2
  • Chroma V: vpbroadcastq fills the entire 64-byte block with just 2 stores

Files changed:

  • codec/common/x86/intra_pred_com.asm — V, H functions (+44 lines)
  • codec/encoder/core/x86/intra_pred.asm — DC, Plane, Chroma V (+215 lines)
  • codec/common/inc/intra_pred_common.h — declarations (+4 lines)
  • codec/encoder/core/inc/get_intra_predictor.h — declarations (+6 lines)
  • codec/encoder/core/src/get_intra_predictor.cpp — registration (+9 lines)

These optimizations were developed for Vauban, an open-source privileged access management (PAM) bastion that uses OpenH264 for real-time H.264 encoding of RDP desktop sessions streamed to web browsers. Combined with AVX2 SAD functions, enabling full AVX2 support reduced CPU usage per session by approximately 50% on an Intel Xeon E-2246G running FreeBSD.

For a detailed technical writeup covering the encoding pipeline context, implementation choices, and performance measurements, see:
https://github.com/rbenaley/Vauban/blob/main/docs/technical/Vauban_OpenH264_AVX2_Optimizations_EN(1.0).md

Implement AVX2-optimized intra prediction for the four 16x16 luma
modes (V, H, DC, Plane) and 8x8 chroma vertical prediction. V/DC use
vbroadcasti128/vpbroadcastb with 256-bit stores (8 instead of 16). H
uses vpbroadcastb for single-instruction byte broadcast. Plane processes
all 16 pixels per row in one pass via vpmullw+vpaddw+vpackuswb+vpermq.
Chroma V uses vpbroadcastq to fill the 64-byte block with 2 stores.

All code is guarded by %ifdef HAVE_AVX2 / WELS_CPU_AVX2 and selected
at runtime via CPUID detection.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant