Skip to content

feat: add ARM NEON optimization for startcode prefix search#3949

Open
llongint wants to merge 1 commit into
cisco:masterfrom
llongint:nalOpt
Open

feat: add ARM NEON optimization for startcode prefix search#3949
llongint wants to merge 1 commit into
cisco:masterfrom
llongint:nalOpt

Conversation

@llongint
Copy link
Copy Markdown

SIMD fast filter: load 64 bytes, use ext to create adjacent byte pairs, detect consecutive 0x00 bytes via orr+umin+uminv reduction. No candidate pair → skip entire 64B block (~97% filtered); found → fall back to precise C scan for 0x000001.

Performance (C: 8268.15 → NEON: 501.80 cycles/iter, ~16.5x)

Refactor into pfDetectStartCodePrefix function pointer dispatch, selecting C/NEON at runtime based on CPU capabilities.

Add comprehensive decoder unit tests for startcode detection.

The assembly code directly uses the compiler-generated output, see: https://www.godbolt.org/z/9coKKfGd5
The following implementation (883.83 Cycles/iter) was attempted, contributed by expert Dougall Johnson: https://www.godbolt.org/z/fKrfPEPW4
And this implementation (744.94 Cycles/iter): https://www.godbolt.org/z/jMhzzMWad

SIMD fast filter: load 64 bytes, use ext to create adjacent byte pairs,
detect consecutive 0x00 bytes via orr+umin+uminv reduction.
No candidate pair → skip entire 64B block (~97% filtered);
found → fall back to precise C scan for 0x000001.

Performance (C: 8268.15 → NEON: 501.80 cycles/iter, ~16.5x)

Refactor into pfDetectStartCodePrefix function pointer dispatch,
selecting C/NEON at runtime based on CPU capabilities.

Add comprehensive decoder unit tests for startcode detection.
@llongint
Copy link
Copy Markdown
Author

llongint commented May 20, 2026

cc @mstorsjo Could you help review this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant