lib/arm/cpu_features: prefer PMULL CRC-32 on Linux Neoverse V-class by tfenne · Pull Request #453 · ebiggers/libdeflate

tfenne · 2026-05-16T17:20:29Z

This is my first attempt at a real PR for libdeflate - I've done my best to follow established patterns in the repo and tailor this PR to fit in with previous work, but I am very open to feedback both on the PR content itself, and any further additional testing/benchmarking/etc. that you would like to see. I should also note that I used AI (Claude) extensively in researching and documenting this change.

Summary

Enables libdeflate's existing 12-way PMULL CRC-32 fold (crc32_arm_pmullx12_crc_eor3 /
crc32_arm_pmullx12_crc) on AWS Graviton 3, Graviton 4, and other Arm Neoverse V-class
server cores running Linux. Until now this code path was gated by
ARM_CPU_FEATURE_PREFER_PMULL, which was only set when compiled for Apple macOS. On
AWS Graviton 4 (Neoverse V2) this raises CRC-32 throughput from ~22 GB/s to ~40 GB/s
(1.80×).

The fast paths themselves are unchanged — this PR is purely a CPU feature detection / dispatch
decision.

Implementation

On Linux arm64, read MIDR_EL1 via the unprivileged sysfs entry
/sys/devices/system/cpu/cpu0/regs/identification/midr_el1 (exposed by the kernel
since Linux 4.7, July 2016) and check the Implementer (Arm Ltd. =
0x41) and PartNum fields. If the PartNum matches the conservative whitelist of
Arm Neoverse V-class cores, set ARM_CPU_FEATURE_PREFER_PMULL. The whitelist is:

PartNum	Core	Notable SKU
`0xd40`	Neoverse V1	AWS Graviton 3
`0xd4f`	Neoverse V2	AWS Graviton 4
`0xd83`	Neoverse V3AE	—
`0xd84`	Neoverse V3	—

All four PartNum values were cross-checked against
arch/arm64/include/asm/cputype.h in torvalds/linux.

The whole policy is wrapped in a small arm_cpu_prefers_pmull() helper modeled
structurally on allow_512bit_vectors() in lib/x86/cpu_features.c:

static bool arm_cpu_prefers_pmull(void)
{
#if defined(__APPLE__) && TARGET_OS_OSX
    return true;
#elif defined(__linux__) && defined(ARCH_ARM64)
    if (arm64_cpu_is_neoverse_v_class())
        return true;
#endif
#ifdef TEST_SUPPORT__DO_NOT_USE
    return true;
#endif
    return false;
}

Performance

All measurements on AWS c8g.large (Neoverse V2, single dedicated vCPU), with
programs/checksum -t -s 1048576 on a 1 GiB random buffer, 5 runs each.

variant	dispatched function	MB/s	vs baseline
baseline (master)	`crc32_arm_crc_pmullcombine`	21,860	1.00×
this PR	`crc32_arm_pmullx12_crc_eor3`	39,420	1.80×
this PR, SHA3 disabled*	`crc32_arm_pmullx12_crc`	33,650	1.54×

* Forced via LIBDEFLATE_DISABLE_CPU_FEATURES=sha3 to verify the non-eor3 path is
still well above baseline. This is why we don't require SHA3 in the runtime gate
— the dispatcher in lib/arm/crc32_impl.h picks the eor3 variant when SHA3 is
also present and falls back to pmullx12_crc (still 1.54× over the old default)
otherwise.

End-to-end (gzip-format)

On the same machine with programs/benchmark -6 -g -s 65280 on a 247MB input file the end-to-end decompress gain is modest:

level	baseline dec MB/s	this PR dec MB/s	Δ
1	709	719	+1.4%
3	729	739	+1.4%
6	720	730	+1.4%
9	702	710	+1.1%

The input file for this test was a FASTQ file containing DNA sequencing reads from a short read sequencer; the block size for compression mirrors the 64kb limit used the BGZF format widely used in genomics (the field I work in).

Testing

scripts/run_tests.sh regular passes on the c8g instance, exercising all seven
LIBDEFLATE_DISABLE_CPU_FEATURES combinations from regular_test (including
prefer_pmull disabled and re-enabled) and four cross-compatibility gzip /
gunzip permutations against /bin/gzip.
All eight programs/test_* binaries pass on both c8g and on macOS arm64 (M2),
confirming the Apple compile-time path still works.
Existing CI (Debian bookworm aarch64 GCC + clang) will exercise the
pmullx12_crc[_eor3] path through the unconditional TEST_SUPPORT__DO_NOT_USE
branch in arm_cpu_prefers_pmull(), the same way it has since c1926a4.

Design notes / anticipated questions

Why not just check HWCAP for SHA3+PMULL+CRC32 instead of MIDR? HWCAP advertises
ISA presence, not microarchitecture identity. The PMULL-vs-CRC32 throughput
asymmetry is a microarchitectural property — Cortex-A78 and Neoverse N1 have
PMULL+CRC32 too, but their relative throughput doesn't justify switching dispatch.
This is the same reason lib/x86/cpu_features.c::allow_512bit_vectors() checks
family/model and not just AVX-512 presence.

Why a whitelist and not a calibration probe? No existing libdeflate code path
uses a runtime calibration probe; allow_512bit_vectors() is the closest
precedent and uses a hardcoded model whitelist. A probe would add startup latency
and a new test surface for negligible benefit on the four known parts.

Why only Neoverse V1/V2/V3/V3AE and not Cortex-X1 etc.? Conservative. Cortex-X
cores are mobile-tuned and the PMULL/CRC32 ratio hasn't been validated there;
Neoverse N1 (Graviton 2) and N2 (Graviton 3E) have narrower PMULL pipes that
don't outperform their CRC32 instructions. The whitelist can be extended in
follow-ups as additional cores are measured.

What if the sysfs file isn't readable? The function returns false on every
error path (file missing, read error, parse failure, unrecognized implementer or
part). The dispatcher then stays on whatever path it would have selected before —
the existing crc32_arm_crc_pmullcombine on Linux/server arm64, or the
appropriate fallback elsewhere. There is no behavior change for any CPU outside
the whitelist or any system where sysfs isn't available.

Why cpu0 only? The kernel ABI notes MIDR_EL1 isn't necessarily uniform across
CPUs on heterogeneous systems, but no Neoverse V-class server SKU has ever shipped
as part of a big.LITTLE configuration, and the HWCAP-based feature checks are
system-safe by construction. For current real-world hardware, cpu0 is
representative.

ebiggers · 2026-05-16T20:21:31Z

Could you rebase onto the latest master branch? I've fixed several GitHub Actions errors that appeared recently due to compiler, glibc, and vcpkg updates. Thanks.

The crc32_arm_pmullx12_crc[_eor3]() paths substantially outperform the crc32-instruction path on CPUs whose pmull pipes have more aggregate throughput than their crc32 unit. This is currently enabled on Apple M-series cores; the Arm Neoverse V class (V1 / V2 / V3 / V3AE) has the same property. On Linux, identify Neoverse V-class cores at runtime by reading the PartNum field of MIDR_EL1 from sysfs. Sysfs read failure of any kind returns false, leaving the dispatcher on its previous default. On AWS Graviton 4 (Neoverse V2), CRC-32 throughput as measured by programs/checksum -t on a 1 GiB random buffer goes from ~22 GB/s to ~40 GB/s.

tfenne · 2026-05-17T00:23:25Z

Just pushed the rebase on master.

ebiggers · 2026-05-17T03:34:18Z

Merged, thanks!

tfenne force-pushed the tf/arm-prefer-pmull-neoverse branch from 21c56a9 to 91a0c9c Compare May 17, 2026 00:22

ebiggers merged commit 0f9a240 into ebiggers:master May 17, 2026
47 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lib/arm/cpu_features: prefer PMULL CRC-32 on Linux Neoverse V-class#453

lib/arm/cpu_features: prefer PMULL CRC-32 on Linux Neoverse V-class#453
ebiggers merged 1 commit into
ebiggers:masterfrom
tfenne:tf/arm-prefer-pmull-neoverse

tfenne commented May 16, 2026

Uh oh!

ebiggers commented May 16, 2026

Uh oh!

tfenne commented May 17, 2026

Uh oh!

Uh oh!

ebiggers commented May 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tfenne commented May 16, 2026

Summary

Implementation

Performance

End-to-end (gzip-format)

Testing

Design notes / anticipated questions

Uh oh!

ebiggers commented May 16, 2026

Uh oh!

tfenne commented May 17, 2026

Uh oh!

Uh oh!

ebiggers commented May 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants