Skip to content

lib/arm/cpu_features: prefer PMULL CRC-32 on Linux Neoverse V-class#453

Merged
ebiggers merged 1 commit into
ebiggers:masterfrom
tfenne:tf/arm-prefer-pmull-neoverse
May 17, 2026
Merged

lib/arm/cpu_features: prefer PMULL CRC-32 on Linux Neoverse V-class#453
ebiggers merged 1 commit into
ebiggers:masterfrom
tfenne:tf/arm-prefer-pmull-neoverse

Conversation

@tfenne
Copy link
Copy Markdown
Contributor

@tfenne tfenne commented May 16, 2026

This is my first attempt at a real PR for libdeflate - I've done my best to follow established patterns in the repo and tailor this PR to fit in with previous work, but I am very open to feedback both on the PR content itself, and any further additional testing/benchmarking/etc. that you would like to see. I should also note that I used AI (Claude) extensively in researching and documenting this change.

Summary

Enables libdeflate's existing 12-way PMULL CRC-32 fold (crc32_arm_pmullx12_crc_eor3 /
crc32_arm_pmullx12_crc) on AWS Graviton 3, Graviton 4, and other Arm Neoverse V-class
server cores running Linux. Until now this code path was gated by
ARM_CPU_FEATURE_PREFER_PMULL, which was only set when compiled for Apple macOS. On
AWS Graviton 4 (Neoverse V2) this raises CRC-32 throughput from ~22 GB/s to ~40 GB/s
(1.80×)
.

The fast paths themselves are unchanged — this PR is purely a CPU feature detection / dispatch
decision.

Implementation

On Linux arm64, read MIDR_EL1 via the unprivileged sysfs entry
/sys/devices/system/cpu/cpu0/regs/identification/midr_el1 (exposed by the kernel
since Linux 4.7, July 2016) and check the Implementer (Arm Ltd. =
0x41) and PartNum fields. If the PartNum matches the conservative whitelist of
Arm Neoverse V-class cores, set ARM_CPU_FEATURE_PREFER_PMULL. The whitelist is:

PartNum Core Notable SKU
0xd40 Neoverse V1 AWS Graviton 3
0xd4f Neoverse V2 AWS Graviton 4
0xd83 Neoverse V3AE
0xd84 Neoverse V3

All four PartNum values were cross-checked against
arch/arm64/include/asm/cputype.h in torvalds/linux.

The whole policy is wrapped in a small arm_cpu_prefers_pmull() helper modeled
structurally on allow_512bit_vectors() in lib/x86/cpu_features.c:

static bool arm_cpu_prefers_pmull(void)
{
#if defined(__APPLE__) && TARGET_OS_OSX
    return true;
#elif defined(__linux__) && defined(ARCH_ARM64)
    if (arm64_cpu_is_neoverse_v_class())
        return true;
#endif
#ifdef TEST_SUPPORT__DO_NOT_USE
    return true;
#endif
    return false;
}

Performance

All measurements on AWS c8g.large (Neoverse V2, single dedicated vCPU), with
programs/checksum -t -s 1048576 on a 1 GiB random buffer, 5 runs each.

variant dispatched function MB/s vs baseline
baseline (master) crc32_arm_crc_pmullcombine 21,860 1.00×
this PR crc32_arm_pmullx12_crc_eor3 39,420 1.80×
this PR, SHA3 disabled* crc32_arm_pmullx12_crc 33,650 1.54×

* Forced via LIBDEFLATE_DISABLE_CPU_FEATURES=sha3 to verify the non-eor3 path is
still well above baseline. This is why we don't require SHA3 in the runtime gate
— the dispatcher in lib/arm/crc32_impl.h picks the eor3 variant when SHA3 is
also present and falls back to pmullx12_crc (still 1.54× over the old default)
otherwise.

End-to-end (gzip-format)

On the same machine with programs/benchmark -6 -g -s 65280 on a 247MB input file the end-to-end decompress gain is modest:

level baseline dec MB/s this PR dec MB/s Δ
1 709 719 +1.4%
3 729 739 +1.4%
6 720 730 +1.4%
9 702 710 +1.1%

The input file for this test was a FASTQ file containing DNA sequencing reads from a short read sequencer; the block size for compression mirrors the 64kb limit used the BGZF format widely used in genomics (the field I work in).

Testing

  • scripts/run_tests.sh regular passes on the c8g instance, exercising all seven
    LIBDEFLATE_DISABLE_CPU_FEATURES combinations from regular_test (including
    prefer_pmull disabled and re-enabled) and four cross-compatibility gzip /
    gunzip permutations against /bin/gzip.
  • All eight programs/test_* binaries pass on both c8g and on macOS arm64 (M2),
    confirming the Apple compile-time path still works.
  • Existing CI (Debian bookworm aarch64 GCC + clang) will exercise the
    pmullx12_crc[_eor3] path through the unconditional TEST_SUPPORT__DO_NOT_USE
    branch in arm_cpu_prefers_pmull(), the same way it has since c1926a4.

Design notes / anticipated questions

Why not just check HWCAP for SHA3+PMULL+CRC32 instead of MIDR? HWCAP advertises
ISA presence, not microarchitecture identity. The PMULL-vs-CRC32 throughput
asymmetry is a microarchitectural property — Cortex-A78 and Neoverse N1 have
PMULL+CRC32 too, but their relative throughput doesn't justify switching dispatch.
This is the same reason lib/x86/cpu_features.c::allow_512bit_vectors() checks
family/model and not just AVX-512 presence.

Why a whitelist and not a calibration probe? No existing libdeflate code path
uses a runtime calibration probe; allow_512bit_vectors() is the closest
precedent and uses a hardcoded model whitelist. A probe would add startup latency
and a new test surface for negligible benefit on the four known parts.

Why only Neoverse V1/V2/V3/V3AE and not Cortex-X1 etc.? Conservative. Cortex-X
cores are mobile-tuned and the PMULL/CRC32 ratio hasn't been validated there;
Neoverse N1 (Graviton 2) and N2 (Graviton 3E) have narrower PMULL pipes that
don't outperform their CRC32 instructions. The whitelist can be extended in
follow-ups as additional cores are measured.

What if the sysfs file isn't readable? The function returns false on every
error path (file missing, read error, parse failure, unrecognized implementer or
part). The dispatcher then stays on whatever path it would have selected before —
the existing crc32_arm_crc_pmullcombine on Linux/server arm64, or the
appropriate fallback elsewhere. There is no behavior change for any CPU outside
the whitelist or any system where sysfs isn't available.

Why cpu0 only? The kernel ABI notes MIDR_EL1 isn't necessarily uniform across
CPUs on heterogeneous systems, but no Neoverse V-class server SKU has ever shipped
as part of a big.LITTLE configuration, and the HWCAP-based feature checks are
system-safe by construction. For current real-world hardware, cpu0 is
representative.

@ebiggers
Copy link
Copy Markdown
Owner

Could you rebase onto the latest master branch? I've fixed several GitHub Actions errors that appeared recently due to compiler, glibc, and vcpkg updates. Thanks.

The crc32_arm_pmullx12_crc[_eor3]() paths substantially outperform the
crc32-instruction path on CPUs whose pmull pipes have more aggregate
throughput than their crc32 unit.  This is currently enabled on Apple
M-series cores; the Arm Neoverse V class (V1 / V2 / V3 / V3AE) has the
same property.

On Linux, identify Neoverse V-class cores at runtime by reading the
PartNum field of MIDR_EL1 from sysfs.  Sysfs read failure of any kind
returns false, leaving the dispatcher on its previous default.

On AWS Graviton 4 (Neoverse V2), CRC-32 throughput as measured by
programs/checksum -t on a 1 GiB random buffer goes from ~22 GB/s to
~40 GB/s.
@tfenne tfenne force-pushed the tf/arm-prefer-pmull-neoverse branch from 21c56a9 to 91a0c9c Compare May 17, 2026 00:22
@tfenne
Copy link
Copy Markdown
Contributor Author

tfenne commented May 17, 2026

Just pushed the rebase on master.

@ebiggers ebiggers merged commit 0f9a240 into ebiggers:master May 17, 2026
47 checks passed
@ebiggers
Copy link
Copy Markdown
Owner

Merged, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants