lib/arm/cpu_features: prefer PMULL CRC-32 on Linux Neoverse V-class#453
Merged
Merged
Conversation
Owner
|
Could you rebase onto the latest master branch? I've fixed several GitHub Actions errors that appeared recently due to compiler, glibc, and vcpkg updates. Thanks. |
The crc32_arm_pmullx12_crc[_eor3]() paths substantially outperform the crc32-instruction path on CPUs whose pmull pipes have more aggregate throughput than their crc32 unit. This is currently enabled on Apple M-series cores; the Arm Neoverse V class (V1 / V2 / V3 / V3AE) has the same property. On Linux, identify Neoverse V-class cores at runtime by reading the PartNum field of MIDR_EL1 from sysfs. Sysfs read failure of any kind returns false, leaving the dispatcher on its previous default. On AWS Graviton 4 (Neoverse V2), CRC-32 throughput as measured by programs/checksum -t on a 1 GiB random buffer goes from ~22 GB/s to ~40 GB/s.
21c56a9 to
91a0c9c
Compare
Contributor
Author
|
Just pushed the rebase on master. |
Owner
|
Merged, thanks! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This is my first attempt at a real PR for
libdeflate- I've done my best to follow established patterns in the repo and tailor this PR to fit in with previous work, but I am very open to feedback both on the PR content itself, and any further additional testing/benchmarking/etc. that you would like to see. I should also note that I used AI (Claude) extensively in researching and documenting this change.Summary
Enables libdeflate's existing 12-way PMULL CRC-32 fold (
crc32_arm_pmullx12_crc_eor3/crc32_arm_pmullx12_crc) on AWS Graviton 3, Graviton 4, and other Arm Neoverse V-classserver cores running Linux. Until now this code path was gated by
ARM_CPU_FEATURE_PREFER_PMULL, which was only set when compiled for Apple macOS. OnAWS Graviton 4 (Neoverse V2) this raises CRC-32 throughput from ~22 GB/s to ~40 GB/s
(1.80×).
The fast paths themselves are unchanged — this PR is purely a CPU feature detection / dispatch
decision.
Implementation
On Linux arm64, read MIDR_EL1 via the unprivileged sysfs entry
/sys/devices/system/cpu/cpu0/regs/identification/midr_el1(exposed by the kernelsince Linux 4.7, July 2016) and check the Implementer (Arm Ltd. =
0x41) and PartNum fields. If the PartNum matches the conservative whitelist ofArm Neoverse V-class cores, set
ARM_CPU_FEATURE_PREFER_PMULL. The whitelist is:0xd400xd4f0xd830xd84All four PartNum values were cross-checked against
arch/arm64/include/asm/cputype.hin torvalds/linux.The whole policy is wrapped in a small
arm_cpu_prefers_pmull()helper modeledstructurally on
allow_512bit_vectors()inlib/x86/cpu_features.c:Performance
All measurements on AWS c8g.large (Neoverse V2, single dedicated vCPU), with
programs/checksum -t -s 1048576on a 1 GiB random buffer, 5 runs each.crc32_arm_crc_pmullcombinecrc32_arm_pmullx12_crc_eor3crc32_arm_pmullx12_crc* Forced via
LIBDEFLATE_DISABLE_CPU_FEATURES=sha3to verify the non-eor3 path isstill well above baseline. This is why we don't require SHA3 in the runtime gate
— the dispatcher in
lib/arm/crc32_impl.hpicks the eor3 variant when SHA3 isalso present and falls back to
pmullx12_crc(still 1.54× over the old default)otherwise.
End-to-end (gzip-format)
On the same machine with
programs/benchmark -6 -g -s 65280on a 247MB input file the end-to-end decompress gain is modest:The input file for this test was a FASTQ file containing DNA sequencing reads from a short read sequencer; the block size for compression mirrors the 64kb limit used the BGZF format widely used in genomics (the field I work in).
Testing
scripts/run_tests.sh regularpasses on the c8g instance, exercising all sevenLIBDEFLATE_DISABLE_CPU_FEATUREScombinations fromregular_test(includingprefer_pmulldisabled and re-enabled) and four cross-compatibility gzip /gunzip permutations against
/bin/gzip.programs/test_*binaries pass on both c8g and on macOS arm64 (M2),confirming the Apple compile-time path still works.
pmullx12_crc[_eor3]path through the unconditionalTEST_SUPPORT__DO_NOT_USEbranch in
arm_cpu_prefers_pmull(), the same way it has sincec1926a4.Design notes / anticipated questions
Why not just check HWCAP for SHA3+PMULL+CRC32 instead of MIDR? HWCAP advertises
ISA presence, not microarchitecture identity. The PMULL-vs-CRC32 throughput
asymmetry is a microarchitectural property — Cortex-A78 and Neoverse N1 have
PMULL+CRC32 too, but their relative throughput doesn't justify switching dispatch.
This is the same reason
lib/x86/cpu_features.c::allow_512bit_vectors()checksfamily/model and not just AVX-512 presence.
Why a whitelist and not a calibration probe? No existing libdeflate code path
uses a runtime calibration probe;
allow_512bit_vectors()is the closestprecedent and uses a hardcoded model whitelist. A probe would add startup latency
and a new test surface for negligible benefit on the four known parts.
Why only Neoverse V1/V2/V3/V3AE and not Cortex-X1 etc.? Conservative. Cortex-X
cores are mobile-tuned and the PMULL/CRC32 ratio hasn't been validated there;
Neoverse N1 (Graviton 2) and N2 (Graviton 3E) have narrower PMULL pipes that
don't outperform their CRC32 instructions. The whitelist can be extended in
follow-ups as additional cores are measured.
What if the sysfs file isn't readable? The function returns
falseon everyerror path (file missing, read error, parse failure, unrecognized implementer or
part). The dispatcher then stays on whatever path it would have selected before —
the existing
crc32_arm_crc_pmullcombineon Linux/server arm64, or theappropriate fallback elsewhere. There is no behavior change for any CPU outside
the whitelist or any system where sysfs isn't available.
Why cpu0 only? The kernel ABI notes MIDR_EL1 isn't necessarily uniform across
CPUs on heterogeneous systems, but no Neoverse V-class server SKU has ever shipped
as part of a big.LITTLE configuration, and the HWCAP-based feature checks are
system-safe by construction. For current real-world hardware, cpu0 is
representative.