Big performance improvement of meta and datanode#274
Open
zhanglistar wants to merge 8 commits into
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR optimizes QFS metadata and replicated write paths for small-file create/write workloads.
Local mstress results:
for 271452 paths/client, about 3.6x faster.
for 1000 files/client with 2 clients, about 1.65x faster.
1000 x 1MB workload, about 2x faster.
with ~1.03 GB/s actual client network send due to 3 replicas.
Main changes:
allocate path.
No-forward/NFprotocol support so replicas can process direct client fanout withoutchunkserver-to-chunkserver forwarding.
IOBufferDatareferences.WRITE_PREPARE, allowing chunkservers to reuse clientchecksums.
replica.
Motivation
The original small-file write path has several latency bottlenecks:
These changes reduce allocation latency, remove connection churn, and move the replicated data path
closer to an HDFS-style model.
Key Configs
New / relevant switches:
Default behavior should remain compatible unless these switches are enabled.
Implementation Details
NamespaceV2
HDFS-like allocation
With the new path enabled:
Append, striped files, and object-store paths remain on the original path.
Parallel write fanout
Buffer sharing
Checksum optimization
in the write hot path.
This follows the HDFS-style tradeoff: trust client-provided checksums on write, then verify stored data
during reads / scrub.
Benchmark
Environment:
Single Client
1000 files, 1 MB each:
1000 files created in 3058 ms
open avg: 240 us
write avg: 134 us
close avg: 2681 us
Write.ChunkWriteUsec: 1735099 us
Write.CloseUsec: 2655539 us
Write.WriteIdAlloc: 225061 us
Write.ChunkClose: 114009 us
ChunkServer.Pool.BytesSent: 3147979791
ChunkServer.Pool.Connect: 3
ChunkServer.Pool.OpsQueued: 9000
Approximate throughput:
Logical write throughput: ~327 MB/s
Actual client network send: ~1.03 GB/s
Two Clients
Before checksum-vector / skip-verify optimization:
proc_00: 4353 ms, Write.ChunkWriteUsec=2865775
proc_01: 4349 ms, Write.ChunkWriteUsec=2863192
After checksum-vector / skip-verify optimization:
proc_00: 4171 ms, Write.ChunkWriteUsec=2414194
proc_01: 4209 ms, Write.ChunkWriteUsec=2435554
ChunkWriteUsec improved by about 14-16%.
Correctness / Recovery Notes
The HDFS-like lazy-create path needs careful recovery semantics for killed writers.
Observed case:
This PR includes initial recovery work, but this area should receive extra upstream review before
enabling lazy-create broadly.
Testing
Validated locally:
git diff --check
cmake --build bld --target metaserver chunkserver mstress_client namespacev2test -j8
./bld/output/bin/devtools/namespacev2test
Also validated with clean-cluster mstress write benchmarks.
Follow-up Work