Big performance improvement of meta and datanode by zhanglistar · Pull Request #274 · quantcast/qfs

zhanglistar · 2026-06-01T08:31:36Z

Summary

This PR optimizes QFS metadata and replicated write paths for small-file create/write workloads.

Local mstress results:

Metadata create workload improved from ~309s/client to ~86s/client
for 271452 paths/client, about 3.6x faster.
1MB replicated file create improved from ~6.9s/client to ~4.2s/client
for 1000 files/client with 2 clients, about 1.65x faster.
Chunk write time improved from ~5.0s/client to ~2.4s/client in the same
1000 x 1MB workload, about 2x faster.
Current single-client 1MB write throughput is ~327 MB/s logical data,
with ~1.03 GB/s actual client network send due to 3 replicas.

Main changes:

Add NamespaceV2 metadata layer with finer-grained locking and WAL replay tests.
Add optional HDFS-like write allocation path to avoid synchronous chunkserver pre-create on the hot
allocate path.
Add chunkserver lazy-create-on-write support.
Reuse client-to-chunkserver connections in the write path.
Add client-side parallel replica fanout for write-id allocation, write prepare, and close.
Add No-forward / NF protocol support so replicas can process direct client fanout without
chunkserver-to-chunkserver forwarding.
Avoid fanout payload copy by sharing IOBufferData references.
Send client-computed 64 KB checksum vectors in WRITE_PREPARE, allowing chunkservers to reuse client
checksums.
Add optional chunkserver write-prepare checksum verify skip to avoid repeated checksum scans on every
replica.

Motivation

The original small-file write path has several latency bottlenecks:

Coarse metadata locking limits create throughput.
Metaserver allocation waits for chunkserver pre-create before replying to the client.
Client creates short-lived chunkserver connections under write-heavy mstress workloads.
Replicated writes use chunkserver forwarding, adding serial network hops.
Each chunkserver replica can rescan write payload to compute checksums.
Fanout write buffers previously incurred avoidable temporary buffer objects.

These changes reduce allocation latency, remove connection churn, and move the replicated data path
closer to an HDFS-style model.

Key Configs

New / relevant switches:

metaServer.writeFlow.hdfsLikeAllocate = 1
chunkServer.writeFlow.lazyCreateOnWrite = 1
client.parallelReplicaWrite = 1
chunkServer.skipWritePrepareChecksumVerify = 1

Default behavior should remain compatible unless these switches are enabled.

Implementation Details

NamespaceV2

Adds native metadata structures and NamespaceV2 tests.
Adds WAL replay coverage.
Moves toward finer-grained namespace locking.

HDFS-like allocation

With the new path enabled:

Client requests chunk allocation from metaserver.
Metaserver selects replicas and returns allocation metadata and lease id.
Metaserver does not synchronously send ALLOCATE_CHUNK.
Client sends WRITE_ID_ALLOC to chunkservers.
Chunkserver lazily creates the chunk if missing and registers the lease.
Client writes data and closes the chunk.

Append, striped files, and object-store paths remain on the original path.

Parallel write fanout

Adds client-side fanout to all replicas for:
- WRITE_ID_ALLOC
- WRITE_PREPARE
- CLOSE
Adds No-forward / NF protocol field.
Replicas still receive the full replica/write-id list so each chunkserver can derive its own position.

Buffer sharing

Adds IOBuffer::AppendShared().
Fanout requests attach shared IOBufferData references.
Payload bytes are not copied per replica.

Checksum optimization

Client sends 64 KB checksum vector in WRITE_PREPARE.
Chunkserver can reuse the vector for chunk metadata.
With chunkServer.skipWritePrepareChecksumVerify=1, chunkserver skips duplicate payload checksum scans
in the write hot path.

This follows the HDFS-style tradeoff: trust client-provided checksums on write, then verify stored data
during reads / scrub.

Benchmark

Environment:

3 chunkservers
3 replicas
1 MB files
client/chunkservers use host IP instead of localhost to exercise network path
client.parallelReplicaWrite=1
chunkServer.skipWritePrepareChecksumVerify=1

Single Client

1000 files, 1 MB each:

1000 files created in 3058 ms

open avg: 240 us
write avg: 134 us
close avg: 2681 us

Write.ChunkWriteUsec: 1735099 us
Write.CloseUsec: 2655539 us
Write.WriteIdAlloc: 225061 us
Write.ChunkClose: 114009 us

ChunkServer.Pool.BytesSent: 3147979791
ChunkServer.Pool.Connect: 3
ChunkServer.Pool.OpsQueued: 9000

Approximate throughput:

Logical write throughput: ~327 MB/s
Actual client network send: ~1.03 GB/s

Two Clients

Before checksum-vector / skip-verify optimization:

proc_00: 4353 ms, Write.ChunkWriteUsec=2865775
proc_01: 4349 ms, Write.ChunkWriteUsec=2863192

After checksum-vector / skip-verify optimization:

proc_00: 4171 ms, Write.ChunkWriteUsec=2414194
proc_01: 4209 ms, Write.ChunkWriteUsec=2435554

ChunkWriteUsec improved by about 14-16%.

Correctness / Recovery Notes

The HDFS-like lazy-create path needs careful recovery semantics for killed writers.

Observed case:

Client is killed after lazy chunk creation and partial write.
Restart may leave namespace size/mapping beyond the last stable recoverable chunk.
Recovery direction is to truncate EOF or repair mappings to the last recoverable stable chunk.

This PR includes initial recovery work, but this area should receive extra upstream review before
enabling lazy-create broadly.

Testing

Validated locally:

git diff --check
cmake --build bld --target metaserver chunkserver mstress_client namespacev2test -j8
./bld/output/bin/devtools/namespacev2test

Also validated with clean-cluster mstress write benchmarks.

Follow-up Work

Split this work into smaller upstream-friendly PRs:
1. NamespaceV2 and tests.
2. HDFS-like allocation / lazy create.
3. Client chunkserver connection pooling.
4. Parallel replica fanout and No-forward.
5. Checksum-vector optimization.
6. IOBuffer::AppendShared().
Add chunkserver-side detailed timing counters:
- parse time
- checksum time
- disk queue submit time
- disk completion time
Finish killed-writer crash/restart recovery.
Run larger 100k-file benchmark after recovery semantics are finalized.

zhanglistar added 8 commits May 22, 2026 10:34

Place build artifacts under output

a361e27

Improve mstress local benchmark behavior

b2fcfb5

Merge branch 'ci' into lock-opt

585fbb8

opt lock

da95513

add rfc

d11e34d

modify

469b735

update doc

c3b38a2

Optimize QFS metadata and write path

a4dfe42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Big performance improvement of meta and datanode#274

Big performance improvement of meta and datanode#274
zhanglistar wants to merge 8 commits into
quantcast:masterfrom
bigo-sg:lock-opt

zhanglistar commented Jun 1, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

zhanglistar commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Key Configs

Implementation Details

NamespaceV2

HDFS-like allocation

Parallel write fanout

Buffer sharing

Checksum optimization

Benchmark

Single Client

Two Clients

Correctness / Recovery Notes

Testing

Follow-up Work

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

zhanglistar commented Jun 1, 2026 •

edited

Loading