Skip to content

Big performance improvement of meta and datanode#274

Open
zhanglistar wants to merge 8 commits into
quantcast:masterfrom
bigo-sg:lock-opt
Open

Big performance improvement of meta and datanode#274
zhanglistar wants to merge 8 commits into
quantcast:masterfrom
bigo-sg:lock-opt

Conversation

@zhanglistar
Copy link
Copy Markdown

@zhanglistar zhanglistar commented Jun 1, 2026

Summary

This PR optimizes QFS metadata and replicated write paths for small-file create/write workloads.

Local mstress results:

  • Metadata create workload improved from ~309s/client to ~86s/client
    for 271452 paths/client, about 3.6x faster.
  • 1MB replicated file create improved from ~6.9s/client to ~4.2s/client
    for 1000 files/client with 2 clients, about 1.65x faster.
  • Chunk write time improved from ~5.0s/client to ~2.4s/client in the same
    1000 x 1MB workload, about 2x faster.
  • Current single-client 1MB write throughput is ~327 MB/s logical data,
    with ~1.03 GB/s actual client network send due to 3 replicas.

Main changes:

  • Add NamespaceV2 metadata layer with finer-grained locking and WAL replay tests.
  • Add optional HDFS-like write allocation path to avoid synchronous chunkserver pre-create on the hot
    allocate path.
  • Add chunkserver lazy-create-on-write support.
  • Reuse client-to-chunkserver connections in the write path.
  • Add client-side parallel replica fanout for write-id allocation, write prepare, and close.
  • Add No-forward / NF protocol support so replicas can process direct client fanout without
    chunkserver-to-chunkserver forwarding.
  • Avoid fanout payload copy by sharing IOBufferData references.
  • Send client-computed 64 KB checksum vectors in WRITE_PREPARE, allowing chunkservers to reuse client
    checksums.
  • Add optional chunkserver write-prepare checksum verify skip to avoid repeated checksum scans on every
    replica.

Motivation

The original small-file write path has several latency bottlenecks:

  1. Coarse metadata locking limits create throughput.
  2. Metaserver allocation waits for chunkserver pre-create before replying to the client.
  3. Client creates short-lived chunkserver connections under write-heavy mstress workloads.
  4. Replicated writes use chunkserver forwarding, adding serial network hops.
  5. Each chunkserver replica can rescan write payload to compute checksums.
  6. Fanout write buffers previously incurred avoidable temporary buffer objects.

These changes reduce allocation latency, remove connection churn, and move the replicated data path
closer to an HDFS-style model.

Key Configs

New / relevant switches:

metaServer.writeFlow.hdfsLikeAllocate = 1
chunkServer.writeFlow.lazyCreateOnWrite = 1
client.parallelReplicaWrite = 1
chunkServer.skipWritePrepareChecksumVerify = 1

Default behavior should remain compatible unless these switches are enabled.

Implementation Details

NamespaceV2

  • Adds native metadata structures and NamespaceV2 tests.
  • Adds WAL replay coverage.
  • Moves toward finer-grained namespace locking.

HDFS-like allocation

With the new path enabled:

  1. Client requests chunk allocation from metaserver.
  2. Metaserver selects replicas and returns allocation metadata and lease id.
  3. Metaserver does not synchronously send ALLOCATE_CHUNK.
  4. Client sends WRITE_ID_ALLOC to chunkservers.
  5. Chunkserver lazily creates the chunk if missing and registers the lease.
  6. Client writes data and closes the chunk.

Append, striped files, and object-store paths remain on the original path.

Parallel write fanout

  • Adds client-side fanout to all replicas for:
    • WRITE_ID_ALLOC
    • WRITE_PREPARE
    • CLOSE
  • Adds No-forward / NF protocol field.
  • Replicas still receive the full replica/write-id list so each chunkserver can derive its own position.

Buffer sharing

  • Adds IOBuffer::AppendShared().
  • Fanout requests attach shared IOBufferData references.
  • Payload bytes are not copied per replica.

Checksum optimization

  • Client sends 64 KB checksum vector in WRITE_PREPARE.
  • Chunkserver can reuse the vector for chunk metadata.
  • With chunkServer.skipWritePrepareChecksumVerify=1, chunkserver skips duplicate payload checksum scans
    in the write hot path.

This follows the HDFS-style tradeoff: trust client-provided checksums on write, then verify stored data
during reads / scrub.

Benchmark

Environment:

  • 3 chunkservers
  • 3 replicas
  • 1 MB files
  • client/chunkservers use host IP instead of localhost to exercise network path
  • client.parallelReplicaWrite=1
  • chunkServer.skipWritePrepareChecksumVerify=1

Single Client

1000 files, 1 MB each:

1000 files created in 3058 ms

open avg: 240 us
write avg: 134 us
close avg: 2681 us

Write.ChunkWriteUsec: 1735099 us
Write.CloseUsec: 2655539 us
Write.WriteIdAlloc: 225061 us
Write.ChunkClose: 114009 us

ChunkServer.Pool.BytesSent: 3147979791
ChunkServer.Pool.Connect: 3
ChunkServer.Pool.OpsQueued: 9000

Approximate throughput:

Logical write throughput: ~327 MB/s
Actual client network send: ~1.03 GB/s

Two Clients

Before checksum-vector / skip-verify optimization:

proc_00: 4353 ms, Write.ChunkWriteUsec=2865775
proc_01: 4349 ms, Write.ChunkWriteUsec=2863192

After checksum-vector / skip-verify optimization:

proc_00: 4171 ms, Write.ChunkWriteUsec=2414194
proc_01: 4209 ms, Write.ChunkWriteUsec=2435554

ChunkWriteUsec improved by about 14-16%.

Correctness / Recovery Notes

The HDFS-like lazy-create path needs careful recovery semantics for killed writers.

Observed case:

  • Client is killed after lazy chunk creation and partial write.
  • Restart may leave namespace size/mapping beyond the last stable recoverable chunk.
  • Recovery direction is to truncate EOF or repair mappings to the last recoverable stable chunk.

This PR includes initial recovery work, but this area should receive extra upstream review before
enabling lazy-create broadly.

Testing

Validated locally:

git diff --check
cmake --build bld --target metaserver chunkserver mstress_client namespacev2test -j8
./bld/output/bin/devtools/namespacev2test

Also validated with clean-cluster mstress write benchmarks.

Follow-up Work

  • Split this work into smaller upstream-friendly PRs:
    1. NamespaceV2 and tests.
    2. HDFS-like allocation / lazy create.
    3. Client chunkserver connection pooling.
    4. Parallel replica fanout and No-forward.
    5. Checksum-vector optimization.
    6. IOBuffer::AppendShared().
  • Add chunkserver-side detailed timing counters:
    • parse time
    • checksum time
    • disk queue submit time
    • disk completion time
  • Finish killed-writer crash/restart recovery.
  • Run larger 100k-file benchmark after recovery semantics are finalized.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant