Skip to content

br: Adjust restore concurrency and tikv config#69590

Open
Leavrth wants to merge 4 commits into
pingcap:masterfrom
Leavrth:adjust_restore_concurrency
Open

br: Adjust restore concurrency and tikv config#69590
Leavrth wants to merge 4 commits into
pingcap:masterfrom
Leavrth:adjust_restore_concurrency

Conversation

@Leavrth

@Leavrth Leavrth commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

What problem does this PR solve?

Issue Number: close #69589

Problem Summary:

  1. Too many task will reduce the average bandwidth for each task so that the io operations will increase for each task which is limited by disk iops.
  2. Each etcd client request may randomly select one PD, which may be a io delay PD.
  3. The tikv configuration storage.flow-control.soft-pending-compaction-bytes-limit will make healthy tikv busy and rocksdb.max-background-jobs will consume disk resources when log restore.

What changed and how does it work?

  1. Reduce the default --tikv-max-restore-concurrency.
  2. Add Context Timeout Retry for request of etcd.
  3. adjust tikv configuration items storage.flow-control.soft/hard-pending-compaction-bytes-limit and rocksdb.max-background-jobs before restore.

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No need to test
    • I checked and no code files have been changed.

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

Please refer to Release Notes Language Style Guide to write a quality release note.

None

Summary by CodeRabbit

  • New Features

    • Log restore now tracks additional checkpoint information, including restore data size and RocksDB background-job settings.
    • Restore operations now automatically tune TiKV concurrency and flow-control settings for more stable large restores.
  • Bug Fixes

    • Improved handling of checkpoint reloads so restore metadata is preserved more reliably.
    • Added safeguards for restore-time configuration changes to avoid overly aggressive concurrency settings and reduce restore instability.

Leavrth added 4 commits July 1, 2026 15:46
Signed-off-by: Jianjun Liao <jianjun.liao@outlook.com>
Signed-off-by: Jianjun Liao <jianjun.liao@outlook.com>
Signed-off-by: Jianjun Liao <jianjun.liao@outlook.com>
Signed-off-by: Jianjun Liao <jianjun.liao@outlook.com>
@ti-chi-bot ti-chi-bot Bot added do-not-merge/needs-tests-checked release-note-none Denotes a PR that doesn't merit a release note. and removed do-not-merge/needs-tests-checked labels Jul 2, 2026
@ti-chi-bot

ti-chi-bot Bot commented Jul 2, 2026

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign leavrth for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot Bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Jul 2, 2026
@coderabbitai

coderabbitai Bot commented Jul 2, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

This PR tunes BR/PITR restore concurrency defaults, extends log-restore checkpoint metadata with RocksDB and snapshot-size fields, adds a TiKV compacted-SST flow-control adjustment mechanism, temporarily lowers rocksdb.max-background-jobs during restore, and hardens streamhelper's etcd watch/checkpoint operations with retries and synchronized watcher resets.

Changes

Restore Concurrency, Checkpoint Metadata, and TiKV Flow Control

Layer / File(s) Summary
Import/restore concurrency constants
br/pkg/conn/conn.go, br/pkg/conn/conn_test.go, br/pkg/task/restore.go, br/pkg/task/stream.go, br/pkg/task/stream_test.go, br/pkg/restore/snap_client/client.go
DefaultImportNumGoroutines lowered to 36 with a new margin constant; ProcessTiKVConfigs and worker-pool sizing formulas updated; restore flag defaults and adjustment helpers added.
Checkpoint metadata for log restore
br/pkg/checkpoint/log_restore.go, br/pkg/checkpoint/checkpoint_test.go, br/pkg/task/restore.go
RocksDBMaxBackgroundJobs and SnapshotRestoreDataSize added to checkpoint metadata; RestoreConfig.snapshotRestoreDataSize added and populated from archive size.
RocksDB max-background-jobs config helpers
br/pkg/utils/db.go, br/pkg/utils/db_test.go, br/pkg/task/stream.go
New getter/setter for TiKV rocksdb.max-background-jobs; KeepRocksDBMaxBackgroundJobsLow temporarily lowers and restores this setting during PITR restore.
Compacted SST flow-control module
br/pkg/restore/log_client/flow_control.go, export_test.go, client_test.go, BUILD.bazel
New module estimates pending compaction bytes and adjusts TiKV soft/hard flow-control limits via restricted SQL, with exported test wrappers and unit tests.
Log client store/replica sizing and checkpoint wiring
br/pkg/restore/log_client/client.go, br/pkg/task/stream.go
SstRestoreManager gains store/replica counts; SST worker-pool sizing uses fixed per-store constant; LoadOrCreateCheckpointMetadataForLogRestore signature extended; restore pipeline passes new size params into RestoreSSTFileSets.

Estimated code review effort: 4 (Complex) | ~75 minutes

Streamhelper etcd Watch and Checkpoint Robustness

Layer / File(s) Summary
Synchronized watcher reset
br/pkg/streamhelper/client.go
New mutex-protected getWatcher/resetWatcher methods safely swap the underlying etcd watcher.
Retryable metadata request helper
br/pkg/streamhelper/advancer_cliext.go
New runMetadataRequestWithRetry and isRetryableMetadataRequestError retry gRPC requests based on error codes and per-attempt timeouts.
Global checkpoint get/upload with retry
br/pkg/streamhelper/advancer_cliext.go, export_test.go, integration_test.go
Checkpoint get/put now use the retry helper, revision derivation changed to response header revision, redacted logging added; new integration tests cover compaction survival and timeout retries.
Watch creation timeout state machine
br/pkg/streamhelper/advancer_cliext.go
waitCheckpointEvent reworked with an atomic state machine and retry loop for watch creation; startListen and watch-progress calls use the synchronized watcher.

Estimated code review effort: 4 (Complex) | ~60 minutes

Possibly related PRs

  • pingcap/tidb#68020: Modifies the same br/pkg/restore/log_client/client.go compacted SST restore flow, including RestoreSSTFileSets and checkpoint-metadata initialization.
  • pingcap/tidb#69047: Modifies the same etcd metadata watch/progress handling in br/pkg/streamhelper/advancer_cliext.go.
  • pingcap/tidb#69498: Modifies the same stream-helper checkpoint/watch logic (startListen, getGlobalCheckpointWithRevision) and related export test helper.

Suggested labels: ok-to-test

Suggested reviewers: YuJuncen, RidRisR

Poem

A rabbit tunes the gears with care,
Fewer threads, but more to spare 🐇
Checkpoints hold what jobs once knew,
While watchers reset and retry anew.
Compaction bytes now kept in line—
Hop on, restore, everything's fine!

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title is concise and accurately reflects the main restore concurrency and TiKV config changes.
Linked Issues check ✅ Passed The PR satisfies #69589 by reducing restore concurrency, improving etcd retry robustness, and adjusting TiKV restore-time configs.
Out of Scope Changes check ✅ Passed No clear unrelated changes stand out; the added tests, helpers, and Bazel update support the restore and TiKV config work.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Warning

Tools execution failed with the following error:

Failed to run tools: 13 INTERNAL: Received RST_STREAM with code 2 (Internal server error)


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@ti-chi-bot

ti-chi-bot Bot commented Jul 2, 2026

Copy link
Copy Markdown

@Leavrth: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-error-log-review 618cbfe link false /test pull-error-log-review

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (4)
br/pkg/restore/log_client/client.go (1)

598-601: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Document the 7186 sizing rationale.

The comment explains the goal, but not why 7186 is the safe per-store pool size. Please add the derivation, benchmark basis, or issue reference so future tuning doesn’t regress restore behavior. As per coding guidelines, comments should explain non-obvious constraints and performance trade-offs.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@br/pkg/restore/log_client/client.go` around lines 598 - 601, The
`sstRestoreWorkerPoolSizePerStore` constant in `client.go` uses the magic value
`7186` without explaining why it is safe, so add a comment that documents the
sizing rationale. Update the nearby restore pool sizing logic to include the
derivation, benchmark basis, or issue reference that justifies `7186`, so future
changes to the pool size can be tuned without regressing restore behavior.

Source: Coding guidelines

br/pkg/restore/log_client/flow_control.go (1)

234-241: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Annotate the config lookup failure.

Line 241 returns the raw SQL error, so restore failures won’t show which TiKV config lookup failed. Add the queried config name for actionable diagnostics. As per coding guidelines, “Go code: Keep error handling actionable and contextual; avoid silently swallowing errors.”

Suggested diff
 	if errSQL != nil {
-		return nil, errSQL
+		return nil, errors.Annotatef(errSQL, "failed to query TiKV config %q", name)
 	}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@br/pkg/restore/log_client/flow_control.go` around lines 234 - 241, The TiKV
config lookup in this ExecRestrictedSQL path returns the raw SQL error without
context, so restore failures can’t tell which config name failed. Update the
error handling in this flow_control.go lookup to wrap or annotate errSQL with
the queried name variable before returning it, so the failure from this config
query is actionable. Use the existing ExecRestrictedSQL call and the surrounding
restore/log client flow as the place to add the contextual message.

Source: Coding guidelines

br/pkg/task/restore.go (1)

328-329: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Missing tags for consistency with sibling unexported fields.

Other unexported fields in this struct (e.g. tiflashRecorder, tableMappingManager) carry explicit json:"-" toml:"-" tags even though they're unexported. Consider adding the same tags to snapshotRestoreDataSize for consistency with the surrounding style.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@br/pkg/task/restore.go` around lines 328 - 329, The unexported field
snapshotRestoreDataSize should follow the same struct-tag convention as nearby
fields like tiflashRecorder and tableMappingManager. Update the restore task
struct to add explicit json:"-" toml:"-" tags to snapshotRestoreDataSize so it
is treated consistently with the other internal-only fields.
br/pkg/checkpoint/checkpoint_test.go (1)

109-129: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Missing round-trip test coverage for SnapshotRestoreDataSize.

This test was updated to cover the new RocksDBMaxBackgroundJobs field (set + assert save/load + assert via GetCheckpointTaskInfo), but the sibling new field SnapshotRestoreDataSize added in br/pkg/checkpoint/log_restore.go isn't set or asserted here.

♻️ Proposed test additions
 	checkpointMetaForLogRestore := &checkpoint.CheckpointMetadataForLogRestore{
 		UpstreamClusterID:        123,
 		RestoredTS:               222,
 		StartTS:                  111,
 		RewriteTS:                333,
 		GcRatio:                  "1.0",
 		RocksDBMaxBackgroundJobs: "8",
+		SnapshotRestoreDataSize:  456,
 		TiFlashItems:             map[int64]model.TiFlashReplicaInfo{1: {Count: 1}},
 	}
 	require.Equal(t, checkpointMetaForLogRestore.RocksDBMaxBackgroundJobs, checkpointMetaForLogRestore2.RocksDBMaxBackgroundJobs)
+	require.Equal(t, checkpointMetaForLogRestore.SnapshotRestoreDataSize, checkpointMetaForLogRestore2.SnapshotRestoreDataSize)
 	require.Equal(t, checkpointMetaForLogRestore.TiFlashItems, checkpointMetaForLogRestore2.TiFlashItems)

Also applies to: 142-150

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@br/pkg/checkpoint/checkpoint_test.go` around lines 109 - 129, Add round-trip
coverage for the new SnapshotRestoreDataSize field in the checkpoint metadata
test: in the checkpointMetaForLogRestore setup and the save/load assertions
around LogMetaManager.SaveCheckpointMetadata, set SnapshotRestoreDataSize on
CheckpointMetadataForLogRestore and verify it is preserved after
LoadCheckpointMetadata. Also update the related GetCheckpointTaskInfo assertions
in the same test block so the new field is validated alongside
RocksDBMaxBackgroundJobs and the other restored metadata fields.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@br/pkg/restore/log_client/client.go`:
- Around line 756-762: The return path in the checkpoint metadata reuse logic is
unconditionally using meta.SnapshotRestoreDataSize, which causes older
checkpoints to resume with a zero snapshot size. Update the restore config path
in client.go’s metadata handling to preserve the caller-computed snapshot size
when the metadata field is unset or zero, mirroring the fallback behavior
already used for RocksDBMaxBackgroundJobs, and keep the logic centered around
the existing reuse TiKV config block.

In `@br/pkg/streamhelper/advancer_cliext.go`:
- Around line 570-589: The checkpoint upload path is not idempotent because
advancer_upload_global_checkpoint retries an unconditional KV.Put in
runMetadataRequestWithRetry, which can overwrite a newer checkpoint with an
older one after a commit timeout. Update the upload flow in advancer_cliext.go
to use a compare-and-swap/etcd transaction or a re-read-and-retry-on-conflict
approach inside the request callback so the stored checkpoint only moves forward
and never regresses. Keep the monotonic guard in sync with the retry logic
around t.KV.Put, checkpoint, and redactedKey.

---

Nitpick comments:
In `@br/pkg/checkpoint/checkpoint_test.go`:
- Around line 109-129: Add round-trip coverage for the new
SnapshotRestoreDataSize field in the checkpoint metadata test: in the
checkpointMetaForLogRestore setup and the save/load assertions around
LogMetaManager.SaveCheckpointMetadata, set SnapshotRestoreDataSize on
CheckpointMetadataForLogRestore and verify it is preserved after
LoadCheckpointMetadata. Also update the related GetCheckpointTaskInfo assertions
in the same test block so the new field is validated alongside
RocksDBMaxBackgroundJobs and the other restored metadata fields.

In `@br/pkg/restore/log_client/client.go`:
- Around line 598-601: The `sstRestoreWorkerPoolSizePerStore` constant in
`client.go` uses the magic value `7186` without explaining why it is safe, so
add a comment that documents the sizing rationale. Update the nearby restore
pool sizing logic to include the derivation, benchmark basis, or issue reference
that justifies `7186`, so future changes to the pool size can be tuned without
regressing restore behavior.

In `@br/pkg/restore/log_client/flow_control.go`:
- Around line 234-241: The TiKV config lookup in this ExecRestrictedSQL path
returns the raw SQL error without context, so restore failures can’t tell which
config name failed. Update the error handling in this flow_control.go lookup to
wrap or annotate errSQL with the queried name variable before returning it, so
the failure from this config query is actionable. Use the existing
ExecRestrictedSQL call and the surrounding restore/log client flow as the place
to add the contextual message.

In `@br/pkg/task/restore.go`:
- Around line 328-329: The unexported field snapshotRestoreDataSize should
follow the same struct-tag convention as nearby fields like tiflashRecorder and
tableMappingManager. Update the restore task struct to add explicit json:"-"
toml:"-" tags to snapshotRestoreDataSize so it is treated consistently with the
other internal-only fields.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 6ddf7b31-f24b-4f91-88c4-3add797582bb

📥 Commits

Reviewing files that changed from the base of the PR and between 5663a07 and 618cbfe.

📒 Files selected for processing (20)
  • br/pkg/checkpoint/checkpoint_test.go
  • br/pkg/checkpoint/log_restore.go
  • br/pkg/conn/conn.go
  • br/pkg/conn/conn_test.go
  • br/pkg/restore/log_client/BUILD.bazel
  • br/pkg/restore/log_client/client.go
  • br/pkg/restore/log_client/client_test.go
  • br/pkg/restore/log_client/export_test.go
  • br/pkg/restore/log_client/flow_control.go
  • br/pkg/restore/snap_client/client.go
  • br/pkg/streamhelper/advancer_cliext.go
  • br/pkg/streamhelper/client.go
  • br/pkg/streamhelper/export_test.go
  • br/pkg/streamhelper/integration_test.go
  • br/pkg/task/restore.go
  • br/pkg/task/stream.go
  • br/pkg/task/stream_test.go
  • br/pkg/utils/BUILD.bazel
  • br/pkg/utils/db.go
  • br/pkg/utils/db_test.go

Comment on lines +756 to +762
if meta.RocksDBMaxBackgroundJobs != "" {
rocksDBMaxBackgroundJobs = meta.RocksDBMaxBackgroundJobs
}
log.Info("reuse TiKV config from checkpoint metadata",
zap.String("gc-ratio", meta.GcRatio),
zap.String("rocksdb-max-background-jobs", rocksDBMaxBackgroundJobs))
return meta.GcRatio, rocksDBMaxBackgroundJobs, meta.SnapshotRestoreDataSize, nil

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win

Preserve the computed snapshot size for older checkpoints.

Line 762 returns meta.SnapshotRestoreDataSize unconditionally. Checkpoints created before this field existed deserialize it as 0, unlike RocksDBMaxBackgroundJobs where the caller value is kept when metadata is missing; that feeds a zero snapshot size into the compacted-SST flow-control estimate on resume.

Suggested diff
 		if meta.RocksDBMaxBackgroundJobs != "" {
 			rocksDBMaxBackgroundJobs = meta.RocksDBMaxBackgroundJobs
 		}
+		if meta.SnapshotRestoreDataSize != 0 {
+			snapshotRestoreDataSize = meta.SnapshotRestoreDataSize
+		}
 		log.Info("reuse TiKV config from checkpoint metadata",
 			zap.String("gc-ratio", meta.GcRatio),
 			zap.String("rocksdb-max-background-jobs", rocksDBMaxBackgroundJobs))
-		return meta.GcRatio, rocksDBMaxBackgroundJobs, meta.SnapshotRestoreDataSize, nil
+		return meta.GcRatio, rocksDBMaxBackgroundJobs, snapshotRestoreDataSize, nil
 	}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if meta.RocksDBMaxBackgroundJobs != "" {
rocksDBMaxBackgroundJobs = meta.RocksDBMaxBackgroundJobs
}
log.Info("reuse TiKV config from checkpoint metadata",
zap.String("gc-ratio", meta.GcRatio),
zap.String("rocksdb-max-background-jobs", rocksDBMaxBackgroundJobs))
return meta.GcRatio, rocksDBMaxBackgroundJobs, meta.SnapshotRestoreDataSize, nil
if meta.RocksDBMaxBackgroundJobs != "" {
rocksDBMaxBackgroundJobs = meta.RocksDBMaxBackgroundJobs
}
if meta.SnapshotRestoreDataSize != 0 {
snapshotRestoreDataSize = meta.SnapshotRestoreDataSize
}
log.Info("reuse TiKV config from checkpoint metadata",
zap.String("gc-ratio", meta.GcRatio),
zap.String("rocksdb-max-background-jobs", rocksDBMaxBackgroundJobs))
return meta.GcRatio, rocksDBMaxBackgroundJobs, snapshotRestoreDataSize, nil
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@br/pkg/restore/log_client/client.go` around lines 756 - 762, The return path
in the checkpoint metadata reuse logic is unconditionally using
meta.SnapshotRestoreDataSize, which causes older checkpoints to resume with a
zero snapshot size. Update the restore config path in client.go’s metadata
handling to preserve the caller-computed snapshot size when the metadata field
is unset or zero, mirroring the fallback behavior already used for
RocksDBMaxBackgroundJobs, and keep the logic centered around the existing reuse
TiKV config block.

Comment on lines +570 to +589
_, err = runMetadataRequestWithRetry(ctx,
"failed to upload global checkpoint to metadata store",
[]zap.Field{
zap.String("key", redactedKey),
zap.String("task", taskName),
zap.Uint64("checkpoint", checkpoint),
},
func(requestCtx context.Context) (struct{}, error) {
failpoint.Inject("advancer_upload_global_checkpoint_request_timeout", func() {
failpoint.Return(struct{}{}, context.DeadlineExceeded)
})
_, err = t.KV.Put(requestCtx, key, value)
if err == nil {
failpoint.Inject("advancer_upload_global_checkpoint_commit_timeout", func() {
err = context.DeadlineExceeded
})
}
return struct{}{}, err
},
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗄️ Data Integrity & Integration | 🟠 Major | 🏗️ Heavy lift

Make the checkpoint upload retry idempotent.

Line 581 retries an unconditional KV.Put after the commit-timeout path at Lines 583-585, where the first Put may already have succeeded. If another advancer writes a higher checkpoint before a retry, this retry can overwrite it with the older checkpoint value despite the monotonic guard at Lines 559-567. Use an etcd transaction/CAS or re-read-on-conflict flow so retries never regress the stored checkpoint.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@br/pkg/streamhelper/advancer_cliext.go` around lines 570 - 589, The
checkpoint upload path is not idempotent because
advancer_upload_global_checkpoint retries an unconditional KV.Put in
runMetadataRequestWithRetry, which can overwrite a newer checkpoint with an
older one after a commit timeout. Update the upload flow in advancer_cliext.go
to use a compare-and-swap/etcd transaction or a re-read-and-retry-on-conflict
approach inside the request callback so the stored checkpoint only moves forward
and never regresses. Keep the monotonic guard in sync with the retry logic
around t.KV.Put, checkpoint, and redactedKey.

@codecov

codecov Bot commented Jul 2, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 35.93005% with 403 lines in your changes missing coverage. Please review.
✅ Project coverage is 75.6362%. Comparing base (693e52c) to head (618cbfe).
⚠️ Report is 7 commits behind head on master.

Additional details and impacted files
@@               Coverage Diff                @@
##             master     #69590        +/-   ##
================================================
- Coverage   76.3268%   75.6362%   -0.6907%     
================================================
  Files          2041       2066        +25     
  Lines        561003     575568     +14565     
================================================
+ Hits         428196     435338      +7142     
- Misses       131906     138797      +6891     
- Partials        901       1433       +532     
Flag Coverage Δ
integration 44.3915% <35.9300%> (+4.7635%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
dumpling 60.4471% <ø> (ø)
parser ∅ <ø> (∅)
br 63.5331% <35.9300%> (+0.7820%) ⬆️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-note-none Denotes a PR that doesn't merit a release note. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Enhance pitr recovery parameters and tikv configuration settings, as well as the robustness of advancer and pd etcd.

1 participant