Skip to content

puller(ticdc): cap resolve lock target ts by PD tso#12741

Merged
ti-chi-bot[bot] merged 2 commits into
pingcap:masterfrom
tenfyzhong:fix/12740-cap-resolve-lock-target-ts
Jun 24, 2026
Merged

puller(ticdc): cap resolve lock target ts by PD tso#12741
ti-chi-bot[bot] merged 2 commits into
pingcap:masterfrom
tenfyzhong:fix/12740-cap-resolve-lock-target-ts

Conversation

@tenfyzhong

Copy link
Copy Markdown
Contributor

What problem does this PR solve?

Issue Number: close #12740

TiCDC stale-lock resolving can derive the ScanLock target timestamp from local clock based resolved-time calculations. If the local CDC clock is ahead of PD time, the target timestamp can exceed the latest PD TSO, which may advance TiKV local MaxTS and make async-commit residual locks fail to resolve with commit_ts_expired.

This ports the same fix as pingcap/ticdc#5419 to TiFlow.

What is changed and how it works?

This PR changes the TiCDC multiplexing puller resolve-lock checker to fetch the current TSO from PD before scanning stale locks and cap the stale-lock target timestamp by that PD currentTs.

The freshness fence for resolvedTsUpdated still uses the local clock because resolvedTsUpdated is written with time.Now(). The PD-derived timestamp is used only as the upper bound for the ScanLock target timestamp.

It also adds unit coverage for clock skew, target timestamp capping, and the existing fence conditions.

Check List

Tests

  • Unit test
go test ./cdc/puller -run TestGetResolveLockTargetTs -count=1
go test -tags=intest ./cdc/puller -count=1
go test ./cdc/kv -run TestResolveLock -count=1

Questions

Will it cause performance regression or break compatibility?

No compatibility impact is expected. The resolve-lock checker now performs one PD GetTS call per resolve-lock tick before scheduling stale-lock scans.

Do you need to update user documentation, design documentation or monitoring documentation?

No.

Release note

Fix a TiCDC stale-lock resolving issue where the ScanLock target timestamp could advance TiKV local MaxTS beyond the latest PD TSO and make async-commit residual locks fail to resolve.

Fetch the current PD TSO before scheduling stale-lock resolution and cap the ScanLock target timestamp by that value. This prevents local clock skew from producing a target timestamp ahead of PD time.

Add unit coverage for the cap and freshness fence conditions.

Close pingcap#12740

Signed-off-by: tenfyzhong <tenfy@tenfy.cn>
@ti-chi-bot ti-chi-bot Bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. do-not-merge/needs-triage-completed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jun 24, 2026

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a mechanism to cap the target timestamp for resolving locks using the current PD TSO, preventing potential issues with uncapped timestamps. It adds a GetTS method to SharedClient, updates resolveLock logic in MultiplexingPuller to use this timestamp, and includes corresponding unit tests. The reviewer suggests handling context cancellation gracefully in runResolveLockChecker to avoid spammy warning logs when the context is canceled, and adding namespace and changefeed fields to the warning log for better observability.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +499 to +503
physical, logical, err := p.client.GetTS(ctx)
if err != nil {
log.Warn("get ts from pd failed", zap.Error(err))
continue
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

When the context is canceled (e.g., during changefeed shutdown or pause), p.client.GetTS(ctx) will fail with a context cancellation error. Logging this as a warning can be misleading and spammy. We should check ctx.Err() and return early if the context is canceled.

Additionally, to improve observability and make troubleshooting easier in production environments with multiple changefeeds, we should include the namespace and changefeed fields in the warning log.

		physical, logical, err := p.client.GetTS(ctx)
		if err != nil {
			if ctx.Err() != nil {
				return ctx.Err()
			}
			log.Warn("get ts from pd failed",
				zap.String("namespace", p.changefeed.Namespace),
				zap.String("changefeed", p.changefeed.ID),
				zap.Error(err))
			continue
		}

@tenfyzhong

Copy link
Copy Markdown
Contributor Author

close tikv/tikv#19755

@ti-chi-bot ti-chi-bot Bot added needs-1-more-lgtm Indicates a PR needs 1 more LGTM. approved labels Jun 24, 2026
@ti-chi-bot ti-chi-bot Bot added the lgtm label Jun 24, 2026
@ti-chi-bot

ti-chi-bot Bot commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: asddongmen, hongyunyan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [asddongmen,hongyunyan]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot Bot removed the needs-1-more-lgtm Indicates a PR needs 1 more LGTM. label Jun 24, 2026
@ti-chi-bot

ti-chi-bot Bot commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

[LGTM Timeline notifier]

Timeline:

  • 2026-06-24 09:13:37.105015484 +0000 UTC m=+111700.768240987: ☑️ agreed by asddongmen.
  • 2026-06-24 09:18:32.023976086 +0000 UTC m=+111995.687201599: ☑️ agreed by hongyunyan.

@tenfyzhong tenfyzhong added affects-7.5 This bug affects the 7.5.x(LTS) versions. and removed affects-7.5 This bug affects the 7.5.x(LTS) versions. labels Jun 24, 2026
…patibility

- TiDB classic rejects starter-only FULLTEXT indexes, so the index is commented out.
- This ensures the test DDL can be executed on TiDB without errors.

Signed-off-by: tenfyzhong <tenfy@tenfy.cn>
@tenfyzhong

Copy link
Copy Markdown
Contributor Author

/retest-required

@ti-chi-bot ti-chi-bot Bot merged commit 42ad8a6 into pingcap:master Jun 24, 2026
26 checks passed
@asddongmen asddongmen added needs-cherry-pick-release-7.5 Should cherry pick this PR to release-7.5 branch. needs-cherry-pick-release-8.5 Should cherry pick this PR to release-8.5 branch. labels Jun 25, 2026
@ti-chi-bot

Copy link
Copy Markdown
Member

In response to a cherrypick label: new pull request created to branch release-7.5: #12742.
But this PR has conflicts, please resolve them!

@ti-chi-bot

Copy link
Copy Markdown
Member

In response to a cherrypick label: new pull request created to branch release-8.5: #12743.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved lgtm needs-cherry-pick-release-7.5 Should cherry pick this PR to release-7.5 branch. needs-cherry-pick-release-8.5 Should cherry pick this PR to release-8.5 branch. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] TiCDC ScanLock can advance MaxTS and break async commit lock resolution

4 participants