Skip to content

feat(metadata): strengthen service-app mapping consistency, retry and…#3373

Merged
AlexStocks merged 6 commits into
apache:developfrom
NeverENG:feat/3354-mapping-consistency-cas
Jun 10, 2026
Merged

feat(metadata): strengthen service-app mapping consistency, retry and…#3373
AlexStocks merged 6 commits into
apache:developfrom
NeverENG:feat/3354-mapping-consistency-cas

Conversation

@NeverENG

@NeverENG NeverENG commented Jun 7, 2026

Copy link
Copy Markdown
Contributor

Description

Fixes #3354

What this PR does

Fixes #3354. Hardens application-level service-app mapping (interface -> app names)
registration so it is correct under concurrent providers, and gives it a proper retry policy.
完善应用级 service-app mapping 的写入一致性、重试与去重。

Background

The mapping value is a comma-separated set of application names stored under a single
interface key, shared by all providers of that interface. Registration is therefore a
read-modify-write, and the previous implementations had several reliability gaps.

Changes

1. Optimistic concurrency across all backends (no more lost updates)

Concurrent appends no longer clobber each other:

  • etcd: Get+PutGetValAndRev + UpdateWithRev (CAS on ModRevision), Create for first write.
  • zookeeper: keeps versioned SetContent, now surfaces version conflicts instead of swallowing them.
  • nacos: adds CasMd5 optimistic lock.

Each backend wraps its native conflict (ErrCompareFail / ErrBadVersion / ErrNodeExists /
nacos publish failure) into a shared report.ErrMappingCASConflict sentinel via %w.

2. Graded retry (was: fixed loop, no backoff)

registerWithRetry retries only CAS conflicts (errors.Is) with exponential backoff + jitter,
and returns permanent errors (network/auth) immediately instead of burning the whole retry budget.
原来任何错误都空转重试 10 次且无 sleep,现在按错误类型分级重试。

3. Extract shared logic + fix two hidden bugs

  • report.MergeServiceAppMapping: whole-element dedup. Fixes the strings.Contains substring
    false positive (registering order was wrongly treated as present when order-service existed)
    and the leading-comma bug ("" + "," + app",app").
  • report.DecodeServiceAppNames: parse into a set, skipping empty elements.

4. Listener cleanup

  • zookeeper: implemented removal via CacheListener.RemoveKeyListeners (was a silent return nil
    that leaked listeners).
  • etcd: documents the mapping listener as unsupported instead of silently succeeding.

5. Tests

  • Unit tests for the merge/decode helpers (incl. the substring and empty-value regressions).
  • A concurrency test that reproduces the lost-update bug with the naive read-modify-write and
    proves CAS preserves every writer (200 writers / 20 concurrent readers). Passes under -race.

Known limitation (documented in code)

Nacos CasMd5 is an optimistic UPDATE and cannot guard the first INSERT (Nacos has no
create-if-absent primitive), so the initial concurrent registration of a brand-new interface can
still race. etcd and zookeeper are not affected. Left as a documented limitation; can be revisited
if Nacos exposes a SETNX-style primitive.

Test

go test -race ./metadata/report/... ./metadata/mapping/...

Checklist

  • I confirm the target branch is develop
  • Code has passed local testing
  • I have added tests that prove my fix is effective or that my feature works

@NeverENG NeverENG force-pushed the feat/3354-mapping-consistency-cas branch from 467a8d8 to 1ad53bd Compare June 7, 2026 04:10
… dedup (apache#3354)

Make interface-to-app mapping registration safe under concurrent providers and
give it a proper retry policy.

- Optimistic concurrency across all backends so concurrent appends no longer
  clobber each other: etcd (GetValAndRev + UpdateWithRev), zookeeper (versioned
  SetContent), nacos (CasMd5). Each backend wraps its native conflict
  (ErrCompareFail / ErrBadVersion / ErrNodeExists / nacos publish failure) into
  the shared report.ErrMappingCASConflict sentinel via %w.
- Graded retry: registerWithRetry retries only CAS conflicts (errors.Is) with
  exponential backoff + jitter, and returns permanent errors immediately
  instead of burning the whole retry budget.
- Extract shared logic: report.MergeServiceAppMapping (whole-element dedup,
  fixing the strings.Contains substring false positive and the leading-comma
  bug on empty values) and report.DecodeServiceAppNames (skips empty elements).
- Listener cleanup: zookeeper removal via CacheListener.RemoveKeyListeners;
  etcd documents the listener as unsupported instead of silently succeeding.
- Tests: helper unit tests plus a concurrency test that reproduces the
  lost-update bug and proves CAS preserves every writer (200 writers /
  20 readers, passes under -race).

Known nacos-only limitation (documented in code): CasMd5 is an optimistic
UPDATE and cannot guard the first INSERT, so the initial concurrent
registration of a brand-new interface can still race. etcd and zookeeper are
not affected.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@Alanxtl

Alanxtl commented Jun 7, 2026

Copy link
Copy Markdown
Member

先看一下应该没和 #3371 重复吧

Comment thread metadata/report/nacos/report.go
Comment thread metadata/report/zookeeper/listener.go
NeverENG and others added 2 commits June 7, 2026 16:39
…#3354)

- nacos: stop swallowing the getConfig read error. On a failed read the old
  value was treated as empty, so registration would publish only the current
  app and overwrite an existing set (e.g. appA,appB -> appC). Return the error
  instead so an existing mapping is never clobbered. A genuinely absent config
  still returns ("", nil) and takes the first-write path.
- zookeeper: CacheListener.DataChange now builds the set via
  report.DecodeServiceAppNames, so mapping change events no longer surface
  empty app names from legacy/malformed comma-separated values (",app",
  "app,,other"). Added a listener test covering this.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…g registration

The previous commit returned any getConfig error from RegisterServiceAppMapping.
Nacos signals a never-written key with a "config data not exist" error (not an
empty value), so the first registration of a fresh interface failed and the
provider panicked on service export (broke the registry/nacos integration test).

Only treat genuine read failures (network/auth/server) as errors; the not-found
signal is handled as an empty old value so the first write can create the key.
Detection mirrors config_center/nacos's isConfigNotExistErr.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@Alanxtl

Alanxtl commented Jun 8, 2026

Copy link
Copy Markdown
Member

pls update to latest develop branch to fix ci fail
然后修一下sonarcloud的ci fail

NeverENG and others added 3 commits June 8, 2026 20:35
SonarCloud flagged math/rand as an insecure PRNG. The jitter only spread
out contending writers and is not worth a crypto/rand dependency, so use
plain exponential backoff instead.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The MD5 in nacos report is the checksum mandated by the Nacos CAS wire
protocol (PublishConfig forwards CasMd5 for the server to compare), not a
security hash, so the algorithm is not ours to change. Mark it NOSONAR
with an explanation to clear the quality-gate security hotspot.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@sonarqubecloud

sonarqubecloud Bot commented Jun 8, 2026

Copy link
Copy Markdown

@codecov-commenter

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 47.05882% with 45 lines in your changes missing coverage. Please review.
✅ Project coverage is 52.55%. Comparing base (60d1c2a) to head (7d6dbf5).
⚠️ Report is 818 commits behind head on develop.

Files with missing lines Patch % Lines
metadata/report/etcd/report.go 0.00% 17 Missing ⚠️
metadata/report/nacos/report.go 21.05% 14 Missing and 1 partial ⚠️
metadata/report/zookeeper/report.go 13.33% 13 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #3373      +/-   ##
===========================================
+ Coverage    46.76%   52.55%   +5.78%     
===========================================
  Files          295      493     +198     
  Lines        17172    37908   +20736     
===========================================
+ Hits          8031    19922   +11891     
- Misses        8287    16379    +8092     
- Partials       854     1607     +753     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Alanxtl

Alanxtl commented Jun 9, 2026

Copy link
Copy Markdown
Member

这个 PR 和 #3371 同时改 metadata/report/nacos、etcd、zookeeper,最终合入前一定要 rebase 到最新 develop,确认 MetadataReport 新接口和 mapping CAS 改动都还在。

@Alanxtl Alanxtl left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

#3360#3362 先合,低耦合。
#3367 作为并发安全底座。
#3369 合入,建立 registryId/report/cache 作用域。
#3370 rebase 到 #3367 + #3369 之后,特别检查 revision 计算不要绕开锁。
#3371 再合,接受 MetadataReport 接口扩展,并补确认 Snapshot() 与 #3367 锁语义一致。
#3373 最后 rebase,因为它和 #3371 同改 report backends;语义不重复,但文件冲突概率高。

@AlexStocks AlexStocks merged commit cdf6f2f into apache:develop Jun 10, 2026
8 checks passed

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens application-level service→app mapping (interface → comma-separated app-name set) across metadata backends by preventing lost updates under concurrent writers, improving dedup/decoding, and introducing a conflict-aware retry policy in the mapping registration flow.

Changes:

  • Introduces shared helpers (MergeServiceAppMapping, DecodeServiceAppNames) and a shared CAS-conflict sentinel (ErrMappingCASConflict) to unify dedup/parse behavior and enable conflict-only retries.
  • Upgrades etcd/zookeeper/nacos mapping writes to use optimistic concurrency (rev/version/CAS-md5) and surfaces conflicts consistently for retry.
  • Improves listener handling (ZK listener removal implemented; etcd explicitly warns listener unsupported) and adds targeted unit + concurrency tests.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
metadata/report/zookeeper/report.go Adds CAS-aware create/update for mapping writes; uses shared merge/decode; implements listener removal via cache listener.
metadata/report/zookeeper/report_test.go Updates merge tests to cover empty/substring regressions via shared helper.
metadata/report/zookeeper/listener.go Uses shared decoder and adds RemoveKeyListeners to stop dispatching events for a key.
metadata/report/zookeeper/listener_test.go Extends listener tests to assert empty app names are filtered in events.
metadata/report/nacos/report.go Adds merge/decode, handles “config not exist”, and adds CAS-md5 optimistic update semantics for mapping writes.
metadata/report/etcd/report.go Uses Get+rev and CAS update/create to avoid lost updates; documents listener unsupported via warning/no-op remove.
metadata/report/mapping.go New shared CAS-conflict sentinel plus merge/decode helpers for consistent behavior across backends.
metadata/report/mapping_test.go Unit tests for merge/decode helpers including regression cases.
metadata/mapping/metadata/service_name_mapping.go Replaces unconditional retry loop with conflict-only retry + exponential backoff.
metadata/mapping/metadata/service_name_mapping_test.go Updates tests to assert non-conflict errors don’t retry; conflicts retry up to budget.
metadata/mapping/metadata/service_name_mapping_concurrency_test.go Adds concurrency test reproducing lost-update and validating CAS + retry preserves all writers.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

}
}
if err != nil {
if err := registerWithRetry(metadataReport, serviceInterface, DefaultGroup, appName); err != nil {
Comment on lines +20 to +24
import (
"fmt"
"sync"
"testing"
)
Comment on lines +194 to +200
default:
set, err := r.GetServiceAppMapping("Iface", DefaultGroup, nil)
assert.NoError(t, err)
assert.GreaterOrEqual(t, set.Size(), prev)
assert.False(t, set.Contains(""))
prev = set.Size()
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE] Improve service-app mapping consistency, retry, and deduplication / 加强 service-app mapping 的一致性、重试与去重

5 participants