feat(metadata): strengthen service-app mapping consistency, retry and…#3373
Conversation
467a8d8 to
1ad53bd
Compare
… dedup (apache#3354) Make interface-to-app mapping registration safe under concurrent providers and give it a proper retry policy. - Optimistic concurrency across all backends so concurrent appends no longer clobber each other: etcd (GetValAndRev + UpdateWithRev), zookeeper (versioned SetContent), nacos (CasMd5). Each backend wraps its native conflict (ErrCompareFail / ErrBadVersion / ErrNodeExists / nacos publish failure) into the shared report.ErrMappingCASConflict sentinel via %w. - Graded retry: registerWithRetry retries only CAS conflicts (errors.Is) with exponential backoff + jitter, and returns permanent errors immediately instead of burning the whole retry budget. - Extract shared logic: report.MergeServiceAppMapping (whole-element dedup, fixing the strings.Contains substring false positive and the leading-comma bug on empty values) and report.DecodeServiceAppNames (skips empty elements). - Listener cleanup: zookeeper removal via CacheListener.RemoveKeyListeners; etcd documents the listener as unsupported instead of silently succeeding. - Tests: helper unit tests plus a concurrency test that reproduces the lost-update bug and proves CAS preserves every writer (200 writers / 20 readers, passes under -race). Known nacos-only limitation (documented in code): CasMd5 is an optimistic UPDATE and cannot guard the first INSERT, so the initial concurrent registration of a brand-new interface can still race. etcd and zookeeper are not affected. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
1ad53bd to
56e0b44
Compare
|
先看一下应该没和 #3371 重复吧 |
…#3354) - nacos: stop swallowing the getConfig read error. On a failed read the old value was treated as empty, so registration would publish only the current app and overwrite an existing set (e.g. appA,appB -> appC). Return the error instead so an existing mapping is never clobbered. A genuinely absent config still returns ("", nil) and takes the first-write path. - zookeeper: CacheListener.DataChange now builds the set via report.DecodeServiceAppNames, so mapping change events no longer surface empty app names from legacy/malformed comma-separated values (",app", "app,,other"). Added a listener test covering this. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…g registration The previous commit returned any getConfig error from RegisterServiceAppMapping. Nacos signals a never-written key with a "config data not exist" error (not an empty value), so the first registration of a fresh interface failed and the provider panicked on service export (broke the registry/nacos integration test). Only treat genuine read failures (network/auth/server) as errors; the not-found signal is handled as an empty old value so the first write can create the key. Detection mirrors config_center/nacos's isConfigNotExistErr. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
pls update to latest develop branch to fix ci fail |
…g-consistency-cas
SonarCloud flagged math/rand as an insecure PRNG. The jitter only spread out contending writers and is not worth a crypto/rand dependency, so use plain exponential backoff instead. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The MD5 in nacos report is the checksum mandated by the Nacos CAS wire protocol (PublishConfig forwards CasMd5 for the server to compare), not a security hash, so the algorithm is not ours to change. Mark it NOSONAR with an explanation to clear the quality-gate security hotspot. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## develop #3373 +/- ##
===========================================
+ Coverage 46.76% 52.55% +5.78%
===========================================
Files 295 493 +198
Lines 17172 37908 +20736
===========================================
+ Hits 8031 19922 +11891
- Misses 8287 16379 +8092
- Partials 854 1607 +753 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
|
这个 PR 和 #3371 同时改 metadata/report/nacos、etcd、zookeeper,最终合入前一定要 rebase 到最新 develop,确认 MetadataReport 新接口和 mapping CAS 改动都还在。 |
There was a problem hiding this comment.
Pull request overview
This PR hardens application-level service→app mapping (interface → comma-separated app-name set) across metadata backends by preventing lost updates under concurrent writers, improving dedup/decoding, and introducing a conflict-aware retry policy in the mapping registration flow.
Changes:
- Introduces shared helpers (
MergeServiceAppMapping,DecodeServiceAppNames) and a shared CAS-conflict sentinel (ErrMappingCASConflict) to unify dedup/parse behavior and enable conflict-only retries. - Upgrades etcd/zookeeper/nacos mapping writes to use optimistic concurrency (rev/version/CAS-md5) and surfaces conflicts consistently for retry.
- Improves listener handling (ZK listener removal implemented; etcd explicitly warns listener unsupported) and adds targeted unit + concurrency tests.
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| metadata/report/zookeeper/report.go | Adds CAS-aware create/update for mapping writes; uses shared merge/decode; implements listener removal via cache listener. |
| metadata/report/zookeeper/report_test.go | Updates merge tests to cover empty/substring regressions via shared helper. |
| metadata/report/zookeeper/listener.go | Uses shared decoder and adds RemoveKeyListeners to stop dispatching events for a key. |
| metadata/report/zookeeper/listener_test.go | Extends listener tests to assert empty app names are filtered in events. |
| metadata/report/nacos/report.go | Adds merge/decode, handles “config not exist”, and adds CAS-md5 optimistic update semantics for mapping writes. |
| metadata/report/etcd/report.go | Uses Get+rev and CAS update/create to avoid lost updates; documents listener unsupported via warning/no-op remove. |
| metadata/report/mapping.go | New shared CAS-conflict sentinel plus merge/decode helpers for consistent behavior across backends. |
| metadata/report/mapping_test.go | Unit tests for merge/decode helpers including regression cases. |
| metadata/mapping/metadata/service_name_mapping.go | Replaces unconditional retry loop with conflict-only retry + exponential backoff. |
| metadata/mapping/metadata/service_name_mapping_test.go | Updates tests to assert non-conflict errors don’t retry; conflicts retry up to budget. |
| metadata/mapping/metadata/service_name_mapping_concurrency_test.go | Adds concurrency test reproducing lost-update and validating CAS + retry preserves all writers. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| } | ||
| } | ||
| if err != nil { | ||
| if err := registerWithRetry(metadataReport, serviceInterface, DefaultGroup, appName); err != nil { |
| import ( | ||
| "fmt" | ||
| "sync" | ||
| "testing" | ||
| ) |
| default: | ||
| set, err := r.GetServiceAppMapping("Iface", DefaultGroup, nil) | ||
| assert.NoError(t, err) | ||
| assert.GreaterOrEqual(t, set.Size(), prev) | ||
| assert.False(t, set.Contains("")) | ||
| prev = set.Size() | ||
| } |



Description
Fixes #3354
What this PR does
Fixes #3354. Hardens application-level service-app mapping (
interface -> app names)registration so it is correct under concurrent providers, and gives it a proper retry policy.
完善应用级 service-app mapping 的写入一致性、重试与去重。
Background
The mapping value is a comma-separated set of application names stored under a single
interface key, shared by all providers of that interface. Registration is therefore a
read-modify-write, and the previous implementations had several reliability gaps.
Changes
1. Optimistic concurrency across all backends (no more lost updates)
Concurrent appends no longer clobber each other:
Get+Put→GetValAndRev+UpdateWithRev(CAS onModRevision),Createfor first write.SetContent, now surfaces version conflicts instead of swallowing them.CasMd5optimistic lock.Each backend wraps its native conflict (
ErrCompareFail/ErrBadVersion/ErrNodeExists/nacos publish failure) into a shared
report.ErrMappingCASConflictsentinel via%w.2. Graded retry (was: fixed loop, no backoff)
registerWithRetryretries only CAS conflicts (errors.Is) with exponential backoff + jitter,and returns permanent errors (network/auth) immediately instead of burning the whole retry budget.
原来任何错误都空转重试 10 次且无 sleep,现在按错误类型分级重试。
3. Extract shared logic + fix two hidden bugs
report.MergeServiceAppMapping: whole-element dedup. Fixes thestrings.Containssubstringfalse positive (registering
orderwas wrongly treated as present whenorder-serviceexisted)and the leading-comma bug (
"" + "," + app→",app").report.DecodeServiceAppNames: parse into a set, skipping empty elements.4. Listener cleanup
CacheListener.RemoveKeyListeners(was a silentreturn nilthat leaked listeners).
5. Tests
proves CAS preserves every writer (200 writers / 20 concurrent readers). Passes under
-race.Known limitation (documented in code)
Nacos
CasMd5is an optimistic UPDATE and cannot guard the first INSERT (Nacos has nocreate-if-absent primitive), so the initial concurrent registration of a brand-new interface can
still race. etcd and zookeeper are not affected. Left as a documented limitation; can be revisited
if Nacos exposes a SETNX-style primitive.
Test
go test -race ./metadata/report/... ./metadata/mapping/...
Checklist
develop