Skip to content

feat(metadata): Clarify metadata report selection for multi-registry and multi-instance scenarios#3369

Merged
AlexStocks merged 19 commits into
apache:developfrom
XnLemon:clarify-metadata-report-selection
Jun 10, 2026
Merged

feat(metadata): Clarify metadata report selection for multi-registry and multi-instance scenarios#3369
AlexStocks merged 19 commits into
apache:developfrom
XnLemon:clarify-metadata-report-selection

Conversation

@XnLemon

@XnLemon XnLemon commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Description

Fixes #3352


Background / 背景

The #2534 refactor centralized metadata logic and settled on an application-level metadata path where MetadataReport handles app-level MetadataInfo and service-app mapping. That refactor was correct in intent, but it left the selection semantics of MetadataReport under-specified for multi-registry and multi-instance deployments.

Concretely, when a process connects to more than one registry (e.g. Nacos + Zookeeper), each registry gets its own MetadataReport instance, keyed by registryId in the instances map. The pre-existing code had four independent correctness gaps in how it chose which report to use.

#2534 重构将 metadata 逻辑集中化,以应用级 MetadataInfo 和 service-app mapping 为核心路径。重构方向正确,但在多注册中心和多实例部署场景下,MetadataReport 的选择语义存在空白

具体来说,当进程同时连接多个注册中心(如 Nacos + Zookeeper)时,每个注册中心会在 instances map 中保有独立的 MetadataReport 实例。原有代码在选择使用哪个 report 时存在以下四个独立的正确性问题。


Problems Fixed / 修复的问题

1. GetMetadataReport() returned a non-deterministic report / 返回结果不确定

Go map iteration order is randomized. GetMetadataReport() returned whichever entry happened to come first — a different report on every call in some runs, making any downstream behavior non-reproducible.

Fix: prefer the "default" key; fall back to the lexicographically first registryId so the result is always stable.

Go map 迭代顺序随机,GetMetadataReport() 每次可能返回不同的 report。修复:优先返回 "default" 键;不存在时退回到字典序最小的 registryId,确保结果稳定。


2. GetMetadataReportByRegistry() silently fell back to the wrong registry's report / 静默返回错误注册中心的 report

When a caller passed a specific registryId that was not registered, the function silently fell back to GetMetadataReport() and returned an unrelated registry's report. In GetMetadataFromMetadataReport, this meant remote metadata could be fetched from the wrong backend with no error surfaced.

Fix: return nil + Warnf when a specific registryId is not found. The existing nil-guard in GetMetadataFromMetadataReport already surfaces this as an actionable error.

调用方传入未注册的 registryId 时,函数静默退回到其他注册中心的 report,且无任何报错。修复:找不到时返回 nil + Warnf,由调用方的 nil 检查给出明确错误。


3. ServiceNameMapping.Remove() only returned the last error / 只保留最后一个错误

Remove() fanned out to all reports but used a lastErr accumulator overwritten on each iteration. If the first report failed and the second succeeded, the error was silently discarded and the caller saw nil, leaving a stale mapping in the first registry.

Fix: collect all errors with errors.Join so the caller receives the full failure picture.

Remove() 向所有 report 扇出,但 lastErr 在每次循环中被覆盖。若第一个 report 失败、第二个成功,错误被静默丢弃。修复:使用 errors.Join 收集全部错误。


4. Revision calculation and remote metadata fetch used cross-registry data / Revision 计算和远端 metadata 拉取使用了跨注册中心数据

The revision customizers called the global GetMetadataService() which merges services across all registries, so two instances on different registries could compute the same revision even though their actual service sets differed — breaking consumer-side change detection.

The listener cache in GetMetadataInfo() was keyed by revision alone. Two registries producing the same revision string would collide, and the second registry's listener would silently receive the first registry's cached metadata.

Fix: thread registryId through createInstance(), both revision customizers, GetMetadataFromMetadataReport(), NewServiceInstancesChangedListener(), and GetMetadataInfo(). Change the cache key to registryId + ":" + revision.

Revision customizer 通过全局 GetMetadataService() 获取服务列表,合并了所有注册中心的数据,导致不同注册中心的实例可能计算出相同的 revision。监听器的 metadata 缓存仅以 revision 为键,多注册中心场景下会发生缓存污染。修复:将 registryId 贯穿传递到相关链路,缓存键改为 registryId + ":" + revision


Changes / 改动内容

File Change
metadata/report_instance.go GetMetadataReport() deterministic; GetMetadataReportByRegistry() returns nil for unknown id; add ClearMetadataReportInstances() for test isolation
metadata/client.go GetMetadataFromMetadataReport() takes registryId, routes to correct per-registry report
metadata/mapping/metadata/service_name_mapping.go Remove() collects all errors with errors.Join
registry/servicediscovery/service_instances_changed_listener_impl.go GetMetadataInfo() and NewServiceInstancesChangedListener() take registryId; cache key scoped to registryId + ":" + revision
registry/servicediscovery/service_discovery_registry.go createInstance() injects registryId into instance metadata; SubscribeURL passes registryId to listener
registry/servicediscovery/customizer/service_revision_customizer.go Read per-registry MetadataInfo; upgrade missing-registryId log to Errorf with accurate consequence description
registry/servicediscovery/customizer/metadata_service_url_params_customizer.go Restore TODO with explanation — per-registry MetadataService URL support is tracked separately

Tests Added / 新增测试

  • TestGetMetadataReportIsDeterministic — 20 iterations verify stability under Go's randomized map iteration
  • TestGetMetadataReportByRegistry / TestGetMetadataReportByRegistryFallsBackDeterministically — hit, miss, and empty-string paths
  • TestServiceNameMappingRemoveFansOutToAllReports — both reports called on success
  • TestServiceNameMappingRemoveCollectsAllErrors — both errors visible via errors.Is in joined return
  • TestServiceNameMappingRemoveContinuesAfterPartialFailure — loop does not short-circuit on first failure
  • TestExportedRevisionIsRegistryScoped / TestSubscribedRevisionIsRegistryScoped — different registries produce different revisions
  • TestExportedRevisionMissingRegistryIdYieldsZero / TestSubscribedRevisionMissingRegistryIdYieldsZero — missing RegistryIdKey yields revision "0", does not borrow another registry's data
  • TestGetMetadataFromMetadataReport (rewritten) — specific registryId routes to correct report; unknown registryId returns error, not wrong report
  • TestListenerUsesRegistryIdToFetchRemoteMetadata — end-to-end: registryId flows from NewServiceInstancesChangedListener through OnEventGetMetadataInfoGetMetadataFromMetadataReport → correct mock report called

#### TODO / 本 PR 还未做完

Multi-instance metadata service URL alignment (metadata_service_url_params_customizer.go): GetMetadataService() still returns a global singleton so all instances across registries receive the same metadata service URL.

多实例 metadata service URL 对齐:GetMetadataService() 仍返回全局单例,所有注册中心实例获得同一 metadata service URL。


Checklist

  • I confirm the target branch is develop
  • Code has passed local testing
  • I have added tests that prove my fix is effective or that my feature works

@XnLemon XnLemon changed the title Fixes apache/dubbo-go#3352: Clarify metadata report selection for multi-registry and multi-instance scenarios fix(metadata): Clarify metadata report selection for multi-registry and multi-instance scenarios Jun 4, 2026
@XnLemon XnLemon changed the title fix(metadata): Clarify metadata report selection for multi-registry and multi-instance scenarios feat(metadata): Clarify metadata report selection for multi-registry and multi-instance scenarios Jun 4, 2026
@XnLemon XnLemon marked this pull request as ready for review June 4, 2026 09:55
@codecov-commenter

codecov-commenter commented Jun 4, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 96.38554% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 52.96%. Comparing base (60d1c2a) to head (91b9136).
⚠️ Report is 825 commits behind head on develop.

Files with missing lines Patch % Lines
metadata/report_instance.go 87.50% 2 Missing ⚠️
...try/servicediscovery/service_discovery_registry.go 90.90% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #3369      +/-   ##
===========================================
+ Coverage    46.76%   52.96%   +6.19%     
===========================================
  Files          295      493     +198     
  Lines        17172    38000   +20828     
===========================================
+ Hits          8031    20126   +12095     
- Misses        8287    16245    +7958     
- Partials       854     1629     +775     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Alanxtl

Alanxtl commented Jun 7, 2026

Copy link
Copy Markdown
Member

看下没和#3362重复吧

@Alanxtl Alanxtl left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Findings

  • [P1] Standalone metadata-report configs can no longer be used by named registries.
    In PR head, GetMetadataReportByRegistry returns nil for any non-empty unknown registryId. But standalone metadata-report config is still registered under default in config/metadata_config.go, while registry URLs get ids like nacos/zk. That means serviceDiscoveryRegistry.metadataReport becomes nil in service_discovery_registry.go, and remote metadata publish/fetch fails for a common config shape: named registry plus separate metadata-report. Need either preserve default as an explicit global fallback or add config/API support to bind standalone metadata-report to registry ids.

  • [P2] #3352’s mapping-read problem is still not fixed.
    ServiceNameMapping.Get still reads from GetMetadataReport() only, so in multi-report setups it now deterministically picks default or lexicographically first. That makes the old random behavior stable, but it still does not use the subscribing registry’s report, nor does it fan out consistently with Map/Remove. The issue explicitly calls out “mapping writes all reports, reads one report”; this PR only changes Remove.
    on is resolved.

Comment thread metadata/mapping/metadata/service_name_mapping.go
Comment thread metadata/report_instance.go
@XnLemon

XnLemon commented Jun 7, 2026

Copy link
Copy Markdown
Contributor Author

Findings

第一个修了 目前可以正常fallback
第二个 模型统一 把get改为拿所有report然后遍历查求并集合并结果返回result 中间如果有一个挂了记录错误但不影响继续 只有全挂了才报错

@XnLemon

XnLemon commented Jun 8, 2026

Copy link
Copy Markdown
Contributor Author

看下没和#3362重复吧

看过了 修改的ServiceNames.String()相关和本pr无冲突

@XnLemon XnLemon requested a review from Alanxtl June 8, 2026 02:35

@Alanxtl Alanxtl left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@Alanxtl

Alanxtl commented Jun 8, 2026

Copy link
Copy Markdown
Member

pls update to latest develop branch to fix ci fail

XnLemon and others added 16 commits June 8, 2026 15:23
…gistryId instead of silently falling back to wrong registry

When a caller provides a specific non-empty registryId that has no
registered report, the previous implementation silently fell back to
GetMetadataReport() and returned an arbitrary report from a different
registry. This undermined the per-registry scoping added in the earlier
commits: metadata for registry A could be fetched from registry B's
report without any indication.

Fix: return nil (with a Warnf log) when a specific registryId is not
found. The existing nil-guard in GetMetadataFromMetadataReport (client.go)
already surfaces this as an error to the caller, which is the correct
behaviour.

The empty-string path is unchanged: it still delegates to
GetMetadataReport() because callers that pass "" have no registry
context and should receive the stable default.

Also expose ClearMetadataReportInstances() to allow cross-package test
isolation of the package-level instances map.
…of returning only the last one

The previous implementation used a lastErr variable that was overwritten
on each iteration. If the first report failed and the second succeeded,
the error was silently discarded and the caller saw no error despite a
partial failure — leaving a stale mapping in the first registry.

Fix: collect all errors with errors.Join so the caller receives the full
failure picture. The loop continues past individual failures (best-effort
fan-out) so a transient error in one registry does not prevent removal
from the others.

Add a comment explaining the intentional behavioural difference from
Map(), which is fail-fast: Map stops at the first report failure, while
Remove continues and collects all errors so every reachable registry gets
the removal attempt.
Three gaps identified in PR review, all addressed:

1. metadata/client_test.go — rewrite TestGetMetadataFromMetadataReport
   - Add sub-test: specific registryId routes to its own report (not default)
   - Add sub-test: unknown registryId returns error, not silent wrong-registry result
   - Each sub-test now resets instances independently to prevent state leakage

2. registry/servicediscovery/customizer/service_revision_customizer_test.go
   - Add TestExportedRevisionMissingRegistryIdYieldsZero: instance with no
     RegistryIdKey gets revision "0" and does not borrow another registry's
     service list
   - Add TestSubscribedRevisionMissingRegistryIdYieldsZero: same for subscribed
   - Add t.Cleanup(ClearMetadataReportInstances) + unique registry-id prefixes
     to all four tests to prevent cross-test global map pollution

3. registry/servicediscovery/service_instances_changed_listener_impl_test.go
   - Add TestListenerUsesRegistryIdToFetchRemoteMetadata: creates a listener
     with registryId="remote-reg-test", registers a mock MetadataReport under
     that id, triggers OnEvent with a RemoteMetadataStorageType instance, and
     asserts via mock.AssertExpectations that the correct per-registry report
     was called — end-to-end proof that registryId threads from
     NewServiceInstancesChangedListener through to GetMetadataFromMetadataReport
- metadata/report_instance_test.go: translate inline comments and
  testify assertion messages in TestGetMetadataReportIsDeterministic
  from Chinese to English, consistent with the rest of the file

- registry/servicediscovery/customizer/service_revision_customizer.go:
  translate Customize doc-comments for both exported and subscribed
  customizers from Chinese to English

- registry/servicediscovery/service_instances_changed_listener_impl_test.go:
  strip the UTF-8 BOM (EF BB BF) that was introduced at line 1;
  Go toolchain accepts BOMs but they are not idiomatic and cause
  spurious diffs in some editors and CI tools
… accurate consequence

The previous Warnf said 'revision will be empty', which is wrong on two
counts: the actual computed revision is "0" (not empty), and revision
"0" causes OnEvent to silently skip the instance entirely, making it
permanently invisible to all consumers.

Upgrade to Errorf and describe the actual outcome so operators can
diagnose misconfigurations immediately. Applies to both exported and
subscribed revision customizers.
…nt cross-registry metadata poisoning

The cache in GetMetadataInfo was keyed by revision string only. In a
multi-registry setup, if two registries happen to produce the same
revision value (e.g. identical service sets yielding the same CRC, or
test fixtures using literal revision strings), the second registry's
listener would get a cache hit and return MetadataInfo fetched via a
different registry's report — silently serving wrong endpoints with no
error surfaced.

Fix: use registryId+":"+revision as the cache key for both Get and Set,
so each registry's metadata is cached independently.

Update all test helpers and direct metaCache.Set/Delete calls to use
the same composite key (with constant.DefaultKey as the registryId for
all existing single-registry test listeners).
…alignment

The previous commit removed the TODO without fixing the underlying issue.
GetMetadataService() still returns a global singleton, so all instances
across registries receive the same metadata service URL regardless of
which registry they belong to.

Restore the TODO with a more detailed explanation so the gap is visible
to future contributors. The fix requires per-registry MetadataService
support and is out of scope for this PR.
…GetMetadataReportByRegistry

GetMetadataReportByRegistry intentionally falls back to the 'default'
report when a specific registryId is not found. This supports standalone
metadata-report configs registered under 'default' being used by named
registries (e.g. nacos, zk). Update tests and doc comment to reflect
this behavior instead of the old 'return nil for unknown id' contract.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@XnLemon XnLemon force-pushed the clarify-metadata-report-selection branch from bc54fc3 to 81160b7 Compare June 8, 2026 07:24
@XnLemon

XnLemon commented Jun 8, 2026

Copy link
Copy Markdown
Contributor Author

pls update to latest develop branch to fix ci fail

Done! owo)b
Thank you for your guidance and review work!

@Alanxtl

Alanxtl commented Jun 9, 2026

Copy link
Copy Markdown
Member

看了一下,这个 PR 和 #3362 不重复。

合入时注意和 #3367 的 report instances lock 合并,别把并发保护覆盖掉。整体 LGTM。

@Alanxtl

Alanxtl commented Jun 9, 2026

Copy link
Copy Markdown
Member

合并顺序

#3360#3362 先合,低耦合。
#3367 作为并发安全底座。
#3369 合入,建立 registryId/report/cache 作用域。
#3370 rebase 到 #3367 + #3369 之后,特别检查 revision 计算不要绕开锁。
#3371 再合,接受 MetadataReport 接口扩展,并补确认 Snapshot() 与 #3367 锁语义一致。
#3373 最后 rebase,因为它和 #3371 同改 report backends;语义不重复,但文件冲突概率高。

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR clarifies and hardens application-level metadata behavior in multi-registry / multi-instance deployments by making metadata report selection deterministic, scoping revision/cache behavior by registry, and expanding test coverage for these scenarios.

Changes:

  • Make metadata report selection deterministic and add test-only reset for report instances.
  • Thread registryId through remote metadata fetch and Service Discovery listener cache keys to prevent cross-registry collisions.
  • Improve multi-report mapping fan-out behavior (including joined errors) and add/extend tests for multi-registry correctness.

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
registry/servicediscovery/service_instances_changed_listener_impl.go Add registryId to listener and scope remote-metadata cache keys by registryId:revision.
registry/servicediscovery/service_instances_changed_listener_impl_test.go Update tests for new listener signature/cache keying; add end-to-end test to ensure correct per-registry report is used.
registry/servicediscovery/service_discovery_registry.go Pass registryId into instance creation and listener construction; propagate into instance metadata.
registry/servicediscovery/service_discovery_registry_test.go Update expected revision computation for new createInstance(..., registryId) signature.
registry/servicediscovery/customizer/service_revision_customizer.go Compute revisions using per-registry MetadataInfo rather than merged global service lists; improve logging on missing registryId.
registry/servicediscovery/customizer/service_revision_customizer_test.go New tests ensuring revisions are registry-scoped and missing-registryId yields "0".
registry/servicediscovery/customizer/metadata_service_url_params_customizer.go Adjust comment/TODO around metadata service URL params customization.
registry/nacos/service_discovery_test.go Update listener construction for new NewServiceInstancesChangedListener(app, registryId, services) signature.
metadata/report_instance.go Deterministic GetMetadataReport; adjusted GetMetadataReportByRegistry; add ClearMetadataReportInstances() for tests.
metadata/report_instance_test.go Add determinism tests and expand by-registry selection tests.
metadata/mapping/metadata/service_name_mapping.go Read mappings from all reports; remove mappings from all reports and join errors.
metadata/mapping/metadata/service_name_mapping_test.go Add tests for remove fan-out and joined error behavior across multiple reports.
metadata/client.go Make remote metadata fetch select report by registryId.
metadata/client_test.go Expand tests to validate routing to the correct report instance by registryId.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread registry/servicediscovery/service_discovery_registry.go Outdated
Comment thread metadata/mapping/metadata/service_name_mapping.go Outdated
Comment thread registry/servicediscovery/service_instances_changed_listener_impl.go Outdated
Comment thread metadata/report_instance.go
Comment thread metadata/client.go Outdated
@AlexStocks

Copy link
Copy Markdown
Contributor

@XnLemon 看下 copilot 的comment,合理的,就处理下

XnLemon added 2 commits June 10, 2026 10:22
- fix(registry): unify registryId default to constant.DefaultKey before
  GetMetadataInfo lookup in RegisterService, eliminating empty-string
  fallback mismatch and redundant GetParam calls
- fix(metadata): scope metaCache key to app:registryId:revision to
  prevent cross-registry cache collisions
- fix(metadata/mapping): bind MappingListener to deterministic primary
  report via GetMetadataReport() instead of non-deterministic i==0
- fix(metadata): include registryId in GetMetadataFromMetadataReport
  error message for easier misconfiguration diagnosis
- test: add t.Cleanup to remove AddService/AddSubscribeURL entries and
  prevent global state leakage between tests
- docs: restore TODO comment explaining GetMetadataService() singleton
  limitation in multi-registry metadata service URL alignment
- fix(test): update metaCache keys in service_instances_changed_listener_impl_test
  to match new app:registryId:revision format
- test: use t.Name()+UnixNano as regID in service_discovery_registry_test
  to prevent cross-test global state pollution without expanding production API
@sonarqubecloud

Copy link
Copy Markdown

@XnLemon

XnLemon commented Jun 10, 2026

Copy link
Copy Markdown
Contributor Author

@XnLemon 看下 copilot 的comment,合理的,就处理下

@AlexStocks Done! owo)b
Thank you for your guidance and review work!

@AlexStocks AlexStocks merged commit fc7f14f into apache:develop Jun 10, 2026
8 checks passed
jieguo-coder added a commit to jieguo-coder/dubbo-go that referenced this pull request Jun 11, 2026
…stic fallback while preserving fine-grained I/O locking

Signed-off-by: jieguo-coder <1193249232@qq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

6 participants