Skip to content

feat(metadata): add concurrency safety for application-level metadata state (#3353)#3367

Open
jieguo-coder wants to merge 17 commits into
apache:developfrom
jieguo-coder:fix/issue-3353-metadata-concurrency
Open

feat(metadata): add concurrency safety for application-level metadata state (#3353)#3367
jieguo-coder wants to merge 17 commits into
apache:developfrom
jieguo-coder:fix/issue-3353-metadata-concurrency

Conversation

@jieguo-coder

Copy link
Copy Markdown

Description

Summary

Fixes concurrency safety issues in the application-level metadata path by adding proper synchronization protection for multiple global maps and shared states.

Background

After the #2534 metadata refactor, application-level metadata is maintained through shared states such as local MetadataInfo, metadata report instances, and MetadataService. Currently, multiple core states are maps without clear synchronization protection. Data races, stale reads, or fatal error: concurrent map writes might occur when service registration, unregistration, subscription, unsubscription, instance changes, and metadata service queries happen concurrently.

Related Issue: Fixes #3353

Changes

  1. Internal Lock Protection for MetadataInfo (metadata/info/metadata_info.go)
  • Added a sync.RWMutex field to the MetadataInfo struct (with json:"-" and hessian:"-" tags to skip serialization) to protect the three internal maps: Services, exportedServiceURLs, and subscribedServiceURLs.
  • AddService / RemoveService / AddSubscribeURL / RemoveSubscribeURL now acquire the write lock.
  • GetExportedServiceURLs / GetSubscribedURLs now acquire the read lock.
  • Added a new GetServices() method that returns a snapshot of Services under a read lock for safe external iteration.
  1. Global registryMetadataInfo Lock Protection (metadata/metadata.go)

Added a sync.RWMutex for the global map[string]*MetadataInfo.

The get-or-create phase in AddService / AddSubscribeURL is now executed atomically under a write lock to prevent race conditions.

GetMetadataInfo is protected by a read lock.

  1. Global instances Lock Protection (metadata/report_instance.go)
  • Added a sync.RWMutex for the global map[string]MetadataReport.
  • Extracted an internal helper function getMetadataReportUnsafe to avoid deadlocks caused by the un-reentrant nature of Go's RWMutex during the fallback path of GetMetadataReportByRegistry.
  1. Concurrency Safety for Listeners (registry/.../service_instances_changed_listener_impl.go)

  2. Added missing mutex protection for AddListenerAndNotify and RemoveListener to prevent concurrent reads/writes on listeners and serviceUrls against OnEvent.

  3. Safe Access to Services Field

Replaced direct accesses to metadataInfo.Services in OnEvent and convertV2 with the new safe method GetServices().

Test Plan

  • go vet ./metadata/... ./registry/... : Zero warnings.
  • golangci-lint run ./metadata/... ./registry/servicediscovery/... : Zero issues.
  • go test ./metadata/... ./registry/... : All 21 packages passed.
  • go build ./... : Full project compilation passed successfully.

…rialization (apache#3353)

Add sync.RWMutex to MetadataInfo struct with json:"-" / hessian:"-"
tags to skip serialization. All mutating methods (AddService, RemoveService,
AddSubscribeURL, RemoveSubscribeURL) acquire the write lock, and read
methods (GetExportedServiceURLs, GetSubscribedURLs, GetServices) acquire
the read lock. The new GetServices method returns a snapshot copy.

Signed-off-by: jieguo-coder <1193249232@qq.com>
…apache#3353)

Add sync.RWMutex to protect registryMetadataInfo in metadata.go and
instances in report_instance.go. Extract getMetadataReportUnsafe helper
to avoid reentrant RLock deadlock in GetMetadataReportByRegistry fallback.
Fix nacos report_test to use pointer to MetadataInfo for json.Marshal.

Signed-off-by: jieguo-coder <1193249232@qq.com>
…rnal calls (apache#3353)

Add mutex locking to AddListenerAndNotify and RemoveListener to protect
shared fields listeners and serviceUrls. Replace direct access to
MetadataInfo.Services with safe GetServices method in OnEvent and
convertV2 to prevent unprotected map reads.

Signed-off-by: jieguo-coder <1193249232@qq.com>
@Alanxtl

Alanxtl commented Jun 3, 2026

Copy link
Copy Markdown
Member

our project use import-formatter to format import blocks, that's the reason why ur CI fails. For you, u should

  1. run go install github.com/dubbogo/tools/cmd/imports-formatter@latest
  2. cd to the root dir of dubbo-go
  3. run imports-formatter

@Alanxtl Alanxtl added ✏️ Feature 3.3.2 version 3.3.2 labels Jun 3, 2026
Signed-off-by: jieguo-coder <1193249232@qq.com>
Signed-off-by: jieguo-coder <1193249232@qq.com>
@jieguo-coder

Copy link
Copy Markdown
Author

our project use import-formatter to format import blocks, that's the reason why ur CI fails. For you, u should

  1. run go install github.com/dubbogo/tools/cmd/imports-formatter@latest
  2. cd to the root dir of dubbo-go
  3. run imports-formatter

Thanks for the guidance! @Alanxtl
I have formatted the import blocks using and pushed the updates. The CI should be happy now. 😊

@codecov-commenter

codecov-commenter commented Jun 3, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 96.00000% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 53.50%. Comparing base (60d1c2a) to head (11cc932).
⚠️ Report is 837 commits behind head on develop.

Files with missing lines Patch % Lines
metadata/info/metadata_info.go 92.30% 3 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #3367      +/-   ##
===========================================
+ Coverage    46.76%   53.50%   +6.73%     
===========================================
  Files          295      493     +198     
  Lines        17172    38370   +21198     
===========================================
+ Hits          8031    20530   +12499     
- Misses        8287    16201    +7914     
- Partials       854     1639     +785     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Comment thread metadata/metadata_service.go
Comment thread metadata/metadata.go
Signed-off-by: jieguo-coder <1193249232@qq.com>
Signed-off-by: jieguo-coder <1193249232@qq.com>
@jieguo-coder jieguo-coder force-pushed the fix/issue-3353-metadata-concurrency branch from bc20d95 to e65f694 Compare June 4, 2026 09:13
Signed-off-by: jieguo-coder <1193249232@qq.com>
Comment thread registry/servicediscovery/service_instances_changed_listener_impl.go Outdated

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds synchronization to the application-level metadata path to prevent data races and concurrent map write panics introduced after the metadata refactor (#2534). It primarily protects shared global maps and metadata state that are accessed concurrently during registration/subscription flows and service discovery events.

Changes:

  • Add sync.RWMutex protection to MetadataInfo internals and introduce GetServices() for safe external iteration.
  • Protect global metadata registries (registryMetadataInfo, metadata report instances) with sync.RWMutex.
  • Add missing listener-map locking in service discovery’s instance-changed listener and update call sites to use GetServices().

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
registry/servicediscovery/service_instances_changed_listener_impl.go Locks listener mutation paths and switches service iteration to MetadataInfo.GetServices() to avoid unsafe map iteration.
registry/servicediscovery/service_instances_changed_listener_impl_test.go Adds test coverage for listener removal behavior.
metadata/report/nacos/report_test.go Updates test expectations for MetadataInfo pointer usage.
metadata/report_instance.go Adds RWMutex protection around global metadata report instances map.
metadata/metadata.go Adds RWMutex protection around global registryMetadataInfo map and makes get-or-create atomic.
metadata/metadata_test.go Adds a concurrent access smoke test for Add/Read operations on global metadata.
metadata/metadata_service.go Adds read-locking while iterating metadata map and uses GetServices() for V2 conversion.
metadata/metadata_service_test.go Adds concurrent read-access test coverage for DefaultMetadataService.
metadata/info/metadata_info.go Adds per-MetadataInfo RWMutex, locks map accessors/mutators, and introduces GetServices() snapshot method.
metadata/info/metadata_info_test.go Adds tests validating GetServices() returns a snapshot copy.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread metadata/info/metadata_info.go Outdated
Comment on lines +185 to +195
// GetServices returns a copy of the Services map for safe iteration by external callers.
func (info *MetadataInfo) GetServices() map[string]*ServiceInfo {
info.mu.RLock()
defer info.mu.RUnlock()

cp := make(map[string]*ServiceInfo, len(info.Services))
for k, v := range info.Services {
cp[k] = v
}
return cp
}
Comment on lines +52 to +54
instancesMu.Lock()
instances[registryId] = &DelegateMetadataReport{instance: fac.CreateMetadataReport(url)}
instancesMu.Unlock()
…ce and reduce lock granularity in report creation

Signed-off-by: jieguo-coder <1193249232@qq.com>
Alanxtl

This comment was marked as resolved.

@jieguo-coder

This comment was marked as resolved.

@Alanxtl

This comment was marked as resolved.

@jieguo-coder jieguo-coder changed the base branch from main to develop June 7, 2026 15:00
Signed-off-by: jieguo-coder <1193249232@qq.com>
@jieguo-coder

This comment was marked as resolved.

@Alanxtl

This comment was marked as outdated.

@Alanxtl

Alanxtl commented Jun 8, 2026

Copy link
Copy Markdown
Member

pls update to latest develop branch to fix ci fail

@Alanxtl Alanxtl left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

合并顺序

#3360#3362 先合,低耦合。
#3367 作为并发安全底座。
#3369 合入,建立 registryId/report/cache 作用域。
#3370 rebase 到 #3367 + #3369 之后,特别检查 revision 计算不要绕开锁。
#3371 再合,接受 MetadataReport 接口扩展,并补确认 Snapshot() 与 #3367 锁语义一致。
#3373 最后 rebase,因为它和 #3371 同改 report backends;语义不重复,但文件冲突概率高。

}

func (info *MetadataInfo) ReplaceExportedServices(urls []*common.URL) {
info.Services = make(map[string]*ServiceInfo)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P1] 这里仍然绕过 MetadataInfo.mu 直接重置 ServicesexportedServiceURLs。本 PR 已经让 AddServiceRemoveServiceGetServicesGetExportedServiceURLs 通过同一个 mutex 保护这些字段,但 ReplaceExportedServicesservice_discovery_registry.go 调用时会无锁写 map;如果同时有 metadata 读取或实例变更处理在调用 GetServices/GetExportedServiceURLs,仍然可能触发 data race 或读到半重建状态。这里需要在函数入口持有写锁,并避免在持锁后直接调用会再次加锁的 AddService,可以拆出一个内部 no-lock helper 来复用写入逻辑。

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

好的,我现在使用了内部无锁辅助模式重构逻辑,并提取 addServiceWithoutLock 来处理核心映射更新。现在,ReplaceExportedServices 在入口获取 info.mu.Lock() 一次,并在循环中安全地重用辅助函数。

…id deadlocks in ReplaceExportedServices

Signed-off-by: jieguo-coder <1193249232@qq.com>
…stic fallback while preserving fine-grained I/O locking

Signed-off-by: jieguo-coder <1193249232@qq.com>
…ing missing registryId argument

Signed-off-by: jieguo-coder <1193249232@qq.com>
@jieguo-coder

Copy link
Copy Markdown
Author

@Alanxtl

我已经成功解决了最近整合 #3369 引发的合并冲突。
在 metadata/report_instance.go 合并过程中,我小心保留了所有新的业务逻辑(确定性排序、DefaultKey 回归和 ClearMetadataReportInstances),同时成功保留了细粒度的并发优化(将 I/O 重的 CreateMetadataReport 置于全局锁之外)

Comment thread metadata/metadata.go
}

func AddService(registryId string, url *common.URL) {
registryMetadataLock.Lock()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P1] 这里仍然裸读 registryMetadataInfo。这个 PR 让 AddService/AddSubscribeURL/GetMetadataInfo 通过 registryMetadataLock 保护同一个全局 map,但 RemoveService/RemoveSubscribeURL 还在没有 RLock 的情况下做 map lookup;当服务导出/订阅与注销并发发生时,Add 分支可能正在写入 registryMetadataInfo,Remove 分支同时读会继续触发 concurrent map read/write。这里需要先在 registryMetadataLock 下取出 metadataInfo,再释放全局锁后调用 metadataInfo.RemoveService,和 AddService 的锁粒度保持一致。

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我已经更新了 metadata/metadata.go 中的 RemoveServiceRemoveSubscribeURL。现在,它们会获取 registryMetadataLock.RLock(),执行映射查找,并在调用实例自身的方法之前立即释放 RUnlock()

全局锁粒度最小化,并与 AddService 完全对称,在不引入任何死锁的情况下解决并发读写风险

@Alanxtl Alanxtl self-assigned this Jun 13, 2026
…d RemoveSubscribeURL to prevent concurrent map read/write

Signed-off-by: jieguo-coder <1193249232@qq.com>
… concurrency safety and integrating new Tag field

Signed-off-by: jieguo-coder <1193249232@qq.com>
…key format tests

Signed-off-by: jieguo-coder <1193249232@qq.com>
@sonarqubecloud

Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE] Add concurrency safety for application-level metadata state / 为应用级 metadata 全局状态补充并发安全

6 participants