feat(metadata): add concurrency safety for application-level metadata state (#3353)#3367
feat(metadata): add concurrency safety for application-level metadata state (#3353)#3367jieguo-coder wants to merge 17 commits into
Conversation
…rialization (apache#3353) Add sync.RWMutex to MetadataInfo struct with json:"-" / hessian:"-" tags to skip serialization. All mutating methods (AddService, RemoveService, AddSubscribeURL, RemoveSubscribeURL) acquire the write lock, and read methods (GetExportedServiceURLs, GetSubscribedURLs, GetServices) acquire the read lock. The new GetServices method returns a snapshot copy. Signed-off-by: jieguo-coder <1193249232@qq.com>
…apache#3353) Add sync.RWMutex to protect registryMetadataInfo in metadata.go and instances in report_instance.go. Extract getMetadataReportUnsafe helper to avoid reentrant RLock deadlock in GetMetadataReportByRegistry fallback. Fix nacos report_test to use pointer to MetadataInfo for json.Marshal. Signed-off-by: jieguo-coder <1193249232@qq.com>
…rnal calls (apache#3353) Add mutex locking to AddListenerAndNotify and RemoveListener to protect shared fields listeners and serviceUrls. Replace direct access to MetadataInfo.Services with safe GetServices method in OnEvent and convertV2 to prevent unprotected map reads. Signed-off-by: jieguo-coder <1193249232@qq.com>
|
our project use
|
Signed-off-by: jieguo-coder <1193249232@qq.com>
Signed-off-by: jieguo-coder <1193249232@qq.com>
Thanks for the guidance! @Alanxtl |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #3367 +/- ##
===========================================
+ Coverage 46.76% 53.50% +6.73%
===========================================
Files 295 493 +198
Lines 17172 38370 +21198
===========================================
+ Hits 8031 20530 +12499
- Misses 8287 16201 +7914
- Partials 854 1639 +785 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
Signed-off-by: jieguo-coder <1193249232@qq.com>
Signed-off-by: jieguo-coder <1193249232@qq.com>
bc20d95 to
e65f694
Compare
Signed-off-by: jieguo-coder <1193249232@qq.com>
There was a problem hiding this comment.
Pull request overview
This PR adds synchronization to the application-level metadata path to prevent data races and concurrent map write panics introduced after the metadata refactor (#2534). It primarily protects shared global maps and metadata state that are accessed concurrently during registration/subscription flows and service discovery events.
Changes:
- Add
sync.RWMutexprotection toMetadataInfointernals and introduceGetServices()for safe external iteration. - Protect global metadata registries (
registryMetadataInfo, metadata reportinstances) withsync.RWMutex. - Add missing listener-map locking in service discovery’s instance-changed listener and update call sites to use
GetServices().
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| registry/servicediscovery/service_instances_changed_listener_impl.go | Locks listener mutation paths and switches service iteration to MetadataInfo.GetServices() to avoid unsafe map iteration. |
| registry/servicediscovery/service_instances_changed_listener_impl_test.go | Adds test coverage for listener removal behavior. |
| metadata/report/nacos/report_test.go | Updates test expectations for MetadataInfo pointer usage. |
| metadata/report_instance.go | Adds RWMutex protection around global metadata report instances map. |
| metadata/metadata.go | Adds RWMutex protection around global registryMetadataInfo map and makes get-or-create atomic. |
| metadata/metadata_test.go | Adds a concurrent access smoke test for Add/Read operations on global metadata. |
| metadata/metadata_service.go | Adds read-locking while iterating metadata map and uses GetServices() for V2 conversion. |
| metadata/metadata_service_test.go | Adds concurrent read-access test coverage for DefaultMetadataService. |
| metadata/info/metadata_info.go | Adds per-MetadataInfo RWMutex, locks map accessors/mutators, and introduces GetServices() snapshot method. |
| metadata/info/metadata_info_test.go | Adds tests validating GetServices() returns a snapshot copy. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // GetServices returns a copy of the Services map for safe iteration by external callers. | ||
| func (info *MetadataInfo) GetServices() map[string]*ServiceInfo { | ||
| info.mu.RLock() | ||
| defer info.mu.RUnlock() | ||
|
|
||
| cp := make(map[string]*ServiceInfo, len(info.Services)) | ||
| for k, v := range info.Services { | ||
| cp[k] = v | ||
| } | ||
| return cp | ||
| } |
| instancesMu.Lock() | ||
| instances[registryId] = &DelegateMetadataReport{instance: fac.CreateMetadataReport(url)} | ||
| instancesMu.Unlock() |
…ce and reduce lock granularity in report creation Signed-off-by: jieguo-coder <1193249232@qq.com>
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
Signed-off-by: jieguo-coder <1193249232@qq.com>
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as outdated.
This comment was marked as outdated.
|
pls update to latest develop branch to fix ci fail |
| } | ||
|
|
||
| func (info *MetadataInfo) ReplaceExportedServices(urls []*common.URL) { | ||
| info.Services = make(map[string]*ServiceInfo) |
There was a problem hiding this comment.
[P1] 这里仍然绕过 MetadataInfo.mu 直接重置 Services 和 exportedServiceURLs。本 PR 已经让 AddService、RemoveService、GetServices、GetExportedServiceURLs 通过同一个 mutex 保护这些字段,但 ReplaceExportedServices 由 service_discovery_registry.go 调用时会无锁写 map;如果同时有 metadata 读取或实例变更处理在调用 GetServices/GetExportedServiceURLs,仍然可能触发 data race 或读到半重建状态。这里需要在函数入口持有写锁,并避免在持锁后直接调用会再次加锁的 AddService,可以拆出一个内部 no-lock helper 来复用写入逻辑。
There was a problem hiding this comment.
好的,我现在使用了内部无锁辅助模式重构逻辑,并提取 addServiceWithoutLock 来处理核心映射更新。现在,ReplaceExportedServices 在入口获取 info.mu.Lock() 一次,并在循环中安全地重用辅助函数。
…id deadlocks in ReplaceExportedServices Signed-off-by: jieguo-coder <1193249232@qq.com>
…stic fallback while preserving fine-grained I/O locking Signed-off-by: jieguo-coder <1193249232@qq.com>
…ing missing registryId argument Signed-off-by: jieguo-coder <1193249232@qq.com>
| } | ||
|
|
||
| func AddService(registryId string, url *common.URL) { | ||
| registryMetadataLock.Lock() |
There was a problem hiding this comment.
[P1] 这里仍然裸读 registryMetadataInfo。这个 PR 让 AddService/AddSubscribeURL/GetMetadataInfo 通过 registryMetadataLock 保护同一个全局 map,但 RemoveService/RemoveSubscribeURL 还在没有 RLock 的情况下做 map lookup;当服务导出/订阅与注销并发发生时,Add 分支可能正在写入 registryMetadataInfo,Remove 分支同时读会继续触发 concurrent map read/write。这里需要先在 registryMetadataLock 下取出 metadataInfo,再释放全局锁后调用 metadataInfo.RemoveService,和 AddService 的锁粒度保持一致。
There was a problem hiding this comment.
我已经更新了 metadata/metadata.go 中的 RemoveService 和 RemoveSubscribeURL。现在,它们会获取 registryMetadataLock.RLock(),执行映射查找,并在调用实例自身的方法之前立即释放 RUnlock()
全局锁粒度最小化,并与 AddService 完全对称,在不引入任何死锁的情况下解决并发读写风险
…d RemoveSubscribeURL to prevent concurrent map read/write Signed-off-by: jieguo-coder <1193249232@qq.com>
… concurrency safety and integrating new Tag field Signed-off-by: jieguo-coder <1193249232@qq.com>
…key format tests Signed-off-by: jieguo-coder <1193249232@qq.com>
|



Description
Summary
Fixes concurrency safety issues in the application-level metadata path by adding proper synchronization protection for multiple global maps and shared states.
Background
After the #2534 metadata refactor, application-level metadata is maintained through shared states such as local MetadataInfo, metadata report instances, and MetadataService. Currently, multiple core states are maps without clear synchronization protection. Data races, stale reads, or fatal error: concurrent map writes might occur when service registration, unregistration, subscription, unsubscription, instance changes, and metadata service queries happen concurrently.
Related Issue: Fixes #3353
Changes
Added a sync.RWMutex for the global map[string]*MetadataInfo.
The get-or-create phase in AddService / AddSubscribeURL is now executed atomically under a write lock to prevent race conditions.
GetMetadataInfo is protected by a read lock.
Concurrency Safety for Listeners (registry/.../service_instances_changed_listener_impl.go)
Added missing mutex protection for AddListenerAndNotify and RemoveListener to prevent concurrent reads/writes on listeners and serviceUrls against OnEvent.
Safe Access to Services Field
Replaced direct accesses to metadataInfo.Services in OnEvent and convertV2 with the new safe method GetServices().
Test Plan