Skip to content

[CELEBORN-194][WIP] Adding client side metrics for celeborn#3740

Open
AmandeepSingh285 wants to merge 5 commits into
apache:mainfrom
AmandeepSingh285:adding-client-side-metrics
Open

[CELEBORN-194][WIP] Adding client side metrics for celeborn#3740
AmandeepSingh285 wants to merge 5 commits into
apache:mainfrom
AmandeepSingh285:adding-client-side-metrics

Conversation

@AmandeepSingh285

@AmandeepSingh285 AmandeepSingh285 commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

Adding client side metrics for Celeborn via heartbeat to master.

Why are the changes needed?

These changes help increase observability for Celeborn clients.

Does this PR resolve a correctness bug?

  • Yes

Does this PR introduce any user-facing change?

  • Yes

How was this patch tested?

@AmandeepSingh285 AmandeepSingh285 changed the title [CELEBORN-194] WIP Adding client side metrics for celeborn [CELEBORN-194] [WIP] Adding client side metrics for celeborn Jun 15, 2026
@AmandeepSingh285

Copy link
Copy Markdown
Contributor Author

Hi @SteNicholas , @RexXiong could you please help with a high level review on the implementation design for change adding client side metrics.
Thanks!

@codecov

codecov Bot commented Jun 17, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 64.76190% with 37 lines in your changes missing coverage. Please review.
✅ Project coverage is 58.55%. Comparing base (b4cb5a0) to head (e43721a).
⚠️ Report is 69 commits behind head on main.

Files with missing lines Patch % Lines
.../org/apache/celeborn/client/LifecycleManager.scala 29.42% 21 Missing and 3 partials ⚠️
...pache/celeborn/client/ChangePartitionManager.scala 0.00% 4 Missing ⚠️
...born/common/protocol/message/ControlMessages.scala 82.36% 0 Missing and 3 partials ⚠️
...n/client/commit/ReducePartitionCommitHandler.scala 0.00% 2 Missing ⚠️
.../apache/celeborn/common/metrics/ClientMetric.scala 50.00% 1 Missing and 1 partial ⚠️
...eleborn/common/metrics/source/AbstractSource.scala 33.34% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #3740      +/-   ##
============================================
- Coverage     66.91%   58.55%   -8.35%     
- Complexity        0      311     +311     
============================================
  Files           358      397      +39     
  Lines         21986    27910    +5924     
  Branches       1946     2728     +782     
============================================
+ Hits          14710    16341    +1631     
- Misses         6262    10382    +4120     
- Partials       1014     1187     +173     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@AmandeepSingh285

Copy link
Copy Markdown
Contributor Author

Gentle ping @SteNicholas , @RexXiong could you please help with a high level review of the approach. Thanks!

@AmandeepSingh285 AmandeepSingh285 changed the title [CELEBORN-194] [WIP] Adding client side metrics for celeborn [CELEBORN-194] Adding client side metrics for celeborn Jun 22, 2026
@AmandeepSingh285

Copy link
Copy Markdown
Contributor Author

Gentle follow-up ping @SteNicholas @RexXiong . Would appreciate a high-level review of the proposed approach whenever you have some time. Thanks!

@AmandeepSingh285 AmandeepSingh285 changed the title [CELEBORN-194] Adding client side metrics for celeborn [CELEBORN-194][WIP] Adding client side metrics for celeborn Jun 22, 2026
@SteNicholas SteNicholas requested a review from Copilot June 22, 2026 10:13

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot couldn't run its full agentic review because no GitHub Actions runner was available. Make sure your repository has a runner available to run Copilot's review, or add a copilot-setup-steps.yml file specifying one with the runs-on attribute. See the docs for more details.

Adds client-side metrics collection in Celeborn clients and ships those metrics to the master via application heartbeats, where they are re-exposed on the master Prometheus endpoint labeled by applicationId.

Changes:

  • Extend HeartbeatFromApplication (and protobuf serde) to carry a clientMetrics map of {name -> (value, type)}.
  • Add client and master metric sources (CelebornClientSource, ApplicationMetricsSource) plus wiring in LifecycleManager/Master.
  • Introduce celeborn.client.metrics.enabled config and add/unit-test coverage for serde + source behavior.

Reviewed changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
master/src/test/scala/org/apache/celeborn/service/deploy/master/ApplicationMetricsSourceSuite.scala Adds unit tests for master-side application metrics source behavior.
master/src/main/scala/org/apache/celeborn/service/deploy/master/Master.scala Registers the new application metrics source and plumbs heartbeat clientMetrics through.
master/src/main/scala/org/apache/celeborn/service/deploy/master/ApplicationMetricsSource.scala Implements master-side cache + Prometheus re-export of client metrics by applicationId.
docs/configuration/metrics.md Documents new celeborn.client.metrics.enabled config.
common/src/test/scala/org/apache/celeborn/common/util/UtilsSuite.scala Adds serde round-trip test for clientMetrics in heartbeats.
common/src/main/scala/org/apache/celeborn/common/protocol/message/ControlMessages.scala Extends heartbeat message, protobuf encoding/decoding for client metrics.
common/src/main/scala/org/apache/celeborn/common/metrics/source/AbstractSource.scala Adds Role.CLIENT label behavior and counterExists helper.
common/src/main/scala/org/apache/celeborn/common/metrics/ClientMetric.scala Introduces ClientMetric + MetricType shared representation.
common/src/main/scala/org/apache/celeborn/common/CelebornConf.scala Adds celeborn.client.metrics.enabled config entry and accessor.
common/src/main/proto/TransportMessages.proto Adds clientMetrics field + metric type/message definitions to heartbeat protobuf.
client/src/test/scala/org/apache/celeborn/client/WorkerStatusTrackerSuite.scala Adds test coverage for excluded-worker metrics behavior.
client/src/test/scala/org/apache/celeborn/client/CelebornClientSourceSuite.scala Adds unit tests for client metric source counters/gauges + snapshot types.
client/src/main/scala/org/apache/celeborn/client/commit/ReducePartitionCommitHandler.scala Increments client “shuffle data lost” metric on lost-file conditions when enabled.
client/src/main/scala/org/apache/celeborn/client/LifecycleManager.scala Creates client metrics source, registers gauges, increments counters, and supplies snapshots to heartbeats.
client/src/main/scala/org/apache/celeborn/client/ChangePartitionManager.scala Increments revive-failure metrics when change partition assignment fails.
client/src/main/scala/org/apache/celeborn/client/CelebornClientSource.scala Implements client-side metrics source + snapshot export for heartbeat payload.
client/src/main/scala/org/apache/celeborn/client/ApplicationHeartbeater.scala Adds callback to attach client metrics to each HeartbeatFromApplication.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +226 to +240
val clientSource = new CelebornClientSource(conf)
private[client] val clientMetricsEnabled = conf.metricsSystemEnable && conf.clientMetricsEnabled
val commitManager = new CommitManager(appUniqueId, conf, this)
val workerStatusTracker = new WorkerStatusTracker(conf, this)
if (clientMetricsEnabled) {
clientSource.addGauge(CelebornClientSource.ACTIVE_SHUFFLE_COUNT) { () =>
registeredShuffle.size
}
clientSource.addGauge(CelebornClientSource.EXCLUDED_WORKER_COUNT) { () =>
workerStatusTracker.excludedWorkers.size
}
clientSource.addGauge(CelebornClientSource.SHUTTING_WORKER_COUNT) { () =>
workerStatusTracker.shuttingWorkers.size
}
}
Comment on lines +49 to +50
// start cleaner thread
startCleaner()
Comment on lines +50 to +53
def updateApplicationMetrics(appId: String, metrics: JMap[String, ClientMetric]): Unit = {
if (metrics.isEmpty) return
metrics.asScala.foreach { case (name, metric) =>
val labels = Map(applicationLabel -> appId)
Comment on lines +78 to +93
private def updateCounter(
appId: String,
name: String,
labels: Map[String, String],
newValue: Long): Unit = {
val prev = appCounterPrev.computeIfAbsent(appId, _ => JavaUtils.newConcurrentHashMap())
if (!counterExists(name, labels)) {
addCounter(name, labels)
}
val prevValue = prev.getOrDefault(name, 0L)
val delta = newValue - prevValue
if (delta > 0) {
incCounter(name, delta, labels)
}
prev.put(name, newValue)
}
errors.get())
}

test("recordWorkerFailure increments client worker-excluded counter and gauge") {

class ApplicationMetricsSourceSuite extends CelebornFunSuite {

private def metricsOf(app: String, value: Long): JHashMap[String, ClientMetric] = {

@SteNicholas SteNicholas left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AmandeepSingh285, thanks for working on client-side metrics — the overall shape (client AbstractSource snapshot → heartbeat → master re-expose) is reasonable and the serde/config/docs are wired up. Since it's marked [WIP], I'm leaving review comments rather than approving; there are a few correctness/lifecycle issues worth resolving first (inline). Summary, most-impactful first:

  1. Client metrics source + cleaner thread leak (per Spark driver). clientSource is created unconditionally and never destroyed — see inline on LifecycleManager.scala:226.
  2. Master re-registers dead apps' metrics → permanent leak. Master is a plain RpcEndpoint (not ThreadSafeRpcEndpoint), so its Inbox runs with enableConcurrent = true and heartbeats are processed concurrently. A heartbeat racing/after handleAppLost resurrects the app's per-app gauges/counters, which are then never cleaned. See inline on ApplicationMetricsSource.updateApplicationMetrics.
  3. Non-atomic counter delta + fragile absolute→delta conversion. Concurrent heartbeats for one app double-count; app-restart-with-same-id or heartbeat reordering corrupt the delta. See inline on updateCounter.
  4. Unbounded master cardinality + silent truncation. Per-applicationId labeling has no top-N cap and no master-side enable flag; AbstractSource.getMetrics truncates at metricsCapacity (4096) and emits counters last, so app counters drop first. The codebase already solved this for worker per-app metrics via celeborn.metrics.worker.app.topResourceConsumption.count (default 0/off). See inline on Master.scala.
  5. Gauge flaps to 0 on removal (cache cleared before the gauge is unregistered) — inline on removeApplicationMetrics.
  6. Metric semantics: ClientReviveFailCount is incremented by changePartitions.size in one place but by +1 in handleRevive, and ClientShuffleDataLostCount is bumped in both handleMapPartitionEnd and ReducePartitionCommitHandler.stageEnd — mixed units / possible double-count. Inline on ChangePartitionManager.
  7. fromPb silently maps unknown PbMetricType to Gauge (forward-compat trap) — inline on ControlMessages.scala.

Minor / cleanup (no inline needed): the if (clientMetricsEnabled) clientSource.incCounter(...) guard is copy-pasted ~12×, and ReducePartitionCommitHandler recomputes the gate inline instead of reusing the cached clientMetricsEnabled field — a single incClientMetric(name, n) helper would centralize the gate and avoid the semantic drift in (6). ClientMetric/MetricType also duplicate proto PbClientMetric/PbMetricType (4 spots to keep in lockstep). Test gap: the counter-delta path in ApplicationMetricsSource is untested (ApplicationMetricsSourceSuite only sends MetricType.Gauge).

}

private val masterClient = new MasterClient(masterRpcEnvInUse, conf, false)
val clientSource = new CelebornClientSource(conf)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clientSource is created unconditionally (before the clientMetricsEnabled guard), and AbstractSource's constructor spawns a daemon worker-metrics-cleaner scheduled executor; CelebornClientSource also calls startCleaner() and registers 8 counters in its ctor. But LifecycleManager.stop() only calls heartbeater.stop() + super.stop() — it never calls clientSource.destroy(). So every LifecycleManager (one per Spark app driver) leaks a daemon thread + a MetricRegistry, even when celeborn.client.metrics.enabled=false (the default). On multi-tenant/long-lived drivers (Spark Connect, Kyuubi, notebooks) and in WorkerStatusTrackerSuite's new test (which only calls stop()), these accumulate. Suggest gating creation on clientMetricsEnabled and calling clientSource.destroy() in stop(). (The cleaner is also a no-op here — clearOldValues only scans namedTimers, and this source has none.)


startCleaner()

def updateApplicationMetrics(appId: String, metrics: JMap[String, ClientMetric]): Unit = {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Master extends plain RpcEndpoint (not ThreadSafeRpcEndpoint), so its Inbox sets enableConcurrent = true and app heartbeats are dispatched concurrently. If a heartbeat for appId is in-flight/queued when handleAppLostremoveApplicationMetrics(appId) runs (e.g. a false timeout from a GC pause, or the last heartbeat racing ApplicationLost), this method then re-computeIfAbsents the caches and re-addGauge/addCounters the app's metrics. Since removeApplicationMetrics only ever fires once per app, those applicationId=<dead app> series leak permanently and the counter re-emits its full cumulative value (prev reset to 0 → delta = full value). Needs a liveness check against the live-app set, or removal coordinated so a later heartbeat can't resurrect.

addCounter(name, labels)
}
val prevValue = prev.getOrDefault(name, 0L)
val delta = newValue - prevValue

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two issues here. (a) Non-atomic read-modify-write: getOrDefaultdeltaincCounterprev.put is not atomic, and with enableConcurrent=true two heartbeats for the same app over-count (both read the same prev) or lose updates. Use ConcurrentHashMap.merge/compute to make delta+store atomic. (b) Fragile absolute→delta: the client reports an absolute cumulative value; if (delta > 0) then prev.put(name, newValue) (unconditional, line 92) means a same-appId restart (counter resets) or a reordered/retried heartbeat rewinds prev and either stalls the counter or over-counts on the next tick. Consider exposing the client's absolute value directly as a gauge (Prometheus rate()/increase() already tolerate counter resets) — that removes appCounterPrev and this whole bug class.


def removeApplicationMetrics(appId: String): Unit = {
val labels = Map(applicationLabel -> appId)
val gaugeCache = appGaugeCache.remove(appId)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removeApplicationMetrics drops the appGaugeCache entry (line 97) before unregistering the gauges (line 99). The registered gauge closure reads Option(appGaugeCache.get(appId))...getOrElse(0L), so a concurrent Prometheus scrape landing in that window reports 0 for the about-to-be-removed gauge — a transient flap-to-0. Unregister the gauges first, then drop the cache (or guard the closure).


metricsSystem.registerSource(resourceConsumptionSource)
metricsSystem.registerSource(masterSource)
metricsSystem.registerSource(applicationMetricsSource)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ApplicationMetricsSource re-exposes ~11 metrics per heartbeating app labeled by applicationId, with no top-N cap and gated only by the client-side flag (no master-side enable). All share one metricsCapacity (default 4096); AbstractSource.getMetrics silently truncates beyond that (logWarning only) and emits counters last, so app counters (e.g. ClientShuffleDataLostCount) are dropped first at ~370 concurrent apps. The codebase already handles this exact cardinality problem for worker per-app metrics via celeborn.metrics.worker.app.topResourceConsumption.count (default 0 = off, top-N capped, documented as high-cardinality). Worth following that precedent (master-side enable + top-N), and note the source/cleaner are constructed even when master metricsSystemEnable=false.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SteNicholas made a change to the approach where users can pass the labels required with the metrics and they are used as tags instead of appid. This helps avoid cardinality constraints. Could you please help with a review for this approach. Thanks!

}
if (lifecycleManager.clientMetricsEnabled) {
lifecycleManager.clientSource.incCounter(
CelebornClientSource.REVIVE_FAIL_COUNT,

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

REVIVE_FAIL_COUNT is incremented here by changePartitions.size (per-partition), but in LifecycleManager.handleRevive it's incremented by +1 per batch on the unregistered/stage-ended paths, while REVIVE_REQUEST_COUNT is incremented by partitionIds.size. Mixing per-partition and per-batch units in one series makes a fail-rate uninterpretable. Separately, SHUFFLE_DATA_LOST_COUNT is incremented both in LifecycleManager.handleMapPartitionEnd and in ReducePartitionCommitHandler.stageEnd — please confirm those are disjoint events and not double-counting the same lost shuffle. Pick one unit per metric.

new util.HashMap[String, ClientMetric](
pbHeartbeatFromApplication.getClientMetricsMap.asScala.map { case (name, pbMetric) =>
val metricType = pbMetric.getType match {
case PbMetricType.COUNTER => MetricType.Counter

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fromPb uses case PbMetricType.COUNTER => Counter; case _ => Gauge, silently coercing UNRECOGNIZED/any future enum value to Gauge. With version skew, a newer client's counter would be decoded on an older master as a gauge → routed to updateGauge (last-value) instead of updateCounter (delta), i.e. silently wrong semantics. At minimum logWarning on the default branch; better, handle GAUGE/UNRECOGNIZED explicitly.

@AmandeepSingh285

Copy link
Copy Markdown
Contributor Author

Thanks @SteNicholas for the review. Still working on improving this PR. Will take into account all the updated you mentioned. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants