Skip to content

[CELEBORN-194] Introduce client side metrics for celeborn#3740

Open
AmandeepSingh285 wants to merge 6 commits into
apache:mainfrom
AmandeepSingh285:adding-client-side-metrics
Open

[CELEBORN-194] Introduce client side metrics for celeborn#3740
AmandeepSingh285 wants to merge 6 commits into
apache:mainfrom
AmandeepSingh285:adding-client-side-metrics

Conversation

@AmandeepSingh285

@AmandeepSingh285 AmandeepSingh285 commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

Adding client side metrics for Celeborn via heartbeat to master.

Why are the changes needed?

These changes help increase observability for Celeborn clients.

Does this PR resolve a correctness bug?

  • Yes

Does this PR introduce any user-facing change?

  • Yes

How was this patch tested?

@AmandeepSingh285 AmandeepSingh285 changed the title [CELEBORN-194] WIP Adding client side metrics for celeborn [CELEBORN-194] [WIP] Adding client side metrics for celeborn Jun 15, 2026
@AmandeepSingh285

Copy link
Copy Markdown
Contributor Author

Hi @SteNicholas , @RexXiong could you please help with a high level review on the implementation design for change adding client side metrics.
Thanks!

@codecov

codecov Bot commented Jun 17, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 81.86813% with 33 lines in your changes missing coverage. Please review.
✅ Project coverage is 58.77%. Comparing base (17159eb) to head (fe1aa14).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
.../org/apache/celeborn/client/LifecycleManager.scala 41.38% 15 Missing and 2 partials ⚠️
...born/common/protocol/message/ControlMessages.scala 76.00% 2 Missing and 4 partials ⚠️
...pache/celeborn/client/ChangePartitionManager.scala 0.00% 3 Missing ⚠️
...pache/celeborn/client/ApplicationHeartbeater.scala 77.78% 1 Missing and 1 partial ⚠️
.../apache/celeborn/common/metrics/ClientMetric.scala 50.00% 1 Missing and 1 partial ⚠️
...eleborn/common/metrics/source/AbstractSource.scala 95.66% 1 Missing and 1 partial ⚠️
...n/client/commit/ReducePartitionCommitHandler.scala 0.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #3740      +/-   ##
============================================
+ Coverage     57.73%   58.77%   +1.04%     
- Complexity      214      319     +105     
============================================
  Files           397      399       +2     
  Lines         27880    28056     +176     
  Branches       2714     2729      +15     
============================================
+ Hits          16095    16488     +393     
+ Misses        10635    10384     -251     
- Partials       1150     1184      +34     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@AmandeepSingh285

Copy link
Copy Markdown
Contributor Author

Gentle ping @SteNicholas , @RexXiong could you please help with a high level review of the approach. Thanks!

@AmandeepSingh285 AmandeepSingh285 changed the title [CELEBORN-194] [WIP] Adding client side metrics for celeborn [CELEBORN-194] Adding client side metrics for celeborn Jun 22, 2026
@AmandeepSingh285

Copy link
Copy Markdown
Contributor Author

Gentle follow-up ping @SteNicholas @RexXiong . Would appreciate a high-level review of the proposed approach whenever you have some time. Thanks!

@AmandeepSingh285 AmandeepSingh285 changed the title [CELEBORN-194] Adding client side metrics for celeborn [CELEBORN-194][WIP] Adding client side metrics for celeborn Jun 22, 2026
@SteNicholas SteNicholas requested a review from Copilot June 22, 2026 10:13

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot couldn't run its full agentic review because no GitHub Actions runner was available. Make sure your repository has a runner available to run Copilot's review, or add a copilot-setup-steps.yml file specifying one with the runs-on attribute. See the docs for more details.

Adds client-side metrics collection in Celeborn clients and ships those metrics to the master via application heartbeats, where they are re-exposed on the master Prometheus endpoint labeled by applicationId.

Changes:

  • Extend HeartbeatFromApplication (and protobuf serde) to carry a clientMetrics map of {name -> (value, type)}.
  • Add client and master metric sources (CelebornClientSource, ApplicationMetricsSource) plus wiring in LifecycleManager/Master.
  • Introduce celeborn.client.metrics.enabled config and add/unit-test coverage for serde + source behavior.

Reviewed changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
master/src/test/scala/org/apache/celeborn/service/deploy/master/ApplicationMetricsSourceSuite.scala Adds unit tests for master-side application metrics source behavior.
master/src/main/scala/org/apache/celeborn/service/deploy/master/Master.scala Registers the new application metrics source and plumbs heartbeat clientMetrics through.
master/src/main/scala/org/apache/celeborn/service/deploy/master/ApplicationMetricsSource.scala Implements master-side cache + Prometheus re-export of client metrics by applicationId.
docs/configuration/metrics.md Documents new celeborn.client.metrics.enabled config.
common/src/test/scala/org/apache/celeborn/common/util/UtilsSuite.scala Adds serde round-trip test for clientMetrics in heartbeats.
common/src/main/scala/org/apache/celeborn/common/protocol/message/ControlMessages.scala Extends heartbeat message, protobuf encoding/decoding for client metrics.
common/src/main/scala/org/apache/celeborn/common/metrics/source/AbstractSource.scala Adds Role.CLIENT label behavior and counterExists helper.
common/src/main/scala/org/apache/celeborn/common/metrics/ClientMetric.scala Introduces ClientMetric + MetricType shared representation.
common/src/main/scala/org/apache/celeborn/common/CelebornConf.scala Adds celeborn.client.metrics.enabled config entry and accessor.
common/src/main/proto/TransportMessages.proto Adds clientMetrics field + metric type/message definitions to heartbeat protobuf.
client/src/test/scala/org/apache/celeborn/client/WorkerStatusTrackerSuite.scala Adds test coverage for excluded-worker metrics behavior.
client/src/test/scala/org/apache/celeborn/client/CelebornClientSourceSuite.scala Adds unit tests for client metric source counters/gauges + snapshot types.
client/src/main/scala/org/apache/celeborn/client/commit/ReducePartitionCommitHandler.scala Increments client “shuffle data lost” metric on lost-file conditions when enabled.
client/src/main/scala/org/apache/celeborn/client/LifecycleManager.scala Creates client metrics source, registers gauges, increments counters, and supplies snapshots to heartbeats.
client/src/main/scala/org/apache/celeborn/client/ChangePartitionManager.scala Increments revive-failure metrics when change partition assignment fails.
client/src/main/scala/org/apache/celeborn/client/CelebornClientSource.scala Implements client-side metrics source + snapshot export for heartbeat payload.
client/src/main/scala/org/apache/celeborn/client/ApplicationHeartbeater.scala Adds callback to attach client metrics to each HeartbeatFromApplication.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread client/src/main/scala/org/apache/celeborn/client/LifecycleManager.scala Outdated
Comment thread client/src/main/scala/org/apache/celeborn/client/CelebornClientSource.scala Outdated

@SteNicholas SteNicholas left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AmandeepSingh285, thanks for working on client-side metrics — the overall shape (client AbstractSource snapshot → heartbeat → master re-expose) is reasonable and the serde/config/docs are wired up. Since it's marked [WIP], I'm leaving review comments rather than approving; there are a few correctness/lifecycle issues worth resolving first (inline). Summary, most-impactful first:

  1. Client metrics source + cleaner thread leak (per Spark driver). clientSource is created unconditionally and never destroyed — see inline on LifecycleManager.scala:226.
  2. Master re-registers dead apps' metrics → permanent leak. Master is a plain RpcEndpoint (not ThreadSafeRpcEndpoint), so its Inbox runs with enableConcurrent = true and heartbeats are processed concurrently. A heartbeat racing/after handleAppLost resurrects the app's per-app gauges/counters, which are then never cleaned. See inline on ApplicationMetricsSource.updateApplicationMetrics.
  3. Non-atomic counter delta + fragile absolute→delta conversion. Concurrent heartbeats for one app double-count; app-restart-with-same-id or heartbeat reordering corrupt the delta. See inline on updateCounter.
  4. Unbounded master cardinality + silent truncation. Per-applicationId labeling has no top-N cap and no master-side enable flag; AbstractSource.getMetrics truncates at metricsCapacity (4096) and emits counters last, so app counters drop first. The codebase already solved this for worker per-app metrics via celeborn.metrics.worker.app.topResourceConsumption.count (default 0/off). See inline on Master.scala.
  5. Gauge flaps to 0 on removal (cache cleared before the gauge is unregistered) — inline on removeApplicationMetrics.
  6. Metric semantics: ClientReviveFailCount is incremented by changePartitions.size in one place but by +1 in handleRevive, and ClientShuffleDataLostCount is bumped in both handleMapPartitionEnd and ReducePartitionCommitHandler.stageEnd — mixed units / possible double-count. Inline on ChangePartitionManager.
  7. fromPb silently maps unknown PbMetricType to Gauge (forward-compat trap) — inline on ControlMessages.scala.

Minor / cleanup (no inline needed): the if (clientMetricsEnabled) clientSource.incCounter(...) guard is copy-pasted ~12×, and ReducePartitionCommitHandler recomputes the gate inline instead of reusing the cached clientMetricsEnabled field — a single incClientMetric(name, n) helper would centralize the gate and avoid the semantic drift in (6). ClientMetric/MetricType also duplicate proto PbClientMetric/PbMetricType (4 spots to keep in lockstep). Test gap: the counter-delta path in ApplicationMetricsSource is untested (ApplicationMetricsSourceSuite only sends MetricType.Gauge).

Comment thread client/src/main/scala/org/apache/celeborn/client/LifecycleManager.scala Outdated
Comment thread client/src/main/scala/org/apache/celeborn/client/ChangePartitionManager.scala Outdated
@AmandeepSingh285

Copy link
Copy Markdown
Contributor Author

Thanks @SteNicholas for the review. Still working on improving this PR. Will take into account all the updated you mentioned. Thanks!

@AmandeepSingh285 AmandeepSingh285 changed the title [CELEBORN-194][WIP] Adding client side metrics for celeborn [CELEBORN-194] Adding client side metrics for celeborn Jul 2, 2026
@SteNicholas SteNicholas changed the title [CELEBORN-194] Adding client side metrics for celeborn [CELEBORN-194] Introduce client side metrics for celeborn Jul 2, 2026
@SteNicholas SteNicholas requested a review from Copilot July 2, 2026 08:30
@SteNicholas

Copy link
Copy Markdown
Member

@AmandeepSingh285, please firstly resolve conflicts.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 19 out of 19 changed files in this pull request and generated 3 comments.

Comment on lines +234 to +245
clientSource.foreach { source =>
source.addGauge(CelebornClientSource.ACTIVE_SHUFFLE_COUNT) { () =>
registeredShuffle.size
}
source.addGauge(CelebornClientSource.EXCLUDED_WORKER_COUNT) { () =>
workerStatusTracker.excludedWorkers.size
}
source.addGauge(CelebornClientSource.SHUTTING_WORKER_COUNT) { () =>
workerStatusTracker.shuttingWorkers.size
}
source.start()
}
Comment on lines +61 to +63
def start(): Unit = startCleaner()

def stop(): Unit = metricsCleaner.shutdown()
errors.get())
}

test("recordWorkerFailure increments client worker-excluded counter and gauge") {
@AmandeepSingh285 AmandeepSingh285 force-pushed the adding-client-side-metrics branch from 36a9603 to 438f541 Compare July 2, 2026 09:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants