Skip to content

[CELEBORN-2375] Support DecommissionThenIdle and Recommission via wor…#3753

Open
gaoyajun02 wants to merge 1 commit into
apache:mainfrom
gaoyajun02:CELEBORN-2375
Open

[CELEBORN-2375] Support DecommissionThenIdle and Recommission via wor…#3753
gaoyajun02 wants to merge 1 commit into
apache:mainfrom
gaoyajun02:CELEBORN-2375

Conversation

@gaoyajun02

@gaoyajun02 gaoyajun02 commented Jun 30, 2026

Copy link
Copy Markdown

What changes were proposed in this pull request?

Introduces a new POST /api/v1/workers/events endpoint to the Worker HTTP API that allows operators to trigger worker state transitions directly on a specific worker node, without going through the Master:

POST /api/v1/workers/events
{"eventType": "DECOMMISSIONTHENIDLE"} # drain traffic, stay alive
{"eventType": "RECOMMISSION"} # bring worker back to Normal

The endpoint is backed by:

  • WorkerEventRequest — a new request model exposing only the two state-transition event types (DECOMMISSIONTHENIDLE, RECOMMISSION), deliberately excluding exit-bound types (DECOMMISSION, GRACEFUL, IMMEDIATELY) which belong to the existing /exit endpoint.
  • Worker#workerEvent() — routes the event to WorkerStatusManager#doTransition(), reusing the existing state-machine logic.
  • WorkerApi#workerEvent() — corresponding method in the generated Java client.

Why are the changes needed?

The existing POST /api/v1/workers/exit endpoint conflates two semantically different operations — traffic draining and process termination — by accepting DECOMMISSION, GRACEFUL, and IMMEDIATELY as exit types. DecommissionThenIdle is fundamentally different: it removes the worker from the serving path while keeping the process alive, making it unsuitable for /exit both semantically and operationally.

The Master already exposes POST /api/v1/workers/events for cluster-wide worker lifecycle management. However, that endpoint requires the caller to resolve the current Master leader and know the target worker's full identity before dispatching the request. In practice — especially in multi-cluster deployments where dozens of independent Celeborn clusters share the same operations platform — requiring maintenance scripts to first discover the active Master leader per cluster significantly increases operational complexity and the blast radius of mis-targeting. A worker-local endpoint removes this indirection entirely: the script targets the exact host being decommissioned, sends a single HTTP request, and the worker handles the rest. A subsequent RECOMMISSION call to the same host restores the worker to service without a restart, making the operation fully reversible.

Does this PR resolve a correctness bug?

  • Yes

Does this PR introduce any user-facing change?

  • Yes

A new REST endpoint POST /api/v1/workers/events is added to the Worker HTTP API, along with a corresponding workerEvent() method in the generated Java client (WorkerApi). Operators can now trigger DECOMMISSIONTHENIDLE and
RECOMMISSION state transitions directly on a worker node without routing through the Master.

How was this patch tested?

UT

…ker-local HTTP API

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot was unable to review this pull request because the user who requested the review has reached their quota limit.

@codecov

codecov Bot commented Jun 30, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 32.07547% with 36 lines in your changes missing coverage. Please review.
✅ Project coverage is 57.68%. Comparing base (3820244) to head (c4aba7a).
⚠️ Report is 4 commits behind head on main.

Files with missing lines Patch % Lines
.../org/apache/celeborn/rest/v1/worker/WorkerApi.java 0.00% 18 Missing ⚠️
...che/celeborn/rest/v1/model/WorkerEventRequest.java 50.00% 16 Missing and 1 partial ⚠️
...rg/apache/celeborn/server/common/HttpService.scala 0.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #3753      +/-   ##
============================================
+ Coverage     57.53%   57.68%   +0.15%     
  Complexity      214      214              
============================================
  Files           396      397       +1     
  Lines         27857    27910      +53     
  Branches       2710     2713       +3     
============================================
+ Hits          16025    16096      +71     
+ Misses        10682    10663      -19     
- Partials       1150     1151       +1     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@SteNicholas SteNicholas requested a review from Copilot June 30, 2026 16:03

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot was unable to review this pull request because the user who requested the review has reached their quota limit.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.

Comment on lines +99 to +102
def workerEvent(request: WorkerEventRequest): HandleResponse = {
if (request.getEventType == null) {
return new HandleResponse().success(false).message("eventType is required")
}
Comment on lines +249 to +254
requestBody:
content:
application/json:
schema:
$ref: '#/components/schemas/WorkerEventRequest'
responses:
Comment on lines +784 to +797
WorkerEventRequest:
type: object
properties:
event_type:
type: string
description: |
The type of the worker event.
Legal types are 'DECOMMISSIONTHENIDLE' and 'RECOMMISSION'.
enum:
- DECOMMISSIONTHENIDLE
- RECOMMISSION
required:
- event_type

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants