[CELEBORN-2371] Bound Spark batch-open client creation retries and stop them on interruption#3746
Conversation
…op them on interruption
| var streamCreatorPool: ThreadPoolExecutor = null | ||
|
|
||
| // TransportClientFactory already retries each attempt; allow one extra pooled-client attempt. | ||
| private val MAX_CLIENT_CREATION_ATTEMPTS_PER_HOST = 2 |
There was a problem hiding this comment.
Could this max attempt be configurable?
There was a problem hiding this comment.
@sunchao, thanks — the direction is right and the tests are focused. The bound on the outer loop, propagating interruption through bootstrap, and closing the in-progress channel on interrupt are all good improvements. A few points worth addressing before merge (none are hard blockers, but the first two are worth a response):
-
The second per-host attempt is redundant. All locations grouped under a
hostPortshare the samehost:fetchPort(grouping keys byhostAndFetchPort, and thecreateClientlambda uses onlygetHost/getFetchPort), so attempt #2 re-targets the same worker. See the inline note.MAX=1would give the same result; if you keep 2, the "preserves same-worker fallback" comment is misleading. -
Interrupt handling is now inconsistent across
createClientcallers.findInterruptedExceptionmakesretryCreateClientset the interrupt flag and throw a bareInterruptedExceptionon a wrapped interrupt, but onlycreateClientsInParallelhas matchingcase ex: InterruptedExceptionhandling. The other callers —CelebornInputStream.createReaderWithRetry(catchesException, thenexcludeFailedFetchLocation+Uninterruptibles.sleepUninterruptiblywhich re-asserts the flag), the non-parallelmakeOpenStreamList, and the push/replicate paths — swallow it generically, so a cancellation gets recorded as a fetch/push failure (polluting shared cross-task exclusion state) and, because Netty'sawait()throws immediately when the flag is preset, cascades into fast-failing the remaining retries. Much of this cascade is pre-existing for direct interrupts; this PR widens it to wrapped ones. Worth either handlingInterruptedExceptionconsistently at these call sites, or noting the trade-off. -
Minor — worker-thread interrupt surfaces as
ExecutionException. When a workerrun()throwsInterruptedException(the new line-620 check, or a wrapped interrupt on the worker thread rethrown at line 628),futures.foreach(_.get())raisesExecutionException, which is not matched by thecase ex: InterruptedExceptionat ~642 — so it escapes without restoring the caller's interrupt status. Low impact, but the new code makes "InterruptedException out of run()" more reachable. -
Nit — reuse.
findInterruptedExceptionduplicates the cause-chain walk inMasterClient.findMasterNotLeaderException; GuavaThrowables.getCausalChain()(already imported here) or commons-lang3ExceptionUtils.throwableOfType/ a sharedExceptionUtilshelper would avoid a third hand-rolled copy (and could carry a cycle guard).
| var streamCreatorPool: ThreadPoolExecutor = null | ||
|
|
||
| // TransportClientFactory already retries each attempt; allow one extra pooled-client attempt. | ||
| private val MAX_CLIENT_CREATION_ATTEMPTS_PER_HOST = 2 |
There was a problem hiding this comment.
All locations grouped under one hostPort share the same host:fetchPort (groupOpenStreamLocations keys by hostAndFetchPort, and the createClient lambda at ~304 uses only getHost/getFetchPort). So the "second" attempt re-targets the same worker and re-runs TransportClientFactory's full maxIORetries connect budget against an already-dead host:port — there is no real same-worker fallback within a group (the replica lives on a different host = a separate group/task). MAX_CLIENT_CREATION_ATTEMPTS_PER_HOST = 1 would yield the same set of clients with half the connect-retry latency per dead worker; if you intentionally keep 2, the comment about "same-worker fallback" is misleading.
| return createClient(remoteHost, remotePort, partitionId, supplier.get()); | ||
| } catch (Exception e) { | ||
| if (e instanceof InterruptedException) { | ||
| InterruptedException interruptedException = findInterruptedException(e); |
There was a problem hiding this comment.
Unwrapping a wrapped InterruptedException to stop retries is reasonable, but note this also calls Thread.currentThread().interrupt() and throws a bare InterruptedException from shared infra used by ~6 createClient callers, while only the parallel reader (createClientsInParallel) added a matching case ex: InterruptedException. The others — CelebornInputStream.createReaderWithRetry (catch (Exception) → excludeFailedFetchLocation + Uninterruptibles.sleepUninterruptibly, which re-asserts the flag), the non-parallel makeOpenStreamList, and the ShuffleClientImpl/PushDataHandler push paths — swallow it generically. Because Netty await() throws immediately when the interrupt flag is preset, the leftover flag makes each subsequent createClient fast-fail in a cascade, and the cancellation is recorded as a worker failure (shared, cross-task exclusion state on replicated clusters). Consider handling InterruptedException consistently across these call sites, or documenting the trade-off.
| }, | ||
| cf); | ||
| assert cf.isDone(); | ||
| if (cf.isCancelled()) { |
There was a problem hiding this comment.
Minor / pre-existing but now inconsistent with the new cleanup paths: in this connectTimeoutMs <= 0 branch, the isCancelled() (339-341) and !isSuccess() (340-342) failures throw without closeChannel(cf), leaking the half-open channel — whereas the connectTimeoutMs > 0 branch below closes on both timeout and cause != null, and the new interrupt path closes too. Worth closing the channel here as well.
Why are the changes needed?
CELEBORN-2371 follows up on the parallel Spark batch-open client creation added by #3692.
The reader now creates clients for different workers in parallel, but each worker task still walks every
PartitionLocationuntil onecreateClientcall succeeds. AcreateClientcall already hasTransportClientFactory's complete retry budget, so the outer location loop can multiply that budget.For example, suppose one reducer has 100 partition locations on
worker-aand that worker is unavailable. The task forworker-acan make 100 outercreateClientcalls, and every call can run the configured transport retries. Because the reader waits for every worker task, the unavailable worker can keep the whole batch-open setup pending long after healthy workers have finished.Cancellation also needs to terminate client creation consistently. If cancellation interrupts a failure callback, that callback can restore the interrupt flag and return; without checking the flag, the worker task starts the next location attempt. At the transport layer, synchronous bootstrap can wrap an
InterruptedExceptionin another exception, so the retry loop currently treats it as retryable. An interrupt while waiting for TCP connect or TLS setup also leaves cleanup of the in-progress channel implicit.What changes were proposed in this PR?
Bound the per-worker outer retry loop
Each worker task now tries at most two partition locations. The first attempt keeps the normal path, and the second preserves same-worker fallback. Each attempt still receives the existing internal
TransportClientFactoryretry budget; this change only prevents that complete budget from being repeated once per partition location.Stop the Spark worker task after cancellation
Before starting another same-worker attempt,
CelebornShuffleReaderchecks the thread's interrupt flag and exits withInterruptedException. This complements the existing future cancellation path and prevents a callback that observed cancellation from falling through to another client creation.Preserve interruption through transport bootstrap
TransportClientFactorynow searches an exception's cause chain forInterruptedException, restores the thread's interrupt flag, and stops retrying immediately. The TCP-connect and TLS-handshake waits also share an interrupt-aware helper that closes the in-progress channel before propagating the interruption.The tests cover the two-attempt bound, cancellation during the failure callback, a wrapped
InterruptedException, and channel cleanup during an interrupted wait.How was this PR tested?
Formatting was applied with:
The focused transport tests passed (11 tests):
The Spark 3.5 reader suite passed (7 tests):
Spark 4.0 / Scala 2.13 production and test compilation also passed: