core: fix Stokesian Dynamics MPI deadlock on missing per-type radius (bug-sweep #25)#5359
Draft
RudolfWeeber wants to merge 1 commit into
Draft
Conversation
…(bug-sweep #25) Bug: StokesianDynamics::propagate_vel_pos evaluated radii.at(p.type) inside the rank-0-only block, between the collective gather_buffer and the collective scatter_buffer. For a particle whose type has no registered radius, at() threw std::out_of_range on rank 0, which unwound out of the integration loop (no parallel_try_catch on that call path) before reaching MPI_Scatterv. The other ranks blocked forever in the matching MPI_Scatterv -> indefinite MPI deadlock (multi-rank) or an uncoordinated std::out_of_range abort (serial). Fix: replace radii.at() with radii.find() and, on a missing type, register a coordinated runtime error via runtimeErrorMsg() instead of throwing. The rank-0 branch still falls through to the collective scatter_buffer (shipping zeroed velocities), so every rank reaches MPI_Scatterv and no deadlock occurs. The integration loop's collective check_runtime_errors(comm_cart) then turns the registered message into a clean cross-rank ESPResSo runtime error ("Stokesian Dynamics: no radius defined for particle type N"). This mirrors the existing SD precondition checks in integrate.cpp. Test: testsuite/python/stokesian_missing_radius.py (NO_MPI) launches the offending scenario as a child mpiexec -n 2 job under a 45 s timeout; on the unfixed core it times out (deadlock), on the fixed core it exits cleanly with the coordinated error. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Bug: StokesianDynamics::propagate_vel_pos evaluated radii.at(p.type) inside
the rank-0-only block, between the collective gather_buffer and the collective
scatter_buffer. For a particle whose type has no registered radius, at() threw
std::out_of_range on rank 0, which unwound out of the integration loop (no
parallel_try_catch on that call path) before reaching MPI_Scatterv. The other
ranks blocked forever in the matching MPI_Scatterv -> indefinite MPI deadlock
(multi-rank) or an uncoordinated std::out_of_range abort (serial).
Fix: replace radii.at() with radii.find() and, on a missing type, register a
coordinated runtime error via runtimeErrorMsg() instead of throwing. The rank-0
branch still falls through to the collective scatter_buffer (shipping zeroed
velocities), so every rank reaches MPI_Scatterv and no deadlock occurs. The
integration loop's collective check_runtime_errors(comm_cart) then turns the
registered message into a clean cross-rank ESPResSo runtime error
("Stokesian Dynamics: no radius defined for particle type N"). This mirrors the
existing SD precondition checks in integrate.cpp.
Test: testsuite/python/stokesian_missing_radius.py (NO_MPI) launches the
offending scenario as a child mpiexec -n 2 job under a 45 s timeout; on the
unfixed core it times out (deadlock), on the fixed core it exits cleanly with
the coordinated error.
Co-Authored-By: Claude Opus 4.8 noreply@anthropic.com
🤖 Generated with Claude Code