Skip to content

[multicast,instance,test-flake] reorder instance_stop and send stop before tearing down multicast member state#10402

Open
zeeshanlakhani wants to merge 4 commits into
mainfrom
zl/flake-test_join_by_ip_existing_group
Open

[multicast,instance,test-flake] reorder instance_stop and send stop before tearing down multicast member state#10402
zeeshanlakhani wants to merge 4 commits into
mainfrom
zl/flake-test_join_by_ip_existing_group

Conversation

@zeeshanlakhani
Copy link
Copy Markdown
Collaborator

Handles (closes) #9711 (was fixed downstream in a PR'ed branch).

…efore tearing down multicast member state

Handles (closes) #9711 (was fixed downstream in a PR'ed branch).
@zeeshanlakhani
Copy link
Copy Markdown
Collaborator Author

zeeshanlakhani commented May 22, 2026

@jgallagher minor one up here btw No so minor, but I think this is much cleaner now. I ran the multicast suite on this (locally), which is turned off on non #9912 affiliated branches.

Comment thread nexus/src/app/instance.rs Outdated
This addresses a review comment on the synchronous `multicast_group_members_detach_by_instance` call
introduced by the prior reorder, as a transient DB failure would 500 the caller for a stop that already
succeeded at the sled, while short-circuiting past the reconciler activation. The state would be
self-healing via the reconciler, but the visible failure for a successful op would be a regression.

Instead, we move the detach into the `instance_update` saga's `siu_commit_instance_updates` action,
gated on `update.deprovision.is_some()`, where the saga signals an instance has reached no-active-VMM state.
That saga already orchestrates terminal-VMM cleanup, so the detach fits naturally there. As a side effect,
this also covers guest-initiated shutdown and sled-agent-reported failure paths that the `instance_stop` 
callsite hadn't covered. And, we still have the reconciler to check through things.

This change also includes a reconciler nudge in `instance_stop` in the case where the saga does not fire
if there were no terminal VMM transition to drive it in the first place (instead of waiting for the full
reconciler pass on next tick).
@zeeshanlakhani
Copy link
Copy Markdown
Collaborator Author

This included a merge of main over the top of the fixes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants