fix(agent,agent-installer): fail the install if the agent tunnel can't reach the gateway#1837
Conversation
…t reach the gateway Enrollment proves only the HTTPS/TCP path; a firewall blocking the QUIC/UDP tunnel (UDP 4433) could let enrollment succeed yet leave the tunnel dead, while the installer still reported success. `agent up` now performs a one-shot QUIC + mTLS connectivity probe to the gateway right after enrolling, and exits non-zero on failure — which the enrollment custom action already turns into a failed (and rolled-back) install. - agent: `probe_connectivity` reuses the live connect path (one handshake + bounded drain); no standalone subcommand or heartbeat round-trip. - agent-installer: on a failed `up`, roll back a freshly-persisted enrollment only when `up` actually wrote new certs (guarded, fails safe), never the prior install's. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Let maintainers know that an action is required on their side
|
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…nnect Inline route_advertisements back into run_single_connection (matching the original) and keep run_single_connection at its prior position, so only the connect logic is extracted into the shared connect_to_gateway the probe reuses. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Ensures the Windows agent installer doesn’t report success when enrollment succeeds over HTTPS/TCP but the actual agent tunnel (QUIC/UDP) can’t reach the gateway. The agent now performs a one-shot QUIC+mTLS handshake after enrollment and fails the install (with guarded rollback) if the UDP path is blocked.
Changes:
- Windows MSI custom action now rolls back a freshly persisted enrollment when
agent.exe upfails or times out before the rollback marker is written. - Agent tunnel code factors QUIC connect logic into
connect_to_gateway()and addsprobe_connectivity()(QUIC+mTLS handshake + close/drain), including unit tests. agent.exe upnow probes QUIC connectivity after enrolling and exits non-zero on failure so the installer can fail/roll back.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| package/AgentWindowsManaged/Actions/CustomActions.cs | Adds guarded rollback on up timeout/non-zero exit before the marker exists, preventing “successful” installs when QUIC connectivity is broken. |
| devolutions-agent/src/tunnel.rs | Adds a reusable QUIC connect helper and a connectivity probe API (with tests) used to validate QUIC/UDP reachability. |
| devolutions-agent/src/main.rs | Updates up command to run the connectivity probe immediately after enrollment so failures propagate to the installer. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| /// Confirm the QUIC/UDP path to the gateway is open by completing one mTLS+QUIC handshake, then | ||
| /// draining the connection, bounded by `timeout`. |
| // Throwaway PEMs so the pre-connect file reads succeed; nothing listens on the target | ||
| // port, so the handshake never completes and the probe must hit its own timeout. |
- Rename RollBackFailedEnrollment -> RollbackFailedEnrollment to match the existing Rollback* helpers (RollbackConfig, RollbackEnrollAgentTunnel). - Clarify probe_connectivity doc: `timeout` bounds the handshake; the best-effort drain adds up to ~3s on top. - Soften the unreachable-gateway test comment: the connect may fail before the timeout (e.g. immediate ICMP error); the test only asserts it's bounded. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
| // Throwaway PEMs so the pre-connect file reads succeed; nothing listens on the target | ||
| // port, so the probe fails quickly — via its own timeout or an immediate connect error. | ||
| let cert_key = | ||
| rcgen::generate_simple_self_signed(vec!["localhost".to_owned()]).expect("generate self-signed cert"); |
There was a problem hiding this comment.
issue: Use a pre-generated certificate / key pair instead of generating a new one on each run. It slows down the testsuite.
| let result = probe_connectivity(&conf, Duration::from_secs(2)).await; | ||
|
|
||
| assert!(result.is_err(), "probe must fail when the gateway is unreachable"); | ||
| assert!(started.elapsed() < Duration::from_secs(15), "probe must fail fast"); |
There was a problem hiding this comment.
15 seconds is overly generous. For a unit test it should not exceed 1 second at most.
| let mut conf = tunnel_conf_template(); | ||
| conf.enabled = false; | ||
|
|
||
| let error = probe_connectivity(&conf, Duration::from_secs(5)) |
There was a problem hiding this comment.
Same, a unit test should not run more than 1 sec.
| // `up` enrolls then probes, so a non-zero exit can leave a freshly-persisted | ||
| // enrollment on disk with no marker yet; undo it (guarded against early failures). | ||
| RollbackFailedEnrollment(session, agentJsonPath, originalTunnel, originalGatewayCaB64, originalStateCaptured); |
There was a problem hiding this comment.
suggestion: Add a separate command specifically for probing. The probing should be used as a diagnostic, and the enrollment action doing "enrollment + diagnostic probe" is probably not granular enough.
Enrollment only proves the HTTPS/TCP path, but the tunnel runs over QUIC/UDP (4433). When UDP is blocked, enrollment still succeeds, so the installer reported success while the agent never connected.
Fix: after enrolling,
agent upperforms a one-shot QUIC + mTLS handshake to the gateway and exits non-zero on failure — whichEnrollAgentTunnelalready turns into a failed, rolled-back install. The standaloneprobe-tunnelsubcommand and the heartbeat round-trip are gone. The installer-side cleanup rolls back only a freshly-persisted enrollment, never a prior install's certs (guarded, fails safe).Verified on a lab agent VM: UDP 4433 open → install succeeds; blocked → install fails and rolls back, no orphaned artifacts.
Out-of-scope follow-ups: the kill-mid-enroll window (pre-existing) and a gateway-side connection-identity check on
unregister.