Skip to content

fix(agent,agent-installer): fail the install if the agent tunnel can't reach the gateway#1837

Open
irvingouj@Devolutions (irvingoujAtDevolution) wants to merge 4 commits into
masterfrom
fix/agent-tunnel-connectivity-probe
Open

fix(agent,agent-installer): fail the install if the agent tunnel can't reach the gateway#1837
irvingouj@Devolutions (irvingoujAtDevolution) wants to merge 4 commits into
masterfrom
fix/agent-tunnel-connectivity-probe

Conversation

@irvingoujAtDevolution

@irvingoujAtDevolution irvingouj@Devolutions (irvingoujAtDevolution) commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Enrollment only proves the HTTPS/TCP path, but the tunnel runs over QUIC/UDP (4433). When UDP is blocked, enrollment still succeeds, so the installer reported success while the agent never connected.

Fix: after enrolling, agent up performs a one-shot QUIC + mTLS handshake to the gateway and exits non-zero on failure — which EnrollAgentTunnel already turns into a failed, rolled-back install. The standalone probe-tunnel subcommand and the heartbeat round-trip are gone. The installer-side cleanup rolls back only a freshly-persisted enrollment, never a prior install's certs (guarded, fails safe).

Verified on a lab agent VM: UDP 4433 open → install succeeds; blocked → install fails and rolls back, no orphaned artifacts.

Out-of-scope follow-ups: the kill-mid-enroll window (pre-existing) and a gateway-side connection-identity check on unregister.

…t reach the gateway

Enrollment proves only the HTTPS/TCP path; a firewall blocking the QUIC/UDP
tunnel (UDP 4433) could let enrollment succeed yet leave the tunnel dead, while
the installer still reported success.

`agent up` now performs a one-shot QUIC + mTLS connectivity probe to the gateway
right after enrolling, and exits non-zero on failure — which the enrollment
custom action already turns into a failed (and rolled-back) install.

- agent: `probe_connectivity` reuses the live connect path (one handshake +
  bounded drain); no standalone subcommand or heartbeat round-trip.
- agent-installer: on a failed `up`, roll back a freshly-persisted enrollment
  only when `up` actually wrote new certs (guarded, fails safe), never the
  prior install's.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@github-actions

Copy link
Copy Markdown

Let maintainers know that an action is required on their side

  • Add the label release-required Please cut a new release (Devolutions Gateway, Devolutions Agent, Jetsocat, PowerShell module) when you request a maintainer to cut a new release (Devolutions Gateway, Devolutions Agent, Jetsocat, PowerShell module)

  • Add the label release-blocker Follow-up is required before cutting a new release if a follow-up is required before cutting a new release

  • Add the label publish-required Please publish libraries (`Devolutions.Gateway.Utils`, OpenAPI clients, etc) when you request a maintainer to publish libraries (Devolutions.Gateway.Utils, OpenAPI clients, etc.)

  • Add the label publish-blocker Follow-up is required before publishing libraries if a follow-up is required before publishing libraries

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…nnect

Inline route_advertisements back into run_single_connection (matching the
original) and keep run_single_connection at its prior position, so only the
connect logic is extracted into the shared connect_to_gateway the probe reuses.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Ensures the Windows agent installer doesn’t report success when enrollment succeeds over HTTPS/TCP but the actual agent tunnel (QUIC/UDP) can’t reach the gateway. The agent now performs a one-shot QUIC+mTLS handshake after enrollment and fails the install (with guarded rollback) if the UDP path is blocked.

Changes:

  • Windows MSI custom action now rolls back a freshly persisted enrollment when agent.exe up fails or times out before the rollback marker is written.
  • Agent tunnel code factors QUIC connect logic into connect_to_gateway() and adds probe_connectivity() (QUIC+mTLS handshake + close/drain), including unit tests.
  • agent.exe up now probes QUIC connectivity after enrolling and exits non-zero on failure so the installer can fail/roll back.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

File Description
package/AgentWindowsManaged/Actions/CustomActions.cs Adds guarded rollback on up timeout/non-zero exit before the marker exists, preventing “successful” installs when QUIC connectivity is broken.
devolutions-agent/src/tunnel.rs Adds a reusable QUIC connect helper and a connectivity probe API (with tests) used to validate QUIC/UDP reachability.
devolutions-agent/src/main.rs Updates up command to run the connectivity probe immediately after enrollment so failures propagate to the installer.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread package/AgentWindowsManaged/Actions/CustomActions.cs Outdated
Comment thread package/AgentWindowsManaged/Actions/CustomActions.cs Outdated
Comment thread package/AgentWindowsManaged/Actions/CustomActions.cs Outdated
Comment thread devolutions-agent/src/tunnel.rs Outdated
Comment on lines +454 to +455
/// Confirm the QUIC/UDP path to the gateway is open by completing one mTLS+QUIC handshake, then
/// draining the connection, bounded by `timeout`.
Comment thread devolutions-agent/src/tunnel.rs Outdated
Comment on lines +709 to +710
// Throwaway PEMs so the pre-connect file reads succeed; nothing listens on the target
// port, so the handshake never completes and the probe must hit its own timeout.
- Rename RollBackFailedEnrollment -> RollbackFailedEnrollment to match the
  existing Rollback* helpers (RollbackConfig, RollbackEnrollAgentTunnel).
- Clarify probe_connectivity doc: `timeout` bounds the handshake; the
  best-effort drain adds up to ~3s on top.
- Soften the unreachable-gateway test comment: the connect may fail before the
  timeout (e.g. immediate ICMP error); the test only asserts it's bounded.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
// Throwaway PEMs so the pre-connect file reads succeed; nothing listens on the target
// port, so the probe fails quickly — via its own timeout or an immediate connect error.
let cert_key =
rcgen::generate_simple_self_signed(vec!["localhost".to_owned()]).expect("generate self-signed cert");

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: Use a pre-generated certificate / key pair instead of generating a new one on each run. It slows down the testsuite.

let result = probe_connectivity(&conf, Duration::from_secs(2)).await;

assert!(result.is_err(), "probe must fail when the gateway is unreachable");
assert!(started.elapsed() < Duration::from_secs(15), "probe must fail fast");

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

15 seconds is overly generous. For a unit test it should not exceed 1 second at most.

let mut conf = tunnel_conf_template();
conf.enabled = false;

let error = probe_connectivity(&conf, Duration::from_secs(5))

@CBenoit Benoît Cortier (CBenoit) Jun 26, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same, a unit test should not run more than 1 sec.

Comment on lines +529 to +531
// `up` enrolls then probes, so a non-zero exit can leave a freshly-persisted
// enrollment on disk with no marker yet; undo it (guarded against early failures).
RollbackFailedEnrollment(session, agentJsonPath, originalTunnel, originalGatewayCaB64, originalStateCaptured);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Add a separate command specifically for probing. The probing should be used as a diagnostic, and the enrollment action doing "enrollment + diagnostic probe" is probably not granular enough.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

3 participants