Skip to content

fix(agent,agent-installer): fail the install if the agent tunnel can't reach the gateway#1835

Closed
irvingouj@Devolutions (irvingoujAtDevolution) wants to merge 1 commit into
masterfrom
agent-tunnel-enroll-connectivity-probe
Closed

fix(agent,agent-installer): fail the install if the agent tunnel can't reach the gateway#1835
irvingouj@Devolutions (irvingoujAtDevolution) wants to merge 1 commit into
masterfrom
agent-tunnel-enroll-connectivity-probe

Conversation

@irvingoujAtDevolution

@irvingoujAtDevolution irvingouj@Devolutions (irvingoujAtDevolution) commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Split out of #1831.

Problem: the installer reports success on enrollment, not on tunnel connectivity

Symptom: the MSI install shows the Agent Tunnel step as success, but the agent never appears online in the Gateway / DVLS agent list.

Root cause: enrollment and the tunnel use two different network paths. EnrollAgentTunnel runs devolutions-agent upenroll_agent(), whose success criteria were only:

  1. POST https://<gw>:7171/jet/tunnel/enroll (HTTPS management port, TCP) returns 2xx and issues the client cert, and
  2. the cert/key/CA + the Tunnel section are persisted to agent.json.

The actual data path — the QUIC tunnel over UDP (4433) — is established later by the agent service and was never probed at install time. Enrollment is TCP/7171; the tunnel is UDP/4433. So when UDP 4433 is blocked (firewall/NAT), the install is green while the agent silently fails to connect (Tunnel connection lost error=QUIC handshake: timed out).

Fix

After enrolling, agent up now performs a one-shot QUIC + mTLS connectivity probe to the gateway tunnel endpoint and exits non-zero on failure. EnrollAgentTunnel already checks up's exit code, so a blocked UDP path now fails the install and rolls back — giving the operator actionable feedback (verify UDP 4433 / firewall) while they're still at the machine.

  • agent (tunnel.rs, main.rs): probe_connectivity reuses the live connect path (the same connect_to_gateway the running service uses) for a single mTLS + QUIC handshake, bounded by a timeout, then drains the connection (close + bounded wait_idle) so the gateway unregisters the probe promptly. A completed handshake is sufficient proof the UDP path is open, so the standalone probe-tunnel subcommand and the heavier heartbeat round-trip probe were removed.
  • agent-installer (CustomActions.cs): removed the in-CA subprocess probe (it lives in up now). On a failed up, both the timeout and non-zero-exit paths roll back a freshly-persisted enrollment via a guarded helper that cleans up only when up actually wrote new certs (uuid-named client cert path changed from the pre-up snapshot) — never the prior install's certs, and never when the pre-up snapshot couldn't be captured. Fails safe.

Testing

Validated end-to-end on a lab agent VM against the live gateway:

  • UDP 4433 open → probe succeeds, install completes.
  • UDP 4433 blocked (firewall rule) → up exits non-zero → install fails and rolls back, no orphaned artifacts left behind.

Unit tests: probe_fails_fast_when_tunnel_disabled, probe_times_out_when_gateway_unreachable.

Reviewed via an iterative Claude + Codex review loop (converged clean).

Known follow-ups (out of scope here)

  • Kill-mid-enroll window: if up is hard-killed (the installer's 60s timeout) after writing the cert files but before agent.json is updated, the new files can orphan / a fixed-name gateway-ca.pem may be left overwritten. Pre-existing; the clean fix is a transactional up or an MSI cert-directory snapshot.
  • Gateway-side registry unregisters by agent_id without a connection-identity check; the probe's bounded drain compensates on this path (it runs before the service starts), but adding an identity/generation check gateway-side would close the race for good.

@github-actions

Copy link
Copy Markdown

Let maintainers know that an action is required on their side

  • Add the label release-required Please cut a new release (Devolutions Gateway, Devolutions Agent, Jetsocat, PowerShell module) when you request a maintainer to cut a new release (Devolutions Gateway, Devolutions Agent, Jetsocat, PowerShell module)

  • Add the label release-blocker Follow-up is required before cutting a new release if a follow-up is required before cutting a new release

  • Add the label publish-required Please publish libraries (`Devolutions.Gateway.Utils`, OpenAPI clients, etc) when you request a maintainer to publish libraries (Devolutions.Gateway.Utils, OpenAPI clients, etc.)

  • Add the label publish-blocker Follow-up is required before publishing libraries if a follow-up is required before publishing libraries

@irvingoujAtDevolution irvingouj@Devolutions (irvingoujAtDevolution) changed the title Agent Tunnel: installer reports success on enrollment, not on tunnel connectivity fix(agent,agent-installer): report installer success on tunnel connectivity, not just enrollment Jun 24, 2026
@irvingoujAtDevolution irvingouj@Devolutions (irvingoujAtDevolution) changed the title fix(agent,agent-installer): report installer success on tunnel connectivity, not just enrollment fix(agent,agent-installer): fail the install if the agent tunnel can't reach the gateway Jun 25, 2026
…t reach the gateway

Enrollment proves only the HTTPS/TCP path; a firewall blocking the QUIC/UDP
tunnel (UDP 4433) could let enrollment succeed yet leave the tunnel dead, while
the installer still reported success.

`agent up` now performs a one-shot QUIC + mTLS connectivity probe to the gateway
right after enrolling, and exits non-zero on failure — which the enrollment
custom action already turns into a failed (and rolled-back) install.

- agent: `probe_connectivity` reuses the live connect path (one handshake +
  bounded drain); no standalone subcommand or heartbeat round-trip.
- agent-installer: on a failed `up`, roll back a freshly-persisted enrollment
  only when `up` actually wrote new certs (guarded, fails safe), never the
  prior install's.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@irvingoujAtDevolution

Copy link
Copy Markdown
Contributor Author

Closing in favor of #1837 — recreated on a fresh branch (fix/agent-tunnel-connectivity-probe) as a single clean commit, with no force-push in its history.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

1 participant