Skip to content

fix(service): recover stalled WiFi/TCP handshakes by cycling active transport#5856

Open
jeremiah-k wants to merge 20 commits into
meshtastic:mainfrom
jeremiah-k:bugfix/wifi-handshake-stall-restart
Open

fix(service): recover stalled WiFi/TCP handshakes by cycling active transport#5856
jeremiah-k wants to merge 20 commits into
meshtastic:mainfrom
jeremiah-k:bugfix/wifi-handshake-stall-restart

Conversation

@jeremiah-k

@jeremiah-k jeremiah-k commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Overview

This PR improves WiFi/TCP handshake-stall recovery so the app and underlying transport do not get stuck in different connection states.

Previously, if the handshake stalled after its retry window, recovery could leave the app-level connection state and underlying transport out of sync. For TCP, the transport could remain active while the UI reported a disconnected state. In that situation, selecting the same node again could be ignored because the radio service still saw the same selected address and an already-running transport.

This PR adds a targeted transport restart path for stalled handshakes. When recovery is needed, the connection manager first moves app state to Disconnected, then asks the radio service to restart the active transport. The fresh transport Connected signal can then re-enter the normal handshake flow.

Follow-up WiFi testing also showed that some failed TCP/USB handshake attempts could sit in a loading state until the lower-level TCP timeout recovered them. This PR adds a TCP/USB-only fast handshake watchdog so those stalled handshakes recover sooner, while leaving BLE's longer handshake timing unchanged.

Additional BLE testing exposed a separate post-handshake stuck-connecting path. Once firmware sends NODE_INFO_NONCE, the handshake watchdog is intentionally cancelled before local NodeDB install work begins. If that local Room/NodeDB install fails, the app can otherwise remain in Connecting with no active watchdog. This PR now recovers from NodeDB install failure by routing through the same reconnecting transport-restart path.

The same testing also exposed a Room timeout/cancellation boundary issue in device hardware and device-link cache refresh. This PR now bounds only the remote API calls and lets local Room persistence complete outside those deadlines.

Addresses: #3727

Key Changes

Handshake Recovery

  • Adds RadioInterfaceService.restartTransport() for silent recovery of stalled mesh handshakes.
  • Updates MeshConnectionManagerImpl so retry-exceeded handshake stalls request a transport restart instead of only changing app-level state.
  • Adds MeshConnectionManager.recoverPostHandshakeFailure() so NodeDB install failures after firmware handshake completion use the same app-level disconnect-before-restart ordering.
  • Adds a 12s TCP/USB-only fast handshake watchdog for Stage 1 and Stage 2.
  • Resets the fast watchdog only on meaningful handshake progress.
  • Cancels the Stage 2 watchdog synchronously when NODE_INFO_NONCE arrives, before async NodeDB install work begins.
  • Recovers only from pre-Connected NodeDB install failure; post-Connected analytics / side-effect failures are logged without forcing recovery.
  • Orders recovery so the app-level Disconnected transition happens before transport restart emissions.
  • Runs recovery work in a sibling coroutine so the restart sequence is not lost when the existing handshake timeout job is cancelled.
  • Re-checks the current app state before recovery so a late-arriving successful handshake is not restarted unnecessarily.

Transport Restart Behavior

  • Implements SharedRadioInterfaceService.restartTransport() as an in-place stop/start of the currently active transport.
  • Preserves the selected device address and existing connection intent.
  • Emits a transient DeviceSleep transport transition before restarting so observers can see a real state change before the fresh Connected event.
  • Keeps recovery quiet from the user's perspective by avoiding a permanent transport Disconnected event and avoiding a user-facing error.
  • Skips restart when:
    • the user explicitly disconnected,
    • the selected device was cleared,
    • no valid selected address exists,
    • no transport is currently running,
    • another restart or liveness recovery is already in progress.

Handshake Progress

  • Tracks meaningful Stage 1 progress from MyNodeInfo, local metadata, file info, device config, module config, channels, DeviceUI config, and early NodeInfo.
  • Tracks Stage 2 progress from NodeInfo packets.
  • Buffers early NodeInfo received during Stage 1 so it is not lost before the node-list phase starts.
  • Keeps Stage 2 completion gated on NODE_INFO_NONCE.

Room / Cache Refresh Boundaries

  • Separates remote API timeouts from local Room persistence for device hardware refresh.
  • Separates device-link remote fetch timeout from local device-link cache writes.
  • Lets local hardware and device-link cache persistence complete outside the remote API deadline.
  • Removes the transactional device-link replacement path and keeps device-link refresh as upsert-then-prune guarded from cancellation.
  • Prevents timeout/cancellation during refresh from interrupting local Room writes during connection startup.

Connection UI and Network Scan

  • Adds a transient Reconnecting… state while silent handshake recovery is in progress.
  • Keeps transient TCP/USB handshake recovery from looking like a final user-facing disconnect.
  • Updates the network-scan connectivity warning so it checks all current system networks instead of only Android's default network.
  • Treats WiFi, Ethernet, or VPN as valid network-scan transports.
  • Keeps cellular-only treated as unavailable for network scan.
  • Shows the network-scan warning while the Network section is visible or network scan is actively running.

Test Coverage

  • Adds MeshConnectionManagerImplTest coverage for:

    • Stage 1 config handshake stall recovery,
    • Stage 2 node-info handshake stall recovery,
    • TCP/USB fast recovery timing,
    • meaningful progress resetting the fast watchdog,
    • no fast recovery for BLE,
    • no restart when the handshake completes normally,
    • recovery ordering so app disconnect is processed before restart emissions,
    • post-handshake failure recovery using the same disconnect-before-restart ordering.
  • Adds MeshConfigFlowManagerImplTest / MeshConfigHandlerImplTest coverage for:

    • meaningful handshake progress signals,
    • early NodeInfo buffering,
    • Stage 2 completion behavior,
    • watchdog cancellation when NODE_INFO_NONCE arrives,
    • NodeDB install failure triggering post-handshake recovery,
    • post-NodeDB analytics / side-effect failures not forcing recovery after Connected.
  • Adds SharedRadioInterfaceServiceLivenessTest coverage for:

    • active transport restart creates a fresh transport,
    • restart is a no-op after explicit disconnect,
    • restart is a no-op after device deselection,
    • selected address is preserved,
    • restart does not emit permanent Disconnected,
    • restart does not emit a user-facing connection error,
    • restart coordinates with in-flight liveness recovery,
    • restart does not bypass environmental recovery when the transport was already stopped.
  • Adds Connections UI/ViewModel and network transport coverage for:

    • transient Reconnecting… mapping,
    • WiFi/Ethernet/VPN availability,
    • cellular-only and no-network warning behavior,
    • scan-active banner behavior when the Network filter is hidden,
    • VPN loss leaving cellular-only unavailable.

Follow-up

This PR addresses the Android-side handshake-stall recovery path associated with the reported WiFi/TCP reconnect behavior.

During testing, this also uncovered a firmware-side crash triggered by the Stage 2 node-info request. That firmware bug (tbeam / v2.7.25.104df5f) is being tracked separately. I built patched firmware that is working but it needs more review. I will update this PR with the firmware PR number once it is ready.

If field logs still show periodic TCP reconnect or configuration loops after this lands, a follow-up investigation should look at TCP idle-timeout behavior and add privacy-safe disconnect-reason logging for EOF, IOException, write failure, and idle timeout.


I mostly run nRF52 devices over BLE, so I do not have a strong baseline for long-term WiFi behavior. I tested these changes with WiFi-connected ESP32 devices while reconnecting, scanning, disabling/re-enabling WiFi, and checking WiFi/VPN/cellular-only network-scan warning behavior. I also tested the post-handshake recovery and Room timeout-boundary changes on my BLE device after the T1000-E stuck-connecting behavior surfaced during review.

…-brain

When the mesh handshake stalls past its retry window, the app previously
flipped to Disconnected without cycling the underlying transport. For TCP
this left a split-brain: transport Connected + radioTransport non-null +
connectionRequested true, which caused setDeviceAddress to silently ignore
same-node reconnect attempts.

Add restartTransport() to RadioInterfaceService that mirrors the BLE
liveness silent-recovery path: gates on connectionRequested, bonded
address validity, active transport, and the isRestarting CAS (all checked
before touching the transport), then emits transient DeviceSleep, stops
silently without a polite disconnect frame, and re-establishes via the
normal start path.

The manager's stall-retry-exceeded branch launches a single sibling job
that performs the app-level Disconnected transition first, then calls
restartTransport(), so the fresh Connected emission is not ignored by the
redundant-Connecting guard.

Tests cover Stage 1 config stall, Stage 2 node-info stall, no-restart on
success, recovery ordering, restart gate behavior (no-op after disconnect,
deselect, environmental stop), address preservation, no permanent
Disconnected, no user-facing error, no polite-disconnect frame, and
liveness overlap coordination.
@github-actions github-actions Bot added the bugfix PR tag label Jun 18, 2026
@jeremiah-k jeremiah-k marked this pull request as draft June 18, 2026 20:11
@jamesarich jamesarich added skip-preview-check Dismiss the advisory preview-staleness check (no UI preview update needed) skip-docs-check Dismiss the advisory docs-staleness check (no user/developer doc update needed) labels Jun 18, 2026
…uration

Buffer NodeInfo packets received during Stage 1 to prevent loss
before the node-list phase. Enforce strict state validation
during connection transitions to mitigate race conditions.
WiFi/TCP and USB serial transports can stall mid-handshake without the link
ever going down, leaving the user stuck on 'Connecting' until the original
long BLE-oriented stall-guard fires. Introduce a transport-aware watchdog
that runs an aggressive 12s timeout on fast transports (TCP, USB serial)
while keeping BLE on its existing long-and-retry budget.

The watchdog exposes two hooks on MeshConnectionManager:
- onHandshakeProgress(): re-arms the timer on meaningful wire activity so a
  slow-but-progressing handshake does not false-trip.
- onHandshakeComplete(): synchronously cancels the timer when the firmware
  signals Stage 2 completion (NODE_INFO_NONCE), before the async NodeDB
  install begins, so a slow Room commit on a large mesh cannot trip the
  timeout after the firmware handshake already succeeded.

On a fast-transport stall, runSiblingHandshakeRecovery() tears the
transport down via restartTransport() and brings it back up silently.
restartTransport()'s coordination comment is updated to document that the
app-level Disconnected flip is now driven by the recovery caller, not by
the transport-level emission itself.
The transport-aware watchdog added in the previous commit only re-arms on
meaningful wire activity. Wire onHandshakeProgress() into every config
handler that processes a packet belonging to the in-flight Stage 1/Stage 2
exchange: device/module config, channels, DeviceUIConfig, NodeInfo
(including early-buffered and Stage 2 nodes), FileInfo, local metadata,
and the Stage 2 NodeInfo request trigger.

Also call onHandshakeComplete() synchronously in handleNodeInfoComplete()
the moment NODE_INFO_NONCE arrives, so the watchdog is cancelled before
the async Room NodeDB install begins. This prevents a slow DB commit on a
large mesh from false-tripping the 12s fast-recovery timeout after the
firmware handshake has already succeeded.
Without a distinct status, the fast-recovery watchdog tearing down and
rebuilding the transport read to the user as NOT_CONNECTED — a final
failure — even though recovery was still in progress.

Add a RECONNECTING value to ConnectionStatus and have ConnectionsViewModel
observe ServiceRepository.connectionProgress for the watchdog's
'Reconnecting\u2026' sentinel, emitted immediately before the recovery
sibling flips the transport to Disconnected. When the sentinel is present
at Disconnected, the UI now shows the in-progress recovery instead of a
final failure.

Also switches the ViewModel's constructor dependency from
ConnectionStateProvider to ServiceRepository so connectionProgress is
observable alongside connectionState in a 3-way combine. ConnectingDeviceInfo
renders the new status with a localized 'reconnecting' string.
isWifiUnavailable() only inspected ConnectivityManager's default
network, so the wifi-unavailable recovery banner stayed stuck whenever
Android kept cellular as the default route (or Wi-Fi was connected but
unvalidated). NSD/mDNS device discovery only needs a LAN, which any
Wi-Fi or Ethernet transport provides regardless of whether the system
has selected it as the default route.

Iterate allNetworks() and consider Wi-Fi/Ethernet present on any of them.
The platform-agnostic reduction (anyLocalNetworkAvailable) is extracted
into NetworkTransportInfo in commonMain so it is unit-testable without an
Android runtime.
…ake progress

- Add TRANSPORT_VPN to the network-scan availability predicate so
  ZeroTier/Tailscale overlays clear the banner; cellular-only remains
  insufficient for NSD/mDNS discovery
- Emit onHandshakeProgress from handleDeviceUIConfig so the 12s fast
  watchdog resets on every meaningful Stage 1 packet, not just config/
  module/channel
- Broaden restartTransport() KDoc to document BLE retry-exhausted usage
  alongside the TCP/USB fast-path
- Neutralize the per-transport NetworkRequest comment to avoid asserting
  AND/OR semantics on the Android request builder
- Hoist FAST_RECOVERY_TYPES to companion object to avoid per-packet
  Set allocation during large-mesh Stage 2 bursts
- Fix stale-progress clearing comment and soften the one-shot
  restartTransport leak rationale
- Expand NetworkTransportTest with 8 VPN-specific transport cases
@jeremiah-k jeremiah-k changed the title fix(service): Fix WiFi handshake stall recovery by cycling active transport fix(service): recover stalled WiFi/TCP handshakes by cycling active transport Jun 19, 2026
…omicfu for state visibility

Hoist the "Reconnecting..." string into ServiceRepository.Companion as
RECONNECTING_PROGRESS_TEXT so repository, view-model, and tests reference
one constant instead of maintaining a cross-module string contract.

Replace @volatile fields on MeshConnectionManagerImpl (handshakeTimeout)
and MeshConfigFlowManagerImpl (handshakeState) with
kotlinx.atomicfu delegates. atomicfu gives lock-free atomic reads that
are safe across KMP targets without resorting to platform-specific
synchronization, and the resulting memory-model semantics are stronger
than @volatile on the JVM alone.

Hoist FAST_RECOVERY_TYPES next to the related packet-type predicates in
MeshConnectionManagerImpl so the recovery decision lives with the
constant it depends on.

Test addresses in MeshConnectionManagerImplTest now use the DeviceType
prefix convention (e.g. '^4a...') so the address-format matches the
production matcher.
…ility during active scan

Add NetworkTransportInfo.isVpnActive() (true when hasVpn or, in the
absence of Wi-Fi/Ethernet, hasCellular) so VPN-hosted radios such as
ZeroTier or Tailscale are treated as a valid transport for the
network-scan path. Cellular-only connections still fall through and show
the warning banner, since the radio scanner needs Wi-Fi or VPN
reachability.

Add NetworkTransportInfo.shouldShowWifiUnavailableBanner(...) which
encapsulates the wifi-unavailable predicate, including the VPN escape
above, so ConnectionsScreen relies on one helper instead of re-deriving
the rule.

In ConnectionsScreen, widen the banner visibility gate so it also shows
while a network scan is in flight (isNetworkScanning), not just when the
filter-chip preference (showNetworkTransport) is enabled. Without this,
a user with the chip off sees no progress indicator during an active
scan and the UI reads as stuck.

Cover both changes with NetworkTransportTest cases for VPN, cellular
fallback, and banner visibility.
@jeremiah-k

Copy link
Copy Markdown
Contributor Author

@jamesarich One thing worth noting from my WiFi testing, this app side fix may expose a firmware crash on some affected builds.

On a tbeam running firmware 2.7.25.104df5f, the app side fix made the connection flow behave correctly and retry/reconnect cleanly. Once that happened, the node started rebooting consistently during the normal config/node-list load. Before this fix, the app could sometimes look connected or get stuck in a partial state, so it may not have hit the crashing firmware path as reliably.

It looks like the app is now reaching the proper handshake path consistently and that path exposes a firmware-side crash. I have a patch in progress around PhoneAPI::handleStartConfig() to avoid the abort and I'll open a PR to the firmware repo for that soon.

As far as this PR goes the debug build is behaving well with the patched firmware, so I think it's ready for review.

@jeremiah-k jeremiah-k marked this pull request as ready for review June 19, 2026 16:33
@jeremiah-k jeremiah-k marked this pull request as draft June 19, 2026 20:20
Bound only remote refresh calls with timeout deadlines while allowing local cache persistence to complete outside those deadlines.

- Allow staleWhileRevalidateFlow callers to manage their own timeout boundary.

- Move device hardware Room writes outside the hardware API timeout.

- Protect device-link cache persistence from cancellation while keeping the remote fetch bounded.

- Remove the transactional device-link replaceAll DAO path from refresh.

- Disconnect after post-handshake NodeDB install failure so the app cannot remain stuck in Connecting.

- Add a regression test for the post-handshake failure path.
Route post-handshake NodeDB install failures through MeshConnectionManager so recovery uses the same app-level disconnect before raw transport restart ordering as handshake stall recovery.

Also remove node identifiers from routine MyNodeInfo logs to avoid exposing stable device identifiers in shared logs.
Recover only when NodeDB installation fails after firmware handshake completion. Once NodeDB install succeeds, mark the app connected and treat analytics plus post-NodeDB side effects as best-effort so they cannot roll back app state or trigger transport recovery after Connected.
Make the onNodeDbReady failure log name the post-connected side-effect boundary explicitly, so it does not read like NodeDB install recovery.
@jeremiah-k jeremiah-k marked this pull request as ready for review June 19, 2026 22:16

@jamesarich jamesarich left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the thorough work here, and especially for the field testing + the firmware-crash writeup — that investigation is genuinely valuable. The core fix is correct: the WiFi/TCP split-brain (transport Connected while the app is Disconnected, blocking same-node reconnect) is the right diagnosis, and cycling the transport in place is the right shape. Test coverage is strong.

Two things before this is ready to merge:

1. Scope — this PR has outgrown its title. It's now +2328/−120 across 26 files / 20 commits, several unrelated to handshake recovery. Please split into their own PR(s):

  • fix(data): make device link catalog refresh transactional
  • fix(data): separate refresh timeouts from Room persistence (DeviceHardwareRepository / DeviceLink / StaleWhileRevalidate)
  • the isWifiUnavailable / VPN-banner work (3 commits) — a distinct bug from handshake recovery

Also: squash the restart transport after post-handshake failurenarrow post-handshake recovery boundary pair, and a rebase would drop the 6 Merge branch 'main' commits. A core-connectivity change is far easier to review for regressions when it's just the recovery fix.

2. Recovery hardening (ties to your firmware finding). I dug into why the clean handshake trips the T-Beam reboot: firmware handleStartConfig() (build v2.7.25.104df5f) re-enters on every want_config with no in-flight guard and never resets its config_state sub-iterator. Android's stall-guard re-sends want_config on a timer and re-handshakes with no backoff/cap — that re-send is what re-enters the fragile firmware path mid node-list load. (The firmware write-dedup is single-slot memcmp vs the previous write only, so the interleaved heartbeats let the re-sent nonce slip past it — the "firmware will drop our retry" assumption doesn't hold.) Your firmware patch is the proper fix; a small app-side backoff + give-up cap is good defensive depth so a crashing node isn't re-driven every cycle. Details inline.

Everything else is minor (inline). Nice work overall — this is close.

* returned), so the cancellation [onConnectionChanged] would attempt is a no-op on an already-completed job — and
* because the sibling is parented to `scope`, not to handshakeTimeout, it survives independently.
*/
private fun runSiblingHandshakeRecovery() {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This recovery path has no backoff and no give-up cap, and the BLE stall branch re-sends want_config mid-session via action() in startHandshakeStallGuard.

I traced the T-Beam reboot from the field report: firmware handleStartConfig() (build v2.7.25.104df5f) re-enters on every want_config with no in-flight guard and never resets its config_state sub-iterator, so a re-sent/repeated want_config re-enters it mid node-list load — a strong contributor to the crash. And with nothing bounding recovery, a node that keeps failing the handshake gets re-driven indefinitely.

Two asks:

  • Add exponential backoff + a consecutive-recovery cap (after N failed recoveries, surface a sticky error instead of silently looping).
  • Reconsider the mid-session want_config re-send. The comment assumes the firmware drops the duplicate, but the firmware write-dedup is single-slot (memcmp vs the previous write only), and the interleaved heartbeats mean the re-sent nonce isn't byte-identical to the prior write — so it slips past and re-enters. A single clean transport cycle is safer than re-sending on the same session.

Your firmware patch is the proper fix; this is complementary app-side hardening so we don't hammer a crashing node.

override fun handleFileInfo(info: FileInfo) {
Logger.d { "FileInfo received: ${info.file_name} (${info.size_bytes} bytes)" }
scope.handledLaunch { radioConfigRepository.addFileInfo(info) }
connectionManager.value.onHandshakeProgress()

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

onHandshakeProgress() is called here with no handshake-state guard (unlike handleNodeInfo, which is ignored once the state is Complete). Between NODE_INFO_NONCE arriving and the async DB install finishing, app state is still Connecting, so a late FileInfo/config packet here can re-arm the 12s fast watchdog that onHandshakeComplete() just cancelled — re-introducing the slow-DB false-trip on large meshes that onHandshakeComplete() was added to prevent.

Suggest having onHandshakeComplete() set a one-way latch that onHandshakeProgress() checks, rather than relying solely on the Connecting state guard. Low probability, but cheap to close.

val hasWifi: Boolean,
val hasEthernet: Boolean,
val hasVpn: Boolean,
val hasCellular: Boolean,

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hasCellular is never read by any predicate — anyNetworkScanTransportAvailable only checks wifi/ethernet/vpn, and nothing else reads it. It's captured in the Android hasLocalNetwork() actual and set in every test fixture but has no effect, which misleadingly implies cellular is considered. Recommend dropping the field and its call sites.

} else {
store(remoteLinks)
lastRefreshMillis = nowMillis
withContext(NonCancellable + dispatchers.io) {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This NonCancellable refresh/persistence fix (here, plus the matching changes in DeviceHardwareRepositoryImpl and StaleWhileRevalidateFlow) looks sound on its own — stopping the network deadline from cancelling the Room write mid-write. But it's unrelated to the WiFi handshake recovery this PR is about.

Could you split the device-catalog caching changes into their own PR? It keeps this one scoped to the connectivity fix and much easier to review for regressions.

private var locationRequestsJob: Job? = null
private var handshakeTimeout: Job? = null

private val handshakeTimeout = atomic<Job?>(null)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice — moving this to atomic<> fixes the cross-thread visibility issue. One residual nit: the re-arm sites still do cancel() then reassign as two ops, so a concurrent re-arm can orphan a job between them. handshakeTimeout.getAndSet(newJob)?.cancel() collapses it into one atomic swap if you want it fully tight. Low priority.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bugfix PR tag skip-docs-check Dismiss the advisory docs-staleness check (no user/developer doc update needed) skip-preview-check Dismiss the advisory preview-staleness check (no UI preview update needed)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants