fix(service): recover stalled WiFi/TCP handshakes by cycling active transport#5856
fix(service): recover stalled WiFi/TCP handshakes by cycling active transport#5856jeremiah-k wants to merge 20 commits into
Conversation
…-brain When the mesh handshake stalls past its retry window, the app previously flipped to Disconnected without cycling the underlying transport. For TCP this left a split-brain: transport Connected + radioTransport non-null + connectionRequested true, which caused setDeviceAddress to silently ignore same-node reconnect attempts. Add restartTransport() to RadioInterfaceService that mirrors the BLE liveness silent-recovery path: gates on connectionRequested, bonded address validity, active transport, and the isRestarting CAS (all checked before touching the transport), then emits transient DeviceSleep, stops silently without a polite disconnect frame, and re-establishes via the normal start path. The manager's stall-retry-exceeded branch launches a single sibling job that performs the app-level Disconnected transition first, then calls restartTransport(), so the fresh Connected emission is not ignored by the redundant-Connecting guard. Tests cover Stage 1 config stall, Stage 2 node-info stall, no-restart on success, recovery ordering, restart gate behavior (no-op after disconnect, deselect, environmental stop), address preservation, no permanent Disconnected, no user-facing error, no polite-disconnect frame, and liveness overlap coordination.
…uration Buffer NodeInfo packets received during Stage 1 to prevent loss before the node-list phase. Enforce strict state validation during connection transitions to mitigate race conditions.
WiFi/TCP and USB serial transports can stall mid-handshake without the link ever going down, leaving the user stuck on 'Connecting' until the original long BLE-oriented stall-guard fires. Introduce a transport-aware watchdog that runs an aggressive 12s timeout on fast transports (TCP, USB serial) while keeping BLE on its existing long-and-retry budget. The watchdog exposes two hooks on MeshConnectionManager: - onHandshakeProgress(): re-arms the timer on meaningful wire activity so a slow-but-progressing handshake does not false-trip. - onHandshakeComplete(): synchronously cancels the timer when the firmware signals Stage 2 completion (NODE_INFO_NONCE), before the async NodeDB install begins, so a slow Room commit on a large mesh cannot trip the timeout after the firmware handshake already succeeded. On a fast-transport stall, runSiblingHandshakeRecovery() tears the transport down via restartTransport() and brings it back up silently. restartTransport()'s coordination comment is updated to document that the app-level Disconnected flip is now driven by the recovery caller, not by the transport-level emission itself.
The transport-aware watchdog added in the previous commit only re-arms on meaningful wire activity. Wire onHandshakeProgress() into every config handler that processes a packet belonging to the in-flight Stage 1/Stage 2 exchange: device/module config, channels, DeviceUIConfig, NodeInfo (including early-buffered and Stage 2 nodes), FileInfo, local metadata, and the Stage 2 NodeInfo request trigger. Also call onHandshakeComplete() synchronously in handleNodeInfoComplete() the moment NODE_INFO_NONCE arrives, so the watchdog is cancelled before the async Room NodeDB install begins. This prevents a slow DB commit on a large mesh from false-tripping the 12s fast-recovery timeout after the firmware handshake has already succeeded.
Without a distinct status, the fast-recovery watchdog tearing down and rebuilding the transport read to the user as NOT_CONNECTED — a final failure — even though recovery was still in progress. Add a RECONNECTING value to ConnectionStatus and have ConnectionsViewModel observe ServiceRepository.connectionProgress for the watchdog's 'Reconnecting\u2026' sentinel, emitted immediately before the recovery sibling flips the transport to Disconnected. When the sentinel is present at Disconnected, the UI now shows the in-progress recovery instead of a final failure. Also switches the ViewModel's constructor dependency from ConnectionStateProvider to ServiceRepository so connectionProgress is observable alongside connectionState in a 3-way combine. ConnectingDeviceInfo renders the new status with a localized 'reconnecting' string.
isWifiUnavailable() only inspected ConnectivityManager's default network, so the wifi-unavailable recovery banner stayed stuck whenever Android kept cellular as the default route (or Wi-Fi was connected but unvalidated). NSD/mDNS device discovery only needs a LAN, which any Wi-Fi or Ethernet transport provides regardless of whether the system has selected it as the default route. Iterate allNetworks() and consider Wi-Fi/Ethernet present on any of them. The platform-agnostic reduction (anyLocalNetworkAvailable) is extracted into NetworkTransportInfo in commonMain so it is unit-testable without an Android runtime.
…ake progress - Add TRANSPORT_VPN to the network-scan availability predicate so ZeroTier/Tailscale overlays clear the banner; cellular-only remains insufficient for NSD/mDNS discovery - Emit onHandshakeProgress from handleDeviceUIConfig so the 12s fast watchdog resets on every meaningful Stage 1 packet, not just config/ module/channel - Broaden restartTransport() KDoc to document BLE retry-exhausted usage alongside the TCP/USB fast-path - Neutralize the per-transport NetworkRequest comment to avoid asserting AND/OR semantics on the Android request builder - Hoist FAST_RECOVERY_TYPES to companion object to avoid per-packet Set allocation during large-mesh Stage 2 bursts - Fix stale-progress clearing comment and soften the one-shot restartTransport leak rationale - Expand NetworkTransportTest with 8 VPN-specific transport cases
…omicfu for state visibility Hoist the "Reconnecting..." string into ServiceRepository.Companion as RECONNECTING_PROGRESS_TEXT so repository, view-model, and tests reference one constant instead of maintaining a cross-module string contract. Replace @volatile fields on MeshConnectionManagerImpl (handshakeTimeout) and MeshConfigFlowManagerImpl (handshakeState) with kotlinx.atomicfu delegates. atomicfu gives lock-free atomic reads that are safe across KMP targets without resorting to platform-specific synchronization, and the resulting memory-model semantics are stronger than @volatile on the JVM alone. Hoist FAST_RECOVERY_TYPES next to the related packet-type predicates in MeshConnectionManagerImpl so the recovery decision lives with the constant it depends on. Test addresses in MeshConnectionManagerImplTest now use the DeviceType prefix convention (e.g. '^4a...') so the address-format matches the production matcher.
…ility during active scan Add NetworkTransportInfo.isVpnActive() (true when hasVpn or, in the absence of Wi-Fi/Ethernet, hasCellular) so VPN-hosted radios such as ZeroTier or Tailscale are treated as a valid transport for the network-scan path. Cellular-only connections still fall through and show the warning banner, since the radio scanner needs Wi-Fi or VPN reachability. Add NetworkTransportInfo.shouldShowWifiUnavailableBanner(...) which encapsulates the wifi-unavailable predicate, including the VPN escape above, so ConnectionsScreen relies on one helper instead of re-deriving the rule. In ConnectionsScreen, widen the banner visibility gate so it also shows while a network scan is in flight (isNetworkScanning), not just when the filter-chip preference (showNetworkTransport) is enabled. Without this, a user with the chip off sees no progress indicator during an active scan and the UI reads as stuck. Cover both changes with NetworkTransportTest cases for VPN, cellular fallback, and banner visibility.
|
@jamesarich One thing worth noting from my WiFi testing, this app side fix may expose a firmware crash on some affected builds. On a tbeam running firmware 2.7.25.104df5f, the app side fix made the connection flow behave correctly and retry/reconnect cleanly. Once that happened, the node started rebooting consistently during the normal config/node-list load. Before this fix, the app could sometimes look connected or get stuck in a partial state, so it may not have hit the crashing firmware path as reliably. It looks like the app is now reaching the proper handshake path consistently and that path exposes a firmware-side crash. I have a patch in progress around As far as this PR goes the debug build is behaving well with the patched firmware, so I think it's ready for review. |
Bound only remote refresh calls with timeout deadlines while allowing local cache persistence to complete outside those deadlines. - Allow staleWhileRevalidateFlow callers to manage their own timeout boundary. - Move device hardware Room writes outside the hardware API timeout. - Protect device-link cache persistence from cancellation while keeping the remote fetch bounded. - Remove the transactional device-link replaceAll DAO path from refresh. - Disconnect after post-handshake NodeDB install failure so the app cannot remain stuck in Connecting. - Add a regression test for the post-handshake failure path.
Route post-handshake NodeDB install failures through MeshConnectionManager so recovery uses the same app-level disconnect before raw transport restart ordering as handshake stall recovery. Also remove node identifiers from routine MyNodeInfo logs to avoid exposing stable device identifiers in shared logs.
Recover only when NodeDB installation fails after firmware handshake completion. Once NodeDB install succeeds, mark the app connected and treat analytics plus post-NodeDB side effects as best-effort so they cannot roll back app state or trigger transport recovery after Connected.
Make the onNodeDbReady failure log name the post-connected side-effect boundary explicitly, so it does not read like NodeDB install recovery.
jamesarich
left a comment
There was a problem hiding this comment.
Thanks for the thorough work here, and especially for the field testing + the firmware-crash writeup — that investigation is genuinely valuable. The core fix is correct: the WiFi/TCP split-brain (transport Connected while the app is Disconnected, blocking same-node reconnect) is the right diagnosis, and cycling the transport in place is the right shape. Test coverage is strong.
Two things before this is ready to merge:
1. Scope — this PR has outgrown its title. It's now +2328/−120 across 26 files / 20 commits, several unrelated to handshake recovery. Please split into their own PR(s):
fix(data): make device link catalog refresh transactionalfix(data): separate refresh timeouts from Room persistence(DeviceHardwareRepository / DeviceLink / StaleWhileRevalidate)- the
isWifiUnavailable/ VPN-banner work (3 commits) — a distinct bug from handshake recovery
Also: squash the restart transport after post-handshake failure → narrow post-handshake recovery boundary pair, and a rebase would drop the 6 Merge branch 'main' commits. A core-connectivity change is far easier to review for regressions when it's just the recovery fix.
2. Recovery hardening (ties to your firmware finding). I dug into why the clean handshake trips the T-Beam reboot: firmware handleStartConfig() (build v2.7.25.104df5f) re-enters on every want_config with no in-flight guard and never resets its config_state sub-iterator. Android's stall-guard re-sends want_config on a timer and re-handshakes with no backoff/cap — that re-send is what re-enters the fragile firmware path mid node-list load. (The firmware write-dedup is single-slot memcmp vs the previous write only, so the interleaved heartbeats let the re-sent nonce slip past it — the "firmware will drop our retry" assumption doesn't hold.) Your firmware patch is the proper fix; a small app-side backoff + give-up cap is good defensive depth so a crashing node isn't re-driven every cycle. Details inline.
Everything else is minor (inline). Nice work overall — this is close.
| * returned), so the cancellation [onConnectionChanged] would attempt is a no-op on an already-completed job — and | ||
| * because the sibling is parented to `scope`, not to handshakeTimeout, it survives independently. | ||
| */ | ||
| private fun runSiblingHandshakeRecovery() { |
There was a problem hiding this comment.
This recovery path has no backoff and no give-up cap, and the BLE stall branch re-sends want_config mid-session via action() in startHandshakeStallGuard.
I traced the T-Beam reboot from the field report: firmware handleStartConfig() (build v2.7.25.104df5f) re-enters on every want_config with no in-flight guard and never resets its config_state sub-iterator, so a re-sent/repeated want_config re-enters it mid node-list load — a strong contributor to the crash. And with nothing bounding recovery, a node that keeps failing the handshake gets re-driven indefinitely.
Two asks:
- Add exponential backoff + a consecutive-recovery cap (after N failed recoveries, surface a sticky error instead of silently looping).
- Reconsider the mid-session
want_configre-send. The comment assumes the firmware drops the duplicate, but the firmware write-dedup is single-slot (memcmpvs the previous write only), and the interleaved heartbeats mean the re-sent nonce isn't byte-identical to the prior write — so it slips past and re-enters. A single clean transport cycle is safer than re-sending on the same session.
Your firmware patch is the proper fix; this is complementary app-side hardening so we don't hammer a crashing node.
| override fun handleFileInfo(info: FileInfo) { | ||
| Logger.d { "FileInfo received: ${info.file_name} (${info.size_bytes} bytes)" } | ||
| scope.handledLaunch { radioConfigRepository.addFileInfo(info) } | ||
| connectionManager.value.onHandshakeProgress() |
There was a problem hiding this comment.
onHandshakeProgress() is called here with no handshake-state guard (unlike handleNodeInfo, which is ignored once the state is Complete). Between NODE_INFO_NONCE arriving and the async DB install finishing, app state is still Connecting, so a late FileInfo/config packet here can re-arm the 12s fast watchdog that onHandshakeComplete() just cancelled — re-introducing the slow-DB false-trip on large meshes that onHandshakeComplete() was added to prevent.
Suggest having onHandshakeComplete() set a one-way latch that onHandshakeProgress() checks, rather than relying solely on the Connecting state guard. Low probability, but cheap to close.
| val hasWifi: Boolean, | ||
| val hasEthernet: Boolean, | ||
| val hasVpn: Boolean, | ||
| val hasCellular: Boolean, |
There was a problem hiding this comment.
hasCellular is never read by any predicate — anyNetworkScanTransportAvailable only checks wifi/ethernet/vpn, and nothing else reads it. It's captured in the Android hasLocalNetwork() actual and set in every test fixture but has no effect, which misleadingly implies cellular is considered. Recommend dropping the field and its call sites.
| } else { | ||
| store(remoteLinks) | ||
| lastRefreshMillis = nowMillis | ||
| withContext(NonCancellable + dispatchers.io) { |
There was a problem hiding this comment.
This NonCancellable refresh/persistence fix (here, plus the matching changes in DeviceHardwareRepositoryImpl and StaleWhileRevalidateFlow) looks sound on its own — stopping the network deadline from cancelling the Room write mid-write. But it's unrelated to the WiFi handshake recovery this PR is about.
Could you split the device-catalog caching changes into their own PR? It keeps this one scoped to the connectivity fix and much easier to review for regressions.
| private var locationRequestsJob: Job? = null | ||
| private var handshakeTimeout: Job? = null | ||
|
|
||
| private val handshakeTimeout = atomic<Job?>(null) |
There was a problem hiding this comment.
Nice — moving this to atomic<> fixes the cross-thread visibility issue. One residual nit: the re-arm sites still do cancel() then reassign as two ops, so a concurrent re-arm can orphan a job between them. handshakeTimeout.getAndSet(newJob)?.cancel() collapses it into one atomic swap if you want it fully tight. Low priority.
Overview
This PR improves WiFi/TCP handshake-stall recovery so the app and underlying transport do not get stuck in different connection states.
Previously, if the handshake stalled after its retry window, recovery could leave the app-level connection state and underlying transport out of sync. For TCP, the transport could remain active while the UI reported a disconnected state. In that situation, selecting the same node again could be ignored because the radio service still saw the same selected address and an already-running transport.
This PR adds a targeted transport restart path for stalled handshakes. When recovery is needed, the connection manager first moves app state to
Disconnected, then asks the radio service to restart the active transport. The fresh transportConnectedsignal can then re-enter the normal handshake flow.Follow-up WiFi testing also showed that some failed TCP/USB handshake attempts could sit in a loading state until the lower-level TCP timeout recovered them. This PR adds a TCP/USB-only fast handshake watchdog so those stalled handshakes recover sooner, while leaving BLE's longer handshake timing unchanged.
Additional BLE testing exposed a separate post-handshake stuck-connecting path. Once firmware sends
NODE_INFO_NONCE, the handshake watchdog is intentionally cancelled before local NodeDB install work begins. If that local Room/NodeDB install fails, the app can otherwise remain inConnectingwith no active watchdog. This PR now recovers from NodeDB install failure by routing through the same reconnecting transport-restart path.The same testing also exposed a Room timeout/cancellation boundary issue in device hardware and device-link cache refresh. This PR now bounds only the remote API calls and lets local Room persistence complete outside those deadlines.
Addresses: #3727
Key Changes
Handshake Recovery
RadioInterfaceService.restartTransport()for silent recovery of stalled mesh handshakes.MeshConnectionManagerImplso retry-exceeded handshake stalls request a transport restart instead of only changing app-level state.MeshConnectionManager.recoverPostHandshakeFailure()so NodeDB install failures after firmware handshake completion use the same app-level disconnect-before-restart ordering.NODE_INFO_NONCEarrives, before async NodeDB install work begins.ConnectedNodeDB install failure; post-Connectedanalytics / side-effect failures are logged without forcing recovery.Disconnectedtransition happens before transport restart emissions.Transport Restart Behavior
SharedRadioInterfaceService.restartTransport()as an in-place stop/start of the currently active transport.DeviceSleeptransport transition before restarting so observers can see a real state change before the freshConnectedevent.Disconnectedevent and avoiding a user-facing error.Handshake Progress
NODE_INFO_NONCE.Room / Cache Refresh Boundaries
Connection UI and Network Scan
Reconnecting…state while silent handshake recovery is in progress.Test Coverage
Adds
MeshConnectionManagerImplTestcoverage for:Adds
MeshConfigFlowManagerImplTest/MeshConfigHandlerImplTestcoverage for:NODE_INFO_NONCEarrives,Connected.Adds
SharedRadioInterfaceServiceLivenessTestcoverage for:Disconnected,Adds Connections UI/ViewModel and network transport coverage for:
Reconnecting…mapping,Follow-up
This PR addresses the Android-side handshake-stall recovery path associated with the reported WiFi/TCP reconnect behavior.
During testing, this also uncovered a firmware-side crash triggered by the Stage 2 node-info request. That firmware bug (tbeam / v2.7.25.104df5f) is being tracked separately. I built patched firmware that is working but it needs more review. I will update this PR with the firmware PR number once it is ready.
If field logs still show periodic TCP reconnect or configuration loops after this lands, a follow-up investigation should look at TCP idle-timeout behavior and add privacy-safe disconnect-reason logging for EOF, IOException, write failure, and idle timeout.
I mostly run nRF52 devices over BLE, so I do not have a strong baseline for long-term WiFi behavior. I tested these changes with WiFi-connected ESP32 devices while reconnecting, scanning, disabling/re-enabling WiFi, and checking WiFi/VPN/cellular-only network-scan warning behavior. I also tested the post-handshake recovery and Room timeout-boundary changes on my BLE device after the T1000-E stuck-connecting behavior surfaced during review.