Fix race condition in WebSocketCoreImpl#6948
Conversation
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adds a single, centralized close(...) helper to gracefully tear down the current WebSocket session and reset internal connection state so future connection attempts can proceed.
Changes:
- Introduced
suspend fun close(code, reason)that cancels any pending connection attempt and clears the current connection reference. - Updated the “no more subscriptions” path to use the new
close(...)helper instead of callingwebSocket.close(...)directly.
| suspend fun close(code: Int, reason: String?) = connectedMutex.withLock { | ||
| // Cancel this so new connection attempts will create a new deferred and not await this one | ||
| pendingConnectDeferred?.cancel() | ||
| pendingConnectDeferred = null | ||
|
|
||
| connectionHolder.get()?.webSocket?.close(code, reason) | ||
| connectionHolder.set(null) | ||
| } |
| * | ||
| * New connection attempts can be made after this call. | ||
| */ | ||
| suspend fun close(code: Int, reason: String?) = connectedMutex.withLock { |
There was a problem hiding this comment.
Could you try to make a unit test that replicate the issue (it should fail without your fix)?
There was a problem hiding this comment.
The following test fails unless you increase the delay in the last advanceTimeBy. Which means that connection attempts between WebSocket.close and WebSocketListener.onClosed are ignored. Attempting to send messages would immediately fail as the underlying WebSocket is marked as cloed, yet WebSocketCore still thinks it is connected, until onClosed is called to clean up the states. I'm not that familiar in Kotlin testing so this is the best I could do.
@Test
fun reconnectsAfterShutdown() = runTest {
setupServer(backgroundScope = backgroundScope)
every {
mockConnection.close(1001, "Session removed from app.")
} answers {
backgroundScope.launch {
// Simulate queue delay before onClosed is called after close
delay(100)
webSocketListener.onClosed(mockConnection, 1001, "Session removed from app.")
}
true
}
prepareAuthenticationAnswer()
assertTrue(webSocketCore.connect())
// Should connect for the first time
verify(exactly = 1) { mockOkHttpClient.newWebSocket(any(), webSocketListener) }
advanceTimeBy(100)
// This closes WS (directly calls close on the socket.
// Similar to what `createSubscriptionFlow` does when all the channels are closed.
webSocketCore.shutdown()
verify(exactly = 1) { mockConnection.close(1001, "Session removed from app.") }
advanceTimeBy(50) // Increase this to more than the queue delay, the test passes
assertTrue(webSocketCore.connect())
// Should connect again, a total of 2 connection attempts
verify(exactly = 2) { mockOkHttpClient.newWebSocket(any(), webSocketListener) }
}|
Hi, is there anything I need to do? |
We need a test that replicate the issue (failing if we remove your logic and pass when adding it). I didn't look at the test in the comment but if you think it does what we ask submit it and put the PR back in ready for review. |
|
Please take a look at the requested changes, and use the Ready for review button when you are done, thanks 👍 |
…sure all resources are properly cleaned up
Summary
Fix #6947. Upon close inspection there is a race condition in
WebSocketCoreImpl.android/common/src/main/kotlin/io/homeassistant/companion/android/common/data/websocket/impl/WebSocketCoreImpl.kt
Lines 764 to 769 in ba2371c
Here the WS connection is closed (no more message can be sent). However, it triggers
onClosingandonClosedasynchronously, which means that when it returns,WebSocketCoreImplstill considers the socket connected.android/common/src/main/kotlin/io/homeassistant/companion/android/common/data/websocket/impl/WebSocketCoreImpl.kt
Lines 305 to 328 in ba2371c
If a new subscription is made after
closeand beforeonClosing,connect()sees thatconnectionHolderstill contains something and return without making a new connection. Therefore, the new messages will be sent to the closing/closed connection, causing exceptions.Xiaomi's SystemUI implementation creates and cancels subscriptions in
HaControlsProviderServicein quick succession, therefore triggering this.Checklist
I have no idea how to create a unit test to cover such a race condition, but at least I no longer see exceptions (consistently) after these changes.
Any other notes
N/A