Skip to content

Fix race condition in WebSocketCoreImpl#6948

Open
kmod-midori wants to merge 10 commits into
home-assistant:mainfrom
kmod-midori:bugfix/ws-race
Open

Fix race condition in WebSocketCoreImpl#6948
kmod-midori wants to merge 10 commits into
home-assistant:mainfrom
kmod-midori:bugfix/ws-race

Conversation

@kmod-midori

Copy link
Copy Markdown

Summary

Fix #6947. Upon close inspection there is a race condition in WebSocketCoreImpl.

if (activeMessages.isEmpty()) {
Timber.i("No more subscriptions, closing connection.")
connectionHolder.get()?.webSocket?.close(1001, "Done listening to subscriptions.")
} else {
Timber.i("Still ${activeMessages.size} messages in the queue, not closing connection.")
}

Here the WS connection is closed (no more message can be sent). However, it triggers onClosing and onClosed asynchronously, which means that when it returns, WebSocketCoreImpl still considers the socket connected.

suspend fun connect(): Boolean {
val connectDeferred: CompletableDeferred<Boolean>
// Track pending connection state locally within the urlObserverJob scope.
// This is read across collectLatest invocations to cancel partial connections on URL change.
var pendingWebSocket: WebSocket? = null
var urlObserverJob: Job? = null
connectedMutex.withLock {
// Already connected?
if (connectionHolder.get() != null && authCompleted.isCompleted) {
if (authCompleted.isCancelled) Timber.w("Trying to connect but was cancelled")
return !authCompleted.isCancelled
}
// Connection already in progress? Reuse its deferred
pendingConnectDeferred?.takeIf { !it.isCompleted }?.let { existing ->
Timber.d("Connection already in progress, reusing existing deferred and release lock")
connectDeferred = existing
return@withLock
}
// Start new connection attempt
connectDeferred = CompletableDeferred()

If a new subscription is made after close and before onClosing, connect() sees that connectionHolder still contains something and return without making a new connection. Therefore, the new messages will be sent to the closing/closed connection, causing exceptions.

Xiaomi's SystemUI implementation creates and cancels subscriptions in HaControlsProviderService in quick succession, therefore triggering this.

Checklist

  • New or updated tests have been added to cover the changes following the testing guidelines.
    I have no idea how to create a unit test to cover such a race condition, but at least I no longer see exceptions (consistently) after these changes.
  • The code follows the project's code style and best_practices.
  • The changes have been thoroughly tested, and edge cases have been considered.
  • Changes are backward compatible whenever feasible. Any breaking changes are documented in the changelog for users and/or in the code for developers depending on the relevance.

Any other notes

N/A

Copilot AI review requested due to automatic review settings June 1, 2026 11:35

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds a single, centralized close(...) helper to gracefully tear down the current WebSocket session and reset internal connection state so future connection attempts can proceed.

Changes:

  • Introduced suspend fun close(code, reason) that cancels any pending connection attempt and clears the current connection reference.
  • Updated the “no more subscriptions” path to use the new close(...) helper instead of calling webSocket.close(...) directly.

Comment on lines +371 to +378
suspend fun close(code: Int, reason: String?) = connectedMutex.withLock {
// Cancel this so new connection attempts will create a new deferred and not await this one
pendingConnectDeferred?.cancel()
pendingConnectDeferred = null

connectionHolder.get()?.webSocket?.close(code, reason)
connectionHolder.set(null)
}
*
* New connection attempts can be made after this call.
*/
suspend fun close(code: Int, reason: String?) = connectedMutex.withLock {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you try to make a unit test that replicate the issue (it should fail without your fix)?

@kmod-midori kmod-midori Jun 1, 2026

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The following test fails unless you increase the delay in the last advanceTimeBy. Which means that connection attempts between WebSocket.close and WebSocketListener.onClosed are ignored. Attempting to send messages would immediately fail as the underlying WebSocket is marked as cloed, yet WebSocketCore still thinks it is connected, until onClosed is called to clean up the states. I'm not that familiar in Kotlin testing so this is the best I could do.

@Test
fun reconnectsAfterShutdown() = runTest {
    setupServer(backgroundScope = backgroundScope)
    every {
        mockConnection.close(1001, "Session removed from app.")
    } answers {
        backgroundScope.launch {
            // Simulate queue delay before onClosed is called after close
            delay(100)
            webSocketListener.onClosed(mockConnection, 1001, "Session removed from app.")
        }
        true
    }

    prepareAuthenticationAnswer()
    assertTrue(webSocketCore.connect())
    // Should connect for the first time
    verify(exactly = 1) { mockOkHttpClient.newWebSocket(any(), webSocketListener) }

    advanceTimeBy(100)

    // This closes WS (directly calls close on the socket.
    // Similar to what `createSubscriptionFlow` does when all the channels are closed.
    webSocketCore.shutdown()
    verify(exactly = 1) { mockConnection.close(1001, "Session removed from app.") }

    advanceTimeBy(50) // Increase this to more than the queue delay, the test passes

    assertTrue(webSocketCore.connect())
    // Should connect again, a total of 2 connection attempts
    verify(exactly = 2) { mockOkHttpClient.newWebSocket(any(), webSocketListener) }
}

@TimoPtr TimoPtr marked this pull request as draft June 1, 2026 14:52
@kmod-midori kmod-midori marked this pull request as ready for review June 17, 2026 11:20
@kmod-midori

Copy link
Copy Markdown
Author

Hi, is there anything I need to do?

@TimoPtr

TimoPtr commented Jun 17, 2026

Copy link
Copy Markdown
Member

Hi, is there anything I need to do?

We need a test that replicate the issue (failing if we remove your logic and pass when adding it). I didn't look at the test in the comment but if you think it does what we ask submit it and put the PR back in ready for review.

@TimoPtr TimoPtr marked this pull request as draft June 17, 2026 14:06
@kmod-midori kmod-midori marked this pull request as ready for review June 17, 2026 14:09
@kmod-midori

Copy link
Copy Markdown
Author

@TimoPtr The tests are ready. It fails on ba2371c (before this PR) and passes now.

@home-assistant

Copy link
Copy Markdown

Please take a look at the requested changes, and use the Ready for review button when you are done, thanks 👍

Learn more about our pull request process.

@home-assistant home-assistant Bot marked this pull request as draft June 18, 2026 10:04
@kmod-midori kmod-midori marked this pull request as ready for review June 18, 2026 15:19
@home-assistant home-assistant Bot requested a review from TimoPtr June 18, 2026 15:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Device Controls show status error, but only sometimes

3 participants