Fix race conditions and state management in quiz handling by bitbacchus · Pull Request #124 · ecomod-code/recapp

bitbacchus · 2026-06-11T20:22:43Z

Resume supervision on session actors — page no longer freezes on slow connections

Branch was opened to chase a suspected race in SetQuiz ("question-subset race");
investigation revealed the actual user-visible bug was different. Renaming the
branch at this point would lose history — sticking with fix-question-subset-race
but the title and content of this PR describe what actually ships.

User-visible problem

Under slow/unstable connections (mobile networks, weak wifi, CPU-busy
tabs) the quiz page would freeze. Submit and Next buttons stayed
disabled; the only recovery was a hard reload, losing the session. In
Playwright with 1.5 Mbps + 200 ms RTT + 4× CPU throttle, this reproduced
in roughly 14/50 (28%) of runs.

Root cause

ts-actors defaults supervisor strategy to "Shutdown" — any unhandled
exception from a message handler kills the actor. Under client-side
load the actor inbox saturated past the 5 s ask timer; in-flight asks
rejected via DistributedActorSystem.js:41's Promise.reject(string)
contract; await this.ask(...) unwound that into a throw; the throw
escaped the handler; the supervisor shut the actor. With
CurrentQuizActor or LocalUserActor gone, React never received
another state update — frozen page.

Fix

CurrentQuizActor and LocalUserActor now declare
strategy = "Resume" as const. The supervisor still catches and
logs handler exceptions (with stack, via a small build-time patch on
node_modules/ts-actors/lib/src/ActorSystem.js), but the actor
continues processing the next message. Page stays responsive.
UserStore.GetNames asks at CurrentQuizActor.ts:1026 and
LocalUserActor.ts:117 wrapped in try/catch. Cosmetic data —
degrade to empty array instead of relying on Resume to swallow.
Counter guards at SetQuiz same-quiz branch and StartQuiz —
defense-in-depth against the original (now-disproven)
counter-regression hypothesis. Mirrors the existing GetRun guard;
zero observed firings in tests but cheap to keep.

Verification

--workers=4 --repeat-each=50 with 1.5 Mbps / 200 ms RTT / 4× CPU
throttle: 14/50 fail → 50/50 pass. Across the passing run, the
supervisor caught and resumed past 72 handler exceptions
(~1.44/session), predominantly timed-out asks to
QuizActor/QuizRun_<quizId>.GetForUser. Each would have been a frozen
page on the prior build.

Typecheck + lint clean.

Diagnostic / telemetry additions (kept)

New RUN_STATE_WRITE debugLog tag instruments every state.run
mutation in CurrentQuizActor + RunningQuizTab's useEffect
counter dep. Gated by VITE_DEBUG_RECAPP=1 / localStorage.DEBUG_RECAPP=1;
no-op when off.
Backend: QuizActor.GetUserRun log includes runUid + counter
(also fixed a pre-existing Maybe<T> truthiness bug in the
runPresent field). QuizRunActor.Clear() WARN now includes
quizId, deletedCount, run + collection subscriber counts, and
the caller actor URI.
Dockerfile build-time patch on ts-actors/.../ActorSystem.js
surfaces e.stack in the supervisor warning. Keeps production
log inspection useful without adding browser console noise.

Cleanup also included

Removed four dev console.log statements in RunningQuizTab.tsx
that serialized the entire quizState on every render — measurable
cost under CPU throttle.

Known follow-ups (not in this PR)

About 18 other ask sites in CurrentQuizActor / LocalUserActor
remain unwrapped. With Resume they no longer crash the actor —
they silently no-op the rest of the handler. Tracked separately;
user-action sites (Create / Delete / Export, etc.) want a visible
error rather than silent failure.

Related open issues

Partially addresses #111 — the actor-shutdown hypothesis matches exactly
what we found and fixed. The "stuck actor never reaches ready" mode is
gone. Note this PR does not close the issue: when LocalUser's GetOwn
ask times out, the user state silently stays empty (the actor survives
but the handler no-ops). The full close needs the follow-up tracked in
issues/actor-ask-resilience-followup.md — wrap LocalUserActor:76
with try/catch + meaningful retry / "session expired" message.

May also fix #106 — Resume eliminates the partial-state
fallout from actors crashing mid-transition between quizzes, which
matches one of the reporter's hypotheses. The deeper subscription /
unsub-resub race in SetQuiz different-quiz branch is unchanged, so
treating this as a confirmed fix is premature.

StartQuiz assembled the question list by filtering this.state.questions (the frontend WS cache), which is populated incrementally as QuestionUpdateMessage responses arrive. When triggered from handleRemoteUpdates or SetQuiz before the cache was fully hydrated, only the questions received so far passed the filter, and QuizRunActor persisted that truncated list to MongoDB. A subsequent reload returned the same incomplete run. GetRun reads question IDs directly from this.state.quiz.groups (already loaded synchronously in SetQuiz), so the question list is always complete at run-creation time. The approved filter is preserved with a safe default: include any question not yet in the WS cache, since GetForUser is get-or-create and ignores the questions param if a run already exists. Shuffle logic is also moved into GetRun so it is applied on the initial get-or-create call.

When a student clicked "Next question", the local answered/answers state was reset synchronously before the backend WebSocket round-trip completed. During that window run.counter was still the old value, so the old question reappeared with Submit re-enabled — allowing a second answer submission. Two complementary changes close the race: 1. CurrentQuizActor LogAnswer handler now performs an optimistic local updateState (counter, answers, correct, wrong) before sending the QuizRunActorMessages.Update to the backend. The actor's run state advances immediately, so React re-renders with the new counter and shows the next question without waiting for the WebSocket echo. The nextCounter value is captured before updateState to avoid a double-increment when the send() call reads this.state.run.counter. 2. RunningQuizTab ties the local state resets (answered, textAnswer, answers) to a useEffect on run?.counter rather than to the click handler. This means state cleanup is always driven by the actual question change, providing defence-in-depth for TEXT questions (which never set answered=true and previously had no Submit guard during the transition).

Both bugs traced to a single race: up to four GetRun messages fired concurrently during quiz startup (one from SetQuiz, up to three more from QuizUpdateMessage arrivals in handleRemoteUpdates). The handlers ran in parallel because ts-actors does not serialize message processing per actor (send dispatches via setTimeout + RxJS Subject, and the inbox's async handler yields at the first await without blocking the next dispatch). Each handler hit a non-atomic findOne + conditional insert in QuizRunActor.GetForUser, so each created a sibling run document with a distinct UID. The slowest GetRun completed 700-1300 ms later — after the student had begun answering — and its unconditional state replacement overwrote the optimistic run with a sibling at counter=0. The resulting counter regression triggered RunningQuizTab's useEffect, manifesting as: - the "Next question" button disappearing (Question Reset Bug), or - the previous question reappearing with Submit re-enabled (Question Repetition Glitch recurrence on this branch). Four complementary changes: 1. QuizRunActor.GetForUser: replace findOne + conditional insert with an atomic findOneAndUpdate({ $setOnInsert }, { upsert: true, returnDocument: "after" }). Only the actually-inserting call notifies collection subscribers. 2. QuizRunActor.beforeStart: create a unique index on (studentId, quizId) so concurrent upserts cannot race past document-level atomicity. Index creation is wrapped in try/catch so pre-existing duplicates from the legacy race don't block startup; the atomic upsert remains a strict improvement on its own. 3. CurrentQuizActor.GetRun: add a synchronous runFetchStarted flag set before the first await. JavaScript run-to-completion guarantees later concurrent invocations see the flag and bail out, deduplicating both the SetQuiz-dispatched and the handleRemoteUpdates-dispatched calls without a per-call-site guard. Error path resets the flag for retry. State replacement now uses incoming.counter >= s.run.counter so even a single GetRun cannot regress optimistic state. 4. CurrentQuizActor QuizRunUpdateMessage handler: defence-in-depth counter guard rejecting WS updates whose counter is below the current. With the optimistic increment in LogAnswer, the authoritative update arrives with an equal counter and is already a no-op via spread-merge; the guard only blocks stale regressions from out-of-order delivery. runFetchStarted (plus the previously-leaking runReady and questionsSubscribed) is reset in Reset, SetQuiz's new-quiz branch, and the QuizRunDeletedMessage handler. Deployment note: legacy duplicate (studentId, quizId) docs in MongoDB will block the unique index creation. Cleanup runbook is in coding/recapp/issues/Question reset Bug.md.

bearerValid awaited SessionStore.GetSessionForUserId but discarded the result. SessionStore.sessionValid returns an Error *value* (not a rejection) for expired sessions, so the await completed normally and bearerValid resolved with the userId regardless. The "Accessed session … that expired …" warning was emitted but never seen by the caller — the WS handshake then merged into the stale session via StoreSession({ uid, actorSystem }) and proceeded, which is how a persistent 30-day cookie could serve a previous week's quiz despite the server-side session being expired. Check the result and reject with "Session expired" so the upstream authenticationMiddleware fails the handshake and the client falls through to its normal refresh/login flow.

…state.run Mirror GetRun's `incoming.counter >= current.counter` guard in the SetQuiz same-quiz branch and StartQuiz handler. Closes a defense-in-depth gap against late-returning GetUserRun/GetForUser races. Adds RUN_STATE_WRITE debugLog tag and instruments every state.run mutation in CurrentQuizActor + RunningQuizTab's counter useEffect. Backend QuizActor.GetUserRun logs returned runUid+counter; Clear() WARN includes quizId, deletedCount, subscriber counts, and caller. The 2026-06-10 failures under --workers=4 are not explained by these paths (Clear didn't fire in the failure window). The telemetry is to identify the actual mechanism in the next test run.

Four console.log statements that fire on every render and on every radio click (one of them serializes the full quizState). Under CPU throttling these measurably extend the per-render cost and crowd the trace viewer. No behavior change.

DistributedActorSystem.js:41 rejects timed-out asks with a plain string Promise.reject(...). When the awaiting handler doesn't catch, the rejection unwinds out of the receive method, the supervisor catches it, and applies the Shutdown strategy — taking down the whole CurrentQuiz (or LocalUser) actor and leaving the page frozen. Observed via the getuserrun-counter-regression Playwright runs under CPU + network throttle: the client-side actor inbox saturates, the 5s ask timer fires before the message is even dispatched to NATS, and the session dies. Both GetNames sites fetch cosmetic data (teacher display names) and fire on every incoming QuizUpdateMessage broadcast, so they were hitting the timeout repeatedly. Now they degrade to an empty array on failure; the next QuizUpdateMessage retries. A new RUN_STATE_WRITE source "AskFailure" is added so we can count post-fix timeout occurrences in the trace data. Out of scope: - ~18 other ask() sites have the same vulnerability. To be assessed case-by-case after we confirm this surgical fix eliminates the observed crash. - The string-rejection contract in DistributedActorSystem.js itself is the root cause; fixing that upstream would let callers use a single `instanceof Error` check.

…er exceptions ts-actors' default supervisor strategy is "Shutdown" — any unhandled throw from a message handler kills the whole actor. For long-lived, user-facing session actors that own UI state, this is the wrong default: one timed-out ask (string-rejected via DistributedActorSystem.js:41) takes the page down. Switch both CurrentQuizActor and LocalUserActor to "Resume". The supervisor catch block now becomes a no-op after logging (ActorSystem.js:246) — the warning + console.error from the existing Dockerfile patch still fire, so we don't lose diagnostic visibility, but the actor keeps processing the next message. This subsumes the GetTeacherNames try/catch shipped in the previous commit for the ask-timeout case, but the try/catch is still kept for the useful side benefit (graceful empty-array fallback so the next QuizUpdateMessage retries).

codecov · 2026-06-11T20:24:36Z

Codecov Report

❌ Patch coverage is 4.41176% with 260 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
packages/frontend/src/actors/CurrentQuizActor.ts	0.00%	179 Missing ⚠️
packages/backend/src/actors/QuizRunActor.ts	19.04%	51 Missing ⚠️
...ontend/src/components/quiz-tabs/RunningQuizTab.tsx	0.00%	13 Missing ⚠️
packages/frontend/src/actors/LocalUserActor.ts	0.00%	11 Missing ⚠️
packages/backend/src/actors/QuizActor.ts	0.00%	4 Missing ⚠️
packages/frontend/src/pages/QuestionEdit.tsx	0.00%	1 Missing ⚠️
packages/frontend/src/utils/debugLog.ts	0.00%	1 Missing ⚠️

📢 Thoughts on this report? Let us know!

bitbacchus added 10 commits May 27, 2026 13:08

Adds diagnostics for debugging

75cd149

Cleanup

2ac8af0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix race conditions and state management in quiz handling#124

Fix race conditions and state management in quiz handling#124
bitbacchus wants to merge 10 commits into
mainfrom
fix-question-subset-race

bitbacchus commented Jun 11, 2026

Uh oh!

codecov Bot commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

bitbacchus commented Jun 11, 2026

Resume supervision on session actors — page no longer freezes on slow connections

User-visible problem

Root cause

Fix

Verification

Diagnostic / telemetry additions (kept)

Cleanup also included

Known follow-ups (not in this PR)

Related open issues

Uh oh!

codecov Bot commented Jun 11, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant