Fix race conditions and state management in quiz handling#124
Open
bitbacchus wants to merge 10 commits into
Open
Fix race conditions and state management in quiz handling#124bitbacchus wants to merge 10 commits into
bitbacchus wants to merge 10 commits into
Conversation
StartQuiz assembled the question list by filtering this.state.questions (the frontend WS cache), which is populated incrementally as QuestionUpdateMessage responses arrive. When triggered from handleRemoteUpdates or SetQuiz before the cache was fully hydrated, only the questions received so far passed the filter, and QuizRunActor persisted that truncated list to MongoDB. A subsequent reload returned the same incomplete run. GetRun reads question IDs directly from this.state.quiz.groups (already loaded synchronously in SetQuiz), so the question list is always complete at run-creation time. The approved filter is preserved with a safe default: include any question not yet in the WS cache, since GetForUser is get-or-create and ignores the questions param if a run already exists. Shuffle logic is also moved into GetRun so it is applied on the initial get-or-create call.
When a student clicked "Next question", the local answered/answers state was reset synchronously before the backend WebSocket round-trip completed. During that window run.counter was still the old value, so the old question reappeared with Submit re-enabled — allowing a second answer submission. Two complementary changes close the race: 1. CurrentQuizActor LogAnswer handler now performs an optimistic local updateState (counter, answers, correct, wrong) before sending the QuizRunActorMessages.Update to the backend. The actor's run state advances immediately, so React re-renders with the new counter and shows the next question without waiting for the WebSocket echo. The nextCounter value is captured before updateState to avoid a double-increment when the send() call reads this.state.run.counter. 2. RunningQuizTab ties the local state resets (answered, textAnswer, answers) to a useEffect on run?.counter rather than to the click handler. This means state cleanup is always driven by the actual question change, providing defence-in-depth for TEXT questions (which never set answered=true and previously had no Submit guard during the transition).
Both bugs traced to a single race: up to four GetRun messages fired
concurrently during quiz startup (one from SetQuiz, up to three more from
QuizUpdateMessage arrivals in handleRemoteUpdates). The handlers ran in
parallel because ts-actors does not serialize message processing per actor
(send dispatches via setTimeout + RxJS Subject, and the inbox's async
handler yields at the first await without blocking the next dispatch).
Each handler hit a non-atomic findOne + conditional insert in
QuizRunActor.GetForUser, so each created a sibling run document with a
distinct UID. The slowest GetRun completed 700-1300 ms later — after the
student had begun answering — and its unconditional state replacement
overwrote the optimistic run with a sibling at counter=0. The resulting
counter regression triggered RunningQuizTab's useEffect, manifesting as:
- the "Next question" button disappearing (Question Reset Bug), or
- the previous question reappearing with Submit re-enabled (Question
Repetition Glitch recurrence on this branch).
Four complementary changes:
1. QuizRunActor.GetForUser: replace findOne + conditional insert with an
atomic findOneAndUpdate({ $setOnInsert }, { upsert: true,
returnDocument: "after" }). Only the actually-inserting call notifies
collection subscribers.
2. QuizRunActor.beforeStart: create a unique index on (studentId, quizId)
so concurrent upserts cannot race past document-level atomicity. Index
creation is wrapped in try/catch so pre-existing duplicates from the
legacy race don't block startup; the atomic upsert remains a strict
improvement on its own.
3. CurrentQuizActor.GetRun: add a synchronous runFetchStarted flag set
before the first await. JavaScript run-to-completion guarantees later
concurrent invocations see the flag and bail out, deduplicating both
the SetQuiz-dispatched and the handleRemoteUpdates-dispatched calls
without a per-call-site guard. Error path resets the flag for retry.
State replacement now uses incoming.counter >= s.run.counter so even a
single GetRun cannot regress optimistic state.
4. CurrentQuizActor QuizRunUpdateMessage handler: defence-in-depth counter
guard rejecting WS updates whose counter is below the current. With the
optimistic increment in LogAnswer, the authoritative update arrives
with an equal counter and is already a no-op via spread-merge; the
guard only blocks stale regressions from out-of-order delivery.
runFetchStarted (plus the previously-leaking runReady and
questionsSubscribed) is reset in Reset, SetQuiz's new-quiz branch, and
the QuizRunDeletedMessage handler.
Deployment note: legacy duplicate (studentId, quizId) docs in MongoDB
will block the unique index creation. Cleanup runbook is in
coding/recapp/issues/Question reset Bug.md.
bearerValid awaited SessionStore.GetSessionForUserId but discarded the
result. SessionStore.sessionValid returns an Error *value* (not a
rejection) for expired sessions, so the await completed normally and
bearerValid resolved with the userId regardless. The "Accessed session
… that expired …" warning was emitted but never seen by the caller —
the WS handshake then merged into the stale session via
StoreSession({ uid, actorSystem }) and proceeded, which is how a
persistent 30-day cookie could serve a previous week's quiz despite
the server-side session being expired.
Check the result and reject with "Session expired" so the upstream
authenticationMiddleware fails the handshake and the client falls
through to its normal refresh/login flow.
…state.run Mirror GetRun's `incoming.counter >= current.counter` guard in the SetQuiz same-quiz branch and StartQuiz handler. Closes a defense-in-depth gap against late-returning GetUserRun/GetForUser races. Adds RUN_STATE_WRITE debugLog tag and instruments every state.run mutation in CurrentQuizActor + RunningQuizTab's counter useEffect. Backend QuizActor.GetUserRun logs returned runUid+counter; Clear() WARN includes quizId, deletedCount, subscriber counts, and caller. The 2026-06-10 failures under --workers=4 are not explained by these paths (Clear didn't fire in the failure window). The telemetry is to identify the actual mechanism in the next test run.
Four console.log statements that fire on every render and on every radio click (one of them serializes the full quizState). Under CPU throttling these measurably extend the per-render cost and crowd the trace viewer. No behavior change.
DistributedActorSystem.js:41 rejects timed-out asks with a plain string Promise.reject(...). When the awaiting handler doesn't catch, the rejection unwinds out of the receive method, the supervisor catches it, and applies the Shutdown strategy — taking down the whole CurrentQuiz (or LocalUser) actor and leaving the page frozen. Observed via the getuserrun-counter-regression Playwright runs under CPU + network throttle: the client-side actor inbox saturates, the 5s ask timer fires before the message is even dispatched to NATS, and the session dies. Both GetNames sites fetch cosmetic data (teacher display names) and fire on every incoming QuizUpdateMessage broadcast, so they were hitting the timeout repeatedly. Now they degrade to an empty array on failure; the next QuizUpdateMessage retries. A new RUN_STATE_WRITE source "AskFailure" is added so we can count post-fix timeout occurrences in the trace data. Out of scope: - ~18 other ask() sites have the same vulnerability. To be assessed case-by-case after we confirm this surgical fix eliminates the observed crash. - The string-rejection contract in DistributedActorSystem.js itself is the root cause; fixing that upstream would let callers use a single `instanceof Error` check.
…er exceptions ts-actors' default supervisor strategy is "Shutdown" — any unhandled throw from a message handler kills the whole actor. For long-lived, user-facing session actors that own UI state, this is the wrong default: one timed-out ask (string-rejected via DistributedActorSystem.js:41) takes the page down. Switch both CurrentQuizActor and LocalUserActor to "Resume". The supervisor catch block now becomes a no-op after logging (ActorSystem.js:246) — the warning + console.error from the existing Dockerfile patch still fire, so we don't lose diagnostic visibility, but the actor keeps processing the next message. This subsumes the GetTeacherNames try/catch shipped in the previous commit for the ask-timeout case, but the try/catch is still kept for the useful side benefit (graceful empty-array fallback so the next QuizUpdateMessage retries).
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Resume supervision on session actors — page no longer freezes on slow connections
Branch was opened to chase a suspected race in
SetQuiz("question-subset race");investigation revealed the actual user-visible bug was different. Renaming the
branch at this point would lose history — sticking with
fix-question-subset-racebut the title and content of this PR describe what actually ships.
User-visible problem
Under slow/unstable connections (mobile networks, weak wifi, CPU-busy
tabs) the quiz page would freeze. Submit and Next buttons stayed
disabled; the only recovery was a hard reload, losing the session. In
Playwright with 1.5 Mbps + 200 ms RTT + 4× CPU throttle, this reproduced
in roughly 14/50 (28%) of runs.
Root cause
ts-actorsdefaults supervisor strategy to"Shutdown"— any unhandledexception from a message handler kills the actor. Under client-side
load the actor inbox saturated past the 5 s ask timer; in-flight asks
rejected via
DistributedActorSystem.js:41'sPromise.reject(string)contract;
await this.ask(...)unwound that into a throw; the throwescaped the handler; the supervisor shut the actor. With
CurrentQuizActororLocalUserActorgone, React never receivedanother state update — frozen page.
Fix
CurrentQuizActorandLocalUserActornow declarestrategy = "Resume" as const. The supervisor still catches andlogs handler exceptions (with stack, via a small build-time patch on
node_modules/ts-actors/lib/src/ActorSystem.js), but the actorcontinues processing the next message. Page stays responsive.
UserStore.GetNamesasks atCurrentQuizActor.ts:1026andLocalUserActor.ts:117wrapped in try/catch. Cosmetic data —degrade to empty array instead of relying on Resume to swallow.
SetQuizsame-quiz branch andStartQuiz—defense-in-depth against the original (now-disproven)
counter-regression hypothesis. Mirrors the existing
GetRunguard;zero observed firings in tests but cheap to keep.
Verification
--workers=4 --repeat-each=50with 1.5 Mbps / 200 ms RTT / 4× CPUthrottle: 14/50 fail → 50/50 pass. Across the passing run, the
supervisor caught and resumed past 72 handler exceptions
(~1.44/session), predominantly timed-out asks to
QuizActor/QuizRun_<quizId>.GetForUser. Each would have been a frozenpage on the prior build.
Typecheck + lint clean.
Diagnostic / telemetry additions (kept)
RUN_STATE_WRITEdebugLog tag instruments everystate.runmutation in
CurrentQuizActor+RunningQuizTab'suseEffectcounter dep. Gated by
VITE_DEBUG_RECAPP=1/localStorage.DEBUG_RECAPP=1;no-op when off.
QuizActor.GetUserRunlog includesrunUid+counter(also fixed a pre-existing
Maybe<T>truthiness bug in therunPresentfield).QuizRunActor.Clear()WARN now includesquizId,deletedCount, run + collection subscriber counts, andthe caller actor URI.
ts-actors/.../ActorSystem.jssurfaces
e.stackin the supervisor warning. Keeps productionlog inspection useful without adding browser console noise.
Cleanup also included
console.logstatements inRunningQuizTab.tsxthat serialized the entire
quizStateon every render — measurablecost under CPU throttle.
Known follow-ups (not in this PR)
CurrentQuizActor/LocalUserActorremain unwrapped. With Resume they no longer crash the actor —
they silently no-op the rest of the handler. Tracked separately;
user-action sites (Create / Delete / Export, etc.) want a visible
error rather than silent failure.
Related open issues
Partially addresses #111 — the actor-shutdown hypothesis matches exactly
what we found and fixed. The "stuck actor never reaches ready" mode is
gone. Note this PR does not close the issue: when LocalUser's GetOwn
ask times out, the user state silently stays empty (the actor survives
but the handler no-ops). The full close needs the follow-up tracked in
issues/actor-ask-resilience-followup.md— wrapLocalUserActor:76with try/catch + meaningful retry / "session expired" message.
May also fix #106 — Resume eliminates the partial-state
fallout from actors crashing mid-transition between quizzes, which
matches one of the reporter's hypotheses. The deeper subscription /
unsub-resub race in
SetQuizdifferent-quiz branch is unchanged, sotreating this as a confirmed fix is premature.