Skip to content

Fix race conditions and state management in quiz handling#124

Open
bitbacchus wants to merge 10 commits into
mainfrom
fix-question-subset-race
Open

Fix race conditions and state management in quiz handling#124
bitbacchus wants to merge 10 commits into
mainfrom
fix-question-subset-race

Conversation

@bitbacchus

Copy link
Copy Markdown
Member

Resume supervision on session actors — page no longer freezes on slow connections

Branch was opened to chase a suspected race in SetQuiz ("question-subset race");
investigation revealed the actual user-visible bug was different. Renaming the
branch at this point would lose history — sticking with fix-question-subset-race
but the title and content of this PR describe what actually ships.

User-visible problem

Under slow/unstable connections (mobile networks, weak wifi, CPU-busy
tabs) the quiz page would freeze. Submit and Next buttons stayed
disabled; the only recovery was a hard reload, losing the session. In
Playwright with 1.5 Mbps + 200 ms RTT + 4× CPU throttle, this reproduced
in roughly 14/50 (28%) of runs.

Root cause

ts-actors defaults supervisor strategy to "Shutdown" — any unhandled
exception from a message handler kills the actor. Under client-side
load the actor inbox saturated past the 5 s ask timer; in-flight asks
rejected via DistributedActorSystem.js:41's Promise.reject(string)
contract; await this.ask(...) unwound that into a throw; the throw
escaped the handler; the supervisor shut the actor. With
CurrentQuizActor or LocalUserActor gone, React never received
another state update — frozen page.

Fix

  • CurrentQuizActor and LocalUserActor now declare
    strategy = "Resume" as const. The supervisor still catches and
    logs handler exceptions (with stack, via a small build-time patch on
    node_modules/ts-actors/lib/src/ActorSystem.js), but the actor
    continues processing the next message. Page stays responsive.
  • UserStore.GetNames asks at CurrentQuizActor.ts:1026 and
    LocalUserActor.ts:117 wrapped in try/catch. Cosmetic data —
    degrade to empty array instead of relying on Resume to swallow.
  • Counter guards at SetQuiz same-quiz branch and StartQuiz
    defense-in-depth against the original (now-disproven)
    counter-regression hypothesis. Mirrors the existing GetRun guard;
    zero observed firings in tests but cheap to keep.

Verification

--workers=4 --repeat-each=50 with 1.5 Mbps / 200 ms RTT / 4× CPU
throttle: 14/50 fail → 50/50 pass. Across the passing run, the
supervisor caught and resumed past 72 handler exceptions
(~1.44/session), predominantly timed-out asks to
QuizActor/QuizRun_<quizId>.GetForUser. Each would have been a frozen
page on the prior build.

Typecheck + lint clean.

Diagnostic / telemetry additions (kept)

  • New RUN_STATE_WRITE debugLog tag instruments every state.run
    mutation in CurrentQuizActor + RunningQuizTab's useEffect
    counter dep. Gated by VITE_DEBUG_RECAPP=1 / localStorage.DEBUG_RECAPP=1;
    no-op when off.
  • Backend: QuizActor.GetUserRun log includes runUid + counter
    (also fixed a pre-existing Maybe<T> truthiness bug in the
    runPresent field). QuizRunActor.Clear() WARN now includes
    quizId, deletedCount, run + collection subscriber counts, and
    the caller actor URI.
  • Dockerfile build-time patch on ts-actors/.../ActorSystem.js
    surfaces e.stack in the supervisor warning. Keeps production
    log inspection useful without adding browser console noise.

Cleanup also included

  • Removed four dev console.log statements in RunningQuizTab.tsx
    that serialized the entire quizState on every render — measurable
    cost under CPU throttle.

Known follow-ups (not in this PR)

  • About 18 other ask sites in CurrentQuizActor / LocalUserActor
    remain unwrapped. With Resume they no longer crash the actor —
    they silently no-op the rest of the handler. Tracked separately;
    user-action sites (Create / Delete / Export, etc.) want a visible
    error rather than silent failure.

Related open issues

Partially addresses #111 — the actor-shutdown hypothesis matches exactly
what we found and fixed. The "stuck actor never reaches ready" mode is
gone. Note this PR does not close the issue: when LocalUser's GetOwn
ask times out, the user state silently stays empty (the actor survives
but the handler no-ops). The full close needs the follow-up tracked in
issues/actor-ask-resilience-followup.md — wrap LocalUserActor:76
with try/catch + meaningful retry / "session expired" message.

May also fix #106 — Resume eliminates the partial-state
fallout from actors crashing mid-transition between quizzes, which
matches one of the reporter's hypotheses. The deeper subscription /
unsub-resub race in SetQuiz different-quiz branch is unchanged, so
treating this as a confirmed fix is premature.

bitbacchus added 10 commits May 27, 2026 13:08
StartQuiz assembled the question list by filtering this.state.questions
(the frontend WS cache), which is populated incrementally as
QuestionUpdateMessage responses arrive. When triggered from
handleRemoteUpdates or SetQuiz before the cache was fully hydrated,
only the questions received so far passed the filter, and QuizRunActor
persisted that truncated list to MongoDB. A subsequent reload returned
the same incomplete run.

GetRun reads question IDs directly from this.state.quiz.groups (already
loaded synchronously in SetQuiz), so the question list is always
complete at run-creation time. The approved filter is preserved with a
safe default: include any question not yet in the WS cache, since
GetForUser is get-or-create and ignores the questions param if a run
already exists. Shuffle logic is also moved into GetRun so it is
applied on the initial get-or-create call.
When a student clicked "Next question", the local answered/answers state
was reset synchronously before the backend WebSocket round-trip completed.
During that window run.counter was still the old value, so the old question
reappeared with Submit re-enabled — allowing a second answer submission.

Two complementary changes close the race:

1. CurrentQuizActor LogAnswer handler now performs an optimistic local
   updateState (counter, answers, correct, wrong) before sending the
   QuizRunActorMessages.Update to the backend.  The actor's run state
   advances immediately, so React re-renders with the new counter and
   shows the next question without waiting for the WebSocket echo.
   The nextCounter value is captured before updateState to avoid a
   double-increment when the send() call reads this.state.run.counter.

2. RunningQuizTab ties the local state resets (answered, textAnswer,
   answers) to a useEffect on run?.counter rather than to the click
   handler.  This means state cleanup is always driven by the actual
   question change, providing defence-in-depth for TEXT questions
   (which never set answered=true and previously had no Submit guard
   during the transition).
Both bugs traced to a single race: up to four GetRun messages fired
concurrently during quiz startup (one from SetQuiz, up to three more from
QuizUpdateMessage arrivals in handleRemoteUpdates). The handlers ran in
parallel because ts-actors does not serialize message processing per actor
(send dispatches via setTimeout + RxJS Subject, and the inbox's async
handler yields at the first await without blocking the next dispatch).

Each handler hit a non-atomic findOne + conditional insert in
QuizRunActor.GetForUser, so each created a sibling run document with a
distinct UID. The slowest GetRun completed 700-1300 ms later — after the
student had begun answering — and its unconditional state replacement
overwrote the optimistic run with a sibling at counter=0. The resulting
counter regression triggered RunningQuizTab's useEffect, manifesting as:
- the "Next question" button disappearing (Question Reset Bug), or
- the previous question reappearing with Submit re-enabled (Question
  Repetition Glitch recurrence on this branch).

Four complementary changes:

1. QuizRunActor.GetForUser: replace findOne + conditional insert with an
   atomic findOneAndUpdate({ $setOnInsert }, { upsert: true,
   returnDocument: "after" }). Only the actually-inserting call notifies
   collection subscribers.

2. QuizRunActor.beforeStart: create a unique index on (studentId, quizId)
   so concurrent upserts cannot race past document-level atomicity. Index
   creation is wrapped in try/catch so pre-existing duplicates from the
   legacy race don't block startup; the atomic upsert remains a strict
   improvement on its own.

3. CurrentQuizActor.GetRun: add a synchronous runFetchStarted flag set
   before the first await. JavaScript run-to-completion guarantees later
   concurrent invocations see the flag and bail out, deduplicating both
   the SetQuiz-dispatched and the handleRemoteUpdates-dispatched calls
   without a per-call-site guard. Error path resets the flag for retry.
   State replacement now uses incoming.counter >= s.run.counter so even a
   single GetRun cannot regress optimistic state.

4. CurrentQuizActor QuizRunUpdateMessage handler: defence-in-depth counter
   guard rejecting WS updates whose counter is below the current. With the
   optimistic increment in LogAnswer, the authoritative update arrives
   with an equal counter and is already a no-op via spread-merge; the
   guard only blocks stale regressions from out-of-order delivery.

runFetchStarted (plus the previously-leaking runReady and
questionsSubscribed) is reset in Reset, SetQuiz's new-quiz branch, and
the QuizRunDeletedMessage handler.

Deployment note: legacy duplicate (studentId, quizId) docs in MongoDB
will block the unique index creation. Cleanup runbook is in
coding/recapp/issues/Question reset Bug.md.
bearerValid awaited SessionStore.GetSessionForUserId but discarded the
result. SessionStore.sessionValid returns an Error *value* (not a
rejection) for expired sessions, so the await completed normally and
bearerValid resolved with the userId regardless. The "Accessed session
… that expired …" warning was emitted but never seen by the caller —
the WS handshake then merged into the stale session via
StoreSession({ uid, actorSystem }) and proceeded, which is how a
persistent 30-day cookie could serve a previous week's quiz despite
the server-side session being expired.

Check the result and reject with "Session expired" so the upstream
authenticationMiddleware fails the handshake and the client falls
through to its normal refresh/login flow.
…state.run

Mirror GetRun's `incoming.counter >= current.counter` guard in the
SetQuiz same-quiz branch and StartQuiz handler. Closes a defense-in-depth
gap against late-returning GetUserRun/GetForUser races.

Adds RUN_STATE_WRITE debugLog tag and instruments every state.run
mutation in CurrentQuizActor + RunningQuizTab's counter useEffect.
Backend QuizActor.GetUserRun logs returned runUid+counter; Clear() WARN
includes quizId, deletedCount, subscriber counts, and caller.

The 2026-06-10 failures under --workers=4 are not explained by these
paths (Clear didn't fire in the failure window). The telemetry is to
identify the actual mechanism in the next test run.
Four console.log statements that fire on every render and on every
radio click (one of them serializes the full quizState). Under CPU
throttling these measurably extend the per-render cost and crowd the
trace viewer. No behavior change.
DistributedActorSystem.js:41 rejects timed-out asks with a plain string
Promise.reject(...). When the awaiting handler doesn't catch, the
rejection unwinds out of the receive method, the supervisor catches it,
and applies the Shutdown strategy — taking down the whole CurrentQuiz
(or LocalUser) actor and leaving the page frozen.

Observed via the getuserrun-counter-regression Playwright runs under
CPU + network throttle: the client-side actor inbox saturates, the 5s
ask timer fires before the message is even dispatched to NATS, and the
session dies.

Both GetNames sites fetch cosmetic data (teacher display names) and
fire on every incoming QuizUpdateMessage broadcast, so they were
hitting the timeout repeatedly. Now they degrade to an empty array on
failure; the next QuizUpdateMessage retries.

A new RUN_STATE_WRITE source "AskFailure" is added so we can count
post-fix timeout occurrences in the trace data.

Out of scope:
- ~18 other ask() sites have the same vulnerability. To be assessed
  case-by-case after we confirm this surgical fix eliminates the
  observed crash.
- The string-rejection contract in DistributedActorSystem.js itself is
  the root cause; fixing that upstream would let callers use a single
  `instanceof Error` check.
…er exceptions

ts-actors' default supervisor strategy is "Shutdown" — any unhandled
throw from a message handler kills the whole actor. For long-lived,
user-facing session actors that own UI state, this is the wrong default:
one timed-out ask (string-rejected via DistributedActorSystem.js:41)
takes the page down.

Switch both CurrentQuizActor and LocalUserActor to "Resume". The
supervisor catch block now becomes a no-op after logging
(ActorSystem.js:246) — the warning + console.error from the existing
Dockerfile patch still fire, so we don't lose diagnostic visibility, but
the actor keeps processing the next message.

This subsumes the GetTeacherNames try/catch shipped in the previous
commit for the ask-timeout case, but the try/catch is still kept for the
useful side benefit (graceful empty-array fallback so the next
QuizUpdateMessage retries).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Some participants only see a subset of questions in the answering phase

1 participant