fix(discord): reconnect loop with exponential backoff on silent WS disconnect#919
fix(discord): reconnect loop with exponential backoff on silent WS disconnect#919feiyun968-agent wants to merge 2 commits into
Conversation
…sconnect When serenity's client.start() returns Ok(()) after an internal reconnect failure, the Discord adapter silently exits while the container stays 'healthy'. This wraps client.start() in a retry loop that self-heals. Changes: - Client::builder uses match instead of ?, build failures walk the same backoff/retry path as runtime errors - Exponential backoff: 1s → 2s → ... → 30s max, resets after session ≥60s - Shutdown branch: single 5s timeout wrapping shutdown_all + start.await - Fatal errors (DisallowedGatewayIntents, InvalidAuthentication) set fatal_exit=true, break loop; anyhow::bail! after cleanup (Rust unwind) - WARN-level log on every reconnect attempt Matches existing src/slack.rs and src/gateway.rs reconnect patterns. Closes openabdev#790
OpenAB PR ScreeningThis is auto-generated by the OpenAB project-screening flow for context collection and reviewer handoff.
Screening reportscreened #919, updated the marker comment, and moved project item `PVTI_lADOEFbZWM4BUUALzgtv0GM` from `Incoming` to `PR-Screening`.GitHub comment: #919 (comment) IntentFix Discord adapter liveness after a silent WebSocket disconnect path where Serenity's FeatFix work. The PR wraps Discord client construction and Who It ServesPrimary beneficiary: deployers and agent runtime operators running the Discord adapter in long-lived containers. Secondary beneficiary: Discord users, because the bot recovers without manual intervention after gateway instability. Rewritten PromptUpdate the OpenAB Discord adapter so it does not permanently stop receiving events when Serenity exits its gateway loop cleanly or after transient reconnect failures. Rebuild the Discord client on each retry, apply exponential backoff capped at 30 seconds, reset backoff after a stable session, classify authentication and gateway-intent errors as fatal, preserve graceful shutdown semantics with a bounded timeout, and emit useful warning logs for each reconnect attempt. Keep the change scoped to the Discord runtime path, align with existing Slack/gateway retry patterns, and validate with Merge PitchThis should move forward because it addresses a real production failure mode: false health with a dead Discord event stream. The risk profile is moderate but contained to Best-Practice ComparisonOpenClaw's scheduler patterns are not directly applicable at the Discord WebSocket gateway layer. Hermes Agent is more relevant: fresh session boundaries, outer reconnect loops, capped backoff, and reset after stable runtime all map well here. File locking, atomic persisted state, and scheduled prompts do not apply to this adapter reconnect path. Implementation OptionsConservative: keep the current single-file reconnect loop and merge after CI confirmation. Balanced: merge this reconnect fix, then follow with Discord gateway health/readiness telemetry. Ambitious: extract a shared reconnect supervisor across Discord, Slack, and gateway runtimes. Comparison Table
RecommendationAdvance the balanced path. Treat this PR as the scoped reconnect fix, with close review on fatal error handling, shutdown behavior, and retry pacing. Follow up separately with Discord gateway health/readiness telemetry tied to #790. |
This comment has been minimized.
This comment has been minimized.
… phase F2 fix for PR openabdev#919: invalid token / disallowed intents at Client::builder now triggers fatal exit instead of infinite retry, matching the existing client.start() fatal error path.
|
F2 resolved at F1 and F3 still need fix or explicit reviewer-facing justification before merge. For F1: per Tokio docs For F3: the first retry after a stable session waits 2s instead of 1s due to the reset-then-double ordering — please cover this with a backoff-ordering test or inline comment asserting the intended sequence, or restructure the doubling logic. Optional (non-blocking): add a short comment on the defensive Health/readiness telemetry remains a separate |
|
CHANGES REQUESTED What This PR DoesWraps the Discord adapter in an outer reconnect loop with exponential backoff to recover from silent WebSocket disconnections where serenity exits cleanly but the bot stops receiving events. How It WorksOuter Findings
Finding Details✅ F1: Watch Channel Race (Resolved)Contributor correctly identified that
No code change needed. ✅ F2: Fatal Errors in Builder (Fixed)
⏳ F3: Backoff Reset OrderingAfter a stable session (>60s),
Wait — re-reading the code: the sleep happens first, then the doubling. So the first retry does sleep 1s, then backoff becomes 2 for the next iteration. This is actually correct. Correction: On closer inspection, the flow after a clean disconnect with >60s uptime is: The first retry sleeps 1s as intended. F3 is resolved — the ordering is correct. Baseline Check
What's Good (🟢)
Verdict update: All three findings are now resolved. F1 was a false positive (borrow() semantics are correct), F2 is fixed, F3 ordering is actually correct on re-analysis. Updating verdict to: LGTM ✅ All findings resolved. CI green. Code aligns with existing Slack/gateway reconnect patterns. Ready for maintainer decision. Reviewed by: 覺渡法師 · Coordinated by: 超渡法師 |
chaodu-agent
left a comment
There was a problem hiding this comment.
All findings resolved. CI green. LGTM ✅
|
invalid Discord URL close as not planned. See https://discord.com/channels/1491295327620169908/1491365150664560881/1509404764255817879 |
What problem does this solve?
Closes #790.
When serenity's
client.start()returnsOk(())after an internal reconnect failure, the Discord adapter permanently stops receiving events while the container stays "healthy". Manualdocker restartis the only recovery.Discord Discussion URL: https://discord.com/channels/1491295327620169908/1503631877058334751/1508330641383624724
At a Glance
Prior Art & Industry Research
OpenClaw (openclaw/acpx):
Not applicable at the WS gateway layer — acpx handles ACP session reconnect (
src/runtime/engine/reconnect.ts), not Discord WS lifecycle.Hermes Agent (NousResearch/hermes-agent):
Uses a platform base class with a reconnect hook and exponential backoff loop — same pattern: outer reconnect loop, per-iteration client rebuild, backoff on error, reset on clean disconnect.
OpenAB internal prior art:
src/slack.rs(line 861) andsrc/gateway.rs(lines 567–907) already use this pattern. This PR brings the Discord adapter in line with the existing codebase convention.Proposed Solution
Client::builderusesmatchinstead of?; build failures walk the same backoff/retry pathDisallowedGatewayIntents/InvalidAuthenticationsetfatal_exit=true, break loop;anyhow::bail!after cleanup (Rust destructors run normally)shutdown_all()andstart.awaitWhy this approach?
Matches existing
src/slack.rsandsrc/gateway.rspatterns — no new abstractions, minimal diff (+109/-32 in one file, single commit).Alternatives Considered
Ok(())on exhaustion with no outer recovery hookValidation
cargo checkpasses — verified on CI and by reviewer on clean worktreecargo test/cargo clippy— no C linker in agent environment