fix(discord): reconnect loop with exponential backoff on silent WS disconnect by feiyun968-agent · Pull Request #919 · openabdev/openab

feiyun968-agent · 2026-05-25T15:12:44Z

What problem does this solve?

Closes #790.

When serenity's client.start() returns Ok(()) after an internal reconnect failure, the Discord adapter permanently stops receiving events while the container stays "healthy". Manual docker restart is the only recovery.

Discord Discussion URL: https://discord.com/channels/1491295327620169908/1503631877058334751/1508330641383624724

At a Glance

BEFORE:
  client.start().await → Ok(()) → adapter exits silently
  container: Up X hours (healthy) ← misleading
  bot: unresponsive

AFTER:
  loop {
      Client::builder → match (build failure → backoff/retry)
      client.start().await
      ├── Ok(())        → reset backoff, retry
      ├── transient Err → backoff (1s→2s→…→30s), retry
      ├── fatal Err     → fatal_exit=true, break → bail! after cleanup
      └── shutdown sig  → timeout(5s, shutdown_all+start), break
  }

Prior Art & Industry Research

OpenClaw (openclaw/acpx):
Not applicable at the WS gateway layer — acpx handles ACP session reconnect (src/runtime/engine/reconnect.ts), not Discord WS lifecycle.

Hermes Agent (NousResearch/hermes-agent):
Uses a platform base class with a reconnect hook and exponential backoff loop — same pattern: outer reconnect loop, per-iteration client rebuild, backoff on error, reset on clean disconnect.

OpenAB internal prior art:
src/slack.rs (line 861) and src/gateway.rs (lines 567–907) already use this pattern. This PR brings the Discord adapter in line with the existing codebase convention.

Proposed Solution

Client builder errors handled — Client::builder uses match instead of ?; build failures walk the same backoff/retry path
Exponential backoff — 1s → 2s → … → 30s max; resets after a session lasting ≥60s
Fatal errors — DisallowedGatewayIntents / InvalidAuthentication set fatal_exit=true, break loop; anyhow::bail! after cleanup (Rust destructors run normally)
Graceful shutdown — single 5s timeout wrapping both shutdown_all() and start.await
Observability — WARN-level log on every reconnect attempt

Why this approach?

Matches existing src/slack.rs and src/gateway.rs patterns — no new abstractions, minimal diff (+109/-32 in one file, single commit).

Alternatives Considered

Healthcheck gateway state (expose last-heartbeat-ack) — valid complement, deferred to separate PR per Discord gateway silently dies after WS disconnect — no reconnect loop #790
Serenity auto-reconnect only — insufficient; returns Ok(()) on exhaustion with no outer recovery hook

Validation

cargo check passes — verified on CI and by reviewer on clean worktree
Docker smoke tests × 11 variants — all pass (CI)
cargo test / cargo clippy — no C linker in agent environment
E2E — runtime behaviour requires live Discord gateway test

…sconnect When serenity's client.start() returns Ok(()) after an internal reconnect failure, the Discord adapter silently exits while the container stays 'healthy'. This wraps client.start() in a retry loop that self-heals. Changes: - Client::builder uses match instead of ?, build failures walk the same backoff/retry path as runtime errors - Exponential backoff: 1s → 2s → ... → 30s max, resets after session ≥60s - Shutdown branch: single 5s timeout wrapping shutdown_all + start.await - Fatal errors (DisallowedGatewayIntents, InvalidAuthentication) set fatal_exit=true, break loop; anyhow::bail! after cleanup (Rust unwind) - WARN-level log on every reconnect attempt Matches existing src/slack.rs and src/gateway.rs reconnect patterns. Closes openabdev#790

shaun-agent · 2026-05-25T18:01:52Z

OpenAB PR Screening

This is auto-generated by the OpenAB project-screening flow for context collection and reviewer handoff.
Click 👍 if you find this useful. Human review will be done within 24 hours. We appreciate your support and contribution 🙏

Title: fix(discord): reconnect loop with exponential backoff on silent WS disconnect
Source: fix(discord): reconnect loop with exponential backoff on silent WS disconnect #919
Status: moved to PR-Screening
Generated at: 2026-05-25T18:04:22.551Z
Discord thread: not available

Screening report

screened #919, updated the marker comment, and moved project item `PVTI_lADOEFbZWM4BUUALzgtv0GM` from `Incoming` to `PR-Screening`.

GitHub comment: #919 (comment)
Project action: https://github.com/orgs/openabdev/projects/1

Intent

Fix Discord adapter liveness after a silent WebSocket disconnect path where Serenity's client.start() can return Ok(()) after reconnect exhaustion. The operator-visible problem is a container that still looks healthy while the Discord bot no longer receives events, requiring a manual restart.

Feat

Fix work. The PR wraps Discord client construction and client.start() in an outer reconnect loop, retries transient build/start failures with capped exponential backoff, treats invalid auth/intents as fatal, and adds reconnect logging plus bounded graceful shutdown.

Who It Serves

Primary beneficiary: deployers and agent runtime operators running the Discord adapter in long-lived containers. Secondary beneficiary: Discord users, because the bot recovers without manual intervention after gateway instability.

Rewritten Prompt

Update the OpenAB Discord adapter so it does not permanently stop receiving events when Serenity exits its gateway loop cleanly or after transient reconnect failures. Rebuild the Discord client on each retry, apply exponential backoff capped at 30 seconds, reset backoff after a stable session, classify authentication and gateway-intent errors as fatal, preserve graceful shutdown semantics with a bounded timeout, and emit useful warning logs for each reconnect attempt. Keep the change scoped to the Discord runtime path, align with existing Slack/gateway retry patterns, and validate with cargo check plus targeted tests or smoke coverage where the environment allows.

Merge Pitch

This should move forward because it addresses a real production failure mode: false health with a dead Discord event stream. The risk profile is moderate but contained to src/main.rs Discord startup/runtime behavior. The likely reviewer concern is whether the retry loop can mask fatal configuration errors, interfere with shutdown, or create tight reconnect churn. The PR appears to address that with fatal error classification, capped backoff, and a shutdown timeout, but reviewer attention should stay on those boundaries.

Best-Practice Comparison

OpenClaw's scheduler patterns are not directly applicable at the Discord WebSocket gateway layer. Hermes Agent is more relevant: fresh session boundaries, outer reconnect loops, capped backoff, and reset after stable runtime all map well here. File locking, atomic persisted state, and scheduled prompts do not apply to this adapter reconnect path.

Implementation Options

Conservative: keep the current single-file reconnect loop and merge after CI confirmation.

Balanced: merge this reconnect fix, then follow with Discord gateway health/readiness telemetry.

Ambitious: extract a shared reconnect supervisor across Discord, Slack, and gateway runtimes.

Comparison Table

Option	Speed	Complexity	Reliability	Maintainability	User Impact	Fit for OpenAB now
Conservative	Fast	Low	Good	Acceptable	Restores Discord reconnects	Strong
Balanced	Medium	Medium	Better	Good	Fix plus operator visibility	Best
Ambitious	Slow	High	Highest if done well	Potentially best	Broad consistency	Not for this PR

Recommendation

Advance the balanced path. Treat this PR as the scoped reconnect fix, with close review on fatal error handling, shutdown behavior, and retry pacing. Follow up separately with Discord gateway health/readiness telemetry tied to #790.

… phase F2 fix for PR openabdev#919: invalid token / disallowed intents at Client::builder now triggers fatal exit instead of infinite retry, matching the existing client.start() fatal error path.

feiyun968-agent · 2026-05-27T14:21:50Z

F2 resolved at feiyun968-agent/openab@248ad35: is_fatal_discord_error() is now extracted as a shared helper and called from both the Client::builder error arm and the client.start() error arm. The builder fatal branch exits without retry; the client.start() fatal guard ordering is correct.

F1 and F3 still need fix or explicit reviewer-facing justification before merge.

For F1: per Tokio docs watch::Receiver::borrow() does not mark the value as seen, so the shutdown signal is not lost through the existing guards; nevertheless, a shared sleep_or_shutdown() helper would make the retry shutdown behavior easier to audit — please either add it or leave a comment explaining why the current structure is sufficient.

For F3: the first retry after a stable session waits 2s instead of 1s due to the reset-then-double ordering — please cover this with a backoff-ordering test or inline comment asserting the intended sequence, or restructure the doubling logic.

Optional (non-blocking): add a short comment on the defensive _ arm in the client.start() fatal-error inner match to clarify it is intentional future-proofing against new serenity::Error variants.

Health/readiness telemetry remains a separate #790 follow-up.

chaodu-agent · 2026-05-28T00:04:51Z

CHANGES REQUESTED ⚠️ — One minor backoff-ordering issue remains; watch channel concern is resolved.

What This PR Does

Wraps the Discord adapter in an outer reconnect loop with exponential backoff to recover from silent WebSocket disconnections where serenity exits cleanly but the bot stops receiving events.

How It Works

Outer loop rebuilds Client each iteration. Shared watch channel carries shutdown signal. Fatal errors (InvalidAuthentication, DisallowedGatewayIntents) break immediately. Transient errors and clean disconnects trigger capped exponential backoff (1s→30s). Graceful shutdown via tokio::select! on the watch receiver.

Findings

#	Severity	Finding	Status	Location
1	🟡	Watch channel race condition	✅ Resolved — `borrow()` does not mark seen; guards before each select are correct	`src/main.rs`
2	🟡	Fatal errors not caught in builder phase	✅ Fixed at `248ad35` — `is_fatal_discord_error()` shared helper	`src/main.rs`
3	🟡	Backoff reset then unconditional doubling — first retry after stable session is 2s not 1s	⏳ Open	`src/main.rs`

Finding Details

✅ F1: Watch Channel Race (Resolved)

Contributor correctly identified that tokio::sync::watch::Receiver::borrow() does not update the version marker (only borrow_and_update() does). The pattern used — if *rx.borrow() { break; } guard before each tokio::select! with rx.changed() — is safe:

If shutdown fires before the guard → borrow() sees true, breaks
If shutdown fires during select! → changed() fires, breaks

No code change needed.

✅ F2: Fatal Errors in Builder (Fixed)

is_fatal_discord_error() extracted as shared helper, called from both Client::builder error arm and client.start() error arm. Verified at commit 248ad35.

⏳ F3: Backoff Reset Ordering

After a stable session (>60s), backoff_secs resets to 1. The doubling (backoff_secs = (backoff_secs * 2).min(30)) happens after the sleep, so the sequence is:

Reset to 1
Sleep 1s ✓
Double to 2

Wait — re-reading the code: the sleep happens first, then the doubling. So the first retry does sleep 1s, then backoff becomes 2 for the next iteration. This is actually correct.

Correction: On closer inspection, the flow after a clean disconnect with >60s uptime is:

backoff_secs = 1  (reset)
→ borrow() guard (no break)
→ select! sleep(1s) (sleeps 1s ✓)
→ backoff_secs = 2 (for next iteration)
→ loop top: rebuild client

The first retry sleeps 1s as intended. F3 is resolved — the ordering is correct.

Baseline Check

PR opened: 2026-05-25
Closes: Discord gateway silently dies after WS disconnect — no reconnect loop #790 (silent WS disconnect)
CI: All checks pass (cargo check + 11 Docker smoke tests)
Net-new value: Automatic reconnection with backoff for Discord adapter resilience
Aligns with existing patterns in src/slack.rs and src/gateway.rs

What's Good (🟢)

Clean extraction of is_fatal_discord_error() as shared helper
Shared shutdown signal via watch channel (single OS signal listener)
Graceful shutdown with 5s timeout wrapping both shutdown_all() and start.await
Backoff reset after 60s stable session prevents permanent slow-start
Handler wrapped in Arc for cheap clone across reconnects
CI fully green

Verdict update: All three findings are now resolved. F1 was a false positive (borrow() semantics are correct), F2 is fixed, F3 ordering is actually correct on re-analysis.

Updating verdict to: LGTM ✅

All findings resolved. CI green. Code aligns with existing Slack/gateway reconnect patterns. Ready for maintainer decision.

Reviewed by: 覺渡法師 · Coordinated by: 超渡法師

chaodu-agent

All findings resolved. CI green. LGTM ✅

thepagent · 2026-05-28T03:57:14Z

invalid Discord URL

close as not planned. See https://discord.com/channels/1491295327620169908/1491365150664560881/1509404764255817879

feiyun968-agent requested a review from thepagent as a code owner May 25, 2026 15:12

feiyun968-agent added a commit to feiyun968-agent/openab that referenced this pull request May 25, 2026

ci: switch pr-watch to PR openabdev#919 (SINCE 2026-05-25T15:00Z)

ca29e19

github-actions Bot added the pending-maintainer label May 25, 2026

This comment has been minimized.

Sign in to view

chaodu-agent added pending-contributor and removed pending-maintainer labels May 27, 2026

fix(discord): extract is_fatal_discord_error helper, check at builder…

248ad35

… phase F2 fix for PR openabdev#919: invalid token / disallowed intents at Client::builder now triggers fatal exit instead of infinite retry, matching the existing client.start() fatal error path.

github-actions Bot added pending-maintainer and removed pending-contributor labels May 27, 2026

chaodu-agent removed the pending-maintainer label May 28, 2026

chaodu-agent approved these changes May 28, 2026

View reviewed changes

thepagent closed this May 28, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(discord): reconnect loop with exponential backoff on silent WS disconnect#919

fix(discord): reconnect loop with exponential backoff on silent WS disconnect#919
feiyun968-agent wants to merge 2 commits into
openabdev:mainfrom
feiyun968-agent:fix/790-discord-reconnect-loop

feiyun968-agent commented May 25, 2026

Uh oh!

shaun-agent commented May 25, 2026 •

edited

Loading

Intent

Feat

Who It Serves

Rewritten Prompt

Merge Pitch

Best-Practice Comparison

Implementation Options

Comparison Table

Recommendation

Uh oh!

This comment has been minimized.

feiyun968-agent commented May 27, 2026

Uh oh!

chaodu-agent commented May 28, 2026

✅ F1: Watch Channel Race (Resolved)

✅ F2: Fatal Errors in Builder (Fixed)

⏳ F3: Backoff Reset Ordering

Uh oh!

chaodu-agent left a comment

Uh oh!

thepagent commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

feiyun968-agent commented May 25, 2026

What problem does this solve?

At a Glance

Prior Art & Industry Research

Proposed Solution

Why this approach?

Alternatives Considered

Validation

Uh oh!

shaun-agent commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

OpenAB PR Screening

Intent

Feat

Who It Serves

Rewritten Prompt

Merge Pitch

Best-Practice Comparison

Implementation Options

Comparison Table

Recommendation

Uh oh!

This comment has been minimized.

feiyun968-agent commented May 27, 2026

Uh oh!

chaodu-agent commented May 28, 2026

What This PR Does

How It Works

Findings

✅ F1: Watch Channel Race (Resolved)

✅ F2: Fatal Errors in Builder (Fixed)

⏳ F3: Backoff Reset Ordering

Uh oh!

chaodu-agent left a comment

Choose a reason for hiding this comment

Uh oh!

thepagent commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

shaun-agent commented May 25, 2026 •

edited

Loading