fix(kiloclaw): replace provision locks with registry admission#3611
fix(kiloclaw): replace provision locks with registry admission#3611pandemicsyn wants to merge 7 commits into
Conversation
Code Review SummaryStatus: No Issues Found | Recommendation: Merge Executive SummaryLatest commit ( Resolved Issues
Files Reviewed (19 files)Incremental (latest commit — 4 files):
Previously reviewed (15 files):
Reviewed by claude-4.6-sonnet-20260217 · 4,814,190 tokens Review guidance: REVIEW.md from base branch |
| if (freshReservationAdmitted && provisionRegistry) { | ||
| try { | ||
| if (freshProviderWorkStarted) { | ||
| await provisionRegistry.stub.failFreshProvision( |
There was a problem hiding this comment.
Moderate, Functional: When fresh provider work has started and then throws, this sets the reservation to failed_requires_reconciliation. That status counts as unresolved in the partial unique index, so it blocks every subsequent beginFreshProvision for this user.
For an ordinary transient provider failure (a Fly 500 or timeout) no Postgres row and no subscription get created, so the conflict-repair path at the top of the route finds no subscribed active instance to repair and returns provision_in_progress. The user is then locked out of retrying. The only automatic recovery is the instance DO's stale-provision cleanup alarm eventually destroying the half-provisioned DO, which releases the reservation via finalizeDestroyedInstance.
Two things worth confirming: that the stale-cleanup alarm reliably releases these, and on what timescale. Consider whether a failure before any infrastructure is created should releaseFreshProvision (retryable) rather than failFreshProvision (fail closed). Fail closed is right once infra may exist, but the window before the first resource is allocated is safely retryable.
There was a problem hiding this comment.
Reviewed: this is intentional fail-closed behavior once provider work has started because infrastructure may already exist. The provisioned DO alarm runs on the idle cadence (30m + jitter) and destroy finalization releases the reservation after cleanup; no code change.
Summary
KiloClawRegistryadmission reservations so fresh creates are serialized before provider allocation.Failed to release provision context lock, then repeated same-contextkiloclaw.provisionretries hung for about 120 seconds without reaching the Worker. Axiom showed this pattern again on 2026-05-31, including eight timeout-shaped retries after one successful create.Verification
Visual Changes
N/A
Reviewer Notes
KiloClawRegistryis now a fresh-provision admission boundary rather than best-effort routing metadata.provision_reservationsstate with a partial unique unresolved-admission guard.provision_in_progress,provision_completion_pending, and 120-second timeout signals after rollout.