Skip to content

fix(kiloclaw): replace provision locks with registry admission#3611

Open
pandemicsyn wants to merge 7 commits into
mainfrom
fix/kiloclaw-registry-provision-admission
Open

fix(kiloclaw): replace provision locks with registry admission#3611
pandemicsyn wants to merge 7 commits into
mainfrom
fix/kiloclaw-registry-provision-admission

Conversation

@pandemicsyn
Copy link
Copy Markdown
Contributor

@pandemicsyn pandemicsyn commented May 31, 2026

Summary

  • Replace the unsafe Vercel/Postgres advisory-lock provisioning flow with durable KiloClawRegistry admission reservations so fresh creates are serialized before provider allocation.
  • The change addresses recurring production failures where a successful provision logged Failed to release provision context lock, then repeated same-context kiloclaw.provision retries hung for about 120 seconds without reaching the Worker. Axiom showed this pattern again on 2026-05-31, including eight timeout-shaped retries after one successful create.
  • Add atomic unresolved-reservation enforcement, canonical active/subscription reconciliation, completion repair, destroy fencing, and alarm-retry cleanup so partially successful or failed provisions cannot create duplicate infrastructure or revive destroyed routes.
  • Remove the web advisory-lock helper and map Worker admission/finalization conflicts into user-facing tRPC conflicts while preserving existing-instance update and subscription-recovery behavior.

Verification

Visual Changes

N/A

Reviewer Notes

  • This is a cross-service correctness change: KiloClawRegistry is now a fresh-provision admission boundary rather than best-effort routing metadata.
  • The Durable SQLite migration adds non-routable provision_reservations state with a partial unique unresolved-admission guard.
  • The Worker/web cutover is intentionally shipped together under the accepted assumption that Worker deployment completes before the slower Vercel deployment becomes routable; monitor Axiom for new provision_in_progress, provision_completion_pending, and 120-second timeout signals after rollout.

Comment thread services/kiloclaw/src/routes/platform.ts Outdated
Comment thread apps/web/src/routers/kiloclaw-router.ts Outdated
Comment thread services/kiloclaw/src/durable-objects/kiloclaw-instance/index.ts Outdated
@kilo-code-bot
Copy link
Copy Markdown
Contributor

kilo-code-bot Bot commented May 31, 2026

Code Review Summary

Status: No Issues Found | Recommendation: Merge

Executive Summary

Latest commit (76b4aef) completes the fix set: adds 404/instance_not_found error mapping to the provision error handler, fixes registry cleanup to target only org:{orgId} (not both user and org) for instance-keyed org instances, and adds explanatory comments to the registry repair-path guard. All previously flagged issues are fully resolved.

Resolved Issues
File Issue Status
services/kiloclaw/src/routes/platform.ts instanceMarkedDestroyed = false in failure path caused failFreshProvision on already-destroyed instance FIXED
services/kiloclaw/src/durable-objects/kiloclaw-instance/index.ts !releaseProvisionReservation path produced indefinite 60s alarm loop FIXED
apps/web/src/routers/kiloclaw-router.ts handleProvisionError duplicated verbatim in org router FIXED
services/kiloclaw/src/durable-objects/kiloclaw-registry.ts Early return in quarantine branch skipped this.migrated = true, causing unbounded Postgres retries on every cooldown cycle FIXED
apps/web/src/lib/kiloclaw/provision-error-handler.ts Only mapped 409 and 503 responses, leaving 404 instance_not_found unmapped (rethrown as unhandled) FIXED
Files Reviewed (19 files)

Incremental (latest commit — 4 files):

  • apps/web/src/lib/kiloclaw/provision-error-handler.ts — adds 404/instance_not_found mapping, no issues
  • apps/web/src/routers/kiloclaw-billing-router.test.ts — adds tests for 409 and 404 error mappings, no issues
  • services/kiloclaw/src/durable-objects/kiloclaw-instance/index.ts — fixes registry key scope for org instances, no issues
  • services/kiloclaw/src/durable-objects/kiloclaw-registry.ts — adds explanatory comments to repair-path guard, no issues

Previously reviewed (15 files):

  • services/kiloclaw/AGENTS.md
  • apps/web/src/lib/kiloclaw/types.ts
  • services/kiloclaw/drizzle/0001_chubby_slipstream.sql
  • services/kiloclaw/drizzle/meta/0001_snapshot.json
  • services/kiloclaw/drizzle/meta/_journal.json
  • services/kiloclaw/drizzle/migrations.js
  • services/kiloclaw/src/db/index.ts
  • services/kiloclaw/src/db/sqlite-schema.ts
  • services/kiloclaw/src/durable-objects/kiloclaw-instance.test.ts
  • services/kiloclaw/src/durable-objects/kiloclaw-registry.test.ts
  • services/kiloclaw/src/durable-objects/kiloclaw-registry.ts
  • services/kiloclaw/src/routes/platform-provision-bootstrap.test.ts
  • services/kiloclaw/src/routes/platform.ts
  • apps/web/src/routers/kiloclaw-router.ts
  • apps/web/src/lib/kiloclaw/provision-error-handler.ts

Reviewed by claude-4.6-sonnet-20260217 · 4,814,190 tokens

Review guidance: REVIEW.md from base branch main

Comment thread services/kiloclaw/src/durable-objects/kiloclaw-registry.ts
if (freshReservationAdmitted && provisionRegistry) {
try {
if (freshProviderWorkStarted) {
await provisionRegistry.stub.failFreshProvision(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moderate, Functional: When fresh provider work has started and then throws, this sets the reservation to failed_requires_reconciliation. That status counts as unresolved in the partial unique index, so it blocks every subsequent beginFreshProvision for this user.

For an ordinary transient provider failure (a Fly 500 or timeout) no Postgres row and no subscription get created, so the conflict-repair path at the top of the route finds no subscribed active instance to repair and returns provision_in_progress. The user is then locked out of retrying. The only automatic recovery is the instance DO's stale-provision cleanup alarm eventually destroying the half-provisioned DO, which releases the reservation via finalizeDestroyedInstance.

Two things worth confirming: that the stale-cleanup alarm reliably releases these, and on what timescale. Consider whether a failure before any infrastructure is created should releaseFreshProvision (retryable) rather than failFreshProvision (fail closed). Fail closed is right once infra may exist, but the window before the first resource is allocated is safely retryable.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed: this is intentional fail-closed behavior once provider work has started because infrastructure may already exist. The provisioned DO alarm runs on the idle cadence (30m + jitter) and destroy finalization releases the reservation after cleanup; no code change.

Comment thread services/kiloclaw/src/durable-objects/kiloclaw-instance/index.ts
Comment thread services/kiloclaw/src/durable-objects/kiloclaw-registry.ts
Comment thread apps/web/src/lib/kiloclaw/provision-error-handler.ts Outdated
@pandemicsyn pandemicsyn requested a review from St0rmz1 June 1, 2026 18:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants