Problem Statement
With CAS optimistic concurrency merged (PR #1292), the persistence layer prevents lost updates when multiple writers mutate the same object. However, the gateway's reconciler loop — which drives sandbox lifecycle state transitions — still runs on every replica. In an HA deployment, this produces duplicate work, N-way CAS contention on every reconcile sweep, and wasted compute driver RPCs. A single reconciler lease ensures only one replica runs background coordination at a time.
Supervisor session ownership and inter-replica session forwarding are out of scope for this issue and will be addressed separately.
Changes from Original Spike
This consolidated design resolves all open questions from the original spike and incorporates review feedback. Key corrections from codebase validation:
updated_at_ms is application-side — both SQLite and Postgres adapters call openshell_core::time::now_ms() (SystemTime::now()), not a database-side function. Clock skew acknowledged as acceptable with 30s TTL.
- Shutdown uses
tokio::sync::watch channel — not CancellationToken (tokio_util is not a dependency).
Store::is_single_replica() does not exist — must be added as a simple enum variant check.
- No per-replica identity exists —
gateway_id is shared across replicas (JWT issuer). Unique replica ID generated at startup.
- Background tasks are fire-and-forget — no JoinSet or centralized tracking. Lease coordinator manages own lifecycle.
- Lease payload is raw JSON bytes via
put_if(&[u8]) — no protobuf, no ObjectType trait.
- Health endpoints are static — lease holder status is Phase 3 observability work.
Technical Context
The gateway reconciler operates through two concurrent loops spawned at startup: a watch loop that consumes real-time events from the compute driver's WatchSandboxes stream, and a reconcile loop that runs a full store-vs-backend sweep every 60 seconds. Both loops acquire a process-local sync_lock Mutex before mutating sandbox state — a guard that is explicitly documented as not HA-safe (references issue #1255).
Both loops are spawned as fire-and-forget tokio::spawn tasks from spawn_watchers() (compute/mod.rs:558). They run until the process exits — there is no structured cancellation or graceful shutdown coordination for these tasks today. The lease coordinator must manage their lifecycle.
All sandbox store mutations go through update_message_cas with expected_version=0 (server-driven CAS), which means the database resolves concurrent writes correctly. But without lease-based ownership, every replica does redundant work: re-fetching sandbox state from the driver, computing phase transitions, and attempting CAS writes that only one replica can win.
Why a single reconciler lease is sufficient
The reconciler is a background consistency-repair mechanism, not the hot path. It covers the 60-second periodic sweep and the watch event processing loop. All replicas still serve gRPC requests (create, delete, update sandboxes), and supervisor sessions still land on whichever replica the TCP connection reaches.
A single reconciler lease breaks down when:
- Sweep duration exceeds the interval. Each reconciled sandbox costs roughly one
GetSandbox driver RPC (~5-10ms). At 60s intervals, you'd need ~6,000+ concurrent sandboxes before the sweep can't finish in time — well beyond initial HA deployments.
- Reconciler needs session locality. If reconciliation ever requires talking to the supervisor (not just the compute driver and store), it would benefit from running on the session-owning replica. Today it doesn't.
- Failover gap. If the lease holder dies, reconciliation pauses for the TTL duration (~30s). gRPC-initiated mutations continue working on all replicas via CAS. The reconciler catches up stale state — a 30s gap is acceptable.
Per-sandbox or shard-based leases are a future optimization if sandbox counts grow into the thousands. The single-lease model avoids O(N) lease records, lease rebalancing, and unnecessary complexity.
This approach is consistent with RFC 0001's intent. The RFC rejects a "singleton controller" where one replica handles all control-plane responsibilities (reconciliation, session ownership, relay coordination, and client requests). A single reconciler lease is narrower: it only scopes background sweeps, while gRPC serving and session handling remain distributed across all replicas.
Affected Components
| Component |
Key Files |
Role |
| Compute runtime |
crates/openshell-server/src/compute/mod.rs |
Reconciler loops, sandbox state machine, driver interaction |
| Persistence layer |
crates/openshell-server/src/persistence/mod.rs, sqlite.rs, postgres.rs |
CAS primitives, object storage, timestamp handling |
| Server startup |
crates/openshell-server/src/lib.rs |
Gateway initialization, shutdown signaling, replica identity |
| Time utilities |
crates/openshell-core/src/time.rs |
now_ms() — application-side wall clock for updated_at_ms |
Technical Investigation
Architecture Overview
The compute subsystem (ComputeRuntime) is the gateway's sandbox lifecycle engine. It owns:
-
Watch loop (compute/mod.rs:715): Opens a streaming WatchSandboxes RPC to the compute driver. Events include sandbox status updates, deletions, and platform events. Each event triggers a CAS read-modify-write on the store record.
-
Reconcile loop (compute/mod.rs:753): Runs every 60 seconds (RECONCILE_INTERVAL). Lists all sandboxes from both the driver (ListSandboxes) and the store, then reconciles discrepancies. Records not updated since the sweep started are refreshed via GetSandbox. Orphaned store records (no backend resource) are pruned after a 300-second grace period (ORPHAN_GRACE_PERIOD).
The sync_lock Mutex (compute/mod.rs:231,278-283) serializes all sandbox mutations within a single gateway process. Its comment explicitly notes this is insufficient for HA and references issue #1255. The CAS branch (#1292) added database-level concurrency control as the foundation for removing this process-local guard.
Code References
| Location |
Description |
compute/mod.rs:220 |
ComputeRuntime struct — holds driver, store, session registry, sync_lock |
compute/mod.rs:231,278-283 |
sync_lock Mutex — documented as not HA-safe |
compute/mod.rs:558 |
spawn_watchers() — launches both background loops |
compute/mod.rs:715 |
watch_loop() — driver event stream consumer |
compute/mod.rs:753 |
reconcile_loop() — 60s periodic sweep |
compute/mod.rs:762 |
reconcile_store_with_backend() — core reconcile logic |
compute/mod.rs:848 |
apply_sandbox_update_locked() — read-modify-write with CAS |
compute/mod.rs:1132 |
reconcile_snapshot_sandbox() — per-sandbox reconcile with staleness guard |
compute/mod.rs:1162 |
prune_missing_sandbox() — orphan cleanup |
persistence/mod.rs:96 |
WriteCondition — MustCreate / MatchResourceVersion / Unconditional |
persistence/mod.rs:181 |
Store::put_if() — CAS write, accepts &[u8] payload |
persistence/mod.rs:215 |
Store::delete_if() — CAS delete |
persistence/mod.rs:477 |
Store::update_message_cas() — read-modify-write helper |
persistence/mod.rs:80 |
ObjectRecord — includes updated_at_ms, resource_version, payload |
lib.rs:341 |
state.compute.spawn_watchers() — startup call |
lib.rs:419 |
let (shutdown_tx, shutdown_rx) = watch::channel(false) — shutdown signal |
lib.rs:445 |
state.compute.cleanup_on_shutdown() — driver cleanup on exit |
Current Behavior
Reconcile flow:
reconcile_store_with_backend() calls ListSandboxes on the driver to get all backend sandbox IDs
- For each backend sandbox: acquire
sync_lock, read store record, skip if recently updated, re-fetch from driver via GetSandbox, apply state merge via apply_sandbox_update_locked
- For each store record with no backend match: wait for 300s grace period, double-check via
GetSandbox, prune if confirmed missing
- State merge (
apply_sandbox_update_locked) derives phase from driver conditions, checks supervisor session presence (in-memory registry), and writes via update_message_cas with expected_version=0
Phase transitions driven by the reconciler:
| Trigger |
From |
To |
Path |
| Driver reports Ready=True |
Provisioning |
Ready |
watch/reconcile loop |
| Driver reports terminal failure |
Provisioning |
Error |
watch/reconcile loop |
| Driver reports deleting=true |
Any |
Deleting |
watch/reconcile loop |
| Backend resource gone (after grace) |
Any |
Deleted |
reconcile loop |
Resolved Design Decisions
1. Watch loop placement: holder-only
Only the lease holder consumes the WatchSandboxes driver stream. Non-holder replicas do not watch.
Rationale: gRPC handlers read-through to the store (get_message, list_messages) — they never rely on an in-memory index populated by the watch loop. The watch loop exists solely to trigger reconciliation state transitions (phase changes, condition updates). Running it on non-holders would double the driver stream load with no benefit.
2. sync_lock: keep as defense-in-depth
The process-local sync_lock Mutex stays on all replicas. It serializes mutations within a single process. CAS is the cross-replica concurrency control.
Rationale: Removing the Mutex would require adding CAS-retry loops to every mutation site. The Mutex prevents intra-process races (e.g., a gRPC DeleteSandbox handler racing the reconcile loop on the same replica), while CAS prevents inter-replica races. The lease reduces CAS contention by ensuring only one replica runs background sweeps, but gRPC-initiated mutations can still race the holder's reconciler within the same process.
3. Replica identity: HOSTNAME with UUID fallback
fn replica_id() -> String {
std::env::var("HOSTNAME")
.or_else(|_| std::env::var("OPENSHELL_REPLICA_ID"))
.unwrap_or_else(|_| uuid::Uuid::new_v4().to_string())
}
Rationale: Kubernetes sets HOSTNAME to the pod name, Docker sets it to the container ID, and systemd units inherit the machine hostname. This gives operators stable, debuggable lease holder identity in logs. OPENSHELL_REPLICA_ID allows explicit override. UUID fallback handles edge cases. The gateway_id field is intentionally not used — it's shared across replicas for JWT issuer identity and would not distinguish holders.
4. Lease record schema: JSON in objects table
The lease is a lightweight record in the existing objects table, not a protobuf message:
| Column |
Value |
object_type |
"reconciler_lease" |
id |
"singleton" |
name |
"reconciler-lease" |
payload |
JSON: {"holder": "<replica_id>", "acquired_at_ms": <ms>} |
resource_version |
Used for CAS operations |
updated_at_ms |
Application-side timestamp — the TTL clock |
No proto definition needed. put_if accepts &[u8] payload (persistence/mod.rs:186), so serde_json::to_vec output works directly. No ObjectType trait implementation — the lease module calls Store::put_if and Store::get with raw object_type strings.
5. Timestamps and clock skew: application-side, acknowledged
TTL expiry is computed from updated_at_ms (written by the holder's openshell_core::time::now_ms() at renewal time) compared against the stealer's now_ms() at read time. Both are application-side SystemTime::now() calls.
In an HA deployment, clock skew between replicas means a fast-clock replica could see a lease as expired before the holder considers it due for renewal. With a 30s TTL, 10s renewal interval, and NTP-synced hosts (typical skew <1-2ms), this is not a practical concern. Worst case: a clock skew >20s (3 missed renewals) causes an early steal, producing a brief dual-holder window where CAS ensures exactly one writer wins.
A database-side timestamp function (Postgres NOW()) was considered but rejected: the current updated_at_ms implementation is application-side in both backends, and changing only the lease path would introduce inconsistency. A future improvement could move all timestamps to database-side, but that's orthogonal to this feature.
6. Renewal-loss semantics: CAS is the safety net
Invariant: The lease is an optimization, not a correctness mechanism. CAS is the correctness mechanism.
If a lease holder loses its lease (renewal fails) while a CAS mutation is in-flight:
- The in-flight write either succeeds (it was the only writer) or fails with
Conflict (the new holder wrote first). Either outcome is correct.
- The old holder detects the renewal failure, stops its loops, and re-enters standby.
- There is a brief window where two replicas may attempt mutations. CAS ensures exactly one wins.
This is explicitly tested.
Proposed Approach
Lease Operations
All operations use existing Store primitives — no new persistence API required:
- Acquire:
store.put_if("reconciler_lease", "singleton", "reconciler-lease", &payload, None, WriteCondition::MustCreate) — atomic insert, fails with UniqueViolation if held.
- Renew:
store.put_if("reconciler_lease", "singleton", "reconciler-lease", &payload, None, WriteCondition::MatchResourceVersion(v)) — CAS update with fresh timestamp.
- Release:
store.delete_if("reconciler_lease", "singleton", resource_version) — CAS delete on graceful shutdown.
- Steal expired:
store.get("reconciler_lease", "singleton"), check now_ms() - record.updated_at_ms >= TTL, then put_if with MatchResourceVersion(record.resource_version).
Lease Lifecycle
+-------------+
| Standby |<---- all replicas start here
+------+------+
| try acquire (MustCreate)
v
+----- succeeded? -----+
| yes | no
v v
+----------+ sleep(ACQUIRE_INTERVAL)
| Holder | |
| | +---> back to Standby
| - watch |
| - reconcile|
| - renew |
+-----+-----+
| renew fails OR shutdown signal
v
+--------------+
| Release/Stop |
| |
| delete_if() |
+------+-------+
|
+---> back to Standby (or exit on shutdown)
Timing Parameters
| Parameter |
Value |
Rationale |
LEASE_TTL |
30s |
3x renewal interval. Long enough to survive transient DB hiccups. Short enough that failover completes before the next reconcile sweep would have run. |
LEASE_RENEWAL_INTERVAL |
10s |
Renew 3 times per TTL. Missing one renewal is not fatal; missing three consecutive means the holder is likely dead. |
LEASE_ACQUIRE_INTERVAL |
5s |
Standby replicas poll every 5s. On holder death, worst-case failover is TTL (30s) + acquire interval (5s) = 35s. |
Shutdown Coordination
The lease coordinator hooks into the existing tokio::sync::watch shutdown channel (lib.rs:419). When the shutdown signal fires:
- The coordinator cancels the watch and reconcile loops (via
JoinHandle::abort() or an internal watch channel).
- It calls
lease.release() to delete the lease record, allowing immediate takeover by a standby replica.
- It returns, allowing the existing
cleanup_on_shutdown() sequence to proceed.
This avoids adding tokio_util as a dependency. The existing shutdown pattern (watch::channel(false) -> shutdown_tx.send(true)) is reused.
Implementation Plan
Phase 1: Lease primitives
New file: crates/openshell-server/src/compute/lease.rs
pub struct ReconcilerLease {
store: Arc<Store>,
replica_id: String,
ttl: Duration,
}
impl ReconcilerLease {
pub async fn try_acquire(&self) -> Result<LeaseGuard, LeaseError>;
pub async fn try_steal_expired(&self) -> Result<LeaseGuard, LeaseError>;
pub async fn acquire_or_steal(&self) -> Result<LeaseGuard, LeaseError>;
pub async fn renew(&self, guard: &mut LeaseGuard) -> Result<(), LeaseError>;
pub async fn release(&self, guard: LeaseGuard) -> Result<(), LeaseError>;
}
pub struct LeaseGuard {
pub resource_version: u64,
}
Add to persistence: Store::is_single_replica() — returns true for Store::Sqlite, false for Store::Postgres.
Phase 2: Lease-gated reconciler
Modify: crates/openshell-server/src/compute/mod.rs
Replace spawn_watchers():
pub fn spawn_watchers(&self, shutdown_rx: watch::Receiver<bool>) {
if self.store.is_single_replica() {
// SQLite: run unconditionally, no lease needed
self.spawn_watch_loop();
self.spawn_reconcile_loop();
return;
}
// Postgres HA: lease-gated
self.spawn_lease_coordinator(shutdown_rx);
}
The lease coordinator:
- Runs a standby acquisition loop.
- On acquisition, spawns the watch and reconcile loops as child tasks.
- Runs a renewal task alongside them.
- On renewal failure or shutdown signal, aborts child tasks, releases the lease, and re-enters standby (or exits on shutdown).
Modify: crates/openshell-server/src/lib.rs — pass shutdown_rx to spawn_watchers().
Phase 3: Observability
- Log lease acquisition, renewal, loss, and release at
info level with replica_id and holder fields.
- Log standby acquisition attempts at
debug level.
- Extend health endpoint to report lease holder status (optional, non-blocking).
Files Changed
| File |
Change |
crates/openshell-server/src/compute/lease.rs |
New — lease primitives |
crates/openshell-server/src/compute/mod.rs |
Modify spawn_watchers() to accept shutdown_rx and use lease coordinator |
crates/openshell-server/src/persistence/mod.rs |
Add is_single_replica() method to Store |
crates/openshell-server/src/lib.rs |
Pass shutdown_rx to spawn_watchers(), generate replica ID |
Test Considerations
- Lease acquisition concurrency: Spawn N tasks attempting to acquire the singleton lease simultaneously. Assert exactly 1 succeeds (MustCreate) and N-1 get UniqueViolation. Follow the pattern in
persistence/tests.rs CAS concurrency tests.
- Lease renewal and expiry: Test that renewal extends TTL, that expired leases can be stolen, and that active leases cannot be stolen.
- Renewal-loss during mutation: Simulate lease loss while a CAS mutation is in-flight. Assert exactly one write succeeds.
- Gated reconciler: Test that a replica with the lease runs reconcile/watch loops and a replica without the lease does not mutate sandbox state.
- Failover simulation: Test lease expiry -> standby acquisition -> reconciler resumes on new holder.
- Graceful shutdown: Test that lease release on shutdown allows immediate takeover by standby.
- Single-replica mode: Test that SQLite deployments skip lease acquisition and run the reconciler unconditionally.
- Test levels: Unit tests for lease primitives, integration tests for gated reconciler.
What This Does Not Change
- gRPC request handling. All replicas continue serving all RPCs. No request routing changes.
- Supervisor sessions. Sessions land on whichever replica the TCP connection reaches. Session ownership and inter-replica forwarding are separate work.
- sync_lock. Stays as-is. Defense-in-depth within a single process.
- CAS semantics. All writes still go through CAS. The lease reduces contention; CAS ensures correctness.
- SQLite deployments. Unaffected. The reconciler runs unconditionally as it does today.
Scope Assessment
- Complexity: Low-Medium
- Confidence: High — uses existing CAS primitives, all open questions resolved, no new inter-replica communication needed
- Estimated files to change: 4
- Issue type:
feat
Remaining Risks
- Clock skew tolerance. Application-side timestamps mean clock skew between replicas could cause early lease steal. With NTP-synced hosts and a 30s TTL, this requires >20s skew to be problematic. Monitoring/alerting on lease churn would surface this.
- Watch/reconcile loop cancellation. These loops have no graceful cancellation today — they loop forever. The lease coordinator will use
JoinHandle::abort() to stop them, which is safe (all mutations are atomic CAS writes) but not graceful. A future improvement could add cooperative cancellation via a watch channel.
Deferred Work
- Supervisor session ownership persistence — recording which replica owns a supervisor's gRPC stream so other replicas can discover it.
- Inter-replica session forwarding — forwarding exec, relay, and log streaming requests to the session-owning replica.
- Per-sandbox or shard-based lease evolution — if the single reconciler lease becomes a bottleneck at scale.
- Database-side timestamps — moving
updated_at_ms to a DB-side function for clock-skew immunity.
Consolidated design from spike investigation and review feedback. Builds on PR #1292 (CAS optimistic concurrency). Use build-from-issue to plan and implement.
Problem Statement
With CAS optimistic concurrency merged (PR #1292), the persistence layer prevents lost updates when multiple writers mutate the same object. However, the gateway's reconciler loop — which drives sandbox lifecycle state transitions — still runs on every replica. In an HA deployment, this produces duplicate work, N-way CAS contention on every reconcile sweep, and wasted compute driver RPCs. A single reconciler lease ensures only one replica runs background coordination at a time.
Supervisor session ownership and inter-replica session forwarding are out of scope for this issue and will be addressed separately.
Changes from Original Spike
This consolidated design resolves all open questions from the original spike and incorporates review feedback. Key corrections from codebase validation:
updated_at_msis application-side — both SQLite and Postgres adapters callopenshell_core::time::now_ms()(SystemTime::now()), not a database-side function. Clock skew acknowledged as acceptable with 30s TTL.tokio::sync::watchchannel — notCancellationToken(tokio_utilis not a dependency).Store::is_single_replica()does not exist — must be added as a simple enum variant check.gateway_idis shared across replicas (JWT issuer). Unique replica ID generated at startup.put_if(&[u8])— no protobuf, noObjectTypetrait.Technical Context
The gateway reconciler operates through two concurrent loops spawned at startup: a watch loop that consumes real-time events from the compute driver's
WatchSandboxesstream, and a reconcile loop that runs a full store-vs-backend sweep every 60 seconds. Both loops acquire a process-localsync_lockMutex before mutating sandbox state — a guard that is explicitly documented as not HA-safe (references issue #1255).Both loops are spawned as fire-and-forget
tokio::spawntasks fromspawn_watchers()(compute/mod.rs:558). They run until the process exits — there is no structured cancellation or graceful shutdown coordination for these tasks today. The lease coordinator must manage their lifecycle.All sandbox store mutations go through
update_message_caswithexpected_version=0(server-driven CAS), which means the database resolves concurrent writes correctly. But without lease-based ownership, every replica does redundant work: re-fetching sandbox state from the driver, computing phase transitions, and attempting CAS writes that only one replica can win.Why a single reconciler lease is sufficient
The reconciler is a background consistency-repair mechanism, not the hot path. It covers the 60-second periodic sweep and the watch event processing loop. All replicas still serve gRPC requests (create, delete, update sandboxes), and supervisor sessions still land on whichever replica the TCP connection reaches.
A single reconciler lease breaks down when:
GetSandboxdriver RPC (~5-10ms). At 60s intervals, you'd need ~6,000+ concurrent sandboxes before the sweep can't finish in time — well beyond initial HA deployments.Per-sandbox or shard-based leases are a future optimization if sandbox counts grow into the thousands. The single-lease model avoids O(N) lease records, lease rebalancing, and unnecessary complexity.
This approach is consistent with RFC 0001's intent. The RFC rejects a "singleton controller" where one replica handles all control-plane responsibilities (reconciliation, session ownership, relay coordination, and client requests). A single reconciler lease is narrower: it only scopes background sweeps, while gRPC serving and session handling remain distributed across all replicas.
Affected Components
crates/openshell-server/src/compute/mod.rscrates/openshell-server/src/persistence/mod.rs,sqlite.rs,postgres.rscrates/openshell-server/src/lib.rscrates/openshell-core/src/time.rsnow_ms()— application-side wall clock forupdated_at_msTechnical Investigation
Architecture Overview
The compute subsystem (
ComputeRuntime) is the gateway's sandbox lifecycle engine. It owns:Watch loop (
compute/mod.rs:715): Opens a streamingWatchSandboxesRPC to the compute driver. Events include sandbox status updates, deletions, and platform events. Each event triggers a CAS read-modify-write on the store record.Reconcile loop (
compute/mod.rs:753): Runs every 60 seconds (RECONCILE_INTERVAL). Lists all sandboxes from both the driver (ListSandboxes) and the store, then reconciles discrepancies. Records not updated since the sweep started are refreshed viaGetSandbox. Orphaned store records (no backend resource) are pruned after a 300-second grace period (ORPHAN_GRACE_PERIOD).The
sync_lockMutex (compute/mod.rs:231,278-283) serializes all sandbox mutations within a single gateway process. Its comment explicitly notes this is insufficient for HA and references issue #1255. The CAS branch (#1292) added database-level concurrency control as the foundation for removing this process-local guard.Code References
compute/mod.rs:220ComputeRuntimestruct — holds driver, store, session registry, sync_lockcompute/mod.rs:231,278-283sync_lockMutex — documented as not HA-safecompute/mod.rs:558spawn_watchers()— launches both background loopscompute/mod.rs:715watch_loop()— driver event stream consumercompute/mod.rs:753reconcile_loop()— 60s periodic sweepcompute/mod.rs:762reconcile_store_with_backend()— core reconcile logiccompute/mod.rs:848apply_sandbox_update_locked()— read-modify-write with CAScompute/mod.rs:1132reconcile_snapshot_sandbox()— per-sandbox reconcile with staleness guardcompute/mod.rs:1162prune_missing_sandbox()— orphan cleanuppersistence/mod.rs:96WriteCondition— MustCreate / MatchResourceVersion / Unconditionalpersistence/mod.rs:181Store::put_if()— CAS write, accepts&[u8]payloadpersistence/mod.rs:215Store::delete_if()— CAS deletepersistence/mod.rs:477Store::update_message_cas()— read-modify-write helperpersistence/mod.rs:80ObjectRecord— includesupdated_at_ms,resource_version,payloadlib.rs:341state.compute.spawn_watchers()— startup calllib.rs:419let (shutdown_tx, shutdown_rx) = watch::channel(false)— shutdown signallib.rs:445state.compute.cleanup_on_shutdown()— driver cleanup on exitCurrent Behavior
Reconcile flow:
reconcile_store_with_backend()callsListSandboxeson the driver to get all backend sandbox IDssync_lock, read store record, skip if recently updated, re-fetch from driver viaGetSandbox, apply state merge viaapply_sandbox_update_lockedGetSandbox, prune if confirmed missingapply_sandbox_update_locked) derives phase from driver conditions, checks supervisor session presence (in-memory registry), and writes viaupdate_message_caswithexpected_version=0Phase transitions driven by the reconciler:
Resolved Design Decisions
1. Watch loop placement: holder-only
Only the lease holder consumes the
WatchSandboxesdriver stream. Non-holder replicas do not watch.Rationale: gRPC handlers read-through to the store (
get_message,list_messages) — they never rely on an in-memory index populated by the watch loop. The watch loop exists solely to trigger reconciliation state transitions (phase changes, condition updates). Running it on non-holders would double the driver stream load with no benefit.2. sync_lock: keep as defense-in-depth
The process-local
sync_lockMutex stays on all replicas. It serializes mutations within a single process. CAS is the cross-replica concurrency control.Rationale: Removing the Mutex would require adding CAS-retry loops to every mutation site. The Mutex prevents intra-process races (e.g., a gRPC
DeleteSandboxhandler racing the reconcile loop on the same replica), while CAS prevents inter-replica races. The lease reduces CAS contention by ensuring only one replica runs background sweeps, but gRPC-initiated mutations can still race the holder's reconciler within the same process.3. Replica identity: HOSTNAME with UUID fallback
Rationale: Kubernetes sets
HOSTNAMEto the pod name, Docker sets it to the container ID, and systemd units inherit the machine hostname. This gives operators stable, debuggable lease holder identity in logs.OPENSHELL_REPLICA_IDallows explicit override. UUID fallback handles edge cases. Thegateway_idfield is intentionally not used — it's shared across replicas for JWT issuer identity and would not distinguish holders.4. Lease record schema: JSON in objects table
The lease is a lightweight record in the existing
objectstable, not a protobuf message:object_type"reconciler_lease"id"singleton"name"reconciler-lease"payload{"holder": "<replica_id>", "acquired_at_ms": <ms>}resource_versionupdated_at_msNo proto definition needed.
put_ifaccepts&[u8]payload (persistence/mod.rs:186), soserde_json::to_vecoutput works directly. NoObjectTypetrait implementation — the lease module callsStore::put_ifandStore::getwith rawobject_typestrings.5. Timestamps and clock skew: application-side, acknowledged
TTL expiry is computed from
updated_at_ms(written by the holder'sopenshell_core::time::now_ms()at renewal time) compared against the stealer'snow_ms()at read time. Both are application-sideSystemTime::now()calls.In an HA deployment, clock skew between replicas means a fast-clock replica could see a lease as expired before the holder considers it due for renewal. With a 30s TTL, 10s renewal interval, and NTP-synced hosts (typical skew <1-2ms), this is not a practical concern. Worst case: a clock skew >20s (3 missed renewals) causes an early steal, producing a brief dual-holder window where CAS ensures exactly one writer wins.
A database-side timestamp function (Postgres
NOW()) was considered but rejected: the currentupdated_at_msimplementation is application-side in both backends, and changing only the lease path would introduce inconsistency. A future improvement could move all timestamps to database-side, but that's orthogonal to this feature.6. Renewal-loss semantics: CAS is the safety net
Invariant: The lease is an optimization, not a correctness mechanism. CAS is the correctness mechanism.
If a lease holder loses its lease (renewal fails) while a CAS mutation is in-flight:
Conflict(the new holder wrote first). Either outcome is correct.This is explicitly tested.
Proposed Approach
Lease Operations
All operations use existing
Storeprimitives — no new persistence API required:store.put_if("reconciler_lease", "singleton", "reconciler-lease", &payload, None, WriteCondition::MustCreate)— atomic insert, fails withUniqueViolationif held.store.put_if("reconciler_lease", "singleton", "reconciler-lease", &payload, None, WriteCondition::MatchResourceVersion(v))— CAS update with fresh timestamp.store.delete_if("reconciler_lease", "singleton", resource_version)— CAS delete on graceful shutdown.store.get("reconciler_lease", "singleton"), checknow_ms() - record.updated_at_ms >= TTL, thenput_ifwithMatchResourceVersion(record.resource_version).Lease Lifecycle
Timing Parameters
LEASE_TTLLEASE_RENEWAL_INTERVALLEASE_ACQUIRE_INTERVALShutdown Coordination
The lease coordinator hooks into the existing
tokio::sync::watchshutdown channel (lib.rs:419). When the shutdown signal fires:JoinHandle::abort()or an internal watch channel).lease.release()to delete the lease record, allowing immediate takeover by a standby replica.cleanup_on_shutdown()sequence to proceed.This avoids adding
tokio_utilas a dependency. The existing shutdown pattern (watch::channel(false)->shutdown_tx.send(true)) is reused.Implementation Plan
Phase 1: Lease primitives
New file:
crates/openshell-server/src/compute/lease.rsAdd to persistence:
Store::is_single_replica()— returnstrueforStore::Sqlite,falseforStore::Postgres.Phase 2: Lease-gated reconciler
Modify:
crates/openshell-server/src/compute/mod.rsReplace
spawn_watchers():The lease coordinator:
Modify:
crates/openshell-server/src/lib.rs— passshutdown_rxtospawn_watchers().Phase 3: Observability
infolevel withreplica_idandholderfields.debuglevel.Files Changed
crates/openshell-server/src/compute/lease.rscrates/openshell-server/src/compute/mod.rsspawn_watchers()to acceptshutdown_rxand use lease coordinatorcrates/openshell-server/src/persistence/mod.rsis_single_replica()method toStorecrates/openshell-server/src/lib.rsshutdown_rxtospawn_watchers(), generate replica IDTest Considerations
persistence/tests.rsCAS concurrency tests.What This Does Not Change
Scope Assessment
featRemaining Risks
JoinHandle::abort()to stop them, which is safe (all mutations are atomic CAS writes) but not graceful. A future improvement could add cooperative cancellation via a watch channel.Deferred Work
updated_at_msto a DB-side function for clock-skew immunity.Consolidated design from spike investigation and review feedback. Builds on PR #1292 (CAS optimistic concurrency). Use
build-from-issueto plan and implement.