diggerhq · motatoes · May 19, 2026 · May 20, 2026 · May 20, 2026 · Jun 2, 2026
diff --git a/.agents/design/burst-worker-cold-ready.md b/.agents/design/burst-worker-cold-ready.md
@@ -0,0 +1,131 @@
+# Burst Worker Cold-Ready Startup Plan
+
+## Context
+
+The burst worker launch test on June 10, 2026 showed two different timing
+segments:
+
+- EC2 instance creation to worker service start was roughly 90 seconds.
+- Worker service start to control-plane registration was much longer because
+  startup blocked on `PrepareGoldenSnapshot`.
+
+The important observation is that the worker can be useful for cold boots
+before the golden snapshot is ready. The current startup path does not expose
+that intermediate state because the worker prepares the golden snapshot before
+starting its servers and heartbeat.
+
+## Goal
+
+Make a newly launched burst worker register as soon as it is cold-boot capable,
+while preparing the golden snapshot in the background.
+
+Target behavior:
+
+- Worker becomes schedulable for cold boots as soon as networking, env, shared
+  mounts, gRPC, HTTP, and Redis heartbeat are ready.
+- Golden snapshot preparation continues asynchronously.
+- Once the golden snapshot is ready, the worker heartbeat advertises the golden
+  version and the control plane can prefer it for fast creates.
+
+This does not remove EC2 launch latency. It removes golden snapshot creation
+from the critical path for worker registration.
+
+## Proposed Changes
+
+1. Move golden snapshot preparation out of the blocking worker startup path.
+
+   Today `cmd/worker/main.go` calls `PrepareGoldenSnapshot()` before starting
+   metadata, HTTP/gRPC, and Redis heartbeat. Move this after server startup and
+   heartbeat setup, running in a background goroutine.
+
+2. Register the worker as cold-ready first.
+
+   Heartbeat should be published with no `golden_version` until the snapshot is
+   ready. The control plane already treats empty `golden_version` as "no golden
+   snapshot available"; keep that meaning.
+
+3. Update heartbeat when golden prep completes.
+
+   After background `PrepareGoldenSnapshot()` succeeds, call
+   `hb.SetGoldenVersion(qemuMgr.GoldenVersion())`. The next heartbeat should
+   update the registry.
+
+4. Add explicit logs for readiness phases.
+
+   Suggested log points:
+
+   - `worker cold-ready: starting heartbeat before golden snapshot`
+   - `worker golden snapshot preparation started in background`
+   - `worker golden-ready: version=<hash>`
+   - `worker golden preparation failed: <err>; continuing cold-ready`
+
+5. Fix AMI/systemd ordering for burst workers.
+
+   The burst AMI currently enables `opensandbox-worker.service`, so systemd can
+   start it before user-data writes `/etc/opensandbox/worker.env`. That caused
+   repeated `Failed to load environment files` messages during boot.
+
+   Change the burst Packer file to install the worker unit but leave it
+   disabled. User-data should start the worker exactly once after:
+
+   - instance identity is known
+   - shared volumes are attached/mounted
+   - `/etc/opensandbox/worker.env` has been written and patched
+
+6. Keep user-data minimal.
+
+   User-data should only do runtime-specific work:
+
+   - fetch instance identity
+   - attach/mount shared volumes
+   - write env
+   - start worker
+
+   Dependency installation, binaries, OCFS2 tools, AWS CLI, QEMU, kernel
+   modules, and rootfs assets should stay baked into the AMI.
+
+## Non-Goals
+
+- Do not change Spot instance type fallback strategy yet.
+- Do not try to guarantee sub-10-second readiness from a brand-new EC2 launch.
+- Do not implement downloaded/prebuilt QEMU memory snapshots in this pass.
+- Do not change public API behavior.
+
+## Expected Impact
+
+Based on the June 10 test:
+
+- Current EC2-created-to-registered time was about 6 minutes 24 seconds.
+- Worker service started about 91 seconds after EC2 creation.
+- Moving golden prep to the background could make cold-ready registration close
+  to that worker-service-start time, likely around 90-100 seconds from EC2
+  creation before further AMI cleanup.
+
+With AMI/systemd cleanup, a realistic next target is roughly 45-70 seconds from
+EC2 creation to cold-ready in favorable cases.
+
+## Risks
+
+- Cold-ready workers may serve slower first sandboxes until golden prep
+  completes.
+- Some scheduling paths may implicitly assume a non-empty `golden_version`.
+  Those paths need review before allowing all workloads onto cold-ready workers.
+- Migration/checkpoint paths that require a known source golden version should
+  continue to require it.
+
+## Validation Plan
+
+1. Build and deploy a worker with background golden prep.
+2. Launch a fresh burst worker and capture timestamps:
+   - scaler launch decision
+   - EC2 instance created
+   - user-data start
+   - worker service start
+   - first Redis heartbeat / CP registration
+   - golden snapshot ready
+3. Confirm the CP sees the worker before golden snapshot readiness.
+4. Create a sandbox on the cold-ready worker and verify it succeeds via cold
+   boot.
+5. Wait for golden-ready heartbeat and verify subsequent creates use the golden
+   path.
+6. Terminate the extra worker after the test to avoid unnecessary cost.
diff --git a/.github/workflows/build-aws-worker-ami.yml b/.github/workflows/build-aws-worker-ami.yml
@@ -0,0 +1,197 @@
+name: Build AWS Worker AMI
+
+on:
+  push:
+    branches:
+      - main
+      - feat/aws-poc-worker-support-clean-rebased
+  workflow_dispatch:
+    inputs:
+      region:
+        description: AWS region to build the AMI in
+        required: true
+        default: us-east-2
+      builder_instance_type:
+        description: EC2 instance type used by Packer for the build
+        required: true
+        default: c5.4xlarge
+      cleanup_old_amis:
+        description: Deregister old OpenComputer AWS worker AMIs after a successful build
+        required: true
+        default: "false"
+        type: choice
+        options:
+          - "false"
+          - "true"
+      ssm_parameter_prefix:
+        description: SSM prefix to update after publishing the AMI
+        required: true
+        default: /opencomputer/aws-us-east-2-burst-prod
+
+env:
+  AWS_REGION: ${{ inputs.region || vars.AWS_REGION || 'us-east-2' }}
+  BUILDER_INSTANCE_TYPE: ${{ inputs.builder_instance_type || 'c5.4xlarge' }}
+  SSM_PARAMETER_PREFIX: ${{ inputs.ssm_parameter_prefix || '/opencomputer/aws-us-east-2-burst-prod' }}
+
+jobs:
+  build-ami:
+    name: Build AWS Worker AMI
+    runs-on: ubuntu-latest
+    environment: aws-prod
+    permissions:
+      id-token: write
+      contents: read
+    steps:
+      - uses: actions/checkout@v4
+        with:
+          fetch-depth: 0
+
+      - uses: actions/setup-go@v5
+        with:
+          go-version: "1.23"
+
+      - name: Configure AWS credentials
+        uses: aws-actions/configure-aws-credentials@v4
+        with:
+          role-to-assume: ${{ secrets.AWS_ROLE_ARN }}
+          aws-region: ${{ env.AWS_REGION }}
+
+      - name: Setup Packer
+        uses: hashicorp/setup-packer@main
+
+      - name: Build binaries (amd64)
+        run: |
+          VERSION=$(git rev-parse --short HEAD)
+          echo "VERSION=$VERSION" >> "$GITHUB_ENV"
+
+          AGENT_VERSION=$(git log -1 --pretty=format:%h -- cmd/agent internal/agent proto/agent)
+          if [ -z "$AGENT_VERSION" ]; then
+            AGENT_VERSION=$VERSION
+          fi
+          echo "AGENT_VERSION=$AGENT_VERSION" >> "$GITHUB_ENV"
+
+          CGO_ENABLED=0 GOOS=linux GOARCH=amd64 go build \
+            -ldflags "-X main.WorkerVersion=$VERSION -X main.AgentVersion=$AGENT_VERSION" \
+            -o bin/opensandbox-worker ./cmd/worker/
+
+          CGO_ENABLED=0 GOOS=linux GOARCH=amd64 go build \
+            -ldflags "-X main.Version=$AGENT_VERSION" \
+            -o bin/osb-agent ./cmd/agent/
+
+      - name: Package rootfs context
+        run: |
+          tar czf /tmp/packer-rootfs-ctx.tar.gz \
+            deploy/firecracker/rootfs/ \
+            deploy/ec2/build-rootfs-docker.sh \
+            scripts/claude-agent-wrapper/
+
+      - name: Package Vector configs
+        run: tar czf /tmp/packer-vector-ctx.tar.gz -C deploy vector
+
+      - name: Packer init
+        run: packer init deploy/packer/worker-ami-aws-burst.pkr.hcl
+
+      - name: Build and publish AMI
+        run: |
+          packer build \
+            -var "worker_version=$VERSION" \
+            -var "agent_version=$AGENT_VERSION" \
+            -var "region=$AWS_REGION" \
+            -var "instance_type=$BUILDER_INSTANCE_TYPE" \
+            -var "tigris_endpoint=${{ secrets.TIGRIS_ENDPOINT }}" \
+            -var "tigris_access_key_id=${{ secrets.TIGRIS_ACCESS_KEY_ID }}" \
+            -var "tigris_secret_access_key=${{ secrets.TIGRIS_SECRET_ACCESS_KEY }}" \
+            -var "tigris_goldens_bucket=${{ secrets.TIGRIS_GOLDENS_BUCKET }}" \
+            deploy/packer/worker-ami-aws-burst.pkr.hcl | tee /tmp/packer-output.txt
+
+      - name: Read AMI manifest
+        id: ami
+        run: |
+          AMI_ID=$(jq -r '.builds[-1].artifact_id | split(":")[-1]' packer-manifest-aws.json)
+          if [ -z "$AMI_ID" ] || [ "$AMI_ID" = "null" ]; then
+            echo "Could not read AMI ID from packer-manifest-aws.json"
+            cat packer-manifest-aws.json
+            exit 1
+          fi
+
+          GOLDEN_VERSION=$(grep -a 'Golden version:' /tmp/packer-output.txt | tail -1 | awk '{print $NF}' | tr -d '\r')
+
+          echo "ami_id=$AMI_ID" >> "$GITHUB_OUTPUT"
+          echo "golden_version=$GOLDEN_VERSION" >> "$GITHUB_OUTPUT"
+          echo "AMI_ID=$AMI_ID" >> "$GITHUB_ENV"
+          echo "GOLDEN_VERSION=$GOLDEN_VERSION" >> "$GITHUB_ENV"
+
+      - name: Verify AMI tags
+        run: |
+          aws ec2 describe-images \
+            --region "$AWS_REGION" \
+            --image-ids "$AMI_ID" \
+            --query 'Images[0].{ImageId:ImageId,Name:Name,State:State,Role:Tags[?Key==`opensandbox-role`].Value|[0],Cloud:Tags[?Key==`opensandbox-cloud`].Value|[0],Version:Tags[?Key==`opensandbox-version`].Value|[0]}' \
+            --output table
+
+      - name: Update worker AMI pointer
+        run: |
+          aws ssm put-parameter \
+            --region "$AWS_REGION" \
+            --name "$SSM_PARAMETER_PREFIX/worker-ami-id" \
+            --type String \
+            --value "$AMI_ID" \
+            --overwrite
+
+          aws ssm put-parameter \
+            --region "$AWS_REGION" \
+            --name "$SSM_PARAMETER_PREFIX/worker-ami-version" \
+            --type String \
+            --value "$VERSION" \
+            --overwrite
+
+          echo "Updated SSM worker AMI pointer: $SSM_PARAMETER_PREFIX/worker-ami-id -> $AMI_ID"
+
+      - name: Cleanup old AMIs
+        if: ${{ inputs.cleanup_old_amis == 'true' }}
+        run: |
+          set -euo pipefail
+
+          mapfile -t OLD_AMIS < <(
+            aws ec2 describe-images \
+              --region "$AWS_REGION" \
+              --owners self \
+              --filters \
+                "Name=tag:opensandbox-role,Values=worker" \
+                "Name=tag:opensandbox-cloud,Values=aws" \
+                "Name=state,Values=available" \
+              --query 'reverse(sort_by(Images,&CreationDate))[10:].ImageId' \
+              --output text | tr '\t' '\n' | awk 'NF'
+          )
+
+          if [ "${#OLD_AMIS[@]}" -eq 0 ]; then
+            echo "No old AMIs to clean up"
+            exit 0
+          fi
+
+          for AMI in "${OLD_AMIS[@]}"; do
+            echo "Deregistering old AMI: $AMI"
+            SNAPSHOTS=$(aws ec2 describe-images \
+              --region "$AWS_REGION" \
+              --image-ids "$AMI" \
+              --query 'Images[0].BlockDeviceMappings[].Ebs.SnapshotId' \
+              --output text)
+            aws ec2 deregister-image --region "$AWS_REGION" --image-id "$AMI"
+            for SNAPSHOT in $SNAPSHOTS; do
+              echo "Deleting old AMI snapshot: $SNAPSHOT"
+              aws ec2 delete-snapshot --region "$AWS_REGION" --snapshot-id "$SNAPSHOT" || true
+            done
+          done
+
+      - name: Summary
+        run: |
+          echo "## AWS Worker AMI Build Complete" >> "$GITHUB_STEP_SUMMARY"
+          echo "- **AMI ID:** \`$AMI_ID\`" >> "$GITHUB_STEP_SUMMARY"
+          echo "- **Worker version:** \`$VERSION\`" >> "$GITHUB_STEP_SUMMARY"
+          echo "- **Agent version:** \`$AGENT_VERSION\`" >> "$GITHUB_STEP_SUMMARY"
+          echo "- **Golden version:** \`${GOLDEN_VERSION:-unknown}\`" >> "$GITHUB_STEP_SUMMARY"
+          echo "- **Golden cache:** \`${{ secrets.TIGRIS_GOLDENS_BUCKET != '' && 'Tigris enabled' || 'disabled' }}\`" >> "$GITHUB_STEP_SUMMARY"
+          echo "- **Region:** \`$AWS_REGION\`" >> "$GITHUB_STEP_SUMMARY"
+          echo "- **SSM pointer:** \`$SSM_PARAMETER_PREFIX/worker-ami-id\`" >> "$GITHUB_STEP_SUMMARY"
+          echo "" >> "$GITHUB_STEP_SUMMARY"
+          echo "The CP scaler reads this AMI through \`OPENSANDBOX_EC2_SSM_AMI_PARAM\`. Terraform creates the parameter but ignores value drift so later applies do not overwrite CI-published AMIs." >> "$GITHUB_STEP_SUMMARY"