Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
b7637bc
worker: multi-cloud secrets + spot preemption monitor + AWS Packer AMI
motatoes May 19, 2026
79c8836
config: pass region explicitly to AWS SDK config loader
motatoes May 20, 2026
0497f46
packer aws: upload vector dir as tarball to avoid scp -r failure
motatoes May 20, 2026
0a76519
fix rebase conflict resolutions
Jun 2, 2026
d42fffe
Improve worker migration readiness startup
Jun 2, 2026
aa9d98e
Add admin worker evacuation drill hook
Jun 2, 2026
5c80187
Revert "Improve worker migration readiness startup"
Jun 2, 2026
35c0e94
Fix scaler tests after readiness revert
Jun 2, 2026
6787fe7
fix spot drain retry behavior
Jun 3, 2026
57ac20d
Improve spot drain migration handling
Jun 3, 2026
25ac2e1
Add alpha spot sandbox family
Jun 3, 2026
732e40b
docs: add alpha spot sandboxes
Jun 3, 2026
ac863fd
Fix golden snapshot agent transport reset
Jun 4, 2026
6fdaafd
Update spot sandbox pricing docs
Jun 4, 2026
6cce58b
Add resumable sandbox recovery
Jun 4, 2026
e2f4854
Update resumable sandbox reliability docs
Jun 4, 2026
9bf21fd
Configure worker-local golden cache
Jun 5, 2026
31c816c
Add resumable sandbox recreate flow
Jun 5, 2026
56d0277
Handle resumable worker shutdown safely
Jun 5, 2026
8a60e7a
Add Burst worker pool support
Jun 5, 2026
0310f6e
Use EC2 AMI for AWS scaler launches
Jun 5, 2026
6c9c4eb
docs: simplify sandbox sizing references
Jun 5, 2026
fe282a6
Revert "docs: simplify sandbox sizing references"
Jun 5, 2026
94ffe65
Add AWS worker AMI build workflow
Jun 5, 2026
2270e73
Run AWS worker AMI build on branch push
Jun 6, 2026
fcb7759
Use generic AWS role secret for AMI workflow
Jun 6, 2026
344f7a6
Add AWS Burst CP deploy workflow
Jun 6, 2026
cb10a6a
Fix AWS burst worker bootstrap storage
Jun 8, 2026
f69aa39
Add burst compute billing meter
Jun 9, 2026
97057ff
Bake burst worker AMI dependencies
Jun 10, 2026
9d7553d
Use static OCFS2 slots for AWS workers
Jun 10, 2026
20d1619
feat: prepare burst workers cold-ready
Jun 15, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
131 changes: 131 additions & 0 deletions .agents/design/burst-worker-cold-ready.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,131 @@
# Burst Worker Cold-Ready Startup Plan

## Context

The burst worker launch test on June 10, 2026 showed two different timing
segments:

- EC2 instance creation to worker service start was roughly 90 seconds.
- Worker service start to control-plane registration was much longer because
startup blocked on `PrepareGoldenSnapshot`.

The important observation is that the worker can be useful for cold boots
before the golden snapshot is ready. The current startup path does not expose
that intermediate state because the worker prepares the golden snapshot before
starting its servers and heartbeat.

## Goal

Make a newly launched burst worker register as soon as it is cold-boot capable,
while preparing the golden snapshot in the background.

Target behavior:

- Worker becomes schedulable for cold boots as soon as networking, env, shared
mounts, gRPC, HTTP, and Redis heartbeat are ready.
- Golden snapshot preparation continues asynchronously.
- Once the golden snapshot is ready, the worker heartbeat advertises the golden
version and the control plane can prefer it for fast creates.

This does not remove EC2 launch latency. It removes golden snapshot creation
from the critical path for worker registration.

## Proposed Changes

1. Move golden snapshot preparation out of the blocking worker startup path.

Today `cmd/worker/main.go` calls `PrepareGoldenSnapshot()` before starting
metadata, HTTP/gRPC, and Redis heartbeat. Move this after server startup and
heartbeat setup, running in a background goroutine.

2. Register the worker as cold-ready first.

Heartbeat should be published with no `golden_version` until the snapshot is
ready. The control plane already treats empty `golden_version` as "no golden
snapshot available"; keep that meaning.

3. Update heartbeat when golden prep completes.

After background `PrepareGoldenSnapshot()` succeeds, call
`hb.SetGoldenVersion(qemuMgr.GoldenVersion())`. The next heartbeat should
update the registry.

4. Add explicit logs for readiness phases.

Suggested log points:

- `worker cold-ready: starting heartbeat before golden snapshot`
- `worker golden snapshot preparation started in background`
- `worker golden-ready: version=<hash>`
- `worker golden preparation failed: <err>; continuing cold-ready`

5. Fix AMI/systemd ordering for burst workers.

The burst AMI currently enables `opensandbox-worker.service`, so systemd can
start it before user-data writes `/etc/opensandbox/worker.env`. That caused
repeated `Failed to load environment files` messages during boot.

Change the burst Packer file to install the worker unit but leave it
disabled. User-data should start the worker exactly once after:

- instance identity is known
- shared volumes are attached/mounted
- `/etc/opensandbox/worker.env` has been written and patched

6. Keep user-data minimal.

User-data should only do runtime-specific work:

- fetch instance identity
- attach/mount shared volumes
- write env
- start worker

Dependency installation, binaries, OCFS2 tools, AWS CLI, QEMU, kernel
modules, and rootfs assets should stay baked into the AMI.

## Non-Goals

- Do not change Spot instance type fallback strategy yet.
- Do not try to guarantee sub-10-second readiness from a brand-new EC2 launch.
- Do not implement downloaded/prebuilt QEMU memory snapshots in this pass.
- Do not change public API behavior.

## Expected Impact

Based on the June 10 test:

- Current EC2-created-to-registered time was about 6 minutes 24 seconds.
- Worker service started about 91 seconds after EC2 creation.
- Moving golden prep to the background could make cold-ready registration close
to that worker-service-start time, likely around 90-100 seconds from EC2
creation before further AMI cleanup.

With AMI/systemd cleanup, a realistic next target is roughly 45-70 seconds from
EC2 creation to cold-ready in favorable cases.

## Risks

- Cold-ready workers may serve slower first sandboxes until golden prep
completes.
- Some scheduling paths may implicitly assume a non-empty `golden_version`.
Those paths need review before allowing all workloads onto cold-ready workers.
- Migration/checkpoint paths that require a known source golden version should
continue to require it.

## Validation Plan

1. Build and deploy a worker with background golden prep.
2. Launch a fresh burst worker and capture timestamps:
- scaler launch decision
- EC2 instance created
- user-data start
- worker service start
- first Redis heartbeat / CP registration
- golden snapshot ready
3. Confirm the CP sees the worker before golden snapshot readiness.
4. Create a sandbox on the cold-ready worker and verify it succeeds via cold
boot.
5. Wait for golden-ready heartbeat and verify subsequent creates use the golden
path.
6. Terminate the extra worker after the test to avoid unnecessary cost.
197 changes: 197 additions & 0 deletions .github/workflows/build-aws-worker-ami.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,197 @@
name: Build AWS Worker AMI

on:
push:
branches:
- main
- feat/aws-poc-worker-support-clean-rebased
workflow_dispatch:
inputs:
region:
description: AWS region to build the AMI in
required: true
default: us-east-2
builder_instance_type:
description: EC2 instance type used by Packer for the build
required: true
default: c5.4xlarge
cleanup_old_amis:
description: Deregister old OpenComputer AWS worker AMIs after a successful build
required: true
default: "false"
type: choice
options:
- "false"
- "true"
ssm_parameter_prefix:
description: SSM prefix to update after publishing the AMI
required: true
default: /opencomputer/aws-us-east-2-burst-prod

env:
AWS_REGION: ${{ inputs.region || vars.AWS_REGION || 'us-east-2' }}
BUILDER_INSTANCE_TYPE: ${{ inputs.builder_instance_type || 'c5.4xlarge' }}
SSM_PARAMETER_PREFIX: ${{ inputs.ssm_parameter_prefix || '/opencomputer/aws-us-east-2-burst-prod' }}

jobs:
build-ami:
name: Build AWS Worker AMI
runs-on: ubuntu-latest
environment: aws-prod
permissions:
id-token: write
contents: read
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0

- uses: actions/setup-go@v5
with:
go-version: "1.23"

- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: ${{ secrets.AWS_ROLE_ARN }}
aws-region: ${{ env.AWS_REGION }}

- name: Setup Packer
uses: hashicorp/setup-packer@main

- name: Build binaries (amd64)
run: |
VERSION=$(git rev-parse --short HEAD)
echo "VERSION=$VERSION" >> "$GITHUB_ENV"

AGENT_VERSION=$(git log -1 --pretty=format:%h -- cmd/agent internal/agent proto/agent)
if [ -z "$AGENT_VERSION" ]; then
AGENT_VERSION=$VERSION
fi
echo "AGENT_VERSION=$AGENT_VERSION" >> "$GITHUB_ENV"

CGO_ENABLED=0 GOOS=linux GOARCH=amd64 go build \
-ldflags "-X main.WorkerVersion=$VERSION -X main.AgentVersion=$AGENT_VERSION" \
-o bin/opensandbox-worker ./cmd/worker/

CGO_ENABLED=0 GOOS=linux GOARCH=amd64 go build \
-ldflags "-X main.Version=$AGENT_VERSION" \
-o bin/osb-agent ./cmd/agent/

- name: Package rootfs context
run: |
tar czf /tmp/packer-rootfs-ctx.tar.gz \
deploy/firecracker/rootfs/ \
deploy/ec2/build-rootfs-docker.sh \
scripts/claude-agent-wrapper/

- name: Package Vector configs
run: tar czf /tmp/packer-vector-ctx.tar.gz -C deploy vector

- name: Packer init
run: packer init deploy/packer/worker-ami-aws-burst.pkr.hcl

- name: Build and publish AMI
run: |
packer build \
-var "worker_version=$VERSION" \
-var "agent_version=$AGENT_VERSION" \
-var "region=$AWS_REGION" \
-var "instance_type=$BUILDER_INSTANCE_TYPE" \
-var "tigris_endpoint=${{ secrets.TIGRIS_ENDPOINT }}" \
-var "tigris_access_key_id=${{ secrets.TIGRIS_ACCESS_KEY_ID }}" \
-var "tigris_secret_access_key=${{ secrets.TIGRIS_SECRET_ACCESS_KEY }}" \
-var "tigris_goldens_bucket=${{ secrets.TIGRIS_GOLDENS_BUCKET }}" \
deploy/packer/worker-ami-aws-burst.pkr.hcl | tee /tmp/packer-output.txt

- name: Read AMI manifest
id: ami
run: |
AMI_ID=$(jq -r '.builds[-1].artifact_id | split(":")[-1]' packer-manifest-aws.json)
if [ -z "$AMI_ID" ] || [ "$AMI_ID" = "null" ]; then
echo "Could not read AMI ID from packer-manifest-aws.json"
cat packer-manifest-aws.json
exit 1
fi

GOLDEN_VERSION=$(grep -a 'Golden version:' /tmp/packer-output.txt | tail -1 | awk '{print $NF}' | tr -d '\r')

echo "ami_id=$AMI_ID" >> "$GITHUB_OUTPUT"
echo "golden_version=$GOLDEN_VERSION" >> "$GITHUB_OUTPUT"
echo "AMI_ID=$AMI_ID" >> "$GITHUB_ENV"
echo "GOLDEN_VERSION=$GOLDEN_VERSION" >> "$GITHUB_ENV"

- name: Verify AMI tags
run: |
aws ec2 describe-images \
--region "$AWS_REGION" \
--image-ids "$AMI_ID" \
--query 'Images[0].{ImageId:ImageId,Name:Name,State:State,Role:Tags[?Key==`opensandbox-role`].Value|[0],Cloud:Tags[?Key==`opensandbox-cloud`].Value|[0],Version:Tags[?Key==`opensandbox-version`].Value|[0]}' \
--output table

- name: Update worker AMI pointer
run: |
aws ssm put-parameter \
--region "$AWS_REGION" \
--name "$SSM_PARAMETER_PREFIX/worker-ami-id" \
--type String \
--value "$AMI_ID" \
--overwrite

aws ssm put-parameter \
--region "$AWS_REGION" \
--name "$SSM_PARAMETER_PREFIX/worker-ami-version" \
--type String \
--value "$VERSION" \
--overwrite

echo "Updated SSM worker AMI pointer: $SSM_PARAMETER_PREFIX/worker-ami-id -> $AMI_ID"

- name: Cleanup old AMIs
if: ${{ inputs.cleanup_old_amis == 'true' }}
run: |
set -euo pipefail

mapfile -t OLD_AMIS < <(
aws ec2 describe-images \
--region "$AWS_REGION" \
--owners self \
--filters \
"Name=tag:opensandbox-role,Values=worker" \
"Name=tag:opensandbox-cloud,Values=aws" \
"Name=state,Values=available" \
--query 'reverse(sort_by(Images,&CreationDate))[10:].ImageId' \
--output text | tr '\t' '\n' | awk 'NF'
)

if [ "${#OLD_AMIS[@]}" -eq 0 ]; then
echo "No old AMIs to clean up"
exit 0
fi

for AMI in "${OLD_AMIS[@]}"; do
echo "Deregistering old AMI: $AMI"
SNAPSHOTS=$(aws ec2 describe-images \
--region "$AWS_REGION" \
--image-ids "$AMI" \
--query 'Images[0].BlockDeviceMappings[].Ebs.SnapshotId' \
--output text)
aws ec2 deregister-image --region "$AWS_REGION" --image-id "$AMI"
for SNAPSHOT in $SNAPSHOTS; do
echo "Deleting old AMI snapshot: $SNAPSHOT"
aws ec2 delete-snapshot --region "$AWS_REGION" --snapshot-id "$SNAPSHOT" || true
done
done

- name: Summary
run: |
echo "## AWS Worker AMI Build Complete" >> "$GITHUB_STEP_SUMMARY"
echo "- **AMI ID:** \`$AMI_ID\`" >> "$GITHUB_STEP_SUMMARY"
echo "- **Worker version:** \`$VERSION\`" >> "$GITHUB_STEP_SUMMARY"
echo "- **Agent version:** \`$AGENT_VERSION\`" >> "$GITHUB_STEP_SUMMARY"
echo "- **Golden version:** \`${GOLDEN_VERSION:-unknown}\`" >> "$GITHUB_STEP_SUMMARY"
echo "- **Golden cache:** \`${{ secrets.TIGRIS_GOLDENS_BUCKET != '' && 'Tigris enabled' || 'disabled' }}\`" >> "$GITHUB_STEP_SUMMARY"
echo "- **Region:** \`$AWS_REGION\`" >> "$GITHUB_STEP_SUMMARY"
echo "- **SSM pointer:** \`$SSM_PARAMETER_PREFIX/worker-ami-id\`" >> "$GITHUB_STEP_SUMMARY"
echo "" >> "$GITHUB_STEP_SUMMARY"
echo "The CP scaler reads this AMI through \`OPENSANDBOX_EC2_SSM_AMI_PARAM\`. Terraform creates the parameter but ignores value drift so later applies do not overwrite CI-published AMIs." >> "$GITHUB_STEP_SUMMARY"
Loading
Loading