From a768cb29b57ee4ed93004bd0b040cf91b1fa1cce Mon Sep 17 00:00:00 2001
From: ziruihao <ziray.hao@gmail.com>
Date: Fri, 5 Jun 2026 14:31:01 -0700
Subject: [PATCH 1/2] refactor(codegen): delete codegen.mjs; outer agent owns
 script generation
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

PR #125 introduced scripts/codegen.mjs as a one-shot completion-API
pipeline that templates a framework prompt, calls the LLM, writes the
emitted message text to disk as the script, then verifies and rewrites
on failure. The sub-process boundary turned out to be the wrong contract:

  • Script content rides the model's natural-language output channel,
    so it competes with the model's conversational instincts. The LLM
    keeps prepending self-narration ("The error is clear:", "Here is the
    corrected script:") on the rewrite path, breaking tsx parse — see
    /tmp/skill/etsy.com/search-products/autobrowse/codegen-cache/
    6c78b599d4d5a9d4.txt from the 2026-06-04 preview run.
  • Multi-framework runs into a shared --out dir collide on package.json
    + node_modules (PR #125 fixed this with deep-merge + pkg-hash stamp;
    the bug only existed because of the sub-process split).
  • Runner timeouts and the parent verify timeout had to be hand-aligned
    so the parent doesn't SIGTERM a healthy child mid-install.
  • Trace/strategy/script artifacts get reasoned about in two places
    (codegen.mjs writes scripts, the outer agent's bash uploads them).

All of those classes of bug disappear when the outer agent owns codegen.
It already has the context, the tools (Read/Write/Bash), and the
judgment loop. The Write tool's structured `content` argument means
script bytes never ride the natural-language channel — no preamble bug.
A single agent process means no cross-process timeout coordination, no
deps merging across sub-process invocations, and no separate place to
reason about "this stagehand failed, drop it before upload".

Changes:
- Delete scripts/codegen.mjs (515 lines)
- Delete codegen/runners/ (tsx-runner.mjs, playwright.mjs, stagehand.mjs)
- Delete codegen/scaffolds/ (inlined into the new reference docs)
- Move + reframe codegen/prompts/{playwright,stagehand}.md to
  references/codegen/{playwright,stagehand}.md. The technical content
  (CDP attach pattern, Stagehand v3 constructor shape, locator
  priorities, snap convention, JSON stdout contract) is preserved; what
  changed is framing — these are now reference docs an outer agent
  reads on demand, not completion-API system prompts.
- Update SKILL.md's "Generate a runnable script" section to describe
  the agent-driven loop (Read trace/refs → Write script → Bash verify
  → iterate or delete on persistent failure).

Net diff: -626 lines.

The companion change in browse.sh's §4b system prompt — replacing the
`node codegen.mjs --frameworks ...` invocation with the inlined
Read/Write/Bash loop — lives in a separate PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 skills/autobrowse/SKILL.md                    |  78 ++-
 .../autobrowse/codegen/prompts/playwright.md  |  55 --
 .../autobrowse/codegen/prompts/stagehand.md   |  80 ---
 .../codegen/runners/lib/tsx-runner.mjs        | 136 -----
 .../autobrowse/codegen/runners/playwright.mjs |  29 -
 .../autobrowse/codegen/runners/stagehand.mjs  |  24 -
 .../codegen/scaffolds/playwright/package.json |  15 -
 .../scaffolds/playwright/tsconfig.json        |  13 -
 .../codegen/scaffolds/stagehand/package.json  |  15 -
 .../codegen/scaffolds/stagehand/tsconfig.json |  13 -
 .../references/codegen/playwright.md          | 128 +++++
 .../references/codegen/stagehand.md           | 145 +++++
 skills/autobrowse/scripts/codegen.mjs         | 515 ------------------
 13 files changed, 310 insertions(+), 936 deletions(-)
 delete mode 100644 skills/autobrowse/codegen/prompts/playwright.md
 delete mode 100644 skills/autobrowse/codegen/prompts/stagehand.md
 delete mode 100644 skills/autobrowse/codegen/runners/lib/tsx-runner.mjs
 delete mode 100755 skills/autobrowse/codegen/runners/playwright.mjs
 delete mode 100755 skills/autobrowse/codegen/runners/stagehand.mjs
 delete mode 100644 skills/autobrowse/codegen/scaffolds/playwright/package.json
 delete mode 100644 skills/autobrowse/codegen/scaffolds/playwright/tsconfig.json
 delete mode 100644 skills/autobrowse/codegen/scaffolds/stagehand/package.json
 delete mode 100644 skills/autobrowse/codegen/scaffolds/stagehand/tsconfig.json
 create mode 100644 skills/autobrowse/references/codegen/playwright.md
 create mode 100644 skills/autobrowse/references/codegen/stagehand.md
 delete mode 100755 skills/autobrowse/scripts/codegen.mjs

diff --git a/skills/autobrowse/SKILL.md b/skills/autobrowse/SKILL.md
index ba7ca24..0a7b51b 100644
--- a/skills/autobrowse/SKILL.md
+++ b/skills/autobrowse/SKILL.md
@@ -225,47 +225,43 @@ Read the new summary. Did it pass? Make clear progress?
 
 ### Generate a runnable script (optional)
 
-Once the task has converged, you can produce a deterministic, runnable script
-in one or more frameworks via `scripts/codegen.mjs`. This is one shot of an
-LLM call per framework, cached by content hash, with optional verify-against-
-fresh-session and rewrite-on-failure.
-
-```bash
-node ${CLAUDE_SKILL_DIR}/scripts/codegen.mjs \
-  --task <name> \
-  --workspace ./autobrowse \
-  --frameworks playwright,stagehand \
-  --verify
-```
-
-Each framework gets its own subdirectory under `tasks/<name>/<framework>/`
-with the emitted script and a self-contained scaffold (`package.json`,
-`tsconfig.json`). The directory is runnable standalone with
-`cd tasks/<name>/playwright && npm install && npx tsx <name>.ts` — the only
-runtime requirement is `BROWSERBASE_API_KEY` (plus `ANTHROPIC_API_KEY` for
-the Stagehand target).
-
-Builtin frameworks: `playwright`, `stagehand`. Add a custom framework with
-`--prompt-template <path> --frameworks custom` (and provide your own runner
-or pass `--no-verify`).
-
-Common flags:
-
-| Flag | Purpose |
-|---|---|
-| `--frameworks a,b,...` | Comma-separated; default `playwright` |
-| `--verify` / `--no-verify` | Run the produced script against a fresh BB session; default `--verify` |
-| `--max-retries N` | Rewrite-on-verify-failure cap; default 2 |
-| `--cache-only` | Error if cache miss (CI-friendly) |
-| `--force` | Bust the cache |
-| `--dry-run` | Estimate prompt size + cost; don't call the LLM |
-| `--run <id>` | Force a specific `run-NNN` (default: latest passing) |
-
-Output is one JSON line per framework on stdout. Non-zero exit if any
-selected framework's final state is `passed: false`.
-
-See `references/playwright-cdp-bridge.md` for the canonical
-`connectOverCDP` patterns the emitted scripts follow.
+Once the task has converged, you can produce a runnable script in one or
+more frameworks (Playwright, Stagehand) directly using your own `Write` and
+`Bash` tools — autobrowse no longer ships a separate `codegen.mjs`
+sub-process. The framework-specific specs live as reference docs you read
+on demand:
+
+- `references/codegen/playwright.md` — script shape, scaffold, verify
+  contract, locator priorities, HTTP-only variant
+- `references/codegen/stagehand.md` — Stagehand v3 constructor, `act` /
+  `extract` patterns, when NOT to ship Stagehand
+- `references/playwright-cdp-bridge.md` — canonical `connectOverCDP`
+  create-session / release dance
+
+The loop is:
+
+1. `Read` the converged trace at
+   `./autobrowse/traces/<task>/latest/{trace.json,unified-events.jsonl}`,
+   the task's `strategy.md`, and the framework reference doc.
+2. `Write` `<framework>.ts` into the output directory (e.g.
+   `tasks/<task>/<framework>/<task>.ts` or a flattened upload root).
+3. `Write` the scaffold's `package.json` + `tsconfig.json` per the
+   reference. When multiple frameworks share an output directory, merge
+   the `dependencies` across frameworks into a single `package.json`.
+4. `Bash` `PLAYWRIGHT_SKIP_BROWSER_DOWNLOAD=1 npm install --silent --no-audit --no-fund`
+   then `npx tsx <framework>.ts` against a fresh Browserbase session.
+5. Parse the trailing `{"success":boolean,...}` JSON line from stdout. If
+   it failed, read the stderr tail and iterate — up to ~3 attempts is
+   reasonable. If still failing, delete the broken script so it isn't
+   uploaded (the upload glob ships whatever's on disk).
+
+The agent does this directly because it already has the context, the
+tools, and the judgment for "this stderr means …, try X". A sub-process
+LLM call (the old `codegen.mjs`) couldn't see why a script was failing
+beyond the stderr tail, and tended to bleed natural-language preamble
+into the `.ts` file via the completion API's message channel — both
+problems disappear when the outer agent writes the file through the
+`Write` tool's structured argument.
 
 ### After all iterations — publish if ready
 
diff --git a/skills/autobrowse/codegen/prompts/playwright.md b/skills/autobrowse/codegen/prompts/playwright.md
deleted file mode 100644
index 51c7de7..0000000
--- a/skills/autobrowse/codegen/prompts/playwright.md
+++ /dev/null
@@ -1,55 +0,0 @@
-# Playwright codegen — system prompt
-
-You are converting a converged autobrowse trace into a runnable Playwright
-script. Your output is the **complete contents of a `.ts` file**, nothing
-else: no preamble, no closing remarks, no markdown fences.
-
-## Constraints
-
-- **Self-contained.** The script must run with only `BROWSERBASE_API_KEY` in
-  the environment. No reliance on autobrowse state, no reading from
-  workspace files.
-- **CDP attach, never `chromium.launch()`.** Follow the
-  `Playwright ↔ Browserbase bridge` reference verbatim for the
-  create-session / connectOverCDP / release dance.
-- **No `browser.close()`.** Release the session via
-  `browse cloud sessions update <id> --status REQUEST_RELEASE` in `finally`.
-- **Final stdout line is JSON.** `{"success":true,"data":...}` on success
-  or `{"success":false,"error":"..."}` on failure. The runner parses this
-  line — don't emit any other JSON-looking lines after it.
-- **Snap on errors.** Wrap `main()` in `try { … } catch (err) { await snap(page, '99-error'); throw err; }`. Honor `process.env.SCREENSHOT_DIR` for snap output.
-- **Locator preferences in order:** `data-testid` attribute → role + name →
-  id → text → xpath. Prefer Playwright's auto-waiting (`locator.click()`,
-  `locator.fill()`) over explicit waits when possible.
-- **Use the descriptor data when available.** Each `descriptors.ndjson` entry
-  describes the actual DOM target the agent interacted with — pick locators
-  from those `attributes` / `role` / `accessibleName` fields rather than
-  inventing them.
-- **Use the trace's network signals.** Where the unified events show a slow
-  XHR after an action, insert `page.waitForResponse(...)` rather than
-  arbitrary sleeps.
-
-## Output schema
-
-The script must define a Zod schema that mirrors the `# Output` section of
-the task.md provided in context, and validate the extracted data through
-that schema before printing the final `success: true` line.
-
-## Imports / runtime
-
-```typescript
-import { chromium, type Browser, type Page } from "playwright";
-import { execFileSync } from "node:child_process";
-import { join } from "node:path";
-import { z } from "zod";
-import "dotenv/config";
-```
-
-`playwright` and `zod` are already in the scaffolded `package.json`. Do not
-add other dependencies.
-
-## What to emit
-
-Output the complete `.ts` file content. Start with imports, end with a call
-to `main()`. Nothing before the first import, nothing after the last
-closing brace. No markdown fences.
diff --git a/skills/autobrowse/codegen/prompts/stagehand.md b/skills/autobrowse/codegen/prompts/stagehand.md
deleted file mode 100644
index 085f7d3..0000000
--- a/skills/autobrowse/codegen/prompts/stagehand.md
+++ /dev/null
@@ -1,80 +0,0 @@
-# Stagehand codegen — system prompt
-
-You are converting a converged autobrowse trace into a runnable Stagehand
-script. Your output is the **complete contents of a `.ts` file**, nothing
-else: no preamble, no closing remarks, no markdown fences.
-
-This targets **Stagehand v3** (`@browserbasehq/stagehand` 3.x). The v3 API
-differs from older examples — follow the patterns below exactly.
-
-## Constraints
-
-- **Self-contained.** The script must run with `BROWSERBASE_API_KEY` and
-  `ANTHROPIC_API_KEY` in the environment.
-- **Stagehand owns its own Browserbase session.** Construct it with
-  `env: "BROWSERBASE"` and let it create the session — do NOT pre-create a
-  session via the `browse` CLI and do NOT pass `browserbaseSessionID`. The
-  constructor shape is:
-  ```typescript
-  const stagehand = new Stagehand({
-    env: "BROWSERBASE",
-    apiKey: process.env.BROWSERBASE_API_KEY,        // ← BROWSERBASE key (NOT the Anthropic key); project inferred from it
-    model: {                                        // ← LLM config lives here, not at top level
-      modelName: "anthropic/claude-sonnet-4-6",     // ← provider-prefixed; do not invent model names
-      apiKey: process.env.ANTHROPIC_API_KEY,
-    },
-  });
-  await stagehand.init();
-  ```
-  The top-level `apiKey` is the **Browserbase** API key (the project is
-  inferred from it — no `projectId` needed). There is no `browserbaseAPIKey`
-  field and no top-level `modelName` — using the Anthropic key as `apiKey`
-  makes session lookup fail with a 404.
-- **Get the page from the context, not `stagehand.page`.**
-  ```typescript
-  const page = stagehand.context.pages()[0] ?? (await stagehand.context.newPage());
-  await page.goto(url, { waitUntil: "domcontentloaded" });
-  ```
-  `page` supports `goto`, `waitForTimeout`, `waitForSelector`, `screenshot`.
-- **`act` and `extract` are methods on the `stagehand` instance, not the page.**
-  - Actions: `await stagehand.act("click the Continue button")`
-  - Data: `await stagehand.extract("<instruction>", zodSchema)` — pass the Zod
-    schema as the second argument; it returns the parsed object.
-  Prefer natural-language intent strings — the whole point of Stagehand is the
-  LLM picks the locator at runtime.
-- **One natural-language action per `act` call.** Don't compound
-  ("click X and fill Y"); chain individual `act` calls so each is retryable.
-- **Schema-backed extract.** Define Zod schemas mirroring the `# Output`
-  section of task.md and validate before emitting the final `success: true`
-  line.
-- **Use the descriptors as natural-language hints.** Where a descriptor shows
-  `accessibleName: "Continue"`, the corresponding `act` should say
-  `"click the Continue button"`. Specific locators aren't required.
-- **Snap on errors.** Wrap the body in
-  `try { … } catch (err) { await snap(page, '99-error'); … }`, honoring
-  `process.env.SCREENSHOT_DIR`. `snap` should be a no-op when the dir is unset.
-- **Final stdout line is JSON.** `{"success":true,"data":...}` on success,
-  `{"success":false,"error":"..."}` on failure. The runner parses this — emit
-  no other JSON-looking lines after it.
-- **Tear down with `await stagehand.close()` in `finally`.** Since Stagehand
-  created and owns the session, `close()` is the correct teardown — do NOT use
-  `browse cloud sessions update … REQUEST_RELEASE` (that's only for the
-  CDP-attach pattern where you created the session yourself).
-
-## Imports / runtime
-
-```typescript
-import { Stagehand } from "@browserbasehq/stagehand";
-import { join } from "node:path";
-import { z } from "zod";
-import "dotenv/config";
-```
-
-`@browserbasehq/stagehand` and `zod` are already in the scaffolded
-`package.json`. Do not add other dependencies.
-
-## What to emit
-
-Output the complete `.ts` file content. Start with imports, end with a call
-to `main()`. Nothing before the first import, nothing after the last
-closing brace. No markdown fences.
diff --git a/skills/autobrowse/codegen/runners/lib/tsx-runner.mjs b/skills/autobrowse/codegen/runners/lib/tsx-runner.mjs
deleted file mode 100644
index c90407c..0000000
--- a/skills/autobrowse/codegen/runners/lib/tsx-runner.mjs
+++ /dev/null
@@ -1,136 +0,0 @@
-// tsx-runner.mjs — shared logic for codegen target runners that boot a tsx
-// script in a scaffolded output dir and parse its trailing JSON line.
-//
-// Playwright and Stagehand runners (and any future TS target that follows the
-// same {"success":boolean,"data":...} contract) call runTsxTarget with their
-// per-framework tweaks: a label for stderr prefix, extra env (e.g.
-// PLAYWRIGHT_SKIP_BROWSER_DOWNLOAD=1), and an optional preflight check (e.g.
-// "ANTHROPIC_API_KEY required for Stagehand").
-
-import * as fs from "node:fs";
-import * as path from "node:path";
-import * as crypto from "node:crypto";
-import { spawnSync } from "node:child_process";
-
-export function getArg(name) {
-  const i = process.argv.indexOf(`--${name}`);
-  return i !== -1 && process.argv[i + 1] ? process.argv[i + 1] : null;
-}
-
-// Emit a JSON result line on stdout and exit. Centralized so the contract
-// (single {passed:bool,...} JSON line, exit 0/2) is consistent across runners.
-function emitAndExit(result) {
-  console.log(JSON.stringify(result));
-  process.exit(result.passed ? 0 : 2);
-}
-
-/**
- * Run a tsx target script against a fresh BB session.
- *
- * @param {object} opts
- * @param {string} opts.label                 stderr prefix, e.g. "playwright"
- * @param {Record<string,string>} [opts.extraEnv]  merged into the run's env
- * @param {Record<string,string>} [opts.installEnv] merged into npm install's env
- * @param {() => string|null} [opts.preflight]  return error message to fail fast
- */
-export function runTsxTarget(opts) {
-  const { label, extraEnv = {}, installEnv = {}, preflight } = opts;
-  const outDir = getArg("out-dir");
-  const script = getArg("script");
-
-  if (!outDir || !script) {
-    emitAndExit({ passed: false, error: "runner missing --out-dir or --script" });
-  }
-
-  const scriptPath = path.join(outDir, script);
-  if (!fs.existsSync(scriptPath)) {
-    emitAndExit({ passed: false, error: `script not found at ${scriptPath}` });
-  }
-
-  if (preflight) {
-    const err = preflight();
-    if (err) emitAndExit({ passed: false, error: err });
-  }
-
-  // Install deps when package.json changes. Gating purely on node_modules
-  // existing is wrong when two frameworks share an --out dir: framework #2's
-  // dropScaffold merges its deps into the existing package.json, but the
-  // node_modules from framework #1's install is still missing them. We hash
-  // package.json and compare against a stamp under node_modules/ to detect
-  // that and re-install.
-  const pkgPath = path.join(outDir, "package.json");
-  const stampPath = path.join(outDir, "node_modules", ".codegen-pkg-hash");
-  const pkgHash = fs.existsSync(pkgPath)
-    ? crypto.createHash("sha256").update(fs.readFileSync(pkgPath)).digest("hex")
-    : null;
-  const stampedHash = fs.existsSync(stampPath)
-    ? fs.readFileSync(stampPath, "utf-8").trim()
-    : null;
-  if (pkgHash && pkgHash !== stampedHash) {
-    process.stderr.write(`[runner.${label}] installing deps in ${outDir}\n`);
-    // Always set PLAYWRIGHT_SKIP_BROWSER_DOWNLOAD=1 here, regardless of which
-    // runner we are. In shared --out mode, framework #2 (e.g. stagehand) gets
-    // playwright merged into its package.json by dropScaffold, so even runners
-    // that don't list playwright in installEnv would still trigger its
-    // postinstall and try to fetch hundreds of MB of chromium — exhausting
-    // the 3min install budget. We never need bundled browsers (always CDP).
-    const install = spawnSync("npm", ["install", "--silent", "--no-audit", "--no-fund"], {
-      cwd: outDir,
-      stdio: ["ignore", "inherit", "inherit"],
-      env: { ...process.env, PLAYWRIGHT_SKIP_BROWSER_DOWNLOAD: "1", ...installEnv },
-      timeout: 3 * 60 * 1000,
-    });
-    if (install.status !== 0) {
-      emitAndExit({ passed: false, error: `npm install exited ${install.status}` });
-    }
-    try {
-      fs.mkdirSync(path.dirname(stampPath), { recursive: true });
-      fs.writeFileSync(stampPath, pkgHash);
-    } catch {}
-  }
-
-  // Per-run screenshot dir, exposed to the script via SCREENSHOT_DIR so its
-  // snap() helper can write progress / failure shots somewhere we can find.
-  const screenshotDir = path.join(outDir, "screenshots", `verify-${Date.now()}`);
-  fs.mkdirSync(screenshotDir, { recursive: true });
-
-  process.stderr.write(`[runner.${label}] running ${scriptPath}\n`);
-  const run = spawnSync("npx", ["tsx", script], {
-    cwd: outDir,
-    encoding: "utf-8",
-    stdio: ["ignore", "pipe", "pipe"],
-    env: { ...process.env, ...extraEnv, SCREENSHOT_DIR: screenshotDir },
-    timeout: 5 * 60 * 1000,
-  });
-
-  const stdout = run.stdout ?? "";
-  const stderr = run.stderr ?? "";
-
-  // Parse the script's trailing JSON line — walk backward through lines and
-  // take the last one that parses as JSON with a boolean `success` field.
-  let parsed = null;
-  const lines = stdout.trim().split("\n").filter(Boolean);
-  for (let i = lines.length - 1; i >= 0; i--) {
-    try {
-      const candidate = JSON.parse(lines[i]);
-      if (typeof candidate?.success === "boolean") {
-        parsed = candidate;
-        break;
-      }
-    } catch {}
-  }
-
-  const passed = run.status === 0 && parsed?.success === true;
-  const result = {
-    passed,
-    exit_code: run.status,
-    script_output: parsed,
-    screenshot_dir: screenshotDir,
-    stderr_tail: stderr.slice(-2000),
-  };
-  if (!passed) {
-    result.error = parsed?.error
-      || (run.status !== 0 ? `script exited ${run.status}` : "script did not emit success:true");
-  }
-  emitAndExit(result);
-}
diff --git a/skills/autobrowse/codegen/runners/playwright.mjs b/skills/autobrowse/codegen/runners/playwright.mjs
deleted file mode 100755
index dbc2f4e..0000000
--- a/skills/autobrowse/codegen/runners/playwright.mjs
+++ /dev/null
@@ -1,29 +0,0 @@
-#!/usr/bin/env node
-
-/**
- * playwright.mjs — Runner for the Playwright codegen target.
- *
- * Invoked by codegen.mjs's verify step. Installs the scaffolded deps if
- * needed, spawns `npx tsx <script>` against a fresh BB session, and emits
- * a single {"passed":boolean, ...} JSON line on stdout.
- *
- * Contract:
- *   --out-dir <path>      the scaffolded output dir
- *   --script <basename>   file inside --out-dir to run (e.g. acme.ts)
- *
- * Shared with stagehand.mjs via lib/tsx-runner.mjs — only differences are
- * the label and the PLAYWRIGHT_SKIP_BROWSER_DOWNLOAD trick (so playwright's
- * postinstall doesn't try to fetch chromium; we use connectOverCDP).
- */
-
-import { runTsxTarget } from "./lib/tsx-runner.mjs";
-
-runTsxTarget({
-  label: "playwright",
-  // PLAYWRIGHT_SKIP_BROWSER_DOWNLOAD=1 is required at install time too,
-  // otherwise the playwright postinstall pulls hundreds of MB of browser
-  // binaries that we never use (we always connectOverCDP to a remote BB
-  // session). Set it for both install and run.
-  installEnv: { PLAYWRIGHT_SKIP_BROWSER_DOWNLOAD: "1" },
-  extraEnv: { PLAYWRIGHT_SKIP_BROWSER_DOWNLOAD: "1" },
-});
diff --git a/skills/autobrowse/codegen/runners/stagehand.mjs b/skills/autobrowse/codegen/runners/stagehand.mjs
deleted file mode 100755
index 5e2a3aa..0000000
--- a/skills/autobrowse/codegen/runners/stagehand.mjs
+++ /dev/null
@@ -1,24 +0,0 @@
-#!/usr/bin/env node
-
-/**
- * stagehand.mjs — Runner for the Stagehand codegen target.
- *
- * Same contract and shared logic as playwright.mjs (see lib/tsx-runner.mjs).
- * Differences:
- *   - No PLAYWRIGHT_SKIP_BROWSER_DOWNLOAD trick (Stagehand uses
- *     connectOverCDP without bundling a local chromium).
- *   - Requires ANTHROPIC_API_KEY (or ANTHROPIC_AUTH_TOKEN) — Stagehand's
- *     act/extract are LLM-driven.
- */
-
-import { runTsxTarget } from "./lib/tsx-runner.mjs";
-
-runTsxTarget({
-  label: "stagehand",
-  preflight: () => {
-    if (!process.env.ANTHROPIC_API_KEY && !process.env.ANTHROPIC_AUTH_TOKEN) {
-      return "ANTHROPIC_API_KEY required for Stagehand verify";
-    }
-    return null;
-  },
-});
diff --git a/skills/autobrowse/codegen/scaffolds/playwright/package.json b/skills/autobrowse/codegen/scaffolds/playwright/package.json
deleted file mode 100644
index 081ed66..0000000
--- a/skills/autobrowse/codegen/scaffolds/playwright/package.json
+++ /dev/null
@@ -1,15 +0,0 @@
-{
-  "name": "{{TASK}}-playwright",
-  "version": "0.1.0",
-  "private": true,
-  "type": "module",
-  "scripts": {
-    "start": "tsx {{SCRIPT}}"
-  },
-  "dependencies": {
-    "dotenv": "{{DOTENV_VERSION}}",
-    "playwright": "{{PLAYWRIGHT_VERSION}}",
-    "tsx": "{{TSX_VERSION}}",
-    "zod": "{{ZOD_VERSION}}"
-  }
-}
diff --git a/skills/autobrowse/codegen/scaffolds/playwright/tsconfig.json b/skills/autobrowse/codegen/scaffolds/playwright/tsconfig.json
deleted file mode 100644
index b7b1f75..0000000
--- a/skills/autobrowse/codegen/scaffolds/playwright/tsconfig.json
+++ /dev/null
@@ -1,13 +0,0 @@
-{
-  "compilerOptions": {
-    "target": "ES2022",
-    "module": "ESNext",
-    "moduleResolution": "Bundler",
-    "strict": true,
-    "esModuleInterop": true,
-    "skipLibCheck": true,
-    "noEmit": true,
-    "resolveJsonModule": true
-  },
-  "include": ["*.ts"]
-}
diff --git a/skills/autobrowse/codegen/scaffolds/stagehand/package.json b/skills/autobrowse/codegen/scaffolds/stagehand/package.json
deleted file mode 100644
index bee503c..0000000
--- a/skills/autobrowse/codegen/scaffolds/stagehand/package.json
+++ /dev/null
@@ -1,15 +0,0 @@
-{
-  "name": "{{TASK}}-stagehand",
-  "version": "0.1.0",
-  "private": true,
-  "type": "module",
-  "scripts": {
-    "start": "tsx {{SCRIPT}}"
-  },
-  "dependencies": {
-    "@browserbasehq/stagehand": "{{STAGEHAND_VERSION}}",
-    "dotenv": "{{DOTENV_VERSION}}",
-    "tsx": "{{TSX_VERSION}}",
-    "zod": "{{ZOD_VERSION}}"
-  }
-}
diff --git a/skills/autobrowse/codegen/scaffolds/stagehand/tsconfig.json b/skills/autobrowse/codegen/scaffolds/stagehand/tsconfig.json
deleted file mode 100644
index b7b1f75..0000000
--- a/skills/autobrowse/codegen/scaffolds/stagehand/tsconfig.json
+++ /dev/null
@@ -1,13 +0,0 @@
-{
-  "compilerOptions": {
-    "target": "ES2022",
-    "module": "ESNext",
-    "moduleResolution": "Bundler",
-    "strict": true,
-    "esModuleInterop": true,
-    "skipLibCheck": true,
-    "noEmit": true,
-    "resolveJsonModule": true
-  },
-  "include": ["*.ts"]
-}
diff --git a/skills/autobrowse/references/codegen/playwright.md b/skills/autobrowse/references/codegen/playwright.md
new file mode 100644
index 0000000..8aa5782
--- /dev/null
+++ b/skills/autobrowse/references/codegen/playwright.md
@@ -0,0 +1,128 @@
+# Playwright codegen reference
+
+Spec for the `playwright.ts` file an outer agent writes when codegenning a
+runnable script from a converged autobrowse trace. The outer agent should
+read this file, draft the script with the `Write` tool, then verify it with
+`Bash` (`npm install && npx tsx playwright.ts`) against a fresh Browserbase
+session — iterating on failure using its own judgment.
+
+The companion file is `references/playwright-cdp-bridge.md`, which has the
+canonical create-session / connectOverCDP / release dance. Read that too.
+
+## Hard constraints
+
+- **Self-contained.** Runs with only `BROWSERBASE_API_KEY` in the env. No
+  reliance on autobrowse state, no reading from workspace files.
+- **CDP attach, never `chromium.launch()`.** Follow the cdp-bridge reference
+  verbatim for create-session / `connectOverCDP` / release.
+- **No `browser.close()`.** Release the session via
+  `browse cloud sessions update <id> --status REQUEST_RELEASE` in `finally`.
+  `browser.close()` on a `connectOverCDP` attachment tears down the remote
+  session prematurely.
+- **Final stdout line is JSON.** Emit `{"success":true,"data":...}` on
+  success or `{"success":false,"error":"..."}` on failure as the last line
+  on stdout. The verify command parses the trailing JSON line — don't emit
+  any other JSON-looking lines after it.
+- **Snap on errors.** Wrap `main()` in
+  `try { … } catch (err) { await snap(page, '99-error'); throw err; }`.
+  `snap` honors `process.env.SCREENSHOT_DIR` and is a no-op when unset.
+- **Locator priority:** `data-testid` → role + accessible name → id → text
+  → xpath. Prefer Playwright's auto-waiting (`locator.click()`,
+  `locator.fill()`) over explicit sleeps.
+- **Use the descriptor data when available.** Each `descriptors.ndjson`
+  entry from the trace describes the actual DOM target the agent interacted
+  with — pick locators from those `attributes` / `role` / `accessibleName`
+  fields rather than inventing them.
+- **Use the trace's network signals.** Where the unified events show a slow
+  XHR after an action, insert `page.waitForResponse(...)` rather than
+  arbitrary sleeps.
+
+## Output schema
+
+Define a Zod schema mirroring the `# Output` section of `task.md`, and
+validate the extracted data through it before printing the final
+`success: true` line.
+
+## Imports
+
+```typescript
+import { chromium, type Browser, type Page } from "playwright";
+import { execFileSync } from "node:child_process";
+import { join } from "node:path";
+import { z } from "zod";
+import "dotenv/config";
+```
+
+Only `playwright`, `zod`, `dotenv`, and `tsx` should appear in
+`package.json`. Don't add other runtime deps.
+
+## Scaffold
+
+Write `package.json` alongside `playwright.ts` (in the same directory):
+
+```json
+{
+  "name": "<task>-playwright",
+  "version": "0.1.0",
+  "private": true,
+  "type": "module",
+  "scripts": { "start": "tsx playwright.ts" },
+  "dependencies": {
+    "dotenv": "16.4.5",
+    "playwright": "1.50.0",
+    "tsx": "4.22.3",
+    "zod": "4.4.3"
+  }
+}
+```
+
+And `tsconfig.json`:
+
+```json
+{
+  "compilerOptions": {
+    "target": "ES2022",
+    "module": "ES2022",
+    "moduleResolution": "Bundler",
+    "strict": true,
+    "esModuleInterop": true,
+    "skipLibCheck": true
+  }
+}
+```
+
+**Install with `PLAYWRIGHT_SKIP_BROWSER_DOWNLOAD=1`** — we always connect
+over CDP to a remote Browserbase session, so the bundled chromium download
+is pure waste (and the sandbox's network allowlist blocks the CDN anyway).
+
+```bash
+PLAYWRIGHT_SKIP_BROWSER_DOWNLOAD=1 npm install --silent --no-audit --no-fund
+PLAYWRIGHT_SKIP_BROWSER_DOWNLOAD=1 npx tsx playwright.ts
+```
+
+## Verify contract
+
+Run the script against a fresh Browserbase session and read the trailing
+JSON line on stdout. Pass if `success === true`; fail otherwise. On
+failure, feed the stderr tail back into your next attempt and iterate.
+
+## When the workflow is HTTP-only
+
+If the trace shows the task can be accomplished via HTTP requests with no
+DOM interaction (api / fetch / url-param `recommended_method`), use
+Playwright's `request` API instead of opening a browser:
+
+```typescript
+import { request, type APIRequestContext } from "playwright";
+
+async function main() {
+  const ctx = await request.newContext({
+    extraHTTPHeaders: { "user-agent": "..." },
+  });
+  const res = await ctx.get("https://example.com/api/foo");
+  // ... parse, validate via Zod, emit success line ...
+}
+```
+
+You still emit the same trailing JSON success/failure line. No
+`connectOverCDP`, no session, no `snap`.
diff --git a/skills/autobrowse/references/codegen/stagehand.md b/skills/autobrowse/references/codegen/stagehand.md
new file mode 100644
index 0000000..0cc29e4
--- /dev/null
+++ b/skills/autobrowse/references/codegen/stagehand.md
@@ -0,0 +1,145 @@
+# Stagehand codegen reference
+
+Spec for the `stagehand.ts` file an outer agent writes when codegenning a
+runnable script from a converged autobrowse trace. The outer agent should
+read this file, draft the script with the `Write` tool, then verify it with
+`Bash` (`npm install && npx tsx stagehand.ts`) against a fresh Browserbase
+session — iterating on failure using its own judgment.
+
+This targets **Stagehand v3** (`@browserbasehq/stagehand` 3.x). The v3 API
+differs from older examples — follow the patterns below exactly.
+
+## When NOT to write a Stagehand script
+
+Stagehand fundamentally needs a browser session, so it doesn't fit
+HTTP-only workflows. If `recommended_method` in metadata.json is `api`,
+`mcp`, `fetch`, or `url-param`, skip Stagehand and ship only the Playwright
+variant. Same for `cli`.
+
+## Hard constraints
+
+- **Self-contained.** Runs with `BROWSERBASE_API_KEY` and `ANTHROPIC_API_KEY`
+  in the env.
+- **Stagehand owns its own Browserbase session.** Construct it with
+  `env: "BROWSERBASE"` and let it create the session — do NOT pre-create a
+  session via the `browse` CLI and do NOT pass `browserbaseSessionID`.
+- **Top-level `apiKey` is the Browserbase key, not the Anthropic key.** The
+  project is inferred from it. There is no `browserbaseAPIKey` field. Using
+  the Anthropic key as `apiKey` makes session lookup fail with a 404.
+- **Get the page from `stagehand.context`, not `stagehand.page`.**
+- **`act` and `extract` are methods on the `stagehand` instance, not the page.**
+- **One natural-language action per `act` call.** Don't compound
+  ("click X and fill Y"); chain individual `act` calls so each is retryable.
+- **Schema-backed extract.** Define Zod schemas mirroring the `# Output`
+  section of task.md and validate before emitting the final `success: true`
+  line.
+- **Tear down with `await stagehand.close()` in `finally`.** Since Stagehand
+  created and owns the session, `close()` is the correct teardown — do NOT
+  use `browse cloud sessions update … REQUEST_RELEASE` (that's only for the
+  CDP-attach pattern in `playwright.ts`).
+- **Snap on errors.** Wrap the body in
+  `try { … } catch (err) { await snap(page, '99-error'); throw err; }`,
+  honoring `process.env.SCREENSHOT_DIR`. `snap` is a no-op when the dir is
+  unset.
+- **Final stdout line is JSON.** Emit `{"success":true,"data":...}` on
+  success or `{"success":false,"error":"..."}` on failure as the last line
+  on stdout.
+
+## Constructor shape
+
+```typescript
+const stagehand = new Stagehand({
+  env: "BROWSERBASE",
+  apiKey: process.env.BROWSERBASE_API_KEY,        // ← BROWSERBASE key; project inferred from it
+  model: {                                        // ← LLM config lives here, not at top level
+    modelName: "anthropic/claude-sonnet-4-6",     // ← provider-prefixed; do not invent model names
+    apiKey: process.env.ANTHROPIC_API_KEY,
+  },
+});
+await stagehand.init();
+const page = stagehand.context.pages()[0] ?? (await stagehand.context.newPage());
+await page.goto(url, { waitUntil: "domcontentloaded" });
+
+// Actions:
+await stagehand.act("click the Continue button");
+
+// Data:
+const data = await stagehand.extract("<instruction>", zodSchema);
+```
+
+Use the descriptors from the trace as natural-language hints: where a
+descriptor shows `accessibleName: "Continue"`, the corresponding `act`
+should say `"click the Continue button"`. Specific locators aren't
+required — Stagehand picks them at runtime.
+
+## Imports
+
+```typescript
+import { Stagehand } from "@browserbasehq/stagehand";
+import { join } from "node:path";
+import { z } from "zod";
+import "dotenv/config";
+```
+
+Only `@browserbasehq/stagehand`, `zod`, `dotenv`, and `tsx` should appear
+in `package.json`. Don't add other runtime deps.
+
+## Scaffold
+
+Write `package.json` alongside `stagehand.ts` (in the same directory):
+
+```json
+{
+  "name": "<task>-stagehand",
+  "version": "0.1.0",
+  "private": true,
+  "type": "module",
+  "scripts": { "start": "tsx stagehand.ts" },
+  "dependencies": {
+    "@browserbasehq/stagehand": "3.4.0",
+    "dotenv": "16.4.5",
+    "tsx": "4.22.3",
+    "zod": "4.4.3"
+  }
+}
+```
+
+If `playwright.ts` is being written into the same directory, **merge** the
+dependencies rather than overwriting the existing `package.json` —
+otherwise the second framework's deps get lost. Use a single combined
+`package.json` with both `playwright` and `@browserbasehq/stagehand` listed.
+
+And `tsconfig.json`:
+
+```json
+{
+  "compilerOptions": {
+    "target": "ES2022",
+    "module": "ES2022",
+    "moduleResolution": "Bundler",
+    "strict": true,
+    "esModuleInterop": true,
+    "skipLibCheck": true
+  }
+}
+```
+
+**Install with `PLAYWRIGHT_SKIP_BROWSER_DOWNLOAD=1`** if Playwright is also
+in the same `package.json` (its postinstall would otherwise fetch chromium
+binaries the sandbox can't reach).
+
+```bash
+PLAYWRIGHT_SKIP_BROWSER_DOWNLOAD=1 npm install --silent --no-audit --no-fund
+npx tsx stagehand.ts
+```
+
+## Verify contract
+
+Run the script against a fresh Browserbase session and read the trailing
+JSON line on stdout. Pass if `success === true`; fail otherwise. On
+failure, feed the stderr tail back into your next attempt and iterate.
+
+If verify still fails after 2-3 retries, **delete `stagehand.ts`** before
+the upload — the upload script globs for `playwright.ts stagehand.ts` and
+will ship whatever is on disk. Shipping a broken Stagehand variant is
+worse than shipping just Playwright.
diff --git a/skills/autobrowse/scripts/codegen.mjs b/skills/autobrowse/scripts/codegen.mjs
deleted file mode 100755
index e4970a3..0000000
--- a/skills/autobrowse/scripts/codegen.mjs
+++ /dev/null
@@ -1,515 +0,0 @@
-#!/usr/bin/env node
-
-/**
- * codegen.mjs — Convert a converged autobrowse trace into a runnable script in
- * one or more frameworks (Playwright, Stagehand, …).
- *
- * Pipeline per framework:
- *   1. Compose the context: task.md + unified-events.jsonl (when present) +
- *      descriptors.ndjson (when present) + strategy.md + the framework's
- *      cdp-bridge reference doc.
- *   2. Compute a cache key over (framework, prompt-template, task, trace,
- *      descriptors). Cache hit short-circuits with zero LLM cost.
- *   3. Single Anthropic completion against the framework's prompt template.
- *      Emit `<task>.<ext>` to the output dir.
- *   4. Drop the framework's scaffold files (package.json, tsconfig, …).
- *   5. If --verify: invoke the framework's runner against a fresh Browserbase
- *      session. On failure, feed the error back into a rewrite call up to
- *      --max-retries times.
- *
- * One JSON status line per framework on stdout. Non-zero exit if any selected
- * framework's final state is fail.
- *
- * Usage:
- *   node scripts/codegen.mjs --task <name> [options]
- *
- * Options:
- *   --task <name>                  task name under <workspace>/tasks/ (required)
- *   --workspace <dir>              default ./autobrowse
- *   --run <id>                     default: latest run-NNN with success: true
- *   --frameworks <a,b,...>         default: playwright
- *   --verify | --no-verify         default: --verify
- *   --max-retries <N>              rewrite-on-verify-failure cap (default: 2)
- *   --cache-dir <dir>              default <workspace>/codegen-cache
- *   --out <dir>                    default <workspace>/tasks/<name>/<framework>
- *   --prompt-template <path>       custom framework prompt (pair with --frameworks custom)
- *   --force                        bust cache
- *   --dry-run                      estimate cost without LLM call
- *   --cache-only                   error if cache miss (no LLM call)
- *   --model <name>                 override Claude model
- *   --help
- */
-
-import "dotenv/config";
-import Anthropic from "@anthropic-ai/sdk";
-import * as fs from "node:fs";
-import * as path from "node:path";
-import { execFileSync, spawnSync } from "node:child_process";
-import { fileURLToPath } from "node:url";
-import crypto from "node:crypto";
-
-const __dirname = path.dirname(fileURLToPath(import.meta.url));
-const SKILL_DIR = path.resolve(__dirname, "..");
-const PROMPT_TEMPLATE_VERSION = "2"; // bump to invalidate cache after prompt edits or scaffold/runner contract changes
-
-const DEFAULT_MODEL = "claude-sonnet-4-6";
-const DEFAULT_MAX_TOKENS = 8192;
-
-// ── CLI ────────────────────────────────────────────────────────────
-
-function getArg(name, fallback) {
-  const i = process.argv.indexOf(`--${name}`);
-  return i !== -1 && process.argv[i + 1] ? process.argv[i + 1] : fallback;
-}
-const hasFlag = (n) => process.argv.includes(`--${n}`);
-
-if (hasFlag("help") || hasFlag("h")) {
-  console.log(`autobrowse codegen — produce runnable scripts from a converged trace
-
-Usage: node scripts/codegen.mjs --task <name> [options]
-
-Options:
-  --task <name>                  task name under <workspace>/tasks/ (required)
-  --workspace <dir>              default: ./autobrowse
-  --run <id>                     specific run-NNN (default: newest passing)
-  --frameworks <a,b,...>         comma list; default: playwright
-                                 builtins: playwright, stagehand
-  --verify | --no-verify         run the script in a fresh BB session (default: --verify)
-  --max-retries <N>              cap rewrite-on-verify-fail loop (default: 2)
-  --cache-dir <dir>              default: <workspace>/codegen-cache
-  --out <dir>                    default: <workspace>/tasks/<name>/<framework>
-  --prompt-template <path>       custom prompt template (pair with --frameworks=custom)
-  --force                        ignore cache, regenerate
-  --dry-run                      estimate cost; don't call the LLM
-  --cache-only                   error if cache miss
-  --model <name>                 default: ${DEFAULT_MODEL}
-
-Env:
-  ANTHROPIC_API_KEY              required for LLM call
-  BROWSERBASE_API_KEY            required for --verify
-
-Exits 0 if all selected frameworks ended in pass (or --no-verify), 2 if any
-failed, 1 on harness error.`);
-  process.exit(0);
-}
-
-const TASK = getArg("task");
-if (!TASK) {
-  console.error("ERROR: --task <name> is required. Pass --help for usage.");
-  process.exit(1);
-}
-const WORKSPACE = path.resolve(getArg("workspace", "autobrowse"));
-const FORCED_RUN = getArg("run");
-const FRAMEWORKS = getArg("frameworks", "playwright").split(",").map((s) => s.trim()).filter(Boolean);
-const VERIFY = !hasFlag("no-verify");
-const MAX_RETRIES = parseInt(getArg("max-retries", "2"), 10);
-const CACHE_DIR = path.resolve(getArg("cache-dir", path.join(WORKSPACE, "codegen-cache")));
-const OUT_OVERRIDE = getArg("out");
-const PROMPT_TEMPLATE_OVERRIDE = getArg("prompt-template");
-const FORCE = hasFlag("force");
-const DRY_RUN = hasFlag("dry-run");
-const CACHE_ONLY = hasFlag("cache-only");
-const MODEL = getArg("model", DEFAULT_MODEL);
-
-// ── Inputs ─────────────────────────────────────────────────────────
-
-const taskDir = path.join(WORKSPACE, "tasks", TASK);
-const tracesDir = path.join(WORKSPACE, "traces", TASK);
-const taskFile = path.join(taskDir, "task.md");
-
-for (const [label, file] of [["task.md", taskFile]]) {
-  if (!fs.existsSync(file)) {
-    console.error(`ERROR: ${label} not found at ${file}. Run autobrowse first.`);
-    process.exit(1);
-  }
-}
-
-function pickRun() {
-  if (FORCED_RUN) {
-    // --run was passed; still confirm the directory exists. Without this we'd
-    // happily call codegen with empty trace/events/descriptors and the LLM
-    // would invent a script from just task.md + strategy.md, while logs
-    // still report the forced run id as if it were a real input.
-    const forcedDir = path.join(tracesDir, FORCED_RUN);
-    if (!fs.existsSync(forcedDir)) return null;
-    return FORCED_RUN;
-  }
-  if (!fs.existsSync(tracesDir)) return null;
-  const runs = fs.readdirSync(tracesDir)
-    .filter((d) => /^run-\d+$/.test(d))
-    .sort()
-    .reverse();
-  for (const r of runs) {
-    const summary = path.join(tracesDir, r, "summary.md");
-    if (!fs.existsSync(summary)) continue;
-    const text = fs.readFileSync(summary, "utf-8");
-    if (/success:\s*true/.test(text) || /"success"\s*:\s*true/.test(text)) return r;
-  }
-  return null;
-}
-
-const RUN_ID = pickRun();
-if (!RUN_ID) {
-  if (FORCED_RUN) {
-    console.error(`ERROR: --run ${FORCED_RUN} not found at ${path.join(tracesDir, FORCED_RUN)}.`);
-  } else {
-    console.error(`ERROR: no passing run found under ${tracesDir}. Pass --run <id> to force, or run autobrowse first.`);
-  }
-  process.exit(1);
-}
-const runDir = path.join(tracesDir, RUN_ID);
-
-// Try multiple candidate paths for each input — autobrowse layouts have
-// shifted over time and we want this to be robust to both modern and legacy.
-function readFirstExisting(...candidates) {
-  for (const p of candidates) {
-    if (p && fs.existsSync(p)) return { path: p, content: fs.readFileSync(p, "utf-8") };
-  }
-  return null;
-}
-
-const taskMd = fs.readFileSync(taskFile, "utf-8");
-const strategyMd = readFirstExisting(path.join(taskDir, "strategy.md"))?.content || "";
-const traceJson = readFirstExisting(path.join(runDir, "trace.json"))?.content || "";
-const unifiedEvents = readFirstExisting(path.join(runDir, "unified-events.jsonl"))?.content || "";
-const descriptors = readFirstExisting(
-  path.join(runDir, ".o11y", RUN_ID, "cdp", "descriptors.ndjson"),
-  path.join(runDir, "cdp", "descriptors.ndjson"),
-)?.content || "";
-
-// ── Framework registry ────────────────────────────────────────────
-
-const CODEGEN_DIR = path.join(SKILL_DIR, "codegen");
-const REFERENCES_DIR = path.join(SKILL_DIR, "references");
-
-function frameworkConfig(framework) {
-  const promptPath = PROMPT_TEMPLATE_OVERRIDE && framework === "custom"
-    ? path.resolve(PROMPT_TEMPLATE_OVERRIDE)
-    : path.join(CODEGEN_DIR, "prompts", `${framework}.md`);
-  const scaffoldDir = path.join(CODEGEN_DIR, "scaffolds", framework);
-  const runnerPath = path.join(CODEGEN_DIR, "runners", `${framework}.mjs`);
-  const extByFramework = { playwright: "ts", stagehand: "ts", puppeteer: "js", selenium: "py" };
-  const ext = extByFramework[framework] || "ts";
-  return { promptPath, scaffoldDir, runnerPath, ext };
-}
-
-// ── Context builder ───────────────────────────────────────────────
-
-// Trim a stringified blob to a budget while keeping head + tail.
-function clip(text, maxBytes) {
-  if (text.length <= maxBytes) return text;
-  const head = Math.floor(maxBytes * 0.7);
-  const tail = maxBytes - head - 64;
-  return text.slice(0, head) + `\n\n…[truncated ${text.length - head - tail} bytes]…\n\n` + text.slice(-tail);
-}
-
-function buildContext({ promptTemplate, cdpBridgeDoc, previousAttempt, verifyFailure }) {
-  const parts = [];
-  parts.push("# Task\n\n" + taskMd.trim());
-  if (strategyMd.trim()) parts.push("# Strategy notes\n\n" + strategyMd.trim());
-  if (cdpBridgeDoc) parts.push("# Reference: Playwright ↔ Browserbase bridge\n\n" + cdpBridgeDoc.trim());
-  if (unifiedEvents.trim()) {
-    parts.push("# Unified events (agent + browser, time-ordered)\n\n```\n" + clip(unifiedEvents, 32_000) + "\n```");
-  } else if (traceJson.trim()) {
-    parts.push("# Trace (agent turns)\n\n```json\n" + clip(traceJson, 32_000) + "\n```");
-  }
-  if (descriptors.trim()) {
-    parts.push("# Descriptors (per-command DOM target)\n\n```\n" + clip(descriptors, 16_000) + "\n```");
-  }
-  if (previousAttempt && verifyFailure) {
-    parts.push(
-      "# Previous attempt and the verify failure\n\nYour previous attempt was:\n\n```\n" +
-      clip(previousAttempt, 12_000) +
-      "\n```\n\nIt failed verification with:\n\n```\n" +
-      clip(verifyFailure, 4_000) +
-      "\n```\n\nFix the issue and emit a complete corrected script.",
-    );
-  }
-  return promptTemplate.trim() + "\n\n" + parts.join("\n\n");
-}
-
-// ── Cache ─────────────────────────────────────────────────────────
-
-function hashContent(s) {
-  return crypto.createHash("sha256").update(s).digest("hex").slice(0, 16);
-}
-function cacheKey(framework, promptTemplate) {
-  return hashContent([
-    "v" + PROMPT_TEMPLATE_VERSION,
-    framework,
-    hashContent(promptTemplate),
-    hashContent(taskMd),
-    hashContent(traceJson),
-    hashContent(unifiedEvents),
-    hashContent(descriptors),
-    hashContent(strategyMd),
-  ].join("|"));
-}
-function readCache(key) {
-  const p = path.join(CACHE_DIR, `${key}.txt`);
-  return fs.existsSync(p) ? fs.readFileSync(p, "utf-8") : null;
-}
-function writeCache(key, content) {
-  fs.mkdirSync(CACHE_DIR, { recursive: true });
-  fs.writeFileSync(path.join(CACHE_DIR, `${key}.txt`), content);
-}
-
-// ── LLM call ──────────────────────────────────────────────────────
-
-let _anthropic = null;
-function anthropic() {
-  if (!_anthropic) {
-    if (!process.env.ANTHROPIC_API_KEY && !process.env.ANTHROPIC_AUTH_TOKEN) {
-      throw new Error("ANTHROPIC_API_KEY (or ANTHROPIC_AUTH_TOKEN) is required for codegen.");
-    }
-    _anthropic = new Anthropic();
-  }
-  return _anthropic;
-}
-
-async function callLlm(systemPrompt, userMessage) {
-  const res = await anthropic().messages.create({
-    model: MODEL,
-    max_tokens: DEFAULT_MAX_TOKENS,
-    system: systemPrompt,
-    messages: [{ role: "user", content: userMessage }],
-  });
-  const text = res.content
-    .filter((b) => b.type === "text")
-    .map((b) => b.text)
-    .join("\n");
-  // The agent might emit fences anyway; strip a single outer code block.
-  const fenced = text.match(/^```[\w-]*\n([\s\S]*?)\n```\s*$/);
-  const code = fenced ? fenced[1] : text.trim();
-  const cost = (res.usage?.input_tokens ?? 0) * 3e-6 + (res.usage?.output_tokens ?? 0) * 15e-6;
-  return { code, cost, tokens: res.usage };
-}
-
-// ── Scaffold + write output ───────────────────────────────────────
-
-// Scaffold version pins. Each framework's scaffold/package.json references
-// these via {{PLAYWRIGHT_VERSION}} / {{STAGEHAND_VERSION}} / etc. so callers
-// can canary a new release without forking — set the corresponding env var.
-// Loose semver guard rejects shell-injection shapes before they hit npm.
-const VERSION_RE = /^\d+\.\d+\.\d+(?:-[A-Za-z0-9.-]+)?$/;
-function resolveVersion(envName, fallback) {
-  const raw = process.env[envName];
-  if (!raw) return fallback;
-  if (!VERSION_RE.test(raw)) {
-    throw new Error(`${envName}="${raw}" is not a valid X.Y.Z[-tag] version`);
-  }
-  return raw;
-}
-const SCAFFOLD_VERSIONS = {
-  PLAYWRIGHT_VERSION: resolveVersion("PLAYWRIGHT_VERSION", "1.50.0"),
-  STAGEHAND_VERSION: resolveVersion("STAGEHAND_VERSION", "3.4.0"),
-  TSX_VERSION: resolveVersion("TSX_VERSION", "4.22.3"),
-  ZOD_VERSION: resolveVersion("ZOD_VERSION", "4.4.3"),
-  DOTENV_VERSION: resolveVersion("DOTENV_VERSION", "16.4.5"),
-};
-
-function templateInterpolate(content, vars) {
-  return Object.entries(vars).reduce(
-    (acc, [k, v]) => acc.replaceAll(`{{${k}}}`, v),
-    content,
-  );
-}
-
-function dropScaffold(scaffoldDir, outDir, taskName, scriptBasename) {
-  if (!fs.existsSync(scaffoldDir)) return;
-  // Two distinct template vars: TASK is the slug (used in package name),
-  // SCRIPT is the actual filename (used in the start script). They diverge
-  // in --out mode where files are named <framework>.ts but TASK is the
-  // task slug — without SCRIPT, `npm start` would invoke a missing file.
-  const vars = { TASK: taskName, SCRIPT: scriptBasename, ...SCAFFOLD_VERSIONS };
-  for (const entry of fs.readdirSync(scaffoldDir)) {
-    const src = path.join(scaffoldDir, entry);
-    const dst = path.join(outDir, entry);
-    const content = templateInterpolate(fs.readFileSync(src, "utf-8"), vars);
-    // Special-case package.json: when --out is shared across frameworks (e.g.
-    // browse.sh passes one dir for playwright+stagehand), the first framework
-    // writes its package.json and the second must MERGE its dependencies in,
-    // not skip. Otherwise the second framework's `node_modules` lacks its own
-    // runtime deps (e.g. @browserbasehq/stagehand) and verify can never pass.
-    if (entry === "package.json" && fs.existsSync(dst)) {
-      try {
-        const existing = JSON.parse(fs.readFileSync(dst, "utf-8"));
-        const incoming = JSON.parse(content);
-        existing.dependencies = {
-          ...(existing.dependencies || {}),
-          ...(incoming.dependencies || {}),
-        };
-        existing.devDependencies = {
-          ...(existing.devDependencies || {}),
-          ...(incoming.devDependencies || {}),
-        };
-        fs.writeFileSync(dst, JSON.stringify(existing, null, 2) + "\n");
-        continue;
-      } catch {
-        // Fall through to never-overwrite policy if either side is malformed.
-      }
-    }
-    if (fs.existsSync(dst)) continue; // never overwrite a user's file
-    fs.writeFileSync(dst, content);
-  }
-}
-
-// ── Verify ────────────────────────────────────────────────────────
-
-function verify(framework, outDir, scriptBasename) {
-  const { runnerPath } = frameworkConfig(framework);
-  if (!fs.existsSync(runnerPath)) {
-    return { passed: false, error: `no runner for framework "${framework}" at ${runnerPath}`, runner_missing: true };
-  }
-  // The parent timeout must exceed the runner's worst case: tsx-runner allows
-  // up to 3min for npm install + 5min for the tsx run = 8min, plus slack for
-  // process startup and the trailing-JSON parse. 10min keeps us safely above
-  // that so a healthy slow run isn't killed mid-flight.
-  const res = spawnSync("node", [runnerPath, "--out-dir", outDir, "--script", scriptBasename], {
-    encoding: "utf-8",
-    stdio: ["ignore", "pipe", "pipe"],
-    env: process.env,
-    timeout: 10 * 60 * 1000,
-  });
-  const stdout = res.stdout || "";
-  const stderr = res.stderr || "";
-  // Runners must emit a final JSON line: {"passed":true,...} or {"passed":false,...}
-  const lastLine = stdout.trim().split("\n").pop() || "";
-  let parsed = null;
-  try { parsed = JSON.parse(lastLine); } catch {}
-  if (parsed && typeof parsed.passed === "boolean") {
-    return { ...parsed, stdout, stderr };
-  }
-  return { passed: false, error: `runner did not emit a {passed:boolean} JSON line; exit=${res.status}`, stdout, stderr };
-}
-
-// ── Per-framework pipeline ────────────────────────────────────────
-
-async function generateOne(framework) {
-  const cfg = frameworkConfig(framework);
-  if (!fs.existsSync(cfg.promptPath)) {
-    return { framework, passed: false, error: `no prompt template for "${framework}" at ${cfg.promptPath}` };
-  }
-  const promptTemplate = fs.readFileSync(cfg.promptPath, "utf-8");
-  const cdpBridgeDoc = fs.existsSync(path.join(REFERENCES_DIR, "playwright-cdp-bridge.md"))
-    ? fs.readFileSync(path.join(REFERENCES_DIR, "playwright-cdp-bridge.md"), "utf-8")
-    : "";
-
-  // Filename + outDir convention:
-  //  - default mode (--out unset): per-framework subdir, file named after the
-  //    task, so the dir feels like a standalone project — e.g.
-  //    tasks/<task>/playwright/<task>.ts  with its own package.json.
-  //  - --out mode: caller is flattening into someone else's tree (e.g.
-  //    browse.sh's /tmp/skill/{domain}/{task}/), so we use the framework
-  //    name as the filename — playwright.ts + stagehand.ts in the same dir,
-  //    no collision.
-  const outDir = OUT_OVERRIDE ? path.resolve(OUT_OVERRIDE) : path.join(taskDir, framework);
-  const scriptBasename = OUT_OVERRIDE ? `${framework}.${cfg.ext}` : `${TASK}.${cfg.ext}`;
-  fs.mkdirSync(outDir, { recursive: true });
-  const scriptPath = path.join(outDir, scriptBasename);
-
-  // Cache lookup
-  const key = cacheKey(framework, promptTemplate);
-  let cached = !FORCE ? readCache(key) : null;
-  if (CACHE_ONLY && !cached) {
-    return { framework, passed: false, error: `--cache-only set but no cached output for key ${key}` };
-  }
-
-  if (DRY_RUN) {
-    const ctx = buildContext({ promptTemplate, cdpBridgeDoc });
-    const bytes = ctx.length;
-    const estCost = (bytes / 4) * 3e-6; // ~4 chars/token, $3/M in
-    return { framework, dryRun: true, prompt_bytes: bytes, estimated_cost_usd: Number(estCost.toFixed(4)) };
-  }
-
-  // `attempts` counts emitted-script-versions. Cached and uncached both start
-  // at 1 (the script-on-disk is one version, whether the LLM just wrote it or
-  // we restored it from cache). The retry loop below then increments per
-  // rewrite, bounded by --max-retries. Initializing to 0 on a cache hit gave
-  // cached runs one extra rewrite vs uncached — caught by Bugbot.
-  let code, cost = 0, attempts = 1;
-  if (cached) {
-    code = cached;
-  } else {
-    const ctx = buildContext({ promptTemplate, cdpBridgeDoc });
-    const { code: c, cost: k } = await callLlm(
-      "You are an expert browser-automation engineer. Output ONLY the contents of the script file — no preamble, no explanation, no markdown fences. The script must be runnable as-is.",
-      ctx,
-    );
-    code = c;
-    cost += k;
-    writeCache(key, code);
-  }
-
-  fs.writeFileSync(scriptPath, code);
-  dropScaffold(cfg.scaffoldDir, outDir, TASK, scriptBasename);
-
-  if (!VERIFY) {
-    return { framework, passed: true, scriptPath, cached: !!cached, verify_skipped: true, cost_usd: cost };
-  }
-
-  // Verify loop with rewrite-on-failure
-  let lastVerify = verify(framework, outDir, scriptBasename);
-  while (!lastVerify.passed && attempts < MAX_RETRIES + 1) {
-    if (lastVerify.runner_missing) break;
-    // --cache-only forbids ANY LLM call, including the rewrite path. Without
-    // this guard a cached script that fails verify would still burn quota
-    // through the rewrite loop, contradicting the documented "no LLM call"
-    // CI behavior.
-    if (CACHE_ONLY) break;
-    attempts++;
-    const previousCode = code;
-    const failureContext =
-      (lastVerify.error || "") +
-      "\n\nstderr:\n" + (lastVerify.stderr || "").slice(-2000) +
-      "\nstdout:\n" + (lastVerify.stdout || "").slice(-2000);
-    const ctx = buildContext({
-      promptTemplate,
-      cdpBridgeDoc,
-      previousAttempt: previousCode,
-      verifyFailure: failureContext,
-    });
-    const { code: c, cost: k } = await callLlm(
-      "You are an expert browser-automation engineer. Output ONLY the corrected script file — no preamble, no explanation, no markdown fences.",
-      ctx,
-    );
-    code = c;
-    cost += k;
-    fs.writeFileSync(scriptPath, code);
-    writeCache(key, code); // overwrite cache with the latest attempt
-    lastVerify = verify(framework, outDir, scriptBasename);
-  }
-
-  return {
-    framework,
-    passed: lastVerify.passed,
-    scriptPath,
-    cached: !!cached && cost === 0,
-    verify_attempts: attempts,
-    last_error: lastVerify.passed ? null : (lastVerify.error || lastVerify.stderr?.slice(-200) || null),
-    cost_usd: Number(cost.toFixed(4)),
-  };
-}
-
-// ── Main ──────────────────────────────────────────────────────────
-
-async function main() {
-  console.error(`[codegen] task=${TASK} run=${RUN_ID} frameworks=[${FRAMEWORKS.join(",")}] verify=${VERIFY}`);
-  let anyFailed = false;
-  for (const framework of FRAMEWORKS) {
-    try {
-      const result = await generateOne(framework);
-      console.log(JSON.stringify(result));
-      if (result.passed === false) anyFailed = true;
-    } catch (err) {
-      console.log(JSON.stringify({ framework, passed: false, error: err.message }));
-      anyFailed = true;
-    }
-  }
-  process.exit(anyFailed ? 2 : 0);
-}
-
-main().catch((err) => {
-  console.error("FATAL:", err.stack || err.message);
-  process.exit(1);
-});

From 7dbe85496167f36a8c768867b45846e217124c18 Mon Sep 17 00:00:00 2001
From: ziruihao <ziray.hao@gmail.com>
Date: Fri, 5 Jun 2026 15:40:52 -0700
Subject: [PATCH 2/2] docs(autobrowse): clarify script filename across both
 output-dir shapes
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

The previous SKILL.md step 2 mixed two valid output-dir layouts in one
example — `tasks/<task>/<framework>/<task>.ts` (per-framework subdir,
script named after task) and "flattened upload root" (script named after
framework) — but step 4's verify command only mentioned
`<framework>.ts`. An agent following the per-framework-subdir example
would `Write` `<task>.ts` and then `tsx <framework>.ts` against a file
that doesn't exist.

Make the two shapes explicit, pick one per task, and key steps 4/5/7's
filenames off step 2's choice rather than baking one of the two
conventions into every step. Also fix the trace path to
`run-NNN/{trace.json,unified-events.jsonl}` (matching what
unify-trace.mjs actually writes) instead of just `latest/` — autobrowse
maintains the `latest` symlink but the explicit zero-padded form is
what the rest of the docs use.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 skills/autobrowse/SKILL.md | 32 +++++++++++++++++++++++++++-----
 1 file changed, 27 insertions(+), 5 deletions(-)

diff --git a/skills/autobrowse/SKILL.md b/skills/autobrowse/SKILL.md
index 0a7b51b..e2b91df 100644
--- a/skills/autobrowse/SKILL.md
+++ b/skills/autobrowse/SKILL.md
@@ -240,16 +240,38 @@ on demand:
 
 The loop is:
 
+Pick an **output directory** for each run and keep all of step 2-4 inside
+it. The two common shapes:
+
+- **Per-framework subdir** (standalone autobrowse use, no host): one
+  directory per framework, scripts named after the task —
+  `tasks/<task>/playwright/<task>.ts`, `tasks/<task>/stagehand/<task>.ts`.
+  Each subdir gets its own `package.json` + `node_modules`.
+- **Flattened upload root** (what browse.sh's skill-generator uses): all
+  frameworks share one output dir at the upload root, scripts named after
+  the framework — `/tmp/skill/{domain}/{task}/playwright.ts`,
+  `.../stagehand.ts`. One merged `package.json` covers both.
+
+Step 4's verify command and step 7's "delete broken script" path must
+match step 2's filename. Pick one shape per task and stick with it.
+
+The loop is:
+
 1. `Read` the converged trace at
-   `./autobrowse/traces/<task>/latest/{trace.json,unified-events.jsonl}`,
-   the task's `strategy.md`, and the framework reference doc.
-2. `Write` `<framework>.ts` into the output directory (e.g.
-   `tasks/<task>/<framework>/<task>.ts` or a flattened upload root).
+   `./autobrowse/traces/<task>/run-NNN/{trace.json,unified-events.jsonl}`
+   (zero-padded run number — autobrowse also maintains a `latest` symlink
+   to the most recent run if you'd rather use that), the task's
+   `strategy.md` at `./autobrowse/tasks/<task>/strategy.md`, and the
+   framework reference doc at `references/codegen/<framework>.md`.
+2. `Write` the script into the output directory you picked:
+   `<output-dir>/<task>.ts` (per-framework subdir) or
+   `<output-dir>/<framework>.ts` (flattened root).
 3. `Write` the scaffold's `package.json` + `tsconfig.json` per the
    reference. When multiple frameworks share an output directory, merge
    the `dependencies` across frameworks into a single `package.json`.
 4. `Bash` `PLAYWRIGHT_SKIP_BROWSER_DOWNLOAD=1 npm install --silent --no-audit --no-fund`
-   then `npx tsx <framework>.ts` against a fresh Browserbase session.
+   then `npx tsx <the-script-you-just-wrote>` against a fresh Browserbase
+   session. Use the same filename you `Write`'d in step 2.
 5. Parse the trailing `{"success":boolean,...}` JSON line from stdout. If
    it failed, read the stderr tail and iterate — up to ~3 attempts is
    reasonable. If still failing, delete the broken script so it isn't