browserbase · ziruihao · Jun 5, 2026 · Jun 5, 2026
diff --git a/skills/autobrowse/SKILL.md b/skills/autobrowse/SKILL.md
@@ -225,47 +225,65 @@ Read the new summary. Did it pass? Make clear progress?
 
 ### Generate a runnable script (optional)
 
-Once the task has converged, you can produce a deterministic, runnable script
-in one or more frameworks via `scripts/codegen.mjs`. This is one shot of an
-LLM call per framework, cached by content hash, with optional verify-against-
-fresh-session and rewrite-on-failure.
-
-```bash
-node ${CLAUDE_SKILL_DIR}/scripts/codegen.mjs \
-  --task <name> \
-  --workspace ./autobrowse \
-  --frameworks playwright,stagehand \
-  --verify
-```
-
-Each framework gets its own subdirectory under `tasks/<name>/<framework>/`
-with the emitted script and a self-contained scaffold (`package.json`,
-`tsconfig.json`). The directory is runnable standalone with
-`cd tasks/<name>/playwright && npm install && npx tsx <name>.ts` — the only
-runtime requirement is `BROWSERBASE_API_KEY` (plus `ANTHROPIC_API_KEY` for
-the Stagehand target).
-
-Builtin frameworks: `playwright`, `stagehand`. Add a custom framework with
-`--prompt-template <path> --frameworks custom` (and provide your own runner
-or pass `--no-verify`).
-
-Common flags:
-
-| Flag | Purpose |
-|---|---|
-| `--frameworks a,b,...` | Comma-separated; default `playwright` |
-| `--verify` / `--no-verify` | Run the produced script against a fresh BB session; default `--verify` |
-| `--max-retries N` | Rewrite-on-verify-failure cap; default 2 |
-| `--cache-only` | Error if cache miss (CI-friendly) |
-| `--force` | Bust the cache |
-| `--dry-run` | Estimate prompt size + cost; don't call the LLM |
-| `--run <id>` | Force a specific `run-NNN` (default: latest passing) |
-
-Output is one JSON line per framework on stdout. Non-zero exit if any
-selected framework's final state is `passed: false`.
-
-See `references/playwright-cdp-bridge.md` for the canonical
-`connectOverCDP` patterns the emitted scripts follow.
+Once the task has converged, you can produce a runnable script in one or
+more frameworks (Playwright, Stagehand) directly using your own `Write` and
+`Bash` tools — autobrowse no longer ships a separate `codegen.mjs`
+sub-process. The framework-specific specs live as reference docs you read
+on demand:
+
+- `references/codegen/playwright.md` — script shape, scaffold, verify
+  contract, locator priorities, HTTP-only variant
+- `references/codegen/stagehand.md` — Stagehand v3 constructor, `act` /
+  `extract` patterns, when NOT to ship Stagehand
+- `references/playwright-cdp-bridge.md` — canonical `connectOverCDP`
+  create-session / release dance
+
+The loop is:
+
+Pick an **output directory** for each run and keep all of step 2-4 inside
+it. The two common shapes:
+
+- **Per-framework subdir** (standalone autobrowse use, no host): one
+  directory per framework, scripts named after the task —
+  `tasks/<task>/playwright/<task>.ts`, `tasks/<task>/stagehand/<task>.ts`.
+  Each subdir gets its own `package.json` + `node_modules`.
+- **Flattened upload root** (what browse.sh's skill-generator uses): all
+  frameworks share one output dir at the upload root, scripts named after
+  the framework — `/tmp/skill/{domain}/{task}/playwright.ts`,
+  `.../stagehand.ts`. One merged `package.json` covers both.
+
+Step 4's verify command and step 7's "delete broken script" path must
+match step 2's filename. Pick one shape per task and stick with it.
+
+The loop is:
+
+1. `Read` the converged trace at
+   `./autobrowse/traces/<task>/run-NNN/{trace.json,unified-events.jsonl}`
+   (zero-padded run number — autobrowse also maintains a `latest` symlink
+   to the most recent run if you'd rather use that), the task's
+   `strategy.md` at `./autobrowse/tasks/<task>/strategy.md`, and the
+   framework reference doc at `references/codegen/<framework>.md`.
+2. `Write` the script into the output directory you picked:
+   `<output-dir>/<task>.ts` (per-framework subdir) or
+   `<output-dir>/<framework>.ts` (flattened root).
+3. `Write` the scaffold's `package.json` + `tsconfig.json` per the
+   reference. When multiple frameworks share an output directory, merge
+   the `dependencies` across frameworks into a single `package.json`.
+4. `Bash` `PLAYWRIGHT_SKIP_BROWSER_DOWNLOAD=1 npm install --silent --no-audit --no-fund`
+   then `npx tsx <the-script-you-just-wrote>` against a fresh Browserbase
+   session. Use the same filename you `Write`'d in step 2.
+5. Parse the trailing `{"success":boolean,...}` JSON line from stdout. If
+   it failed, read the stderr tail and iterate — up to ~3 attempts is
+   reasonable. If still failing, delete the broken script so it isn't
+   uploaded (the upload glob ships whatever's on disk).
+
+The agent does this directly because it already has the context, the
+tools, and the judgment for "this stderr means …, try X". A sub-process
+LLM call (the old `codegen.mjs`) couldn't see why a script was failing
+beyond the stderr tail, and tended to bleed natural-language preamble
+into the `.ts` file via the completion API's message channel — both
+problems disappear when the outer agent writes the file through the
+`Write` tool's structured argument.
 
 ### After all iterations — publish if ready
 

diff --git a/skills/autobrowse/codegen/prompts/playwright.md b/skills/autobrowse/codegen/prompts/playwright.md
diff --git a/skills/autobrowse/codegen/prompts/stagehand.md b/skills/autobrowse/codegen/prompts/stagehand.md
diff --git a/skills/autobrowse/codegen/runners/lib/tsx-runner.mjs b/skills/autobrowse/codegen/runners/lib/tsx-runner.mjs