Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
100 changes: 59 additions & 41 deletions skills/autobrowse/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -225,47 +225,65 @@ Read the new summary. Did it pass? Make clear progress?

### Generate a runnable script (optional)

Once the task has converged, you can produce a deterministic, runnable script
in one or more frameworks via `scripts/codegen.mjs`. This is one shot of an
LLM call per framework, cached by content hash, with optional verify-against-
fresh-session and rewrite-on-failure.

```bash
node ${CLAUDE_SKILL_DIR}/scripts/codegen.mjs \
--task <name> \
--workspace ./autobrowse \
--frameworks playwright,stagehand \
--verify
```

Each framework gets its own subdirectory under `tasks/<name>/<framework>/`
with the emitted script and a self-contained scaffold (`package.json`,
`tsconfig.json`). The directory is runnable standalone with
`cd tasks/<name>/playwright && npm install && npx tsx <name>.ts` — the only
runtime requirement is `BROWSERBASE_API_KEY` (plus `ANTHROPIC_API_KEY` for
the Stagehand target).

Builtin frameworks: `playwright`, `stagehand`. Add a custom framework with
`--prompt-template <path> --frameworks custom` (and provide your own runner
or pass `--no-verify`).

Common flags:

| Flag | Purpose |
|---|---|
| `--frameworks a,b,...` | Comma-separated; default `playwright` |
| `--verify` / `--no-verify` | Run the produced script against a fresh BB session; default `--verify` |
| `--max-retries N` | Rewrite-on-verify-failure cap; default 2 |
| `--cache-only` | Error if cache miss (CI-friendly) |
| `--force` | Bust the cache |
| `--dry-run` | Estimate prompt size + cost; don't call the LLM |
| `--run <id>` | Force a specific `run-NNN` (default: latest passing) |

Output is one JSON line per framework on stdout. Non-zero exit if any
selected framework's final state is `passed: false`.

See `references/playwright-cdp-bridge.md` for the canonical
`connectOverCDP` patterns the emitted scripts follow.
Once the task has converged, you can produce a runnable script in one or
more frameworks (Playwright, Stagehand) directly using your own `Write` and
`Bash` tools — autobrowse no longer ships a separate `codegen.mjs`
sub-process. The framework-specific specs live as reference docs you read
on demand:

- `references/codegen/playwright.md` — script shape, scaffold, verify
contract, locator priorities, HTTP-only variant
- `references/codegen/stagehand.md` — Stagehand v3 constructor, `act` /
`extract` patterns, when NOT to ship Stagehand
- `references/playwright-cdp-bridge.md` — canonical `connectOverCDP`
create-session / release dance

The loop is:

Pick an **output directory** for each run and keep all of step 2-4 inside
it. The two common shapes:

- **Per-framework subdir** (standalone autobrowse use, no host): one
directory per framework, scripts named after the task —
`tasks/<task>/playwright/<task>.ts`, `tasks/<task>/stagehand/<task>.ts`.
Each subdir gets its own `package.json` + `node_modules`.
- **Flattened upload root** (what browse.sh's skill-generator uses): all
frameworks share one output dir at the upload root, scripts named after
the framework — `/tmp/skill/{domain}/{task}/playwright.ts`,
`.../stagehand.ts`. One merged `package.json` covers both.

Step 4's verify command and step 7's "delete broken script" path must
match step 2's filename. Pick one shape per task and stick with it.

The loop is:

1. `Read` the converged trace at
`./autobrowse/traces/<task>/run-NNN/{trace.json,unified-events.jsonl}`
(zero-padded run number — autobrowse also maintains a `latest` symlink
to the most recent run if you'd rather use that), the task's
`strategy.md` at `./autobrowse/tasks/<task>/strategy.md`, and the
framework reference doc at `references/codegen/<framework>.md`.
2. `Write` the script into the output directory you picked:
`<output-dir>/<task>.ts` (per-framework subdir) or
`<output-dir>/<framework>.ts` (flattened root).
3. `Write` the scaffold's `package.json` + `tsconfig.json` per the
reference. When multiple frameworks share an output directory, merge
the `dependencies` across frameworks into a single `package.json`.
4. `Bash` `PLAYWRIGHT_SKIP_BROWSER_DOWNLOAD=1 npm install --silent --no-audit --no-fund`
then `npx tsx <the-script-you-just-wrote>` against a fresh Browserbase
session. Use the same filename you `Write`'d in step 2.
5. Parse the trailing `{"success":boolean,...}` JSON line from stdout. If
it failed, read the stderr tail and iterate — up to ~3 attempts is
reasonable. If still failing, delete the broken script so it isn't
uploaded (the upload glob ships whatever's on disk).

The agent does this directly because it already has the context, the
tools, and the judgment for "this stderr means …, try X". A sub-process
LLM call (the old `codegen.mjs`) couldn't see why a script was failing
beyond the stderr tail, and tended to bleed natural-language preamble
into the `.ts` file via the completion API's message channel — both
problems disappear when the outer agent writes the file through the
`Write` tool's structured argument.

### After all iterations — publish if ready

Expand Down
55 changes: 0 additions & 55 deletions skills/autobrowse/codegen/prompts/playwright.md

This file was deleted.

80 changes: 0 additions & 80 deletions skills/autobrowse/codegen/prompts/stagehand.md

This file was deleted.

136 changes: 0 additions & 136 deletions skills/autobrowse/codegen/runners/lib/tsx-runner.mjs

This file was deleted.

Loading