browser-record: turn a recorded human browser flow into a reusable, intent-level task skill#141
browser-record: turn a recorded human browser flow into a reusable, intent-level task skill#141shubh24 wants to merge 3 commits into
Conversation
Record a human browser flow on a Browserbase cloud session and replay it deterministically through the browse CLI, with accessibility-snapshot selector healing. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Shift from deterministic selector-replay to "capture wide, reason narrow":
- inject.js: capture each step's accessible name + role (ungated), so an
autocomplete suggestion ("New York") is recorded even when its only
selector is a dynamic id — this is the intent signal.
- record.mjs: per-step screenshots (intent evidence + replay oracle) and an
RR_CONNECT_URL attach mode so the recorder can join a browser-trace
keep-alive session and share the full CDP firehose.
- Distillation is now an agent, not a script (removed distill.mjs). The
teacher agent reads the interaction stream + screenshots + trace and
reconstructs intent — collapsing self-corrections, dropping abandoned
actions, parameterizing inputs — then authors a task skill. See
references/distill.md.
- SKILL.md rewritten around record -> trace -> distill -> task skill;
deterministic replay.mjs demoted to an optional CI fast path.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- Rename the skill to browser-record (it's a recorder that emits a task skill; "replay" is now just invoking that skill). - Remove replay.mjs: replay is agentic (invoke the generated skill), so the deterministic engine is no longer part of the product. - Generated task skills now bundle a curated screenshots/ folder (the visual oracle) referenced per step, and each step names its recorded target (accessible name/role) as a hint while granting the agent agency to use whatever live element achieves the intent. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Played with this end-to-end — recorded a couple of real flows and distilled them into skills. Two learnings worth baking in: 1. Default the capture step to local headed Chrome, not a cloud sessionI tried swapping the recorder to launch a local headed Chrome —
Keeping 2. Security posture on captured credentialsRecording any flow that includes a login currently persists the password to disk in plaintext. // scripts/inject.js
const value = ('value' in el) ? el.value : ''; // includes type="password"I hit this for real recording a GitHub login — my email + password ended up in the output
Happy to send a PR with both changes if useful. |
What we're building
Turn a one-time human browser demonstration into a durable, reusable, parameterized agent skill.
A human performs a flow once in a live cloud browser; out the other end comes a
skills/<task>/SKILL.mdthat any agent can invoke — parameterized, self-verifying, and resilient to the page changing underneath it. "Show, don't prompt."The core thesis: replay intent, not mechanics
A naive recording captures mechanics — "typed
n-e-w-y-o, clicked#c307, clickedli[1]". Those rot instantly: dynamic ids regenerate per page load, deep DOM paths drift, and keystrokes aren't the point. We started with a deterministic selector-replay engine (with a healing ladder) and hit the wall you'd expect — it could be made to work, but it was the wrong abstraction.What you actually want is intent — "destination = New York". And recovering intent is a judgment call:
None of that is collapsible by deterministic rules. So the distiller is an agent, not a script — the same shape as the
autobrowseteacher loop, re-seeded on a human's trace instead of an agent's own run. (Intended to merge withautobrowselater.)How it works: capture wide, reason narrow
record.mjsinjects a listener that records each click/type with the acted element's accessible name + role (ungated — this is the fix that captures an autocomplete suggestion's name even when its only selector is a dynamic id) plus a screenshot per step.RR_CONNECT_URLlets it attach to abrowser-tracekeep-alive session so the firehose and the interaction stream observe the same session.references/distill.md) reads the stream + screenshots, queries the bisected trace on demand (progressive disclosure, not firehose-in-prompt), and writes the smallest set of intents that explains the session.browse, uses the per-step screenshots as the visual oracle, and verifies committed values.The generated task skill bundles a curated
screenshots/folder (referenced per step) and names each step's recorded target as a hint while granting the agent agency to use whatever live element achieves the intent.Why Browserbase (not local Chrome)
Capture works over plain CDP, so it runs locally — but the loop is materially better on Browserbase: the live-view URL makes the human demo remote and shareable; server-side observability (session recording, proxy network, downloads, logs via
bb-finalize) gives the teacher agent far more to reason over; clean isolated sessions avoid recording your local profile/cookies/extensions; and recording + replaying in the same environment keeps record ≈ replay.What's in this PR
skills/browser-record/— capture scripts (record.mjs,inject.js), the teacher-agent distill procedure + prompt (references/distill.md), the task-skill shape (SKILL.md), evals,package.json, LICENSE.browser-traceskill for the firehose.Status / validation
node scripts/validate-skills.mjs --skill browser-record→ passes (0 errors/warnings).google-flights-searchskill, collapsing the nameless filter-UI noise. That generated skill is kept local for now, not in this PR.cookie-sync/pitch-prepare unrelated and untouched.🤖 Generated with Claude Code