Skip to content

browser-record: turn a recorded human browser flow into a reusable, intent-level task skill#141

Draft
shubh24 wants to merge 3 commits into
mainfrom
record-and-replay-skill
Draft

browser-record: turn a recorded human browser flow into a reusable, intent-level task skill#141
shubh24 wants to merge 3 commits into
mainfrom
record-and-replay-skill

Conversation

@shubh24

@shubh24 shubh24 commented Jun 27, 2026

Copy link
Copy Markdown
Contributor

What we're building

Turn a one-time human browser demonstration into a durable, reusable, parameterized agent skill.

A human performs a flow once in a live cloud browser; out the other end comes a skills/<task>/SKILL.md that any agent can invoke — parameterized, self-verifying, and resilient to the page changing underneath it. "Show, don't prompt."

The core thesis: replay intent, not mechanics

A naive recording captures mechanics — "typed n-e-w-y-o, clicked #c307, clicked li[1]". Those rot instantly: dynamic ids regenerate per page load, deep DOM paths drift, and keystrokes aren't the point. We started with a deterministic selector-replay engine (with a healing ladder) and hit the wall you'd expect — it could be made to work, but it was the wrong abstraction.

What you actually want is intent — "destination = New York". And recovering intent is a judgment call:

  • The committed value lives in the outcome (the chosen suggestion's accessible name), not the input.
  • A human who typed San Francisco, erased it, and chose Los Angeles meant Los Angeles — the recorder must drop the correction.
  • A filter applied then removed is net-zero — drop it entirely.
  • The values the human supplied (cities, dates) are parameters, not constants.

None of that is collapsible by deterministic rules. So the distiller is an agent, not a script — the same shape as the autobrowse teacher loop, re-seeded on a human's trace instead of an agent's own run. (Intended to merge with autobrowse later.)

How it works: capture wide, reason narrow

record (interaction stream + per-step screenshots)        ← semantic spine
  + browser-trace (CDP firehose: network/console/DOM)     ← full observability
  → distill = teacher agent reconstructs INTENT           ← collapses corrections,
  → skills/<task>/ (SKILL.md + screenshots/ + recording)    drops abandoned actions
  • Capturerecord.mjs injects a listener that records each click/type with the acted element's accessible name + role (ungated — this is the fix that captures an autocomplete suggestion's name even when its only selector is a dynamic id) plus a screenshot per step. RR_CONNECT_URL lets it attach to a browser-trace keep-alive session so the firehose and the interaction stream observe the same session.
  • Distill — an agent (per references/distill.md) reads the stream + screenshots, queries the bisected trace on demand (progressive disclosure, not firehose-in-prompt), and writes the smallest set of intents that explains the session.
  • Replay — just invoke the generated skill. The agent realizes each intent via browse, uses the per-step screenshots as the visual oracle, and verifies committed values.

The generated task skill bundles a curated screenshots/ folder (referenced per step) and names each step's recorded target as a hint while granting the agent agency to use whatever live element achieves the intent.

Why Browserbase (not local Chrome)

Capture works over plain CDP, so it runs locally — but the loop is materially better on Browserbase: the live-view URL makes the human demo remote and shareable; server-side observability (session recording, proxy network, downloads, logs via bb-finalize) gives the teacher agent far more to reason over; clean isolated sessions avoid recording your local profile/cookies/extensions; and recording + replaying in the same environment keeps record ≈ replay.

What's in this PR

  • skills/browser-record/ — capture scripts (record.mjs, inject.js), the teacher-agent distill procedure + prompt (references/distill.md), the task-skill shape (SKILL.md), evals, package.json, LICENSE.
  • Pairs with the existing browser-trace skill for the firehose.
  • Deterministic replay engine removed — replay is invoking the generated skill.

Status / validation

  • Draft. node scripts/validate-skills.mjs --skill browser-record → passes (0 errors/warnings).
  • Demonstrated end-to-end on Google Flights: enriched capture recorded the suggestion name "San Francisco International Airport" (the old tag-gated capture dropped it); the teacher agent distilled a one-way SAN→SFO/Jul-3 flow into a parameterized google-flights-search skill, collapsing the nameless filter-UI noise. That generated skill is kept local for now, not in this PR.
  • Pre-existing validator failures in cookie-sync / pitch-prep are unrelated and untouched.

🤖 Generated with Claude Code

shubh24 and others added 3 commits June 26, 2026 23:54
Record a human browser flow on a Browserbase cloud session and replay it
deterministically through the browse CLI, with accessibility-snapshot
selector healing.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Shift from deterministic selector-replay to "capture wide, reason narrow":

- inject.js: capture each step's accessible name + role (ungated), so an
  autocomplete suggestion ("New York") is recorded even when its only
  selector is a dynamic id — this is the intent signal.
- record.mjs: per-step screenshots (intent evidence + replay oracle) and an
  RR_CONNECT_URL attach mode so the recorder can join a browser-trace
  keep-alive session and share the full CDP firehose.
- Distillation is now an agent, not a script (removed distill.mjs). The
  teacher agent reads the interaction stream + screenshots + trace and
  reconstructs intent — collapsing self-corrections, dropping abandoned
  actions, parameterizing inputs — then authors a task skill. See
  references/distill.md.
- SKILL.md rewritten around record -> trace -> distill -> task skill;
  deterministic replay.mjs demoted to an optional CI fast path.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- Rename the skill to browser-record (it's a recorder that emits a task
  skill; "replay" is now just invoking that skill).
- Remove replay.mjs: replay is agentic (invoke the generated skill), so the
  deterministic engine is no longer part of the product.
- Generated task skills now bundle a curated screenshots/ folder (the visual
  oracle) referenced per step, and each step names its recorded target
  (accessible name/role) as a hint while granting the agent agency to use
  whatever live element achieves the intent.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@shubh24 shubh24 changed the title Add record-and-replay skill Add browser-record skill (record a flow → distill into a task skill) Jun 27, 2026
@shubh24 shubh24 changed the title Add browser-record skill (record a flow → distill into a task skill) browser-record: turn a recorded human browser flow into a reusable, intent-level task skill Jun 27, 2026
@shrey150

Copy link
Copy Markdown
Contributor

Played with this end-to-end — recorded a couple of real flows and distilled them into skills. Two learnings worth baking in:

1. Default the capture step to local headed Chrome, not a cloud session

I tried swapping the recorder to launch a local headed Chromechromium.launchPersistentContext(userDataDir, { headless: false, channel: 'chrome' }) — instead of creating a Browserbase session + connectOverCDP. The capture engine (inject.jswindow.__rr_events → drain + per-step screenshots) is browser-agnostic, so only the connection setup changes. Local won out as the better default for the record step because:

  • It uses the human's real persistent profile, so logins survive between recordings. Big for auth'd flows — with a warm profile the SSO/2FA step is often already satisfied and you land straight on the destination.
  • No API key, no session cost, no cloud round-trip latency.
  • You interact with a real window directly — no live-view URL hand-off needed.

Keeping RR_CONNECT_URL as an optional attach path preserves the browser-trace firehose pairing (and Browserbase itself, when you want the shareable live view / clean isolated session). Suggestion: make local headed the default, cloud/attach opt-in.

2. Security posture on captured credentials

Recording any flow that includes a login currently persists the password to disk in plaintext. inject.js logs every change event's .value with no special-casing:

// scripts/inject.js
const value = ('value' in el) ? el.value : '';   // includes type="password"

I hit this for real recording a GitHub login — my email + password ended up in the output recording.json in /tmp, and per-step screenshots can capture sensitive fields visually too. For a skill whose whole purpose is "record human flows and persist them as reusable files," that's a meaningful gap. Suggested fixes:

  • Redact type="password" / autocomplete="*-password" inputs — store a [REDACTED] sentinel instead of the value.
  • Consider the same for other obvious secrets (OTP, card number), and add a note that screenshots may capture sensitive UI.
  • A scrub pass in the distiller so any generated recording.json fallback never carries secrets.

Happy to send a PR with both changes if useful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants