feat(cua-driver-rs)(experimental): picture-in-picture agent preview (macOS native; Win/Linux stubs)#1730
Conversation
New `pip-preview` crate carries the cross-platform PipConfig, PipFrame, and PipBackend trait + factory registry, mirroring the shape of `cua-driver-core::video`. A thin `pip_hook` module inside `cua-driver-core` exposes the per-tool-call push callback so the tool dispatcher can synthesise a frame label and forward the existing SCREENSHOT_FN bytes without taking a direct dependency on pip-preview. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
MacosPipBackend renders the post-action screenshot via NSImageView (scaled proportionally) with an NSTextField label strip at the bottom. Window is NSFloatingWindowLevel with CanJoinAllSpaces / FullScreenAuxiliary / Stationary / Transient / IgnoresCycle so it stays visible across spaces and full-screen apps without becoming a Mission-Control affordance or stealing focus. Frame updates dispatch_async_f onto the main queue (AppKit must run on main); `run_appkit_main_loop()` is exposed so `cua-driver serve --experimental-pip` can park its main thread in NSApplication.run() while serve runs on a background thread — without that, the dispatched blocks never execute and the window stays blank. `becomesKeyOnlyIfNeeded:` is NSPanel-only; we rely on `orderFrontRegardless` + the no-cycle/transient collection-behavior flags instead. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Both factories return a clear "not yet implemented" error so maybe_init_pip() can log "PiP unavailable" and the daemon keeps running without a window. Real implementations (WS_EX_NOACTIVATE HWND on Win, wlr-layer-shell / GTK4 on Linux) tracked as a follow-up issue. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
maybe_init_pip() registers the platform's PipBackendFactory, calls start(), and bridges the live backend handle to the core push hook via a OnceLock<Mutex<Option<...>>>. Wired into Serve and Mcp on macOS plus Serve and async_main on non-macOS so every long-running entry point honors the flag. On macOS Serve, when --experimental-pip is on, the tokio runtime moves to a background thread and the main thread parks in NSApplication.run() (via platform_macos::pip::run_appkit_main_loop) so the dispatch_async_f frame-push path is actually pumped. Geometry parses as the X11 `WxH[+X+Y]` form; default is 480x360 in the top-right corner of the main display. Startup prints an "experimental" banner so users know the flag is opt-in. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Documents the opt-in flag, geometry override, what gets pushed (same tools the recorder writes screenshot.png for), macOS window properties (level / collection behavior / no-key contract), and the current platform-support matrix (macOS working, Win/Linux stubs). Marked experimental everywhere — the warning callout, the banner emoji, the title suffix. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
The latest updates on your projects. Learn more about Vercel for GitHub. |
|
Caution Review failedPull request was closed or merged during review 📝 WalkthroughWalkthroughThis PR introduces an experimental always-on-top picture-in-picture preview window for cua-driver. The window displays agent action screenshots and short action labels. Activation is opt-in via ChangesPiP Preview Feature
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Possibly related issues
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 ESLint
ESLint skipped: no ESLint configuration detected in root package.json. To enable, add Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
`setImageScaling:` expects an `NSUInteger` (objc2 type code `'Q'` / u64),
not the signed `'q'`/i64 that was passed. macOS 26+ aborts the process
on the mismatch:
invalid message send to -[NSImageView setImageScaling:]:
expected argument at index 0 to have type code 'Q', but found 'q'
Verified live: `cua-driver serve --experimental-pip` now stays up; PiP
window appears top-right with placeholder and updates with screenshot +
label after `cua-driver call launch_app …`.
Verified live (post-
|
…verlay label
User feedback on the first PiP cut: window too big, label felt like a
separate UI strip instead of an overlay. Rework:
- Default geometry 480x360 → 320x200 (smaller, image-first)
- NSWindow: Borderless (no title bar / close button), transparent
background, default shadow. CALayer-backed content view supplies the
rounded-corner mask (radius 12) and the dark backing behind the
scaled image.
- NSImageView now fills the entire content rect (was reserved 28pt
for the label strip)
- New pill overlay: NSView at the bottom-center with a
fully-rounded CALayer (radius = height/2), semi-transparent black
background, NSTextField inside with white 11pt system font centered
- Window stays passive (no-activate via Transient/IgnoresCycle
collection behavior) and is now draggable from anywhere
(setMovableByWindowBackground)
Encoding shim added for CGColor — `[NSColor CGColor]` returns
`^{CGColor=}` and objc2's strict msg_send! enforcement rejects
`*mut c_void`. A tiny `#[repr(C)] struct CGColor` + RefEncode impl gives
the right encoding without pulling in a wider CGColor binding crate.
Verified live: `cua-driver serve --experimental-pip` →
`launch_app + get_window_state + click(AllClear)` shows the
expected dark rounded preview in the top-right with a centered pill
reading `click: element_index=2`.
The first PiP cut was CLI-only — to persist `--experimental-pip` across
daemon restarts, users had to bake the flag into every MCP-client
config (`claude mcp add cua-computer-use -- /path/to/cua-driver mcp
--experimental-pip`) and re-add the entry whenever they wanted to
toggle. That's friction for an opt-in experimental feature.
Wire the same `~/.cua-driver/config.json` file the existing
`set_config` MCP tool writes to into the PiP startup path:
{
"experimental_pip": true,
"experimental_pip_geometry": "320x200+24+24"
}
Edit the JSON once, restart the daemon (or MCP client), PiP comes up.
CLI flags still override — `--experimental-pip` forces it on regardless
of the config value, and `--experimental-pip-geometry WxH+X+Y` wins
over the file's geometry.
Implementation:
- `PipConfig::from_args_and_file(path: &Path)` reads the JSON, then
layers CLI args on top. Malformed / missing file falls back to
defaults silently.
- `default_config_path()` returns `$HOME/.cua-driver/config.json` so
callers don't have to recompute it.
- `main.rs`'s two PiP init sites switched from `from_args()` to the
new path-aware variant.
- `pip-preview` gains a `serde_json` dep (just for parsing the small
config file).
The MCP `set_config` tool's schema is NOT updated in this commit —
that's a separate per-platform change touching 3 different
`tools/set_config` implementations. Users can still edit the JSON
directly; the schema update is a tracked follow-up.
Live-verified: set `experimental_pip: true` in JSON, started
`cua-driver serve` WITHOUT any CLI flag, banner printed and window
appeared at the JSON-configured 280x180+24+24 position.
… on all 3 platforms
Closes the follow-up flagged in the previous commit: the MCP set_config
tool now persists both PiP keys via the cross-platform
pip_preview::write_config_key helper, and get_config surfaces them in
the structured output (read fresh from ~/.cua-driver/config.json on
every call since they don't live in the in-memory DriverConfig).
Per platform:
- macOS (separate set_config.rs / get_config.rs) — schema gains both
keys + invoke() persists them; description notes they take effect
on next daemon restart (the PiP backend is initialised once at
startup).
- Linux (inline in impl_.rs) — same treatment; description calls
out that the Linux backend is still a stub (issue #1729) so the
config persists but no window appears until that lands.
- Windows (inline in impl_.rs) — same, with the additional twist that
Windows set_config exposes BOTH the Swift-compatible {key, value}
dotted-leaf shape AND a legacy per-field shape. Both shapes now
accept the two new keys.
Validation:
- geometry strings are passed through pip_preview::PipGeometry::parse
before persistence; malformed input returns an error from set_config
instead of corrupting the config file
- bool / string type checks in the Swift-shape match arm
- "restart cua-driver for X to take effect" hint baked into the
success message so callers know not to expect immediate window
appearance
Live-verified on macOS:
$ cua-driver call set_config '{"experimental_pip":true,"experimental_pip_geometry":"320x200+24+24"}'
Config updated: capture_mode=som, max_image_dimension=1024
— restart cua-driver for experimental_pip=true to take effect
$ cua-driver call get_config '{}'
"experimental_pip": true,
"experimental_pip_geometry": "320x200+24+24",
$ cat ~/.cua-driver/config.json
{ "experimental_pip": true, "experimental_pip_geometry": "320x200+24+24", "max_image_dimension": 1024 }
$ cua-driver call set_config '{"experimental_pip_geometry":"junk"}'
experimental_pip_geometry `junk` is not a valid WxH or WxH+X+Y string
New cross-platform helpers in pip-preview:
- write_config_key(key, value) — merges into ~/.cua-driver/config.json
- read_pip_keys_from_file() — surfaces (enabled, geometry) for get_config
Summary
Adds an opt-in
--experimental-pipflag that opens a small always-on-top window showing what the cua-driver agent is doing in real time: the post-action screenshot of the target window plus a one-line label describing the tool call (click element_index=2,type_text "hello world", etc.).Frames are pushed for every non-read-only tool call — the same set the recording pipeline writes a
turn-NNNNN/screenshot.pngfor. The PNG bytes come from the existingSCREENSHOT_FNcallback, so the live view matches what a replay would show. No continuous capture: PiP follows tool calls, not a frame rate.Experimental, default OFF. Cross-platform from day 1 via a trait, with macOS as the first working backend; Windows and Linux ship as compile-clean stubs whose
start()returns a clear "not yet implemented" notice so the daemon keeps running without a window. Win + Linux native impls tracked in #1729.How to try it locally (macOS)
Geometry override (X11
WxH[+X+Y]form, default480x360top-right):Architecture (mirror of cursor-overlay + video.rs)
libs/cua-driver/rust/crates/pip-preview/—PipConfig,PipFrame,PipBackendtrait,PipBackendFactory,PIP_FACTORY: OnceLockregistry. Same shape ascua_driver_core::video.cua-driver-core/src/pip_hook.rs— per-process push callback the tool dispatcher (tool.rs::invoke) calls after a successful action tool lands. Synthesises the action label from(tool_name, args); pulls screenshot bytes via the existingrecording::screenshot_for(window_id, pid)shim.platform-macos/src/pip/mod.rs—MacosPipBackendusing NSWindow + NSImageView + NSTextField. Frame push →dispatch_async_f(main_queue, ...)(AppKit must run on main).platform-{windows,linux}/src/pip/mod.rs— stubs returningErr("not yet implemented"). Tracked in PiP preview (experimental): native Windows + Linux backends #1729.cua-driver/src/main.rs::maybe_init_pip— registers the platform factory, starts the backend, bridges the liveBox<dyn PipBackend>topip_hook::set_pip_push_fn. Wired intoServeandMcpon macOS plusServeandasync_mainon non-macOS.macOS Serve mode quirk (worth a closer review eye)
dispatch_async_f→ main queue only fires while NSRunLoop is pumping. The cursor overlay'srun_on_main_thread()provides that loop in MCP mode, but Serve mode normally blocks its tokio runtime on main. When--experimental-pipis on, the macOS Serve arm now moves the tokio runtime onto a background thread and parks main inNSApplication.run()(viaplatform_macos::pip::run_appkit_main_loop). Without this, frames queue forever and the window stays blank.The non-PiP Serve path keeps its original run-on-main semantics so existing users are unaffected.
Window properties (macOS)
NSFloatingWindowLevel— above normal apps, below menus / accessibility overlays.CanJoinAllSpaces | FullScreenAuxiliary | Stationary | Transient | IgnoresCycle— visible across spaces and full-screen apps; never the main / key window.orderFrontRegardlessinstead ofmakeKeyAndOrderFront— never steals keyboard focus.becomesKeyOnlyIfNeeded:is NSPanel-only and crashes when sent to NSWindow — using collection-behavior flags +orderFrontRegardlessachieves the same passive-window contract.Test plan
cargo build --release -p cua-driveron macOS — passescargo check --workspace— passes (only pre-existing warnings)--helplists the new--experimental-pip/--experimental-pip-geometryflagscua-driver serve --experimental-pip --no-permissions-gatestarts; PiP window appears top-right at 480x360 with placeholder "waiting for first action…"cua-driver call launch_app '{"bundle_id":"com.apple.calculator"}'→ window updates with screenshot + labellaunch_app: com.apple.calculator(verified viascreencapture)Files touched
libs/cua-driver/rust/crates/pip-preview/{Cargo.toml,src/lib.rs}libs/cua-driver/rust/crates/cua-driver-core/src/pip_hook.rslibs/cua-driver/rust/crates/platform-{macos,windows,linux}/src/pip/mod.rscua-driver/src/{main.rs,cli.rs},cua-driver/Cargo.toml,platform-*/Cargo.toml,platform-*/src/lib.rs,cua-driver-core/src/{tool.rs,recording.rs,lib.rs}, workspaceCargo.tomldocs/content/docs/cua-driver/guide/getting-started/pip-preview.mdx+ meta.json entryFollow-ups
tools/callresponse metadata so HTTP clients (not just the local PiP window) can show what the driver just did🤖 Generated with Claude Code
Summary by CodeRabbit
Release Notes
New Features
--experimental-pipflag or persistent configuration. Customize window size and position with--experimental-pip-geometryoverride.Documentation