Skip to content

feat(cua-driver-rs)(experimental): picture-in-picture agent preview (macOS native; Win/Linux stubs)#1730

Merged
f-trycua merged 10 commits into
mainfrom
feat/experimental-pip-preview
May 27, 2026
Merged

feat(cua-driver-rs)(experimental): picture-in-picture agent preview (macOS native; Win/Linux stubs)#1730
f-trycua merged 10 commits into
mainfrom
feat/experimental-pip-preview

Conversation

@f-trycua
Copy link
Copy Markdown
Collaborator

@f-trycua f-trycua commented May 27, 2026

Summary

Adds an opt-in --experimental-pip flag that opens a small always-on-top window showing what the cua-driver agent is doing in real time: the post-action screenshot of the target window plus a one-line label describing the tool call (click element_index=2, type_text "hello world", etc.).

Frames are pushed for every non-read-only tool call — the same set the recording pipeline writes a turn-NNNNN/screenshot.png for. The PNG bytes come from the existing SCREENSHOT_FN callback, so the live view matches what a replay would show. No continuous capture: PiP follows tool calls, not a frame rate.

Experimental, default OFF. Cross-platform from day 1 via a trait, with macOS as the first working backend; Windows and Linux ship as compile-clean stubs whose start() returns a clear "not yet implemented" notice so the daemon keeps running without a window. Win + Linux native impls tracked in #1729.

How to try it locally (macOS)

~/.local/bin/cua-driver serve --experimental-pip &
~/.local/bin/cua-driver call launch_app '{"bundle_id":"com.apple.calculator"}'
# → PiP window appears top-right; label updates with each call

Geometry override (X11 WxH[+X+Y] form, default 480x360 top-right):

cua-driver serve --experimental-pip --experimental-pip-geometry 640x480+24+24

Architecture (mirror of cursor-overlay + video.rs)

  • libs/cua-driver/rust/crates/pip-preview/PipConfig, PipFrame, PipBackend trait, PipBackendFactory, PIP_FACTORY: OnceLock registry. Same shape as cua_driver_core::video.
  • cua-driver-core/src/pip_hook.rs — per-process push callback the tool dispatcher (tool.rs::invoke) calls after a successful action tool lands. Synthesises the action label from (tool_name, args); pulls screenshot bytes via the existing recording::screenshot_for(window_id, pid) shim.
  • platform-macos/src/pip/mod.rsMacosPipBackend using NSWindow + NSImageView + NSTextField. Frame push → dispatch_async_f(main_queue, ...) (AppKit must run on main).
  • platform-{windows,linux}/src/pip/mod.rs — stubs returning Err("not yet implemented"). Tracked in PiP preview (experimental): native Windows + Linux backends #1729.
  • cua-driver/src/main.rs::maybe_init_pip — registers the platform factory, starts the backend, bridges the live Box<dyn PipBackend> to pip_hook::set_pip_push_fn. Wired into Serve and Mcp on macOS plus Serve and async_main on non-macOS.

macOS Serve mode quirk (worth a closer review eye)

dispatch_async_f → main queue only fires while NSRunLoop is pumping. The cursor overlay's run_on_main_thread() provides that loop in MCP mode, but Serve mode normally blocks its tokio runtime on main. When --experimental-pip is on, the macOS Serve arm now moves the tokio runtime onto a background thread and parks main in NSApplication.run() (via platform_macos::pip::run_appkit_main_loop). Without this, frames queue forever and the window stays blank.

The non-PiP Serve path keeps its original run-on-main semantics so existing users are unaffected.

Window properties (macOS)

  • NSFloatingWindowLevel — above normal apps, below menus / accessibility overlays.
  • CanJoinAllSpaces | FullScreenAuxiliary | Stationary | Transient | IgnoresCycle — visible across spaces and full-screen apps; never the main / key window.
  • orderFrontRegardless instead of makeKeyAndOrderFront — never steals keyboard focus.
  • Closeable via the red traffic-light button (decouples from the session per spec).

becomesKeyOnlyIfNeeded: is NSPanel-only and crashes when sent to NSWindow — using collection-behavior flags + orderFrontRegardless achieves the same passive-window contract.

Test plan

  • cargo build --release -p cua-driver on macOS — passes
  • cargo check --workspace — passes (only pre-existing warnings)
  • CLI --help lists the new --experimental-pip / --experimental-pip-geometry flags
  • cua-driver serve --experimental-pip --no-permissions-gate starts; PiP window appears top-right at 480x360 with placeholder "waiting for first action…"
  • cua-driver call launch_app '{"bundle_id":"com.apple.calculator"}' → window updates with screenshot + label launch_app: com.apple.calculator (verified via screencapture)
  • Manual: confirm window stays passive when clicked (doesn't steal focus from frontmost app)
  • Manual: confirm window survives space-switch + full-screen-app transitions
  • Windows / Linux smoke (deferred — stubs only today; tracked in PiP preview (experimental): native Windows + Linux backends #1729)

Files touched

  • New crate: libs/cua-driver/rust/crates/pip-preview/{Cargo.toml,src/lib.rs}
  • New module: libs/cua-driver/rust/crates/cua-driver-core/src/pip_hook.rs
  • New module: libs/cua-driver/rust/crates/platform-{macos,windows,linux}/src/pip/mod.rs
  • Wire-up: cua-driver/src/{main.rs,cli.rs}, cua-driver/Cargo.toml, platform-*/Cargo.toml, platform-*/src/lib.rs, cua-driver-core/src/{tool.rs,recording.rs,lib.rs}, workspace Cargo.toml
  • Docs: docs/content/docs/cua-driver/guide/getting-started/pip-preview.mdx + meta.json entry

Follow-ups

🤖 Generated with Claude Code

Summary by CodeRabbit

Release Notes

  • New Features

    • Added experimental Picture-in-Picture (PiP) preview window displaying agent per-action screenshots with action labels. Enable via --experimental-pip flag or persistent configuration. Customize window size and position with --experimental-pip-geometry override.
    • macOS support with floating overlay window; Windows and Linux support planned.
  • Documentation

    • Added Getting Started guide for experimental PiP Preview feature with activation instructions, supported tool types, platform-specific behaviors, and troubleshooting tips.

Review Change Stack

f-trycua and others added 5 commits May 27, 2026 10:22
New `pip-preview` crate carries the cross-platform PipConfig,
PipFrame, and PipBackend trait + factory registry, mirroring the
shape of `cua-driver-core::video`. A thin `pip_hook` module inside
`cua-driver-core` exposes the per-tool-call push callback so the
tool dispatcher can synthesise a frame label and forward the
existing SCREENSHOT_FN bytes without taking a direct dependency
on pip-preview.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
MacosPipBackend renders the post-action screenshot via NSImageView
(scaled proportionally) with an NSTextField label strip at the bottom.
Window is NSFloatingWindowLevel with CanJoinAllSpaces /
FullScreenAuxiliary / Stationary / Transient / IgnoresCycle so it stays
visible across spaces and full-screen apps without becoming a
Mission-Control affordance or stealing focus.

Frame updates dispatch_async_f onto the main queue (AppKit must run
on main); `run_appkit_main_loop()` is exposed so `cua-driver serve
--experimental-pip` can park its main thread in NSApplication.run()
while serve runs on a background thread — without that, the
dispatched blocks never execute and the window stays blank.

`becomesKeyOnlyIfNeeded:` is NSPanel-only; we rely on
`orderFrontRegardless` + the no-cycle/transient collection-behavior
flags instead.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Both factories return a clear "not yet implemented" error so
maybe_init_pip() can log "PiP unavailable" and the daemon keeps
running without a window. Real implementations (WS_EX_NOACTIVATE
HWND on Win, wlr-layer-shell / GTK4 on Linux) tracked as a
follow-up issue.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
maybe_init_pip() registers the platform's PipBackendFactory, calls
start(), and bridges the live backend handle to the core push hook
via a OnceLock<Mutex<Option<...>>>. Wired into Serve and Mcp on
macOS plus Serve and async_main on non-macOS so every long-running
entry point honors the flag.

On macOS Serve, when --experimental-pip is on, the tokio runtime
moves to a background thread and the main thread parks in
NSApplication.run() (via platform_macos::pip::run_appkit_main_loop)
so the dispatch_async_f frame-push path is actually pumped.

Geometry parses as the X11 `WxH[+X+Y]` form; default is 480x360 in
the top-right corner of the main display. Startup prints an
"experimental" banner so users know the flag is opt-in.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Documents the opt-in flag, geometry override, what gets pushed
(same tools the recorder writes screenshot.png for), macOS window
properties (level / collection behavior / no-key contract), and
the current platform-support matrix (macOS working, Win/Linux
stubs). Marked experimental everywhere — the warning callout, the
banner emoji, the title suffix.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@vercel
Copy link
Copy Markdown
Contributor

vercel Bot commented May 27, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment
Project Deployment Actions Updated (UTC)
docs Ignored Ignored Preview May 27, 2026 10:25am

Request Review

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 27, 2026

Caution

Review failed

Pull request was closed or merged during review

📝 Walkthrough

Walkthrough

This PR introduces an experimental always-on-top picture-in-picture preview window for cua-driver. The window displays agent action screenshots and short action labels. Activation is opt-in via --experimental-pip flag or persistent config. The implementation includes macOS full support, Windows/Linux compile-clean stubs, and cross-platform configuration persistence via JSON settings.

Changes

PiP Preview Feature

Layer / File(s) Summary
Shared configuration and runtime API
libs/cua-driver/rust/crates/pip-preview/Cargo.toml, libs/cua-driver/rust/crates/pip-preview/src/lib.rs
pip-preview crate introduces CLI/file config parsing, geometry validation, and platform abstraction traits (PipBackendFactory, PipBackend) for frame delivery and window lifecycle management.
Core frame hook and screenshot infrastructure
libs/cua-driver/rust/Cargo.toml, libs/cua-driver/rust/crates/cua-driver-core/src/lib.rs, libs/cua-driver/rust/crates/cua-driver-core/src/pip_hook.rs, libs/cua-driver/rust/crates/cua-driver-core/src/recording.rs
pip_hook module in cua-driver-core provides frame registration and push callbacks; screenshot_for helper centralizes platform screenshot invocation for use by recording and PiP hooks.
CLI parsing and main initialization
libs/cua-driver/rust/crates/cua-driver/src/cli.rs, libs/cua-driver/rust/crates/cua-driver/Cargo.toml, libs/cua-driver/rust/crates/cua-driver/src/main.rs
CLI parsing treats --experimental-pip-geometry WxH[+X+Y] as value-taking flag; maybe_init_pip() loads config, registers OS-specific backend, starts it, and wires frame delivery callback into core hook.
Main startup integration
libs/cua-driver/rust/crates/cua-driver/src/main.rs
PiP initialization integrated into macOS Serve (with AppKit main loop parking on main thread while serve runs in background), macOS MCP startup, and non-macOS Serve/async startup paths.
Per-action frame capture and label synthesis
libs/cua-driver/rust/crates/cua-driver-core/src/tool.rs
ToolRegistry::invoke captures screenshots, synthesizes human-readable action labels (click coordinates, typed text, keys, scroll deltas), and pushes frames to hook after recording when PiP enabled.
macOS tool configuration
libs/cua-driver/rust/crates/platform-macos/src/tools/get_config.rs, libs/cua-driver/rust/crates/platform-macos/src/tools/set_config.rs
Tools augment get/set config to read and persist experimental_pip/experimental_pip_geometry from file; set config validates geometry and notes "next restart" requirement.
Linux tool configuration
libs/cua-driver/rust/crates/platform-linux/src/tools/impl_.rs
Tools read/write experimental PiP keys via shared file helpers with geometry validation and restart messaging.
Windows tool configuration
libs/cua-driver/rust/crates/platform-windows/src/tools/impl_.rs
Tools handle experimental PiP in Swift-compatible and legacy paths, validate geometry, and extend response payloads to match macOS/Linux.
macOS PiP backend with AppKit
libs/cua-driver/rust/crates/platform-macos/Cargo.toml, libs/cua-driver/rust/crates/platform-macos/src/lib.rs, libs/cua-driver/rust/crates/platform-macos/src/pip/mod.rs
Complete macOS window implementation: creates floating borderless NSWindow with NSImageView and label, marshals frame updates from Tokio threads to AppKit main thread via libdispatch, and provides main loop runner for Serve mode.
Windows and Linux stubs
libs/cua-driver/rust/crates/platform-linux/Cargo.toml, libs/cua-driver/rust/crates/platform-linux/src/lib.rs, libs/cua-driver/rust/crates/platform-linux/src/pip/mod.rs, libs/cua-driver/rust/crates/platform-windows/Cargo.toml, libs/cua-driver/rust/crates/platform-windows/src/lib.rs, libs/cua-driver/rust/crates/platform-windows/src/pip/mod.rs
Platform stubs register factories that fail start() with descriptive errors, allowing graceful fallback and future implementation.
User documentation
docs/content/docs/cua-driver/guide/getting-started/meta.json, docs/content/docs/cua-driver/guide/getting-started/pip-preview.mdx
Getting Started guide documents opt-in activation, supported tools, macOS window behavior, platform status, and troubleshooting.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related issues

Poem

🐰 A window floats, forever on top,
Screenshots whisper of clicks and text,
Each action labeled in pill-shaped pop,
Macintosh sings (the others—not yet!),
Picture-in-picture, hopping to the top! 📸

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the primary change: adding an experimental picture-in-picture agent preview feature with a macOS implementation and Windows/Linux stubs.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/experimental-pip-preview

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ESLint

If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration.

ESLint skipped: no ESLint configuration detected in root package.json. To enable, add eslint to devDependencies.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

`setImageScaling:` expects an `NSUInteger` (objc2 type code `'Q'` / u64),
not the signed `'q'`/i64 that was passed. macOS 26+ aborts the process
on the mismatch:

    invalid message send to -[NSImageView setImageScaling:]:
    expected argument at index 0 to have type code 'Q', but found 'q'

Verified live: `cua-driver serve --experimental-pip` now stays up; PiP
window appears top-right with placeholder and updates with screenshot +
label after `cua-driver call launch_app …`.
@f-trycua
Copy link
Copy Markdown
Collaborator Author

Verified live (post-eb4bf5e6 + fixup)

Smoke test on this Mac:

$ ~/.local/bin/cua-driver serve --experimental-pip &
⚗️  PiP preview enabled (experimental — macOS only today; ...)
cua-driver daemon listening on /Users/.../cua-driver.sock

$ ~/.local/bin/cua-driver call launch_app '{"bundle_id":"com.apple.calculator"}'
{... pid: 11844, windows: [...]}

PiP window appeared in the top-right corner at 480×360 with placeholder text "Waiting for first action…". After the launch_app call landed, the window updated with the post-action screenshot and the label launch_app: com.apple.calculator. Frame-push hook works as designed.

Caveats from this test

PR remains DRAFT for review.

f-trycua added 4 commits May 27, 2026 12:11
…verlay label

User feedback on the first PiP cut: window too big, label felt like a
separate UI strip instead of an overlay. Rework:

- Default geometry 480x360 → 320x200 (smaller, image-first)
- NSWindow: Borderless (no title bar / close button), transparent
  background, default shadow. CALayer-backed content view supplies the
  rounded-corner mask (radius 12) and the dark backing behind the
  scaled image.
- NSImageView now fills the entire content rect (was reserved 28pt
  for the label strip)
- New pill overlay: NSView at the bottom-center with a
  fully-rounded CALayer (radius = height/2), semi-transparent black
  background, NSTextField inside with white 11pt system font centered
- Window stays passive (no-activate via Transient/IgnoresCycle
  collection behavior) and is now draggable from anywhere
  (setMovableByWindowBackground)

Encoding shim added for CGColor — `[NSColor CGColor]` returns
`^{CGColor=}` and objc2's strict msg_send! enforcement rejects
`*mut c_void`. A tiny `#[repr(C)] struct CGColor` + RefEncode impl gives
the right encoding without pulling in a wider CGColor binding crate.

Verified live: `cua-driver serve --experimental-pip` →
`launch_app + get_window_state + click(AllClear)` shows the
expected dark rounded preview in the top-right with a centered pill
reading `click: element_index=2`.
The first PiP cut was CLI-only — to persist `--experimental-pip` across
daemon restarts, users had to bake the flag into every MCP-client
config (`claude mcp add cua-computer-use -- /path/to/cua-driver mcp
--experimental-pip`) and re-add the entry whenever they wanted to
toggle. That's friction for an opt-in experimental feature.

Wire the same `~/.cua-driver/config.json` file the existing
`set_config` MCP tool writes to into the PiP startup path:

  {
    "experimental_pip": true,
    "experimental_pip_geometry": "320x200+24+24"
  }

Edit the JSON once, restart the daemon (or MCP client), PiP comes up.
CLI flags still override — `--experimental-pip` forces it on regardless
of the config value, and `--experimental-pip-geometry WxH+X+Y` wins
over the file's geometry.

Implementation:
- `PipConfig::from_args_and_file(path: &Path)` reads the JSON, then
  layers CLI args on top. Malformed / missing file falls back to
  defaults silently.
- `default_config_path()` returns `$HOME/.cua-driver/config.json` so
  callers don't have to recompute it.
- `main.rs`'s two PiP init sites switched from `from_args()` to the
  new path-aware variant.
- `pip-preview` gains a `serde_json` dep (just for parsing the small
  config file).

The MCP `set_config` tool's schema is NOT updated in this commit —
that's a separate per-platform change touching 3 different
`tools/set_config` implementations. Users can still edit the JSON
directly; the schema update is a tracked follow-up.

Live-verified: set `experimental_pip: true` in JSON, started
`cua-driver serve` WITHOUT any CLI flag, banner printed and window
appeared at the JSON-configured 280x180+24+24 position.
… on all 3 platforms

Closes the follow-up flagged in the previous commit: the MCP set_config
tool now persists both PiP keys via the cross-platform
pip_preview::write_config_key helper, and get_config surfaces them in
the structured output (read fresh from ~/.cua-driver/config.json on
every call since they don't live in the in-memory DriverConfig).

Per platform:

  - macOS  (separate set_config.rs / get_config.rs) — schema gains both
    keys + invoke() persists them; description notes they take effect
    on next daemon restart (the PiP backend is initialised once at
    startup).
  - Linux  (inline in impl_.rs)  — same treatment; description calls
    out that the Linux backend is still a stub (issue #1729) so the
    config persists but no window appears until that lands.
  - Windows (inline in impl_.rs) — same, with the additional twist that
    Windows set_config exposes BOTH the Swift-compatible {key, value}
    dotted-leaf shape AND a legacy per-field shape. Both shapes now
    accept the two new keys.

Validation:
  - geometry strings are passed through pip_preview::PipGeometry::parse
    before persistence; malformed input returns an error from set_config
    instead of corrupting the config file
  - bool / string type checks in the Swift-shape match arm
  - "restart cua-driver for X to take effect" hint baked into the
    success message so callers know not to expect immediate window
    appearance

Live-verified on macOS:
  $ cua-driver call set_config '{"experimental_pip":true,"experimental_pip_geometry":"320x200+24+24"}'
  Config updated: capture_mode=som, max_image_dimension=1024
    — restart cua-driver for experimental_pip=true to take effect
  $ cua-driver call get_config '{}'
    "experimental_pip": true,
    "experimental_pip_geometry": "320x200+24+24",
  $ cat ~/.cua-driver/config.json
    { "experimental_pip": true, "experimental_pip_geometry": "320x200+24+24", "max_image_dimension": 1024 }

  $ cua-driver call set_config '{"experimental_pip_geometry":"junk"}'
    experimental_pip_geometry `junk` is not a valid WxH or WxH+X+Y string

New cross-platform helpers in pip-preview:
  - write_config_key(key, value)  — merges into ~/.cua-driver/config.json
  - read_pip_keys_from_file()     — surfaces (enabled, geometry) for get_config
@f-trycua f-trycua marked this pull request as ready for review May 27, 2026 10:25
@f-trycua f-trycua merged commit 7594195 into main May 27, 2026
5 of 9 checks passed
@f-trycua f-trycua deleted the feat/experimental-pip-preview branch May 27, 2026 10:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant