wta(prompts): tighten autofix + terminal-agent classification#154
wta(prompts): tighten autofix + terminal-agent classification#154yeelam-gordon wants to merge 1 commit into
Conversation
There was a problem hiding this comment.
Pull request overview
This PR tightens WTA prompt routing to reduce misclassification between “auto-fix” vs “explain” and “Chat” vs “Mode A”, and adds a PowerShell A/B harness (plus recorded CSVs) to reproduce and validate those prompt changes.
Changes:
- Update
auto-fix.mdto treat unambiguous language-level missing packages asfix, while keeping ambiguous system CLI installs asexplain. - Update
terminal-agent.mdto route follow-ups after a failed command (as seen inbuffer) to Mode A, and add a tie-breaker discouraging prose+fix-command fences. - Add prompt evaluation harness scripts + checked-in run summaries under
tools/wta/prompts/tests/.
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| tools/wta/prompts/auto-fix.md | Refines fix vs explain decision criteria for missing packages vs system tools. |
| tools/wta/prompts/terminal-agent.md | Tightens Chat eligibility based on runtime buffer errors; adds Mode A follow-up and tie-breaker guidance. |
| tools/wta/prompts/tests/runner-autofix.ps1 | Adds A/B harness for auto-fix.md with scenario parsing/scoring. |
| tools/wta/prompts/tests/runner-terminal-agent.ps1 | Adds A/B harness for Chat vs Mode A classification using runtime buffer scenarios. |
| tools/wta/prompts/tests/runner-terminal-agent-copilot-cli.ps1 | Adds Copilot CLI-driven harness for a reduced scenario set. |
| tools/wta/prompts/tests/README.md | Documents harness layout, requirements, usage, and references to checked-in results. |
| tools/wta/prompts/tests/results/autofix-min-qwen.csv | Recorded Qwen track results for autofix A/B. |
| tools/wta/prompts/tests/results/autofix-min-copilot.csv | Recorded Copilot track results for autofix A/B. |
| tools/wta/prompts/tests/results/terminal-agent-min-qwen.csv | Recorded Qwen track results for terminal-agent A/B. |
| tools/wta/prompts/tests/results/terminal-agent-min-copilot.csv | Recorded Copilot track results for terminal-agent A/B. |
| tools/wta/prompts/tests/results/terminal-agent-min-copilot-cli.csv | Recorded Copilot CLI track results for terminal-agent A/B. |
| function Build-Variant { | ||
| param([string]$variant) | ||
| switch ($variant) { | ||
| 'baseline' { return $basePrompt } | ||
| 'MIN2' { return $minPrompt } | ||
| default { throw "Unknown variant: $variant" } | ||
| } |
| # Robust A/B harness for the Terminal Agent prompt against qwen-code default system prompt. | ||
| # Variants: | ||
| # baseline = current prompt (line 9 unchanged) | ||
| # VC = baseline + 15-word chat-mode qualifier on line 9 | ||
| # PRE = baseline + Step 0 binary gate inserted BEFORE the numbered modes | ||
| # PRE+VC = both (belt and suspenders) |
| $ErrorActionPreference = 'Stop' | ||
| $root = $PSScriptRoot | ||
| if (-not $root) { $root = Join-Path $PSScriptRoot 'results' } | ||
|
|
| function Invoke-CopilotCli { | ||
| param([string]$systemAndUser) | ||
| $tmp = New-TemporaryFile | ||
| Set-Content -Path $tmp -Value $systemAndUser -Encoding UTF8 -NoNewline | ||
| try { | ||
| # Pipe prompt via stdin would be cleaner but -p reads arg; use file-based via @ not supported. | ||
| # Use -p with the raw content; PowerShell will pass as single arg. | ||
| $out = & copilot -p $systemAndUser --allow-all-tools 2>&1 | Out-String | ||
| return $out | ||
| } finally { Remove-Item $tmp -ErrorAction SilentlyContinue } | ||
| } |
| $ErrorActionPreference = 'Stop' | ||
| $root = $PSScriptRoot | ||
| if (-not $root) { $root = Join-Path $PSScriptRoot 'results' } | ||
|
|
| $ErrorActionPreference = 'Stop' | ||
| $root = $PSScriptRoot | ||
| if (-not $root) { $root = Join-Path $PSScriptRoot 'results' } | ||
|
|
| Each runner sends the same scenarios to the live LLM under two prompt | ||
| variants (`baseline` = pre-fix, `MIN` = post-fix) across multiple | ||
| trials, parses the response, and tallies pass/fail. This is the | ||
| evidence trail for the prompt edits — re-run any time the prompts | ||
| change to catch regressions. | ||
|
|
||
| ## Layout | ||
|
|
||
| | File | Purpose | | ||
| |---|---| | ||
| | `runner-autofix.ps1` | A/B harness for `auto-fix.md` (Qwen + Copilot OpenAI-compatible API). 12 scenarios (F1–F8 expect `fix`, E1–E4 expect `explain`). | | ||
| | `runner-terminal-agent.ps1` | A/B harness for `terminal-agent.md` (Qwen + Copilot OpenAI-compatible API). Chat-vs-Mode-A classification scenarios. | | ||
| | `runner-terminal-agent-copilot-cli.ps1` | Same scenarios as `runner-terminal-agent.ps1` but driven through the real `copilot -p` CLI (closer to production wta path). | | ||
| | `results/` | CSV summaries from the runs that justified the prompt edits in this PR. | | ||
|
|
||
| ## Requirements | ||
|
|
||
| - Windows PowerShell 5+ or PowerShell 7+ | ||
| - A Qwen Code CLI config at `~/.qwen/settings.json` containing the | ||
| OpenAI-compatible endpoint, API key env var, and model id | ||
| (`modelProviders.openai[0].{baseUrl,envKey,id}`). | ||
| - For the Qwen track only: a `qwen-default-sys.txt` next to the runner | ||
| containing the Qwen CLI's default system prompt (so the harness | ||
| mirrors production exactly). If missing, pass `-Copilot` to skip the | ||
| Qwen track. | ||
| - For `runner-terminal-agent-copilot-cli.ps1`: the `copilot` CLI on PATH. | ||
|
|
||
| ## Usage | ||
|
|
||
| ```powershell | ||
| cd tools/wta/prompts/tests | ||
|
|
||
| # 3-trial A/B on auto-fix.md, Qwen track: | ||
| .\runner-autofix.ps1 -Trials 3 -Variants @('baseline','MIN') -OutSuffix '-qwen' | ||
|
|
||
| # Same, Copilot track (no qwen-default-sys.txt needed): | ||
| .\runner-autofix.ps1 -Trials 3 -Variants @('baseline','MIN') -Copilot -OutSuffix '-copilot' | ||
|
|
||
| # terminal-agent.md, both variants: | ||
| .\runner-terminal-agent.ps1 -Trials 3 -Variants @('baseline','MIN') -OutSuffix '-qwen' | ||
| .\runner-terminal-agent.ps1 -Trials 3 -Variants @('baseline','MIN') -Copilot -OutSuffix '-copilot' | ||
| ``` | ||
|
|
||
| Outputs land in this folder (or `-OutFile` if you set it): | ||
| `{autofix,results}-summary{$OutSuffix}.csv` (per-trial pass/fail) and | ||
| `{autofix,results}-full{$OutSuffix}.json` (full prompts + raw model | ||
| responses for debugging). |
There was a problem hiding this comment.
check-spelling found more than 20 potential problems in the proposed changes. Check the Files changed tab for more details.
This comment has been minimized.
This comment has been minimized.
5a38d83 to
d916600
Compare
| Read the runtime context (cwd, profile, activeTarget, buffer, supported delegate agents) and the user's input. Then walk this decision tree top-to-bottom and stop at the FIRST match: | ||
|
|
||
| 1. **Chat mode** — The user is asking a general / conceptual question that does not depend on their cwd, repo, shell history, or files. Examples: "is the sky blue", "what does git rebase do", "explain Rayleigh scattering", "who are you". | ||
| 1. **Chat mode** — The user is asking a general / conceptual question that does not depend on their cwd, repo, shell history, or files, AND the runtime `buffer` shows no recent error / failed command. If the buffer shows an error, the request is never Chat — it inherits that error as context: route information-seeking words ("why?", "what does this mean", "explain") to **Mode A** (explain the error), and action-seeking words ("help", "fix it", "make it work") to **Mode B** (run a command). Chat examples (no buffer error): "is the sky blue", "what does git rebase do", "explain Rayleigh scattering", "who are you". |
| Read the runtime context (cwd, profile, activeTarget, buffer, supported delegate agents) and the user's input. Then walk this decision tree top-to-bottom and stop at the FIRST match: | ||
|
|
||
| 1. **Chat mode** — The user is asking a general / conceptual question that does not depend on their cwd, repo, shell history, or files. Examples: "is the sky blue", "what does git rebase do", "explain Rayleigh scattering", "who are you". | ||
| 1. **Chat mode** — The user is asking a general / conceptual question that does not depend on their cwd, repo, shell history, or files, AND the runtime `buffer` shows no recent error / failed command. If the buffer shows an error, the request is never Chat — it inherits that error as context: route information-seeking words ("why?", "what does this mean", "explain") to **Mode A** (explain the error), and action-seeking words ("help", "fix it", "make it work") to **Mode B** (run a command). Chat examples (no buffer error): "is the sky blue", "what does git rebase do", "explain Rayleigh scattering", "who are you". |
| ### `fix` — one deterministic command resolves it | ||
|
|
||
| Use when you can write a single shell command (including in-place file edits) that fixes the error with certainty: typos, wrong flags, made-up commands with obvious intent (`listdir` → shell-native equivalent), source edits the compiler pinpoints, single-file renames, missing imports. | ||
| **This is the strong default. Pick `fix` whenever a single shell command can plausibly resolve what the user was trying to do** — typos, wrong flags, made-up commands with obvious intent (`listdir` → shell-native equivalent), source edits the compiler pinpoints, single-file renames, missing imports, missing language-level packages where the package manager is unambiguous from the project (`ModuleNotFoundError` → `pip install`, `Cannot find module 'X'` → `npm install`, `unresolved import` in Rust → `cargo add`), bare words that look like a non-existent command but match an idiomatic one in this shell (`datetime` in PowerShell → `Get-Date`; `ll` on Windows PowerShell → `Get-ChildItem`). | ||
|
|
||
| If multiple shell commands are plausible interpretations, **commit to the single most likely one** for the current shell and mention the alternative in `rationale` ("Did you mean X? — Y is also possible.") rather than escalating to `explain`. The user can dismiss the suggestion if it's wrong; an unhelpful "intent is unclear" essay is worse than a best-guess fix. |
This comment has been minimized.
This comment has been minimized.
d916600 to
cafe581
Compare
cafe581 to
8808b96
Compare
This comment has been minimized.
This comment has been minimized.
| 1. **Chat mode** — The user is seeking information or asking a question. They want an answer, not an action. Answer in prose. A recent error in the buffer is fine context to draw on when explaining; don't try to fix it unless the user explicitly asked for action. | ||
| → Answer in prose. No tool calls. No JSON. |
| ### `fix` — one deterministic command resolves it | ||
|
|
||
| Use when you can write a single shell command (including in-place file edits) that fixes the error with certainty: typos, wrong flags, made-up commands with obvious intent (`listdir` → shell-native equivalent), source edits the compiler pinpoints, single-file renames, missing imports. | ||
| **The strong default.** Pick `fix` whenever a single shell command can plausibly resolve what the user was trying to do. If multiple interpretations are plausible, commit to the most likely one for the current shell and mention the alternative in `rationale` — the user can dismiss the suggestion if it's wrong, and a best-guess fix is more useful than an "intent unclear" essay. | ||
|
|
This comment has been minimized.
This comment has been minimized.
8808b96 to
ca20f9f
Compare
ca20f9f to
d7c4cf5
Compare
| 1. **Chat mode** — The user is seeking information or asking a question. They want an answer, not an action. Answer in prose. A recent error in the buffer is fine context to draw on when explaining; don't try to fix it unless the user explicitly asked for action. | ||
| → Answer in prose. No tool calls. No JSON. |
There was a problem hiding this comment.
Should we talk about user ask for an action or ask for information. Ask for action should still trigger Mode A or Mode B explcitly.
There was a problem hiding this comment.
Applied in 1d2f88d. Reframed both Chat and Mode A around the info-vs-action distinction: Chat = wants prose information and is not asking you to do anything, suggest a command, or address an error; Mode A = asking for an action — a recommended command, an operation on the system, or a fix for an error visible in the buffer. Tested: copilot 48/48, qwen 45/48 (only bare-word help still flips to Chat on qwen — defensible, that one is genuinely ambiguous).
| ### `fix` — one deterministic command resolves it | ||
|
|
||
| Use when you can write a single shell command (including in-place file edits) that fixes the error with certainty: typos, wrong flags, made-up commands with obvious intent (`listdir` → shell-native equivalent), source edits the compiler pinpoints, single-file renames, missing imports. | ||
| **The strong default.** Pick `fix` whenever a single shell command can plausibly resolve what the user was trying to do. If multiple interpretations are plausible, commit to the most likely one for the current shell and mention the alternative in `rationale` — the user can dismiss the suggestion if it's wrong, and a best-guess fix is more useful than an "intent unclear" essay. | ||
|
|
This comment has been minimized.
This comment has been minimized.
| ```json | ||
| {"action": "fix", "title": "Use println! instead of printf!", "command": "(Get-Content src\\main.rs) -replace 'printf!', 'println!' | Set-Content src\\main.rs", "rationale": "Rust uses println!; compiler suggested the same."} | ||
| ``` | ||
|
|
There was a problem hiding this comment.
I don't see you restore this?
There was a problem hiding this comment.
Restored in 1d2f88d — missed it when I reverted, sorry. Back to matching main.
`auto-fix.md` - `fix` desc: add missing language-level packages where the package manager is unambiguous (`ModuleNotFoundError` -> `pip install`, `Cannot find module 'X'` -> `npm install`, Rust `unresolved import` -> `cargo add`). - `explain` desc: narrow "tool not installed" to *system* CLIs where the install path is ambiguous (`psql` / `docker` / `gh`). `terminal-agent.md` - Chat-mode line: a non-empty buffer with an error disqualifies Chat. Even a bare "why?" / "explain" / "help" inherits that error as context and routes to Mode A or B. - Mode A description: explicit "follow-up to a failed command in buffer always lands in Mode A — user wants the fix command, not prose." - Tiebreaker: if you would emit prose followed by a code fence with a fix command, stop and emit a Mode A card instead. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
d7c4cf5 to
1d2f88d
Compare
@check-spelling-bot Report
|
| Dictionary | Entries | Covers | Uniquely |
|---|---|---|---|
| cspell:csharp/csharp.txt | 32 | 2 | 2 |
| cspell:aws/aws.txt | 232 | 2 | 2 |
| cspell:fonts/fonts.txt | 536 | 1 | 1 |
Consider adding to the extra_dictionaries array (in the .github/actions/spelling/config.json file):
"cspell:csharp/csharp.txt",
"cspell:aws/aws.txt",
"cspell:fonts/fonts.txt",
To stop checking additional dictionaries, put (in the .github/actions/spelling/config.json file):
"check_extra_dictionaries": []Warnings ⚠️ (1)
See the 📂 files view, the 📜action log, 👼 SARIF report, or 📝 job summary for details.
| Count | |
|---|---|
| 54 |
See
✏️ Contributor please read this
By default the command suggestion will generate a file named based on your commit. That's generally ok as long as you add the file to your commit. Someone can reorganize it later.
If the listed items are:
- ... misspelled, then please correct them instead of using the command.
- ... names, please add them to
.github/actions/spelling/allow/names.txt. - ... APIs, you can add them to a file in
.github/actions/spelling/allow/. - ... just things you're using, please add them to an appropriate file in
.github/actions/spelling/expect/. - ... tokens you only need in one place and shouldn't generally be used, you can add an item in an appropriate file in
.github/actions/spelling/patterns/.
See the README.md in each directory for more information.
🔬 You can test your commits without appending to a PR by creating a new branch with that extra change and pushing it to your fork. The check-spelling action will run in response to your push -- it doesn't require an open pull request. By using such a branch, you can limit the number of typos your peers see you make. 😉
If the flagged items are 🤯 false positives
If items relate to a ...
-
binary file (or some other file you wouldn't want to check at all).
Please add a file path to the
excludes.txtfile matching the containing file.File paths are Perl 5 Regular Expressions - you can test yours before committing to verify it will match your files.
^refers to the file's path from the root of the repository, so^README\.md$would exclude README.md (on whichever branch you're using). -
well-formed pattern.
If you can write a pattern that would match it,
try adding it to thepatterns.txtfile.Patterns are Perl 5 Regular Expressions - you can test yours before committing to verify it will match your lines.
Note that patterns can't match multiline strings.
What this fixes
Two prompt files used by WTA classify the user''s intent before deciding how to respond. Two classifications were misrouting in ways that gave users a worse experience:
Bug 1 — Autofix: missing-package errors got an explanation instead of a fix
When a user ran a command and hit a "missing package" error such as:
or
…autofix would respond with an explanation of the error rather than a one-line fix (
pip install requests,npm install express). The fix is unambiguous in these cases — the error itself tells you the package manager — so making the user read a paragraph is the wrong call.Root cause:
tools/wta/prompts/auto-fix.mdtold the model to route "tool not installed" cases to theexplainaction because system CLIs (psql, docker, gh, etc.) are genuinely ambiguous (apt vs brew vs winget vs scoop vs chocolatey). That blanket rule swept up language-level packages too.Fix: Two sentence edits in
auto-fix.md:fixdescription now explicitly includes language-level packages when the package manager is unambiguous (ModuleNotFoundError→pip install,Cannot find module ''X''→npm install, Rustunresolved import→cargo add).explaindescription narrows "tool not installed" to system CLIs where the install path is genuinely ambiguous.Bug 2 — Terminal-agent chat: follow-up to a failed command got a generic chat reply
When the user ran a command that failed, then asked a short follow-up like "why?", "explain", or "help", the model often took the question as a generic chat prompt (Chat mode → prose answer) rather than a request to fix the command shown in the buffer (Mode A → run-this-command card).
Fix: Three edits in
tools/wta/prompts/terminal-agent.md:powershell/bashfix-command code fence, stop and emit a Mode A card instead.Evidence
Both edits were validated by an offline A/B harness that sends a set of failing-terminal scenarios to the live LLM under two prompt variants —
baseline(the prompt before this PR) and the edited prompt — multiple trials each, then tallies how often the model''s output matches the expected action (fixvsexplainfor autofix; Chat vs Mode A for terminal-agent). The harness itself is kept out of this PR because it relies on a live model and per-developer API config.Autofix — 12 scenarios × 3 trials × 2 model backends (36 calls per variant per backend)
The aggregate understates the change — only two scenarios were broken, and they were the ones targeted:
ModuleNotFoundError: requestsCannot find module ''express''The remaining 1/36 miss on the "after" side is the model occasionally forgetting the
jsonfence — a parser flake, not a classification regression.Terminal-agent
Same harness pattern: scenarios where the user runs a typo-d command (
gti status,pythn --version,npm run buld, etc.) and then asks "why?" / "help" / "?" / "fix?". Baseline frequently routed these to Chat (generic prose answer); after the edits they consistently route to Mode A (a card with the corrected command).Files
tools/wta/prompts/auto-fix.md(+2 / −2)tools/wta/prompts/terminal-agent.md(+3 / −1)Why this is a separate PR from #123
These two prompt files travelled together with the custom-agent-save settings fix on the original branch by accident. They have nothing to do with custom-agent-save, so they''re extracted here as a standalone change. PR #123 has been force-pushed to drop the three commits that landed here.