diff --git a/.github/workflows/ci-failure-scan.lock.yml b/.github/workflows/ci-failure-scan.lock.yml index bb1861e00f1fed..a95125a28233f0 100644 --- a/.github/workflows/ci-failure-scan.lock.yml +++ b/.github/workflows/ci-failure-scan.lock.yml @@ -69,7 +69,7 @@ name: "CI Outer-Loop Failure Scanner" # - maintainer # Roles processed as role check in pre-activation job # - write # Roles processed as role check in pre-activation job schedule: - - cron: "31 */12 * * *" + - cron: "34 */12 * * *" # Friendly format: every 12h (scattered) workflow_dispatch: inputs: @@ -1161,9 +1161,29 @@ jobs: await main(); detection: + # MANUAL PATCH (not produced by `gh aw compile`): pat_pool added to + # `needs:` so the case() expression in COPILOT_GITHUB_TOKEN below + # (which references `needs.pat_pool.outputs.pat_number`) actually + # resolves to a real PAT number instead of empty. + # + # Without this line, GitHub Actions silently resolves the reference + # to '', none of the case() arms match '0'..'9', the expression + # falls through to secrets.COPILOT_GITHUB_TOKEN (empty/stale in this + # repo), the api-proxy sidecar starts with a malformed Authorization + # header, and every Copilot API call from the detection pass 400s + # ("Authorization header is badly formatted"). Result: no + # THREAT_DETECTION_RESULT marker, parser fails, and gh-aw stamps a + # "Security scanning requires review" CAUTION banner onto every + # issue/PR the agent produced. + # + # gh-aw v0.71.5 fixed engine.env-based needs inference for the agent + # job only (#30232) — the threat-detection job still has the gap. + # This patch must be re-applied after every `gh aw compile` until + # gh-aw extends the same inference to the detection job. needs: - activation - agent + - pat_pool if: > always() && needs.agent.result != 'skipped' && (needs.agent.outputs.output_types != '' || needs.agent.outputs.has_patch == 'true') runs-on: ubuntu-latest diff --git a/.github/workflows/ci-failure-scan.md b/.github/workflows/ci-failure-scan.md index 1ca7541d977429..ac8faf5391488e 100644 --- a/.github/workflows/ci-failure-scan.md +++ b/.github/workflows/ci-failure-scan.md @@ -74,20 +74,64 @@ network: # CI Outer-Loop Failure Scanner -Platform-agnostic scan of `dnceng-public/public` outer-loop CI pipelines on `main`. Every actionable failure becomes either a draft PR (per-test fix) or a tracking issue (everything else). The intent is to keep outer-loop pipelines green without waiting on humans to file issues. +You are a CI triage agent. Each scheduled run, you scan a fixed list of `dnceng-public/public` outer-loop AzDO pipelines on `main`, classify failures, and emit gh-aw `safe-outputs` requests so every actionable failure converges on a Known Build Error issue (immediate effect on PR CI via Build Analysis) plus a follow-up test-disable PR (permanent effect after human merge). -## Pipelines to scan +The agent runs read-only. All writes go through `safe-outputs`. -Iterate over every pipeline in this list. For each, fetch builds on branch `main` filtered to `resultFilter=succeeded,failed,partiallySucceeded` (skip `canceled`). Pick the most recent such build as the "latest", then look back through ~10 prior completed builds to compute first-seen-in-window and occurrence counts. +## Hard rules — non-negotiable + +1. **All writes via `safe-outputs`.** No `issues: write`, no `contents: write`. Don't try to use `gh` to write. +2. **Caps per run: 5 `create_issue`, 10 `create_pull_request`.** On cap, record `-> skipped: cap reached` and move on. +3. **Labels: only `Known Build Error` and `blocking-clean-ci` on KBEs.** Every other label (`area-*`, `os-*`, `arch-*`, `disabled-test`, ...) is dropped by `allowed-labels`. Area triage is delegated to `dotnet/issue-labeler` (`.github/workflows/labeler-predict-issues.yml`); never propose area labels yourself. +4. **One area path per issue.** Title each KBE around a single failure shape (assertion text or test family), not a list of pipelines. If a root cause spans multiple area paths, file one KBE per area and cross-link with `Related: dotnet/runtime#`. +5. **No `Mute` / `Muting` in titles.** Use `Skip`, `Disable`, `Suppress`, or `Exclude`. +6. **Every issue and PR title starts with `[ci-scan] `.** +7. **Every actionable failure becomes a `Known Build Error` issue.** Test failures, hangs, AND build breaks all converge on the same KBE template; Build Analysis matches both via the JSON body. Skip emission entirely for: pre-existing issue/PR matches (Step 4.2-4.5), unstable signatures (< 2 occurrences in window with no current-run severity), or true infra noise (agent disconnect, pool offline) where no stable signature can be extracted. +8. **One signature = one outcome.** No duplicate KBEs. No comments on existing KBEs — Build Analysis already counts occurrences in the issue body. +9. **No same-run test-disable PR.** The KBE issue number is not visible at emit time (no `issues: write`), and the gap between runs is intentional — it forces a human-review window before disabling the test. +10. **All intermediate state under `/tmp/gh-aw/agent/`.** Each bash invocation is a fresh subshell; persist anything you want to keep. +11. **AzDO API: anonymous only.** Stay on `_apis/build/...`. Never call `_apis/test/...` or `vstmr.dev.azure.com` (both redirect to sign-in). +12. **Don't add `area-*` references to issue/PR titles.** Multi-area titles produce multi-label assignments from the labeler bot. + +## What this run must accomplish + +For every actionable failure, converge on these artifacts: + +| Artifact | Filed in | Same run? | +|---|---|---| +| Known Build Error issue | First run that sees the failure | Yes | +| Test-disable PR | First run that finds the KBE already exists | No — intentional next-run cadence | +| Fix PR (optional) | Same run as the test-disable PR, when the fix fits the small-fix bounds | Same run as test-disable PR | + +The `.NET Core Engineering Services: Known Build Errors` org project (`https://github.com/orgs/dotnet/projects/111`) is populated by `net-helix[bot]` automation that watches `dotnet/runtime` for the `Known Build Error` label and adds matching issues to the project within seconds. Build Analysis reads from the project. The only thing this workflow has to do for project linkage is apply the `Known Build Error` label on the KBE; do NOT try to mutate the project from this workflow. + +## Step-by-step + +Walk the steps in order. Do not skip. Stop at Step 6. + +### Step 1 — Orient + +Read once at start: + +- The skill matching the pipeline you are about to scan (routing table in Step 4.1). Skills live under `.github/skills/`. + +### Step 2 — Walk pipelines + +For each row in the pipeline table below, in order: + +1. Pre-bind the build-list URL to a shell variable on its own line, then `curl -s "$url" | tee /tmp/gh-aw/agent/builds_.json`. +2. Pick the most recent build with `result in {succeeded, failed, partiallySucceeded}`; skip `canceled`. +3. If no such build in the last 7 days -> record `pipeline-skipped: stale` in the tally and continue. +4. Pass failed timeline records to Step 3. | Pipeline | Definition ID | Notes | -|----------|---------------|-------| +|---|---|---| | runtime-extra-platforms | 154 | Apple mobile, Android, browser, wasi, NativeAOT outer loop | | runtime-coreclr outerloop | 108 | | | runtime-coreclr jitstress | 109 | JIT stress modes | | runtime-coreclr jitstressregs | 110 | | | runtime-coreclr jitstress2-jitstressregs | 111 | | -| runtime-coreclr gcstress0x3-gcstress0xc | 112 | | +| runtime-coreclr gcstress-gcstress | 112 | | | runtime-coreclr gcstress-extra | 113 | | | runtime-coreclr r2r-extra | 114 | | | runtime-coreclr jitstress-isas-x86 | 115 | | @@ -116,239 +160,158 @@ Iterate over every pipeline in this list. For each, fetch builds on branch `main | runtime-interpreter | 316 | ADO name differs from display name | | runtime-libraries-interpreter | 330 | ADO name differs from display name | -If a pipeline has no completed build in the last 7 days, skip it silently. - -## Skills to consult per failure - -Read the relevant skill before classifying / fixing. Skills live under `.github/skills/`. +### Step 3 — Classify each failure (log-extraction only) -- **Mobile (`ios`, `tvos`, `maccatalyst`, `android`, `iossimulator`, `tvossimulator`)** → `mobile-platforms/SKILL.md`. Pipeline layout, platform helpers, code-path map. -- **JIT / GC / PGO stress** (definitions 109–160, 230, 235; `runtime-jit-experimental`) → `jit-regression-test/SKILL.md` for repro extraction; `ci-pipeline-monitor/SKILL.md` for triage and failure-shape recognition. JIT product fixes are out of scope for autofix — file an issue and `@`-mention the JIT area owners. -- **Browser/WASM, WASI** (extra-platforms) → consult `mobile-platforms/SKILL.md` (the WASM/WASI sections) for build-time conditional patterns; `extensions-review/SKILL.md` if the failure is in `Microsoft.Extensions.*` tests; `system-net-review/SKILL.md` if the failure is in `System.Net.*` tests. -- **NativeAOT outer loop** → check `eng/testing/tests.*aot*.targets` and the test `.csproj` for AOT-specific conditions before suggesting a fix. -- **Generic CI triage** → `ci-pipeline-monitor/SKILL.md` for known-failure-shape patterns and Build Analysis matching. +Classification here drives WHERE the agent reads the signature text from. It does NOT drive WHERE the issue gets filed — every actionable signature flows through Step 4 + Step 5 Branch A. The timeline graph is `Stage -> Phase -> Job -> Task`; walk it via `parentId`. Drill into one representative console log per signature to confirm the shape. -## Outcome (per actionable failure) +1. **Build break.** Failed task is `Build product` / `Build native components` / `Configure CMake` / any pre-test compile step, AND `Send to Helix` is `skipped`. Read the signature from the failing compile task log (CSxxxx / linker error / cmake error line). +2. **Phase/Stage-only failure with no failed Job underneath.** Compile breaks aggregated at phase level (e.g. `windows-arm64 checked` on JIT stress pipelines). Open the Phase log + the latest log of any non-succeeded child Task and treat as build break. +3. **Helix work-item failure.** `Send to Helix` succeeded but Job still failed. Extract Helix job IDs from the `Send to Helix` log (`Sent Helix Job: `), query Helix work items, fetch the failing console log, locate the `[FAIL]` line. +4. **Dead-lettered Helix work item.** Console URI contains `helix-workitem-deadletter`. Extract `[FAIL]` line if present; if not, treat as infra noise (no stable signature) and skip emission entirely — record `skipped: infra noise — no stable signature` in the tally. +5. **Infra-shaped Job failure with no Helix work items.** `Initialize job` failed / agent disconnect / `Pool is offline`. Skip emission entirely — record `skipped: infra noise — no stable signature` in the tally. -The primary purpose of this workflow is to keep PR CI green. **KBE** = Known Build Error: an issue tagged `Known Build Error` whose body contains a JSON `ErrorMessage`/`ErrorPattern` block that Arcade Build Analysis matches against future failure logs to mark them as already-tracked, so unrelated PRs aren't blocked. KBEs are immediately effective for PR CI; muting PRs are not effective until merged by a human (latency ≥ 12h, often days). The workflow runs every 12h and converges on **two artifacts per failure across two runs**: KBE in run N (immediate), muting PR in run N+1 (permanent after merge), with a small-fix PR added in run N+1 when scope allows. +For each (1)/(2)/(3) signature, compute the tuple `(definition_id, work_item_or_phase, queue, stress_mode, [FAIL]-or-compile-error signature)`. Look back ~10 prior completed builds in the same definition for first-seen-in-window timestamp and occurrence count. -### Per-failure deliverables +#### Data sources -For each actionable failure, produce **up to three artifacts**: +- **AzDO REST.** `https://dev.azure.com/dnceng-public/public/_apis/build/...`. Anonymous, no auth. + - List builds: `?definitions={id}&branchName=refs/heads/main&statusFilter=completed&resultFilter=succeeded,failed,partiallySucceeded&%24top=20&api-version=7.1` + - Timeline: `/builds/{id}/timeline?api-version=7.1` returns flat `records[]`; reconstruct via `parentId`. A failed record with non-null log id is a leaf to inspect. +- **Helix REST.** `https://helix.dot.net/api/jobs/{jobId}/workitems?api-version=2019-06-17`. Each item has `Name`, `State`, `ExitCode`, `ConsoleOutputUri`. Failed: `ExitCode != 0` or `State == "Failed"`. +- **Build Analysis attachment (best-effort).** `https://dev.azure.com/dnceng-public/public/_apis/build/builds/{id}/attachments/Build_Analysis_KnownIssues_v1?api-version=7.1`. Use to dedupe. 404 = none attached; do not fail. -1. **KBE** — immediate Build Analysis signal so PR CI is unblocked right away. Always produced (or reused if one already exists) for stable-signature failures. -2. **Muting PR** — small, clean, mergeable PR that just adds `[ActiveIssue(...)]` / `` referencing the KBE. No diagnosis logic, no product code. Designed to be merge-without-thinking by any maintainer who agrees the failure should be silenced. Always produced when (1) is produced. -3. **Fix PR** — actual product/test code fix. Produced **only when** (a) the root cause is clear from the failure log, (b) the change fits the "small product fix opportunity" bounds (≤ 20 lines, single file, non-API, non-JIT-codegen, non-GC, non-threading, non-security), and (c) the failing test verifies the fix. Otherwise the deeper investigation is left to the area owner via the KBE — do NOT attempt a speculative fix PR. +### Step 4 — Per-signature walk -The muting PR and the fix PR are independent: a maintainer can merge the muting PR immediately (CI goes green) and then iterate on the fix PR at human pace. If the fix PR lands first, the muting PR becomes a no-op and can be closed; if the muting PR lands first, the fix PR removes the `[ActiveIssue]` annotation. +For each `(definition_id, phase, queue, stress_mode, signature)` produced by Step 3: -### Two-pass KBE → PR flow (across runs) +#### Step 4.1 — Load the matching skill -Same-run KBE + PR is not possible: gh-aw strict mode forbids `issues: write` on the agent job, so the agent cannot create issues at runtime — it can only emit safe-outputs `create_issue` directives that are processed by a separate post-agent job after the agent finishes. Issue numbers are therefore never visible to the agent during execution. Patches cannot reference an issue number that doesn't exist yet. +| Pipeline category | Skill | +|---|---| +| Mobile (`runtime-extra-platforms`; ios/tvos/maccatalyst/android/iossimulator/tvossimulator) | `mobile-platforms/SKILL.md` | +| JIT / GC / PGO stress (definitions 109–160, 230, 235, `runtime-jit-experimental`) | `jit-regression-test/SKILL.md` (repro extraction); `ci-pipeline-monitor/SKILL.md` (triage). JIT product fixes are out of scope for autofix — file an issue and `@`-mention JIT area owners. | +| Browser/WASM, WASI | `mobile-platforms/SKILL.md` (WASM sections); `extensions-review/SKILL.md` if failure is in `Microsoft.Extensions.*`; `system-net-review/SKILL.md` if in `System.Net.*`. | +| NativeAOT outer loop | Check `eng/testing/tests.*aot*.targets` and the test `.csproj` for AOT-specific conditions before suggesting a fix. | +| Generic | `ci-pipeline-monitor/SKILL.md` | -The agent must accept this constraint and produce KBEs in run N, then companion PRs in run N+1. The 12-hour cadence makes this acceptable: the KBE alone unblocks PR CI immediately (the moment the safe-outputs job processes it, ~1 min after the agent finishes), and the muting PR follows within 12h. +#### Step 4.2 — Search for an existing KBE -For each actionable failure, walk through all six checks below before deciding the action — multiple can fire at once, and any one is reason to stop. +`is:issue is:open label:"Known Build Error" in:body ""`. Try variations: full `[FAIL]` line; assertion text; exception class + test name. On hit, record `existing-kbe #` and continue (the walk does not end — a KBE hit changes the final action, not the inspection). -#### Step 1 — Look for an existing KBE. +#### Step 4.3 — Search for an area-team tracker (no KBE label) -Search `is:issue is:open label:"Known Build Error" in:body ""`. Try variations: the full `[FAIL]` line, the assertion text, the exception class plus the test name. On a hit, record the issue number as `existing-kbe` and continue — finding a KBE doesn't end the walk, it changes the final action. +`is:issue is:open in:title ""` AND `in:body ""`. On hit, record `linked-tracker #`. A plain tracker is NOT a KBE substitute (Build Analysis only matches `Known Build Error`-labeled issues with a valid JSON body). File a fresh KBE and cross-link the tracker as `Tracking: dotnet/runtime#` inside the KBE body and the test-disable PR body. -#### Step 2 — Look for an area-team tracker without the KBE label. +#### Step 4.4 — Search for an existing test-disable PR -Some teams track recurring failures in plain issues. Search `is:issue is:open in:title ""` together with `in:body ""`. On a hit, record it as `linked-tracker` — but **do not** treat the tracker as a substitute for a KBE. Build Analysis only matches against issues that carry the `Known Build Error` label and a valid JSON body, so a plain tracker won't unblock PR CI on its own. File a new KBE for Build Analysis to match against, and cross-link the tracker (`Tracking: dotnet/runtime#`) inside the KBE body and the muting PR body. +`is:pr is:open in:title "" "[ci-scan]"` and `is:pr is:open "" ActiveIssue`. On hit, record `existing-PR #` (test-disable) and stop the walk for this signature. -#### Step 3 — Look for an existing muting PR. +#### Step 4.5 — Search for an in-flight fix PR by anyone -Search `is:pr is:open in:title "" "[ci-scan]"` and `is:pr is:open "" ActiveIssue`. On a hit, record `→ existing-PR #` (muting) and stop. +Broad search (NOT only `[ci-scan]` PRs): `is:pr is:open ""`, `is:pr is:open ""`, `is:pr is:open "" in:title`. Fetch each candidate body; if it claims to fix this failure or links the same KBE, record `existing-PR #` (in-flight fix) and stop. -#### Step 4 — Look for an in-flight fix PR by anyone. +#### Step 4.6 — Verify every embedded issue number exists -Search broadly — not only `[ci-scan]` PRs — by test name, file path, and assembly: `is:pr is:open ""`, `is:pr is:open ""`, `is:pr is:open "" in:title`. For each candidate, fetch the PR body; if it claims to fix this failure (or links the same KBE), stop and record `→ existing-PR #` (in-flight fix). +For every `` you plan to write into source (`[ActiveIssue("...issues/")]`, `Linked KBE: #`, inline ``) call `issue_read` with `get` and `{owner: "dotnet", repo: "runtime", issue_number: }`. Confirm it returns an open issue. If it does not -> stop. A dead-link annotation in source requires a follow-up PR to remove. -#### Step 5 — Verify every issue number you're about to write actually exists. +#### Step 4.7 — Confirm a test-disable is welcome on the candidate issue -For every `` you plan to embed in source (`[ActiveIssue("...issues/")]`, `Linked KBE: #`, the inline `` comment), call the github tool `issue_read` with method `get` (`{"method": "get", "owner": "dotnet", "repo": "runtime", "issue_number": }`) and confirm it returns an open issue. If it doesn't, stop — a dead-link annotation in source requires a follow-up PR to remove. +Read the candidate KBE / tracker body + its most recent area-owner comment. Skip the test-disable (record `-> skipped: do-not-disable on issue #`) if ANY of: -#### Step 6 — Confirm muting is welcome on this issue. +- Body or recent comment from area owner says `please don't disable`, `do not mute`, `keep failing`, `investigation in progress`. +- Issue carries a label semantically equivalent to "do not mute" (verify the label exists in `dotnet/runtime` before relying on it; do not invent labels). +- Most recent area-owner comment within the last 14 days opposes disabling the test on procedural grounds (fix-forward request, awaiting JIT/GC repro). -Read the candidate KBE / tracker's body and its most recent area-owner comment. Skip muting (record `→ skipped: do-not-mute on issue #` and stop) if any of the following holds: +When in doubt -> skip the test-disable and let the next run revisit. -- The body or a recent comment from an area owner explicitly says not to mute, disable, or skip — e.g. "please don't disable these tests", "do not mute", "keep failing", "investigation in progress". -- The issue is labeled with anything semantically equivalent (verify the label exists in `dotnet/runtime` before relying on it; do not invent labels). -- The most recent area-owner comment (within the last 14 days) actively opposes muting on procedural grounds — e.g. requesting a fix-forward, awaiting a JIT/GC repro. +#### Step 4.8 — Verify the candidate KBE actually matches (4-question check) -When in doubt, skip muting and let the next run revisit; over-muting against an active investigation is the failure mode this step exists to prevent. +Before writing `Linked KBE: #` or `[ActiveIssue("...issues/")]`, answer: -#### What action to take +1. Does the candidate KBE describe the same test (or test family) as the current `[FAIL]` line? +2. Does its `ErrorMessage` / quoted exception text describe the same failure signature (exception class, assertion message)? +3. Is the failing OS in the set the KBE says it impacts? +4. Is the failing architecture in the set the KBE says it impacts? -- **Step 1 found nothing** → file a new KBE via safe-outputs `create_issue` with the body template below, title prefix `[ci-scan] `, and only the labels `Known Build Error` and `blocking-clean-ci`. If Step 2 found a tracker, cross-link it as `Tracking: dotnet/runtime#` in the KBE body. The issue number isn't visible during this run, so the muting PR is deferred to the next run. -- **Step 1 found a valid KBE, AND steps 3–6 are clean** → open the muting PR via safe-outputs `create_pull_request`. Diff ≤ 5 lines, only test annotations or csproj flags. The body must include `Linked KBE: #` as a top-level line plus the four-question verification block below. If Step 2 also found a tracker, cite it as `Tracking: dotnet/runtime#` alongside `Linked KBE`. -- **Plus, if the failure satisfies the "small product fix opportunity" criteria above** → open a separate fix PR on its own branch. Body cites (a) the failing test as evidence, (b) the root cause, (c) why the fix is safe, (d) `Linked KBE: #`, and (e) "If this lands before #, that PR can be closed." Kept separate so a maintainer can take one without the other. +If any answer is no -> file a fresh KBE this run instead. Embed the four answers in the test-disable PR body's `Reasoning` section. -#### Before you link a KBE, verify it actually matches +Optional fifth check when the candidate KBE is older than ~14 days: confirm Build Analysis is still matching it. `gh api graphql` over `userContentEdits` gives the edit timeline; a stale never-edited body hints the signature went bad. -Test-name overlap alone is not enough — common wrong-link patterns include reusing a KBE filed against a different architecture, or one about an exception class for a failure that's actually a work-item timeout. Answer all four before writing `Linked KBE: #` or `[ActiveIssue("...issues/")]`: +### Step 5 — Decide and emit -1. Does the candidate KBE describe the **same test (or test family)** as the current `[FAIL]` line? -2. Does its `ErrorMessage` / quoted exception text describe the **same failure signature** (exception class, assertion message)? -3. Is the failing **OS** in the set the KBE says it impacts? -4. Is the failing **architecture** in the set the KBE says it impacts? +Exactly one of Branch A / B fires per signature. Branch C is an additive refinement of Branch B (Branch B's outputs are still emitted, plus an additional small-fix PR). Signatures that do not match any branch get `skipped: ` in the tally and emit nothing. -If any answer is no, file a fresh KBE this run and defer the muting PR. Embed the four answers in the PR body's "Reasoning" section; PRs missing this, or with an unaddressed mismatch, will be closed. +**Branch A — No existing KBE; signature is stable.** -Optional fifth check when the candidate KBE is older than ~14 days: confirm Build Analysis is still actually matching it. The hit count appears in the issue body and is rewritten by Build Analysis on every match — a stale, never-edited body is a hint the signature went bad. `gh api graphql` over `userContentEdits` on the issue gives the edit timeline. +Stable means >= 2 occurrences in the ~10-build window, OR a build break that fails all legs of the current build (block-everyone severity that warrants filing on first sight). Emit one `create_issue` using the KBE template. Apply both `Known Build Error` and `blocking-clean-ci` labels so the org project auto-add rule picks it up; do NOT try to mutate the project from this workflow. -#### Caps and end-of-run check +If Step 4.3 found a tracker, cross-link as `Tracking: dotnet/runtime#` in the KBE body. Test-disable PR is deferred to the next run. -Per-run caps: `create_issue` max 5, `create_pull_request` max 10. On cap, record `→ skipped: cap reached` — the next run picks them up. +**Branch B — Existing KBE; no test-disable PR; test-disable is welcome (Step 4.7 clean).** -Before stopping, confirm each failure is handled: +Emit one `create_pull_request` using the Test-disable PR template. Diff <= 5 lines; only test annotations or csproj flags. Body MUST include `Linked KBE: #` as a top-level line plus the Step 4.8 four-question block. -- **No existing KBE** → KBE filed? -- **Existing KBE, no muting PR yet** → muting PR opened (and the optional fix PR if criteria are met)? -- **Existing KBE plus existing muting PR or in-flight fix** → `→ existing-PR #` recorded? +Build-break KBEs cannot be disabled — there is no test annotation that can skip a compile error. Skip Branch B for build-break signatures (record `skipped: build break — no test-disable path` in the tally) and rely on Branch C (small-fix PR) when the fix is mechanical, or on the area owner otherwise. -If the answer is no for any failure, the run is incomplete. +**Branch C — Refinement of Branch B when the failure satisfies the small-fix bounds.** -### Per-failure-class rules +Small-fix bounds: <= 20 lines, single file, non-API, non-JIT-codegen, non-GC, non-threading, non-security; the failing test (or compile error) verifies the fix. -The two-pass flow above applies to all classes below. "KBE + muting PR" means: KBE in the run that first encounters the failure, muting PR in the next run that finds the KBE already exists. +In addition to the Branch B test-disable PR (test failures) or directly against the existing KBE (build breaks), emit a separate `create_pull_request` for the fix on its own branch. Build-break fixes are limited to obvious mechanical changes (typo, missing `#if`, wrong cast, missing `using`). Body cites (a) failing test or compile error as evidence, (b) root cause, (c) why fix is safe, (d) `Linked KBE: #`, (e) "If this lands before #, that PR can be closed." (omit (e) for build-break fixes). -- **Recurring failure with a stable error signature** (≥ 2 occurrences on `main` in the scanned window) → KBE (run N) + muting PR (run N+1) + fix PR (optional, run N+1, only if criteria met). -- **Per-test platform / configuration incompatibility** (e.g., test fails only under `jitstress=2`, `gcstress=0xC`, on a single mobile arch, on browser, on NativeAOT) → KBE (run N) + muting PR (run N+1). The muting PR's skip condition MUST be **as narrow as the observed failure scope** — only the OS / arch / config combinations that actually fail. +After emitting, record the outcome per signature (Step 6). - | Observed failure scope | ❌ Bad (too broad) | ✅ Good (matches scope) | - |---|---|---| - | Only `linux-arm` fails | `[SkipOnPlatform(TestPlatforms.AnyUnix, ...)]` or muting on all NativeAOT | `true` | - | Only NativeAOT on a single arch | `true` (all arches) | `true` | - | Only one stress mode | `true` (all stress modes) | Add stress-mode predicate, e.g. gate via the existing `GCStressIncompatible` only for the failing variant | +### Step 6 — Per-pipeline tally + end-of-run summary - In the PR's "Reasoning" section, list the exact set of failing legs (definition + queue + stress mode) that justifies the chosen condition, so a reviewer can verify scope matches evidence. -- Allowed muting PR mechanisms: - - `[SkipOnPlatform(TestPlatforms., "")]` for platform-specific failures. - - `[ConditionalFact(typeof(PlatformDetection), nameof(PlatformDetection.))]` narrowed via existing helpers. - - `[ActiveIssue("https://github.com/dotnet/runtime/issues/", TestPlatforms.)]` referencing the KBE. - - For JIT/GC stress: `[ActiveIssue("...", typeof(TestLibrary.PlatformDetection), nameof(TestLibrary.PlatformDetection.IsStressTest))]` or `true` at the csproj level. **Tradeoff**: stress-guarded skips remove the test signal from the stress pipelines, so the bug becomes invisible in those pipelines until the JIT fix lands. The KBE filed in run N is what keeps the JIT team aware; without that KBE, the muting PR alone would silently lose the signal. -- **Build break on a single leg** (`Build product` or similar failed; `Send to Helix` skipped) → if the compile error has a clear, mechanical root cause and the fix is **≤ 20 lines in a single file** (e.g., obvious typo, missing `#if`, wrong type cast, missing `using`), open a fix PR (no KBE — Build Analysis explicitly forbids KBEs for build breaks). If the fix is non-trivial, file a regular tracking issue and reference the failing source file and compile error. -- **Anything else** — multi-assembly cluster, infrastructure (queue exhaustion / dead-letter / device-lost) — file a tracking issue (not a KBE). Group all infra failures from one run into a single issue. Before filing, `search_issues` for an open issue whose title or body matches the same failure signature and skip silently if one already exists (do not duplicate, do not append a comment — the agent only has read permission on existing issues). +Per signature, append one outcome line to `/tmp/gh-aw/agent/coverage/.txt`: -For each failure compute a `(definition_id, work_item_or_phase, queue, stress_mode, [FAIL] or compile-error signature)` signature. Look back through ~10 completed builds in the same definition to build first-seen-in-window timestamp and occurrence count. - -**Convergence target**: across two consecutive runs, every actionable test/runtime failure ends up with both (a) a KBE filed (immediate effect on PR CI via Build Analysis) and (b) a clean muting PR open against that KBE (permanent effect after merge, low review cost). The fix PR is a bonus when the root cause is obviously small. A tracking-issue-only outcome is acceptable only for build breaks (which Build Analysis cannot match) and infra failures. - -Do not emit `noop`. Either a PR or an issue must come out of every actionable failure. - -Cap: **10 PRs and 5 issues per run.** Group failures that share one fix into a single PR. Group failures with the same root cause into a single issue. - -## Data sources - -- AzDO REST: `https://dev.azure.com/dnceng-public/public/_apis/build/...`. Anonymous access only — do **not** call `_apis/test/...` or `vstmr.dev.azure.com`; both redirect to sign-in. Stay on `builds`, `builds/{id}/timeline`, `builds/{id}/logs/{logId}`. - - List builds: `?definitions={id}&branchName=refs/heads/main&statusFilter=completed&resultFilter=succeeded,failed,partiallySucceeded&%24top=20&api-version=7.1`. - - Timeline: `/builds/{id}/timeline?api-version=7.1` returns a flat `records[]` array; reconstruct the tree via `parentId`. - - Failed-leaf rule: a record with `result == "failed"` whose log id is non-null is a leaf to inspect; failed Stage/Phase records without a failed child Job indicate a build break — open the parent Phase log and the most recent non-succeeded Task log. -- Helix REST: `https://helix.dot.net/api/jobs/{jobId}/workitems?api-version=2019-06-17`. Helix job IDs come from the `Send to Helix` Task log, which is a child of the failed Job. Each work item has `Name`, `State`, `ExitCode`, `ConsoleOutputUri`. Failed: `ExitCode != 0` or `State == "Failed"`. Console URIs containing `helix-workitem-deadletter` are dead-lettered (queue had no agent) — group as infra. -- Build Analysis attachment (best-effort, may 404): `https://dev.azure.com/dnceng-public/public/_apis/build/builds/{id}/attachments/Build_Analysis_KnownIssues_v1?api-version=7.1`. Use to dedupe against already-known issues. A 404 means none were attached; do not fail. - -## Failure classification - -Classify every failed timeline record before deciding whether to PR or file an issue. The timeline graph is `Stage → Phase → Job → Task`. Walk it as follows: - -1. List every record with `result == "failed"`. For each failed Job, list its child Tasks (records whose `parentId == job.id`). -2. **Build break (no test ever ran)**: among the Job's Tasks, the failed Task is `Build product`, `Build native components`, `Configure CMake`, or any pre-test compile step, **and** the `Send to Helix` Task is `skipped`. → tracking issue. Do **not** attempt a test-side fix. -3. **Phase/Stage-only failure with no failed Job underneath**: typical of compile-time breaks aggregated at the phase level (e.g. `windows-arm64 checked` on the JIT stress pipelines). Open the Phase log and the latest log of any non-succeeded child Task; classify as build break and file a tracking issue. -4. **Send to Helix succeeded but the Job still failed**: open the `Send to Helix` log, extract Helix job IDs (look for `Job on ` or `JobId: `; the Helix info-mart log entry that always appears is `Sent Helix Job: `), then query Helix for failed work items. This is the test-failure path. -5. **Helix work item failure**: confirm via `ConsoleOutputUri`. `helix-workitem-deadletter` URIs → infra (group into one issue). Otherwise fetch the console log, find the `[FAIL]` line, and proceed to PR vs issue selection. -6. **Infra-shaped Job failure** without Helix workitems (e.g., `Initialize job` failed, agent disconnect, "Pool is offline") → file a single grouped infra issue, do not retry per-leg. - -Drill into one representative console log per signature to confirm the shape before classifying. - -## PR body - -Five H2 sections, in this exact order: - -1. **Reasoning** — why the test fails on the affected platform/configuration; why the chosen attribute is the right fix; why this is a test-side fix and not a product bug. -2. **Impact on platforms** — bullet list of `(pipeline + platform/arch + Helix queue + stress mode + exit code)` per affected occurrence. -3. **Errors log** — sanitized excerpt from the Helix console log (the `[FAIL]` line, the assertion or exception, and the `Failed tests:` summary). Strip JWTs, bearer tokens, `ApplicationGatewayAffinity*=`, and per-user paths. -4. **First build it occurred** — first build in the scanned window where this signature appeared: build link, finish time, commit SHA, occurrences-in-window count. State explicitly that this is computed within the scanned window and may not be the true origin. -5. **Linked issue** (optional) — if an `ActiveIssue` reference is used, link the issue. - -Branch from `origin/main`. Stage only the files you intend to change with `git add `; never `git add -A`. Verify with `git diff --name-only --cached` before committing. Do not include any labels in the PR (see "Outputs: title and labels" below). - -## Issue body - -Use this when a PR is not the right tool — product regression, native crash, multi-assembly cluster, infra requiring an owner, JIT/GC product bug. Same four sections as a PR (Reasoning, Impact on platforms, Errors log, First build it occurred), plus a fifth: - -5. **Recommended action** — concrete next step: which area owner, which file likely needs the fix, or what investigation would localize the root cause. For JIT/GC issues include the exact stress mode env vars and the JIT method-name from the log. Reference any related PR or issue you found via `search_issues`. The issue must be actionable — a checkbox-ready task list, not just "FYI". +``` + +``` -Do not include any labels in the issue creation request (see "Outputs: title and labels" below). +`` is one of: `filed-issue #aw_`, `filed-PR #aw_`, `existing-issue #`, `existing-PR #`, `skipped: `. -### JIT pipeline issue template (definitions 109–160, 230, 235, 108, 137, 144–145, 150, 153) +A skipped signature MUST have a reason (e.g., `build canceled`, `< 2 occurrences and not blocking`, `do-not-disable on issue #`, `cap reached`, `infra noise — no stable signature`, `build break — no test-disable path`). -For tracking issues filed against a JIT, GC, PGO, or stress pipeline, use this body layout instead of the generic "five sections" above (matches the in-repo convention; see #125685 for the canonical example): +At end of run, print this table to the agent log: ``` -**Summary:** - - -**Failed in ():** -- [ ]() -- [ ]() -- ... - -**Console Log:** [Console Log]() - -**Failed tests:** -(use a fenced code block; per-pipeline, list the failing legs and tests) - -- - - - - - -- - - - -**Error Message:** -(fenced code block with the canonical error line) - -**Stack Trace:** -(fenced code block with the relevant stack trace; trim noise but keep the failing frame) +| pipeline | total-signatures | issues-filed | prs-filed | reused-existing | skipped-with-reason | ``` -This format makes the issue immediately actionable for JIT/GC owners (@JulieLeeMSFT, @BruceForstall, @jakobbotsch, @dotnet/jit-contrib) without further drilldown. Area triage (`area-CodeGen-coreclr` / `area-GC-coreclr` / `area-PGO-coreclr` / `area-Tools-ILVerification`) is added later by a human reviewer — do not propose any `area-*` label yourself. - -## Outputs: title and labels - -- **All issues and PRs MUST have title prefix `[ci-scan] `**, including tracking issues, Known Build Error issues, and muting PRs. Examples: - - `[ci-scan] Test failure: on ` - - `[ci-scan] Known Build Error: ` - - `[ci-scan] Skip under (refs #)` -- **Do not use the word "Mute" or "Muting"** in titles. Use "Skip", "Disable", "Suppress", or "Exclude" depending on the mechanism. Examples: "Skip … under GCStress", "Disable … on tvOS", "Suppress … in MiniFull AOT mode". -- **Labels (hard restriction).** You **MUST NOT** propose any labels in your output. The workflow auto-applies `agentic-workflows` to every issue and PR, and additionally permits **only** `Known Build Error` and `blocking-clean-ci` on Known Build Error issues. Any other label — `os-*`, `area-*`, `arch-*`, `disabled-test`, `jit-stress`, `gc-stress`, `pgo`, `nativeaot`, `untriaged`, etc. — is rejected by `safe-outputs.allowed-labels` and **will be dropped**. Do not invent new labels under any name. Area, OS, and arch triage is performed by a human reviewer after the issue/PR is filed; do not attempt to pre-apply or guess them. +## Templates -## Known Build Error issue +Emit each template verbatim except for `` slots. Match headings exactly — Build Analysis is strict about `## Error Message` and the JSON fence shape. -A Known Build Error is a tracking issue that Arcade Build Analysis (https://github.com/dotnet/arcade/blob/main/Documentation/Projects/Build%20Analysis/KnownIssueJsonStepByStep.md) automatically matches against future failures so PRs aren't blocked by an already-tracked flake. +### Template: KBE issue body — literal substring match (default) -File one when **all** of the following hold: -- The failure has occurred ≥ 2 times in the scanned window on `main`. -- The error has a stable substring or regex signature that uniquely identifies it. -- No fix PR is currently open (verify via `search_pull_requests`). -- The failure is **not** a build break or an infrastructure failure — only test failures or hangs are eligible for a KBE. Build breaks and infra failures (for example dead-letter, device-lost, or agent-disconnect issues) must use a regular tracking issue. +Title (pick the form matching the signature): +- `[ci-scan] Test failure: ` for test failures +- `[ci-scan] Hang: ` for hangs / timeouts +- `[ci-scan] Build break: ` for compile / link / cmake breaks (the body's `## Error Message` JSON still carries the canonical signature for Build Analysis) -Required structure: match the headings exactly. The literal body MUST look like one of the two templates below — pick exactly one (literal substring is the default; regex only if no single literal line is specific enough). Do not emit both blocks. The outer fence in this prompt uses `~~~` (tildes) only so the inner ` ``` ` fences stay literal; in the issue you emit, do **not** use tildes anywhere — emit only the inner content between (but not including) the `~~~` lines for the template you chose. Walk the "Verify the body before submitting" checks below before committing to the issue body. +Labels: `Known Build Error`, `blocking-clean-ci`. -**Template A — literal substring match (default).** Pick this when the failure log contains a stable, specific assertion or exception message line. - -~~~ +````markdown ## Build Information Build: Build error leg or test failing: - Pull request: +## Error Details + + + +``` + Sanitization> +``` + ## Error Message - + ```json { @@ -357,19 +320,31 @@ Pull request: +Build: Build error leg or test failing: - -Pull request: +Pull request: + +## Error Details + + + +``` + +``` ## Error Message - + ```json { @@ -378,117 +353,174 @@ Pull request: " failure.log # > 0 = matches the failure grep -F "" failure.log | grep -E '^\[(PASS|SKIP)\]' # MUST be empty ``` - For `ErrorPattern`, use `grep -E` — different regex flavor than .NET's `NonBacktracking`, but close enough to flag over-broad patterns: + For `ErrorPattern`: ```bash grep -Ec '' failure.log grep -E '' failure.log | grep -E '^\[(PASS|SKIP)\]' # MUST be empty ``` - If the second command in either pair prints anything, the signature also matches `[PASS]` / `[SKIP]` lines for this test and will mute future passing runs. Narrow it. Also mentally check whether the signature would match (a) other tests in the same assembly, or (b) build-time output (Crossgen2, ilasm, MSBuild). The canonical validator is [`Test-KnownIssuePattern.ps1`](https://github.com/dotnet/arcade-skills/blob/main/plugins/dotnet-dnceng/skills/ci-analysis/scripts/Test-KnownIssuePattern.ps1) (uses the exact regex flavor, emits a validated JSON block); pwsh isn't in this workflow's tool allowlist today, so the `grep -F` / `grep -E` smoke test above is the in-band substitute. -8. **Single-line, no escapes.** Build Analysis runs `String.Contains` (case-sensitive, ordinal) for `ErrorMessage` and `Regex` with `Singleline | IgnoreCase | NonBacktracking` and a 50ms-per-line timeout for `ErrorPattern`. Newlines, ANSI escapes (`\u001b[`), and time-prefixes (`[12:34:56.789]`) are not stripped from log lines before matching. Use the array form (below) for multi-line; use `[^\\n]*` instead of `.*` in regexes. -9. **JSON escaping is correct.** Inside the JSON string value: `"` → `\"`, `\` → `\\`, real newlines → `\n`. For regex patterns this means **double escape**: a literal dot is `\\.` in JSON (the JSON parser consumes one backslash, leaving `\.` for the regex engine). A `\d` you actually want regex to see has to be written `\\d` in JSON. GitHub's issue Preview tab will flag invalid JSON — use it. + If the second command in either pair prints anything, the signature matches `[PASS]` / `[SKIP]` lines too and will mute future passing runs. Narrow it. -##### Multi-line signatures (array form) +8. Single-line, no escapes. Build Analysis matchers do not strip newlines, ANSI escapes (`\u001b[`), or time-prefixes (`[12:34:56.789]`). Use array form for multi-line; use `[^\n]*` instead of `.*` in regexes. +9. JSON escaping is correct. Inside the JSON string value: `"` -> `\"`, `\` -> `\\`, real newlines -> `\n`. Regex patterns double-escape: literal dot = `\\.` in JSON. -Both `ErrorMessage` and `ErrorPattern` accept an **array of strings**: each element matches a separate log line, in order, and lines may appear between matched elements. Use this when no single line on its own is unique enough — e.g., the test name on one line and the assertion text two lines down. +### Template: KBE multi-line array form + +Both `ErrorMessage` and `ErrorPattern` accept arrays — each element matches a separate log line, in order, with arbitrary lines allowed between matched elements. ```json { "ErrorMessage": [ - "System.Net.Http.Tests.HttpClientHandlerTest.GetAsync_UnknownHost_Throws", - "System.Net.Http.HttpRequestException : Name or service not known" + "", + "" ] } ``` -Rules: each element matches one line (the elements are NOT concatenated and matched as a single multi-line string). All elements must match in order. Don't mix `ErrorMessage` and `ErrorPattern` in the same array. Don't pad the array with generic tokens like `exitcode: 139` or `Crash` — they add no specificity and risk false negatives if the log format changes. +Rules: one element = one line (NOT concatenated). All elements must match in order. Don't mix `ErrorMessage` + `ErrorPattern` in one array. Don't pad with generic tokens (`exitcode: 139`, `Crash`) — they add no specificity and risk false negatives if log format changes. + +### Template: KBE signature specificity + +The `ErrorMessage` / `ErrorPattern` MUST uniquely identify this specific failure mode, not an entire category of crashes. + +Reject signatures consisting only of: + +- A bare exit code or signal: `exitcode: 139`, `Segmentation fault`, `SIGSEGV`. +- A generic tool + verb: `Crossgen2 failed`, `ilasm failed`, `dotnet build failed`. +- A bare exception type without message: `BadImageFormatException`, `NullReferenceException`. +- A bare `[FAIL]` line with only the test class name. +- A bare fully-qualified test name (matches every future regression of that test). +- A truncated test-name prefix ending in `_`, `.`, `*` (literal, not glob — over-matches). +- Common infra strings: `Connection reset`, `Operation timed out`, `No space left on device`. -#### Signature examples — Bad → Good +Prefer signatures built from, in order: -`ErrorMessage` is matched as an exact literal substring — `...` in the value is matched as three literal dots, not "anything". Use the array form (above) when you need to span variable text between two anchors. The "Good" column below shows the form to use; values shown as plain strings go in `ErrorMessage`, values prefixed `ErrorPattern:` go in `ErrorPattern`, and values shown as a JSON array go in `ErrorMessage` array form. +1. Exact assertion text or exception **message** (not just the type), e.g. `Assertion failed 'comp->compHndBBtabCount == 0' in 'X' during 'Y'`. +2. Fully-qualified failing test name AND a specific exception message (use array form). +3. Unique native stack frame or symbol, e.g. `coreclr!Compiler::fgMorphCall + 0x`. +4. Specific JIT method-being-compiled marker + the specific stress mode. -| ❌ Bad | Why bad | ✅ Good | +If you cannot produce a signature meeting this bar -> skip emission entirely (record `skipped: weak signature` in the tally). Do NOT file a KBE with a weak signature — it will mismatch in Build Analysis and become noise. + +### Template: KBE signature — Bad vs Good + +| Bad | Why bad | Good | |---|---|---| -| `"Some.Test.Class.TestMethodName"` | bare test name; matches `[PASS]` lines for the same test | array: `["Some.Test.Class.TestMethodName", "System.Net.Sockets.SocketException : Try again"]` | -| `"SomeTests.Prefix_"` (trailing `_`) | truncated prefix; trailing `_`/`*`/`.` is literal not glob | `ErrorPattern: "^SomeTests\\.Prefix_[A-Za-z]+\\b[^\\n]*Xunit\\.Sdk\\."` | -| `"Some.Type.Method"` (bare type/method) | matches stack scans of unrelated tests | `ErrorPattern: "^System\\.NullReferenceException\\b[^\\n]*\\n\\s+at Some\\.Type\\.Method\\b"` | -| `"BadImageFormatException"` | bare exception type; matches infra hiccups too | `"System.BadImageFormatException: Could not load file or assembly 'System.Private.CoreLib'"` | -| `"Operation timed out"` | matches transient network failures everywhere | array: `["xharness exec android test", "Operation timed out after 3600s"]` paired with `BuildRetry: false` | +| `"Some.Test.Class.TestMethodName"` | bare test name; matches `[PASS]` lines | array: `["Some.Test.Class.TestMethodName", "System.Net.Sockets.SocketException : Try again"]` | +| `"SomeTests.Prefix_"` (trailing `_`) | trailing `_`/`*`/`.` is literal not glob | `ErrorPattern: "^SomeTests\\.Prefix_[A-Za-z]+\\b[^\\n]*Xunit\\.Sdk\\."` | +| `"Some.Type.Method"` | matches stack scans of unrelated tests | `ErrorPattern: "^System\\.NullReferenceException\\b[^\\n]*\\n\\s+at Some\\.Type\\.Method\\b"` | +| `"BadImageFormatException"` | bare exception type | `"System.BadImageFormatException: Could not load file or assembly 'System.Private.CoreLib'"` | +| `"Operation timed out"` | matches transient network everywhere | array: `["xharness exec android test", "Operation timed out after 3600s"]` paired with `BuildRetry: false` | -Choose `ErrorMessage` (literal substring) by default. Use `ErrorPattern` only when no single literal line is specific enough — and confirm the regex is anchored and has no catastrophic backtracking. **Populate exactly one of the two fields per JSON block; never both.** Pattern length doesn't matter; specificity does — don't shorten a unique multi-line signature into a pithy one-liner. Set `BuildRetry: true` **only** for confirmed infra/queue-side flakes (dead-letter, device-lost, agent disconnect) where retrying is safe. +### Template: Test-disable PR body -### Signature specificity (mandatory) +Title: `[ci-scan] Skip under (refs #)`. Use `Skip` / `Disable` / `Suppress` / `Exclude`. Never `Mute`. -The `ErrorMessage` / `ErrorPattern` MUST uniquely identify **this specific failure mode**, not an entire category of crashes or build errors. A signature that would match unrelated future regressions is wrong and will mute legitimate failures. +Branch handling: branch from `origin/main`. Stage only files you intend to change with `git add `; never `git add -A`. Verify with `git diff --name-only --cached` before committing. -**Reject** signatures that consist only of: +````markdown +## Reasoning + -- A bare exit code or signal: `exitcode: 139`, `exit code 1`, `Segmentation fault`, `Aborted`, `SIGSEGV`, `SIGABRT`. -- A generic tool name + failure verb: `Crossgen2 failed`, `ilasm failed`, `dotnet build failed`, `xharness exited`. -- A bare exception type with no message: `BadImageFormatException`, `NullReferenceException`, `Fatal error. Invalid Program`, `Assertion failed`. -- A bare `[FAIL]` line with only the test class name and no exception/assertion text. -- A bare fully-qualified test name (e.g. `"ErrorMessage": "Namespace.Class.TestName"`) without the assertion/exception text that follows it on the next line of the log. The test name alone matches every future regression of that test, including unrelated ones, and Build Analysis will mute legitimate new failures. -- A truncated test-name prefix ending in an underscore, dot, or wildcard glyph (e.g. `"SomeClass.SomeMethod_"`, `"Foo.Bar."`, `"Connect_*"`). `ErrorMessage` is a literal `String.Contains` match, not a glob — a trailing `_` or `*` is treated as a literal character and either over-matches every test whose name contains the prefix or never matches at all. If you need to cover multiple related test methods, instead set `ErrorPattern` to a properly anchored regex (e.g. `"SomeClass\\.SomeMethod_[A-Za-z]+ "`), or pick the exception/assertion message that is common to all of them. -- Common infra strings: `Connection reset`, `Operation timed out`, `Resource temporarily unavailable`, `No space left on device`. +Linked KBE: # +> -**Prefer** signatures built from the most specific stable token in the log. In order of preference: +Match verification (from Step 4.8): +1. Same test/family: +2. Same failure signature: +3. Same OS: +4. Same architecture: -1. The exact assertion text or exception **message** (not just the type), e.g. `Assertion failed 'comp->compHndBBtabCount == 0' in 'X' during 'Y'`. -2. The fully-qualified failing test name combined with a specific exception message, e.g. `System.Text.Json.Tests.Utf8JsonReaderTests.TestFoo … System.InvalidOperationException: Cannot read value of type X`. -3. A unique native stack frame or symbol from the crash dump excerpt, e.g. `coreclr!Compiler::fgMorphCall + 0x`. -4. A specific JIT method-being-compiled marker plus the specific stress mode, when the crash is JIT/GC stress only. +## Impact on platforms +- + +## Errors log +``` + +``` + +## First build it occurred +- Build: +- Finished: +- Commit: +- Occurrences in window: +- Computed within the scanned window; may not be the true origin. + +## Linked issue + +```` + +Allowed test-disable mechanisms: + +- `[SkipOnPlatform(TestPlatforms., "")]` — platform-specific failures. +- `[ConditionalFact(typeof(PlatformDetection), nameof(PlatformDetection.))]` — narrow via existing helpers. +- `[ActiveIssue("https://github.com/dotnet/runtime/issues/", TestPlatforms.)]` — reference the KBE. +- JIT/GC stress: `[ActiveIssue("...", typeof(TestLibrary.PlatformDetection), nameof(TestLibrary.PlatformDetection.IsStressTest))]` or `true` at the csproj level. + +Scope rule (mandatory): condition must be AS NARROW AS the observed failure scope. + +| Observed scope | Too broad | Matches scope | +|---|---|---| +| Only `linux-arm` fails | `[SkipOnPlatform(TestPlatforms.AnyUnix, ...)]` | `true` | +| Only NativeAOT on a single arch | `true` (all arches) | `true` | +| Only one stress mode | `true` (all stress modes) | Add stress-mode predicate via the failing variant | -**Combining signature parts** — a JSON array in `ErrorMessage` is AND-matched (all substrings must be present in the failure log). Do not pad an array with generic tokens like `exitcode: 139` or `Crash` alongside the specific message — those tokens add no specificity and only risk false negatives if the log format changes. Include at most one supplementary token, and only when it is itself non-generic (e.g. a specific assembly name or test name). +In the PR `Reasoning` section, list the exact set of failing legs (definition + queue + stress mode) that justifies the chosen condition. -If you cannot produce a signature that meets the bar above, **do not file a Known Build Error**. File a regular tracking issue instead and call out in "Recommended action" that the failure needs a stable signature before it can be muted. +### Template: Sanitization -Title: `[ci-scan] Test failure: ` for test failures, or `[ci-scan] Known Build Error: ` for non-test build errors. The `[ci-scan] ` prefix is mandatory on every issue and PR this workflow files (see "Outputs: title and labels" above). +When pasting log excerpts into issue/PR bodies, strip: -Labels: only `Known Build Error` and `blocking-clean-ci` are permitted on Known Build Error issues. Do not include any other label (no `area-*`, `os-*`, `arch-*`, etc.) — they will be rejected by `safe-outputs.allowed-labels`. Area and platform triage is added later by a human reviewer. +- JWTs, bearer tokens, `ApplicationGatewayAffinity*=`. +- Per-user paths (`/home//`, `C:\Users\\`). +- Machine names from Helix agent strings. +- Anything that uniquely identifies a contributor's environment. -Before filing, search for an existing Known Build Error issue with a matching `ErrorMessage` (`label:"Known Build Error" in:body ""`). If one exists and is open, **skip silently — do not duplicate, do not append a comment**. Build Analysis already counts the new occurrence in its hit-count summary on the issue body; piling on issue comments per occurrence creates noise on already-noisy KBEs (some have tens of hits per run). If `search_issues` returns no matches, proceed to file the new KBE. +## Environment constraints -## Hard environment constraints +These look like permission errors but are physical. -These look like permission errors but are physical: +- **Pre-bind every URL to a shell variable on its own line, then `curl -s "$url"`.** Inline URLs with `?` or `&` are rejected as "Permission denied" even single-quoted (the tool-approver treats query strings as interactive prompts). Working pattern: -- **Pre-bind every URL to a shell variable on a line of its own, then `curl -s "$url"`.** Inline URLs with `?` or `&` are rejected as "Permission denied and could not request permission from user" even when single-quoted, because the Copilot CLI tool-approver treats query strings as interactive prompts. The only working pattern is: ```bash url='https://dev.azure.com/dnceng-public/public/_apis/build/builds?definitions=154&branchName=refs/heads/main&statusFilter=completed&resultFilter=succeeded,failed,partiallySucceeded&%24top=25&api-version=7.1' curl -s "$url" | jq '.' | tee /tmp/gh-aw/agent/builds.json | jq -r '.value[0] | "\(.id) \(.result)"' ``` - Do **not** retry an inline URL hoping the rejection will clear — it won't. Switch to the variable pattern immediately. -- `>` and `-o` redirection at the agent's command line is blocked. Use `| tee /path/to/file`. -- `$(...)` and `${var@P}` are blocked at the command line. Compose values via `xargs -I{}` or by reading files inline. -- OData `$top` must be encoded as `%24top` in URLs. -- Bash allowlist: `dotnet`, `git`, `find`, `ls`, `cat`, `grep`, `head`, `tail`, `wc`, `curl`, `jq`, `tee`, `sed`, `awk`, `tr`, `cut`, `sort`, `uniq`, `xargs`, `echo`, `date`, `mkdir`, `test`, `env`, `basename`, `dirname`, `bash`, `sh`, `chmod`. No `gh`, no `pwsh`, no `python`. Each call runs in a fresh subshell — persist intermediate state to files under `/tmp/gh-aw/agent/`. -## Coverage discipline (avoid arbitrary selection) + Do NOT retry an inline URL hoping the rejection clears. Switch to the variable pattern immediately. -Process every failed signature in every pipeline — do not cherry-pick the obvious ones and skip the rest. Walk pipelines in the order listed in the "Pipelines to scan" table; finish all classifications for pipeline N before moving to pipeline N+1. +- **No `>` or `-o` redirection.** Use `| tee /path/to/file`. +- **No `$(...)` or `${var@P}`.** Compose via `xargs -I{}` or by reading files inline. +- **OData `$top` must be encoded as `%24top` in URLs.** +- **Bash allowlist** (per the frontmatter `tools.bash`): `dotnet`, `git`, `find`, `ls`, `cat`, `grep`, `head`, `tail`, `wc`, `curl`, `jq`, `tee`, `sed`, `awk`, `tr`, `cut`, `sort`, `uniq`, `xargs`, `echo`, `date`, `mkdir`, `test`, `env`, `basename`, `dirname`, `bash`, `sh`, `chmod`. No `gh`, no `pwsh`, no `python`. +- **Each bash call runs in a fresh subshell.** Persist state to `/tmp/gh-aw/agent/`. -For each pipeline: +## Output discipline -1. List every failed signature in the latest scanned build, sorted by occurrence count in the window (descending). -2. For each signature, run the six-step walk in "Two-pass KBE → PR flow" and record the outcome (`→ filed-issue #aw_`, `→ filed-PR #aw_`, `→ existing-issue #`, `→ existing-PR #`, or `→ skipped: `). A skipped signature MUST have a reason (e.g., "build canceled, not a test failure", "less than 2 occurrences and not blocking", "owned by area-Infrastructure rota and already triaged"). -3. Keep a per-pipeline tally on disk under `/tmp/gh-aw/agent/coverage/.txt`. At the end of the run, print a summary table to the agent log: `pipeline | total-signatures | issues-filed | prs-filed | reused-existing | skipped-with-reason`. +- Each pipeline gets exactly one walk-through. Do not revisit. +- Don't propose alternative workflow designs. The structure here is the workflow. +- Don't add `area-*` labels — the labeler owns area triage. +- Don't comment on existing KBEs (Build Analysis tracks occurrence counts in the issue body). +- Don't emit `noop`. Either a PR or an issue must come out of every actionable failure. +- One signature = one outcome line in `/tmp/gh-aw/agent/coverage/.txt`. +- The final agent log MUST include the Step 6 summary table.