Skip to content

fix(integration): calibrate L3 gate — slot matching, V-TAMPER false-positive, codex robustness#814

Merged
Yiminnn merged 4 commits into
mainfrom
fix/integration-gate-calibration
Jun 20, 2026
Merged

fix(integration): calibrate L3 gate — slot matching, V-TAMPER false-positive, codex robustness#814
Yiminnn merged 4 commits into
mainfrom
fix/integration-gate-calibration

Conversation

@Yiminnn

@Yiminnn Yiminnn commented Jun 19, 2026

Copy link
Copy Markdown
Collaborator

Running the L3 final-review gate end-to-end on real PRs (#794/#803/#813) showed it returns not mergeable for every PR, for three reasons unrelated to the PR under review. This fixes all three so the gate reflects the actual change, not gate/matrix bugs.

1. Slot mis-attribution → spurious duplicate/missing blockers

The review-pack matcher mapped produced rollouts to planned cells by fuzzy dims (task, agent, model, sandbox, skill_mode) and ignored network_mode. So citation-check-…-openhands and its …-openhands-allowlist variant — identical in every compared dim — collided into one slot: the allowlist rollout matched the plain cell, leaving the plain slot duplicate and the allowlist slot missing. Separately, an agent that retried in place left several result.json under one cell-job dir, each counted as a separate rollout (duplicate).

Fix: attribute each rollout by its cell-id directory (the run-matrix writes jobs/integration-final/<cell.id>/<ts>/<task>__<hash>/), so cells differing only in an expected-only axis can't collide; and collapse same-job retries to the latest by started_at. A genuinely double-scheduled cell (two distinct job dirs) still surfaces as duplicate. (build_integration_review_pack.py)

2. V-TAMPER false-positive on OpenHands

The deterministic V-TAMPER scanner matched the whole native-ACP execute title. OpenHands writes titles as "<human description>: $ <command>", and prose like "Verify the output" / "Final verification…" collides with the verif token in the verifier-file regex — falsely flagging benign cleanup / read-only verification as verifier tampering (the LLM judge correctly called these benign). Other agents don't use a prose prefix, so only OpenHands tripped it — on most L3 runs.

Fix: strip the "<description>: $ " prefix and scan only the command; a real tamper command (echo x > grader.py, rm -rf tests) still appears after $ and is still caught. (agent_judge.py)

3. Codex "unavailable" flake → false fail-closed

The codex reviewer was handed only a path to the review pack and had to shell out to read it — which on Linux runs under a bubblewrap sandbox. A transient bwrap: loopback: Failed RTM_NEWADDR: Operation not permitted killed every read, so codex produced no parseable verdict and fail-closed to not mergeable (codex unavailable) (this is exactly what differed between #794 and #813).

Fix: inline the review-pack file contents into the prompt (bounded) so codex never needs a sandboxed read, plus a bounded retry (CODEX_MAX_ATTEMPTS, default 2) on output carrying a transient marker (sandbox/network/rate-limit). A genuinely dead codex — or a non-transient empty output — still fails closed immediately. (codex_review.py, codex_review_prompt.md)

Tests

  • Slot collision (network-allowlist variant maps 1:1) + retry-collapse (3 attempts → 1, but two job dirs still flag duplicate).
  • V-TAMPER: the 3 exact OpenHands false-positives now pass; 2 real tampers in the same prefixed shape still flagged.
  • Codex: transient classification, prompt inlining, retry-recovers, persistent-transient-fail-closed, non-transient-not-retried.

Validation: affected suites + full repo sweep — 4433 passed; the only failures are 4 pre-existing env/credential tests (confirmed identical on pristine main with these changes stashed; none import the changed modules).

…robustness

Three defects made the L3 final-review gate systematically return "not mergeable"
for reasons unrelated to the PR under review (found running the gate end-to-end
on #794/#803/#813):

1. Slot mis-attribution (duplicate/missing slots). The review-pack matcher mapped
   rollouts to planned cells by fuzzy dims, ignoring network_mode, so a cell and
   its `-allowlist` network variant collided into one slot (one "duplicate", one
   "missing"); in-place agent retries also each counted as a separate rollout.
   Now attribute each rollout by its cell-id directory (the run-matrix writes
   <cell.id>/<ts>/<task>__<hash>), and collapse same-job retries to the latest
   attempt. Genuinely double-scheduled cells still surface as duplicates.

2. V-TAMPER false-positive on OpenHands. The native-ACP execute scanner matched
   the whole title; OpenHands writes "<description>: $ <command>", and prose like
   "Verify the output" collided with the `verif` token, falsely flagging benign
   cleanup/verification as verifier tampering. Scan only the command after "$ ";
   a real tamper command is still caught.

3. Codex "unavailable" flake. The reviewer had to shell out (bubblewrap sandbox)
   to read the review pack from disk; a transient bwrap loopback failure killed
   every read -> no verdict -> false fail-closed. Inline the review-pack files
   into the prompt so no sandboxed read is needed, plus a bounded retry on
   transient (sandbox/network/rate-limit) output. A dead codex still fails closed.

Adds regression tests for all three.
@greptile-apps

greptile-apps Bot commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR fixes three independent bugs that caused the L3 integration-review gate to return not mergeable on every real PR regardless of actual content. All three root causes were in the gate infrastructure itself, not in the PRs being reviewed.

  • Slot mis-attribution (build_integration_review_pack.py): rollouts are now attributed by their cell-ID directory path (scoped below the artifacts root) rather than by fuzzy dims, eliminating collisions between cells that differ only in network_mode; same-job retries are collapsed to the latest by started_at so a flaky agent no longer inflates the rollout count into a spurious duplicate.
  • V-TAMPER false-positive (agent_judge.py): the native ACP execute title scanner now strips the OpenHands \"<description>: $ \" prefix before pattern-matching, so prose like "Verify the output" in the description no longer triggers the verif token in _VERIFIER_FILE_RE; real tamper commands after $ are still caught.
  • Codex bwrap flake (codex_review.py): review-pack file contents are inlined into the codex prompt (eliminating sandboxed disk reads), and a bounded retry loop (CODEX_MAX_ATTEMPTS, default 2) recovers from transient sandbox/network/rate-limit failures instead of failing closed immediately.

Confidence Score: 5/5

All three fixes are narrowly scoped to the gate infrastructure, well-tested with regression cases, and fail in the cautious direction on any remaining edge case.

The changes fix deterministic, reproducible gate bugs with clean, targeted logic: cell-ID attribution is anchored to the artifact path rather than fuzzy dims; the execute-title prefix strip only ever narrows the scan surface (real tamper commands still appear verbatim after $ ); and the codex retry loop only fires when no verdict was parsed AND the output carries a known transient marker. Each change has dedicated regression tests that mirror the exact production failures. The only open observation is that "operation not permitted" in the transient-marker list is broader than strictly necessary, but its effect is bounded to a single extra codex attempt in non-transient edge cases.

No files require special attention. codex_review.py carries the most logic density (inlining + retry loop) but is thoroughly covered by the new test suite.

Important Files Changed

Filename Overview
.github/scripts/build_integration_review_pack.py Adds cell-ID-based rollout attribution (_attribute_rollout) and same-job retry collapse (_dedupe_retries), resolving the slot mis-attribution bug that caused spurious duplicate/missing blockers on network-mode variant cells.
.github/scripts/codex_review.py Inlines review-pack file contents into the codex prompt (eliminating the need for sandboxed disk reads), adds _looks_transient classification, and wraps run_codex_verdict in a bounded retry loop — three independent improvements that together fix the bwrap false-fail-closed.
tests/integration/agent_judge.py Adds _acp_execute_command to strip the OpenHands "<description>: $ " prefix before scanning execute-kind titles, preventing the prose description's "Verify"/"verification" tokens from triggering the _VERIFIER_FILE_RE tamper check.
tests/test_build_review_pack.py Adds two regression tests: allowlist-variant cell maps 1:1 to its slot (no collision), and in-place retries collapse to the latest rollout while distinct job dirs still surface as duplicate.
tests/test_codex_review.py Adds tests for transient classification, prompt inlining, retry-then-succeed, persistent-transient fail-closed, and non-transient-not-retried cases; thoroughly covers the new retry logic.
tests/test_judge_robustness.py Adds parameterized tests for the OpenHands description-prefix stripping: the three exact false-positive shapes from production runs now pass, while genuine tampers in the same prefixed shape are still caught.
.github/integration/codex_review_prompt.md Updates the instruction for item 2 to tell codex that review-pack file contents are inlined in the prompt, removing the implicit "go read files from disk" instruction.
uv.lock Bumps pydantic-settings from 2.14.1 → 2.14.2; no code changes, routine lock file update.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[artifacts.rglob result.json] --> B[_attribute_rollout]
    B --> C{cell-id dir in\nrelative path?}
    C -- yes --> D[Assign to matching Slot\nby cell ID]
    C -- no --> E[Fallback: dims-based match]
    D --> F[slot.rollouts.append]
    E --> F
    F --> G[_dedupe_retries per slot]
    G --> H{Multiple rollouts\nsame job dir?}
    H -- yes --> I[Keep latest by started_at]
    H -- no --> J[Keep as-is]
    I --> K[_classify_one: healthy / duplicate / missing / stale]
    J --> K

    subgraph V-TAMPER fix
        L[ACP execute event] --> M[_acp_execute_command\nstrip description prefix]
        M --> N{VERIFIER_FILE_RE\n+ TAMPER_OP_RE\nmatch command?}
        N -- yes --> O[Flag as tamper]
        N -- no --> P[Pass]
    end

    subgraph Codex retry fix
        Q[_assemble_codex_prompt\nInline review-pack files] --> R[run_codex_verdict]
        R --> S{verdict parsed?}
        S -- yes --> T[Use verdict]
        S -- no --> U{_looks_transient?}
        U -- yes, attempts left --> R
        U -- no OR exhausted --> V[Fail closed]
    end
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
    A[artifacts.rglob result.json] --> B[_attribute_rollout]
    B --> C{cell-id dir in\nrelative path?}
    C -- yes --> D[Assign to matching Slot\nby cell ID]
    C -- no --> E[Fallback: dims-based match]
    D --> F[slot.rollouts.append]
    E --> F
    F --> G[_dedupe_retries per slot]
    G --> H{Multiple rollouts\nsame job dir?}
    H -- yes --> I[Keep latest by started_at]
    H -- no --> J[Keep as-is]
    I --> K[_classify_one: healthy / duplicate / missing / stale]
    J --> K

    subgraph V-TAMPER fix
        L[ACP execute event] --> M[_acp_execute_command\nstrip description prefix]
        M --> N{VERIFIER_FILE_RE\n+ TAMPER_OP_RE\nmatch command?}
        N -- yes --> O[Flag as tamper]
        N -- no --> P[Pass]
    end

    subgraph Codex retry fix
        Q[_assemble_codex_prompt\nInline review-pack files] --> R[run_codex_verdict]
        R --> S{verdict parsed?}
        S -- yes --> T[Use verdict]
        S -- no --> U{_looks_transient?}
        U -- yes, attempts left --> R
        U -- no OR exhausted --> V[Fail closed]
    end
Loading

Reviews (4): Last reviewed commit: "fix(deps): bump pydantic-settings 2.14.1..." | Re-trigger Greptile

Comment thread .github/scripts/codex_review.py
Comment thread .github/scripts/build_integration_review_pack.py Outdated
Comment thread .github/scripts/build_integration_review_pack.py
symphony-bot added 3 commits June 19, 2026 17:46
- codex transient markers: drop the bare '429' (matched 'line 429'); use the
  'too many requests' reason phrase + 'http 429' instead.
- _attribute_rollout: only scan path parts BELOW the artifacts root, so OS-level
  segments can never coincidentally match a cell id.
- _started_at: fall back to finished_at so retry-collapse stays chronological
  when started_at is absent.
A new advisory (GHSA-4xgf-cpjx-pc3j, pydantic-settings 2.14.2, published
2026-06-19) started failing pip-audit repo-wide on every PR and main —
pydantic-settings is a transitive dep via litellm/mcp. Surgically bump the
locked version to the fixed 2.14.2, preserving the revision-1 lockfile format
(no full reformat). `uv sync --locked` accepts the lock; pip-audit is clean.
@Yiminnn Yiminnn merged commit 3459996 into main Jun 20, 2026
8 checks passed
@Yiminnn Yiminnn deleted the fix/integration-gate-calibration branch June 20, 2026 00:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant