fix(integration): calibrate L3 gate — slot matching, V-TAMPER false-positive, codex robustness by Yiminnn · Pull Request #814 · benchflow-ai/benchflow

Yiminnn · 2026-06-19T06:18:09Z

Running the L3 final-review gate end-to-end on real PRs (#794/#803/#813) showed it returns not mergeable for every PR, for three reasons unrelated to the PR under review. This fixes all three so the gate reflects the actual change, not gate/matrix bugs.

1. Slot mis-attribution → spurious `duplicate`/`missing` blockers

The review-pack matcher mapped produced rollouts to planned cells by fuzzy dims (task, agent, model, sandbox, skill_mode) and ignored network_mode. So citation-check-…-openhands and its …-openhands-allowlist variant — identical in every compared dim — collided into one slot: the allowlist rollout matched the plain cell, leaving the plain slot duplicate and the allowlist slot missing. Separately, an agent that retried in place left several result.json under one cell-job dir, each counted as a separate rollout (duplicate).

Fix: attribute each rollout by its cell-id directory (the run-matrix writes jobs/integration-final/<cell.id>/<ts>/<task>__<hash>/), so cells differing only in an expected-only axis can't collide; and collapse same-job retries to the latest by started_at. A genuinely double-scheduled cell (two distinct job dirs) still surfaces as duplicate. (build_integration_review_pack.py)

2. V-TAMPER false-positive on OpenHands

The deterministic V-TAMPER scanner matched the whole native-ACP execute title. OpenHands writes titles as "<human description>: $ <command>", and prose like "Verify the output" / "Final verification…" collides with the verif token in the verifier-file regex — falsely flagging benign cleanup / read-only verification as verifier tampering (the LLM judge correctly called these benign). Other agents don't use a prose prefix, so only OpenHands tripped it — on most L3 runs.

Fix: strip the "<description>: $ " prefix and scan only the command; a real tamper command (echo x > grader.py, rm -rf tests) still appears after $ and is still caught. (agent_judge.py)

3. Codex "unavailable" flake → false fail-closed

The codex reviewer was handed only a path to the review pack and had to shell out to read it — which on Linux runs under a bubblewrap sandbox. A transient bwrap: loopback: Failed RTM_NEWADDR: Operation not permitted killed every read, so codex produced no parseable verdict and fail-closed to not mergeable (codex unavailable) (this is exactly what differed between #794 and #813).

Fix: inline the review-pack file contents into the prompt (bounded) so codex never needs a sandboxed read, plus a bounded retry (CODEX_MAX_ATTEMPTS, default 2) on output carrying a transient marker (sandbox/network/rate-limit). A genuinely dead codex — or a non-transient empty output — still fails closed immediately. (codex_review.py, codex_review_prompt.md)

Tests

Slot collision (network-allowlist variant maps 1:1) + retry-collapse (3 attempts → 1, but two job dirs still flag duplicate).
V-TAMPER: the 3 exact OpenHands false-positives now pass; 2 real tampers in the same prefixed shape still flagged.
Codex: transient classification, prompt inlining, retry-recovers, persistent-transient-fail-closed, non-transient-not-retried.

Validation: affected suites + full repo sweep — 4433 passed; the only failures are 4 pre-existing env/credential tests (confirmed identical on pristine main with these changes stashed; none import the changed modules).

…robustness Three defects made the L3 final-review gate systematically return "not mergeable" for reasons unrelated to the PR under review (found running the gate end-to-end on #794/#803/#813): 1. Slot mis-attribution (duplicate/missing slots). The review-pack matcher mapped rollouts to planned cells by fuzzy dims, ignoring network_mode, so a cell and its `-allowlist` network variant collided into one slot (one "duplicate", one "missing"); in-place agent retries also each counted as a separate rollout. Now attribute each rollout by its cell-id directory (the run-matrix writes <cell.id>/<ts>/<task>__<hash>), and collapse same-job retries to the latest attempt. Genuinely double-scheduled cells still surface as duplicates. 2. V-TAMPER false-positive on OpenHands. The native-ACP execute scanner matched the whole title; OpenHands writes "<description>: $ <command>", and prose like "Verify the output" collided with the `verif` token, falsely flagging benign cleanup/verification as verifier tampering. Scan only the command after "$ "; a real tamper command is still caught. 3. Codex "unavailable" flake. The reviewer had to shell out (bubblewrap sandbox) to read the review pack from disk; a transient bwrap loopback failure killed every read -> no verdict -> false fail-closed. Inline the review-pack files into the prompt so no sandboxed read is needed, plus a bounded retry on transient (sandbox/network/rate-limit) output. A dead codex still fails closed. Adds regression tests for all three.

greptile-apps · 2026-06-19T06:23:58Z

Greptile Summary

This PR fixes three independent bugs that caused the L3 integration-review gate to return not mergeable on every real PR regardless of actual content. All three root causes were in the gate infrastructure itself, not in the PRs being reviewed.

Slot mis-attribution (build_integration_review_pack.py): rollouts are now attributed by their cell-ID directory path (scoped below the artifacts root) rather than by fuzzy dims, eliminating collisions between cells that differ only in network_mode; same-job retries are collapsed to the latest by started_at so a flaky agent no longer inflates the rollout count into a spurious duplicate.
V-TAMPER false-positive (agent_judge.py): the native ACP execute title scanner now strips the OpenHands \"<description>: $ \" prefix before pattern-matching, so prose like "Verify the output" in the description no longer triggers the verif token in _VERIFIER_FILE_RE; real tamper commands after $ are still caught.
Codex bwrap flake (codex_review.py): review-pack file contents are inlined into the codex prompt (eliminating sandboxed disk reads), and a bounded retry loop (CODEX_MAX_ATTEMPTS, default 2) recovers from transient sandbox/network/rate-limit failures instead of failing closed immediately.

Confidence Score: 5/5

All three fixes are narrowly scoped to the gate infrastructure, well-tested with regression cases, and fail in the cautious direction on any remaining edge case.

The changes fix deterministic, reproducible gate bugs with clean, targeted logic: cell-ID attribution is anchored to the artifact path rather than fuzzy dims; the execute-title prefix strip only ever narrows the scan surface (real tamper commands still appear verbatim after $ ); and the codex retry loop only fires when no verdict was parsed AND the output carries a known transient marker. Each change has dedicated regression tests that mirror the exact production failures. The only open observation is that "operation not permitted" in the transient-marker list is broader than strictly necessary, but its effect is bounded to a single extra codex attempt in non-transient edge cases.

No files require special attention. codex_review.py carries the most logic density (inlining + retry loop) but is thoroughly covered by the new test suite.

Important Files Changed

Filename	Overview
.github/scripts/build_integration_review_pack.py	Adds cell-ID-based rollout attribution (`_attribute_rollout`) and same-job retry collapse (`_dedupe_retries`), resolving the slot mis-attribution bug that caused spurious duplicate/missing blockers on network-mode variant cells.
.github/scripts/codex_review.py	Inlines review-pack file contents into the codex prompt (eliminating the need for sandboxed disk reads), adds `_looks_transient` classification, and wraps `run_codex_verdict` in a bounded retry loop — three independent improvements that together fix the bwrap false-fail-closed.
tests/integration/agent_judge.py	Adds `_acp_execute_command` to strip the OpenHands `"<description>: $ "` prefix before scanning execute-kind titles, preventing the prose description's "Verify"/"verification" tokens from triggering the `_VERIFIER_FILE_RE` tamper check.
tests/test_build_review_pack.py	Adds two regression tests: allowlist-variant cell maps 1:1 to its slot (no collision), and in-place retries collapse to the latest rollout while distinct job dirs still surface as duplicate.
tests/test_codex_review.py	Adds tests for transient classification, prompt inlining, retry-then-succeed, persistent-transient fail-closed, and non-transient-not-retried cases; thoroughly covers the new retry logic.
tests/test_judge_robustness.py	Adds parameterized tests for the OpenHands description-prefix stripping: the three exact false-positive shapes from production runs now pass, while genuine tampers in the same prefixed shape are still caught.
.github/integration/codex_review_prompt.md	Updates the instruction for item 2 to tell codex that review-pack file contents are inlined in the prompt, removing the implicit "go read files from disk" instruction.
uv.lock	Bumps `pydantic-settings` from 2.14.1 → 2.14.2; no code changes, routine lock file update.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[artifacts.rglob result.json] --> B[_attribute_rollout]
    B --> C{cell-id dir in\nrelative path?}
    C -- yes --> D[Assign to matching Slot\nby cell ID]
    C -- no --> E[Fallback: dims-based match]
    D --> F[slot.rollouts.append]
    E --> F
    F --> G[_dedupe_retries per slot]
    G --> H{Multiple rollouts\nsame job dir?}
    H -- yes --> I[Keep latest by started_at]
    H -- no --> J[Keep as-is]
    I --> K[_classify_one: healthy / duplicate / missing / stale]
    J --> K

    subgraph V-TAMPER fix
        L[ACP execute event] --> M[_acp_execute_command\nstrip description prefix]
        M --> N{VERIFIER_FILE_RE\n+ TAMPER_OP_RE\nmatch command?}
        N -- yes --> O[Flag as tamper]
        N -- no --> P[Pass]
    end

    subgraph Codex retry fix
        Q[_assemble_codex_prompt\nInline review-pack files] --> R[run_codex_verdict]
        R --> S{verdict parsed?}
        S -- yes --> T[Use verdict]
        S -- no --> U{_looks_transient?}
        U -- yes, attempts left --> R
        U -- no OR exhausted --> V[Fail closed]
    end

%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
    A[artifacts.rglob result.json] --> B[_attribute_rollout]
    B --> C{cell-id dir in\nrelative path?}
    C -- yes --> D[Assign to matching Slot\nby cell ID]
    C -- no --> E[Fallback: dims-based match]
    D --> F[slot.rollouts.append]
    E --> F
    F --> G[_dedupe_retries per slot]
    G --> H{Multiple rollouts\nsame job dir?}
    H -- yes --> I[Keep latest by started_at]
    H -- no --> J[Keep as-is]
    I --> K[_classify_one: healthy / duplicate / missing / stale]
    J --> K

    subgraph V-TAMPER fix
        L[ACP execute event] --> M[_acp_execute_command\nstrip description prefix]
        M --> N{VERIFIER_FILE_RE\n+ TAMPER_OP_RE\nmatch command?}
        N -- yes --> O[Flag as tamper]
        N -- no --> P[Pass]
    end

    subgraph Codex retry fix
        Q[_assemble_codex_prompt\nInline review-pack files] --> R[run_codex_verdict]
        R --> S{verdict parsed?}
        S -- yes --> T[Use verdict]
        S -- no --> U{_looks_transient?}
        U -- yes, attempts left --> R
        U -- no OR exhausted --> V[Fail closed]
    end

_{Reviews (4): Last reviewed commit: "fix(deps): bump pydantic-settings 2.14.1..." | Re-trigger Greptile}

- codex transient markers: drop the bare '429' (matched 'line 429'); use the 'too many requests' reason phrase + 'http 429' instead. - _attribute_rollout: only scan path parts BELOW the artifacts root, so OS-level segments can never coincidentally match a cell id. - _started_at: fall back to finished_at so retry-collapse stays chronological when started_at is absent.

A new advisory (GHSA-4xgf-cpjx-pc3j, pydantic-settings 2.14.2, published 2026-06-19) started failing pip-audit repo-wide on every PR and main — pydantic-settings is a transitive dep via litellm/mcp. Surgically bump the locked version to the fixed 2.14.2, preserving the revision-1 lockfile format (no full reformat). `uv sync --locked` accepts the lock; pip-audit is clean.

greptile-apps Bot reviewed Jun 19, 2026

View reviewed changes

Comment thread .github/scripts/codex_review.py

Comment thread .github/scripts/build_integration_review_pack.py Outdated

Comment thread .github/scripts/build_integration_review_pack.py

symphony-bot added 3 commits June 19, 2026 17:46

style(integration): ruff format the new gate-calibration tests

fb4f707

Yiminnn merged commit 3459996 into main Jun 20, 2026
8 checks passed

Yiminnn deleted the fix/integration-gate-calibration branch June 20, 2026 00:11

This was referenced Jun 20, 2026

ci: verify L3 gate fixes e2e (throwaway — DO NOT MERGE) #815

Closed

fix(integration): clear residual greptile findings on the L3 gate #817

Merged

chore(deps)!: 7-day dependency cooldown (uv exclude-newer + CI assert) #788

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(integration): calibrate L3 gate — slot matching, V-TAMPER false-positive, codex robustness#814

fix(integration): calibrate L3 gate — slot matching, V-TAMPER false-positive, codex robustness#814
Yiminnn merged 4 commits into
mainfrom
fix/integration-gate-calibration

Yiminnn commented Jun 19, 2026

Uh oh!

greptile-apps Bot commented Jun 19, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Yiminnn commented Jun 19, 2026

1. Slot mis-attribution → spurious duplicate/missing blockers

2. V-TAMPER false-positive on OpenHands

3. Codex "unavailable" flake → false fail-closed

Tests

Uh oh!

greptile-apps Bot commented Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

1. Slot mis-attribution → spurious `duplicate`/`missing` blockers

greptile-apps Bot commented Jun 19, 2026 •

edited

Loading