fix(integration): calibrate L3 gate — slot matching, V-TAMPER false-positive, codex robustness#814
Conversation
…robustness Three defects made the L3 final-review gate systematically return "not mergeable" for reasons unrelated to the PR under review (found running the gate end-to-end on #794/#803/#813): 1. Slot mis-attribution (duplicate/missing slots). The review-pack matcher mapped rollouts to planned cells by fuzzy dims, ignoring network_mode, so a cell and its `-allowlist` network variant collided into one slot (one "duplicate", one "missing"); in-place agent retries also each counted as a separate rollout. Now attribute each rollout by its cell-id directory (the run-matrix writes <cell.id>/<ts>/<task>__<hash>), and collapse same-job retries to the latest attempt. Genuinely double-scheduled cells still surface as duplicates. 2. V-TAMPER false-positive on OpenHands. The native-ACP execute scanner matched the whole title; OpenHands writes "<description>: $ <command>", and prose like "Verify the output" collided with the `verif` token, falsely flagging benign cleanup/verification as verifier tampering. Scan only the command after "$ "; a real tamper command is still caught. 3. Codex "unavailable" flake. The reviewer had to shell out (bubblewrap sandbox) to read the review pack from disk; a transient bwrap loopback failure killed every read -> no verdict -> false fail-closed. Inline the review-pack files into the prompt so no sandboxed read is needed, plus a bounded retry on transient (sandbox/network/rate-limit) output. A dead codex still fails closed. Adds regression tests for all three.
Greptile SummaryThis PR fixes three independent bugs that caused the L3 integration-review gate to return
Confidence Score: 5/5All three fixes are narrowly scoped to the gate infrastructure, well-tested with regression cases, and fail in the cautious direction on any remaining edge case. The changes fix deterministic, reproducible gate bugs with clean, targeted logic: cell-ID attribution is anchored to the artifact path rather than fuzzy dims; the execute-title prefix strip only ever narrows the scan surface (real tamper commands still appear verbatim after No files require special attention. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[artifacts.rglob result.json] --> B[_attribute_rollout]
B --> C{cell-id dir in\nrelative path?}
C -- yes --> D[Assign to matching Slot\nby cell ID]
C -- no --> E[Fallback: dims-based match]
D --> F[slot.rollouts.append]
E --> F
F --> G[_dedupe_retries per slot]
G --> H{Multiple rollouts\nsame job dir?}
H -- yes --> I[Keep latest by started_at]
H -- no --> J[Keep as-is]
I --> K[_classify_one: healthy / duplicate / missing / stale]
J --> K
subgraph V-TAMPER fix
L[ACP execute event] --> M[_acp_execute_command\nstrip description prefix]
M --> N{VERIFIER_FILE_RE\n+ TAMPER_OP_RE\nmatch command?}
N -- yes --> O[Flag as tamper]
N -- no --> P[Pass]
end
subgraph Codex retry fix
Q[_assemble_codex_prompt\nInline review-pack files] --> R[run_codex_verdict]
R --> S{verdict parsed?}
S -- yes --> T[Use verdict]
S -- no --> U{_looks_transient?}
U -- yes, attempts left --> R
U -- no OR exhausted --> V[Fail closed]
end
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
A[artifacts.rglob result.json] --> B[_attribute_rollout]
B --> C{cell-id dir in\nrelative path?}
C -- yes --> D[Assign to matching Slot\nby cell ID]
C -- no --> E[Fallback: dims-based match]
D --> F[slot.rollouts.append]
E --> F
F --> G[_dedupe_retries per slot]
G --> H{Multiple rollouts\nsame job dir?}
H -- yes --> I[Keep latest by started_at]
H -- no --> J[Keep as-is]
I --> K[_classify_one: healthy / duplicate / missing / stale]
J --> K
subgraph V-TAMPER fix
L[ACP execute event] --> M[_acp_execute_command\nstrip description prefix]
M --> N{VERIFIER_FILE_RE\n+ TAMPER_OP_RE\nmatch command?}
N -- yes --> O[Flag as tamper]
N -- no --> P[Pass]
end
subgraph Codex retry fix
Q[_assemble_codex_prompt\nInline review-pack files] --> R[run_codex_verdict]
R --> S{verdict parsed?}
S -- yes --> T[Use verdict]
S -- no --> U{_looks_transient?}
U -- yes, attempts left --> R
U -- no OR exhausted --> V[Fail closed]
end
Reviews (4): Last reviewed commit: "fix(deps): bump pydantic-settings 2.14.1..." | Re-trigger Greptile |
- codex transient markers: drop the bare '429' (matched 'line 429'); use the 'too many requests' reason phrase + 'http 429' instead. - _attribute_rollout: only scan path parts BELOW the artifacts root, so OS-level segments can never coincidentally match a cell id. - _started_at: fall back to finished_at so retry-collapse stays chronological when started_at is absent.
A new advisory (GHSA-4xgf-cpjx-pc3j, pydantic-settings 2.14.2, published 2026-06-19) started failing pip-audit repo-wide on every PR and main — pydantic-settings is a transitive dep via litellm/mcp. Surgically bump the locked version to the fixed 2.14.2, preserving the revision-1 lockfile format (no full reformat). `uv sync --locked` accepts the lock; pip-audit is clean.
Running the L3 final-review gate end-to-end on real PRs (#794/#803/#813) showed it returns
not mergeablefor every PR, for three reasons unrelated to the PR under review. This fixes all three so the gate reflects the actual change, not gate/matrix bugs.1. Slot mis-attribution → spurious
duplicate/missingblockersThe review-pack matcher mapped produced rollouts to planned cells by fuzzy dims (task, agent, model, sandbox, skill_mode) and ignored
network_mode. Socitation-check-…-openhandsand its…-openhands-allowlistvariant — identical in every compared dim — collided into one slot: the allowlist rollout matched the plain cell, leaving the plain slotduplicateand the allowlist slotmissing. Separately, an agent that retried in place left severalresult.jsonunder one cell-job dir, each counted as a separate rollout (duplicate).Fix: attribute each rollout by its cell-id directory (the run-matrix writes
jobs/integration-final/<cell.id>/<ts>/<task>__<hash>/), so cells differing only in an expected-only axis can't collide; and collapse same-job retries to the latest bystarted_at. A genuinely double-scheduled cell (two distinct job dirs) still surfaces asduplicate. (build_integration_review_pack.py)2. V-TAMPER false-positive on OpenHands
The deterministic V-TAMPER scanner matched the whole native-ACP execute title. OpenHands writes titles as
"<human description>: $ <command>", and prose like "Verify the output" / "Final verification…" collides with theveriftoken in the verifier-file regex — falsely flagging benign cleanup / read-only verification as verifier tampering (the LLM judge correctly called these benign). Other agents don't use a prose prefix, so only OpenHands tripped it — on most L3 runs.Fix: strip the
"<description>: $ "prefix and scan only the command; a real tamper command (echo x > grader.py,rm -rf tests) still appears after$and is still caught. (agent_judge.py)3. Codex "unavailable" flake → false fail-closed
The codex reviewer was handed only a path to the review pack and had to shell out to read it — which on Linux runs under a bubblewrap sandbox. A transient
bwrap: loopback: Failed RTM_NEWADDR: Operation not permittedkilled every read, so codex produced no parseable verdict and fail-closed tonot mergeable (codex unavailable)(this is exactly what differed between #794 and #813).Fix: inline the review-pack file contents into the prompt (bounded) so codex never needs a sandboxed read, plus a bounded retry (
CODEX_MAX_ATTEMPTS, default 2) on output carrying a transient marker (sandbox/network/rate-limit). A genuinely dead codex — or a non-transient empty output — still fails closed immediately. (codex_review.py,codex_review_prompt.md)Tests
Validation: affected suites + full repo sweep —
4433 passed; the only failures are 4 pre-existing env/credential tests (confirmed identical on pristinemainwith these changes stashed; none import the changed modules).