fix(integration): clear residual greptile findings on the L3 gate by Yiminnn · Pull Request #817 · benchflow-ai/benchflow

Yiminnn · 2026-06-20T19:32:36Z

Cross-checked every greptile comment left unaddressed on the earlier integration PRs (#802–#810). Most were already fixed by later work or stale; this PR closes the genuine ones.

Triage of the 9 unaddressed comments

file:line	P	verdict
#806 integration-final-review.yml:195 (filter fail-open)	P1	FIXED — `filter_matrix` catches import + per-cell exceptions (fail-open)
#810 build_integration_review_pack.py (deterministic_reject left True)	P1	OPEN → fixed here
#810 codex_review.py (`endswith('/v1')`)	P2	STALE — heuristic gone (Moon Bridge)
#808 codex_review.py:281 (CODEX_API_KEY in env)	P2	cosmetic/intentional — kept (codex ignores it; Moon Bridge routes via config.toml)
*#802 codex_review.py:191 (greedy `{.}`)**	P2	OPEN → fixed here
#802 integration_matrix.py:1121 (CalledProcessError)	P2	OPEN → fixed here
#802 codex_review.py:163 (`[:8000]` cut)	P2	cosmetic-accepted — fenced untrusted text the LLM reads as prose
#802 integration-scope.yml:265 (empty matrix)	P1	FIXED — `has_cells` green-no-op guard
#802 integration-final-review.yml:430 (DeepSeek creds bleed into codex)	P1	FIXED — `_codex_env` isolation

Fixes in this PR

P1 — deterministic_reject left True after R-OUTCOME demotion. When an R-OUTCOME-only reject is demoted to healthy + quarantine, the grade still serialized deterministic_reject: True into agent_judge_summary.json. codex reads that, and a "healthy slot with a deterministic reject" can spuriously push it to downgrade mergeable → not mergeable — directly undermining the gate calibration in fix(integration): calibrate L3 gate — slot matching, V-TAMPER false-positive, codex robustness #814. Now cleared on demotion. + regression test.
Greedy JSON regex in the deepseek per-rollout finding parser → json.JSONDecoder().raw_decode of the first complete object (tolerant of trailing prose; a greedy {.*} merged two objects into one invalid blob, losing the finding).
Uncaught CalledProcessError from git diff --check=True → re-raised as ScopeError so a failed/absent head_sha fails closed cleanly instead of an uncaught traceback.

Validation: 112 tests pass (incl. the new regression); ruff clean.

Cross-checked the unaddressed greptile comments on the earlier integration PRs (#802-#810); most were already fixed or stale. Three genuine ones remained: - P1 (build_integration_review_pack): an R-OUTCOME-only deterministic reject is demoted to a healthy+quarantine slot, but grade["deterministic_reject"] was left True and serialized into agent_judge_summary.json. codex reads that and a "healthy slot with a deterministic reject" can spuriously push it to downgrade mergeable -> not mergeable. Clear the flag on demotion. + regression test. - codex_review deepseek-finding parser: replace the greedy `{.*}` DOTALL regex (which merges two JSON objects into one invalid blob) with raw_decode of the first complete object, tolerant of trailing prose. - integration_matrix: `git diff` runs with check=True but CalledProcessError bypassed the ScopeError fail-closed handler (uncaught traceback). Re-raise it as ScopeError so a failed/absent head_sha fails closed cleanly.

greptile-apps · 2026-06-20T19:35:37Z

Greptile Summary

This PR closes three genuine residual findings from earlier integration PRs: it clears deterministic_reject when an R-OUTCOME-only reject is demoted to healthy, replaces a greedy regex with json.JSONDecoder().raw_decode for per-rollout DeepSeek finding parsing, and wraps the bare subprocess.CalledProcessError from git diff inside a ScopeError for clean fail-closed handling.

build_integration_review_pack.py: slot.grade[\"deterministic_reject\"] is now set to False immediately before the quarantine message is appended during R-OUTCOME demotion, eliminating the contradictory "healthy slot with a deterministic reject" state that could spuriously push the codex reviewer to downgrade a verdict. A regression test is included.
codex_review.py: raw_decode replaces the greedy {.*} regex; it stops at the first complete JSON object and tolerates trailing prose without merging sibling objects into an invalid blob.
integration_matrix.py: CalledProcessError from git diff --name-only is caught and re-raised as ScopeError, which the existing main() handler already catches cleanly.

Confidence Score: 5/5

All three changes are narrow, targeted fixes with no new surface area introduced; the deterministic_reject demotion fix is backed by a regression test.

Each change addresses a well-scoped bug: the deterministic_reject flag is cleared at exactly the right point in _classify_one, the raw_decode replacement handles every edge case the old regex missed, and the ScopeError wrapping is consistent with every other error site in integration_matrix.py. The regression test correctly exercises the demotion path by monkey-patching via the module reference. No pre-existing logic is altered beyond the minimal targeted lines.

No files require special attention — all four changed files are straightforward and self-contained.

Important Files Changed

Filename	Overview
.github/scripts/build_integration_review_pack.py	Adds `slot.grade["deterministic_reject"] = False` before the quarantine append during R-OUTCOME-only demotion; fix is minimal and correctly placed.
.github/scripts/codex_review.py	Replaces greedy `{.*}` regex with `raw.find("{")` + `json.JSONDecoder().raw_decode`; correctly handles trailing prose and multiple sibling JSON objects.
.github/scripts/integration_matrix.py	Wraps `subprocess.CalledProcessError` from `git diff` in `ScopeError`; consistent with all other error-handling sites in this file and caught cleanly by `main()`.
tests/test_build_review_pack.py	Adds a focused regression test for the `deterministic_reject` demotion fix; monkey-patch of `rubric_checks.grade_rollout` is correct because `build_integration_review_pack` accesses it via the module reference.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[_classify_one] --> B{deterministic_reject?}
    B -- No --> C[slot.status = healthy]
    B -- Yes --> D{non_outcome rejects?}
    D -- Yes --> E[slot.status = unhealthy]
    D -- No R-OUTCOME only --> F[slot.status = healthy]
    F --> G["slot.grade['deterministic_reject'] = False NEW"]
    G --> H[Append R-OUTCOME quarantine message]
    H --> I[Serialize to agent_judge_summary.json]
    I --> J[codex_review reads summary]
    J --> K{deterministic_reject?}
    K -- False --> L[No spurious downgrade]
    K -- True was bug --> M[Spurious not-mergeable downgrade]

%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
    A[_classify_one] --> B{deterministic_reject?}
    B -- No --> C[slot.status = healthy]
    B -- Yes --> D{non_outcome rejects?}
    D -- Yes --> E[slot.status = unhealthy]
    D -- No R-OUTCOME only --> F[slot.status = healthy]
    F --> G["slot.grade['deterministic_reject'] = False NEW"]
    G --> H[Append R-OUTCOME quarantine message]
    H --> I[Serialize to agent_judge_summary.json]
    I --> J[codex_review reads summary]
    J --> K{deterministic_reject?}
    K -- False --> L[No spurious downgrade]
    K -- True was bug --> M[Spurious not-mergeable downgrade]

_{Reviews (1): Last reviewed commit: "fix(integration): clear residual greptil..." | Re-trigger Greptile}

Yiminnn temporarily deployed to pypi-internal-preview June 20, 2026 19:32 — with GitHub Actions Inactive

Yiminnn merged commit 7a0f1ba into main Jun 20, 2026
8 checks passed

Yiminnn deleted the fix/integration-residual-greptile branch June 20, 2026 22:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(integration): clear residual greptile findings on the L3 gate#817

fix(integration): clear residual greptile findings on the L3 gate#817
Yiminnn merged 1 commit into
mainfrom
fix/integration-residual-greptile

Yiminnn commented Jun 20, 2026

Uh oh!

greptile-apps Bot commented Jun 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Yiminnn commented Jun 20, 2026

Triage of the 9 unaddressed comments

Fixes in this PR

Uh oh!

greptile-apps Bot commented Jun 20, 2026

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant