fix(integration): codex reviewer on gpt-5.5 (xhigh) + evidence serialization + R-OUTCOME demote#810
Conversation
…ation, demote R-OUTCOME The first codex execution (never run before) surfaced three issues: 1. Pass-1 deepseek findings crashed: codex_review.py called evidence.to_dict(), but RolloutEvidence is a frozen dataclass with no to_dict -> use dataclasses.asdict. 2. Pass-2 codex exec failed: gpt-5.4-nano rejects codex's Responses-API `tool_search` tool. Move the codex composer to DeepSeek-v4-pro, routed through a custom chat-completions provider (DeepSeek serves chat, not Responses + tool_search); _deepseek_codex_config emits the codex -c overrides and reads DEEPSEEK_API_KEY/DEEPSEEK_BASE_URL from the job env. CODEX_MODEL=deepseek-v4-pro in both review-pack codex steps. 3. R-OUTCOME (the rollout produced no valid SCORED outcome — an errored / unscored-timeout run, often a hard task the agent didn't finish in budget) is an experiment-fidelity issue on one rollout, not a PR regression. Demote an R-OUTCOME-ONLY reject to a quarantine (visible, non-blocking); any other deterministic reject still hard-blocks. Regression tests added for the deepseek codex routing. The pinned-baseline parity demote (prior PR) is confirmed working; deterministic grader graded 9/10 healthy.
Greptile SummaryThis PR patches three bugs surfaced by the first live codex execution after #808: a crash caused by calling
Confidence Score: 4/5Safe to merge for the three targeted fixes; the only remaining concern is the pre-existing data inconsistency in agent_judge_summary.json where a quarantined slot still advertises deterministic_reject: true, which an already-open thread has flagged. The dataclasses.asdict crash fix and the reasoning-effort config helper are both correct and well-tested. The R-OUTCOME demotion logic routes the slot to healthy + quarantine correctly, and compute_verdict picks up those quarantines. The one outstanding inconsistency — slot.grade["deterministic_reject"] is never cleared to False when a slot is demoted — means the serialized agent_judge_summary.json will show status: healthy alongside deterministic_reject: true for those slots, which could prompt codex to re-escalate the quarantine during Pass 2. That issue was flagged in a prior thread and is not new to this PR, but it remains unresolved in the code. .github/scripts/build_integration_review_pack.py — the R-OUTCOME demotion branch (lines 306-315) leaves slot.grade["deterministic_reject"] uncorrected after changing slot.status to "healthy". Important Files Changed
Sequence Diagram%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
participant WF as GitHub Workflow
participant BP as build_review_pack.py
participant CR as codex_review.py
participant DS as DeepSeek (Pass 1)
participant CX as codex CLI / gpt-5.5 (Pass 2)
WF->>BP: run deterministic grader
BP->>BP: grade_rollout() per slot
BP->>BP: R-OUTCOME only? quarantine (healthy)
BP->>BP: Other deterministic fail? unhealthy
BP-->>WF: deterministic_verdict + agent_judge_summary.json
WF->>CR: run codex_review.py --deterministic-verdict
CR->>CR: load_rollout_evidence() + dataclasses.asdict()
CR->>DS: Pass 1 per-rollout findings
DS-->>CR: JSON findings per rollout
CR->>CR: _codex_env() swap DeepSeek key for real OpenAI key
CR->>CR: "_reasoning_config() model_reasoning_effort=xhigh"
CR->>CX: "codex exec --model gpt-5.5 -c model_reasoning_effort=xhigh"
CX-->>CR: advisory verdict
CR->>CR: worst(deterministic, codex_verdict)
CR-->>WF: final_verdict (codex can only tighten, never loosen)
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
participant WF as GitHub Workflow
participant BP as build_review_pack.py
participant CR as codex_review.py
participant DS as DeepSeek (Pass 1)
participant CX as codex CLI / gpt-5.5 (Pass 2)
WF->>BP: run deterministic grader
BP->>BP: grade_rollout() per slot
BP->>BP: R-OUTCOME only? quarantine (healthy)
BP->>BP: Other deterministic fail? unhealthy
BP-->>WF: deterministic_verdict + agent_judge_summary.json
WF->>CR: run codex_review.py --deterministic-verdict
CR->>CR: load_rollout_evidence() + dataclasses.asdict()
CR->>DS: Pass 1 per-rollout findings
DS-->>CR: JSON findings per rollout
CR->>CR: _codex_env() swap DeepSeek key for real OpenAI key
CR->>CR: "_reasoning_config() model_reasoning_effort=xhigh"
CR->>CX: "codex exec --model gpt-5.5 -c model_reasoning_effort=xhigh"
CX-->>CR: advisory verdict
CR->>CR: worst(deterministic, codex_verdict)
CR-->>WF: final_verdict (codex can only tighten, never loosen)
Reviews (2): Last reviewed commit: "fix(integration): run the codex reviewer..." | Re-trigger Greptile |
| slot.status = "healthy" | ||
| slot.grade["quarantines"] = [ | ||
| *slot.grade.get("quarantines", []), | ||
| "R-OUTCOME: rollout produced no valid scored outcome " | ||
| "(error / unscored timeout) — quarantined, not a PR regression", | ||
| ] | ||
| slot.detail = ( | ||
| f"healthy with {len(slot.grade['quarantines'])} quarantine(s) " | ||
| "(incl. R-OUTCOME)" | ||
| ) |
There was a problem hiding this comment.
grade["deterministic_reject"] left True after R-OUTCOME demotion
When the R-OUTCOME-only branch fires, slot.status is set to "healthy" but slot.grade["deterministic_reject"] is never cleared — it remains True. agent_judge_summary.json serializes this field directly, so codex (Pass 2) will read a slot where "status": "healthy" and "deterministic_reject": true at the same time. That direct contradiction could cause the codex reviewer to escalate a quarantine to "not mergeable", which is exactly the false-positive this demotion is trying to prevent. Adding slot.grade["deterministic_reject"] = False (alongside the quarantine append) would make the review pack data internally consistent.
| base = (env.get("DEEPSEEK_BASE_URL") or "https://api.deepseek.com").rstrip("/") | ||
| # DeepSeek's OpenAI-compatible chat endpoint lives under /v1. | ||
| if not base.endswith("/v1"): | ||
| base = f"{base}/v1" |
There was a problem hiding this comment.
endswith("/v1") heuristic can double-append on non-standard proxy URLs
base.endswith("/v1") strips no path components before the check, so a DEEPSEEK_BASE_URL like https://proxy.example.com/deepseek/v1/chat/completions would pass through rstrip("/") unchanged, fail the endswith("/v1") guard, and produce …/chat/completions/v1 — an invalid endpoint. The default (https://api.deepseek.com) and the typical …/v1 form work correctly; this only breaks with unusual custom proxy paths.
… impossible deepseek route Local end-to-end test proved codex CANNOT use DeepSeek: codex (0.137+) removed `wire_api = "chat"` and speaks ONLY the OpenAI Responses API, which DeepSeek does not serve (same wall that gates codex-acp); and deepseek-v4-pro is a reasoning model returning empty `content`. Remove the dead _deepseek_codex_config (it emitted the now-rejected wire_api=chat). Run the codex composer on an OpenAI model instead: gpt-5.5 (a full model that supports codex's tools, unlike gpt-5.4-nano which rejected `tool_search`) with CODEX_REASONING_EFFORT=xhigh (validated against codex's effort enum). _reasoning_config emits the model_reasoning_effort `-c` override from the env. Keeps the prior fixes: RolloutEvidence serialization via dataclasses.asdict, and the R-OUTCOME-only -> quarantine demote. Regression tests updated.
Cross-checked the unaddressed greptile comments on the earlier integration PRs (#802-#810); most were already fixed or stale. Three genuine ones remained: - P1 (build_integration_review_pack): an R-OUTCOME-only deterministic reject is demoted to a healthy+quarantine slot, but grade["deterministic_reject"] was left True and serialized into agent_judge_summary.json. codex reads that and a "healthy slot with a deterministic reject" can spuriously push it to downgrade mergeable -> not mergeable. Clear the flag on demotion. + regression test. - codex_review deepseek-finding parser: replace the greedy `{.*}` DOTALL regex (which merges two JSON objects into one invalid blob) with raw_decode of the first complete object, tolerant of trailing prose. - integration_matrix: `git diff` runs with check=True but CalledProcessError bypassed the ScopeError fail-closed handler (uncaught traceback). Re-raise it as ScopeError so a failed/absent head_sha fails closed cleanly. Co-authored-by: symphony-bot <symphony@benchflow.ai>
Follow-up. The first real codex execution (it had never run before #808) surfaced three issues; the rest of the L3 pipeline is confirmed working (plan ✓, 10/10 rollouts ✓, deterministic grader ✓, parity demote ✓).
codex_review.pycalledevidence.to_dict(), butRolloutEvidenceis a frozen dataclass with noto_dict→dataclasses.asdict.gpt-5.4-nanorejects codex's Responses-APItool_searchtool (400). Per your call, move the composer to DeepSeek-v4-pro, routed through a custom chat-completions provider (_deepseek_codex_config): DeepSeek serves chat, not Responses+tool_search. ReadsDEEPSEEK_API_KEY/DEEPSEEK_BASE_URLfrom the job env.Test plan