fix(integration): codex reviewer on gpt-5.5 (xhigh) + evidence serialization + R-OUTCOME demote by Yiminnn · Pull Request #810 · benchflow-ai/benchflow

Yiminnn · 2026-06-19T01:15:53Z

Follow-up. The first real codex execution (it had never run before #808) surfaced three issues; the rest of the L3 pipeline is confirmed working (plan ✓, 10/10 rollouts ✓, deterministic grader ✓, parity demote ✓).

Pass-1 crash — codex_review.py called evidence.to_dict(), but RolloutEvidence is a frozen dataclass with no to_dict → dataclasses.asdict.
Pass-2 codex model — gpt-5.4-nano rejects codex's Responses-API tool_search tool (400). Per your call, move the composer to DeepSeek-v4-pro, routed through a custom chat-completions provider (_deepseek_codex_config): DeepSeek serves chat, not Responses+tool_search. Reads DEEPSEEK_API_KEY/DEEPSEEK_BASE_URL from the job env.
R-OUTCOME demote — R-OUTCOME fails when a rollout has no valid scored outcome (error / unscored timeout — often a hard task that timed out), an experiment-fidelity issue on one rollout, not a PR regression. Demote an R-OUTCOME-only reject to a quarantine; any other deterministic reject still hard-blocks. (Note: R-OUTCOME is outcome-VALIDITY, not low reward — flagging in case you want it stricter.)

Test plan

ruff + YAML; build-review-pack + codex_review suites green incl. new deepseek-routing tests
Re-dispatch L3 on Preserve pi-acp model metadata through LiteLLM proxy #803 → codex runs on DeepSeek-v4-pro (chat wire, no tool_search) + produces a parseable verdict; R-OUTCOME cell quarantined → mergeable / mergeable-with-quarantines

…ation, demote R-OUTCOME The first codex execution (never run before) surfaced three issues: 1. Pass-1 deepseek findings crashed: codex_review.py called evidence.to_dict(), but RolloutEvidence is a frozen dataclass with no to_dict -> use dataclasses.asdict. 2. Pass-2 codex exec failed: gpt-5.4-nano rejects codex's Responses-API `tool_search` tool. Move the codex composer to DeepSeek-v4-pro, routed through a custom chat-completions provider (DeepSeek serves chat, not Responses + tool_search); _deepseek_codex_config emits the codex -c overrides and reads DEEPSEEK_API_KEY/DEEPSEEK_BASE_URL from the job env. CODEX_MODEL=deepseek-v4-pro in both review-pack codex steps. 3. R-OUTCOME (the rollout produced no valid SCORED outcome — an errored / unscored-timeout run, often a hard task the agent didn't finish in budget) is an experiment-fidelity issue on one rollout, not a PR regression. Demote an R-OUTCOME-ONLY reject to a quarantine (visible, non-blocking); any other deterministic reject still hard-blocks. Regression tests added for the deepseek codex routing. The pinned-baseline parity demote (prior PR) is confirmed working; deterministic grader graded 9/10 healthy.

greptile-apps · 2026-06-19T01:20:40Z

Greptile Summary

This PR patches three bugs surfaced by the first live codex execution after #808: a crash caused by calling .to_dict() on a frozen dataclass (RolloutEvidence), a 400 error from gpt-5.4-nano rejecting codex's tool_search tool, and an over-broad deterministic gate that blocked PRs when a rollout produced no scored outcome (timeout/error) rather than a genuine regression.

Crash fix: evidence.to_dict() → dataclasses.asdict(evidence) — the correct serialization path for a frozen dataclass with no custom method.
Model bump: codex composer moved from gpt-5.4-nano to gpt-5.5 with xhigh reasoning effort; a new _reasoning_config() helper injects the effort as a -c config override, tested with two new unit tests.
R-OUTCOME demotion: when the only failing deterministic gate is R-OUTCOME (rollout produced no valid scored outcome), the slot is now quarantined rather than hard-blocked; any other deterministic failure still hard-blocks as before.

Confidence Score: 4/5

Safe to merge for the three targeted fixes; the only remaining concern is the pre-existing data inconsistency in agent_judge_summary.json where a quarantined slot still advertises deterministic_reject: true, which an already-open thread has flagged.

The dataclasses.asdict crash fix and the reasoning-effort config helper are both correct and well-tested. The R-OUTCOME demotion logic routes the slot to healthy + quarantine correctly, and compute_verdict picks up those quarantines. The one outstanding inconsistency — slot.grade["deterministic_reject"] is never cleared to False when a slot is demoted — means the serialized agent_judge_summary.json will show status: healthy alongside deterministic_reject: true for those slots, which could prompt codex to re-escalate the quarantine during Pass 2. That issue was flagged in a prior thread and is not new to this PR, but it remains unresolved in the code.

.github/scripts/build_integration_review_pack.py — the R-OUTCOME demotion branch (lines 306-315) leaves slot.grade["deterministic_reject"] uncorrected after changing slot.status to "healthy".

Important Files Changed

Filename	Overview
.github/scripts/build_integration_review_pack.py	R-OUTCOME demotion logic added: R-OUTCOME-only rejects are quarantined instead of hard-blocked. The existing issue (slot.grade["deterministic_reject"] left True when status becomes "healthy") is still present, creating a data inconsistency in agent_judge_summary.json that codex reads in Pass 2.
.github/scripts/codex_review.py	Two clean fixes: dataclasses.asdict() replaces the missing .to_dict() call, and _reasoning_config() correctly reads CODEX_REASONING_EFFORT into a -c override injected before user-supplied overrides.
.github/workflows/integration-final-review.yml	CODEX_MODEL updated from gpt-5.4-nano to gpt-5.5 and CODEX_REASONING_EFFORT: xhigh added; explains the tool_search 400 fix.
.github/workflows/integration-scope.yml	Same model/effort change as integration-final-review.yml, keeping both workflow codex invocations consistent.
tests/test_codex_review.py	Two well-structured unit tests cover both the happy path and the empty/whitespace-only cases for _reasoning_config(); good coverage of the new helper.

Sequence Diagram

%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
    participant WF as GitHub Workflow
    participant BP as build_review_pack.py
    participant CR as codex_review.py
    participant DS as DeepSeek (Pass 1)
    participant CX as codex CLI / gpt-5.5 (Pass 2)

    WF->>BP: run deterministic grader
    BP->>BP: grade_rollout() per slot
    BP->>BP: R-OUTCOME only? quarantine (healthy)
    BP->>BP: Other deterministic fail? unhealthy
    BP-->>WF: deterministic_verdict + agent_judge_summary.json

    WF->>CR: run codex_review.py --deterministic-verdict
    CR->>CR: load_rollout_evidence() + dataclasses.asdict()
    CR->>DS: Pass 1 per-rollout findings
    DS-->>CR: JSON findings per rollout

    CR->>CR: _codex_env() swap DeepSeek key for real OpenAI key
    CR->>CR: "_reasoning_config() model_reasoning_effort=xhigh"
    CR->>CX: "codex exec --model gpt-5.5 -c model_reasoning_effort=xhigh"
    CX-->>CR: advisory verdict

    CR->>CR: worst(deterministic, codex_verdict)
    CR-->>WF: final_verdict (codex can only tighten, never loosen)

%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
    participant WF as GitHub Workflow
    participant BP as build_review_pack.py
    participant CR as codex_review.py
    participant DS as DeepSeek (Pass 1)
    participant CX as codex CLI / gpt-5.5 (Pass 2)

    WF->>BP: run deterministic grader
    BP->>BP: grade_rollout() per slot
    BP->>BP: R-OUTCOME only? quarantine (healthy)
    BP->>BP: Other deterministic fail? unhealthy
    BP-->>WF: deterministic_verdict + agent_judge_summary.json

    WF->>CR: run codex_review.py --deterministic-verdict
    CR->>CR: load_rollout_evidence() + dataclasses.asdict()
    CR->>DS: Pass 1 per-rollout findings
    DS-->>CR: JSON findings per rollout

    CR->>CR: _codex_env() swap DeepSeek key for real OpenAI key
    CR->>CR: "_reasoning_config() model_reasoning_effort=xhigh"
    CR->>CX: "codex exec --model gpt-5.5 -c model_reasoning_effort=xhigh"
    CX-->>CR: advisory verdict

    CR->>CR: worst(deterministic, codex_verdict)
    CR-->>WF: final_verdict (codex can only tighten, never loosen)

_{Reviews (2): Last reviewed commit: "fix(integration): run the codex reviewer..." | Re-trigger Greptile}

greptile-apps · 2026-06-19T01:20:44Z

+            slot.status = "healthy"
+            slot.grade["quarantines"] = [
+                *slot.grade.get("quarantines", []),
+                "R-OUTCOME: rollout produced no valid scored outcome "
+                "(error / unscored timeout) — quarantined, not a PR regression",
+            ]
+            slot.detail = (
+                f"healthy with {len(slot.grade['quarantines'])} quarantine(s) "
+                "(incl. R-OUTCOME)"
+            )


grade["deterministic_reject"] left True after R-OUTCOME demotion

When the R-OUTCOME-only branch fires, slot.status is set to "healthy" but slot.grade["deterministic_reject"] is never cleared — it remains True. agent_judge_summary.json serializes this field directly, so codex (Pass 2) will read a slot where "status": "healthy" and "deterministic_reject": true at the same time. That direct contradiction could cause the codex reviewer to escalate a quarantine to "not mergeable", which is exactly the false-positive this demotion is trying to prevent. Adding slot.grade["deterministic_reject"] = False (alongside the quarantine append) would make the review pack data internally consistent.

greptile-apps · 2026-06-19T01:20:45Z

+    base = (env.get("DEEPSEEK_BASE_URL") or "https://api.deepseek.com").rstrip("/")
+    # DeepSeek's OpenAI-compatible chat endpoint lives under /v1.
+    if not base.endswith("/v1"):
+        base = f"{base}/v1"


endswith("/v1") heuristic can double-append on non-standard proxy URLs

base.endswith("/v1") strips no path components before the check, so a DEEPSEEK_BASE_URL like https://proxy.example.com/deepseek/v1/chat/completions would pass through rstrip("/") unchanged, fail the endswith("/v1") guard, and produce …/chat/completions/v1 — an invalid endpoint. The default (https://api.deepseek.com) and the typical …/v1 form work correctly; this only breaks with unusual custom proxy paths.

… impossible deepseek route Local end-to-end test proved codex CANNOT use DeepSeek: codex (0.137+) removed `wire_api = "chat"` and speaks ONLY the OpenAI Responses API, which DeepSeek does not serve (same wall that gates codex-acp); and deepseek-v4-pro is a reasoning model returning empty `content`. Remove the dead _deepseek_codex_config (it emitted the now-rejected wire_api=chat). Run the codex composer on an OpenAI model instead: gpt-5.5 (a full model that supports codex's tools, unlike gpt-5.4-nano which rejected `tool_search`) with CODEX_REASONING_EFFORT=xhigh (validated against codex's effort enum). _reasoning_config emits the model_reasoning_effort `-c` override from the env. Keeps the prior fixes: RolloutEvidence serialization via dataclasses.asdict, and the R-OUTCOME-only -> quarantine demote. Regression tests updated.

Cross-checked the unaddressed greptile comments on the earlier integration PRs (#802-#810); most were already fixed or stale. Three genuine ones remained: - P1 (build_integration_review_pack): an R-OUTCOME-only deterministic reject is demoted to a healthy+quarantine slot, but grade["deterministic_reject"] was left True and serialized into agent_judge_summary.json. codex reads that and a "healthy slot with a deterministic reject" can spuriously push it to downgrade mergeable -> not mergeable. Clear the flag on demotion. + regression test. - codex_review deepseek-finding parser: replace the greedy `{.*}` DOTALL regex (which merges two JSON objects into one invalid blob) with raw_decode of the first complete object, tolerant of trailing prose. - integration_matrix: `git diff` runs with check=True but CalledProcessError bypassed the ScopeError fail-closed handler (uncaught traceback). Re-raise it as ScopeError so a failed/absent head_sha fails closed cleanly. Co-authored-by: symphony-bot <symphony@benchflow.ai>

Yiminnn had a problem deploying to pypi-internal-preview June 19, 2026 01:16 — with GitHub Actions Failure

greptile-apps Bot reviewed Jun 19, 2026

View reviewed changes

Yiminnn changed the title ~~fix(integration): codex on DeepSeek-v4-pro + evidence serialization + R-OUTCOME demote~~ fix(integration): codex reviewer on gpt-5.5 (xhigh) + evidence serialization + R-OUTCOME demote Jun 19, 2026

Yiminnn temporarily deployed to pypi-internal-preview June 19, 2026 01:35 — with GitHub Actions Inactive

Yiminnn merged commit 9db14f9 into main Jun 19, 2026
8 checks passed

Yiminnn deleted the fix/integration-codex-deepseek-r-outcome branch June 19, 2026 01:42

Yiminnn mentioned this pull request Jun 20, 2026

fix(integration): clear residual greptile findings on the L3 gate #817

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(integration): codex reviewer on gpt-5.5 (xhigh) + evidence serialization + R-OUTCOME demote#810

fix(integration): codex reviewer on gpt-5.5 (xhigh) + evidence serialization + R-OUTCOME demote#810
Yiminnn merged 2 commits into
mainfrom
fix/integration-codex-deepseek-r-outcome

Yiminnn commented Jun 19, 2026

Uh oh!

greptile-apps Bot commented Jun 19, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot Jun 19, 2026

Uh oh!

greptile-apps Bot Jun 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Yiminnn commented Jun 19, 2026

Test plan

Uh oh!

greptile-apps Bot commented Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps Bot Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

greptile-apps Bot commented Jun 19, 2026 •

edited

Loading