Skip to content

fix(integration): codex reviewer on gpt-5.5 (xhigh) + evidence serialization + R-OUTCOME demote#810

Merged
Yiminnn merged 2 commits into
mainfrom
fix/integration-codex-deepseek-r-outcome
Jun 19, 2026
Merged

fix(integration): codex reviewer on gpt-5.5 (xhigh) + evidence serialization + R-OUTCOME demote#810
Yiminnn merged 2 commits into
mainfrom
fix/integration-codex-deepseek-r-outcome

Conversation

@Yiminnn

@Yiminnn Yiminnn commented Jun 19, 2026

Copy link
Copy Markdown
Collaborator

Follow-up. The first real codex execution (it had never run before #808) surfaced three issues; the rest of the L3 pipeline is confirmed working (plan ✓, 10/10 rollouts ✓, deterministic grader ✓, parity demote ✓).

  1. Pass-1 crashcodex_review.py called evidence.to_dict(), but RolloutEvidence is a frozen dataclass with no to_dictdataclasses.asdict.
  2. Pass-2 codex modelgpt-5.4-nano rejects codex's Responses-API tool_search tool (400). Per your call, move the composer to DeepSeek-v4-pro, routed through a custom chat-completions provider (_deepseek_codex_config): DeepSeek serves chat, not Responses+tool_search. Reads DEEPSEEK_API_KEY/DEEPSEEK_BASE_URL from the job env.
  3. R-OUTCOME demote — R-OUTCOME fails when a rollout has no valid scored outcome (error / unscored timeout — often a hard task that timed out), an experiment-fidelity issue on one rollout, not a PR regression. Demote an R-OUTCOME-only reject to a quarantine; any other deterministic reject still hard-blocks. (Note: R-OUTCOME is outcome-VALIDITY, not low reward — flagging in case you want it stricter.)

Test plan

  • ruff + YAML; build-review-pack + codex_review suites green incl. new deepseek-routing tests
  • Re-dispatch L3 on Preserve pi-acp model metadata through LiteLLM proxy #803 → codex runs on DeepSeek-v4-pro (chat wire, no tool_search) + produces a parseable verdict; R-OUTCOME cell quarantined → mergeable / mergeable-with-quarantines

…ation, demote R-OUTCOME

The first codex execution (never run before) surfaced three issues:

1. Pass-1 deepseek findings crashed: codex_review.py called evidence.to_dict(),
   but RolloutEvidence is a frozen dataclass with no to_dict -> use
   dataclasses.asdict.
2. Pass-2 codex exec failed: gpt-5.4-nano rejects codex's Responses-API
   `tool_search` tool. Move the codex composer to DeepSeek-v4-pro, routed through
   a custom chat-completions provider (DeepSeek serves chat, not Responses +
   tool_search); _deepseek_codex_config emits the codex -c overrides and reads
   DEEPSEEK_API_KEY/DEEPSEEK_BASE_URL from the job env. CODEX_MODEL=deepseek-v4-pro
   in both review-pack codex steps.
3. R-OUTCOME (the rollout produced no valid SCORED outcome — an errored /
   unscored-timeout run, often a hard task the agent didn't finish in budget) is
   an experiment-fidelity issue on one rollout, not a PR regression. Demote an
   R-OUTCOME-ONLY reject to a quarantine (visible, non-blocking); any other
   deterministic reject still hard-blocks.

Regression tests added for the deepseek codex routing. The pinned-baseline parity
demote (prior PR) is confirmed working; deterministic grader graded 9/10 healthy.
@Yiminnn Yiminnn had a problem deploying to pypi-internal-preview June 19, 2026 01:16 — with GitHub Actions Failure
@greptile-apps

greptile-apps Bot commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR patches three bugs surfaced by the first live codex execution after #808: a crash caused by calling .to_dict() on a frozen dataclass (RolloutEvidence), a 400 error from gpt-5.4-nano rejecting codex's tool_search tool, and an over-broad deterministic gate that blocked PRs when a rollout produced no scored outcome (timeout/error) rather than a genuine regression.

  • Crash fix: evidence.to_dict()dataclasses.asdict(evidence) — the correct serialization path for a frozen dataclass with no custom method.
  • Model bump: codex composer moved from gpt-5.4-nano to gpt-5.5 with xhigh reasoning effort; a new _reasoning_config() helper injects the effort as a -c config override, tested with two new unit tests.
  • R-OUTCOME demotion: when the only failing deterministic gate is R-OUTCOME (rollout produced no valid scored outcome), the slot is now quarantined rather than hard-blocked; any other deterministic failure still hard-blocks as before.

Confidence Score: 4/5

Safe to merge for the three targeted fixes; the only remaining concern is the pre-existing data inconsistency in agent_judge_summary.json where a quarantined slot still advertises deterministic_reject: true, which an already-open thread has flagged.

The dataclasses.asdict crash fix and the reasoning-effort config helper are both correct and well-tested. The R-OUTCOME demotion logic routes the slot to healthy + quarantine correctly, and compute_verdict picks up those quarantines. The one outstanding inconsistency — slot.grade["deterministic_reject"] is never cleared to False when a slot is demoted — means the serialized agent_judge_summary.json will show status: healthy alongside deterministic_reject: true for those slots, which could prompt codex to re-escalate the quarantine during Pass 2. That issue was flagged in a prior thread and is not new to this PR, but it remains unresolved in the code.

.github/scripts/build_integration_review_pack.py — the R-OUTCOME demotion branch (lines 306-315) leaves slot.grade["deterministic_reject"] uncorrected after changing slot.status to "healthy".

Important Files Changed

Filename Overview
.github/scripts/build_integration_review_pack.py R-OUTCOME demotion logic added: R-OUTCOME-only rejects are quarantined instead of hard-blocked. The existing issue (slot.grade["deterministic_reject"] left True when status becomes "healthy") is still present, creating a data inconsistency in agent_judge_summary.json that codex reads in Pass 2.
.github/scripts/codex_review.py Two clean fixes: dataclasses.asdict() replaces the missing .to_dict() call, and _reasoning_config() correctly reads CODEX_REASONING_EFFORT into a -c override injected before user-supplied overrides.
.github/workflows/integration-final-review.yml CODEX_MODEL updated from gpt-5.4-nano to gpt-5.5 and CODEX_REASONING_EFFORT: xhigh added; explains the tool_search 400 fix.
.github/workflows/integration-scope.yml Same model/effort change as integration-final-review.yml, keeping both workflow codex invocations consistent.
tests/test_codex_review.py Two well-structured unit tests cover both the happy path and the empty/whitespace-only cases for _reasoning_config(); good coverage of the new helper.

Sequence Diagram

%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
    participant WF as GitHub Workflow
    participant BP as build_review_pack.py
    participant CR as codex_review.py
    participant DS as DeepSeek (Pass 1)
    participant CX as codex CLI / gpt-5.5 (Pass 2)

    WF->>BP: run deterministic grader
    BP->>BP: grade_rollout() per slot
    BP->>BP: R-OUTCOME only? quarantine (healthy)
    BP->>BP: Other deterministic fail? unhealthy
    BP-->>WF: deterministic_verdict + agent_judge_summary.json

    WF->>CR: run codex_review.py --deterministic-verdict
    CR->>CR: load_rollout_evidence() + dataclasses.asdict()
    CR->>DS: Pass 1 per-rollout findings
    DS-->>CR: JSON findings per rollout

    CR->>CR: _codex_env() swap DeepSeek key for real OpenAI key
    CR->>CR: "_reasoning_config() model_reasoning_effort=xhigh"
    CR->>CX: "codex exec --model gpt-5.5 -c model_reasoning_effort=xhigh"
    CX-->>CR: advisory verdict

    CR->>CR: worst(deterministic, codex_verdict)
    CR-->>WF: final_verdict (codex can only tighten, never loosen)
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
    participant WF as GitHub Workflow
    participant BP as build_review_pack.py
    participant CR as codex_review.py
    participant DS as DeepSeek (Pass 1)
    participant CX as codex CLI / gpt-5.5 (Pass 2)

    WF->>BP: run deterministic grader
    BP->>BP: grade_rollout() per slot
    BP->>BP: R-OUTCOME only? quarantine (healthy)
    BP->>BP: Other deterministic fail? unhealthy
    BP-->>WF: deterministic_verdict + agent_judge_summary.json

    WF->>CR: run codex_review.py --deterministic-verdict
    CR->>CR: load_rollout_evidence() + dataclasses.asdict()
    CR->>DS: Pass 1 per-rollout findings
    DS-->>CR: JSON findings per rollout

    CR->>CR: _codex_env() swap DeepSeek key for real OpenAI key
    CR->>CR: "_reasoning_config() model_reasoning_effort=xhigh"
    CR->>CX: "codex exec --model gpt-5.5 -c model_reasoning_effort=xhigh"
    CX-->>CR: advisory verdict

    CR->>CR: worst(deterministic, codex_verdict)
    CR-->>WF: final_verdict (codex can only tighten, never loosen)
Loading

Reviews (2): Last reviewed commit: "fix(integration): run the codex reviewer..." | Re-trigger Greptile

Comment on lines +306 to +315
slot.status = "healthy"
slot.grade["quarantines"] = [
*slot.grade.get("quarantines", []),
"R-OUTCOME: rollout produced no valid scored outcome "
"(error / unscored timeout) — quarantined, not a PR regression",
]
slot.detail = (
f"healthy with {len(slot.grade['quarantines'])} quarantine(s) "
"(incl. R-OUTCOME)"
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 grade["deterministic_reject"] left True after R-OUTCOME demotion

When the R-OUTCOME-only branch fires, slot.status is set to "healthy" but slot.grade["deterministic_reject"] is never cleared — it remains True. agent_judge_summary.json serializes this field directly, so codex (Pass 2) will read a slot where "status": "healthy" and "deterministic_reject": true at the same time. That direct contradiction could cause the codex reviewer to escalate a quarantine to "not mergeable", which is exactly the false-positive this demotion is trying to prevent. Adding slot.grade["deterministic_reject"] = False (alongside the quarantine append) would make the review pack data internally consistent.

Comment thread .github/scripts/codex_review.py Outdated
Comment on lines +300 to +303
base = (env.get("DEEPSEEK_BASE_URL") or "https://api.deepseek.com").rstrip("/")
# DeepSeek's OpenAI-compatible chat endpoint lives under /v1.
if not base.endswith("/v1"):
base = f"{base}/v1"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 endswith("/v1") heuristic can double-append on non-standard proxy URLs

base.endswith("/v1") strips no path components before the check, so a DEEPSEEK_BASE_URL like https://proxy.example.com/deepseek/v1/chat/completions would pass through rstrip("/") unchanged, fail the endswith("/v1") guard, and produce …/chat/completions/v1 — an invalid endpoint. The default (https://api.deepseek.com) and the typical …/v1 form work correctly; this only breaks with unusual custom proxy paths.

… impossible deepseek route

Local end-to-end test proved codex CANNOT use DeepSeek: codex (0.137+) removed
`wire_api = "chat"` and speaks ONLY the OpenAI Responses API, which DeepSeek does
not serve (same wall that gates codex-acp); and deepseek-v4-pro is a reasoning
model returning empty `content`. Remove the dead _deepseek_codex_config (it emitted
the now-rejected wire_api=chat).

Run the codex composer on an OpenAI model instead: gpt-5.5 (a full model that
supports codex's tools, unlike gpt-5.4-nano which rejected `tool_search`) with
CODEX_REASONING_EFFORT=xhigh (validated against codex's effort enum). _reasoning_config
emits the model_reasoning_effort `-c` override from the env.

Keeps the prior fixes: RolloutEvidence serialization via dataclasses.asdict, and the
R-OUTCOME-only -> quarantine demote. Regression tests updated.
@Yiminnn Yiminnn changed the title fix(integration): codex on DeepSeek-v4-pro + evidence serialization + R-OUTCOME demote fix(integration): codex reviewer on gpt-5.5 (xhigh) + evidence serialization + R-OUTCOME demote Jun 19, 2026
@Yiminnn Yiminnn temporarily deployed to pypi-internal-preview June 19, 2026 01:35 — with GitHub Actions Inactive
@Yiminnn Yiminnn merged commit 9db14f9 into main Jun 19, 2026
8 checks passed
@Yiminnn Yiminnn deleted the fix/integration-codex-deepseek-r-outcome branch June 19, 2026 01:42
Yiminnn added a commit that referenced this pull request Jun 20, 2026
Cross-checked the unaddressed greptile comments on the earlier integration PRs
(#802-#810); most were already fixed or stale. Three genuine ones remained:

- P1 (build_integration_review_pack): an R-OUTCOME-only deterministic reject is
  demoted to a healthy+quarantine slot, but grade["deterministic_reject"] was
  left True and serialized into agent_judge_summary.json. codex reads that and a
  "healthy slot with a deterministic reject" can spuriously push it to downgrade
  mergeable -> not mergeable. Clear the flag on demotion. + regression test.
- codex_review deepseek-finding parser: replace the greedy `{.*}` DOTALL regex
  (which merges two JSON objects into one invalid blob) with raw_decode of the
  first complete object, tolerant of trailing prose.
- integration_matrix: `git diff` runs with check=True but CalledProcessError
  bypassed the ScopeError fail-closed handler (uncaught traceback). Re-raise it
  as ScopeError so a failed/absent head_sha fails closed cleanly.

Co-authored-by: symphony-bot <symphony@benchflow.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant