fix(integration): pin codex model + demote false pinned-baseline parity blocker by Yiminnn · Pull Request #809 · benchflow-ai/benchflow

Yiminnn · 2026-06-19T00:22:18Z

Follow-up to #806/#807/#808. The first full L3 run on #803 reached the verdict with plan ✓, 10/10 rollouts ✓, deterministic grader ✓ but went red on two confirmed issues (pre-flight audited before the run even finished):

1. codex default model not entitled → 'codex unavailable'

codex_review.py never sets --codex-model, so codex exec falls back to its built-in default (gpt-5.x-codex), which the repo OPENAI_API_KEY may not serve → model_not_found → fail-closed. Pin CODEX_MODEL=gpt-5.4-nano (the model the codex-acp cells prove the key serves) in both review-pack codex steps.

2. pinned-baseline parity gate structurally false-fails → hard not-mergeable

The workflow feeds a native BenchFlow HF leaderboard baseline to check_skillsbench_harbor_parity, which validates Harbor schema + a git pin → missing Harbor field(s) / pin mismatch → fail. That is not a real reward regression. Demote a pinned-baseline parity FAIL → quarantine (visible, non-blocking) in compute_verdict; within-PR docker/daytona parity still hard-blocks. Regression test added.

Tracked follow-ups (not blocking)

The real parity fix: a native-vs-native baseline mode for check_skillsbench_harbor_parity (currently the golden-truth gate can't compare a native HF baseline at all).
S-NOSKILL gate is silently na: _load_production_evidence reads run_config.json, but production rollouts ship config.json → skill_mode=None. Add a config.json/result.json fallback.

Test plan

ruff + YAML; build-review-pack suite green incl. the new demote regression test
Re-dispatch L3 on Preserve pi-acp model metadata through LiteLLM proxy #803 → codex runs on gpt-5.4-nano + pinned-baseline demoted → real verdict (mergeable / mergeable-with-quarantines)

…alse parity blocker The first full L3 run on #803 (plan ✓, 10/10 rollouts ✓, deterministic grader ✓) went red for two reasons, both confirmed by a pre-flight audit: 1. codex used its built-in default model (no --codex-model set), which the repo OPENAI_API_KEY is not entitled to -> model_not_found -> 'codex unavailable'. Pin CODEX_MODEL=gpt-5.4-nano (the model the codex-acp cells prove the key serves) in both review-pack codex steps. 2. the pinned-baseline reward-band gate STRUCTURALLY false-fails: the workflow feeds a NATIVE BenchFlow HF leaderboard baseline to check_skillsbench_harbor_ parity, which validates Harbor schema + a git pin -> 'missing Harbor field(s)' / pin mismatch -> fail -> hard not-mergeable. This is not a real reward regression. Demote a pinned-baseline parity FAIL to a QUARANTINE (visible, non-blocking); within-PR docker/daytona parity still hard-blocks. Follow-ups (documented, not blocking): a native-vs-native baseline mode for check_skillsbench_harbor_parity (the real parity fix); the S-NOSKILL gate is silently NA because production rollouts ship config.json not run_config.json. Regression test added for the parity demote.

greptile-apps · 2026-06-19T00:24:40Z

Greptile Summary

This PR makes two targeted fixes to the L3 integration review pipeline: it pins the codex CLI model to gpt-5.4-nano in both workflow files to prevent model_not_found failures when the default model isn't entitled, and it demotes pinned-baseline parity failures from hard blockers to quarantines (visible but non-blocking) since the current gate structurally false-fails against native HF baselines.

Codex model pin: Both integration-final-review.yml and integration-scope.yml now set CODEX_MODEL: gpt-5.4-nano in the env block; codex_review.py already reads this via os.environ.get("CODEX_MODEL"), so no script changes are needed.
Parity demotion: compute_verdict in build_integration_review_pack.py now branches on pr.kind == "pinned-baseline" to route those failures to quarantines rather than blockers; within-PR docker/daytona parity failures still hard-block. A regression test verifies both behaviours.

Confidence Score: 5/5

Both fixes are narrow, well-scoped, and backed by a targeted regression test; no pre-existing gates are weakened.

The codex model pin is a one-liner that routes through an already-supported env-var path in codex_review.py. The parity demotion adds a single kind-gated branch in compute_verdict; within-PR docker/daytona parity failures still hard-block as before, so no real-regression coverage is lost. The new test directly exercises both the demoted case and the still-blocking case against the same function. The hardcoded gpt-5.4-nano string will need updating if the entitled model changes, but there is clear documentation of the intent and the PR description explains the evidence trail.

No files require special attention.

Important Files Changed

Filename	Overview
.github/scripts/build_integration_review_pack.py	Adds a `pr.kind == "pinned-baseline"` guard to `compute_verdict` so false-failing parity results route to quarantine rather than hard-blocking; the branch ordering and logic are correct.
.github/workflows/integration-final-review.yml	Adds `CODEX_MODEL: gpt-5.4-nano` env var to the codex step so `codex_review.py` picks it up via `os.environ.get("CODEX_MODEL")`; the existing env-isolation pattern is respected.
.github/workflows/integration-scope.yml	Same `CODEX_MODEL: gpt-5.4-nano` pin as in `integration-final-review.yml`; advisory codex step only.
tests/test_build_review_pack.py	Adds a regression test covering both the demotion case (pinned-baseline fail → quarantine) and the hard-block case (within-pr fail → not-mergeable); field ordering in ParityResult constructor is correct.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[compute_verdict receives ParityResult] --> B{pr.status == 'fail'?}
    B -- No --> C{pr.status == 'quarantine'?}
    C -- Yes --> D[quarantines.append]
    C -- No --> E[ignored / na]

    B -- Yes --> F{pr.kind == 'pinned-baseline'?}
    F -- Yes --> G["quarantines.append (advisory — non-blocking)"]
    F -- No --> H["blockers.append (hard-block)"]

    H --> I{blockers non-empty?}
    G --> I
    D --> I
    I -- Yes --> J["VERDICT_NOT_MERGEABLE"]
    I -- No --> K{quarantines non-empty?}
    K -- Yes --> L["VERDICT_QUARANTINES (mergeable with quarantines)"]
    K -- No --> M["VERDICT_MERGEABLE"]

%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
    A[compute_verdict receives ParityResult] --> B{pr.status == 'fail'?}
    B -- No --> C{pr.status == 'quarantine'?}
    C -- Yes --> D[quarantines.append]
    C -- No --> E[ignored / na]

    B -- Yes --> F{pr.kind == 'pinned-baseline'?}
    F -- Yes --> G["quarantines.append (advisory — non-blocking)"]
    F -- No --> H["blockers.append (hard-block)"]

    H --> I{blockers non-empty?}
    G --> I
    D --> I
    I -- Yes --> J["VERDICT_NOT_MERGEABLE"]
    I -- No --> K{quarantines non-empty?}
    K -- Yes --> L["VERDICT_QUARANTINES (mergeable with quarantines)"]
    K -- No --> M["VERDICT_MERGEABLE"]

_{Reviews (1): Last reviewed commit: "fix(integration): unblock the L3 verdict..." | Re-trigger Greptile}

Yiminnn temporarily deployed to pypi-internal-preview June 19, 2026 00:22 — with GitHub Actions Inactive

Yiminnn merged commit 4fe3f39 into main Jun 19, 2026
8 checks passed

Yiminnn deleted the fix/integration-parity-demote-codex-model branch June 19, 2026 00:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(integration): pin codex model + demote false pinned-baseline parity blocker#809

fix(integration): pin codex model + demote false pinned-baseline parity blocker#809
Yiminnn merged 1 commit into
mainfrom
fix/integration-parity-demote-codex-model

Yiminnn commented Jun 19, 2026

Uh oh!

greptile-apps Bot commented Jun 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Yiminnn commented Jun 19, 2026

1. codex default model not entitled → 'codex unavailable'

2. pinned-baseline parity gate structurally false-fails → hard not-mergeable

Tracked follow-ups (not blocking)

Test plan

Uh oh!

greptile-apps Bot commented Jun 19, 2026

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant