Skip to content

fix(integration): pin codex model + demote false pinned-baseline parity blocker#809

Merged
Yiminnn merged 1 commit into
mainfrom
fix/integration-parity-demote-codex-model
Jun 19, 2026
Merged

fix(integration): pin codex model + demote false pinned-baseline parity blocker#809
Yiminnn merged 1 commit into
mainfrom
fix/integration-parity-demote-codex-model

Conversation

@Yiminnn

@Yiminnn Yiminnn commented Jun 19, 2026

Copy link
Copy Markdown
Collaborator

Follow-up to #806/#807/#808. The first full L3 run on #803 reached the verdict with plan ✓, 10/10 rollouts ✓, deterministic grader ✓ but went red on two confirmed issues (pre-flight audited before the run even finished):

1. codex default model not entitled → 'codex unavailable'

codex_review.py never sets --codex-model, so codex exec falls back to its built-in default (gpt-5.x-codex), which the repo OPENAI_API_KEY may not serve → model_not_found → fail-closed. Pin CODEX_MODEL=gpt-5.4-nano (the model the codex-acp cells prove the key serves) in both review-pack codex steps.

2. pinned-baseline parity gate structurally false-fails → hard not-mergeable

The workflow feeds a native BenchFlow HF leaderboard baseline to check_skillsbench_harbor_parity, which validates Harbor schema + a git pin → missing Harbor field(s) / pin mismatch → fail. That is not a real reward regression. Demote a pinned-baseline parity FAIL → quarantine (visible, non-blocking) in compute_verdict; within-PR docker/daytona parity still hard-blocks. Regression test added.

Tracked follow-ups (not blocking)

  • The real parity fix: a native-vs-native baseline mode for check_skillsbench_harbor_parity (currently the golden-truth gate can't compare a native HF baseline at all).
  • S-NOSKILL gate is silently na: _load_production_evidence reads run_config.json, but production rollouts ship config.jsonskill_mode=None. Add a config.json/result.json fallback.

Test plan

…alse parity blocker

The first full L3 run on #803 (plan ✓, 10/10 rollouts ✓, deterministic grader ✓)
went red for two reasons, both confirmed by a pre-flight audit:

1. codex used its built-in default model (no --codex-model set), which the repo
   OPENAI_API_KEY is not entitled to -> model_not_found -> 'codex unavailable'.
   Pin CODEX_MODEL=gpt-5.4-nano (the model the codex-acp cells prove the key
   serves) in both review-pack codex steps.
2. the pinned-baseline reward-band gate STRUCTURALLY false-fails: the workflow
   feeds a NATIVE BenchFlow HF leaderboard baseline to check_skillsbench_harbor_
   parity, which validates Harbor schema + a git pin -> 'missing Harbor field(s)'
   / pin mismatch -> fail -> hard not-mergeable. This is not a real reward
   regression. Demote a pinned-baseline parity FAIL to a QUARANTINE (visible,
   non-blocking); within-PR docker/daytona parity still hard-blocks.

Follow-ups (documented, not blocking): a native-vs-native baseline mode for
check_skillsbench_harbor_parity (the real parity fix); the S-NOSKILL gate is
silently NA because production rollouts ship config.json not run_config.json.
Regression test added for the parity demote.
@Yiminnn Yiminnn temporarily deployed to pypi-internal-preview June 19, 2026 00:22 — with GitHub Actions Inactive
@greptile-apps

greptile-apps Bot commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR makes two targeted fixes to the L3 integration review pipeline: it pins the codex CLI model to gpt-5.4-nano in both workflow files to prevent model_not_found failures when the default model isn't entitled, and it demotes pinned-baseline parity failures from hard blockers to quarantines (visible but non-blocking) since the current gate structurally false-fails against native HF baselines.

  • Codex model pin: Both integration-final-review.yml and integration-scope.yml now set CODEX_MODEL: gpt-5.4-nano in the env block; codex_review.py already reads this via os.environ.get("CODEX_MODEL"), so no script changes are needed.
  • Parity demotion: compute_verdict in build_integration_review_pack.py now branches on pr.kind == "pinned-baseline" to route those failures to quarantines rather than blockers; within-PR docker/daytona parity failures still hard-block. A regression test verifies both behaviours.

Confidence Score: 5/5

Both fixes are narrow, well-scoped, and backed by a targeted regression test; no pre-existing gates are weakened.

The codex model pin is a one-liner that routes through an already-supported env-var path in codex_review.py. The parity demotion adds a single kind-gated branch in compute_verdict; within-PR docker/daytona parity failures still hard-block as before, so no real-regression coverage is lost. The new test directly exercises both the demoted case and the still-blocking case against the same function. The hardcoded gpt-5.4-nano string will need updating if the entitled model changes, but there is clear documentation of the intent and the PR description explains the evidence trail.

No files require special attention.

Important Files Changed

Filename Overview
.github/scripts/build_integration_review_pack.py Adds a pr.kind == "pinned-baseline" guard to compute_verdict so false-failing parity results route to quarantine rather than hard-blocking; the branch ordering and logic are correct.
.github/workflows/integration-final-review.yml Adds CODEX_MODEL: gpt-5.4-nano env var to the codex step so codex_review.py picks it up via os.environ.get("CODEX_MODEL"); the existing env-isolation pattern is respected.
.github/workflows/integration-scope.yml Same CODEX_MODEL: gpt-5.4-nano pin as in integration-final-review.yml; advisory codex step only.
tests/test_build_review_pack.py Adds a regression test covering both the demotion case (pinned-baseline fail → quarantine) and the hard-block case (within-pr fail → not-mergeable); field ordering in ParityResult constructor is correct.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[compute_verdict receives ParityResult] --> B{pr.status == 'fail'?}
    B -- No --> C{pr.status == 'quarantine'?}
    C -- Yes --> D[quarantines.append]
    C -- No --> E[ignored / na]

    B -- Yes --> F{pr.kind == 'pinned-baseline'?}
    F -- Yes --> G["quarantines.append (advisory — non-blocking)"]
    F -- No --> H["blockers.append (hard-block)"]

    H --> I{blockers non-empty?}
    G --> I
    D --> I
    I -- Yes --> J["VERDICT_NOT_MERGEABLE"]
    I -- No --> K{quarantines non-empty?}
    K -- Yes --> L["VERDICT_QUARANTINES (mergeable with quarantines)"]
    K -- No --> M["VERDICT_MERGEABLE"]
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
    A[compute_verdict receives ParityResult] --> B{pr.status == 'fail'?}
    B -- No --> C{pr.status == 'quarantine'?}
    C -- Yes --> D[quarantines.append]
    C -- No --> E[ignored / na]

    B -- Yes --> F{pr.kind == 'pinned-baseline'?}
    F -- Yes --> G["quarantines.append (advisory — non-blocking)"]
    F -- No --> H["blockers.append (hard-block)"]

    H --> I{blockers non-empty?}
    G --> I
    D --> I
    I -- Yes --> J["VERDICT_NOT_MERGEABLE"]
    I -- No --> K{quarantines non-empty?}
    K -- Yes --> L["VERDICT_QUARANTINES (mergeable with quarantines)"]
    K -- No --> M["VERDICT_MERGEABLE"]
Loading

Reviews (1): Last reviewed commit: "fix(integration): unblock the L3 verdict..." | Re-trigger Greptile

@Yiminnn Yiminnn merged commit 4fe3f39 into main Jun 19, 2026
8 checks passed
@Yiminnn Yiminnn deleted the fix/integration-parity-demote-codex-model branch June 19, 2026 00:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant