fix(integration): pin codex model + demote false pinned-baseline parity blocker#809
Conversation
…alse parity blocker The first full L3 run on #803 (plan ✓, 10/10 rollouts ✓, deterministic grader ✓) went red for two reasons, both confirmed by a pre-flight audit: 1. codex used its built-in default model (no --codex-model set), which the repo OPENAI_API_KEY is not entitled to -> model_not_found -> 'codex unavailable'. Pin CODEX_MODEL=gpt-5.4-nano (the model the codex-acp cells prove the key serves) in both review-pack codex steps. 2. the pinned-baseline reward-band gate STRUCTURALLY false-fails: the workflow feeds a NATIVE BenchFlow HF leaderboard baseline to check_skillsbench_harbor_ parity, which validates Harbor schema + a git pin -> 'missing Harbor field(s)' / pin mismatch -> fail -> hard not-mergeable. This is not a real reward regression. Demote a pinned-baseline parity FAIL to a QUARANTINE (visible, non-blocking); within-PR docker/daytona parity still hard-blocks. Follow-ups (documented, not blocking): a native-vs-native baseline mode for check_skillsbench_harbor_parity (the real parity fix); the S-NOSKILL gate is silently NA because production rollouts ship config.json not run_config.json. Regression test added for the parity demote.
Greptile SummaryThis PR makes two targeted fixes to the L3 integration review pipeline: it pins the codex CLI model to
Confidence Score: 5/5Both fixes are narrow, well-scoped, and backed by a targeted regression test; no pre-existing gates are weakened. The codex model pin is a one-liner that routes through an already-supported env-var path in codex_review.py. The parity demotion adds a single kind-gated branch in compute_verdict; within-PR docker/daytona parity failures still hard-block as before, so no real-regression coverage is lost. The new test directly exercises both the demoted case and the still-blocking case against the same function. The hardcoded gpt-5.4-nano string will need updating if the entitled model changes, but there is clear documentation of the intent and the PR description explains the evidence trail. No files require special attention. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[compute_verdict receives ParityResult] --> B{pr.status == 'fail'?}
B -- No --> C{pr.status == 'quarantine'?}
C -- Yes --> D[quarantines.append]
C -- No --> E[ignored / na]
B -- Yes --> F{pr.kind == 'pinned-baseline'?}
F -- Yes --> G["quarantines.append (advisory — non-blocking)"]
F -- No --> H["blockers.append (hard-block)"]
H --> I{blockers non-empty?}
G --> I
D --> I
I -- Yes --> J["VERDICT_NOT_MERGEABLE"]
I -- No --> K{quarantines non-empty?}
K -- Yes --> L["VERDICT_QUARANTINES (mergeable with quarantines)"]
K -- No --> M["VERDICT_MERGEABLE"]
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
A[compute_verdict receives ParityResult] --> B{pr.status == 'fail'?}
B -- No --> C{pr.status == 'quarantine'?}
C -- Yes --> D[quarantines.append]
C -- No --> E[ignored / na]
B -- Yes --> F{pr.kind == 'pinned-baseline'?}
F -- Yes --> G["quarantines.append (advisory — non-blocking)"]
F -- No --> H["blockers.append (hard-block)"]
H --> I{blockers non-empty?}
G --> I
D --> I
I -- Yes --> J["VERDICT_NOT_MERGEABLE"]
I -- No --> K{quarantines non-empty?}
K -- Yes --> L["VERDICT_QUARANTINES (mergeable with quarantines)"]
K -- No --> M["VERDICT_MERGEABLE"]
Reviews (1): Last reviewed commit: "fix(integration): unblock the L3 verdict..." | Re-trigger Greptile |
Follow-up to #806/#807/#808. The first full L3 run on #803 reached the verdict with plan ✓, 10/10 rollouts ✓, deterministic grader ✓ but went red on two confirmed issues (pre-flight audited before the run even finished):
1. codex default model not entitled → 'codex unavailable'
codex_review.py never sets
--codex-model, socodex execfalls back to its built-in default (gpt-5.x-codex), which the repoOPENAI_API_KEYmay not serve →model_not_found→ fail-closed. PinCODEX_MODEL=gpt-5.4-nano(the model the codex-acp cells prove the key serves) in both review-pack codex steps.2. pinned-baseline parity gate structurally false-fails → hard not-mergeable
The workflow feeds a native BenchFlow HF leaderboard baseline to
check_skillsbench_harbor_parity, which validates Harbor schema + a git pin →missing Harbor field(s)/ pin mismatch →fail. That is not a real reward regression. Demote a pinned-baseline parity FAIL → quarantine (visible, non-blocking) incompute_verdict; within-PR docker/daytona parity still hard-blocks. Regression test added.Tracked follow-ups (not blocking)
check_skillsbench_harbor_parity(currently the golden-truth gate can't compare a native HF baseline at all).S-NOSKILLgate is silentlyna:_load_production_evidencereadsrun_config.json, but production rollouts shipconfig.json→skill_mode=None. Add a config.json/result.json fallback.Test plan