Preserve pi-acp model metadata through LiteLLM proxy#803
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: eeec75bb7d
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
|
||
| @pytest.mark.asyncio | ||
| async def test_pi_acp_proxy_preserves_provider_model_metadata(monkeypatch): | ||
| """Pi sees the LiteLLM alias, so metadata must be copied to that id.""" |
There was a problem hiding this comment.
Name the regression guard in the test docstring
This new test is the regression coverage for the Pi/LiteLLM aliasing fix, but its docstring does not identify the PR or commit it guards. The root AGENTS.md explicitly requires regression tests to name the PR/commit they guard, so please include this commit/PR identifier in the docstring to preserve the maintenance context.
Useful? React with 👍 / 👎.
| alias_entry = dict(entry) | ||
| alias_entry["id"] = route.model_alias | ||
| alias_entry.setdefault("name", route.model_alias) |
There was a problem hiding this comment.
When the original entry already has a
"name" key — as in the test fixture where "name": "Qwen/Qwen3-4B" is present — setdefault is a no-op, so the alias entry ends up with id = "benchflow-vllm-Qwen-Qwen3-4B" but name = "Qwen/Qwen3-4B". If pi-acp uses name for display or any secondary lookup this creates a stale/inconsistent value. Using a direct assignment ensures the alias entry's name always reflects the alias ID.
| alias_entry = dict(entry) | |
| alias_entry["id"] = route.model_alias | |
| alias_entry.setdefault("name", route.model_alias) | |
| alias_entry = dict(entry) | |
| alias_entry["id"] = route.model_alias | |
| alias_entry["name"] = route.model_alias |
| route=route, | ||
| ) | ||
| if alias_models: | ||
| updated["BENCHFLOW_PROVIDER_MODELS"] = alias_models |
There was a problem hiding this comment.
Silent failure when no entry matches
If _provider_models_for_proxy_alias returns None (e.g., BENCHFLOW_PROVIDER_MODELS is set but none of its entries have an id matching any variant in wanted), updated["BENCHFLOW_PROVIDER_MODELS"] is left unchanged — keeping the original entry list that does not contain the alias ID. Pi-acp would then receive a model list without the proxied alias, causing the same lookup failure this PR aims to fix, with no log to indicate the miss. Adding a debug/warning log on the None return paths in _provider_models_for_proxy_alias would make this much easier to diagnose in production.
| assert provider_runtime is not None | ||
| assert updated["BENCHFLOW_PROVIDER_MODEL"] == "benchflow-vllm-Qwen-Qwen3-4B" | ||
| models = json.loads(updated["BENCHFLOW_PROVIDER_MODELS"]) | ||
| alias = next(m for m in models if m["id"] == "benchflow-vllm-Qwen-Qwen3-4B") | ||
| assert alias["maxTokens"] == 1024 | ||
| assert alias["contextWindow"] == 16384 |
There was a problem hiding this comment.
name field of alias entry is not asserted
The test confirms maxTokens and contextWindow are preserved on the alias entry, but does not assert alias["name"]. Because setdefault is a no-op when "name" already exists in the source entry, the alias entry currently carries name = "Qwen/Qwen3-4B" rather than the alias "benchflow-vllm-Qwen-Qwen3-4B". Adding assert alias["name"] == "benchflow-vllm-Qwen-Qwen3-4B" would lock in the expected behaviour and catch a regression if the name handling changes.
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Integration final reviewFinal verdict: not mergeable Verdictnot mergeable Blockers
Coverage
Slots: healthy=8, unhealthy=2, missing=0, duplicate=0, stale=0 (planned=10) Evidence
Parity:
Residual risk
Required reruns
|
|
Automation update (2026-06-19): pushed What changed:
Local verification after the patch:
Current GitHub checks: normal |
|
Follow-up (2026-06-19): the pending Docker |
…ositive, codex robustness (#814) * fix(integration): calibrate L3 gate — slot matching, V-TAMPER, codex robustness Three defects made the L3 final-review gate systematically return "not mergeable" for reasons unrelated to the PR under review (found running the gate end-to-end on #794/#803/#813): 1. Slot mis-attribution (duplicate/missing slots). The review-pack matcher mapped rollouts to planned cells by fuzzy dims, ignoring network_mode, so a cell and its `-allowlist` network variant collided into one slot (one "duplicate", one "missing"); in-place agent retries also each counted as a separate rollout. Now attribute each rollout by its cell-id directory (the run-matrix writes <cell.id>/<ts>/<task>__<hash>), and collapse same-job retries to the latest attempt. Genuinely double-scheduled cells still surface as duplicates. 2. V-TAMPER false-positive on OpenHands. The native-ACP execute scanner matched the whole title; OpenHands writes "<description>: $ <command>", and prose like "Verify the output" collided with the `verif` token, falsely flagging benign cleanup/verification as verifier tampering. Scan only the command after "$ "; a real tamper command is still caught. 3. Codex "unavailable" flake. The reviewer had to shell out (bubblewrap sandbox) to read the review pack from disk; a transient bwrap loopback failure killed every read -> no verdict -> false fail-closed. Inline the review-pack files into the prompt so no sandboxed read is needed, plus a bounded retry on transient (sandbox/network/rate-limit) output. A dead codex still fails closed. Adds regression tests for all three. * style(integration): ruff format the new gate-calibration tests * fix(integration): tighten greptile P2s on the L3 gate fixes - codex transient markers: drop the bare '429' (matched 'line 429'); use the 'too many requests' reason phrase + 'http 429' instead. - _attribute_rollout: only scan path parts BELOW the artifacts root, so OS-level segments can never coincidentally match a cell id. - _started_at: fall back to finished_at so retry-collapse stays chronological when started_at is absent. * fix(deps): bump pydantic-settings 2.14.1 -> 2.14.2 (GHSA-4xgf-cpjx-pc3j) A new advisory (GHSA-4xgf-cpjx-pc3j, pydantic-settings 2.14.2, published 2026-06-19) started failing pip-audit repo-wide on every PR and main — pydantic-settings is a transitive dep via litellm/mcp. Surgically bump the locked version to the fixed 2.14.2, preserving the revision-1 lockfile format (no full reformat). `uv sync --locked` accepts the lock; pip-audit is clean. --------- Co-authored-by: symphony-bot <symphony@benchflow.ai>
Summary
Verification