Preserve pi-acp model metadata through LiteLLM proxy by bingran-you · Pull Request #803 · benchflow-ai/benchflow

bingran-you · 2026-06-18T07:14:09Z

Summary

copy BENCHFLOW_PROVIDER_MODELS metadata onto the LiteLLM proxy alias for pi-acp
preserve maxTokens/contextWindow when pi-acp is routed through usage-tracked LiteLLM
add a regression test for vLLM/Qwen metadata aliasing

Verification

uv run python -m py_compile src/benchflow/providers/litellm_runtime.py tests/test_litellm_runtime.py
uv run pytest tests/test_litellm_runtime.py -q (blocked locally: missing acp package in fresh uv env)

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: eeec75bb7d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-18T07:17:42Z


+@pytest.mark.asyncio
+async def test_pi_acp_proxy_preserves_provider_model_metadata(monkeypatch):
+    """Pi sees the LiteLLM alias, so metadata must be copied to that id."""


Name the regression guard in the test docstring

This new test is the regression coverage for the Pi/LiteLLM aliasing fix, but its docstring does not identify the PR or commit it guards. The root AGENTS.md explicitly requires regression tests to name the PR/commit they guard, so please include this commit/PR identifier in the docstring to preserve the maintenance context.

Useful? React with 👍 / 👎.

greptile-apps · 2026-06-18T07:18:26Z

Greptile Summary

This PR ensures that model metadata (maxTokens, contextWindow, etc.) carried in BENCHFLOW_PROVIDER_MODELS is mirrored onto the LiteLLM proxy alias that pi-acp sees, preventing a mismatch where the agent's model list doesn't contain an entry for the aliased model ID.

Adds _provider_models_for_proxy_alias which finds the source entry by matching several variants of the requested/upstream model ID, clones it, and appends a new alias entry with id and name both set to route.model_alias.
Calls the new function inside the pi-acp branch of _apply_litellm_agent_env, updating BENCHFLOW_PROVIDER_MODELS only when a match is found.
Adds a regression test that verifies maxTokens, contextWindow, and name are preserved on the alias entry for a vLLM/Qwen routing scenario.

Confidence Score: 5/5

Safe to merge — the change is additive (new helper + one branch update) and the test covers the aliasing path end-to-end.

The implementation correctly appends the alias entry without mutating the original list, both id and name are directly assigned (not using setdefault), and the deduplication guard prevents double-appending on repeated calls. The new test exercises the full pi-acp proxy env-rewrite path including the metadata preservation.

The test in tests/test_litellm_runtime.py doesn't assert the original model entry is retained alongside the alias, which is a minor gap in coverage.

Important Files Changed

Filename	Overview
src/benchflow/providers/litellm_runtime.py	Adds _provider_models_for_proxy_alias and _provider_model_id helpers, and calls the former in the pi-acp agent branch; logic is correct — direct assignment of id/name ensures alias entry is consistent, and deduplication check prevents double-appending the alias.
tests/test_litellm_runtime.py	New regression test confirms alias name, maxTokens, and contextWindow are preserved; does not assert that the original source entry is also retained in the merged list, leaving a potential regression undetected if the append logic changes to a replace.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["ensure_litellm_runtime (pi-acp, usage_tracking=required)"] --> B["_apply_litellm_agent_env"]
    B --> C["Set BENCHFLOW_PROVIDER_NAME/BASE_URL/API_KEY/MODEL/PROTOCOL"]
    C --> D["_provider_models_for_proxy_alias(raw=BENCHFLOW_PROVIDER_MODELS, route)"]
    D --> E{raw present and valid JSON list?}
    E -- No --> F["return None"]
    E -- Yes --> G["Build wanted set: requested_model + upstream_model + strip_provider_prefix variants"]
    G --> H{Any entry.id in wanted?}
    H -- No --> I["return None"]
    H -- Yes --> J["alias_entry = copy of matched entry, alias_entry id = name = model_alias"]
    J --> K{model_alias already in list?}
    K -- Yes --> L["return original list as JSON"]
    K -- No --> M["append alias_entry, return merged list as JSON"]
    F --> N{alias_models is truthy?}
    I --> N
    L --> N
    M --> N
    N -- Yes --> O["updated[BENCHFLOW_PROVIDER_MODELS] = alias_models"]
    N -- No --> P["BENCHFLOW_PROVIDER_MODELS unchanged"]
    O --> Q["Return updated env to agent"]
    P --> Q

%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
    A["ensure_litellm_runtime (pi-acp, usage_tracking=required)"] --> B["_apply_litellm_agent_env"]
    B --> C["Set BENCHFLOW_PROVIDER_NAME/BASE_URL/API_KEY/MODEL/PROTOCOL"]
    C --> D["_provider_models_for_proxy_alias(raw=BENCHFLOW_PROVIDER_MODELS, route)"]
    D --> E{raw present and valid JSON list?}
    E -- No --> F["return None"]
    E -- Yes --> G["Build wanted set: requested_model + upstream_model + strip_provider_prefix variants"]
    G --> H{Any entry.id in wanted?}
    H -- No --> I["return None"]
    H -- Yes --> J["alias_entry = copy of matched entry, alias_entry id = name = model_alias"]
    J --> K{model_alias already in list?}
    K -- Yes --> L["return original list as JSON"]
    K -- No --> M["append alias_entry, return merged list as JSON"]
    F --> N{alias_models is truthy?}
    I --> N
    L --> N
    M --> N
    N -- Yes --> O["updated[BENCHFLOW_PROVIDER_MODELS] = alias_models"]
    N -- No --> P["BENCHFLOW_PROVIDER_MODELS unchanged"]
    O --> Q["Return updated env to agent"]
    P --> Q

_{Reviews (2): Last reviewed commit: "Fix pi-acp proxy metadata typing" | Re-trigger Greptile}

greptile-apps · 2026-06-18T07:18:34Z

+        alias_entry = dict(entry)
+        alias_entry["id"] = route.model_alias
+        alias_entry.setdefault("name", route.model_alias)


When the original entry already has a "name" key — as in the test fixture where "name": "Qwen/Qwen3-4B" is present — setdefault is a no-op, so the alias entry ends up with id = "benchflow-vllm-Qwen-Qwen3-4B" but name = "Qwen/Qwen3-4B". If pi-acp uses name for display or any secondary lookup this creates a stale/inconsistent value. Using a direct assignment ensures the alias entry's name always reflects the alias ID.

Suggested change

alias_entry = dict(entry)

alias_entry["id"] = route.model_alias

alias_entry.setdefault("name", route.model_alias)

alias_entry = dict(entry)

alias_entry["id"] = route.model_alias

alias_entry["name"] = route.model_alias

greptile-apps · 2026-06-18T07:18:35Z

+            route=route,
+        )
+        if alias_models:
+            updated["BENCHFLOW_PROVIDER_MODELS"] = alias_models


Silent failure when no entry matches

If _provider_models_for_proxy_alias returns None (e.g., BENCHFLOW_PROVIDER_MODELS is set but none of its entries have an id matching any variant in wanted), updated["BENCHFLOW_PROVIDER_MODELS"] is left unchanged — keeping the original entry list that does not contain the alias ID. Pi-acp would then receive a model list without the proxied alias, causing the same lookup failure this PR aims to fix, with no log to indicate the miss. Adding a debug/warning log on the None return paths in _provider_models_for_proxy_alias would make this much easier to diagnose in production.

greptile-apps · 2026-06-18T07:18:36Z

+    assert provider_runtime is not None
+    assert updated["BENCHFLOW_PROVIDER_MODEL"] == "benchflow-vllm-Qwen-Qwen3-4B"
+    models = json.loads(updated["BENCHFLOW_PROVIDER_MODELS"])
+    alias = next(m for m in models if m["id"] == "benchflow-vllm-Qwen-Qwen3-4B")
+    assert alias["maxTokens"] == 1024
+    assert alias["contextWindow"] == 16384


name field of alias entry is not asserted

The test confirms maxTokens and contextWindow are preserved on the alias entry, but does not assert alias["name"]. Because setdefault is a no-op when "name" already exists in the source entry, the alias entry currently carries name = "Qwen/Qwen3-4B" rather than the alias "benchflow-vllm-Qwen-Qwen3-4B". Adding assert alias["name"] == "benchflow-vllm-Qwen-Qwen3-4B" would lock in the expected behaviour and catch a regression if the name handling changes.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

github-actions · 2026-06-18T23:45:32Z

Integration final review

Final verdict: not mergeable

Verdict

not mergeable

Blockers

unhealthy slot: data-to-d3-docker-no-skill-openhands (deterministic reject: ['V-TAMPER'])
unhealthy slot: weighted-gdp-calc-docker-no-skill-openhands (deterministic reject: ['V-TAMPER'])

Coverage

cell	task	agent	sandbox	skill_mode	status	detail
jax-computing-basics-docker-no-skill-openhands	jax-computing-basics	openhands	docker	no-skill	healthy	all gates green
jax-computing-basics-docker-no-skill-pi-acp	jax-computing-basics	pi-acp	docker	no-skill	healthy	all gates green
jax-computing-basics-docker-no-skill-opencode	jax-computing-basics	opencode	docker	no-skill	healthy	all gates green
data-to-d3-docker-no-skill-openhands	data-to-d3	openhands	docker	no-skill	unhealthy	deterministic reject: ['V-TAMPER']
data-to-d3-docker-no-skill-pi-acp	data-to-d3	pi-acp	docker	no-skill	healthy	all gates green
data-to-d3-docker-no-skill-opencode	data-to-d3	opencode	docker	no-skill	healthy	all gates green
weighted-gdp-calc-docker-no-skill-openhands	weighted-gdp-calc	openhands	docker	no-skill	unhealthy	deterministic reject: ['V-TAMPER']
weighted-gdp-calc-docker-no-skill-pi-acp	weighted-gdp-calc	pi-acp	docker	no-skill	healthy	all gates green
weighted-gdp-calc-docker-no-skill-opencode	weighted-gdp-calc	opencode	docker	no-skill	healthy	all gates green
citation-check-docker-no-skill-openhands-allowlist	citation-check	openhands	docker	no-skill	healthy	all gates green

Slots: healthy=8, unhealthy=2, missing=0, duplicate=0, stale=0 (planned=10)

Evidence

jax-computing-basics-docker-no-skill-openhands (healthy)
- root: jobs/integration-final/jax-computing-basics-docker-no-skill-openhands/2026-06-19__02-16-57/jax-computing-basics__69552840
- gates: R-REAL=pass, R-OUTCOME=pass, R-ARTIFACT=pass, R-TELEMETRY=pass, S-NOSKILL=na, S-WITHSKILL=na, V-TAMPER=pass, C-ATTRIB=na, V-LIFECYCLE=na, V-ENVHARDEN=na, V-REWARDHACK=na
- rerun: python tests/integration/rubric_checks.py jobs/integration-final/jax-computing-basics-docker-no-skill-openhands/2026-06-19__02-16-57/jax-computing-basics__69552840 --json
jax-computing-basics-docker-no-skill-pi-acp (healthy)
- root: jobs/integration-final/jax-computing-basics-docker-no-skill-pi-acp/2026-06-19__02-16-57/jax-computing-basics__74dd6805
- gates: R-REAL=pass, R-OUTCOME=pass, R-ARTIFACT=pass, R-TELEMETRY=pass, S-NOSKILL=na, S-WITHSKILL=na, V-TAMPER=pass, C-ATTRIB=na, V-LIFECYCLE=na, V-ENVHARDEN=na, V-REWARDHACK=na
- rerun: python tests/integration/rubric_checks.py jobs/integration-final/jax-computing-basics-docker-no-skill-pi-acp/2026-06-19__02-16-57/jax-computing-basics__74dd6805 --json
jax-computing-basics-docker-no-skill-opencode (healthy)
- root: jobs/integration-final/jax-computing-basics-docker-no-skill-opencode/2026-06-19__02-16-57/jax-computing-basics__ed6098c5
- gates: R-REAL=pass, R-OUTCOME=pass, R-ARTIFACT=pass, R-TELEMETRY=pass, S-NOSKILL=na, S-WITHSKILL=na, V-TAMPER=pass, C-ATTRIB=na, V-LIFECYCLE=na, V-ENVHARDEN=na, V-REWARDHACK=na
- rerun: python tests/integration/rubric_checks.py jobs/integration-final/jax-computing-basics-docker-no-skill-opencode/2026-06-19__02-16-57/jax-computing-basics__ed6098c5 --json
data-to-d3-docker-no-skill-openhands (unhealthy)
- root: jobs/integration-final/data-to-d3-docker-no-skill-openhands/2026-06-19__02-16-59/data-to-d3__73173ee7
- gates: R-REAL=pass, R-OUTCOME=pass, R-ARTIFACT=pass, R-TELEMETRY=pass, S-NOSKILL=na, S-WITHSKILL=na, V-TAMPER=fail, C-ATTRIB=pass, V-LIFECYCLE=na, V-ENVHARDEN=na, V-REWARDHACK=na
- rerun: python tests/integration/rubric_checks.py jobs/integration-final/data-to-d3-docker-no-skill-openhands/2026-06-19__02-16-59/data-to-d3__73173ee7 --json
data-to-d3-docker-no-skill-pi-acp (healthy)
- root: jobs/integration-final/data-to-d3-docker-no-skill-pi-acp/2026-06-19__02-17-44/data-to-d3__8cb1b2db
- gates: R-REAL=pass, R-OUTCOME=pass, R-ARTIFACT=pass, R-TELEMETRY=pass, S-NOSKILL=na, S-WITHSKILL=na, V-TAMPER=pass, C-ATTRIB=pass, V-LIFECYCLE=na, V-ENVHARDEN=na, V-REWARDHACK=na
- rerun: python tests/integration/rubric_checks.py jobs/integration-final/data-to-d3-docker-no-skill-pi-acp/2026-06-19__02-17-44/data-to-d3__8cb1b2db --json
data-to-d3-docker-no-skill-opencode (healthy)
- root: jobs/integration-final/data-to-d3-docker-no-skill-opencode/2026-06-19__02-17-01/data-to-d3__6ff11a4d
- gates: R-REAL=pass, R-OUTCOME=pass, R-ARTIFACT=pass, R-TELEMETRY=pass, S-NOSKILL=na, S-WITHSKILL=na, V-TAMPER=pass, C-ATTRIB=pass, V-LIFECYCLE=na, V-ENVHARDEN=na, V-REWARDHACK=na
- rerun: python tests/integration/rubric_checks.py jobs/integration-final/data-to-d3-docker-no-skill-opencode/2026-06-19__02-17-01/data-to-d3__6ff11a4d --json
weighted-gdp-calc-docker-no-skill-openhands (unhealthy)
- root: jobs/integration-final/weighted-gdp-calc-docker-no-skill-openhands/2026-06-19__02-16-59/weighted-gdp-calc__e8121277
- gates: R-REAL=pass, R-OUTCOME=pass, R-ARTIFACT=pass, R-TELEMETRY=pass, S-NOSKILL=na, S-WITHSKILL=na, V-TAMPER=fail, C-ATTRIB=pass, V-LIFECYCLE=na, V-ENVHARDEN=na, V-REWARDHACK=na
- rerun: python tests/integration/rubric_checks.py jobs/integration-final/weighted-gdp-calc-docker-no-skill-openhands/2026-06-19__02-16-59/weighted-gdp-calc__e8121277 --json
weighted-gdp-calc-docker-no-skill-pi-acp (healthy)
- root: jobs/integration-final/weighted-gdp-calc-docker-no-skill-pi-acp/2026-06-19__02-17-00/weighted-gdp-calc__18f9642a
- gates: R-REAL=pass, R-OUTCOME=pass, R-ARTIFACT=pass, R-TELEMETRY=pass, S-NOSKILL=na, S-WITHSKILL=na, V-TAMPER=pass, C-ATTRIB=pass, V-LIFECYCLE=na, V-ENVHARDEN=na, V-REWARDHACK=na
- rerun: python tests/integration/rubric_checks.py jobs/integration-final/weighted-gdp-calc-docker-no-skill-pi-acp/2026-06-19__02-17-00/weighted-gdp-calc__18f9642a --json
weighted-gdp-calc-docker-no-skill-opencode (healthy)
- root: jobs/integration-final/weighted-gdp-calc-docker-no-skill-opencode/2026-06-19__02-17-00/weighted-gdp-calc__72b8e555
- gates: R-REAL=pass, R-OUTCOME=pass, R-ARTIFACT=pass, R-TELEMETRY=pass, S-NOSKILL=na, S-WITHSKILL=na, V-TAMPER=pass, C-ATTRIB=pass, V-LIFECYCLE=na, V-ENVHARDEN=na, V-REWARDHACK=na
- rerun: python tests/integration/rubric_checks.py jobs/integration-final/weighted-gdp-calc-docker-no-skill-opencode/2026-06-19__02-17-00/weighted-gdp-calc__72b8e555 --json
citation-check-docker-no-skill-openhands-allowlist (healthy)
- root: jobs/integration-final/citation-check-docker-no-skill-openhands-allowlist/2026-06-19__02-16-57/citation-check__f2b934ae
- gates: R-REAL=pass, R-OUTCOME=pass, R-ARTIFACT=pass, R-TELEMETRY=pass, S-NOSKILL=na, S-WITHSKILL=na, V-TAMPER=pass, C-ATTRIB=na, V-LIFECYCLE=na, V-ENVHARDEN=na, V-REWARDHACK=na
- rerun: python tests/integration/rubric_checks.py jobs/integration-final/citation-check-docker-no-skill-openhands-allowlist/2026-06-19__02-16-57/citation-check__f2b934ae --json

Parity:

pinned-baseline pinned-baseline: fail — pinned-baseline parity FAIL: harbor: baseline git HEAD 3daf698 does not match pinned ref 2d86fe82f6a06f7c7b3a22a3ae90d554d0e9655c; harbor: /home/runner/work/benchflow/benchflow/baseline-root/submissions/skillsbench/v1.1/openhands-no-skills__deepseek-v4-flash/2026-06-08__pr2-pr3-selected-3trial/citation-check__pr2-fill5-c10-noskills-7a4c6d6d-0004/result.json: missing Harbor field(s): ['agent_info', 'config', 'verifier_result']; harbor: /home/runner/work/benchflow/benchflow/baseline-root/submissions/skillsbench/v1.1/openhands-no-skills__deepseek-v4-flash/2026-06-08__pr2-pr3-selected-3trial/citation-check__pr2-fill5-c10-noskills-9c44b8b1-0003/result.json: missing Harbor field(s): ['agent_info', 'config', 'verifier_result']; harbor: /home/runner/work/benchflow/benchflow/baseline-root/submissions/skillsbench/v1.1/openhands-no-skills__deepseek-v4-flash/2026-06-08__pr2-pr3-selected-3trial/citation-check__pr2-fill5-c10-noskills-aeadd837-0002/result.json: missing Harbor field(s): ['agent_info', 'config', 'verifier_result']; harbor: /home/runner/work/benchflow/benchflow/baseline-root/submissions/skillsbench/v1.1/openhands-no-skills__deepseek-v4-flash/2026-06-08__pr2-pr3-selected-3trial/data-to-d3__pr2-fill5-c10-noskills-5fd3682f-0002/result.json: missing Harbor field(s): ['agent_info', 'config', 'verifier_result']; harbor: /home/runner/work/benchflow/benchflow/baseline-root/submissions/skillsbench/v1.1/openhands-no-skills__deepseek-v4-flash/2026-06-08__pr2-pr3-selected-3trial/data-to-d3__pr2-fill5-c10-noskills-b86fffc1-0004/result.json: missing Harbor field(s): ['agent_info', 'config', 'verifier_result']; harbor: /home/runner/work/benchflow/benchflow/baseline-root/submissions/skillsbench/v1.1/openhands-no-skills__deepseek-v4-flash/2026-06-08__pr2-pr3-selected-3trial/data-to-d3__pr2-fill5-c10-noskills-ba03ab11-0005/result.json: missing Harbor field(s): ['agent_info', 'config', 'verifier_result']; harbor: /home/runner/work/benchflow/benchflow/baseline-root/submissions/skillsbench/v1.1/openhands-no-skills__deepseek-v4-flash/2026-06-08__pr2-pr3-selected-3trial/jax-computing-basics__pr2-missing-capped5-dbf6c36392/result.json: missing Harbor field(s): ['agent_info', 'config', 'verifier_result']; harbor: /home/runner/work/benchflow/benchflow/baseline-root/submissions/skillsbench/v1.1/openhands-no-skills__deepseek-v4-flash/2026-06-08__pr2-pr3-selected-3trial/jax-computing-basics__pr2-vmfill-6cee37920c/result.json: missing Harbor field(s): ['agent_info', 'config', 'verifier_result']; harbor: /home/runner/work/benchflow/benchflow/baseline-root/submissions/skillsbench/v1.1/openhands-no-skills__deepseek-v4-flash/2026-06-08__pr2-pr3-selected-3trial/jax-computing-basics__pr2-vmfill-9737e2a3b8/result.json: missing Harbor field(s): ['agent_info', 'config', 'verifier_result']; harbor: no matching result.json files under /home/runner/work/benchflow/benchflow/baseline-root/submissions/skillsbench/v1.1/openhands-no-skills__deepseek-v4-flash; harbor: missing baseline result for citation-check; harbor: missing baseline result for data-to-d3; harbor: missing baseline result for jax-computing-basics; harbor: missing baseline result for weighted-gdp-calc; no overlapping SkillsBench tasks to compare

Residual risk

QUARANTINE: parity pinned-baseline (advisory — gate needs native-baseline mode): pinned-baseline — pinned-baseline parity FAIL: harbor: baseline git HEAD 3daf698 does not match pinned ref 2d86fe82f6a06f7c7b3a22a3ae90d554d0e9655c; harbor: /home/runner/work/benchflow/benchflow/baseline-root/submissions/skillsbench/v1.1/openhands-no-skills__deepseek-v4-flash/2026-06-08__pr2-pr3-selected-3trial/citation-check__pr2-fill5-c10-noskills-7a4c6d6d-0004/result.json: missing Harbor field(s): ['agent_info', 'config', 'verifier_result']; harbor: /home/runner/work/benchflow/benchflow/baseline-root/submissions/skillsbench/v1.1/openhands-no-skills__deepseek-v4-flash/2026-06-08__pr2-pr3-selected-3trial/citation-check__pr2-fill5-c10-noskills-9c44b8b1-0003/result.json: missing Harbor field(s): ['agent_info', 'config', 'verifier_result']; harbor: /home/runner/work/benchflow/benchflow/baseline-root/submissions/skillsbench/v1.1/openhands-no-skills__deepseek-v4-flash/2026-06-08__pr2-pr3-selected-3trial/citation-check__pr2-fill5-c10-noskills-aeadd837-0002/result.json: missing Harbor field(s): ['agent_info', 'config', 'verifier_result']; harbor: /home/runner/work/benchflow/benchflow/baseline-root/submissions/skillsbench/v1.1/openhands-no-skills__deepseek-v4-flash/2026-06-08__pr2-pr3-selected-3trial/data-to-d3__pr2-fill5-c10-noskills-5fd3682f-0002/result.json: missing Harbor field(s): ['agent_info', 'config', 'verifier_result']; harbor: /home/runner/work/benchflow/benchflow/baseline-root/submissions/skillsbench/v1.1/openhands-no-skills__deepseek-v4-flash/2026-06-08__pr2-pr3-selected-3trial/data-to-d3__pr2-fill5-c10-noskills-b86fffc1-0004/result.json: missing Harbor field(s): ['agent_info', 'config', 'verifier_result']; harbor: /home/runner/work/benchflow/benchflow/baseline-root/submissions/skillsbench/v1.1/openhands-no-skills__deepseek-v4-flash/2026-06-08__pr2-pr3-selected-3trial/data-to-d3__pr2-fill5-c10-noskills-ba03ab11-0005/result.json: missing Harbor field(s): ['agent_info', 'config', 'verifier_result']; harbor: /home/runner/work/benchflow/benchflow/baseline-root/submissions/skillsbench/v1.1/openhands-no-skills__deepseek-v4-flash/2026-06-08__pr2-pr3-selected-3trial/jax-computing-basics__pr2-missing-capped5-dbf6c36392/result.json: missing Harbor field(s): ['agent_info', 'config', 'verifier_result']; harbor: /home/runner/work/benchflow/benchflow/baseline-root/submissions/skillsbench/v1.1/openhands-no-skills__deepseek-v4-flash/2026-06-08__pr2-pr3-selected-3trial/jax-computing-basics__pr2-vmfill-6cee37920c/result.json: missing Harbor field(s): ['agent_info', 'config', 'verifier_result']; harbor: /home/runner/work/benchflow/benchflow/baseline-root/submissions/skillsbench/v1.1/openhands-no-skills__deepseek-v4-flash/2026-06-08__pr2-pr3-selected-3trial/jax-computing-basics__pr2-vmfill-9737e2a3b8/result.json: missing Harbor field(s): ['agent_info', 'config', 'verifier_result']; harbor: no matching result.json files under /home/runner/work/benchflow/benchflow/baseline-root/submissions/skillsbench/v1.1/openhands-no-skills__deepseek-v4-flash; harbor: missing baseline result for citation-check; harbor: missing baseline result for data-to-d3; harbor: missing baseline result for jax-computing-basics; harbor: missing baseline result for weighted-gdp-calc; no overlapping SkillsBench tasks to compare
residual (from plan): network lane carries network_mode as EXPECTED only; the allowlist variant is not passed to bench (no --network flag exists)
V-LIFECYCLE / V-ENVHARDEN / V-REWARDHACK: codex/residual review (never faked deterministically)

Required reruns

rerun cell: data-to-d3-docker-no-skill-openhands
rerun cell: weighted-gdp-calc-docker-no-skill-openhands

bingran-you · 2026-06-19T05:43:53Z

Automation update (2026-06-19): pushed 212d29b6 to fix the stale ty check failure on eeec75bb.

What changed:

Typed the provider-model metadata lookup so uv run ty check src passes again.
Set the copied alias entry's name to the LiteLLM proxy alias as well as id, and added a regression assertion.
Updated the regression-test docstring to name PR Preserve pi-acp model metadata through LiteLLM proxy #803 per repo convention.

Local verification after the patch:

uv sync --extra dev --extra sandbox-daytona --locked
uv run pytest tests/test_litellm_runtime.py -q -> 20 passed
uv run ruff check src/benchflow/providers/litellm_runtime.py tests/test_litellm_runtime.py -> pass
uv run ruff format --check src/benchflow/providers/litellm_runtime.py tests/test_litellm_runtime.py -> pass
uv run ty check src -> pass

Current GitHub checks: normal test and pip-audit are green on the new head; the Docker rollout-smoke is still pending, so I am not marking this merge-ready yet. The older final-review V-TAMPER report was on the previous head and should be treated as stale until current-head review completes.

bingran-you · 2026-06-19T05:45:44Z

Follow-up (2026-06-19): the pending Docker rollout-smoke on head 212d29b6 has now passed as well. Current-head checks are green for test, pip-audit, detect-scope, and rollout-smoke. Holding under review:pending for non-author human review before merge.

…ositive, codex robustness (#814) * fix(integration): calibrate L3 gate — slot matching, V-TAMPER, codex robustness Three defects made the L3 final-review gate systematically return "not mergeable" for reasons unrelated to the PR under review (found running the gate end-to-end on #794/#803/#813): 1. Slot mis-attribution (duplicate/missing slots). The review-pack matcher mapped rollouts to planned cells by fuzzy dims, ignoring network_mode, so a cell and its `-allowlist` network variant collided into one slot (one "duplicate", one "missing"); in-place agent retries also each counted as a separate rollout. Now attribute each rollout by its cell-id directory (the run-matrix writes <cell.id>/<ts>/<task>__<hash>), and collapse same-job retries to the latest attempt. Genuinely double-scheduled cells still surface as duplicates. 2. V-TAMPER false-positive on OpenHands. The native-ACP execute scanner matched the whole title; OpenHands writes "<description>: $ <command>", and prose like "Verify the output" collided with the `verif` token, falsely flagging benign cleanup/verification as verifier tampering. Scan only the command after "$ "; a real tamper command is still caught. 3. Codex "unavailable" flake. The reviewer had to shell out (bubblewrap sandbox) to read the review pack from disk; a transient bwrap loopback failure killed every read -> no verdict -> false fail-closed. Inline the review-pack files into the prompt so no sandboxed read is needed, plus a bounded retry on transient (sandbox/network/rate-limit) output. A dead codex still fails closed. Adds regression tests for all three. * style(integration): ruff format the new gate-calibration tests * fix(integration): tighten greptile P2s on the L3 gate fixes - codex transient markers: drop the bare '429' (matched 'line 429'); use the 'too many requests' reason phrase + 'http 429' instead. - _attribute_rollout: only scan path parts BELOW the artifacts root, so OS-level segments can never coincidentally match a cell id. - _started_at: fall back to finished_at so retry-collapse stays chronological when started_at is absent. * fix(deps): bump pydantic-settings 2.14.1 -> 2.14.2 (GHSA-4xgf-cpjx-pc3j) A new advisory (GHSA-4xgf-cpjx-pc3j, pydantic-settings 2.14.2, published 2026-06-19) started failing pip-audit repo-wide on every PR and main — pydantic-settings is a transitive dep via litellm/mcp. Surgically bump the locked version to the fixed 2.14.2, preserving the revision-1 lockfile format (no full reformat). `uv sync --locked` accepts the lock; pip-audit is clean. --------- Co-authored-by: symphony-bot <symphony@benchflow.ai>

Preserve pi-acp metadata through LiteLLM proxy

eeec75b

bingran-you temporarily deployed to pypi-internal-preview June 18, 2026 07:14 — with GitHub Actions Inactive

chatgpt-codex-connector Bot reviewed Jun 18, 2026

View reviewed changes

greptile-apps Bot reviewed Jun 18, 2026

View reviewed changes

This was referenced Jun 18, 2026

fix(integration): repair the full L0–L3 workflow + ready-to-merge codex auto-trigger #806

Merged

fix(integration): add uv.lock to plan-job sparse-checkout (setup-uv cache) #807

Merged

Fix pi-acp proxy metadata typing

212d29b

bingran-you temporarily deployed to pypi-internal-preview June 19, 2026 05:37 — with GitHub Actions Inactive

Yiminnn mentioned this pull request Jun 19, 2026

fix(integration): calibrate L3 gate — slot matching, V-TAMPER false-positive, codex robustness #814

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preserve pi-acp model metadata through LiteLLM proxy#803

Preserve pi-acp model metadata through LiteLLM proxy#803
bingran-you wants to merge 2 commits into
mainfrom
bry/pi-acp-proxy-model-metadata

bingran-you commented Jun 18, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jun 18, 2026

Uh oh!

greptile-apps Bot commented Jun 18, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot Jun 18, 2026

Uh oh!

greptile-apps Bot Jun 18, 2026

Uh oh!

greptile-apps Bot Jun 18, 2026

Uh oh!

github-actions Bot commented Jun 18, 2026 •

edited

Loading

Uh oh!

bingran-you commented Jun 19, 2026

Uh oh!

bingran-you commented Jun 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

bingran-you commented Jun 18, 2026

Summary

Verification

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps Bot Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Integration final review

Verdict

Blockers

Coverage

Evidence

Residual risk

Required reruns

Uh oh!

bingran-you commented Jun 19, 2026

Uh oh!

bingran-you commented Jun 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

greptile-apps Bot commented Jun 18, 2026 •

edited

Loading

github-actions Bot commented Jun 18, 2026 •

edited

Loading