Skip to content

Preserve pi-acp model metadata through LiteLLM proxy#803

Open
bingran-you wants to merge 2 commits into
mainfrom
bry/pi-acp-proxy-model-metadata
Open

Preserve pi-acp model metadata through LiteLLM proxy#803
bingran-you wants to merge 2 commits into
mainfrom
bry/pi-acp-proxy-model-metadata

Conversation

@bingran-you

Copy link
Copy Markdown
Collaborator

Summary

  • copy BENCHFLOW_PROVIDER_MODELS metadata onto the LiteLLM proxy alias for pi-acp
  • preserve maxTokens/contextWindow when pi-acp is routed through usage-tracked LiteLLM
  • add a regression test for vLLM/Qwen metadata aliasing

Verification

  • uv run python -m py_compile src/benchflow/providers/litellm_runtime.py tests/test_litellm_runtime.py
  • uv run pytest tests/test_litellm_runtime.py -q (blocked locally: missing acp package in fresh uv env)

@bingran-you bingran-you temporarily deployed to pypi-internal-preview June 18, 2026 07:14 — with GitHub Actions Inactive

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: eeec75bb7d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread tests/test_litellm_runtime.py Outdated

@pytest.mark.asyncio
async def test_pi_acp_proxy_preserves_provider_model_metadata(monkeypatch):
"""Pi sees the LiteLLM alias, so metadata must be copied to that id."""

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Name the regression guard in the test docstring

This new test is the regression coverage for the Pi/LiteLLM aliasing fix, but its docstring does not identify the PR or commit it guards. The root AGENTS.md explicitly requires regression tests to name the PR/commit they guard, so please include this commit/PR identifier in the docstring to preserve the maintenance context.

Useful? React with 👍 / 👎.

@greptile-apps

greptile-apps Bot commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR ensures that model metadata (maxTokens, contextWindow, etc.) carried in BENCHFLOW_PROVIDER_MODELS is mirrored onto the LiteLLM proxy alias that pi-acp sees, preventing a mismatch where the agent's model list doesn't contain an entry for the aliased model ID.

  • Adds _provider_models_for_proxy_alias which finds the source entry by matching several variants of the requested/upstream model ID, clones it, and appends a new alias entry with id and name both set to route.model_alias.
  • Calls the new function inside the pi-acp branch of _apply_litellm_agent_env, updating BENCHFLOW_PROVIDER_MODELS only when a match is found.
  • Adds a regression test that verifies maxTokens, contextWindow, and name are preserved on the alias entry for a vLLM/Qwen routing scenario.

Confidence Score: 5/5

Safe to merge — the change is additive (new helper + one branch update) and the test covers the aliasing path end-to-end.

The implementation correctly appends the alias entry without mutating the original list, both id and name are directly assigned (not using setdefault), and the deduplication guard prevents double-appending on repeated calls. The new test exercises the full pi-acp proxy env-rewrite path including the metadata preservation.

The test in tests/test_litellm_runtime.py doesn't assert the original model entry is retained alongside the alias, which is a minor gap in coverage.

Important Files Changed

Filename Overview
src/benchflow/providers/litellm_runtime.py Adds _provider_models_for_proxy_alias and _provider_model_id helpers, and calls the former in the pi-acp agent branch; logic is correct — direct assignment of id/name ensures alias entry is consistent, and deduplication check prevents double-appending the alias.
tests/test_litellm_runtime.py New regression test confirms alias name, maxTokens, and contextWindow are preserved; does not assert that the original source entry is also retained in the merged list, leaving a potential regression undetected if the append logic changes to a replace.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["ensure_litellm_runtime (pi-acp, usage_tracking=required)"] --> B["_apply_litellm_agent_env"]
    B --> C["Set BENCHFLOW_PROVIDER_NAME/BASE_URL/API_KEY/MODEL/PROTOCOL"]
    C --> D["_provider_models_for_proxy_alias(raw=BENCHFLOW_PROVIDER_MODELS, route)"]
    D --> E{raw present and valid JSON list?}
    E -- No --> F["return None"]
    E -- Yes --> G["Build wanted set: requested_model + upstream_model + strip_provider_prefix variants"]
    G --> H{Any entry.id in wanted?}
    H -- No --> I["return None"]
    H -- Yes --> J["alias_entry = copy of matched entry, alias_entry id = name = model_alias"]
    J --> K{model_alias already in list?}
    K -- Yes --> L["return original list as JSON"]
    K -- No --> M["append alias_entry, return merged list as JSON"]
    F --> N{alias_models is truthy?}
    I --> N
    L --> N
    M --> N
    N -- Yes --> O["updated[BENCHFLOW_PROVIDER_MODELS] = alias_models"]
    N -- No --> P["BENCHFLOW_PROVIDER_MODELS unchanged"]
    O --> Q["Return updated env to agent"]
    P --> Q
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
    A["ensure_litellm_runtime (pi-acp, usage_tracking=required)"] --> B["_apply_litellm_agent_env"]
    B --> C["Set BENCHFLOW_PROVIDER_NAME/BASE_URL/API_KEY/MODEL/PROTOCOL"]
    C --> D["_provider_models_for_proxy_alias(raw=BENCHFLOW_PROVIDER_MODELS, route)"]
    D --> E{raw present and valid JSON list?}
    E -- No --> F["return None"]
    E -- Yes --> G["Build wanted set: requested_model + upstream_model + strip_provider_prefix variants"]
    G --> H{Any entry.id in wanted?}
    H -- No --> I["return None"]
    H -- Yes --> J["alias_entry = copy of matched entry, alias_entry id = name = model_alias"]
    J --> K{model_alias already in list?}
    K -- Yes --> L["return original list as JSON"]
    K -- No --> M["append alias_entry, return merged list as JSON"]
    F --> N{alias_models is truthy?}
    I --> N
    L --> N
    M --> N
    N -- Yes --> O["updated[BENCHFLOW_PROVIDER_MODELS] = alias_models"]
    N -- No --> P["BENCHFLOW_PROVIDER_MODELS unchanged"]
    O --> Q["Return updated env to agent"]
    P --> Q
Loading

Reviews (2): Last reviewed commit: "Fix pi-acp proxy metadata typing" | Re-trigger Greptile

Comment on lines +893 to +895
alias_entry = dict(entry)
alias_entry["id"] = route.model_alias
alias_entry.setdefault("name", route.model_alias)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 When the original entry already has a "name" key — as in the test fixture where "name": "Qwen/Qwen3-4B" is present — setdefault is a no-op, so the alias entry ends up with id = "benchflow-vllm-Qwen-Qwen3-4B" but name = "Qwen/Qwen3-4B". If pi-acp uses name for display or any secondary lookup this creates a stale/inconsistent value. Using a direct assignment ensures the alias entry's name always reflects the alias ID.

Suggested change
alias_entry = dict(entry)
alias_entry["id"] = route.model_alias
alias_entry.setdefault("name", route.model_alias)
alias_entry = dict(entry)
alias_entry["id"] = route.model_alias
alias_entry["name"] = route.model_alias

Comment on lines +976 to +979
route=route,
)
if alias_models:
updated["BENCHFLOW_PROVIDER_MODELS"] = alias_models

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Silent failure when no entry matches

If _provider_models_for_proxy_alias returns None (e.g., BENCHFLOW_PROVIDER_MODELS is set but none of its entries have an id matching any variant in wanted), updated["BENCHFLOW_PROVIDER_MODELS"] is left unchanged — keeping the original entry list that does not contain the alias ID. Pi-acp would then receive a model list without the proxied alias, causing the same lookup failure this PR aims to fix, with no log to indicate the miss. Adding a debug/warning log on the None return paths in _provider_models_for_proxy_alias would make this much easier to diagnose in production.

Comment on lines +206 to +211
assert provider_runtime is not None
assert updated["BENCHFLOW_PROVIDER_MODEL"] == "benchflow-vllm-Qwen-Qwen3-4B"
models = json.loads(updated["BENCHFLOW_PROVIDER_MODELS"])
alias = next(m for m in models if m["id"] == "benchflow-vllm-Qwen-Qwen3-4B")
assert alias["maxTokens"] == 1024
assert alias["contextWindow"] == 16384

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 name field of alias entry is not asserted

The test confirms maxTokens and contextWindow are preserved on the alias entry, but does not assert alias["name"]. Because setdefault is a no-op when "name" already exists in the source entry, the alias entry currently carries name = "Qwen/Qwen3-4B" rather than the alias "benchflow-vllm-Qwen-Qwen3-4B". Adding assert alias["name"] == "benchflow-vllm-Qwen-Qwen3-4B" would lock in the expected behaviour and catch a regression if the name handling changes.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

@github-actions

github-actions Bot commented Jun 18, 2026

Copy link
Copy Markdown

Integration final review

Final verdict: not mergeable

Verdict

not mergeable

Blockers

  • unhealthy slot: data-to-d3-docker-no-skill-openhands (deterministic reject: ['V-TAMPER'])
  • unhealthy slot: weighted-gdp-calc-docker-no-skill-openhands (deterministic reject: ['V-TAMPER'])

Coverage

cell task agent sandbox skill_mode status detail
jax-computing-basics-docker-no-skill-openhands jax-computing-basics openhands docker no-skill healthy all gates green
jax-computing-basics-docker-no-skill-pi-acp jax-computing-basics pi-acp docker no-skill healthy all gates green
jax-computing-basics-docker-no-skill-opencode jax-computing-basics opencode docker no-skill healthy all gates green
data-to-d3-docker-no-skill-openhands data-to-d3 openhands docker no-skill unhealthy deterministic reject: ['V-TAMPER']
data-to-d3-docker-no-skill-pi-acp data-to-d3 pi-acp docker no-skill healthy all gates green
data-to-d3-docker-no-skill-opencode data-to-d3 opencode docker no-skill healthy all gates green
weighted-gdp-calc-docker-no-skill-openhands weighted-gdp-calc openhands docker no-skill unhealthy deterministic reject: ['V-TAMPER']
weighted-gdp-calc-docker-no-skill-pi-acp weighted-gdp-calc pi-acp docker no-skill healthy all gates green
weighted-gdp-calc-docker-no-skill-opencode weighted-gdp-calc opencode docker no-skill healthy all gates green
citation-check-docker-no-skill-openhands-allowlist citation-check openhands docker no-skill healthy all gates green

Slots: healthy=8, unhealthy=2, missing=0, duplicate=0, stale=0 (planned=10)

Evidence

  • jax-computing-basics-docker-no-skill-openhands (healthy)
    • root: jobs/integration-final/jax-computing-basics-docker-no-skill-openhands/2026-06-19__02-16-57/jax-computing-basics__69552840
    • gates: R-REAL=pass, R-OUTCOME=pass, R-ARTIFACT=pass, R-TELEMETRY=pass, S-NOSKILL=na, S-WITHSKILL=na, V-TAMPER=pass, C-ATTRIB=na, V-LIFECYCLE=na, V-ENVHARDEN=na, V-REWARDHACK=na
    • rerun: python tests/integration/rubric_checks.py jobs/integration-final/jax-computing-basics-docker-no-skill-openhands/2026-06-19__02-16-57/jax-computing-basics__69552840 --json
  • jax-computing-basics-docker-no-skill-pi-acp (healthy)
    • root: jobs/integration-final/jax-computing-basics-docker-no-skill-pi-acp/2026-06-19__02-16-57/jax-computing-basics__74dd6805
    • gates: R-REAL=pass, R-OUTCOME=pass, R-ARTIFACT=pass, R-TELEMETRY=pass, S-NOSKILL=na, S-WITHSKILL=na, V-TAMPER=pass, C-ATTRIB=na, V-LIFECYCLE=na, V-ENVHARDEN=na, V-REWARDHACK=na
    • rerun: python tests/integration/rubric_checks.py jobs/integration-final/jax-computing-basics-docker-no-skill-pi-acp/2026-06-19__02-16-57/jax-computing-basics__74dd6805 --json
  • jax-computing-basics-docker-no-skill-opencode (healthy)
    • root: jobs/integration-final/jax-computing-basics-docker-no-skill-opencode/2026-06-19__02-16-57/jax-computing-basics__ed6098c5
    • gates: R-REAL=pass, R-OUTCOME=pass, R-ARTIFACT=pass, R-TELEMETRY=pass, S-NOSKILL=na, S-WITHSKILL=na, V-TAMPER=pass, C-ATTRIB=na, V-LIFECYCLE=na, V-ENVHARDEN=na, V-REWARDHACK=na
    • rerun: python tests/integration/rubric_checks.py jobs/integration-final/jax-computing-basics-docker-no-skill-opencode/2026-06-19__02-16-57/jax-computing-basics__ed6098c5 --json
  • data-to-d3-docker-no-skill-openhands (unhealthy)
    • root: jobs/integration-final/data-to-d3-docker-no-skill-openhands/2026-06-19__02-16-59/data-to-d3__73173ee7
    • gates: R-REAL=pass, R-OUTCOME=pass, R-ARTIFACT=pass, R-TELEMETRY=pass, S-NOSKILL=na, S-WITHSKILL=na, V-TAMPER=fail, C-ATTRIB=pass, V-LIFECYCLE=na, V-ENVHARDEN=na, V-REWARDHACK=na
    • rerun: python tests/integration/rubric_checks.py jobs/integration-final/data-to-d3-docker-no-skill-openhands/2026-06-19__02-16-59/data-to-d3__73173ee7 --json
  • data-to-d3-docker-no-skill-pi-acp (healthy)
    • root: jobs/integration-final/data-to-d3-docker-no-skill-pi-acp/2026-06-19__02-17-44/data-to-d3__8cb1b2db
    • gates: R-REAL=pass, R-OUTCOME=pass, R-ARTIFACT=pass, R-TELEMETRY=pass, S-NOSKILL=na, S-WITHSKILL=na, V-TAMPER=pass, C-ATTRIB=pass, V-LIFECYCLE=na, V-ENVHARDEN=na, V-REWARDHACK=na
    • rerun: python tests/integration/rubric_checks.py jobs/integration-final/data-to-d3-docker-no-skill-pi-acp/2026-06-19__02-17-44/data-to-d3__8cb1b2db --json
  • data-to-d3-docker-no-skill-opencode (healthy)
    • root: jobs/integration-final/data-to-d3-docker-no-skill-opencode/2026-06-19__02-17-01/data-to-d3__6ff11a4d
    • gates: R-REAL=pass, R-OUTCOME=pass, R-ARTIFACT=pass, R-TELEMETRY=pass, S-NOSKILL=na, S-WITHSKILL=na, V-TAMPER=pass, C-ATTRIB=pass, V-LIFECYCLE=na, V-ENVHARDEN=na, V-REWARDHACK=na
    • rerun: python tests/integration/rubric_checks.py jobs/integration-final/data-to-d3-docker-no-skill-opencode/2026-06-19__02-17-01/data-to-d3__6ff11a4d --json
  • weighted-gdp-calc-docker-no-skill-openhands (unhealthy)
    • root: jobs/integration-final/weighted-gdp-calc-docker-no-skill-openhands/2026-06-19__02-16-59/weighted-gdp-calc__e8121277
    • gates: R-REAL=pass, R-OUTCOME=pass, R-ARTIFACT=pass, R-TELEMETRY=pass, S-NOSKILL=na, S-WITHSKILL=na, V-TAMPER=fail, C-ATTRIB=pass, V-LIFECYCLE=na, V-ENVHARDEN=na, V-REWARDHACK=na
    • rerun: python tests/integration/rubric_checks.py jobs/integration-final/weighted-gdp-calc-docker-no-skill-openhands/2026-06-19__02-16-59/weighted-gdp-calc__e8121277 --json
  • weighted-gdp-calc-docker-no-skill-pi-acp (healthy)
    • root: jobs/integration-final/weighted-gdp-calc-docker-no-skill-pi-acp/2026-06-19__02-17-00/weighted-gdp-calc__18f9642a
    • gates: R-REAL=pass, R-OUTCOME=pass, R-ARTIFACT=pass, R-TELEMETRY=pass, S-NOSKILL=na, S-WITHSKILL=na, V-TAMPER=pass, C-ATTRIB=pass, V-LIFECYCLE=na, V-ENVHARDEN=na, V-REWARDHACK=na
    • rerun: python tests/integration/rubric_checks.py jobs/integration-final/weighted-gdp-calc-docker-no-skill-pi-acp/2026-06-19__02-17-00/weighted-gdp-calc__18f9642a --json
  • weighted-gdp-calc-docker-no-skill-opencode (healthy)
    • root: jobs/integration-final/weighted-gdp-calc-docker-no-skill-opencode/2026-06-19__02-17-00/weighted-gdp-calc__72b8e555
    • gates: R-REAL=pass, R-OUTCOME=pass, R-ARTIFACT=pass, R-TELEMETRY=pass, S-NOSKILL=na, S-WITHSKILL=na, V-TAMPER=pass, C-ATTRIB=pass, V-LIFECYCLE=na, V-ENVHARDEN=na, V-REWARDHACK=na
    • rerun: python tests/integration/rubric_checks.py jobs/integration-final/weighted-gdp-calc-docker-no-skill-opencode/2026-06-19__02-17-00/weighted-gdp-calc__72b8e555 --json
  • citation-check-docker-no-skill-openhands-allowlist (healthy)
    • root: jobs/integration-final/citation-check-docker-no-skill-openhands-allowlist/2026-06-19__02-16-57/citation-check__f2b934ae
    • gates: R-REAL=pass, R-OUTCOME=pass, R-ARTIFACT=pass, R-TELEMETRY=pass, S-NOSKILL=na, S-WITHSKILL=na, V-TAMPER=pass, C-ATTRIB=na, V-LIFECYCLE=na, V-ENVHARDEN=na, V-REWARDHACK=na
    • rerun: python tests/integration/rubric_checks.py jobs/integration-final/citation-check-docker-no-skill-openhands-allowlist/2026-06-19__02-16-57/citation-check__f2b934ae --json

Parity:

  • pinned-baseline pinned-baseline: fail — pinned-baseline parity FAIL: harbor: baseline git HEAD 3daf698 does not match pinned ref 2d86fe82f6a06f7c7b3a22a3ae90d554d0e9655c; harbor: /home/runner/work/benchflow/benchflow/baseline-root/submissions/skillsbench/v1.1/openhands-no-skills__deepseek-v4-flash/2026-06-08__pr2-pr3-selected-3trial/citation-check__pr2-fill5-c10-noskills-7a4c6d6d-0004/result.json: missing Harbor field(s): ['agent_info', 'config', 'verifier_result']; harbor: /home/runner/work/benchflow/benchflow/baseline-root/submissions/skillsbench/v1.1/openhands-no-skills__deepseek-v4-flash/2026-06-08__pr2-pr3-selected-3trial/citation-check__pr2-fill5-c10-noskills-9c44b8b1-0003/result.json: missing Harbor field(s): ['agent_info', 'config', 'verifier_result']; harbor: /home/runner/work/benchflow/benchflow/baseline-root/submissions/skillsbench/v1.1/openhands-no-skills__deepseek-v4-flash/2026-06-08__pr2-pr3-selected-3trial/citation-check__pr2-fill5-c10-noskills-aeadd837-0002/result.json: missing Harbor field(s): ['agent_info', 'config', 'verifier_result']; harbor: /home/runner/work/benchflow/benchflow/baseline-root/submissions/skillsbench/v1.1/openhands-no-skills__deepseek-v4-flash/2026-06-08__pr2-pr3-selected-3trial/data-to-d3__pr2-fill5-c10-noskills-5fd3682f-0002/result.json: missing Harbor field(s): ['agent_info', 'config', 'verifier_result']; harbor: /home/runner/work/benchflow/benchflow/baseline-root/submissions/skillsbench/v1.1/openhands-no-skills__deepseek-v4-flash/2026-06-08__pr2-pr3-selected-3trial/data-to-d3__pr2-fill5-c10-noskills-b86fffc1-0004/result.json: missing Harbor field(s): ['agent_info', 'config', 'verifier_result']; harbor: /home/runner/work/benchflow/benchflow/baseline-root/submissions/skillsbench/v1.1/openhands-no-skills__deepseek-v4-flash/2026-06-08__pr2-pr3-selected-3trial/data-to-d3__pr2-fill5-c10-noskills-ba03ab11-0005/result.json: missing Harbor field(s): ['agent_info', 'config', 'verifier_result']; harbor: /home/runner/work/benchflow/benchflow/baseline-root/submissions/skillsbench/v1.1/openhands-no-skills__deepseek-v4-flash/2026-06-08__pr2-pr3-selected-3trial/jax-computing-basics__pr2-missing-capped5-dbf6c36392/result.json: missing Harbor field(s): ['agent_info', 'config', 'verifier_result']; harbor: /home/runner/work/benchflow/benchflow/baseline-root/submissions/skillsbench/v1.1/openhands-no-skills__deepseek-v4-flash/2026-06-08__pr2-pr3-selected-3trial/jax-computing-basics__pr2-vmfill-6cee37920c/result.json: missing Harbor field(s): ['agent_info', 'config', 'verifier_result']; harbor: /home/runner/work/benchflow/benchflow/baseline-root/submissions/skillsbench/v1.1/openhands-no-skills__deepseek-v4-flash/2026-06-08__pr2-pr3-selected-3trial/jax-computing-basics__pr2-vmfill-9737e2a3b8/result.json: missing Harbor field(s): ['agent_info', 'config', 'verifier_result']; harbor: no matching result.json files under /home/runner/work/benchflow/benchflow/baseline-root/submissions/skillsbench/v1.1/openhands-no-skills__deepseek-v4-flash; harbor: missing baseline result for citation-check; harbor: missing baseline result for data-to-d3; harbor: missing baseline result for jax-computing-basics; harbor: missing baseline result for weighted-gdp-calc; no overlapping SkillsBench tasks to compare

Residual risk

  • QUARANTINE: parity pinned-baseline (advisory — gate needs native-baseline mode): pinned-baseline — pinned-baseline parity FAIL: harbor: baseline git HEAD 3daf698 does not match pinned ref 2d86fe82f6a06f7c7b3a22a3ae90d554d0e9655c; harbor: /home/runner/work/benchflow/benchflow/baseline-root/submissions/skillsbench/v1.1/openhands-no-skills__deepseek-v4-flash/2026-06-08__pr2-pr3-selected-3trial/citation-check__pr2-fill5-c10-noskills-7a4c6d6d-0004/result.json: missing Harbor field(s): ['agent_info', 'config', 'verifier_result']; harbor: /home/runner/work/benchflow/benchflow/baseline-root/submissions/skillsbench/v1.1/openhands-no-skills__deepseek-v4-flash/2026-06-08__pr2-pr3-selected-3trial/citation-check__pr2-fill5-c10-noskills-9c44b8b1-0003/result.json: missing Harbor field(s): ['agent_info', 'config', 'verifier_result']; harbor: /home/runner/work/benchflow/benchflow/baseline-root/submissions/skillsbench/v1.1/openhands-no-skills__deepseek-v4-flash/2026-06-08__pr2-pr3-selected-3trial/citation-check__pr2-fill5-c10-noskills-aeadd837-0002/result.json: missing Harbor field(s): ['agent_info', 'config', 'verifier_result']; harbor: /home/runner/work/benchflow/benchflow/baseline-root/submissions/skillsbench/v1.1/openhands-no-skills__deepseek-v4-flash/2026-06-08__pr2-pr3-selected-3trial/data-to-d3__pr2-fill5-c10-noskills-5fd3682f-0002/result.json: missing Harbor field(s): ['agent_info', 'config', 'verifier_result']; harbor: /home/runner/work/benchflow/benchflow/baseline-root/submissions/skillsbench/v1.1/openhands-no-skills__deepseek-v4-flash/2026-06-08__pr2-pr3-selected-3trial/data-to-d3__pr2-fill5-c10-noskills-b86fffc1-0004/result.json: missing Harbor field(s): ['agent_info', 'config', 'verifier_result']; harbor: /home/runner/work/benchflow/benchflow/baseline-root/submissions/skillsbench/v1.1/openhands-no-skills__deepseek-v4-flash/2026-06-08__pr2-pr3-selected-3trial/data-to-d3__pr2-fill5-c10-noskills-ba03ab11-0005/result.json: missing Harbor field(s): ['agent_info', 'config', 'verifier_result']; harbor: /home/runner/work/benchflow/benchflow/baseline-root/submissions/skillsbench/v1.1/openhands-no-skills__deepseek-v4-flash/2026-06-08__pr2-pr3-selected-3trial/jax-computing-basics__pr2-missing-capped5-dbf6c36392/result.json: missing Harbor field(s): ['agent_info', 'config', 'verifier_result']; harbor: /home/runner/work/benchflow/benchflow/baseline-root/submissions/skillsbench/v1.1/openhands-no-skills__deepseek-v4-flash/2026-06-08__pr2-pr3-selected-3trial/jax-computing-basics__pr2-vmfill-6cee37920c/result.json: missing Harbor field(s): ['agent_info', 'config', 'verifier_result']; harbor: /home/runner/work/benchflow/benchflow/baseline-root/submissions/skillsbench/v1.1/openhands-no-skills__deepseek-v4-flash/2026-06-08__pr2-pr3-selected-3trial/jax-computing-basics__pr2-vmfill-9737e2a3b8/result.json: missing Harbor field(s): ['agent_info', 'config', 'verifier_result']; harbor: no matching result.json files under /home/runner/work/benchflow/benchflow/baseline-root/submissions/skillsbench/v1.1/openhands-no-skills__deepseek-v4-flash; harbor: missing baseline result for citation-check; harbor: missing baseline result for data-to-d3; harbor: missing baseline result for jax-computing-basics; harbor: missing baseline result for weighted-gdp-calc; no overlapping SkillsBench tasks to compare
  • residual (from plan): network lane carries network_mode as EXPECTED only; the allowlist variant is not passed to bench (no --network flag exists)
  • V-LIFECYCLE / V-ENVHARDEN / V-REWARDHACK: codex/residual review (never faked deterministically)

Required reruns

  • rerun cell: data-to-d3-docker-no-skill-openhands
  • rerun cell: weighted-gdp-calc-docker-no-skill-openhands

@bingran-you bingran-you temporarily deployed to pypi-internal-preview June 19, 2026 05:37 — with GitHub Actions Inactive
@bingran-you bingran-you added bug Something isn't working P1 Important debt — must fix soon, but does not block the current release. review:pending PR is ready-for-review, no reviewer engagement yet. area:eval Issue / PR lives primarily in the "eval" subsystem. area:diagnostics Issue / PR lives primarily in the "diagnostics" subsystem. labels Jun 19, 2026
@bingran-you

Copy link
Copy Markdown
Collaborator Author

Automation update (2026-06-19): pushed 212d29b6 to fix the stale ty check failure on eeec75bb.

What changed:

  • Typed the provider-model metadata lookup so uv run ty check src passes again.
  • Set the copied alias entry's name to the LiteLLM proxy alias as well as id, and added a regression assertion.
  • Updated the regression-test docstring to name PR Preserve pi-acp model metadata through LiteLLM proxy #803 per repo convention.

Local verification after the patch:

  • uv sync --extra dev --extra sandbox-daytona --locked
  • uv run pytest tests/test_litellm_runtime.py -q -> 20 passed
  • uv run ruff check src/benchflow/providers/litellm_runtime.py tests/test_litellm_runtime.py -> pass
  • uv run ruff format --check src/benchflow/providers/litellm_runtime.py tests/test_litellm_runtime.py -> pass
  • uv run ty check src -> pass

Current GitHub checks: normal test and pip-audit are green on the new head; the Docker rollout-smoke is still pending, so I am not marking this merge-ready yet. The older final-review V-TAMPER report was on the previous head and should be treated as stale until current-head review completes.

@bingran-you

Copy link
Copy Markdown
Collaborator Author

Follow-up (2026-06-19): the pending Docker rollout-smoke on head 212d29b6 has now passed as well. Current-head checks are green for test, pip-audit, detect-scope, and rollout-smoke. Holding under review:pending for non-author human review before merge.

Yiminnn added a commit that referenced this pull request Jun 20, 2026
…ositive, codex robustness (#814)

* fix(integration): calibrate L3 gate — slot matching, V-TAMPER, codex robustness

Three defects made the L3 final-review gate systematically return "not mergeable"
for reasons unrelated to the PR under review (found running the gate end-to-end
on #794/#803/#813):

1. Slot mis-attribution (duplicate/missing slots). The review-pack matcher mapped
   rollouts to planned cells by fuzzy dims, ignoring network_mode, so a cell and
   its `-allowlist` network variant collided into one slot (one "duplicate", one
   "missing"); in-place agent retries also each counted as a separate rollout.
   Now attribute each rollout by its cell-id directory (the run-matrix writes
   <cell.id>/<ts>/<task>__<hash>), and collapse same-job retries to the latest
   attempt. Genuinely double-scheduled cells still surface as duplicates.

2. V-TAMPER false-positive on OpenHands. The native-ACP execute scanner matched
   the whole title; OpenHands writes "<description>: $ <command>", and prose like
   "Verify the output" collided with the `verif` token, falsely flagging benign
   cleanup/verification as verifier tampering. Scan only the command after "$ ";
   a real tamper command is still caught.

3. Codex "unavailable" flake. The reviewer had to shell out (bubblewrap sandbox)
   to read the review pack from disk; a transient bwrap loopback failure killed
   every read -> no verdict -> false fail-closed. Inline the review-pack files
   into the prompt so no sandboxed read is needed, plus a bounded retry on
   transient (sandbox/network/rate-limit) output. A dead codex still fails closed.

Adds regression tests for all three.

* style(integration): ruff format the new gate-calibration tests

* fix(integration): tighten greptile P2s on the L3 gate fixes

- codex transient markers: drop the bare '429' (matched 'line 429'); use the
  'too many requests' reason phrase + 'http 429' instead.
- _attribute_rollout: only scan path parts BELOW the artifacts root, so OS-level
  segments can never coincidentally match a cell id.
- _started_at: fall back to finished_at so retry-collapse stays chronological
  when started_at is absent.

* fix(deps): bump pydantic-settings 2.14.1 -> 2.14.2 (GHSA-4xgf-cpjx-pc3j)

A new advisory (GHSA-4xgf-cpjx-pc3j, pydantic-settings 2.14.2, published
2026-06-19) started failing pip-audit repo-wide on every PR and main —
pydantic-settings is a transitive dep via litellm/mcp. Surgically bump the
locked version to the fixed 2.14.2, preserving the revision-1 lockfile format
(no full reformat). `uv sync --locked` accepts the lock; pip-audit is clean.

---------

Co-authored-by: symphony-bot <symphony@benchflow.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:diagnostics Issue / PR lives primarily in the "diagnostics" subsystem. area:eval Issue / PR lives primarily in the "eval" subsystem. bug Something isn't working P1 Important debt — must fix soon, but does not block the current release. review:pending PR is ready-for-review, no reviewer engagement yet.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant