Expose generic Harbor eval inputs#3839
Conversation
Python API breakage checks — ✅ PASSEDResult: ✅ PASSED |
REST API breakage checks (OpenAPI) — ✅ PASSEDResult: ✅ PASSED |
Co-authored-by: openhands <openhands@all-hands.dev>
d8ccc1d to
6cf8732
Compare
#3839) Use folded scalar style (>-) for long description to satisfy yamlfmt hook. Co-authored-by: openhands <openhands@all-hands.dev>
|
@github-copilot please review this PR |
There was a problem hiding this comment.
🟡 QA Report: PARTIAL
The SDK workflow now exposes benchmark=harbor and the generated evaluation-dispatch payload preserves a realistic Harbor config JSON string, but I did not trigger a live cross-repo evaluation run.
Does this PR achieve its stated goal?
Partially verified. For this repository's workflow handoff, yes: exercising the workflow's jq payload construction with benchmark=harbor and a realistic harbor_config_json produced a dispatch payload containing both inputs.benchmark = "harbor" and an intact, parseable Harbor config string. The remaining unverified piece is the live OpenHands/evaluation workflow execution after dispatch; I avoided starting an external eval job from QA.
| Phase | Result |
|---|---|
| Environment Setup | ✅ Workflow YAML parsed successfully; jq, Ruby, and Python were available for local workflow execution. |
| CI Status | PR Description Check / Validate PR description is failing and QA Changes by OpenHands was still in progress when checked. |
| Functional Verification | 🟡 Local workflow-dispatch payload behavior verified before/after; live external eval run not triggered. |
Functional Verification
Test 1: Harbor workflow input and dispatch payload forwarding
Step 1 — Reproduce / establish baseline on origin/main:
Ran a local simulation of the workflow dispatch-payload jq step with benchmark=harbor and a realistic Harbor config, and checked the baseline workflow markers:
=== baseline main: Harbor choice markers ===
(no matches)
=== baseline main: dispatch payload Harbor fields ===
{
"benchmark": "harbor",
"has_harbor_config_json": false,
"harbor_config_json": null
}
This shows the baseline workflow did not expose Harbor in the manual choices and would not include harbor_config_json in the dispatch payload.
Step 2 — Apply the PR's changes:
Used the checked-out PR branch at 02ff09dc0055a8633712568eefb2b502befa48b6.
Step 3 — Re-run with the PR in place:
Ran the same local workflow-payload simulation against the PR branch:
=== PR branch: Harbor choice markers ===
24: - harbor
81: harbor_config_json:
188: HARBOR_CONFIG_JSON: ${{ github.event.inputs.harbor_config_json || 'N/A' }}
214: echo "harbor_config_json: $HARBOR_CONFIG_JSON"
426: HARBOR_CONFIG_JSON: ${{ github.event.inputs.harbor_config_json || '' }}
452: --arg harbor_config_json "$HARBOR_CONFIG_JSON" \
454: '{ref: $ref, inputs: {... benchmark: $benchmark, harbor_config_json: $harbor_config_json, ...}}'
=== PR branch: dispatch payload Harbor fields ===
{
"benchmark": "harbor",
"has_harbor_config_json": true,
"harbor_config_json_present": true,
"parsed_harbor_config": {
"target": "gh:OpenHands/example",
"target_type": "github",
"adapter_repo": "OpenHands/harbor-adapter",
"adapter_ref": "main",
"adapter_path": "adapter.py",
"agent": "openhands",
"agent_env": {"FOO": "bar"},
"agent_kwargs": {"max_iterations": 3},
"extra_args": ["--smoke"]
}
}
This shows the PR adds the Harbor workflow choice and forwards the entire Harbor config as a string that remains valid JSON for the downstream workflow to parse.
Test 2: Workflow syntax
Ran ruby -e 'require "yaml"; YAML.parse_file(".github/workflows/run-eval.yml"); puts "run-eval.yml parsed successfully"':
run-eval.yml parsed successfully
This confirms the edited workflow file is syntactically parseable YAML.
Unable to Verify
I did not trigger run-eval.yml through GitHub Actions or dispatch a real OpenHands/evaluation run, because that would start external CI/evaluation work and may consume repo secrets/resources. Future QA guidance in AGENTS.md could define a safe no-op Harbor evaluation target or dry-run workflow input so agents can verify this path end-to-end without launching a real benchmark.
Issues Found
None for the exercised workflow behavior. CI still reports a failing PR description check, which appears unrelated to the Harbor dispatch behavior and should be handled by a human because the PR description human-check fields are human-only.
This review was created by an AI agent (OpenHands) on behalf of the user.
# Conflicts: # .github/workflows/run-eval.yml
|
Closing as this is no longer necessary. |
HUMAN:
Adding Harbor benchmark support to the SDK eval workflow so we can run any Harbor adapter.
AGENT:
Why
The SDK's
run-eval.ymlworkflow supports multiple benchmarks (swebench, gaia, terminalbench, etc.) but lacks a generic Harbor mode. Harbor provides a standardized adapter interface for benchmarks, andOpenHands/evaluation#588+OpenHands/benchmarks#755add the underlying infrastructure. This PR exposes the Harbor inputs in the SDK workflow so users can trigger Harbor-based evals via the existingrun-eval.ymldispatch.Summary
benchmark=harborto therun-eval.ymlworkflow benchmark enumharbor_config_jsoninput that accepts target/adapter/agent configurationHARBOR_CONFIG_JSON(base64-encoded) to the evaluation backendIssue Number
Depends on OpenHands/benchmarks#755 and OpenHands/evaluation#588.
How to Test
benchmark=harborand aharbor_config_jsonlike:{"target": "aider-polyglot", "target_type": "dataset", "agent": "openhands-sdk"}HARBOR_CONFIG_B64andHARBOR_CONFIG_JSONenv varsrun_harbor.shwith the correct phase (inference/evaluation)A smoke eval was run successfully using this infrastructure:
aider-polyglotbenchmark, 5 instances, Claude Sonnet 4.5, result 5/5 PASS.Video/Screenshots
N/A — workflow configuration change, no UI.
Type
Notes
This PR was created by an AI agent (OpenHands) on behalf of Graham Neubig.
Agent Server images for this PR
• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server
Variants & Base Images
eclipse-temurin:17-jdknikolaik/python-nodejs:python3.13-nodejs22-slimgolang:1.21-bookwormPull (multi-arch manifest)
# Each variant is a multi-arch manifest supporting both amd64 and arm64 docker pull ghcr.io/openhands/agent-server:d7c90d1-pythonRun
All tags pushed for this build
About Multi-Architecture Support
d7c90d1-python) is a multi-arch manifest supporting both amd64 and arm64d7c90d1-python-amd64) are also available if needed