Skip to content

Expose generic Harbor eval inputs#3839

Closed
neubig wants to merge 4 commits into
mainfrom
feat/generic-harbor-eval
Closed

Expose generic Harbor eval inputs#3839
neubig wants to merge 4 commits into
mainfrom
feat/generic-harbor-eval

Conversation

@neubig

@neubig neubig commented Jun 22, 2026

Copy link
Copy Markdown
Member

HUMAN:
Adding Harbor benchmark support to the SDK eval workflow so we can run any Harbor adapter.


AGENT:

Why

The SDK's run-eval.yml workflow supports multiple benchmarks (swebench, gaia, terminalbench, etc.) but lacks a generic Harbor mode. Harbor provides a standardized adapter interface for benchmarks, and OpenHands/evaluation#588 + OpenHands/benchmarks#755 add the underlying infrastructure. This PR exposes the Harbor inputs in the SDK workflow so users can trigger Harbor-based evals via the existing run-eval.yml dispatch.

Summary

  • Add benchmark=harbor to the run-eval.yml workflow benchmark enum
  • Add harbor_config_json input that accepts target/adapter/agent configuration
  • Forward HARBOR_CONFIG_JSON (base64-encoded) to the evaluation backend
  • Document the Harbor config schema in the workflow input description

Issue Number

Depends on OpenHands/benchmarks#755 and OpenHands/evaluation#588.

How to Test

  1. Trigger the workflow manually with benchmark=harbor and a harbor_config_json like:
    {"target": "aider-polyglot", "target_type": "dataset", "agent": "openhands-sdk"}
  2. Verify the Helm deployment receives HARBOR_CONFIG_B64 and HARBOR_CONFIG_JSON env vars
  3. Confirm the eval job runs run_harbor.sh with the correct phase (inference/evaluation)

A smoke eval was run successfully using this infrastructure: aider-polyglot benchmark, 5 instances, Claude Sonnet 4.5, result 5/5 PASS.

Video/Screenshots

N/A — workflow configuration change, no UI.

Type

  • Feature

Notes

This PR was created by an AI agent (OpenHands) on behalf of Graham Neubig.


Agent Server images for this PR

GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant Architectures Base Image Docs / Tags
java amd64, arm64 eclipse-temurin:17-jdk Link
python amd64, arm64 nikolaik/python-nodejs:python3.13-nodejs22-slim Link
golang amd64, arm64 golang:1.21-bookworm Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:d7c90d1-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-d7c90d1-python \
  ghcr.io/openhands/agent-server:d7c90d1-python

All tags pushed for this build

ghcr.io/openhands/agent-server:d7c90d1-golang-amd64
ghcr.io/openhands/agent-server:d7c90d1822e53a331ed69c4af4045edd0850b4d2-golang-amd64
ghcr.io/openhands/agent-server:feat-generic-harbor-eval-golang-amd64
ghcr.io/openhands/agent-server:d7c90d1-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:d7c90d1-golang-arm64
ghcr.io/openhands/agent-server:d7c90d1822e53a331ed69c4af4045edd0850b4d2-golang-arm64
ghcr.io/openhands/agent-server:feat-generic-harbor-eval-golang-arm64
ghcr.io/openhands/agent-server:d7c90d1-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:d7c90d1-java-amd64
ghcr.io/openhands/agent-server:d7c90d1822e53a331ed69c4af4045edd0850b4d2-java-amd64
ghcr.io/openhands/agent-server:feat-generic-harbor-eval-java-amd64
ghcr.io/openhands/agent-server:d7c90d1-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:d7c90d1-java-arm64
ghcr.io/openhands/agent-server:d7c90d1822e53a331ed69c4af4045edd0850b4d2-java-arm64
ghcr.io/openhands/agent-server:feat-generic-harbor-eval-java-arm64
ghcr.io/openhands/agent-server:d7c90d1-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:d7c90d1-python-amd64
ghcr.io/openhands/agent-server:d7c90d1822e53a331ed69c4af4045edd0850b4d2-python-amd64
ghcr.io/openhands/agent-server:feat-generic-harbor-eval-python-amd64
ghcr.io/openhands/agent-server:d7c90d1-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim-amd64
ghcr.io/openhands/agent-server:d7c90d1-python-arm64
ghcr.io/openhands/agent-server:d7c90d1822e53a331ed69c4af4045edd0850b4d2-python-arm64
ghcr.io/openhands/agent-server:feat-generic-harbor-eval-python-arm64
ghcr.io/openhands/agent-server:d7c90d1-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim-arm64
ghcr.io/openhands/agent-server:d7c90d1-golang
ghcr.io/openhands/agent-server:d7c90d1822e53a331ed69c4af4045edd0850b4d2-golang
ghcr.io/openhands/agent-server:feat-generic-harbor-eval-golang
ghcr.io/openhands/agent-server:d7c90d1-golang_tag_1.21-bookworm
ghcr.io/openhands/agent-server:d7c90d1-java
ghcr.io/openhands/agent-server:d7c90d1822e53a331ed69c4af4045edd0850b4d2-java
ghcr.io/openhands/agent-server:feat-generic-harbor-eval-java
ghcr.io/openhands/agent-server:d7c90d1-eclipse-temurin_tag_17-jdk
ghcr.io/openhands/agent-server:d7c90d1-python
ghcr.io/openhands/agent-server:d7c90d1822e53a331ed69c4af4045edd0850b4d2-python
ghcr.io/openhands/agent-server:feat-generic-harbor-eval-python
ghcr.io/openhands/agent-server:d7c90d1-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim

About Multi-Architecture Support

  • Each variant tag (e.g., d7c90d1-python) is a multi-arch manifest supporting both amd64 and arm64
  • Docker automatically pulls the correct architecture for your platform
  • Individual architecture tags (e.g., d7c90d1-python-amd64) are also available if needed

@github-actions

github-actions Bot commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

Python API breakage checks — ✅ PASSED

Result:PASSED

Action log

@github-actions

github-actions Bot commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

REST API breakage checks (OpenAPI) — ✅ PASSED

Result:PASSED

Action log

Co-authored-by: openhands <openhands@all-hands.dev>
@neubig neubig force-pushed the feat/generic-harbor-eval branch from d8ccc1d to 6cf8732 Compare June 22, 2026 14:16
#3839)

Use folded scalar style (>-) for long description to satisfy yamlfmt hook.

Co-authored-by: openhands <openhands@all-hands.dev>
@neubig

neubig commented Jun 22, 2026

Copy link
Copy Markdown
Member Author

@github-copilot please review this PR

@neubig neubig marked this pull request as ready for review June 22, 2026 22:22

@all-hands-bot all-hands-bot left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 QA Report: PARTIAL

The SDK workflow now exposes benchmark=harbor and the generated evaluation-dispatch payload preserves a realistic Harbor config JSON string, but I did not trigger a live cross-repo evaluation run.

Does this PR achieve its stated goal?

Partially verified. For this repository's workflow handoff, yes: exercising the workflow's jq payload construction with benchmark=harbor and a realistic harbor_config_json produced a dispatch payload containing both inputs.benchmark = "harbor" and an intact, parseable Harbor config string. The remaining unverified piece is the live OpenHands/evaluation workflow execution after dispatch; I avoided starting an external eval job from QA.

Phase Result
Environment Setup ✅ Workflow YAML parsed successfully; jq, Ruby, and Python were available for local workflow execution.
CI Status ⚠️ Most checks are green, but PR Description Check / Validate PR description is failing and QA Changes by OpenHands was still in progress when checked.
Functional Verification 🟡 Local workflow-dispatch payload behavior verified before/after; live external eval run not triggered.
Functional Verification

Test 1: Harbor workflow input and dispatch payload forwarding

Step 1 — Reproduce / establish baseline on origin/main:
Ran a local simulation of the workflow dispatch-payload jq step with benchmark=harbor and a realistic Harbor config, and checked the baseline workflow markers:

=== baseline main: Harbor choice markers ===
(no matches)

=== baseline main: dispatch payload Harbor fields ===
{
  "benchmark": "harbor",
  "has_harbor_config_json": false,
  "harbor_config_json": null
}

This shows the baseline workflow did not expose Harbor in the manual choices and would not include harbor_config_json in the dispatch payload.

Step 2 — Apply the PR's changes:
Used the checked-out PR branch at 02ff09dc0055a8633712568eefb2b502befa48b6.

Step 3 — Re-run with the PR in place:
Ran the same local workflow-payload simulation against the PR branch:

=== PR branch: Harbor choice markers ===
24:                    - harbor
81:            harbor_config_json:
188:                  HARBOR_CONFIG_JSON: ${{ github.event.inputs.harbor_config_json || 'N/A' }}
214:                  echo "harbor_config_json: $HARBOR_CONFIG_JSON"
426:                  HARBOR_CONFIG_JSON: ${{ github.event.inputs.harbor_config_json || '' }}
452:                    --arg harbor_config_json "$HARBOR_CONFIG_JSON" \
454:                    '{ref: $ref, inputs: {... benchmark: $benchmark, harbor_config_json: $harbor_config_json, ...}}'

=== PR branch: dispatch payload Harbor fields ===
{
  "benchmark": "harbor",
  "has_harbor_config_json": true,
  "harbor_config_json_present": true,
  "parsed_harbor_config": {
    "target": "gh:OpenHands/example",
    "target_type": "github",
    "adapter_repo": "OpenHands/harbor-adapter",
    "adapter_ref": "main",
    "adapter_path": "adapter.py",
    "agent": "openhands",
    "agent_env": {"FOO": "bar"},
    "agent_kwargs": {"max_iterations": 3},
    "extra_args": ["--smoke"]
  }
}

This shows the PR adds the Harbor workflow choice and forwards the entire Harbor config as a string that remains valid JSON for the downstream workflow to parse.

Test 2: Workflow syntax

Ran ruby -e 'require "yaml"; YAML.parse_file(".github/workflows/run-eval.yml"); puts "run-eval.yml parsed successfully"':

run-eval.yml parsed successfully

This confirms the edited workflow file is syntactically parseable YAML.

Unable to Verify

I did not trigger run-eval.yml through GitHub Actions or dispatch a real OpenHands/evaluation run, because that would start external CI/evaluation work and may consume repo secrets/resources. Future QA guidance in AGENTS.md could define a safe no-op Harbor evaluation target or dry-run workflow input so agents can verify this path end-to-end without launching a real benchmark.

Issues Found

None for the exercised workflow behavior. CI still reports a failing PR description check, which appears unrelated to the Harbor dispatch behavior and should be handled by a human because the PR description human-check fields are human-only.

This review was created by an AI agent (OpenHands) on behalf of the user.

# Conflicts:
#	.github/workflows/run-eval.yml
@neubig

neubig commented Jun 24, 2026

Copy link
Copy Markdown
Member Author

Closing as this is no longer necessary.

@neubig neubig closed this Jun 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants