Add generic Harbor benchmark runner by neubig · Pull Request #755 · OpenHands/benchmarks

neubig · 2026-06-22T14:08:33Z

Adds a generic Harbor benchmark runner to OpenHands/benchmarks that can run any Harbor adapter (e.g. TerminalBench, SkillsBench, or any adapter from harbor-framework/harbor/adapters).

Depends on OpenHands/evaluation#588 and OpenHands/software-agent-sdk#3839.

Changes

benchmarks/harbor/run_infer.py: Generic Harbor inference runner with adapter checkout, target resolution, and secret-safe command logging
benchmarks/harbor/eval_infer.py: Evaluation phase that scores Harbor output and reports to Laminar
benchmarks/utils/harbor.py: Shared Harbor utilities (check_harbor_installed, convert_harbor_to_eval_output, _secret_value)
tests/test_harbor_run_infer.py: 24 unit tests covering target resolution, JSON parsing, and secret masking

Evidence

A smoke Harbor eval was run successfully using this runner:

Benchmark: aider-polyglot (5 instances)
Model: Claude Sonnet 4.5
Result: 5/5 PASS (reward=1 for all instances)
Command: python -m benchmarks.harbor.run_infer .llm_config/runtime.json --harbor-target aider-polyglot --n-limit 5

All 24 unit tests pass, ruff format/lint pass, pyright strict passes.

This PR was created by an AI agent (OpenHands) on behalf of Graham Neubig.

Co-authored-by: openhands <openhands@all-hands.dev>

…telemetry non-fatal Co-authored-by: openhands <openhands@all-hands.dev>

- Run ruff format on eval_infer.py and run_infer.py (line wrapping) - Fix LaminarService.update_evaluation_scores call to use singleton instance via get() with correct arguments (output_file, report_file) instead of calling unbound method with dict as self Co-authored-by: openhands <openhands@all-hands.dev>

all-hands-bot · 2026-06-22T22:08:04Z

✅ Review complete.

This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here.

- Use single-line LaminarService call to match ruff 0.13.0 formatting - Remove unused 'report' variable assignment Co-authored-by: openhands <openhands@all-hands.dev>

all-hands-bot

🔴 Needs improvement — two concrete bugs plus a missing test gap that must be addressed before merge.

Summary of issues:

_resolve_target is called twice in the main() → run_harbor() path, causing the adapter repo to be git-cloned twice when --harbor-adapter-repo is set (double network traffic, double disk use, neither directory cleaned up).
_secret_value is a verbatim copy of the private function already defined in benchmarks/utils/harbor.py; import it instead.
Temp directories created by _checkout_adapter are never removed — both the one from the main() metadata block and the one inside run_harbor() leak to /tmp indefinitely.
No tests for the new benchmarks/harbor/ module while the repo has comprehensive coverage for all comparable runners.

[RISK ASSESSMENT]

[Overall PR] ⚠️ Risk Assessment: 🟡 MEDIUM
New module, no tests, and a latent double-clone bug that wastes resources and leaks temp dirs. The blast radius is limited to the harbor benchmark path; no existing functionality is modified.

VERDICT: ❌ Needs rework — the double-_resolve_target call and missing tests must be addressed before merging.

KEY INSIGHT: _resolve_target is a side-effecting function (it clones a git repo into a temp dir) but is treated as if it were pure; calling it twice in the happy path is a latent resource waste that is trivially fixed by passing the already-resolved values into run_harbor.

This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation

Improve this review? If any feedback above seems incorrect or irrelevant to this repository, you can teach the reviewer to do better:

Add a .agents/skills/custom-codereview-guide.md file to your branch (or edit it if one already exists) with the /codereview trigger and the context the reviewer is missing (e.g., "Security concerns about X do not apply here because Y"). See the customization docs for the required frontmatter format.

Re-request a review - the reviewer reads guidelines from the PR branch, so your changes take effect immediately.

When your PR is merged, the guideline file goes through normal code review by repository maintainers.

Resolve with AI? Install the iterate skill in your agent and run /iterate to automatically drive this PR through CI, review, and QA until it's merge-ready.

Was this review helpful? React with 👍 or 👎 to give feedback.

- Remove duplicate _secret_value; import from benchmarks.utils.harbor - Fix double _resolve_target call: pass resolved target/target_type/ checkout_dir into run_harbor instead of re-resolving (avoids double git clone when --harbor-adapter-repo is set) - Clean up adapter checkout temp dir in finally block after run - Add comment explaining max_iterations=100 standard cap - Add tests for harbor run_infer helpers (17 tests) Co-authored-by: openhands <openhands@all-hands.dev>

Use .items() for dict update to satisfy pyright strict type checking. Co-authored-by: openhands <openhands@all-hands.dev>

all-hands-bot · 2026-06-22T22:34:03Z

✅ Review complete.

This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here.

all-hands-bot

🔴 Needs improvement — the generic Harbor runner still has two blockers: adapter checkouts are not reproducible by default, and arbitrary agent environment values can be logged in cleartext. I avoided duplicating the existing unresolved temp-dir/max-iterations review threads.

[CRITICAL ISSUES]

[benchmarks/harbor/run_infer.py, Line 75] Reproducibility: --harbor-adapter-ref is optional, so adapter repos can be cloned from a moving default branch. That makes the same benchmark command evaluate different task/evaluator code over time, and the metadata does not record the resolved commit SHA.
[benchmarks/harbor/run_infer.py, Line 173] Secret Handling: The new arbitrary --agent-env/--agent-env-json passthrough is logged after masking only keys ending in KEY, so common secrets such as GITHUB_TOKEN, HF_TOKEN, PASSWORD, or *_SECRET values can be written to logs.

[TESTING GAPS]

[PR description] No Evidence: This adds user-facing CLI/backend scripts. Add an Evidence section with the exact command(s) used to exercise the real runner/evaluator path and the resulting output. Unit tests alone do not prove the end-to-end Harbor integration path works.

[RISK ASSESSMENT]

[Overall PR] ⚠️ Risk Assessment: 🟡 MEDIUM
The blast radius is mostly isolated to the new generic Harbor benchmark path, but the change touches benchmark reproducibility, external repo checkout behavior, and credential handling/logging. Those are foundational for trustworthy benchmark results.

VERDICT:
❌ Needs rework — pin/record adapter provenance and harden command redaction before merging.

KEY INSIGHT: Generic benchmark runners have to make reproducibility the default; optional pinning is not enough when outputs may be compared later.

Improve this review? If any feedback above seems incorrect or irrelevant to this repository, you can teach the reviewer to do better:

Add a .agents/skills/custom-codereview-guide.md file to your branch (or edit it if one already exists) with the /codereview trigger and the context the reviewer is missing (e.g., "Security concerns about X do not apply here because Y"). See the customization docs for the required frontmatter format.

Re-request a review - the reviewer reads guidelines from the PR branch, so your changes take effect immediately.

When your PR is merged, the guideline file goes through normal code review by repository maintainers.

Resolve with AI? Install the iterate skill in your agent and run /iterate to automatically drive this PR through CI, review, and QA until it's merge-ready.

Was this review helpful? React with 👍 or 👎 to give feedback.

This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation

…ing (#755) Critical fixes from 2nd AI review: - Record resolved adapter commit SHA in metadata for reproducibility - Warn when cloning adapter repo without a pinned ref - Broaden secret masking to cover TOKEN, SECRET, PASSWORD, CREDENTIAL, PASSPHRASE patterns (not just *KEY suffix), preventing GITHUB_TOKEN, HF_TOKEN, etc. from being logged in cleartext - Add _is_sensitive_env_value helper with 7 tests - Update _resolve_target to return 4-tuple including adapter_sha - Update all tests for new signature (24 tests total) Co-authored-by: openhands <openhands@all-hands.dev>

all-hands-bot · 2026-06-22T22:46:37Z

🔍 Review in progress…

We are performing the review through OpenHands Cloud Automation. You can log in and view the conversation here.

all-hands-bot · 2026-06-22T22:53:57Z

✅ Review complete.

This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here.

all-hands-bot

🔴 Needs improvement — the previous secret-redaction fix still leaves the same class of leak for arbitrary agent kwargs.

[CRITICAL ISSUES]

[benchmarks/harbor/run_infer.py, Line 208] Secret Handling: --agent-kwarg / --agent-kwarg-json values are appended as --ak KEY=VALUE, but the safe command logger only masks values following --ae. Since these kwargs are arbitrary agent inputs, callers can reasonably pass token/API-key-like values here and have them logged in cleartext.

[TESTING GAPS]

[tests/test_harbor_run_infer.py, Line 177] Regression Test: The current test suite explicitly asserts that --ak GITHUB_TOKEN=... is not sensitive, which locks in the leak. Update this expectation and add coverage for the logged command path so future redaction changes cannot regress.

[RISK ASSESSMENT]

[Overall PR] ⚠️ Risk Assessment: 🟡 MEDIUM
The new benchmark runner is isolated from existing benchmark paths and CI is green, but it introduces a user-facing CLI that forwards arbitrary values into subprocess arguments and logs a redacted command. Any remaining redaction hole is a real credential-exposure risk in benchmark logs.

VERDICT:
❌ Needs rework — redact sensitive-looking --ak values the same way as --ae values, or reject secrets in agent kwargs before logging.

KEY INSIGHT: Generic passthrough flags need one shared redaction rule; separating env and kwargs creates exactly the special case that let credentials leak back into logs.

Improve this review? If any feedback above seems incorrect or irrelevant to this repository, you can teach the reviewer to do better:

Add a .agents/skills/custom-codereview-guide.md file to your branch (or edit it if one already exists) with the /codereview trigger and the context the reviewer is missing (e.g., "Security concerns about X do not apply here because Y"). See the customization docs for the required frontmatter format.

Re-request a review - the reviewer reads guidelines from the PR branch, so your changes take effect immediately.

When your PR is merged, the guideline file goes through normal code review by repository maintainers.

Resolve with AI? Install the iterate skill in your agent and run /iterate to automatically drive this PR through CI, review, and QA until it's merge-ready.

Was this review helpful? React with 👍 or 👎 to give feedback.

This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation

3rd AI review found that --agent-kwarg/--agent-kwarg-json values (--ak) were not masked, allowing GITHUB_TOKEN etc. to leak in logs. - Rename _is_sensitive_env_value → _is_sensitive_value - Check both --ae and --ak flags for sensitive key patterns - Update test that incorrectly asserted --ak was not sensitive - Add regression tests for --ak masking (27 tests total) Co-authored-by: openhands <openhands@all-hands.dev>

all-hands-bot · 2026-06-22T23:01:10Z

✅ Review complete.

This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here.

all-hands-bot

🟢 Taste Rating: Good taste — the runner keeps the Harbor-specific concerns at the CLI boundary, records adapter provenance, cleans up transient checkouts, and extends secret redaction to both --ae and --ak values.

I re-reviewed the current head after the earlier bot feedback. I do not see remaining blocking code issues in the changed files. The PR is still marked draft and declares external dependencies, so those normal readiness gates should be resolved before merge.

[RISK ASSESSMENT]

[Overall PR] ⚠️ Risk Assessment: 🟡 MEDIUM
This adds a new generic benchmark runner/evaluator and touches secret forwarding plus runtime adapter checkout behavior, so the blast radius is higher than a narrow wrapper change. Risk is contained by using subprocess argument lists rather than shell execution, recording the resolved adapter SHA in metadata, cleaning temporary clones, adding focused helper/regression tests, and avoiding dependency changes.

VERDICT:
✅ Worth merging: Core logic is sound; proceed once the PR is taken out of draft and its declared external dependencies/CI gates are satisfied.

KEY INSIGHT:
The implementation is pragmatic: it exposes Harbor's generic target model without inventing a second abstraction layer, while preserving the key reproducibility metadata needed for benchmark runs.

This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation

Add generic Harbor benchmark runner

a926514

Co-authored-by: openhands <openhands@all-hands.dev>

neubig mentioned this pull request Jun 22, 2026

Expose generic Harbor eval inputs OpenHands/software-agent-sdk#3839

Open

1 task

Graham Neubig and others added 2 commits June 22, 2026 11:13

Fix LaminarService call in eval_infer to use correct method and make …

11d10ff

…telemetry non-fatal Co-authored-by: openhands <openhands@all-hands.dev>

neubig added the review-this label Jun 22, 2026

fix: match CI ruff formatting and remove unused variable (#755)

5349fde

- Use single-line LaminarService call to match ruff 0.13.0 formatting - Remove unused 'report' variable assignment Co-authored-by: openhands <openhands@all-hands.dev>

all-hands-bot requested changes Jun 22, 2026

View reviewed changes

Comment thread benchmarks/harbor/run_infer.py Outdated

Comment thread benchmarks/harbor/run_infer.py Outdated

Comment thread benchmarks/harbor/run_infer.py

Comment thread benchmarks/harbor/run_infer.py

Graham Neubig and others added 2 commits June 22, 2026 18:20

fix: pyright type error in test helper (#755)

1fb976c

Use .items() for dict update to satisfy pyright strict type checking. Co-authored-by: openhands <openhands@all-hands.dev>

neubig added review-this and removed review-this labels Jun 22, 2026

all-hands-bot requested changes Jun 22, 2026

View reviewed changes

Comment thread benchmarks/harbor/run_infer.py Outdated

Comment thread benchmarks/harbor/run_infer.py Outdated

neubig added review-this and removed review-this labels Jun 22, 2026

all-hands-bot requested changes Jun 22, 2026

View reviewed changes

Comment thread benchmarks/harbor/run_infer.py

Comment thread tests/test_harbor_run_infer.py Outdated

neubig added review-this and removed review-this labels Jun 22, 2026

all-hands-bot approved these changes Jun 22, 2026

View reviewed changes

neubig marked this pull request as ready for review June 23, 2026 01:34

neubig merged commit 82687c8 into main Jun 23, 2026
11 checks passed

Conversation

neubig commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Evidence

Uh oh!

all-hands-bot commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

all-hands-bot commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

all-hands-bot commented Jun 22, 2026

Uh oh!

all-hands-bot commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

all-hands-bot commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

neubig commented Jun 22, 2026 •

edited

Loading

all-hands-bot commented Jun 22, 2026 •

edited

Loading

all-hands-bot commented Jun 22, 2026 •

edited

Loading

all-hands-bot commented Jun 22, 2026 •

edited

Loading

all-hands-bot commented Jun 22, 2026 •

edited

Loading