Skip to content

Add generic Harbor benchmark runner#755

Merged
neubig merged 8 commits into
mainfrom
feat/generic-harbor-eval
Jun 23, 2026
Merged

Add generic Harbor benchmark runner#755
neubig merged 8 commits into
mainfrom
feat/generic-harbor-eval

Conversation

@neubig

@neubig neubig commented Jun 22, 2026

Copy link
Copy Markdown
Member

Adds a generic Harbor benchmark runner to OpenHands/benchmarks that can run any Harbor adapter (e.g. TerminalBench, SkillsBench, or any adapter from harbor-framework/harbor/adapters).

Depends on OpenHands/evaluation#588 and OpenHands/software-agent-sdk#3839.

Changes

  • benchmarks/harbor/run_infer.py: Generic Harbor inference runner with adapter checkout, target resolution, and secret-safe command logging
  • benchmarks/harbor/eval_infer.py: Evaluation phase that scores Harbor output and reports to Laminar
  • benchmarks/utils/harbor.py: Shared Harbor utilities (check_harbor_installed, convert_harbor_to_eval_output, _secret_value)
  • tests/test_harbor_run_infer.py: 24 unit tests covering target resolution, JSON parsing, and secret masking

Evidence

A smoke Harbor eval was run successfully using this runner:

  • Benchmark: aider-polyglot (5 instances)
  • Model: Claude Sonnet 4.5
  • Result: 5/5 PASS (reward=1 for all instances)
  • Command: python -m benchmarks.harbor.run_infer .llm_config/runtime.json --harbor-target aider-polyglot --n-limit 5

All 24 unit tests pass, ruff format/lint pass, pyright strict passes.

This PR was created by an AI agent (OpenHands) on behalf of Graham Neubig.

Co-authored-by: openhands <openhands@all-hands.dev>
Graham Neubig and others added 2 commits June 22, 2026 11:13
…telemetry non-fatal

Co-authored-by: openhands <openhands@all-hands.dev>
- Run ruff format on eval_infer.py and run_infer.py (line wrapping)
- Fix LaminarService.update_evaluation_scores call to use singleton
  instance via get() with correct arguments (output_file, report_file)
  instead of calling unbound method with dict as self

Co-authored-by: openhands <openhands@all-hands.dev>

all-hands-bot commented Jun 22, 2026

Copy link
Copy Markdown
Collaborator

Review complete.

This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here.

- Use single-line LaminarService call to match ruff 0.13.0 formatting
- Remove unused 'report' variable assignment

Co-authored-by: openhands <openhands@all-hands.dev>

@all-hands-bot all-hands-bot left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Needs improvement — two concrete bugs plus a missing test gap that must be addressed before merge.

Summary of issues:

  1. _resolve_target is called twice in the main()run_harbor() path, causing the adapter repo to be git-cloned twice when --harbor-adapter-repo is set (double network traffic, double disk use, neither directory cleaned up).
  2. _secret_value is a verbatim copy of the private function already defined in benchmarks/utils/harbor.py; import it instead.
  3. Temp directories created by _checkout_adapter are never removed — both the one from the main() metadata block and the one inside run_harbor() leak to /tmp indefinitely.
  4. No tests for the new benchmarks/harbor/ module while the repo has comprehensive coverage for all comparable runners.

[RISK ASSESSMENT]

  • [Overall PR] ⚠️ Risk Assessment: 🟡 MEDIUM
    New module, no tests, and a latent double-clone bug that wastes resources and leaks temp dirs. The blast radius is limited to the harbor benchmark path; no existing functionality is modified.

VERDICT:Needs rework — the double-_resolve_target call and missing tests must be addressed before merging.

KEY INSIGHT: _resolve_target is a side-effecting function (it clones a git repo into a temp dir) but is treated as if it were pure; calling it twice in the happy path is a latent resource waste that is trivially fixed by passing the already-resolved values into run_harbor.

This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation


Improve this review? If any feedback above seems incorrect or irrelevant to this repository, you can teach the reviewer to do better:

  1. Add a .agents/skills/custom-codereview-guide.md file to your branch (or edit it if one already exists) with the /codereview trigger and the context the reviewer is missing (e.g., "Security concerns about X do not apply here because Y"). See the customization docs for the required frontmatter format.
  2. Re-request a review - the reviewer reads guidelines from the PR branch, so your changes take effect immediately.
  3. When your PR is merged, the guideline file goes through normal code review by repository maintainers.

Resolve with AI? Install the iterate skill in your agent and run /iterate to automatically drive this PR through CI, review, and QA until it's merge-ready.

Was this review helpful? React with 👍 or 👎 to give feedback.

Comment thread benchmarks/harbor/run_infer.py Outdated
Comment thread benchmarks/harbor/run_infer.py Outdated
Comment thread benchmarks/harbor/run_infer.py
Comment thread benchmarks/harbor/run_infer.py
Graham Neubig and others added 2 commits June 22, 2026 18:20
- Remove duplicate _secret_value; import from benchmarks.utils.harbor
- Fix double _resolve_target call: pass resolved target/target_type/
  checkout_dir into run_harbor instead of re-resolving (avoids double
  git clone when --harbor-adapter-repo is set)
- Clean up adapter checkout temp dir in finally block after run
- Add comment explaining max_iterations=100 standard cap
- Add tests for harbor run_infer helpers (17 tests)

Co-authored-by: openhands <openhands@all-hands.dev>
Use .items() for dict update to satisfy pyright strict type checking.

Co-authored-by: openhands <openhands@all-hands.dev>

all-hands-bot commented Jun 22, 2026

Copy link
Copy Markdown
Collaborator

Review complete.

This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here.

@all-hands-bot all-hands-bot left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Needs improvement — the generic Harbor runner still has two blockers: adapter checkouts are not reproducible by default, and arbitrary agent environment values can be logged in cleartext. I avoided duplicating the existing unresolved temp-dir/max-iterations review threads.

[CRITICAL ISSUES]

  • [benchmarks/harbor/run_infer.py, Line 75] Reproducibility: --harbor-adapter-ref is optional, so adapter repos can be cloned from a moving default branch. That makes the same benchmark command evaluate different task/evaluator code over time, and the metadata does not record the resolved commit SHA.
  • [benchmarks/harbor/run_infer.py, Line 173] Secret Handling: The new arbitrary --agent-env/--agent-env-json passthrough is logged after masking only keys ending in KEY, so common secrets such as GITHUB_TOKEN, HF_TOKEN, PASSWORD, or *_SECRET values can be written to logs.

[TESTING GAPS]

  • [PR description] No Evidence: This adds user-facing CLI/backend scripts. Add an Evidence section with the exact command(s) used to exercise the real runner/evaluator path and the resulting output. Unit tests alone do not prove the end-to-end Harbor integration path works.

[RISK ASSESSMENT]

  • [Overall PR] ⚠️ Risk Assessment: 🟡 MEDIUM
    The blast radius is mostly isolated to the new generic Harbor benchmark path, but the change touches benchmark reproducibility, external repo checkout behavior, and credential handling/logging. Those are foundational for trustworthy benchmark results.

VERDICT:
Needs rework — pin/record adapter provenance and harden command redaction before merging.

KEY INSIGHT: Generic benchmark runners have to make reproducibility the default; optional pinning is not enough when outputs may be compared later.


Improve this review? If any feedback above seems incorrect or irrelevant to this repository, you can teach the reviewer to do better:

  1. Add a .agents/skills/custom-codereview-guide.md file to your branch (or edit it if one already exists) with the /codereview trigger and the context the reviewer is missing (e.g., "Security concerns about X do not apply here because Y"). See the customization docs for the required frontmatter format.
  2. Re-request a review - the reviewer reads guidelines from the PR branch, so your changes take effect immediately.
  3. When your PR is merged, the guideline file goes through normal code review by repository maintainers.

Resolve with AI? Install the iterate skill in your agent and run /iterate to automatically drive this PR through CI, review, and QA until it's merge-ready.

Was this review helpful? React with 👍 or 👎 to give feedback.

This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation

Comment thread benchmarks/harbor/run_infer.py Outdated
Comment thread benchmarks/harbor/run_infer.py Outdated
…ing (#755)

Critical fixes from 2nd AI review:
- Record resolved adapter commit SHA in metadata for reproducibility
- Warn when cloning adapter repo without a pinned ref
- Broaden secret masking to cover TOKEN, SECRET, PASSWORD, CREDENTIAL,
  PASSPHRASE patterns (not just *KEY suffix), preventing GITHUB_TOKEN,
  HF_TOKEN, etc. from being logged in cleartext
- Add _is_sensitive_env_value helper with 7 tests
- Update _resolve_target to return 4-tuple including adapter_sha
- Update all tests for new signature (24 tests total)

Co-authored-by: openhands <openhands@all-hands.dev>

Copy link
Copy Markdown
Collaborator

🔍 Review in progress…

We are performing the review through OpenHands Cloud Automation. You can log in and view the conversation here.

all-hands-bot commented Jun 22, 2026

Copy link
Copy Markdown
Collaborator

Review complete.

This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here.

@all-hands-bot all-hands-bot left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Needs improvement — the previous secret-redaction fix still leaves the same class of leak for arbitrary agent kwargs.

[CRITICAL ISSUES]

  • [benchmarks/harbor/run_infer.py, Line 208] Secret Handling: --agent-kwarg / --agent-kwarg-json values are appended as --ak KEY=VALUE, but the safe command logger only masks values following --ae. Since these kwargs are arbitrary agent inputs, callers can reasonably pass token/API-key-like values here and have them logged in cleartext.

[TESTING GAPS]

  • [tests/test_harbor_run_infer.py, Line 177] Regression Test: The current test suite explicitly asserts that --ak GITHUB_TOKEN=... is not sensitive, which locks in the leak. Update this expectation and add coverage for the logged command path so future redaction changes cannot regress.

[RISK ASSESSMENT]

  • [Overall PR] ⚠️ Risk Assessment: 🟡 MEDIUM
    The new benchmark runner is isolated from existing benchmark paths and CI is green, but it introduces a user-facing CLI that forwards arbitrary values into subprocess arguments and logs a redacted command. Any remaining redaction hole is a real credential-exposure risk in benchmark logs.

VERDICT:
Needs rework — redact sensitive-looking --ak values the same way as --ae values, or reject secrets in agent kwargs before logging.

KEY INSIGHT: Generic passthrough flags need one shared redaction rule; separating env and kwargs creates exactly the special case that let credentials leak back into logs.


Improve this review? If any feedback above seems incorrect or irrelevant to this repository, you can teach the reviewer to do better:

  1. Add a .agents/skills/custom-codereview-guide.md file to your branch (or edit it if one already exists) with the /codereview trigger and the context the reviewer is missing (e.g., "Security concerns about X do not apply here because Y"). See the customization docs for the required frontmatter format.
  2. Re-request a review - the reviewer reads guidelines from the PR branch, so your changes take effect immediately.
  3. When your PR is merged, the guideline file goes through normal code review by repository maintainers.

Resolve with AI? Install the iterate skill in your agent and run /iterate to automatically drive this PR through CI, review, and QA until it's merge-ready.

Was this review helpful? React with 👍 or 👎 to give feedback.

This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation

Comment thread benchmarks/harbor/run_infer.py
Comment thread tests/test_harbor_run_infer.py Outdated
3rd AI review found that --agent-kwarg/--agent-kwarg-json values (--ak)
were not masked, allowing GITHUB_TOKEN etc. to leak in logs.

- Rename _is_sensitive_env_value → _is_sensitive_value
- Check both --ae and --ak flags for sensitive key patterns
- Update test that incorrectly asserted --ak was not sensitive
- Add regression tests for --ak masking (27 tests total)

Co-authored-by: openhands <openhands@all-hands.dev>

all-hands-bot commented Jun 22, 2026

Copy link
Copy Markdown
Collaborator

Review complete.

This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here.

@all-hands-bot all-hands-bot left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Taste Rating: Good taste — the runner keeps the Harbor-specific concerns at the CLI boundary, records adapter provenance, cleans up transient checkouts, and extends secret redaction to both --ae and --ak values.

I re-reviewed the current head after the earlier bot feedback. I do not see remaining blocking code issues in the changed files. The PR is still marked draft and declares external dependencies, so those normal readiness gates should be resolved before merge.

[RISK ASSESSMENT]

  • [Overall PR] ⚠️ Risk Assessment: 🟡 MEDIUM
    This adds a new generic benchmark runner/evaluator and touches secret forwarding plus runtime adapter checkout behavior, so the blast radius is higher than a narrow wrapper change. Risk is contained by using subprocess argument lists rather than shell execution, recording the resolved adapter SHA in metadata, cleaning temporary clones, adding focused helper/regression tests, and avoiding dependency changes.

VERDICT:
Worth merging: Core logic is sound; proceed once the PR is taken out of draft and its declared external dependencies/CI gates are satisfied.

KEY INSIGHT:
The implementation is pragmatic: it exposes Harbor's generic target model without inventing a second abstraction layer, while preserving the key reproducibility metadata needed for benchmark runs.

This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation

@neubig neubig marked this pull request as ready for review June 23, 2026 01:34
@neubig neubig merged commit 82687c8 into main Jun 23, 2026
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants