Add generic Harbor benchmark runner#755
Conversation
Co-authored-by: openhands <openhands@all-hands.dev>
…telemetry non-fatal Co-authored-by: openhands <openhands@all-hands.dev>
- Run ruff format on eval_infer.py and run_infer.py (line wrapping) - Fix LaminarService.update_evaluation_scores call to use singleton instance via get() with correct arguments (output_file, report_file) instead of calling unbound method with dict as self Co-authored-by: openhands <openhands@all-hands.dev>
|
✅ Review complete. This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here. |
- Use single-line LaminarService call to match ruff 0.13.0 formatting - Remove unused 'report' variable assignment Co-authored-by: openhands <openhands@all-hands.dev>
all-hands-bot
left a comment
There was a problem hiding this comment.
🔴 Needs improvement — two concrete bugs plus a missing test gap that must be addressed before merge.
Summary of issues:
_resolve_targetis called twice in themain()→run_harbor()path, causing the adapter repo to be git-cloned twice when--harbor-adapter-repois set (double network traffic, double disk use, neither directory cleaned up)._secret_valueis a verbatim copy of the private function already defined inbenchmarks/utils/harbor.py; import it instead.- Temp directories created by
_checkout_adapterare never removed — both the one from themain()metadata block and the one insiderun_harbor()leak to/tmpindefinitely. - No tests for the new
benchmarks/harbor/module while the repo has comprehensive coverage for all comparable runners.
[RISK ASSESSMENT]
- [Overall PR]
⚠️ Risk Assessment: 🟡 MEDIUM
New module, no tests, and a latent double-clone bug that wastes resources and leaks temp dirs. The blast radius is limited to the harbor benchmark path; no existing functionality is modified.
VERDICT: ❌ Needs rework — the double-_resolve_target call and missing tests must be addressed before merging.
KEY INSIGHT: _resolve_target is a side-effecting function (it clones a git repo into a temp dir) but is treated as if it were pure; calling it twice in the happy path is a latent resource waste that is trivially fixed by passing the already-resolved values into run_harbor.
This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation
Improve this review? If any feedback above seems incorrect or irrelevant to this repository, you can teach the reviewer to do better:
- Add a
.agents/skills/custom-codereview-guide.mdfile to your branch (or edit it if one already exists) with the/codereviewtrigger and the context the reviewer is missing (e.g., "Security concerns about X do not apply here because Y"). See the customization docs for the required frontmatter format.- Re-request a review - the reviewer reads guidelines from the PR branch, so your changes take effect immediately.
- When your PR is merged, the guideline file goes through normal code review by repository maintainers.
Resolve with AI? Install the iterate skill in your agent and run
/iterateto automatically drive this PR through CI, review, and QA until it's merge-ready.Was this review helpful? React with 👍 or 👎 to give feedback.
- Remove duplicate _secret_value; import from benchmarks.utils.harbor - Fix double _resolve_target call: pass resolved target/target_type/ checkout_dir into run_harbor instead of re-resolving (avoids double git clone when --harbor-adapter-repo is set) - Clean up adapter checkout temp dir in finally block after run - Add comment explaining max_iterations=100 standard cap - Add tests for harbor run_infer helpers (17 tests) Co-authored-by: openhands <openhands@all-hands.dev>
Use .items() for dict update to satisfy pyright strict type checking. Co-authored-by: openhands <openhands@all-hands.dev>
|
✅ Review complete. This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here. |
all-hands-bot
left a comment
There was a problem hiding this comment.
🔴 Needs improvement — the generic Harbor runner still has two blockers: adapter checkouts are not reproducible by default, and arbitrary agent environment values can be logged in cleartext. I avoided duplicating the existing unresolved temp-dir/max-iterations review threads.
[CRITICAL ISSUES]
- [benchmarks/harbor/run_infer.py, Line 75] Reproducibility:
--harbor-adapter-refis optional, so adapter repos can be cloned from a moving default branch. That makes the same benchmark command evaluate different task/evaluator code over time, and the metadata does not record the resolved commit SHA. - [benchmarks/harbor/run_infer.py, Line 173] Secret Handling: The new arbitrary
--agent-env/--agent-env-jsonpassthrough is logged after masking only keys ending inKEY, so common secrets such asGITHUB_TOKEN,HF_TOKEN,PASSWORD, or*_SECRETvalues can be written to logs.
[TESTING GAPS]
- [PR description] No Evidence: This adds user-facing CLI/backend scripts. Add an
Evidencesection with the exact command(s) used to exercise the real runner/evaluator path and the resulting output. Unit tests alone do not prove the end-to-end Harbor integration path works.
[RISK ASSESSMENT]
- [Overall PR]
⚠️ Risk Assessment: 🟡 MEDIUM
The blast radius is mostly isolated to the new generic Harbor benchmark path, but the change touches benchmark reproducibility, external repo checkout behavior, and credential handling/logging. Those are foundational for trustworthy benchmark results.
VERDICT:
❌ Needs rework — pin/record adapter provenance and harden command redaction before merging.
KEY INSIGHT: Generic benchmark runners have to make reproducibility the default; optional pinning is not enough when outputs may be compared later.
Improve this review? If any feedback above seems incorrect or irrelevant to this repository, you can teach the reviewer to do better:
- Add a
.agents/skills/custom-codereview-guide.mdfile to your branch (or edit it if one already exists) with the/codereviewtrigger and the context the reviewer is missing (e.g., "Security concerns about X do not apply here because Y"). See the customization docs for the required frontmatter format.- Re-request a review - the reviewer reads guidelines from the PR branch, so your changes take effect immediately.
- When your PR is merged, the guideline file goes through normal code review by repository maintainers.
Resolve with AI? Install the iterate skill in your agent and run
/iterateto automatically drive this PR through CI, review, and QA until it's merge-ready.Was this review helpful? React with 👍 or 👎 to give feedback.
This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation
…ing (#755) Critical fixes from 2nd AI review: - Record resolved adapter commit SHA in metadata for reproducibility - Warn when cloning adapter repo without a pinned ref - Broaden secret masking to cover TOKEN, SECRET, PASSWORD, CREDENTIAL, PASSPHRASE patterns (not just *KEY suffix), preventing GITHUB_TOKEN, HF_TOKEN, etc. from being logged in cleartext - Add _is_sensitive_env_value helper with 7 tests - Update _resolve_target to return 4-tuple including adapter_sha - Update all tests for new signature (24 tests total) Co-authored-by: openhands <openhands@all-hands.dev>
|
🔍 Review in progress… We are performing the review through OpenHands Cloud Automation. You can log in and view the conversation here. |
|
✅ Review complete. This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here. |
all-hands-bot
left a comment
There was a problem hiding this comment.
🔴 Needs improvement — the previous secret-redaction fix still leaves the same class of leak for arbitrary agent kwargs.
[CRITICAL ISSUES]
- [benchmarks/harbor/run_infer.py, Line 208] Secret Handling:
--agent-kwarg/--agent-kwarg-jsonvalues are appended as--ak KEY=VALUE, but the safe command logger only masks values following--ae. Since these kwargs are arbitrary agent inputs, callers can reasonably pass token/API-key-like values here and have them logged in cleartext.
[TESTING GAPS]
- [tests/test_harbor_run_infer.py, Line 177] Regression Test: The current test suite explicitly asserts that
--ak GITHUB_TOKEN=...is not sensitive, which locks in the leak. Update this expectation and add coverage for the logged command path so future redaction changes cannot regress.
[RISK ASSESSMENT]
- [Overall PR]
⚠️ Risk Assessment: 🟡 MEDIUM
The new benchmark runner is isolated from existing benchmark paths and CI is green, but it introduces a user-facing CLI that forwards arbitrary values into subprocess arguments and logs a redacted command. Any remaining redaction hole is a real credential-exposure risk in benchmark logs.
VERDICT:
❌ Needs rework — redact sensitive-looking --ak values the same way as --ae values, or reject secrets in agent kwargs before logging.
KEY INSIGHT: Generic passthrough flags need one shared redaction rule; separating env and kwargs creates exactly the special case that let credentials leak back into logs.
Improve this review? If any feedback above seems incorrect or irrelevant to this repository, you can teach the reviewer to do better:
- Add a
.agents/skills/custom-codereview-guide.mdfile to your branch (or edit it if one already exists) with the/codereviewtrigger and the context the reviewer is missing (e.g., "Security concerns about X do not apply here because Y"). See the customization docs for the required frontmatter format.- Re-request a review - the reviewer reads guidelines from the PR branch, so your changes take effect immediately.
- When your PR is merged, the guideline file goes through normal code review by repository maintainers.
Resolve with AI? Install the iterate skill in your agent and run
/iterateto automatically drive this PR through CI, review, and QA until it's merge-ready.Was this review helpful? React with 👍 or 👎 to give feedback.
This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation
3rd AI review found that --agent-kwarg/--agent-kwarg-json values (--ak) were not masked, allowing GITHUB_TOKEN etc. to leak in logs. - Rename _is_sensitive_env_value → _is_sensitive_value - Check both --ae and --ak flags for sensitive key patterns - Update test that incorrectly asserted --ak was not sensitive - Add regression tests for --ak masking (27 tests total) Co-authored-by: openhands <openhands@all-hands.dev>
|
✅ Review complete. This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here. |
all-hands-bot
left a comment
There was a problem hiding this comment.
🟢 Taste Rating: Good taste — the runner keeps the Harbor-specific concerns at the CLI boundary, records adapter provenance, cleans up transient checkouts, and extends secret redaction to both --ae and --ak values.
I re-reviewed the current head after the earlier bot feedback. I do not see remaining blocking code issues in the changed files. The PR is still marked draft and declares external dependencies, so those normal readiness gates should be resolved before merge.
[RISK ASSESSMENT]
- [Overall PR]
⚠️ Risk Assessment: 🟡 MEDIUM
This adds a new generic benchmark runner/evaluator and touches secret forwarding plus runtime adapter checkout behavior, so the blast radius is higher than a narrow wrapper change. Risk is contained by using subprocess argument lists rather than shell execution, recording the resolved adapter SHA in metadata, cleaning temporary clones, adding focused helper/regression tests, and avoiding dependency changes.
VERDICT:
✅ Worth merging: Core logic is sound; proceed once the PR is taken out of draft and its declared external dependencies/CI gates are satisfied.
KEY INSIGHT:
The implementation is pragmatic: it exposes Harbor's generic target model without inventing a second abstraction layer, while preserving the key reproducibility metadata needed for benchmark runs.
This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation
Adds a generic Harbor benchmark runner to
OpenHands/benchmarksthat can run any Harbor adapter (e.g. TerminalBench, SkillsBench, or any adapter fromharbor-framework/harbor/adapters).Depends on
OpenHands/evaluation#588andOpenHands/software-agent-sdk#3839.Changes
benchmarks/harbor/run_infer.py: Generic Harbor inference runner with adapter checkout, target resolution, and secret-safe command loggingbenchmarks/harbor/eval_infer.py: Evaluation phase that scores Harbor output and reports to Laminarbenchmarks/utils/harbor.py: Shared Harbor utilities (check_harbor_installed, convert_harbor_to_eval_output, _secret_value)tests/test_harbor_run_infer.py: 24 unit tests covering target resolution, JSON parsing, and secret maskingEvidence
A smoke Harbor eval was run successfully using this runner:
aider-polyglot(5 instances)python -m benchmarks.harbor.run_infer .llm_config/runtime.json --harbor-target aider-polyglot --n-limit 5All 24 unit tests pass, ruff format/lint pass, pyright strict passes.
This PR was created by an AI agent (OpenHands) on behalf of Graham Neubig.