Add Auto-FL agent skill#4780
Conversation
|
@greptileai review |
Greptile SummaryThis PR introduces the
Confidence Score: 5/5Safe to merge. The workspace restore guards, YAML error handling, schema-before-apply ordering, and job_help recovery introduced by this PR are all correctly placed on the current head, and the previous critical issues are no longer present in the code. All findings are non-blocking suggestions. The campaign lifecycle code has thorough rollback coverage (BaseException guards, staged snapshots, file-version capture), the importer never executes user code, and the 66-test suite exercises the primary paths including partial-apply restore and schema failure isolation. The three flagged items (triple schema reads, unused parameter, no copytree cleanup) do not affect correctness in normal operation. skills/nvflare-autofl/scripts/run_job_campaign.py — specifically the evaluate_candidate and prepare_candidate functions. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[initialize] --> B{campaign.json exists?}
B -- No --> C[import_job_config → autofl.yaml]
C --> D[create_best_snapshot]
D --> E{target_env == sim?}
E -- Yes --> F[execute_sim_baseline]
F --> G[refresh_campaign_artifacts]
E -- No --> G
B -- Yes --> H{scored baseline?}
H -- No --> I[retry baseline]
I --> G
H -- Yes --> G
G --> J[Agent: prepare]
J --> K[load_best_snapshot / validate workspace]
K --> L[copytree → candidate/source / write manifest]
L --> M[Agent edits candidate/source]
M --> N[evaluate]
N --> O[validate_candidate_for_evaluation]
O --> P[write patch + manifest]
P --> Q[load_mutation_schema / campaign_timeout]
Q --> R{try block}
R --> S[apply_candidate_source / import_job_config / budget hash check]
S -- error --> T[restore_best_source / re-raise]
S -- success --> U{target_env != sim?}
U -- Yes --> V[status=ready_for_external_execution]
V --> W[record external result]
W --> X[finalize_candidate_result]
U -- No --> Y[job_help / run_job]
Y -- INFRA_RETRY --> Z[restore_best_source / status=prepared]
Y -- done --> X
X --> AA{score better?}
AA -- keep --> AB[stage/activate new best snapshot]
AA -- discard --> AC[restore_best_source]
AB --> AD[write_results / write_state / write_progress]
AC --> AD
AD --> AE{campaign_guard decision}
AE -- continue --> J
AE -- plateau_literature --> AF[record literature event]
AF --> J
AE -- stop --> AG[final_report allowed]
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
A[initialize] --> B{campaign.json exists?}
B -- No --> C[import_job_config → autofl.yaml]
C --> D[create_best_snapshot]
D --> E{target_env == sim?}
E -- Yes --> F[execute_sim_baseline]
F --> G[refresh_campaign_artifacts]
E -- No --> G
B -- Yes --> H{scored baseline?}
H -- No --> I[retry baseline]
I --> G
H -- Yes --> G
G --> J[Agent: prepare]
J --> K[load_best_snapshot / validate workspace]
K --> L[copytree → candidate/source / write manifest]
L --> M[Agent edits candidate/source]
M --> N[evaluate]
N --> O[validate_candidate_for_evaluation]
O --> P[write patch + manifest]
P --> Q[load_mutation_schema / campaign_timeout]
Q --> R{try block}
R --> S[apply_candidate_source / import_job_config / budget hash check]
S -- error --> T[restore_best_source / re-raise]
S -- success --> U{target_env != sim?}
U -- Yes --> V[status=ready_for_external_execution]
V --> W[record external result]
W --> X[finalize_candidate_result]
U -- No --> Y[job_help / run_job]
Y -- INFRA_RETRY --> Z[restore_best_source / status=prepared]
Y -- done --> X
X --> AA{score better?}
AA -- keep --> AB[stage/activate new best snapshot]
AA -- discard --> AC[restore_best_source]
AB --> AD[write_results / write_state / write_progress]
AC --> AD
AD --> AE{campaign_guard decision}
AE -- continue --> J
AE -- plateau_literature --> AF[record literature event]
AF --> J
AE -- stop --> AG[final_report allowed]
Reviews (23): Last reviewed commit: "Merge remote-tracking branch 'upstream/m..." | Re-trigger Greptile |
|
H100 validation for this draft PR:
Passed:
Simulator smoke caveat:
Read: the Auto-FL skill/importer feature path passes on H100; the tiny SimEnv execution smoke is blocked by an upstream/main-compatible simulator/runtime issue rather than this PR’s Auto-FL changes. |
|
@greptileai review again |
|
@greptileai check again |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #4780 +/- ##
==========================================
+ Coverage 56.95% 57.15% +0.19%
==========================================
Files 969 971 +2
Lines 92261 92772 +511
==========================================
+ Hits 52551 53024 +473
- Misses 39710 39748 +38
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
|
@greptileai review again |
|
Addressed the Greptile findings from #4780 (comment) in commit 46c4d5c:
Validation run locally:
|
|
@greptileai check the updated PR |
Signed-off-by: Holger Roth <hroth@nvidia.com>
|
/build |
Signed-off-by: Holger Roth <hroth@nvidia.com>
|
@greptileai review again |
|
/build |
Signed-off-by: Holger Roth <hroth@nvidia.com>
|
@greptileai review again |
|
/build |
Signed-off-by: Holger Roth <hroth@nvidia.com>
|
@greptileai review again |
|
/build |
Signed-off-by: Holger Roth <hroth@nvidia.com>
|
Addressed the remaining Greptile workspace-integrity concern from the current summary in
Regression coverage includes snapshot-stage failure, a failure injected after all campaign artifacts were rewritten, discard restoration failure, and @greptileai review again |
|
/build |
Signed-off-by: Holger Roth <hroth@nvidia.com>
Signed-off-by: Holger Roth <hroth@nvidia.com>
|
Addressed the two remaining actionable findings from the current Greptile summary in
Regression coverage includes malformed @greptileai review again |
|
Addressed the server-aggregation exploration gap in 9828fe4. Campaign initialization now deterministically merges existing workspace-local mutation_schema.yaml preferred_targets such as custom_aggregators.py into both generated autofl.yaml allowlists; missing, symlinked, reserved, or escaping paths stay unresolved. The skill and campaign state now explicitly allow new server aggregator modules and require at least one source-backed aggregation candidate after each literature-triggered plateau, unless incompatibility is recorded. Validation: 231 focused tests passed, skill admission reported 0 errors/0 warnings, Black/isort/flake8 and git diff --check passed, and the docs build succeeded. No research/auto-fl-research or H100-specific files were changed. |
Summary
nvflare-autofl, a provider-neutral agent skill that optimizes an existing NVFlarejob.pywithout introducing a new Auto-FL command tree.autofl.yamlwith extracted fields, unresolved values, allowed edit paths, metric provenance, and budget constraints without executing user code.Typical user experience
Optimize ./job.py for accuracy in sim.job.pyintoautofl.yaml, explains the trust contract, and gives the agent an isolated candidate draft based on the current best source.candidate_manifest.json, enforces paths and fixed budgets, runs or materializes the candidate, and keeps or restores it from the metric result.results.tsvandprogress.png; POC and production candidates continue through standardnvflare jobsubmission, wait, download, and inspection commands.End-to-end validation
The skill was exercised from the minimal prompt above on an 8-client, 20-round, non-IID CIFAR-10 simulation (
alpha=0.1) using held-outtest_accuracy. This is an uncapped live campaign; the snapshot below was captured after 149 ledger rows, including 135 scored runs and 10 literature checkpoints.Executive summary
test_accuracyin NVFlare simulation0.68700.8218+0.1348(+13.48 percentage points, +19.62% relative)temp2_t040_left0775_ep7Optimization trajectory
This live run validates the product workflow rather than establishing a final benchmark claim. Final benchmark validation should separate the search-validation metric from the held-out test metric and decide whether local epochs are fixed or an explicitly cost-aware search dimension.
Validation
0 errors, 0 warnings.Design boundary
NVFlare owns deterministic import, source snapshots, candidate validation, execution truth, policy, artifacts, metrics, restoration, and campaign state. The coding agent owns hypotheses, source edits, new algorithm implementations, and result interpretation. Built-in tunable candidates are optional machine-readable suggestions, not the default search policy. POC and production continue to use the standard NVFlare job submission and policy model.