Skip to content

Add Auto-FL agent skill#4780

Open
holgerroth wants to merge 29 commits into
NVIDIA:mainfrom
holgerroth:codex/autofl-skill-v1
Open

Add Auto-FL agent skill#4780
holgerroth wants to merge 29 commits into
NVIDIA:mainfrom
holgerroth:codex/autofl-skill-v1

Conversation

@holgerroth

@holgerroth holgerroth commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Add nvflare-autofl, a provider-neutral agent skill that optimizes an existing NVFlare job.py without introducing a new Auto-FL command tree.
  • Add a deterministic AST importer that emits a reviewable autofl.yaml with extracted fields, unresolved values, allowed edit paths, metric provenance, and budget constraints without executing user code.
  • Add a skill-local candidate lifecycle that lets the coding agent author source and algorithm candidates while NVFlare snapshots, validates, executes, compares, restores, and records them deterministically.

Typical user experience

  1. Select the NVFlare Auto-FL skill and prompt: Optimize ./job.py for accuracy in sim.
  2. The skill imports job.py into autofl.yaml, explains the trust contract, and gives the agent an isolated candidate draft based on the current best source.
  3. The agent edits tunables, source, or new Python algorithms; NVFlare computes candidate_manifest.json, enforces paths and fixed budgets, runs or materializes the candidate, and keeps or restores it from the metric result.
  4. The user monitors results.tsv and progress.png; POC and production candidates continue through standard nvflare job submission, wait, download, and inspection commands.

End-to-end validation

The skill was exercised from the minimal prompt above on an 8-client, 20-round, non-IID CIFAR-10 simulation (alpha=0.1) using held-out test_accuracy. This is an uncapped live campaign; the snapshot below was captured after 149 ledger rows, including 135 scored runs and 10 literature checkpoints.

Executive summary

Item Live campaign snapshot
Objective Maximize test_accuracy in NVFlare simulation
Baseline 0.6870
Best observed 0.8218
Improvement +0.1348 (+13.48 percentage points, +19.62% relative)
Decisions 33 kept, 101 discarded, 4 crashed
Literature checkpoints 10
Recorded candidate runtime 15.31 hours total, 6.6 minutes per non-literature row
Best candidate temp2_t040_left0775_ep7
Campaign status Active and uncapped

Auto-FL campaign progress

Optimization trajectory

  • Federated optimizer tuning: FedAvgM improved accuracy to 0.7019. Combining FedAvgM with FedProx reached 0.7454, then momentum refinement reached 0.7545.
  • Agent-authored evaluation code: Horizontal-flip TTA raised accuracy to 0.7739. Increasing local epochs and retaining TTA reached 0.8002.
  • Broader TTA and calibration: Directional shifts, probability averaging, scale/crop views, local-prior correction, and temperature controls progressively raised the best score to 0.8218.
  • Literature-driven exploration: Ten checkpoints considered FedBN, SCAFFOLD, FedDyn, MixUp, SAM, class-balanced loss, FedLC, EMA/SWA/Lookahead, FedNova, TTA aggregation, logarithmic pooling, and R-Drop. Some produced useful implementation directions; others were rejected by measured results.
  • Current plateau: Recent candidates cluster around 0.8208 to 0.8218. The campaign remains active and has started another literature-driven branch after fine TTA calibration plateaued.

This live run validates the product workflow rather than establishing a final benchmark claim. Final benchmark validation should separate the search-validation metric from the held-out test metric and decide whether local epochs are fixed or an explicitly cost-aware search dimension.

Validation

  • 66 focused Auto-FL importer, campaign guard, candidate lifecycle, subprocess end-to-end, and skill-packaging tests passed.
  • NVFlare skill admission checks completed with 0 errors, 0 warnings.
  • Direct Black, isort, and flake8 checks passed; the Sphinx HTML build completed successfully.
  • A subprocess end-to-end test initializes a baseline, edits a generated source draft, evaluates its manifest, retains the improved code, and verifies ledger and patch provenance without Git.
  • The latest head passed the full GitHub test matrix across Python 3.10-3.14 on Ubuntu 22.04 and 24.04, plus style, license, coverage, wheel, link, and CodeQL checks.

Design boundary

NVFlare owns deterministic import, source snapshots, candidate validation, execution truth, policy, artifacts, metrics, restoration, and campaign state. The coding agent owns hypotheses, source edits, new algorithm implementations, and result interpretation. Built-in tunable candidates are optional machine-readable suggestions, not the default search policy. POC and production continue to use the standard NVFlare job submission and policy model.

@holgerroth holgerroth changed the title Add Auto-FL agent skill importer Add Auto-FL agent skill Jun 8, 2026
@holgerroth

Copy link
Copy Markdown
Collaborator Author

@greptileai review

@greptile-apps

greptile-apps Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR introduces the nvflare-autofl skill: a deterministic Auto-FL agent skill that imports an existing NVFlare job.py into a reviewable autofl.yaml trust contract and manages a full candidate lifecycle (prepare → evaluate → finalize/restore) without executing user code. Previous review rounds caught and fixed a series of critical issues (missing error handlers, importer fallback paths, shared list references, --mode forwarding, pre-apply schema failures, partial-copy recovery, YAML error propagation, and job_help placement), all of which have been addressed on the current head.

  • New importer module (nvflare/app_common/autofl/job_importer.py): AST-only parser that extracts tunables, budget constraints, metric provenance, and unresolved fields into autofl.yaml without importing or executing job.py.
  • Campaign runner (skills/nvflare-autofl/scripts/run_job_campaign.py, ~2 850 lines): manages snapshot lifecycle, candidate validation, workspace restore on failure, and transactional finalization with rollback guards.
  • Campaign guard and plotter (campaign_guard.py, plot_progress.py): stateless decision function and optional progress visualization loaded dynamically at runtime.
  • Test suite (66 tests across four files) covering the importer, guard, plotter, and runner end-to-end.

Confidence Score: 5/5

Safe to merge. The workspace restore guards, YAML error handling, schema-before-apply ordering, and job_help recovery introduced by this PR are all correctly placed on the current head, and the previous critical issues are no longer present in the code.

All findings are non-blocking suggestions. The campaign lifecycle code has thorough rollback coverage (BaseException guards, staged snapshots, file-version capture), the importer never executes user code, and the 66-test suite exercises the primary paths including partial-apply restore and schema failure isolation. The three flagged items (triple schema reads, unused parameter, no copytree cleanup) do not affect correctness in normal operation.

skills/nvflare-autofl/scripts/run_job_campaign.py — specifically the evaluate_candidate and prepare_candidate functions.

Important Files Changed

Filename Overview
skills/nvflare-autofl/scripts/run_job_campaign.py Core campaign runner (2 848 lines). Candidate lifecycle logic is sound and previous issues around restore-guard placement, YAML errors, and job_help recovery have been addressed. Minor: load_mutation_schema is called three times per evaluate path and prepare_candidate leaves an orphaned directory on partial copytree failure.
nvflare/app_common/autofl/job_importer.py Deterministic AST importer. Properly handles unresolved names, dynamic train_script, shared-list separation, and clean error reporting. No issues found on current head.
skills/nvflare-autofl/scripts/campaign_guard.py Stateless continuation-decision function. parse_max_candidates now catches malformed env values, plateau/crash/stop-file logic is correct, and guard_state_for_rows is pure (reads from rows, not disk).
tests/unit_test/tool/autofl_skill_runner_test.py 1 266-line test suite covering initialize, prepare, evaluate, abandon, record, and suggest actions; includes subprocess end-to-end, schema-failure isolation, partial-apply restore, and job_help failure recovery.
tests/unit_test/app_common/autofl/job_importer_test.py 491-line importer test suite covering YAML round-trip, dynamic train_script, unresolved budget/metric constants, and clean error for missing job.
tests/unit_test/tool/autofl_skill_campaign_guard_test.py 278-line guard tests; covers plateau, crash-blocker, stop-file, cap-exhausted, and malformed-env-cap cases.
tests/unit_test/tool/autofl_skill_plot_progress_test.py 137-line plotter tests; covers both matplotlib and Pillow fallback paths.
nvflare/app_common/autofl/init.py Package init; exports DeterministicJobImporter and JobImportError cleanly.
skills/nvflare-autofl/SKILL.md Skill manifest describing the Auto-FL lifecycle, tool contract, and agent instructions.
skills/nvflare-autofl/evals/evals.json Evaluation fixtures for skill admission checks; no issues.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[initialize] --> B{campaign.json exists?}
    B -- No --> C[import_job_config → autofl.yaml]
    C --> D[create_best_snapshot]
    D --> E{target_env == sim?}
    E -- Yes --> F[execute_sim_baseline]
    F --> G[refresh_campaign_artifacts]
    E -- No --> G
    B -- Yes --> H{scored baseline?}
    H -- No --> I[retry baseline]
    I --> G
    H -- Yes --> G
    G --> J[Agent: prepare]
    J --> K[load_best_snapshot / validate workspace]
    K --> L[copytree → candidate/source / write manifest]
    L --> M[Agent edits candidate/source]
    M --> N[evaluate]
    N --> O[validate_candidate_for_evaluation]
    O --> P[write patch + manifest]
    P --> Q[load_mutation_schema / campaign_timeout]
    Q --> R{try block}
    R --> S[apply_candidate_source / import_job_config / budget hash check]
    S -- error --> T[restore_best_source / re-raise]
    S -- success --> U{target_env != sim?}
    U -- Yes --> V[status=ready_for_external_execution]
    V --> W[record external result]
    W --> X[finalize_candidate_result]
    U -- No --> Y[job_help / run_job]
    Y -- INFRA_RETRY --> Z[restore_best_source / status=prepared]
    Y -- done --> X
    X --> AA{score better?}
    AA -- keep --> AB[stage/activate new best snapshot]
    AA -- discard --> AC[restore_best_source]
    AB --> AD[write_results / write_state / write_progress]
    AC --> AD
    AD --> AE{campaign_guard decision}
    AE -- continue --> J
    AE -- plateau_literature --> AF[record literature event]
    AF --> J
    AE -- stop --> AG[final_report allowed]
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
    A[initialize] --> B{campaign.json exists?}
    B -- No --> C[import_job_config → autofl.yaml]
    C --> D[create_best_snapshot]
    D --> E{target_env == sim?}
    E -- Yes --> F[execute_sim_baseline]
    F --> G[refresh_campaign_artifacts]
    E -- No --> G
    B -- Yes --> H{scored baseline?}
    H -- No --> I[retry baseline]
    I --> G
    H -- Yes --> G
    G --> J[Agent: prepare]
    J --> K[load_best_snapshot / validate workspace]
    K --> L[copytree → candidate/source / write manifest]
    L --> M[Agent edits candidate/source]
    M --> N[evaluate]
    N --> O[validate_candidate_for_evaluation]
    O --> P[write patch + manifest]
    P --> Q[load_mutation_schema / campaign_timeout]
    Q --> R{try block}
    R --> S[apply_candidate_source / import_job_config / budget hash check]
    S -- error --> T[restore_best_source / re-raise]
    S -- success --> U{target_env != sim?}
    U -- Yes --> V[status=ready_for_external_execution]
    V --> W[record external result]
    W --> X[finalize_candidate_result]
    U -- No --> Y[job_help / run_job]
    Y -- INFRA_RETRY --> Z[restore_best_source / status=prepared]
    Y -- done --> X
    X --> AA{score better?}
    AA -- keep --> AB[stage/activate new best snapshot]
    AA -- discard --> AC[restore_best_source]
    AB --> AD[write_results / write_state / write_progress]
    AC --> AD
    AD --> AE{campaign_guard decision}
    AE -- continue --> J
    AE -- plateau_literature --> AF[record literature event]
    AF --> J
    AE -- stop --> AG[final_report allowed]
Loading

Reviews (23): Last reviewed commit: "Merge remote-tracking branch 'upstream/m..." | Re-trigger Greptile

Comment thread nvflare/app_common/autofl/job_importer.py
Comment thread nvflare/app_common/autofl/job_importer.py Outdated
Comment thread nvflare/app_common/autofl/job_importer.py
Comment thread nvflare/tool/install_skills.py Outdated

Copy link
Copy Markdown
Collaborator Author

H100 validation for this draft PR:

  • Validated head SHA: 9b86388b3387eb6677494e91a9fbcb223601cded
  • Node/runtime: r1u14, 4x NVIDIA H100 NVL, nvcr.io/nvidia/clara/bionemo-framework:2.5, Python 3.12.3
  • Feature E2E output root: /scratch/hroth/Code/nvflare/pr4780-autofl-output/autofl_skill_feature_e2e_20260608_170611

Passed:

  • Installed the PR checkout in the container.
  • Ran changed feature tests: tests/unit_test/app_common/autofl/job_importer_test.py tests/unit_test/tool/install_skills_test.py -q -> 8 passed in 0.04s.
  • Installed bundled skill with install_skills(target_dir=/host_out/skills) -> nvflare-autofl copied successfully, no errors.
  • Ran deterministic import for examples/hello-world/hello-pt/job.py -> generated hello_pt_autofl.yaml.
  • Verified trust contract fields: schema_version=nvflare.autofl.config.v1, support=recipe:FedAvgRecipe,env:SimEnv, allowed edit paths include job.py, client.py, model.py, requirements.txt, tunables are batch_size,epochs,num_workers, unresolved count is 1, and agent controls require allowed-path edits, fixed-budget preservation, and candidate diffs.

Simulator smoke caveat:

  • I also tried a tiny hello-pt SimEnv run on the PR branch and on a fresh upstream/main baseline checkout (42c44fa58900f439bdd21b9b59298a2d84e434c8) in the same H100 container with the same minimal install.
  • Both fail at server app startup with the same existing runtime error: AttributeError: 'NoneType' object has no attribute 'get' in nvflare/private/fed/server/fed_server.py:create_job_cell while reading server_config.
  • PR branch SimEnv log: /scratch/hroth/Code/nvflare/pr4780-autofl-output/autofl_skill_sim_minimal_20260608_170722/hello_pt_sim.log
  • Baseline SimEnv log: /scratch/hroth/Code/nvflare/pr4780-autofl-output/main_baseline_sim_minimal_20260608_170905/hello_pt_sim.log

Read: the Auto-FL skill/importer feature path passes on H100; the tiny SimEnv execution smoke is blocked by an upstream/main-compatible simulator/runtime issue rather than this PR’s Auto-FL changes.

@holgerroth

Copy link
Copy Markdown
Collaborator Author

@greptileai review again

Comment thread nvflare/app_common/autofl/job_importer.py Outdated
@holgerroth

Copy link
Copy Markdown
Collaborator Author

@greptileai check again

@codecov-commenter

codecov-commenter commented Jun 25, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 87.86693% with 62 lines in your changes missing coverage. Please review.
✅ Project coverage is 57.15%. Comparing base (9fb7f0b) to head (1c5bbd7).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
nvflare/app_common/autofl/job_importer.py 87.72% 62 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4780      +/-   ##
==========================================
+ Coverage   56.95%   57.15%   +0.19%     
==========================================
  Files         969      971       +2     
  Lines       92261    92772     +511     
==========================================
+ Hits        52551    53024     +473     
- Misses      39710    39748      +38     
Flag Coverage Δ
unit-tests 57.15% <87.86%> (+0.19%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@holgerroth

Copy link
Copy Markdown
Collaborator Author

@greptileai review again

Comment thread research/auto-fl-research/scripts/campaign_guard.py Outdated
@holgerroth

Copy link
Copy Markdown
Collaborator Author

Addressed the Greptile findings from #4780 (comment) in commit 46c4d5c:

  • Mark unsupported AST call expressions as unresolved in the deterministic Auto-FL importer so call-derived num_rounds, min_clients, and key_metric values cannot enter autofl.yaml as concrete trust-contract values.
  • Preserve common model=ModelClass() recipe constructor handling without lowering overall import confidence for otherwise static jobs.
  • Harden AUTOFL_MAX_CANDIDATES parsing in campaign_guard.py so malformed environment values leave the campaign uncapped instead of crashing, while explicit invalid CLI values still fail argument parsing.
  • Removed the duplicate infrastructure-retry write_report call in the Auto-FL skill runner.

Validation run locally:

  • /Library/Frameworks/Python.framework/Versions/3.12/bin/python3 -m pytest tests/unit_test/app_common/autofl/job_importer_test.py tests/unit_test/research/autofl_campaign_guard_test.py tests/unit_test/tool/autofl_skill_runner_test.py -q -> 19 passed
  • /Library/Frameworks/Python.framework/Versions/3.12/bin/python3 -m pytest tests/unit_test/app_common/autofl/job_importer_test.py tests/unit_test/tool/autofl_skill_runner_test.py tests/unit_test/research/autofl_campaign_guard_test.py tests/unit_test/tool/agent_skill_checks/seed_skills_test.py tests/unit_test/tool/agent_skill_checks/frontmatter_test.py tests/unit_test/tool/agent/seed_skill_packaging_test.py -q -> 47 passed
  • /Library/Frameworks/Python.framework/Versions/3.12/bin/python3 -m dev_tools.agent.skills.checks --skills-root skills --docs-root docs/design -> 0 errors, 0 warnings
  • git diff --check -> clean

@holgerroth

holgerroth commented Jun 25, 2026

Copy link
Copy Markdown
Collaborator Author

@greptileai check the updated PR

@holgerroth

Copy link
Copy Markdown
Collaborator Author

/build

Signed-off-by: Holger Roth <hroth@nvidia.com>
@holgerroth

Copy link
Copy Markdown
Collaborator Author

@greptileai review again

@holgerroth

Copy link
Copy Markdown
Collaborator Author

/build

Comment thread skills/nvflare-autofl/scripts/run_job_campaign.py Outdated
Signed-off-by: Holger Roth <hroth@nvidia.com>
@holgerroth

Copy link
Copy Markdown
Collaborator Author

@greptileai review again

@holgerroth

Copy link
Copy Markdown
Collaborator Author

/build

Comment thread skills/nvflare-autofl/scripts/run_job_campaign.py
Comment thread skills/nvflare-autofl/scripts/run_job_campaign.py Outdated
Signed-off-by: Holger Roth <hroth@nvidia.com>
@holgerroth

Copy link
Copy Markdown
Collaborator Author

@greptileai review again

@holgerroth

Copy link
Copy Markdown
Collaborator Author

/build

Comment thread skills/nvflare-autofl/scripts/run_job_campaign.py Outdated
Signed-off-by: Holger Roth <hroth@nvidia.com>
@holgerroth

Copy link
Copy Markdown
Collaborator Author

Addressed the remaining Greptile workspace-integrity concern from the current summary in ebc0c05ee.

finalize_candidate_result now treats source and campaign-state finalization as a recoverable transaction: it stages a complete candidate snapshot before activation, retains the previous best snapshot until all manifest/metadata/ledger/state/plot/report writes complete, and rolls back the snapshot, workspace, autofl.yaml, manifest, ledger, state, progress plot, report, and patch if any step fails. Discard restoration failures are retried from the preserved best snapshot, and a rollback failure is surfaced explicitly instead of silently leaving an ambiguous campaign.

Regression coverage includes snapshot-stage failure, a failure injected after all campaign artifacts were rewritten, discard restoration failure, and job_help() failure. Validation: 73 focused tests passed; black, isort, flake8, skill admission (0 errors/0 warnings), and git diff --check pass.

@greptileai review again

@holgerroth

Copy link
Copy Markdown
Collaborator Author

/build

@holgerroth

Copy link
Copy Markdown
Collaborator Author

Addressed the two remaining actionable findings from the current Greptile summary in fa82a4113:

  • campaign_timeout now validates both timeout fields as non-negative integers and converts list, mapping, boolean, fractional, negative, and otherwise malformed values into the runners clean ValueError / exit-code-2 contract instead of leaking TypeError.
  • Text-artifact metric extraction now supports signed decimal and scientific-notation values while preserving the configured metric precedence across standard NVFlare JSON and text artifacts.

Regression coverage includes malformed run_timeout_seconds and simulator_no_progress_timeout_seconds values plus scientific-notation log metrics. Validation on the final head: 229 focused Auto-FL/importer/guard/plotter/admission/packaging tests passed; Black, isort, flake8, git diff --check, docs build, and a real reduced hello-pt SimEnv baseline all passed.

@greptileai review again

@holgerroth

Copy link
Copy Markdown
Collaborator Author

Addressed the server-aggregation exploration gap in 9828fe4. Campaign initialization now deterministically merges existing workspace-local mutation_schema.yaml preferred_targets such as custom_aggregators.py into both generated autofl.yaml allowlists; missing, symlinked, reserved, or escaping paths stay unresolved. The skill and campaign state now explicitly allow new server aggregator modules and require at least one source-backed aggregation candidate after each literature-triggered plateau, unless incompatibility is recorded. Validation: 231 focused tests passed, skill admission reported 0 errors/0 warnings, Black/isort/flake8 and git diff --check passed, and the docs build succeeded. No research/auto-fl-research or H100-specific files were changed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants