Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
9b86388
Add Auto-FL agent skill importer
holgerroth Jun 8, 2026
5e87fa2
Clarify Auto-FL campaign config role
holgerroth Jun 9, 2026
a29f84f
Move Auto-FL skill to root skills layout
holgerroth Jun 12, 2026
2146a8e
Merge branch 'main' into codex/autofl-skill-v1
holgerroth Jun 12, 2026
989bf45
Address Auto-FL importer review feedback
holgerroth Jun 12, 2026
1d64aa4
Mark unresolved Auto-FL importer names
holgerroth Jun 12, 2026
ead5ff2
Keep Auto-FL skill campaigns running
holgerroth Jun 25, 2026
361e639
Merge upstream main into Auto-FL skill PR
holgerroth Jun 25, 2026
cb2bf31
Fix Auto-FL skill release packaging references
holgerroth Jun 25, 2026
46c4d5c
Address Auto-FL Greptile review findings
holgerroth Jun 25, 2026
2e287b4
Add Auto-FL skill campaign guard
holgerroth Jun 25, 2026
780fe21
Report CIFAR Auto-FL score as test accuracy
holgerroth Jun 26, 2026
84be7f9
Standardize Auto-FL optimization metric contract
holgerroth Jun 26, 2026
4c1706f
Lower Auto-FL literature watchdog default
holgerroth Jun 26, 2026
1480e1b
Fix uncapped Auto-FL campaign stopping
holgerroth Jun 29, 2026
abf6384
Merge remote-tracking branch 'upstream/main' into codex/autofl-skill-v1
holgerroth Jun 29, 2026
1782ba5
Add Auto-FL campaign progress visual
holgerroth Jun 29, 2026
1a93d94
Remove local Auto-FL validation scaffolding
holgerroth Jun 29, 2026
3012a2f
Make Auto-FL code candidates first-class
holgerroth Jun 29, 2026
28f1a9f
Restore Auto-FL workspace on schema errors
holgerroth Jun 29, 2026
0020b39
Harden Auto-FL candidate error recovery
holgerroth Jun 29, 2026
ebc0c05
Make Auto-FL candidate finalization transactional
holgerroth Jun 29, 2026
6a3cd27
Improve Auto-FL campaign progress plots
holgerroth Jun 30, 2026
08105a3
Add Auto-FL stopped campaign report skill
holgerroth Jun 30, 2026
43655ed
Merge branch 'main' into codex/autofl-report-skill
holgerroth Jul 1, 2026
23f6a6e
Harden Auto-FL final report semantics
holgerroth Jul 1, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
204 changes: 204 additions & 0 deletions docs/design/autofl_skill.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,204 @@
# NVFlare Auto-FL Skill Design

## Summary

Auto-FL should enter NVFlare as a skill-first product experience. Users select
an official NVFlare Auto-FL skill in a coding agent, point it at an existing
`job.py`, and state the optimization objective, environment, and budget. NVFlare
owns deterministic import of campaign-relevant settings, execution truth, policy
boundaries, artifacts, and reproducibility. The agent owns candidate planning,
code edits within allowed paths, experiment execution through the existing
`job.py`, comparison, and narrative reporting.

This avoids introducing a new public Auto-FL command tree while still making
Auto-FL an NVFlare-owned feature.

## Product Boundary

The first production-oriented slice includes:

- A root `skills/nvflare-autofl` agent skill that follows the NVFLARE skills
layout used by the general agent-skills work.
- A deterministic `job.py` importer that emits reviewable `autofl.yaml` for the
Auto-FL campaign.
- A trust contract in `autofl.yaml` showing editable campaign settings,
unresolved fields, fixed-budget constraints, and allowed edit paths.
- A skill-local candidate lifecycle that snapshots the current best source,
gives the agent an isolated draft, validates the resulting patch, and keeps or
restores source according to the campaign metric.
- A companion `skills/nvflare-autofl-report` skill that deterministically turns
a stopped campaign ledger, state, config, and manifests into human- and
machine-readable final report artifacts.
- Documentation for using the skill with simulation, POC, and production
environments through existing NVFlare surfaces.

The first version does not embed or vendor a coding agent, and it does not add a
public Auto-FL command family.

## Role of autofl.yaml

`autofl.yaml` is not a replacement for `job.py` and is not a second exported
job format. The original `job.py` remains the experiment entry point the agent
uses to run candidates, and exported job folders remain the NVFlare execution
and submission artifacts.

The purpose of `autofl.yaml` is to expose the human-reviewable Auto-FL campaign
layer:

- Objective metric, requested environment, and candidate budget.
- Editable search-space settings discovered from `job.py` and related train
scripts.
- Fixed-budget constraints that must remain comparable across candidates.
- Allowed edit paths and files that are out of scope for the agent.
- Allowed creation patterns for new Python modules under the job root.
- Artifact, ledger, and report locations for the campaign.
- Provenance and unresolved fields that need user review before safe execution.

By default, users should not need to edit `autofl.yaml`. They review or modify
it only when the importer surfaces unresolved settings or when they want to
override campaign knobs explicitly.

## Deterministic Import

The importer parses Python source with `ast`; it does not import or execute
user code. It supports known Recipe and FedJob-style patterns first and focuses
on campaign-relevant settings rather than duplicating the full exported job:

- Recipe/FedJob constructor and class import.
- `SimEnv`, `PocEnv`, and `ProdEnv` references.
- `train_script` resolution for literal and argparse-derived values.
- Objective metric from user request, `key_metric`, or explicit unresolved
default.
- Fixed-budget fields such as rounds, clients, and candidate budget.
- Common argparse tunables from `job.py` and the resolved train script.

The exported job folder remains useful as execution truth once the job is
materialized, because it contains resolved NVFlare app and component configs.
However, it does not reliably preserve all authoring intent needed for an
Auto-FL campaign, such as editable source files, train-argument construction,
tunable-versus-fixed intent, and source provenance. Therefore the importer uses
deterministic Python/static parsing for the campaign layer and may use exported
config inspection as a validation aid when available.

Unsupported or dynamic fields are carried forward as unresolved review items
instead of being guessed by the importer or the agent.

## Trust Contract

Every import result includes:

- `import`: importer version, source path, source hash, support status, and
confidence.
- `job`: surface, entrypoint, allowed edit paths, train script, and call
arguments with provenance.
- `objective`, `budget`, `environment`, and `search_space`.
- `trust_contract`: extracted facts, unresolved fields, allowed edit paths, and
agent controls.

The skill must present editable, unresolved, and allowed sections before it runs
candidates. This is the core product guardrail: NVFlare makes the campaign
reviewable and reproducible; the agent makes it interactive and exploratory.

## Candidate Contract

The agent, rather than the deterministic runner, owns search policy. It may
change tunables, edit the imported job's allowed source files, or implement new
algorithms as Python modules. Each attempt starts from the retained best source
in `.nvflare/autofl/candidates/<id>/source` and has a generated
`candidate_manifest.json` containing its hypothesis, base candidate, run
arguments, changed files, source and budget hashes, patch hash, artifacts, and
result.

NVFlare computes the manifest's evidence fields; the agent does not assert them.
Before execution, the helper rejects stale candidates, path traversal, symlink
escapes, unauthorized existing-file edits, and detectable fixed-budget drift.
It applies the candidate transactionally to the real job workspace, retains a
new best, and restores the previous best after a discard or crash. This works
without requiring a Git repository and leaves the best source ready for the
standard NVFlare job lifecycle.

The built-in parameter candidates are suggestion seeds only. They are returned
as machine-readable hypotheses and arguments when requested, but are not the
default search loop and are never executed without agent selection.

## Execution Model

The skill uses existing NVFlare execution surfaces:

- Simulation: initialize a baseline, prepare an agent-authored candidate draft,
and evaluate it through the existing `job.py` and configured `SimEnv`.
- POC: use the existing job authoring/export flow, startup kits, and standard
`nvflare job` commands, then record the job ID, artifacts, and metric against
the candidate manifest.
- Production: use standard startup-kit authentication, site policy, job submit,
wait, download, and inspection commands with the same manifest and result
recording contract.

Production is a valid optimization environment. The best candidate may later be
submitted or reused through the standard NVFlare job lifecycle; no separate
promotion command is needed.

## Stopped-Campaign Reporting

Reporting is a separate skill boundary because its trigger and safety posture
differ from active optimization. `nvflare-autofl` must continue an active,
uncapped campaign while state has `final_response_allowed=false`.
`nvflare-autofl-report` operates only after a clean stop, explicit cap, hard
blocker, or independently confirmed interruption.

The report helper consumes `results.tsv`, `autofl.yaml`, campaign state, and
candidate manifests. It attempts to refresh the shared `progress.png` and
writes:

- `autofl_final_report.md`, a concise review artifact with executive summary,
trajectory, best-candidate lineage, exact commands, reliability, and
reproduction guidance;
- `autofl_report_summary.json`, a machine-readable
`nvflare.autofl.report.v1` summary for tools and future automation.

The helper does not edit source, ledger, manifests, or campaign state and does
not require Git. If an abrupt interruption leaves state active, the human must
confirm interruption after execution is independently checked; the report
records that assertion without rewriting history. This confirmation bypasses
only stale stop state. Pending state, `candidate` ledger rows, or manifests in
`prepared`/`ready_for_external_execution` status block finalization until the
active skill finalizes or abandons them.

Plotting is optional report evidence. A missing plotting dependency or invalid
PNG does not suppress the Markdown and JSON artifacts: the helper preserves the
failed artifact, emits a warning, omits the Markdown image, and records
`artifacts.progress_plot_available=false`.

Literature reporting follows measured evidence rather than agent narrative.
Each recorded literature checkpoint owns the comparable candidates until the
next checkpoint. Their best result is compared with the incumbent immediately
before the review and classified as helped, matched, not confirmed, failed, or
not evaluated. Recorded `[src: ...]` markers are preserved as campaign
provenance, not presented as independently verified citations.

The report distinguishes retained and observed evidence. `best` is limited to
scored baseline and `keep` rows, while `best_observed` may expose an unretained
scored `discard`. Pending candidates and crashes remain attempt/failure
evidence and cannot become milestones or literature improvements. The
objective also separates measurement provenance (`metric_source`) from the
importer's metric-contract provenance (`metric_contract_source`).

Finally, the report compares the declarative/imported budget with exact
baseline and best-candidate commands. It highlights changed compute or data
arguments, incomplete lineage, and repeated selection on test-like metrics.
This makes the report a trust artifact rather than a polished restatement of
the agent's conclusions.

## Review Questions

- Are the supported `job.py` patterns sufficient for an initial prototype?
- Are the edit and creation permissions in `autofl.yaml` appropriate for
algorithm-level candidates while preserving candidate comparability?
- Which exported-job fields should be used as validation evidence versus static
`job.py` parsing for authoring intent?
- Does the Auto-FL skill pass the general NVFLARE skill frontmatter, trigger,
and eval checks after it lands under `skills/nvflare-autofl`?
- Which candidate-manifest and metric/artifact fields should become stable
NVFlare APIs after the skill-local contract proves itself?
- Is `nvflare.autofl.report.v1` sufficient for downstream review and automation
while remaining explicitly skill-local in this follow-up?
180 changes: 180 additions & 0 deletions docs/user_guide/nvflare_cli/autofl_skill.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,180 @@
.. _autofl_skill:

#######################
NVFlare Auto-FL Skill
#######################

The NVFlare Auto-FL skill is an agent-assisted workflow for optimizing an
existing NVFlare ``job.py``. The user entry point is the coding agent skill:
select the NVFlare Auto-FL skill, point it at a job, and state the objective,
environment, and candidate budget.

The skill source lives in ``skills/nvflare-autofl`` with the other NVFlare-owned
agent skills. When the general agent skill CLI is available, install it through
the standard ``nvflare agent skills`` workflow for the target coding agent.

NVFlare does not add a separate public Auto-FL command family for this workflow.
Instead, NVFlare provides the deterministic import, reviewable
``autofl.yaml`` contract, execution substrate, policy boundaries, artifacts,
and reproducibility evidence. The agent chooses hypotheses, edits source,
implements algorithms, and runs candidates through existing NVFlare surfaces.

``autofl.yaml`` is the human-reviewable campaign configuration, not a replacement
for ``job.py`` or for exported NVFlare job folders. It exposes the editable
Auto-FL settings, fixed-budget constraints, allowed edit paths, objective,
candidate budget, provenance, and unresolved fields. The original ``job.py``
remains the experiment entry point the skill and agent use to run candidates.

Typical Prompt
==============

.. code-block:: text

Use the NVFlare Auto-FL skill.
Optimize ./job.py for validation accuracy in simulation with an
8-candidate budget.

First Step: Deterministic Import
================================

The skill first imports the job without executing user code:

.. code-block:: shell

python -m nvflare.app_common.autofl.job_importer ./job.py \
--metric accuracy \
--env sim \
--max-candidates 8 \
--output autofl.yaml

The importer parses supported Recipe and FedJob patterns with Python AST
inspection. It extracts campaign-relevant settings into ``autofl.yaml`` and
marks unknown or dynamic fields as unresolved instead of guessing.

Trust Contract
==============

Before editing or running candidates, the skill should show the user three
things from ``autofl.yaml``:

- **Editable**: metric, environment, candidate budget, tunables, artifact
locations, source hash, and importer version.
- **Unresolved**: dynamic defaults, unsupported Python semantics, missing metric
sources, unknown data paths, or low-confidence fields.
- **Allowed**: files the agent may edit, fixed-budget fields it must preserve,
Python modules it may add under the job root, and environment or policy
boundaries.

This makes the workflow feel native and reproducible: NVFlare owns the truth of
the campaign settings and execution surfaces; the agent owns exploration within
explicit constraints.

Execution
=========

The bundled helper is an internal skill surface, not a public NVFlare command
family. It first initializes the campaign and baseline:

.. code-block:: shell

python "$CODEX_HOME/skills/nvflare-autofl/scripts/run_job_campaign.py" \
initialize ./job.py --metric accuracy --mode max --env sim

For each attempt, the agent supplies a hypothesis and receives an isolated
candidate source directory plus ``candidate_manifest.json``:

.. code-block:: shell

python "$CODEX_HOME/skills/nvflare-autofl/scripts/run_job_campaign.py" \
prepare ./job.py --name fedprox-variant \
--hypothesis "stabilize heterogeneous client updates"

The agent edits that candidate source, including new Python algorithm modules
when useful, and asks the helper to evaluate it:

.. code-block:: shell

python "$CODEX_HOME/skills/nvflare-autofl/scripts/run_job_campaign.py" \
evaluate ./job.py --manifest <candidate_manifest.json>

NVFlare computes the source diff and hash, checks allowed paths and detectable
fixed-budget drift, executes the candidate, updates ``results.tsv`` and
``progress.png``, and either retains the new best source or restores the prior
best. Built-in tunable candidates are available through the helper's
``suggest`` action only as optional seeds; the agent remains free to implement
new algorithms.

The workflow then uses existing NVFlare execution surfaces:

- Simulation jobs run through the job's configured ``SimEnv``.
- POC and production jobs use the standard startup-kit and ``nvflare job``
submission, wait, download, and inspection commands. The skill records the
resulting job ID, artifacts, and metric against the candidate manifest.
- Production execution is allowed when the user requests it, but the skill must
not bypass normal startup-kit authentication, site policy, or job submission.

Supported First Version
=======================

The first version is intentionally narrow:

- Supported job surfaces: NVFlare Recipe constructors and FedJob-style scripts.
- Supported import fields: objective metric, fixed budget fields, environment,
train script, allowed edit paths, and common argparse tunables.
- Unsupported or ambiguous custom Python is preserved as unresolved review
fields.

The default user experience should not require editing ``autofl.yaml``. Users
review it only when the importer reports unresolved fields or when they want to
override the campaign configuration.

Final Report After Stop
=======================

After a campaign is manually stopped, reaches its explicit cap, or ends at a
hard policy/runtime boundary, select the companion NVFlare Auto-FL Report skill.
It turns the recorded campaign evidence into a reviewable final report without
requiring Git or rerunning candidates:

.. code-block:: text

Use the NVFlare Auto-FL Report skill.
Generate the final report for the stopped campaign in ./job.

The skill verifies ``.nvflare/autofl/campaign_state.json``, ``results.tsv``, and
available candidate manifests before finalizing. A pending candidate must be
finalized or abandoned through the active Auto-FL skill first. The report
helper attempts to refresh ``progress.png`` and produces:

- ``autofl_final_report.md`` for human review;
- ``autofl_report_summary.json`` for tools and downstream agents;
- a synthesis of every literature checkpoint and the candidates evaluated
after it;
- best-candidate lineage, inherited code changes, manifests, patch hashes,
exact commands, artifacts, failures, and reproducibility warnings.

Plotting is optional evidence. If plotting dependencies are unavailable or
the existing artifact is not a valid PNG, Markdown and JSON are still written,
the plot is omitted from Markdown, and
``artifacts.progress_plot_available=false`` records the degraded state.

The deterministic helper can also be invoked directly by an agent:

.. code-block:: shell

python "$CODEX_HOME/skills/nvflare-autofl-report/scripts/generate_report.py" \
<job-dir>

If a process was abruptly interrupted and campaign state still appears active,
the agent must first independently confirm that execution has stopped. It may
then add ``--confirm-interrupted``. This records the reporting assertion but
does not mutate campaign state. It bypasses only stale stop state; it never
bypasses pending state, ``candidate`` ledger rows, or prepared candidate
manifests.

The report distinguishes the imported budget in ``autofl.yaml`` from the exact
arguments that ran. It warns when the selected candidate changed training
compute or when multiple candidates were selected against a test-like metric.
This keeps the final result useful without overstating the evidence.
The JSON ``best`` field always means a retained baseline or ``keep`` result;
an unretained scored ``discard`` is exposed separately as ``best_observed``.
Loading
Loading