diff --git a/docs/design/autofl_skill.md b/docs/design/autofl_skill.md new file mode 100644 index 0000000000..2312ebf152 --- /dev/null +++ b/docs/design/autofl_skill.md @@ -0,0 +1,204 @@ +# NVFlare Auto-FL Skill Design + +## Summary + +Auto-FL should enter NVFlare as a skill-first product experience. Users select +an official NVFlare Auto-FL skill in a coding agent, point it at an existing +`job.py`, and state the optimization objective, environment, and budget. NVFlare +owns deterministic import of campaign-relevant settings, execution truth, policy +boundaries, artifacts, and reproducibility. The agent owns candidate planning, +code edits within allowed paths, experiment execution through the existing +`job.py`, comparison, and narrative reporting. + +This avoids introducing a new public Auto-FL command tree while still making +Auto-FL an NVFlare-owned feature. + +## Product Boundary + +The first production-oriented slice includes: + +- A root `skills/nvflare-autofl` agent skill that follows the NVFLARE skills + layout used by the general agent-skills work. +- A deterministic `job.py` importer that emits reviewable `autofl.yaml` for the + Auto-FL campaign. +- A trust contract in `autofl.yaml` showing editable campaign settings, + unresolved fields, fixed-budget constraints, and allowed edit paths. +- A skill-local candidate lifecycle that snapshots the current best source, + gives the agent an isolated draft, validates the resulting patch, and keeps or + restores source according to the campaign metric. +- A companion `skills/nvflare-autofl-report` skill that deterministically turns + a stopped campaign ledger, state, config, and manifests into human- and + machine-readable final report artifacts. +- Documentation for using the skill with simulation, POC, and production + environments through existing NVFlare surfaces. + +The first version does not embed or vendor a coding agent, and it does not add a +public Auto-FL command family. + +## Role of autofl.yaml + +`autofl.yaml` is not a replacement for `job.py` and is not a second exported +job format. The original `job.py` remains the experiment entry point the agent +uses to run candidates, and exported job folders remain the NVFlare execution +and submission artifacts. + +The purpose of `autofl.yaml` is to expose the human-reviewable Auto-FL campaign +layer: + +- Objective metric, requested environment, and candidate budget. +- Editable search-space settings discovered from `job.py` and related train + scripts. +- Fixed-budget constraints that must remain comparable across candidates. +- Allowed edit paths and files that are out of scope for the agent. +- Allowed creation patterns for new Python modules under the job root. +- Artifact, ledger, and report locations for the campaign. +- Provenance and unresolved fields that need user review before safe execution. + +By default, users should not need to edit `autofl.yaml`. They review or modify +it only when the importer surfaces unresolved settings or when they want to +override campaign knobs explicitly. + +## Deterministic Import + +The importer parses Python source with `ast`; it does not import or execute +user code. It supports known Recipe and FedJob-style patterns first and focuses +on campaign-relevant settings rather than duplicating the full exported job: + +- Recipe/FedJob constructor and class import. +- `SimEnv`, `PocEnv`, and `ProdEnv` references. +- `train_script` resolution for literal and argparse-derived values. +- Objective metric from user request, `key_metric`, or explicit unresolved + default. +- Fixed-budget fields such as rounds, clients, and candidate budget. +- Common argparse tunables from `job.py` and the resolved train script. + +The exported job folder remains useful as execution truth once the job is +materialized, because it contains resolved NVFlare app and component configs. +However, it does not reliably preserve all authoring intent needed for an +Auto-FL campaign, such as editable source files, train-argument construction, +tunable-versus-fixed intent, and source provenance. Therefore the importer uses +deterministic Python/static parsing for the campaign layer and may use exported +config inspection as a validation aid when available. + +Unsupported or dynamic fields are carried forward as unresolved review items +instead of being guessed by the importer or the agent. + +## Trust Contract + +Every import result includes: + +- `import`: importer version, source path, source hash, support status, and + confidence. +- `job`: surface, entrypoint, allowed edit paths, train script, and call + arguments with provenance. +- `objective`, `budget`, `environment`, and `search_space`. +- `trust_contract`: extracted facts, unresolved fields, allowed edit paths, and + agent controls. + +The skill must present editable, unresolved, and allowed sections before it runs +candidates. This is the core product guardrail: NVFlare makes the campaign +reviewable and reproducible; the agent makes it interactive and exploratory. + +## Candidate Contract + +The agent, rather than the deterministic runner, owns search policy. It may +change tunables, edit the imported job's allowed source files, or implement new +algorithms as Python modules. Each attempt starts from the retained best source +in `.nvflare/autofl/candidates//source` and has a generated +`candidate_manifest.json` containing its hypothesis, base candidate, run +arguments, changed files, source and budget hashes, patch hash, artifacts, and +result. + +NVFlare computes the manifest's evidence fields; the agent does not assert them. +Before execution, the helper rejects stale candidates, path traversal, symlink +escapes, unauthorized existing-file edits, and detectable fixed-budget drift. +It applies the candidate transactionally to the real job workspace, retains a +new best, and restores the previous best after a discard or crash. This works +without requiring a Git repository and leaves the best source ready for the +standard NVFlare job lifecycle. + +The built-in parameter candidates are suggestion seeds only. They are returned +as machine-readable hypotheses and arguments when requested, but are not the +default search loop and are never executed without agent selection. + +## Execution Model + +The skill uses existing NVFlare execution surfaces: + +- Simulation: initialize a baseline, prepare an agent-authored candidate draft, + and evaluate it through the existing `job.py` and configured `SimEnv`. +- POC: use the existing job authoring/export flow, startup kits, and standard + `nvflare job` commands, then record the job ID, artifacts, and metric against + the candidate manifest. +- Production: use standard startup-kit authentication, site policy, job submit, + wait, download, and inspection commands with the same manifest and result + recording contract. + +Production is a valid optimization environment. The best candidate may later be +submitted or reused through the standard NVFlare job lifecycle; no separate +promotion command is needed. + +## Stopped-Campaign Reporting + +Reporting is a separate skill boundary because its trigger and safety posture +differ from active optimization. `nvflare-autofl` must continue an active, +uncapped campaign while state has `final_response_allowed=false`. +`nvflare-autofl-report` operates only after a clean stop, explicit cap, hard +blocker, or independently confirmed interruption. + +The report helper consumes `results.tsv`, `autofl.yaml`, campaign state, and +candidate manifests. It attempts to refresh the shared `progress.png` and +writes: + +- `autofl_final_report.md`, a concise review artifact with executive summary, + trajectory, best-candidate lineage, exact commands, reliability, and + reproduction guidance; +- `autofl_report_summary.json`, a machine-readable + `nvflare.autofl.report.v1` summary for tools and future automation. + +The helper does not edit source, ledger, manifests, or campaign state and does +not require Git. If an abrupt interruption leaves state active, the human must +confirm interruption after execution is independently checked; the report +records that assertion without rewriting history. This confirmation bypasses +only stale stop state. Pending state, `candidate` ledger rows, or manifests in +`prepared`/`ready_for_external_execution` status block finalization until the +active skill finalizes or abandons them. + +Plotting is optional report evidence. A missing plotting dependency or invalid +PNG does not suppress the Markdown and JSON artifacts: the helper preserves the +failed artifact, emits a warning, omits the Markdown image, and records +`artifacts.progress_plot_available=false`. + +Literature reporting follows measured evidence rather than agent narrative. +Each recorded literature checkpoint owns the comparable candidates until the +next checkpoint. Their best result is compared with the incumbent immediately +before the review and classified as helped, matched, not confirmed, failed, or +not evaluated. Recorded `[src: ...]` markers are preserved as campaign +provenance, not presented as independently verified citations. + +The report distinguishes retained and observed evidence. `best` is limited to +scored baseline and `keep` rows, while `best_observed` may expose an unretained +scored `discard`. Pending candidates and crashes remain attempt/failure +evidence and cannot become milestones or literature improvements. The +objective also separates measurement provenance (`metric_source`) from the +importer's metric-contract provenance (`metric_contract_source`). + +Finally, the report compares the declarative/imported budget with exact +baseline and best-candidate commands. It highlights changed compute or data +arguments, incomplete lineage, and repeated selection on test-like metrics. +This makes the report a trust artifact rather than a polished restatement of +the agent's conclusions. + +## Review Questions + +- Are the supported `job.py` patterns sufficient for an initial prototype? +- Are the edit and creation permissions in `autofl.yaml` appropriate for + algorithm-level candidates while preserving candidate comparability? +- Which exported-job fields should be used as validation evidence versus static + `job.py` parsing for authoring intent? +- Does the Auto-FL skill pass the general NVFLARE skill frontmatter, trigger, + and eval checks after it lands under `skills/nvflare-autofl`? +- Which candidate-manifest and metric/artifact fields should become stable + NVFlare APIs after the skill-local contract proves itself? +- Is `nvflare.autofl.report.v1` sufficient for downstream review and automation + while remaining explicitly skill-local in this follow-up? diff --git a/docs/user_guide/nvflare_cli/autofl_skill.rst b/docs/user_guide/nvflare_cli/autofl_skill.rst new file mode 100644 index 0000000000..3549f0a0f6 --- /dev/null +++ b/docs/user_guide/nvflare_cli/autofl_skill.rst @@ -0,0 +1,180 @@ +.. _autofl_skill: + +####################### +NVFlare Auto-FL Skill +####################### + +The NVFlare Auto-FL skill is an agent-assisted workflow for optimizing an +existing NVFlare ``job.py``. The user entry point is the coding agent skill: +select the NVFlare Auto-FL skill, point it at a job, and state the objective, +environment, and candidate budget. + +The skill source lives in ``skills/nvflare-autofl`` with the other NVFlare-owned +agent skills. When the general agent skill CLI is available, install it through +the standard ``nvflare agent skills`` workflow for the target coding agent. + +NVFlare does not add a separate public Auto-FL command family for this workflow. +Instead, NVFlare provides the deterministic import, reviewable +``autofl.yaml`` contract, execution substrate, policy boundaries, artifacts, +and reproducibility evidence. The agent chooses hypotheses, edits source, +implements algorithms, and runs candidates through existing NVFlare surfaces. + +``autofl.yaml`` is the human-reviewable campaign configuration, not a replacement +for ``job.py`` or for exported NVFlare job folders. It exposes the editable +Auto-FL settings, fixed-budget constraints, allowed edit paths, objective, +candidate budget, provenance, and unresolved fields. The original ``job.py`` +remains the experiment entry point the skill and agent use to run candidates. + +Typical Prompt +============== + +.. code-block:: text + + Use the NVFlare Auto-FL skill. + Optimize ./job.py for validation accuracy in simulation with an + 8-candidate budget. + +First Step: Deterministic Import +================================ + +The skill first imports the job without executing user code: + +.. code-block:: shell + + python -m nvflare.app_common.autofl.job_importer ./job.py \ + --metric accuracy \ + --env sim \ + --max-candidates 8 \ + --output autofl.yaml + +The importer parses supported Recipe and FedJob patterns with Python AST +inspection. It extracts campaign-relevant settings into ``autofl.yaml`` and +marks unknown or dynamic fields as unresolved instead of guessing. + +Trust Contract +============== + +Before editing or running candidates, the skill should show the user three +things from ``autofl.yaml``: + +- **Editable**: metric, environment, candidate budget, tunables, artifact + locations, source hash, and importer version. +- **Unresolved**: dynamic defaults, unsupported Python semantics, missing metric + sources, unknown data paths, or low-confidence fields. +- **Allowed**: files the agent may edit, fixed-budget fields it must preserve, + Python modules it may add under the job root, and environment or policy + boundaries. + +This makes the workflow feel native and reproducible: NVFlare owns the truth of +the campaign settings and execution surfaces; the agent owns exploration within +explicit constraints. + +Execution +========= + +The bundled helper is an internal skill surface, not a public NVFlare command +family. It first initializes the campaign and baseline: + +.. code-block:: shell + + python "$CODEX_HOME/skills/nvflare-autofl/scripts/run_job_campaign.py" \ + initialize ./job.py --metric accuracy --mode max --env sim + +For each attempt, the agent supplies a hypothesis and receives an isolated +candidate source directory plus ``candidate_manifest.json``: + +.. code-block:: shell + + python "$CODEX_HOME/skills/nvflare-autofl/scripts/run_job_campaign.py" \ + prepare ./job.py --name fedprox-variant \ + --hypothesis "stabilize heterogeneous client updates" + +The agent edits that candidate source, including new Python algorithm modules +when useful, and asks the helper to evaluate it: + +.. code-block:: shell + + python "$CODEX_HOME/skills/nvflare-autofl/scripts/run_job_campaign.py" \ + evaluate ./job.py --manifest + +NVFlare computes the source diff and hash, checks allowed paths and detectable +fixed-budget drift, executes the candidate, updates ``results.tsv`` and +``progress.png``, and either retains the new best source or restores the prior +best. Built-in tunable candidates are available through the helper's +``suggest`` action only as optional seeds; the agent remains free to implement +new algorithms. + +The workflow then uses existing NVFlare execution surfaces: + +- Simulation jobs run through the job's configured ``SimEnv``. +- POC and production jobs use the standard startup-kit and ``nvflare job`` + submission, wait, download, and inspection commands. The skill records the + resulting job ID, artifacts, and metric against the candidate manifest. +- Production execution is allowed when the user requests it, but the skill must + not bypass normal startup-kit authentication, site policy, or job submission. + +Supported First Version +======================= + +The first version is intentionally narrow: + +- Supported job surfaces: NVFlare Recipe constructors and FedJob-style scripts. +- Supported import fields: objective metric, fixed budget fields, environment, + train script, allowed edit paths, and common argparse tunables. +- Unsupported or ambiguous custom Python is preserved as unresolved review + fields. + +The default user experience should not require editing ``autofl.yaml``. Users +review it only when the importer reports unresolved fields or when they want to +override the campaign configuration. + +Final Report After Stop +======================= + +After a campaign is manually stopped, reaches its explicit cap, or ends at a +hard policy/runtime boundary, select the companion NVFlare Auto-FL Report skill. +It turns the recorded campaign evidence into a reviewable final report without +requiring Git or rerunning candidates: + +.. code-block:: text + + Use the NVFlare Auto-FL Report skill. + Generate the final report for the stopped campaign in ./job. + +The skill verifies ``.nvflare/autofl/campaign_state.json``, ``results.tsv``, and +available candidate manifests before finalizing. A pending candidate must be +finalized or abandoned through the active Auto-FL skill first. The report +helper attempts to refresh ``progress.png`` and produces: + +- ``autofl_final_report.md`` for human review; +- ``autofl_report_summary.json`` for tools and downstream agents; +- a synthesis of every literature checkpoint and the candidates evaluated + after it; +- best-candidate lineage, inherited code changes, manifests, patch hashes, + exact commands, artifacts, failures, and reproducibility warnings. + +Plotting is optional evidence. If plotting dependencies are unavailable or +the existing artifact is not a valid PNG, Markdown and JSON are still written, +the plot is omitted from Markdown, and +``artifacts.progress_plot_available=false`` records the degraded state. + +The deterministic helper can also be invoked directly by an agent: + +.. code-block:: shell + + python "$CODEX_HOME/skills/nvflare-autofl-report/scripts/generate_report.py" \ + + +If a process was abruptly interrupted and campaign state still appears active, +the agent must first independently confirm that execution has stopped. It may +then add ``--confirm-interrupted``. This records the reporting assertion but +does not mutate campaign state. It bypasses only stale stop state; it never +bypasses pending state, ``candidate`` ledger rows, or prepared candidate +manifests. + +The report distinguishes the imported budget in ``autofl.yaml`` from the exact +arguments that ran. It warns when the selected candidate changed training +compute or when multiple candidates were selected against a test-like metric. +This keeps the final result useful without overstating the evidence. +The JSON ``best`` field always means a retained baseline or ``keep`` result; +an unretained scored ``discard`` is exposed separately as ``best_observed``. diff --git a/docs/user_guide/nvflare_cli/nvflare_cli.rst b/docs/user_guide/nvflare_cli/nvflare_cli.rst index d1c035837a..205a9535f7 100644 --- a/docs/user_guide/nvflare_cli/nvflare_cli.rst +++ b/docs/user_guide/nvflare_cli/nvflare_cli.rst @@ -47,6 +47,14 @@ Deprecated commands still exposed in help, such as ``simulator`` and ``authz_preview``, are documented only briefly and should not be used for new workflows unless you are maintaining an older setup. +Agent-assisted workflows +======================== + +The :ref:`NVFlare Auto-FL skill ` is documented with CLI +workflows because it uses NVFlare's existing job, POC, production, and +machine-readable output surfaces. It is not a separate ``nvflare`` command +group. + .. toctree:: :maxdepth: 1 @@ -62,5 +70,6 @@ workflows unless you are maintaining an older setup. cert_command package_command recipe_command + autofl_skill preflight_check dashboard_command diff --git a/nvflare/app_common/autofl/__init__.py b/nvflare/app_common/autofl/__init__.py new file mode 100644 index 0000000000..42fef0858f --- /dev/null +++ b/nvflare/app_common/autofl/__init__.py @@ -0,0 +1,31 @@ +# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +"""Auto-FL utilities for agent-assisted NVFlare job optimization.""" + +__all__ = [ + "AUTOFL_CONFIG_SCHEMA_VERSION", + "DeterministicJobImporter", + "JobImportError", + "dump_autofl_yaml", + "import_job_to_autofl_config", +] + + +def __getattr__(name): + if name in __all__: + from nvflare.app_common.autofl import job_importer + + return getattr(job_importer, name) + raise AttributeError(f"module {__name__!r} has no attribute {name!r}") diff --git a/nvflare/app_common/autofl/job_importer.py b/nvflare/app_common/autofl/job_importer.py new file mode 100644 index 0000000000..343f495911 --- /dev/null +++ b/nvflare/app_common/autofl/job_importer.py @@ -0,0 +1,850 @@ +# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +"""Deterministically import supported NVFlare job scripts into ``autofl.yaml``. + +The Auto-FL skill uses this module as its trust layer. The importer parses +Python source with ``ast``; it never imports or executes the user's ``job.py``. +Supported Recipe/FedJob patterns are converted into a reviewable config, while +dynamic or unsupported fields are surfaced under ``unresolved``. +""" + +from __future__ import annotations + +import argparse +import ast +import hashlib +import sys +from dataclasses import dataclass +from pathlib import Path +from typing import Any, Dict, Iterable, List, Optional, Tuple + +import yaml + +AUTOFL_CONFIG_SCHEMA_VERSION = "nvflare.autofl.config.v1" +IMPORTER_VERSION = "nvflare-autofl-job-importer/v1" +ALLOWED_CREATE_PATTERNS = ["**/*.py"] + +SUPPORTED_ENV_NAMES = {"PocEnv", "ProdEnv", "SimEnv"} +TUNABLE_ARG_NAMES = { + "aggregation_epochs", + "alpha", + "batch_size", + "cosine_lr_eta_min_factor", + "eval_batch_size", + "epochs", + "fedopt_beta1", + "fedopt_beta2", + "fedopt_tau", + "fedproxloss_mu", + "local_epochs", + "local_train_steps", + "lr", + "max_model_params", + "model_arch", + "momentum", + "num_workers", + "server_lr", + "server_momentum", + "weight_decay", +} + + +@dataclass(frozen=True) +class ArgSpec: + """Static argparse field extracted from source.""" + + name: str + flags: Tuple[str, ...] + default: Any = None + default_source: str = "argparse_default" + default_unresolved: bool = False + value_type: Optional[str] = None + choices: Optional[List[Any]] = None + action: Optional[str] = None + + +@dataclass(frozen=True) +class ResolvedValue: + """Resolved expression value plus provenance and confidence.""" + + value: Any + source: str + confidence: str = "high" + unresolved: bool = False + + +@dataclass(frozen=True) +class CallInfo: + """Supported call found in ``job.py`` with source-local resolution context.""" + + name: str + full_name: str + keywords: Dict[str, ast.AST] + assignments: Dict[str, ast.AST] + source: str + function_name: Optional[str] = None + + +class JobImportError(ValueError): + """Raised when the importer cannot read or parse a job file.""" + + +class DeterministicJobImporter: + """Rule-based importer for supported NVFlare Recipe and FedJob scripts.""" + + def __init__(self, workspace_root: Optional[str] = None): + self.workspace_root = Path(workspace_root or ".").resolve() + + def import_job( + self, + job_path: str, + *, + metric: Optional[str] = None, + mode: str = "max", + target_env: Optional[str] = None, + max_candidates: Optional[int] = None, + ) -> Dict[str, Any]: + """Return an ``autofl.yaml``-shaped config for ``job_path``. + + Args: + job_path: Path to ``job.py`` or a directory containing ``job.py``. + metric: Optional optimization metric requested by the user. + mode: ``max`` or ``min`` objective direction. + target_env: Optional target environment, such as ``sim`` or ``prod``. + max_candidates: Optional fixed candidate budget. + + Returns: + A deterministic, YAML-serializable mapping. + """ + + if mode not in {"max", "min"}: + raise JobImportError("mode must be 'max' or 'min'") + + source_path = self._resolve_job_path(job_path) + source_text = source_path.read_text(encoding="utf-8") + try: + tree = ast.parse(source_text, filename=str(source_path)) + except SyntaxError as e: + raise JobImportError(f"failed to parse {source_path}: {e}") from e + + index = _ImportIndex.from_tree(tree, source_text) + job_call = index.first_job_call() + env_call = index.first_env_call() + train_script = self._resolve_train_script(source_path, job_call, index.parser_args, source_text) + train_args = _collect_argparse_args_from_file(train_script) if train_script else {} + unresolved: List[Dict[str, str]] = [] + + if not job_call: + unresolved.append(_unresolved("job.surface", "no supported Recipe or FedJob constructor was found")) + if not train_script: + unresolved.append(_unresolved("job.train_script", "no train_script was found or resolved")) + + metric_name, metric_source, metric_issue = self._resolve_metric(metric, job_call, index.parser_args) + objective = _objective_contract(metric_name, mode, metric_source) + if metric_issue: + unresolved.append(metric_issue) + + budget, budget_issues = self._resolve_budget(max_candidates, job_call, env_call, index.parser_args, source_text) + unresolved.extend(budget_issues) + + allowed_edit_paths = self._allowed_edit_paths(source_path, train_script) + search_space, search_issues = self._suggest_search_space(index.parser_args, train_args) + unresolved.extend(search_issues) + + job_payload = { + "source": self._display_path(source_path), + "surface": _surface_name(job_call), + "entrypoint": "main" if _has_main_entrypoint(tree) else "unresolved", + "allowed_edit_paths": allowed_edit_paths, + "allowed_create_patterns": list(ALLOWED_CREATE_PATTERNS), + } + if job_call: + call_args, call_issues = self._resolved_call_keywords(job_call, index.parser_args, source_text) + unresolved.extend(call_issues) + if _is_recipe_call(job_call): + job_payload.update( + { + "recipe": job_call.name, + "recipe_class": index.imports.get(job_call.name, job_call.full_name), + "recipe_args": call_args, + } + ) + else: + job_payload.update( + { + "fed_job": job_call.name, + "fed_job_class": index.imports.get(job_call.name, job_call.full_name), + "fed_job_args": call_args, + } + ) + if train_script: + job_payload["train_script"] = self._display_path(train_script) + + environment = self._environment_profile(target_env, env_call, index.parser_args, source_text) + if env_call: + env_args, env_issues = self._resolved_call_keywords(env_call, index.parser_args, source_text) + unresolved.extend(env_issues) + environment["discovered"] = {"name": env_call.name, "args": env_args} + + config = { + "schema_version": AUTOFL_CONFIG_SCHEMA_VERSION, + "import": { + "importer_version": IMPORTER_VERSION, + "source": self._display_path(source_path), + "source_sha256": _sha256_text(source_text), + "confidence": _overall_confidence(unresolved, job_call), + "support": { + "status": "supported" if job_call else "partial", + "patterns": _support_patterns(job_call, env_call), + }, + }, + "job": job_payload, + "objective": objective, + "budget": budget, + "environment": environment, + "search_space": {"suggested": search_space}, + "artifacts": { + "collect": ["logs", "metrics", "job_config", "candidate_diff", "candidate_manifest"], + "result_root": "autofl_runs", + }, + "trust_contract": { + "extracted": _trust_extracted( + job_call, + env_call, + train_script, + budget, + objective, + search_space, + ), + "unresolved": list(unresolved), + "allowed_edit_paths": allowed_edit_paths, + "allowed_create_patterns": list(ALLOWED_CREATE_PATTERNS), + "agent_controls": { + "must_not_edit_outside_allowed_paths": True, + "must_preserve_fixed_training_budget": bool(budget.get("fixed_training_budget")), + "must_report_candidate_diffs": True, + }, + }, + "unresolved": list(unresolved), + } + return config + + def dump_yaml(self, config: Dict[str, Any]) -> str: + """Return deterministic YAML for an imported Auto-FL config.""" + + return dump_autofl_yaml(config) + + def _resolve_job_path(self, job_path: str) -> Path: + path = Path(job_path) + if not path.is_absolute(): + path = self.workspace_root / path + if path.is_dir(): + path = path / "job.py" + if not path.exists(): + raise JobImportError(f"job.py not found: {job_path}") + if not path.is_file(): + raise JobImportError(f"job path must be a file or directory containing job.py: {job_path}") + return path.resolve() + + def _display_path(self, path: Path) -> str: + try: + return path.resolve().relative_to(self.workspace_root).as_posix() + except ValueError: + return path.resolve().as_posix() + + def _resolve_train_script( + self, + source_path: Path, + job_call: Optional[CallInfo], + parser_args: Dict[str, ArgSpec], + source_text: str, + ) -> Optional[Path]: + if not job_call: + return _existing_path(source_path.parent / "client.py") + + train_script_node = job_call.keywords.get("train_script") + if not train_script_node: + return _existing_path(source_path.parent / "client.py") + + resolved = _resolve_value(train_script_node, job_call.assignments, parser_args, source_text) + value = resolved.value + if isinstance(value, str) and _is_resolved_path_string(resolved): + return _existing_path((source_path.parent / value).resolve()) + return None + + def _resolve_metric( + self, + requested_metric: Optional[str], + job_call: Optional[CallInfo], + parser_args: Dict[str, ArgSpec], + ) -> Tuple[str, str, Optional[Dict[str, str]]]: + if requested_metric: + return requested_metric, "user_request", None + if job_call and "key_metric" in job_call.keywords: + resolved = _resolve_value(job_call.keywords["key_metric"], job_call.assignments, parser_args, "") + if isinstance(resolved.value, str) and not resolved.unresolved: + return resolved.value, resolved.source, None + return "accuracy", "default", _unresolved("objective.metric", resolved.source) + if "key_metric" in parser_args and isinstance(parser_args["key_metric"].default, str): + return parser_args["key_metric"].default, "arg:key_metric", None + return ( + "accuracy", + "default", + _unresolved("objective.metric", "metric defaulted to accuracy; validation metric source is unknown"), + ) + + def _resolve_budget( + self, + max_candidates: Optional[int], + job_call: Optional[CallInfo], + env_call: Optional[CallInfo], + parser_args: Dict[str, ArgSpec], + source_text: str, + ) -> Tuple[Dict[str, Any], List[Dict[str, str]]]: + budget: Dict[str, Any] = {} + fixed_training_budget: Dict[str, Any] = {} + unresolved = [] + if max_candidates is not None: + budget["max_candidates"] = max_candidates + + if job_call: + for output_key, job_key in (("num_rounds", "num_rounds"), ("min_clients", "min_clients")): + if job_key in job_call.keywords: + resolved = _resolve_value( + job_call.keywords[job_key], job_call.assignments, parser_args, source_text + ) + if resolved.unresolved: + unresolved.append(_unresolved(f"budget.fixed_training_budget.{output_key}", resolved.source)) + else: + fixed_training_budget[output_key] = resolved.value + + if env_call and env_call.name == "SimEnv" and "num_clients" in env_call.keywords: + resolved = _resolve_value(env_call.keywords["num_clients"], env_call.assignments, parser_args, source_text) + if resolved.unresolved: + unresolved.append(_unresolved("budget.fixed_training_budget.num_clients", resolved.source)) + else: + fixed_training_budget["num_clients"] = resolved.value + + if fixed_training_budget: + budget["fixed_training_budget"] = fixed_training_budget + else: + unresolved.append(_unresolved("budget.fixed_training_budget", "no fixed training budget was resolved")) + return budget, unresolved + + def _environment_profile( + self, + target_env: Optional[str], + env_call: Optional[CallInfo], + parser_args: Dict[str, ArgSpec], + source_text: str, + ) -> Dict[str, Any]: + requested = target_env or (_env_name_to_profile(env_call.name) if env_call else "sim") + environment: Dict[str, Any] = {"requested": requested, "profiles": {}} + if env_call and env_call.name == "SimEnv": + sim_profile: Dict[str, Any] = {} + if "num_clients" in env_call.keywords: + resolved = _resolve_value( + env_call.keywords["num_clients"], env_call.assignments, parser_args, source_text + ) + if not resolved.unresolved: + sim_profile["num_clients"] = resolved.value + environment["profiles"]["sim"] = sim_profile + return environment + + def _allowed_edit_paths(self, source_path: Path, train_script: Optional[Path]) -> List[str]: + candidates = [source_path] + if train_script: + candidates.append(train_script) + for filename in ("model.py", "mutation_schema.yaml", "requirements.txt"): + path = source_path.parent / filename + if path.exists(): + candidates.append(path.resolve()) + return list(dict.fromkeys(self._display_path(path) for path in candidates)) + + def _suggest_search_space( + self, + job_args: Dict[str, ArgSpec], + train_args: Dict[str, ArgSpec], + ) -> Tuple[Dict[str, Any], List[Dict[str, str]]]: + suggested = {} + unresolved = [] + for arg_name, arg_spec in sorted({**job_args, **train_args}.items()): + if arg_name not in TUNABLE_ARG_NAMES: + continue + item = { + "type": arg_spec.value_type or _type_name(arg_spec.default), + "default": arg_spec.default, + "source": f"argparse:{arg_name}", + "confidence": "low" if arg_spec.default_unresolved else "high", + } + if arg_spec.choices: + item["values"] = arg_spec.choices + if arg_spec.default_unresolved: + item["default_source"] = arg_spec.default_source + item["unresolved"] = True + unresolved.append( + _unresolved( + f"search_space.suggested.{arg_name}.default", + f"default is dynamic expression: {arg_spec.default}", + ) + ) + suggested[arg_name] = item + + if not suggested: + unresolved.append(_unresolved("search_space.suggested", "no supported tunable argparse fields were found")) + return suggested, unresolved + + def _resolved_call_keywords( + self, + call_info: CallInfo, + parser_args: Dict[str, ArgSpec], + source_text: str, + ) -> Tuple[Dict[str, Any], List[Dict[str, str]]]: + resolved = {} + unresolved = [] + for key, value_node in sorted(call_info.keywords.items()): + value = _resolve_value(value_node, call_info.assignments, parser_args, source_text) + resolved[key] = { + "value": value.value, + "source": value.source, + "confidence": value.confidence, + } + if value.unresolved and key != "model": + unresolved.append(_unresolved(f"job.{call_info.name}.{key}", value.source)) + return resolved, unresolved + + +def import_job_to_autofl_config( + job_path: str, + *, + workspace_root: Optional[str] = None, + metric: Optional[str] = None, + mode: str = "max", + target_env: Optional[str] = None, + max_candidates: Optional[int] = None, +) -> Dict[str, Any]: + """Convenience wrapper for deterministic job import.""" + + importer = DeterministicJobImporter(workspace_root=workspace_root) + return importer.import_job( + job_path, + metric=metric, + mode=mode, + target_env=target_env, + max_candidates=max_candidates, + ) + + +def dump_autofl_yaml(config: Dict[str, Any]) -> str: + """Return deterministic YAML for an imported Auto-FL config.""" + + return yaml.dump(config, Dumper=_NoAliasSafeDumper, sort_keys=False) + + +class _NoAliasSafeDumper(yaml.SafeDumper): + def ignore_aliases(self, data): + return True + + +class _ImportIndex(ast.NodeVisitor): + def __init__(self, source_text: str): + self.source_text = source_text + self.imports: Dict[str, str] = {} + self.parser_args: Dict[str, ArgSpec] = {} + self.module_assignments: Dict[str, ast.AST] = {} + self._local_assignments_stack: List[Dict[str, ast.AST]] = [] + self._function_stack: List[str] = [] + self.job_calls: List[CallInfo] = [] + self.env_calls: List[CallInfo] = [] + + @classmethod + def from_tree(cls, tree: ast.AST, source_text: str) -> "_ImportIndex": + index = cls(source_text) + index.visit(tree) + return index + + def first_job_call(self) -> Optional[CallInfo]: + return self.job_calls[0] if self.job_calls else None + + def first_env_call(self) -> Optional[CallInfo]: + return self.env_calls[0] if self.env_calls else None + + def visit_Import(self, node: ast.Import) -> None: + for alias in node.names: + self.imports[alias.asname or alias.name] = alias.name + + def visit_ImportFrom(self, node: ast.ImportFrom) -> None: + if node.module: + for alias in node.names: + self.imports[alias.asname or alias.name] = f"{node.module}.{alias.name}" + + def visit_FunctionDef(self, node: ast.FunctionDef) -> None: + self._function_stack.append(node.name) + self._local_assignments_stack.append({}) + self.generic_visit(node) + self._local_assignments_stack.pop() + self._function_stack.pop() + + def visit_Assign(self, node: ast.Assign) -> None: + for target in node.targets: + if isinstance(target, ast.Name): + self._current_assignments()[target.id] = node.value + self.generic_visit(node) + + def visit_AnnAssign(self, node: ast.AnnAssign) -> None: + if isinstance(node.target, ast.Name) and node.value: + self._current_assignments()[node.target.id] = node.value + self.generic_visit(node) + + def visit_Call(self, node: ast.Call) -> None: + call_name = _call_name(node.func) + if _is_argparse_add_argument_call(call_name): + arg_spec = _arg_spec_from_call(node) + if arg_spec: + self.parser_args[arg_spec.name] = arg_spec + + short_name = call_name.split(".")[-1] + if _is_supported_job_call_name(short_name) or short_name in SUPPORTED_ENV_NAMES: + call_info = CallInfo( + name=short_name, + full_name=call_name, + keywords={keyword.arg: keyword.value for keyword in node.keywords if keyword.arg}, + assignments=self._resolution_assignments(), + source=_source_segment(self.source_text, node), + function_name=self._function_stack[-1] if self._function_stack else None, + ) + if short_name in SUPPORTED_ENV_NAMES: + self.env_calls.append(call_info) + else: + self.job_calls.append(call_info) + self.generic_visit(node) + + def _current_assignments(self) -> Dict[str, ast.AST]: + if self._local_assignments_stack: + return self._local_assignments_stack[-1] + return self.module_assignments + + def _resolution_assignments(self) -> Dict[str, ast.AST]: + assignments = dict(self.module_assignments) + if self._local_assignments_stack: + assignments.update(self._local_assignments_stack[-1]) + return assignments + + +def _collect_argparse_args_from_file(path: Optional[Path]) -> Dict[str, ArgSpec]: + if not path or not path.exists(): + return {} + try: + tree = ast.parse(path.read_text(encoding="utf-8"), filename=str(path)) + except SyntaxError: + return {} + return _ImportIndex.from_tree(tree, "").parser_args + + +def _arg_spec_from_call(node: ast.Call) -> Optional[ArgSpec]: + flags = [] + for arg in node.args: + is_literal, value = _literal_value(arg) + if is_literal and isinstance(value, str) and value.startswith("-"): + flags.append(value) + if not flags: + return None + + keywords = {keyword.arg: keyword.value for keyword in node.keywords if keyword.arg} + name = _literal_keyword_value(keywords.get("dest")) or _name_from_flags(flags) + if not name: + return None + + action = _literal_keyword_value(keywords.get("action")) + default, default_source, default_unresolved = _arg_default_from_keywords(keywords) + if not default_unresolved and default is None and action == "store_true": + default = False + default_source = "argparse_action" + elif not default_unresolved and default is None and action == "store_false": + default = True + default_source = "argparse_action" + + return ArgSpec( + name=name, + flags=tuple(flags), + default=default, + default_source=default_source, + default_unresolved=default_unresolved, + value_type=_call_name(keywords["type"]) if "type" in keywords else None, + choices=_literal_sequence(keywords.get("choices")), + action=action, + ) + + +def _arg_default_from_keywords(keywords: Dict[str, ast.AST]) -> Tuple[Any, str, bool]: + if "default" not in keywords: + return None, "argparse_default", False + + node = keywords["default"] + is_literal, literal = _literal_value(node) + if is_literal: + return literal, "literal", False + return _unparse(node), "expression", True + + +def _name_from_flags(flags: Iterable[str]) -> Optional[str]: + long_flags = [flag for flag in flags if flag.startswith("--")] + selected = long_flags[0] if long_flags else next(iter(flags), None) + if not selected: + return None + return selected.lstrip("-").replace("-", "_") + + +def _resolve_value( + node: ast.AST, + assignments: Dict[str, ast.AST], + parser_args: Dict[str, ArgSpec], + source_text: str, + seen: Optional[set[str]] = None, +) -> ResolvedValue: + seen = seen or set() + is_literal, literal = _literal_value(node) + if is_literal: + return ResolvedValue(literal, "literal") + + if isinstance(node, ast.Attribute) and isinstance(node.value, ast.Name) and node.value.id == "args": + arg_spec = parser_args.get(node.attr) + if arg_spec: + return _resolve_arg_default(node.attr, arg_spec) + return ResolvedValue(None, f"unresolved arg:{node.attr}", "low", True) + + if isinstance(node, ast.Name): + if node.id in seen: + return ResolvedValue(None, f"recursive reference:{node.id}", "low", True) + if node.id in parser_args: + return _resolve_arg_default(node.id, parser_args[node.id]) + if node.id in assignments: + return _resolve_value(assignments[node.id], assignments, parser_args, source_text, seen | {node.id}) + return ResolvedValue(node.id, f"name:{node.id}", "low", True) + + if isinstance(node, ast.Call): + call_name = _call_name(node.func) + if call_name in {"Path", "pathlib.Path", "os.path.join"}: + arg_value = _first_resolved_argparse_string(node, assignments, parser_args, source_text) + if arg_value is not None: + return arg_value + return ResolvedValue(_source_segment(source_text, node) or call_name, f"call:{call_name}", "low", True) + + return ResolvedValue(_source_segment(source_text, node) or type(node).__name__, "expression", "low", True) + + +def _resolve_arg_default(name: str, arg_spec: ArgSpec) -> ResolvedValue: + if arg_spec.default_unresolved: + return ResolvedValue(arg_spec.default, f"arg:{name}:{arg_spec.default_source}", "low", True) + return ResolvedValue(arg_spec.default, f"arg:{name}") + + +def _first_resolved_argparse_string( + node: ast.Call, + assignments: Dict[str, ast.AST], + parser_args: Dict[str, ArgSpec], + source_text: str, +) -> Optional[ResolvedValue]: + for arg in node.args: + resolved = _resolve_value(arg, assignments, parser_args, source_text) + if resolved.source.startswith("arg:") and isinstance(resolved.value, str): + return resolved + return None + + +def _call_name(node: Optional[ast.AST]) -> str: + if isinstance(node, ast.Name): + return node.id + if isinstance(node, ast.Attribute): + parent = _call_name(node.value) + return f"{parent}.{node.attr}" if parent else node.attr + return "" + + +def _literal_value(node: Optional[ast.AST]) -> Tuple[bool, Any]: + if isinstance(node, ast.Constant): + return True, node.value + if isinstance(node, (ast.List, ast.Tuple)): + values = [] + for item in node.elts: + is_literal, value = _literal_value(item) + if not is_literal: + return False, None + values.append(value) + return True, values + return False, None + + +def _literal_keyword_value(node: Optional[ast.AST]) -> Any: + is_literal, value = _literal_value(node) + return value if is_literal else None + + +def _literal_sequence(node: Optional[ast.AST]) -> Optional[List[Any]]: + is_literal, value = _literal_value(node) + return value if is_literal and isinstance(value, list) else None + + +def _source_segment(source_text: str, node: ast.AST) -> str: + if not source_text: + return "" + return ast.get_source_segment(source_text, node) or "" + + +def _unparse(node: ast.AST) -> str: + try: + return ast.unparse(node) + except Exception: + return type(node).__name__ + + +def _is_argparse_add_argument_call(call_name: str) -> bool: + return call_name.endswith(".add_argument") + + +def _is_supported_job_call_name(name: str) -> bool: + return name.endswith("Recipe") or name in {"BaseFedJob", "FedJob"} + + +def _is_recipe_call(call_info: CallInfo) -> bool: + return call_info.name.endswith("Recipe") + + +def _surface_name(call_info: Optional[CallInfo]) -> str: + if not call_info: + return "unknown" + return "recipe" if _is_recipe_call(call_info) else "fed_job" + + +def _env_name_to_profile(env_name: str) -> str: + return env_name.removesuffix("Env").lower() + + +def _support_patterns(job_call: Optional[CallInfo], env_call: Optional[CallInfo]) -> List[str]: + patterns = [] + if job_call: + patterns.append(f"{_surface_name(job_call)}:{job_call.name}") + if env_call: + patterns.append(f"env:{env_call.name}") + return patterns + + +def _trust_extracted( + job_call: Optional[CallInfo], + env_call: Optional[CallInfo], + train_script: Optional[Path], + budget: Dict[str, Any], + objective: Dict[str, Any], + search_space: Dict[str, Any], +) -> List[Dict[str, Any]]: + extracted = [] + if job_call: + extracted.append({"field": "job.surface", "value": _surface_name(job_call)}) + extracted.append({"field": f"job.{_surface_name(job_call)}", "value": job_call.name}) + if env_call: + extracted.append({"field": "environment.discovered", "value": env_call.name}) + if train_script: + extracted.append({"field": "job.train_script", "value": train_script.name}) + extracted.append({"field": "objective.metric", "value": objective["metric"]}) + extracted.append({"field": "objective.optimization_metric", "value": objective["optimization_metric"]}) + if "fixed_training_budget" in budget: + extracted.append({"field": "budget.fixed_training_budget", "value": budget["fixed_training_budget"]}) + if search_space: + extracted.append({"field": "search_space.suggested", "value": sorted(search_space)}) + return extracted + + +def _objective_contract(metric_name: str, mode: str, source: str) -> Dict[str, Any]: + return { + "metric": metric_name, + "requested_metric": metric_name, + "optimization_metric": metric_name, + "metric_extraction_order": [metric_name], + "mode": mode, + "source": source, + } + + +def _has_main_entrypoint(tree: ast.AST) -> bool: + return any(isinstance(node, ast.FunctionDef) and node.name == "main" for node in ast.walk(tree)) + + +def _existing_path(path: Path) -> Optional[Path]: + return path.resolve() if path.exists() else None + + +def _is_resolved_path_string(value: ResolvedValue) -> bool: + return not value.unresolved and (value.source == "literal" or value.source.startswith("arg:")) + + +def _sha256_text(text: str) -> str: + return hashlib.sha256(text.encode("utf-8")).hexdigest() + + +def _type_name(value: Any) -> str: + if isinstance(value, bool): + return "bool" + if isinstance(value, int): + return "int" + if isinstance(value, float): + return "float" + if isinstance(value, str): + return "str" + if value is None: + return "unknown" + return type(value).__name__ + + +def _overall_confidence(unresolved: List[Dict[str, str]], job_call: Optional[CallInfo]) -> str: + if not job_call: + return "low" + return "medium" if unresolved else "high" + + +def _unresolved(field: str, reason: str) -> Dict[str, str]: + return {"field": field, "reason": reason} + + +def main(argv: Optional[List[str]] = None) -> int: + parser = argparse.ArgumentParser(description="Deterministically import an NVFlare job.py into autofl.yaml") + parser.add_argument("job", help="NVFlare job.py file or directory containing job.py") + parser.add_argument("--output", default="autofl.yaml", help="output path for generated autofl.yaml") + parser.add_argument("--metric", help="requested optimization metric") + parser.add_argument("--mode", default="max", choices=["max", "min"]) + parser.add_argument("--env", dest="target_env", choices=["sim", "poc", "prod"], help="target environment") + parser.add_argument("--max-candidates", type=int, help="candidate budget") + args = parser.parse_args(argv) + + importer = DeterministicJobImporter() + try: + config = importer.import_job( + args.job, + metric=args.metric, + mode=args.mode, + target_env=args.target_env, + max_candidates=args.max_candidates, + ) + except JobImportError as e: + print(f"error: {e}", file=sys.stderr) + return 1 + output_path = Path(args.output) + output_path.parent.mkdir(parents=True, exist_ok=True) + output_path.write_text(importer.dump_yaml(config), encoding="utf-8") + print(output_path) + return 0 + + +if __name__ == "__main__": + raise SystemExit(main()) diff --git a/skills/nvflare-autofl-report/BENCHMARK.md b/skills/nvflare-autofl-report/BENCHMARK.md new file mode 100644 index 0000000000..a78689bc8e --- /dev/null +++ b/skills/nvflare-autofl-report/BENCHMARK.md @@ -0,0 +1,25 @@ +# Benchmark Summary + +Status: draft/internal pending runtime agent evaluation. + +Skill version: 0.1.0 +FLARE version: 2.8.0 minimum + +## Initial Checks + +| Check | Status | Notes | +| --- | --- | --- | +| Positive trigger | Draft | Final report request after a manually interrupted campaign. | +| Active-campaign boundary | Draft | Must refuse finalization while execution remains active. | +| Adjacent negative trigger | Draft | Optimization requests route to `nvflare-autofl`. | +| Global negative trigger | Draft | Non-FLARE reporting prompts route to no skill. | +| Deterministic artifacts | Draft | Markdown, JSON, and refreshed plot are generated from campaign evidence. | +| Literature synthesis | Draft | Checkpoints are linked to subsequent measured candidate outcomes. | +| Reproducibility | Draft | Commands, lineage, manifests, hashes, artifacts, and warnings are preserved. | + +## Known Gaps + +- Runtime agent-performance scoring has not been run yet. +- POC and production fixtures should be expanded after external-result campaigns are available. +- Campaign-recorded source markers are summarized but not independently resolved or verified. + diff --git a/skills/nvflare-autofl-report/SKILL.md b/skills/nvflare-autofl-report/SKILL.md new file mode 100644 index 0000000000..4e5f66779e --- /dev/null +++ b/skills/nvflare-autofl-report/SKILL.md @@ -0,0 +1,113 @@ +--- +name: nvflare-autofl-report +description: "Generate a reproducible final report, literature-outcome synthesis, JSON summary, and refreshed progress plot for a stopped or interrupted NVFLARE Auto-FL campaign." +min_flare_version: "2.8.0" +blast_radius: edits_files +skill_version: "0.1.0" +--- + +# NVFLARE Auto-FL Report + +## Use When + +Use this skill after an NVFLARE Auto-FL campaign has stopped, reached an +explicit cap, hit a hard blocker, or was manually interrupted. Use it when the +user asks for the final report, achieved improvement, literature findings, +failed ideas, reproduction details, or a refreshed progress plot. + +## Do Not Use When + +Do not use this skill to start or continue optimization, invent missing +results, or finalize a campaign that is still running. Use `nvflare-autofl` for +the active candidate loop. Do not stop an active campaign merely because the +user asks for a status snapshot. + +## Workflow + +1. Locate the job directory containing `results.tsv`, `autofl.yaml`, + `.nvflare/autofl/campaign_state.json`, and candidate manifests. +2. Confirm execution has stopped. Prefer campaign state with + `final_response_allowed=true`. When a process was abruptly interrupted and + state is stale, independently confirm no campaign or job process remains, + then use `--confirm-interrupted`. Before finalizing, confirm that campaign + state, `results.tsv`, and available candidate manifests contain no pending + candidate. Finalize or abandon pending work through `nvflare-autofl` first. +3. Generate the deterministic report artifacts: + + ```bash + python "$CODEX_HOME/skills/nvflare-autofl-report/scripts/generate_report.py" + ``` + +4. Read both **autofl_final_report.md** and `autofl_report_summary.json`. Check + warnings about metric use, executed budget changes, missing provenance, and + incomplete interruption state. +5. Give the user the baseline, best score, delta, strongest candidate lineage, + literature ideas that helped or failed, reliability caveats, and absolute + artifact paths. + +The helper attempts to refresh `progress.png` by reusing the product Auto-FL +plotter. Plotting is optional evidence: if plotting dependencies are missing or +the artifact is not a valid PNG, the helper preserves the artifact, records a +warning and `artifacts.progress_plot_available=false`, and still writes the +Markdown and JSON reports without embedding the broken image. It does not +modify source, candidate manifests, `results.tsv`, or campaign state. + +## Interrupted Campaigns + +The report helper must refuse state with `final_response_allowed=false` unless +the human has said the campaign was stopped/interrupted and the agent confirms +that execution is no longer active. Only then run: + +```bash +python "$CODEX_HOME/skills/nvflare-autofl-report/scripts/generate_report.py" --confirm-interrupted +``` + +This records a reporting-time interruption assertion; it does not rewrite the +campaign state or pretend the runner finalized cleanly. It bypasses only stale +stop state. A `candidate` ledger row, pending-candidate state, or a manifest in +`prepared` or `ready_for_external_execution` status always blocks finalization; +the agent must finalize or abandon that candidate first. + +## Report Contract + +The final report must include: + +- campaign termination reason, objective, metric source, direction, + environment, cap, and declared fixed budget; +- baseline, best retained result, score delta, runtime, failures, and status + counts; +- running-best trajectory and a refreshed `progress.png` when plotting is + available, with explicit plot availability in the JSON summary otherwise; +- best-candidate manifest, patch hash, base-candidate lineage, inherited code + changes, artifacts, and exact baseline/best commands; +- every recorded literature checkpoint, cited source markers, candidates + attempted afterward, and whether measured evidence helped, matched, failed, + or did not confirm the idea; +- discarded/crashed ideas and deterministic comparability warnings; +- optional agent model, reasoning effort, cost, or tooling notes when supplied; +- absolute paths to `autofl.yaml`, `results.tsv`, campaign state, + `progress.png`, `autofl_report_summary.json`, and + **autofl_final_report.md**. + +The report must distinguish imported/declared budget from executed command +arguments. It must warn when the best candidate changed training compute or +when repeated selection used a test-like metric. It must not add PR-specific +sections such as "Product Findings" unless the user explicitly requests them. +`best` means a scored retained baseline or `keep` row; an unretained scored +`discard` may appear only as `best_observed`. Candidate and crash rows never +become retained best results, milestones, or literature improvements. + +Read [report-contract.md](references/report-contract.md) when interpreting +lineage, literature outcomes, budget warnings, or interrupted state. + +## Requirements + +- Treat `results.tsv` as recorded evidence; never repair scores by guessing. +- Work without Git. Do not commit or push unless the user separately asks. +- Preserve the campaign and job sources exactly as found. +- Use candidate manifests when available, but still report partial provenance + when copied artifacts make old absolute manifest paths unavailable. +- Keep conclusions proportional to the evidence. A single run is a candidate, + not a robustness claim. +- For POC/production, report standard NVFLARE job IDs and downloaded artifacts + already present in the ledger; do not resubmit jobs during reporting. diff --git a/skills/nvflare-autofl-report/evals/evals.json b/skills/nvflare-autofl-report/evals/evals.json new file mode 100644 index 0000000000..6fb35fc8c5 --- /dev/null +++ b/skills/nvflare-autofl-report/evals/evals.json @@ -0,0 +1,118 @@ +{ + "skill_name": "nvflare-autofl-report", + "evals": [ + { + "id": "autofl-report-stopped-campaign", + "prompt": "The Auto-FL campaign in ./job has been manually stopped. Generate its final report and explain what the literature reviews taught us.", + "expected_output": "The agent selects nvflare-autofl-report, verifies stopped state and absence of pending candidates, generates the Markdown and JSON reports, refreshes the progress plot when available, then summarizes measured trajectory, candidate lineage, literature outcomes, failures, and comparability caveats.", + "files": [], + "assertions": [ + "The agent verifies that campaign execution has stopped and no candidate remains pending before finalizing.", + "The agent generates autofl_final_report.md and autofl_report_summary.json, and reports whether progress.png is available.", + "The agent reports baseline, best score, delta, exact command provenance, candidate lineage, and artifacts.", + "The agent connects each literature checkpoint to subsequent measured candidate outcomes.", + "The agent surfaces test-metric and changed-compute warnings when applicable." + ], + "nvflare": { + "expected_skill": "nvflare-autofl-report", + "mandatory_behavior": [ + { + "id": "verify-stopped-state", + "description": "requires finalized campaign state or explicit independently confirmed interruption, and always refuses unfinished candidate evidence" + }, + { + "id": "deterministic-report-artifacts", + "description": "generates Markdown and JSON from recorded Auto-FL artifacts and refreshes the plot when plotting is available" + }, + { + "id": "candidate-lineage", + "description": "reports base-candidate lineage, inherited changed files, manifest, patch hash, commands, and artifacts" + }, + { + "id": "literature-outcome-synthesis", + "description": "links literature checkpoints and source markers to subsequent measured candidate evidence" + }, + { + "id": "comparability-warnings", + "description": "distinguishes declared and executed budgets and warns about test-like selection metrics" + } + ], + "prohibited_behavior": [ + { + "id": "no-active-finalization", + "description": "does not finalize a campaign that is still running" + }, + { + "id": "no-invented-evidence", + "description": "does not invent scores, literature outcomes, manifests, or source citations" + }, + { + "id": "no-source-or-state-edits", + "description": "does not modify job source, ledger, candidate manifests, or campaign state" + }, + { + "id": "no-git-requirement", + "description": "does not require or automatically commit to a Git repository" + } + ], + "process_metrics": [ + { + "id": "artifact_completeness", + "description": "whether Markdown, JSON summary, and progress plot paths are reported" + }, + { + "id": "literature_evidence_coverage", + "description": "fraction of recorded literature checkpoints with synthesized candidate outcomes" + }, + { + "id": "provenance_completeness", + "description": "whether best-candidate lineage, command, manifest, hash, and artifacts are represented" + }, + { + "id": "unsupported_claim_count", + "description": "number of report conclusions not supported by ledger or campaign artifacts" + } + ] + } + }, + { + "id": "autofl-report-active-negative", + "prompt": "Show me a status snapshot of the Auto-FL campaign that is still running, but do not interrupt it.", + "expected_output": "The reporting skill does not finalize or stop the campaign; the active Auto-FL workflow provides status instead.", + "files": [], + "assertions": [ + "The agent does not generate a final stopped-campaign report or interrupt execution." + ], + "nvflare": { + "expected_skill": "nvflare-autofl", + "negative_for": "nvflare-autofl-report" + } + }, + { + "id": "autofl-report-optimize-negative", + "prompt": "Optimize ./job.py for validation accuracy in simulation.", + "expected_output": "The active optimization skill handles the campaign rather than the final-report skill.", + "files": [], + "assertions": [ + "The selected skill is nvflare-autofl, not nvflare-autofl-report." + ], + "nvflare": { + "expected_skill": "nvflare-autofl", + "negative_for": "nvflare-autofl-report" + } + }, + { + "id": "autofl-report-global-negative", + "prompt": "Create a quarterly sales report from this spreadsheet.", + "expected_output": "No FLARE skill should trigger.", + "files": [], + "assertions": [ + "The selected skill is none." + ], + "nvflare": { + "expected_skill": null, + "negative_for": "*" + } + } + ] +} diff --git a/skills/nvflare-autofl-report/references/report-contract.md b/skills/nvflare-autofl-report/references/report-contract.md new file mode 100644 index 0000000000..1977c23b74 --- /dev/null +++ b/skills/nvflare-autofl-report/references/report-contract.md @@ -0,0 +1,98 @@ +# Auto-FL Final Report Contract + +## Inputs + +The report generator consumes product Auto-FL artifacts from the job directory: + +- `results.tsv`: append-only result evidence; +- `autofl.yaml`: requested and imported campaign contract; +- `.nvflare/autofl/campaign_state.json`: stop/finalization state; +- `.nvflare/autofl/candidates/*/candidate_manifest.json`: optional detailed + candidate provenance; +- `progress.png`: existing plot to refresh safely. + +The current ledger contract includes status, name, score, runtime, changed +files, hypothesis/diff summary, exact run command, artifacts, failure reason, +candidate manifest, base candidate, and patch hash. Older ledgers may omit +newer fields; the report should disclose missing provenance rather than fail +when the essential status and score columns remain readable. + +## Termination + +`final_response_allowed=true` is the preferred deterministic proof that a +campaign may be finalized. For abrupt process termination, the user and agent +may explicitly confirm interruption. This assertion is report provenance only: +it must not mutate campaign state. `--confirm-interrupted` bypasses only stale +stop state. Finalization is refused when campaign state reports pending work, +the ledger has a `candidate` row, or an available manifest remains `prepared` +or `ready_for_external_execution`. Finalize or abandon such candidates before +generating an authoritative report. + +## Candidate Lineage + +Candidates are based on the current best source snapshot. A retained candidate +can therefore record `changed_files=none` while inheriting algorithm code from +its `base_candidate`. Follow `base_candidate` links back to baseline and union +changed files across the chain. Mark lineage partial when an ancestor row or +manifest is unavailable. + +## Literature Evidence + +A `literature` row opens a checkpoint. Associate subsequent comparable +candidate rows with that checkpoint until the next literature row. Compare the +best finalized `keep` or `discard` score in that segment with the incumbent +immediately before the review. Pending candidates and crashes are attempts but +never scored improvements, even if a malformed crash row contains a numeric +score: + +- `helped`: a following candidate improved the incumbent; +- `matched`: the best following candidate tied the incumbent; +- `not_confirmed`: candidates ran but none matched or improved the incumbent; +- `failed`: attempts produced no score; +- `not_evaluated`: no candidate attempt followed the checkpoint. + +Preserve `[src: ...]` markers from the checkpoint. These are campaign-recorded +source identifiers, not independently verified citations. + +## Result Selection + +`best` is strictly the best scored retained result: a baseline or `keep` row. +`best_observed` may identify a better scored `discard` as unretained evidence. +`candidate` and `crash` rows are excluded from retained best selection, +running-best milestones, and literature improvements. Literature tables render +status first, so scored crashes remain `crash` and unscored discards remain +`n/a`. + +The objective contract records two distinct provenance fields. `metric_source` +describes where measurements came from and defaults to `NVFlare metric +artifacts`. `metric_contract_source` records how the importer selected the +metric, for example `user_request`, `arg:key_metric`, or `default`. + +## Comparability + +`autofl.yaml` describes the imported/declarative contract; exact commands in +`results.tsv` describe what executed. Report both. Compare baseline and best +command options and flag changes to clients, rounds, local epochs/steps, batch +sizes, data partitioning, seed, model architecture, or model-size limits. + +When a test-like metric guided multiple candidate decisions, state that the +selected candidate needs one final evaluation on an untouched holdout. Do not +silently present repeated test-set selection as an unbiased final estimate. + +## Outputs + +- **autofl_final_report.md**: human-readable review artifact; +- `autofl_report_summary.json`: machine-readable summary using schema + `nvflare.autofl.report.v1`; +- `progress.png`: refreshed using the `nvflare-autofl` product plotter when + plotting is available. + +The JSON summary remains `nvflare.autofl.report.v1` and includes +`artifacts.progress_plot_available` and +`objective.metric_contract_source`. Missing plotting dependencies or an +invalid existing PNG produce warnings and `progress_plot_available=false`, but +do not suppress the Markdown or JSON report. The invalid or failed plot +artifact is preserved and is not embedded in Markdown. + +Report generation must be independent of Git and must not edit campaign source, +the ledger, manifests, or state. diff --git a/skills/nvflare-autofl-report/scripts/generate_report.py b/skills/nvflare-autofl-report/scripts/generate_report.py new file mode 100644 index 0000000000..9acdc254dc --- /dev/null +++ b/skills/nvflare-autofl-report/scripts/generate_report.py @@ -0,0 +1,1047 @@ +#!/usr/bin/env python3 +# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +"""Generate a reproducible final report for a stopped Auto-FL campaign.""" + +from __future__ import annotations + +import argparse +import csv +import importlib.util +import json +import math +import os +import re +import shlex +import sys +import tempfile +import textwrap +from collections import Counter +from dataclasses import asdict, dataclass +from datetime import datetime, timezone +from pathlib import Path +from typing import Any, Dict, Iterable, List, Optional, Sequence, Tuple + +try: + import yaml +except ImportError: # pragma: no cover - NVFlare installs PyYAML + yaml = None + + +SUMMARY_SCHEMA_VERSION = "nvflare.autofl.report.v1" +PNG_SIGNATURE = b"\x89PNG\r\n\x1a\n" +ATTEMPT_STATUSES = {"candidate", "keep", "discard", "crash"} +FINALIZED_SCORE_STATUSES = {"baseline", "keep", "discard"} +RETAINED_STATUSES = {"baseline", "keep"} +LITERATURE_STATUSES = {"literature", "checkpoint", "event"} +PENDING_MANIFEST_STATUSES = {"prepared", "ready_for_external_execution"} +TRAINING_BUDGET_ARGS = { + "aggregation_epochs", + "alpha", + "batch_size", + "eval_batch_size", + "local_train_steps", + "max_model_params", + "min_clients", + "model_arch", + "n_clients", + "num_clients", + "num_rounds", + "seed", +} +FIXED_BUDGET_TO_CLI = { + "min_clients": "min_clients", + "num_clients": "n_clients", + "num_rounds": "num_rounds", +} +SOURCE_RE = re.compile(r"\[src:\s*([^\]]+)\]", re.IGNORECASE) +ARXIV_RE = re.compile(r"\barxiv\s*:\s*(\d{4}\.\d{4,5})", re.IGNORECASE) + + +@dataclass(frozen=True) +class RunRecord: + index: int + status: str + name: str + score: Optional[float] + runtime_seconds: float + changed_files: str + diff_summary: str + run_command: str + artifacts: str + failure_reason: str + candidate_manifest: str + base_candidate: str + patch_sha256: str + + +def parse_args(argv: Optional[Sequence[str]] = None) -> argparse.Namespace: + parser = argparse.ArgumentParser(description=__doc__) + parser.add_argument("campaign_dir", nargs="?", default=".", help="job directory containing Auto-FL artifacts") + parser.add_argument("--results", default="results.tsv") + parser.add_argument("--state", default=".nvflare/autofl/campaign_state.json") + parser.add_argument("--autofl-yaml", default="autofl.yaml") + parser.add_argument("--progress", default="progress.png") + parser.add_argument("--output", default="autofl_final_report.md") + parser.add_argument("--summary-json", default="autofl_report_summary.json") + parser.add_argument("--plotter", help="override path to nvflare-autofl plot_progress.py") + parser.add_argument("--mode", choices=["max", "min"], help="override objective direction") + parser.add_argument("--metric", help="override optimization metric label") + parser.add_argument("--confirm-interrupted", action="store_true", help="confirm an abruptly interrupted campaign") + parser.add_argument("--agent-model") + parser.add_argument("--reasoning-effort") + parser.add_argument("--agent-cost") + parser.add_argument("--agent-context", help="optional JSON or text file with agent/tooling context") + parser.add_argument("--max-milestones", type=int, default=12) + parser.add_argument("--max-non-improvements", type=int, default=10) + return parser.parse_args(argv) + + +def resolve_path(root: Path, value: str) -> Path: + path = Path(value).expanduser() + return path if path.is_absolute() else root / path + + +def finite_float(value: Any) -> Optional[float]: + try: + result = float(value) + except (TypeError, ValueError): + return None + return result if math.isfinite(result) else None + + +def load_results(path: Path) -> List[RunRecord]: + if not path.is_file(): + raise ValueError(f"Auto-FL ledger not found: {path}") + records = [] + with path.open("r", encoding="utf-8", newline="") as f: + reader = csv.DictReader(f, delimiter="\t") + if not reader.fieldnames or not {"status", "score"}.issubset(reader.fieldnames): + raise ValueError(f"Auto-FL ledger is missing required status/score columns: {path}") + for index, row in enumerate(reader): + records.append( + RunRecord( + index=index, + status=(row.get("status") or "").strip().lower(), + name=(row.get("name") or row.get("candidate") or row.get("commit") or f"row_{index}").strip(), + score=finite_float(row.get("score")), + runtime_seconds=max(0.0, finite_float(row.get("runtime_seconds")) or 0.0), + changed_files=(row.get("changed_files") or "").strip(), + diff_summary=(row.get("diff_summary") or row.get("description") or "").strip(), + run_command=(row.get("run_command") or "").strip(), + artifacts=(row.get("artifacts") or "").strip(), + failure_reason=(row.get("failure_reason") or "").strip(), + candidate_manifest=(row.get("candidate_manifest") or "").strip(), + base_candidate=(row.get("base_candidate") or "").strip(), + patch_sha256=(row.get("patch_sha256") or "").strip(), + ) + ) + if not records: + raise ValueError(f"Auto-FL ledger has no rows: {path}") + return records + + +def load_json(path: Path) -> Dict[str, Any]: + try: + value = json.loads(path.read_text(encoding="utf-8")) + except (OSError, json.JSONDecodeError) as exc: + raise ValueError(f"cannot read JSON from {path}: {exc}") from exc + if not isinstance(value, dict): + raise ValueError(f"expected a JSON object in {path}") + return value + + +def load_config(path: Path) -> Dict[str, Any]: + if not path.is_file(): + return {} + if yaml is None: + raise ValueError("PyYAML is required to read autofl.yaml") + try: + value = yaml.safe_load(path.read_text(encoding="utf-8")) or {} + except (OSError, yaml.YAMLError) as exc: + raise ValueError(f"cannot read Auto-FL config from {path}: {exc}") from exc + if not isinstance(value, dict): + raise ValueError(f"expected a YAML mapping in {path}") + return value + + +def better(score: Optional[float], incumbent: Optional[float], mode: str) -> bool: + if score is None: + return False + if incumbent is None: + return True + return score > incumbent if mode == "max" else score < incumbent + + +def normalize_contract_sections(config: Dict[str, Any]) -> Tuple[Dict[str, Any], List[str]]: + normalized = dict(config) + warnings = [] + for section in ("objective", "environment", "budget"): + if section not in normalized: + continue + value = normalized[section] + if not isinstance(value, dict): + kind = "null" if value is None else type(value).__name__ + warnings.append( + f"autofl.yaml section '{section}' is {kind}, not a mapping; it was treated as an empty section." + ) + normalized[section] = {} + return normalized, warnings + + +def metric_contract( + config: Dict[str, Any], state: Dict[str, Any], args: argparse.Namespace +) -> Tuple[str, str, str, str]: + objective = config.get("objective") if isinstance(config.get("objective"), dict) else {} + metric = ( + args.metric or objective.get("optimization_metric") or objective.get("metric") or state.get("metric") or "score" + ) + requested = objective.get("requested_metric") or objective.get("metric") or metric + measurement_source = objective.get("metric_source") or "NVFlare metric artifacts" + contract_source = objective.get("source") or "not declared" + return str(metric), str(requested), str(measurement_source), str(contract_source) + + +def infer_mode(config: Dict[str, Any], state: Dict[str, Any], requested: Optional[str]) -> str: + if requested: + return requested + objective = config.get("objective") if isinstance(config.get("objective"), dict) else {} + value = objective.get("mode") or objective.get("direction") or state.get("mode") or "max" + return str(value).lower() if str(value).lower() in {"max", "min"} else "max" + + +def verify_stopped(state_path: Path, confirm_interrupted: bool) -> Tuple[Dict[str, Any], str, List[str]]: + warnings = [] + if not state_path.is_file(): + if not confirm_interrupted: + raise ValueError( + f"campaign state not found at {state_path}; use --confirm-interrupted only after confirming execution stopped" + ) + warnings.append("Campaign state was unavailable; the user explicitly confirmed interruption.") + return {}, "user_confirmed_interruption", warnings + + state = load_json(state_path) + if state.get("final_response_allowed") is True: + return state, str(state.get("reason") or "campaign_stopped"), warnings + if not confirm_interrupted: + raise ValueError( + "campaign state still has final_response_allowed=false; stop the campaign or pass --confirm-interrupted " + "after independently confirming its processes are no longer running" + ) + warnings.append("Campaign state remained active; the user explicitly confirmed that execution was interrupted.") + return state, "user_confirmed_interruption", warnings + + +def is_baseline(record: RunRecord) -> bool: + name = record.name.strip().lower() + command = record.run_command.lower() + return ( + record.status == "baseline" + or name == "baseline" + or name.startswith("baseline_") + or "--name baseline" in command + ) + + +def is_finalized_scored_record(record: RunRecord) -> bool: + if record.score is None or record.status in {"candidate", "crash"}: + return False + return is_baseline(record) or record.status in FINALIZED_SCORE_STATUSES + + +def pending_manifest_paths(campaign_root: Path, records: Sequence[RunRecord]) -> List[Path]: + candidates_root = campaign_root / ".nvflare" / "autofl" / "candidates" + paths = set(candidates_root.glob("*/candidate_manifest.json")) if candidates_root.is_dir() else set() + paths.update( + resolve_path(campaign_root, record.candidate_manifest) for record in records if record.candidate_manifest + ) + pending = [] + for path in sorted(paths): + if not path.is_file(): + continue + try: + manifest = load_json(path) + except ValueError: + continue + if str(manifest.get("status") or "").strip().lower() in PENDING_MANIFEST_STATUSES: + pending.append(path) + return pending + + +def verify_no_pending_candidates(campaign_root: Path, state: Dict[str, Any], records: Sequence[RunRecord]) -> None: + evidence = [] + pending_count = finite_float(state.get("pending_candidates")) + if pending_count is not None and pending_count > 0: + evidence.append(f"campaign state reports pending_candidates={state.get('pending_candidates')}") + if state.get("pending_candidate_manifest"): + evidence.append(f"campaign state names {state['pending_candidate_manifest']}") + + ledger_rows = [record for record in records if record.status == "candidate"] + if ledger_rows: + rows = ", ".join(f"row {record.index + 1} ({record.name})" for record in ledger_rows[:5]) + if len(ledger_rows) > 5: + rows += f", and {len(ledger_rows) - 5} more" + evidence.append(f"results.tsv contains pending candidate rows: {rows}") + + manifests = pending_manifest_paths(campaign_root, records) + if manifests: + paths = ", ".join(str(path.resolve()) for path in manifests[:3]) + if len(manifests) > 3: + paths += f", and {len(manifests) - 3} more" + evidence.append(f"candidate manifests remain prepared for execution: {paths}") + + if evidence: + raise ValueError( + "cannot finalize while candidate evidence is unfinished; finalize or abandon pending candidates first. " + "--confirm-interrupted bypasses only stale campaign stop state, never unfinished candidate evidence. " + + " ".join(evidence) + ) + + +def scored_records(records: Iterable[RunRecord]) -> List[RunRecord]: + return [record for record in records if is_finalized_scored_record(record)] + + +def select_baseline(records: Sequence[RunRecord]) -> Optional[RunRecord]: + return next((record for record in records if is_baseline(record) and is_finalized_scored_record(record)), None) + + +def select_best(records: Sequence[RunRecord], mode: str, retained_only: bool = False) -> Optional[RunRecord]: + candidates = [ + record + for record in records + if is_finalized_scored_record(record) + and ( + (is_baseline(record) or record.status in RETAINED_STATUSES) + if retained_only + else (is_baseline(record) or record.status in FINALIZED_SCORE_STATUSES) + ) + ] + if not candidates: + return None + return (max if mode == "max" else min)(candidates, key=lambda record: record.score) + + +def running_best_milestones(records: Sequence[RunRecord], mode: str, limit: int) -> List[Dict[str, Any]]: + milestones = [] + incumbent = None + for record in records: + if not is_finalized_scored_record(record) or not better(record.score, incumbent, mode): + continue + previous = incumbent + incumbent = record.score + milestones.append( + { + "row": record.index + 1, + "name": record.name, + "status": record.status, + "score": record.score, + "delta_from_previous_best": None if previous is None else record.score - previous, + "hypothesis": record.diff_summary, + "changed_files": split_files(record.changed_files), + } + ) + if limit > 0 and len(milestones) > limit: + if limit == 1: + return [milestones[-1]] + indices = [round(index * (len(milestones) - 1) / (limit - 1)) for index in range(limit)] + return [milestones[index] for index in dict.fromkeys(indices)] + return milestones + + +def split_files(value: str) -> List[str]: + return [item.strip() for item in value.split(",") if item.strip() and item.strip().lower() != "none"] + + +def candidate_lineage(best: Optional[RunRecord], records: Sequence[RunRecord]) -> Dict[str, Any]: + if best is None: + return {"candidates": [], "changed_files": [], "complete": False} + by_name = {record.name: record for record in records if record.name} + chain = [] + changed_files = [] + seen = set() + current = best + complete = True + while current: + if current.name in seen: + complete = False + break + seen.add(current.name) + chain.append(current.name) + changed_files.extend(split_files(current.changed_files)) + if not current.base_candidate: + complete = current.status == "baseline" or current.name == "baseline" + break + base_candidate = current.base_candidate + current = by_name.get(base_candidate) + if current is None: + chain.append(base_candidate) + complete = False + break + return { + "candidates": list(reversed(chain)), + "changed_files": sorted(set(changed_files)), + "complete": complete, + } + + +def parse_sources(text: str) -> List[str]: + sources = [] + for match in SOURCE_RE.findall(text or ""): + for item in re.split(r"\s*;\s*", match): + if item and item not in sources: + sources.append(item) + for identifier in ARXIV_RE.findall(text or ""): + source = f"arXiv:{identifier}" + if not any(identifier in existing for existing in sources): + sources.append(source) + return sources + + +def manifest_summary(record: Optional[RunRecord], campaign_root: Path) -> Dict[str, Any]: + if record is None or not record.candidate_manifest: + return {} + path = resolve_path(campaign_root, record.candidate_manifest) + if not path.is_file(): + return {"path": str(path), "available": False} + try: + manifest = load_json(path) + except ValueError as exc: + return {"path": str(path), "available": False, "error": str(exc)} + return { + "path": str(path.resolve()), + "available": True, + "schema_version": manifest.get("schema_version"), + "candidate_id": manifest.get("candidate_id"), + "base_candidate": manifest.get("base_candidate"), + "hypothesis": manifest.get("hypothesis"), + "run_args": manifest.get("run_args") or [], + "changed_files": manifest.get("changed_files") or [], + "created_files": manifest.get("created_files") or [], + "source_sha256": manifest.get("candidate_source_sha256") or manifest.get("base_source_sha256"), + "budget_sha256": manifest.get("fixed_budget_sha256") or manifest.get("budget_sha256"), + "patch_sha256": manifest.get("patch_sha256"), + "status": manifest.get("status"), + "artifacts": manifest.get("artifacts") or {}, + "result": manifest.get("result") or {}, + } + + +def literature_outcomes(records: Sequence[RunRecord], mode: str) -> List[Dict[str, Any]]: + event_indices = [index for index, record in enumerate(records) if record.status in LITERATURE_STATUSES] + outcomes = [] + for event_number, start in enumerate(event_indices): + event = records[start] + end = event_indices[event_number + 1] if event_number + 1 < len(event_indices) else len(records) + before = select_best(records[:start], mode) + attempts = [ + record + for record in records[start + 1 : end] + if record.status in ATTEMPT_STATUSES and not is_baseline(record) + ] + scored = [record for record in attempts if is_finalized_scored_record(record)] + segment_best = select_best(scored, mode) + if segment_best is None: + outcome = "failed" if attempts else "not_evaluated" + delta = None + elif before is None or better(segment_best.score, before.score, mode): + outcome = "helped" + delta = None if before is None else segment_best.score - before.score + elif segment_best.score == before.score: + outcome = "matched" + delta = 0.0 + else: + outcome = "not_confirmed" + delta = segment_best.score - before.score + outcomes.append( + { + "event": event.name, + "row": event.index + 1, + "hypothesis": event.diff_summary, + "sources": parse_sources(event.diff_summary), + "outcome": outcome, + "incumbent_score": None if before is None else before.score, + "best_candidate": None if segment_best is None else segment_best.name, + "best_score": None if segment_best is None else segment_best.score, + "delta_from_incumbent": delta, + "candidate_attempts": [record.name for record in attempts], + "candidate_results": [ + {"name": record.name, "status": record.status, "score": record.score} for record in attempts + ], + "failures": [record.name for record in attempts if record.status == "crash"], + } + ) + return outcomes + + +def command_options(command: str) -> Dict[str, Any]: + if not command: + return {} + try: + tokens = shlex.split(command) + except ValueError: + return {} + options: Dict[str, Any] = {} + index = 0 + while index < len(tokens): + token = tokens[index] + if not token.startswith("--"): + index += 1 + continue + key_value = token[2:].split("=", 1) + key = key_value[0].replace("-", "_") + if len(key_value) == 2: + options[key] = key_value[1] + elif index + 1 < len(tokens) and not tokens[index + 1].startswith("--"): + options[key] = tokens[index + 1] + index += 1 + else: + options[key] = True + index += 1 + return options + + +def command_changes(baseline: Optional[RunRecord], best: Optional[RunRecord]) -> Dict[str, Dict[str, Any]]: + if baseline is None or best is None: + return {} + base = command_options(baseline.run_command) + candidate = command_options(best.run_command) + changes = {} + for key in sorted(set(base) | set(candidate)): + if key == "name" or base.get(key) == candidate.get(key): + continue + changes[key] = {"baseline": base.get(key), "best": candidate.get(key)} + return changes + + +def values_equal(left: Any, right: Any) -> bool: + if str(left) == str(right): + return True + left_number = finite_float(left) + right_number = finite_float(right) + return left_number is not None and right_number is not None and math.isclose(left_number, right_number) + + +def comparability_warnings( + config: Dict[str, Any], + records: Sequence[RunRecord], + baseline: Optional[RunRecord], + best: Optional[RunRecord], + metric: str, + metric_source: str, +) -> Tuple[List[str], Dict[str, Dict[str, Any]]]: + warnings = [] + changes = command_changes(baseline, best) + changed_budget = sorted(set(changes).intersection(TRAINING_BUDGET_ARGS)) + if changed_budget: + warnings.append( + "The best run changed executed training/comparison arguments relative to baseline: " + + ", ".join(changed_budget) + + ". Treat the gain as non-equal-compute unless this was explicitly approved." + ) + + budget = config.get("budget") if isinstance(config.get("budget"), dict) else {} + fixed = budget.get("fixed_training_budget", {}) + baseline_options = command_options(baseline.run_command) if baseline else {} + fixed_mismatches = [] + if isinstance(fixed, dict): + for config_key, expected in fixed.items(): + cli_key = FIXED_BUDGET_TO_CLI.get(config_key, config_key) + actual = baseline_options.get(cli_key) + if actual is not None and not values_equal(actual, expected): + fixed_mismatches.append(f"{config_key}: autofl.yaml={expected}, executed={actual}") + if fixed_mismatches: + warnings.append( + "The imported fixed budget differs from the executed baseline command (" + + "; ".join(fixed_mismatches) + + ")." + ) + + candidate_count = len( + [record for record in records if record.status in ATTEMPT_STATUSES and not is_baseline(record)] + ) + metric_text = f"{metric} {metric_source}".lower() + if candidate_count > 1 and "test" in metric_text: + warnings.append( + "Multiple candidates were selected against a test-like metric. Re-evaluate the chosen candidate once on an " + "untouched holdout before making generalization claims." + ) + if not config: + warnings.append( + "autofl.yaml was unavailable, so the declared objective and fixed-budget contract could not be verified." + ) + return warnings, changes + + +def default_plotter_path() -> Path: + return Path(__file__).resolve().parents[2] / "nvflare-autofl" / "scripts" / "plot_progress.py" + + +def refresh_plot(results: Path, output: Path, mode: str, metric: str, plotter_path: Path) -> Optional[str]: + if not plotter_path.is_file(): + return f"Auto-FL progress plotter not found at {plotter_path}; existing plot was preserved." + spec = importlib.util.spec_from_file_location("nvflare_autofl_report_plotter", plotter_path) + if spec is None or spec.loader is None: + return f"Could not load Auto-FL progress plotter from {plotter_path}; existing plot was preserved." + module = importlib.util.module_from_spec(spec) + sys.modules[spec.name] = module + try: + spec.loader.exec_module(module) + module.plot_progress(module.load_results(results), output, mode, metric) + except Exception as exc: # plotting should not destroy an otherwise useful stopped-campaign report + return f"Could not refresh progress plot ({type(exc).__name__}: {exc}); existing plot was preserved." + return None + + +def is_png(path: Path) -> bool: + try: + with path.open("rb") as f: + return f.read(len(PNG_SIGNATURE)) == PNG_SIGNATURE + except OSError: + return False + + +def compact_lineage(lineage: Sequence[str], edge_items: int = 4) -> str: + if len(lineage) <= edge_items * 2 + 1: + return " -> ".join(lineage) + hidden = len(lineage) - edge_items * 2 + return " -> ".join([*lineage[:edge_items], f"... ({hidden} intermediate)", *lineage[-edge_items:]]) + + +def read_agent_context(path: Optional[Path], args: argparse.Namespace) -> Dict[str, Any]: + context = {} + if path: + if not path.is_file(): + raise ValueError(f"agent context file not found: {path}") + text = path.read_text(encoding="utf-8").strip() + try: + parsed = json.loads(text) + except json.JSONDecodeError: + context["notes"] = text + else: + context.update(parsed if isinstance(parsed, dict) else {"notes": parsed}) + if args.agent_model: + context["model"] = args.agent_model + if args.reasoning_effort: + context["reasoning_effort"] = args.reasoning_effort + if args.agent_cost: + context["cost"] = args.agent_cost + return context + + +def atomic_write_text(path: Path, text: str) -> None: + path.parent.mkdir(parents=True, exist_ok=True) + temp_path = None + try: + with tempfile.NamedTemporaryFile( + "w", encoding="utf-8", dir=path.parent, prefix=f".{path.name}.", delete=False + ) as f: + f.write(text) + temp_path = Path(f.name) + os.replace(temp_path, path) + finally: + if temp_path and temp_path.exists(): + temp_path.unlink() + + +def md_cell(value: Any, limit: int = 180) -> str: + text = " ".join(str(value or "").split()).replace("|", "\\|") + return text if len(text) <= limit else text[: limit - 3] + "..." + + +def format_score(value: Optional[float]) -> str: + return "n/a" if value is None else f"{value:.6f}" + + +def format_delta(value: Optional[float]) -> str: + return "n/a" if value is None else f"{value:+.6f}" + + +def format_runtime(seconds: float) -> str: + if seconds <= 0: + return "" + if seconds < 3600: + return f"{seconds / 60:.1f}m" + return f"{seconds / 3600:.2f}h" + + +def format_command(command: str) -> str: + try: + tokens = shlex.split(command) + except ValueError: + return command or "not recorded" + if not tokens: + return "not recorded" + chunks = [] + current = [] + index = 0 + while index < len(tokens): + token = tokens[index] + if token.startswith("--") and current: + chunks.append(current) + current = [] + current.append(token) + if token.startswith("--") and index + 1 < len(tokens) and not tokens[index + 1].startswith("--"): + index += 1 + current.append(tokens[index]) + index += 1 + if current: + chunks.append(current) + return " \\\n ".join(shlex.join(chunk) for chunk in chunks) + + +def wrap_markdown_bullet(prefix: str, text: str, width: int = 120) -> List[str]: + return textwrap.wrap( + f"{prefix}{' '.join(text.split())}", + width=width, + subsequent_indent=" ", + break_long_words=False, + break_on_hyphens=False, + ) + + +def report_markdown(summary: Dict[str, Any], records: Sequence[RunRecord], max_non_improvements: int) -> str: + baseline = summary["baseline"] + best = summary["best"] + counts = summary["status_counts"] + lines = [ + "# Auto-FL Final Report", + "", + "## Executive Summary", + "", + f"The stopped campaign optimized `{summary['objective']['optimization_metric']}` in " + f"`{summary['environment']}` mode. It evaluated `{summary['candidate_attempts']}` candidate attempts " + f"across `{len(records)}` ledger rows in `{format_runtime(summary['runtime_seconds'])}`.", + "", + ] + if best: + baseline_score = baseline["score"] if baseline else None + delta = None if baseline_score is None else best["score"] - baseline_score + lines.append( + f"Best retained candidate: `{best['name']}` at " + f"`{summary['objective']['optimization_metric']}={format_score(best['score'])}` " + f"(baseline `{format_score(baseline_score)}`, delta `{format_delta(delta)}`)." + ) + else: + observed = summary["best_observed"] + if observed: + lines.append( + "No scored result was retained. Best unretained observed result: " + f"`{observed['name']}` at " + f"`{summary['objective']['optimization_metric']}={format_score(observed['score'])}` " + f"with status `{observed['status']}`." + ) + else: + lines.append("No finalized scored result was available.") + progress_lines = ( + [f"![Auto-FL progress]({summary['artifacts']['progress_plot']})"] + if summary["artifacts"]["progress_plot_available"] + else ["Progress plot unavailable; see the validation and comparability warnings below."] + ) + lines.extend( + [ + "", + f"Termination: `{summary['termination']['reason']}`. " + f"Final campaign state allowed: `{summary['termination']['state_allowed_final_response']}`.", + "", + "## Campaign Contract", + "", + f"- Requested metric: `{summary['objective']['requested_metric']}`", + f"- Optimization metric: `{summary['objective']['optimization_metric']}`", + f"- Metric source: `{summary['objective']['metric_source']}`", + f"- Metric contract source: `{summary['objective']['metric_contract_source']}`", + f"- Direction: `{summary['objective']['mode']}`", + f"- Candidate cap: `{summary['candidate_cap']}`", + f"- Declared fixed budget: `{json.dumps(summary['declared_fixed_budget'], sort_keys=True)}`", + "", + "## Progress", + "", + *progress_lines, + "", + "## Optimization Trajectory", + "", + "| Row | Candidate | Status | Score | Delta from previous best | Hypothesis |", + "| ---: | --- | --- | ---: | ---: | --- |", + ] + ) + for item in summary["milestones"]: + lines.append( + f"| {item['row']} | `{md_cell(item['name'], 50)}` | {item['status']} | {format_score(item['score'])} | " + f"{format_delta(item['delta_from_previous_best'])} | {md_cell(item['hypothesis'])} |" + ) + if not summary["milestones"]: + lines.append("| | | | | | No scored milestones. |") + + lines.extend(["", "## Best Candidate Provenance", ""]) + if best: + lines.extend( + [ + f"- Candidate: `{best['name']}`", + f"- Status: `{best['status']}`", + f"- Base lineage: `{compact_lineage(summary['best_lineage']['candidates']) or 'unavailable'}`", + f"- Cumulative changed files: `{', '.join(summary['best_lineage']['changed_files']) or 'none recorded'}`", + f"- Candidate manifest: `{best['candidate_manifest'] or 'not recorded'}`", + f"- Manifest available: `{summary['best_manifest'].get('available', False)}`", + f"- Manifest budget SHA-256: `{summary['best_manifest'].get('budget_sha256') or 'not recorded'}`", + f"- Patch SHA-256: `{best['patch_sha256'] or 'not recorded'}`", + f"- Artifacts: `{best['artifacts'] or 'not recorded'}`", + "", + "Executed baseline command:", + "", + "```text", + format_command(baseline["run_command"]) if baseline else "not recorded", + "```", + "", + "Executed best-candidate command:", + "", + "```text", + format_command(best["run_command"]), + "```", + ] + ) + else: + lines.append("No retained scored candidate provenance was available.") + + lines.extend(["", "## Literature Review Outcomes", ""]) + if summary["literature_reviews"]: + lines.extend( + [ + "| Checkpoint | Sources | Outcome | Candidate evidence | Delta vs incumbent |", + "| --- | --- | --- | --- | ---: |", + ] + ) + for item in summary["literature_reviews"]: + evidence_items = [] + for candidate in item["candidate_results"]: + if candidate["status"] == "crash": + result = "crash" + elif candidate["status"] == "candidate": + result = "pending" + else: + result = format_score(candidate["score"]) + evidence_items.append(f"{candidate['name']}={result}") + evidence = "; ".join(evidence_items) or "no candidate recorded" + lines.append( + f"| `{md_cell(item['event'], 45)}` | {md_cell('; '.join(item['sources']) or 'not recorded', 100)} | " + f"{item['outcome']} | `{md_cell(evidence, 130)}` | {format_delta(item['delta_from_incumbent'])} |" + ) + lines.extend(["", "Checkpoint hypotheses and decisions:", ""]) + for item in summary["literature_reviews"]: + candidates = ", ".join(item["candidate_attempts"]) or "none recorded" + lines.extend( + wrap_markdown_bullet( + f"- **{item['event']} ({item['outcome']}):** ", + f"{item['hypothesis']} Measured candidates: `{md_cell(candidates, 300)}`.", + ) + ) + lines.extend( + [ + "", + "Source labels above are campaign-recorded identifiers and were not independently verified by the report helper.", + ] + ) + else: + lines.append("No literature checkpoints were recorded in the ledger.") + + lines.extend( + [ + "", + "## Runtime And Reliability", + "", + f"- Total recorded runtime: `{format_runtime(summary['runtime_seconds'])}`", + f"- Status counts: `{json.dumps(counts, sort_keys=True)}`", + f"- Scored comparable runs: `{summary['scored_runs']}`", + f"- Crashes: `{counts.get('crash', 0)}`", + "", + "## Null, Worse, Or Unstable Ideas", + "", + ] + ) + non_improvements = [record for record in records if record.status in {"discard", "crash"}] + for record in non_improvements[: max(0, max_non_improvements)]: + evidence = record.failure_reason or f"score={format_score(record.score)}" + lines.append(f"- `{record.name}`: {md_cell(record.diff_summary, 240)} Evidence: `{md_cell(evidence, 120)}`.") + if not non_improvements: + lines.append("No discarded or crashed candidates were recorded.") + + lines.extend(["", "## Validation And Comparability Notes", ""]) + if summary["warnings"]: + lines.extend(f"- {warning}" for warning in summary["warnings"]) + else: + lines.append("- No deterministic comparability warning was detected from the available artifacts.") + if summary["best_command_changes"]: + lines.extend(["", "Executed argument changes from baseline to best:", ""]) + for key, value in summary["best_command_changes"].items(): + lines.append(f"- `{key}`: `{value['baseline']}` -> `{value['best']}`") + + lines.extend(["", "## Agent And Tooling Context", ""]) + if summary["agent_context"]: + lines.append("```json") + lines.append(json.dumps(summary["agent_context"], indent=2, sort_keys=True)) + lines.append("```") + else: + lines.append("Agent model, reasoning effort, and cost were not supplied to the report generator.") + + lines.extend( + [ + "", + "## Reproduction Recommendations", + "", + "1. Re-run the baseline and selected candidate from the exact commands above in the same NVFLARE environment.", + "2. Confirm the candidate on additional seeds or sites before treating a single-run improvement as robust.", + "3. When a test-like metric guided selection, perform one final evaluation on an untouched holdout.", + "4. Preserve the candidate manifest, patch hash, ledger, campaign config, and downloaded NVFLARE artifacts.", + "", + "## Report Artifacts", + "", + f"- Auto-FL config: `{summary['artifacts']['autofl_yaml']}`", + f"- Results ledger: `{summary['artifacts']['results']}`", + f"- Campaign state: `{summary['artifacts']['campaign_state']}`", + f"- Progress plot: `{summary['artifacts']['progress_plot']}`", + f"- Progress plot available: `{summary['artifacts']['progress_plot_available']}`", + f"- Machine-readable summary: `{summary['artifacts']['summary_json']}`", + f"- Markdown report: `{summary['artifacts']['report']}`", + "", + "## Technical Appendix", + "", + f"Generated at `{summary['generated_at']}` with schema `{summary['schema_version']}`.", + ] + ) + return "\n".join(lines) + "\n" + + +def record_payload(record: Optional[RunRecord]) -> Optional[Dict[str, Any]]: + return asdict(record) if record else None + + +def generate(args: argparse.Namespace) -> Dict[str, Any]: + root = Path(args.campaign_dir).expanduser().resolve() + results_path = resolve_path(root, args.results) + state_path = resolve_path(root, args.state) + config_path = resolve_path(root, args.autofl_yaml) + progress_path = resolve_path(root, args.progress) + report_path = resolve_path(root, args.output) + summary_path = resolve_path(root, args.summary_json) + agent_context_path = resolve_path(root, args.agent_context) if args.agent_context else None + + records = load_results(results_path) + state, termination_reason, warnings = verify_stopped(state_path, args.confirm_interrupted) + verify_no_pending_candidates(root, state, records) + config = load_config(config_path) + config, config_warnings = normalize_contract_sections(config) + warnings.extend(config_warnings) + mode = infer_mode(config, state, args.mode) + metric, requested_metric, metric_source, metric_contract_source = metric_contract(config, state, args) + baseline = select_baseline(records) + best = select_best(records, mode, retained_only=True) + observed_best = select_best(records, mode) + if best is None and observed_best is not None: + warnings.append( + f"Best observed row {observed_best.name} was not retained; no scored baseline or kept candidate is available." + ) + elif best and observed_best and best.name != observed_best.name: + warnings.append( + f"Best observed row {observed_best.name} was not retained; the report identifies retained candidate {best.name}." + ) + + plot_warning = refresh_plot( + results_path, + progress_path, + mode, + metric, + resolve_path(root, args.plotter).resolve() if args.plotter else default_plotter_path(), + ) + if plot_warning: + warnings.append(plot_warning) + progress_plot_available = is_png(progress_path) + if not progress_plot_available: + warnings.append( + f"Progress plot at {progress_path} is missing or is not a valid PNG; the report was generated without " + "embedding it." + ) + + comparison_warnings, changes = comparability_warnings(config, records, baseline, best, metric, metric_source) + warnings.extend(comparison_warnings) + status_counts = dict(sorted(Counter(record.status or "unknown" for record in records).items())) + scored = scored_records(records) + candidate_attempts = len( + [record for record in records if record.status in ATTEMPT_STATUSES and not is_baseline(record)] + ) + environment = config.get("environment") if isinstance(config.get("environment"), dict) else {} + budget = config.get("budget") if isinstance(config.get("budget"), dict) else {} + fixed_budget = budget.get("fixed_training_budget", {}) + cap = state.get("candidate_cap") + cap_label = "uncapped" if cap in {None, "", 0} else cap + summary = { + "schema_version": SUMMARY_SCHEMA_VERSION, + "generated_at": datetime.now(timezone.utc).isoformat(), + "termination": { + "reason": termination_reason, + "state_allowed_final_response": state.get("final_response_allowed") is True, + "user_confirmed_interruption": bool(args.confirm_interrupted), + }, + "objective": { + "requested_metric": requested_metric, + "optimization_metric": metric, + "metric_source": metric_source, + "metric_contract_source": metric_contract_source, + "mode": mode, + }, + "environment": str(environment.get("requested") or state.get("environment") or "not declared"), + "candidate_cap": cap_label, + "candidate_cap_source": state.get("candidate_cap_source") or "not recorded", + "declared_fixed_budget": fixed_budget if isinstance(fixed_budget, dict) else {}, + "candidate_attempts": candidate_attempts, + "scored_runs": len(scored), + "runtime_seconds": sum(record.runtime_seconds for record in records), + "status_counts": status_counts, + "baseline": record_payload(baseline), + "best": record_payload(best), + "best_observed": record_payload(observed_best), + "best_lineage": candidate_lineage(best, records), + "best_manifest": manifest_summary(best, root), + "best_command_changes": changes, + "milestones": running_best_milestones(records, mode, args.max_milestones), + "literature_reviews": literature_outcomes(records, mode), + "warnings": warnings, + "agent_context": read_agent_context(agent_context_path, args), + "artifacts": { + "autofl_yaml": str(config_path.resolve()), + "results": str(results_path.resolve()), + "campaign_state": str(state_path.resolve()), + "progress_plot": str(progress_path.resolve()), + "progress_plot_available": progress_plot_available, + "summary_json": str(summary_path.resolve()), + "report": str(report_path.resolve()), + }, + } + atomic_write_text(report_path, report_markdown(summary, records, args.max_non_improvements)) + atomic_write_text(summary_path, json.dumps(summary, indent=2, sort_keys=True) + "\n") + return summary + + +def main(argv: Optional[Sequence[str]] = None) -> int: + args = parse_args(argv) + try: + summary = generate(args) + except ValueError as exc: + print(f"error: {exc}", file=sys.stderr) + return 2 + print(json.dumps({"status": "ok", "artifacts": summary["artifacts"], "best": summary["best"]}, sort_keys=True)) + return 0 + + +if __name__ == "__main__": + raise SystemExit(main()) diff --git a/skills/nvflare-autofl-report/tests/helper_scripts.md b/skills/nvflare-autofl-report/tests/helper_scripts.md new file mode 100644 index 0000000000..a10a33a0a3 --- /dev/null +++ b/skills/nvflare-autofl-report/tests/helper_scripts.md @@ -0,0 +1,17 @@ +# Helper Script Checks + +The report helper is covered by +`tests/unit_test/tool/autofl_skill_report_test.py`. + +Run the focused checks from the repository root: + +```bash +pytest tests/unit_test/tool/autofl_skill_report_test.py -q +``` + +The tests exercise stopped-state admission, pending-candidate refusal, explicit +interruption confirmation, retained versus observed result selection, +Markdown/JSON artifacts with optional plot availability, max/min objectives, +literature outcome synthesis, candidate lineage, malformed contract sections, +metric provenance, budget and test-metric warnings, guard/plotter parity, +optional agent context, and operation outside a Git repository. diff --git a/skills/nvflare-autofl/BENCHMARK.md b/skills/nvflare-autofl/BENCHMARK.md new file mode 100644 index 0000000000..e8e9174f69 --- /dev/null +++ b/skills/nvflare-autofl/BENCHMARK.md @@ -0,0 +1,24 @@ +# Benchmark Summary + +Status: draft/internal pending runtime evaluation. + +Skill version: 0.1.0 +FLARE version: 2.8.0 minimum + +## Initial Checks + +| Check | Status | Notes | +| --- | --- | --- | +| Positive trigger | Draft | `autofl-optimize-existing-job` defines the primary Auto-FL prompt. | +| Adjacent negative trigger | Draft | PyTorch conversion routes to `nvflare-convert-pytorch`. | +| Diagnosis negative trigger | Draft | Failed-job diagnosis routes to `nvflare-diagnose-job`. | +| Global negative trigger | Draft | Non-FLARE prompts route to no skill. | +| Mandatory behavior | Draft | Behavior IDs cover deterministic import, agent-authored code candidates, candidate manifests, bounded edits, and existing FLARE execution. | +| Prohibited behavior | Draft | Behavior IDs prohibit bypassing policy, editing outside allowed paths, and treating `autofl.yaml` as an exported job. | +| Process evaluation | Draft | Metrics cover import-first behavior, contract preservation, score extraction, and unwanted production actions. | + +## Known Gaps + +- Runtime agent-performance scoring has not been run yet. +- Adjacent negatives should be expanded after more NVFLARE workflow skills land. +- Production execution behavior needs site-policy fixture coverage before the skill graduates from draft status. diff --git a/skills/nvflare-autofl/SKILL.md b/skills/nvflare-autofl/SKILL.md new file mode 100644 index 0000000000..8d248d9195 --- /dev/null +++ b/skills/nvflare-autofl/SKILL.md @@ -0,0 +1,197 @@ +--- +name: nvflare-autofl +description: "Optimize an existing NVFLARE job.py through an agent-assisted Auto-FL campaign that preserves FLARE execution, policy, artifacts, and reproducibility." +min_flare_version: "2.8.0" +blast_radius: submits_production +skill_version: "0.1.0" +--- + +# NVFLARE Auto-FL + +## Use When + +Use this skill when the user asks to optimize an existing NVFLARE `job.py` for +accuracy, AUC, loss, runtime, robustness, or another metric in simulation, POC, +or production. + +## Do Not Use When + +Do not use for converting non-FL training code into NVFLARE, diagnosing failed +jobs without an optimization goal, production deployment setup, or generic +hyperparameter tuning outside an NVFLARE job. + +## Workflow + +Use this skill to optimize an existing NVFLARE `job.py` without asking the user +to learn a new Auto-FL command tree. The user selects this skill, points to a +job, and states the objective, environment, and optional budget. NVFLARE +provides the deterministic campaign import, execution substrate, policy +boundaries, artifacts, and machine-readable contracts. The coding agent owns +hypotheses, source edits, new algorithm implementations, and candidate choice. + +Initialize the campaign and baseline through the bundled helper: + +```bash +python "$CODEX_HOME/skills/nvflare-autofl/scripts/run_job_campaign.py" initialize ./job.py --metric --mode --env [--max-candidates ] +``` + +Read `autofl.yaml` and the JSON response, then prepare an agent-authored +candidate with a short hypothesis and optional candidate-only arguments: + +```bash +python "$CODEX_HOME/skills/nvflare-autofl/scripts/run_job_campaign.py" prepare ./job.py --name --hypothesis "" [--run-args ""] +``` + +Edit only the returned candidate source directory. Modify existing allowed +files or add Python modules under the job root; do not edit the live best source +directly. Then evaluate the manifest: + +```bash +python "$CODEX_HOME/skills/nvflare-autofl/scripts/run_job_campaign.py" evaluate ./job.py --manifest +``` + +Simulation evaluation runs the candidate immediately. POC and production +evaluation validates and materializes the candidate; submit it with standard +`nvflare job` commands, then call `record` with the manifest, job ID, artifacts, +and score. Use `abandon` to restore a pending candidate. Use `suggest` only when +you want deterministic tunable seeds; suggestions are never executed +automatically and do not limit agent-authored code candidates. + +If the job directory contains a task-local `mutation_schema.yaml`, treat its +`comparison_budget_args.default_candidate_budget` and mutation bounds as +authoritative. Invalid generated proposals are product friction, not campaign +blockers; preserve the same campaign and continue with another same-budget +candidate. + +Treat `job.py`, generated `autofl.yaml`, and optional job-local +`mutation_schema.yaml` as campaign inputs. Do not require example-specific +runbooks, branches, or initialization scripts. + +Request escalated execution for the runner command because NVFLARE simulator +runs create local sockets that fail inside the restricted Codex sandbox. If a +runner command reports a sandbox/socket permission failure, treat it as an +infrastructure retry, not as a candidate result, and rerun the same command with +escalated execution. + +The helper owns deterministic import, source snapshots, candidate validation, +execution, restoration, counting, ledger updates, campaign state, plotting, +and reports. After each lifecycle action, read +`.nvflare/autofl/campaign_state.json` and only finalize when +`final_response_allowed=true`. For long-running and simulator-stall handling, read +[continuous-campaigns.md](references/continuous-campaigns.md). + +Read `autofl.yaml` and show the user a concise campaign summary: + +- **Editable**: metric, environment, candidate budget, tunables, artifact + locations, `objective.optimization_metric`, metric source, source hash, and importer version. +- **Unresolved**: dynamic defaults, unsupported Python semantics, missing + metric sources, unknown data paths, or any low-confidence fields. +- **Allowed**: files the agent may edit, Python source it may create, + fixed-budget fields it must preserve, and policy boundaries for the requested + environment. + +Treat `autofl.yaml` as the human-reviewable Auto-FL campaign config, not as a +replacement for `job.py` or an exported NVFLARE job folder. Use the original +`job.py` as the runnable experiment entry point throughout the candidate loop. + +If `autofl.yaml` contains unresolved fields that affect execution safety, +candidate comparability, or production submission, ask the user to resolve those +specific fields before running candidates. + +## Requirements + +- Edit existing files only through candidate drafts and within + `job.allowed_edit_paths`. New Python modules may match + `job.allowed_create_patterns` under the job root. +- Preserve `budget.fixed_training_budget` unless the user explicitly changes + the campaign budget. +- If the environment provides `PYTHON`, `VIRTUAL_ENV`, or a venv on `PATH`, + treat that prepared runtime as authoritative. Verify it, then use it for + import, validation, execution, metric extraction, plotting, and reporting. + Do not search for alternate interpreters or install dependencies unless the + user explicitly asks you to prepare the environment. +- Treat generated `autofl.yaml`, task-local `mutation_schema.yaml`, and + existing NVFLARE job/runtime configuration as authoritative. In the default + simulation product flow, do not require task-local prose profiles, special + branch setup, or harness initialization before invoking the runner. +- Use NVFLARE's existing execution surfaces: + - For simulation, run the imported job with its configured `SimEnv`. + - For POC and production, use standard `nvflare job submit`, `job wait`, + `job download`, and related job/status commands. +- Record every candidate in a ledger such as `results.tsv` with a short name, + changed files, diff summary, run command, metric result, artifacts, and + failure reason when applicable. +- Prefer small, reviewable edits over broad rewrites. +- Treat production as an available execution environment, but never bypass + startup-kit authentication, site policy, or normal NVFLARE job submission. + +## Candidate Loop + +1. Inspect `autofl.yaml`, current best source, prior manifests, and results. +2. Form a concrete hypothesis. Use literature, framework knowledge, source + edits, new algorithms, or a fallback tunable suggestion as appropriate. +3. Prepare a candidate, edit its draft, and evaluate its manifest. +4. Let the helper validate paths and fixed-budget comparability, compute the + patch hash, execute or materialize it, extract metrics, and keep or restore. +5. Read campaign state and execute `next_action`. Run a source-backed literature + pass when requested, then implement its strongest compatible idea. + +## Continuous Campaign Rule + +For uncapped campaigns, continue proposing and evaluating same-budget candidates +after setup and baseline until manually interrupted. Do not ask whether to keep +going or finalize while campaign state says `final_response_allowed=false`. + +A kept improvement, refreshed plot, updated report, local commit, first plateau +check, or encoded `job.py` default is a checkpoint, not completion. Treat the +campaign state as authoritative: if `final_response_allowed=false`, execute +`next_action` and keep the same `job.py`, `autofl.yaml`, metric, environment, +ledger, and comparison budget. Use +[continuous-campaigns.md](references/continuous-campaigns.md) for simulator +watchdogs, campaign-state handling, and recovery rules. + +## Candidate Caps + +Campaigns are uncapped by default. If the user says "optimize this job" without +an explicit candidate budget, continue the campaign until manually interrupted +or blocked. Do not stop after the first baseline, first batch, first successful +candidate, first kept improvement, first local commit, or first plateau +checkpoint. Do not stop after a first sweep of tunables; broaden into +agent-authored code or literature-derived algorithm candidates. Progress +reports in uncapped mode must not be +phrased as "should I continue?" decisions; the answer is continue unless the +user explicitly interrupts or the code-owned state permits finalization. +Do not invent a replacement campaign or new objective after a recoverable +failure. Keep the current campaign identity and artifacts coherent unless the +human explicitly requests a new campaign. + +If the user provides an `N`-candidate budget, pass it only through the runner's +explicit `--max-candidates` argument and count up to `N` comparable attempts +after baseline. Never infer a cap from an inherited environment variable. Do +not count import, validation, smoke runs, plotting, reporting, baseline, or +infrastructure-only retries. Count a real candidate crash after execution +starts. State must report `candidate_cap_source=explicit` or `uncapped`. + +Treat plateau as a decision checkpoint, not an automatic stop. Summarize the +plateau in the running report, refresh `progress.png`, run the campaign guard or +read the runner's `.nvflare/autofl/campaign_state.json`, choose the returned +next mode, and continue unless the state reports `final_response_allowed=true`. +After a source-backed review, record it with the helper's `record --literature +--hypothesis ""` action before preparing its candidate. + +## Stop Handling + +Only produce a final answer for a campaign when the code-owned campaign state +reports `final_response_allowed=true`, for example because the user manually +stopped it, an explicit cap is exhausted, production policy blocks execution, or +a hard safety/runtime blocker prevents further comparable runs. Then hand off +to the `nvflare-autofl-report` skill. It deterministically refreshes +`progress.png` and generates **autofl_final_report.md** plus +`autofl_report_summary.json` from the ledger, campaign state, config, and +candidate manifests. The final answer must include baseline, best score, +metric source, literature outcomes, failures, comparability warnings, +commands, and absolute artifact paths. + +When execution was interrupted before state could be finalized, first confirm +that no campaign or job process remains. The report skill may then record that +human-confirmed interruption without rewriting campaign state. diff --git a/skills/nvflare-autofl/evals/evals.json b/skills/nvflare-autofl/evals/evals.json new file mode 100644 index 0000000000..e7dfb55e1a --- /dev/null +++ b/skills/nvflare-autofl/evals/evals.json @@ -0,0 +1,131 @@ +{ + "skill_name": "nvflare-autofl", + "evals": [ + { + "id": "autofl-optimize-existing-job", + "prompt": "Use NVFLARE Auto-FL to optimize ./job.py for validation accuracy in simulation with an 8-candidate budget.", + "expected_output": "The agent invokes the NVFLARE Auto-FL skill, imports job.py into autofl.yaml, summarizes editable/unresolved/allowed campaign settings, then prepares and evaluates agent-authored source or tunable candidates through candidate manifests and existing FLARE execution surfaces.", + "files": [], + "assertions": [ + "The agent generates autofl.yaml before editing files.", + "The agent summarizes editable settings, unresolved fields, fixed-budget constraints, and allowed edit paths.", + "The agent forms hypotheses and edits isolated candidate source rather than limiting the campaign to built-in tunable sweeps.", + "The agent records changed files, patch hash, base candidate, and artifacts in candidate_manifest.json.", + "The agent runs candidates using the existing job.py rather than replacing it with autofl.yaml.", + "The agent records candidate results and reports the best reproducible candidate." + ], + "nvflare": { + "expected_skill": "nvflare-autofl", + "mandatory_behavior": [ + { + "id": "deterministic-import-first", + "description": "runs deterministic job import before candidate edits" + }, + { + "id": "campaign-summary-before-runs", + "description": "shows editable, unresolved, allowed, and fixed-budget campaign fields before running candidates" + }, + { + "id": "bounded-edit-surface", + "description": "edits only files allowed by autofl.yaml" + }, + { + "id": "agent-authored-code-candidates", + "description": "lets the coding agent implement source and algorithm candidates in an isolated draft" + }, + { + "id": "candidate-manifest-provenance", + "description": "computes a manifest with base source, changed files, patch hash, budget hash, artifacts, and result" + }, + { + "id": "existing-flare-execution", + "description": "uses job.py and standard FLARE execution surfaces for candidate runs" + } + ], + "prohibited_behavior": [ + { + "id": "no-autofl-yaml-as-exported-job", + "description": "does not treat autofl.yaml as a replacement for job.py or an exported FLARE job" + }, + { + "id": "no-policy-bypass", + "description": "does not bypass startup-kit authentication, site policy, or normal job submission for production runs" + }, + { + "id": "no-out-of-scope-edits", + "description": "does not edit outside the allowed edit paths" + }, + { + "id": "no-default-generated-search-policy", + "description": "does not let built-in tunable suggestions replace agent candidate planning" + } + ], + "process_metrics": [ + { + "id": "import_before_edit", + "description": "whether deterministic import occurs before candidate code edits" + }, + { + "id": "fixed_budget_violation_count", + "description": "number of candidate runs that changed fixed-budget fields without user approval" + }, + { + "id": "metric_extraction_success", + "description": "whether the requested metric is extracted from NVFLARE artifacts or logs" + }, + { + "id": "candidate_manifest_completeness", + "description": "whether code candidates include deterministic source, budget, patch, and artifact provenance" + }, + { + "id": "unwanted_production_action_count", + "description": "number of production submissions or policy-sensitive actions performed without explicit user context" + }, + { + "id": "report_reproducibility", + "description": "whether the final report includes commands, artifacts, changed files, and the best candidate" + } + ] + } + }, + { + "id": "autofl-negative-pytorch-conversion", + "prompt": "Convert this standalone PyTorch training script into an NVFLARE federated job.", + "expected_output": "The Auto-FL skill should not be the lead; a conversion skill should handle the request before Auto-FL is applicable.", + "files": [], + "assertions": [ + "The selected skill is nvflare-convert-pytorch, not nvflare-autofl." + ], + "nvflare": { + "expected_skill": "nvflare-convert-pytorch", + "negative_for": "nvflare-autofl" + } + }, + { + "id": "autofl-negative-diagnose-job", + "prompt": "My NVFLARE job failed with EXECUTION_EXCEPTION. Diagnose the client logs and tell me what went wrong.", + "expected_output": "The Auto-FL skill should not be the lead; failed-job diagnosis should route to nvflare-diagnose-job.", + "files": [], + "assertions": [ + "The selected skill is nvflare-diagnose-job, not nvflare-autofl." + ], + "nvflare": { + "expected_skill": "nvflare-diagnose-job", + "negative_for": "nvflare-autofl" + } + }, + { + "id": "autofl-global-negative-web-app", + "prompt": "Build a React landing page for a bakery.", + "expected_output": "No FLARE skill should trigger.", + "files": [], + "assertions": [ + "The selected skill is none." + ], + "nvflare": { + "expected_skill": null, + "negative_for": "*" + } + } + ] +} diff --git a/skills/nvflare-autofl/references/continuous-campaigns.md b/skills/nvflare-autofl/references/continuous-campaigns.md new file mode 100644 index 0000000000..81e6c7b299 --- /dev/null +++ b/skills/nvflare-autofl/references/continuous-campaigns.md @@ -0,0 +1,82 @@ +# Continuous Campaigns + +Use this reference when an Auto-FL campaign is uncapped, long-running, or +recovering from a simulator stall. The top-level skill owns the interaction +contract; this file carries the operational detail that keeps the campaign from +prematurely stopping. + +## Lifecycle State + +Each `scripts/run_job_campaign.py` lifecycle action exits with a JSON envelope; +the campaign continues through `.nvflare/autofl/campaign_state.json`. In +uncapped mode, a completed action, current best, plot, report, or exhausted local +tunable sweep is a checkpoint only. Execute `next_action` while +`final_response_allowed=false`. + +A prepared manifest is pending work. Edit its candidate source and evaluate it, +or abandon it explicitly; do not silently start another candidate. Invalid +drafts are product friction to repair and reevaluate, not a reason to terminate +the campaign. + +During a long `evaluate` action, monitor the process plus +`autofl_runs//run.log`. A live process with no final ledger row is a +running candidate. If logs are temporarily quiet but CPU or GPU use and the +child process remain active, keep waiting. + +## Campaign Guards + +The product runner writes `.nvflare/autofl/campaign_state.json` through +`scripts/campaign_guard.py`; read that state before any final response. This +product state is authoritative. If it has `final_response_allowed=false`, +execute `next_action` immediately; the skill text is only the interaction layer. + +Common next actions: + +- `edit_candidate` or `evaluate_candidate`: finish the pending candidate draft. +- `propose_candidate`: form a hypothesis, prepare its manifest, and edit the + returned candidate source directory. +- `submit_baseline` or `submit_candidate`: use the standard POC/production job + lifecycle, then call `record` with its job ID and artifacts. +- `run_literature_loop`: run a short source-backed literature pass, record a + non-scored `literature` row when a ledger is available, then launch the next + compatible same-budget candidates. + +After every finalized batch, run the available plateau or progress watchdog when +the task provides one. If it recommends `continue`, refresh `progress.png` and +keep iterating locally. If it recommends a literature or exploration mode, record +that decision in the ledger/report, refresh `progress.png`, and launch the top +compatible candidate batch next. If no non-duplicate safe local axis remains, +switch mode rather than stopping: broaden the search within `autofl.yaml`, run a +literature-inspired proposal pass, implement a compatible algorithm change, or +request deterministic tunable suggestions as seeds. + +## Simulator Recovery + +For NVFLARE simulator runs, the server log can be quiet after it dispatches a +round while individual clients are still training. Before declaring a stall, +inspect the active simulator directory under `/tmp/nvflare/simulation/` and +check `site-*/log.txt` or `site-*/log_fl.txt` for epoch, finished-training, +download, or task-completion progress. If any client log or server aggregation +marker advances within the expected candidate runtime, continue the same +candidate; do not stop the runner, final-answer, or start a new campaign. + +If the active NVFLARE simulator logs show a hard child-process connection +failure such as +`Failed to create connection to the child process in SimulatorClientRunner`, or +if a dispatched simulator round has no advancing server/client progress markers +past the configured no-progress watchdog timeout, do not start a new campaign and +do not produce a final report. Treat only that active candidate as crashed, +terminate the stuck `job.py` child if the runner has not already done so, +preserve the same `job.py`, `autofl.yaml`, metric, environment, ledger, and +comparison budget, then continue the same campaign. + +The product runner includes these simulator-stall watchdogs; prefer letting it +record the crash row, refresh artifacts, and launch the next candidate. For +legitimately long quiet tasks, raise `--simulator-no-progress-timeout`, set +`AUTOFL_SIMULATOR_NO_PROGRESS_TIMEOUT_SECONDS`, or set +`simulator_no_progress_timeout_seconds` in the task profile. + +If the current process has already stopped but the user did not ask to stop, do +not leave the campaign in a terminal state. Inspect the campaign state and +ledger, fix the recoverable cause, and continue the same optimization with the +same `job.py`, `autofl.yaml`, metric, environment, and comparison budget. diff --git a/skills/nvflare-autofl/scripts/campaign_guard.py b/skills/nvflare-autofl/scripts/campaign_guard.py new file mode 100644 index 0000000000..b3ccc7b658 --- /dev/null +++ b/skills/nvflare-autofl/scripts/campaign_guard.py @@ -0,0 +1,369 @@ +#!/usr/bin/env python3 +# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +"""Own the product Auto-FL campaign continuation decision. + +The skill runner executes candidates and writes ``results.tsv``. This guard +turns that ledger into a machine-readable campaign state so a live coding agent +does not decide completion, plateau handling, or literature-mode transitions by +itself. +""" + +from __future__ import annotations + +import argparse +import csv +import json +import math +from datetime import datetime, timezone +from pathlib import Path +from typing import Any, Dict, List, Optional, Tuple + +DEFAULT_HARD_CRASH_THRESHOLD = 6 +DEFAULT_MIN_DELTA = 0.0005 +DEFAULT_PLATEAU_THRESHOLD = 8 +DEFAULT_STATE_PATH = ".nvflare/autofl/campaign_state.json" +DEFAULT_STOP_FILES = ("STOP_AUTOFL", ".nvflare/autofl/STOP") +COMPARABLE_STATUSES = {"candidate", "keep", "discard", "crash"} +LITERATURE_EVENT_STATUSES = {"event", "literature", "checkpoint"} + + +def utc_now() -> str: + return datetime.now(timezone.utc).replace(microsecond=0).isoformat().replace("+00:00", "Z") + + +def parse_score(value: Any) -> Optional[float]: + try: + score = float(value) + except (TypeError, ValueError): + return None + if math.isnan(score) or math.isinf(score): + return None + return score + + +def load_rows(path: Path) -> List[Dict[str, str]]: + if not path.exists(): + return [] + with path.open("r", encoding="utf-8", newline="") as f: + return list(csv.DictReader(f, delimiter="\t")) + + +def normalize_status(row: Dict[str, str]) -> str: + return (row.get("status", "") or "").strip().lower() + + +def row_text(row: Dict[str, str]) -> str: + return " ".join(str(value or "") for value in row.values()).lower() + + +def is_baseline(row: Dict[str, str]) -> bool: + status = normalize_status(row) + if status == "baseline": + return True + name = (row.get("name", "") or "").strip().lower() + command = (row.get("run_command", "") or "").lower() + return name == "baseline" or name.startswith("baseline_") or "--name baseline" in command + + +def is_literature_event(row: Dict[str, str]) -> bool: + status = normalize_status(row) + if status in LITERATURE_EVENT_STATUSES and "literature" in row_text(row): + return True + return "literature" in (row.get("name", "") or "").lower() + + +def comparable_attempts(rows: List[Dict[str, str]]) -> List[Dict[str, str]]: + return [row for row in rows if normalize_status(row) in COMPARABLE_STATUSES and not is_baseline(row)] + + +def pending_candidates(rows: List[Dict[str, str]]) -> List[Dict[str, str]]: + return [row for row in rows if normalize_status(row) == "candidate"] + + +def scored_attempts_with_index(rows: List[Dict[str, str]]) -> List[Tuple[int, Dict[str, str], float]]: + scored = [] + for idx, row in enumerate(rows): + if normalize_status(row) not in COMPARABLE_STATUSES or is_baseline(row): + continue + score = parse_score(row.get("score", "")) + if score is not None: + scored.append((idx, row, score)) + return scored + + +def better(new_score: float, old_score: Optional[float], mode: str, min_delta: float = 0.0) -> bool: + if old_score is None: + return True + if mode == "min": + return new_score < old_score - min_delta + return new_score > old_score + min_delta + + +def best_score(rows: List[Dict[str, str]], mode: str) -> Optional[float]: + best = None + for row in rows: + if normalize_status(row) not in COMPARABLE_STATUSES and not is_baseline(row): + continue + score = parse_score(row.get("score", "")) + if score is None: + continue + if better(score, best, mode): + best = score + return best + + +def plateau_status(rows: List[Dict[str, str]], threshold: int, min_delta: float, mode: str) -> Dict[str, Any]: + scored = [] + for idx, row in enumerate(rows): + if normalize_status(row) not in COMPARABLE_STATUSES and not is_baseline(row): + continue + score = parse_score(row.get("score", "")) + if score is not None: + scored.append((idx, row, score)) + if threshold <= 0 or not scored: + return { + "available": True, + "recommendation": "continue", + "scored_since_reset": 0, + "threshold": threshold, + "min_delta": min_delta, + } + + best = None + best_row_idx = -1 + best_scored_idx = -1 + best_name = "" + for scored_idx, (row_idx, row, score) in enumerate(scored): + if better(score, best, mode, min_delta): + best = score + best_row_idx = row_idx + best_scored_idx = scored_idx + best_name = row.get("name", "") + + last_literature_idx = max((idx for idx, row in enumerate(rows) if is_literature_event(row)), default=-1) + reset_row_idx = max(best_row_idx, last_literature_idx) + scored_since_reset = sum(1 for row_idx, _, _ in scored if row_idx > reset_row_idx) + recommendation = "literature" if scored_since_reset >= threshold else "continue" + return { + "available": True, + "recommendation": recommendation, + "best_name": best_name, + "best_score": best, + "best_scored_index": best_scored_idx, + "last_literature_event_index": last_literature_idx, + "min_delta": min_delta, + "reset_row_index": reset_row_idx, + "scored_since_reset": scored_since_reset, + "threshold": threshold, + } + + +def parse_max_candidates(value: Optional[str]) -> Optional[int]: + if value is None or str(value).strip() == "": + return None + try: + parsed = int(value) + except (TypeError, ValueError): + return None + if parsed <= 0: + return None + return parsed + + +def parse_max_candidates_arg(value: str) -> int: + parsed = parse_max_candidates(value) + if parsed is None: + raise argparse.ArgumentTypeError("must be a positive integer") + return parsed + + +def existing_stop_files(paths: List[str]) -> List[str]: + return [path for path in paths if Path(path).exists()] + + +def repeated_crash_blocker(attempts: List[Dict[str, str]], threshold: int) -> bool: + if threshold <= 0 or len(attempts) < threshold: + return False + return all(normalize_status(row) == "crash" for row in attempts[-threshold:]) + + +def guard_state_for_rows( + rows: List[Dict[str, str]], + *, + results_path: str = "results.tsv", + max_candidates: Optional[int] = None, + stop_files: Optional[List[str]] = None, + plateau_threshold: int = DEFAULT_PLATEAU_THRESHOLD, + min_delta: float = DEFAULT_MIN_DELTA, + hard_crash_threshold: int = DEFAULT_HARD_CRASH_THRESHOLD, + mode: str = "max", +) -> Dict[str, Any]: + attempts = comparable_attempts(rows) + pending = pending_candidates(rows) + cap = max_candidates + cap_source = "explicit" if cap is not None else "uncapped" + stop_file_hits = existing_stop_files(stop_files or list(DEFAULT_STOP_FILES)) + plateau = plateau_status(rows, plateau_threshold, min_delta, mode) + + decision = "continue" + reason = "continue" + next_action = "propose_candidate" + final_response_allowed = False + + if pending: + reason = "pending_candidates" + next_action = "finalize_pending_candidates" + elif stop_file_hits: + decision = "stop" + reason = "manual_stop_file" + next_action = "final_report" + final_response_allowed = True + elif cap is not None and len(attempts) >= cap: + decision = "stop" + reason = "candidate_cap_exhausted" + next_action = "final_report" + final_response_allowed = True + elif repeated_crash_blocker(attempts, hard_crash_threshold): + decision = "stop" + reason = "hard_repeated_crash_blocker" + next_action = "final_report" + final_response_allowed = True + elif plateau.get("recommendation") == "literature": + reason = "plateau_literature" + next_action = "run_literature_loop" + + if final_response_allowed: + instruction = "Final report is allowed because the campaign guard reached a stop condition." + elif next_action == "finalize_pending_candidates": + instruction = ( + "Do not produce a final answer. Finalize reviewed candidate rows, refresh artifacts, then rerun the guard." + ) + elif next_action == "run_literature_loop": + instruction = ( + "Do not produce a final answer. Run the literature loop, record a literature event, " + "then launch source-backed candidates under the same comparison budget." + ) + else: + instruction = "Do not produce a final answer. Propose and prepare the next same-budget candidate now." + + return { + "schema_version": "nvflare.autofl.campaign_state.v1", + "updated_at": utc_now(), + "results": results_path, + "decision": decision, + "reason": reason, + "next_action": next_action, + "final_response_allowed": final_response_allowed, + "candidate_cap": cap, + "candidate_cap_source": cap_source, + "candidate_attempts": len(attempts), + "pending_candidates": len(pending), + "scored_attempts": len(scored_attempts_with_index(rows)), + "best_score": best_score(rows, mode), + "stop_files": stop_file_hits, + "plateau": plateau, + "agent_instruction": instruction, + } + + +def guard_state( + results_path: Path, + *, + max_candidates: Optional[int] = None, + stop_files: Optional[List[str]] = None, + plateau_threshold: int = DEFAULT_PLATEAU_THRESHOLD, + min_delta: float = DEFAULT_MIN_DELTA, + hard_crash_threshold: int = DEFAULT_HARD_CRASH_THRESHOLD, + mode: str = "max", +) -> Dict[str, Any]: + return guard_state_for_rows( + load_rows(results_path), + results_path=str(results_path), + max_candidates=max_candidates, + stop_files=stop_files, + plateau_threshold=plateau_threshold, + min_delta=min_delta, + hard_crash_threshold=hard_crash_threshold, + mode=mode, + ) + + +def write_state(path: Path, state: Dict[str, Any]) -> None: + path.parent.mkdir(parents=True, exist_ok=True) + tmp_path = path.with_suffix(path.suffix + ".tmp") + tmp_path.write_text(json.dumps(state, indent=2, sort_keys=True) + "\n", encoding="utf-8") + tmp_path.replace(path) + + +def print_text(state: Dict[str, Any]) -> None: + for key in [ + "decision", + "reason", + "next_action", + "final_response_allowed", + "candidate_cap", + "candidate_cap_source", + "candidate_attempts", + "pending_candidates", + "scored_attempts", + "best_score", + "agent_instruction", + ]: + value = state.get(key) + if isinstance(value, bool): + value = str(value).lower() + elif value is None: + value = "" + print(f"{key}={value}") + + +def main(argv: Optional[List[str]] = None) -> int: + parser = argparse.ArgumentParser(description=__doc__) + parser.add_argument("results", nargs="?", default="results.tsv") + parser.add_argument("--state", default=DEFAULT_STATE_PATH) + parser.add_argument("--max-candidates", type=parse_max_candidates_arg) + parser.add_argument("--stop-file", action="append", default=list(DEFAULT_STOP_FILES)) + parser.add_argument("--plateau-threshold", type=int, default=DEFAULT_PLATEAU_THRESHOLD) + parser.add_argument("--min-delta", type=float, default=DEFAULT_MIN_DELTA) + parser.add_argument("--hard-crash-threshold", type=int, default=DEFAULT_HARD_CRASH_THRESHOLD) + parser.add_argument("--mode", choices=["max", "min"], default="max") + parser.add_argument("--format", choices=["text", "json"], default="text") + args = parser.parse_args(argv) + + if args.plateau_threshold <= 0: + raise ValueError("--plateau-threshold must be positive") + if args.min_delta < 0: + raise ValueError("--min-delta must be non-negative") + + state = guard_state( + Path(args.results), + max_candidates=args.max_candidates, + stop_files=args.stop_file, + plateau_threshold=args.plateau_threshold, + min_delta=args.min_delta, + hard_crash_threshold=args.hard_crash_threshold, + mode=args.mode, + ) + write_state(Path(args.state), state) + if args.format == "json": + print(json.dumps(state, sort_keys=True)) + else: + print_text(state) + return 0 + + +if __name__ == "__main__": + raise SystemExit(main()) diff --git a/skills/nvflare-autofl/scripts/plot_progress.py b/skills/nvflare-autofl/scripts/plot_progress.py new file mode 100755 index 0000000000..764d7b1be2 --- /dev/null +++ b/skills/nvflare-autofl/scripts/plot_progress.py @@ -0,0 +1,537 @@ +#!/usr/bin/env python3 +# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +"""Generate a readable Auto-FL progress plot from the campaign ledger.""" + +from __future__ import annotations + +import argparse +import csv +import math +import os +import tempfile +from dataclasses import dataclass +from pathlib import Path +from typing import Any, Iterable, List, Optional, Sequence, Tuple + + +class PlotDependencyError(RuntimeError): + """Raised when the rich plotting dependency is unavailable.""" + + +class NoScoredResultsError(ValueError): + """Raised when a campaign has no scored rows to plot yet.""" + + +@dataclass +class ProgressRecord: + index: int + status: str + name: str + score: Optional[float] + runtime_seconds: float + description: str + + +def parse_score(value: Any) -> Optional[float]: + try: + score = float(value) + except (TypeError, ValueError): + return None + if math.isnan(score) or math.isinf(score): + return None + return score + + +def parse_runtime(value: Any) -> float: + try: + seconds = float(value) + except (TypeError, ValueError): + return 0.0 + if math.isnan(seconds) or math.isinf(seconds): + return 0.0 + return max(0.0, seconds) + + +def format_runtime(seconds: float) -> str: + if seconds <= 0: + return "" + if seconds < 3600: + return f"{seconds / 60:.1f}m" + return f"{seconds / 3600:.2f}h" + + +def truncate(text: Any, limit: int) -> str: + compact = " ".join(str(text or "").strip().split()) + if len(compact) <= limit: + return compact + return compact[: max(0, limit - 3)] + "..." + + +def compact_label(text: Any, limit: int) -> str: + compact = " ".join(str(text or "").strip().split()) + replacements = { + "server_momentum": "smom", + "server_lr": "slr", + "clip_grad_norm": "clip", + "eta_min_factor": "eta_min", + "label_smoothing": "ls", + } + for old, new in replacements.items(): + compact = compact.replace(old, new) + for marker in [" (", ";"]: + if marker in compact: + compact = compact.split(marker, 1)[0] + return truncate(compact, limit) + + +def record_label(record: ProgressRecord, limit: int) -> str: + return compact_label(record.name or record.description, limit) + + +def load_results(path: Path) -> List[ProgressRecord]: + records = [] + with path.open("r", encoding="utf-8", newline="") as f: + for index, row in enumerate(csv.DictReader(f, delimiter="\t")): + records.append( + ProgressRecord( + index=index, + status=(row.get("status", "") or "").strip().lower(), + name=row.get("name", "") or row.get("candidate", "") or row.get("commit", ""), + score=parse_score(row.get("score", "")), + runtime_seconds=parse_runtime(row.get("runtime_seconds", "")), + description=row.get("diff_summary", "") or row.get("description", ""), + ) + ) + return records + + +def normalize_records(records: Iterable[Any]) -> List[ProgressRecord]: + normalized = [] + for index, record in enumerate(records): + normalized.append( + ProgressRecord( + index=index, + status=str(getattr(record, "status", "") or "").strip().lower(), + name=str(getattr(record, "name", "") or ""), + score=parse_score(getattr(record, "score", None)), + runtime_seconds=parse_runtime(getattr(record, "runtime_seconds", 0.0)), + description=str(getattr(record, "diff_summary", "") or getattr(record, "description", "") or ""), + ) + ) + return normalized + + +def better(value: float, incumbent: Optional[float], mode: str) -> bool: + if incumbent is None: + return True + return value > incumbent if mode == "max" else value < incumbent + + +def cumulative_best(values: Sequence[float], mode: str) -> List[float]: + incumbent = None + result = [] + for value in values: + if better(value, incumbent, mode): + incumbent = value + result.append(incumbent) + return result + + +def percentile(values: Sequence[float], fraction: float) -> float: + if not values: + raise ValueError("percentile requires at least one value") + if len(values) == 1: + return values[0] + ordered = sorted(values) + position = min(max(fraction, 0.0), 1.0) * (len(ordered) - 1) + lower_index = math.floor(position) + upper_index = math.ceil(position) + if lower_index == upper_index: + return ordered[lower_index] + weight = position - lower_index + return ordered[lower_index] + (ordered[upper_index] - ordered[lower_index]) * weight + + +def default_y_limits( + scores: Sequence[float], baseline: float, mode: str, full_y_range: bool = False +) -> Tuple[float, float]: + score_min = min(scores) + score_max = max(scores) + if full_y_range: + span = max(score_max - score_min, 0.01) + return score_min - max(0.01, span * 0.08), score_max + max(0.01, span * 0.16) + + if mode == "max": + useful_min = min(baseline, percentile(scores, 0.20)) + useful_span = max(score_max - useful_min, 0.01) + lower = useful_min - max(0.01, useful_span * 0.20) + upper = score_max + max(0.015, useful_span * 0.35) + if score_min >= lower: + full_span = max(score_max - score_min, 0.01) + lower = score_min - max(0.01, full_span * 0.08) + return lower, upper + + useful_max = max(baseline, percentile(scores, 0.80)) + useful_span = max(useful_max - score_min, 0.01) + lower = score_min - max(0.015, useful_span * 0.35) + upper = useful_max + max(0.01, useful_span * 0.20) + if score_max <= upper: + full_span = max(score_max - score_min, 0.01) + upper = score_max + max(0.01, full_span * 0.08) + return lower, upper + + +def select_observed_milestones( + valid: Sequence[ProgressRecord], mode: str, max_labels: int +) -> List[Tuple[float, ProgressRecord]]: + if max_labels <= 0: + return [] + incumbent = None + milestones = [] + final_running_best = None + for record in valid: + if record.score is None or not better(record.score, incumbent, mode): + continue + delta = 0.0 if incumbent is None else abs(record.score - incumbent) + final_running_best = (delta, record) + if incumbent is not None: + milestones.append((delta, record)) + incumbent = record.score + + if final_running_best is None: + return [] + if not milestones: + return [final_running_best] + + final_index = final_running_best[1].index + non_final = [item for item in milestones if item[1].index != final_index] + selected = sorted(non_final, key=lambda item: item[0], reverse=True)[: max_labels - 1] + selected_indices = {record.index for _, record in selected} + selected_indices.add(final_index) + return [(delta, record) for delta, record in milestones if record.index in selected_indices] + + +def select_literature_labels(records: Sequence[ProgressRecord], max_labels: int) -> List[ProgressRecord]: + if max_labels <= 0 or not records: + return [] + if len(records) <= max_labels: + return list(records) + latest = records[-1] + selected = {latest.index} + for record in sorted(records, key=lambda item: item.runtime_seconds, reverse=True): + if len(selected) >= max_labels: + break + selected.add(record.index) + return [record for record in records if record.index in selected] + + +def label_placement( + label_number: int, + record: ProgressRecord, + x_limits: Tuple[float, float], + y_limits: Tuple[float, float], +) -> Tuple[Tuple[int, int], str, str]: + x_span = max(x_limits[1] - x_limits[0], 1.0) + y_span = max(y_limits[1] - y_limits[0], 1e-9) + x_fraction = (record.index - x_limits[0]) / x_span + y_fraction = ((record.score or y_limits[0]) - y_limits[0]) / y_span + near_right = x_fraction > 0.72 + near_top = y_fraction > 0.78 + x_offset = -10 if near_right else 10 + y_base = -12 if near_top else 12 + y_step = (label_number % 3) * 8 + return ( + (x_offset, y_base - y_step if near_top else y_base + y_step), + "right" if near_right else "left", + "top" if near_top else "bottom", + ) + + +def plot_progress( + records: Iterable[Any], + output: Path, + mode: str, + metric_label: str, + max_labels: int = 6, + max_literature_labels: int = 4, + full_y_range: bool = False, +) -> Tuple[float, float]: + try: + import matplotlib + + matplotlib.use("Agg") + import matplotlib.pyplot as plt + except ImportError as exc: + raise PlotDependencyError("matplotlib is required for the rich Auto-FL progress plot") from exc + + rows = normalize_records(records) + valid = [record for record in rows if record.score is not None and record.status != "crash"] + literature_rows = [record for record in rows if record.status == "literature"] + if not valid: + raise NoScoredResultsError("No non-crash rows with numeric scores found") + + baseline_row = next((record for record in valid if record.status == "baseline"), valid[0]) + baseline = baseline_row.score + best_row = valid[0] + for record in valid[1:]: + if better(record.score, best_row.score, mode): + best_row = record + best_score = best_row.score + + fig, ax = plt.subplots(figsize=(16, 8)) + fig.patch.set_facecolor("white") + ax.set_facecolor("white") + styles = { + "discard": {"color": "#cccccc", "size": 18, "alpha": 0.55, "label": "Discarded"}, + "candidate": {"color": "#3498db", "size": 32, "alpha": 0.75, "label": "Candidate"}, + "keep": {"color": "#2ecc71", "size": 72, "alpha": 0.95, "label": "Kept"}, + } + groups = { + "discard": [record for record in valid if record.status == "discard"], + "candidate": [record for record in valid if record.status == "candidate"], + "keep": [record for record in valid if record.status in {"baseline", "keep"}], + } + for status, group in groups.items(): + if not group: + continue + style = styles[status] + ax.scatter( + [record.index for record in group], + [record.score for record in group], + c=style["color"], + s=style["size"], + alpha=style["alpha"], + zorder=4 if status == "keep" else 2, + label=style["label"], + edgecolors="black" if status == "keep" else "none", + linewidths=0.5 if status == "keep" else 0, + ) + + known_statuses = {"baseline", "keep", "discard", "candidate", "crash", "literature"} + other = [record for record in valid if record.status not in known_statuses] + if other: + ax.scatter( + [record.index for record in other], + [record.score for record in other], + c="#9b59b6", + s=28, + alpha=0.65, + zorder=2, + label="Other", + ) + + observed_scores = [record.score for record in valid] + ax.step( + [record.index for record in valid], + cumulative_best(observed_scores, mode), + where="post", + color="#27ae60", + linewidth=2.2, + alpha=0.75, + zorder=3, + label="Running best observed", + ) + + runtime_rows = [record for record in rows if record.runtime_seconds > 0 and record.status != "literature"] + total_runtime = sum(record.runtime_seconds for record in runtime_rows) + average_runtime = total_runtime / len(runtime_rows) if runtime_rows else 0.0 + literature_runtime = sum(record.runtime_seconds for record in literature_rows) + runtime_title = f", {format_runtime(total_runtime)} total" if total_runtime else "" + if average_runtime: + runtime_title += f", {format_runtime(average_runtime)} avg/candidate" + if literature_rows: + runtime_title += f", {len(literature_rows)} lit" + if literature_runtime: + runtime_title += f" ({format_runtime(literature_runtime)})" + + direction = "higher" if mode == "max" else "lower" + ax.set_xlabel("Experiment #", fontsize=12) + ax.set_ylabel(f"{metric_label} ({direction} is better)", fontsize=12) + ax.set_title( + f"Auto-FL Progress ({metric_label}): {len(rows)} rows, {len(valid)} scored, " + f"{sum(record.status == 'keep' for record in rows)} kept, " + f"{sum(record.status == 'candidate' for record in rows)} candidate, " + f"{sum(record.status == 'discard' for record in rows)} discarded, " + f"{sum(record.status == 'crash' for record in rows)} crash{runtime_title}", + fontsize=14, + ) + ax.axhline( + baseline, + color="#7f8c8d", + linewidth=1.2, + alpha=0.5, + linestyle="--", + label="Baseline", + ) + ax.grid(True, alpha=0.2) + + y_limits = default_y_limits(observed_scores, baseline, mode, full_y_range=full_y_range) + ax.set_ylim(*y_limits) + ax.set_xlim(-0.5, max(len(rows) - 0.5, max(record.index for record in valid) + 0.5)) + + if literature_rows: + event_color = "#8e44ad" + event_y = y_limits[1] - max(y_limits[1] - y_limits[0], 1e-9) * 0.035 + for record in literature_rows: + ax.axvline(record.index, color=event_color, linestyle=":", linewidth=1.1, alpha=0.45, zorder=1) + ax.scatter( + [record.index for record in literature_rows], + [event_y for _ in literature_rows], + marker="v", + c=event_color, + s=50, + alpha=0.85, + zorder=5, + label="Literature review", + edgecolors="white", + linewidths=0.4, + ) + event_x_limits = ax.get_xlim() + event_x_span = max(event_x_limits[1] - event_x_limits[0], 1.0) + for record in select_literature_labels(literature_rows, max_literature_labels): + runtime = format_runtime(record.runtime_seconds) + near_right = (record.index - event_x_limits[0]) / event_x_span > 0.88 + annotation = ax.annotate( + f"lit #{record.index}{f' {runtime}' if runtime else ''}: {compact_label(record.description, 30)}", + (record.index, event_y), + textcoords="offset points", + xytext=(-4, -4) if near_right else (4, -4), + fontsize=7.0, + color=event_color, + alpha=0.9, + rotation=90, + ha="right" if near_right else "left", + va="top", + annotation_clip=True, + ) + annotation.set_clip_on(True) + + for label_number, (_, record) in enumerate(select_observed_milestones(valid, mode, max_labels)): + offset, horizontal_align, vertical_align = label_placement(label_number, record, ax.get_xlim(), ax.get_ylim()) + annotation = ax.annotate( + f"#{record.index} {record.score:.4f}: {record_label(record, 28)}", + (record.index, record.score), + textcoords="offset points", + xytext=offset, + fontsize=8.0, + color="#1a7a3a", + alpha=0.9, + ha=horizontal_align, + va=vertical_align, + annotation_clip=True, + arrowprops={ + "arrowstyle": "-", + "color": "#1a7a3a", + "alpha": 0.35, + "linewidth": 0.8, + "shrinkA": 0, + "shrinkB": 4, + }, + ) + annotation.set_clip_on(True) + + summary_lines = [ + f"Baseline: {baseline:.6f}", + f"Best: {best_score:.6f}", + f"Delta: {best_score - baseline:+.6f}", + ] + if total_runtime: + summary_lines.append(f"Runtime: {format_runtime(total_runtime)}") + if average_runtime: + summary_lines.append(f"Avg/candidate: {format_runtime(average_runtime)}") + if literature_rows: + literature_summary = f"Lit reviews: {len(literature_rows)}" + if literature_runtime: + literature_summary += f" ({format_runtime(literature_runtime)})" + summary_lines.append(literature_summary) + summary_lines.append(f"Best run: #{best_row.index} {record_label(best_row, 36)}") + ax.text( + 0.015, + 0.985, + "\n".join(summary_lines), + transform=ax.transAxes, + ha="left", + va="top", + fontsize=9, + bbox={ + "boxstyle": "round,pad=0.35", + "facecolor": "white", + "edgecolor": "#dddddd", + "alpha": 0.9, + }, + ) + + clipped = sum(score < y_limits[0] or score > y_limits[1] for score in observed_scores) + if clipped: + ax.text( + 0.99, + 0.015, + f"{clipped} outlier{'s' if clipped != 1 else ''} outside displayed range", + transform=ax.transAxes, + ha="right", + va="bottom", + fontsize=8, + color="#666666", + ) + + ax.legend(loc="best", fontsize=9) + output.parent.mkdir(parents=True, exist_ok=True) + temp_path = None + try: + with tempfile.NamedTemporaryFile( + dir=output.parent, prefix=f".{output.stem}.", suffix=".png", delete=False + ) as f: + temp_path = Path(f.name) + plt.tight_layout() + fig.savefig(temp_path, dpi=150, facecolor="white", transparent=False) + os.replace(temp_path, output) + finally: + plt.close(fig) + if temp_path and temp_path.exists(): + temp_path.unlink() + return baseline, best_score + + +def main(argv: Optional[Sequence[str]] = None) -> int: + parser = argparse.ArgumentParser(description=__doc__) + parser.add_argument("path", nargs="?", default="results.tsv", help="path to the Auto-FL TSV ledger") + parser.add_argument("--output", default="progress.png", help="output PNG path") + parser.add_argument("--mode", choices=["max", "min"], default="max") + parser.add_argument("--metric", default="score", help="metric label shown in the plot") + parser.add_argument("--max-labels", type=int, default=6) + parser.add_argument("--max-literature-labels", type=int, default=4) + parser.add_argument("--full-y-range", action="store_true") + args = parser.parse_args(argv) + + records = load_results(Path(args.path)) + baseline, best = plot_progress( + records, + Path(args.output), + args.mode, + args.metric, + max_labels=args.max_labels, + max_literature_labels=args.max_literature_labels, + full_y_range=args.full_y_range, + ) + print(f"Saved {args.output}") + print(f"baseline={baseline:.6f}") + print(f"best={best:.6f}") + print(f"delta={best - baseline:+.6f}") + return 0 + + +if __name__ == "__main__": + raise SystemExit(main()) diff --git a/skills/nvflare-autofl/scripts/run_job_campaign.py b/skills/nvflare-autofl/scripts/run_job_campaign.py new file mode 100644 index 0000000000..dcc77f2e12 --- /dev/null +++ b/skills/nvflare-autofl/scripts/run_job_campaign.py @@ -0,0 +1,2589 @@ +#!/usr/bin/env python3 +# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +"""Manage agent-authored Auto-FL candidates for an existing NVFlare job.py. + +The coding agent owns hypotheses and source edits. This helper snapshots the +current best source, validates and evaluates candidate manifests, restores +discarded candidates, and records reproducible campaign state and artifacts. +""" + +from __future__ import annotations + +import argparse +import csv +import difflib +import hashlib +import importlib.util +import json +import os +import re +import selectors +import shlex +import shutil +import signal +import subprocess +import sys +import time +import uuid +from dataclasses import dataclass, field +from datetime import datetime, timezone +from pathlib import Path +from typing import Any, Dict, Iterable, List, Optional, Sequence, Tuple + +try: + import yaml +except ImportError: # pragma: no cover - NVFlare installs PyYAML + yaml = None + + +RESULT_FIELDS = [ + "status", + "name", + "score", + "runtime_seconds", + "changed_files", + "diff_summary", + "run_command", + "artifacts", + "failure_reason", + "candidate_manifest", + "base_candidate", + "patch_sha256", +] + +CANDIDATE_MANIFEST_SCHEMA_VERSION = "nvflare.autofl.candidate.v1" +CAMPAIGN_METADATA_SCHEMA_VERSION = "nvflare.autofl.campaign.v1" +CAMPAIGN_METADATA_PATH = ".nvflare/autofl/campaign.json" +CANDIDATE_ROOT = ".nvflare/autofl/candidates" +BEST_SNAPSHOT_ROOT = ".nvflare/autofl/snapshots/best" +ALLOWED_CREATE_PATTERNS = ["**/*.py"] +RESERVED_CANDIDATE_PATH_PARTS = {".git", ".nvflare", "__pycache__", "autofl_runs"} + +INFRASTRUCTURE_RETRY = "infrastructure_retry" +SIMULATOR_STALL_EXIT_CODE = 125 +SIMULATOR_STALL_PATTERNS = ( + "Failed to create connection to the child process in SimulatorClientRunner", + "SimulatorClientRunner - ERROR - run_client_thread error", +) +SIMULATOR_STALL_LOG_LIMIT = 65536 +SIMULATOR_AGGREGATION_RE = re.compile(r"Aggregated\s+(\d+)/(\d+)\s+results") +SIMULATOR_PROGRESS_PATTERNS = ( + "Round ", + "Aggregated ", + "Beginning model validation", + "Saved validation result", + "Finished FedAvg", + "Finished ScatterAndGather", + "cross_site_val", + "round=", + "epoch", + "returning result", + "sent result", + "task completed", +) +SIMULATOR_PROGRESS_LOG_LIMIT = 131072 +DEFAULT_SIMULATOR_NO_PROGRESS_TIMEOUT = 240 +DEFAULT_HARD_CRASH_THRESHOLD = 6 +DEFAULT_PLATEAU_MIN_DELTA = 0.0005 +DEFAULT_PLATEAU_THRESHOLD = 8 + +FIXED_BUDGET_TO_CLI = { + "num_clients": "n_clients", + "min_clients": "min_clients", + "num_rounds": "num_rounds", +} + +COMPARISON_BUDGET_TO_CLI = { + "n_clients": "n_clients", + "num_rounds": "num_rounds", + "aggregation_epochs": "aggregation_epochs", + "local_train_steps": "local_train_steps", + "batch_size": "batch_size", + "eval_batch_size": "eval_batch_size", + "alpha": "alpha", + "seed": "seed", + "model_arch": "model_arch", + "max_model_params": "max_model_params", + "aggregator": "aggregator", + "final_eval_clients": "final_eval_clients", +} + + +@dataclass +class RunRecord: + status: str + name: str + score: Optional[float] + runtime_seconds: float + changed_files: str + diff_summary: str + run_command: str + artifacts: str + failure_reason: str = "" + candidate_manifest: str = "" + base_candidate: str = "" + patch_sha256: str = "" + + +@dataclass +class JobRun: + name: str + args: List[str] + description: str + status: str = "candidate" + score: Optional[float] = None + runtime_seconds: float = 0.0 + artifacts: str = "" + failure_reason: str = "" + command: List[str] = field(default_factory=list) + + +def env_int(name: str, default: int) -> int: + try: + value = int(os.environ.get(name, default)) + except (TypeError, ValueError): + return default + return value + + +def env_float(name: str, default: float) -> float: + try: + value = float(os.environ.get(name, default)) + except (TypeError, ValueError): + return default + return value + + +def parse_args(argv: Optional[Sequence[str]] = None) -> argparse.Namespace: + parser = argparse.ArgumentParser(description=__doc__) + parser.add_argument( + "action", + choices=["initialize", "prepare", "evaluate", "abandon", "suggest", "record", "status"], + help="skill-internal campaign lifecycle action", + ) + parser.add_argument("job", help="NVFlare job.py to optimize") + parser.add_argument("--metric", default="accuracy") + parser.add_argument("--mode", choices=["max", "min"], default="max") + parser.add_argument("--env", dest="target_env", choices=["sim", "poc", "prod"], default="sim") + parser.add_argument( + "--max-candidates", + type=int, + help="optional candidate cap; omit to continue until interrupted or blocked", + ) + parser.add_argument("--autofl-yaml", default="autofl.yaml") + parser.add_argument("--results", default="results.tsv") + parser.add_argument("--state", default=".nvflare/autofl/campaign_state.json") + parser.add_argument("--progress", default="progress.png") + parser.add_argument("--report", default="autofl_report.md") + parser.add_argument("--output-root", default="autofl_runs") + parser.add_argument( + "--plateau-threshold", + type=int, + default=env_int("AUTOFL_PLATEAU_THRESHOLD", DEFAULT_PLATEAU_THRESHOLD), + help=( + "scored comparable candidate attempts after the last material improvement or literature event " + "before campaign state requests run_literature_loop" + ), + ) + parser.add_argument( + "--plateau-min-delta", + type=float, + default=env_float("AUTOFL_PLATEAU_MIN_DELTA", DEFAULT_PLATEAU_MIN_DELTA), + help="minimum metric delta required to reset the plateau clock", + ) + parser.add_argument( + "--hard-crash-threshold", + type=int, + default=env_int("AUTOFL_HARD_CRASH_THRESHOLD", DEFAULT_HARD_CRASH_THRESHOLD), + help="stop after this many consecutive real candidate crashes; set 0 to disable", + ) + parser.add_argument( + "--stop-file", + action="append", + help="manual stop-file path; defaults to STOP_AUTOFL and .nvflare/autofl/STOP under the job directory", + ) + parser.add_argument("--base-args", default="", help="extra job args applied to baseline and all candidates") + parser.add_argument("--timeout", type=int, default=900) + parser.add_argument( + "--simulator-no-progress-timeout", + type=int, + default=env_int("AUTOFL_SIMULATOR_NO_PROGRESS_TIMEOUT_SECONDS", DEFAULT_SIMULATOR_NO_PROGRESS_TIMEOUT), + help=( + "candidate-level simulator no-progress timeout in seconds; set 0 to disable. " + "This is separate from the full run timeout and only applies after progress markers appear." + ), + ) + parser.add_argument("--python", default=os.environ.get("PYTHON") or sys.executable) + parser.add_argument("--prefer-synthetic", action=argparse.BooleanOptionalAction, default=True) + parser.add_argument("--synthetic-train-size", type=int, default=2048) + parser.add_argument("--synthetic-test-size", type=int, default=256) + parser.add_argument("--name", help="candidate name for prepare") + parser.add_argument("--hypothesis", help="candidate hypothesis for prepare") + parser.add_argument("--manifest", help="candidate_manifest.json path") + parser.add_argument("--run-args", default="", help="candidate-only job.py arguments") + parser.add_argument("--score", type=float, help="externally measured metric for record") + parser.add_argument("--artifacts", dest="external_artifacts", help="external POC/production artifacts") + parser.add_argument("--job-id", help="standard NVFlare job ID for an external result") + parser.add_argument("--failure-reason", default="", help="external execution failure") + parser.add_argument("--baseline", action="store_true", help="record an externally executed baseline") + parser.add_argument("--literature", action="store_true", help="record a literature-review checkpoint") + parser.add_argument("--limit", type=int, default=10, help="maximum fallback suggestions") + return parser.parse_args(argv) + + +def terminate_process(process: subprocess.Popen) -> None: + if process.poll() is not None: + return + if os.name != "nt": + try: + os.killpg(process.pid, signal.SIGTERM) + except ProcessLookupError: + return + except Exception: + process.terminate() + else: + process.terminate() + + try: + process.wait(timeout=10) + return + except subprocess.TimeoutExpired: + pass + + if os.name != "nt": + try: + os.killpg(process.pid, signal.SIGKILL) + except ProcessLookupError: + return + except Exception: + process.kill() + else: + process.kill() + + +def recent_text(path: Path, limit: int = SIMULATOR_STALL_LOG_LIMIT) -> str: + try: + with path.open("rb") as f: + try: + f.seek(-limit, os.SEEK_END) + except OSError: + f.seek(0) + return f.read().decode("utf-8", errors="replace") + except OSError: + return "" + + +def detect_nvflare_simulator_stall(sim_root: Path) -> Optional[str]: + if not sim_root.exists(): + return None + + log_paths = [ + sim_root / "server" / "log_fl.txt", + sim_root / "server" / "log.txt", + sim_root / "server" / "error_log.txt", + ] + for path in log_paths: + text = recent_text(path) + if not text: + continue + for pattern in SIMULATOR_STALL_PATTERNS: + if pattern in text: + return f"{pattern} in {path}" + return None + + +def simulator_stall_message(simulator_stall_roots: Sequence[Path]) -> Optional[str]: + for root in simulator_stall_roots: + message = detect_nvflare_simulator_stall(root) + if message: + return message + return None + + +def simulator_progress_signature(sim_root: Path) -> str: + if not sim_root.exists(): + return "" + + log_paths = [ + sim_root / "server" / "log_fl.txt", + sim_root / "server" / "log.txt", + *sorted(sim_root.glob("site-*/log_fl.txt")), + *sorted(sim_root.glob("site-*/log.txt")), + ] + markers: List[str] = [] + for path in log_paths: + text = recent_text(path, limit=SIMULATOR_PROGRESS_LOG_LIMIT) + if not text: + continue + for line in text.splitlines(): + line_lower = line.lower() + if any(pattern.lower() in line_lower for pattern in SIMULATOR_PROGRESS_PATTERNS): + markers.append(f"{path.relative_to(sim_root)}:{line}") + return "\n".join(markers[-200:]) + + +def simulator_progress_signature_for_roots(simulator_stall_roots: Sequence[Path]) -> str: + markers = [] + for root in simulator_stall_roots: + signature = simulator_progress_signature(root) + if signature: + markers.append(f"{root}:\n{signature}") + return "\n".join(markers) + + +def simulator_partial_aggregation_signature(sim_root: Path) -> str: + signatures = [] + for relative_log_path in ("server/log_fl.txt", "server/simulate_job/log_fl.txt"): + path = sim_root / relative_log_path + text = recent_text(path, SIMULATOR_PROGRESS_LOG_LIMIT) + if not text: + continue + for line in reversed(text.splitlines()): + match = SIMULATOR_AGGREGATION_RE.search(line) + if not match: + continue + received = int(match.group(1)) + expected = int(match.group(2)) + if received < expected: + signatures.append(f"{path.relative_to(sim_root)}:{line}") + break + return "\n".join(signatures) + + +def simulator_partial_aggregation_signature_for_roots(simulator_stall_roots: Sequence[Path]) -> str: + markers = [] + for root in simulator_stall_roots: + signature = simulator_partial_aggregation_signature(root) + if signature: + markers.append(f"{root}:\n{signature}") + return "\n".join(markers) + + +def run( + argv: Sequence[str], + cwd: Path, + timeout: int, + log_path: Path, + simulator_stall_roots: Sequence[Path] = (), + stall_check_interval: float = 5.0, + simulator_no_progress_timeout: int = DEFAULT_SIMULATOR_NO_PROGRESS_TIMEOUT, +) -> Tuple[int, str, float]: + started = time.monotonic() + next_stall_check = started + last_progress_check = started + last_progress_seen = started + last_progress_signature = "" + last_partial_aggregation_seen = started + last_partial_aggregation_signature = "" + log_path.parent.mkdir(parents=True, exist_ok=True) + chunks: List[str] = [] + with log_path.open("w", encoding="utf-8") as log_file: + process = subprocess.Popen( + argv, + cwd=str(cwd), + stdout=subprocess.PIPE, + stderr=subprocess.STDOUT, + text=True, + bufsize=1, + start_new_session=os.name != "nt", + ) + assert process.stdout is not None + selector = selectors.DefaultSelector() + selector.register(process.stdout, selectors.EVENT_READ) + timed_out = False + stall_message = "" + while True: + now = time.monotonic() + if timeout and now - started > timeout: + timed_out = True + terminate_process(process) + if simulator_stall_roots and now >= next_stall_check: + stall_message = simulator_stall_message(simulator_stall_roots) or "" + next_stall_check = now + stall_check_interval + if stall_message: + terminate_process(process) + elif simulator_no_progress_timeout: + partial_aggregation_signature = simulator_partial_aggregation_signature_for_roots( + simulator_stall_roots + ) + if ( + partial_aggregation_signature + and partial_aggregation_signature != last_partial_aggregation_signature + ): + last_partial_aggregation_signature = partial_aggregation_signature + last_partial_aggregation_seen = now + elif ( + last_partial_aggregation_signature + and now - last_partial_aggregation_seen > simulator_no_progress_timeout + ): + stall_message = ( + "partial simulator aggregation made no server-side progress for " + f"{int(now - last_partial_aggregation_seen)}s: {last_partial_aggregation_signature}" + ) + terminate_process(process) + progress_signature = simulator_progress_signature_for_roots(simulator_stall_roots) + if stall_message: + pass + elif progress_signature and progress_signature != last_progress_signature: + last_progress_signature = progress_signature + last_progress_seen = now + elif ( + last_progress_signature + and now - last_progress_seen > simulator_no_progress_timeout + and now - last_progress_check >= stall_check_interval + ): + stall_message = ( + f"no simulator progress markers changed for {int(now - last_progress_seen)}s " + f"across {', '.join(str(root) for root in simulator_stall_roots)}" + ) + terminate_process(process) + last_progress_check = now + events = selector.select(timeout=0.2) + for key, _ in events: + chunk = key.fileobj.readline() + if chunk: + chunks.append(chunk) + log_file.write(chunk) + log_file.flush() + if process.poll() is not None: + remainder = process.stdout.read() + if remainder: + chunks.append(remainder) + log_file.write(remainder) + log_file.flush() + break + selector.close() + if timed_out: + timeout_msg = f"\nTIMEOUT after {timeout}s\n" + chunks.append(timeout_msg) + log_file.write(timeout_msg) + log_file.flush() + return 124, "".join(chunks), time.monotonic() - started + if stall_message: + stall_text = f"\nSIMULATOR_STALL: {stall_message}\n" + chunks.append(stall_text) + log_file.write(stall_text) + log_file.flush() + return SIMULATOR_STALL_EXIT_CODE, "".join(chunks), time.monotonic() - started + return process.returncode or 0, "".join(chunks), time.monotonic() - started + + +def run_allow_timeout( + argv: Sequence[str], + cwd: Path, + timeout: int, + log_path: Path, + simulator_stall_roots: Sequence[Path] = (), + simulator_no_progress_timeout: int = DEFAULT_SIMULATOR_NO_PROGRESS_TIMEOUT, +) -> Tuple[int, str, float]: + return run( + argv, + cwd, + timeout, + log_path, + simulator_stall_roots=simulator_stall_roots, + simulator_no_progress_timeout=simulator_no_progress_timeout, + ) + + +def read_yaml(path: Path) -> Dict[str, Any]: + if yaml is None: + raise RuntimeError("PyYAML is required to read YAML files") + try: + data = yaml.safe_load(path.read_text(encoding="utf-8")) or {} + except yaml.YAMLError as e: + raise ValueError(f"invalid YAML in {path}: {e}") from e + if not isinstance(data, dict): + raise ValueError(f"invalid YAML in {path}: expected a mapping") + return data + + +def write_yaml(path: Path, data: Dict[str, Any]) -> None: + if yaml is None: + raise RuntimeError("PyYAML is required to write autofl.yaml") + path.write_text(yaml.safe_dump(data, sort_keys=False), encoding="utf-8") + + +def write_json(path: Path, data: Dict[str, Any]) -> None: + path.parent.mkdir(parents=True, exist_ok=True) + path.write_text(json.dumps(data, indent=2, sort_keys=True) + "\n", encoding="utf-8") + + +def atomic_write_bytes(path: Path, data: bytes) -> None: + path.parent.mkdir(parents=True, exist_ok=True) + temporary = path.with_name(f".{path.name}.tmp-{uuid.uuid4().hex}") + try: + temporary.write_bytes(data) + os.replace(temporary, path) + finally: + temporary.unlink(missing_ok=True) + + +def capture_file_versions(paths: Iterable[Path]) -> Dict[Path, Optional[bytes]]: + return {path: path.read_bytes() if path.is_file() else None for path in paths} + + +def restore_file_versions(versions: Dict[Path, Optional[bytes]]) -> None: + for path, data in versions.items(): + if data is None: + path.unlink(missing_ok=True) + else: + atomic_write_bytes(path, data) + + +def read_json(path: Path) -> Dict[str, Any]: + try: + payload = json.loads(path.read_text(encoding="utf-8")) + except (OSError, json.JSONDecodeError) as e: + raise ValueError(f"failed to read JSON from {path}: {e}") from e + if not isinstance(payload, dict): + raise ValueError(f"expected a JSON object in {path}") + return payload + + +def utc_now() -> str: + return datetime.now(timezone.utc).replace(microsecond=0).isoformat().replace("+00:00", "Z") + + +def sha256_bytes(data: bytes) -> str: + return hashlib.sha256(data).hexdigest() + + +def sha256_json(data: Any) -> str: + return sha256_bytes(json.dumps(data, sort_keys=True, separators=(",", ":")).encode("utf-8")) + + +def safe_relative_path(workspace: Path, value: str) -> str: + path = Path(value) + resolved = path.resolve() if path.is_absolute() else (workspace / path).resolve() + try: + relative = resolved.relative_to(workspace.resolve()) + except ValueError as e: + raise ValueError(f"path escapes the Auto-FL job workspace: {value}") from e + if not relative.parts or any(part in RESERVED_CANDIDATE_PATH_PARTS for part in relative.parts): + raise ValueError(f"path is reserved for Auto-FL or repository metadata: {value}") + return relative.as_posix() + + +def allowed_edit_paths(config: Dict[str, Any], workspace: Path) -> List[str]: + values = config.get("job", {}).get("allowed_edit_paths", []) or [] + if not isinstance(values, list): + raise ValueError("autofl.yaml job.allowed_edit_paths must be a list") + return list(dict.fromkeys(safe_relative_path(workspace, str(value)) for value in values)) + + +def is_allowed_new_source(path: str) -> bool: + relative = Path(path) + return relative.suffix == ".py" and not any(part in RESERVED_CANDIDATE_PATH_PARTS for part in relative.parts) + + +def file_map(root: Path) -> Dict[str, str]: + if not root.exists(): + return {} + files: Dict[str, str] = {} + for path in sorted(root.rglob("*")): + if path.is_symlink(): + raise ValueError(f"candidate source contains a symlink: {path}") + if path.is_file(): + relative = path.relative_to(root).as_posix() + files[relative] = sha256_bytes(path.read_bytes()) + return files + + +def source_hash(files: Dict[str, str]) -> str: + return sha256_json(files) + + +def copy_relative_file(source_root: Path, destination_root: Path, relative: str) -> None: + source = source_root / relative + destination = destination_root / relative + destination.parent.mkdir(parents=True, exist_ok=True) + shutil.copy2(source, destination) + + +def stage_best_snapshot(workspace: Path, config: Dict[str, Any], snapshot_root: Path) -> Tuple[Path, Dict[str, str]]: + snapshot_root.parent.mkdir(parents=True, exist_ok=True) + staged_root = snapshot_root.parent / f".{snapshot_root.name}.staged-{uuid.uuid4().hex}" + source_root = staged_root / "source" + source_root.mkdir(parents=True, exist_ok=True) + try: + for relative in allowed_edit_paths(config, workspace): + source = workspace / relative + if source.is_symlink(): + raise ValueError(f"allowed edit path is a symlink: {source}") + if source.is_file(): + copy_relative_file(workspace, source_root, relative) + files = file_map(source_root) + write_json(staged_root / "snapshot.json", {"source_sha256": source_hash(files), "files": files}) + return staged_root, files + except BaseException: + shutil.rmtree(staged_root, ignore_errors=True) + raise + + +def activate_best_snapshot(snapshot_root: Path, staged_root: Path) -> Optional[Path]: + previous_root = None + if snapshot_root.exists(): + previous_root = snapshot_root.parent / f".{snapshot_root.name}.previous-{uuid.uuid4().hex}" + os.replace(snapshot_root, previous_root) + try: + os.replace(staged_root, snapshot_root) + except BaseException: + if previous_root is not None: + os.replace(previous_root, snapshot_root) + raise + return previous_root + + +def rollback_best_snapshot(snapshot_root: Path, previous_root: Optional[Path]) -> None: + if previous_root is None: + shutil.rmtree(snapshot_root, ignore_errors=True) + return + failed_root = snapshot_root.parent / f".{snapshot_root.name}.failed-{uuid.uuid4().hex}" + if snapshot_root.exists(): + os.replace(snapshot_root, failed_root) + try: + os.replace(previous_root, snapshot_root) + except BaseException: + if failed_root.exists(): + os.replace(failed_root, snapshot_root) + raise + shutil.rmtree(failed_root, ignore_errors=True) + + +def create_best_snapshot(workspace: Path, config: Dict[str, Any], snapshot_root: Path) -> Dict[str, str]: + staged_root, files = stage_best_snapshot(workspace, config, snapshot_root) + try: + previous_root = activate_best_snapshot(snapshot_root, staged_root) + finally: + if staged_root.exists(): + shutil.rmtree(staged_root, ignore_errors=True) + if previous_root is not None: + shutil.rmtree(previous_root, ignore_errors=True) + return files + + +def load_best_snapshot(snapshot_root: Path) -> Tuple[Path, Dict[str, str]]: + metadata = read_json(snapshot_root / "snapshot.json") + files = metadata.get("files") + if not isinstance(files, dict) or not all(isinstance(k, str) and isinstance(v, str) for k, v in files.items()): + raise ValueError("best snapshot metadata has an invalid files mapping") + source_root = snapshot_root / "source" + if source_hash(files) != metadata.get("source_sha256") or file_map(source_root) != files: + raise ValueError("best snapshot failed its integrity check") + return source_root, files + + +def workspace_matches_snapshot(workspace: Path, source_root: Path, files: Dict[str, str]) -> bool: + for relative, digest in files.items(): + path = workspace / relative + if path.is_symlink() or not path.is_file() or sha256_bytes(path.read_bytes()) != digest: + return False + return True + + +def validate_candidate_id(value: str) -> str: + if not re.fullmatch(r"[A-Za-z0-9][A-Za-z0-9._-]{0,63}", value or ""): + raise ValueError("candidate name must match [A-Za-z0-9][A-Za-z0-9._-]{0,63}") + return value + + +def candidate_manifest_path(workspace: Path, candidate_id: str) -> Path: + return workspace / CANDIDATE_ROOT / candidate_id / "candidate_manifest.json" + + +def load_candidate_manifest(path: Path) -> Dict[str, Any]: + manifest = read_json(path) + if manifest.get("schema_version") != CANDIDATE_MANIFEST_SCHEMA_VERSION: + raise ValueError(f"unsupported candidate manifest schema in {path}") + candidate_id = validate_candidate_id(str(manifest.get("candidate_id") or "")) + expected = candidate_manifest_path(Path(str(manifest.get("workspace_root") or "")), candidate_id).resolve() + if path.resolve() != expected: + raise ValueError("candidate manifest path does not match its workspace and candidate ID") + if manifest.get("status") not in {"prepared", "ready_for_external_execution"}: + raise ValueError(f"candidate {candidate_id} is not pending evaluation") + return manifest + + +def campaign_metadata_path(workspace: Path) -> Path: + return workspace / CAMPAIGN_METADATA_PATH + + +def load_campaign_metadata(workspace: Path, job: Path) -> Dict[str, Any]: + metadata = read_json(campaign_metadata_path(workspace)) + if metadata.get("schema_version") != CAMPAIGN_METADATA_SCHEMA_VERSION: + raise ValueError("unsupported Auto-FL campaign metadata schema") + if Path(str(metadata.get("job") or "")).resolve() != job.resolve(): + raise ValueError("campaign metadata belongs to a different job.py") + return metadata + + +def fixed_budget_hash(config: Dict[str, Any]) -> str: + return sha256_json(config.get("budget", {}).get("fixed_training_budget", {}) or {}) + + +def candidate_changes( + workspace: Path, + config: Dict[str, Any], + best_source: Path, + best_files: Dict[str, str], + draft_source: Path, +) -> Tuple[List[str], List[str]]: + draft_files = file_map(draft_source) + deleted = sorted(set(best_files) - set(draft_files)) + if deleted: + raise ValueError(f"candidate deletes managed source files: {', '.join(deleted)}") + + allowed = set(allowed_edit_paths(config, workspace)) + changed = sorted(path for path, digest in draft_files.items() if best_files.get(path) != digest) + created = [] + for relative in changed: + if relative in best_files: + if relative not in allowed: + raise ValueError(f"candidate modifies a path outside job.allowed_edit_paths: {relative}") + continue + if (workspace / relative).exists() or not is_allowed_new_source(relative): + raise ValueError(f"candidate creates an unauthorized source path: {relative}") + created.append(relative) + return changed, created + + +def text_for_diff(path: Path) -> List[str]: + data = path.read_bytes() + if b"\0" in data: + raise ValueError(f"candidate diff does not support binary file: {path}") + return data.decode("utf-8", errors="replace").splitlines(keepends=True) + + +def render_candidate_patch(best_source: Path, draft_source: Path, changed: Sequence[str]) -> str: + chunks: List[str] = [] + for relative in changed: + before = text_for_diff(best_source / relative) if (best_source / relative).is_file() else [] + after = text_for_diff(draft_source / relative) + chunks.extend( + difflib.unified_diff( + before, + after, + fromfile=f"a/{relative}", + tofile=f"b/{relative}", + ) + ) + return "".join(chunks) + + +def apply_candidate_source(workspace: Path, draft_source: Path, changed: Sequence[str]) -> None: + for relative in changed: + copy_relative_file(draft_source, workspace, relative) + + +def restore_best_source(workspace: Path, best_source: Path, best_files: Dict[str, str], changed: Sequence[str]) -> None: + for relative in changed: + destination = workspace / relative + if relative in best_files: + copy_relative_file(best_source, workspace, relative) + elif destination.exists() and not destination.is_dir(): + destination.unlink() + + +_CAMPAIGN_GUARD = None + + +def load_campaign_guard(): + global _CAMPAIGN_GUARD + if _CAMPAIGN_GUARD is not None: + return _CAMPAIGN_GUARD + + guard_path = Path(__file__).resolve().with_name("campaign_guard.py") + spec = importlib.util.spec_from_file_location("nvflare_autofl_skill_campaign_guard", guard_path) + if spec is None or spec.loader is None: + raise RuntimeError(f"Unable to load campaign guard from {guard_path}") + module = importlib.util.module_from_spec(spec) + sys.modules[spec.name] = module + spec.loader.exec_module(module) + _CAMPAIGN_GUARD = module + return module + + +def resolve_output_path(cwd: Path, value: str) -> Path: + path = Path(value) + if path.is_absolute(): + return path + return cwd / path + + +def resolve_stop_files(cwd: Path, values: Optional[Sequence[str]]) -> List[str]: + guard = load_campaign_guard() + stop_files = values if values is not None else list(guard.DEFAULT_STOP_FILES) + return [str(resolve_output_path(cwd, value)) for value in stop_files] + + +def extract_result_dir(output: str) -> Optional[Path]: + patterns = [ + r"Result can be found in\s*:\s*(?P\S+)", + r"Results:\s*(?P\S+)", + r"result_dir=(?P\S+)", + ] + for pattern in patterns: + match = re.search(pattern, output) + if match: + return Path(match.group("path")).expanduser() + return None + + +def objective_contract(config: Dict[str, Any], requested_metric: str) -> Dict[str, Any]: + objective = config.get("objective", {}) + if not isinstance(objective, dict): + objective = {} + metric = str(objective.get("metric") or requested_metric) + requested = str(objective.get("requested_metric") or metric) + optimization = str(objective.get("optimization_metric") or metric) + extraction_order = objective.get("metric_extraction_order") + if not isinstance(extraction_order, list) or not extraction_order: + extraction_order = [optimization] + extraction_order = [str(item) for item in extraction_order if item] + if optimization not in extraction_order: + extraction_order.insert(0, optimization) + return { + **objective, + "metric": metric, + "requested_metric": requested, + "optimization_metric": optimization, + "metric_extraction_order": extraction_order, + } + + +def apply_metric_contract( + config: Dict[str, Any], requested_metric: str, schema: Optional[Dict[str, Any]] +) -> Dict[str, Any]: + objective = objective_contract(config, requested_metric) + schema_objective = (schema or {}).get("objective", {}) + if isinstance(schema_objective, dict): + schema_requested = schema_objective.get("requested_metric") or schema_objective.get("metric") + if not schema_requested or schema_requested == objective["requested_metric"]: + for key in ("optimization_metric", "metric_extraction_order", "metric_source"): + if key in schema_objective: + objective[key] = schema_objective[key] + config["objective"] = objective_contract({"objective": objective}, requested_metric) + return config + + +def metric_extraction_order(config: Dict[str, Any], requested_metric: str) -> List[str]: + return list(objective_contract(config, requested_metric)["metric_extraction_order"]) + + +def optimization_metric(config: Dict[str, Any], requested_metric: str) -> str: + return str(objective_contract(config, requested_metric)["optimization_metric"]) + + +def metric_source(config: Dict[str, Any]) -> str: + source = config.get("objective", {}).get("metric_source", "") + return str(source) if source else "NVFlare metric artifacts" + + +def normalize_metric_order(metrics: Sequence[str] | str) -> List[str]: + if isinstance(metrics, str): + return [metrics] + return [str(metric) for metric in metrics if metric] + + +def find_metric_value(payload: Any, metric_order: Sequence[str] | str) -> Optional[float]: + metric_keys = normalize_metric_order(metric_order) + if isinstance(payload, dict): + for key in ("final_aggregated_metrics", "best_metrics", "metrics"): + for metric_key in metric_keys: + value = metric_from_list(payload.get(key), metric_key) + if value is not None: + return value + for metric_key in metric_keys: + if metric_key in payload and isinstance(payload[metric_key], (int, float)): + return float(payload[metric_key]) + for value in payload.values(): + score = find_metric_value(value, metric_keys) + if score is not None: + return score + elif isinstance(payload, list): + for metric_key in metric_keys: + value = metric_from_list(payload, metric_key) + if value is not None: + return value + for item in payload: + score = find_metric_value(item, metric_keys) + if score is not None: + return score + return None + + +def metric_from_list(items: Any, metric: str) -> Optional[float]: + if not isinstance(items, list): + return None + for item in items: + if not isinstance(item, dict): + continue + if item.get("name") == metric and isinstance(item.get("value"), (int, float)): + return float(item["value"]) + return None + + +def extract_score(artifact_root: Path, metrics: Sequence[str] | str) -> Optional[float]: + metric_order = normalize_metric_order(metrics) + metric_files = list(artifact_root.glob("**/metrics_summary.json")) + list( + artifact_root.glob("**/cross_val_results.json") + ) + for path in metric_files: + try: + payload = json.loads(path.read_text(encoding="utf-8")) + except Exception: + continue + score = find_metric_value(payload, metric_order) + if score is not None: + return score + + number_patterns = [rf"{re.escape(metric)}[^0-9+\-.]+([+-]?[0-9]+(?:\.[0-9]+)?)" for metric in metric_order] + if "accuracy" in metric_order: + number_patterns.append(r"Accuracy of the network[^0-9+\-.]+([+-]?[0-9]+(?:\.[0-9]+)?)") + for path in artifact_root.glob("**/*.log"): + try: + text = path.read_text(encoding="utf-8", errors="replace") + except Exception: + continue + for pattern in number_patterns: + matches = re.findall(pattern, text, flags=re.IGNORECASE) + if matches: + return float(matches[-1]) + return None + + +def is_sandbox_socket_failure(output: str) -> bool: + text = output.lower() + return ( + "permissionerror" in text + and ("operation not permitted" in text or "[errno 1]" in text) + and ("socket" in text or "sock" in text) + ) + + +def is_nvflare_simulator_stall(output: str) -> bool: + return "SIMULATOR_STALL:" in output + + +def collect_artifacts(result_dir: Optional[Path], output_root: Path, name: str, log_path: Path) -> Path: + dest = output_root / name / "simulation" + run_log = output_root / name / "run.log" + if dest.exists(): + shutil.rmtree(dest) + if result_dir and result_dir.exists(): + shutil.copytree(result_dir, dest) + shutil.rmtree(result_dir, ignore_errors=True) + else: + dest.mkdir(parents=True, exist_ok=True) + if log_path.resolve() != run_log.resolve(): + shutil.copy2(log_path, run_log) + return dest + + +def job_help(python: str, job: Path, cwd: Path) -> str: + process = subprocess.run( + [python, str(job), "--help"], + cwd=str(cwd), + stdout=subprocess.PIPE, + stderr=subprocess.STDOUT, + text=True, + check=False, + ) + return process.stdout + + +def supports_flag(help_text: str, flag: str) -> bool: + return flag in help_text + + +def mutable_arg_specs(schema: Dict[str, Any]) -> Dict[str, Dict[str, Any]]: + specs = schema.get("mutable_args") + return specs if isinstance(specs, dict) else {} + + +def candidate_arg_values(candidate_args: Sequence[str]) -> Dict[str, Any]: + values: Dict[str, Any] = {} + idx = 0 + while idx < len(candidate_args): + raw = candidate_args[idx] + if not raw.startswith("--"): + idx += 1 + continue + option = raw[2:] + if "=" in option: + name, value = option.split("=", 1) + values[name.replace("-", "_")] = value + idx += 1 + continue + name = option.replace("-", "_") + if idx + 1 >= len(candidate_args) or candidate_args[idx + 1].startswith("--"): + values[name] = True + idx += 1 + else: + values[name] = candidate_args[idx + 1] + idx += 2 + return values + + +def coerce_schema_value(value: Any, spec: Dict[str, Any]) -> Any: + value_type = spec.get("type") + if value_type == "int": + return int(value) + if value_type == "float": + return float(value) + if value_type == "bool": + if isinstance(value, bool): + return value + return str(value).strip().lower() in {"1", "true", "yes", "on"} + return str(value) + + +def candidate_args_allowed(candidate_args: Sequence[str], schema: Dict[str, Any]) -> Tuple[bool, str]: + specs = mutable_arg_specs(schema) + if not specs: + return True, "" + + for name, value in candidate_arg_values(candidate_args).items(): + spec = specs.get(name) + if not isinstance(spec, dict): + continue + try: + coerced = coerce_schema_value(value, spec) + except (TypeError, ValueError) as e: + return False, f"{name}={value!r} cannot be parsed as {spec.get('type')}: {e}" + + choices = spec.get("choices") + if choices is not None and coerced not in choices: + return False, f"{name}={coerced!r} is not in allowed choices {choices!r}" + minimum = spec.get("min") + if minimum is not None and coerced < minimum: + return False, f"{name}={coerced!r} is below schema min {minimum!r}" + maximum = spec.get("max") + if maximum is not None and coerced > maximum: + return False, f"{name}={coerced!r} is above schema max {maximum!r}" + + return True, "" + + +def candidate_preserves_fixed_args( + candidate_args: Sequence[str], config: Dict[str, Any], schema: Dict[str, Any] +) -> Tuple[bool, str]: + values = candidate_arg_values(candidate_args) + fixed_names = set(fixed_within_campaign(schema)) + fixed_budget = config.get("budget", {}).get("fixed_training_budget", {}) or {} + fixed_names.update(FIXED_BUDGET_TO_CLI[field] for field in fixed_budget if field in FIXED_BUDGET_TO_CLI) + changed = sorted(fixed_names.intersection(values)) + if changed: + return False, f"candidate run arguments change fixed-budget fields: {', '.join(changed)}" + return True, "" + + +def load_mutation_schema(cwd: Path) -> Dict[str, Any]: + path = cwd / "mutation_schema.yaml" + if not path.exists(): + return {} + return read_yaml(path) + + +def comparison_budget(schema: Dict[str, Any]) -> Dict[str, Any]: + comparison = schema.get("comparison_budget_args") + if not isinstance(comparison, dict): + return {} + budget = comparison.get("default_candidate_budget") + return budget if isinstance(budget, dict) else {} + + +def fixed_within_campaign(schema: Dict[str, Any]) -> set: + values = [] + comparison = schema.get("comparison_budget_args") + if isinstance(comparison, dict): + values = comparison.get("fixed_within_campaign") or [] + return set(values) if isinstance(values, list) else set() + + +def build_comparison_budget_args(schema: Dict[str, Any], help_text: str) -> List[str]: + budget = comparison_budget(schema) + args: List[str] = [] + for budget_field, cli_name in COMPARISON_BUDGET_TO_CLI.items(): + value = budget.get(budget_field) + if value is not None and supports_flag(help_text, f"--{cli_name}"): + args.extend([f"--{cli_name}", str(value)]) + if budget.get("cross_site_eval") and supports_flag(help_text, "--cross_site_eval"): + args.append("--cross_site_eval") + return args + + +def build_fixed_args(config: Dict[str, Any], help_text: str, schema: Dict[str, Any]) -> List[str]: + fixed = config.get("budget", {}).get("fixed_training_budget", {}) or {} + budget = comparison_budget(schema) + budget_cli_names = { + cli_name for budget_field, cli_name in COMPARISON_BUDGET_TO_CLI.items() if budget.get(budget_field) is not None + } + args: List[str] = [] + for budget_field, cli_name in FIXED_BUDGET_TO_CLI.items(): + if cli_name in budget_cli_names: + continue + value = fixed.get(budget_field) + if value is not None and supports_flag(help_text, f"--{cli_name}"): + args.extend([f"--{cli_name}", str(value)]) + return args + + +def build_base_args(args: argparse.Namespace, help_text: str, schema: Dict[str, Any]) -> List[str]: + base = shlex.split(args.base_args) + budget_args = build_comparison_budget_args(schema, help_text) + if budget_args: + base.extend(budget_args) + if args.prefer_synthetic and supports_flag(help_text, "--synthetic_data"): + if "--synthetic_data" not in base: + base.append("--synthetic_data") + if supports_flag(help_text, "--train_size") and "--train_size" not in base: + base.extend(["--train_size", str(args.synthetic_train_size)]) + if supports_flag(help_text, "--test_size") and "--test_size" not in base: + base.extend(["--test_size", str(args.synthetic_test_size)]) + return base + + +def suggested_arg_defaults(config: Dict[str, Any]) -> Dict[str, Any]: + suggested = config.get("search_space", {}).get("suggested", {}) or {} + defaults = {} + for name, spec in suggested.items(): + if isinstance(spec, dict) and "default" in spec: + defaults[name] = spec["default"] + return defaults + + +def candidate_plan( + config: Dict[str, Any], + help_text: str, + max_candidates: Optional[int], + schema: Optional[Dict[str, Any]] = None, +) -> Iterable[JobRun]: + defaults = suggested_arg_defaults(config) + candidates: List[JobRun] = [] + seen_args = set() + fixed_fields = fixed_within_campaign(schema or {}) + + def can_mutate(name: str) -> bool: + return name not in fixed_fields + + def make_candidate(name: str, candidate_args: List[str], description: str) -> Optional[JobRun]: + allowed, _ = candidate_args_allowed(candidate_args, schema or {}) + if not allowed: + return None + key = tuple(candidate_args) + if key in seen_args: + return None + seen_args.add(key) + return JobRun(name=name, args=candidate_args, description=description) + + def add(name: str, candidate_args: List[str], description: str) -> None: + if max_candidates is not None and len(candidates) >= max_candidates: + return + candidate = make_candidate(name, candidate_args, description) + if candidate is not None: + candidates.append(candidate) + + if can_mutate("aggregator") and supports_flag(help_text, "--aggregator"): + for value in ["default", "fedavg", "fedavgm", "fedadam", "scaffold"]: + add(f"aggregator_{value}", ["--aggregator", value], f"aggregator={value}") + if can_mutate("fedproxloss_mu") and supports_flag(help_text, "--fedproxloss_mu"): + add( + "fedprox_mu_1e-5", + ["--aggregator", "weighted", "--fedproxloss_mu", "1e-5"], + "aggregator=weighted fedproxloss_mu=1e-5", + ) + add( + "fedprox_mu_1e-4", + ["--aggregator", "weighted", "--fedproxloss_mu", "1e-4"], + "aggregator=weighted fedproxloss_mu=1e-4", + ) + + if can_mutate("aggregation_epochs") and supports_flag(help_text, "--aggregation_epochs"): + default = int(defaults.get("aggregation_epochs") or 4) + for value in [1, 2, 3, 4, 6, 8]: + if value != default: + add(f"aggregation_epochs_{value}", ["--aggregation_epochs", str(value)], f"aggregation_epochs={value}") + + if can_mutate("local_train_steps") and supports_flag(help_text, "--local_train_steps"): + for value in [50, 100, 200, 400]: + add(f"local_train_steps_{value}", ["--local_train_steps", str(value)], f"local_train_steps={value}") + + if can_mutate("lr") and supports_flag(help_text, "--lr"): + default = float(defaults.get("lr") or 0.05) + for value in [default / 4, default / 2, default * 2, default * 4]: + value_text = f"{value:.6g}" + add(f"lr_{value_text.replace('.', 'p').replace('-', 'm')}", ["--lr", value_text], f"lr={value_text}") + + if can_mutate("momentum") and supports_flag(help_text, "--momentum"): + for value in [0.0, 0.5, 0.8, 0.95]: + add( + f"momentum_{str(value).replace('.', 'p')}", + ["--momentum", str(value)], + f"momentum={value}", + ) + + if can_mutate("weight_decay") and supports_flag(help_text, "--weight_decay"): + for value in ["1e-5", "1e-4", "5e-4"]: + add(f"weight_decay_{value.replace('-', 'm')}", ["--weight_decay", value], f"weight_decay={value}") + + if can_mutate("batch_size") and supports_flag(help_text, "--batch_size"): + default = int(defaults.get("batch_size") or 16) + values = [max(1, default // 2), default * 2, default * 4, max(1, default // 4), 24, 40, 64, 96, 128, 256] + for value in values: + if value != default: + add(f"batch_size_{value}", ["--batch_size", str(value)], f"batch_size={value}") + + if can_mutate("epochs") and supports_flag(help_text, "--epochs"): + for value in [1, 2, 3, 4, 5]: + add(f"epochs_{value}", ["--epochs", str(value)], f"epochs={value}") + + if can_mutate("num_workers") and supports_flag(help_text, "--num_workers"): + for value in [0, 1, 2, 4]: + add(f"num_workers_{value}", ["--num_workers", str(value)], f"num_workers={value}") + + if supports_flag(help_text, "--client_memory_gc_rounds"): + add("client_memory_gc_1", ["--client_memory_gc_rounds", "1"], "client_memory_gc_rounds=1") + + if not candidates: + add("rerun", [], "repeat baseline command to test campaign plumbing") + + if max_candidates is not None: + return candidates[:max_candidates] + + def uncapped() -> Iterable[JobRun]: + for template in candidates: + yield JobRun(name=template.name, args=list(template.args), description=template.description) + + idx = 1 + batch_default = int(defaults.get("batch_size") or 16) + while True: + generated = False + if can_mutate("batch_size") and supports_flag(help_text, "--batch_size"): + # Walk a broad conservative range before repeats so uncapped + # campaigns keep doing comparable, reviewable work for hours. + value = 1 + ((batch_default + idx * 7) % 512) + candidate_args = ["--batch_size", str(value)] + candidate = make_candidate( + f"batch_size_auto_{value}", + candidate_args, + f"batch_size={value}", + ) + if value != batch_default and candidate is not None: + yield candidate + generated = True + + if not generated and can_mutate("aggregation_epochs") and supports_flag(help_text, "--aggregation_epochs"): + value = 1 + ((idx - 1) % 8) + candidate_args = ["--aggregation_epochs", str(value)] + candidate = make_candidate( + f"aggregation_epochs_auto_{value}", + candidate_args, + f"aggregation_epochs={value}", + ) + if candidate is not None: + yield candidate + generated = True + + if not generated and can_mutate("lr") and supports_flag(help_text, "--lr"): + value = 10 ** (-4 + ((idx - 1) % 25) / 10) + value_text = f"{value:.6g}" + candidate_args = ["--lr", value_text] + candidate = make_candidate( + f"lr_auto_{value_text.replace('.', 'p').replace('-', 'm')}", + candidate_args, + f"lr={value_text}", + ) + if candidate is not None: + yield candidate + generated = True + + if not generated and can_mutate("epochs") and supports_flag(help_text, "--epochs"): + value = 1 + ((idx - 1) % 20) + candidate_args = ["--epochs", str(value)] + candidate = make_candidate(f"epochs_auto_{value}", candidate_args, f"epochs={value}") + if candidate is not None: + yield candidate + generated = True + + if not generated and can_mutate("num_workers") and supports_flag(help_text, "--num_workers"): + value = (idx - 1) % 9 + candidate_args = ["--num_workers", str(value)] + candidate = make_candidate(f"num_workers_auto_{value}", candidate_args, f"num_workers={value}") + if candidate is not None: + yield candidate + generated = True + + if not generated: + template = candidates[(idx - 1) % len(candidates)] + cycle = (idx - 1) // len(candidates) + 2 + yield JobRun( + name=f"{template.name}_repeat_{cycle:04d}", + args=list(template.args), + description=f"{template.description}; repeat_cycle={cycle}", + ) + idx += 1 + + return uncapped() + + +def remove_known_result_dir(config: Dict[str, Any]) -> None: + recipe_args = config.get("job", {}).get("recipe_args", {}) or {} + name = recipe_args.get("name", {}).get("value") if isinstance(recipe_args.get("name"), dict) else None + if isinstance(name, str) and name: + shutil.rmtree(Path("/tmp/nvflare/simulation") / name, ignore_errors=True) + + +def run_job( + run_def: JobRun, + *, + python: str, + job: Path, + cwd: Path, + help_text: str, + fixed_args: List[str], + base_args: List[str], + output_root: Path, + timeout: int, + simulator_no_progress_timeout: int, + metrics: Sequence[str], + config: Dict[str, Any], +) -> RunRecord: + log_path = output_root / run_def.name / "run.log" + run_name = f"autofl_{run_def.name}" + simulator_root = Path("/tmp/nvflare/simulation") / run_name + name_args = ["--name", run_name] if supports_flag(help_text, "--name") else [] + command = [python, str(job), *fixed_args, *base_args, *name_args, *run_def.args] + run_def.command = command + remove_known_result_dir(config) + if name_args: + shutil.rmtree(simulator_root, ignore_errors=True) + rc, stdout, runtime = run_allow_timeout( + command, + cwd, + timeout, + log_path, + simulator_stall_roots=[simulator_root], + simulator_no_progress_timeout=simulator_no_progress_timeout, + ) + run_def.runtime_seconds = runtime + result_dir = extract_result_dir(stdout) or (simulator_root if simulator_root.exists() else None) + artifact_dir = collect_artifacts(result_dir, output_root, run_def.name, log_path) + run_def.artifacts = str(artifact_dir) + + if rc != 0: + if is_sandbox_socket_failure(stdout): + run_def.status = INFRASTRUCTURE_RETRY + run_def.failure_reason = "sandbox/socket permission failure; rerun runner with escalated execution" + elif is_nvflare_simulator_stall(stdout): + run_def.status = "crash" + run_def.failure_reason = ( + "nvflare simulator watchdog detected a child connection/no-progress stall; " + "candidate killed and campaign continued" + ) + else: + run_def.status = "crash" + run_def.failure_reason = f"exit_code={rc}" + else: + score = extract_score(artifact_dir, metrics) + if score is None: + run_def.status = "crash" + run_def.failure_reason = f"metrics '{', '.join(metrics)}' not found" + else: + run_def.score = score + + return RunRecord( + status=run_def.status, + name=run_def.name, + score=run_def.score, + runtime_seconds=run_def.runtime_seconds, + changed_files="none", + diff_summary=run_def.description, + run_command=shlex.join(command), + artifacts=run_def.artifacts, + failure_reason=run_def.failure_reason, + ) + + +def write_results(path: Path, records: List[RunRecord]) -> None: + path.parent.mkdir(parents=True, exist_ok=True) + with path.open("w", encoding="utf-8", newline="") as f: + writer = csv.DictWriter(f, fieldnames=RESULT_FIELDS, delimiter="\t") + writer.writeheader() + for record in records: + writer.writerow( + { + "status": record.status, + "name": record.name, + "score": "" if record.score is None else f"{record.score:.6f}", + "runtime_seconds": f"{record.runtime_seconds:.3f}", + "changed_files": record.changed_files, + "diff_summary": record.diff_summary, + "run_command": record.run_command, + "artifacts": record.artifacts, + "failure_reason": record.failure_reason, + "candidate_manifest": record.candidate_manifest, + "base_candidate": record.base_candidate, + "patch_sha256": record.patch_sha256, + } + ) + + +def load_results(path: Path) -> List[RunRecord]: + if not path.exists(): + return [] + records = [] + with path.open("r", encoding="utf-8", newline="") as f: + for row in csv.DictReader(f, delimiter="\t"): + score_text = row.get("score", "") + records.append( + RunRecord( + status=row.get("status", ""), + name=row.get("name", ""), + score=float(score_text) if score_text else None, + runtime_seconds=float(row.get("runtime_seconds") or 0.0), + changed_files=row.get("changed_files", ""), + diff_summary=row.get("diff_summary", ""), + run_command=row.get("run_command", ""), + artifacts=row.get("artifacts", ""), + failure_reason=row.get("failure_reason", ""), + candidate_manifest=row.get("candidate_manifest", ""), + base_candidate=row.get("base_candidate", ""), + patch_sha256=row.get("patch_sha256", ""), + ) + ) + return records + + +def better(new_score: Optional[float], old_score: Optional[float], mode: str) -> bool: + if new_score is None: + return False + if old_score is None: + return True + return new_score > old_score if mode == "max" else new_score < old_score + + +def write_state( + path: Path, + results_path: Path, + records: List[RunRecord], + max_candidates: Optional[int], + *, + mode: str = "max", + stop_files: Optional[List[str]] = None, + plateau_threshold: int = DEFAULT_PLATEAU_THRESHOLD, + plateau_min_delta: float = DEFAULT_PLATEAU_MIN_DELTA, + hard_crash_threshold: int = DEFAULT_HARD_CRASH_THRESHOLD, + manual_stop: bool = False, +) -> Dict[str, Any]: + guard = load_campaign_guard() + attempts = len([r for r in records if r.status in {"keep", "discard", "crash"}]) + if manual_stop: + state = guard.guard_state( + results_path, + max_candidates=max_candidates, + stop_files=stop_files, + plateau_threshold=plateau_threshold, + min_delta=plateau_min_delta, + hard_crash_threshold=hard_crash_threshold, + mode=mode, + ) + state.update( + { + "candidate_attempts": attempts, + "decision": "stop", + "reason": "manual_interrupt", + "next_action": "final_report", + "final_response_allowed": True, + "agent_instruction": "Final report is allowed because the campaign was manually interrupted.", + } + ) + write_json(path, state) + return state + + if any(r.status == INFRASTRUCTURE_RETRY for r in records): + state = guard.guard_state( + results_path, + max_candidates=max_candidates, + stop_files=stop_files, + plateau_threshold=plateau_threshold, + min_delta=plateau_min_delta, + hard_crash_threshold=hard_crash_threshold, + mode=mode, + ) + state.update( + { + "candidate_attempts": attempts, + "decision": "retry_infrastructure", + "reason": "infrastructure_retry", + "next_action": "rerun_with_escalated_execution", + "final_response_allowed": False, + "agent_instruction": ( + "Do not produce a final answer. Rerun the same command with escalated execution or repaired " + "runtime permissions; infrastructure retries do not count against the candidate budget." + ), + } + ) + write_json(path, state) + return state + + state = guard.guard_state( + results_path, + max_candidates=max_candidates, + stop_files=stop_files, + plateau_threshold=plateau_threshold, + min_delta=plateau_min_delta, + hard_crash_threshold=hard_crash_threshold, + mode=mode, + ) + write_json(path, state) + return state + + +def load_progress_plotter(): + script_path = Path(__file__).with_name("plot_progress.py") + spec = importlib.util.spec_from_file_location("nvflare_autofl_plot_progress", script_path) + if spec is None or spec.loader is None: + raise RuntimeError(f"cannot load Auto-FL progress plotter from {script_path}") + module = importlib.util.module_from_spec(spec) + sys.modules[spec.name] = module + spec.loader.exec_module(module) + return module + + +def write_progress_fallback(path: Path, records: List[RunRecord], mode: str, metric_label: str) -> None: + path.parent.mkdir(parents=True, exist_ok=True) + try: + from PIL import Image, ImageDraw, ImageFont + except ImportError: + path.write_text("Pillow is not installed; progress image unavailable.\n", encoding="utf-8") + return + + width, height = 1000, 560 + margin = 70 + scores = [r.score for r in records if r.score is not None] + image = Image.new("RGB", (width, height), "white") + draw = ImageDraw.Draw(image) + font = ImageFont.load_default() + draw.text( + (margin, 24), + f"Auto-FL Progress ({metric_label}): {len(records)} rows, {len(scores)} scored", + fill="black", + font=font, + ) + draw.line((margin, height - margin, width - margin, height - margin), fill=(80, 80, 80), width=2) + draw.line((margin, margin, margin, height - margin), fill=(80, 80, 80), width=2) + + if scores: + lo, hi = min(scores), max(scores) + if lo == hi: + lo -= 1.0 + hi += 1.0 + span = hi - lo + running_best: Optional[float] = None + last_point: Optional[Tuple[float, float]] = None + denom = max(1, len(records) - 1) + for idx, record in enumerate(records): + if record.score is None: + continue + x = margin + (width - 2 * margin) * idx / denom + y = height - margin - (height - 2 * margin) * (record.score - lo) / span + color = (40, 160, 90) if record.status in {"baseline", "keep"} else (150, 150, 150) + draw.ellipse((x - 5, y - 5, x + 5, y + 5), fill=color, outline="black") + draw.text((x + 6, y - 14), f"{record.name}: {record.score:.3f}", fill=color, font=font) + if better(record.score, running_best, mode): + running_best = record.score + if running_best == record.score: + if last_point: + draw.line((last_point[0], last_point[1], x, y), fill=(40, 160, 90), width=2) + last_point = (x, y) + image.save(path) + + +def write_progress(path: Path, records: List[RunRecord], mode: str, metric_label: str) -> None: + plotter = load_progress_plotter() + try: + plotter.plot_progress(records, path, mode, metric_label) + except (plotter.NoScoredResultsError, plotter.PlotDependencyError): + write_progress_fallback(path, records, mode, metric_label) + + +def write_report(path: Path, config: Dict[str, Any], records: List[RunRecord], args: argparse.Namespace) -> None: + path.parent.mkdir(parents=True, exist_ok=True) + best = None + for record in records: + if record.status in {"baseline", "keep"} and better(record.score, best.score if best else None, args.mode): + best = record + candidate_budget = ( + str(args.max_candidates) if args.max_candidates is not None else "uncapped; runs until manual interruption" + ) + objective = objective_contract(config, args.metric) + lines = [ + "# Auto-FL Report", + "", + f"Objective: optimize `{objective['optimization_metric']}` in `{args.target_env}`.", + f"Requested metric: `{objective['requested_metric']}`.", + f"Metric source: `{metric_source(config)}`.", + f"Metric extraction order: `{', '.join(objective['metric_extraction_order'])}`.", + f"Candidate budget: `{candidate_budget}`.", + f"Config: `{args.autofl_yaml}`.", + f"Fixed budget: `{json.dumps(config.get('budget', {}).get('fixed_training_budget', {}), sort_keys=True)}`.", + "", + "## Leaderboard", + "", + "| Status | Name | Score | Changed files | Manifest | Artifacts | Notes |", + "| --- | --- | ---: | --- | --- | --- | --- |", + ] + for record in records: + score = "" if record.score is None else f"{record.score:.6f}" + lines.append( + f"| {record.status} | {record.name} | {score} | `{record.changed_files}` | " + f"`{record.candidate_manifest}` | `{record.artifacts}` | {record.diff_summary} |" + ) + lines.extend( + [ + "", + "## Artifacts", + "", + f"- Ledger: `{args.results}`", + f"- Progress plot: `{args.progress}`", + f"- Campaign state: `{args.state}`", + "", + "## Outcome", + "", + ] + ) + if best: + lines.append(f"Best retained run: `{best.name}` with `{objective['optimization_metric']}={best.score:.6f}`.") + else: + lines.append("No scored run was retained.") + path.write_text("\n".join(lines) + "\n", encoding="utf-8") + + +def candidate_attempts(records: List[RunRecord]) -> int: + return len([r for r in records if r.status in {"keep", "discard", "crash"}]) + + +def campaign_summary( + autofl_yaml: Path, + results: Path, + state: Path, + progress: Path, + report: Path, + records: List[RunRecord], + state_payload: Optional[Dict[str, Any]] = None, +) -> Dict[str, Any]: + payload = { + "autofl_yaml": str(autofl_yaml.resolve()), + "results": str(results.resolve()), + "state": str(state.resolve()), + "progress": str(progress.resolve()), + "report": str(report.resolve()), + "candidate_attempts": candidate_attempts(records), + } + if state_payload: + for key in [ + "decision", + "reason", + "next_action", + "final_response_allowed", + "candidate_cap", + "candidate_cap_source", + "agent_instruction", + ]: + if key in state_payload: + payload[key] = state_payload[key] + return payload + + +def campaign_paths(args: argparse.Namespace, job: Path) -> Dict[str, Path]: + workspace = job.parent + return { + "workspace": workspace, + "autofl_yaml": resolve_output_path(workspace, args.autofl_yaml), + "results": resolve_output_path(workspace, args.results), + "state": resolve_output_path(workspace, args.state), + "progress": resolve_output_path(workspace, args.progress), + "report": resolve_output_path(workspace, args.report), + "output_root": resolve_output_path(workspace, args.output_root), + "snapshot_root": workspace / BEST_SNAPSHOT_ROOT, + } + + +CAMPAIGN_SETTING_NAMES = ( + "metric", + "mode", + "target_env", + "max_candidates", + "autofl_yaml", + "results", + "state", + "progress", + "report", + "output_root", + "plateau_threshold", + "plateau_min_delta", + "hard_crash_threshold", + "base_args", + "timeout", + "simulator_no_progress_timeout", + "python", + "prefer_synthetic", + "synthetic_train_size", + "synthetic_test_size", +) + + +def campaign_settings(args: argparse.Namespace) -> Dict[str, Any]: + return {name: getattr(args, name) for name in CAMPAIGN_SETTING_NAMES} + + +def restore_campaign_settings(args: argparse.Namespace, metadata: Dict[str, Any]) -> None: + settings = metadata.get("settings") + if not isinstance(settings, dict): + raise ValueError("campaign metadata is missing settings") + for name in CAMPAIGN_SETTING_NAMES: + if name in settings: + setattr(args, name, settings[name]) + + +def campaign_timeout(args: argparse.Namespace, schema: Dict[str, Any]) -> Tuple[int, int]: + budget = comparison_budget(schema) + budget_timeout = budget.get("run_timeout_seconds") + timeout = max(args.timeout, int(budget_timeout)) if budget_timeout is not None else args.timeout + budget_no_progress_timeout = budget.get("simulator_no_progress_timeout_seconds") + no_progress_timeout = ( + int(budget_no_progress_timeout) + if budget_no_progress_timeout is not None + else args.simulator_no_progress_timeout + ) + return timeout, no_progress_timeout + + +def import_job_config( + args: argparse.Namespace, + job: Path, + output: Path, + log_path: Path, + timeout: int, +) -> Dict[str, Any]: + command = [ + args.python, + "-m", + "nvflare.app_common.autofl.job_importer", + str(job), + "--metric", + args.metric, + "--mode", + args.mode, + "--env", + args.target_env, + "--output", + str(output), + ] + if args.max_candidates is not None: + command.extend(["--max-candidates", str(args.max_candidates)]) + rc, text, _ = run(command, job.parent, timeout, log_path) + if rc != 0: + raise RuntimeError(text.strip() or f"deterministic import exited with status {rc}") + return read_yaml(output) + + +def best_retained_record(records: Sequence[RunRecord], mode: str) -> Optional[RunRecord]: + best = None + for record in records: + if record.status in {"baseline", "keep"} and better(record.score, best.score if best else None, mode): + best = record + return best + + +def refresh_campaign_artifacts( + args: argparse.Namespace, + paths: Dict[str, Path], + config: Dict[str, Any], + records: List[RunRecord], + metadata: Dict[str, Any], + *, + pending_manifest: Optional[Path] = None, + next_action: Optional[str] = None, + reason: Optional[str] = None, +) -> Dict[str, Any]: + write_results(paths["results"], records) + state = write_state( + paths["state"], + paths["results"], + records, + args.max_candidates, + mode=args.mode, + stop_files=resolve_stop_files(paths["workspace"], args.stop_file), + plateau_threshold=args.plateau_threshold, + plateau_min_delta=args.plateau_min_delta, + hard_crash_threshold=args.hard_crash_threshold, + ) + if not state.get("final_response_allowed") and next_action: + state["next_action"] = next_action + state["reason"] = reason or next_action + state["agent_instruction"] = f"Do not produce a final answer. Execute `{next_action}` for this campaign." + state.update( + { + "best_candidate": metadata.get("best_candidate"), + "best_source_sha256": metadata.get("best_source_sha256"), + "pending_candidate_manifest": str(pending_manifest.resolve()) if pending_manifest else None, + } + ) + write_json(paths["state"], state) + write_progress(paths["progress"], records, args.mode, optimization_metric(config, args.metric)) + write_report(paths["report"], config, records, args) + return state + + +def print_campaign_result( + paths: Dict[str, Path], records: List[RunRecord], state: Dict[str, Any], **extra: Any +) -> None: + payload = campaign_summary( + paths["autofl_yaml"], + paths["results"], + paths["state"], + paths["progress"], + paths["report"], + records, + state, + ) + payload.update(extra) + print(json.dumps(payload, indent=2, sort_keys=True)) + + +def execute_sim_baseline( + args: argparse.Namespace, + job: Path, + paths: Dict[str, Path], + config: Dict[str, Any], + schema: Dict[str, Any], +) -> RunRecord: + timeout, no_progress_timeout = campaign_timeout(args, schema) + help_text = job_help(args.python, job, job.parent) + return run_job( + JobRun(name="baseline", args=[], description="baseline", status="baseline"), + python=args.python, + job=job, + cwd=job.parent, + help_text=help_text, + fixed_args=build_fixed_args(config, help_text, schema), + base_args=build_base_args(args, help_text, schema), + output_root=paths["output_root"], + timeout=timeout, + simulator_no_progress_timeout=no_progress_timeout, + metrics=metric_extraction_order(config, args.metric), + config=config, + ) + + +def initialize_campaign(args: argparse.Namespace, job: Path) -> int: + workspace = job.parent + metadata_path = campaign_metadata_path(workspace) + if metadata_path.exists(): + metadata = load_campaign_metadata(workspace, job) + restore_campaign_settings(args, metadata) + paths = campaign_paths(args, job) + records = load_results(paths["results"]) + if args.target_env == "sim" and not any( + record.status == "baseline" and record.score is not None for record in records + ): + config = read_yaml(paths["autofl_yaml"]) + baseline = execute_sim_baseline(args, job, paths, config, load_mutation_schema(workspace)) + records = [baseline] + metadata["best_score"] = baseline.score + metadata["updated_at"] = utc_now() + write_json(metadata_path, metadata) + next_action = "propose_candidate" if baseline.score is not None else "repair_baseline" + state = refresh_campaign_artifacts( + args, paths, config, records, metadata, next_action=next_action, reason="baseline_retried" + ) + print_campaign_result(paths, records, state, initialized=False, baseline_retried=True) + if baseline.status == INFRASTRUCTURE_RETRY: + return 75 + return 0 if baseline.score is not None else 1 + print_campaign_result(paths, records, read_json(paths["state"]), initialized=False) + return 0 + + paths = campaign_paths(args, job) + paths["output_root"].mkdir(parents=True, exist_ok=True) + schema = load_mutation_schema(workspace) + timeout, _ = campaign_timeout(args, schema) + config = import_job_config(args, job, paths["autofl_yaml"], paths["output_root"] / "import.log", timeout) + config = apply_metric_contract(config, args.metric, schema) + config.setdefault("job", {})["allowed_create_patterns"] = list(ALLOWED_CREATE_PATTERNS) + config.setdefault("trust_contract", {})["allowed_create_patterns"] = list(ALLOWED_CREATE_PATTERNS) + write_yaml(paths["autofl_yaml"], config) + snapshot_files = create_best_snapshot(workspace, config, paths["snapshot_root"]) + metadata = { + "schema_version": CAMPAIGN_METADATA_SCHEMA_VERSION, + "created_at": utc_now(), + "updated_at": utc_now(), + "job": str(job.resolve()), + "workspace_root": str(workspace.resolve()), + "settings": campaign_settings(args), + "best_candidate": "baseline", + "best_score": None, + "best_source_sha256": source_hash(snapshot_files), + "fixed_budget_sha256": fixed_budget_hash(config), + } + write_json(metadata_path, metadata) + + records: List[RunRecord] = [] + if args.target_env == "sim": + baseline = execute_sim_baseline(args, job, paths, config, schema) + records.append(baseline) + metadata["best_score"] = baseline.score + metadata["updated_at"] = utc_now() + write_json(metadata_path, metadata) + next_action = "propose_candidate" if baseline.score is not None else "repair_baseline" + else: + next_action = "submit_baseline" + + state = refresh_campaign_artifacts( + args, paths, config, records, metadata, next_action=next_action, reason="campaign_initialized" + ) + print_campaign_result(paths, records, state, initialized=True) + if records and records[0].status == INFRASTRUCTURE_RETRY: + return 75 + if args.target_env == "sim" and (not records or records[0].score is None): + return 1 + return 0 + + +def pending_candidate_manifests(workspace: Path) -> List[Path]: + root = workspace / CANDIDATE_ROOT + pending = [] + if not root.exists(): + return pending + for path in sorted(root.glob("*/candidate_manifest.json")): + try: + status = read_json(path).get("status") + except ValueError: + continue + if status in {"prepared", "ready_for_external_execution"}: + pending.append(path) + return pending + + +def prepare_candidate(args: argparse.Namespace, job: Path) -> int: + workspace = job.parent + metadata = load_campaign_metadata(workspace, job) + restore_campaign_settings(args, metadata) + paths = campaign_paths(args, job) + records = load_results(paths["results"]) + if not any(record.status == "baseline" and record.score is not None for record in records): + raise ValueError("a scored baseline is required before preparing candidates") + state = read_json(paths["state"]) + if state.get("final_response_allowed"): + raise ValueError(f"campaign is already final: {state.get('reason')}") + pending = pending_candidate_manifests(workspace) + if pending: + raise ValueError(f"campaign already has a pending candidate: {pending[0]}") + candidate_id = validate_candidate_id(args.name or "") + if not args.hypothesis: + raise ValueError("--hypothesis is required for prepare") + manifest_path = candidate_manifest_path(workspace, candidate_id) + candidate_dir = manifest_path.parent + if candidate_dir.exists(): + raise ValueError(f"candidate already exists: {candidate_id}") + + best_source, best_files = load_best_snapshot(paths["snapshot_root"]) + if source_hash(best_files) != metadata.get("best_source_sha256"): + raise ValueError("campaign best-source hash is stale") + if not workspace_matches_snapshot(workspace, best_source, best_files): + raise ValueError("job workspace differs from the recorded best candidate; reconcile edits before preparing") + + draft_source = candidate_dir / "source" + shutil.copytree(best_source, draft_source) + manifest = { + "schema_version": CANDIDATE_MANIFEST_SCHEMA_VERSION, + "candidate_id": candidate_id, + "name": candidate_id, + "hypothesis": args.hypothesis, + "status": "prepared", + "created_at": utc_now(), + "updated_at": utc_now(), + "workspace_root": str(workspace.resolve()), + "base_candidate": metadata.get("best_candidate"), + "base_source_sha256": source_hash(best_files), + "fixed_budget_sha256": metadata.get("fixed_budget_sha256"), + "objective": {"metric": args.metric, "mode": args.mode}, + "environment": args.target_env, + "run_args": shlex.split(args.run_args), + "changed_files": [], + "patch_sha256": "", + "candidate_source_sha256": source_hash(best_files), + "provenance": { + "job": str(job.resolve()), + "autofl_yaml": str(paths["autofl_yaml"].resolve()), + "import_source_sha256": read_yaml(paths["autofl_yaml"]).get("import", {}).get("source_sha256"), + }, + "artifacts": {}, + "result": {}, + } + write_json(manifest_path, manifest) + config = read_yaml(paths["autofl_yaml"]) + state = refresh_campaign_artifacts( + args, + paths, + config, + records, + metadata, + pending_manifest=manifest_path, + next_action="edit_candidate", + reason="candidate_prepared", + ) + print_campaign_result( + paths, + records, + state, + candidate_manifest=str(manifest_path.resolve()), + candidate_source=str(draft_source.resolve()), + ) + return 0 + + +def validate_candidate_for_evaluation( + args: argparse.Namespace, + job: Path, + metadata: Dict[str, Any], + paths: Dict[str, Path], + manifest_path: Path, +) -> Tuple[Dict[str, Any], Dict[str, Any], Path, Dict[str, str], List[str], List[str], str]: + manifest = load_candidate_manifest(manifest_path) + if Path(str(manifest.get("workspace_root") or "")).resolve() != job.parent.resolve(): + raise ValueError("candidate manifest belongs to a different job workspace") + config = read_yaml(paths["autofl_yaml"]) + best_source, best_files = load_best_snapshot(paths["snapshot_root"]) + best_hash = source_hash(best_files) + if manifest.get("base_source_sha256") != best_hash or metadata.get("best_source_sha256") != best_hash: + raise ValueError("candidate was prepared from a stale best candidate") + if manifest.get("fixed_budget_sha256") != metadata.get("fixed_budget_sha256"): + raise ValueError("candidate fixed-budget provenance is stale") + if not workspace_matches_snapshot(job.parent, best_source, best_files): + raise ValueError("job workspace differs from the recorded best candidate") + draft_source = manifest_path.parent / "source" + changed, created = candidate_changes(job.parent, config, best_source, best_files, draft_source) + run_args = manifest.get("run_args") + if not isinstance(run_args, list) or not all(isinstance(item, str) for item in run_args): + raise ValueError("candidate manifest run_args must be a list of strings") + allowed, reason = candidate_args_allowed(run_args, load_mutation_schema(job.parent)) + if not allowed: + raise ValueError(reason) + allowed, reason = candidate_preserves_fixed_args(run_args, config, load_mutation_schema(job.parent)) + if not allowed: + raise ValueError(reason) + if not changed and not run_args: + raise ValueError("candidate has no source changes or run arguments") + patch = render_candidate_patch(best_source, draft_source, changed) + return manifest, config, best_source, best_files, changed, created, patch + + +def update_config_for_kept_sources(config: Dict[str, Any], created: Sequence[str]) -> None: + if not created: + return + job_paths = config.setdefault("job", {}).setdefault("allowed_edit_paths", []) + trust_paths = config.setdefault("trust_contract", {}).setdefault("allowed_edit_paths", []) + for relative in created: + if relative not in job_paths: + job_paths.append(relative) + if relative not in trust_paths: + trust_paths.append(relative) + + +def candidate_campaign_config( + candidate_config: Dict[str, Any], current_config: Dict[str, Any], args: argparse.Namespace, schema: Dict[str, Any] +) -> Dict[str, Any]: + candidate_config = apply_metric_contract(candidate_config, args.metric, schema) + current_paths = current_config.get("job", {}).get("allowed_edit_paths", []) or [] + candidate_paths = candidate_config.setdefault("job", {}).setdefault("allowed_edit_paths", []) + trust_paths = candidate_config.setdefault("trust_contract", {}).setdefault("allowed_edit_paths", []) + for path in current_paths: + if path not in candidate_paths: + candidate_paths.append(path) + if path not in trust_paths: + trust_paths.append(path) + create_patterns = current_config.get("job", {}).get("allowed_create_patterns", ALLOWED_CREATE_PATTERNS) + candidate_config["job"]["allowed_create_patterns"] = list(create_patterns) + candidate_config["trust_contract"]["allowed_create_patterns"] = list(create_patterns) + return candidate_config + + +def finalize_candidate_result( + args: argparse.Namespace, + job: Path, + metadata: Dict[str, Any], + paths: Dict[str, Path], + config: Dict[str, Any], + manifest_path: Path, + manifest: Dict[str, Any], + best_source: Path, + best_files: Dict[str, str], + changed: List[str], + created: List[str], + patch: str, + record: RunRecord, +) -> Tuple[List[RunRecord], Dict[str, Any]]: + rollback_files: Dict[Path, Optional[bytes]] = {} + staged_snapshot = None + previous_snapshot = None + try: + records = load_results(paths["results"]) + previous_best = best_retained_record(records, args.mode) + if record.status == "candidate": + record.status = ( + "keep" if better(record.score, previous_best.score if previous_best else None, args.mode) else "discard" + ) + patch_path = manifest_path.parent / "candidate.patch" + rollback_files = capture_file_versions( + [ + paths["autofl_yaml"], + manifest_path, + campaign_metadata_path(job.parent), + paths["results"], + paths["state"], + paths["progress"], + paths["report"], + patch_path, + ] + ) + patch_path.write_text(patch, encoding="utf-8") + patch_sha256 = sha256_bytes(patch.encode("utf-8")) + record.changed_files = ",".join(changed) if changed else "none" + record.diff_summary = str(manifest.get("hypothesis") or "candidate") + record.candidate_manifest = str(manifest_path.resolve()) + record.base_candidate = str(manifest.get("base_candidate") or "") + record.patch_sha256 = patch_sha256 + + if record.status == "keep": + update_config_for_kept_sources(config, created) + staged_snapshot, snapshot_files = stage_best_snapshot(job.parent, config, paths["snapshot_root"]) + write_yaml(paths["autofl_yaml"], config) + previous_snapshot = activate_best_snapshot(paths["snapshot_root"], staged_snapshot) + staged_snapshot = None + metadata.update( + { + "best_candidate": record.name, + "best_score": record.score, + "best_source_sha256": source_hash(snapshot_files), + "updated_at": utc_now(), + } + ) + else: + restore_best_source(job.parent, best_source, best_files, changed) + + manifest.update( + { + "status": record.status, + "updated_at": utc_now(), + "changed_files": changed, + "created_files": created, + "patch_sha256": patch_sha256, + "artifacts": {"patch": str(patch_path.resolve()), "run": record.artifacts}, + "result": { + "score": record.score, + "runtime_seconds": record.runtime_seconds, + "run_command": record.run_command, + "failure_reason": record.failure_reason, + }, + } + ) + write_json(manifest_path, manifest) + write_json(campaign_metadata_path(job.parent), metadata) + records.append(record) + state = refresh_campaign_artifacts(args, paths, config, records, metadata) + except BaseException as error: + try: + if previous_snapshot is not None: + rollback_best_snapshot(paths["snapshot_root"], previous_snapshot) + previous_snapshot = None + restore_best_source(job.parent, paths["snapshot_root"] / "source", best_files, changed) + if rollback_files: + restore_file_versions(rollback_files) + except BaseException as rollback_error: + raise RuntimeError( + f"candidate finalization failed ({error}); automatic workspace rollback also failed ({rollback_error})" + ) from rollback_error + raise + finally: + if staged_snapshot is not None: + shutil.rmtree(staged_snapshot, ignore_errors=True) + + if previous_snapshot is not None: + shutil.rmtree(previous_snapshot, ignore_errors=True) + return records, state + + +def evaluate_candidate(args: argparse.Namespace, job: Path) -> int: + workspace = job.parent + metadata = load_campaign_metadata(workspace, job) + restore_campaign_settings(args, metadata) + paths = campaign_paths(args, job) + manifest_path = Path(args.manifest).resolve() if args.manifest else None + if manifest_path is None: + pending = pending_candidate_manifests(workspace) + if len(pending) != 1: + raise ValueError("--manifest is required when there is not exactly one pending candidate") + manifest_path = pending[0] + manifest, config, best_source, best_files, changed, created, patch = validate_candidate_for_evaluation( + args, job, metadata, paths, manifest_path + ) + patch_path = manifest_path.parent / "candidate.patch" + patch_path.write_text(patch, encoding="utf-8") + manifest.update( + { + "updated_at": utc_now(), + "changed_files": changed, + "created_files": created, + "patch_sha256": sha256_bytes(patch.encode("utf-8")), + "candidate_source_sha256": source_hash(file_map(manifest_path.parent / "source")), + } + ) + write_json(manifest_path, manifest) + schema = load_mutation_schema(workspace) + timeout, no_progress_timeout = campaign_timeout(args, schema) + candidate_config_path = manifest_path.parent / "candidate_autofl.yaml" + try: + apply_candidate_source(workspace, manifest_path.parent / "source", changed) + candidate_config = import_job_config( + args, + job, + candidate_config_path, + manifest_path.parent / "import.log", + timeout, + ) + if fixed_budget_hash(candidate_config) != metadata.get("fixed_budget_sha256"): + raise ValueError("candidate changes budget.fixed_training_budget") + candidate_config = candidate_campaign_config(candidate_config, config, args, schema) + except Exception: + restore_best_source(workspace, best_source, best_files, changed) + raise + + if args.target_env != "sim": + manifest["status"] = "ready_for_external_execution" + manifest["updated_at"] = utc_now() + write_json(manifest_path, manifest) + records = load_results(paths["results"]) + state = refresh_campaign_artifacts( + args, + paths, + config, + records, + metadata, + pending_manifest=manifest_path, + next_action="submit_candidate", + reason="candidate_validated", + ) + print_campaign_result( + paths, + records, + state, + candidate_manifest=str(manifest_path.resolve()), + job=str(job.resolve()), + ) + return 0 + + try: + help_text = job_help(args.python, job, workspace) + run_record = run_job( + JobRun( + name=str(manifest["candidate_id"]), + args=list(manifest.get("run_args") or []), + description=str(manifest.get("hypothesis") or "candidate"), + ), + python=args.python, + job=job, + cwd=workspace, + help_text=help_text, + fixed_args=build_fixed_args(config, help_text, schema), + base_args=build_base_args(args, help_text, schema), + output_root=paths["output_root"], + timeout=timeout, + simulator_no_progress_timeout=no_progress_timeout, + metrics=metric_extraction_order(config, args.metric), + config=config, + ) + except BaseException: + restore_best_source(workspace, best_source, best_files, changed) + raise + if run_record.status == INFRASTRUCTURE_RETRY: + restore_best_source(workspace, best_source, best_files, changed) + manifest["status"] = "prepared" + manifest["result"] = {"failure_reason": run_record.failure_reason} + manifest["updated_at"] = utc_now() + write_json(manifest_path, manifest) + records = load_results(paths["results"]) + state = refresh_campaign_artifacts( + args, + paths, + config, + records, + metadata, + pending_manifest=manifest_path, + next_action="rerun_with_escalated_execution", + reason="infrastructure_retry", + ) + print_campaign_result(paths, records, state, candidate_manifest=str(manifest_path.resolve())) + return 75 + + records, state = finalize_candidate_result( + args, + job, + metadata, + paths, + candidate_config, + manifest_path, + manifest, + best_source, + best_files, + changed, + created, + patch, + run_record, + ) + print_campaign_result(paths, records, state, candidate_manifest=str(manifest_path.resolve())) + return 0 + + +def abandon_candidate(args: argparse.Namespace, job: Path) -> int: + workspace = job.parent + metadata = load_campaign_metadata(workspace, job) + restore_campaign_settings(args, metadata) + paths = campaign_paths(args, job) + manifest_path = Path(args.manifest).resolve() if args.manifest else None + if manifest_path is None: + pending = pending_candidate_manifests(workspace) + if len(pending) != 1: + raise ValueError("--manifest is required when there is not exactly one pending candidate") + manifest_path = pending[0] + manifest = load_candidate_manifest(manifest_path) + best_source, best_files = load_best_snapshot(paths["snapshot_root"]) + changed = manifest.get("changed_files") or [] + if not isinstance(changed, list): + raise ValueError("candidate manifest changed_files must be a list") + restore_best_source(workspace, best_source, best_files, changed) + manifest["status"] = "abandoned" + manifest["updated_at"] = utc_now() + write_json(manifest_path, manifest) + records = load_results(paths["results"]) + config = read_yaml(paths["autofl_yaml"]) + state = refresh_campaign_artifacts( + args, + paths, + config, + records, + metadata, + next_action="propose_candidate", + reason="candidate_abandoned", + ) + print_campaign_result(paths, records, state, candidate_manifest=str(manifest_path.resolve())) + return 0 + + +def suggest_candidates(args: argparse.Namespace, job: Path) -> int: + metadata = load_campaign_metadata(job.parent, job) + restore_campaign_settings(args, metadata) + if args.limit < 1: + raise ValueError("--limit must be positive") + config = read_yaml(campaign_paths(args, job)["autofl_yaml"]) + help_text = job_help(args.python, job, job.parent) + suggestions = [ + {"name": candidate.name, "run_args": candidate.args, "hypothesis": candidate.description} + for candidate in candidate_plan( + config, + help_text, + args.limit, + load_mutation_schema(job.parent), + ) + ] + print(json.dumps({"suggestions": suggestions}, indent=2, sort_keys=True)) + return 0 + + +def record_external_result(args: argparse.Namespace, job: Path) -> int: + workspace = job.parent + metadata = load_campaign_metadata(workspace, job) + restore_campaign_settings(args, metadata) + paths = campaign_paths(args, job) + config = read_yaml(paths["autofl_yaml"]) + artifact_path = Path(args.external_artifacts).resolve() if args.external_artifacts else None + score = args.score + if score is None and artifact_path: + score = extract_score(artifact_path, metric_extraction_order(config, args.metric)) + if args.literature: + records = load_results(paths["results"]) + name = f"literature_review_{sum(record.status == 'literature' for record in records) + 1}" + records.append( + RunRecord( + status="literature", + name=name, + score=None, + runtime_seconds=0.0, + changed_files="none", + diff_summary=args.hypothesis or "source-backed literature review", + run_command="agent literature review", + artifacts=str(artifact_path or ""), + ) + ) + state = refresh_campaign_artifacts( + args, + paths, + config, + records, + metadata, + next_action="propose_candidate", + reason="literature_review_recorded", + ) + print_campaign_result(paths, records, state, literature_event=name) + return 0 + if args.baseline: + records = load_results(paths["results"]) + if records: + raise ValueError("external baseline can only be recorded before campaign candidates") + if score is None: + raise ValueError("a score or extractable --artifacts path is required for the baseline") + record = RunRecord( + status="baseline", + name="baseline", + score=score, + runtime_seconds=0.0, + changed_files="none", + diff_summary="external baseline", + run_command=f"nvflare job id={args.job_id or 'unreported'}", + artifacts=str(artifact_path or ""), + ) + records.append(record) + metadata["best_score"] = score + metadata["updated_at"] = utc_now() + write_json(campaign_metadata_path(workspace), metadata) + state = refresh_campaign_artifacts( + args, + paths, + config, + records, + metadata, + next_action="propose_candidate", + reason="baseline_recorded", + ) + print_campaign_result(paths, records, state, job_id=args.job_id) + return 0 + + manifest_path = Path(args.manifest).resolve() if args.manifest else None + if manifest_path is None: + pending = pending_candidate_manifests(workspace) + if len(pending) != 1: + raise ValueError("--manifest is required when there is not exactly one pending candidate") + manifest_path = pending[0] + manifest = load_candidate_manifest(manifest_path) + if manifest.get("status") != "ready_for_external_execution": + raise ValueError("external candidate must be validated before its result is recorded") + draft_source = manifest_path.parent / "source" + draft_files = file_map(draft_source) + if source_hash(draft_files) != manifest.get("candidate_source_sha256") or not workspace_matches_snapshot( + workspace, draft_source, draft_files + ): + raise ValueError("materialized candidate source changed after validation") + best_source, best_files = load_best_snapshot(paths["snapshot_root"]) + changed = manifest.get("changed_files") or [] + created = manifest.get("created_files") or [] + if not isinstance(changed, list) or not isinstance(created, list): + raise ValueError("candidate manifest source lists must be arrays") + patch_path = manifest_path.parent / "candidate.patch" + patch = patch_path.read_text(encoding="utf-8") if patch_path.exists() else "" + candidate_config_path = manifest_path.parent / "candidate_autofl.yaml" + candidate_config = read_yaml(candidate_config_path) if candidate_config_path.exists() else config + candidate_config = candidate_campaign_config(candidate_config, config, args, load_mutation_schema(workspace)) + status = "crash" if args.failure_reason or score is None else "candidate" + record = RunRecord( + status=status, + name=str(manifest["candidate_id"]), + score=score, + runtime_seconds=0.0, + changed_files="none", + diff_summary=str(manifest.get("hypothesis") or "candidate"), + run_command=f"nvflare job id={args.job_id or 'unreported'}", + artifacts=str(artifact_path or ""), + failure_reason=args.failure_reason or ("metric not found" if score is None else ""), + ) + records, state = finalize_candidate_result( + args, + job, + metadata, + paths, + candidate_config, + manifest_path, + manifest, + best_source, + best_files, + changed, + created, + patch, + record, + ) + updated_manifest = read_json(manifest_path) + updated_manifest.setdefault("artifacts", {})["job_id"] = args.job_id + write_json(manifest_path, updated_manifest) + print_campaign_result(paths, records, state, candidate_manifest=str(manifest_path.resolve()), job_id=args.job_id) + return 0 + + +def show_campaign_status(args: argparse.Namespace, job: Path) -> int: + metadata = load_campaign_metadata(job.parent, job) + restore_campaign_settings(args, metadata) + paths = campaign_paths(args, job) + records = load_results(paths["results"]) + print_campaign_result(paths, records, read_json(paths["state"]), campaign=metadata) + return 0 + + +def validate_args(args: argparse.Namespace) -> None: + if args.max_candidates is not None and args.max_candidates < 1: + raise ValueError("--max-candidates must be positive when provided") + if args.plateau_threshold < 1: + raise ValueError("--plateau-threshold must be positive") + if args.plateau_min_delta < 0: + raise ValueError("--plateau-min-delta must be non-negative") + if args.hard_crash_threshold < 0: + raise ValueError("--hard-crash-threshold must be non-negative") + + +def main(argv: Optional[Sequence[str]] = None) -> int: + args = parse_args(argv) + try: + validate_args(args) + job = Path(args.job).resolve() + if not job.is_file(): + raise ValueError(f"job.py does not exist: {job}") + actions = { + "initialize": initialize_campaign, + "prepare": prepare_candidate, + "evaluate": evaluate_candidate, + "abandon": abandon_candidate, + "suggest": suggest_candidates, + "record": record_external_result, + "status": show_campaign_status, + } + return actions[args.action](args, job) + except (OSError, RuntimeError, ValueError) as e: + print(f"Auto-FL {args.action} failed: {e}", file=sys.stderr) + return 2 + + +if __name__ == "__main__": + raise SystemExit(main()) diff --git a/skills/nvflare-autofl/tests/helper_scripts.md b/skills/nvflare-autofl/tests/helper_scripts.md new file mode 100644 index 0000000000..f7a33e6e37 --- /dev/null +++ b/skills/nvflare-autofl/tests/helper_scripts.md @@ -0,0 +1,8 @@ +# Helper Script Coverage + +The bundled campaign helpers are covered by focused repository tests: + +- `scripts/campaign_guard.py`: + `tests/unit_test/tool/autofl_skill_campaign_guard_test.py` +- `scripts/run_job_campaign.py`: + `tests/unit_test/tool/autofl_skill_runner_test.py` diff --git a/tests/unit_test/app_common/autofl/__init__.py b/tests/unit_test/app_common/autofl/__init__.py new file mode 100644 index 0000000000..4fc25d0d3c --- /dev/null +++ b/tests/unit_test/app_common/autofl/__init__.py @@ -0,0 +1,13 @@ +# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. diff --git a/tests/unit_test/app_common/autofl/job_importer_test.py b/tests/unit_test/app_common/autofl/job_importer_test.py new file mode 100644 index 0000000000..d61a179ee9 --- /dev/null +++ b/tests/unit_test/app_common/autofl/job_importer_test.py @@ -0,0 +1,376 @@ +# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import yaml + +from nvflare.app_common.autofl import ( + AUTOFL_CONFIG_SCHEMA_VERSION, + DeterministicJobImporter, + dump_autofl_yaml, + import_job_to_autofl_config, + job_importer, +) + + +def _objective(metric, source="user_request"): + return { + "metric": metric, + "requested_metric": metric, + "optimization_metric": metric, + "metric_extraction_order": [metric], + "mode": "max", + "source": source, + } + + +def _write_recipe_job(root): + (root / "model.py").write_text( + """ +class SimpleNetwork: + pass +""", + encoding="utf-8", + ) + (root / "client.py").write_text( + """ +import argparse + + +def build_parser(): + parser = argparse.ArgumentParser() + parser.add_argument("--lr", type=float, default=0.01) + parser.add_argument("--batch_size", type=int, default=64) + parser.add_argument("--weight_decay", type=float, default=0.001) + return parser +""", + encoding="utf-8", + ) + (root / "job.py").write_text( + """ +import argparse + +from model import SimpleNetwork +from nvflare.app_opt.pt.recipes.fedavg import FedAvgRecipe +from nvflare.recipe import SimEnv + + +def define_parser(): + parser = argparse.ArgumentParser() + parser.add_argument("--n_clients", type=int, default=3) + parser.add_argument("--num_rounds", type=int, default=5) + parser.add_argument("--train_script", type=str, default="client.py") + parser.add_argument("--key_metric", type=str, default="accuracy") + return parser.parse_args() + + +def main(): + args = define_parser() + recipe = FedAvgRecipe( + name="demo", + min_clients=args.n_clients, + num_rounds=args.num_rounds, + model=SimpleNetwork(), + train_script=args.train_script, + key_metric=args.key_metric, + ) + env = SimEnv(num_clients=args.n_clients) + recipe.execute(env) + + +if __name__ == "__main__": + main() +""", + encoding="utf-8", + ) + return root / "job.py" + + +def test_import_recipe_job_extracts_trust_contract_without_executing_code(tmp_path): + job_path = _write_recipe_job(tmp_path) + + config = import_job_to_autofl_config( + str(job_path), + workspace_root=str(tmp_path), + metric="AUC", + target_env="prod", + max_candidates=8, + ) + + assert config["schema_version"] == AUTOFL_CONFIG_SCHEMA_VERSION + assert config["import"]["support"]["patterns"] == ["recipe:FedAvgRecipe", "env:SimEnv"] + assert config["import"]["confidence"] == "high" + assert config["job"]["surface"] == "recipe" + assert config["job"]["recipe"] == "FedAvgRecipe" + assert config["job"]["train_script"] == "client.py" + assert config["objective"] == _objective("AUC") + assert config["budget"]["max_candidates"] == 8 + assert config["budget"]["fixed_training_budget"] == { + "num_rounds": 5, + "min_clients": 3, + "num_clients": 3, + } + assert config["environment"]["requested"] == "prod" + assert config["environment"]["profiles"]["sim"] == {"num_clients": 3} + assert config["search_space"]["suggested"]["lr"]["default"] == 0.01 + assert config["search_space"]["suggested"]["batch_size"]["type"] == "int" + assert config["trust_contract"]["allowed_edit_paths"] == ["job.py", "client.py", "model.py"] + assert config["job"]["allowed_create_patterns"] == ["**/*.py"] + assert config["trust_contract"]["allowed_create_patterns"] == ["**/*.py"] + assert config["trust_contract"]["agent_controls"]["must_not_edit_outside_allowed_paths"] is True + assert config["unresolved"] == [] + + +def test_import_is_repeatable_and_yaml_round_trips(tmp_path): + job_path = _write_recipe_job(tmp_path) + importer = DeterministicJobImporter(workspace_root=str(tmp_path)) + + first = importer.import_job(str(job_path), max_candidates=4) + second = importer.import_job(str(job_path), max_candidates=4) + yaml_text = dump_autofl_yaml(first) + + assert first == second + assert yaml.safe_load(yaml_text) == first + assert "&id" not in yaml_text + assert first["trust_contract"]["unresolved"] is not first["unresolved"] + + +def test_import_marks_dynamic_argparse_defaults_unresolved(tmp_path): + (tmp_path / "client.py").write_text( + """ +import argparse + + +def build_parser(): + parser = argparse.ArgumentParser() + parser.add_argument("--model_arch", type=str, default=DEFAULT_MODEL_ARCH) + return parser +""", + encoding="utf-8", + ) + job_path = tmp_path / "job.py" + job_path.write_text( + """ +import argparse + +from nvflare.app_opt.pt.recipes.fedavg import FedAvgRecipe +from nvflare.recipe import SimEnv + + +def define_parser(): + parser = argparse.ArgumentParser() + parser.add_argument("--n_clients", type=int, default=2) + parser.add_argument("--num_rounds", type=int, default=3) + parser.add_argument("--train_script", type=str, default="client.py") + return parser.parse_args() + + +def main(): + args = define_parser() + recipe = FedAvgRecipe( + name="demo", + min_clients=args.n_clients, + num_rounds=args.num_rounds, + train_script=args.train_script, + ) + recipe.execute(SimEnv(num_clients=args.n_clients)) +""", + encoding="utf-8", + ) + + config = import_job_to_autofl_config(str(job_path), workspace_root=str(tmp_path)) + + model_arch = config["search_space"]["suggested"]["model_arch"] + assert model_arch["default"] == "DEFAULT_MODEL_ARCH" + assert model_arch["confidence"] == "low" + assert model_arch["unresolved"] is True + assert config["import"]["confidence"] == "medium" + assert { + "field": "search_space.suggested.model_arch.default", + "reason": "default is dynamic expression: DEFAULT_MODEL_ARCH", + } in config["unresolved"] + + +def test_import_marks_dynamic_train_script_unresolved_without_client_fallback(tmp_path): + (tmp_path / "client.py").write_text( + """ +import argparse + + +def build_parser(): + parser = argparse.ArgumentParser() + parser.add_argument("--lr", type=float, default=0.01) + return parser +""", + encoding="utf-8", + ) + job_path = tmp_path / "job.py" + job_path.write_text( + """ +from nvflare.app_opt.pt.recipes.fedavg import FedAvgRecipe +from nvflare.recipe import SimEnv + + +def get_script(): + return "client.py" + + +def main(): + recipe = FedAvgRecipe( + name="demo", + min_clients=2, + num_rounds=3, + train_script=get_script(), + ) + recipe.execute(SimEnv(num_clients=2)) +""", + encoding="utf-8", + ) + + config = import_job_to_autofl_config(str(job_path), workspace_root=str(tmp_path)) + + assert "train_script" not in config["job"] + assert "client.py" not in config["trust_contract"]["allowed_edit_paths"] + assert {"field": "job.train_script", "reason": "no train_script was found or resolved"} in config["unresolved"] + + +def test_import_marks_imported_budget_and_metric_constants_unresolved(tmp_path): + (tmp_path / "client.py").write_text( + """ +def train(): + pass +""", + encoding="utf-8", + ) + job_path = tmp_path / "job.py" + job_path.write_text( + """ +from config import KEY_METRIC, NUM_ROUNDS +from nvflare.app_opt.pt.recipes.fedavg import FedAvgRecipe +from nvflare.recipe import SimEnv + + +def main(): + recipe = FedAvgRecipe( + name="demo", + min_clients=2, + num_rounds=NUM_ROUNDS, + train_script="client.py", + key_metric=KEY_METRIC, + ) + recipe.execute(SimEnv(num_clients=2)) +""", + encoding="utf-8", + ) + + config = import_job_to_autofl_config(str(job_path), workspace_root=str(tmp_path)) + + assert config["objective"] == _objective("accuracy", source="default") + assert config["budget"]["fixed_training_budget"] == {"min_clients": 2, "num_clients": 2} + assert { + "field": "budget.fixed_training_budget.num_rounds", + "reason": "name:NUM_ROUNDS", + } in config["unresolved"] + assert {"field": "objective.metric", "reason": "name:KEY_METRIC"} in config["unresolved"] + assert {"field": "job.FedAvgRecipe.key_metric", "reason": "name:KEY_METRIC"} in config["unresolved"] + assert {"field": "job.FedAvgRecipe.num_rounds", "reason": "name:NUM_ROUNDS"} in config["unresolved"] + + +def test_import_marks_call_expression_budget_and_metric_unresolved(tmp_path): + (tmp_path / "client.py").write_text( + """ +def train(): + pass +""", + encoding="utf-8", + ) + job_path = tmp_path / "job.py" + job_path.write_text( + """ +from nvflare.app_opt.pt.recipes.fedavg import FedAvgRecipe +from nvflare.recipe import SimEnv + + +def get_metric(): + return "accuracy" + + +def get_rounds(): + return 5 + + +def main(): + recipe = FedAvgRecipe( + name="demo", + min_clients=2, + num_rounds=get_rounds(), + train_script="client.py", + key_metric=get_metric(), + ) + recipe.execute(SimEnv(num_clients=2)) +""", + encoding="utf-8", + ) + + config = import_job_to_autofl_config(str(job_path), workspace_root=str(tmp_path)) + + assert config["objective"] == _objective("accuracy", source="default") + assert config["budget"]["fixed_training_budget"] == {"min_clients": 2, "num_clients": 2} + assert { + "field": "budget.fixed_training_budget.num_rounds", + "reason": "call:get_rounds", + } in config["unresolved"] + assert {"field": "objective.metric", "reason": "call:get_metric"} in config["unresolved"] + assert {"field": "job.FedAvgRecipe.key_metric", "reason": "call:get_metric"} in config["unresolved"] + assert {"field": "job.FedAvgRecipe.num_rounds", "reason": "call:get_rounds"} in config["unresolved"] + assert config["job"]["recipe_args"]["num_rounds"] == { + "value": "get_rounds()", + "source": "call:get_rounds", + "confidence": "low", + } + + +def test_import_marks_unsupported_custom_job_as_partial(tmp_path): + job_path = tmp_path / "job.py" + job_path.write_text( + """ +def main(): + run_custom_workflow() + + +if __name__ == "__main__": + main() +""", + encoding="utf-8", + ) + + config = import_job_to_autofl_config(str(job_path), workspace_root=str(tmp_path)) + + assert config["import"]["support"]["status"] == "partial" + assert config["job"]["surface"] == "unknown" + unresolved_fields = {item["field"] for item in config["unresolved"]} + assert "job.surface" in unresolved_fields + assert "job.train_script" in unresolved_fields + assert "budget.fixed_training_budget" in unresolved_fields + + +def test_main_returns_clean_error_for_missing_job(tmp_path, capsys): + output_path = tmp_path / "autofl.yaml" + + exit_code = job_importer.main([str(tmp_path / "missing.py"), "--output", str(output_path)]) + + captured = capsys.readouterr() + assert exit_code == 1 + assert captured.out == "" + assert "error: job.py not found:" in captured.err + assert not output_path.exists() diff --git a/tests/unit_test/tool/autofl_skill_campaign_guard_test.py b/tests/unit_test/tool/autofl_skill_campaign_guard_test.py new file mode 100644 index 0000000000..d3be2da472 --- /dev/null +++ b/tests/unit_test/tool/autofl_skill_campaign_guard_test.py @@ -0,0 +1,188 @@ +# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import csv +import importlib.util +import json +import subprocess +import sys +from pathlib import Path + +RESULT_FIELDS = [ + "status", + "name", + "score", + "runtime_seconds", + "changed_files", + "diff_summary", + "run_command", + "artifacts", + "failure_reason", + "candidate_manifest", + "base_candidate", + "patch_sha256", +] + + +def _load_guard(): + repo_root = Path(__file__).parents[3] + guard_path = repo_root / "skills" / "nvflare-autofl" / "scripts" / "campaign_guard.py" + spec = importlib.util.spec_from_file_location("nvflare_autofl_skill_campaign_guard_test", guard_path) + module = importlib.util.module_from_spec(spec) + sys.modules[spec.name] = module + spec.loader.exec_module(module) + return module + + +def _write_results(path, rows): + with path.open("w", encoding="utf-8", newline="") as f: + writer = csv.DictWriter(f, fieldnames=RESULT_FIELDS, delimiter="\t") + writer.writeheader() + for row in rows: + writer.writerow(row) + + +def _row(status, name, score="", diff_summary="candidate", run_command="python job.py"): + return { + "status": status, + "name": name, + "score": score, + "runtime_seconds": "10.0", + "changed_files": "none", + "diff_summary": diff_summary, + "run_command": run_command, + "artifacts": "/tmp/run", + "failure_reason": "", + "candidate_manifest": "", + "base_candidate": "", + "patch_sha256": "", + } + + +def test_guard_continues_uncapped_before_plateau(): + guard = _load_guard() + rows = [ + _row("baseline", "baseline", "0.85"), + _row("discard", "batch_size_64", "0.851"), + _row("discard", "lr_0p01", "0.850"), + ] + + state = guard.guard_state_for_rows(rows, plateau_threshold=4) + + assert state["schema_version"] == "nvflare.autofl.campaign_state.v1" + assert state["decision"] == "continue" + assert state["next_action"] == "propose_candidate" + assert state["final_response_allowed"] is False + assert state["candidate_cap"] is None + assert state["candidate_cap_source"] == "uncapped" + assert state["candidate_attempts"] == 2 + assert state["best_score"] == 0.851 + + +def test_guard_routes_plateau_to_literature_without_finalizing(): + guard = _load_guard() + rows = [_row("baseline", "baseline", "0.85")] + rows.extend(_row("discard", f"candidate_{idx}", "0.840") for idx in range(4)) + + state = guard.guard_state_for_rows(rows, plateau_threshold=4) + + assert state["decision"] == "continue" + assert state["reason"] == "plateau_literature" + assert state["next_action"] == "run_literature_loop" + assert state["final_response_allowed"] is False + assert state["best_score"] == 0.85 + assert "Do not produce a final answer" in state["agent_instruction"] + + +def test_guard_literature_event_resets_plateau_clock(): + guard = _load_guard() + rows = [_row("baseline", "baseline", "0.85")] + rows.extend(_row("discard", f"before_lit_{idx}", "0.840") for idx in range(4)) + rows.append(_row("literature", "literature_review", "", diff_summary="literature review")) + rows.extend(_row("discard", f"after_lit_{idx}", "0.841") for idx in range(2)) + + state = guard.guard_state_for_rows(rows, plateau_threshold=4) + + assert state["reason"] == "continue" + assert state["next_action"] == "propose_candidate" + assert state["plateau"]["last_literature_event_index"] == 5 + assert state["plateau"]["scored_since_reset"] == 2 + + +def test_guard_counts_candidate_with_baseline_in_description(): + guard = _load_guard() + rows = [ + _row("baseline", "baseline", "0.85"), + _row("discard", "weighted_rerun", "0.8503", diff_summary="weighted baseline escalated rerun"), + ] + + state = guard.guard_state_for_rows(rows, max_candidates=1) + + assert state["candidate_attempts"] == 1 + assert state["decision"] == "stop" + assert state["reason"] == "candidate_cap_exhausted" + assert state["candidate_cap_source"] == "explicit" + + +def test_guard_ignores_ambient_candidate_cap(monkeypatch): + guard = _load_guard() + monkeypatch.setenv("AUTOFL_MAX_CANDIDATES", "1") + rows = [ + _row("baseline", "baseline", "0.85"), + _row("discard", "candidate_1", "0.84"), + ] + + state = guard.guard_state_for_rows(rows) + + assert state["decision"] == "continue" + assert state["reason"] == "continue" + assert state["candidate_cap"] is None + assert state["candidate_cap_source"] == "uncapped" + assert state["final_response_allowed"] is False + + +def test_guard_cli_writes_campaign_state_json(tmp_path): + repo_root = Path(__file__).parents[3] + guard_path = repo_root / "skills" / "nvflare-autofl" / "scripts" / "campaign_guard.py" + results_path = tmp_path / "results.tsv" + state_path = tmp_path / "state.json" + _write_results( + results_path, + [ + _row("baseline", "baseline", "0.85"), + _row("discard", "candidate_1", "0.840"), + _row("discard", "candidate_2", "0.841"), + ], + ) + + proc = subprocess.run( + [ + sys.executable, + str(guard_path), + str(results_path), + "--state", + str(state_path), + "--plateau-threshold", + "2", + "--format", + "json", + ], + text=True, + capture_output=True, + check=True, + ) + + payload = json.loads(proc.stdout) + assert json.loads(state_path.read_text(encoding="utf-8")) == payload + assert payload["next_action"] == "run_literature_loop" diff --git a/tests/unit_test/tool/autofl_skill_plot_progress_test.py b/tests/unit_test/tool/autofl_skill_plot_progress_test.py new file mode 100644 index 0000000000..2cb59a2f0b --- /dev/null +++ b/tests/unit_test/tool/autofl_skill_plot_progress_test.py @@ -0,0 +1,127 @@ +# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import csv +import importlib.util +import sys +from pathlib import Path + +import pytest + + +def _load_plotter(): + repo_root = Path(__file__).parents[3] + script_path = repo_root / "skills" / "nvflare-autofl" / "scripts" / "plot_progress.py" + spec = importlib.util.spec_from_file_location("nvflare_autofl_skill_plot_progress", script_path) + module = importlib.util.module_from_spec(spec) + sys.modules[spec.name] = module + spec.loader.exec_module(module) + return module + + +def _record(plotter, index, score, status="discard", name=None, runtime=300.0): + return plotter.ProgressRecord( + index=index, + status=status, + name=name or f"candidate_{index}", + score=score, + runtime_seconds=runtime, + description=f"candidate {index}", + ) + + +def test_robust_y_limits_focus_on_improvement_region_for_maximize(): + plotter = _load_plotter() + scores = [0.50, 0.55, 0.687] + [0.70 + index * 0.002 for index in range(20)] + + lower, upper = plotter.default_y_limits(scores, baseline=0.687, mode="max") + + assert lower > 0.60 + assert lower < 0.687 + assert upper > max(scores) + + +def test_robust_y_limits_focus_on_improvement_region_for_minimize(): + plotter = _load_plotter() + scores = [2.0, 1.8, 0.90] + [0.80 - index * 0.01 for index in range(20)] + + lower, upper = plotter.default_y_limits(scores, baseline=0.90, mode="min") + + assert lower < min(scores) + assert upper > 0.90 + assert upper < 1.5 + + +@pytest.mark.parametrize( + "mode,scores,expected_best", + [ + ("max", [0.5, 0.6, 0.55, 0.7, 0.69, 0.8], 0.8), + ("min", [0.8, 0.7, 0.75, 0.6, 0.62, 0.5], 0.5), + ], +) +def test_milestone_selection_supports_both_objective_directions(mode, scores, expected_best): + plotter = _load_plotter() + records = [ + _record(plotter, index, score, status="baseline" if index == 0 else "keep") + for index, score in enumerate(scores) + ] + + milestones = plotter.select_observed_milestones(records, mode=mode, max_labels=3) + + assert len(milestones) <= 3 + assert milestones[-1][1].score == expected_best + + +def test_load_results_uses_productized_ledger_fields(tmp_path): + plotter = _load_plotter() + ledger = tmp_path / "results.tsv" + with ledger.open("w", encoding="utf-8", newline="") as f: + writer = csv.DictWriter( + f, + fieldnames=["status", "name", "score", "runtime_seconds", "diff_summary"], + delimiter="\t", + ) + writer.writeheader() + writer.writerow( + { + "status": "keep", + "name": "fedavgm", + "score": "0.75", + "runtime_seconds": "123.5", + "diff_summary": "server momentum", + } + ) + + records = plotter.load_results(ledger) + + assert records == [plotter.ProgressRecord(0, "keep", "fedavgm", 0.75, 123.5, "server momentum")] + assert plotter.normalize_records(records) == records + + +def test_rich_progress_plot_renders_png(tmp_path): + pytest.importorskip("matplotlib") + plotter = _load_plotter() + records = [_record(plotter, 0, 0.687, status="baseline", name="baseline")] + for index in range(1, 21): + status = "keep" if index in {3, 8, 15} else "discard" + records.append(_record(plotter, index, 0.69 + index * 0.002, status=status)) + records.insert(10, _record(plotter, 10, None, status="literature", name="literature_review_1", runtime=45.0)) + output = tmp_path / "progress.png" + + baseline, best = plotter.plot_progress(records, output, mode="max", metric_label="test_accuracy") + + assert baseline == pytest.approx(0.687) + assert best == pytest.approx(0.73) + assert output.read_bytes().startswith(b"\x89PNG\r\n\x1a\n") + assert output.stat().st_size > 20_000 diff --git a/tests/unit_test/tool/autofl_skill_report_test.py b/tests/unit_test/tool/autofl_skill_report_test.py new file mode 100644 index 0000000000..5cc082e817 --- /dev/null +++ b/tests/unit_test/tool/autofl_skill_report_test.py @@ -0,0 +1,551 @@ +# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import csv +import importlib.util +import json +import sys +from pathlib import Path + +import pytest + +RESULT_FIELDS = [ + "status", + "name", + "score", + "runtime_seconds", + "changed_files", + "diff_summary", + "run_command", + "artifacts", + "failure_reason", + "candidate_manifest", + "base_candidate", + "patch_sha256", +] + + +def _load_script(name, script_path): + spec = importlib.util.spec_from_file_location(name, script_path) + module = importlib.util.module_from_spec(spec) + sys.modules[spec.name] = module + spec.loader.exec_module(module) + return module + + +def _load_reporter(): + repo_root = Path(__file__).parents[3] + script_path = repo_root / "skills" / "nvflare-autofl-report" / "scripts" / "generate_report.py" + return _load_script("nvflare_autofl_skill_report", script_path) + + +def _load_autofl_script(name): + repo_root = Path(__file__).parents[3] + script_path = repo_root / "skills" / "nvflare-autofl" / "scripts" / f"{name}.py" + return _load_script(f"nvflare_autofl_{name}", script_path) + + +def _row(status, name, score="", **kwargs): + row = {field: "" for field in RESULT_FIELDS} + row.update( + { + "status": status, + "name": name, + "score": score, + "runtime_seconds": kwargs.pop("runtime_seconds", "10"), + "changed_files": kwargs.pop("changed_files", "none"), + "diff_summary": kwargs.pop("diff_summary", name), + "run_command": kwargs.pop("run_command", "python job.py"), + "artifacts": kwargs.pop("artifacts", f"runs/{name}"), + } + ) + row.update(kwargs) + return row + + +def _write_rows(tmp_path, rows): + with tmp_path.joinpath("results.tsv").open("w", encoding="utf-8", newline="") as f: + writer = csv.DictWriter(f, fieldnames=RESULT_FIELDS, delimiter="\t") + writer.writeheader() + writer.writerows(rows) + + +def _write_campaign(tmp_path, *, active=False, mode="max"): + rows = [ + _row( + "baseline", + "baseline", + "0.500000", + run_command=("python job.py --n_clients 8 --num_rounds 10 --aggregation_epochs 1 --name autofl_baseline"), + ), + _row( + "literature", + "literature_review_1", + diff_summary="Review FedProx [src: Li18 arXiv:1812.06127] and SCAFFOLD [src: Karimireddy19].", + run_command="agent literature review", + ), + _row( + "keep" if mode == "min" else "discard", + "weak_prox", + "0.490000", + base_candidate="baseline", + diff_summary="Small FedProx term did not help.", + ), + _row( + "discard" if mode == "min" else "keep", + "algorithm_code", + "0.600000", + changed_files="client.py,job.py", + base_candidate="baseline", + candidate_manifest=".nvflare/autofl/candidates/algorithm_code/candidate_manifest.json", + patch_sha256="a" * 64, + diff_summary="Implement an agent-authored drift correction algorithm.", + run_command=( + "python job.py --n_clients 8 --num_rounds 10 --aggregation_epochs 2 --name autofl_algorithm_code" + ), + ), + _row( + "discard" if mode == "min" else "keep", + "inherited_tuning", + "0.650000", + changed_files="none", + base_candidate="algorithm_code", + candidate_manifest=".nvflare/autofl/candidates/inherited_tuning/candidate_manifest.json", + patch_sha256="b" * 64, + diff_summary="Tune the retained algorithm without another source patch.", + run_command=( + "python job.py --n_clients 8 --num_rounds 10 --aggregation_epochs 2 --lr 0.01 " + "--name autofl_inherited_tuning" + ), + ), + _row( + "crash", + "unstable_branch", + failure_reason="exit_code=1", + base_candidate="inherited_tuning", + diff_summary="A larger update was unstable.", + ), + ] + _write_rows(tmp_path, rows) + + tmp_path.joinpath("autofl.yaml").write_text( + "schema_version: nvflare.autofl.config.v1\n" + "objective:\n" + " requested_metric: accuracy\n" + " optimization_metric: test_accuracy\n" + " metric_source: held-out test set\n" + f" mode: {mode}\n" + "environment:\n" + " requested: sim\n" + "budget:\n" + " fixed_training_budget:\n" + " num_clients: 8\n" + " num_rounds: 20\n", + encoding="utf-8", + ) + state = { + "reason": "running" if active else "manual_interrupt", + "final_response_allowed": not active, + "candidate_cap": None, + "candidate_cap_source": "uncapped", + "mode": mode, + } + tmp_path.joinpath(".nvflare/autofl").mkdir(parents=True) + tmp_path.joinpath(".nvflare/autofl/campaign_state.json").write_text(json.dumps(state), encoding="utf-8") + tmp_path.joinpath("progress.png").write_bytes(b"\x89PNG\r\n\x1a\nexisting plot") + return rows + + +def _generate(reporter, tmp_path, monkeypatch, *extra): + monkeypatch.setattr(reporter, "refresh_plot", lambda *args, **kwargs: None) + args = reporter.parse_args([str(tmp_path), *extra]) + return reporter.generate(args) + + +def test_report_refuses_active_campaign_without_confirmation(tmp_path, monkeypatch): + reporter = _load_reporter() + _write_campaign(tmp_path, active=True) + monkeypatch.setattr(reporter, "refresh_plot", lambda *args, **kwargs: None) + + with pytest.raises(ValueError, match="final_response_allowed=false"): + reporter.generate(reporter.parse_args([str(tmp_path)])) + + assert not tmp_path.joinpath("autofl_final_report.md").exists() + + +def test_confirm_interrupted_reports_without_mutating_state(tmp_path, monkeypatch): + reporter = _load_reporter() + _write_campaign(tmp_path, active=True) + state_path = tmp_path / ".nvflare/autofl/campaign_state.json" + original_state = state_path.read_bytes() + + summary = _generate(reporter, tmp_path, monkeypatch, "--confirm-interrupted") + + assert summary["termination"] == { + "reason": "user_confirmed_interruption", + "state_allowed_final_response": False, + "user_confirmed_interruption": True, + } + assert state_path.read_bytes() == original_state + + +@pytest.mark.parametrize("score", ["", "0.900000"]) +def test_report_refuses_pending_ledger_rows_even_when_interruption_is_confirmed(tmp_path, monkeypatch, score): + reporter = _load_reporter() + rows = _write_campaign(tmp_path, active=True) + rows.append(_row("candidate", "pending_algo", score, base_candidate="baseline")) + _write_rows(tmp_path, rows) + monkeypatch.setattr(reporter, "refresh_plot", lambda *args, **kwargs: None) + + with pytest.raises(ValueError, match="finalize or abandon pending candidates"): + reporter.generate(reporter.parse_args([str(tmp_path), "--confirm-interrupted"])) + + assert not tmp_path.joinpath("autofl_final_report.md").exists() + + +def test_report_refuses_pending_campaign_state(tmp_path, monkeypatch): + reporter = _load_reporter() + _write_campaign(tmp_path) + state_path = tmp_path / ".nvflare/autofl/campaign_state.json" + state = json.loads(state_path.read_text(encoding="utf-8")) + state["pending_candidates"] = 1 + state["pending_candidate_manifest"] = ".nvflare/autofl/candidates/pending/candidate_manifest.json" + state_path.write_text(json.dumps(state), encoding="utf-8") + monkeypatch.setattr(reporter, "refresh_plot", lambda *args, **kwargs: None) + + with pytest.raises(ValueError, match="campaign state reports pending_candidates=1"): + reporter.generate(reporter.parse_args([str(tmp_path), "--confirm-interrupted"])) + + +@pytest.mark.parametrize("status", ["prepared", "ready_for_external_execution"]) +def test_report_refuses_pending_candidate_manifests(tmp_path, monkeypatch, status): + reporter = _load_reporter() + _write_campaign(tmp_path) + manifest_path = tmp_path / ".nvflare/autofl/candidates/pending/candidate_manifest.json" + manifest_path.parent.mkdir(parents=True) + manifest_path.write_text(json.dumps({"status": status}), encoding="utf-8") + monkeypatch.setattr(reporter, "refresh_plot", lambda *args, **kwargs: None) + + with pytest.raises(ValueError, match="candidate manifests remain prepared"): + reporter.generate(reporter.parse_args([str(tmp_path), "--confirm-interrupted"])) + + +def test_pending_candidate_returns_cli_exit_2(tmp_path, monkeypatch, capsys): + reporter = _load_reporter() + rows = _write_campaign(tmp_path) + rows.append(_row("candidate", "pending_algo")) + _write_rows(tmp_path, rows) + monkeypatch.setattr(reporter, "refresh_plot", lambda *args, **kwargs: None) + + assert reporter.main([str(tmp_path)]) == 2 + assert "finalize or abandon pending candidates" in capsys.readouterr().err + + +def test_report_generates_product_artifacts_and_candidate_lineage(tmp_path, monkeypatch): + reporter = _load_reporter() + _write_campaign(tmp_path) + + summary = _generate(reporter, tmp_path, monkeypatch) + report = tmp_path.joinpath("autofl_final_report.md").read_text(encoding="utf-8") + saved_summary = json.loads(tmp_path.joinpath("autofl_report_summary.json").read_text(encoding="utf-8")) + + assert summary["schema_version"] == "nvflare.autofl.report.v1" + assert summary["baseline"]["score"] == 0.5 + assert summary["best"]["name"] == "inherited_tuning" + assert summary["best"]["score"] == 0.65 + assert summary["best_lineage"] == { + "candidates": ["baseline", "algorithm_code", "inherited_tuning"], + "changed_files": ["client.py", "job.py"], + "complete": True, + } + assert saved_summary["best"]["name"] == "inherited_tuning" + assert summary["artifacts"]["progress_plot_available"] is True + assert "## Executive Summary" in report + assert "## Literature Review Outcomes" in report + assert "## Validation And Comparability Notes" in report + assert "Product Findings" not in report + assert str(tmp_path.joinpath("progress.png").resolve()) in report + + +def test_report_synthesizes_literature_against_checkpoint_incumbent(tmp_path, monkeypatch): + reporter = _load_reporter() + _write_campaign(tmp_path) + + summary = _generate(reporter, tmp_path, monkeypatch) + + literature = summary["literature_reviews"] + assert len(literature) == 1 + assert literature[0]["sources"] == ["Li18 arXiv:1812.06127", "Karimireddy19"] + assert literature[0]["outcome"] == "helped" + assert literature[0]["incumbent_score"] == 0.5 + assert literature[0]["best_candidate"] == "inherited_tuning" + assert literature[0]["delta_from_incumbent"] == pytest.approx(0.15) + + +def test_literature_evidence_preserves_crash_and_blank_discard_status(tmp_path, monkeypatch): + reporter = _load_reporter() + rows = _write_campaign(tmp_path) + rows[-1]["score"] = "0.990000" + rows.append(_row("discard", "blank_discard", "", diff_summary="No metric artifact was produced.")) + _write_rows(tmp_path, rows) + + summary = _generate(reporter, tmp_path, monkeypatch) + report = tmp_path.joinpath("autofl_final_report.md").read_text(encoding="utf-8") + + assert summary["best"]["name"] == "inherited_tuning" + assert summary["best_observed"]["name"] == "inherited_tuning" + assert all(item["name"] != "unstable_branch" for item in summary["milestones"]) + assert "unstable_branch=crash" in report + assert "blank_discard=n/a" in report + + +def test_discard_only_campaign_has_observed_but_no_retained_best(tmp_path, monkeypatch): + reporter = _load_reporter() + _write_campaign(tmp_path) + _write_rows(tmp_path, [_row("discard", "unretained_gain", "0.700000")]) + + summary = _generate(reporter, tmp_path, monkeypatch) + report = tmp_path.joinpath("autofl_final_report.md").read_text(encoding="utf-8") + + assert summary["best"] is None + assert summary["best_observed"]["name"] == "unretained_gain" + assert "No scored result was retained" in report + assert "Best retained candidate" not in report + + +def test_report_warns_about_executed_budget_and_test_metric(tmp_path, monkeypatch): + reporter = _load_reporter() + _write_campaign(tmp_path) + + summary = _generate(reporter, tmp_path, monkeypatch) + warnings = "\n".join(summary["warnings"]) + + assert "aggregation_epochs" in warnings + assert "autofl.yaml=20, executed=10" in warnings + assert "test-like metric" in warnings + assert summary["best_command_changes"]["aggregation_epochs"] == {"baseline": "1", "best": "2"} + + +def test_report_supports_minimization_and_agent_context_without_git(tmp_path, monkeypatch): + reporter = _load_reporter() + _write_campaign(tmp_path, mode="min") + context = tmp_path / "agent.json" + context.write_text(json.dumps({"provider": "codex", "notes": "stopped by user"}), encoding="utf-8") + + summary = _generate( + reporter, + tmp_path, + monkeypatch, + "--metric", + "loss", + "--agent-context", + str(context), + "--agent-model", + "gpt-test", + "--reasoning-effort", + "high", + ) + + assert not tmp_path.joinpath(".git").exists() + assert summary["objective"]["mode"] == "min" + assert summary["best"]["name"] == "weak_prox" + assert summary["agent_context"] == { + "model": "gpt-test", + "notes": "stopped by user", + "provider": "codex", + "reasoning_effort": "high", + } + + +def test_report_keeps_existing_plot_when_refresh_fails(tmp_path, monkeypatch): + reporter = _load_reporter() + _write_campaign(tmp_path) + monkeypatch.setattr(reporter, "default_plotter_path", lambda: tmp_path / "missing_plotter.py") + + summary = reporter.generate(reporter.parse_args([str(tmp_path)])) + + assert tmp_path.joinpath("progress.png").read_bytes() == b"\x89PNG\r\n\x1a\nexisting plot" + assert any("plotter not found" in warning for warning in summary["warnings"]) + assert summary["artifacts"]["progress_plot_available"] is True + + +def test_report_reads_best_candidate_manifest_when_available(tmp_path, monkeypatch): + reporter = _load_reporter() + _write_campaign(tmp_path) + manifest_path = tmp_path / ".nvflare/autofl/candidates/inherited_tuning/candidate_manifest.json" + manifest_path.parent.mkdir(parents=True) + manifest_path.write_text( + json.dumps( + { + "schema_version": "nvflare.autofl.candidate.v1", + "candidate_id": "inherited_tuning", + "base_candidate": "algorithm_code", + "hypothesis": "tune retained algorithm", + "run_args": ["--lr", "0.01"], + "changed_files": [], + "created_files": [], + "candidate_source_sha256": "c" * 64, + "fixed_budget_sha256": "d" * 64, + "patch_sha256": "b" * 64, + "status": "keep", + } + ), + encoding="utf-8", + ) + + summary = _generate(reporter, tmp_path, monkeypatch) + + assert summary["best_manifest"]["available"] is True + assert summary["best_manifest"]["candidate_id"] == "inherited_tuning" + assert summary["best_manifest"]["budget_sha256"] == "d" * 64 + + +@pytest.mark.parametrize("invalid_content", [None, b"not a png"]) +def test_report_degrades_when_progress_artifact_is_unavailable(tmp_path, monkeypatch, invalid_content): + reporter = _load_reporter() + _write_campaign(tmp_path) + progress_path = tmp_path.joinpath("progress.png") + if invalid_content is None: + progress_path.unlink() + else: + progress_path.write_bytes(invalid_content) + monkeypatch.setattr(reporter, "default_plotter_path", lambda: tmp_path / "missing_plotter.py") + + summary = reporter.generate(reporter.parse_args([str(tmp_path)])) + report = tmp_path.joinpath("autofl_final_report.md").read_text(encoding="utf-8") + + assert summary["artifacts"]["progress_plot_available"] is False + assert "Progress plot unavailable" in report + assert "![Auto-FL progress]" not in report + assert tmp_path.joinpath("autofl_report_summary.json").is_file() + if invalid_content is not None: + assert progress_path.read_bytes() == invalid_content + + +def test_report_normalizes_malformed_optional_contract_sections(tmp_path, monkeypatch): + reporter = _load_reporter() + _write_campaign(tmp_path) + tmp_path.joinpath("autofl.yaml").write_text( + "objective: []\nenvironment: sim\nbudget: null\n", + encoding="utf-8", + ) + + summary = _generate(reporter, tmp_path, monkeypatch) + warnings = "\n".join(summary["warnings"]) + + assert summary["objective"]["optimization_metric"] == "score" + assert summary["objective"]["metric_source"] == "NVFlare metric artifacts" + assert summary["environment"] == "not declared" + assert summary["declared_fixed_budget"] == {} + assert "section 'objective' is list" in warnings + assert "section 'environment' is str" in warnings + assert "section 'budget' is null" in warnings + + +def test_report_separates_metric_measurement_and_contract_sources(tmp_path, monkeypatch): + reporter = _load_reporter() + _write_campaign(tmp_path) + tmp_path.joinpath("autofl.yaml").write_text( + "objective:\n" + " requested_metric: accuracy\n" + " optimization_metric: test_accuracy\n" + " source: arg:key_metric\n", + encoding="utf-8", + ) + + summary = _generate(reporter, tmp_path, monkeypatch) + + assert summary["objective"]["metric_source"] == "NVFlare metric artifacts" + assert summary["objective"]["metric_contract_source"] == "arg:key_metric" + + +def test_relative_plotter_path_resolves_from_campaign_directory(tmp_path, monkeypatch): + reporter = _load_reporter() + _write_campaign(tmp_path) + plotter_path = tmp_path / "tools/plot_progress.py" + plotter_path.parent.mkdir() + plotter_path.write_text("# test plotter\n", encoding="utf-8") + captured = {} + + def _capture_plotter(*args): + captured["path"] = args[-1] + + monkeypatch.setattr(reporter, "refresh_plot", _capture_plotter) + + reporter.generate(reporter.parse_args([str(tmp_path), "--plotter", "tools/plot_progress.py"])) + + assert captured["path"] == plotter_path.resolve() + + +def test_candidate_lineage_marks_cycles_incomplete(): + reporter = _load_reporter() + fields = { + "score": 0.5, + "runtime_seconds": 1.0, + "changed_files": "none", + "diff_summary": "candidate", + "run_command": "python job.py", + "artifacts": "runs/candidate", + "failure_reason": "", + "candidate_manifest": "", + "patch_sha256": "", + } + first = reporter.RunRecord(index=0, status="keep", name="first", base_candidate="second", **fields) + second = reporter.RunRecord(index=1, status="keep", name="second", base_candidate="first", **fields) + + assert reporter.candidate_lineage(second, [first, second])["complete"] is False + + +def test_budget_comparison_accepts_numeric_equivalence(): + reporter = _load_reporter() + + assert reporter.values_equal("8.0", 8) + assert not reporter.values_equal("8.1", 8) + + +def test_report_attempt_and_baseline_rules_match_campaign_guard(): + reporter = _load_reporter() + guard = _load_autofl_script("campaign_guard") + cases = [ + {"status": "baseline", "name": "control", "run_command": "python job.py"}, + {"status": "keep", "name": "baseline", "run_command": "python job.py"}, + {"status": "discard", "name": "baseline_seed_1", "run_command": "python job.py"}, + {"status": "discard", "name": "control", "run_command": "python job.py --name baseline"}, + {"status": "keep", "name": "candidate_1", "run_command": "python job.py"}, + ] + + assert reporter.ATTEMPT_STATUSES == guard.COMPARABLE_STATUSES + for index, case in enumerate(cases): + record = reporter.RunRecord( + index=index, + score=0.5, + runtime_seconds=1.0, + changed_files="", + diff_summary="", + artifacts="", + failure_reason="", + candidate_manifest="", + base_candidate="", + patch_sha256="", + **case, + ) + assert reporter.is_baseline(record) == guard.is_baseline(case) + + +@pytest.mark.parametrize("seconds", [0.0, 45.0, 3599.0, 3600.0]) +def test_report_runtime_format_matches_progress_plotter(seconds): + reporter = _load_reporter() + plotter = _load_autofl_script("plot_progress") + + assert reporter.format_runtime(seconds) == plotter.format_runtime(seconds) diff --git a/tests/unit_test/tool/autofl_skill_runner_test.py b/tests/unit_test/tool/autofl_skill_runner_test.py new file mode 100644 index 0000000000..db0cf57685 --- /dev/null +++ b/tests/unit_test/tool/autofl_skill_runner_test.py @@ -0,0 +1,892 @@ +# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import importlib.util +import json +import os +import subprocess +import sys +from copy import deepcopy +from pathlib import Path +from types import SimpleNamespace + +import pytest + + +def _load_runner(): + repo_root = Path(__file__).parents[3] + runner_path = repo_root / "skills" / "nvflare-autofl" / "scripts" / "run_job_campaign.py" + spec = importlib.util.spec_from_file_location("nvflare_autofl_skill_runner", runner_path) + module = importlib.util.module_from_spec(spec) + sys.modules[spec.name] = module + spec.loader.exec_module(module) + return module + + +def _campaign_config(): + return { + "schema_version": "nvflare.autofl.config.v1", + "import": {"source_sha256": "a" * 64}, + "job": {"allowed_edit_paths": ["job.py", "client.py"]}, + "objective": { + "metric": "accuracy", + "requested_metric": "accuracy", + "optimization_metric": "accuracy", + "metric_extraction_order": ["accuracy"], + }, + "budget": {"fixed_training_budget": {"num_clients": 2, "num_rounds": 1}}, + "environment": {"requested": "sim"}, + "search_space": {"suggested": {"lr": {"default": 0.1}}}, + "trust_contract": {"allowed_edit_paths": ["job.py", "client.py"], "unresolved": []}, + } + + +def _initialize_fake_campaign(runner, tmp_path, monkeypatch, *, target_env="sim", baseline_score=0.5): + job = tmp_path / "job.py" + client = tmp_path / "client.py" + job.write_text("print('job')\n", encoding="utf-8") + client.write_text("ALGORITHM = 'baseline'\n", encoding="utf-8") + config = _campaign_config() + config["environment"]["requested"] = target_env + monkeypatch.setattr(runner, "import_job_config", lambda *args, **kwargs: deepcopy(config)) + monkeypatch.setattr(runner, "job_help", lambda *args, **kwargs: "") + monkeypatch.setattr(runner, "write_progress", lambda path, *args: path.write_bytes(b"progress")) + + def fake_run(run_def, **kwargs): + return runner.RunRecord( + run_def.status, + run_def.name, + baseline_score, + 1.0, + "none", + run_def.description, + "python job.py", + str(tmp_path / "artifacts" / run_def.name), + ) + + monkeypatch.setattr(runner, "run_job", fake_run) + argv = ["initialize", str(job), "--env", target_env, "--no-prefer-synthetic"] + assert runner.main(argv) == 0 + return job, client, config + + +def test_candidate_args_respect_mutation_schema_bounds(): + runner = _load_runner() + schema = {"mutable_args": {"lr": {"type": "float", "min": 0.0001, "max": 0.1}}} + + assert runner.candidate_args_allowed(["--lr", "0.1"], schema) == (True, "") + + allowed, reason = runner.candidate_args_allowed(["--lr", "0.2"], schema) + assert not allowed + assert "above schema max" in reason + + +def test_candidate_run_args_cannot_override_fixed_budget(): + runner = _load_runner() + config = {"budget": {"fixed_training_budget": {"num_clients": 8, "num_rounds": 20}}} + + assert runner.candidate_preserves_fixed_args(["--lr", "0.01"], config, {}) == (True, "") + assert runner.candidate_preserves_fixed_args(["--num_rounds=2"], config, {})[0] is False + assert runner.candidate_preserves_fixed_args(["--n-clients", "4"], config, {})[0] is False + + +def test_candidate_plan_skips_out_of_bounds_learning_rate(): + runner = _load_runner() + config = {"search_space": {"suggested": {"lr": {"default": 0.05}}}} + schema = {"mutable_args": {"lr": {"type": "float", "min": 0.0001, "max": 0.1}}} + help_text = "--lr" + + candidates = list(runner.candidate_plan(config, help_text, max_candidates=10, schema=schema)) + candidate_commands = [" ".join(candidate.args) for candidate in candidates] + + assert "--lr 0.1" in candidate_commands + assert "--lr 0.2" not in candidate_commands + + +def test_runner_prefers_explicit_test_accuracy_alias(tmp_path): + runner = _load_runner() + result_path = tmp_path / "cross_val_results.json" + result_path.write_text( + json.dumps( + { + "site-1": { + "SRV_FL_global_model.pt": { + "accuracy": 0.5, + "test_accuracy": 0.8, + } + } + } + ), + encoding="utf-8", + ) + + assert runner.extract_score(tmp_path, ["test_accuracy", "accuracy"]) == 0.8 + + +def test_runner_applies_schema_metric_contract(): + runner = _load_runner() + config = { + "objective": { + "metric": "accuracy", + "requested_metric": "accuracy", + "optimization_metric": "accuracy", + "metric_extraction_order": ["accuracy"], + } + } + schema = { + "objective": { + "requested_metric": "accuracy", + "optimization_metric": "test_accuracy", + "metric_extraction_order": ["test_accuracy", "accuracy"], + "metric_source": "held-out CIFAR-10 test set", + } + } + + updated = runner.apply_metric_contract(config, "accuracy", schema) + + assert updated["objective"]["metric"] == "accuracy" + assert updated["objective"]["requested_metric"] == "accuracy" + assert updated["objective"]["optimization_metric"] == "test_accuracy" + assert updated["objective"]["metric_extraction_order"] == ["test_accuracy", "accuracy"] + assert updated["objective"]["metric_source"] == "held-out CIFAR-10 test set" + + +def test_comparison_budget_suppresses_duplicate_imported_fixed_budget_args(): + runner = _load_runner() + config = {"budget": {"fixed_training_budget": {"num_clients": 8, "num_rounds": 10}}} + schema = { + "comparison_budget_args": { + "default_candidate_budget": { + "n_clients": 8, + "num_rounds": 20, + } + } + } + help_text = "--n_clients --num_rounds" + + assert runner.build_fixed_args(config, help_text, schema) == [] + + args = SimpleNamespace(base_args="", prefer_synthetic=False, synthetic_train_size=1, synthetic_test_size=1) + assert runner.build_base_args(args, help_text, schema) == ["--n_clients", "8", "--num_rounds", "20"] + + +def test_run_streams_output_before_timeout(tmp_path): + runner = _load_runner() + log_path = tmp_path / "run.log" + + rc, output, _ = runner.run( + [ + sys.executable, + "-c", + "import time; print('started', flush=True); time.sleep(2); print('finished', flush=True)", + ], + tmp_path, + timeout=1, + log_path=log_path, + ) + + log_text = log_path.read_text(encoding="utf-8") + assert rc == 124 + assert "started" in output + assert "started" in log_text + assert "TIMEOUT after 1s" in log_text + + +def test_run_stops_on_nvflare_simulator_stall_log(tmp_path): + runner = _load_runner() + log_path = tmp_path / "run.log" + sim_root = tmp_path / "simulation" / "autofl_candidate" + server_log = sim_root / "server" / "log_fl.txt" + server_log.parent.mkdir(parents=True) + server_log.write_text( + "SimulatorClientRunner - ERROR - run_client_thread error: RuntimeError: " + "Failed to create connection to the child process in SimulatorClientRunner, timeout: 60.0\n", + encoding="utf-8", + ) + + rc, output, runtime = runner.run( + [ + sys.executable, + "-c", + "import time; print('started', flush=True); time.sleep(30)", + ], + tmp_path, + timeout=30, + log_path=log_path, + simulator_stall_roots=[sim_root], + stall_check_interval=0.01, + ) + + log_text = log_path.read_text(encoding="utf-8") + assert rc == runner.SIMULATOR_STALL_EXIT_CODE + assert runtime < 5 + assert "SIMULATOR_STALL:" in output + assert "SIMULATOR_STALL:" in log_text + + +def test_run_stops_on_nvflare_simulator_no_progress_log(tmp_path): + runner = _load_runner() + log_path = tmp_path / "run.log" + sim_root = tmp_path / "simulation" / "autofl_candidate" + server_log = sim_root / "server" / "log_fl.txt" + server_log.parent.mkdir(parents=True) + server_log.write_text("Round 0 started\n", encoding="utf-8") + + rc, output, runtime = runner.run( + [ + sys.executable, + "-c", + "import time; print('started', flush=True); time.sleep(30)", + ], + tmp_path, + timeout=30, + log_path=log_path, + simulator_stall_roots=[sim_root], + stall_check_interval=0.01, + simulator_no_progress_timeout=1, + ) + + log_text = log_path.read_text(encoding="utf-8") + assert rc == runner.SIMULATOR_STALL_EXIT_CODE + assert runtime < 5 + assert "SIMULATOR_STALL: no simulator progress markers changed" in output + assert "SIMULATOR_STALL: no simulator progress markers changed" in log_text + + +def test_run_stops_on_stale_partial_simulator_aggregation(tmp_path): + runner = _load_runner() + log_path = tmp_path / "run.log" + sim_root = tmp_path / "simulation" / "autofl_candidate" + server_log = sim_root / "server" / "log_fl.txt" + site_log = sim_root / "site-1" / "log_fl.txt" + server_log.parent.mkdir(parents=True) + site_log.parent.mkdir(parents=True) + server_log.write_text( + "Round 0 started\n" "2026-06-25 06:32:33 - FedAvg - INFO - Aggregated 1/8 results\n", + encoding="utf-8", + ) + site_log.write_text("[site=site-1] round=0\n", encoding="utf-8") + + rc, output, runtime = runner.run( + [ + sys.executable, + "-c", + ( + "import pathlib, time; " + f"path = pathlib.Path({str(site_log)!r}); " + "time.sleep(0.2); " + "path.write_text('[site=site-1] round=0\\n[site=site-2] round=0\\n'); " + "time.sleep(30)" + ), + ], + tmp_path, + timeout=30, + log_path=log_path, + simulator_stall_roots=[sim_root], + stall_check_interval=0.01, + simulator_no_progress_timeout=1, + ) + + log_text = log_path.read_text(encoding="utf-8") + assert rc == runner.SIMULATOR_STALL_EXIT_CODE + assert runtime < 5 + assert "SIMULATOR_STALL: partial simulator aggregation made no server-side progress" in output + assert "Aggregated 1/8 results" in log_text + + +def test_runner_state_routes_plateau_to_literature_checkpoint(tmp_path): + runner = _load_runner() + records = [ + runner.RunRecord("baseline", "baseline", 0.85, 1.0, "none", "baseline", "python job.py", "/tmp/baseline"), + runner.RunRecord("discard", "candidate_1", 0.84, 1.0, "none", "candidate", "python job.py", "/tmp/c1"), + runner.RunRecord("discard", "candidate_2", 0.84, 1.0, "none", "candidate", "python job.py", "/tmp/c2"), + ] + results_path = tmp_path / "results.tsv" + state_path = tmp_path / "state.json" + runner.write_results(results_path, records) + + state = runner.write_state( + state_path, + results_path, + records, + None, + plateau_threshold=2, + ) + + assert state["schema_version"] == "nvflare.autofl.campaign_state.v1" + assert state["reason"] == "plateau_literature" + assert state["next_action"] == "run_literature_loop" + assert state["final_response_allowed"] is False + assert state == json.loads(state_path.read_text(encoding="utf-8")) + + +def test_runner_state_finalizes_after_explicit_candidate_cap(tmp_path): + runner = _load_runner() + records = [ + runner.RunRecord("baseline", "baseline", 0.85, 1.0, "none", "baseline", "python job.py", "/tmp/baseline"), + runner.RunRecord("discard", "candidate_1", 0.84, 1.0, "none", "candidate", "python job.py", "/tmp/c1"), + ] + results_path = tmp_path / "results.tsv" + state_path = tmp_path / "state.json" + runner.write_results(results_path, records) + + state = runner.write_state(state_path, results_path, records, 1) + + assert state["decision"] == "stop" + assert state["reason"] == "candidate_cap_exhausted" + assert state["next_action"] == "final_report" + assert state["final_response_allowed"] is True + assert state["candidate_cap_source"] == "explicit" + + +def test_runner_state_ignores_ambient_candidate_cap(tmp_path, monkeypatch): + runner = _load_runner() + monkeypatch.setenv("AUTOFL_MAX_CANDIDATES", "1") + records = [ + runner.RunRecord("baseline", "baseline", 0.85, 1.0, "none", "baseline", "python job.py", "/tmp/baseline"), + runner.RunRecord("discard", "candidate_1", 0.84, 1.0, "none", "candidate", "python job.py", "/tmp/c1"), + ] + results_path = tmp_path / "results.tsv" + state_path = tmp_path / "state.json" + runner.write_results(results_path, records) + + state = runner.write_state(state_path, results_path, records, None) + + assert state["decision"] == "continue" + assert state["reason"] == "continue" + assert state["candidate_cap"] is None + assert state["candidate_cap_source"] == "uncapped" + assert state["final_response_allowed"] is False + + +def test_runner_state_marks_infrastructure_retry_non_final(tmp_path): + runner = _load_runner() + records = [ + runner.RunRecord( + runner.INFRASTRUCTURE_RETRY, + "baseline", + None, + 1.0, + "none", + "baseline", + "python job.py", + "/tmp/baseline", + ) + ] + results_path = tmp_path / "results.tsv" + state_path = tmp_path / "state.json" + runner.write_results(results_path, records) + + state = runner.write_state(state_path, results_path, records, None) + + assert state["decision"] == "retry_infrastructure" + assert state["reason"] == "infrastructure_retry" + assert state["next_action"] == "rerun_with_escalated_execution" + assert state["final_response_allowed"] is False + + +def test_code_candidate_keeps_improvement_and_restores_discard_without_git(tmp_path, monkeypatch): + runner = _load_runner() + job, client, _ = _initialize_fake_campaign(runner, tmp_path, monkeypatch) + + assert runner.main(["prepare", str(job), "--name", "new_algo", "--hypothesis", "add a new algorithm"]) == 0 + draft = tmp_path / ".nvflare" / "autofl" / "candidates" / "new_algo" / "source" + draft.joinpath("client.py").write_text("from new_algorithm import VALUE\n", encoding="utf-8") + draft.joinpath("new_algorithm.py").write_text("VALUE = 'improved'\n", encoding="utf-8") + + def improved_run(run_def, **kwargs): + return runner.RunRecord( + "candidate", run_def.name, 0.7, 2.0, "none", run_def.description, "python job.py", "/tmp/new_algo" + ) + + monkeypatch.setattr(runner, "run_job", improved_run) + assert runner.main(["evaluate", str(job)]) == 0 + assert client.read_text(encoding="utf-8") == "from new_algorithm import VALUE\n" + assert tmp_path.joinpath("new_algorithm.py").read_text(encoding="utf-8") == "VALUE = 'improved'\n" + + kept_manifest = json.loads( + tmp_path.joinpath(".nvflare/autofl/candidates/new_algo/candidate_manifest.json").read_text(encoding="utf-8") + ) + assert kept_manifest["status"] == "keep" + assert kept_manifest["changed_files"] == ["client.py", "new_algorithm.py"] + assert kept_manifest["patch_sha256"] + + assert runner.main(["prepare", str(job), "--name", "bad_algo", "--hypothesis", "try a regression"]) == 0 + bad_draft = tmp_path / ".nvflare" / "autofl" / "candidates" / "bad_algo" / "source" + bad_draft.joinpath("client.py").write_text("ALGORITHM = 'regression'\n", encoding="utf-8") + + def regressed_run(run_def, **kwargs): + return runner.RunRecord( + "candidate", run_def.name, 0.3, 2.0, "none", run_def.description, "python job.py", "/tmp/bad_algo" + ) + + monkeypatch.setattr(runner, "run_job", regressed_run) + assert runner.main(["evaluate", str(job)]) == 0 + assert client.read_text(encoding="utf-8") == "from new_algorithm import VALUE\n" + records = runner.load_results(tmp_path / "results.tsv") + assert [record.status for record in records] == ["baseline", "keep", "discard"] + assert records[1].changed_files == "client.py,new_algorithm.py" + assert records[1].candidate_manifest.endswith("candidate_manifest.json") + + +def test_candidate_rejects_unauthorized_existing_source_and_symlink(tmp_path, monkeypatch): + runner = _load_runner() + job, _, _ = _initialize_fake_campaign(runner, tmp_path, monkeypatch) + tmp_path.joinpath("secret.py").write_text("SECRET = True\n", encoding="utf-8") + assert runner.main(["prepare", str(job), "--name", "unsafe", "--hypothesis", "touch secret"]) == 0 + draft = tmp_path / ".nvflare" / "autofl" / "candidates" / "unsafe" / "source" + draft.joinpath("secret.py").write_text("SECRET = False\n", encoding="utf-8") + assert runner.main(["evaluate", str(job)]) == 2 + + with pytest.raises(ValueError, match="escapes"): + runner.safe_relative_path(tmp_path, "../outside.py") + assert tmp_path.joinpath("secret.py").read_text(encoding="utf-8") == "SECRET = True\n" + + draft.joinpath("secret.py").unlink() + link = draft / "linked.py" + try: + link.symlink_to(tmp_path / "secret.py") + except OSError: + pytest.skip("symlinks are unavailable on this platform") + assert runner.main(["evaluate", str(job)]) == 2 + + +def test_candidate_rejects_stale_manifest_and_budget_drift(tmp_path, monkeypatch): + runner = _load_runner() + job, client, config = _initialize_fake_campaign(runner, tmp_path, monkeypatch) + assert runner.main(["prepare", str(job), "--name", "stale", "--hypothesis", "change code"]) == 0 + candidate_dir = tmp_path / ".nvflare" / "autofl" / "candidates" / "stale" + candidate_dir.joinpath("source/client.py").write_text("ALGORITHM = 'candidate'\n", encoding="utf-8") + manifest_path = candidate_dir / "candidate_manifest.json" + manifest = json.loads(manifest_path.read_text(encoding="utf-8")) + manifest["base_source_sha256"] = "0" * 64 + manifest_path.write_text(json.dumps(manifest), encoding="utf-8") + assert runner.main(["evaluate", str(job)]) == 2 + + manifest["base_source_sha256"] = json.loads( + tmp_path.joinpath(".nvflare/autofl/campaign.json").read_text(encoding="utf-8") + )["best_source_sha256"] + manifest_path.write_text(json.dumps(manifest), encoding="utf-8") + drifted = deepcopy(config) + drifted["budget"]["fixed_training_budget"]["num_rounds"] = 2 + monkeypatch.setattr(runner, "import_job_config", lambda *args, **kwargs: deepcopy(drifted)) + assert runner.main(["evaluate", str(job)]) == 2 + assert client.read_text(encoding="utf-8") == "ALGORITHM = 'baseline'\n" + + +def test_candidate_schema_failure_does_not_modify_workspace(tmp_path, monkeypatch): + runner = _load_runner() + job, client, _ = _initialize_fake_campaign(runner, tmp_path, monkeypatch) + assert runner.main(["prepare", str(job), "--name", "bad_schema", "--hypothesis", "change code"]) == 0 + candidate_dir = tmp_path / ".nvflare" / "autofl" / "candidates" / "bad_schema" + candidate_dir.joinpath("source/client.py").write_text("ALGORITHM = 'candidate'\n", encoding="utf-8") + tmp_path.joinpath("mutation_schema.yaml").write_text( + "comparison_budget_args:\n default_candidate_budget:\n run_timeout_seconds: fast\n", + encoding="utf-8", + ) + + assert runner.main(["evaluate", str(job)]) == 2 + assert client.read_text(encoding="utf-8") == "ALGORITHM = 'baseline'\n" + + +def test_candidate_partial_apply_failure_restores_workspace(tmp_path, monkeypatch): + runner = _load_runner() + job, client, _ = _initialize_fake_campaign(runner, tmp_path, monkeypatch) + baseline_job = job.read_text(encoding="utf-8") + baseline_client = client.read_text(encoding="utf-8") + assert runner.main(["prepare", str(job), "--name", "partial", "--hypothesis", "change two files"]) == 0 + draft = tmp_path / ".nvflare" / "autofl" / "candidates" / "partial" / "source" + draft.joinpath("client.py").write_text("ALGORITHM = 'candidate'\n", encoding="utf-8") + draft.joinpath("job.py").write_text("print('candidate')\n", encoding="utf-8") + original_copy = runner.copy_relative_file + + def fail_second_candidate_copy(source_root, destination_root, relative): + if source_root == draft and relative == "job.py": + raise OSError("simulated candidate copy failure") + original_copy(source_root, destination_root, relative) + + monkeypatch.setattr(runner, "copy_relative_file", fail_second_candidate_copy) + + assert runner.main(["evaluate", str(job)]) == 2 + assert job.read_text(encoding="utf-8") == baseline_job + assert client.read_text(encoding="utf-8") == baseline_client + + +def test_candidate_job_help_failure_restores_workspace(tmp_path, monkeypatch): + runner = _load_runner() + job, client, _ = _initialize_fake_campaign(runner, tmp_path, monkeypatch) + assert runner.main(["prepare", str(job), "--name", "help_failure", "--hypothesis", "change code"]) == 0 + draft = tmp_path / ".nvflare" / "autofl" / "candidates" / "help_failure" / "source" + draft.joinpath("client.py").write_text("ALGORITHM = 'candidate'\n", encoding="utf-8") + + def fail_job_help(*args, **kwargs): + raise OSError("simulated missing Python executable") + + monkeypatch.setattr(runner, "job_help", fail_job_help) + + assert runner.main(["evaluate", str(job)]) == 2 + assert client.read_text(encoding="utf-8") == "ALGORITHM = 'baseline'\n" + + +def test_candidate_finalization_failure_rolls_back_workspace_and_campaign_files(tmp_path, monkeypatch): + runner = _load_runner() + job, client, _ = _initialize_fake_campaign(runner, tmp_path, monkeypatch) + assert runner.main(["prepare", str(job), "--name", "late_failure", "--hypothesis", "improve code"]) == 0 + candidate_dir = tmp_path / ".nvflare" / "autofl" / "candidates" / "late_failure" + candidate_dir.joinpath("source/client.py").write_text("ALGORITHM = 'candidate'\n", encoding="utf-8") + original_autofl = tmp_path.joinpath("autofl.yaml").read_bytes() + original_results = tmp_path.joinpath("results.tsv").read_bytes() + original_state = tmp_path.joinpath(".nvflare/autofl/campaign_state.json").read_bytes() + original_progress = tmp_path.joinpath("progress.png").read_bytes() + original_report = tmp_path.joinpath("autofl_report.md").read_bytes() + + def improved_run(run_def, **kwargs): + return runner.RunRecord( + "candidate", run_def.name, 0.7, 2.0, "none", run_def.description, "python job.py", "/tmp/late_failure" + ) + + original_refresh = runner.refresh_campaign_artifacts + + def fail_artifact_refresh(*args, **kwargs): + original_refresh(*args, **kwargs) + raise OSError("simulated report write failure") + + monkeypatch.setattr(runner, "run_job", improved_run) + monkeypatch.setattr(runner, "refresh_campaign_artifacts", fail_artifact_refresh) + + assert runner.main(["evaluate", str(job)]) == 2 + assert client.read_text(encoding="utf-8") == "ALGORITHM = 'baseline'\n" + assert tmp_path.joinpath("autofl.yaml").read_bytes() == original_autofl + assert tmp_path.joinpath("results.tsv").read_bytes() == original_results + assert tmp_path.joinpath(".nvflare/autofl/campaign_state.json").read_bytes() == original_state + assert tmp_path.joinpath("progress.png").read_bytes() == original_progress + assert tmp_path.joinpath("autofl_report.md").read_bytes() == original_report + best_source, best_files = runner.load_best_snapshot(tmp_path / ".nvflare" / "autofl" / "snapshots" / "best") + assert runner.workspace_matches_snapshot(tmp_path, best_source, best_files) + metadata = json.loads(tmp_path.joinpath(".nvflare/autofl/campaign.json").read_text(encoding="utf-8")) + assert metadata["best_candidate"] == "baseline" + manifest = json.loads(candidate_dir.joinpath("candidate_manifest.json").read_text(encoding="utf-8")) + assert manifest["status"] == "prepared" + + +def test_candidate_snapshot_stage_failure_preserves_previous_best(tmp_path, monkeypatch): + runner = _load_runner() + job, client, _ = _initialize_fake_campaign(runner, tmp_path, monkeypatch) + assert runner.main(["prepare", str(job), "--name", "snapshot_failure", "--hypothesis", "improve code"]) == 0 + draft = tmp_path / ".nvflare" / "autofl" / "candidates" / "snapshot_failure" / "source" + draft.joinpath("client.py").write_text("ALGORITHM = 'candidate'\n", encoding="utf-8") + + def improved_run(run_def, **kwargs): + return runner.RunRecord( + "candidate", run_def.name, 0.7, 2.0, "none", run_def.description, "python job.py", "/tmp/snapshot" + ) + + def fail_snapshot_stage(*args, **kwargs): + raise OSError("simulated snapshot copy failure") + + monkeypatch.setattr(runner, "run_job", improved_run) + monkeypatch.setattr(runner, "stage_best_snapshot", fail_snapshot_stage) + + assert runner.main(["evaluate", str(job)]) == 2 + assert client.read_text(encoding="utf-8") == "ALGORITHM = 'baseline'\n" + best_source, best_files = runner.load_best_snapshot(tmp_path / ".nvflare" / "autofl" / "snapshots" / "best") + assert runner.workspace_matches_snapshot(tmp_path, best_source, best_files) + + +def test_candidate_discard_restore_failure_retries_rollback(tmp_path, monkeypatch): + runner = _load_runner() + job, client, _ = _initialize_fake_campaign(runner, tmp_path, monkeypatch) + assert runner.main(["prepare", str(job), "--name", "discard_failure", "--hypothesis", "regress code"]) == 0 + draft = tmp_path / ".nvflare" / "autofl" / "candidates" / "discard_failure" / "source" + draft.joinpath("client.py").write_text("ALGORITHM = 'candidate'\n", encoding="utf-8") + original_restore = runner.restore_best_source + restore_calls = 0 + + def fail_first_restore(*args, **kwargs): + nonlocal restore_calls + restore_calls += 1 + if restore_calls == 1: + raise OSError("simulated restore failure") + original_restore(*args, **kwargs) + + def regressed_run(run_def, **kwargs): + return runner.RunRecord( + "candidate", run_def.name, 0.3, 2.0, "none", run_def.description, "python job.py", "/tmp/discard_failure" + ) + + monkeypatch.setattr(runner, "restore_best_source", fail_first_restore) + monkeypatch.setattr(runner, "run_job", regressed_run) + + assert runner.main(["evaluate", str(job)]) == 2 + assert restore_calls == 2 + assert client.read_text(encoding="utf-8") == "ALGORITHM = 'baseline'\n" + + +def test_malformed_yaml_returns_clean_cli_errors(tmp_path, monkeypatch, capsys): + runner = _load_runner() + job, _, config = _initialize_fake_campaign(runner, tmp_path, monkeypatch) + capsys.readouterr() + autofl_yaml = tmp_path / "autofl.yaml" + autofl_yaml.write_text("job: [\n", encoding="utf-8") + + assert runner.main(["suggest", str(job)]) == 2 + stderr = capsys.readouterr().err + assert f"Auto-FL suggest failed: invalid YAML in {autofl_yaml}" in stderr + assert "Traceback" not in stderr + + runner.write_yaml(autofl_yaml, config) + mutation_schema = tmp_path / "mutation_schema.yaml" + mutation_schema.write_text("comparison_budget_args: [\n", encoding="utf-8") + + assert runner.main(["suggest", str(job)]) == 2 + stderr = capsys.readouterr().err + assert f"Auto-FL suggest failed: invalid YAML in {mutation_schema}" in stderr + assert "Traceback" not in stderr + + +def test_abandon_candidate_clears_pending_draft_without_touching_best(tmp_path, monkeypatch): + runner = _load_runner() + job, client, _ = _initialize_fake_campaign(runner, tmp_path, monkeypatch) + assert runner.main(["prepare", str(job), "--name", "abandoned", "--hypothesis", "temporary idea"]) == 0 + draft = tmp_path / ".nvflare" / "autofl" / "candidates" / "abandoned" / "source" / "client.py" + draft.write_text("ALGORITHM = 'temporary'\n", encoding="utf-8") + + assert runner.main(["abandon", str(job)]) == 0 + assert client.read_text(encoding="utf-8") == "ALGORITHM = 'baseline'\n" + manifest = json.loads( + tmp_path.joinpath(".nvflare/autofl/candidates/abandoned/candidate_manifest.json").read_text(encoding="utf-8") + ) + assert manifest["status"] == "abandoned" + state = json.loads(tmp_path.joinpath(".nvflare/autofl/campaign_state.json").read_text(encoding="utf-8")) + assert state["next_action"] == "propose_candidate" + assert state["pending_candidate_manifest"] is None + + +def test_initialize_retries_an_unscored_baseline(tmp_path, monkeypatch): + runner = _load_runner() + job = tmp_path / "job.py" + job.write_text("print('job')\n", encoding="utf-8") + tmp_path.joinpath("client.py").write_text("ALGORITHM = 'baseline'\n", encoding="utf-8") + monkeypatch.setattr(runner, "import_job_config", lambda *args, **kwargs: deepcopy(_campaign_config())) + monkeypatch.setattr(runner, "job_help", lambda *args, **kwargs: "") + scores = iter([None, 0.5]) + + def fake_run(run_def, **kwargs): + return runner.RunRecord( + "baseline", + "baseline", + next(scores), + 1.0, + "none", + "baseline", + "python job.py", + "/tmp/baseline", + ) + + monkeypatch.setattr(runner, "run_job", fake_run) + command = ["initialize", str(job), "--no-prefer-synthetic"] + assert runner.main(command) == 1 + assert runner.main(command) == 0 + records = runner.load_results(tmp_path / "results.tsv") + assert [(record.status, record.score) for record in records] == [("baseline", 0.5)] + + +def test_record_literature_checkpoint_returns_to_agent_proposal(tmp_path, monkeypatch): + runner = _load_runner() + job, _, _ = _initialize_fake_campaign(runner, tmp_path, monkeypatch) + + assert ( + runner.main( + [ + "record", + str(job), + "--literature", + "--hypothesis", + "reviewed adaptive federated optimization", + ] + ) + == 0 + ) + records = runner.load_results(tmp_path / "results.tsv") + assert records[-1].status == "literature" + assert records[-1].diff_summary == "reviewed adaptive federated optimization" + state = json.loads(tmp_path.joinpath(".nvflare/autofl/campaign_state.json").read_text(encoding="utf-8")) + assert state["next_action"] == "propose_candidate" + + +def test_external_candidate_uses_standard_job_result_recording(tmp_path, monkeypatch): + runner = _load_runner() + job, client, _ = _initialize_fake_campaign(runner, tmp_path, monkeypatch, target_env="prod") + assert runner.main(["record", str(job), "--baseline", "--score", "0.5", "--job-id", "job-baseline"]) == 0 + assert runner.main(["prepare", str(job), "--name", "prod_algo", "--hypothesis", "production algorithm"]) == 0 + draft = tmp_path / ".nvflare" / "autofl" / "candidates" / "prod_algo" / "source" + draft.joinpath("client.py").write_text("ALGORITHM = 'production'\n", encoding="utf-8") + assert runner.main(["evaluate", str(job)]) == 0 + assert client.read_text(encoding="utf-8") == "ALGORITHM = 'production'\n" + + manifest_path = tmp_path / ".nvflare" / "autofl" / "candidates" / "prod_algo" / "candidate_manifest.json" + assert json.loads(manifest_path.read_text(encoding="utf-8"))["status"] == "ready_for_external_execution" + assert ( + runner.main( + [ + "record", + str(job), + "--manifest", + str(manifest_path), + "--score", + "0.8", + "--job-id", + "job-candidate", + ] + ) + == 0 + ) + manifest = json.loads(manifest_path.read_text(encoding="utf-8")) + assert manifest["status"] == "keep" + assert manifest["artifacts"]["job_id"] == "job-candidate" + + +def test_suggest_returns_fallbacks_without_executing_them(tmp_path, monkeypatch, capsys): + runner = _load_runner() + job, _, _ = _initialize_fake_campaign(runner, tmp_path, monkeypatch) + capsys.readouterr() + monkeypatch.setattr(runner, "job_help", lambda *args, **kwargs: "--lr") + monkeypatch.setattr( + runner, + "run_job", + lambda *args, **kwargs: pytest.fail("suggest must not execute a candidate"), + ) + + assert runner.main(["suggest", str(job), "--limit", "2"]) == 0 + payload = json.loads(capsys.readouterr().out) + assert len(payload["suggestions"]) == 2 + assert all(item["run_args"] for item in payload["suggestions"]) + + +def test_import_job_config_forwards_minimization_mode(tmp_path, monkeypatch): + runner = _load_runner() + job = tmp_path / "job.py" + job.write_text("print('job')\n", encoding="utf-8") + output = tmp_path / "autofl.yaml" + captured = {} + + def fake_run(command, cwd, timeout, log_path): + captured["command"] = command + runner.write_yaml(output, _campaign_config()) + return 0, "", 0.0 + + monkeypatch.setattr(runner, "run", fake_run) + args = runner.parse_args(["initialize", str(job), "--mode", "min"]) + runner.import_job_config(args, job, output, tmp_path / "import.log", 10) + + mode_index = captured["command"].index("--mode") + assert captured["command"][mode_index + 1] == "min" + + +def test_cli_lifecycle_runs_agent_code_candidate_end_to_end(tmp_path): + repo_root = Path(__file__).parents[3] + runner_path = repo_root / "skills" / "nvflare-autofl" / "scripts" / "run_job_campaign.py" + job = tmp_path / "job.py" + job.write_text( + """ +import argparse +import json +from pathlib import Path + +SCORE = 0.5 + +class FedAvgRecipe: + def __init__(self, **kwargs): + self.kwargs = kwargs + +class SimEnv: + def __init__(self, **kwargs): + self.kwargs = kwargs + +def main(): + parser = argparse.ArgumentParser() + parser.add_argument("--name", default="run") + parser.add_argument("--num_rounds", type=int, default=1) + parser.add_argument("--n_clients", type=int, default=2) + args = parser.parse_args() + FedAvgRecipe(model=object(), num_rounds=args.num_rounds, min_clients=args.n_clients) + SimEnv(num_clients=args.n_clients) + result = Path(f"result-{args.name}") + result.mkdir(exist_ok=True) + result.joinpath("metrics_summary.json").write_text(json.dumps({"accuracy": SCORE})) + print(f"Result can be found in : {result.resolve()}") + +if __name__ == "__main__": + main() +""".lstrip(), + encoding="utf-8", + ) + env = os.environ.copy() + env["PYTHONPATH"] = os.pathsep.join(filter(None, [str(repo_root), env.get("PYTHONPATH")])) + + subprocess.run( + [ + sys.executable, + str(runner_path), + "initialize", + str(job), + "--metric", + "accuracy", + "--no-prefer-synthetic", + ], + cwd=tmp_path, + env=env, + check=True, + capture_output=True, + text=True, + ) + subprocess.run( + [ + sys.executable, + str(runner_path), + "prepare", + str(job), + "--name", + "code_candidate", + "--hypothesis", + "raise the reported score through a source change", + ], + cwd=tmp_path, + env=env, + check=True, + capture_output=True, + text=True, + ) + draft_job = tmp_path / ".nvflare" / "autofl" / "candidates" / "code_candidate" / "source" / "job.py" + draft_job.write_text(draft_job.read_text(encoding="utf-8").replace("SCORE = 0.5", "SCORE = 0.8"), encoding="utf-8") + subprocess.run( + [sys.executable, str(runner_path), "evaluate", str(job)], + cwd=tmp_path, + env=env, + check=True, + capture_output=True, + text=True, + ) + + runner = _load_runner() + records = runner.load_results(tmp_path / "results.tsv") + assert [(record.status, record.score) for record in records] == [("baseline", 0.5), ("keep", 0.8)] + assert "SCORE = 0.8" in job.read_text(encoding="utf-8") + manifest = json.loads( + tmp_path.joinpath(".nvflare/autofl/candidates/code_candidate/candidate_manifest.json").read_text( + encoding="utf-8" + ) + ) + assert manifest["status"] == "keep" + assert manifest["changed_files"] == ["job.py"]