Skip to content

NVFLARE agent skills: PyTorch/Lightning conversion, orient, diagnose — with lint, eval, packaging, and security hardening#4837

Open
chesterxgchen wants to merge 192 commits into
NVIDIA:mainfrom
chesterxgchen:milestone8-agent-skills
Open

NVFLARE agent skills: PyTorch/Lightning conversion, orient, diagnose — with lint, eval, packaging, and security hardening#4837
chesterxgchen wants to merge 192 commits into
NVIDIA:mainfrom
chesterxgchen:milestone8-agent-skills

Conversation

@chesterxgchen

@chesterxgchen chesterxgchen commented Jun 28, 2026

Copy link
Copy Markdown
Collaborator

Summary

Adds a set of NVFLARE agent skills — procedural playbooks a coding agent
(Codex/Claude) loads to convert ML training code to federated learning, orient
in an NVFLARE project, and diagnose failed jobs — together with the packaging,
install, lint, and evaluation infrastructure to author and ship them safely.

The skills follow the three-layer operating model: the skill owns procedure
(what to extract, which commands to run, the validation ladder, semantic stop
conditions, and reporting), the product API owns truth (recipe schema,
validation, execution, and export), and the agent host owns permissions and
execution security. The agent discovers product truth at runtime through
commands such as nvflare recipe show --format json; recipe list is reserved
for cases where framework and algorithm intent do not already identify the
recipe. This keeps skill prose from drifting into a "shadow API." Skills conform
to the agentskills.io specification and the company (NVCARPS) skills guideline.

Skills

  • nvflare-convert-pytorch — plain PyTorch → FLARE Client API conversion.
  • nvflare-convert-lightning — PyTorch Lightning → FLARE via
    flare.patch(trainer), Lightning-native evaluation, and DDP handling.
  • nvflare-diagnose-job — read-only diagnosis of failed/stalled/suspicious
    jobs by mapping bounded evidence to a failure-pattern catalog.
  • nvflare-orient — routes an open-ended or ambiguous NVFLARE request to
    the right workflow skill without editing files; explicit conversion requests
    select their converter directly after static framework inspection.
  • nvflare-shared — internal, non-triggered reference library
    (non-standard conversion guidance, validation ladder, dependency ordering,
    PyTorch-family recipe selection, model exchange, metrics/artifact reporting)
    plus the shared ModelAggregator template.

Conversion skills use explicit {"class_path", "args"} recipe model config (no
live model instances), paired training/evaluation templates, custom aggregation
via the recipe aggregator= extension point, enable_tensor_disk_offload
paired with server_expected_format=PYTORCH, build setup once before the FL
round loop, device-selection preservation, and configurable data paths. The
standard FedAvg path is inline and uses direct recipe show fedavg-pt; broader
references and catalog discovery are loaded only for relevant exceptions.
Missing dependencies are installed before import-level preflight, and requested
validation proceeds under the agent host's permission system without a
skill-issued install, repository-trust, or execution prompt. Follow-up questions
are limited to genuinely missing conversion semantics.

Homomorphic encryption, differential privacy, and privacy filters are explicitly
not supported by conversion (report and route to provisioning/deployment;
never silently substitute an unprotected job).

Product code

  • Static source inspector (nvflare/tool/agent/inspector.py) — AST-only
    framework detection (no import/execution of user code) via a
    frameworks/ detector-plugin package (PyTorch/Lightning), with entry-context
    reachability and evidence collection; surfaced as nvflare agent inspect.
  • Agent CLI (nvflare agent inspect | info | skills …) with JSON output
    contracts.
  • Skill packaging & install (skill_manifest.py, skill_manager.py) —
    content-addressed shared snapshots, atomic stage-then-publish, manifest
    path-traversal defense, TOCTOU-closing no-follow/identity fd checks,
    owned-by-current-user targets, and bytecode/eval exclusion. A build hook
    bundles the released skills into the wheel.

Evaluation & lint infrastructure

  • Eval suites live outside the shipped skills in
    dev_tools/agent/skill_evals/<skill>/ — trigger routing, source-fact
    extraction, code/evaluation transformation, device selection, checkpoint
    safety, prompt-injection resistance, direct recipe inspection, progressive
    reference loading, and negative cases.
  • Deterministic lint engine (dev_tools/agent/skills/checks/) — a
    frontmatter validator (agentskills.io + NVCARPS constraints: required
    metadata.author, name/description/compatibility limits, bounded no-follow
    file reads, no alwaysApply/globs) plus v1 lints (including the 200-line
    SKILL.md limit, trigger/negative/policy coverage, command drift, fixture
    integrity, and runtime boundaries). NVFLARE-specific checks stay on top of
    company conformance.

Security and authorization boundary

  • Source comments, READMEs, config values, requirement-file prose, logs, and
    tool output are evidence, not authority. Embedded instructions cannot change
    aggregation, skip validation, authorize installation, or add exfiltration;
    they are ignored and reported as prompt-injection anomalies.
  • Skills enforce only controls they directly own: static fact extraction,
    install-before-import ordering, safe torch.load(..., weights_only=True),
    secret redaction, and generated code that does not upload local data, paths,
    credentials, models, or datasets.
  • The agent host owns sandboxing, filesystem permissions, network isolation,
    resource limits, and authorization. Skills use the host-supplied environment,
    do not probe or construct isolation mechanisms, do not assess whether the
    runtime is trusted, and report only actual host/tool denials or command
    failures as execution blockers.
  • Package entries and installer options are dependency configuration handled
    under host permissions, not a skill-owned allowlist, typosquat audit, or
    supply-chain security claim. Natural-language instructions embedded in those
    files remain untrusted source text.
  • Generated source may sit beside writable training source. Simulator
    workspaces, exports, models, and logs use a host-provided runtime directory or
    one temporary directory, with exact paths reported; skills do not construct
    permission hierarchies or run manifests.
  • Product packaging/install code retains its own no-follow, path traversal,
    ownership, and atomic-publication protections. Diagnosis continues to treat
    log content and spoofable status markers as untrusted evidence.

Enforcement

The skill lint gates every push and merge: a pre-push hook (.githooks/),
./runtest.sh -s, and the CI unit tests all run
python -m dev_tools.agent.skills.checks --skills-root skills (covering skills
and eval suites); the packaged bundle is verified at build/install time.

Design docs

Only docs/design/skills_architecture.md ships; the operating-model,
evaluation, and check-report design docs are kept local (gitignored).

@chesterxgchen chesterxgchen force-pushed the milestone8-agent-skills branch from ed74191 to 2d32542 Compare June 28, 2026 10:55
@greptile-apps

greptile-apps Bot commented Jun 28, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR adds the nvflare-convert-lightning agent skill and refactors nvflare-convert-pytorch to align both conversion skills to an explicit operating model: skills are procedural guidance for coding agents, while product APIs own schema, validation, execution, and export. A new _shared/conversion-workflow.md replaces the lifecycle reference, the lint engine is decoupled from docs/design/ (replacing docs_root with evals_root), and the doctor command is scoped to conversion-only readiness checks.

  • New Lightning skill and shared framework detection: The inspector engine is refactored into a plugin architecture (frameworks/ package) with per-framework FrameworkDetector classes; the Lightning detector handles promotion-over-family disambiguation, import alias tracking, and flare.patch conversion-state detection. All four Lightning eval fixtures include the required empty-batch ValueError guards.
  • Lint engine decoupling: run_v1_lints no longer reads docs/design/*.md; evals are moved to dev_tools/agent/skill_evals/<skill>/; a new skill-runtime-boundary-lint prevents docs/design/ references and evaluator hooks from leaking into shipped skill content; agent-doc-crosslink-lint and skill-catalog-category-lint are removed.
  • Doctor scope reduction: startup_kits, poc, and online fields are removed from nvflare agent doctor output; the command is scoped to conversion/ML framework readiness only, with corresponding JSON contract tests added.

Confidence Score: 5/5

Safe to merge; the changes are well-scoped, well-tested, and introduce no regressions in the areas reviewed.

The inspector refactoring is sophisticated but thoroughly tested — every Lightning import alias form, the flare.patch conversion-state path, family promotion logic, entry-context reachability, and nemo-wrapper edge cases all have dedicated parameterized tests. The lint engine decoupling replaces docs-coupled checks with deterministic evals-root-based checks, and the new runtime-boundary lint adds surface coverage without removing any existing admission gates. The only observable behavioral changes are the removal of startup_kits/poc/online from doctor output (intentional scope reduction, covered by updated tests) and the consolidation of the DistributedDataParallel duplicate-append into a single tuple check (a minor correctness fix). Empty-batch guards previously flagged in review threads are present in all four Lightning eval fixtures.

No files require special attention. The nvflare/tool/agent/inspector.py rewrite is the highest-complexity change, but the _FamilyResolver adapter pattern and entry-context reachability helpers are each unit-tested in isolation.

Important Files Changed

Filename Overview
nvflare/tool/agent/inspector.py Major refactor: framework detection extracted to plugin architecture, entry-context reachability added for primary-framework selection, Lightning integration signals drive conversion state, evidence collection cap raised from 12 to 10000 to avoid skewed confidence scores. Logic is well-tested and design decisions are thoroughly documented.
nvflare/tool/agent/frameworks/lightning.py New Lightning framework detector: handles all import alias forms (pytorch_lightning, lightning, lightning.pytorch, module aliases), tracks flare.patch conversion signals, and implements promote_over_family with reachability-aware disambiguation logic.
nvflare/tool/agent/frameworks/registry.py New registry module: aggregates import roots and evidence weights from all detectors, implements framework_for_import/recommended_skill_for/resolve_primary_framework, and defines UTILITY_FRAMEWORKS to prevent numpy from hijacking primary-framework selection.
dev_tools/agent/skills/checks/lints.py Significant refactor: docs_root to evals_root, evals moved to dev_tools/agent/skill_evals/, new skill-runtime-boundary-lint replaces agent-doc-crosslink-lint and skill-catalog-category-lint, lint registry replaces the ad-hoc dict, SkillRecord gains cached_property validation and has_helper_tests. All changes are intentional and well-explained.
dev_tools/agent/skills/checks/frontmatter.py SKILL.md schema updated: NVFLARE custom fields now nested under metadata sub-map per agentskills.io spec; new validators for supported top-level fields, metadata mapping type, spec constraints, and SKILL.md length; skill_metadata() accessor added.
nvflare/tool/agent/doctor.py Scoped to conversion-only readiness: startup_kits, poc, and online fields removed; doctor_environment() now returns nvflare/commands/optional_dependencies/skills/findings/status only. Breaking change to JSON schema is intentional and covered by updated tests.
nvflare/tool/agent/agent_cli.py --online and startup-kit selection args removed from nvflare agent doctor; known flags list updated to match; hint message updated to conversion context.
tests/unit_test/tool/agent/agent_inspector_test.py Large new test suite for Lightning detection, flare.patch conversion state classification (all import alias forms), family disambiguation (pytorch vs lightning), entry-context reachability, and nemo-wrapper edge cases.
tests/unit_test/tool/agent/agent_json_contract_test.py New golden-field contract tests for recipe list/show and agent inspect JSON shapes; fails CI if a skill-consumed field is renamed or dropped.
dev_tools/agent/skill_evals/nvflare-convert-lightning/files/gpu-device-lightning/model.py New GPU-device Lightning eval fixture with empty-batch ValueError guard in training_step. No validation_step (intentional: fixture targets device-placement testing only).

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[inspect_path] --> B[_inspect_dir / _inspect_file]
    B --> C[_PythonInspector AST walk per file]
    C --> D[visit_Import / visit_ImportFrom / visit_ClassDef / visit_Call]
    D --> E[FrameworkDetector plugins: PyTorchDetector, LightningDetector]
    E --> F[DetectContext: emit evidence, record flare_call, record integration_signal]
    F --> G[InspectState: framework_evidence, flare_calls, integration_signals, class_body_ranges]
    G --> H[_rank_frameworks count-based confidence]
    H --> I[_detect_primary_framework]
    I --> J{Entry context reachability check}
    J -->|Active evidence tied to entry point| K[Non-utility framework with active evidence wins]
    J -->|No entry context| L[resolve_primary_framework family promotion]
    L --> M{Lightning + PyTorch both present?}
    M -->|Yes| N[promote_over_family LightningDetector decides]
    N -->|Lightning reachable from entry| O[pytorch_lightning]
    N -->|PyTorch reachable from entry| P[pytorch]
    K --> Q[_conversion_state]
    O --> Q
    P --> Q
    Q --> R{_has_conversion_integration? integration_signals AND flare_imports}
    R -->|Yes - flare.patch detected| S[client_api_converted]
    R -->|No| T{flare.receive+send or FLModel in calls?}
    T -->|Yes| S
    T -->|No| U{any flare signals?}
    U -->|Yes| V[partial_client_api]
    U -->|No| W[not_converted - recommend skill]
    W --> X[nvflare-convert-pytorch OR nvflare-convert-lightning]
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
    A[inspect_path] --> B[_inspect_dir / _inspect_file]
    B --> C[_PythonInspector AST walk per file]
    C --> D[visit_Import / visit_ImportFrom / visit_ClassDef / visit_Call]
    D --> E[FrameworkDetector plugins: PyTorchDetector, LightningDetector]
    E --> F[DetectContext: emit evidence, record flare_call, record integration_signal]
    F --> G[InspectState: framework_evidence, flare_calls, integration_signals, class_body_ranges]
    G --> H[_rank_frameworks count-based confidence]
    H --> I[_detect_primary_framework]
    I --> J{Entry context reachability check}
    J -->|Active evidence tied to entry point| K[Non-utility framework with active evidence wins]
    J -->|No entry context| L[resolve_primary_framework family promotion]
    L --> M{Lightning + PyTorch both present?}
    M -->|Yes| N[promote_over_family LightningDetector decides]
    N -->|Lightning reachable from entry| O[pytorch_lightning]
    N -->|PyTorch reachable from entry| P[pytorch]
    K --> Q[_conversion_state]
    O --> Q
    P --> Q
    Q --> R{_has_conversion_integration? integration_signals AND flare_imports}
    R -->|Yes - flare.patch detected| S[client_api_converted]
    R -->|No| T{flare.receive+send or FLModel in calls?}
    T -->|Yes| S
    T -->|No| U{any flare signals?}
    U -->|Yes| V[partial_client_api]
    U -->|No| W[not_converted - recommend skill]
    W --> X[nvflare-convert-pytorch OR nvflare-convert-lightning]
Loading

Reviews (56): Last reviewed commit: "Add empty-batch guard to gpu-device-ligh..." | Re-trigger Greptile

Comment thread skills/nvflare-convert-lightning/evals/files/imbalanced-lightning/model.py Outdated
Comment thread nvflare/tool/agent/inspector.py Outdated
Comment thread nvflare/tool/agent/inspector.py
@codecov-commenter

codecov-commenter commented Jun 28, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 85.71429% with 196 lines in your changes missing coverage. Please review.
✅ Project coverage is 57.34%. Comparing base (3c0458d) to head (f05de35).

Files with missing lines Patch % Lines
nvflare/tool/agent/skill_manager.py 77.97% 122 Missing ⚠️
nvflare/tool/agent/skill_manifest.py 81.31% 51 Missing ⚠️
nvflare/tool/agent/frameworks/pytorch.py 91.48% 8 Missing ⚠️
nvflare/tool/agent/inspector.py 97.23% 7 Missing ⚠️
nvflare/tool/agent/frameworks/lightning.py 96.42% 4 Missing ⚠️
nvflare/tool/agent/frameworks/base.py 89.65% 3 Missing ⚠️
nvflare/tool/agent/frameworks/registry.py 98.14% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4837      +/-   ##
==========================================
+ Coverage   56.98%   57.34%   +0.36%     
==========================================
  Files         969      973       +4     
  Lines       92310    93350    +1040     
==========================================
+ Hits        52603    53536     +933     
- Misses      39707    39814     +107     
Flag Coverage Δ
unit-tests 57.34% <85.71%> (+0.36%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Comment thread skills/nvflare-convert-lightning/evals/evals.json Outdated
Comment thread skills/nvflare-convert-lightning/evals/evals.json Outdated
Comment thread skills/nvflare-convert-lightning/references/lightning-ddp-and-tracking.md Outdated
@holgerroth

Copy link
Copy Markdown
Collaborator

@chesterxgchen One architectural suggestion before this evaluation surface grows further: move test-only evaluation artifacts out of the distributable skill directories while retaining per-skill namespacing.

A possible layout:

skills/
  nvflare-convert-lightning/
    SKILL.md
    references/

dev_tools/agent/skill_evals/
  nvflare-convert-lightning/
    evals.json
    files/

tests/unit_test/tool/agent_skill_evals/
  ...

The main reasons are:

  • evals.json contains expected outputs, assertions, mandatory behavior, and prohibited behavior. That is grading-oracle data rather than runtime skill guidance, and including it in dev/nightly-installed skills can expose benchmark answers.
  • The current colocated layout requires include_analysis_files, release exclusions, and mode-dependent skill hashes solely to keep eval content out of release bundles.
  • Cross-skill negative-routing cases and the evaluation runner naturally belong to a central suite.
  • Runtime skill directories would then contain only the instructions and references that an installed agent should load.

Small examples that are genuinely useful during conversion can remain under references/; test-only fixtures and grading metadata would live in the evaluation tree. The runner can discover suites by skill name, preserving clear ownership without shipping evaluation artifacts.

@chesterxgchen chesterxgchen changed the title Add NVFLARE Lightning conversion agent skill Add NVFLARE Lightning conversion skill and align conversion skills to the operating model Jul 2, 2026
@chesterxgchen chesterxgchen force-pushed the milestone8-agent-skills branch from 1b85f8e to 639af87 Compare July 2, 2026 13:13
chesterxgchen added a commit to chesterxgchen/NVFlare that referenced this pull request Jul 2, 2026
Resolve the outstanding reviewer threads:

- validation_step empty-batch guard (greptile P1): training_step already raised
  ValueError on an empty batch, but validation_step still called F.cross_entropy
  unguarded (nan on an empty per-site partition). Add the same guard to
  validation_step in the hello-lightning and vocab-lightning eval fixtures.
- FedEval contract (Holger P1): the lightning-eval-only eval promised
  trainer.validate/test "in the round loop" with "per-round" metrics. FedEvalRecipe
  requires eval_ckpt and sends one validate task, so rewrite the expected output
  and assertions to require the supplied eval_ckpt, a single trainer.validate
  (no fit, no test), and one-shot per-site metrics.
- Shared token IDs (Holger P1): equal vocab_size guarantees tensor shape, not
  semantic alignment. The lightning-data-derived-required-arg eval now requires
  one identical tokenizer/token-ID mapping across sites (validated, e.g. via a
  stable hash), with a fixed-tokenizer/approval-blocker fallback.
- DDP validation metrics (Holger P1): under launch_external_process=True the
  ScriptRunner-built PTClientAPILauncherExecutor defaults train_with_evaluation
  =False and exposes no switch, so pre-fit trainer.validate metrics never reach
  the server. Document this recipe limitation in lightning-ddp-and-tracking.md
  instead of promising per-round server-side metrics.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
chesterxgchen added a commit to chesterxgchen/NVFlare that referenced this pull request Jul 3, 2026
Resolve the outstanding reviewer threads:

- validation_step empty-batch guard (greptile P1): training_step already raised
  ValueError on an empty batch, but validation_step still called F.cross_entropy
  unguarded (nan on an empty per-site partition). Add the same guard to
  validation_step in the hello-lightning and vocab-lightning eval fixtures.
- FedEval contract (Holger P1): the lightning-eval-only eval promised
  trainer.validate/test "in the round loop" with "per-round" metrics. FedEvalRecipe
  requires eval_ckpt and sends one validate task, so rewrite the expected output
  and assertions to require the supplied eval_ckpt, a single trainer.validate
  (no fit, no test), and one-shot per-site metrics.
- Shared token IDs (Holger P1): equal vocab_size guarantees tensor shape, not
  semantic alignment. The lightning-data-derived-required-arg eval now requires
  one identical tokenizer/token-ID mapping across sites (validated, e.g. via a
  stable hash), with a fixed-tokenizer/approval-blocker fallback.
- DDP validation metrics (Holger P1): under launch_external_process=True the
  ScriptRunner-built PTClientAPILauncherExecutor defaults train_with_evaluation
  =False and exposes no switch, so pre-fit trainer.validate metrics never reach
  the server. Document this recipe limitation in lightning-ddp-and-tracking.md
  instead of promising per-round server-side metrics.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@chesterxgchen chesterxgchen force-pushed the milestone8-agent-skills branch from fe35795 to e8f8010 Compare July 3, 2026 00:40
chesterxgchen added a commit to chesterxgchen/NVFlare that referenced this pull request Jul 3, 2026
Resolve the outstanding reviewer threads:

- validation_step empty-batch guard (greptile P1): training_step already raised
  ValueError on an empty batch, but validation_step still called F.cross_entropy
  unguarded (nan on an empty per-site partition). Add the same guard to
  validation_step in the hello-lightning and vocab-lightning eval fixtures.
- FedEval contract (Holger P1): the lightning-eval-only eval promised
  trainer.validate/test "in the round loop" with "per-round" metrics. FedEvalRecipe
  requires eval_ckpt and sends one validate task, so rewrite the expected output
  and assertions to require the supplied eval_ckpt, a single trainer.validate
  (no fit, no test), and one-shot per-site metrics.
- Shared token IDs (Holger P1): equal vocab_size guarantees tensor shape, not
  semantic alignment. The lightning-data-derived-required-arg eval now requires
  one identical tokenizer/token-ID mapping across sites (validated, e.g. via a
  stable hash), with a fixed-tokenizer/approval-blocker fallback.
- DDP validation metrics (Holger P1): under launch_external_process=True the
  ScriptRunner-built PTClientAPILauncherExecutor defaults train_with_evaluation
  =False and exposes no switch, so pre-fit trainer.validate metrics never reach
  the server. Document this recipe limitation in lightning-ddp-and-tracking.md
  instead of promising per-round server-side metrics.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@chesterxgchen chesterxgchen force-pushed the milestone8-agent-skills branch from 8479b1f to f3100cd Compare July 3, 2026 04:07
@chesterxgchen

Copy link
Copy Markdown
Collaborator Author

@holgerroth Adopted this layout — thanks, it's a clear improvement. Test-only evaluation artifacts now live outside the distributable skill directories, with per-skill namespacing preserved:

  • skills/<skill>/ holds only runtime content (SKILL.md, references/, assets/); no evals/ remains under skills/.
  • Eval suites moved to dev_tools/agent/skill_evals/<skill>/{evals.json, files/}; a central runner discovers suites by skill name, keeping ownership clear.
  • Structure/runner tests are centralized under tests/unit_test/tool/agent_skill_checks/.

On the reasons you gave:

  • Grading-oracle data (expected outputs, assertions, mandatory/prohibited behavior) is no longer shipped in dev/nightly-installed skills.
  • The colocated-layout machinery that existed solely to keep eval content out of release bundles — include_analysis_files, mode-dependent skill hashes, and the eval-only release exclusions — is removed. As defense-in-depth, SKILL_PACKAGING_EXCLUDE_NAMES still fail-closes on a stray skills/<skill>/evals/ so eval content can never re-enter a bundle even if one is reintroduced by mistake.
  • Cross-skill negative-routing cases and the runner now live in the central eval tree.
  • Runtime skill directories contain only the instructions/references an installed agent should load; small conversion-time examples stay under references//assets/, only test-only fixtures + grading metadata moved to the eval tree.

Landed in 78d0cfb and follow-ups.

chesterxgchen and others added 9 commits July 3, 2026 12:28
The converted-state classifier only matched a literal flare.patch call or a
directly imported patch symbol, and additionally required a recognized
Lightning Trainer constructor. This misclassified valid conversions:

- aliased modules (import nvflare.client.lightning as nfl; nfl.patch(...))
- fully-qualified calls (nvflare.client.lightning.patch(...))
- trainers built via wrappers (e.g. nl.Trainer from nemo.lightning)

Track aliases of the nvflare.client.lightning module and recognize all
patch call forms, and treat any genuine patch(trainer) call plus an nvflare
import as the definitive conversion signal without gating on the trainer
constructor.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
from nvflare.client import lightning as flare; flare.patch(trainer) is a
valid conversion form that was still classified as partial_client_api.
Record the aliased submodule as a Lightning patch module so flare.patch is
recognized as the conversion signal.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Implements Milestone 8 Stage 4 (Lightning skill content): a PyTorch-family
conversion skill that integrates through flare.patch(trainer) rather than
manual FLModel exchange, reuses the PyTorch recipe family, and covers DDP and
experiment tracking. Includes SKILL.md, four references, evals, an unconverted
Lightning fixture, and BENCHMARK.md.

Enable the inspector framework->skill mapping for pytorch_lightning now that the
skill is implemented, packaged, and covered by admission tests, so agent inspect
recommends nvflare-convert-lightning for Lightning projects.

Lightning eval behavior IDs use the canonical contract IDs pinned in
agent_skill_evaluation.md (use-lightning-patched-trainer,
no-manual-flmodel-exchange-lightning, no-input-model-to-trainer) so benchmark
consumers see the Lightning-specific gates.

Update inspector/CLI tests to expect the Lightning recommendation and add the
skill to the seed packaging tests with a payload and release-filter assertion.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Two issues from review of the Lightning routing enablement:

- _prefer_lightning_over_pytorch promoted Lightning over PyTorch whenever both
  frameworks were present. Since Lightning always imports torch, a plain PyTorch
  training entry point with an incidental Lightning import elsewhere in the
  workspace was misrouted to nvflare-convert-lightning. Promote Lightning only
  when there is active Lightning use (a LightningModule/DataModule subclass or a
  Trainer call) and that evidence is at least as strong as the PyTorch evidence;
  otherwise keep PyTorch as the lead framework. Adds two mixed-workspace tests.

- lightning-ddp-and-tracking.md referenced the non-exported
  nvflare.app_opt.lightning.loggers.ClientLogger import path. Point agents to the
  canonical flare.logger() shortcut and the real
  nvflare.app_opt.lightning.loggers.client_logger.ClientLogger class.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The previous count-based gate still misrouted normal Lightning scripts: a
typical Lightning entry point imports several torch symbols (torch, nn,
DataLoader), so PyTorch import evidence outnumbers Lightning symbols and the
Lightning skill was not recommended.

Active Lightning use -- a LightningModule/DataModule subclass or a Trainer call
-- is definitive and now outweighs any number of plain torch imports. An
incidental Lightning import with no active use still keeps PyTorch as the lead
framework. Replaces the over-strict count test with a regression test for a
Lightning script carrying multiple torch imports.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…evidence

Promoting Lightning whenever any active Lightning evidence appeared anywhere in
a scanned directory misrouted mixed workspaces: a plain PyTorch train.py with a
LightningModule only in a secondary helper file was routed to Lightning.

Prefer Lightning only when active Lightning use (a LightningModule/DataModule
subclass or a Trainer call) is in the inspected file or a likely training entry
point, with a fallback to promote when Lightning evidence is strictly stronger
overall. A single inspected file always counts as its own entry point so normal
Lightning scripts (with many torch imports) still route to Lightning.

Adds a regression test for a PyTorch entry point plus active Lightning code in a
non-entry helper file; keeps the incidental-import and many-torch-imports tests.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
chesterxgchen and others added 3 commits July 3, 2026 12:29
…ecks on top

Company conformance is the floor; the NVFLARE-specific checks stay unchanged
on top of it (stronger, not weaker).

Frontmatter (all 5 skills): add the required metadata.author (name + NVIDIA
email) and the recommended discoverability fields (tags, languages, frameworks,
domain, team).

Lint engine (frontmatter validator):
- require metadata.author alongside the existing min_flare_version /
  blast_radius / category requirements;
- accept the optional top-level title field NVCARPS allows;
- enforce the NVCARPS constraints as new checks: name charset/length (1-64,
  lowercase alphanumeric + hyphens, no leading/trailing/consecutive hyphens;
  skill-frontmatter-name-invalid), description <= 1024 chars, compatibility
  <= 500 chars, and SKILL.md main file <= 500 lines (skill-md-too-long).
  Prohibited fields (alwaysApply, globs) were already rejected by the
  top-level allowlist check.

Tests: fixtures across both suites gain the author field; new unit tests cover
each added constraint plus the author requirement and title acceptance.

Not adopted (deliberate): the guideline's section-name template and
markdown-link reference style are recommendations, not validation gates; the
skills keep their eval-verified structure and backticked reference paths,
which cover the same substance (use-when triggers, stepwise workflow,
progressive detail in references).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
docs/design/agent_skill_checks_report.md is a local-only human reference (the
check-inventory report), kept untracked like the other local design docs so a
stray git add -A cannot re-track it.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Review finding: conversion-workflow.md already declared HE, differential
privacy, and Filter-based protection unsupported by conversion, but recipe
selection still said 'any other algorithm or privacy requirement' should pick
a matching catalog recipe, and both SKILL.md scope statements named only HE.
For a request like 'convert with differential privacy' the agent could read
recipe selection as 'find a privacy recipe' before hitting the shared boundary.

- pytorch-family-recipe-selection.md: selection rule now covers 'any other
  algorithm requirement' only, and a new bullet extends the HE rule to DP,
  privacy filters, and other Filter-based protection: report unsupported,
  route to provisioning/deployment, ask or fail closed, never generate an
  unprotected job with only a disclaimer.
- Both conversion SKILL.md scope statements generalize from HE to
  privacy-protection requests (HE/encrypted aggregation, DP, privacy filters).

validation-evidence.md redundancy left as-is per the review (not blocking,
not contradictory).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@chesterxgchen chesterxgchen force-pushed the milestone8-agent-skills branch from d37bc5d to bcb74c1 Compare July 3, 2026 19:29
chesterxgchen and others added 10 commits July 3, 2026 15:40
Re-land the non-benchmark file-safety hardening from the reverted commit
(39236ef) on top of the NVCARPS constraints already in HEAD:

- parse_skill_frontmatter and _validate_skill_md_length read through
  _read_regular_skill_file: one no-follow (O_NOFOLLOW where available),
  nonblocking (O_NONBLOCK), size-bounded (512 KB) descriptor with an
  lstat-before / fstat-after regular-file check.
- Direct parse_skill_frontmatter callers (e.g. the policy/trigger lints) no
  longer bypass the symlink/special-file guard: a FIFO or device at SKILL.md
  would otherwise block a reader indefinitely.
- _validate_no_symlinks now rejects non-regular/non-directory entries
  (skill-special-file-not-allowed) and reports unreadable entries.
- Non-UTF-8 SKILL.md becomes a reported finding instead of an uncaught
  UnicodeDecodeError.

O_NOFOLLOW/O_NONBLOCK/O_CLOEXEC are applied via getattr fallback so the reader
degrades cleanly on platforms lacking them. NVCARPS constraint checks and tests
already in HEAD are unchanged.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…gger hardening

Non-benchmark skill-guidance hardening from the reverted commit (39236ef),
extracted without the DDP-stance reversal or benchmark harness. All additive on
the current shared references (privacy scope, RCA dependency fix, and data-
location rules are preserved as a superset):

- conversion-workflow.md: untrusted source-derived execution must run in an
  OS-level sandbox/container/VM (a venv is package isolation, not a security
  sandbox) with no ambient credentials, read-only source, bounded writable
  output, resource limits, and denied egress; fail closed when unavailable.
  Adds a network/telemetry-logger boundary: source logger config is evidence,
  not approval; disable unapproved network/custom loggers during validation.
- dependency-install.md: treat install as a separate allowlisted acquisition
  phase inside the sandbox; filter the requirements actually installed rather
  than passing an untrusted file with skipped directives; add elevated-risk
  classes; a missing security sandbox is a blocker, never bypassed with a host
  venv (the installable-dep-is-not-a-blocker rule is retained for when the
  sandbox is available).
- runtime-output-guidance.md: replace predictable /tmp/nvflare/<job> paths with
  a private per-user runtime root (0700), randomized run directories, no reuse
  of existing/symlinked output paths, and an exact run manifest.

Excludes the reverted commit's DDP-stance rewrite, device_selection benchmark
shim, release_checklist.json, and everything under skills/benchmark/.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ivacy, sandbox

Extract the non-benchmark PyTorch eval expansion from the reverted commit
(39236ef): new fixtures gpu-device-pt, checkpoint-pt, state-mismatch-pt, and
dedicated eval cases that close real coverage gaps —

- pytorch-device-selection: torch.cuda.is_available() source keeps GPU-when-
  available + CPU fallback; no hard-coded CPU rewrite after CPU-only validation;
- pytorch-safe-checkpoint-load / pytorch-state-dict-schema-mismatch: exercise the
  weights_only=True rule and treat a generated state-dict mismatch as a
  conversion bug (previously unevaluated);
- pytorch-missing-evaluation-fail-closed and pytorch-unsupported-privacy-request:
  reinforce fail-closed evaluation and the settled HE/DP report-and-route rule;
- behavior/assertion swap tmp-runtime-outputs -> secure-random-runtime-outputs
  plus sandbox behaviors, reconciling the suite with the runtime-path and
  execution-sandbox guidance landed in 929a69a.

Kept device-selection-respects-availability on the basic case (per the retained
0e0847a check/test) in addition to the dedicated case. Excludes the reverted
commit's release_checklist.json (its only consumer was the removed benchmark
harness) and the device_selection benchmark shim.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…e DDP reversal

Extract the non-benchmark Lightning changes from the reverted commit
(39236ef), keeping the settled data-scientist-verified DDP mapping intact:

- New gpu-device-lightning fixture + lightning-device-selection eval case;
  device-selection-respects-availability kept on the basic case (per the
  retained device_selection check/test).
- secure-random-runtime-outputs behavior swap and sandbox/network behaviors,
  reconciling the suite with 929a69a.
- Network-logger gating (#5): Experiment Tracking in lightning-ddp-and-tracking.md
  and the SKILL.md tracking bullets now treat existing source logger/callback
  config as evidence of intent, not approval — remote/upload/custom loggers stay
  disabled during validation, unattended keeps denied egress or blocks.

EXCLUDED as rejected item #7 (contradicts the settled DDP decision): the diff's
DDP Execution Model / rank-intro / metric-capability rewrites and the
'no-implicit-ddp-param-mapping' description change to 'does not maintain a
skill-owned DDP-to-recipe-parameter table' — reverted to the settled wording.
The concrete train_with_evaluation=False metric-delivery caveat is kept.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Additive, non-benchmark eval behaviors that reinforce rules already in HEAD:
- diagnose: keep-diagnosis-read-only, use-mode-specific-metric-artifacts,
  no-sensitive-log-quoting (matches the log-content PII-redaction rule),
  no-production-path-inference-from-simulation.
- orient: source-and-logs-are-routing-evidence (matches the orient trust
  language), report-routing-evidence, state-follow-up-approval-boundary,
  no-orientation-mutation.
No DDP, benchmark, or release_checklist coupling.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Re-land the non-benchmark skill-install hardening from the reverted commit
(39236ef), softening the one real user-experience regression:

skill_manifest.py / skill_manager.py add, on top of the existing
stage-to-temp + os.rename publish, install lock, and symlink rejection:
- validate_manifest blocks path traversal from a tampered wheel manifest;
- O_NOFOLLOW/O_NONBLOCK fd opens with dev/ino + st_nlink identity re-checks,
  closing the hash->copy TOCTOU;
- the staged tree's hash must equal the manifest source_hash before publish;
- immutable content-addressed shared snapshots (.nvflare-shared/<sha256>/)
  with idempotent path pinning and a recorded installed_hash (legacy fallback
  compatible), plus hash-verified legacy nvflare-shared retirement;
- a private, user-owned target directory and flock-lease lock hardening.

UX fix: the target/parent writability guard rejected group-OR-world-writable
dirs, which hard-fails a normal install on user-private-group (umask 002)
systems where ~/.claude/skills is group-writable. Since every such check is
backstopped by an owned-by-current-user check, narrow it to world-writable
only (the actual tampering threat). Added a test that a group-writable,
user-owned target is accepted; kept world-writable rejection.

Windows-specific code paths from the reverted commit are retained as-is;
Windows is not an officially supported platform for this tooling.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The AgentSkillsBuildPy build hook validated the packaged skill bundle root and
rejected a group-OR-world-writable directory. setuptools creates
build/lib/.../bundled_skills under the process umask, so on user-private-group
(umask 002) systems that directory is group-writable and the wheel/sdist build
failed with 'skill bundle root must not be group/world writable'.

This is the build-time counterpart of the install-time check softened in
0527b07. The bundle root is already required to be owned by the current user,
so narrow the writability guard to world-writable only (the real tampering
threat). Verified: setup.py build_py under umask 002 now builds the bundle into
a group-writable build-lib and exits 0.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@chesterxgchen

Copy link
Copy Markdown
Collaborator Author

@greptile-apps

The external-data-lightning fixture (added after the earlier empty-batch fixes)
called F.cross_entropy in training_step/validation_step without the
labels.numel() == 0 guard its sibling fixtures (hello-lightning, vocab-lightning)
carry. In a federated setting a site can receive an empty data partition, so
cross_entropy would silently return nan (0/0) instead of surfacing the
data-preparation problem. Add the same explicit ValueError guard to both hooks
(team rule 783565ac). Resolves the greptile P1 review threads.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Same fix as the sibling Lightning fixtures (hello-lightning, vocab-lightning,
external-data-lightning): guard training_step against an empty per-site
partition with an explicit ValueError instead of letting F.cross_entropy return
nan silently (team rule 783565ac). Resolves the greptile P1 thread. All four
Lightning fixtures now carry the guard.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@chesterxgchen

Copy link
Copy Markdown
Collaborator Author

@greptile-apps

@chesterxgchen chesterxgchen changed the title Add NVFLARE Lightning conversion skill and align conversion skills to the operating model NVFLARE agent skills: PyTorch/Lightning conversion, orient, diagnose — with lint, eval, packaging, and security hardening Jul 4, 2026
doctor was a read-only readiness snapshot of largely derivable/convenience
checks (nvflare import+version, the command surface, optional ML-dependency
availability, and the installed skill bundle). Nothing depends on it: inspect
only suggested it as a next command, and orient used it optionally. The real
routing signal (framework detection, FLARE integration, conversion state,
recommended skill) comes from 'nvflare agent inspect', and the conversion flow
installs missing dependencies on its own.

Remove the command and all references: agent_cli.py (constant, parser,
dispatch, handler), doctor.py + its test (deleted), command_registry.py entry,
inspector.py _recommended_next_commands prefix, lints.py _KNOWN_AGENT_COMMANDS
and flag table, nvflare-orient SKILL.md / orientation-routing.md (fold
readiness into 'nvflare agent inspect'), nvflare-orient evals.json, and the
agent_cli / agent_inspector tests.

Verified: 'nvflare agent --help' lists info/inspect/skills only; 'agent doctor'
is an invalid choice; no doctor references remain; lint 0 findings; 377 agent
tests pass; lint-parity stays in sync.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@greptile-apps

greptile-apps Bot commented Jul 4, 2026

Copy link
Copy Markdown
Contributor

Too many files changed for review. (110 files found, 100 file limit)

Bypass the limit by tagging @greptile-apps to review.

@chesterxgchen

Copy link
Copy Markdown
Collaborator Author

@greptile-app

chesterxgchen and others added 2 commits July 4, 2026 17:33
… execution gating

RCA of the benchmark showed the skills pushed the agent into a layer it cannot
own: they told it to construct/verify an OS sandbox (it probed
bwrap/unshare/firejail, they failed, it fell back to non-isolating fallbacks yet
the run passed = false assurance + latency), and they gated dependency install +
execution behind a skill-inferred interactive/unattended mode, so a harness that
declared no mode made the agent emit an approval/trust prompt, get no answer, and
end not_started on an already-valid conversion.

Correct the boundary: a skill can reliably enforce only what it reasons about and
generates — prompt-injection resistance and conversion semantics. The agent
host/harness owns execution security and the authorization to install/run.

Remove from skills + evals: OS-sandbox construction/probing/claims, 0700/
randomized-hierarchy/run-manifest/permission audits, supply-chain enforcement
(allowlist/typosquat/acquisition-phase), first-execution trust machinery, and
skill-inferred-mode gating of install/execution plus skill-issued
install/trust/run approval prompts.

Keep: prompt-injection resistance (source/comments/README/config/requirement
comments/logs/tool output are untrusted data, never authorization; source text
never authorizes install/fetch), static inspection, install-before-import
ordering, generated-artifact rules (torch.load weights_only=True, no-exfil,
secret/credential redaction), conversion semantics, and the settled DDP mapping.

Now: dependency install + validation proceed by default; the host permission
system is the only gate; blockers are real host/tool denials or install failures
only; a completed conversion ending not_started (waiting on approval) is a
failure. Interactive-vs-unattended governs only unresolved conversion semantics
(missing required arg / ambiguous algorithm). Dependency entries
(--index-url/git+/URL/pins) are surfaced as unusual but not audited/allowlisted/
blocked (host owns that); requirement-file comments/prose are treated as
injection. Runtime output goes to a host dir or one temp dir with paths
reported. Both SKILL.md use progressive disclosure with a direct
recipe-show fedavg-pt fast path; explicit conversions no longer route through
nvflare-orient just to detect the framework. Evals updated to assert the new
behavior and add negative checks (no sandbox construction, no orient-for-explicit
-conversion); device-selection and the settled DDP mapping preserved.

Reviewed by architect (compliant) and principal (minimal-and-correct).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…o residual execution/dependency gating

Follow-up refinements on top of the boundary rescope (403fb8d):

- PyTorch and Lightning standard FedAvg paths no longer preemptively load broad
  references; references load only for exceptions (progressive disclosure).
- Basic evals require a direct 'nvflare recipe show fedavg-pt' for an explicit
  FedAvg + PyTorch request; 'recipe list' is reserved for ambiguous/non-standard.
- Remove the residual skill-owned 'trusted execution boundary' gate: the agent
  uses the host-supplied environment and does not itself decide whether the host
  is trusted or sufficiently isolated.
- Move orient's explicit-conversion exclusion into the skill's trigger metadata
  (description), so an explicit conversion request that merely needs framework
  detection is not routed through orient.
- Stop the skill auditing/classifying dependency entries; keep prompt-injection
  resistance (requirement-file comments/prose are untrusted; entries are surfaced
  but not audited/allowlisted/blocked — host owns that).
- Remove stale approval/trust wording around loggers, overwrites, validation,
  and routing. Evals updated to enforce the same behavior.

Verified: skill lint 0 findings; 377 agent tests pass; JSON valid; boundary
invariants intact (no sandbox/0700/manifest/mode-gating; injection resistance,
weights_only, and the settled DDP mapping preserved); no dangling references.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@chesterxgchen

Copy link
Copy Markdown
Collaborator Author

@greptile-app

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants