NVFLARE agent skills: PyTorch/Lightning conversion, orient, diagnose — with lint, eval, packaging, and security hardening by chesterxgchen · Pull Request #4837 · NVIDIA/NVFlare

chesterxgchen · 2026-06-28T10:53:33Z

Summary

Adds a set of NVFLARE agent skills — procedural playbooks a coding agent
(Codex/Claude) loads to convert ML training code to federated learning, orient
in an NVFLARE project, and diagnose failed jobs — together with the packaging,
install, lint, and evaluation infrastructure to author and ship them safely.

The skills follow the three-layer operating model: the skill owns procedure
(what to extract, which commands to run, the validation ladder, semantic stop
conditions, and reporting), the product API owns truth (recipe schema,
validation, execution, and export), and the agent host owns permissions and
execution security. The agent discovers product truth at runtime through
commands such as nvflare recipe show --format json; recipe list is reserved
for cases where framework and algorithm intent do not already identify the
recipe. This keeps skill prose from drifting into a "shadow API." Skills conform
to the agentskills.io specification and the company (NVCARPS) skills guideline.

Skills

nvflare-convert-pytorch — plain PyTorch → FLARE Client API conversion.
nvflare-convert-lightning — PyTorch Lightning → FLARE via
flare.patch(trainer), Lightning-native evaluation, and DDP handling.
nvflare-diagnose-job — read-only diagnosis of failed/stalled/suspicious
jobs by mapping bounded evidence to a failure-pattern catalog.
nvflare-orient — routes an open-ended or ambiguous NVFLARE request to
the right workflow skill without editing files; explicit conversion requests
select their converter directly after static framework inspection.
nvflare-shared — internal, non-triggered reference library
(non-standard conversion guidance, validation ladder, dependency ordering,
PyTorch-family recipe selection, model exchange, metrics/artifact reporting)
plus the shared ModelAggregator template.

Conversion skills use explicit {"class_path", "args"} recipe model config (no
live model instances), paired training/evaluation templates, custom aggregation
via the recipe aggregator= extension point, enable_tensor_disk_offload
paired with server_expected_format=PYTORCH, build setup once before the FL
round loop, device-selection preservation, and configurable data paths. The
standard FedAvg path is inline and uses direct recipe show fedavg-pt; broader
references and catalog discovery are loaded only for relevant exceptions.
Missing dependencies are installed before import-level preflight, and requested
validation proceeds under the agent host's permission system without a
skill-issued install, repository-trust, or execution prompt. Follow-up questions
are limited to genuinely missing conversion semantics.

Homomorphic encryption, differential privacy, and privacy filters are explicitly
not supported by conversion (report and route to provisioning/deployment;
never silently substitute an unprotected job).

Product code

Static source inspector (nvflare/tool/agent/inspector.py) — AST-only
framework detection (no import/execution of user code) via a
frameworks/ detector-plugin package (PyTorch/Lightning), with entry-context
reachability and evidence collection; surfaced as nvflare agent inspect.
Agent CLI (nvflare agent inspect | info | skills …) with JSON output
contracts.
Skill packaging & install (skill_manifest.py, skill_manager.py) —
content-addressed shared snapshots, atomic stage-then-publish, manifest
path-traversal defense, TOCTOU-closing no-follow/identity fd checks,
owned-by-current-user targets, and bytecode/eval exclusion. A build hook
bundles the released skills into the wheel.

Evaluation & lint infrastructure

Eval suites live outside the shipped skills in
dev_tools/agent/skill_evals/<skill>/ — trigger routing, source-fact
extraction, code/evaluation transformation, device selection, checkpoint
safety, prompt-injection resistance, direct recipe inspection, progressive
reference loading, and negative cases.
Deterministic lint engine (dev_tools/agent/skills/checks/) — a
frontmatter validator (agentskills.io + NVCARPS constraints: required
metadata.author, name/description/compatibility limits, bounded no-follow
file reads, no alwaysApply/globs) plus v1 lints (including the 200-line
SKILL.md limit, trigger/negative/policy coverage, command drift, fixture
integrity, and runtime boundaries). NVFLARE-specific checks stay on top of
company conformance.

Security and authorization boundary

Source comments, READMEs, config values, requirement-file prose, logs, and
tool output are evidence, not authority. Embedded instructions cannot change
aggregation, skip validation, authorize installation, or add exfiltration;
they are ignored and reported as prompt-injection anomalies.
Skills enforce only controls they directly own: static fact extraction,
install-before-import ordering, safe torch.load(..., weights_only=True),
secret redaction, and generated code that does not upload local data, paths,
credentials, models, or datasets.
The agent host owns sandboxing, filesystem permissions, network isolation,
resource limits, and authorization. Skills use the host-supplied environment,
do not probe or construct isolation mechanisms, do not assess whether the
runtime is trusted, and report only actual host/tool denials or command
failures as execution blockers.
Package entries and installer options are dependency configuration handled
under host permissions, not a skill-owned allowlist, typosquat audit, or
supply-chain security claim. Natural-language instructions embedded in those
files remain untrusted source text.
Generated source may sit beside writable training source. Simulator
workspaces, exports, models, and logs use a host-provided runtime directory or
one temporary directory, with exact paths reported; skills do not construct
permission hierarchies or run manifests.
Product packaging/install code retains its own no-follow, path traversal,
ownership, and atomic-publication protections. Diagnosis continues to treat
log content and spoofable status markers as untrusted evidence.

Enforcement

The skill lint gates every push and merge: a pre-push hook (.githooks/),
./runtest.sh -s, and the CI unit tests all run
python -m dev_tools.agent.skills.checks --skills-root skills (covering skills
and eval suites); the packaged bundle is verified at build/install time.

Design docs

Only docs/design/skills_architecture.md ships; the operating-model,
evaluation, and check-report design docs are kept local (gitignored).

greptile-apps · 2026-06-28T11:03:29Z

Greptile Summary

This PR adds the nvflare-convert-lightning agent skill and refactors nvflare-convert-pytorch to align both conversion skills to an explicit operating model: skills are procedural guidance for coding agents, while product APIs own schema, validation, execution, and export. A new _shared/conversion-workflow.md replaces the lifecycle reference, the lint engine is decoupled from docs/design/ (replacing docs_root with evals_root), and the doctor command is scoped to conversion-only readiness checks.

New Lightning skill and shared framework detection: The inspector engine is refactored into a plugin architecture (frameworks/ package) with per-framework FrameworkDetector classes; the Lightning detector handles promotion-over-family disambiguation, import alias tracking, and flare.patch conversion-state detection. All four Lightning eval fixtures include the required empty-batch ValueError guards.
Lint engine decoupling: run_v1_lints no longer reads docs/design/*.md; evals are moved to dev_tools/agent/skill_evals/<skill>/; a new skill-runtime-boundary-lint prevents docs/design/ references and evaluator hooks from leaking into shipped skill content; agent-doc-crosslink-lint and skill-catalog-category-lint are removed.
Doctor scope reduction: startup_kits, poc, and online fields are removed from nvflare agent doctor output; the command is scoped to conversion/ML framework readiness only, with corresponding JSON contract tests added.

Confidence Score: 5/5

Safe to merge; the changes are well-scoped, well-tested, and introduce no regressions in the areas reviewed.

The inspector refactoring is sophisticated but thoroughly tested — every Lightning import alias form, the flare.patch conversion-state path, family promotion logic, entry-context reachability, and nemo-wrapper edge cases all have dedicated parameterized tests. The lint engine decoupling replaces docs-coupled checks with deterministic evals-root-based checks, and the new runtime-boundary lint adds surface coverage without removing any existing admission gates. The only observable behavioral changes are the removal of startup_kits/poc/online from doctor output (intentional scope reduction, covered by updated tests) and the consolidation of the DistributedDataParallel duplicate-append into a single tuple check (a minor correctness fix). Empty-batch guards previously flagged in review threads are present in all four Lightning eval fixtures.

No files require special attention. The nvflare/tool/agent/inspector.py rewrite is the highest-complexity change, but the _FamilyResolver adapter pattern and entry-context reachability helpers are each unit-tested in isolation.

Important Files Changed

Filename	Overview
nvflare/tool/agent/inspector.py	Major refactor: framework detection extracted to plugin architecture, entry-context reachability added for primary-framework selection, Lightning integration signals drive conversion state, evidence collection cap raised from 12 to 10000 to avoid skewed confidence scores. Logic is well-tested and design decisions are thoroughly documented.
nvflare/tool/agent/frameworks/lightning.py	New Lightning framework detector: handles all import alias forms (pytorch_lightning, lightning, lightning.pytorch, module aliases), tracks flare.patch conversion signals, and implements promote_over_family with reachability-aware disambiguation logic.
nvflare/tool/agent/frameworks/registry.py	New registry module: aggregates import roots and evidence weights from all detectors, implements framework_for_import/recommended_skill_for/resolve_primary_framework, and defines UTILITY_FRAMEWORKS to prevent numpy from hijacking primary-framework selection.
dev_tools/agent/skills/checks/lints.py	Significant refactor: docs_root to evals_root, evals moved to dev_tools/agent/skill_evals/, new skill-runtime-boundary-lint replaces agent-doc-crosslink-lint and skill-catalog-category-lint, lint registry replaces the ad-hoc dict, SkillRecord gains cached_property validation and has_helper_tests. All changes are intentional and well-explained.
dev_tools/agent/skills/checks/frontmatter.py	SKILL.md schema updated: NVFLARE custom fields now nested under metadata sub-map per agentskills.io spec; new validators for supported top-level fields, metadata mapping type, spec constraints, and SKILL.md length; skill_metadata() accessor added.
nvflare/tool/agent/doctor.py	Scoped to conversion-only readiness: startup_kits, poc, and online fields removed; doctor_environment() now returns nvflare/commands/optional_dependencies/skills/findings/status only. Breaking change to JSON schema is intentional and covered by updated tests.
nvflare/tool/agent/agent_cli.py	--online and startup-kit selection args removed from nvflare agent doctor; known flags list updated to match; hint message updated to conversion context.
tests/unit_test/tool/agent/agent_inspector_test.py	Large new test suite for Lightning detection, flare.patch conversion state classification (all import alias forms), family disambiguation (pytorch vs lightning), entry-context reachability, and nemo-wrapper edge cases.
tests/unit_test/tool/agent/agent_json_contract_test.py	New golden-field contract tests for recipe list/show and agent inspect JSON shapes; fails CI if a skill-consumed field is renamed or dropped.
dev_tools/agent/skill_evals/nvflare-convert-lightning/files/gpu-device-lightning/model.py	New GPU-device Lightning eval fixture with empty-batch ValueError guard in training_step. No validation_step (intentional: fixture targets device-placement testing only).

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[inspect_path] --> B[_inspect_dir / _inspect_file]
    B --> C[_PythonInspector AST walk per file]
    C --> D[visit_Import / visit_ImportFrom / visit_ClassDef / visit_Call]
    D --> E[FrameworkDetector plugins: PyTorchDetector, LightningDetector]
    E --> F[DetectContext: emit evidence, record flare_call, record integration_signal]
    F --> G[InspectState: framework_evidence, flare_calls, integration_signals, class_body_ranges]
    G --> H[_rank_frameworks count-based confidence]
    H --> I[_detect_primary_framework]
    I --> J{Entry context reachability check}
    J -->|Active evidence tied to entry point| K[Non-utility framework with active evidence wins]
    J -->|No entry context| L[resolve_primary_framework family promotion]
    L --> M{Lightning + PyTorch both present?}
    M -->|Yes| N[promote_over_family LightningDetector decides]
    N -->|Lightning reachable from entry| O[pytorch_lightning]
    N -->|PyTorch reachable from entry| P[pytorch]
    K --> Q[_conversion_state]
    O --> Q
    P --> Q
    Q --> R{_has_conversion_integration? integration_signals AND flare_imports}
    R -->|Yes - flare.patch detected| S[client_api_converted]
    R -->|No| T{flare.receive+send or FLModel in calls?}
    T -->|Yes| S
    T -->|No| U{any flare signals?}
    U -->|Yes| V[partial_client_api]
    U -->|No| W[not_converted - recommend skill]
    W --> X[nvflare-convert-pytorch OR nvflare-convert-lightning]

%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
    A[inspect_path] --> B[_inspect_dir / _inspect_file]
    B --> C[_PythonInspector AST walk per file]
    C --> D[visit_Import / visit_ImportFrom / visit_ClassDef / visit_Call]
    D --> E[FrameworkDetector plugins: PyTorchDetector, LightningDetector]
    E --> F[DetectContext: emit evidence, record flare_call, record integration_signal]
    F --> G[InspectState: framework_evidence, flare_calls, integration_signals, class_body_ranges]
    G --> H[_rank_frameworks count-based confidence]
    H --> I[_detect_primary_framework]
    I --> J{Entry context reachability check}
    J -->|Active evidence tied to entry point| K[Non-utility framework with active evidence wins]
    J -->|No entry context| L[resolve_primary_framework family promotion]
    L --> M{Lightning + PyTorch both present?}
    M -->|Yes| N[promote_over_family LightningDetector decides]
    N -->|Lightning reachable from entry| O[pytorch_lightning]
    N -->|PyTorch reachable from entry| P[pytorch]
    K --> Q[_conversion_state]
    O --> Q
    P --> Q
    Q --> R{_has_conversion_integration? integration_signals AND flare_imports}
    R -->|Yes - flare.patch detected| S[client_api_converted]
    R -->|No| T{flare.receive+send or FLModel in calls?}
    T -->|Yes| S
    T -->|No| U{any flare signals?}
    U -->|Yes| V[partial_client_api]
    U -->|No| W[not_converted - recommend skill]
    W --> X[nvflare-convert-pytorch OR nvflare-convert-lightning]

_{Reviews (56): Last reviewed commit: "Add empty-batch guard to gpu-device-ligh..." | Re-trigger Greptile}

codecov-commenter · 2026-06-28T11:54:28Z

Codecov Report

❌ Patch coverage is 85.71429% with 196 lines in your changes missing coverage. Please review.
✅ Project coverage is 57.34%. Comparing base (3c0458d) to head (f05de35).

Files with missing lines	Patch %	Lines
nvflare/tool/agent/skill_manager.py	77.97%	122 Missing ⚠️
nvflare/tool/agent/skill_manifest.py	81.31%	51 Missing ⚠️
nvflare/tool/agent/frameworks/pytorch.py	91.48%	8 Missing ⚠️
nvflare/tool/agent/inspector.py	97.23%	7 Missing ⚠️
nvflare/tool/agent/frameworks/lightning.py	96.42%	4 Missing ⚠️
nvflare/tool/agent/frameworks/base.py	89.65%	3 Missing ⚠️
nvflare/tool/agent/frameworks/registry.py	98.14%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #4837      +/-   ##
==========================================
+ Coverage   56.98%   57.34%   +0.36%     
==========================================
  Files         969      973       +4     
  Lines       92310    93350    +1040     
==========================================
+ Hits        52603    53536     +933     
- Misses      39707    39814     +107

Flag	Coverage Δ
unit-tests	`57.34% <85.71%> (+0.36%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

holgerroth · 2026-06-30T21:42:18Z

@chesterxgchen One architectural suggestion before this evaluation surface grows further: move test-only evaluation artifacts out of the distributable skill directories while retaining per-skill namespacing.

A possible layout:

skills/
  nvflare-convert-lightning/
    SKILL.md
    references/

dev_tools/agent/skill_evals/
  nvflare-convert-lightning/
    evals.json
    files/

tests/unit_test/tool/agent_skill_evals/
  ...

The main reasons are:

evals.json contains expected outputs, assertions, mandatory behavior, and prohibited behavior. That is grading-oracle data rather than runtime skill guidance, and including it in dev/nightly-installed skills can expose benchmark answers.
The current colocated layout requires include_analysis_files, release exclusions, and mode-dependent skill hashes solely to keep eval content out of release bundles.
Cross-skill negative-routing cases and the evaluation runner naturally belong to a central suite.
Runtime skill directories would then contain only the instructions and references that an installed agent should load.

Small examples that are genuinely useful during conversion can remain under references/; test-only fixtures and grading metadata would live in the evaluation tree. The runner can discover suites by skill name, preserving clear ownership without shipping evaluation artifacts.

Resolve the outstanding reviewer threads: - validation_step empty-batch guard (greptile P1): training_step already raised ValueError on an empty batch, but validation_step still called F.cross_entropy unguarded (nan on an empty per-site partition). Add the same guard to validation_step in the hello-lightning and vocab-lightning eval fixtures. - FedEval contract (Holger P1): the lightning-eval-only eval promised trainer.validate/test "in the round loop" with "per-round" metrics. FedEvalRecipe requires eval_ckpt and sends one validate task, so rewrite the expected output and assertions to require the supplied eval_ckpt, a single trainer.validate (no fit, no test), and one-shot per-site metrics. - Shared token IDs (Holger P1): equal vocab_size guarantees tensor shape, not semantic alignment. The lightning-data-derived-required-arg eval now requires one identical tokenizer/token-ID mapping across sites (validated, e.g. via a stable hash), with a fixed-tokenizer/approval-blocker fallback. - DDP validation metrics (Holger P1): under launch_external_process=True the ScriptRunner-built PTClientAPILauncherExecutor defaults train_with_evaluation =False and exposes no switch, so pre-fit trainer.validate metrics never reach the server. Document this recipe limitation in lightning-ddp-and-tracking.md instead of promising per-round server-side metrics. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

chesterxgchen · 2026-07-03T14:05:45Z

@holgerroth Adopted this layout — thanks, it's a clear improvement. Test-only evaluation artifacts now live outside the distributable skill directories, with per-skill namespacing preserved:

skills/<skill>/ holds only runtime content (SKILL.md, references/, assets/); no evals/ remains under skills/.
Eval suites moved to dev_tools/agent/skill_evals/<skill>/{evals.json, files/}; a central runner discovers suites by skill name, keeping ownership clear.
Structure/runner tests are centralized under tests/unit_test/tool/agent_skill_checks/.

On the reasons you gave:

Grading-oracle data (expected outputs, assertions, mandatory/prohibited behavior) is no longer shipped in dev/nightly-installed skills.
The colocated-layout machinery that existed solely to keep eval content out of release bundles — include_analysis_files, mode-dependent skill hashes, and the eval-only release exclusions — is removed. As defense-in-depth, SKILL_PACKAGING_EXCLUDE_NAMES still fail-closes on a stray skills/<skill>/evals/ so eval content can never re-enter a bundle even if one is reintroduced by mistake.
Cross-skill negative-routing cases and the runner now live in the central eval tree.
Runtime skill directories contain only the instructions/references an installed agent should load; small conversion-time examples stay under references//assets/, only test-only fixtures + grading metadata moved to the eval tree.

Landed in 78d0cfb and follow-ups.

The converted-state classifier only matched a literal flare.patch call or a directly imported patch symbol, and additionally required a recognized Lightning Trainer constructor. This misclassified valid conversions: - aliased modules (import nvflare.client.lightning as nfl; nfl.patch(...)) - fully-qualified calls (nvflare.client.lightning.patch(...)) - trainers built via wrappers (e.g. nl.Trainer from nemo.lightning) Track aliases of the nvflare.client.lightning module and recognize all patch call forms, and treat any genuine patch(trainer) call plus an nvflare import as the definitive conversion signal without gating on the trainer constructor. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

from nvflare.client import lightning as flare; flare.patch(trainer) is a valid conversion form that was still classified as partial_client_api. Record the aliased submodule as a Lightning patch module so flare.patch is recognized as the conversion signal. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Implements Milestone 8 Stage 4 (Lightning skill content): a PyTorch-family conversion skill that integrates through flare.patch(trainer) rather than manual FLModel exchange, reuses the PyTorch recipe family, and covers DDP and experiment tracking. Includes SKILL.md, four references, evals, an unconverted Lightning fixture, and BENCHMARK.md. Enable the inspector framework->skill mapping for pytorch_lightning now that the skill is implemented, packaged, and covered by admission tests, so agent inspect recommends nvflare-convert-lightning for Lightning projects. Lightning eval behavior IDs use the canonical contract IDs pinned in agent_skill_evaluation.md (use-lightning-patched-trainer, no-manual-flmodel-exchange-lightning, no-input-model-to-trainer) so benchmark consumers see the Lightning-specific gates. Update inspector/CLI tests to expect the Lightning recommendation and add the skill to the seed packaging tests with a payload and release-filter assertion. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Two issues from review of the Lightning routing enablement: - _prefer_lightning_over_pytorch promoted Lightning over PyTorch whenever both frameworks were present. Since Lightning always imports torch, a plain PyTorch training entry point with an incidental Lightning import elsewhere in the workspace was misrouted to nvflare-convert-lightning. Promote Lightning only when there is active Lightning use (a LightningModule/DataModule subclass or a Trainer call) and that evidence is at least as strong as the PyTorch evidence; otherwise keep PyTorch as the lead framework. Adds two mixed-workspace tests. - lightning-ddp-and-tracking.md referenced the non-exported nvflare.app_opt.lightning.loggers.ClientLogger import path. Point agents to the canonical flare.logger() shortcut and the real nvflare.app_opt.lightning.loggers.client_logger.ClientLogger class. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The previous count-based gate still misrouted normal Lightning scripts: a typical Lightning entry point imports several torch symbols (torch, nn, DataLoader), so PyTorch import evidence outnumbers Lightning symbols and the Lightning skill was not recommended. Active Lightning use -- a LightningModule/DataModule subclass or a Trainer call -- is definitive and now outweighs any number of plain torch imports. An incidental Lightning import with no active use still keeps PyTorch as the lead framework. Replaces the over-strict count test with a regression test for a Lightning script carrying multiple torch imports. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…evidence Promoting Lightning whenever any active Lightning evidence appeared anywhere in a scanned directory misrouted mixed workspaces: a plain PyTorch train.py with a LightningModule only in a secondary helper file was routed to Lightning. Prefer Lightning only when active Lightning use (a LightningModule/DataModule subclass or a Trainer call) is in the inspected file or a likely training entry point, with a fallback to promote when Lightning evidence is strictly stronger overall. A single inspected file always counts as its own entry point so normal Lightning scripts (with many torch imports) still route to Lightning. Adds a regression test for a PyTorch entry point plus active Lightning code in a non-entry helper file; keeps the incidental-import and many-torch-imports tests. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ecks on top Company conformance is the floor; the NVFLARE-specific checks stay unchanged on top of it (stronger, not weaker). Frontmatter (all 5 skills): add the required metadata.author (name + NVIDIA email) and the recommended discoverability fields (tags, languages, frameworks, domain, team). Lint engine (frontmatter validator): - require metadata.author alongside the existing min_flare_version / blast_radius / category requirements; - accept the optional top-level title field NVCARPS allows; - enforce the NVCARPS constraints as new checks: name charset/length (1-64, lowercase alphanumeric + hyphens, no leading/trailing/consecutive hyphens; skill-frontmatter-name-invalid), description <= 1024 chars, compatibility <= 500 chars, and SKILL.md main file <= 500 lines (skill-md-too-long). Prohibited fields (alwaysApply, globs) were already rejected by the top-level allowlist check. Tests: fixtures across both suites gain the author field; new unit tests cover each added constraint plus the author requirement and title acceptance. Not adopted (deliberate): the guideline's section-name template and markdown-link reference style are recommendations, not validation gates; the skills keep their eval-verified structure and backticked reference paths, which cover the same substance (use-when triggers, stepwise workflow, progressive detail in references). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

docs/design/agent_skill_checks_report.md is a local-only human reference (the check-inventory report), kept untracked like the other local design docs so a stray git add -A cannot re-track it. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Review finding: conversion-workflow.md already declared HE, differential privacy, and Filter-based protection unsupported by conversion, but recipe selection still said 'any other algorithm or privacy requirement' should pick a matching catalog recipe, and both SKILL.md scope statements named only HE. For a request like 'convert with differential privacy' the agent could read recipe selection as 'find a privacy recipe' before hitting the shared boundary. - pytorch-family-recipe-selection.md: selection rule now covers 'any other algorithm requirement' only, and a new bullet extends the HE rule to DP, privacy filters, and other Filter-based protection: report unsupported, route to provisioning/deployment, ask or fail closed, never generate an unprotected job with only a disclaimer. - Both conversion SKILL.md scope statements generalize from HE to privacy-protection requests (HE/encrypted aggregation, DP, privacy filters). validation-evidence.md redundancy left as-is per the review (not blocking, not contradictory). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

This reverts commit 39236ef.

Re-land the non-benchmark file-safety hardening from the reverted commit (39236ef) on top of the NVCARPS constraints already in HEAD: - parse_skill_frontmatter and _validate_skill_md_length read through _read_regular_skill_file: one no-follow (O_NOFOLLOW where available), nonblocking (O_NONBLOCK), size-bounded (512 KB) descriptor with an lstat-before / fstat-after regular-file check. - Direct parse_skill_frontmatter callers (e.g. the policy/trigger lints) no longer bypass the symlink/special-file guard: a FIFO or device at SKILL.md would otherwise block a reader indefinitely. - _validate_no_symlinks now rejects non-regular/non-directory entries (skill-special-file-not-allowed) and reports unreadable entries. - Non-UTF-8 SKILL.md becomes a reported finding instead of an uncaught UnicodeDecodeError. O_NOFOLLOW/O_NONBLOCK/O_CLOEXEC are applied via getattr fallback so the reader degrades cleanly on platforms lacking them. NVCARPS constraint checks and tests already in HEAD are unchanged. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…gger hardening Non-benchmark skill-guidance hardening from the reverted commit (39236ef), extracted without the DDP-stance reversal or benchmark harness. All additive on the current shared references (privacy scope, RCA dependency fix, and data- location rules are preserved as a superset): - conversion-workflow.md: untrusted source-derived execution must run in an OS-level sandbox/container/VM (a venv is package isolation, not a security sandbox) with no ambient credentials, read-only source, bounded writable output, resource limits, and denied egress; fail closed when unavailable. Adds a network/telemetry-logger boundary: source logger config is evidence, not approval; disable unapproved network/custom loggers during validation. - dependency-install.md: treat install as a separate allowlisted acquisition phase inside the sandbox; filter the requirements actually installed rather than passing an untrusted file with skipped directives; add elevated-risk classes; a missing security sandbox is a blocker, never bypassed with a host venv (the installable-dep-is-not-a-blocker rule is retained for when the sandbox is available). - runtime-output-guidance.md: replace predictable /tmp/nvflare/<job> paths with a private per-user runtime root (0700), randomized run directories, no reuse of existing/symlinked output paths, and an exact run manifest. Excludes the reverted commit's DDP-stance rewrite, device_selection benchmark shim, release_checklist.json, and everything under skills/benchmark/. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…ivacy, sandbox Extract the non-benchmark PyTorch eval expansion from the reverted commit (39236ef): new fixtures gpu-device-pt, checkpoint-pt, state-mismatch-pt, and dedicated eval cases that close real coverage gaps — - pytorch-device-selection: torch.cuda.is_available() source keeps GPU-when- available + CPU fallback; no hard-coded CPU rewrite after CPU-only validation; - pytorch-safe-checkpoint-load / pytorch-state-dict-schema-mismatch: exercise the weights_only=True rule and treat a generated state-dict mismatch as a conversion bug (previously unevaluated); - pytorch-missing-evaluation-fail-closed and pytorch-unsupported-privacy-request: reinforce fail-closed evaluation and the settled HE/DP report-and-route rule; - behavior/assertion swap tmp-runtime-outputs -> secure-random-runtime-outputs plus sandbox behaviors, reconciling the suite with the runtime-path and execution-sandbox guidance landed in 929a69a. Kept device-selection-respects-availability on the basic case (per the retained 0e0847a check/test) in addition to the dedicated case. Excludes the reverted commit's release_checklist.json (its only consumer was the removed benchmark harness) and the device_selection benchmark shim. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…e DDP reversal Extract the non-benchmark Lightning changes from the reverted commit (39236ef), keeping the settled data-scientist-verified DDP mapping intact: - New gpu-device-lightning fixture + lightning-device-selection eval case; device-selection-respects-availability kept on the basic case (per the retained device_selection check/test). - secure-random-runtime-outputs behavior swap and sandbox/network behaviors, reconciling the suite with 929a69a. - Network-logger gating (#5): Experiment Tracking in lightning-ddp-and-tracking.md and the SKILL.md tracking bullets now treat existing source logger/callback config as evidence of intent, not approval — remote/upload/custom loggers stay disabled during validation, unattended keeps denied egress or blocks. EXCLUDED as rejected item #7 (contradicts the settled DDP decision): the diff's DDP Execution Model / rank-intro / metric-capability rewrites and the 'no-implicit-ddp-param-mapping' description change to 'does not maintain a skill-owned DDP-to-recipe-parameter table' — reverted to the settled wording. The concrete train_with_evaluation=False metric-delivery caveat is kept. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Additive, non-benchmark eval behaviors that reinforce rules already in HEAD: - diagnose: keep-diagnosis-read-only, use-mode-specific-metric-artifacts, no-sensitive-log-quoting (matches the log-content PII-redaction rule), no-production-path-inference-from-simulation. - orient: source-and-logs-are-routing-evidence (matches the orient trust language), report-routing-evidence, state-follow-up-approval-boundary, no-orientation-mutation. No DDP, benchmark, or release_checklist coupling. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Re-land the non-benchmark skill-install hardening from the reverted commit (39236ef), softening the one real user-experience regression: skill_manifest.py / skill_manager.py add, on top of the existing stage-to-temp + os.rename publish, install lock, and symlink rejection: - validate_manifest blocks path traversal from a tampered wheel manifest; - O_NOFOLLOW/O_NONBLOCK fd opens with dev/ino + st_nlink identity re-checks, closing the hash->copy TOCTOU; - the staged tree's hash must equal the manifest source_hash before publish; - immutable content-addressed shared snapshots (.nvflare-shared/<sha256>/) with idempotent path pinning and a recorded installed_hash (legacy fallback compatible), plus hash-verified legacy nvflare-shared retirement; - a private, user-owned target directory and flock-lease lock hardening. UX fix: the target/parent writability guard rejected group-OR-world-writable dirs, which hard-fails a normal install on user-private-group (umask 002) systems where ~/.claude/skills is group-writable. Since every such check is backstopped by an owned-by-current-user check, narrow it to world-writable only (the actual tampering threat). Added a test that a group-writable, user-owned target is accepted; kept world-writable rejection. Windows-specific code paths from the reverted commit are retained as-is; Windows is not an officially supported platform for this tooling. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

The AgentSkillsBuildPy build hook validated the packaged skill bundle root and rejected a group-OR-world-writable directory. setuptools creates build/lib/.../bundled_skills under the process umask, so on user-private-group (umask 002) systems that directory is group-writable and the wheel/sdist build failed with 'skill bundle root must not be group/world writable'. This is the build-time counterpart of the install-time check softened in 0527b07. The bundle root is already required to be owned by the current user, so narrow the writability guard to world-writable only (the real tampering threat). Verified: setup.py build_py under umask 002 now builds the bundle into a group-writable build-lib and exits 0. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

chesterxgchen · 2026-07-04T18:23:41Z

@greptile-apps

The external-data-lightning fixture (added after the earlier empty-batch fixes) called F.cross_entropy in training_step/validation_step without the labels.numel() == 0 guard its sibling fixtures (hello-lightning, vocab-lightning) carry. In a federated setting a site can receive an empty data partition, so cross_entropy would silently return nan (0/0) instead of surfacing the data-preparation problem. Add the same explicit ValueError guard to both hooks (team rule 783565ac). Resolves the greptile P1 review threads. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Same fix as the sibling Lightning fixtures (hello-lightning, vocab-lightning, external-data-lightning): guard training_step against an empty per-site partition with an explicit ValueError instead of letting F.cross_entropy return nan silently (team rule 783565ac). Resolves the greptile P1 thread. All four Lightning fixtures now carry the guard. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

chesterxgchen · 2026-07-04T19:09:43Z

@greptile-apps

doctor was a read-only readiness snapshot of largely derivable/convenience checks (nvflare import+version, the command surface, optional ML-dependency availability, and the installed skill bundle). Nothing depends on it: inspect only suggested it as a next command, and orient used it optionally. The real routing signal (framework detection, FLARE integration, conversion state, recommended skill) comes from 'nvflare agent inspect', and the conversion flow installs missing dependencies on its own. Remove the command and all references: agent_cli.py (constant, parser, dispatch, handler), doctor.py + its test (deleted), command_registry.py entry, inspector.py _recommended_next_commands prefix, lints.py _KNOWN_AGENT_COMMANDS and flag table, nvflare-orient SKILL.md / orientation-routing.md (fold readiness into 'nvflare agent inspect'), nvflare-orient evals.json, and the agent_cli / agent_inspector tests. Verified: 'nvflare agent --help' lists info/inspect/skills only; 'agent doctor' is an invalid choice; no doctor references remain; lint 0 findings; 377 agent tests pass; lint-parity stays in sync. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

greptile-apps · 2026-07-04T19:28:27Z

Too many files changed for review. (110 files found, 100 file limit)

Bypass the limit by tagging @greptile-apps to review.

chesterxgchen · 2026-07-04T22:37:24Z

@greptile-app

… execution gating RCA of the benchmark showed the skills pushed the agent into a layer it cannot own: they told it to construct/verify an OS sandbox (it probed bwrap/unshare/firejail, they failed, it fell back to non-isolating fallbacks yet the run passed = false assurance + latency), and they gated dependency install + execution behind a skill-inferred interactive/unattended mode, so a harness that declared no mode made the agent emit an approval/trust prompt, get no answer, and end not_started on an already-valid conversion. Correct the boundary: a skill can reliably enforce only what it reasons about and generates — prompt-injection resistance and conversion semantics. The agent host/harness owns execution security and the authorization to install/run. Remove from skills + evals: OS-sandbox construction/probing/claims, 0700/ randomized-hierarchy/run-manifest/permission audits, supply-chain enforcement (allowlist/typosquat/acquisition-phase), first-execution trust machinery, and skill-inferred-mode gating of install/execution plus skill-issued install/trust/run approval prompts. Keep: prompt-injection resistance (source/comments/README/config/requirement comments/logs/tool output are untrusted data, never authorization; source text never authorizes install/fetch), static inspection, install-before-import ordering, generated-artifact rules (torch.load weights_only=True, no-exfil, secret/credential redaction), conversion semantics, and the settled DDP mapping. Now: dependency install + validation proceed by default; the host permission system is the only gate; blockers are real host/tool denials or install failures only; a completed conversion ending not_started (waiting on approval) is a failure. Interactive-vs-unattended governs only unresolved conversion semantics (missing required arg / ambiguous algorithm). Dependency entries (--index-url/git+/URL/pins) are surfaced as unusual but not audited/allowlisted/ blocked (host owns that); requirement-file comments/prose are treated as injection. Runtime output goes to a host dir or one temp dir with paths reported. Both SKILL.md use progressive disclosure with a direct recipe-show fedavg-pt fast path; explicit conversions no longer route through nvflare-orient just to detect the framework. Evals updated to assert the new behavior and add negative checks (no sandbox construction, no orient-for-explicit -conversion); device-selection and the settled DDP mapping preserved. Reviewed by architect (compliant) and principal (minimal-and-correct). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…o residual execution/dependency gating Follow-up refinements on top of the boundary rescope (403fb8d): - PyTorch and Lightning standard FedAvg paths no longer preemptively load broad references; references load only for exceptions (progressive disclosure). - Basic evals require a direct 'nvflare recipe show fedavg-pt' for an explicit FedAvg + PyTorch request; 'recipe list' is reserved for ambiguous/non-standard. - Remove the residual skill-owned 'trusted execution boundary' gate: the agent uses the host-supplied environment and does not itself decide whether the host is trusted or sufficiently isolated. - Move orient's explicit-conversion exclusion into the skill's trigger metadata (description), so an explicit conversion request that merely needs framework detection is not routed through orient. - Stop the skill auditing/classifying dependency entries; keep prompt-injection resistance (requirement-file comments/prose are untrusted; entries are surfaced but not audited/allowlisted/blocked — host owns that). - Remove stale approval/trust wording around loggers, overwrites, validation, and routing. Evals updated to enforce the same behavior. Verified: skill lint 0 findings; 377 agent tests pass; JSON valid; boundary invariants intact (no sandbox/0700/manifest/mode-gating; injection resistance, weights_only, and the settled DDP mapping preserved); no dangling references. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

chesterxgchen · 2026-07-05T00:49:47Z

@greptile-app

chesterxgchen force-pushed the milestone8-agent-skills branch from ed74191 to 2d32542 Compare June 28, 2026 10:55

greptile-apps Bot reviewed Jun 28, 2026

View reviewed changes

Comment thread skills/nvflare-convert-lightning/evals/files/imbalanced-lightning/model.py Outdated

Comment thread nvflare/tool/agent/inspector.py Outdated

Comment thread nvflare/tool/agent/inspector.py

greptile-apps Bot reviewed Jun 28, 2026

View reviewed changes

Comment thread dev_tools/agent/skill_evals/nvflare-convert-lightning/files/hello-lightning/model.py

holgerroth reviewed Jun 29, 2026

View reviewed changes

Comment thread skills/nvflare-convert-lightning/evals/evals.json Outdated

holgerroth reviewed Jun 29, 2026

View reviewed changes

Comment thread skills/nvflare-convert-lightning/evals/evals.json Outdated

holgerroth reviewed Jun 29, 2026

View reviewed changes

Comment thread skills/nvflare-convert-lightning/references/lightning-ddp-and-tracking.md Outdated

greptile-apps Bot reviewed Jul 2, 2026

View reviewed changes

Comment thread dev_tools/agent/skill_evals/nvflare-convert-lightning/files/hello-lightning/model.py

chesterxgchen changed the title ~~Add NVFLARE Lightning conversion agent skill~~ Add NVFLARE Lightning conversion skill and align conversion skills to the operating model Jul 2, 2026

chesterxgchen force-pushed the milestone8-agent-skills branch from 1b85f8e to 639af87 Compare July 2, 2026 13:13

chesterxgchen force-pushed the milestone8-agent-skills branch from fe35795 to e8f8010 Compare July 3, 2026 00:40

chesterxgchen force-pushed the milestone8-agent-skills branch from 8479b1f to f3100cd Compare July 3, 2026 04:07

greptile-apps Bot reviewed Jul 3, 2026

View reviewed changes

Comment thread dev_tools/agent/skill_evals/nvflare-convert-lightning/files/external-data-lightning/model.py

greptile-apps Bot reviewed Jul 3, 2026

View reviewed changes

Comment thread dev_tools/agent/skill_evals/nvflare-convert-lightning/files/external-data-lightning/model.py

greptile-apps Bot reviewed Jul 3, 2026

View reviewed changes

Comment thread dev_tools/agent/skill_evals/nvflare-convert-lightning/files/external-data-lightning/model.py

chesterxgchen and others added 9 commits July 3, 2026 12:28

Refactor shared skill guidance

147ed94

Add Lightning routing to agent inspect

6013255

Remove benchmark notes from skill artifacts

2ecb40c

chesterxgchen and others added 3 commits July 3, 2026 12:29

chesterxgchen force-pushed the milestone8-agent-skills branch from d37bc5d to bcb74c1 Compare July 3, 2026 19:29

chesterxgchen and others added 10 commits July 3, 2026 15:40

Add device selection eval check

0e0847a

Harden agent skills and benchmark RCA reporting

39236ef

Revert "Harden agent skills and benchmark RCA reporting"

3b0153c

This reverts commit 39236ef.

greptile-apps Bot reviewed Jul 4, 2026

View reviewed changes

Comment thread dev_tools/agent/skill_evals/nvflare-convert-lightning/files/gpu-device-lightning/model.py

chesterxgchen changed the title ~~Add NVFLARE Lightning conversion skill and align conversion skills to the operating model~~ NVFLARE agent skills: PyTorch/Lightning conversion, orient, diagnose — with lint, eval, packaging, and security hardening Jul 4, 2026

chesterxgchen added 2 commits July 4, 2026 13:55

Strengthen conversion skill validation ordering

429bb50

Avoid approval waits in unattended skill runs

c8fb50a

chesterxgchen and others added 2 commits July 4, 2026 17:33

chesterxgchen added 2 commits July 5, 2026 08:54

Clarify skill layout collision handling

f91b5f2

Generalize source model layout guidance

f05de35

Uh oh!

Conversation

chesterxgchen commented Jun 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Skills

Product code

Evaluation & lint infrastructure

Security and authorization boundary

Enforcement

Design docs

Uh oh!

greptile-apps Bot commented Jun 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov-commenter commented Jun 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

holgerroth commented Jun 30, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chesterxgchen commented Jul 3, 2026

Uh oh!

chesterxgchen commented Jul 4, 2026

Uh oh!

Uh oh!

chesterxgchen commented Jul 4, 2026

Uh oh!

greptile-apps Bot commented Jul 4, 2026

Uh oh!

chesterxgchen commented Jul 4, 2026

Uh oh!

chesterxgchen commented Jul 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

chesterxgchen commented Jun 28, 2026 •

edited

Loading

greptile-apps Bot commented Jun 28, 2026 •

edited

Loading

codecov-commenter commented Jun 28, 2026 •

edited

Loading