NVFLARE agent skills: PyTorch/Lightning conversion, orient, diagnose — with lint, eval, packaging, and security hardening#4837
Conversation
ed74191 to
2d32542
Compare
Greptile SummaryThis PR adds the
Confidence Score: 5/5Safe to merge; the changes are well-scoped, well-tested, and introduce no regressions in the areas reviewed. The inspector refactoring is sophisticated but thoroughly tested — every Lightning import alias form, the flare.patch conversion-state path, family promotion logic, entry-context reachability, and nemo-wrapper edge cases all have dedicated parameterized tests. The lint engine decoupling replaces docs-coupled checks with deterministic evals-root-based checks, and the new runtime-boundary lint adds surface coverage without removing any existing admission gates. The only observable behavioral changes are the removal of startup_kits/poc/online from doctor output (intentional scope reduction, covered by updated tests) and the consolidation of the DistributedDataParallel duplicate-append into a single tuple check (a minor correctness fix). Empty-batch guards previously flagged in review threads are present in all four Lightning eval fixtures. No files require special attention. The nvflare/tool/agent/inspector.py rewrite is the highest-complexity change, but the _FamilyResolver adapter pattern and entry-context reachability helpers are each unit-tested in isolation. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[inspect_path] --> B[_inspect_dir / _inspect_file]
B --> C[_PythonInspector AST walk per file]
C --> D[visit_Import / visit_ImportFrom / visit_ClassDef / visit_Call]
D --> E[FrameworkDetector plugins: PyTorchDetector, LightningDetector]
E --> F[DetectContext: emit evidence, record flare_call, record integration_signal]
F --> G[InspectState: framework_evidence, flare_calls, integration_signals, class_body_ranges]
G --> H[_rank_frameworks count-based confidence]
H --> I[_detect_primary_framework]
I --> J{Entry context reachability check}
J -->|Active evidence tied to entry point| K[Non-utility framework with active evidence wins]
J -->|No entry context| L[resolve_primary_framework family promotion]
L --> M{Lightning + PyTorch both present?}
M -->|Yes| N[promote_over_family LightningDetector decides]
N -->|Lightning reachable from entry| O[pytorch_lightning]
N -->|PyTorch reachable from entry| P[pytorch]
K --> Q[_conversion_state]
O --> Q
P --> Q
Q --> R{_has_conversion_integration? integration_signals AND flare_imports}
R -->|Yes - flare.patch detected| S[client_api_converted]
R -->|No| T{flare.receive+send or FLModel in calls?}
T -->|Yes| S
T -->|No| U{any flare signals?}
U -->|Yes| V[partial_client_api]
U -->|No| W[not_converted - recommend skill]
W --> X[nvflare-convert-pytorch OR nvflare-convert-lightning]
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
A[inspect_path] --> B[_inspect_dir / _inspect_file]
B --> C[_PythonInspector AST walk per file]
C --> D[visit_Import / visit_ImportFrom / visit_ClassDef / visit_Call]
D --> E[FrameworkDetector plugins: PyTorchDetector, LightningDetector]
E --> F[DetectContext: emit evidence, record flare_call, record integration_signal]
F --> G[InspectState: framework_evidence, flare_calls, integration_signals, class_body_ranges]
G --> H[_rank_frameworks count-based confidence]
H --> I[_detect_primary_framework]
I --> J{Entry context reachability check}
J -->|Active evidence tied to entry point| K[Non-utility framework with active evidence wins]
J -->|No entry context| L[resolve_primary_framework family promotion]
L --> M{Lightning + PyTorch both present?}
M -->|Yes| N[promote_over_family LightningDetector decides]
N -->|Lightning reachable from entry| O[pytorch_lightning]
N -->|PyTorch reachable from entry| P[pytorch]
K --> Q[_conversion_state]
O --> Q
P --> Q
Q --> R{_has_conversion_integration? integration_signals AND flare_imports}
R -->|Yes - flare.patch detected| S[client_api_converted]
R -->|No| T{flare.receive+send or FLModel in calls?}
T -->|Yes| S
T -->|No| U{any flare signals?}
U -->|Yes| V[partial_client_api]
U -->|No| W[not_converted - recommend skill]
W --> X[nvflare-convert-pytorch OR nvflare-convert-lightning]
Reviews (56): Last reviewed commit: "Add empty-batch guard to gpu-device-ligh..." | Re-trigger Greptile |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #4837 +/- ##
==========================================
+ Coverage 56.98% 57.34% +0.36%
==========================================
Files 969 973 +4
Lines 92310 93350 +1040
==========================================
+ Hits 52603 53536 +933
- Misses 39707 39814 +107
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
|
@chesterxgchen One architectural suggestion before this evaluation surface grows further: move test-only evaluation artifacts out of the distributable skill directories while retaining per-skill namespacing. A possible layout: The main reasons are:
Small examples that are genuinely useful during conversion can remain under |
1b85f8e to
639af87
Compare
Resolve the outstanding reviewer threads: - validation_step empty-batch guard (greptile P1): training_step already raised ValueError on an empty batch, but validation_step still called F.cross_entropy unguarded (nan on an empty per-site partition). Add the same guard to validation_step in the hello-lightning and vocab-lightning eval fixtures. - FedEval contract (Holger P1): the lightning-eval-only eval promised trainer.validate/test "in the round loop" with "per-round" metrics. FedEvalRecipe requires eval_ckpt and sends one validate task, so rewrite the expected output and assertions to require the supplied eval_ckpt, a single trainer.validate (no fit, no test), and one-shot per-site metrics. - Shared token IDs (Holger P1): equal vocab_size guarantees tensor shape, not semantic alignment. The lightning-data-derived-required-arg eval now requires one identical tokenizer/token-ID mapping across sites (validated, e.g. via a stable hash), with a fixed-tokenizer/approval-blocker fallback. - DDP validation metrics (Holger P1): under launch_external_process=True the ScriptRunner-built PTClientAPILauncherExecutor defaults train_with_evaluation =False and exposes no switch, so pre-fit trainer.validate metrics never reach the server. Document this recipe limitation in lightning-ddp-and-tracking.md instead of promising per-round server-side metrics. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Resolve the outstanding reviewer threads: - validation_step empty-batch guard (greptile P1): training_step already raised ValueError on an empty batch, but validation_step still called F.cross_entropy unguarded (nan on an empty per-site partition). Add the same guard to validation_step in the hello-lightning and vocab-lightning eval fixtures. - FedEval contract (Holger P1): the lightning-eval-only eval promised trainer.validate/test "in the round loop" with "per-round" metrics. FedEvalRecipe requires eval_ckpt and sends one validate task, so rewrite the expected output and assertions to require the supplied eval_ckpt, a single trainer.validate (no fit, no test), and one-shot per-site metrics. - Shared token IDs (Holger P1): equal vocab_size guarantees tensor shape, not semantic alignment. The lightning-data-derived-required-arg eval now requires one identical tokenizer/token-ID mapping across sites (validated, e.g. via a stable hash), with a fixed-tokenizer/approval-blocker fallback. - DDP validation metrics (Holger P1): under launch_external_process=True the ScriptRunner-built PTClientAPILauncherExecutor defaults train_with_evaluation =False and exposes no switch, so pre-fit trainer.validate metrics never reach the server. Document this recipe limitation in lightning-ddp-and-tracking.md instead of promising per-round server-side metrics. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
fe35795 to
e8f8010
Compare
Resolve the outstanding reviewer threads: - validation_step empty-batch guard (greptile P1): training_step already raised ValueError on an empty batch, but validation_step still called F.cross_entropy unguarded (nan on an empty per-site partition). Add the same guard to validation_step in the hello-lightning and vocab-lightning eval fixtures. - FedEval contract (Holger P1): the lightning-eval-only eval promised trainer.validate/test "in the round loop" with "per-round" metrics. FedEvalRecipe requires eval_ckpt and sends one validate task, so rewrite the expected output and assertions to require the supplied eval_ckpt, a single trainer.validate (no fit, no test), and one-shot per-site metrics. - Shared token IDs (Holger P1): equal vocab_size guarantees tensor shape, not semantic alignment. The lightning-data-derived-required-arg eval now requires one identical tokenizer/token-ID mapping across sites (validated, e.g. via a stable hash), with a fixed-tokenizer/approval-blocker fallback. - DDP validation metrics (Holger P1): under launch_external_process=True the ScriptRunner-built PTClientAPILauncherExecutor defaults train_with_evaluation =False and exposes no switch, so pre-fit trainer.validate metrics never reach the server. Document this recipe limitation in lightning-ddp-and-tracking.md instead of promising per-round server-side metrics. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
8479b1f to
f3100cd
Compare
|
@holgerroth Adopted this layout — thanks, it's a clear improvement. Test-only evaluation artifacts now live outside the distributable skill directories, with per-skill namespacing preserved:
On the reasons you gave:
Landed in 78d0cfb and follow-ups. |
The converted-state classifier only matched a literal flare.patch call or a directly imported patch symbol, and additionally required a recognized Lightning Trainer constructor. This misclassified valid conversions: - aliased modules (import nvflare.client.lightning as nfl; nfl.patch(...)) - fully-qualified calls (nvflare.client.lightning.patch(...)) - trainers built via wrappers (e.g. nl.Trainer from nemo.lightning) Track aliases of the nvflare.client.lightning module and recognize all patch call forms, and treat any genuine patch(trainer) call plus an nvflare import as the definitive conversion signal without gating on the trainer constructor. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
from nvflare.client import lightning as flare; flare.patch(trainer) is a valid conversion form that was still classified as partial_client_api. Record the aliased submodule as a Lightning patch module so flare.patch is recognized as the conversion signal. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Implements Milestone 8 Stage 4 (Lightning skill content): a PyTorch-family conversion skill that integrates through flare.patch(trainer) rather than manual FLModel exchange, reuses the PyTorch recipe family, and covers DDP and experiment tracking. Includes SKILL.md, four references, evals, an unconverted Lightning fixture, and BENCHMARK.md. Enable the inspector framework->skill mapping for pytorch_lightning now that the skill is implemented, packaged, and covered by admission tests, so agent inspect recommends nvflare-convert-lightning for Lightning projects. Lightning eval behavior IDs use the canonical contract IDs pinned in agent_skill_evaluation.md (use-lightning-patched-trainer, no-manual-flmodel-exchange-lightning, no-input-model-to-trainer) so benchmark consumers see the Lightning-specific gates. Update inspector/CLI tests to expect the Lightning recommendation and add the skill to the seed packaging tests with a payload and release-filter assertion. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Two issues from review of the Lightning routing enablement: - _prefer_lightning_over_pytorch promoted Lightning over PyTorch whenever both frameworks were present. Since Lightning always imports torch, a plain PyTorch training entry point with an incidental Lightning import elsewhere in the workspace was misrouted to nvflare-convert-lightning. Promote Lightning only when there is active Lightning use (a LightningModule/DataModule subclass or a Trainer call) and that evidence is at least as strong as the PyTorch evidence; otherwise keep PyTorch as the lead framework. Adds two mixed-workspace tests. - lightning-ddp-and-tracking.md referenced the non-exported nvflare.app_opt.lightning.loggers.ClientLogger import path. Point agents to the canonical flare.logger() shortcut and the real nvflare.app_opt.lightning.loggers.client_logger.ClientLogger class. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The previous count-based gate still misrouted normal Lightning scripts: a typical Lightning entry point imports several torch symbols (torch, nn, DataLoader), so PyTorch import evidence outnumbers Lightning symbols and the Lightning skill was not recommended. Active Lightning use -- a LightningModule/DataModule subclass or a Trainer call -- is definitive and now outweighs any number of plain torch imports. An incidental Lightning import with no active use still keeps PyTorch as the lead framework. Replaces the over-strict count test with a regression test for a Lightning script carrying multiple torch imports. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…evidence Promoting Lightning whenever any active Lightning evidence appeared anywhere in a scanned directory misrouted mixed workspaces: a plain PyTorch train.py with a LightningModule only in a secondary helper file was routed to Lightning. Prefer Lightning only when active Lightning use (a LightningModule/DataModule subclass or a Trainer call) is in the inspected file or a likely training entry point, with a fallback to promote when Lightning evidence is strictly stronger overall. A single inspected file always counts as its own entry point so normal Lightning scripts (with many torch imports) still route to Lightning. Adds a regression test for a PyTorch entry point plus active Lightning code in a non-entry helper file; keeps the incidental-import and many-torch-imports tests. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ecks on top Company conformance is the floor; the NVFLARE-specific checks stay unchanged on top of it (stronger, not weaker). Frontmatter (all 5 skills): add the required metadata.author (name + NVIDIA email) and the recommended discoverability fields (tags, languages, frameworks, domain, team). Lint engine (frontmatter validator): - require metadata.author alongside the existing min_flare_version / blast_radius / category requirements; - accept the optional top-level title field NVCARPS allows; - enforce the NVCARPS constraints as new checks: name charset/length (1-64, lowercase alphanumeric + hyphens, no leading/trailing/consecutive hyphens; skill-frontmatter-name-invalid), description <= 1024 chars, compatibility <= 500 chars, and SKILL.md main file <= 500 lines (skill-md-too-long). Prohibited fields (alwaysApply, globs) were already rejected by the top-level allowlist check. Tests: fixtures across both suites gain the author field; new unit tests cover each added constraint plus the author requirement and title acceptance. Not adopted (deliberate): the guideline's section-name template and markdown-link reference style are recommendations, not validation gates; the skills keep their eval-verified structure and backticked reference paths, which cover the same substance (use-when triggers, stepwise workflow, progressive detail in references). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
docs/design/agent_skill_checks_report.md is a local-only human reference (the check-inventory report), kept untracked like the other local design docs so a stray git add -A cannot re-track it. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Review finding: conversion-workflow.md already declared HE, differential privacy, and Filter-based protection unsupported by conversion, but recipe selection still said 'any other algorithm or privacy requirement' should pick a matching catalog recipe, and both SKILL.md scope statements named only HE. For a request like 'convert with differential privacy' the agent could read recipe selection as 'find a privacy recipe' before hitting the shared boundary. - pytorch-family-recipe-selection.md: selection rule now covers 'any other algorithm requirement' only, and a new bullet extends the HE rule to DP, privacy filters, and other Filter-based protection: report unsupported, route to provisioning/deployment, ask or fail closed, never generate an unprotected job with only a disclaimer. - Both conversion SKILL.md scope statements generalize from HE to privacy-protection requests (HE/encrypted aggregation, DP, privacy filters). validation-evidence.md redundancy left as-is per the review (not blocking, not contradictory). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
d37bc5d to
bcb74c1
Compare
This reverts commit 39236ef.
Re-land the non-benchmark file-safety hardening from the reverted commit (39236ef) on top of the NVCARPS constraints already in HEAD: - parse_skill_frontmatter and _validate_skill_md_length read through _read_regular_skill_file: one no-follow (O_NOFOLLOW where available), nonblocking (O_NONBLOCK), size-bounded (512 KB) descriptor with an lstat-before / fstat-after regular-file check. - Direct parse_skill_frontmatter callers (e.g. the policy/trigger lints) no longer bypass the symlink/special-file guard: a FIFO or device at SKILL.md would otherwise block a reader indefinitely. - _validate_no_symlinks now rejects non-regular/non-directory entries (skill-special-file-not-allowed) and reports unreadable entries. - Non-UTF-8 SKILL.md becomes a reported finding instead of an uncaught UnicodeDecodeError. O_NOFOLLOW/O_NONBLOCK/O_CLOEXEC are applied via getattr fallback so the reader degrades cleanly on platforms lacking them. NVCARPS constraint checks and tests already in HEAD are unchanged. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…gger hardening Non-benchmark skill-guidance hardening from the reverted commit (39236ef), extracted without the DDP-stance reversal or benchmark harness. All additive on the current shared references (privacy scope, RCA dependency fix, and data- location rules are preserved as a superset): - conversion-workflow.md: untrusted source-derived execution must run in an OS-level sandbox/container/VM (a venv is package isolation, not a security sandbox) with no ambient credentials, read-only source, bounded writable output, resource limits, and denied egress; fail closed when unavailable. Adds a network/telemetry-logger boundary: source logger config is evidence, not approval; disable unapproved network/custom loggers during validation. - dependency-install.md: treat install as a separate allowlisted acquisition phase inside the sandbox; filter the requirements actually installed rather than passing an untrusted file with skipped directives; add elevated-risk classes; a missing security sandbox is a blocker, never bypassed with a host venv (the installable-dep-is-not-a-blocker rule is retained for when the sandbox is available). - runtime-output-guidance.md: replace predictable /tmp/nvflare/<job> paths with a private per-user runtime root (0700), randomized run directories, no reuse of existing/symlinked output paths, and an exact run manifest. Excludes the reverted commit's DDP-stance rewrite, device_selection benchmark shim, release_checklist.json, and everything under skills/benchmark/. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ivacy, sandbox Extract the non-benchmark PyTorch eval expansion from the reverted commit (39236ef): new fixtures gpu-device-pt, checkpoint-pt, state-mismatch-pt, and dedicated eval cases that close real coverage gaps — - pytorch-device-selection: torch.cuda.is_available() source keeps GPU-when- available + CPU fallback; no hard-coded CPU rewrite after CPU-only validation; - pytorch-safe-checkpoint-load / pytorch-state-dict-schema-mismatch: exercise the weights_only=True rule and treat a generated state-dict mismatch as a conversion bug (previously unevaluated); - pytorch-missing-evaluation-fail-closed and pytorch-unsupported-privacy-request: reinforce fail-closed evaluation and the settled HE/DP report-and-route rule; - behavior/assertion swap tmp-runtime-outputs -> secure-random-runtime-outputs plus sandbox behaviors, reconciling the suite with the runtime-path and execution-sandbox guidance landed in 929a69a. Kept device-selection-respects-availability on the basic case (per the retained 0e0847a check/test) in addition to the dedicated case. Excludes the reverted commit's release_checklist.json (its only consumer was the removed benchmark harness) and the device_selection benchmark shim. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…e DDP reversal Extract the non-benchmark Lightning changes from the reverted commit (39236ef), keeping the settled data-scientist-verified DDP mapping intact: - New gpu-device-lightning fixture + lightning-device-selection eval case; device-selection-respects-availability kept on the basic case (per the retained device_selection check/test). - secure-random-runtime-outputs behavior swap and sandbox/network behaviors, reconciling the suite with 929a69a. - Network-logger gating (#5): Experiment Tracking in lightning-ddp-and-tracking.md and the SKILL.md tracking bullets now treat existing source logger/callback config as evidence of intent, not approval — remote/upload/custom loggers stay disabled during validation, unattended keeps denied egress or blocks. EXCLUDED as rejected item #7 (contradicts the settled DDP decision): the diff's DDP Execution Model / rank-intro / metric-capability rewrites and the 'no-implicit-ddp-param-mapping' description change to 'does not maintain a skill-owned DDP-to-recipe-parameter table' — reverted to the settled wording. The concrete train_with_evaluation=False metric-delivery caveat is kept. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Additive, non-benchmark eval behaviors that reinforce rules already in HEAD: - diagnose: keep-diagnosis-read-only, use-mode-specific-metric-artifacts, no-sensitive-log-quoting (matches the log-content PII-redaction rule), no-production-path-inference-from-simulation. - orient: source-and-logs-are-routing-evidence (matches the orient trust language), report-routing-evidence, state-follow-up-approval-boundary, no-orientation-mutation. No DDP, benchmark, or release_checklist coupling. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Re-land the non-benchmark skill-install hardening from the reverted commit (39236ef), softening the one real user-experience regression: skill_manifest.py / skill_manager.py add, on top of the existing stage-to-temp + os.rename publish, install lock, and symlink rejection: - validate_manifest blocks path traversal from a tampered wheel manifest; - O_NOFOLLOW/O_NONBLOCK fd opens with dev/ino + st_nlink identity re-checks, closing the hash->copy TOCTOU; - the staged tree's hash must equal the manifest source_hash before publish; - immutable content-addressed shared snapshots (.nvflare-shared/<sha256>/) with idempotent path pinning and a recorded installed_hash (legacy fallback compatible), plus hash-verified legacy nvflare-shared retirement; - a private, user-owned target directory and flock-lease lock hardening. UX fix: the target/parent writability guard rejected group-OR-world-writable dirs, which hard-fails a normal install on user-private-group (umask 002) systems where ~/.claude/skills is group-writable. Since every such check is backstopped by an owned-by-current-user check, narrow it to world-writable only (the actual tampering threat). Added a test that a group-writable, user-owned target is accepted; kept world-writable rejection. Windows-specific code paths from the reverted commit are retained as-is; Windows is not an officially supported platform for this tooling. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The AgentSkillsBuildPy build hook validated the packaged skill bundle root and rejected a group-OR-world-writable directory. setuptools creates build/lib/.../bundled_skills under the process umask, so on user-private-group (umask 002) systems that directory is group-writable and the wheel/sdist build failed with 'skill bundle root must not be group/world writable'. This is the build-time counterpart of the install-time check softened in 0527b07. The bundle root is already required to be owned by the current user, so narrow the writability guard to world-writable only (the real tampering threat). Verified: setup.py build_py under umask 002 now builds the bundle into a group-writable build-lib and exits 0. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The external-data-lightning fixture (added after the earlier empty-batch fixes) called F.cross_entropy in training_step/validation_step without the labels.numel() == 0 guard its sibling fixtures (hello-lightning, vocab-lightning) carry. In a federated setting a site can receive an empty data partition, so cross_entropy would silently return nan (0/0) instead of surfacing the data-preparation problem. Add the same explicit ValueError guard to both hooks (team rule 783565ac). Resolves the greptile P1 review threads. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Same fix as the sibling Lightning fixtures (hello-lightning, vocab-lightning, external-data-lightning): guard training_step against an empty per-site partition with an explicit ValueError instead of letting F.cross_entropy return nan silently (team rule 783565ac). Resolves the greptile P1 thread. All four Lightning fixtures now carry the guard. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
doctor was a read-only readiness snapshot of largely derivable/convenience checks (nvflare import+version, the command surface, optional ML-dependency availability, and the installed skill bundle). Nothing depends on it: inspect only suggested it as a next command, and orient used it optionally. The real routing signal (framework detection, FLARE integration, conversion state, recommended skill) comes from 'nvflare agent inspect', and the conversion flow installs missing dependencies on its own. Remove the command and all references: agent_cli.py (constant, parser, dispatch, handler), doctor.py + its test (deleted), command_registry.py entry, inspector.py _recommended_next_commands prefix, lints.py _KNOWN_AGENT_COMMANDS and flag table, nvflare-orient SKILL.md / orientation-routing.md (fold readiness into 'nvflare agent inspect'), nvflare-orient evals.json, and the agent_cli / agent_inspector tests. Verified: 'nvflare agent --help' lists info/inspect/skills only; 'agent doctor' is an invalid choice; no doctor references remain; lint 0 findings; 377 agent tests pass; lint-parity stays in sync. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
|
Too many files changed for review. ( Bypass the limit by tagging |
|
@greptile-app |
… execution gating RCA of the benchmark showed the skills pushed the agent into a layer it cannot own: they told it to construct/verify an OS sandbox (it probed bwrap/unshare/firejail, they failed, it fell back to non-isolating fallbacks yet the run passed = false assurance + latency), and they gated dependency install + execution behind a skill-inferred interactive/unattended mode, so a harness that declared no mode made the agent emit an approval/trust prompt, get no answer, and end not_started on an already-valid conversion. Correct the boundary: a skill can reliably enforce only what it reasons about and generates — prompt-injection resistance and conversion semantics. The agent host/harness owns execution security and the authorization to install/run. Remove from skills + evals: OS-sandbox construction/probing/claims, 0700/ randomized-hierarchy/run-manifest/permission audits, supply-chain enforcement (allowlist/typosquat/acquisition-phase), first-execution trust machinery, and skill-inferred-mode gating of install/execution plus skill-issued install/trust/run approval prompts. Keep: prompt-injection resistance (source/comments/README/config/requirement comments/logs/tool output are untrusted data, never authorization; source text never authorizes install/fetch), static inspection, install-before-import ordering, generated-artifact rules (torch.load weights_only=True, no-exfil, secret/credential redaction), conversion semantics, and the settled DDP mapping. Now: dependency install + validation proceed by default; the host permission system is the only gate; blockers are real host/tool denials or install failures only; a completed conversion ending not_started (waiting on approval) is a failure. Interactive-vs-unattended governs only unresolved conversion semantics (missing required arg / ambiguous algorithm). Dependency entries (--index-url/git+/URL/pins) are surfaced as unusual but not audited/allowlisted/ blocked (host owns that); requirement-file comments/prose are treated as injection. Runtime output goes to a host dir or one temp dir with paths reported. Both SKILL.md use progressive disclosure with a direct recipe-show fedavg-pt fast path; explicit conversions no longer route through nvflare-orient just to detect the framework. Evals updated to assert the new behavior and add negative checks (no sandbox construction, no orient-for-explicit -conversion); device-selection and the settled DDP mapping preserved. Reviewed by architect (compliant) and principal (minimal-and-correct). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…o residual execution/dependency gating Follow-up refinements on top of the boundary rescope (403fb8d): - PyTorch and Lightning standard FedAvg paths no longer preemptively load broad references; references load only for exceptions (progressive disclosure). - Basic evals require a direct 'nvflare recipe show fedavg-pt' for an explicit FedAvg + PyTorch request; 'recipe list' is reserved for ambiguous/non-standard. - Remove the residual skill-owned 'trusted execution boundary' gate: the agent uses the host-supplied environment and does not itself decide whether the host is trusted or sufficiently isolated. - Move orient's explicit-conversion exclusion into the skill's trigger metadata (description), so an explicit conversion request that merely needs framework detection is not routed through orient. - Stop the skill auditing/classifying dependency entries; keep prompt-injection resistance (requirement-file comments/prose are untrusted; entries are surfaced but not audited/allowlisted/blocked — host owns that). - Remove stale approval/trust wording around loggers, overwrites, validation, and routing. Evals updated to enforce the same behavior. Verified: skill lint 0 findings; 377 agent tests pass; JSON valid; boundary invariants intact (no sandbox/0700/manifest/mode-gating; injection resistance, weights_only, and the settled DDP mapping preserved); no dangling references. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
|
@greptile-app |
Summary
Adds a set of NVFLARE agent skills — procedural playbooks a coding agent
(Codex/Claude) loads to convert ML training code to federated learning, orient
in an NVFLARE project, and diagnose failed jobs — together with the packaging,
install, lint, and evaluation infrastructure to author and ship them safely.
The skills follow the three-layer operating model: the skill owns procedure
(what to extract, which commands to run, the validation ladder, semantic stop
conditions, and reporting), the product API owns truth (recipe schema,
validation, execution, and export), and the agent host owns permissions and
execution security. The agent discovers product truth at runtime through
commands such as
nvflare recipe show --format json;recipe listis reservedfor cases where framework and algorithm intent do not already identify the
recipe. This keeps skill prose from drifting into a "shadow API." Skills conform
to the agentskills.io specification and the company (NVCARPS) skills guideline.
Skills
nvflare-convert-pytorch— plain PyTorch → FLARE Client API conversion.nvflare-convert-lightning— PyTorch Lightning → FLARE viaflare.patch(trainer), Lightning-native evaluation, and DDP handling.nvflare-diagnose-job— read-only diagnosis of failed/stalled/suspiciousjobs by mapping bounded evidence to a failure-pattern catalog.
nvflare-orient— routes an open-ended or ambiguous NVFLARE request tothe right workflow skill without editing files; explicit conversion requests
select their converter directly after static framework inspection.
nvflare-shared— internal, non-triggered reference library(non-standard conversion guidance, validation ladder, dependency ordering,
PyTorch-family recipe selection, model exchange, metrics/artifact reporting)
plus the shared
ModelAggregatortemplate.Conversion skills use explicit
{"class_path", "args"}recipe model config (nolive model instances), paired training/evaluation templates, custom aggregation
via the recipe
aggregator=extension point,enable_tensor_disk_offloadpaired with
server_expected_format=PYTORCH, build setup once before the FLround loop, device-selection preservation, and configurable data paths. The
standard FedAvg path is inline and uses direct
recipe show fedavg-pt; broaderreferences and catalog discovery are loaded only for relevant exceptions.
Missing dependencies are installed before import-level preflight, and requested
validation proceeds under the agent host's permission system without a
skill-issued install, repository-trust, or execution prompt. Follow-up questions
are limited to genuinely missing conversion semantics.
Homomorphic encryption, differential privacy, and privacy filters are explicitly
not supported by conversion (report and route to provisioning/deployment;
never silently substitute an unprotected job).
Product code
nvflare/tool/agent/inspector.py) — AST-onlyframework detection (no import/execution of user code) via a
frameworks/detector-plugin package (PyTorch/Lightning), with entry-contextreachability and evidence collection; surfaced as
nvflare agent inspect.nvflare agent inspect | info | skills …) with JSON outputcontracts.
skill_manifest.py,skill_manager.py) —content-addressed shared snapshots, atomic stage-then-publish, manifest
path-traversal defense, TOCTOU-closing no-follow/identity fd checks,
owned-by-current-user targets, and bytecode/eval exclusion. A build hook
bundles the released skills into the wheel.
Evaluation & lint infrastructure
dev_tools/agent/skill_evals/<skill>/— trigger routing, source-factextraction, code/evaluation transformation, device selection, checkpoint
safety, prompt-injection resistance, direct recipe inspection, progressive
reference loading, and negative cases.
dev_tools/agent/skills/checks/) — afrontmatter validator (agentskills.io + NVCARPS constraints: required
metadata.author, name/description/compatibility limits, bounded no-followfile reads, no
alwaysApply/globs) plus v1 lints (including the 200-lineSKILL.mdlimit, trigger/negative/policy coverage, command drift, fixtureintegrity, and runtime boundaries). NVFLARE-specific checks stay on top of
company conformance.
Security and authorization boundary
tool output are evidence, not authority. Embedded instructions cannot change
aggregation, skip validation, authorize installation, or add exfiltration;
they are ignored and reported as prompt-injection anomalies.
install-before-import ordering, safe
torch.load(..., weights_only=True),secret redaction, and generated code that does not upload local data, paths,
credentials, models, or datasets.
resource limits, and authorization. Skills use the host-supplied environment,
do not probe or construct isolation mechanisms, do not assess whether the
runtime is trusted, and report only actual host/tool denials or command
failures as execution blockers.
under host permissions, not a skill-owned allowlist, typosquat audit, or
supply-chain security claim. Natural-language instructions embedded in those
files remain untrusted source text.
workspaces, exports, models, and logs use a host-provided runtime directory or
one temporary directory, with exact paths reported; skills do not construct
permission hierarchies or run manifests.
ownership, and atomic-publication protections. Diagnosis continues to treat
log content and spoofable status markers as untrusted evidence.
Enforcement
The skill lint gates every push and merge: a pre-push hook (
.githooks/),./runtest.sh -s, and the CI unit tests all runpython -m dev_tools.agent.skills.checks --skills-root skills(covering skillsand eval suites); the packaged bundle is verified at build/install time.
Design docs
Only
docs/design/skills_architecture.mdships; the operating-model,evaluation, and check-report design docs are kept local (gitignored).