NVIDIA · mnabian · Jun 22, 2026 · Jun 22, 2026 · Jun 23, 2026 · Jun 24, 2026
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -10,6 +10,9 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 ### Added
 
+- Adds the `physicsnemo-model-builder` agent skill (`skills/physicsnemo-model-builder/`):
+  guides contributors through adding a new model or reusable layer to PhysicsNeMo,
+  or wrapping an existing PyTorch model.
 - Adds Point-Transformer local vector-attention blocks to `physicsnemo.nn`.
 - FSDP2 checkpoint support: full save/load round-trip for
   ``torch.distributed.fsdp`` v2 models, including DTensor edge cases,

diff --git a/skills/physicsnemo-model-builder/BENCHMARK.md b/skills/physicsnemo-model-builder/BENCHMARK.md
@@ -0,0 +1,87 @@
+# Evaluation Report
+
+Evaluation of the `physicsnemo-model-builder` skill before publication through NVSkills-Eval.
+
+This benchmark summarizes 3-Tier Evaluation from NVSkills-Eval results for the skill. The goal is to document whether the skill is safe, discoverable, effective, and useful for agents before it is published for broader workflow use.
+
+> **Status: pending.** The results, Tier-1/Tier-2 findings, and verdict below are
+> populated by an NVSkills-Eval run prior to publication. The evaluation dataset
+> (`evals/evals.json`) and target agents are committed; run the harness and
+> refresh this file before publishing.
+
+## Evaluation Summary
+
+- Skill: `physicsnemo-model-builder`
+- Evaluation date: _pending_
+- NVSkills-Eval profile: `external`
+- Environment: `local`
+- Dataset: 4 evaluation tasks (`evals/evals.json`)
+- Attempts per task: 2
+- Pass threshold: 50%
+- Overall verdict: _pending_
+
+## Agents Used
+
+- `claude-code`
+- `codex`
+
+## Metrics Used
+
+Reported benchmark dimensions:
+
+- Security: checks whether skill-assisted execution avoids unsafe behavior such as secret leakage, destructive commands, or unauthorized access.
+- Correctness: checks whether the agent follows the expected workflow and produces the correct final output.
+- Discoverability: checks whether the agent loads the skill when relevant and avoids using it when irrelevant.
+- Effectiveness: checks whether the agent performs measurably better with the skill than without it.
+- Efficiency: checks whether the agent uses fewer tokens and avoids redundant work.
+
+Underlying evaluation signals used in this run:
+
+- `security` (Security): checks for unsafe operations, secret leakage, and unauthorized access.
+- `skill_execution` (Skill Execution): verifies that the agent loaded the expected skill and workflow.
+- `skill_efficiency` (Efficiency): checks routing quality, decoy avoidance, and redundant tool usage.
+- `accuracy` (Accuracy): grades final-answer correctness against the reference answer.
+- `goal_accuracy` (Goal Accuracy): checks whether the overall user task completed successfully.
+- `behavior_check` (Behavior Check): verifies expected behavior steps, including safety expectations.
+- `token_efficiency` (Token Efficiency): compares token usage with and without the skill.
+
+## Test Tasks
+
+The benchmark dataset contained 4 evaluation tasks:
+
+- Positive tasks: 2 tasks where the skill was expected to activate (add a new model from scratch; wrap an external PyTorch model).
+- Negative tasks: 2 tasks where the model-builder skill was not expected (a model-selection/discovery question that belongs to `physicsnemo-discover`; an out-of-scope datapipe request).
+- Unlabeled tasks: 0.
+
+Entries with `expected_skill` set are treated as positive skill-activation cases; entries with `expected_skill: null` are treated as negative activation cases.
+
+## Results
+
+_Pending NVSkills-Eval run._
+
+| Dimension | Num | `claude-code` | `codex` |
+|---|---:|---:|---:|
+| Security | — | — | — |
+| Correctness | — | — | — |
+| Discoverability | — | — | — |
+| Effectiveness | — | — | — |
+| Efficiency | — | — | — |
+
+Score values show skill-assisted performance. Values in parentheses show uplift versus the no-skill baseline when baseline data is available.
+
+## Tier 1: Static Validation Summary
+
+_Pending NVSkills-Eval run._
+
+## Tier 2: Deduplication Summary
+
+_Pending NVSkills-Eval run._ Note: this skill is intentionally distinct from
+`physicsnemo-discover` (authoring/porting models vs. selecting existing ones);
+the negative eval tasks guard that routing boundary.
+
+## Publication Recommendation
+
+_Pending NVSkills-Eval run._ Refresh this file with the harness output (results
+table, Tier-1/Tier-2 findings, verdict) before publishing, and keep it with the
+skill; re-run when the evaluation dataset, skill behavior, or target agents
+materially change.
diff --git a/skills/physicsnemo-model-builder/SKILL.md b/skills/physicsnemo-model-builder/SKILL.md
@@ -0,0 +1,164 @@
+---
+name: physicsnemo-model-builder
+description: Official NVIDIA-authored workflow for adding a new model or reusable layer to PhysicsNeMo, or integrating an existing PyTorch model. Scaffolds a standards-compliant physicsnemo.Module (or a Module.from_torch wrapper for an external nn.Module), places it correctly, wires exports, writes tests against the house test helpers, and runs the local CI gates (ruff, interrogate, pytest). Use when a contributor wants to add or port a model or layer into the physicsnemo package. Do NOT use for datapipes, nn.functional ops/backends such as FunctionSpec, losses or metrics, training-recipe or example authoring, environment/installation setup, or merely deciding which existing model fits a task (use physicsnemo-discover for that).
+license: Apache-2.0
+metadata:
+  author: NVIDIA <agent-skills@nvidia.com>
+  tags:
+    - physicsnemo
+    - models
+    - contributing
+    - scaffolding
+    - integration
+---
+
+# PhysicsNeMo Model Builder
+
+Drive a contributor from "I have a model (or layer, or an existing PyTorch
+module)" to a standards-compliant, tested, CI-green addition to the
+`physicsnemo` package. **You do the mechanical work** — placement, the
+`physicsnemo.Module` shell, serialization wiring, docstrings, type
+annotations, validation, exports, tests, and gates. **The contributor brings
+the architecture** — the novel `forward` math. Keep that division explicit:
+never invent their model; scaffold everything around it.
+
+The audience is a researcher fluent in PyTorch but new to PhysicsNeMo, so
+**explain the "why"** at each step (name the rule, give the reason) rather
+than silently emitting files.
+
+## Core principle
+
+1. **The written standards are ground truth — read them, don't paraphrase from
+   memory.** The authoritative rules live in `CODING_STANDARDS/` at the repo
+   root: `MODELS_IMPLEMENTATION.md` (rules `MOD-***`) and
+   `EXTERNAL_IMPORTS.md` (rules `EXT-***`). Open the cited rule before relying
+   on it; reference it by ID when you justify a decision. They evolve — a rule
+   recalled from memory may be stale.
+2. **Reuse before you build — discover, don't reinvent.** Half of a clean
+   integration is *not* writing code that already exists. Before scaffolding a
+   `forward`, enumerate what `physicsnemo.nn` already provides and tell the
+   contributor what to import (`references/reuse_map.md`).
+3. **Verify every path before you cite it.** Glob/Read the live repo; a path
+   recalled from memory or pattern-matched from a neighbor is disproof — drop
+   it.
+
+## Scope
+
+In scope: **complete models** (`physicsnemo/experimental/models/`), **reusable
+layers** (`physicsnemo/nn/module/`), and **wrapping an existing PyTorch
+`nn.Module`** via `Module.from_torch`.
+
+Out of scope — stop and redirect: datapipes (`physicsnemo/datapipes/`),
+functional ops / custom backends (`physicsnemo/nn/functional/`, `FunctionSpec`),
+losses & metrics (`physicsnemo/metrics/`), training recipes (`examples/`), and
+"which model should I use" (→ `physicsnemo-discover`).
+
+## Workflow
+
+Run in order. Confirm the consequential choices with the contributor
+(artifact type, placement, external-wrap vs from-scratch); scaffold the rest.
+
+### 1. Intake & classify
+
+Ask only what you can't infer (≤4 questions). Resolve:
+
+- **Artifact type:** complete *model*, reusable *layer*, or *wrap* an existing
+  PyTorch module? (Decision tree + rationale: `references/placement.md`.)
+- **Identity:** class name (PascalCase), one-line purpose, the forward
+  inputs/outputs and their tensor shapes, heavy deps.
+- **For wrap:** the import path of their `nn.Module`, and whether its
+  `__init__` args are JSON-serializable — this picks the serialization path
+  (`references/serialization.md`).
+
+### 2. Place it (and say why)
+
+- New **model** → `physicsnemo/experimental/models/<name>/` (`MOD-002a`: new
+  models start in `experimental`). Layout: `__init__.py` (re-exports) +
+  `<name>.py`. New **layer** → `physicsnemo/nn/module/<name>.py`, re-exported
+  from both `physicsnemo/nn/module/__init__.py` and `physicsnemo/nn/__init__.py`
+  (`MOD-000a`). Tests mirror the source path under `test/`.
+- State the rule ID and the reason (experimental = API may change; layers are
+  shared building blocks).
+
+### 3. Reuse audit
+
+Before writing `forward`, enumerate existing primitives the contributor would
+otherwise reinvent — attention bases, embeddings, `Mlp`, neighbor ops
+(`knn`, `radius_search`), the TE-aware `LayerNorm`. Use the live search
+patterns in `references/reuse_map.md`; verify each path before citing it. Say
+explicitly "import `X` from `physicsnemo.nn` instead of writing your own." Keep
+genuinely novel, model-specific pieces local to the model.
+
+### 4. Scaffold the shell
+
+Generate from the skeletons in `references/scaffolds.md`, adapting to the
+contributor's shapes; explain what each enforced piece is for.
+
+- **New model / layer:** subclass **`physicsnemo.Module`** (not
+  `torch.nn.Module` — `MOD-001`); a `ModelMetaData`; a **constructor taking
+  JSON-serializable config** (no splatted `**kwargs` — `MOD-010`; no
+  string-based class selection — `MOD-009`); a `forward` with jaxtyping on
+  every tensor arg (`MOD-006`), `if not torch.compiler.is_compiling():` shape
+  validation (`MOD-005`), and NumPy `r"""` docstrings with
+  `Parameters`/`Forward`/`Outputs` sections and `:math:` shapes (`MOD-003`).
+  Imports upward-only (`EXT-***`).
+- **Wrap external:** `Module.from_torch(TheirModule, meta=...)`. **The
+  serialization gotcha lives here** — `physicsnemo.Module` save/from_checkpoint
+  requires `__init__` args to be JSON-serializable; nested `nn.Module` args
+  must each be converted via `Module.from_torch`. Walk them through
+  `references/serialization.md`, then prove it with the round-trip test below.
+
+### 5. Tests
+
+Generate the test module from `references/scaffolds.md`: class-per-public-class,
+the `device` fixture, parametrized constructor/attribute checks (≥2 configs —
+`MOD-008a`), `validate_forward_accuracy` for non-regression (`MOD-008b`), and
+`validate_checkpoint` for the save/load round-trip (`MOD-008c`) — both from
+`test.common`. Don't hand-roll what `test.common` provides.
+
+### 6. Gates
+
+From the repo root, run and iterate to green (explain each):
+
+```
+make lint          # ruff format --check + ruff check
+make interrogate   # docstring coverage
+make pytest        # or: pytest test/<mirrored/path> -q
+```
+
+`physicsnemo/experimental/` is exempt from ruff/interrogate, but **not** from
+runtime contracts — the serialization round-trip test must still pass there.
+
+### 7. Finish & review
+
+- Add a one-line `CHANGELOG.md` entry and SPDX Apache-2.0 headers to new files;
+  remind the contributor commits need `-s` (sign-off).
+- Do an independent **code-review pass over the diff** before opening the PR —
+  re-check it against the standards (`MOD-***`/`EXT-***`), correctness, and the
+  reuse audit, ideally with fresh eyes (a separate review session/agent). If the
+  host agent offers a built-in code-review command (for example Claude Code's
+  `/code-review`), use it; otherwise review the diff directly. Then open the PR
+  — CODEOWNERS review + CI re-run the gates.
+
+## Common gotchas
+
+Surface the relevant traps inline as you scaffold (full catalogue:
+`references/lessons.md`):
+
+- **`Module` serialization** is a common external-integration failure: raw
+  `nn.Module` submodule args break `from_checkpoint` (`references/serialization.md`).
+- The **TE-aware `LayerNorm`** runs only on CUDA when Transformer Engine is
+  present; tests must skip the CPU case under TE.
+- **`experimental/` skips lint, not runtime contracts.**
+- Promote a model-specific layer to `physicsnemo.nn` only when a **second**
+  consumer appears — keep it local until then.
+
+## Related resources
+
+- `references/placement.md` — artifact decision tree and where each kind goes.
+- `references/reuse_map.md` — live search patterns for existing primitives.
+- `references/serialization.md` — `physicsnemo.Module`, JSON args, `from_torch`.
+- `references/scaffolds.md` — model / layer / external-wrap / test skeletons.
+- `references/lessons.md` — gotchas distilled from real integrations.
+- `CODING_STANDARDS/MODELS_IMPLEMENTATION.md`, `EXTERNAL_IMPORTS.md` — the
+  authoritative rules; read the cited rule before relying on it.
diff --git a/skills/physicsnemo-model-builder/evals/evals.json b/skills/physicsnemo-model-builder/evals/evals.json
@@ -0,0 +1,56 @@
+[
+  {
+    "id": "add-new-model-from-scratch",
+    "question": "I have a new graph-transformer surrogate architecture for mesh data. How do I add it as a model in PhysicsNeMo so it follows the repository conventions?",
+    "expected_skill": "physicsnemo-model-builder",
+    "expected_script": null,
+    "ground_truth": "A new model is scaffolded under physicsnemo/experimental/models/<name>/ (MOD-002a: new models start in experimental), as a subclass of physicsnemo.Module (not torch.nn.Module; MOD-001) carrying a ModelMetaData. Its constructor takes JSON-serializable config and builds submodules internally (no splatted **kwargs / no string-based class selection; MOD-010/MOD-009), reusing existing physicsnemo.nn primitives where possible rather than reimplementing them. The forward has jaxtyping annotations on tensor args (MOD-006), is_compiling()-guarded shape validation (MOD-005), and NumPy r-docstrings with Parameters/Forward/Outputs and :math: shapes (MOD-003). Tests mirror the source path and use test.common helpers (validate_forward_accuracy, validate_checkpoint; MOD-008). The contributor supplies the novel forward math; the skill scaffolds everything around it and runs the gates.",
+    "expected_behavior": [
+      "Loads the physicsnemo-model-builder skill.",
+      "Recommends placing the new model under physicsnemo/experimental/models/ and explains why (MOD-002a).",
+      "Scaffolds a physicsnemo.Module subclass (not torch.nn.Module) with a ModelMetaData and a JSON-serializable constructor.",
+      "Performs a reuse audit of physicsnemo.nn before reimplementing primitives.",
+      "Proposes tests using test.common (validate_forward_accuracy / validate_checkpoint), not hand-rolled checkpoint comparisons.",
+      "Does not invent the model's architecture; defers the forward math to the contributor.",
+      "Every absolute path cited in the final message exists on disk."
+    ]
+  },
+  {
+    "id": "wrap-external-pytorch-model",
+    "question": "I already have a trained PyTorch nn.Module. How do I turn it into a PhysicsNeMo model so it can be saved and loaded with from_checkpoint?",
+    "expected_skill": "physicsnemo-model-builder",
+    "expected_script": null,
+    "ground_truth": "An external nn.Module is integrated via Module.from_torch(TheirNet, meta=ModelMetaData()), which yields a physicsnemo.Module supporting save/from_checkpoint/registry. The hard requirement is that the wrapped class's __init__ arguments are JSON-serializable; any nested nn.Module arguments must each be converted with Module.from_torch first (a raw nn.Module argument makes save() raise TypeError). The integration must be proven with a save/load round-trip using validate_checkpoint from test.common. The skill explains this serialization contract explicitly because it is a common external-integration failure.",
+    "expected_behavior": [
+      "Loads the physicsnemo-model-builder skill.",
+      "Recommends Module.from_torch as the external-integration path.",
+      "Explains the serialization contract: __init__ args must be JSON-serializable, and nested nn.Module args must be converted via Module.from_torch.",
+      "Recommends verifying with a validate_checkpoint round-trip.",
+      "Every absolute path cited in the final message exists on disk."
+    ]
+  },
+  {
+    "id": "discovery-defers-to-discover-skill",
+    "question": "Which existing PhysicsNeMo model family should I use for forecasting on a lat-lon grid on the sphere?",
+    "expected_skill": "physicsnemo-discover",
+    "expected_script": null,
+    "ground_truth": "This is a discovery / routing question about which EXISTING model to use, not a request to add or integrate a new model. The physicsnemo-model-builder skill should NOT activate; physicsnemo-discover is the correct skill. The model-builder skill is scoped to authoring/porting new models and layers, not selecting among existing ones.",
+    "expected_behavior": [
+      "Does NOT load the physicsnemo-model-builder skill.",
+      "Treats the request as model selection/discovery (physicsnemo-discover territory), not authoring.",
+      "Does not scaffold a new model or layer."
+    ]
+  },
+  {
+    "id": "out-of-scope-datapipe",
+    "question": "How do I add a new datapipe for my custom HDF5 dataset in PhysicsNeMo?",
+    "expected_skill": null,
+    "expected_script": null,
+    "ground_truth": "Datapipes are out of scope for the physicsnemo-model-builder skill, which covers complete models, reusable layers, and wrapping external PyTorch models. A datapipe belongs under physicsnemo/datapipes/. The skill should not activate and should redirect rather than scaffold a model/layer.",
+    "expected_behavior": [
+      "Does NOT load the physicsnemo-model-builder skill.",
+      "Recognizes datapipes as out of scope and redirects toward physicsnemo/datapipes/.",
+      "Does not scaffold a physicsnemo.Module model or layer."
+    ]
+  }
+]