Skip to content
Draft
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,9 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

### Added

- Adds the `physicsnemo-model-builder` agent skill (`skills/physicsnemo-model-builder/`):
guides contributors through adding a new model or reusable layer to PhysicsNeMo,
or wrapping an existing PyTorch model.
- Adds Point-Transformer local vector-attention blocks to `physicsnemo.nn`.
- FSDP2 checkpoint support: full save/load round-trip for
``torch.distributed.fsdp`` v2 models, including DTensor edge cases,
Expand Down
87 changes: 87 additions & 0 deletions skills/physicsnemo-model-builder/BENCHMARK.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
# Evaluation Report

Evaluation of the `physicsnemo-model-builder` skill before publication through NVSkills-Eval.

This benchmark summarizes 3-Tier Evaluation from NVSkills-Eval results for the skill. The goal is to document whether the skill is safe, discoverable, effective, and useful for agents before it is published for broader workflow use.

> **Status: pending.** The results, Tier-1/Tier-2 findings, and verdict below are
> populated by an NVSkills-Eval run prior to publication. The evaluation dataset
> (`evals/evals.json`) and target agents are committed; run the harness and
> refresh this file before publishing.

## Evaluation Summary

- Skill: `physicsnemo-model-builder`
- Evaluation date: _pending_
- NVSkills-Eval profile: `external`
- Environment: `local`
- Dataset: 4 evaluation tasks (`evals/evals.json`)
- Attempts per task: 2
- Pass threshold: 50%
- Overall verdict: _pending_

## Agents Used

- `claude-code`
- `codex`

## Metrics Used

Reported benchmark dimensions:

- Security: checks whether skill-assisted execution avoids unsafe behavior such as secret leakage, destructive commands, or unauthorized access.
- Correctness: checks whether the agent follows the expected workflow and produces the correct final output.
- Discoverability: checks whether the agent loads the skill when relevant and avoids using it when irrelevant.
- Effectiveness: checks whether the agent performs measurably better with the skill than without it.
- Efficiency: checks whether the agent uses fewer tokens and avoids redundant work.

Underlying evaluation signals used in this run:

- `security` (Security): checks for unsafe operations, secret leakage, and unauthorized access.
- `skill_execution` (Skill Execution): verifies that the agent loaded the expected skill and workflow.
- `skill_efficiency` (Efficiency): checks routing quality, decoy avoidance, and redundant tool usage.
- `accuracy` (Accuracy): grades final-answer correctness against the reference answer.
- `goal_accuracy` (Goal Accuracy): checks whether the overall user task completed successfully.
- `behavior_check` (Behavior Check): verifies expected behavior steps, including safety expectations.
- `token_efficiency` (Token Efficiency): compares token usage with and without the skill.

## Test Tasks

The benchmark dataset contained 4 evaluation tasks:

- Positive tasks: 2 tasks where the skill was expected to activate (add a new model from scratch; wrap an external PyTorch model).
- Negative tasks: 2 tasks where the model-builder skill was not expected (a model-selection/discovery question that belongs to `physicsnemo-discover`; an out-of-scope datapipe request).
- Unlabeled tasks: 0.

Entries with `expected_skill` set are treated as positive skill-activation cases; entries with `expected_skill: null` are treated as negative activation cases.

## Results

_Pending NVSkills-Eval run._

| Dimension | Num | `claude-code` | `codex` |
|---|---:|---:|---:|
| Security | — | — | — |
| Correctness | — | — | — |
| Discoverability | — | — | — |
| Effectiveness | — | — | — |
| Efficiency | — | — | — |

Score values show skill-assisted performance. Values in parentheses show uplift versus the no-skill baseline when baseline data is available.

## Tier 1: Static Validation Summary

_Pending NVSkills-Eval run._

## Tier 2: Deduplication Summary

_Pending NVSkills-Eval run._ Note: this skill is intentionally distinct from
`physicsnemo-discover` (authoring/porting models vs. selecting existing ones);
the negative eval tasks guard that routing boundary.

## Publication Recommendation

_Pending NVSkills-Eval run._ Refresh this file with the harness output (results
table, Tier-1/Tier-2 findings, verdict) before publishing, and keep it with the
skill; re-run when the evaluation dataset, skill behavior, or target agents
materially change.
164 changes: 164 additions & 0 deletions skills/physicsnemo-model-builder/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,164 @@
---
name: physicsnemo-model-builder
description: Official NVIDIA-authored workflow for adding a new model or reusable layer to PhysicsNeMo, or integrating an existing PyTorch model. Scaffolds a standards-compliant physicsnemo.Module (or a Module.from_torch wrapper for an external nn.Module), places it correctly, wires exports, writes tests against the house test helpers, and runs the local CI gates (ruff, interrogate, pytest). Use when a contributor wants to add or port a model or layer into the physicsnemo package. Do NOT use for datapipes, nn.functional ops/backends such as FunctionSpec, losses or metrics, training-recipe or example authoring, environment/installation setup, or merely deciding which existing model fits a task (use physicsnemo-discover for that).
license: Apache-2.0
metadata:
author: NVIDIA <agent-skills@nvidia.com>
tags:
- physicsnemo
- models
- contributing
- scaffolding
- integration
---

# PhysicsNeMo Model Builder

Drive a contributor from "I have a model (or layer, or an existing PyTorch
module)" to a standards-compliant, tested, CI-green addition to the
`physicsnemo` package. **You do the mechanical work** — placement, the
`physicsnemo.Module` shell, serialization wiring, docstrings, type
annotations, validation, exports, tests, and gates. **The contributor brings
the architecture** — the novel `forward` math. Keep that division explicit:
never invent their model; scaffold everything around it.

The audience is a researcher fluent in PyTorch but new to PhysicsNeMo, so
**explain the "why"** at each step (name the rule, give the reason) rather
than silently emitting files.

## Core principle

1. **The written standards are ground truth — read them, don't paraphrase from
memory.** The authoritative rules live in `CODING_STANDARDS/` at the repo
root: `MODELS_IMPLEMENTATION.md` (rules `MOD-***`) and
`EXTERNAL_IMPORTS.md` (rules `EXT-***`). Open the cited rule before relying
on it; reference it by ID when you justify a decision. They evolve — a rule
recalled from memory may be stale.
2. **Reuse before you build — discover, don't reinvent.** Half of a clean
integration is *not* writing code that already exists. Before scaffolding a
`forward`, enumerate what `physicsnemo.nn` already provides and tell the
contributor what to import (`references/reuse_map.md`).
3. **Verify every path before you cite it.** Glob/Read the live repo; a path
recalled from memory or pattern-matched from a neighbor is disproof — drop
it.

## Scope

In scope: **complete models** (`physicsnemo/experimental/models/`), **reusable
layers** (`physicsnemo/nn/module/`), and **wrapping an existing PyTorch
`nn.Module`** via `Module.from_torch`.

Out of scope — stop and redirect: datapipes (`physicsnemo/datapipes/`),
functional ops / custom backends (`physicsnemo/nn/functional/`, `FunctionSpec`),
losses & metrics (`physicsnemo/metrics/`), training recipes (`examples/`), and
"which model should I use" (→ `physicsnemo-discover`).

## Workflow

Run in order. Confirm the consequential choices with the contributor
(artifact type, placement, external-wrap vs from-scratch); scaffold the rest.

### 1. Intake & classify

Ask only what you can't infer (≤4 questions). Resolve:

- **Artifact type:** complete *model*, reusable *layer*, or *wrap* an existing
PyTorch module? (Decision tree + rationale: `references/placement.md`.)
- **Identity:** class name (PascalCase), one-line purpose, the forward
inputs/outputs and their tensor shapes, heavy deps.
- **For wrap:** the import path of their `nn.Module`, and whether its
`__init__` args are JSON-serializable — this picks the serialization path
(`references/serialization.md`).

### 2. Place it (and say why)

- New **model** → `physicsnemo/experimental/models/<name>/` (`MOD-002a`: new
models start in `experimental`). Layout: `__init__.py` (re-exports) +
`<name>.py`. New **layer** → `physicsnemo/nn/module/<name>.py`, re-exported
from both `physicsnemo/nn/module/__init__.py` and `physicsnemo/nn/__init__.py`
(`MOD-000a`). Tests mirror the source path under `test/`.
- State the rule ID and the reason (experimental = API may change; layers are
shared building blocks).

### 3. Reuse audit

Before writing `forward`, enumerate existing primitives the contributor would
otherwise reinvent — attention bases, embeddings, `Mlp`, neighbor ops
(`knn`, `radius_search`), the TE-aware `LayerNorm`. Use the live search
patterns in `references/reuse_map.md`; verify each path before citing it. Say
explicitly "import `X` from `physicsnemo.nn` instead of writing your own." Keep
genuinely novel, model-specific pieces local to the model.

### 4. Scaffold the shell

Generate from the skeletons in `references/scaffolds.md`, adapting to the
contributor's shapes; explain what each enforced piece is for.

- **New model / layer:** subclass **`physicsnemo.Module`** (not
`torch.nn.Module` — `MOD-001`); a `ModelMetaData`; a **constructor taking
JSON-serializable config** (no splatted `**kwargs` — `MOD-010`; no
string-based class selection — `MOD-009`); a `forward` with jaxtyping on
every tensor arg (`MOD-006`), `if not torch.compiler.is_compiling():` shape
validation (`MOD-005`), and NumPy `r"""` docstrings with
`Parameters`/`Forward`/`Outputs` sections and `:math:` shapes (`MOD-003`).
Imports upward-only (`EXT-***`).
- **Wrap external:** `Module.from_torch(TheirModule, meta=...)`. **The
serialization gotcha lives here** — `physicsnemo.Module` save/from_checkpoint
requires `__init__` args to be JSON-serializable; nested `nn.Module` args
must each be converted via `Module.from_torch`. Walk them through
`references/serialization.md`, then prove it with the round-trip test below.

### 5. Tests

Generate the test module from `references/scaffolds.md`: class-per-public-class,
the `device` fixture, parametrized constructor/attribute checks (≥2 configs —
`MOD-008a`), `validate_forward_accuracy` for non-regression (`MOD-008b`), and
`validate_checkpoint` for the save/load round-trip (`MOD-008c`) — both from
`test.common`. Don't hand-roll what `test.common` provides.

### 6. Gates

From the repo root, run and iterate to green (explain each):

```
make lint # ruff format --check + ruff check
make interrogate # docstring coverage
make pytest # or: pytest test/<mirrored/path> -q
```

`physicsnemo/experimental/` is exempt from ruff/interrogate, but **not** from
runtime contracts — the serialization round-trip test must still pass there.

### 7. Finish & review

- Add a one-line `CHANGELOG.md` entry and SPDX Apache-2.0 headers to new files;
remind the contributor commits need `-s` (sign-off).
- Do an independent **code-review pass over the diff** before opening the PR —
re-check it against the standards (`MOD-***`/`EXT-***`), correctness, and the
reuse audit, ideally with fresh eyes (a separate review session/agent). If the
host agent offers a built-in code-review command (for example Claude Code's
`/code-review`), use it; otherwise review the diff directly. Then open the PR
— CODEOWNERS review + CI re-run the gates.

## Common gotchas

Surface the relevant traps inline as you scaffold (full catalogue:
`references/lessons.md`):

- **`Module` serialization** is a common external-integration failure: raw
`nn.Module` submodule args break `from_checkpoint` (`references/serialization.md`).
- The **TE-aware `LayerNorm`** runs only on CUDA when Transformer Engine is
present; tests must skip the CPU case under TE.
- **`experimental/` skips lint, not runtime contracts.**
- Promote a model-specific layer to `physicsnemo.nn` only when a **second**
consumer appears — keep it local until then.

## Related resources

- `references/placement.md` — artifact decision tree and where each kind goes.
- `references/reuse_map.md` — live search patterns for existing primitives.
- `references/serialization.md` — `physicsnemo.Module`, JSON args, `from_torch`.
- `references/scaffolds.md` — model / layer / external-wrap / test skeletons.
- `references/lessons.md` — gotchas distilled from real integrations.
- `CODING_STANDARDS/MODELS_IMPLEMENTATION.md`, `EXTERNAL_IMPORTS.md` — the
authoritative rules; read the cited rule before relying on it.
56 changes: 56 additions & 0 deletions skills/physicsnemo-model-builder/evals/evals.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
[
{
"id": "add-new-model-from-scratch",
"question": "I have a new graph-transformer surrogate architecture for mesh data. How do I add it as a model in PhysicsNeMo so it follows the repository conventions?",
"expected_skill": "physicsnemo-model-builder",
"expected_script": null,
"ground_truth": "A new model is scaffolded under physicsnemo/experimental/models/<name>/ (MOD-002a: new models start in experimental), as a subclass of physicsnemo.Module (not torch.nn.Module; MOD-001) carrying a ModelMetaData. Its constructor takes JSON-serializable config and builds submodules internally (no splatted **kwargs / no string-based class selection; MOD-010/MOD-009), reusing existing physicsnemo.nn primitives where possible rather than reimplementing them. The forward has jaxtyping annotations on tensor args (MOD-006), is_compiling()-guarded shape validation (MOD-005), and NumPy r-docstrings with Parameters/Forward/Outputs and :math: shapes (MOD-003). Tests mirror the source path and use test.common helpers (validate_forward_accuracy, validate_checkpoint; MOD-008). The contributor supplies the novel forward math; the skill scaffolds everything around it and runs the gates.",
"expected_behavior": [
"Loads the physicsnemo-model-builder skill.",
"Recommends placing the new model under physicsnemo/experimental/models/ and explains why (MOD-002a).",
"Scaffolds a physicsnemo.Module subclass (not torch.nn.Module) with a ModelMetaData and a JSON-serializable constructor.",
"Performs a reuse audit of physicsnemo.nn before reimplementing primitives.",
"Proposes tests using test.common (validate_forward_accuracy / validate_checkpoint), not hand-rolled checkpoint comparisons.",
"Does not invent the model's architecture; defers the forward math to the contributor.",
"Every absolute path cited in the final message exists on disk."
]
},
{
"id": "wrap-external-pytorch-model",
"question": "I already have a trained PyTorch nn.Module. How do I turn it into a PhysicsNeMo model so it can be saved and loaded with from_checkpoint?",
"expected_skill": "physicsnemo-model-builder",
"expected_script": null,
"ground_truth": "An external nn.Module is integrated via Module.from_torch(TheirNet, meta=ModelMetaData()), which yields a physicsnemo.Module supporting save/from_checkpoint/registry. The hard requirement is that the wrapped class's __init__ arguments are JSON-serializable; any nested nn.Module arguments must each be converted with Module.from_torch first (a raw nn.Module argument makes save() raise TypeError). The integration must be proven with a save/load round-trip using validate_checkpoint from test.common. The skill explains this serialization contract explicitly because it is a common external-integration failure.",
"expected_behavior": [
"Loads the physicsnemo-model-builder skill.",
"Recommends Module.from_torch as the external-integration path.",
"Explains the serialization contract: __init__ args must be JSON-serializable, and nested nn.Module args must be converted via Module.from_torch.",
"Recommends verifying with a validate_checkpoint round-trip.",
"Every absolute path cited in the final message exists on disk."
]
},
{
"id": "discovery-defers-to-discover-skill",
"question": "Which existing PhysicsNeMo model family should I use for forecasting on a lat-lon grid on the sphere?",
"expected_skill": "physicsnemo-discover",
"expected_script": null,
"ground_truth": "This is a discovery / routing question about which EXISTING model to use, not a request to add or integrate a new model. The physicsnemo-model-builder skill should NOT activate; physicsnemo-discover is the correct skill. The model-builder skill is scoped to authoring/porting new models and layers, not selecting among existing ones.",
"expected_behavior": [
"Does NOT load the physicsnemo-model-builder skill.",
"Treats the request as model selection/discovery (physicsnemo-discover territory), not authoring.",
"Does not scaffold a new model or layer."
]
},
{
"id": "out-of-scope-datapipe",
"question": "How do I add a new datapipe for my custom HDF5 dataset in PhysicsNeMo?",
"expected_skill": null,
"expected_script": null,
"ground_truth": "Datapipes are out of scope for the physicsnemo-model-builder skill, which covers complete models, reusable layers, and wrapping external PyTorch models. A datapipe belongs under physicsnemo/datapipes/. The skill should not activate and should redirect rather than scaffold a model/layer.",
"expected_behavior": [
"Does NOT load the physicsnemo-model-builder skill.",
"Recognizes datapipes as out of scope and redirects toward physicsnemo/datapipes/.",
"Does not scaffold a physicsnemo.Module model or layer."
]
}
]
Loading
Loading