Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
104 changes: 6 additions & 98 deletions .claude/CLAUDE.md
Original file line number Diff line number Diff line change
@@ -1,101 +1,9 @@
# APCalign — agent guide
# CLAUDE.md

R package that resolves and updates Australian plant taxon names against the
Australian Plant Census (APC) and Australian Plant Name Index (APNI). Exported
functions align messy input names to accepted names, report taxonomic updates,
and supply native/introduced status by state.
This repo keeps its agent & contributor guidance in **[`AGENTS.md`](../AGENTS.md)** so the same
content is tool-agnostic and shared across every agent.

## Architecture
**👉 Read [`AGENTS.md`](../AGENTS.md)** for repo-local orientation (architecture, build & test,
gotchas) and the AusTraits-family cross-package pointer.

The user-facing pipeline is **align → update**:

- `create_taxonomic_update_lookup()` — the main entry point; runs alignment then
taxonomy updating end to end.
- `align_taxa()` ([R/align_taxa.R](../R/align_taxa.R)) — standardises input names
and finds the best APC/APNI alignment. Builds a `taxa` list with `tocheck` and
`checked` tibbles, then delegates to `match_taxa()`.
- `match_taxa()` ([R/match_taxa.R](../R/match_taxa.R)) — **the core matcher, ~2150
lines.** It runs ~54 sequential match branches (`match_01a` … `match_12i`),
each: compute a logical index `i`, `match()` against a resource table, `mutate()`
the matched rows with `aligned_name`/`taxon_rank`/`taxonomic_dataset`/
`aligned_reason`/`alignment_code`, then `redistribute()` checked rows out of
`tocheck` and early-return when `tocheck` is empty. Branches are heavily
copy-pasted — see "Known issues".
- `update_taxonomy()` ([R/update_taxonomy.R](../R/update_taxonomy.R)) — maps aligned
names to currently accepted names, handling synonyms and taxonomic splits.
- `load_taxonomic_resources()` ([R/load_taxonomic_resources.R](../R/load_taxonomic_resources.R))
— downloads/loads APC+APNI parquet files for a dated `version`, derives the
filtered lookup tables (`APC_accepted`, `APC_synonyms`, `APNI_names`,
`genera_accepted`, `genera_synonym`, `genera_APNI`, `family_accepted`,
`family_synonym`, …). Session-cached in the package-private `.pkg_cache`
environment (NOT `.GlobalEnv` — CRAN compliance). Clear with
`clear_cached_resources()`.

Supporting helpers: `standardise_names()`/`strip_names()`/`strip_names_extra()`
(text normalisation), `fuzzy_match()` (Damerau–Levenshtein with first-letter-per-word
constraints), `word()` (fast `stringr::word` replacement), `extract_genus()`.

Diversity functions: `native_anywhere_in_australia()`,
`create_species_state_origin_matrix()`, `state_diversity_counts()`.

## Running tests

Tests need the taxonomic resources loaded. `tests/testthat/helper.R` reuses a
global `resources` if present, else loads **version `2024-10-11`** (the version the
benchmarks were built against — do not bump it casually; benchmark expected values
are tied to it). First load downloads parquet files from GitHub releases.

```r
# fast iteration in an R session — load once, reuse:
devtools::load_all()
resources <- load_taxonomic_resources(version = "2024-10-11", quiet = TRUE)
testthat::test_dir("tests/testthat", filter = "match_branches") # one file

# full suite
devtools::test()
```

Gotchas:
- **Snapshot tests need `NOT_CRAN=true`.** `expect_snapshot_value()` defaults to
skipping on CRAN, so `test_dir()` without `NOT_CRAN` set reports "Reason: On CRAN".
`devtools::test()` sets it for you.
- Network/data-dependent tests (`test-cache`, `test-versions`) `skip_on_cran()`.
- `R CMD check`: `devtools::check()`.

## Test design (important for changing `match_taxa`)

Regression coverage of the matcher lives in:
- `benchmarks/test_matches_alignments_updates.csv` — curated inputs, one per match
branch; covers **all 54 branches**. Asserted in
`test-operation_outputs.R` for `aligned_name`, `taxon_rank`,
`taxonomic_dataset`, **and `alignment_code`**.
- `test-match_branches.R` — dark-branch resolution, `aligned_reason`
well-formedness, and a **full-output snapshot** (`_snaps/match_branches.md`,
date-normalised) that pins the alignment contract across all branches.

This snapshot + assertions are the safety net for refactoring `match_taxa()`: a
behaviour-preserving change leaves them green. If you legitimately change matcher
output, regenerate with `testthat::snapshot_accept("match_branches")` and update
the CSV.

## Conventions

- Tidyverse style: `%>%`, `dplyr`/`stringr`/`purrr`, namespaced calls (`dplyr::`).
- roxygen2 with markdown; run `devtools::document()` after changing `@` docs — keep
`man/*.Rd` and `NAMESPACE` in sync (do not hand-edit them).
- User-facing output text changes (e.g. `aligned_reason`) warrant a `NEWS.md` entry.
- `Sys.Date()` is embedded in every `aligned_reason`, so raw output is not
reproducible day to day — normalise the date when snapshotting/diffing.

## Known issues / landmines

- **`match_taxa.R` duplication.** ~54 near-identical blocks. This has already bred
bugs (a missing ` (` before the date in three `aff.` fuzzy branches). A helper
(`apply_match(...)`) would collapse it dramatically — do it as its own PR backed
by the snapshot above.
- **`native_anywhere_in_australia()`**: the `is.null(resources)` guard runs *after*
`create_species_state_origin_matrix()` (so offline errors instead of failing
gracefully), and its `apply(..., grepl("native", x))` greps across all columns
including the species name.
- Tests assume resources are available; there is no `skip_if_offline()` guard in the
operation tests.
Don't duplicate that content here — edit `AGENTS.md` and this file stays correct by reference.
128 changes: 128 additions & 0 deletions AGENTS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,128 @@
# APCalign — agent & contributor guide

R package that resolves and updates Australian plant taxon names against the Australian Plant Census
(APC) and Australian Plant Name Index (APNI). Exported functions align messy input names to accepted
names, report taxonomic updates, and supply native/introduced status by state.

## Repo-local guidance

The user-facing pipeline is **align → update**:

- `create_taxonomic_update_lookup()` — the main entry point; runs alignment then
taxonomy updating end to end.
- `align_taxa()` ([R/align_taxa.R](R/align_taxa.R)) — standardises input names
and finds the best APC/APNI alignment. Builds a `taxa` list with `tocheck` and
`checked` tibbles, then delegates to `match_taxa()`.
- `match_taxa()` ([R/match_taxa.R](R/match_taxa.R)) — **the core matcher, ~2150
lines.** It runs ~54 sequential match branches (`match_01a` … `match_12i`),
each: compute a logical index `i`, `match()` against a resource table, `mutate()`
the matched rows with `aligned_name`/`taxon_rank`/`taxonomic_dataset`/
`aligned_reason`/`alignment_code`, then `redistribute()` checked rows out of
`tocheck` and early-return when `tocheck` is empty. Branches are heavily
copy-pasted — see "Known issues".
- `update_taxonomy()` ([R/update_taxonomy.R](R/update_taxonomy.R)) — maps aligned
names to currently accepted names, handling synonyms and taxonomic splits.
- `load_taxonomic_resources()` ([R/load_taxonomic_resources.R](R/load_taxonomic_resources.R))
— downloads/loads APC+APNI parquet files for a dated `version`, derives the
filtered lookup tables (`APC_accepted`, `APC_synonyms`, `APNI_names`,
`genera_accepted`, `genera_synonym`, `genera_APNI`, `family_accepted`,
`family_synonym`, …). Session-cached in the package-private `.pkg_cache`
environment (NOT `.GlobalEnv` — CRAN compliance). Clear with
`clear_cached_resources()`.

Supporting helpers: `standardise_names()`/`strip_names()`/`strip_names_extra()`
(text normalisation), `fuzzy_match()` (Damerau–Levenshtein with first-letter-per-word
constraints), `word()` (fast `stringr::word` replacement), `extract_genus()`.

Diversity functions: `native_anywhere_in_australia()`,
`create_species_state_origin_matrix()`, `state_diversity_counts()`.

### Running tests

Tests need the taxonomic resources loaded. `tests/testthat/helper.R` reuses a
global `resources` if present, else loads **version `2024-10-11`** (the version the
benchmarks were built against — do not bump it casually; benchmark expected values
are tied to it). First load downloads parquet files from GitHub releases.

```r
# fast iteration in an R session — load once, reuse:
devtools::load_all()
resources <- load_taxonomic_resources(version = "2024-10-11", quiet = TRUE)
testthat::test_dir("tests/testthat", filter = "match_branches") # one file

# full suite
devtools::test()
```

Gotchas:
- **Snapshot tests need `NOT_CRAN=true`.** `expect_snapshot_value()` defaults to
skipping on CRAN, so `test_dir()` without `NOT_CRAN` set reports "Reason: On CRAN".
`devtools::test()` sets it for you.
- Network/data-dependent tests (`test-cache`, `test-versions`) `skip_on_cran()`.
- `R CMD check`: `devtools::check()`.

### Test design (important for changing `match_taxa`)

Regression coverage of the matcher lives in:
- `benchmarks/test_matches_alignments_updates.csv` — curated inputs, one per match
branch; covers **all 54 branches**. Asserted in
`test-operation_outputs.R` for `aligned_name`, `taxon_rank`,
`taxonomic_dataset`, **and `alignment_code`**.
- `test-match_branches.R` — dark-branch resolution, `aligned_reason`
well-formedness, and a **full-output snapshot** (`_snaps/match_branches.md`,
date-normalised) that pins the alignment contract across all branches.

This snapshot + assertions are the safety net for refactoring `match_taxa()`: a
behaviour-preserving change leaves them green. If you legitimately change matcher
output, regenerate with `testthat::snapshot_accept("match_branches")` and update
the CSV.

### Conventions

- Tidyverse style: `%>%`, `dplyr`/`stringr`/`purrr`, namespaced calls (`dplyr::`).
- roxygen2 with markdown; run `devtools::document()` after changing `@` docs — keep
`man/*.Rd` and `NAMESPACE` in sync (do not hand-edit them).
- User-facing output text changes (e.g. `aligned_reason`) warrant a `NEWS.md` entry.
- `Sys.Date()` is embedded in every `aligned_reason`, so raw output is not
reproducible day to day — normalise the date when snapshotting/diffing.

### Known issues / landmines

- **`match_taxa.R` duplication.** ~54 near-identical blocks. This has already bred
bugs (a missing ` (` before the date in three `aff.` fuzzy branches). A helper
(`apply_match(...)`) would collapse it dramatically — do it as its own PR backed
by the snapshot above.
- **`native_anywhere_in_australia()`**: the `is.null(resources)` guard runs *after*
`create_species_state_origin_matrix()` (so offline errors instead of failing
gracefully), and its `apply(..., grepl("native", x))` greps across all columns
including the species name.
- Tests assume resources are available; there is no `skip_if_offline()` guard in the
operation tests.

---

## AusTraits family — cross-package context

`APCalign` is part of the **AusTraits family** (a subset of the
[`traitecoevo`](https://github.com/traitecoevo) org) — here, plant taxonomy alignment against the
Australian Plant Census (APC) / APNI, including native/introduced status. Family-wide concerns are
documented centrally in
**[austraits-meta](https://github.com/traitecoevo/austraits-meta)** — don't restate them here, read
them there:

- **Start with [`AGENTS.md`](https://github.com/traitecoevo/austraits-meta/blob/main/AGENTS.md)** —
pipeline order, who owns what, dependency direction, source-of-truth rules, cross-boundary
artifacts, gotchas.
- **[`dependencies.yml`](https://github.com/traitecoevo/austraits-meta/blob/main/dependencies.yml)** —
machine-readable package graph + cross-boundary artifacts.
- **[`governance/`](https://github.com/traitecoevo/austraits-meta/tree/main/governance)** —
label taxonomy, board #9 conventions, release playbooks, triage.

**Filing issues:** the whole family is tracked on one board,
[AusTraits #9](https://github.com/orgs/traitecoevo/projects/9) (new issues auto-add to it). Follow
the [issue & labelling guide](https://github.com/traitecoevo/austraits-meta/blob/main/governance/issue-guide.md):
pick one work-type label (`bug` / `task` / `epic`); Status and Priority are set on the board, not as
labels.

> austraits-meta is hand-maintained prose — a map, not ground truth. Verify specifics against the
> actual repos.
12 changes: 12 additions & 0 deletions README.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -159,3 +159,15 @@ Did you come across an unexpected taxon name change? Elusive error you can't deb

We welcome any comments and contributions to the package, start by [submit an issue](https://github.com/traitecoevo/APCalign/issues) and we can take it from there!

## AusTraits family

`APCalign` is part of the **AusTraits family** of packages maintained by the
[AusTraits](https://austraits.org) team. See **[austraits.org](https://austraits.org)** for the
project, the data, and the people behind it.

Contributing? Issues across the family are tracked on one board,
[AusTraits #9](https://github.com/orgs/traitecoevo/projects/9), and new issues are auto-added. Please
read the [issue & labelling guide](https://github.com/traitecoevo/austraits-meta/blob/main/governance/issue-guide.md)
in [`austraits-meta`](https://github.com/traitecoevo/austraits-meta) — the family's cross-package
knowledge and governance hub — before filing.

12 changes: 12 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -256,3 +256,15 @@ our best to help.
We welcome any comments and contributions to the package, start by
[submit an issue](https://github.com/traitecoevo/APCalign/issues) and we
can take it from there!

## AusTraits family

`APCalign` is part of the **AusTraits family** of packages maintained by the
[AusTraits](https://austraits.org) team. See **[austraits.org](https://austraits.org)** for the
project, the data, and the people behind it.

Contributing? Issues across the family are tracked on one board,
[AusTraits #9](https://github.com/orgs/traitecoevo/projects/9), and new issues are auto-added. Please
read the [issue & labelling guide](https://github.com/traitecoevo/austraits-meta/blob/main/governance/issue-guide.md)
in [`austraits-meta`](https://github.com/traitecoevo/austraits-meta) — the family's cross-package
knowledge and governance hub — before filing.
Loading