diff --git a/.claude/CLAUDE.md b/.claude/CLAUDE.md index fa05307..c2c3f2a 100644 --- a/.claude/CLAUDE.md +++ b/.claude/CLAUDE.md @@ -1,101 +1,9 @@ -# APCalign β€” agent guide +# CLAUDE.md -R package that resolves and updates Australian plant taxon names against the -Australian Plant Census (APC) and Australian Plant Name Index (APNI). Exported -functions align messy input names to accepted names, report taxonomic updates, -and supply native/introduced status by state. +This repo keeps its agent & contributor guidance in **[`AGENTS.md`](../AGENTS.md)** so the same +content is tool-agnostic and shared across every agent. -## Architecture +**πŸ‘‰ Read [`AGENTS.md`](../AGENTS.md)** for repo-local orientation (architecture, build & test, +gotchas) and the AusTraits-family cross-package pointer. -The user-facing pipeline is **align β†’ update**: - -- `create_taxonomic_update_lookup()` β€” the main entry point; runs alignment then - taxonomy updating end to end. -- `align_taxa()` ([R/align_taxa.R](../R/align_taxa.R)) β€” standardises input names - and finds the best APC/APNI alignment. Builds a `taxa` list with `tocheck` and - `checked` tibbles, then delegates to `match_taxa()`. -- `match_taxa()` ([R/match_taxa.R](../R/match_taxa.R)) β€” **the core matcher, ~2150 - lines.** It runs ~54 sequential match branches (`match_01a` … `match_12i`), - each: compute a logical index `i`, `match()` against a resource table, `mutate()` - the matched rows with `aligned_name`/`taxon_rank`/`taxonomic_dataset`/ - `aligned_reason`/`alignment_code`, then `redistribute()` checked rows out of - `tocheck` and early-return when `tocheck` is empty. Branches are heavily - copy-pasted β€” see "Known issues". -- `update_taxonomy()` ([R/update_taxonomy.R](../R/update_taxonomy.R)) β€” maps aligned - names to currently accepted names, handling synonyms and taxonomic splits. -- `load_taxonomic_resources()` ([R/load_taxonomic_resources.R](../R/load_taxonomic_resources.R)) - β€” downloads/loads APC+APNI parquet files for a dated `version`, derives the - filtered lookup tables (`APC_accepted`, `APC_synonyms`, `APNI_names`, - `genera_accepted`, `genera_synonym`, `genera_APNI`, `family_accepted`, - `family_synonym`, …). Session-cached in the package-private `.pkg_cache` - environment (NOT `.GlobalEnv` β€” CRAN compliance). Clear with - `clear_cached_resources()`. - -Supporting helpers: `standardise_names()`/`strip_names()`/`strip_names_extra()` -(text normalisation), `fuzzy_match()` (Damerau–Levenshtein with first-letter-per-word -constraints), `word()` (fast `stringr::word` replacement), `extract_genus()`. - -Diversity functions: `native_anywhere_in_australia()`, -`create_species_state_origin_matrix()`, `state_diversity_counts()`. - -## Running tests - -Tests need the taxonomic resources loaded. `tests/testthat/helper.R` reuses a -global `resources` if present, else loads **version `2024-10-11`** (the version the -benchmarks were built against β€” do not bump it casually; benchmark expected values -are tied to it). First load downloads parquet files from GitHub releases. - -```r -# fast iteration in an R session β€” load once, reuse: -devtools::load_all() -resources <- load_taxonomic_resources(version = "2024-10-11", quiet = TRUE) -testthat::test_dir("tests/testthat", filter = "match_branches") # one file - -# full suite -devtools::test() -``` - -Gotchas: -- **Snapshot tests need `NOT_CRAN=true`.** `expect_snapshot_value()` defaults to - skipping on CRAN, so `test_dir()` without `NOT_CRAN` set reports "Reason: On CRAN". - `devtools::test()` sets it for you. -- Network/data-dependent tests (`test-cache`, `test-versions`) `skip_on_cran()`. -- `R CMD check`: `devtools::check()`. - -## Test design (important for changing `match_taxa`) - -Regression coverage of the matcher lives in: -- `benchmarks/test_matches_alignments_updates.csv` β€” curated inputs, one per match - branch; covers **all 54 branches**. Asserted in - `test-operation_outputs.R` for `aligned_name`, `taxon_rank`, - `taxonomic_dataset`, **and `alignment_code`**. -- `test-match_branches.R` β€” dark-branch resolution, `aligned_reason` - well-formedness, and a **full-output snapshot** (`_snaps/match_branches.md`, - date-normalised) that pins the alignment contract across all branches. - -This snapshot + assertions are the safety net for refactoring `match_taxa()`: a -behaviour-preserving change leaves them green. If you legitimately change matcher -output, regenerate with `testthat::snapshot_accept("match_branches")` and update -the CSV. - -## Conventions - -- Tidyverse style: `%>%`, `dplyr`/`stringr`/`purrr`, namespaced calls (`dplyr::`). -- roxygen2 with markdown; run `devtools::document()` after changing `@` docs β€” keep - `man/*.Rd` and `NAMESPACE` in sync (do not hand-edit them). -- User-facing output text changes (e.g. `aligned_reason`) warrant a `NEWS.md` entry. -- `Sys.Date()` is embedded in every `aligned_reason`, so raw output is not - reproducible day to day β€” normalise the date when snapshotting/diffing. - -## Known issues / landmines - -- **`match_taxa.R` duplication.** ~54 near-identical blocks. This has already bred - bugs (a missing ` (` before the date in three `aff.` fuzzy branches). A helper - (`apply_match(...)`) would collapse it dramatically β€” do it as its own PR backed - by the snapshot above. -- **`native_anywhere_in_australia()`**: the `is.null(resources)` guard runs *after* - `create_species_state_origin_matrix()` (so offline errors instead of failing - gracefully), and its `apply(..., grepl("native", x))` greps across all columns - including the species name. -- Tests assume resources are available; there is no `skip_if_offline()` guard in the - operation tests. +Don't duplicate that content here β€” edit `AGENTS.md` and this file stays correct by reference. diff --git a/AGENTS.md b/AGENTS.md new file mode 100644 index 0000000..da9d477 --- /dev/null +++ b/AGENTS.md @@ -0,0 +1,128 @@ +# APCalign β€” agent & contributor guide + +R package that resolves and updates Australian plant taxon names against the Australian Plant Census +(APC) and Australian Plant Name Index (APNI). Exported functions align messy input names to accepted +names, report taxonomic updates, and supply native/introduced status by state. + +## Repo-local guidance + +The user-facing pipeline is **align β†’ update**: + +- `create_taxonomic_update_lookup()` β€” the main entry point; runs alignment then + taxonomy updating end to end. +- `align_taxa()` ([R/align_taxa.R](R/align_taxa.R)) β€” standardises input names + and finds the best APC/APNI alignment. Builds a `taxa` list with `tocheck` and + `checked` tibbles, then delegates to `match_taxa()`. +- `match_taxa()` ([R/match_taxa.R](R/match_taxa.R)) β€” **the core matcher, ~2150 + lines.** It runs ~54 sequential match branches (`match_01a` … `match_12i`), + each: compute a logical index `i`, `match()` against a resource table, `mutate()` + the matched rows with `aligned_name`/`taxon_rank`/`taxonomic_dataset`/ + `aligned_reason`/`alignment_code`, then `redistribute()` checked rows out of + `tocheck` and early-return when `tocheck` is empty. Branches are heavily + copy-pasted β€” see "Known issues". +- `update_taxonomy()` ([R/update_taxonomy.R](R/update_taxonomy.R)) β€” maps aligned + names to currently accepted names, handling synonyms and taxonomic splits. +- `load_taxonomic_resources()` ([R/load_taxonomic_resources.R](R/load_taxonomic_resources.R)) + β€” downloads/loads APC+APNI parquet files for a dated `version`, derives the + filtered lookup tables (`APC_accepted`, `APC_synonyms`, `APNI_names`, + `genera_accepted`, `genera_synonym`, `genera_APNI`, `family_accepted`, + `family_synonym`, …). Session-cached in the package-private `.pkg_cache` + environment (NOT `.GlobalEnv` β€” CRAN compliance). Clear with + `clear_cached_resources()`. + +Supporting helpers: `standardise_names()`/`strip_names()`/`strip_names_extra()` +(text normalisation), `fuzzy_match()` (Damerau–Levenshtein with first-letter-per-word +constraints), `word()` (fast `stringr::word` replacement), `extract_genus()`. + +Diversity functions: `native_anywhere_in_australia()`, +`create_species_state_origin_matrix()`, `state_diversity_counts()`. + +### Running tests + +Tests need the taxonomic resources loaded. `tests/testthat/helper.R` reuses a +global `resources` if present, else loads **version `2024-10-11`** (the version the +benchmarks were built against β€” do not bump it casually; benchmark expected values +are tied to it). First load downloads parquet files from GitHub releases. + +```r +# fast iteration in an R session β€” load once, reuse: +devtools::load_all() +resources <- load_taxonomic_resources(version = "2024-10-11", quiet = TRUE) +testthat::test_dir("tests/testthat", filter = "match_branches") # one file + +# full suite +devtools::test() +``` + +Gotchas: +- **Snapshot tests need `NOT_CRAN=true`.** `expect_snapshot_value()` defaults to + skipping on CRAN, so `test_dir()` without `NOT_CRAN` set reports "Reason: On CRAN". + `devtools::test()` sets it for you. +- Network/data-dependent tests (`test-cache`, `test-versions`) `skip_on_cran()`. +- `R CMD check`: `devtools::check()`. + +### Test design (important for changing `match_taxa`) + +Regression coverage of the matcher lives in: +- `benchmarks/test_matches_alignments_updates.csv` β€” curated inputs, one per match + branch; covers **all 54 branches**. Asserted in + `test-operation_outputs.R` for `aligned_name`, `taxon_rank`, + `taxonomic_dataset`, **and `alignment_code`**. +- `test-match_branches.R` β€” dark-branch resolution, `aligned_reason` + well-formedness, and a **full-output snapshot** (`_snaps/match_branches.md`, + date-normalised) that pins the alignment contract across all branches. + +This snapshot + assertions are the safety net for refactoring `match_taxa()`: a +behaviour-preserving change leaves them green. If you legitimately change matcher +output, regenerate with `testthat::snapshot_accept("match_branches")` and update +the CSV. + +### Conventions + +- Tidyverse style: `%>%`, `dplyr`/`stringr`/`purrr`, namespaced calls (`dplyr::`). +- roxygen2 with markdown; run `devtools::document()` after changing `@` docs β€” keep + `man/*.Rd` and `NAMESPACE` in sync (do not hand-edit them). +- User-facing output text changes (e.g. `aligned_reason`) warrant a `NEWS.md` entry. +- `Sys.Date()` is embedded in every `aligned_reason`, so raw output is not + reproducible day to day β€” normalise the date when snapshotting/diffing. + +### Known issues / landmines + +- **`match_taxa.R` duplication.** ~54 near-identical blocks. This has already bred + bugs (a missing ` (` before the date in three `aff.` fuzzy branches). A helper + (`apply_match(...)`) would collapse it dramatically β€” do it as its own PR backed + by the snapshot above. +- **`native_anywhere_in_australia()`**: the `is.null(resources)` guard runs *after* + `create_species_state_origin_matrix()` (so offline errors instead of failing + gracefully), and its `apply(..., grepl("native", x))` greps across all columns + including the species name. +- Tests assume resources are available; there is no `skip_if_offline()` guard in the + operation tests. + +--- + +## AusTraits family β€” cross-package context + +`APCalign` is part of the **AusTraits family** (a subset of the +[`traitecoevo`](https://github.com/traitecoevo) org) β€” here, plant taxonomy alignment against the +Australian Plant Census (APC) / APNI, including native/introduced status. Family-wide concerns are +documented centrally in +**[austraits-meta](https://github.com/traitecoevo/austraits-meta)** β€” don't restate them here, read +them there: + +- **Start with [`AGENTS.md`](https://github.com/traitecoevo/austraits-meta/blob/main/AGENTS.md)** β€” + pipeline order, who owns what, dependency direction, source-of-truth rules, cross-boundary + artifacts, gotchas. +- **[`dependencies.yml`](https://github.com/traitecoevo/austraits-meta/blob/main/dependencies.yml)** β€” + machine-readable package graph + cross-boundary artifacts. +- **[`governance/`](https://github.com/traitecoevo/austraits-meta/tree/main/governance)** β€” + label taxonomy, board #9 conventions, release playbooks, triage. + +**Filing issues:** the whole family is tracked on one board, +[AusTraits #9](https://github.com/orgs/traitecoevo/projects/9) (new issues auto-add to it). Follow +the [issue & labelling guide](https://github.com/traitecoevo/austraits-meta/blob/main/governance/issue-guide.md): +pick one work-type label (`bug` / `task` / `epic`); Status and Priority are set on the board, not as +labels. + +> austraits-meta is hand-maintained prose β€” a map, not ground truth. Verify specifics against the +> actual repos. diff --git a/README.Rmd b/README.Rmd index 04530a2..3bf9c83 100644 --- a/README.Rmd +++ b/README.Rmd @@ -159,3 +159,15 @@ Did you come across an unexpected taxon name change? Elusive error you can't deb We welcome any comments and contributions to the package, start by [submit an issue](https://github.com/traitecoevo/APCalign/issues) and we can take it from there! +## AusTraits family + +`APCalign` is part of the **AusTraits family** of packages maintained by the +[AusTraits](https://austraits.org) team. See **[austraits.org](https://austraits.org)** for the +project, the data, and the people behind it. + +Contributing? Issues across the family are tracked on one board, +[AusTraits #9](https://github.com/orgs/traitecoevo/projects/9), and new issues are auto-added. Please +read the [issue & labelling guide](https://github.com/traitecoevo/austraits-meta/blob/main/governance/issue-guide.md) +in [`austraits-meta`](https://github.com/traitecoevo/austraits-meta) β€” the family's cross-package +knowledge and governance hub β€” before filing. + diff --git a/README.md b/README.md index 442023a..c75041d 100644 --- a/README.md +++ b/README.md @@ -256,3 +256,15 @@ our best to help. We welcome any comments and contributions to the package, start by [submit an issue](https://github.com/traitecoevo/APCalign/issues) and we can take it from there! + +## AusTraits family + +`APCalign` is part of the **AusTraits family** of packages maintained by the +[AusTraits](https://austraits.org) team. See **[austraits.org](https://austraits.org)** for the +project, the data, and the people behind it. + +Contributing? Issues across the family are tracked on one board, +[AusTraits #9](https://github.com/orgs/traitecoevo/projects/9), and new issues are auto-added. Please +read the [issue & labelling guide](https://github.com/traitecoevo/austraits-meta/blob/main/governance/issue-guide.md) +in [`austraits-meta`](https://github.com/traitecoevo/austraits-meta) β€” the family's cross-package +knowledge and governance hub β€” before filing.