Skip to content

post-hackathon + template merge + modules/subworkflows update#98

Merged
vagkaratzas merged 105 commits intomainfrom
dev
May 7, 2026
Merged

post-hackathon + template merge + modules/subworkflows update#98
vagkaratzas merged 105 commits intomainfrom
dev

Conversation

@vagkaratzas
Copy link
Copy Markdown
Collaborator

@vagkaratzas vagkaratzas commented May 5, 2026

Added

  • #90 - Added the option to download and use the latest metagRoot HMM library (or use path to an existing one) for domain annotation. (by @angelphanth)
  • #87 - Added the option to download and use the latest NMPFams HMM library (or use path to an existing one) for domain annotation. (by @npechl)
  • #85 - Added zenodo doi in nextflow.config. (by @vagkaratzas)

Changed

  • #93 - nf-core tools template update to 4.0.2. (by @vagkaratzas)
  • #85 - test_full.config input samplesheet path is now set properly. (by @vagkaratzas)

Dependencies

Tool Previous version New version
aria2 1.36.0 1.37.0
multiqc 1.33 1.34

PR checklist

  • This comment contains a description of changes (with reason).
  • If you've fixed a bug or added code that should be tested, add tests!
  • If you've added a new tool - have you followed the pipeline conventions in the contribution docs
  • If necessary, also make a PR on the nf-core/proteinannotator branch on the nf-core/test-datasets repository.
  • Make sure your code lints (nf-core pipelines lint).
  • Ensure the test suite passes (e.g. nf-test test */local --profile=~test,docker for all new local tests).
  • Check for unexpected warnings in debug mode (nf-test test */local --profile=~test,docker,debug).
  • Usage Documentation in docs/usage.md is updated.
  • Output Documentation in docs/output.md is updated.
  • CHANGELOG.md is updated.
  • README.md is updated (including new tool citations and authors/contributors).

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 5, 2026

nf-core pipelines lint overall result: Passed ✅ ⚠️

Posted for pipeline commit 696ad48

+| ✅ 228 tests passed       |+
#| ❔   7 tests were ignored |#
!| ❗   1 tests had warnings |!
Details

❗ Test warnings:

❔ Tests ignored:

  • files_exist - File is ignored: .github/workflows/ci.yml
  • files_exist - File is ignored: conf/igenomes.config
  • files_exist - File is ignored: conf/igenomes_ignored.config
  • files_unchanged - File ignored due to lint config: .github/PULL_REQUEST_TEMPLATE.md
  • files_unchanged - File ignored due to lint config: assets/nf-core-proteinannotator_logo_light.png
  • files_unchanged - File ignored due to lint config: docs/images/nf-core-proteinannotator_logo_light.png
  • files_unchanged - File ignored due to lint config: docs/images/nf-core-proteinannotator_logo_dark.png

✅ Tests passed:

Run details

  • nf-core/tools version 4.0.2
  • Run at 2026-05-07 13:20:00

@Aratz Aratz self-requested a review May 6, 2026 08:23
Copy link
Copy Markdown

@Aratz Aratz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Just had some minor suggestions for you to address in this version or the next one.

Comment thread subworkflows/local/domain_annotation/main.nf Outdated
Comment thread docs/usage.md Outdated
Comment thread docs/usage.md

You can also generate such `YAML`/`JSON` files via [nf-core/launch](https://nf-co.re/launch).

## Functional Annotation Options
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here you could add a similar section for domain annotation tools

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a bit more straightforward then Functional Annotation, but I agree that a couple of sentences with database enabling aprameters would make sense; either now or in the future. Noted

Copy link
Copy Markdown
Member

@pinin4fjords pinin4fjords left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AI-assisted review (Claude, on behalf of @pinin4fjords).

This is a release PR (dev -> main) bundling a 4.0.2 template merge, modules/subworkflows update, and the new metagRoot domain annotation. Most issues below are doc/schema polish; one (H1) breaks the AWS full test on default params.

Findings

  • 1 high (broken full-test URL)
  • 6 medium inline + 2 medium below (schema typos / orphan section / unused module / hmmer description)
  • 1 low (functional-annotation snapshot wraps a boolean)
  • 3 informational (IPS binary/data version skew, full-test still uses test-sized data, empty PR description)

All claims grepped against commit 04b0928. Nothing here blocks merge on its own; H1 is the only one I'd want fixed before tagging v1.1.0.


Additional findings on lines outside this PR's diff

These touch lines the PR didn't change, so GitHub won't accept them as inline comments. Listing them here so the polish can land alongside the rest.

M5 (medium) - docs/output.md:117 - .gff should be .gff3
The InterProScan module emits *.gff3 (modules/nf-core/interproscan/main.nf:18) and the snapshot confirms .gff3 is what publishes under functional_annotation/interproscan/<sample>/. The line currently reads - `<samplename>.gff`: general feature format (GFF) file and should be <samplename>.gff3.

M7 (medium) - docs/output.md:392-402 - orphan "SeqKit stats" section
Documents a seqkit/<prefix>.tsv output that the pipeline does not produce. grep -rn SEQKIT_STATS only matches files inside modules/nf-core/seqkit/stats/ itself - the module is installed but never imported or invoked anywhere (see also M8 inline on modules.json). The QC TSVs that are produced come from SeqFu and are already documented in the SeqFu section above. Either remove this section, or wire SEQKIT_STATS into FAA_SEQFU_SEQKIT if it was meant to be added.

M1 (medium) - nextflow_schema.json:347 - typo in interproscan_enableprecalc help_text
---diasable-precalc should be --disable-precalc (three dashes, plus "diasable" misspelled). The actual InterProScan flag is --disable-precalc, which is what conf/modules.config:173 correctly passes - so this is purely cosmetic, but it does end up in --help output and the parameter docs site.

M4 (medium) - nextflow_schema.json:319-323 - skip_interproscan description is inverted
For a skip_* flag, "Run InterProScan" reads the wrong way around. Match the wording style of skip_pfam/skip_funfam/etc.: "description": "Skip the functional annotation with InterProScan.". Also the explicit "default": false is unique to this entry among the skip flags - either drop it or add it consistently.

I3 (informational) - empty PR description
The PR body is just the unfilled checklist. For a release-target PR (dev -> main), a 3-4 line summary of what's bundled (template 4.0.2, modules sync, metagRoot, NMPFams) helps reviewers and feeds release notes.

Comment thread conf/test_full.config Outdated
nmpfams_latest_link = params.pipelines_testdata_base_path + 'proteinannotator/testdata/nmpfams/nmpfamsdb_test.hmm.gz'
metagroot_latest_link = params.pipelines_testdata_base_path + 'proteinannotator/testdata/metagroot/metagroot_test.hmm.gz'
// Functional annotation
interproscan_db_url = params.pipelines_testdata_base_path + 'proteinannotator/testdata/interproscan_test.tar.gz'
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Severity: high - this URL 404s on the test-datasets repo, so -profile test_full (i.e. the AWS full test) will fail at the InterProScan database download step.

Verified:

  • …/proteinannotator/testdata/interproscan_test.tar.gz -> 404
  • …/proteinannotator/testdata/interproscan/interproscan_test.tar.gz -> 200 (the path conf/test.config:33 already uses)

The samplesheet path in this file was fixed in this PR (#85), but this URL was missed.

Suggested change
interproscan_db_url = params.pipelines_testdata_base_path + 'proteinannotator/testdata/interproscan_test.tar.gz'
interproscan_db_url = params.pipelines_testdata_base_path + 'proteinannotator/testdata/interproscan/interproscan_test.tar.gz'

Comment thread nextflow_schema.json Outdated
"type": "string",
"format": "file-path",
"description": "Path to an already installed NMPFams HMM database.",
"help_text": "If left null and skip_funfam is false, the pipeline will start downloading the latest FunFam HMM library."
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Severity: medium - copy-paste from the FunFam block. The help_text refers to skip_funfam and "FunFam HMM library" but this is the nmpfams_db parameter.

Suggested change
"help_text": "If left null and skip_funfam is false, the pipeline will start downloading the latest FunFam HMM library."
"help_text": "If left null and skip_nmpfams is false, the pipeline will start downloading the latest NMPFams HMM library."

Comment thread nextflow_schema.json Outdated
"nmpfams_latest_link": {
"type": "string",
"default": "https://pavlopoulos-lab.org/envofams/databases/hmmer/nmpfamsdb.hmm.gz",
"description": ""
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Severity: medium - description is empty for nmpfams_latest_link. The other *_latest_link entries (pfam, funfam, metagroot) all have one. Suggested fill (mirroring the metagroot wording):

Suggested change
"description": ""
"description": "NMPFams hosted link to the latest NMPFams HMM database file."

Comment thread docs/output.md Outdated
Each of the `domain_annotation/` subfolders (e.g., `pfam`, `funfam`) contain a `.domtbl.gz` annotation file per input sample, depending on which domain annotation databases were used in the pipeline execution.
Each of the `domain_annotation/` subfolders (e.g., `pfam`, `funfam`, `nmpfams`, `metagroot`) contain a `.domtbl.gz` annotation file per input sample, depending on which domain annotation databases were used in the pipeline execution.

[hmmer](https://github.com/EddyRivasLab/hmmer) is a fast and flexible alignment trimming tool that keeps phylogenetically informative sites and removes others.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Severity: medium - this sentence describes trimAl, not hmmer. hmmer is a profile-HMM-based sequence search tool. Looks like leftover copy-paste from another pipeline's output docs.

Suggested change
[hmmer](https://github.com/EddyRivasLab/hmmer) is a fast and flexible alignment trimming tool that keeps phylogenetically informative sites and removes others.
[hmmer](https://github.com/EddyRivasLab/hmmer) (HMMER) is a sequence search tool that uses profile hidden Markov models (profile HMMs) to identify homologous sequences against curated databases such as Pfam, FunFam, NMPFams and metagRoot.

Comment thread modules.json Outdated
Comment on lines 53 to 57
"seqkit/stats": {
"branch": "master",
"git_sha": "28935b89b7e1f19e835f8c6e4c8322d4b505dded",
"git_sha": "6d46786420b4d7bc88eba026eb389c0c5535d120",
"installed_by": ["modules"]
},
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Severity: medium - seqkit/stats is installed (installed_by: ["modules"]) but never imported anywhere in the workflow. grep -rn SEQKIT_STATS only hits files inside modules/nf-core/seqkit/stats/ itself.

Either remove with nf-core modules remove seqkit/stats (which also lets you drop docs/output.md's orphan SeqKit-stats section, see M7 in the review body), or import and use it. Carrying an unused module bloats the snapshot footprint and the docs drift.

Comment on lines +30 to +33
{ assert snapshot(
path(workflow.out.interproscan_tsv[0][1]).readLines()[0]
.contains("GI|225038609|EFDID|719595|FULL 079fff43a0270e432d339ea71b6f0acf 350 SFLD SFLDS00057 Glutaminase/Asparaginase 17 347 0.0 T")
).match()}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Severity: low - the snapshot value here is just true (see main.nf.test.snap:4), because snapshot() is being passed the result of a .contains() boolean check. That means the test only catches a regression where line 0 of the TSV stops containing that exact substring - any other change to the TSV body slides past silently. None of the rest of the channel content is snapshotted either.

Suggested split: assert the contains-check directly, and snapshot the actual TSV channel separately.

Suggested change
{ assert snapshot(
path(workflow.out.interproscan_tsv[0][1]).readLines()[0]
.contains("GI|225038609|EFDID|719595|FULL 079fff43a0270e432d339ea71b6f0acf 350 SFLD SFLDS00057 Glutaminase/Asparaginase 17 347 0.0 T")
).match()}
then {
assertAll(
{ assert workflow.success },
{ assert path(workflow.out.interproscan_tsv[0][1]).readLines()[0]
.contains("GI|225038609|EFDID|719595|FULL\t079fff43a0270e432d339ea71b6f0acf\t350\tSFLD\tSFLDS00057\tGlutaminase/Asparaginase\t17\t347\t0.0\tT") },
{ assert snapshot(workflow.out.interproscan_tsv).match() }
)
}

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tsv is not matching; i'll keep as it is for now

conda "${moduleDir}/environment.yml"
container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ?
container "${ workflow.containerEngine in ['singularity', 'apptainer'] && !task.ext.singularity_pull_docker_container ?
'https://depot.galaxyproject.org/singularity/interproscan:5.59_91.0--hec16e2b_1' :
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Severity: informational (pre-existing, but ships with this v1.1.0 tag).

The module pins interproscan:5.59_91.0--hec16e2b_1, but the default interproscan_db_url in nextflow.config:38 is interproscan-5.72-103.0, and docs/usage.md:91-94 instructs users to download v5.72-103.0 manually. InterProScan data files are not forward/backward compatible across binary versions, so default-param users will hit a runtime mismatch. CI passes because the test fixture is version-matched to 5.59.

Not introduced by this PR, but the v1.1.0 tag here is a reasonable trigger to either (a) pin the default URL to a 5.59-91.0 tarball, or (b) bump the binary to a 5.72-compatible biocontainer when one becomes available, and update docs/usage.md to match.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

InterProScan is a difficult case; the final version (no more updates coming) with a conda version are stuck on 5.59_91.0. However the team has been developing newer versions (whole Nextflow --non nf-core-- pipeline) and we aim at some point to move there in the nf-core module, maybe....for now these two versions (module + database) seem compatible, so keeping it like this.

Comment thread conf/test_full.config
metagroot_latest_link = params.pipelines_testdata_base_path + 'proteinannotator/testdata/metagroot/metagroot_test.hmm.gz'
// Functional annotation
interproscan_db_url = params.pipelines_testdata_base_path + 'proteinannotator/testdata/interproscan_test.tar.gz'
interproscan_applications = 'Hamap,TIGRFAM,sfld'
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Severity: informational - after fixing H1 above, test_full would still point at the same test-sized HMM and InterProScan archives as conf/test.config. The CHANGELOG entry "test_full.config input samplesheet path is now set properly" suggests this is intentional for now, but it means the AWS full test isn't actually exercising a full-size workload. If a real full-size dataset is planned, that's a follow-up - just flagging it isn't currently distinct from test.config.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing planend so far. Will need to carefully come up with a different full size dataset when the pipeline is more mature.

Copy link
Copy Markdown
Member

@pinin4fjords pinin4fjords left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AI-assisted follow-up review (Claude, on behalf of @pinin4fjords). Snapshot-hygiene suggestion on the domain_annotation tests.

Comment on lines +108 to +112
{ assert snapshot(
path(workflow.out.nmpfams_domains[0][1]).linesGzip[0..7],
workflow.out.versions.collect { path(it).yaml }.unique()
).match()}
)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Severity: low - the four linesGzip[0..7] assertions in this file (lines 36/73/109/145) inline raw rows into the snapshot. Same coverage, but a hash in the .snap instead of raw content - anchored here on the nmpfams block since the others are outside the diff:

Suggested change
{ assert snapshot(
path(workflow.out.nmpfams_domains[0][1]).linesGzip[0..7],
workflow.out.versions.collect { path(it).yaml }.unique()
).match()}
)
then {
assertAll(
{ assert workflow.success},
{ assert snapshot(
path(workflow.out.nmpfams_domains[0][1]).linesGzip[0..7].join('\n').md5(),
workflow.out.versions.collect { path(it).yaml }.unique()
).match()}
)
}

Same swap applies to the three other linesGzip[0..7] blocks.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not not fond of seeing actual lines of character in the snapshots, expecially if they are only a couple of lines like here, rather than md5sums. Could have both I guess, but for now I'll leave as is. Interested for documentation links if there are new md5sum guidelines that I've missed!

Copy link
Copy Markdown
Member

@pinin4fjords pinin4fjords left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing huge here. The AI found one critical issue and a load of things I think it would be worth you fixing. I also think you could use .md5() to keep the snapshots tidier on.

Trust you to resolve as appropriate!

@vagkaratzas
Copy link
Copy Markdown
Collaborator Author

Nothing huge here. The AI found one critical issue and a load of things I think it would be worth you fixing. I also think you could use .md5() to keep the snapshots tidier on.

Trust you to resolve as appropriate!

Is .md5() a new thing in the latest version of nf-test? Are we also adopt it in nf-core/modules? Link any nf-core documentation that points to that please :D

@pinin4fjords pinin4fjords mentioned this pull request May 7, 2026
11 tasks
@vagkaratzas vagkaratzas merged commit cbf78d4 into main May 7, 2026
51 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants