post-hackathon + template merge + modules/subworkflows update#98
post-hackathon + template merge + modules/subworkflows update#98vagkaratzas merged 105 commits intomainfrom
Conversation
post-release version bump
… to nextflow.config
…duction full test correct input samplesheet path, typos fix, zenodo doi added…
Important! Template update for nf-core/tools v4.0.2
update nf-core modules and subworkflows to latest
pre-release v bump
|
Aratz
left a comment
There was a problem hiding this comment.
Looks good! Just had some minor suggestions for you to address in this version or the next one.
|
|
||
| You can also generate such `YAML`/`JSON` files via [nf-core/launch](https://nf-co.re/launch). | ||
|
|
||
| ## Functional Annotation Options |
There was a problem hiding this comment.
Here you could add a similar section for domain annotation tools
There was a problem hiding this comment.
It is a bit more straightforward then Functional Annotation, but I agree that a couple of sentences with database enabling aprameters would make sense; either now or in the future. Noted
pinin4fjords
left a comment
There was a problem hiding this comment.
AI-assisted review (Claude, on behalf of @pinin4fjords).
This is a release PR (dev -> main) bundling a 4.0.2 template merge, modules/subworkflows update, and the new metagRoot domain annotation. Most issues below are doc/schema polish; one (H1) breaks the AWS full test on default params.
Findings
- 1 high (broken full-test URL)
- 6 medium inline + 2 medium below (schema typos / orphan section / unused module / hmmer description)
- 1 low (functional-annotation snapshot wraps a boolean)
- 3 informational (IPS binary/data version skew, full-test still uses test-sized data, empty PR description)
All claims grepped against commit 04b0928. Nothing here blocks merge on its own; H1 is the only one I'd want fixed before tagging v1.1.0.
Additional findings on lines outside this PR's diff
These touch lines the PR didn't change, so GitHub won't accept them as inline comments. Listing them here so the polish can land alongside the rest.
M5 (medium) - docs/output.md:117 - .gff should be .gff3
The InterProScan module emits *.gff3 (modules/nf-core/interproscan/main.nf:18) and the snapshot confirms .gff3 is what publishes under functional_annotation/interproscan/<sample>/. The line currently reads - `<samplename>.gff`: general feature format (GFF) file and should be <samplename>.gff3.
M7 (medium) - docs/output.md:392-402 - orphan "SeqKit stats" section
Documents a seqkit/<prefix>.tsv output that the pipeline does not produce. grep -rn SEQKIT_STATS only matches files inside modules/nf-core/seqkit/stats/ itself - the module is installed but never imported or invoked anywhere (see also M8 inline on modules.json). The QC TSVs that are produced come from SeqFu and are already documented in the SeqFu section above. Either remove this section, or wire SEQKIT_STATS into FAA_SEQFU_SEQKIT if it was meant to be added.
M1 (medium) - nextflow_schema.json:347 - typo in interproscan_enableprecalc help_text
---diasable-precalc should be --disable-precalc (three dashes, plus "diasable" misspelled). The actual InterProScan flag is --disable-precalc, which is what conf/modules.config:173 correctly passes - so this is purely cosmetic, but it does end up in --help output and the parameter docs site.
M4 (medium) - nextflow_schema.json:319-323 - skip_interproscan description is inverted
For a skip_* flag, "Run InterProScan" reads the wrong way around. Match the wording style of skip_pfam/skip_funfam/etc.: "description": "Skip the functional annotation with InterProScan.". Also the explicit "default": false is unique to this entry among the skip flags - either drop it or add it consistently.
I3 (informational) - empty PR description
The PR body is just the unfilled checklist. For a release-target PR (dev -> main), a 3-4 line summary of what's bundled (template 4.0.2, modules sync, metagRoot, NMPFams) helps reviewers and feeds release notes.
| nmpfams_latest_link = params.pipelines_testdata_base_path + 'proteinannotator/testdata/nmpfams/nmpfamsdb_test.hmm.gz' | ||
| metagroot_latest_link = params.pipelines_testdata_base_path + 'proteinannotator/testdata/metagroot/metagroot_test.hmm.gz' | ||
| // Functional annotation | ||
| interproscan_db_url = params.pipelines_testdata_base_path + 'proteinannotator/testdata/interproscan_test.tar.gz' |
There was a problem hiding this comment.
Severity: high - this URL 404s on the test-datasets repo, so -profile test_full (i.e. the AWS full test) will fail at the InterProScan database download step.
Verified:
…/proteinannotator/testdata/interproscan_test.tar.gz-> 404…/proteinannotator/testdata/interproscan/interproscan_test.tar.gz-> 200 (the pathconf/test.config:33already uses)
The samplesheet path in this file was fixed in this PR (#85), but this URL was missed.
| interproscan_db_url = params.pipelines_testdata_base_path + 'proteinannotator/testdata/interproscan_test.tar.gz' | |
| interproscan_db_url = params.pipelines_testdata_base_path + 'proteinannotator/testdata/interproscan/interproscan_test.tar.gz' |
| "type": "string", | ||
| "format": "file-path", | ||
| "description": "Path to an already installed NMPFams HMM database.", | ||
| "help_text": "If left null and skip_funfam is false, the pipeline will start downloading the latest FunFam HMM library." |
There was a problem hiding this comment.
Severity: medium - copy-paste from the FunFam block. The help_text refers to skip_funfam and "FunFam HMM library" but this is the nmpfams_db parameter.
| "help_text": "If left null and skip_funfam is false, the pipeline will start downloading the latest FunFam HMM library." | |
| "help_text": "If left null and skip_nmpfams is false, the pipeline will start downloading the latest NMPFams HMM library." |
| "nmpfams_latest_link": { | ||
| "type": "string", | ||
| "default": "https://pavlopoulos-lab.org/envofams/databases/hmmer/nmpfamsdb.hmm.gz", | ||
| "description": "" |
There was a problem hiding this comment.
Severity: medium - description is empty for nmpfams_latest_link. The other *_latest_link entries (pfam, funfam, metagroot) all have one. Suggested fill (mirroring the metagroot wording):
| "description": "" | |
| "description": "NMPFams hosted link to the latest NMPFams HMM database file." |
| Each of the `domain_annotation/` subfolders (e.g., `pfam`, `funfam`) contain a `.domtbl.gz` annotation file per input sample, depending on which domain annotation databases were used in the pipeline execution. | ||
| Each of the `domain_annotation/` subfolders (e.g., `pfam`, `funfam`, `nmpfams`, `metagroot`) contain a `.domtbl.gz` annotation file per input sample, depending on which domain annotation databases were used in the pipeline execution. | ||
|
|
||
| [hmmer](https://github.com/EddyRivasLab/hmmer) is a fast and flexible alignment trimming tool that keeps phylogenetically informative sites and removes others. |
There was a problem hiding this comment.
Severity: medium - this sentence describes trimAl, not hmmer. hmmer is a profile-HMM-based sequence search tool. Looks like leftover copy-paste from another pipeline's output docs.
| [hmmer](https://github.com/EddyRivasLab/hmmer) is a fast and flexible alignment trimming tool that keeps phylogenetically informative sites and removes others. | |
| [hmmer](https://github.com/EddyRivasLab/hmmer) (HMMER) is a sequence search tool that uses profile hidden Markov models (profile HMMs) to identify homologous sequences against curated databases such as Pfam, FunFam, NMPFams and metagRoot. |
| "seqkit/stats": { | ||
| "branch": "master", | ||
| "git_sha": "28935b89b7e1f19e835f8c6e4c8322d4b505dded", | ||
| "git_sha": "6d46786420b4d7bc88eba026eb389c0c5535d120", | ||
| "installed_by": ["modules"] | ||
| }, |
There was a problem hiding this comment.
Severity: medium - seqkit/stats is installed (installed_by: ["modules"]) but never imported anywhere in the workflow. grep -rn SEQKIT_STATS only hits files inside modules/nf-core/seqkit/stats/ itself.
Either remove with nf-core modules remove seqkit/stats (which also lets you drop docs/output.md's orphan SeqKit-stats section, see M7 in the review body), or import and use it. Carrying an unused module bloats the snapshot footprint and the docs drift.
| { assert snapshot( | ||
| path(workflow.out.interproscan_tsv[0][1]).readLines()[0] | ||
| .contains("GI|225038609|EFDID|719595|FULL 079fff43a0270e432d339ea71b6f0acf 350 SFLD SFLDS00057 Glutaminase/Asparaginase 17 347 0.0 T") | ||
| ).match()} |
There was a problem hiding this comment.
Severity: low - the snapshot value here is just true (see main.nf.test.snap:4), because snapshot() is being passed the result of a .contains() boolean check. That means the test only catches a regression where line 0 of the TSV stops containing that exact substring - any other change to the TSV body slides past silently. None of the rest of the channel content is snapshotted either.
Suggested split: assert the contains-check directly, and snapshot the actual TSV channel separately.
| { assert snapshot( | |
| path(workflow.out.interproscan_tsv[0][1]).readLines()[0] | |
| .contains("GI|225038609|EFDID|719595|FULL 079fff43a0270e432d339ea71b6f0acf 350 SFLD SFLDS00057 Glutaminase/Asparaginase 17 347 0.0 T") | |
| ).match()} | |
| then { | |
| assertAll( | |
| { assert workflow.success }, | |
| { assert path(workflow.out.interproscan_tsv[0][1]).readLines()[0] | |
| .contains("GI|225038609|EFDID|719595|FULL\t079fff43a0270e432d339ea71b6f0acf\t350\tSFLD\tSFLDS00057\tGlutaminase/Asparaginase\t17\t347\t0.0\tT") }, | |
| { assert snapshot(workflow.out.interproscan_tsv).match() } | |
| ) | |
| } |
There was a problem hiding this comment.
tsv is not matching; i'll keep as it is for now
| conda "${moduleDir}/environment.yml" | ||
| container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ? | ||
| container "${ workflow.containerEngine in ['singularity', 'apptainer'] && !task.ext.singularity_pull_docker_container ? | ||
| 'https://depot.galaxyproject.org/singularity/interproscan:5.59_91.0--hec16e2b_1' : |
There was a problem hiding this comment.
Severity: informational (pre-existing, but ships with this v1.1.0 tag).
The module pins interproscan:5.59_91.0--hec16e2b_1, but the default interproscan_db_url in nextflow.config:38 is interproscan-5.72-103.0, and docs/usage.md:91-94 instructs users to download v5.72-103.0 manually. InterProScan data files are not forward/backward compatible across binary versions, so default-param users will hit a runtime mismatch. CI passes because the test fixture is version-matched to 5.59.
Not introduced by this PR, but the v1.1.0 tag here is a reasonable trigger to either (a) pin the default URL to a 5.59-91.0 tarball, or (b) bump the binary to a 5.72-compatible biocontainer when one becomes available, and update docs/usage.md to match.
There was a problem hiding this comment.
InterProScan is a difficult case; the final version (no more updates coming) with a conda version are stuck on 5.59_91.0. However the team has been developing newer versions (whole Nextflow --non nf-core-- pipeline) and we aim at some point to move there in the nf-core module, maybe....for now these two versions (module + database) seem compatible, so keeping it like this.
| metagroot_latest_link = params.pipelines_testdata_base_path + 'proteinannotator/testdata/metagroot/metagroot_test.hmm.gz' | ||
| // Functional annotation | ||
| interproscan_db_url = params.pipelines_testdata_base_path + 'proteinannotator/testdata/interproscan_test.tar.gz' | ||
| interproscan_applications = 'Hamap,TIGRFAM,sfld' |
There was a problem hiding this comment.
Severity: informational - after fixing H1 above, test_full would still point at the same test-sized HMM and InterProScan archives as conf/test.config. The CHANGELOG entry "test_full.config input samplesheet path is now set properly" suggests this is intentional for now, but it means the AWS full test isn't actually exercising a full-size workload. If a real full-size dataset is planned, that's a follow-up - just flagging it isn't currently distinct from test.config.
There was a problem hiding this comment.
Nothing planend so far. Will need to carefully come up with a different full size dataset when the pipeline is more mature.
pinin4fjords
left a comment
There was a problem hiding this comment.
AI-assisted follow-up review (Claude, on behalf of @pinin4fjords). Snapshot-hygiene suggestion on the domain_annotation tests.
| { assert snapshot( | ||
| path(workflow.out.nmpfams_domains[0][1]).linesGzip[0..7], | ||
| workflow.out.versions.collect { path(it).yaml }.unique() | ||
| ).match()} | ||
| ) |
There was a problem hiding this comment.
Severity: low - the four linesGzip[0..7] assertions in this file (lines 36/73/109/145) inline raw rows into the snapshot. Same coverage, but a hash in the .snap instead of raw content - anchored here on the nmpfams block since the others are outside the diff:
| { assert snapshot( | |
| path(workflow.out.nmpfams_domains[0][1]).linesGzip[0..7], | |
| workflow.out.versions.collect { path(it).yaml }.unique() | |
| ).match()} | |
| ) | |
| then { | |
| assertAll( | |
| { assert workflow.success}, | |
| { assert snapshot( | |
| path(workflow.out.nmpfams_domains[0][1]).linesGzip[0..7].join('\n').md5(), | |
| workflow.out.versions.collect { path(it).yaml }.unique() | |
| ).match()} | |
| ) | |
| } |
Same swap applies to the three other linesGzip[0..7] blocks.
There was a problem hiding this comment.
I am not not fond of seeing actual lines of character in the snapshots, expecially if they are only a couple of lines like here, rather than md5sums. Could have both I guess, but for now I'll leave as is. Interested for documentation links if there are new md5sum guidelines that I've missed!
pinin4fjords
left a comment
There was a problem hiding this comment.
Nothing huge here. The AI found one critical issue and a load of things I think it would be worth you fixing. I also think you could use .md5() to keep the snapshots tidier on.
Trust you to resolve as appropriate!
Is |
AddedmetagRootHMM library (or use path to an existing one) for domain annotation. (by @angelphanth)NMPFamsHMM library (or use path to an existing one) for domain annotation. (by @npechl)nextflow.config. (by @vagkaratzas)Changedtest_full.configinput samplesheet path is now set properly. (by @vagkaratzas)DependenciesPR checklist
nf-core pipelines lint).nf-test test */local --profile=~test,dockerfor all new local tests).nf-test test */local --profile=~test,docker,debug).docs/usage.mdis updated.docs/output.mdis updated.CHANGELOG.mdis updated.README.mdis updated (including new tool citations and authors/contributors).