Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add module for TaxonKit name2taxid #4778

Merged
merged 9 commits into from
Mar 26, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions modules/nf-core/taxonkit/name2taxid/environment.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
---
# yaml-language-server: $schema=https://raw.githubusercontent.com/nf-core/modules/master/modules/environment-schema.json
name: "taxonkit_name2taxid"
channels:
- conda-forge
- bioconda
- defaults
dependencies:
- "bioconda::taxonkit=0.15.1"
51 changes: 51 additions & 0 deletions modules/nf-core/taxonkit/name2taxid/main.nf
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
process TAXONKIT_NAME2TAXID {
tag "$meta.id"
label 'process_low'

conda "${moduleDir}/environment.yml"
container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ?
'https://depot.galaxyproject.org/singularity/taxonkit:0.15.1--h9ee0642_0':
'biocontainers/taxonkit:0.15.1--h9ee0642_0' }"

input:
tuple val(meta), val(name), path(names_txt)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see why you would want the input like this, but I'm not sure if it follows the guidelines since the actual input is a file.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The file could be easily made using the .collectFile() operator

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was just following the docs https://bioinf.shenwei.me/taxonkit/usage/#name2taxid.

Although how do you mean with collectFile(). The output of that is file and not [ meta, file ].

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes true, but that's nothing some channel logic can't handle :).

@maintainers what are your opinions on this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not particularly easy for a new developer though:

    Channel.of( [ id: 'test', taxid: 1234 ] )
        .tap { ch_data }
        .collectFile( newLine: true ) { meta -> [ "${meta.id}.tsv", "$meta.taxid" ] }
        .map{ file -> [ file.baseName , file ] }
        .join( ch_data.map { meta -> [ meta.id, meta ] }, by: 0 )
        .map{ id, idfile, meta -> [ meta, idfile ] }

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The value is mandatory if there's no file ( and I don't plan on generating that file )

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the difference between name and names_txt? As in you can supply a string versus a file?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you can pipe in the name, or provide a list of names.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could have two separate optional input channels, one for string only and one for file (but are mutually exclusive)?

Might look alittle ugly, would be relatively easy for for a pipeline dev to put an [] in the channel they don't want with a (multi)Map?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's just adding unnecessary complexity to have to use multiMap to make this input.
Either way the input will likely need to be formed using a map

path taxdb

output:
tuple val(meta), path("*.tsv"), emit: tsv
path "versions.yml" , emit: versions

when:
task.ext.when == null || task.ext.when

script:
def args = task.ext.args ?: ''
def prefix = task.ext.prefix ?: "${meta.id}"
assert (!name && names_txt) || (name && !names_txt)
"""
taxonkit \\
name2taxid \\
$args \\
--data-dir $taxdb \\
--threads $task.cpus \\
--out-file ${prefix}.tsv \\
${name? "<<< '$name'": names_txt}

cat <<-END_VERSIONS > versions.yml
"${task.process}":
taxonkit: \$( taxonkit version | sed 's/.* v//' )
END_VERSIONS
"""

stub:
def args = task.ext.args ?: ''
def prefix = task.ext.prefix ?: "${meta.id}"
"""
touch ${prefix}.tsv

cat <<-END_VERSIONS > versions.yml
"${task.process}":
taxonkit: \$( taxonkit version | sed 's/.* v//' )
END_VERSIONS
"""
}
53 changes: 53 additions & 0 deletions modules/nf-core/taxonkit/name2taxid/meta.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
---
# yaml-language-server: $schema=https://raw.githubusercontent.com/nf-core/modules/master/modules/meta-schema.json
name: "taxonkit_name2taxid"
description: Convert taxon names to TaxIds
keywords:
- taxonomy
- taxids
- taxon name
- conversion
tools:
- "taxonkit":
description: "A Cross-platform and Efficient NCBI Taxonomy Toolkit"
homepage: "https://bioinf.shenwei.me/taxonkit/"
documentation: "https://bioinf.shenwei.me/taxonkit/usage/#name2taxid"
tool_dev_url: "https://github.com/shenwei356/taxonkit"
doi: "10.1016/j.jgg.2021.03.006"
licence: ["MIT"]

input:
- meta:
type: map
description: |
Groovy Map containing sample information
e.g. `[ id:'sample1', single_end:false ]`
- name:
type: string
description: Taxon name to look up (provide either this or names.txt, not both)
- names_txt:
type: file
description: File with taxon names to look up, each on their own line (provide either this or name, not both)
- taxdb:
type: file
description: Taxonomy database unpacked from ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe specify it should be the directory that get's supplied (presumably? from my knowledge of taxdump.tar.gz)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my case it's specifically that file that I use outside of this module.


output:
- meta:
type: map
description: |
Groovy Map containing sample information
e.g. `[ id:'sample1', single_end:false ]`
- versions:
type: file
description: File containing software versions
pattern: "versions.yml"
- tsv:
type: file
description: TSV file of Taxon names and their taxon ID
pattern: "*.tsv"

authors:
- "@mahesh-panchal"
maintainers:
- "@mahesh-panchal"
100 changes: 100 additions & 0 deletions modules/nf-core/taxonkit/name2taxid/tests/main.nf.test
Original file line number Diff line number Diff line change
@@ -0,0 +1,100 @@
nextflow_process {

name "Test Process TAXONKIT_NAME2TAXID"
script "../main.nf"
process "TAXONKIT_NAME2TAXID"

tag "modules"
tag "modules_nfcore"
tag "untar"
tag "taxonkit"
tag "taxonkit/name2taxid"

setup {
run("UNTAR"){
script "modules/nf-core/untar/main.nf"
process {
"""
input[0] = [
[ id:'test' ],
file("ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz", checkIfExists: true)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a mini set of (working) taxdump files for createtaxdb:

https://github.com/nf-core/test-datasets/tree/createtaxdb/data/taxonomy

You're welcome to tar.gz them and add to test datasets if you want to speed up tests if slow?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests don't seem slow, but let me see if there's a difference.

]
"""
}
}
}

test("sarscov2 - name") {

when {
process {
"""
input[0] = [
[ id:'test', single_end:false ], // meta map
"SARS-CoV-2",
[]
]
input[1] = UNTAR.out.untar.map{ it[1] }
"""
}
}

then {
assertAll(
{ assert process.success },
{ assert snapshot(process.out).match() }
)
}

}

test("sarscov2 - list") {

when {
process {
"""
input[0] = Channel.of( [
[ id:'test', single_end:false ], // meta map
''
] ).combine( Channel.of("SARS-CoV-2").collectFile( name:'names.txt', newLine: true ) )
input[1] = UNTAR.out.untar.map{ it[1] }
"""
}
}

then {
assertAll(
{ assert process.success },
{ assert snapshot(process.out).match() }
)
}

}

test("sarscov2 - name - stub") {

options "-stub"

when {
process {
"""
input[0] = [
[ id:'test', single_end:false ], // meta map
"SARS-CoV-2",
[]
]
input[1] = UNTAR.out.untar.map{ it[1] }
"""
}
}

then {
assertAll(
{ assert process.success },
{ assert snapshot(process.out).match() }
)
}

}

}
95 changes: 95 additions & 0 deletions modules/nf-core/taxonkit/name2taxid/tests/main.nf.test.snap

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 2 additions & 0 deletions modules/nf-core/taxonkit/name2taxid/tests/tags.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
taxonkit/name2taxid:
- "modules/nf-core/taxonkit/name2taxid/**"
Loading