Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add module for TaxonKit name2taxid #4778

Merged
merged 9 commits into from
Mar 26, 2024

Conversation

mahesh-panchal
Copy link
Member

PR checklist

Adds the module TaxonKit name2taxid

Closes #XXX

  • This comment contains a description of changes (with reason).
  • If you've fixed a bug or added code that should be tested, add tests!
  • If you've added a new tool - have you followed the module conventions in the contribution docs
  • If necessary, include test data in your PR.
  • Remove all TODO statements.
  • Emit the versions.yml file.
  • Follow the naming conventions.
  • Follow the parameters requirements.
  • Follow the input/output options guidelines.
  • Add a resource label
  • Use BioConda and BioContainers if possible to fulfil software requirements.
  • Ensure that the test works with either Docker / Singularity. Conda CI tests can be quite flaky:
    • PROFILE=docker pytest --tag <MODULE> --symlink --keep-workflow-wd --git-aware
    • PROFILE=singularity pytest --tag <MODULE> --symlink --keep-workflow-wd --git-aware
    • PROFILE=conda pytest --tag <MODULE> --symlink --keep-workflow-wd --git-aware

@mahesh-panchal mahesh-panchal requested a review from a team as a code owner January 19, 2024 16:50
@mahesh-panchal
Copy link
Member Author

@nf-core-bot fix linting

'biocontainers/taxonkit:0.15.1--h9ee0642_0' }"

input:
tuple val(meta), val(name), path(names_txt)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see why you would want the input like this, but I'm not sure if it follows the guidelines since the actual input is a file.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The file could be easily made using the .collectFile() operator

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was just following the docs https://bioinf.shenwei.me/taxonkit/usage/#name2taxid.

Although how do you mean with collectFile(). The output of that is file and not [ meta, file ].

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes true, but that's nothing some channel logic can't handle :).

@maintainers what are your opinions on this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not particularly easy for a new developer though:

    Channel.of( [ id: 'test', taxid: 1234 ] )
        .tap { ch_data }
        .collectFile( newLine: true ) { meta -> [ "${meta.id}.tsv", "$meta.taxid" ] }
        .map{ file -> [ file.baseName , file ] }
        .join( ch_data.map { meta -> [ meta.id, meta ] }, by: 0 )
        .map{ id, idfile, meta -> [ meta, idfile ] }

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The value is mandatory if there's no file ( and I don't plan on generating that file )

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the difference between name and names_txt? As in you can supply a string versus a file?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you can pipe in the name, or provide a list of names.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could have two separate optional input channels, one for string only and one for file (but are mutually exclusive)?

Might look alittle ugly, would be relatively easy for for a pipeline dev to put an [] in the channel they don't want with a (multi)Map?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's just adding unnecessary complexity to have to use multiMap to make this input.
Either way the input will likely need to be formed using a map

@mahesh-panchal mahesh-panchal requested a review from jfy133 March 26, 2024 14:21
Copy link
Member

@jfy133 jfy133 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor things, but otherwise no blockers from my end now given the input channel structure guidelines appear to be a bit in flux

modules/nf-core/taxonkit/name2taxid/meta.yml Outdated Show resolved Hide resolved
description: File with taxon names to look up, each on their own line (provide either this or name, not both)
- taxdb:
type: file
description: Taxonomy database unpacked from ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe specify it should be the directory that get's supplied (presumably? from my knowledge of taxdump.tar.gz)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my case it's specifically that file that I use outside of this module.

"""
input[0] = [
[ id:'test' ],
file("ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz", checkIfExists: true)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a mini set of (working) taxdump files for createtaxdb:

https://github.com/nf-core/test-datasets/tree/createtaxdb/data/taxonomy

You're welcome to tar.gz them and add to test datasets if you want to speed up tests if slow?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests don't seem slow, but let me see if there's a difference.

@mahesh-panchal mahesh-panchal added this pull request to the merge queue Mar 26, 2024
Merged via the queue into nf-core:master with commit e41aa10 Mar 26, 2024
11 checks passed
jvfe added a commit that referenced this pull request Mar 26, 2024
* master: (47 commits)
  Update genomescope2 module and add nf-tests (#5399)
  PLINK/GWAS additional output (#5387)
  Plink linkage disequilibrium additional output (#5388)
  new module: Plink epistasis analysis  (#5386)
  Add module for TaxonKit name2taxid (#4778)
  Update-cellranger-modules (#5016)
  Update: GTDBTK/CLASSIFYWF (migrate to nf-test and only execute mv when dirs actually present) (#5390)
  Added MetaPlAn subworkflow (#5356)
  port `vcftools` to nf-test (#5375)
  Update modules: rgi (#5383)
  Update condaforge/mambaforge Docker tag to v24 (#5381)
  Update gecco + migrate to nf-test (#5372)
  Update learnmsa module to work with compressed files. (#5276)
  New module: kraken2/build  (#5212)
  Update kalign module to work with compressed files. (#5277)
  port `ensemblvep/download` to nf-test (#5373)
  port `manta-germline` to nf-test (#5362)
  port `tiddit-sv` to nf-test and add `--threads` (#5371)
  Decoupler module (#5317)
  fix optional cram support in modules and migrate to nf-test (#5329)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants