How to download all genomes that belong to a specific taxa? #68

johanneswerner · 2021-01-28T12:39:01Z

Hello,

cool package, thank you. :-)

Is there a possibility to download all genomes that belong to a certain genus or family? The way I see it, I cannot use the taxid of the genus or family as this taxid does not has a genome and has no information about descending taxa.

I was trying first with the genus name, however in the case of Enterococcus, it also returns me lots of Enterococcus phage genomes which I do not want to download here.

Thank you for your help.

HajkD · 2021-01-28T13:26:23Z

Hi Johannes,

since this is a taxonomic classification issue, did you by any chance have a look at the taxize package and see what happens there when you insert your particular case?

Since biomartr is solely relying on internal data provided by NCBI could you also check there what kind of records they store for your example. This will help me find strategies to capture such particular cases.

Many thanks,
Hajk

johanneswerner · 2021-01-28T13:30:18Z

I did not work with taxize yet, I got the idea also from the ncbi-genome-download package.

They have a python script (which uses ete3) that gets the descending taxa based on a specific taxid, see here.

The script is available here: https://github.com/kblin/ncbi-genome-download/blob/master/contrib/gimme_taxa.py

In any way, I will look into taxize, maybe I get an idea from there as well, thank you for the suggestion.

johanneswerner · 2021-03-03T07:58:45Z

Possibly linked with #6 (comment)

johanneswerner · 2021-03-03T08:53:00Z

shenwei356/taxonkit#41

johanneswerner · 2021-03-03T15:01:01Z

From the referenced issue

I wrote Python bindings for TaxonKit recently, if you’d like to use that as a reference. I primarily used the Popen construct in Python’s subprocess library to call TaxonKit, and then loaded the output into dataframes (pandas) for handling in Python.

https://github.com/bioforensics/pytaxonkit/blob/9746225b1c0a9eff708790037e3b53e5d45ac235/pytaxonkit.py#L203-L211

It’s been a long time since I did any serious work in R, so I’m not sure what the best tools are for system calls. But I imagine R’s native dataframes would be suitable for storing most results.

I hope this helps.

I fear this is not going to be pretty - not that it were too hard to use system() in R to invoke shell commands, but I severely doubt that this will work platform-independently (especially on Windows). @HajkD I would certainly be open for suggestions (unless it is okay to not support Windows ...). If the conda recipe still is going to be built, we can add the funcitonality and add taxonkit as requirement.

mr-eyes · 2022-02-06T19:26:37Z

This might help! I am getting accessions for the provided organism name/taxon or the closest one in the lineage.

https://gist.github.com/mr-eyes/92d6172c7a5c7d5bd35fcff6f765d48d

zachary-foster · 2023-11-01T18:43:27Z

Hello All,

Saw this issue on Slack. Not sure if this is helpful or not, but I have done this with entrez eutils command line tools. I imagine the same functionality exists in rentrez package for R? Here is the script we used. For our purpose, it was in two steps: 1) get CSV of info for all genomes available for each taxon, 2) download a subset of them. This is from a nextflow module, so the bash might look a bit odd, but should give you the idea.

Make CSV with info for all genomes for a given taxon:

esearch -db taxonomy -query "${taxon} OR ${taxon}[subtree]" | \\
        elink -target assembly | \\
        efilter -query "latest[PROP] AND full-genome-representation[PROP] AND has-annotation[PROP] NOT excluded-from-refseq[PROP]" | \\
        efetch -format docsum | \\
        xtract -pattern DocumentSummary -def 'NA' -element \$COLS >> \\
        ${prefix}.tsv

Download genome for each row in CSV:

datasets download genome accession $id --include gff3,rna,cds,protein,genome,seq-report --filename ${prefix}.zip

salix-d · 2023-11-02T23:17:30Z

The taxize package should be able to get the children, but since you mentionned:

Since biomartr is solely relying on internal data provided by NCBI

Do you get the taxdump?
If so with names.dmp & nodes.dmp you can make an sqlite db that makes it easy to get the parents/children of an accession/tax id.
The taxonomizer package does it.
Might also be how taxize does it, I don't remember.

HajkD added enhancement help wanted labels Sep 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to download all genomes that belong to a specific taxa? #68

How to download all genomes that belong to a specific taxa? #68

johanneswerner commented Jan 28, 2021

HajkD commented Jan 28, 2021

johanneswerner commented Jan 28, 2021

johanneswerner commented Mar 3, 2021

johanneswerner commented Mar 3, 2021

johanneswerner commented Mar 3, 2021

mr-eyes commented Feb 6, 2022

zachary-foster commented Nov 1, 2023

salix-d commented Nov 2, 2023 •

edited

Loading

How to download all genomes that belong to a specific taxa? #68

How to download all genomes that belong to a specific taxa? #68

Comments

johanneswerner commented Jan 28, 2021

HajkD commented Jan 28, 2021

johanneswerner commented Jan 28, 2021

johanneswerner commented Mar 3, 2021

johanneswerner commented Mar 3, 2021

johanneswerner commented Mar 3, 2021

mr-eyes commented Feb 6, 2022

zachary-foster commented Nov 1, 2023

salix-d commented Nov 2, 2023 • edited Loading

salix-d commented Nov 2, 2023 •

edited

Loading