Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NCBI Taxon ID included in the final_table.tsv file? #29

Open
cpavloud opened this issue Nov 26, 2021 · 12 comments
Open

NCBI Taxon ID included in the final_table.tsv file? #29

cpavloud opened this issue Nov 26, 2021 · 12 comments
Assignees
Labels
enhancement New feature or request FAIR improvement LW mid priority priority for the LW developers

Comments

@cpavloud
Copy link
Collaborator

cpavloud commented Nov 26, 2021

One think that has been requested is to enhance the final_table.tsv file to include (apart from the columns it already includes), the NCBI Taxon ID for each ASV/OTU and the accession number of the sequence that was its closest match in the database used. The NCBI Taxon ID could then be used as the taxonConceptID when submitting data to GBIF/OBIS using the DwC-A format (as discussed here)

For example, instead of the current final_table.tsv file, which looks like this
OTU_id,ERR0000008,ERR0000009,Classification
Otu1,1123,2,Eukaryota;Arthropoda;Insecta;Plecoptera;Capniidae;Allocapnia;Allocapnia aurora
Otu2,3,0,Eukaryota;Porifera;Demospongiae;Hadromerida;Polymastiidae;Polymastia;Polymastia littoralis

(Ideally) It could be something like this
OTU_id,ERR0000008,ERR0000009,Classification,Accession_number,NCBI_Taxon_ID
Otu1,1123,2,Eukaryota;Arthropoda;Insecta;Plecoptera;Capniidae;Allocapnia;Allocapnia aurora,JN200445,608846
Otu2,3,0,Eukaryota;Porifera;Demospongiae;Hadromerida;Polymastiidae;Polymastia;Polymastia littoralis,NC_023834,1473587

If it is not possible to retrieve the accession number and/or the NCBI taxon ID, I think we can find some workarounds.
Perhaps it will be possible to retrieve the NCBI Taxon ID using the Bio.Entrez package

@cpavloud cpavloud added the enhancement New feature or request label Nov 26, 2021
@hariszaf
Copy link
Owner

@cpavloud I found out about the ncbi-taxonomist tool.

We could use it I think.

Would you like to have a look and share any thoughts?

@cpavloud
Copy link
Collaborator Author

I am not sure how it would work exactly (the ncbi-taxonomist page does not provide very good examples/explanations), but we could give it a try.

@hariszaf
Copy link
Owner

Think of a while loop that will start from the end of the taxonomy in each row of the finalTable.tsv file and will use the ncbi-taxonomist for each level.
Using each level, we ll do queries searching for an ncbi taxonomy id, and when we have one we ll have something like this:

Assiming we are looking for Saprospiraceae

ncbi-taxonomist collect -n 'Saprospiraceae'

would return:

{"taxid":131567,"rank":"no rank","names":{"cellular organisms":"scientific_name"},"parentid":null,"name":"cellular organisms"}
{"taxid":2,"rank":"superkingdom","names":{"Bacteria":"scientific_name"},"parentid":131567,"name":"Bacteria"}
{"taxid":1783270,"rank":"clade","names":{"FCB group":"scientific_name"},"parentid":2,"name":"FCB group"}
{"taxid":68336,"rank":"clade","names":{"Bacteroidetes/Chlorobi group":"scientific_name"},"parentid":1783270,"name":"Bacteroidetes/Chlorobi group"}
{"taxid":976,"rank":"phylum","names":{"Bacteroidetes":"scientific_name"},"parentid":68336,"name":"Bacteroidetes"}
{"taxid":1937959,"rank":"class","names":{"Saprospiria":"scientific_name"},"parentid":976,"name":"Saprospiria"}
{"taxid":1936988,"rank":"order","names":{"Saprospirales":"scientific_name"},"parentid":1937959,"name":"Saprospirales"}
{"taxid":89374,"rank":"family","names":{"Saprospiraceae":"scientific_name","Saprospira group":"Synonym"},"parentid":1936988,"name":"Saprospiraceae"}

@cpavloud
Copy link
Collaborator Author

So, for example, if you have this classifications in the finalTable.tsv

Main genome;Eukaryota;Opisthokonta;Nucletmycea;Fungi;Dikarya;Ascomycota;Saccharomycotina;Saccharomycetes;Saccharomycetales;Dipodascaceae;Geotrichum

you would search for Geotrichum
and then for Dipodascaceae
and then for Saccharomycetales
etc etc.

and get the last line for each of your searches?

@hariszaf
Copy link
Owner

I would search for Geotrichum, if that has a hit, i d get

  • only its ncbi taxonomy id
  • the ncbi taxonomy ids of all its lineage
    we could think about that.

If I would not get a hit, I would continue with Dipodascaceae etc.

@hariszaf
Copy link
Owner

@cpavloud have a look. would that be ok ?

root@3bbfa77ef486:/mnt/analysis# more extenedFinalTable.tsv 
OTU	ERR0000001	Classification	TAXON:NCBI_TAX_ID
Otu4056	1	Main genome;Bacteria;Patescibacteria;Saccharimonadia;Saccharimonadales	Patescibacteria:1783273

@cpavloud
Copy link
Collaborator Author

@cpavloud have a look. would that be ok ?

root@3bbfa77ef486:/mnt/analysis# more extenedFinalTable.tsv 
OTU	ERR0000001	Classification	TAXON:NCBI_TAX_ID
Otu4056	1	Main genome;Bacteria;Patescibacteria;Saccharimonadia;Saccharimonadales	Patescibacteria:1783273

If there were no NCBI taxonomy IDs for Saccharimonadia and Saccharimonadales, I think we are fine :)

@hariszaf
Copy link
Owner

Exactly!
The thing is that there is not a ncbi taxonomy id always for a name in a ref db.
So i thought we could go up to the taxonomy found and work at one rank at a time starting from the species level.
I ll add this asap.

@hariszaf
Copy link
Owner

Just fyi, here is what you would get if you d search on ncbi taxonomy db for Saccharimonadales

image

and Saccharimonadia

image

@hariszaf
Copy link
Owner

hariszaf commented Dec 1, 2021

This feature is now ready and will be part of pema:v.2.1.4.

The issue is now resolved.

@hariszaf hariszaf closed this as completed Dec 1, 2021
@cpavloud
Copy link
Collaborator Author

Re-opening the issue:
In case it might be helpful, we can go from the sequence accession number to the NCBI Id: https://www.biostars.org/p/10959/

@cpavloud cpavloud reopened this May 29, 2023
@hariszaf
Copy link
Owner

This is definitely useful for ITS #52

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request FAIR improvement LW mid priority priority for the LW developers
Projects
None yet
Development

No branches or pull requests

3 participants