Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MIDORI updates #56

Open
kmexter opened this issue Jun 8, 2023 · 2 comments
Open

MIDORI updates #56

kmexter opened this issue Jun 8, 2023 · 2 comments
Labels
enhancement New feature or request LW mid priority priority for the LW developers update update a resource used in pema to a more recent version

Comments

@kmexter
Copy link
Collaborator

kmexter commented Jun 8, 2023

According to recent emails with the MIDORI developers, it seems wise to update PEMA to where the midori db is now published. Hopefully this will solve a couple of issues that we have had (1) the gaps in the taxonomic classification output when there are missing taxon nodes (2) some were errors and discrepancies in the classifications wrt NCBI

Copy of the emails (latest to first):

Sorry to say that we are no more updating the databases in "MIDORI server”.
We are updating only databased you can download from here : http://www.reference-midori.info/download.php#

Hi Christina,
Thank you for your email.
I think PEMA is using old MIDORI database.
I have fixed this problem quite long time ago.
In all formats, except RAW files, we have inserted missing taxonomy by creating it from a lower taxonomic ranking (ex. description in class-level was missing, so it was created from order-level in the following example, >JF502242.1.7041.7724 root_1;Eukaryota_2759;Chordata_7711;class_Crocodylia_1294634;Crocodylia_1294634;Crocodylidae_8493;Crocodylus_8500;Crocodylus intermedius_184240).
Will it be possible that you download recent databases from our cite and locally perform the taxonomic assignment?
We are using NCBI taxonomy for all MIDORI databases.
I think those inconsistency is happening because PEMA is using old database (NCBI taxonomy has been consistently revised).
If you have further questions, please write me back again.
Best regards, Ryuji

Dear Dr Machida,
My name is Christina Pavloudi and I am a Post Doctoral Researcher at the CNRS.
In my previouds Post Doc position, I was working for the ARMS-MBON project (my colleagues are in CC), where we were sequencing ARMS samples for COI (among other genes) and we were using PEMA for the analyses of the results.
PEMA is using MIDORI for the taxonomic assignment of COI reads, hence I am contacting you regarding an issue we came across.
At the moment, the MIDORI output does not always have the same number of columns, i.e. the same number of taxonomic levels, for all the assignments.
You can see an example in the the attached file ("Example_species_notall.tsv")
For some assignments, the output has all the 8 levels: root, superkingdom, phylum, class, order, family, genus, species (see attached file "Example_species_alllevels.tsv").
It would be extremely helpful, in terms of FAIRness for the ARMS-MBON project, if the MIDORI output was consistent and always contained the 8 levels, even if some columns were empty (see attached "Example_species_emptylevels.tsv"). Do you perhaps consider doing something like this for future versions of MIDORI?
Also, could I ask which taxonomy you are using in MIDORI?
Because, as you can see in "Example_species_emptylevels_completed.tsv", for some of the species in question the missing taxonomic levels do exist (if we check at the WoRMS, but also at the NCBI Taxonomy). Also, some of them are different from the output that is produced by MIDORI.

@hariszaf
Copy link
Owner

Steps

  1. Make sure we can use the MIDORI2_LONGEST_NUC_GB255_CO1_RDP.fasta from MIDORI 2 that's based on the GenBank 255.
    This file has header lines, starting with (>) and they include the taxonomyL
    root_1;Eukaryota_2759;Discosea_555280;Flabellinia_1485085;order_Vannellidae_95227;Vannellidae_95227;Vannella_95228;Vannella danica_703018
    The number after each _ is the NCBI Taxonomy id of the corresponding taxonomic level.

  2. Once you make sure which file to use, then you need to train the RDPClassifier. To do so, you need to follow the instructions you ll find here.

@kmexter
Copy link
Collaborator Author

kmexter commented Jun 13, 2023

Note that the output file format (the finalTable.tsv and the extendedFinalTable) will change as a consequence: this will need to be looked at since the same table is output when other reference DBs are used (UNITE and Silva), and we don't want a different output format just because some of the internal parameters change. Once this update has been done, therefore,
@kmexter, @cpavloud, and @hariszaf can help look at the results and figure out how to create the best finalTable and extendedFinalTables (as well as perhaps a few other output files)

There will also be some interplay between this issue and #52 #29, so these should all be considered together before any work starts

@kmexter kmexter added enhancement New feature or request update update a resource used in pema to a more recent version LW mid priority priority for the LW developers labels Jun 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request LW mid priority priority for the LW developers update update a resource used in pema to a more recent version
Projects
None yet
Development

No branches or pull requests

2 participants