MIDORI updates #56

kmexter · 2023-06-08T09:08:49Z

According to recent emails with the MIDORI developers, it seems wise to update PEMA to where the midori db is now published. Hopefully this will solve a couple of issues that we have had (1) the gaps in the taxonomic classification output when there are missing taxon nodes (2) some were errors and discrepancies in the classifications wrt NCBI

Copy of the emails (latest to first):

Sorry to say that we are no more updating the databases in "MIDORI server”.
We are updating only databased you can download from here : http://www.reference-midori.info/download.php#

Hi Christina,
Thank you for your email.
I think PEMA is using old MIDORI database.
I have fixed this problem quite long time ago.
In all formats, except RAW files, we have inserted missing taxonomy by creating it from a lower taxonomic ranking (ex. description in class-level was missing, so it was created from order-level in the following example, >JF502242.1.7041.7724 root_1;Eukaryota_2759;Chordata_7711;class_Crocodylia_1294634;Crocodylia_1294634;Crocodylidae_8493;Crocodylus_8500;Crocodylus intermedius_184240).
Will it be possible that you download recent databases from our cite and locally perform the taxonomic assignment?
We are using NCBI taxonomy for all MIDORI databases.
I think those inconsistency is happening because PEMA is using old database (NCBI taxonomy has been consistently revised).
If you have further questions, please write me back again.
Best regards, Ryuji

Dear Dr Machida,
My name is Christina Pavloudi and I am a Post Doctoral Researcher at the CNRS.
In my previouds Post Doc position, I was working for the ARMS-MBON project (my colleagues are in CC), where we were sequencing ARMS samples for COI (among other genes) and we were using PEMA for the analyses of the results.
PEMA is using MIDORI for the taxonomic assignment of COI reads, hence I am contacting you regarding an issue we came across.
At the moment, the MIDORI output does not always have the same number of columns, i.e. the same number of taxonomic levels, for all the assignments.
You can see an example in the the attached file ("Example_species_notall.tsv")
For some assignments, the output has all the 8 levels: root, superkingdom, phylum, class, order, family, genus, species (see attached file "Example_species_alllevels.tsv").
It would be extremely helpful, in terms of FAIRness for the ARMS-MBON project, if the MIDORI output was consistent and always contained the 8 levels, even if some columns were empty (see attached "Example_species_emptylevels.tsv"). Do you perhaps consider doing something like this for future versions of MIDORI?
Also, could I ask which taxonomy you are using in MIDORI?
Because, as you can see in "Example_species_emptylevels_completed.tsv", for some of the species in question the missing taxonomic levels do exist (if we check at the WoRMS, but also at the NCBI Taxonomy). Also, some of them are different from the output that is produced by MIDORI.

hariszaf · 2023-06-13T14:21:37Z

Steps

Make sure we can use the MIDORI2_LONGEST_NUC_GB255_CO1_RDP.fasta from MIDORI 2 that's based on the GenBank 255.
This file has header lines, starting with (>) and they include the taxonomyL
root_1;Eukaryota_2759;Discosea_555280;Flabellinia_1485085;order_Vannellidae_95227;Vannellidae_95227;Vannella_95228;Vannella danica_703018
The number after each _ is the NCBI Taxonomy id of the corresponding taxonomic level.
Once you make sure which file to use, then you need to train the RDPClassifier. To do so, you need to follow the instructions you ll find here.

kmexter · 2023-06-13T14:52:05Z

Note that the output file format (the finalTable.tsv and the extendedFinalTable) will change as a consequence: this will need to be looked at since the same table is output when other reference DBs are used (UNITE and Silva), and we don't want a different output format just because some of the internal parameters change. Once this update has been done, therefore,
@kmexter, @cpavloud, and @hariszaf can help look at the results and figure out how to create the best finalTable and extendedFinalTables (as well as perhaps a few other output files)

There will also be some interplay between this issue and #52 #29, so these should all be considered together before any work starts

hariszaf mentioned this issue Jun 13, 2023

provide pema main data product in a 7-level taxonomy format #52

Open

kmexter added enhancement New feature or request update update a resource used in pema to a more recent version LW mid priority priority for the LW developers labels Jun 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MIDORI updates #56

MIDORI updates #56

kmexter commented Jun 8, 2023 •

edited

Loading

hariszaf commented Jun 13, 2023

kmexter commented Jun 13, 2023 •

edited

Loading

MIDORI updates #56

MIDORI updates #56

Comments

kmexter commented Jun 8, 2023 • edited Loading

hariszaf commented Jun 13, 2023

Steps

kmexter commented Jun 13, 2023 • edited Loading

kmexter commented Jun 8, 2023 •

edited

Loading

kmexter commented Jun 13, 2023 •

edited

Loading