Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What is the recommended way of dealing with multiple genetic codes? #13

Open
hjarnek opened this issue Sep 15, 2024 · 1 comment
Open

Comments

@hjarnek
Copy link

hjarnek commented Sep 15, 2024

Hi,

When working with metabarcoding data where you're not interested in any specific taxonomic group but want to quantify general diversity, you almost always have a mix of different genetic codes present in the samples, the most common being vertebrates (code 2), invertebrates (code 5), and protozoans (code 4), not to mention echinoderms/flatworms (code 9) and tunicates (code 13) for those of us working with marine data. How do you best deal with this?

From your book chapter (page 7) you can find:

If the data set contains sequences with different genetic codes, as it could be the case, for instance, in mitochondrial COI barcoding data from multiple animal phyla, they could be specified in a separated text file using the -gc_file option. This file should indicate, on each line, the sequence names with their corresponding genetic code numbers.

How do you mean that one should know which sequences correspond to which translation table? Do you mean that one should perform taxonomic classification prior to running MACSE, and get the genetic codes through the taxonomy? What about unassigned sequences in that case? By the way, there is no specification for how the gc_file should be formatted.

Cheers

@ranwez
Copy link
Owner

ranwez commented Sep 17, 2024

Hi,

Indeed, you need to provide the genetic code for your sequences as input to MACSE, as there is no way for MACSE to guess this information (it is not implemented, and I don’t think it is even possible to select the correct genetic code based solely on the sequences to be aligned). Therefore, you must perform an initial rough taxonomic assignment of your sequences (using dedicated tools and databases such as BOLD, e.g., assignment review). This does not need to be very precise since only the genetic code is needed, not the exact species or genera.

This taxonomic assignment can also help build a better alignment. The strategy we advised in the book chapter you mentioned is to work separately for each major taxonomic group. For a specific group, you would build a curated alignment of some representative full-length sequences of your marker, and then use the enrichAlignment subprogram of MACSE to add your own sequences to these small, expert alignments. Some alignments built using this strategy are available here: barcoding-alignments.

Regarding the use of different genetic codes, see the documentation here. You can provide a default genetic code (gc_def) that will be used for sequences for which no specific genetic code is provided (in the gc_def file). The format of the gc_def file is quite simple: each line contains both a sequence name and the number of the corresponding genetic code (you can use any of the following field separators: space, tab, comma, or semicolon). Examples of gc_files are provided in Examples_MACSE_Methods_In_Mol_Biol_2020.zip provided here.

Note that you can then use the output of these MACSE alignments to conduct a more precise taxonomic assignment.

I hope this will be helpful in analyzing your data.

Do not hesitate to reach out if you have further questions or feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants