Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some gene alignments in aligned_gene_sequences/ have fragmented entries #312

Open
mudymudy opened this issue Oct 7, 2024 · 1 comment
Open

Comments

@mudymudy
Copy link

mudymudy commented Oct 7, 2024

Hello there,

thank you for developing this software. I was looking at the results in the folder aligned_gene_sequences/ , and I realised that some of those MSA contain fragmented genes (I changed the names from aln.fas to fasta):

$ cat nfrA2.fasta

>GCF_028335125.1_ASM2833512v1
atgaaacagattcctcaagattttcgtttgatagaagatttcttccgcacgcgcagatccgtacgcaagtttatcgatcgtcctgtggaggaagagaagttgatggccatcctcgaagccggacgcatagctccttcggcacataattaccagccgtggcatttcctcgtggtcagagaagaagagggccgcaaacgcttggctccctgttcccaacaaccttggttcccgggtgcccccatctatatcatcacgcttggcgatcatcaaagagcatggaagcgaggagcaggcgattccgtagacatcgatacctctatcgccatgacttatatgatgctggaagcacatagtctgggacttggatgtacgtgggtctgtgctttcgatcaagctctttgttcggagatcttcgacatcccttcgcacatgacacctgtttccatattggctctcggctatggcgatccgaccgtacctccgcgtgaggctttcaatcgcaaatccatcgaagaggtagtcagcttcgagaaattatga
>GCF_030144345.1_ASM3014434v1
atgaaacagattcctcaagattttcgtttgatagaagatttcttccgcacgcgcagatccgtacgcaagtttatcgatcgtcctgtggaggaagagaagttgatggccatcctcgaagccggacgcatagctccttcggcacataattaccagccgtggcatttcctcgtggtcagagaagaagagggccgcaaacgcttggctccctgttcccaacaaccttggttcccgggtgcccccatctatatcatcacgcttggcgatcatcaaagagcatggaagcgaggagcaggcgattccgtagacatcgatacctctatcgccatgacttatatgatgctggaagcacatagtctgggacttggatgtacgtgggtctgtgctttcgatcaagctctttgttcggagatcttcgacatcccttcgcacatgacacctgtttccatattggctctcggctatggcgatccgaccgtacctccgcgtgaggctttcaatcgcaaatccatcgaagaggtagtcagcttcgagaaattatga
>GCF_030252365.1_ASM3025236v1
atgaaacagattcctcaagattttcgtttgatagaagatttcttccgcacgcgcagatccgtacgcaagtttatcgatcgtcctgtggaggaagagaagttgatggccatcctcgaagccggacgcatagctccttcggcacataattaccagccgtggcatttcctcgtggtcagagaagaagagggccgcaaacgcttggctccctgttcccaacaaccttggttcccgggtgcccccatctatatcatcacgcttggcgatcatcaaagagcatggaagcgaggagcaggcgattccgtagacatcgatacctctatcgccatgacttatatgatgctggaagcacatagtctgggacttggatgtacgtgggtctgtgctttcgatcaagctctttgttcggagatcttcgacatcccttcgcacatgacacctgtttccatattggctctcggctatggcgatccgaccgtacctccgcgtgaggctttcaatcgcaaatccatcgaagaggtagtcagcttcgagaaattatga
>GCF_030440475.1_ASM3044047v1
atgaaacagattcctcaagattttcgtttgatagaagatttcttccgcacgcgcagatccgtacgcaagtttatcgatcgtcctgtggaggaagagaagttgatggccatcctcgaagccggacgcatagctccttcggcacataattaccagccgtga---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>GCF_030440475.1_ASM3044047v1
------------------------------------------------------------------------------------------------------------------------------------------------------------------------gtggtcagagaagaagagggccgcaaacgcttggctccctgttcccaacaaccttggttcccgggtgcccccatctatatcatcacgcttggcgatcatcaaagagcatggaagcgaggagcgggcgattcggtagacatcgatacctctatcgccatgacttatatgatgctggaagcacatagtctgggacttggatgtacgtgggtctgtgctttcgatcaagctctttgttcggagatcttcgacatcccttcgcacatgacacctgtttccatattggctctcggctatggcgatccgaccgtacctccgcgtgaggctttcaatcgcaaatccatcgaagaggtagtcagcttcgagaaattatga
>GCF_030440495.1_ASM3044049v1
atgaaacagattcctcaagattttcgtttgatagaagatttcttccgcacgcgcagatccgtacgcaagtttatcgatcgtcctgtggaggaagagaagttgatggccatcctcgaagccggacgcatagctccttcggcacataattaccagccgtga---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>GCF_030440495.1_ASM3044049v1
------------------------------------------------------------------------------------------------------------------------------------------------------------------------gtggtcagagaagaagagggccgcaaacgcttggctccctgttcccaacaaccttggttcccgggtgcccccatctatatcatcacgcttggcgatcatcaaagagcatggaagcgaggagcgggcgattcggtagacatcgatacctctatcgccatgacttatatgatgctggaagcacatagtctgggacttggatgtacgtgggtctgtgctttcgatcaagctctttgttcggagatcttcgacatcccttcgcacatgacacctgtttccatattggctctcggctatggcgatccgaccgtacctccgcgtgaggctttcaatcgcaaatccatcgaagaggtagtcagcttcgagaaattatga

In this example you can see that both GCF_030440475.1_ASM3044047v1 and GCF_030440495.1_ASM3044049v1 have two sequences for the same gene, probably because there is a gap between those two sequences. I'm wondering, does Panaroo offer some script to merge these two fragmented sequences into just one? I want to build specific MSAs from set of genes but before trying to make a homemade script to handle this I was thinking that maybe there is already a script that can do this?

Thanks!

@gtonkinhill
Copy link
Owner

Hi, I am afraid Panaroo doesn't currently have a script to do this. We are hoping to address this in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants