Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract mapped genome sequence from mappy API #126

Closed
marcus1487 opened this issue Feb 22, 2018 · 5 comments
Closed

Extract mapped genome sequence from mappy API #126

marcus1487 opened this issue Feb 22, 2018 · 5 comments

Comments

@marcus1487
Copy link
Contributor

I am using the mappy API with the sole end goal of extracting the genomic sequence for a read mapping. Currently, I am using pyfaidx to take the mapped coordinates from the mappy.Alignment object and extract the associated genomic sequence. This seems like a bit of extra work since the sequence in theory may be extract-able from the mappy.Aligner or mappy.Alignmentobjects (though there are likely reasons this may not be true). This would also allow my code not to load the same reference fasta file twice (once for mapping in mappy and the second time for randomly extracting genome sequence). Is there access to this right now via the python API (maybe undocumented)? Or would it be possible to add this feature either directly or indirectly to the mappy.Alignment object?

@lh3
Copy link
Owner

lh3 commented Feb 23, 2018

Not possible at the moment, but it should not be hard to add this functionality.

@lh3
Copy link
Owner

lh3 commented Feb 23, 2018

Added. You can now:

import mappy as mp

a = mp.Aligner("test/MT-human.fa")
seq = a.seq("MT_human", 100, 250)
for hit in a.map(mp.revcomp(seq)):
	print(hit)

I have only tested it on toy examples. Let me know if it has issues.

@marcus1487
Copy link
Contributor Author

I have tested this feature and it works perfectly for my test cases.

@marcus1487
Copy link
Contributor Author

After some more extensive testing it does not look as though this feature is working correctly. When working on some larger genomes the sequence extraction appears to be broken.

As a test case extracting the following genomic location from the linked zebrafish genome returns None, but this is a valid genomic location and I can confirm that it contains only standard bases.

Genomic location: 4:70000000-70000100
genome file source: ftp://ftp.ensembl.org/pub/release-89/fasta/danio_rerio/dna/Danio_rerio.GRCz10.dna.toplevel.fa.gz

Through a binary search I've found that sequence is only able to be extracted for each chromosome up to position 58871916. This is true for every chromosome in this particular fasta file.

@marcus1487 marcus1487 reopened this Mar 7, 2018
lh3 added a commit that referenced this issue Mar 9, 2018
@lh3
Copy link
Owner

lh3 commented Mar 9, 2018

It is a bug. It is now fixed via 96b132c.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants