Extract mapped genome sequence from mappy API #126

marcus1487 · 2018-02-22T23:11:36Z

I am using the mappy API with the sole end goal of extracting the genomic sequence for a read mapping. Currently, I am using pyfaidx to take the mapped coordinates from the mappy.Alignment object and extract the associated genomic sequence. This seems like a bit of extra work since the sequence in theory may be extract-able from the mappy.Aligner or mappy.Alignmentobjects (though there are likely reasons this may not be true). This would also allow my code not to load the same reference fasta file twice (once for mapping in mappy and the second time for randomly extracting genome sequence). Is there access to this right now via the python API (maybe undocumented)? Or would it be possible to add this feature either directly or indirectly to the mappy.Alignment object?

The text was updated successfully, but these errors were encountered:

lh3 · 2018-02-23T00:30:07Z

Not possible at the moment, but it should not be hard to add this functionality.

lh3 · 2018-02-23T18:39:52Z

Added. You can now:

import mappy as mp

a = mp.Aligner("test/MT-human.fa")
seq = a.seq("MT_human", 100, 250)
for hit in a.map(mp.revcomp(seq)):
	print(hit)

I have only tested it on toy examples. Let me know if it has issues.

marcus1487 · 2018-02-26T22:39:09Z

I have tested this feature and it works perfectly for my test cases.

marcus1487 · 2018-03-07T21:26:45Z

After some more extensive testing it does not look as though this feature is working correctly. When working on some larger genomes the sequence extraction appears to be broken.

As a test case extracting the following genomic location from the linked zebrafish genome returns None, but this is a valid genomic location and I can confirm that it contains only standard bases.

Genomic location: 4:70000000-70000100
genome file source: ftp://ftp.ensembl.org/pub/release-89/fasta/danio_rerio/dna/Danio_rerio.GRCz10.dna.toplevel.fa.gz

Through a binary search I've found that sequence is only able to be extracted for each chromosome up to position 58871916. This is true for every chromosome in this particular fasta file.

lh3 · 2018-03-09T00:12:47Z

It is a bug. It is now fixed via 96b132c.

lh3 added the feature-request label Feb 23, 2018

marcus1487 closed this as completed Feb 26, 2018

marcus1487 reopened this Mar 7, 2018

lh3 added a commit that referenced this issue Mar 9, 2018

fixed a bug/typo in Aligner.seq() (#126)

96b132c

marcus1487 mentioned this issue Mar 13, 2018

Huge memory demand in resquiggle. nanoporetech/tombo#36

Closed

marcus1487 mentioned this issue Mar 26, 2018

All Reads Are Failing nanoporetech/tombo#43

Closed

marcus1487 closed this as completed May 7, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extract mapped genome sequence from mappy API #126

Extract mapped genome sequence from mappy API #126

marcus1487 commented Feb 22, 2018

lh3 commented Feb 23, 2018

lh3 commented Feb 23, 2018

marcus1487 commented Feb 26, 2018

marcus1487 commented Mar 7, 2018

lh3 commented Mar 9, 2018

Extract mapped genome sequence from mappy API #126

Extract mapped genome sequence from mappy API #126

Comments

marcus1487 commented Feb 22, 2018

lh3 commented Feb 23, 2018

lh3 commented Feb 23, 2018

marcus1487 commented Feb 26, 2018

marcus1487 commented Mar 7, 2018

lh3 commented Mar 9, 2018