-
Notifications
You must be signed in to change notification settings - Fork 424
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
User specified splicing sites #197
Conversation
To me, it sounds too specific for a generic aligner. |
I tend to agree, but the heuristic for detecting splicing sites seems less generic than loading them from an external source. |
I like the idea of user-defined splice sites. However, I may need to reimplement some part of this PR. I will probably remove the support of custom TSV, too. I will come back to this issue later. Thank you. |
In hybrid mode (-ub -uintrons.bed) user supplied splicing sites are added to list of canonical sites detected on one strand for the first alignement and then another alignement is made with canonical sites on the opposite strand. The best alignement is conserved. When user sites for a transcript on one strand are added to the cannonical sites on the other strand this produces more splicing combinations and thus a better scoring but meaningless alignement. This commit add logic discarding user provided sites that are canonical on the opposite strand from which the other cannonical sites are detected. This should help restore the balance and the detection of the correct transcript strand.
I have implemented the basic idea at HEAD. New option Thanks for the suggestion. I am closing this PR. If you find bugs, please create a separate issue. |
Hi Heng, |
--junc-bed accepts a list of introns in BED. |
PS: see the manpage. |
This PR is mostly a feature request with a PoC implementation.
The bioinformatics team at the french sequencing center (Genoscope), working on MinION RNA-seq data, have expressed the need of specifying the splicing sites from an external source. In their de novo workflow, they infer splicing sites by aligning cDNA SR on genomics contigs. They'd like to leverage this information for improving the alignment of cDNA and directRNA ONT reads on their assembly.
With high error rate, false positives of inferred splicing sites may cause exons to be misaligned, yielding the wrong isoform. There is also the rare cases of non canonical sites.
This PR loads splicing sites from a BED containing introns intervals or a TSV file listing unpaired splicing sites. The file is indicated on the command line with the '-u' flag. The BED format is compatible with files from UCSC. The TSV format is more proprietary and use the minimap2's internal coordinates (-1bp compared to BED's introns intervals). The TSV file has a column indicating wether the site is gap opening of closing.
Sites with contig name not in the reference index are discarded. The splicing coordinates are stored and sorted in one vector per reference sequence. The MSB bit of the coordinate is set if the site is gap opening (donor).
Multi-part indexes, are untested. In verbose mode it will complain about sites on contigs not found in the current part of the index.
Before each alignment, splicing sites for the region are retrieved and the corresponding nucleotides of the extracted reference sequence are tagged in their high bits part. This allows
ksw_exts2_sse
to remain unaware of sequence offsets.ksw_exts2_sse
constructs it'sdonor
andacceptor
arrays by reading the tags and then untag the reference.I'm not quite happy with all these contortions. There is a 5/10% loss in performance even without using specified sites.
The fact that
ksw_exts2_sse
uses contiguous 16-bytes aligned pool for its arrays and is unaware of sequence offsets complicates the parameter passing without additional copying and allocating.It also support an "hybrid mode" where splicing sites are both inferred and specified. Each specified site score +noncan/2, while canonical inferred sites score's remains 0 and non-canonical -noncan.
I don't know if it's ok to use a positive score. My reasoning is that we add information by using externally supplied sites.
Initially in hybrid mode, the alignments with the sites inferred from the opposite strand of the transcript scored higher since they get more sites by adding the specified sites that may be on the opposite strand. The solution implemented is to remove the canonical sites on the opposite alignment strand from the list of specified sites.
I may add real world results when my colleagues are done with the evaluation.
Meanwhile, I'm open for comments and reviews.
Thanks !
Edit: Here is a pile-up of a small exon that got misaligned without knowledge of the gene model.
This is mouse brain sample direct RNA ONT read mapped on mm10. Top pile-up is generated with
-ax splice -u introns.bed
and bottom with-ax splice
(original behaviour, the exon is missed). The region shown is chr9:37,544,827-37,546,222, an excerpt of the neurogranin gene.