User specified splicing sites #197

Piezoid · 2018-07-06T15:18:47Z

This PR is mostly a feature request with a PoC implementation.

The bioinformatics team at the french sequencing center (Genoscope), working on MinION RNA-seq data, have expressed the need of specifying the splicing sites from an external source. In their de novo workflow, they infer splicing sites by aligning cDNA SR on genomics contigs. They'd like to leverage this information for improving the alignment of cDNA and directRNA ONT reads on their assembly.
With high error rate, false positives of inferred splicing sites may cause exons to be misaligned, yielding the wrong isoform. There is also the rare cases of non canonical sites.

This PR loads splicing sites from a BED containing introns intervals or a TSV file listing unpaired splicing sites. The file is indicated on the command line with the '-u' flag. The BED format is compatible with files from UCSC. The TSV format is more proprietary and use the minimap2's internal coordinates (-1bp compared to BED's introns intervals). The TSV file has a column indicating wether the site is gap opening of closing.

Sites with contig name not in the reference index are discarded. The splicing coordinates are stored and sorted in one vector per reference sequence. The MSB bit of the coordinate is set if the site is gap opening (donor).
Multi-part indexes, are untested. In verbose mode it will complain about sites on contigs not found in the current part of the index.

Before each alignment, splicing sites for the region are retrieved and the corresponding nucleotides of the extracted reference sequence are tagged in their high bits part. This allows ksw_exts2_sse to remain unaware of sequence offsets. ksw_exts2_sse constructs it's donor and acceptor arrays by reading the tags and then untag the reference.
I'm not quite happy with all these contortions. There is a 5/10% loss in performance even without using specified sites.
The fact that ksw_exts2_sse uses contiguous 16-bytes aligned pool for its arrays and is unaware of sequence offsets complicates the parameter passing without additional copying and allocating.

It also support an "hybrid mode" where splicing sites are both inferred and specified. Each specified site score +noncan/2, while canonical inferred sites score's remains 0 and non-canonical -noncan.
I don't know if it's ok to use a positive score. My reasoning is that we add information by using externally supplied sites.

Initially in hybrid mode, the alignments with the sites inferred from the opposite strand of the transcript scored higher since they get more sites by adding the specified sites that may be on the opposite strand. The solution implemented is to remove the canonical sites on the opposite alignment strand from the list of specified sites.

I may add real world results when my colleagues are done with the evaluation.
Meanwhile, I'm open for comments and reviews.
Thanks !

Edit: Here is a pile-up of a small exon that got misaligned without knowledge of the gene model.
This is mouse brain sample direct RNA ONT read mapped on mm10. Top pile-up is generated with -ax splice -u introns.bed and bottom with -ax splice (original behaviour, the exon is missed). The region shown is chr9:37,544,827-37,546,222, an excerpt of the neurogranin gene.

armintoepfer · 2018-07-06T15:31:22Z

To me, it sounds too specific for a generic aligner.

Piezoid · 2018-07-08T13:43:29Z

I tend to agree, but the heuristic for detecting splicing sites seems less generic than loading them from an external source.

lh3 · 2018-07-16T14:10:04Z

I like the idea of user-defined splice sites. However, I may need to reimplement some part of this PR. I will probably remove the support of custom TSV, too. I will come back to this issue later. Thank you.

In hybrid mode (-ub -uintrons.bed) user supplied splicing sites are added to list of canonical sites detected on one strand for the first alignement and then another alignement is made with canonical sites on the opposite strand. The best alignement is conserved. When user sites for a transcript on one strand are added to the cannonical sites on the other strand this produces more splicing combinations and thus a better scoring but meaningless alignement. This commit add logic discarding user provided sites that are canonical on the opposite strand from which the other cannonical sites are detected. This should help restore the balance and the detection of the correct transcript strand.

lh3 · 2019-04-28T21:11:02Z

I have implemented the basic idea at HEAD. New option --junc-bed specifies introns (i.e. junctions). This file can be generated paftools.js gff2bed -j. Option --junc-bonus specifies the score bonus for a donor/acceptor site in --junc-bed.

Thanks for the suggestion. I am closing this PR. If you find bugs, please create a separate issue.

leleory · 2020-04-07T21:48:48Z

Hi Heng,
--junc-bed requires a BED12 file representing known annotation, but for de-novo assemblies such a file may not be available. Nevertheless, if RNA-seq data is available for the species that can also provide splice junction information. E.g. STAR would create an SJ.out.tab file with such information (or it would be possible to generate such file from the mapped RNA-seq reads). Would it be possible to modify the --junc-bed parameter so it would accept a simple bedfile representing possible introns as well?
Thank you,
Lel

lh3 · 2020-04-07T22:47:34Z

--junc-bed accepts a list of introns in BED.

lh3 · 2020-04-08T03:27:14Z

PS: see the manpage.

lh3 added the feature-request label Jul 16, 2018

lh3 mentioned this pull request Oct 23, 2018

Alignment of nanopore cDNA reads to a tiny exon (6 nucleotides) #253

Closed

Piezoid and others added 8 commits November 9, 2018 16:23

Loads external splicing coordinates

6a89670

get sequences tagged with user supplied splicing sites

a1ecbb2

use user supplied splicing sites in ksw_exts2_sse

b32ff64

Add TSV support for unpaired splice annotations

c0c9391

C90 compatibility

4afdb7b

Warning for out of range coordinates

2f023c2

Fix splicing parameter (-u) kind detection

ddcc885

Piezoid force-pushed the lh3 branch from ea9c3be to ddcc885 Compare November 9, 2018 15:28

lh3 mentioned this pull request Mar 4, 2019

Give in known splice sites to MiniMap2 (as in STAR/HISAT2/TopHat2) #348

Closed

lh3 closed this Apr 28, 2019

lh3 added this to the 2.17 milestone Apr 29, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

User specified splicing sites #197

User specified splicing sites #197

Piezoid commented Jul 6, 2018 •

edited

Loading

armintoepfer commented Jul 6, 2018

Piezoid commented Jul 8, 2018

lh3 commented Jul 16, 2018

lh3 commented Apr 28, 2019

leleory commented Apr 7, 2020

lh3 commented Apr 7, 2020

lh3 commented Apr 8, 2020

User specified splicing sites #197

User specified splicing sites #197

Conversation

Piezoid commented Jul 6, 2018 • edited Loading

armintoepfer commented Jul 6, 2018

Piezoid commented Jul 8, 2018

lh3 commented Jul 16, 2018

lh3 commented Apr 28, 2019

leleory commented Apr 7, 2020

lh3 commented Apr 7, 2020

lh3 commented Apr 8, 2020

Piezoid commented Jul 6, 2018 •

edited

Loading