Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

transcripts missing length information exception #3

Open
travc opened this issue Oct 28, 2015 · 3 comments
Open

transcripts missing length information exception #3

travc opened this issue Oct 28, 2015 · 3 comments

Comments

@travc
Copy link

travc commented Oct 28, 2015

I'm triggering RuntimeError('There exist transcripts missing length information.')

We are following the example: https://emase.readthedocs.org/en/latest/examples.html#estimating-allele-specific-expression-from-human-rna-seq-data
A probably important note is that we're working with a significantly rougher reference (A. gambiae) than human, mouse, ect. Samples are all the same species using the same ref and annotation. Anyway, my guess is that some wonkiness in the reference gtf is causing the problem.

Assuming my arithmetic is correct, I've traced it to the fact that there are some transcripts listed in emase.gene2transcripts.tsv which are not in emase.pooled.transcripts.info.

It appears that 37 transcripts are being lost someplace between generating emase.transcripts.info and emase.pooled.transcripts.info.

$ wc emase.transcripts.info
15656  31312 291991
$ wc emase.pooled.transcripts.info
31238  62476 738868

31238/2 = 15619 ... If I understand correctly, that should equal the total number of transcripts (15656). I could be wrong of course.

@narayananr
Copy link
Contributor

Hi Travis

Thanks for using EMASE and contacting. I think you right in guessing that the individualized pooled transcriptome is missing genes/isoforms/alleles that is present in the reference GTF file. This typically happens when insertions/deletions removes them. (Just curious to know how you created the pooled transcriptome adjusting for SNPs and Indels.)

A hack to get around the issue is to try manually adding the missing genes/isoforms/alleles and its "fake" lengths and try running EMASE. Since the pooled transcriptome does not have the sequences for these missing genes/isoforms/alleles, the quantitation would not really get affected.

Thank you
Narayanan

@travc
Copy link
Author

travc commented Oct 29, 2015

We are really just following the first "real use case" example at this point. I'm still a bit fuzzy on the details myself since I just stepped in to help a colleague who got stuck.
I assumed that adjust_annotations.py combined with prepare-emase adjusts for SNPs and indels when creating the pooled transcriptome.

If transcripts with a length of 0 won't effect the results, then maybe a work-around is just to make the exception a warning instead. I'll play around with that a bit.

@travc
Copy link
Author

travc commented Oct 29, 2015

So an apparent hackish fix is to just replace the raise on line 62 of EMfactory.py with a warning and setting those 0's to 1's.

self.target_lengths[self.target_lengths==0] = 1

I only understand what emase is doing at a very superficial level, but if giving missing transcripts 'fake' lengths won't effect the results, then I think this fix should be kosher. There are probably more proper/better ways of handling it (or avoiding the problem in the first place), but it seems like a rare bug so maybe this quick fix is good enough.

Since the bug is still outstanding in the code, I'll let the maintainers close

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants