Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

indels #26

Merged
merged 6 commits into from
Jun 15, 2020
Merged

indels #26

merged 6 commits into from
Jun 15, 2020

Conversation

rnmitchell
Copy link
Contributor

This PR is to address the possibility of indels within the flanking sequence of the UAS region when using STRait Razor output (full length sequences). The initial idea is to create a dictionary of known STR length alleles for each locus and if the calculated length allele is not in the list of known alleles, flag the sequence as containing a possible indel in the flanking report for the user to manually evaluate.

@rnmitchell
Copy link
Contributor Author

I think this is ready for review @standage. It's not much of a change (I realized I had accidentally included the indel_flag function in marker.py in the previous branch when it should have been here... oops! This is why I should only work on one branch at a time...). I also added a test to ensure the indel flag is created if the length allele is not present in the dictionary.

One note I want to make. I updated the test_flank_anno() file. If you look at this file, you can see that there are a decent number of sequences that are flagged as possible indel or partial sequence. You can also see that the majority of these sequences have 1 read and therefore are likely crap (and would be filtered out anyway), so I don't see this as a huge deal, but I wanted to see if you had any opinion on the number of sequences that are flagged (may also be a question for RebJ).

@rnmitchell rnmitchell marked this pull request as ready for review June 15, 2020 13:03
@rnmitchell rnmitchell requested a review from standage June 15, 2020 13:03
Copy link
Member

@standage standage left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@standage
Copy link
Member

You can also see that the majority of these sequences have 1 read and therefore are likely crap...

Unless I'm missing something, the number of reads for each sequence is 100. So I'm not sure how to interpret that. But apart from the uncertainty there I don't see anything problematic with the output.

@rnmitchell
Copy link
Contributor Author

Ahh sorry, I was looking at a different file... I had made Reads all to be 100 in the test files.

@standage
Copy link
Member

When I first saw that a while back, I figured you had just set the number of reads to 100 for testing purposes. At what point do sequences get filtered out by number of reads?

@rnmitchell
Copy link
Contributor Author

It's up to the user to do beforehand, lusSTR don't do any filtering. That could be something we implement in the future, I suppose.

@standage
Copy link
Member

Yeah, I mean in that case I can't think of any concerns about the output as is.

@standage
Copy link
Member

Should I go ahead and merge this?

@rnmitchell
Copy link
Contributor Author

Oh, yeah go ahead. I got distracted.

@standage standage merged commit bf7e4b4 into master Jun 15, 2020
@standage standage deleted the indels branch June 15, 2020 20:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants