-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix format and annotation issues #27
Conversation
Ok @standage this is ready for review and merge. I ran 4 STRait Razor datasets and fixed any errors that popped up. Given all the crap in those datasets, I actually didn't run into too many errors. There's still a bigger issue of the presence of partial sequences. This PR addresses a sequence which are smaller than the # of bases in the flanking region and therefore throws an error whenever those bases are removed. However, there still remains the issue of partial sequences which are large enough to remove the # of flanking bases but are clearly smaller than expected. This will be addressed with PR #26 which addresses the issue of potential indels. Checking the called length allele against a dictionary of expected length alleles should catch the majority of these partial sequences. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Description of the changes is very thorough, and the changes look straightforward. I haven't looked at the changes to the large test data files, but the tests are passing for me. 👍
I'm going to go ahead an approve this PR, but I'm curious whether the test suite tests any of the changes you just made to locus-specific handling. If so, feel free to merge. If not, I'd recommend adding a few small tests where you create a STRMarker object, feed it one of the sequence string that previously caused it to fail, and do a string comparison to confirm it gives the right output this time. That'll make it easier to make sure that future changes don't cause any regressions.
Thanks for the feedback! I added a few tests with specific sequences for the loci I changed. I also created a new test that inputs a file with all partial sequences and ensures an empty file is created. The tests are passing here so I'm going to go ahead and merge this. |
After running some Strait Razor files through lusSTR, I noticed some errors occurring.
Format
command does not add aProject ID
when formatting Strait Razor files. This causes theannotate
script to fail to correctly combine reads of identical sequences. This was fixed in the first commit to add the analysis ID as the project ID (both analysis ID and project ID are identical) for Strait Razor files. Theannotate
script can correctly combine identical sequences after this fix.Some sequences from the D19 locus were failing. The D19 locus function splits the sequence after the final 'CCTT' repeat and runs separate functions on each sequence. However, some sequences end in the 'CCTT' repeat, thus a function is run on an empty string, resulting in an error. A second error is occurring if the second part of the sequence string is less than 7bp. The second commit fixes both of these issues.
The Strait Razor output contained multiple sequences (from FGA locus) which was 23bp in length. The 3' flanking region which is removed in the
annotate
command is 23bp. This partial sequenced erred out. Given that partial sequences are included in STRait Razor output, they therefore need to be identified, flagged and removed before the annotation step.D21S11 locus: sequence with forward strand bracketed annotation as:
[TCTA]5 [TCTG]6 [TCTA]3 TA [TCTA]3 TCA [TCTA]2 TCCATA [TCTA]11 TA
, however the current script errs out with this. The third commit fixes this error by allowing for the 2 bases at the end after the [TCTA] repeat set.FGA sequence which did not contain 'GGAA' in the sequence erred out. Commit number 4 fixes this error.