Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix format and annotation issues #27

Merged
merged 11 commits into from
Jun 15, 2020
Merged

Fix format and annotation issues #27

merged 11 commits into from
Jun 15, 2020

Conversation

rnmitchell
Copy link
Contributor

@rnmitchell rnmitchell commented Jun 9, 2020

After running some Strait Razor files through lusSTR, I noticed some errors occurring.

  1. Format command does not add a Project ID when formatting Strait Razor files. This causes the annotate script to fail to correctly combine reads of identical sequences. This was fixed in the first commit to add the analysis ID as the project ID (both analysis ID and project ID are identical) for Strait Razor files. The annotate script can correctly combine identical sequences after this fix.

  2. Some sequences from the D19 locus were failing. The D19 locus function splits the sequence after the final 'CCTT' repeat and runs separate functions on each sequence. However, some sequences end in the 'CCTT' repeat, thus a function is run on an empty string, resulting in an error. A second error is occurring if the second part of the sequence string is less than 7bp. The second commit fixes both of these issues.

  3. The Strait Razor output contained multiple sequences (from FGA locus) which was 23bp in length. The 3' flanking region which is removed in the annotate command is 23bp. This partial sequenced erred out. Given that partial sequences are included in STRait Razor output, they therefore need to be identified, flagged and removed before the annotation step.

  4. D21S11 locus: sequence with forward strand bracketed annotation as:
    [TCTA]5 [TCTG]6 [TCTA]3 TA [TCTA]3 TCA [TCTA]2 TCCATA [TCTA]11 TA, however the current script errs out with this. The third commit fixes this error by allowing for the 2 bases at the end after the [TCTA] repeat set.

  5. FGA sequence which did not contain 'GGAA' in the sequence erred out. Commit number 4 fixes this error.

@rnmitchell
Copy link
Contributor Author

Ok @standage this is ready for review and merge. I ran 4 STRait Razor datasets and fixed any errors that popped up. Given all the crap in those datasets, I actually didn't run into too many errors.

There's still a bigger issue of the presence of partial sequences. This PR addresses a sequence which are smaller than the # of bases in the flanking region and therefore throws an error whenever those bases are removed. However, there still remains the issue of partial sequences which are large enough to remove the # of flanking bases but are clearly smaller than expected. This will be addressed with PR #26 which addresses the issue of potential indels. Checking the called length allele against a dictionary of expected length alleles should catch the majority of these partial sequences.

@rnmitchell rnmitchell marked this pull request as ready for review June 10, 2020 17:16
@rnmitchell rnmitchell requested a review from standage June 10, 2020 17:16
Copy link
Member

@standage standage left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Description of the changes is very thorough, and the changes look straightforward. I haven't looked at the changes to the large test data files, but the tests are passing for me. 👍

I'm going to go ahead an approve this PR, but I'm curious whether the test suite tests any of the changes you just made to locus-specific handling. If so, feel free to merge. If not, I'd recommend adding a few small tests where you create a STRMarker object, feed it one of the sequence string that previously caused it to fail, and do a string comparison to confirm it gives the right output this time. That'll make it easier to make sure that future changes don't cause any regressions.

@rnmitchell
Copy link
Contributor Author

rnmitchell commented Jun 15, 2020

Thanks for the feedback! I added a few tests with specific sequences for the loci I changed. I also created a new test that inputs a file with all partial sequences and ensures an empty file is created. The tests are passing here so I'm going to go ahead and merge this.

@rnmitchell rnmitchell merged commit 9ac308e into master Jun 15, 2020
@standage standage deleted the format_anno_issues branch June 19, 2020 17:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants