Fix format and annotation issues #27

rnmitchell · 2020-06-09T20:04:31Z

After running some Strait Razor files through lusSTR, I noticed some errors occurring.

Format command does not add a Project ID when formatting Strait Razor files. This causes the annotate script to fail to correctly combine reads of identical sequences. This was fixed in the first commit to add the analysis ID as the project ID (both analysis ID and project ID are identical) for Strait Razor files. The annotate script can correctly combine identical sequences after this fix.
Some sequences from the D19 locus were failing. The D19 locus function splits the sequence after the final 'CCTT' repeat and runs separate functions on each sequence. However, some sequences end in the 'CCTT' repeat, thus a function is run on an empty string, resulting in an error. A second error is occurring if the second part of the sequence string is less than 7bp. The second commit fixes both of these issues.
The Strait Razor output contained multiple sequences (from FGA locus) which was 23bp in length. The 3' flanking region which is removed in the annotate command is 23bp. This partial sequenced erred out. Given that partial sequences are included in STRait Razor output, they therefore need to be identified, flagged and removed before the annotation step.
D21S11 locus: sequence with forward strand bracketed annotation as:
[TCTA]5 [TCTG]6 [TCTA]3 TA [TCTA]3 TCA [TCTA]2 TCCATA [TCTA]11 TA, however the current script errs out with this. The third commit fixes this error by allowing for the 2 bases at the end after the [TCTA] repeat set.
FGA sequence which did not contain 'GGAA' in the sequence erred out. Commit number 4 fixes this error.

rnmitchell · 2020-06-10T17:16:08Z

Ok @standage this is ready for review and merge. I ran 4 STRait Razor datasets and fixed any errors that popped up. Given all the crap in those datasets, I actually didn't run into too many errors.

There's still a bigger issue of the presence of partial sequences. This PR addresses a sequence which are smaller than the # of bases in the flanking region and therefore throws an error whenever those bases are removed. However, there still remains the issue of partial sequences which are large enough to remove the # of flanking bases but are clearly smaller than expected. This will be addressed with PR #26 which addresses the issue of potential indels. Checking the called length allele against a dictionary of expected length alleles should catch the majority of these partial sequences.

standage

Description of the changes is very thorough, and the changes look straightforward. I haven't looked at the changes to the large test data files, but the tests are passing for me. 👍

I'm going to go ahead an approve this PR, but I'm curious whether the test suite tests any of the changes you just made to locus-specific handling. If so, feel free to merge. If not, I'd recommend adding a few small tests where you create a STRMarker object, feed it one of the sequence string that previously caused it to fail, and do a string comparison to confirm it gives the right output this time. That'll make it easier to make sure that future changes don't cause any regressions.

rnmitchell · 2020-06-15T11:50:37Z

Thanks for the feedback! I added a few tests with specific sequences for the loci I changed. I also created a new test that inputs a file with all partial sequences and ensures an empty file is created. The tests are passing here so I'm going to go ahead and merge this.

Rebecca Mitchell added 9 commits June 9, 2020 14:23

fixed format issue

4fbe687

fixed D19 issue

275f30a

Fixed D21 error

145ed6f

fixed D21 typo

c871e5d

fixed minor error in FGA

cf91da9

updated strait razor format test data

094ffa2

fixed partial sequences error

ed2942f

updated flanking report test file

c0d12a1

Fixed D21 LUS error

b937577

rnmitchell marked this pull request as ready for review June 10, 2020 17:16

rnmitchell requested a review from standage June 10, 2020 17:16

standage approved these changes Jun 12, 2020

View reviewed changes

Rebecca Mitchell added 2 commits June 15, 2020 07:40

added tests for new formatting rules

a55ebd6

fixed style errors

430ab26

rnmitchell merged commit 9ac308e into master Jun 15, 2020

standage deleted the format_anno_issues branch June 19, 2020 17:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix format and annotation issues #27

Fix format and annotation issues #27

rnmitchell commented Jun 9, 2020 •

edited by standage

Loading

rnmitchell commented Jun 10, 2020

standage left a comment •

edited

Loading

rnmitchell commented Jun 15, 2020 •

edited

Loading

Fix format and annotation issues #27

Fix format and annotation issues #27

Conversation

rnmitchell commented Jun 9, 2020 • edited by standage Loading

rnmitchell commented Jun 10, 2020

standage left a comment • edited Loading

Choose a reason for hiding this comment

rnmitchell commented Jun 15, 2020 • edited Loading

rnmitchell commented Jun 9, 2020 •

edited by standage

Loading

standage left a comment •

edited

Loading

rnmitchell commented Jun 15, 2020 •

edited

Loading