Add GeneMarker files as possible input #67

rnmitchell · 2023-12-19T10:46:53Z

This PR will allow for using GeneMarker files as input into lusSTR. All but two (D16 and D8) loci sequence lengths match those produced by STRait Razor; lusSTR will ensure these loci are treated appropriately.

rnmitchell · 2023-12-26T15:41:03Z

lusSTR/scripts/marker.py

            for m in re.finditer("GGGCTGCCTA", self.uas_sequence):
+                print(m)
                break_point = m.end()
+            try:
+                break_point
+            except NameError:
+                for m in re.finditer("TTTT", self.uas_sequence):
+                    break_point = m.end() + 10


While running some STRait Razor data through lusSTR, I ran into an error that it couldn't find GGGCTGCCTA within the sequence (there was a SNP within it) and therefore couldn't identify the break point. I'm surprised I haven't run into this error before. This identifies a run of Ts before that sequence. This allowed the entire STRait Razor file to run.

rnmitchell · 2023-12-26T16:57:54Z

lusSTR/scripts/marker.py

+            if self.locus == "D16S539" and self.software == "genemarker":
+                return self.data["Power_5"], (self.data["Power_3"] - 3)
+            elif self.locus == "D8S1179" and self.software == "genemarker":
+                return (self.data["Power_5"] - 5), (self.data["Power_3"] - 5)
+            else:
+                return self.data["Power_5"], self.data["Power_3"]


Using the GeneMarker software, only two loci (D16 and D8) produced different sequences (STRait Razor contains a few additional bases). This accounts for these differences.

rnmitchell · 2023-12-26T16:59:15Z

lusSTR/wrappers/convert.py

@@ -57,7 +57,7 @@ def format_table(input, uas=False, kit="forenseq"):
            locus = "PENTA D"
        if locus == "PENTAE" or locus == "PENTA_E":
            locus = "PENTA E"
-        if locus == "DYS385A-B" or locus == "DYS385":
+        if locus == "DYS385A/B" or locus == "DYS385":


GeneMarker reports as DYS385A/B

Does nothing else still use the A-B formatting?

The UAS does- lusSTR just converts the locus to DYS385A-B for non-UAS data.

standage · 2023-12-29T15:58:05Z

lusSTR/scripts/marker.py

-    def __init__(self, locus, sequence, uas=False, kit="forenseq"):
+    def __init__(self, locus, sequence, software, kit="forenseq"):
        self.locus = locus
        self.sequence = sequence
        if locus not in str_marker_data:
            raise InvalidLocusError(locus)
        self.data = str_marker_data[locus]
-        self.uas = uas
+        self.software = software


Error handling: probably want to check that the software argument has an expected value here.

standage · 2023-12-29T16:15:56Z

lusSTR/scripts/marker.py

            for m in re.finditer("GGGCTGCCTA", self.uas_sequence):
                break_point = m.end()
+            try:
+                break_point
+            except NameError:
+                for m in re.finditer("TTTT", self.uas_sequence):
+                    break_point = m.end() + 10


This use of the try/except mechanism is pretty unconventional. A much more common (and IMHO, clearer) pattern would be like this.

for ...: # found it break else: # handle the no break case

In concrete terms:

for m in re.finditer("GGGCTGCCTA", self.uas_sequence): break_point = m.end() break else: for m in re.finditer("TTTT", self.uas_sequence): break_point = m.end() + 10

But I think we could probably improve this code even more for clarity of intent. You might consider something like this.

if "GGGCTGCCTA" in self.uas_sequence: break_point = self.uas_sequence.index("GGGCTGCCTA") + 10 else: break_point = self.uas_sequence.index("TTTT") + 14

I'm not 100% sure about those 10 and 14 offsets, but I hope you get the idea.

Makes sense!

standage · 2023-12-29T16:18:37Z

lusSTR/wrappers/convert.py

@@ -57,7 +57,7 @@ def format_table(input, uas=False, kit="forenseq"):
            locus = "PENTA D"
        if locus == "PENTAE" or locus == "PENTA_E":
            locus = "PENTA E"
-        if locus == "DYS385A-B" or locus == "DYS385":
+        if locus == "DYS385A/B" or locus == "DYS385":


Does nothing else still use the A-B formatting?

standage · 2023-12-29T18:22:38Z

lusSTR/scripts/marker.py

+            # for m in re.finditer("GGGCTGCCTA", self.uas_sequence):
+            #    break_point = m.end()
+            # try:
+            #    break_point
+            # except NameError:
+            #    for m in re.finditer("TTTT", self.uas_sequence):
+            #        break_point = m.end() + 10


Sorry to be a stickler but this should be cleaned up before merging.

rnmitchell · 2023-12-31T19:20:51Z

lusSTR/tests/test_suite.py

+    for ext in [".csv", ".txt", "_flanks.txt", "_sexloci.csv", "_sexloci_flanks.txt"]:
+        exp_output = data_file(f"genemarker/genemarker_test{ext}")
+        print(exp_output)
+        obs_output = str(tmp_path / f"genemarker_test{ext}")
+        assert filecmp.cmp(exp_output, obs_output) is True


I know this isn't usually how we test multiple files, but I didn't want to use parametrize because then it'll run lusSTR over and over again... and it takes up extra time when I don't need it to run multiple times.

I think this is the right approach: the loop is clear, and parametrize wouldn't really be appropriate here for the reason you indicate.

rnmitchell · 2023-12-31T19:22:35Z

lusSTR/wrappers/filter.py

-        final_df = final_df.append(filtered_df)
-        flags_df = flags_df.append(flags(filtered_df, datatype))
+        final_df = pd.concat([final_df, filtered_df])
+        flags_df = pd.concat([flags_df, flags(filtered_df, datatype)])


Updated this to stop the annoying Pandas warning messages.

rnmitchell · 2023-12-31T19:23:22Z

This is ready for review now @standage

standage · 2024-01-02T19:44:57Z

lusSTR/tests/test_suite.py

+    for ext in [".csv", ".txt", "_flanks.txt", "_sexloci.csv", "_sexloci_flanks.txt"]:
+        exp_output = data_file(f"genemarker/genemarker_test{ext}")
+        print(exp_output)
+        obs_output = str(tmp_path / f"genemarker_test{ext}")
+        assert filecmp.cmp(exp_output, obs_output) is True


I think this is the right approach: the loop is clear, and parametrize wouldn't really be appropriate here for the reason you indicate.

rnmitchell added 6 commits December 19, 2023 05:35

updated config [skip ci]

9548b12

added genemarker files to format wrapper script [skip ci]

926b7f8

began changing convert script [skip ci]

572033a

updated convert workflow with genemarker [skip ci]

c6e391b

update marker.py [skip ci]

a82a01b

remove print statement [skip ci]

cbcac35

rnmitchell commented Dec 26, 2023

View reviewed changes

rnmitchell added 3 commits December 26, 2023 10:51

updated snp workflows to new config [skip ci]

44c6253

updated tests [skip ci]

b82420e

fixed default config [skip ci]

09443cf

rnmitchell commented Dec 26, 2023

View reviewed changes

cleaning up debugging statements [skip ci]

ef47be6

rnmitchell commented Dec 27, 2023

View reviewed changes

standage reviewed Dec 29, 2023

View reviewed changes

standage mentioned this pull request Dec 29, 2023

Update STRait Razor STR loci config settings #68

Merged

rnmitchell added 2 commits December 29, 2023 11:32

merge master

77ea84b

added software check and updated DYS448 code

457b87a

standage reviewed Dec 29, 2023

View reviewed changes

standage marked this pull request as ready for review December 29, 2023 18:22

fixed marker.py

57c0647

standage approved these changes Dec 29, 2023

View reviewed changes

standage marked this pull request as draft December 29, 2023 18:37

rnmitchell added 2 commits December 29, 2023 16:18

changed append to concat

5127e82

added test for genemarker files

719d82c

rnmitchell commented Dec 31, 2023

View reviewed changes

rnmitchell marked this pull request as ready for review December 31, 2023 19:23

rnmitchell added 2 commits December 31, 2023 14:38

updated readme

5c40923

cleaned up test

a533aa2

standage approved these changes Jan 2, 2024

View reviewed changes

standage merged commit 998bc86 into master Jan 2, 2024
2 checks passed

standage deleted the genemarker branch January 2, 2024 19:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add GeneMarker files as possible input #67

Add GeneMarker files as possible input #67

rnmitchell commented Dec 19, 2023

rnmitchell Dec 26, 2023

rnmitchell Dec 26, 2023

rnmitchell Dec 26, 2023

standage Dec 29, 2023

rnmitchell Dec 29, 2023

standage Dec 29, 2023

standage Dec 29, 2023

rnmitchell Dec 29, 2023

standage Dec 29, 2023

standage Dec 29, 2023

rnmitchell Dec 29, 2023

rnmitchell Dec 31, 2023

standage Jan 2, 2024

rnmitchell Dec 31, 2023

rnmitchell commented Dec 31, 2023

standage Jan 2, 2024

Add GeneMarker files as possible input #67

Add GeneMarker files as possible input #67

Conversation

rnmitchell commented Dec 19, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rnmitchell commented Dec 31, 2023

Choose a reason for hiding this comment