-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add GeneMarker files as possible input #67
Conversation
lusSTR/scripts/marker.py
Outdated
for m in re.finditer("GGGCTGCCTA", self.uas_sequence): | ||
print(m) | ||
break_point = m.end() | ||
try: | ||
break_point | ||
except NameError: | ||
for m in re.finditer("TTTT", self.uas_sequence): | ||
break_point = m.end() + 10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While running some STRait Razor data through lusSTR, I ran into an error that it couldn't find GGGCTGCCTA
within the sequence (there was a SNP within it) and therefore couldn't identify the break point. I'm surprised I haven't run into this error before. This identifies a run of Ts before that sequence. This allowed the entire STRait Razor file to run.
if self.locus == "D16S539" and self.software == "genemarker": | ||
return self.data["Power_5"], (self.data["Power_3"] - 3) | ||
elif self.locus == "D8S1179" and self.software == "genemarker": | ||
return (self.data["Power_5"] - 5), (self.data["Power_3"] - 5) | ||
else: | ||
return self.data["Power_5"], self.data["Power_3"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using the GeneMarker software, only two loci (D16 and D8) produced different sequences (STRait Razor contains a few additional bases). This accounts for these differences.
@@ -57,7 +57,7 @@ def format_table(input, uas=False, kit="forenseq"): | |||
locus = "PENTA D" | |||
if locus == "PENTAE" or locus == "PENTA_E": | |||
locus = "PENTA E" | |||
if locus == "DYS385A-B" or locus == "DYS385": | |||
if locus == "DYS385A/B" or locus == "DYS385": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GeneMarker reports as DYS385A/B
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does nothing else still use the A-B
formatting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The UAS does- lusSTR just converts the locus to DYS385A-B
for non-UAS data.
lusSTR/scripts/marker.py
Outdated
def __init__(self, locus, sequence, uas=False, kit="forenseq"): | ||
def __init__(self, locus, sequence, software, kit="forenseq"): | ||
self.locus = locus | ||
self.sequence = sequence | ||
if locus not in str_marker_data: | ||
raise InvalidLocusError(locus) | ||
self.data = str_marker_data[locus] | ||
self.uas = uas | ||
self.software = software |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Error handling: probably want to check that the software
argument has an expected value here.
lusSTR/scripts/marker.py
Outdated
for m in re.finditer("GGGCTGCCTA", self.uas_sequence): | ||
break_point = m.end() | ||
try: | ||
break_point | ||
except NameError: | ||
for m in re.finditer("TTTT", self.uas_sequence): | ||
break_point = m.end() + 10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This use of the try/except mechanism is pretty unconventional. A much more common (and IMHO, clearer) pattern would be like this.
for ...:
# found it
break
else:
# handle the no break case
In concrete terms:
for m in re.finditer("GGGCTGCCTA", self.uas_sequence):
break_point = m.end()
break
else:
for m in re.finditer("TTTT", self.uas_sequence):
break_point = m.end() + 10
But I think we could probably improve this code even more for clarity of intent. You might consider something like this.
if "GGGCTGCCTA" in self.uas_sequence:
break_point = self.uas_sequence.index("GGGCTGCCTA") + 10
else:
break_point = self.uas_sequence.index("TTTT") + 14
I'm not 100% sure about those 10 and 14 offsets, but I hope you get the idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense!
@@ -57,7 +57,7 @@ def format_table(input, uas=False, kit="forenseq"): | |||
locus = "PENTA D" | |||
if locus == "PENTAE" or locus == "PENTA_E": | |||
locus = "PENTA E" | |||
if locus == "DYS385A-B" or locus == "DYS385": | |||
if locus == "DYS385A/B" or locus == "DYS385": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does nothing else still use the A-B
formatting?
lusSTR/scripts/marker.py
Outdated
# for m in re.finditer("GGGCTGCCTA", self.uas_sequence): | ||
# break_point = m.end() | ||
# try: | ||
# break_point | ||
# except NameError: | ||
# for m in re.finditer("TTTT", self.uas_sequence): | ||
# break_point = m.end() + 10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry to be a stickler but this should be cleaned up before merging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GAH yes
for ext in [".csv", ".txt", "_flanks.txt", "_sexloci.csv", "_sexloci_flanks.txt"]: | ||
exp_output = data_file(f"genemarker/genemarker_test{ext}") | ||
print(exp_output) | ||
obs_output = str(tmp_path / f"genemarker_test{ext}") | ||
assert filecmp.cmp(exp_output, obs_output) is True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this isn't usually how we test multiple files, but I didn't want to use parametrize
because then it'll run lusSTR over and over again... and it takes up extra time when I don't need it to run multiple times.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is the right approach: the loop is clear, and parametrize wouldn't really be appropriate here for the reason you indicate.
final_df = final_df.append(filtered_df) | ||
flags_df = flags_df.append(flags(filtered_df, datatype)) | ||
final_df = pd.concat([final_df, filtered_df]) | ||
flags_df = pd.concat([flags_df, flags(filtered_df, datatype)]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated this to stop the annoying Pandas warning messages.
This is ready for review now @standage |
for ext in [".csv", ".txt", "_flanks.txt", "_sexloci.csv", "_sexloci_flanks.txt"]: | ||
exp_output = data_file(f"genemarker/genemarker_test{ext}") | ||
print(exp_output) | ||
obs_output = str(tmp_path / f"genemarker_test{ext}") | ||
assert filecmp.cmp(exp_output, obs_output) is True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is the right approach: the loop is clear, and parametrize wouldn't really be appropriate here for the reason you indicate.
This PR will allow for using GeneMarker files as input into lusSTR. All but two (D16 and D8) loci sequence lengths match those produced by STRait Razor; lusSTR will ensure these loci are treated appropriately.