Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add GeneMarker files as possible input #67

Merged
merged 17 commits into from
Jan 2, 2024
Merged

Add GeneMarker files as possible input #67

merged 17 commits into from
Jan 2, 2024

Conversation

rnmitchell
Copy link
Contributor

This PR will allow for using GeneMarker files as input into lusSTR. All but two (D16 and D8) loci sequence lengths match those produced by STRait Razor; lusSTR will ensure these loci are treated appropriately.

Comment on lines 383 to 390
for m in re.finditer("GGGCTGCCTA", self.uas_sequence):
print(m)
break_point = m.end()
try:
break_point
except NameError:
for m in re.finditer("TTTT", self.uas_sequence):
break_point = m.end() + 10
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While running some STRait Razor data through lusSTR, I ran into an error that it couldn't find GGGCTGCCTA within the sequence (there was a SNP within it) and therefore couldn't identify the break point. I'm surprised I haven't run into this error before. This identifies a run of Ts before that sequence. This allowed the entire STRait Razor file to run.

Comment on lines +77 to +82
if self.locus == "D16S539" and self.software == "genemarker":
return self.data["Power_5"], (self.data["Power_3"] - 3)
elif self.locus == "D8S1179" and self.software == "genemarker":
return (self.data["Power_5"] - 5), (self.data["Power_3"] - 5)
else:
return self.data["Power_5"], self.data["Power_3"]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using the GeneMarker software, only two loci (D16 and D8) produced different sequences (STRait Razor contains a few additional bases). This accounts for these differences.

@@ -57,7 +57,7 @@ def format_table(input, uas=False, kit="forenseq"):
locus = "PENTA D"
if locus == "PENTAE" or locus == "PENTA_E":
locus = "PENTA E"
if locus == "DYS385A-B" or locus == "DYS385":
if locus == "DYS385A/B" or locus == "DYS385":
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GeneMarker reports as DYS385A/B

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does nothing else still use the A-B formatting?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The UAS does- lusSTR just converts the locus to DYS385A-B for non-UAS data.

Comment on lines 44 to 50
def __init__(self, locus, sequence, uas=False, kit="forenseq"):
def __init__(self, locus, sequence, software, kit="forenseq"):
self.locus = locus
self.sequence = sequence
if locus not in str_marker_data:
raise InvalidLocusError(locus)
self.data = str_marker_data[locus]
self.uas = uas
self.software = software
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error handling: probably want to check that the software argument has an expected value here.

Comment on lines 383 to 389
for m in re.finditer("GGGCTGCCTA", self.uas_sequence):
break_point = m.end()
try:
break_point
except NameError:
for m in re.finditer("TTTT", self.uas_sequence):
break_point = m.end() + 10
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This use of the try/except mechanism is pretty unconventional. A much more common (and IMHO, clearer) pattern would be like this.

for ...:
    # found it
    break
else:
    # handle the no break case

In concrete terms:

for m in re.finditer("GGGCTGCCTA", self.uas_sequence):
    break_point = m.end()
    break
else:
    for m in re.finditer("TTTT", self.uas_sequence):
        break_point = m.end() + 10

But I think we could probably improve this code even more for clarity of intent. You might consider something like this.

if "GGGCTGCCTA" in self.uas_sequence:
    break_point = self.uas_sequence.index("GGGCTGCCTA") + 10
else:
    break_point = self.uas_sequence.index("TTTT") + 14

I'm not 100% sure about those 10 and 14 offsets, but I hope you get the idea.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense!

@@ -57,7 +57,7 @@ def format_table(input, uas=False, kit="forenseq"):
locus = "PENTA D"
if locus == "PENTAE" or locus == "PENTA_E":
locus = "PENTA E"
if locus == "DYS385A-B" or locus == "DYS385":
if locus == "DYS385A/B" or locus == "DYS385":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does nothing else still use the A-B formatting?

Comment on lines 393 to 399
# for m in re.finditer("GGGCTGCCTA", self.uas_sequence):
# break_point = m.end()
# try:
# break_point
# except NameError:
# for m in re.finditer("TTTT", self.uas_sequence):
# break_point = m.end() + 10
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry to be a stickler but this should be cleaned up before merging.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GAH yes

@standage standage marked this pull request as ready for review December 29, 2023 18:22
@standage standage marked this pull request as draft December 29, 2023 18:37
Comment on lines +366 to +370
for ext in [".csv", ".txt", "_flanks.txt", "_sexloci.csv", "_sexloci_flanks.txt"]:
exp_output = data_file(f"genemarker/genemarker_test{ext}")
print(exp_output)
obs_output = str(tmp_path / f"genemarker_test{ext}")
assert filecmp.cmp(exp_output, obs_output) is True
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this isn't usually how we test multiple files, but I didn't want to use parametrize because then it'll run lusSTR over and over again... and it takes up extra time when I don't need it to run multiple times.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the right approach: the loop is clear, and parametrize wouldn't really be appropriate here for the reason you indicate.

Comment on lines -114 to +115
final_df = final_df.append(filtered_df)
flags_df = flags_df.append(flags(filtered_df, datatype))
final_df = pd.concat([final_df, filtered_df])
flags_df = pd.concat([flags_df, flags(filtered_df, datatype)])
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated this to stop the annoying Pandas warning messages.

@rnmitchell
Copy link
Contributor Author

This is ready for review now @standage

@rnmitchell rnmitchell marked this pull request as ready for review December 31, 2023 19:23
Comment on lines +366 to +370
for ext in [".csv", ".txt", "_flanks.txt", "_sexloci.csv", "_sexloci_flanks.txt"]:
exp_output = data_file(f"genemarker/genemarker_test{ext}")
print(exp_output)
obs_output = str(tmp_path / f"genemarker_test{ext}")
assert filecmp.cmp(exp_output, obs_output) is True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the right approach: the loop is clear, and parametrize wouldn't really be appropriate here for the reason you indicate.

@standage standage merged commit 998bc86 into master Jan 2, 2024
2 checks passed
@standage standage deleted the genemarker branch January 2, 2024 19:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants