Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move marker plot generation to filter steb for STRs #69

Merged
merged 10 commits into from
Jan 12, 2024
Merged

Conversation

rnmitchell
Copy link
Contributor

@rnmitchell rnmitchell commented Jan 10, 2024

Moving the marker plot generation code to the filter step in order to create a third set of plots containing only the true alleles.

Marker plots containing all alleles will now be classified as either BelowAT, RealTyped, or Stutter by color.

Comment on lines -391 to +402
else:
bracketed_form = (
f"{collapse_repeats_by_length(self.uas_sequence[:break_point], 4)} "
f"{collapse_repeats_by_length(self.uas_sequence[break_point:], 4)}"
)
elif "TTTT" in self.uas_sequence:
break_point = self.uas_sequence.index("TTTT") + 14
bracketed_form = (
f"{collapse_repeats_by_length(self.uas_sequence[:break_point], 4)} "
f"{collapse_repeats_by_length(self.uas_sequence[break_point:], 4)}"
)
bracketed_form = (
f"{collapse_repeats_by_length(self.uas_sequence[:break_point], 4)} "
f"{collapse_repeats_by_length(self.uas_sequence[break_point:], 4)}"
)
else:
bracketed_form = ""
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once again ran into an issue with this marker- this time, the sequence seemed to be off by 10s of bases on one side (e.g. the sequence started ~50 bases early on the 5' side thus not containing the entire repeat region and was missing the GGGCTGCCTA and the TTTT sequences the code originally searched for). Now the code just basically drops these sequences- they are garbage.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rebecca emphatically declares that these sequences cannot be redeemed.

Comment on lines 166 to 175
if bf != "":
for block in bf.split(" "):
if block == repeat:
if 1 > longest:
longest = 1
match = re.match(r"\[" + repeat + r"\](\d+)", block)
if match:
length = int(match.group(1))
if length > longest:
longest = length
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code is dropping these garbage sequences.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a simple and straightforward change, but I have a suggestion for improvement: instead of indenting the code in a conditional block, short-circuit the inverse condition.

if bf == "":
    return 0
for block in bf.split(" "):
   ### blah blah blah

This has a few benefits. First of all, you see what happens right away if bf is empty: in the current code, you have to read all the way to the end of the block to find out that...nothing happens. Second, it keeps the nesting and indentation to a minimum: this is not only good for legibility, it also (at least in this case) makes for cleaner diffs in code review. My suggestion above would result in a simple two-line diff: the diff for the current code looks like this.

Screenshot 2024-01-11 at 9 39 33 PM

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense! I'll update.

@rnmitchell
Copy link
Contributor Author

straitrazor_B9B_S9_L001_R1_001_straitrazor_marker_plots.pdf
Example of the new marker plots

@rnmitchell rnmitchell marked this pull request as ready for review January 11, 2024 14:57
@rnmitchell rnmitchell requested a review from standage January 11, 2024 14:57
Comment on lines -534 to -535
if datatype == "lusplus":
final_df = final_df.drop("CE_Allele", axis=1)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ran into a problem with this code because the marker plots use the CE allele for plotting- removing it so it can actually create the plots.

@rnmitchell
Copy link
Contributor Author

Ready for your review @standage

Comment on lines -106 to 105
if locus == "D12S391" and kit == "powerseq":
if locus == "D12S391" and kit == "powerseq" and software == "straitrazor":
if "." in str(marker.canonical):
check_sr += 1
if check_sr > 10:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking through the code I realized if I was going to be running a check on strait razor data I should make sure that strait razor was actually used (and not gene marker).

@rnmitchell
Copy link
Contributor Author

Copy link
Member

@standage standage left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall! I have one suggestion for improvement.

Comment on lines 166 to 175
if bf != "":
for block in bf.split(" "):
if block == repeat:
if 1 > longest:
longest = 1
match = re.match(r"\[" + repeat + r"\](\d+)", block)
if match:
length = int(match.group(1))
if length > longest:
longest = length
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a simple and straightforward change, but I have a suggestion for improvement: instead of indenting the code in a conditional block, short-circuit the inverse condition.

if bf == "":
    return 0
for block in bf.split(" "):
   ### blah blah blah

This has a few benefits. First of all, you see what happens right away if bf is empty: in the current code, you have to read all the way to the end of the block to find out that...nothing happens. Second, it keeps the nesting and indentation to a minimum: this is not only good for legibility, it also (at least in this case) makes for cleaner diffs in code review. My suggestion above would result in a simple two-line diff: the diff for the current code looks like this.

Screenshot 2024-01-11 at 9 39 33 PM

Comment on lines -391 to +402
else:
bracketed_form = (
f"{collapse_repeats_by_length(self.uas_sequence[:break_point], 4)} "
f"{collapse_repeats_by_length(self.uas_sequence[break_point:], 4)}"
)
elif "TTTT" in self.uas_sequence:
break_point = self.uas_sequence.index("TTTT") + 14
bracketed_form = (
f"{collapse_repeats_by_length(self.uas_sequence[:break_point], 4)} "
f"{collapse_repeats_by_length(self.uas_sequence[break_point:], 4)}"
)
bracketed_form = (
f"{collapse_repeats_by_length(self.uas_sequence[:break_point], 4)} "
f"{collapse_repeats_by_length(self.uas_sequence[break_point:], 4)}"
)
else:
bracketed_form = ""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rebecca emphatically declares that these sequences cannot be redeemed.

@standage standage merged commit 03aa9e2 into master Jan 12, 2024
2 checks passed
@standage standage deleted the markerplots branch January 12, 2024 14:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants