-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reimplement ParseGenotypes in GenotypeComplexVariants #614
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A very belated thanks for this rewrite and its massive scaling benefits in such a short time frame. This will also be great to have in python, although I must admit it's still pretty difficult to follow (and I imagine even more difficult to write). It would definitely benefit from more comments throughout.
I did leave a bunch of comments and questions and some of them might prompt minor changes (or could inform places needing further documentation), but this has already been scale-tested and used so I'm going to go ahead and approve.
genotype_list_dict[vid] = list() | ||
matching_variant_info = None | ||
for variant_info in genotype_list_dict[vid]: | ||
if variant_info.chrom == chrom and variant_info.start == start and variant_info.end == end: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It took me a couple reads to understand what the inputs look like, and why you end up with multiple variants per VID here, and to distinguish between the multiple different intervals/variants/genotypes from the different inputs that are all called variant or genotype in the code, so I think that would be helpful to document
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a few inline comments here
def get_overlap(chrom1, start1, end1, chrom2, start2, end2): | ||
if chrom1 != chrom2: | ||
return 0 | ||
if not (start1 <= end2 and start1 <= end1): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if not (start1 <= end2 and start1 <= end1): | |
if not (start1 <= end2 and start2 <= end1): |
This is what I expected to see if you're just checking whether the intervals overlap?
I don't think it would impact the overall script results though. If the intervals don't overlap and interval 1 < interval 2 then the result here would just be negative
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, luckily yeah I don't think it's of any consequence.
control_cn = [var.genotype_copy_numbers[s] for s in control_samples for var in matching_variants | ||
if var.genotype_copy_numbers.get(s, None) is not None] | ||
if len(control_cn) == 0: | ||
# If no valid carriers, assume default |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# If no valid carriers, assume default | |
# If no valid controls, assume default |
|
||
# Get valid matching variant genotype records | ||
variant_info_list = genotype_dict[vid] | ||
if end - start < 1000000: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that this matches the original script, but why do we require 95% RO for variants <1Mb but one-way 100% overlap for variants >1Mb?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm unsure about this
else: | ||
intervals = ["UNK_" + s for s in intervals_list_dict[query_record.merged_vid]] | ||
for interval_str in intervals: | ||
tokens = interval_str.replace(":", "\t").replace("_", "\t").replace("-", "\t").split("\t") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could do this in one line with a regex like ([A-Z]*)_(chr[XY0-9]*):([0-9]*)-([0-9]*)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes but I think the replace()
method is clearer and shouldn't make much performance difference
# Default to test for INS-iDEL | ||
def _evaluate_ins_idel(r, records_list, default_sv_type, default_cpx_type): | ||
if r.sink_chrom is not None: | ||
# TODO shell script uses bedtools intersect with -v flag (=>"not") that probably isn't supposed to be used |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yikes
m.cnv_assessment == "DEL" and not has_reciprocal_overlap( | ||
r.sink_chrom, r.sink_start, r.sink_end, | ||
m.chrom, m.start, m.end, 0.95) | ||
for m in records_list |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could just iterate over sinks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, I rewrote this
new_sv_type="CPX", | ||
new_cpx_type="INS_iDEL", | ||
new_cpx_intervals=f"DEL_{r.sink_string()}", | ||
new_svlen=None + r.source_size(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
new_svlen=None + r.source_size(), | |
new_svlen=r.sink_size() + r.source_size(), |
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed. Interesting that this case must have never been hit because this should throw an error.
for m in records_list | ||
) | ||
else: | ||
sink_is_del = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like both sink_is_del and sink_is_not_del could be true, in which case sink_is_del takes precedence... I know this is just matching the behavior of the old script but is that what we want...?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes the logic does seem inconsistent. I'll add a TODO note.
) | ||
if dup_confirmed: | ||
# If dup confirms, resolve as palindromic inverted duplication | ||
new_cpx_type = "piDUP_FR" if default_cpx_type == "INVERSION_SINGLE_ENDER_" else "piDUP_RF" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a note that I've never seen a piDUP_FR or piDUP_RF in one of our VCFs, and downstream steps like manual review and annotation do not handle them. Might be worth revisiting to understand better
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a TODO note
c0509f4
to
f63954f
Compare
Update docker Fix errors Update dockers Fix sort in wdl Fix alt alleles Perfect parity Update dockers Remove UNRESOLVED filter status Delete unused scripts
f63954f
to
d7ea802
Compare
A critical optimization to the
ParseGenotypes
task that reimplementsprocess_posthoc_cpx_depth_regenotyping.py
with greatly accelerated computations.The previous version of the script used many repetitive quadratic (N^2) commands that caused it to grind to a halt on a 98,000-sample call set in problematic regions, with shards taking >1 week (possibly longer) to complete. A call cached run of the same workflow on chr2 took <24 hr to run and cost $2.73, requiring <15GB memory per shard in
ParseGenotypes
.In a few places, I've pointed out some possible bugs with
#TODO
s. In the interest of preserving exact functionality, I did not attempt to address these (they mostly seemed minor), but they should be revisited in the future.An issue where two
UNRESOLVED
filter statuses were being applied to some records was also corrected.The new implementation was tested on the 1KGP reference panel and produced identical output through
CleanVcf
, with the exception of one record reflecting a rare edge case that previously resulted in a false positive large CPX CNV (relating to an INS record with END<<POS).