New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Reimplement ParseGenotypes in GenotypeComplexVariants #614

Merged

mwalker174 merged 4 commits into main from mw_posthoc_cpx

Jun 25, 2024

Collaborator

mwalker174 commented Oct 23, 2023

A critical optimization to the ParseGenotypes task that reimplements process_posthoc_cpx_depth_regenotyping.py with greatly accelerated computations.

The previous version of the script used many repetitive quadratic (N^2) commands that caused it to grind to a halt on a 98,000-sample call set in problematic regions, with shards taking >1 week (possibly longer) to complete. A call cached run of the same workflow on chr2 took <24 hr to run and cost $2.73, requiring <15GB memory per shard in ParseGenotypes.

In a few places, I've pointed out some possible bugs with #TODOs. In the interest of preserving exact functionality, I did not attempt to address these (they mostly seemed minor), but they should be revisited in the future.

An issue where two UNRESOLVED filter statuses were being applied to some records was also corrected.

The new implementation was tested on the 1KGP reference panel and produced identical output through CleanVcf, with the exception of one record reflecting a rare edge case that previously resulted in a false positive large CPX CNV (relating to an INS record with END<<POS).

mwalker174 mentioned this pull request

Initial Python Version of Genotyping Script-Not Tested Yet #609

Closed

epiercehoffman approved these changes

View reviewed changes

Collaborator

epiercehoffman left a comment

A very belated thanks for this rewrite and its massive scaling benefits in such a short time frame. This will also be great to have in python, although I must admit it's still pretty difficult to follow (and I imagine even more difficult to write). It would definitely benefit from more comments throughout.

I did leave a bunch of comments and questions and some of them might prompt minor changes (or could inform places needing further documentation), but this has already been scale-tested and used so I'm going to go ahead and approve.

src/sv-pipeline/04_variant_resolution/scripts/process_posthoc_cpx_depth_regenotyping.py

+                              genotype_list_dict[vid] = list()
+                          matching_variant_info = None
+                          for variant_info in genotype_list_dict[vid]:
+                              if variant_info.chrom == chrom and variant_info.start == start and variant_info.end == end:

Collaborator

epiercehoffman Jan 25, 2024

It took me a couple reads to understand what the inputs look like, and why you end up with multiple variants per VID here, and to distinguish between the multiple different intervals/variants/genotypes from the different inputs that are all called variant or genotype in the code, so I think that would be helpful to document

Collaborator Author

mwalker174 Jun 13, 2024

Added a few inline comments here

src/sv-pipeline/04_variant_resolution/scripts/process_posthoc_cpx_depth_regenotyping.py Outdated

+              def get_overlap(chrom1, start1, end1, chrom2, start2, end2):
+                  if chrom1 != chrom2:
+                      return 0
+                  if not (start1 <= end2 and start1 <= end1):

Collaborator

epiercehoffman Jan 25, 2024

Suggested change

      
                if not (start1 <= end2 and start1 <= end1):
          
                if not (start1 <= end2 and start2 <= end1):

This is what I expected to see if you're just checking whether the intervals overlap?

I don't think it would impact the overall script results though. If the intervals don't overlap and interval 1 < interval 2 then the result here would just be negative

Collaborator Author

mwalker174 Jun 13, 2024

Good catch, luckily yeah I don't think it's of any consequence.

src/sv-pipeline/04_variant_resolution/scripts/process_posthoc_cpx_depth_regenotyping.py Outdated

+                          control_cn = [var.genotype_copy_numbers[s] for s in control_samples for var in matching_variants
+                                        if var.genotype_copy_numbers.get(s, None) is not None]
+                          if len(control_cn) == 0:
+                              # If no valid carriers, assume default

Collaborator

epiercehoffman Jan 25, 2024

Suggested change

      
                            # If no valid carriers, assume default
          
                            # If no valid controls, assume default

src/sv-pipeline/04_variant_resolution/scripts/process_posthoc_cpx_depth_regenotyping.py

+                          # Get valid matching variant genotype records
+                          variant_info_list = genotype_dict[vid]
+                          if end - start < 1000000:

Collaborator

epiercehoffman May 31, 2024

I see that this matches the original script, but why do we require 95% RO for variants <1Mb but one-way 100% overlap for variants >1Mb?

Collaborator Author

mwalker174 Jun 13, 2024

I'm unsure about this

src/sv-pipeline/04_variant_resolution/scripts/process_posthoc_cpx_depth_regenotyping.py

+                          else:
+                              intervals = ["UNK_" + s for s in intervals_list_dict[query_record.merged_vid]]
+                      for interval_str in intervals:
+                          tokens = interval_str.replace(":", "\t").replace("_", "\t").replace("-", "\t").split("\t")

Collaborator

epiercehoffman May 31, 2024

Could do this in one line with a regex like ([A-Z]*)_(chr[XY0-9]*):([0-9]*)-([0-9]*)

Collaborator Author

mwalker174 Jun 13, 2024

Yes but I think the replace() method is clearer and shouldn't make much performance difference

src/sv-pipeline/04_variant_resolution/scripts/process_posthoc_cpx_depth_regenotyping.py

+                  # Default to test for INS-iDEL
+                  def _evaluate_ins_idel(r, records_list, default_sv_type, default_cpx_type):
+                      if r.sink_chrom is not None:
+                          # TODO shell script uses bedtools intersect with -v flag (=>"not") that probably isn't supposed to be used

Collaborator

epiercehoffman May 31, 2024

yikes

src/sv-pipeline/04_variant_resolution/scripts/process_posthoc_cpx_depth_regenotyping.py Outdated

+                                      m.cnv_assessment == "DEL" and not has_reciprocal_overlap(
+                                          r.sink_chrom, r.sink_start, r.sink_end,
+                                          m.chrom, m.start, m.end, 0.95)
+                                      for m in records_list

Collaborator

epiercehoffman May 31, 2024

Could just iterate over sinks

Collaborator Author

mwalker174 Jun 14, 2024

Good catch, I rewrote this

src/sv-pipeline/04_variant_resolution/scripts/process_posthoc_cpx_depth_regenotyping.py Outdated

+                                      new_sv_type="CPX",
+                                      new_cpx_type="INS_iDEL",
+                                      new_cpx_intervals=f"DEL_{r.sink_string()}",
+                                      new_svlen=None + r.source_size(),

Collaborator

epiercehoffman May 31, 2024

Suggested change

      
                                    new_svlen=None + r.source_size(),
          
                                    new_svlen=r.sink_size() + r.source_size(),

?

Collaborator Author

mwalker174 Jun 14, 2024

Fixed. Interesting that this case must have never been hit because this should throw an error.

src/sv-pipeline/04_variant_resolution/scripts/process_posthoc_cpx_depth_regenotyping.py

+                                      for m in records_list
+                                  )
+                              else:
+                                  sink_is_del = False

Collaborator

epiercehoffman May 31, 2024

Seems like both sink_is_del and sink_is_not_del could be true, in which case sink_is_del takes precedence... I know this is just matching the behavior of the old script but is that what we want...?

Collaborator Author

mwalker174 Jun 14, 2024

Yes the logic does seem inconsistent. I'll add a TODO note.

src/sv-pipeline/04_variant_resolution/scripts/process_posthoc_cpx_depth_regenotyping.py

+                      )
+                      if dup_confirmed:
+                          # If dup confirms, resolve as palindromic inverted duplication
+                          new_cpx_type = "piDUP_FR" if default_cpx_type == "INVERSION_SINGLE_ENDER_" else "piDUP_RF"

Collaborator

epiercehoffman May 31, 2024

Just a note that I've never seen a piDUP_FR or piDUP_RF in one of our VCFs, and downstream steps like manual review and annotation do not handle them. Might be worth revisiting to understand better

Collaborator Author

mwalker174 Jun 14, 2024

Added a TODO note

mwalker174 force-pushed the mw_posthoc_cpx branch from c0509f4 to f63954f Compare

June 14, 2024 17:13

mwalker174 mentioned this pull request

Trigger WDL tests on changes to /inputs and fix Terra config tests #691

Merged

mwalker174 added 4 commits

June 24, 2024 14:28


          Reimplement ParseGenotypes in GenotypeComplexVariants

321918a

Update docker

Fix errors

Update dockers

Fix sort in wdl

Fix alt alleles

Perfect parity

Update dockers

Remove UNRESOLVED filter status

Delete unused scripts


          Lint

2b179e2


          Start addressing comments

6ddd93a


          Finish addressing comments

d7ea802

mwalker174 force-pushed the mw_posthoc_cpx branch from f63954f to d7ea802 Compare

June 24, 2024 18:29

mwalker174 merged commit 1982f2c into main

9 checks passed

mwalker174 deleted the mw_posthoc_cpx branch

June 25, 2024 15:05

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet