Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sort output from SVClusterEngine and fix no-call genotype ploidy bug in JointGermlineCNVSegmentation #7779

Merged
merged 7 commits into from
May 16, 2022

Conversation

mwalker174
Copy link
Contributor

  • Moved OutputSortingBuffer class used by SVCluster into SVClusterEngine to unify clustering code across tools
  • Fixes an issue where no-call genotypes caused an error in JointGermlineCNVSegmentation

@mwalker174 mwalker174 requested a review from ldgauthier April 13, 2022 15:37
Copy link
Contributor

@ldgauthier ldgauthier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple questions and requests for comments, but otherwise LGTM!

.collect(Collectors.toList());
}

public List<T> forceFlush() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this is for special cases, like the tests below, and the task end. Is that true? Regardless, please add some javadoc, esp. for differentiating use cases with the above method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok added a comment. This is to be used only by forceFlush() in the engine itself, which is only called when we're certain that none of the currently active clusters can change. This is yes usually when reaching the end of a contig (or file).

//variantContexts should have identical start, so choose 0th arbitrarily
final String variantContig = variantContexts.get(0).getContig();
if (currentContig != null && !variantContig.equals(currentContig)) {
// Since we need to check for variant overlap and reset genotypes, only flush clustering when we hit a new contig
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this comment should go in processClusters instead since that's where the logic about contigs is

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revisiting this, I noticed how we only call this at the end of contigs anyway so the force parameter is unnecessary (just always force).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also moved comment to a docstring in processClusters()

@@ -682,14 +680,18 @@ private static Genotype prepareGenotype(final Genotype g, final Allele refAllele

private static void correctGenotypePloidy(final GenotypeBuilder builder, final Genotype g, final int ploidy,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know I dropped the ball, but can you add some comments here? This is for overlapping events, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. This just modifies the genotypes for input variants so the ploidies are consistent with the ped file and also irons out any no-call/null GTs.


// Check for one record
int expectedRecordsFound = 0;
for (final VariantContext variant : records) {
Assert.assertTrue(variant.hasAttribute(GATKSVVCFConstants.CLUSTER_MEMBER_IDS_KEY));
Assert.assertTrue(variant.hasAttribute(GATKSVVCFConstants.ALGORITHMS_ATTRIBUTE));
if (variant.getID().equals("SVx000001ad")) {
if (variant.getContig().equals("chr20") && variant.getStart() == 28654436) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is better.

@@ -56,7 +56,7 @@ public void testDefragmentation() {
Assert.assertEquals(header.getSampleNamesInOrder(), inputHeader.getSampleNamesInOrder());
Assert.assertEquals(header.getSequenceDictionary().size(), inputHeader.getSequenceDictionary().size());

Assert.assertEquals(records.size(), 338);
Assert.assertEquals(records.size(), 408);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What caused this change?

Copy link
Contributor Author

@mwalker174 mwalker174 May 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interestingly, this is a bug in master. Due to the use of TreeSet with a comparator on genomic position, unclustered variants with the same start position were getting thrown out.

@codecov
Copy link

codecov bot commented May 13, 2022

Codecov Report

Merging #7779 (e1c3e19) into master (c6eb337) will increase coverage by 68.302%.
The diff coverage is 83.266%.

@@               Coverage Diff                @@
##              master     #7779        +/-   ##
================================================
+ Coverage     18.644%   86.946%   +68.302%     
- Complexity      4635     36933     +32298     
================================================
  Files           1261      2219       +958     
  Lines          73745    173666     +99921     
  Branches       11768     18754      +6986     
================================================
+ Hits           13749    150995    +137246     
+ Misses         57944     16055     -41889     
- Partials        2052      6616      +4564     
Impacted Files Coverage Δ
...ine/GATKPlugin/GATKAnnotationPluginDescriptor.java 83.851% <ø> (+46.584%) ⬆️
...ine/GATKPlugin/GATKReadFilterPluginDescriptor.java 84.615% <ø> (+32.308%) ⬆️
...nder/cmdline/PicardCommandLineProgramExecutor.java 63.636% <0.000%> (+63.636%) ⬆️
...e/argumentcollections/DbsnpArgumentCollection.java 100.000% <ø> (ø)
...llections/MultiVariantInputArgumentCollection.java 100.000% <ø> (+100.000%) ⬆️
...broadinstitute/hellbender/engine/FeatureCache.java 92.157% <ø> (+19.608%) ⬆️
...oadinstitute/hellbender/engine/FeatureContext.java 81.818% <ø> (+72.727%) ⬆️
...adinstitute/hellbender/engine/ReadsDataSource.java 80.000% <ø> (+21.667%) ⬆️
...stitute/hellbender/engine/ReadsPathDataSource.java 89.855% <ø> (ø)
...dinstitute/hellbender/engine/ReferenceContext.java 85.714% <ø> (+53.530%) ⬆️
... and 2330 more

@mwalker174 mwalker174 merged commit 752ee3c into master May 16, 2022
@mwalker174 mwalker174 deleted the mw_joint_genotype_sort branch May 16, 2022 16:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants