Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Genotyping code for the Gnarly Pipeline (gnomAD v3) #4947

Merged
merged 13 commits into from
Aug 1, 2019
Merged

Conversation

ldgauthier
Copy link
Contributor

This along with the GVCF reblocking branch constitute the new code I'm using for gnomAD v3 on the Gnarly Pipeline.

Some of the GDB hacks are gross, but I can't clean it up until after the protobuf update.

@codecov-io
Copy link

codecov-io commented Jun 26, 2018

Codecov Report

❗ No coverage uploaded for pull request base (master@39a9d13). Click here to learn what that means.
The diff coverage is 6.58%.

@@            Coverage Diff             @@
##             master     #4947   +/-   ##
==========================================
  Coverage          ?   13.338%           
  Complexity        ?      6396           
==========================================
  Files             ?      2016           
  Lines             ?    151745           
  Branches          ?     16269           
==========================================
  Hits              ?     20240           
  Misses            ?    129234           
  Partials          ?      2271
Impacted Files Coverage Δ Complexity Δ
...ute/hellbender/utils/variant/GATKVCFConstants.java 50% <ø> (ø) 2 <0> (?)
...iantutils/PosteriorProbabilitiesUtilsUnitTest.java 4.386% <ø> (ø) 1 <0> (?)
...te/hellbender/utils/variant/writers/TLODBlock.java 0% <ø> (ø) 0 <0> (?)
...notyper/GenotypeCalculationArgumentCollection.java 100% <ø> (ø) 2 <0> (?)
...ender/utils/variant/writers/GVCFBlockCombiner.java 0% <0%> (ø) 0 <0> (?)
...titute/hellbender/tools/walkers/GenotypeGVCFs.java 0% <0%> (ø) 0 <0> (?)
...rs/annotator/allelespecific/AS_StrandBiasTest.java 5.085% <0%> (ø) 2 <0> (?)
...ender/utils/variant/writers/TLODBlockUnitTest.java 4.348% <0%> (ø) 2 <0> (?)
...lkers/annotator/allelespecific/AS_QualByDepth.java 1.351% <0%> (ø) 1 <0> (?)
...ls/variant/writers/GVCFBlockCombiningIterator.java 0% <0%> (ø) 0 <0> (?)
... and 39 more

@ldgauthier
Copy link
Contributor Author

TODO: refactor duplicated VC generation code from PosteriorProbabilitiesUtilsUnitTest in ReblockGVCFUnitTest by extracting to VariantContextTestUtils

@droazen
Copy link
Contributor

droazen commented Aug 24, 2018

@ldgauthier Should this branch be reviewed before the GVCF reblocking PR goes in?

@ldgauthier
Copy link
Contributor Author

They may actually be more independent than two PRs both modifying SNP/indel data usually are. I think there will be VariantContextTestUtils conflicts either way, but the VCF data will probably be the same.

@droazen droazen self-requested a review August 24, 2018 19:28
@droazen droazen self-assigned this Aug 24, 2018
@ldgauthier
Copy link
Contributor Author

@droazen can someone take a look soon? Eric tells me we need to start the CCDG 60K callset next week and I'd rather run that off a release.

@droazen
Copy link
Contributor

droazen commented Nov 5, 2018

@ldgauthier Looks like this branch is in conflict -- can you rebase before we review?

@ldgauthier ldgauthier force-pushed the ldg_getVQSRinput branch 3 times, most recently from b909de3 to 6b7d045 Compare November 8, 2018 16:31
@ldgauthier
Copy link
Contributor Author

@droazen After a gruesome rebase, conflicts are resolved.

@droazen
Copy link
Contributor

droazen commented Nov 14, 2018

@ldgauthier Since I'm about to leave for ~2 weeks, I've recruited @lbergelson to review this PR in my place.

@droazen droazen assigned lbergelson and unassigned droazen Nov 14, 2018
@droazen droazen requested review from lbergelson and removed request for droazen November 14, 2018 20:14
@ldgauthier
Copy link
Contributor Author

Conflicts resolved and David's comments addressed. @lbergelson can you take a look?

@ldgauthier
Copy link
Contributor Author

@lbergelson I fixed the PL wiggle in the failing test. Think you could review before 4.1? Or maybe @droazen will look again? (Getting it in would be nice, but not critical.)

Copy link
Member

@lbergelson lbergelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ldgauthier I'm not quite done looking through this. I just want to checkpoint what I've reviewed so far.

new GenomicsDBOptions());
}

public FeatureDataSource(final FeatureInput<T> featureInput, final int queryLookaheadBases, final Class<? extends Feature> targetFeatureType,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

javadoc

@ldgauthier
Copy link
Contributor Author

ldgauthier commented Mar 4, 2019 via email

@droazen
Copy link
Contributor

droazen commented Mar 28, 2019

What's the status of this one @ldgauthier ? Do you need a new reviewer with @lbergelson out?

@ldgauthier
Copy link
Contributor Author

ldgauthier commented Mar 28, 2019 via email

@nalinigans
Copy link
Collaborator

nalinigans commented Mar 29, 2019

Yes please. Where we left it I think Louis was happy, but we wanted to ask Nalini if she had any suggestions to avoid threading the argument for genotypes all the way through the engine.

Not sure we can avoid threading the argument for genotypes, but would using GenomicsDBOptions instead work?

@droazen
Copy link
Contributor

droazen commented Jun 5, 2019

@ldgauthier Let me know once you've had the chance to do that GnarlyGenotyperEngine refactoring discussed above, and I'd be happy to give this a (hopefully) final look.

…rge WGS cohort "Gnarly Pipeline" to finalize annotation values for filtering

Super fast on Hail-derived sites-only input, needs to calculate ACs for GenomicsDB input
Does NOT subset alternate alleles, just drops genotypes for sites with more than 6 alts
Does NOT perform joint discovery, i.e. aggregating small amounts of evidence across samples
Fix ReblockGVCF NPE on no-calls
Some annotation cleanup
New scheme to preserve de novo calling -- more blocks, but drop PLs, MIN_DP, floor GQs for each block
Add AS combine operation and capability for GenomicsDB
@ldgauthier
Copy link
Contributor Author

@droazen a few minor tests I missed, but this should be good to go otherwise.

I did the refactor for Evoquer and I think everything else has been addressed.

@droazen
Copy link
Contributor

droazen commented Jul 26, 2019

@ldgauthier Can you move spanDel.exome.chr20.vcf (as well as any other test files over about 1-2 MB) into a directory under src/test/resources/large so that they'll be stored in git lfs rather than check in to the repo?

Copy link
Contributor

@droazen droazen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Back to @ldgauthier with a final round of mostly-minor comments. We should be able to merge once they're addressed and tests pass.

@@ -205,7 +207,8 @@ private void initializeFeatureSources( final int featureQueryLookahead, final Co
// Only create a data source for Feature arguments that were actually specified
if ( featureInput != null ) {
final Class<? extends Feature> featureType = getFeatureTypeForFeatureInputField(featureArgument.getKey());
addToFeatureSources(featureQueryLookahead, featureInput, featureType, cloudPrefetchBuffer, cloudIndexPrefetchBuffer, reference);
addToFeatureSources(featureQueryLookahead, featureInput, featureType, cloudPrefetchBuffer, cloudIndexPrefetchBuffer,
toolInstance instanceof VariantWalker ? ((VariantWalker) toolInstance).getGenomicsDBOptions() : null);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could probably avoid this ugly instanceof check by making GATKTool.getGenomicsDBOptions() return null (or new GenomicsDBOptions(referenceArguments.getReferencePath())) instead of throwing an exception, and then having GATKTool.initializeFeatures() (and its overrides) pass the GenomicsDB options in to the FeatureManager constructor, which can then propagate them down here.

Could be done in a separate PR if you don't want to do it here.

@@ -217,6 +220,13 @@ public void dumpAllFeatureCacheStats() {
}
}

void addToFeatureSources(final int featureQueryLookahead, final FeatureInput<? extends Feature> featureInput,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add javadoc for this overload (can copy from below)

* @return By default, not every GATK tool can read from a GenomicsDB -- child classes can override
*/
protected GenomicsDBOptions getGenomicsDBOptions() {
throw new IllegalArgumentException("This tool does not take a GenomicsDB as a feature input.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you have this return either null or a new GenomicsDBOptions(referenceArguments.getReferencePath()) instead of throwing an exception, you could likely get rid of the VariantWalker special-casing in FeatureManager.

}

@SuppressWarnings({"unchecked", "rawtypes"})
public VariantContext finalizeGenotype(final VariantContext variant, final VariantContextBuilder annotationDBBuilder) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed in person, modify the method to tolerate a null annotationDBBuilder, and add a one-arg overload of finalizeGenotype() that takes just a VariantContext arg.

@@ -78,7 +78,7 @@
boolean foundData = false;

for( final Genotype g : genotypes ) {
if( g.isNoCall() || ! g.hasAnyAttribute(GATKVCFConstants.STRAND_BIAS_BY_SAMPLE_KEY) ) {
if( ! g.hasAnyAttribute(GATKVCFConstants.STRAND_BIAS_BY_SAMPLE_KEY) ) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't want to skip no-calls here?

*/
@Advanced
@Argument(fullName=HaplotypeCallerArgumentCollection.OUTPUT_BLOCK_LOWER_BOUNDS, doc = "Output the band lower bound for each GQ block regardless of the data it represents", optional = true)
public boolean floorBlocks = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a test covering this new HC arg?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just in ReblockGVCF. Do you want one in HaplotypeCaller?

* @param annotation the annotation to be tested
* @return true if the annotation is expected to have values per-allele
*/
public static boolean isAlleleSpecific(final InfoFieldAnnotation annotation) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a danger that if someone adds a new allele-specific annotation one day, they will not realize that they have to update this method as well. Perhaps we could introduce an empty AlleleSpecificAnnotation marker interface, and have just these 4 classes implement it? That would be much less likely to be missed in the future...

* Output the band lower bound for each GQ block instead of the min GQ -- for better compression
*/
@Advanced
@Argument(fullName=HaplotypeCallerArgumentCollection.OUTPUT_BLOCK_LOWER_BOUNDS, doc = "Output the band lower bound for each GQ block regardless of the data it represents", optional = true)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to cover this new HC arg with a quick/simple direct test, since the ReblockGVCF arg is separate.

Copy link
Contributor

@droazen droazen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Latest version looks good overall and tests pass -- I still have a lingering maintenance concern about AnnotationUtils.isAlleleSpecific() that could be resolved by adding an empty marker interface for AS annotations, and I'd still like to see a direct test covering that new HC arg.

@droazen
Copy link
Contributor

droazen commented Aug 1, 2019

After discussing with @ldgauthier, I'm going to approve this PR as-is, and Laura will address the remaining TODOs in a separate PR. For the record, the three remaining issues that need addressing are:

  • Get rid of the instanceof VariantWalker check in FeatureManager by making GATKTool.getGenomicsDBOptions() return null (or new GenomicsDBOptions(referenceArguments.getReferencePath())) instead of throwing an exception, and then having GATKTool.initializeFeatures() (and its overrides) pass the GenomicsDB options in to the FeatureManager constructor, which can then propagate them down here.

  • Add a simple direct integration test for the new --floor-blocks HaplotypeCaller arg

  • Address my maintenance concerns about AnnotationUtils.isAlleleSpecific() by adding an empty marker interface for AS annotations (open to other ideas here if you don't like that one)

@droazen droazen merged commit d0c8739 into master Aug 1, 2019
@droazen droazen deleted the ldg_getVQSRinput branch August 1, 2019 16:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants