Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve MQ calculation accuracy #4969

Merged
merged 7 commits into from
Oct 9, 2018
Merged

Improve MQ calculation accuracy #4969

merged 7 commits into from
Oct 9, 2018

Conversation

ldgauthier
Copy link
Contributor

After the switch to rawMQ for the reducible implementation we no longer take the median of MQ values across all samples (yay!) but there were some accuracy problems at sites with a lot of uninformative reads (boo!)

I tried to calculate the depth over variant samples previously, but the approximations weren't perfect. Now we keep track of the depth used to find the RMS of MQ at the end of joint calling.

@ldgauthier ldgauthier requested a review from jamesemery June 29, 2018 18:53
@ldgauthier
Copy link
Contributor Author

@jamesemery Can you take a look? (This doesn't have to go into the next release.)

@codecov-io
Copy link

codecov-io commented Jun 29, 2018

Codecov Report

Merging #4969 into master will increase coverage by 0.01%.
The diff coverage is 88.15%.

Impacted file tree graph

@@             Coverage Diff              @@
##             master    #4969      +/-   ##
============================================
+ Coverage     86.75%   86.77%   +0.01%     
- Complexity    29834    29883      +49     
============================================
  Files          1828     1831       +3     
  Lines        138115   138452     +337     
  Branches      15227    15249      +22     
============================================
+ Hits         119827   120138     +311     
- Misses        12741    12765      +24     
- Partials       5547     5549       +2
Impacted Files Coverage Δ Complexity Δ
...ute/hellbender/utils/variant/GATKVCFConstants.java 80% <ø> (ø) 4 <0> (ø) ⬇️
...institute/hellbender/engine/FeatureDataSource.java 76.42% <ø> (-0.08%) 42 <0> (-7)
...broadinstitute/hellbender/engine/FeatureInput.java 94.36% <100%> (ø) 18 <3> (ø) ⬇️
...e/hellbender/utils/variant/GATKVCFHeaderLines.java 95.62% <100%> (+0.02%) 10 <0> (ø) ⬇️
...ools/walkers/variantutils/ReblockGVCFUnitTest.java 92.53% <100%> (ø) 9 <0> (ø) ⬇️
...lkers/variantutils/ReblockGVCFIntegrationTest.java 97.82% <100%> (+0.26%) 8 <1> (+1) ⬆️
...er/tools/walkers/GenotypeGVCFsIntegrationTest.java 78.26% <100%> (+2.45%) 25 <6> (+6) ⬆️
...bender/tools/walkers/variantutils/ReblockGVCF.java 81.52% <100%> (+1.17%) 46 <0> (+3) ⬆️
...der/tools/walkers/annotator/RMSMappingQuality.java 83.6% <77.35%> (-10.21%) 41 <25> (+1)
...der/tools/walkers/CombineGVCFsIntegrationTest.java 87.44% <83.33%> (+0.15%) 24 <2> (ø) ⬇️
... and 19 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 868a32e...49c474d. Read the comment docs.

Copy link
Collaborator

@jamesemery jamesemery left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly there are some places that could use clarifying comments. Otherwise the code itself looks reasonable and contained. Are there plans to update AS_RMSMappingQuality? They are now out of synch with each-other with regards to what they do.


@Override
public String getRawKeyName() { return GATKVCFConstants.RAW_RMS_MAPPING_QUALITY_KEY;}
public String getRawKeyName() { return GATKVCFConstants.RAW_MAPPING_QUALITY_WITH_DEPTH_KEY;} //new key for the two-value MQ data to prevent version mismatch catastrophes
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to bump this into a new class? This would mean that we are no longer able to handle the old RMS mapping quality code.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does that matter to us? I guess since the variant context wont have a key matching one of the discovered reducible annotations' keys then nothing will really happen and the key will be dropped...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we could add a check and emit a warning to the user if there is a mismatch between the versions of the rmsMappingQualityKey in their files?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a check in the finalize method in case GenotypeGVCFs gets run on old data. GenomicsDB behavior is largely outside of our control here. I don't have a great idea for CombineGVCFs though. I thought about adding a check against ReducibleAnnotation.getDeprecatedKeyName(), but I feel like having getDeprecatedKeyName() return null for the other annotations and querying the annotationMap for null is an invitation for trouble.

@@ -53,12 +56,12 @@

@Override
public List<VCFInfoHeaderLine> getDescriptions() {
return Arrays.asList(VCFStandardHeaderLines.getInfoLine(getKeyNames().get(0)), GATKVCFHeaderLines.getInfoLine(getRawKeyName()));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

heh, yeah this should probably have been changed when I ported that AS_stuff. I remember this was here purely because it broke tests. It looks like you've updated them which is good.

/**
* Created by gauthier on 5/29/18.
*/
public class VariantDepth {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you may have included this accidentally

@@ -223,6 +226,13 @@ private static double parseRawDataString(String rawDataString) {
@VisibleForTesting
static int getNumOfReads(final VariantContext vc,
final ReadLikelihoods<Allele> likelihoods) {
if(vc.hasAttribute(GATKVCFConstants.RAW_MAPPING_QUALITY_WITH_DEPTH_KEY)) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you update the comment to reflect that it searches for the count data first?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the GATK Javadoc for the annotation -- is that what you meant?

}


public void combineAttributeMap(ReducibleAnnotationData<Double> toAdd, ReducibleAnnotationData<Double> combined) {
public void combineAttributeMap(ReducibleAnnotationData<List<Integer>> toAdd, ReducibleAnnotationData<List<Integer>> combined) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method could use a comment, as its a little harder to read now (Allele.NO_CALL)

.mapToDouble(mq -> mq * mq).sum();

rawAnnotations.putAttribute(Allele.NO_CALL, squareSum);
//GATK3.5 had a double, but change this to an int for the tuple representation
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also just a quick comment explaining that this is counting both the squareSum AND the count.

}

@Test (dataProvider = "VCFdata")
public void assertMatchingGenotypesFromGenomicsDB_vidmapHack(File[] inputs, File expected, String interval, List<String> additionalArguments, String reference) throws IOException {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a comment explaining this test? It doesn't seem related to the rest of the branch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I goofed and copied the genotypes comparison when I meant do compare annotations. I fixed the name and made sure MQ is actually being compared. (FYI the vidmap hack is because I need to change the GDB json since I changed the MQ annotation name. I'll update this once the protobuf update goes in.)

@@ -42,9 +42,12 @@
@DocumentedFeature(groupName=HelpConstants.DOC_CAT_ANNOTATORS, groupSummary=HelpConstants.DOC_CAT_ANNOTATORS_SUMMARY, summary="Root mean square of the mapping quality of reads across all samples (MQ)")
public final class RMSMappingQuality extends InfoFieldAnnotation implements StandardAnnotation, ReducibleAnnotation {
private static final RMSMappingQuality instance = new RMSMappingQuality();
public static final int NUM_LIST_ENTRIES = 2;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you update the header comments to reflect the new annotation? Just add a line specifying what the count means

@@ -97,6 +100,8 @@
{getTestFile( "testAlleleSpecificAnnotations.CombineGVCF.output.g.vcf"), getTestFile( "testAlleleSpecificAnnotations.CombineGVCF.expected.g.vcf"), Arrays.asList( "-A", "ClippingRankSumTest", "-G", "AS_StandardAnnotation", "-G", "StandardAnnotation"), b37_reference_20_21},
//all sites not supported yet see https://github.com/broadinstitute/gatk-protected/issues/580 and https://github.com/broadinstitute/gatk/issues/2429
//{getTestFile(basePairGVCF), getTestFile( "gvcf.basepairResolution.includeNonVariantSites.gatk3.7_30_ga4f720357.expected.vcf"), Collections.singletonList("--"+GenotypeGVCFs.ALL_SITES_LONG_NAME) //allsites not supported yet
//Test for new RAW_MQandDP annotation format
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test doesn't actually test RAW_MWandDP output as they are added to attributes to ignore. Maybe spin this off into its own test to make sure that the assertions made on the annotations are run?

Copy link
Collaborator

@jamesemery jamesemery left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good, I still have the question of whether this should be ported to AS_MQ to include an AS_RAW_MQandDP tag? Perhaps we should open an issue to implement it at some point in the future?

@@ -118,7 +118,8 @@ private static void addFilterLine(final VCFFilterHeaderLine line) {
addInfoLine(new VCFInfoHeaderLine(LIKELIHOOD_RANK_SUM_KEY, 1, VCFHeaderLineType.Float, "Z-score from Wilcoxon rank sum test of Alt Vs. Ref haplotype likelihoods"));
addInfoLine(new VCFInfoHeaderLine(MAP_QUAL_RANK_SUM_KEY, 1, VCFHeaderLineType.Float, "Z-score From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities"));
addInfoLine(new VCFInfoHeaderLine(AS_MAP_QUAL_RANK_SUM_KEY, VCFHeaderLineCount.A, VCFHeaderLineType.Float, "allele specific Z-score From Wilcoxon rank sum test of each Alt vs. Ref read mapping qualities"));
addInfoLine(new VCFInfoHeaderLine(RAW_RMS_MAPPING_QUALITY_KEY, 1, VCFHeaderLineType.Float, "Raw data for RMS Mapping Quality"));
addInfoLine(new VCFInfoHeaderLine(RAW_RMS_MAPPING_QUALITY_KEY, 2, VCFHeaderLineType.Integer, "Raw data for RMS Mapping Quality"));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the vcf description change for the new key to reflect that it is functionally different and incompatible with the old format? Having two files with the same header lines except for a count seems potentially confusing.

@ldgauthier
Copy link
Contributor Author

The AS_MQ never suffered from this issue because it uses AD for (allele-specific) depth instead of the INFO DP. The sum of the squared MQs there was allocated based on informative reads and the AD represents informative reads, so the data there was always in lock-step.

Copy link
Collaborator

@jamesemery jamesemery left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me now. Unfortunately there are some merge conflicts, could you rebase onto master then I think it is good to merge 👍

@ldgauthier
Copy link
Contributor Author

I put this into a different branch because I upgraded GDB to fix the weird error. I don't want this feature to go into the 4.0.9.0 release so I'll do a PR of the new branch after.

@ldgauthier
Copy link
Contributor Author

Provided Travis tests pass, does this still have your 👍 @jamesemery ? The ReblockGVCF tool had a less elegant MQ solution so now I'm having it output both versions so that we don't need to re-reprocess the gnomAD v3 GVCFs.

Copy link
Collaborator

@jamesemery jamesemery left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes look good to me, this still has my 👍

* https://developers.google.com/protocol-buffers/docs/javatutorial#the-protocol-buffer-api
* https://developers.google.com/protocol-buffers/docs/reference/java-generated
*/
public class GenomicsDBUtils {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I approve of this refactor

@ldgauthier ldgauthier merged commit 4dd7ba8 into master Oct 9, 2018
@ldgauthier ldgauthier deleted the ldg_fixMQcalc branch October 9, 2018 17:34
EdwardDixon pushed a commit to EdwardDixon/gatk that referenced this pull request Nov 9, 2018
Change raw MQ to a tuple of (sumSquaredMQs, totalDepth) for better accuracy where there are lots of uninformative reads or called single-sample variants with homRef genotypes.  Note that incorporating this change into a pipeline will require a concomitant update to this version for GenomicsDBImport and GenotypeGVCFs.
@indraniel
Copy link

Just as a note. This change in line 251 of gatk/src/main/java/org/broadinstitute/hellbender/tools/walkers/annotator/RMSMappingQuality.java of function parseRawDataString:

final int squareSum = Integer.parseInt(parsed[SUM_OF_SQUARES_INDEX]);

results in the following type of GVCF/VCF parsing error:

A USER ERROR has occurred: Bad input: malformed RAW_MQ annotation: 3415207168,1749038

when the RAW_MQ first index is greater than Integer.MAX_VALUE.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants