Improve MQ calculation accuracy #4969

ldgauthier · 2018-06-29T18:53:36Z

After the switch to rawMQ for the reducible implementation we no longer take the median of MQ values across all samples (yay!) but there were some accuracy problems at sites with a lot of uninformative reads (boo!)

I tried to calculate the depth over variant samples previously, but the approximations weren't perfect. Now we keep track of the depth used to find the RMS of MQ at the end of joint calling.

ldgauthier · 2018-06-29T18:54:20Z

@jamesemery Can you take a look? (This doesn't have to go into the next release.)

codecov-io · 2018-06-29T19:52:25Z

Codecov Report

Merging #4969 into master will increase coverage by 0.01%.
The diff coverage is 88.15%.

@@             Coverage Diff              @@
##             master    #4969      +/-   ##
============================================
+ Coverage     86.75%   86.77%   +0.01%     
- Complexity    29834    29883      +49     
============================================
  Files          1828     1831       +3     
  Lines        138115   138452     +337     
  Branches      15227    15249      +22     
============================================
+ Hits         119827   120138     +311     
- Misses        12741    12765      +24     
- Partials       5547     5549       +2

Impacted Files	Coverage Δ	Complexity Δ
...ute/hellbender/utils/variant/GATKVCFConstants.java	`80% <ø> (ø)`	`4 <0> (ø)`	⬇️
...institute/hellbender/engine/FeatureDataSource.java	`76.42% <ø> (-0.08%)`	`42 <0> (-7)`
...broadinstitute/hellbender/engine/FeatureInput.java	`94.36% <100%> (ø)`	`18 <3> (ø)`	⬇️
...e/hellbender/utils/variant/GATKVCFHeaderLines.java	`95.62% <100%> (+0.02%)`	`10 <0> (ø)`	⬇️
...ools/walkers/variantutils/ReblockGVCFUnitTest.java	`92.53% <100%> (ø)`	`9 <0> (ø)`	⬇️
...lkers/variantutils/ReblockGVCFIntegrationTest.java	`97.82% <100%> (+0.26%)`	`8 <1> (+1)`	⬆️
...er/tools/walkers/GenotypeGVCFsIntegrationTest.java	`78.26% <100%> (+2.45%)`	`25 <6> (+6)`	⬆️
...bender/tools/walkers/variantutils/ReblockGVCF.java	`81.52% <100%> (+1.17%)`	`46 <0> (+3)`	⬆️
...der/tools/walkers/annotator/RMSMappingQuality.java	`83.6% <77.35%> (-10.21%)`	`41 <25> (+1)`
...der/tools/walkers/CombineGVCFsIntegrationTest.java	`87.44% <83.33%> (+0.15%)`	`24 <2> (ø)`	⬇️
... and 19 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 868a32e...49c474d. Read the comment docs.

jamesemery

Mostly there are some places that could use clarifying comments. Otherwise the code itself looks reasonable and contained. Are there plans to update AS_RMSMappingQuality? They are now out of synch with each-other with regards to what they do.

jamesemery · 2018-07-06T20:06:44Z

src/main/java/org/broadinstitute/hellbender/tools/walkers/annotator/RMSMappingQuality.java


    @Override
-    public String getRawKeyName() { return GATKVCFConstants.RAW_RMS_MAPPING_QUALITY_KEY;}
+    public String getRawKeyName() { return GATKVCFConstants.RAW_MAPPING_QUALITY_WITH_DEPTH_KEY;}   //new key for the two-value MQ data to prevent version mismatch catastrophes


Do we want to bump this into a new class? This would mean that we are no longer able to handle the old RMS mapping quality code.

Does that matter to us? I guess since the variant context wont have a key matching one of the discovered reducible annotations' keys then nothing will really happen and the key will be dropped...

Perhaps we could add a check and emit a warning to the user if there is a mismatch between the versions of the rmsMappingQualityKey in their files?

I added a check in the finalize method in case GenotypeGVCFs gets run on old data. GenomicsDB behavior is largely outside of our control here. I don't have a great idea for CombineGVCFs though. I thought about adding a check against ReducibleAnnotation.getDeprecatedKeyName(), but I feel like having getDeprecatedKeyName() return null for the other annotations and querying the annotationMap for null is an invitation for trouble.

jamesemery · 2018-07-06T20:10:58Z

src/main/java/org/broadinstitute/hellbender/tools/walkers/annotator/RMSMappingQuality.java

@@ -53,12 +56,12 @@

    @Override
    public List<VCFInfoHeaderLine> getDescriptions() {
-        return Arrays.asList(VCFStandardHeaderLines.getInfoLine(getKeyNames().get(0)), GATKVCFHeaderLines.getInfoLine(getRawKeyName()));


heh, yeah this should probably have been changed when I ported that AS_stuff. I remember this was here purely because it broke tests. It looks like you've updated them which is good.

jamesemery · 2018-07-06T20:31:39Z

src/main/java/org/broadinstitute/hellbender/tools/walkers/annotator/VariantDepth.java

+/**
+ * Created by gauthier on 5/29/18.
+ */
+public class VariantDepth {


I think you may have included this accidentally

jamesemery · 2018-07-09T14:28:01Z

src/main/java/org/broadinstitute/hellbender/tools/walkers/annotator/RMSMappingQuality.java

@@ -223,6 +226,13 @@ private static double parseRawDataString(String rawDataString) {
    @VisibleForTesting
    static int getNumOfReads(final VariantContext vc,
                             final ReadLikelihoods<Allele> likelihoods) {
+        if(vc.hasAttribute(GATKVCFConstants.RAW_MAPPING_QUALITY_WITH_DEPTH_KEY)) {


Can you update the comment to reflect that it searches for the count data first?

I updated the GATK Javadoc for the annotation -- is that what you meant?

jamesemery · 2018-07-09T14:52:24Z

src/main/java/org/broadinstitute/hellbender/tools/walkers/annotator/RMSMappingQuality.java

    }


-    public void combineAttributeMap(ReducibleAnnotationData<Double> toAdd, ReducibleAnnotationData<Double> combined) {
+    public void combineAttributeMap(ReducibleAnnotationData<List<Integer>> toAdd, ReducibleAnnotationData<List<Integer>> combined) {


This method could use a comment, as its a little harder to read now (Allele.NO_CALL)

jamesemery · 2018-07-09T14:54:21Z

src/main/java/org/broadinstitute/hellbender/tools/walkers/annotator/RMSMappingQuality.java

-                .mapToDouble(mq -> mq * mq).sum();
-
-        rawAnnotations.putAttribute(Allele.NO_CALL, squareSum);
+        //GATK3.5 had a double, but change this to an int for the tuple representation


Also just a quick comment explaining that this is counting both the squareSum AND the count.

jamesemery · 2018-07-09T15:01:07Z

src/test/java/org/broadinstitute/hellbender/tools/walkers/GenotypeGVCFsIntegrationTest.java

+    }
+
+    @Test (dataProvider = "VCFdata")
+    public void assertMatchingGenotypesFromGenomicsDB_vidmapHack(File[] inputs, File expected, String interval, List<String> additionalArguments, String reference) throws IOException {


Could you add a comment explaining this test? It doesn't seem related to the rest of the branch.

I goofed and copied the genotypes comparison when I meant do compare annotations. I fixed the name and made sure MQ is actually being compared. (FYI the vidmap hack is because I need to change the GDB json since I changed the MQ annotation name. I'll update this once the protobuf update goes in.)

jamesemery · 2018-07-09T15:06:01Z

src/main/java/org/broadinstitute/hellbender/tools/walkers/annotator/RMSMappingQuality.java

@@ -42,9 +42,12 @@
 @DocumentedFeature(groupName=HelpConstants.DOC_CAT_ANNOTATORS, groupSummary=HelpConstants.DOC_CAT_ANNOTATORS_SUMMARY, summary="Root mean square of the mapping quality of reads across all samples (MQ)")
 public final class RMSMappingQuality extends InfoFieldAnnotation implements StandardAnnotation, ReducibleAnnotation {
    private static final RMSMappingQuality instance = new RMSMappingQuality();
+    public static final int NUM_LIST_ENTRIES = 2;


Could you update the header comments to reflect the new annotation? Just add a line specifying what the count means

jamesemery · 2018-07-09T15:13:43Z

src/test/java/org/broadinstitute/hellbender/tools/walkers/GenotypeGVCFsIntegrationTest.java

@@ -97,6 +100,8 @@
                {getTestFile( "testAlleleSpecificAnnotations.CombineGVCF.output.g.vcf"), getTestFile( "testAlleleSpecificAnnotations.CombineGVCF.expected.g.vcf"), Arrays.asList( "-A", "ClippingRankSumTest", "-G", "AS_StandardAnnotation", "-G", "StandardAnnotation"), b37_reference_20_21},
                //all sites not supported yet see https://github.com/broadinstitute/gatk-protected/issues/580 and  https://github.com/broadinstitute/gatk/issues/2429
                //{getTestFile(basePairGVCF), getTestFile( "gvcf.basepairResolution.includeNonVariantSites.gatk3.7_30_ga4f720357.expected.vcf"), Collections.singletonList("--"+GenotypeGVCFs.ALL_SITES_LONG_NAME) //allsites not supported yet
+                //Test for new RAW_MQandDP annotation format


This test doesn't actually test RAW_MWandDP output as they are added to attributes to ignore. Maybe spin this off into its own test to make sure that the assertions made on the annotations are run?

jamesemery

This looks good, I still have the question of whether this should be ported to AS_MQ to include an AS_RAW_MQandDP tag? Perhaps we should open an issue to implement it at some point in the future?

jamesemery · 2018-08-20T16:12:37Z

src/main/java/org/broadinstitute/hellbender/utils/variant/GATKVCFHeaderLines.java

@@ -118,7 +118,8 @@ private static void addFilterLine(final VCFFilterHeaderLine line) {
        addInfoLine(new VCFInfoHeaderLine(LIKELIHOOD_RANK_SUM_KEY, 1, VCFHeaderLineType.Float, "Z-score from Wilcoxon rank sum test of Alt Vs. Ref haplotype likelihoods"));
        addInfoLine(new VCFInfoHeaderLine(MAP_QUAL_RANK_SUM_KEY, 1, VCFHeaderLineType.Float, "Z-score From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities"));
        addInfoLine(new VCFInfoHeaderLine(AS_MAP_QUAL_RANK_SUM_KEY, VCFHeaderLineCount.A, VCFHeaderLineType.Float, "allele specific Z-score From Wilcoxon rank sum test of each Alt vs. Ref read mapping qualities"));
-        addInfoLine(new VCFInfoHeaderLine(RAW_RMS_MAPPING_QUALITY_KEY, 1, VCFHeaderLineType.Float, "Raw data for RMS Mapping Quality"));
+        addInfoLine(new VCFInfoHeaderLine(RAW_RMS_MAPPING_QUALITY_KEY, 2, VCFHeaderLineType.Integer, "Raw data for RMS Mapping Quality"));


Can the vcf description change for the new key to reflect that it is functionally different and incompatible with the old format? Having two files with the same header lines except for a count seems potentially confusing.

ldgauthier · 2018-08-22T15:17:16Z

The AS_MQ never suffered from this issue because it uses AD for (allele-specific) depth instead of the INFO DP. The sum of the squared MQs there was allocated based on informative reads and the AD represents informative reads, so the data there was always in lock-step.

jamesemery

Looks good to me now. Unfortunately there are some merge conflicts, could you rebase onto master then I think it is good to merge 👍

ldgauthier · 2018-09-18T19:00:25Z

I put this into a different branch because I upgraded GDB to fix the weird error. I don't want this feature to go into the 4.0.9.0 release so I'll do a PR of the new branch after.

…curacy where there are lots of uninformative reads

ldgauthier · 2018-10-05T17:41:31Z

Provided Travis tests pass, does this still have your 👍 @jamesemery ? The ReblockGVCF tool had a less elegant MQ solution so now I'm having it output both versions so that we don't need to re-reprocess the gnomAD v3 GVCFs.

jamesemery

These changes look good to me, this still has my 👍

jamesemery · 2018-10-09T14:31:54Z

src/main/java/org/broadinstitute/hellbender/tools/genomicsdb/GenomicsDBUtils.java

+ * https://developers.google.com/protocol-buffers/docs/javatutorial#the-protocol-buffer-api
+ * https://developers.google.com/protocol-buffers/docs/reference/java-generated
+ */
+public class GenomicsDBUtils {


I approve of this refactor

Change raw MQ to a tuple of (sumSquaredMQs, totalDepth) for better accuracy where there are lots of uninformative reads or called single-sample variants with homRef genotypes. Note that incorporating this change into a pipeline will require a concomitant update to this version for GenomicsDBImport and GenotypeGVCFs.

indraniel · 2018-11-17T19:47:52Z

Just as a note. This change in line 251 of gatk/src/main/java/org/broadinstitute/hellbender/tools/walkers/annotator/RMSMappingQuality.java of function parseRawDataString:

final int squareSum = Integer.parseInt(parsed[SUM_OF_SQUARES_INDEX]);

results in the following type of GVCF/VCF parsing error:

A USER ERROR has occurred: Bad input: malformed RAW_MQ annotation: 3415207168,1749038

when the RAW_MQ first index is greater than Integer.MAX_VALUE.

ldgauthier requested a review from jamesemery June 29, 2018 18:53

jamesemery requested changes Jul 9, 2018

View reviewed changes

jamesemery reviewed Aug 20, 2018

View reviewed changes

jamesemery approved these changes Aug 22, 2018

View reviewed changes

ldgauthier mentioned this pull request Aug 23, 2018

Add new numReads annotation for better stability of new MQ calculation #2668

Closed

droazen assigned ldgauthier Aug 24, 2018

ldgauthier force-pushed the ldg_fixMQcalc branch 2 times, most recently from c3c0ed5 to e95075e Compare August 27, 2018 16:44

ldgauthier closed this Sep 18, 2018

jamesemery mentioned this pull request Oct 1, 2018

Change raw MQ to a tuple of (sumSquaredMQs, totalDepth) for better ac… #5237

Closed

ldgauthier reopened this Oct 3, 2018

ldgauthier force-pushed the ldg_fixMQcalc branch 2 times, most recently from 317b4fe to 475668c Compare October 3, 2018 17:31

ldgauthier added 2 commits October 4, 2018 11:17

Change raw MQ to a tuple of (sumSquaredMQs, totalDepth) for better ac…

13b47a5

…curacy where there are lots of uninformative reads

Cleanup new GDB code

840bb25

ldgauthier force-pushed the ldg_fixMQcalc branch from 79d9e59 to 840bb25 Compare October 4, 2018 15:17

ldgauthier added 4 commits October 4, 2018 14:38

Oops

7b8f9b3

Oops test utils

7349076

Last method call change?

09831f0

Keep gnomAD v3 format annotations so we don't have to reprocess again

dbef456

Actually add the files for the new test

49c474d

jamesemery approved these changes Oct 9, 2018

View reviewed changes

ldgauthier merged commit 4dd7ba8 into master Oct 9, 2018

ldgauthier deleted the ldg_fixMQcalc branch October 9, 2018 17:34

indraniel mentioned this pull request Nov 18, 2018

RAW_MQ/sumSquaredMQs parsing error when running GenotypeGVCFs for JointGenotyping #5433

Closed

lbergelson mentioned this pull request Jan 24, 2019

Genotyping code for the Gnarly Pipeline (gnomAD v3) #4947

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve MQ calculation accuracy #4969

Improve MQ calculation accuracy #4969

ldgauthier commented Jun 29, 2018

ldgauthier commented Jun 29, 2018

codecov-io commented Jun 29, 2018 •

edited

Loading

jamesemery left a comment

jamesemery Jul 6, 2018

jamesemery Jul 6, 2018

jamesemery Jul 9, 2018

ldgauthier Aug 17, 2018

jamesemery Jul 6, 2018

jamesemery Jul 6, 2018

jamesemery Jul 9, 2018

ldgauthier Aug 17, 2018

jamesemery Jul 9, 2018

jamesemery Jul 9, 2018

jamesemery Jul 9, 2018

ldgauthier Aug 20, 2018

jamesemery Jul 9, 2018

jamesemery Jul 9, 2018

jamesemery left a comment

jamesemery Aug 20, 2018

ldgauthier commented Aug 22, 2018

jamesemery left a comment

ldgauthier commented Sep 18, 2018

ldgauthier commented Oct 5, 2018

jamesemery left a comment

jamesemery Oct 9, 2018

indraniel commented Nov 17, 2018

Improve MQ calculation accuracy #4969

Improve MQ calculation accuracy #4969

Conversation

ldgauthier commented Jun 29, 2018

ldgauthier commented Jun 29, 2018

codecov-io commented Jun 29, 2018 • edited Loading

Codecov Report

jamesemery left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jamesemery left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ldgauthier commented Aug 22, 2018

jamesemery left a comment

Choose a reason for hiding this comment

ldgauthier commented Sep 18, 2018

ldgauthier commented Oct 5, 2018

jamesemery left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

indraniel commented Nov 17, 2018

codecov-io commented Jun 29, 2018 •

edited

Loading