Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allele Frequency QC Metric for Arrays #6039

Merged
merged 11 commits into from
Sep 19, 2019
Merged

Allele Frequency QC Metric for Arrays #6039

merged 11 commits into from
Sep 19, 2019

Conversation

skwalker
Copy link
Contributor

This extends Variant Eval to compare AFs between variants in binned AF buckets based on Thousand Genomes VCF, between the expected AF from Thousand Genomes and the seen one in the actual VCF, to be used as a QC metric for our arrays pipeline.

@skwalker skwalker requested a review from takutosato July 16, 2019 14:51
@codecov
Copy link

codecov bot commented Jul 16, 2019

Codecov Report

❗ No coverage uploaded for pull request base (master@f499656). Click here to learn what that means.
The diff coverage is 81.513%.

@@            Coverage Diff             @@
##             master     #6039   +/-   ##
==========================================
  Coverage          ?   87.572%           
  Complexity        ?     36801           
==========================================
  Files             ?      2044           
  Lines             ?    166417           
  Branches          ?     19264           
==========================================
  Hits              ?    145735           
  Misses            ?     14365           
  Partials          ?      6317
Impacted Files Coverage Δ Complexity Δ
...itute/hellbender/utils/report/GATKReportTable.java 70.522% <100%> (ø) 68 <1> (?)
...lbender/tools/walkers/varianteval/VariantEval.java 90.769% <100%> (ø) 144 <3> (?)
.../varianteval/AlleleFrequencyQCIntegrationTest.java 100% <100%> (ø) 3 <3> (?)
...nder/metrics/analysis/AlleleFrequencyQCMetric.java 100% <100%> (ø) 1 <1> (?)
...s/varianteval/stratifications/AlleleFrequency.java 53.333% <60.87%> (ø) 6 <4> (?)
...kers/varianteval/evaluators/PVariantEvaluator.java 78.571% <78.571%> (ø) 9 <9> (?)
...r/tools/walkers/varianteval/AlleleFrequencyQC.java 86.538% <86.538%> (ø) 10 <10> (?)

@takutosato takutosato self-assigned this Jul 29, 2019
Copy link
Contributor

@takutosato takutosato left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work Sarah. This is good stuff.


@Argument(shortName = "pvalue-threshold",
doc = "Threshold to cut off the pvalue.")
protected Double threshold = 0.05;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use the primitive double unless you have a reason to use the object Double


super.onTraversalSuccess();

GATKReportTable table= new GATKReport(outFile).getTable(MODULES_TO_USE.get(0));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all these variables declared here should be final

doc = "File to save the results from variant eval for debugging", optional = true)
protected File debugFile;

private String R_SCRIPT = "plotAlleleFrequencyQC.R";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

R_SCRIPT -> rScript.

Reserve the "all caps and underscore" case for static final constants

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should actually be a static final constant

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good call

public int totalCalledSites;
@DataPoint(description = "Number of called heterozygous sites;", format = "%d")
public int totalHetSites;
@DataPoint(description = "Number of called heterozygous sites;", format = "%d")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

heterozygous -> hom var

public int totalHetSites;
@DataPoint(description = "Number of called heterozygous sites;", format = "%d")
public int totalHomVarSites;
@DataPoint(description = "Number of called heterozygous sites;", format = "%d")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

heterozygous -> hom ref


private int getLogitBucket(Double alleleFraction) {
// calculate logit value
Float score = (float)( -10 * Math.log10((alleleFraction/(1-alleleFraction))));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's not too much work, I would remove the factor of -10 and rescale the bin width and limits. Logit by definition doesn't have the factor of -10, and you shouldn't take phred scale of something that's not a probability i.e. p/(1-p). It's an unnecessary layer that could potentially mislead the user.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok so I talked about this with Yossi, and I think we actually want to keep it like this; first because it means a bin width of 1 works really well, and he thought that the idea of phred-scale was actually beneficial. So I think I will keep it like this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. And this way seems to have binned the variants nicely so I guess there's no reason to change.

}
break;
case LOGARITHMIC:
for (int a = -logLimit; a <= logLimit; a += 1) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extract a variable private static int LOGIT_BIN_WIDTH = 1;

import org.broadinstitute.hellbender.tools.walkers.varianteval.util.Analysis;
import org.broadinstitute.hellbender.tools.walkers.varianteval.util.DataPoint;

@Analysis(description = "Computes different estimates of theta based on variant sites and genotypes")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's theta?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd be good to document the method succinctly; i.e. divide sites by AF in 1000G, then within each bin add up the alt alleles and divide by 2*num_sites, and do the goodness of fit test to get the p-value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that documentation should be for AllleFrequencyQC; this is just the evaluator being used to calculate average var allele fraction (P --> why its PVariantEvaluator) and count the number of hets/hom vars/ hom refs; definitely updating description but not sure about a new name unless you have a better idea ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, docs should go wherever you think is appropriate, I just happened to think that when I was reviewing this part.

As for the name, maybe something like AggregateAltFraction

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about VariantAFEvaluator ?
I think there is decent documentation in AlleleFrequencyQC.java -- let me know if you think it's unclear?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the first sentence in AlleleFreqeuncyQC.java docs "This tool uses Variant Eval..." could be improved. What do you think?


@Analysis(description = "Computes different estimates of theta based on variant sites and genotypes")

public class PVariantEvaluator extends VariantEvaluator {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we pick a more descriptive name than PVariantEvaluator?

return null;
}

// This creates a modified chi squared statistic that allows for a constant variance (1%) rather than scaling
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to have to think about this.

}
}

public List<Object> getRelevantStates(ReferenceContext referenceContext, ReadsContext readsContext, FeatureContext featureContext, VariantContext comp, String compName, VariantContext eval, String evalName, String sampleName, String FamilyName) {
if (eval != null) {
try {
return Collections.singletonList((Object)String.format("%.3f", (5.0 * MathUtils.roundToNDecimalPlaces(eval.getAttributeAsDouble("AF", 0.0) / 5.0, 3))));
Double alleleFraction = Collections.max(eval.getAttributeAsDoubleList("AF", 0.0));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alleleFraction -> alleleFrequency to minimize confusion (and also because the class is called AlleleFrequency)

Copy link
Contributor

@takutosato takutosato left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rewrite the doc as you see fit and it's ready to go

@skwalker skwalker force-pushed the skwalker_af_arrays_qc branch from 00c8ee2 to a317675 Compare September 17, 2019 20:38
Copy link
Member

@lbergelson lbergelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@skwalker Looks good. Retag the docker image and push it to docker hub, then update the pr and make sure tests pass with the new image. 👍 from me when that's green.

@@ -6,7 +6,7 @@ set -e

REPO=broadinstitute
PROJECT=gatk
VERSION=2.0.3
VERSION=2.1.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update to 2.2.0 to avoid conflict with the existing wierd 2.1.0

Dockerfile Outdated
@@ -1,5 +1,5 @@
# Using OpenJDK 8
FROM broadinstitute/gatk:gatkbase-2.0.3
FROM skwalker/gatk:gatkbase-2.1.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Retag as broadinstitute/gatk:gatkbase-2.2.0 and push to dockerhub.

@lbergelson
Copy link
Member

Oh, be sure to mention that you updated the base image and included dplyr in the commit message.

@skwalker skwalker merged commit 895843f into master Sep 19, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants