-
Notifications
You must be signed in to change notification settings - Fork 594
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allele Frequency QC Metric for Arrays #6039
Conversation
Codecov Report
@@ Coverage Diff @@
## master #6039 +/- ##
==========================================
Coverage ? 87.572%
Complexity ? 36801
==========================================
Files ? 2044
Lines ? 166417
Branches ? 19264
==========================================
Hits ? 145735
Misses ? 14365
Partials ? 6317
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work Sarah. This is good stuff.
|
||
@Argument(shortName = "pvalue-threshold", | ||
doc = "Threshold to cut off the pvalue.") | ||
protected Double threshold = 0.05; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use the primitive double
unless you have a reason to use the object Double
|
||
super.onTraversalSuccess(); | ||
|
||
GATKReportTable table= new GATKReport(outFile).getTable(MODULES_TO_USE.get(0)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all these variables declared here should be final
doc = "File to save the results from variant eval for debugging", optional = true) | ||
protected File debugFile; | ||
|
||
private String R_SCRIPT = "plotAlleleFrequencyQC.R"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
R_SCRIPT
-> rScript
.
Reserve the "all caps and underscore" case for static final
constants
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should actually be a static final constant
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good call
public int totalCalledSites; | ||
@DataPoint(description = "Number of called heterozygous sites;", format = "%d") | ||
public int totalHetSites; | ||
@DataPoint(description = "Number of called heterozygous sites;", format = "%d") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
heterozygous -> hom var
public int totalHetSites; | ||
@DataPoint(description = "Number of called heterozygous sites;", format = "%d") | ||
public int totalHomVarSites; | ||
@DataPoint(description = "Number of called heterozygous sites;", format = "%d") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
heterozygous -> hom ref
|
||
private int getLogitBucket(Double alleleFraction) { | ||
// calculate logit value | ||
Float score = (float)( -10 * Math.log10((alleleFraction/(1-alleleFraction)))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's not too much work, I would remove the factor of -10 and rescale the bin width and limits. Logit by definition doesn't have the factor of -10, and you shouldn't take phred scale of something that's not a probability i.e. p/(1-p). It's an unnecessary layer that could potentially mislead the user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok so I talked about this with Yossi, and I think we actually want to keep it like this; first because it means a bin width of 1 works really well, and he thought that the idea of phred-scale was actually beneficial. So I think I will keep it like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. And this way seems to have binned the variants nicely so I guess there's no reason to change.
} | ||
break; | ||
case LOGARITHMIC: | ||
for (int a = -logLimit; a <= logLimit; a += 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extract a variable private static int LOGIT_BIN_WIDTH = 1;
import org.broadinstitute.hellbender.tools.walkers.varianteval.util.Analysis; | ||
import org.broadinstitute.hellbender.tools.walkers.varianteval.util.DataPoint; | ||
|
||
@Analysis(description = "Computes different estimates of theta based on variant sites and genotypes") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's theta?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it'd be good to document the method succinctly; i.e. divide sites by AF in 1000G, then within each bin add up the alt alleles and divide by 2*num_sites, and do the goodness of fit test to get the p-value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that documentation should be for AllleFrequencyQC
; this is just the evaluator being used to calculate average var allele fraction (P --> why its PVariantEvaluator) and count the number of hets/hom vars/ hom refs; definitely updating description but not sure about a new name unless you have a better idea ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, docs should go wherever you think is appropriate, I just happened to think that when I was reviewing this part.
As for the name, maybe something like AggregateAltFraction
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about VariantAFEvaluator ?
I think there is decent documentation in AlleleFrequencyQC.java
-- let me know if you think it's unclear?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the first sentence in AlleleFreqeuncyQC.java
docs "This tool uses Variant Eval..." could be improved. What do you think?
|
||
@Analysis(description = "Computes different estimates of theta based on variant sites and genotypes") | ||
|
||
public class PVariantEvaluator extends VariantEvaluator { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we pick a more descriptive name than PVariantEvaluator
?
return null; | ||
} | ||
|
||
// This creates a modified chi squared statistic that allows for a constant variance (1%) rather than scaling |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to have to think about this.
} | ||
} | ||
|
||
public List<Object> getRelevantStates(ReferenceContext referenceContext, ReadsContext readsContext, FeatureContext featureContext, VariantContext comp, String compName, VariantContext eval, String evalName, String sampleName, String FamilyName) { | ||
if (eval != null) { | ||
try { | ||
return Collections.singletonList((Object)String.format("%.3f", (5.0 * MathUtils.roundToNDecimalPlaces(eval.getAttributeAsDouble("AF", 0.0) / 5.0, 3)))); | ||
Double alleleFraction = Collections.max(eval.getAttributeAsDoubleList("AF", 0.0)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
alleleFraction
-> alleleFrequency
to minimize confusion (and also because the class is called AlleleFrequency)
...va/org/broadinstitute/hellbender/tools/walkers/varianteval/evaluators/PVariantEvaluator.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rewrite the doc as you see fit and it's ready to go
00c8ee2
to
a317675
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@skwalker Looks good. Retag the docker image and push it to docker hub, then update the pr and make sure tests pass with the new image. 👍 from me when that's green.
@@ -6,7 +6,7 @@ set -e | |||
|
|||
REPO=broadinstitute | |||
PROJECT=gatk | |||
VERSION=2.0.3 | |||
VERSION=2.1.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update to 2.2.0 to avoid conflict with the existing wierd 2.1.0
Dockerfile
Outdated
@@ -1,5 +1,5 @@ | |||
# Using OpenJDK 8 | |||
FROM broadinstitute/gatk:gatkbase-2.0.3 | |||
FROM skwalker/gatk:gatkbase-2.1.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Retag as broadinstitute/gatk:gatkbase-2.2.0 and push to dockerhub.
Oh, be sure to mention that you updated the base image and included dplyr in the commit message. |
This extends Variant Eval to compare AFs between variants in binned AF buckets based on Thousand Genomes VCF, between the expected AF from Thousand Genomes and the seen one in the actual VCF, to be used as a QC metric for our arrays pipeline.