-
Notifications
You must be signed in to change notification settings - Fork 596
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Port of GATK3 Variant Annotator Tool #3803
Conversation
8769084
to
b835691
Compare
Codecov Report
@@ Coverage Diff @@
## master #3803 +/- ##
===============================================
+ Coverage 79.116% 79.163% +0.047%
- Complexity 16472 16604 +132
===============================================
Files 1047 1049 +2
Lines 59199 59615 +416
Branches 9676 9776 +100
===============================================
+ Hits 46836 47193 +357
- Misses 8600 8635 +35
- Partials 3763 3787 +24
|
b9805a2
to
962c95e
Compare
@ldgauthier Would you have time to do an initial review pass on this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a couple questions, especially about indels and pileups. I haven't done all the tests, but I'll do them now.
return counts; | ||
} | ||
|
||
private int[] annotateWithLiklihoods(VariantContext vc, Genotype g, Set<Allele> alleles, ReadLikelihoods<Allele> likelihoods) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Liklihoods -> Likelihoods
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
* @param read a read containing the variant | ||
* @return number of non-hard clipped, aligned bases (excluding low quality bases at either end) | ||
*/ | ||
//TODO: this is bizarre -- this code counts hard clips, but then subtracts them from the read length, which already doesn't count hard clips |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would not be the only ReadPosRankSum-related bug... Can you file an issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, so these comments are mostly pulled in from GATK3. I needed to pull in all of these classes in order to support annotating from the pileup, which doesn't affect the calculations we currently do based on the likelihood. My goal was to match GATK3 bugs and all. I will file a ticket for after this branch to evaluate what should be done for these classes.
return Collections.singletonMap(getKeyNames().get(0), (Object)annotationString); | ||
} | ||
|
||
public String makeFinalizedAnnotationString(final VariantContext vc, final Map<Allele, Number> perAlleleData) { | ||
int numOfReads = getNumOfReads(vc); | ||
public String makeFinalizedAnnotationString(final int numOfReads, final Map<Allele, Number> perAlleleData) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the refactor.
} | ||
|
||
// Use the pileup to stratify otherwise | ||
} else if (!(this instanceof AS_RankSumTest)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't love that we're calling instanceof for every variant. My instinct says this is going to slow things down -- have you noticed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have not run extensive timing, I will overload apply() in AS_RankSumTest to make this cleaner though.
for (final PileupElement p : likelihoods.getStratifiedPileups(vc).values().stream().flatMap(Collection::stream).collect(Collectors.toList())) { | ||
final OptionalDouble value = getElementForPileupElement(p, refLoc); | ||
if (value.isPresent() && value.getAsDouble() != INVALID_ELEMENT_FROM_READ && isUsableBase(p)) { | ||
if (vc.getReference().equals(Allele.create(p.getBase(), true))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't worked much with PileupElements -- what happens for deletions? It seems like for a deletion vc.getReference() will return multiple bases, but the pileup element will return one base or the deletion 'D'.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will return the first base before the deletion site. Which is not ideal I imagine but this seems to be a repeated bug from gatk3. It seems like an edge case because the only time this could make a difference is if you are at a site that is not marked as a deletion but you then have a read which has a deletion on it and then happens to match one of the two alleles of the snp... So a minor effect. Should I open a new ticket for this?
|
||
import static org.testng.Assert.*; | ||
|
||
public class AnnotationUtilsUnitTest { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can check these later if you'd like, but I figured I'd get comments on the rest back to you faster if I don't do it now.
@@ -95,5 +90,24 @@ public void testBlowUp(){ | |||
new DepthPerAlleleBySample().annotate(null, vc, gAC, gb, likelihoods); | |||
} | |||
|
|||
@Test | |||
public void testEmptyLiklihoodsFallback(){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
search and replace
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spelling? What is that?
@@ -216,11 +216,12 @@ public void testUsingLikelihoods_Raw() { | |||
Assert.assertNotNull(annotatedMapRaw, vc.toString()); | |||
final String actualStringRaw = (String) annotatedMapRaw.get(key); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you're going to replace key below you might as well do it here too and get rid of key.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -23,13 +23,6 @@ | |||
public static final String RAW_DELIM = ","; | |||
public static final String REDUCED_DELIM = ","; | |||
|
|||
@Override | |||
public Map<String, Object> annotate(final ReferenceContext ref, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if you remove this, it's going to default to the non-allele-specific version, right? So if you run VariantAnnotator requesting an AS_RankSum annotation, you'll get the finalized form of RankSum classic? I think that at least deserves a warning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thats true, BUT if you want raw annotations you need simply instantiate the variantAnnotator in rawAnnotations mode (set the boolean flag on construction). Then it will call annotateRawData() instead. Indeed the reason for removing this method was to allow for annotate() in the event that VariantAnnotator/HaplotypeCaller want to output a vcf with an AS_Annotation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the record, I feel weird about this tool default behavior. If you ask for an annotation expecting key "AS_FS" from VariantAnnotator then A) it won't be there and B) there will be an unexpected "FS" key in the output INFO field.
##LoF=Loss-of-function annotation (HC = High Confidence; LC = Low Confidence) | ||
##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence type as predicted by VEP. Format: Allele|Gene|Feature|Feature_type|Consequence|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|Existing_variation|ALLELE_NUM|DISTANCE|STRAND|SYMBOL|SYMBOL_SOURCE|HGNC_ID|BIOTYPE|CANONICAL|CCDS|ENSP|SWISSPROT|TREMBL|UNIPARC|SIFT|PolyPhen|EXON|INTRON|DOMAINS|HGVSc|HGVSp|GMAF|AFR_MAF|AMR_MAF|ASN_MAF|EUR_MAF|AA_MAF|EA_MAF|CLIN_SIG|SOMATIC|PUBMED|MOTIF_NAME|MOTIF_POS|HIGH_INF_POS|MOTIF_SCORE_CHANGE|LoF_info|LoF_flags|LoF_filter|LoF"> | ||
#CHROM POS ID REF ALT QUAL FILTER INFO | ||
1 26608873 rs1134583 T A,C 8647.13 VQSRTrancheSNP99.90to100.00 AAChange=uc001bly.3:c.A1120T:p.S374C;AC=35,5;AF=0.269,0.038;AN=130;AVSIFT=0;BaseQRankSum=-2.407;DB;DP=2659;Dels=0.00;ExonicFunc=nonsynonymousSNV;FS=27.067;Func=exonic;Gene=UBXN11;HaplotypeScore=11.0781;InbreedingCoeff=-0.3011;LJB_GERP++=0.0;LJB_LRT=0.825324;LJB_LRT_Pred=NA;LJB_PhyloP=0.887806;LJB_PhyloP_Pred=N;LJB_PolyPhen2=0.515975;LJB_PolyPhen2_Pred=NA;LJB_SIFT=1.0;LJB_SIFT_Pred=D;LRT_MutationTaster=7.8E-5;LRT_MutationTaster_Pred=N;MLEAC=35,3;MLEAF=0.269,0.023;MQ=48.46;MQ0=1;MQRankSum=-11.748;NEGATIVE_TRAIN_SITE;QD=5.85;ReadPosRankSum=-6.714;SNPEFF_AMINO_ACID_CHANGE=S251G;SNPEFF_CODON_CHANGE=Agt/Ggt;SNPEFF_EFFECT=NON_SYNONYMOUS_CODING;SNPEFF_EXON_ID=exon_1_26608773_26609060;SNPEFF_FUNCTIONAL_CLASS=MISSENSE;SNPEFF_GENE_BIOTYPE=protein_coding;SNPEFF_GENE_NAME=UBXN11;SNPEFF_IMPACT=MODERATE;SNPEFF_TRANSCRIPT_ID=ENST00000374223;VQSLOD=-4.019e+00;culprit=MQRankSum;dbSNP135=rs1134583;CSQ=A|ENSG00000158062|ENST00000496466|Transcript|non_coding_transcript_exon_variant&non_coding_transcript_variant|2262|||||rs1134583&COSM1602122|1||-1|UBXN11|HGNC|30600|retained_intron|||||||||10/10|||ENST00000496466.1:n.2262A>T||||||||||0&1|||||||||,C|ENSG00000158062|ENST00000496466|Transcript|non_coding_transcript_exon_variant&non_coding_transcript_variant|2262|||||rs1134583&COSM1602122|2||-1|UBXN11|HGNC|30600|retained_intron|||||||||10/10|||ENST00000496466.1:n.2262A>G||||||||||0&1|||||||||,A|ENSG00000158062|ENST00000374217|Transcript|missense_variant|1748|1381|461|S/C|Agt/Tgt|rs1134583&COSM1602122|1||-1|UBXN11|HGNC|30600|protein_coding||CCDS41286.1|ENSP00000363334|UBX11_HUMAN|Q5T130_HUMAN|UPI00004700E0|deleterious(0.01)|unknown(0)|15/15||Low_complexity_(Seg):Seg&PROSITE_profiles:PS50099|ENST00000374217.2:c.1381A>T|ENSP00000363334.2:p.Ser461Cys|||||||||0&1|||||||||,C|ENSG00000158062|ENST00000374217|Transcript|missense_variant|1748|1381|461|S/G|Agt/Ggt|rs1134583&COSM1602122|2||-1|UBXN11|HGNC|30600|protein_coding||CCDS41286.1|ENSP00000363334|UBX11_HUMAN|Q5T130_HUMAN|UPI00004700E0|deleterious(0.03)|unknown(0)|15/15||Low_complexity_(Seg):Seg&PROSITE_profiles:PS50099|ENST00000374217.2:c.1381A>G|ENSP00000363334.2:p.Ser461Gly|||||||||0&1|||||||||,A|ENSG00000158062|ENST00000374222|Transcript|missense_variant|1945|1480|494|S/C|Agt/Tgt|rs1134583&COSM1602122|1||-1|UBXN11|HGNC|30600|protein_coding|YES|CCDS41288.1|ENSP00000363339|UBX11_HUMAN||UPI00004700E1|deleterious(0.01)|unknown(0)|16/16||Low_complexity_(Seg):Seg&PROSITE_profiles:PS50099|ENST00000374222.1:c.1480A>T|ENSP00000363339.1:p.Ser494Cys|||||||||0&1|||||||||,C|ENSG00000158062|ENST00000374222|Transcript|missense_variant|1945|1480|494|S/G|Agt/Ggt|rs1134583&COSM1602122|2||-1|UBXN11|HGNC|30600|protein_coding|YES|CCDS41288.1|ENSP00000363339|UBX11_HUMAN||UPI00004700E1|deleterious(0.03)|unknown(0)|16/16||Low_complexity_(Seg):Seg&PROSITE_profiles:PS50099|ENST00000374222.1:c.1480A>G|ENSP00000363339.1:p.Ser494Gly|||||||||0&1|||||||||,A|ENSG00000158062|ENST00000357089|Transcript|missense_variant|1457|1381|461|S/C|Agt/Tgt|rs1134583&COSM1602122|1||-1|UBXN11|HGNC|30600|protein_coding||CCDS41286.1|ENSP00000349601|UBX11_HUMAN|Q5T130_HUMAN|UPI00004700E0|deleterious(0.01)|unknown(0)|14/14||Low_complexity_(Seg):Seg&PROSITE_profiles:PS50099|ENST00000357089.4:c.1381A>T|ENSP00000349601.4:p.Ser461Cys|||||||||0&1|||||||||,C|ENSG00000158062|ENST00000357089|Transcript|missense_variant|1457|1381|461|S/G|Agt/Ggt|rs1134583&COSM1602122|2||-1|UBXN11|HGNC|30600|protein_coding||CCDS41286.1|ENSP00000349601|UBX11_HUMAN|Q5T130_HUMAN|UPI00004700E0|deleterious(0.03)|unknown(0)|14/14||Low_complexity_(Seg):Seg&PROSITE_profiles:PS50099|ENST00000357089.4:c.1381A>G|ENSP00000349601.4:p.Ser461Gly|||||||||0&1|||||||||,A|ENSG00000158062|ENST00000314675|Transcript|missense_variant|1200|1120|374|S/C|Agt/Tgt|rs1134583&COSM1602122|1||-1|UBXN11|HGNC|30600|protein_coding||CCDS41287.1|ENSP00000324721|UBX11_HUMAN||UPI000013F8CA|deleterious(0.01)|unknown(0)|11/11||Low_complexity_(Seg):Seg&PROSITE_profiles:PS50099|ENST00000314675.7:c.1120A>T|ENSP00000324721.7:p.Ser374Cys|||||||||0&1|||||||||,C|ENSG00000158062|ENST00000314675|Transcript|missense_variant|1200|1120|374|S/G|Agt/Ggt|rs1134583&COSM1602122|2||-1|UBXN11|HGNC|30600|protein_coding||CCDS41287.1|ENSP00000324721|UBX11_HUMAN||UPI000013F8CA|deleterious(0.03)|unknown(0)|11/11||Low_complexity_(Seg):Seg&PROSITE_profiles:PS50099|ENST00000314675.7:c.1120A>G|ENSP00000324721.7:p.Ser374Gly|||||||||0&1|||||||||,A|ENSG00000158062|ENST00000475591|Transcript|non_coding_transcript_exon_variant&non_coding_transcript_variant|2442|||||rs1134583&COSM1602122|1||-1|UBXN11|HGNC|30600|retained_intron|||||||||13/13|||ENST00000475591.1:n.2442A>T||||||||||0&1|||||||||,C|ENSG00000158062|ENST00000475591|Transcript|non_coding_transcript_exon_variant&non_coding_transcript_variant|2442|||||rs1134583&COSM1602122|2||-1|UBXN11|HGNC|30600|retained_intron|||||||||13/13|||ENST00000475591.1:n.2442A>G||||||||||0&1|||||||||,A|ENSG00000158062|ENST00000494942|Transcript|non_coding_transcript_exon_variant&non_coding_transcript_variant|2227|||||rs1134583&COSM1602122|1||-1|UBXN11|HGNC|30600|retained_intron|||||||||10/10|||ENST00000494942.1:n.2227A>T||||||||||0&1|||||||||,C|ENSG00000158062|ENST00000494942|Transcript|non_coding_transcript_exon_variant&non_coding_transcript_variant|2227|||||rs1134583&COSM1602122|2||-1|UBXN11|HGNC|30600|retained_intron|||||||||10/10|||ENST00000494942.1:n.2227A>G||||||||||0&1|||||||||,A|ENSG00000158062|ENST00000374221|Transcript|missense_variant|1694|1480|494|S/C|Agt/Tgt|rs1134583&COSM1602122|1||-1|UBXN11|HGNC|30600|protein_coding||CCDS41288.1|ENSP00000363338|UBX11_HUMAN||UPI00004700E1|deleterious(0.01)|unknown(0)|16/16||Low_complexity_(Seg):Seg&PROSITE_profiles:PS50099|ENST00000374221.3:c.1480A>T|ENSP00000363338.3:p.Ser494Cys|||||||||0&1|||||||||,C|ENSG00000158062|ENST00000374221|Transcript|missense_variant|1694|1480|494|S/G|Agt/Ggt|rs1134583&COSM1602122|2||-1|UBXN11|HGNC|30600|protein_coding||CCDS41288.1|ENSP00000363338|UBX11_HUMAN||UPI00004700E1|deleterious(0.03)|unknown(0)|16/16||Low_complexity_(Seg):Seg&PROSITE_profiles:PS50099|ENST00000374221.3:c.1480A>G|ENSP00000363338.3:p.Ser494Gly|||||||||0&1|||||||||,A|ENSG00000158062|ENST00000472155|Transcript|non_coding_transcript_exon_variant&non_coding_transcript_variant|2545|||||rs1134583&COSM1602122|1||-1|UBXN11|HGNC|30600|retained_intron|||||||||14/14|||ENST00000472155.1:n.2545A>T||||||||||0&1|||||||||,C|ENSG00000158062|ENST00000472155|Transcript|non_coding_transcript_exon_variant&non_coding_transcript_variant|2545|||||rs1134583&COSM1602122|2||-1|UBXN11|HGNC|30600|retained_intron|||||||||14/14|||ENST00000472155.1:n.2545A>G||||||||||0&1|||||||||,A|ENSG00000158062|ENST00000374223|Transcript|missense_variant|1212|751|251|S/C|Agt/Tgt|rs1134583&COSM1602122|1||-1|UBXN11|HGNC|30600|protein_coding|||ENSP00000363340|||UPI00004700DA|deleterious(0.03)|unknown(0)|12/12||Low_complexity_(Seg):Seg&PROSITE_profiles:PS50099|ENST00000374223.1:c.751A>T|ENSP00000363340.1:p.Ser251Cys|||||||||0&1|||||||||,C|ENSG00000158062|ENST00000374223|Transcript|missense_variant|1212|751|251|S/G|Agt/Ggt|rs1134583&COSM1602122|2||-1|UBXN11|HGNC|30600|protein_coding|||ENSP00000363340|||UPI00004700DA|tolerated(0.06)|unknown(0)|12/12||Low_complexity_(Seg):Seg&PROSITE_profiles:PS50099|ENST00000374223.1:c.751A>G|ENSP00000363340.1:p.Ser251Gly|||||||||0&1|||||||||,A||ENSR00000533083|RegulatoryFeature|regulatory_region_variant||||||rs1134583&COSM1602122|1||||||regulatory_region||||||||||||||||||||||0&1|||||||||,C||ENSR00000533083|RegulatoryFeature|regulatory_region_variant||||||rs1134583&COSM1602122|2||||||regulatory_region||||||||||||||||||||||0&1||||||||| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good lord, this is huge. Is this funcotator output? Do you really need all those transcripts to test what you want to test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was a ported GATK4 test. Im guessing it was some pathological Oncotator output that caused an error in GATK3.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked at the tests. I feel weird about running b36 data with a b37 reference, but I would probably feel less weird if we just changed the reference in the header to b37. (There's no sequence dictionary so we can play fast and loose.) I'm also curious if all the annotations are covered with the reduced VariantAnnotatorIntegrationTest. The Standard annotations are definitely covered and maybe it's not worth extending the runtime of our integration test suite to cover things no one uses. I'll let @droazen weigh in.
import java.util.stream.Collectors; | ||
|
||
public class VariantAnnotatorIntegrationTest extends CommandLineProgramTest { | ||
final static String STANDARD_ANNOTATIONS = " -G StandardAnnotation "; //-G StandardUG ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It appears we don't have StandardUG anymore. Good. Those were useless.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
erased
public class VariantAnnotatorIntegrationTest extends CommandLineProgramTest { | ||
final static String STANDARD_ANNOTATIONS = " -G StandardAnnotation "; //-G StandardUG ? | ||
private static final List<String> ATTRIBUTES_TO_IGNORE = Arrays.asList( | ||
"QD",//TODO QD has a cap value and anything that reaches that is randomized. It's difficult to reproduce the same random numbers across gatk3 -> 4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ha, I read that as "crap value". Over the cap it's crap... I'm really not a stickler for matching values exactly when we know there's some wiggle room.
"FS");//TODO There's some bug in either gatk3 or gatk4 fisherstrand that's making them not agree still, I'm not sure which is correct | ||
private static final List<String> HEADER_LINES_TO_IGNROE = Arrays.asList( | ||
"reference", | ||
"HaplotypeScore", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is, incidentally, one of those StandardUG annotations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I eventually gave up scrubbing it from all the headers, should I make a pass and remove it everywhere?
assertVariantContextsMatch(getTestFile("HCOutput.NoAnnotations.vcf"), new File(getToolTestDataDir() + "expected/integrationTest.vcf"), Arrays.asList("-G", "Standard", "-G", "AS_Standard", "-L", "20:10000000-10100000", "-I", NA12878_20_21_WGS_bam), b37_reference_20_21); | ||
} | ||
|
||
//TODO these tests must be modernized :-( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I love the names too. :-/
private static final List<String> ATTRIBUTES_TO_IGNORE = Arrays.asList( | ||
"QD",//TODO QD has a cap value and anything that reaches that is randomized. It's difficult to reproduce the same random numbers across gatk3 -> 4 | ||
"FS");//TODO There's some bug in either gatk3 or gatk4 fisherstrand that's making them not agree still, I'm not sure which is correct | ||
private static final List<String> HEADER_LINES_TO_IGNROE = Arrays.asList( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IGNROE -> IGNORE
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
||
//TODO these tests must be modernized :-( | ||
@Test | ||
public void testHasAnnotsNotAsking1() throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know why there were two of these in GATK3. I'm fine with dropping #2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically the reason was that there was a run of tests comparing a standard single sample vcf and one comparing the outputs for a complicated 20-30 sample vcf but they were primarily just testing input combinations to see that there was the right expected behavior with respect to recomputing annotations and keeping irrelevant annotations. They were all on old references so I couldn't port them. I have condensed all of this down into one set of tests on one multisample VCF.
|
||
|
||
@Test | ||
public void testNoAnnotsAsking1() throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I especially think it's good that you dropped the second test of this style because the comments claimed that the input genotype annotations were weird. I hate to keep around test data that's not representative of our usual output unless it's testing a specific error mode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
b37_reference_20_21); | ||
} | ||
@Test | ||
public void testOverwritingHeader() throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this use the withUpdatedContigs method? No, it must not because otherwise the results wouldn't match. I'm not sure this is necessary since the noAnnotsAsking tests will make sure the header gets updated. I can't help but notice that vcfexample4.vcf is allegedly on b36 according to its header. b37 happens to have the same reference base at that position, so the output is still valid, but honestly I feel weird with a test that overwrites the VCF reference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is using updatedContigs actually, everything is I believe. The reason I used it is because it was the best way to match the header to gatk3 since that is what it was using.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is our sequence dictionary validation policy for VCFs? As things are now, there are a lot of opportunities for mayhem. For example, CalculateGenotypePosteriors doesn't need a reference and there's no way to check that the input VCF and the supporting VCF (if specified) are on the same reference. Ditto VariantRecalibrator. I believe those tools both check alleles for overlapping variants, but that doesn't mean is 100% secure against input mismatches.
Assert.assertFalse(lineIteratorAnn.hasNext()); | ||
} | ||
|
||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's nothing specific that jumps out at me as being missing. That being said, in GATK3, this was pretty much the only way we got test coverage on most of the annotations. In GATK4 there are a lot more unit tests, so maybe that's less of an issue. How hard would it be to run test coverage over the classes in the walkers.annotator package? I knew how to do it in IntelliJ once, but it's been a while.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have run coverage, there are a few annotations that could use better coverage. Notable among them are StrandArtifact, RMSMappingQuality (though that seems to be a fluke because it does have tests), OxoGCounts, MappingQuality, Fragment Length, and DepthPerAlleleBySample. I have added a test for --use-all-annotationsthat asserts everything at least ran and is outputting results if I think it should but it's by no means comprehensive (just testing for crashes)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Of which I have found a few in our seldom-used annotations, namely StrandArtifact and ReferenceBases. These have been resolved now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OxoGCounts is probably going to be obsolete soon. Not sure if @davidbenjamin still cares about MappingQuality (median MQ per allele). DepthPerAlleleBySample is AD, so that should be tested in a lot of places -- maybe that's a fluke too. I think annotation coverage is good enough as is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MappingQuality is a default annotation of Mutect2 and remains useful to us.
@ldgauthier @droazen Responded to first round of comments |
A warning, in the last commit I ended up changing the behavior of LikelihoodRankSumTest which used to crash when you ran it without likelihoods to instead just not emit the annotation in keeping in line with other annotations. There may be a reason to keep its previous behavior |
c629e1c
to
980e5cc
Compare
@jamesemery Can you rebase this branch onto latest master to resolve the conflicts? Recommend doing a local squash first given the number of commits here to make it less painful (by "local squash" I mean first |
5033898
to
a42e5c6
Compare
@droazen Rebased and resolved the conflicts |
Hi guys, is this good to go? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stylistic stuff.
final Cigar c = read.getCigar(); | ||
CigarElement last = c.getCigarElement(c.numCigarElements() - 1); | ||
|
||
int numEndClippedBases = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would avoid the non-final
primitive by:
final int terminalHardClipCount = last.getOperator() == CigarOperator.H ? last.getLength() : 0;
. . .
for loop to calculate terminalLowQualCount
return terminalHardClipCount + terminalLowQualCount;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cleaned up slightly to be more readable.
final Cigar c = read.getCigar(); | ||
final CigarElement first = c.getCigarElement(0); | ||
|
||
int numStartClippedBases = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comment below
readPos = numAlignedBases - (initialReadPosition + 1); | ||
} | ||
return readPos; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extra whitespace
if (initialReadPosition > numAlignedBases / 2) { | ||
readPos = numAlignedBases - (initialReadPosition + 1); | ||
} | ||
return readPos; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return Math.min(initialReadPosition, numAlignedBases - (initialReadPosition + 1))
or something like that ought to work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would work, but i'm trying to preserve bugged behavior which means this is subtly different than what was there before.
//TODO: this code may not even get used because HaplotypeCaller already hard clips low quality tails | ||
for (int i = unclippedReadBases.length - numEndClippedBases - 1; i >= 0; i--) { | ||
if (unclippedReadQuals[i] < 20) { //TODO the 20 hard value here is in lieu of PairHMM.BASE_QUALITY_SCORE_THRESHOLD in order to directly match GATK3 output | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extra whitespace
if (refContext.getBases().length ==0 || BaseUtils.simpleBaseToBaseIndex(refContext.getBase()) != -1 ) { | ||
//TODO remove this filter and update the tests, we limit reads starting within a spanning deletion to match GATK3 at the expense of matching haplotype caller | ||
List<GATKRead> reads = new ArrayList<>(); | ||
readsContext.iterator().forEachRemaining(r -> {if (r.getStart()<=vc.getStart()) {reads.add(r);}}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The
ifstatement could be replaced by a
filter` so you could have:
final List<GATKRead> reads = Utils.stream(readsContext).filter(r -> r.getStart() <= vc.getStart()).collect(Collectors.toList());
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
true, changed
|
||
VariantContext annotatedVC = annotatorEngine.annotateContext(vc, fc, refContext, likelihoods, a -> true); | ||
vcfWriter.add(annotatedVC); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extra whitespace
for ( final VAExpression expression : expressions ) { | ||
List<VariantContext> variantContexts = features.getValues(expression.binding, vc.getStart()); | ||
|
||
if (!variantContexts.isEmpty()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is too much nesting here, but you can reduce it with if( variantContexts.isEmpty) { continue; }
} else if (expression.fieldName.equals("ALT")) { | ||
attributes.put(expression.fullName, expressionVC.getAlternateAllele(0).getDisplayString()); | ||
} else if (expression.fieldName.equals("FILTER")) { | ||
if (expressionVC.isFiltered()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
final String filterString = expressionVC.isFiltered() ? expressionVC.getFilters().toString().replace("[", "").replace("]", "").replace(" ", "")) : "PASS";
attributes.put(express.fullName, filterString);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea, like many places in this pr, I should have cleaned these things up
// check that ref and alt alleles are the same | ||
List<Allele> exprAlleles = biallelicVC.getAlleles(); | ||
boolean isAlleleConcordant = false; | ||
int i = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There has to be a better way. This variable can be incremented in two different places, and it is related to the for loop without being syntactically tied to it, and you only need it for the check i == 0
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
its actually used in several places below to pull the ith elemetn out of expressionValuesList
@jamesemery I am well aware that everything I complained about is old code for which you are guiltless. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jamesemery A bunch of comments, some trivial, some major but maybe can be pushed back. I haven't gotten into the tests at all yet.
@@ -46,7 +46,7 @@ | |||
private ReadFilter wellFormedFilter = null; | |||
|
|||
// Command line parser requires a no-arg constructor | |||
public WellformedReadFilter() { | |||
public WellformedReadFilter() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unecessary space here
final int numAlignedBases = getNumAlignedBases(read); | ||
|
||
int readPos = initialReadPosition; | ||
//TODO: this doesn't work for the middle-right position if we index from zero |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this todo worries me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this method worries me but it was necessary to mimic gatk3
|
||
/** | ||
* | ||
* @param read a read containing the variant |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comment is confusing, what's "the variant" in this context?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
*/ | ||
public static int getNumClippedBasesAtEnd(final GATKRead read) { | ||
// check for hard clips (never consider these bases): | ||
final Cigar c = read.getCigar(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
read.getCigar()
copies the cigar, it's more efficient to use read.getCigarElement()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
// and may leave a string of Q2 bases still hanging off the reads. | ||
//TODO: this code may not even get used because HaplotypeCaller already hard clips low quality tails | ||
for (int i = unclippedReadBases.length - numEndClippedBases - 1; i >= 0; i--) { | ||
if (unclippedReadQuals[i] < 20) { //TODO the 20 hard value here is in lieu of PairHMM.BASE_QUALITY_SCORE_THRESHOLD in order to directly match GATK3 output |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be updated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it breaks tests because a handful of reads have a 18-20 quality score at the ends with this annotation.
} | ||
|
||
/** | ||
* Is this container expected to have the per-allele likelihoods calculations filled in. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comment should reflect that this is always false for this class
* @return | ||
*/ | ||
public Map<String, List<PileupElement>> getStratifiedPileups(final Locatable loc) { | ||
if (stratifiedPileups != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like an awkward duplication of the functionality from ReadsPileup.splitBySample(), but this is probably more efficient. Maybe we can unify / speed that up? Save for future maybe.
/** | ||
* Is this container expected to have the per-allele likelihoods calculations filled in. | ||
*/ | ||
public boolean hasFilledLikelihoods() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing Override
s here and elsewhere
import java.util.Map; | ||
|
||
//TODO comment this off | ||
public class UnfilledReadsLikelihoods<A extends Allele> extends ReadLikelihoods<A> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like this as a subclass, I feel like it's almost certainly better as a second implementation of an abstract class
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will do this in another pass, I have opened a ticket.
* or {@code reads} is {@code null}, | ||
* or if they contain null values. | ||
*/ | ||
public UnfilledReadsLikelihoods(SampleList samples, AlleleList<A> alleles, Map<String, List<GATKRead>> reads) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect there are a lot of other methods on this class that should be overriden to throw. I feel like this whole thing is kind of problematic and doesn't result in clean code elsewhere. I don't have a better design in my head, since somewhere we essentially need a switch that does totally different things with totally different input data. Maybe the right thing is to have 2 different methods in annotate that accept different data? We have to get this in soon though so maybe we should revisit.
…to the annotation engine and client annotations to support annotating variants independant of the haplotype caller.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a few changes requested of test files
); | ||
|
||
//TODO because of differences between how GATK3 and GATK4 handle capturing reads for spanning deletions (Namely 3 only looks for reads overlapping the first site, 4 gets all reads over the span) | ||
//TODO then we want ot ignore affected attributes for concordance tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ot -> to
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, this is unused
import java.util.stream.Collectors; | ||
|
||
public class VariantAnnotatorIntegrationTest extends CommandLineProgramTest { | ||
final static String STANDARD_ANNOTATIONS = " -G StandardAnnotation "; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is unused
|
||
final String testFile = getToolTestDataDir() + "vcfexample2.vcf"; | ||
|
||
public static String baseTestString() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so is this method
final VCFHeader expectedHeader = getHeaderFromFile(expected); | ||
final VCFHeader inputHeader = getHeaderFromFile(input); | ||
|
||
Iterator<VCFHeaderLine> inputItr = inputHeader.getMetaDataInSortedOrder().iterator(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this could be simplified by like:
final Iterator<VCFHeaderLine> itr = expectedHeader.getMetaDataInSortedOrder().stream().filter(line -> !shouldFilterLine(line, linesToIgnore)).iterator();
final Iterator<VCFHeaderLine> itr2 = inputHeader.getMetaDataInSortedOrder().stream().filter(line -> !shouldFilterLine(line, linesToIgnore)).iterator();
Assert.assertEquals(itr, itr2);
return header; | ||
} | ||
|
||
private static <T> void assertForEachElementInLists(final List<T> actual, final List<T> expected, final BiConsumer<T, T> assertion) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this shows up in 3 places in the code now, is there somewhere common it could live?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is an issue to extract a common interface for this, its just nontrivial enough that I haven't done it yet in any of the branches. So there it languishes....
return VCs; | ||
} | ||
|
||
private static VCFHeader getHeaderFromFile(final File vcfFile) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
try(VCFFileReader reader = new VCFFileReader(vcfFile)){
return reader.getFileHeader();
}
is cleaner, or
try(FeatureDataSource<VariantContext> reader = new FeatureDataSource<>(vcfFile)){
return (VCFHeader)reader.getHeader();
}
if you want to be more gatk but more verbose
//================================================================================================================== | ||
|
||
@Test | ||
public void GATK3LargeConcoranceTest() throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo, not concordant with Concordance
final File expected = new File(getToolTestDataDir() + "expected/testHasAnnotsAsking1.vcf"); | ||
final VCFHeader header = getHeaderFromFile(expected); | ||
runVariantAnnotatorAndAssertSomething(getTestFile("vcfexamplemultisample.vcf"), new File(getToolTestDataDir() + "expected/testHasAnnotsAsking1.vcf"), Arrays.asList("-G", "Standard", "-I", largeFileTestDir + "CEUTrio.multisample.b37.1M-1M50k.bam"), | ||
(a, e) -> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
factor out this method since it's repeated 3 times
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, intellij suggested that instead of !anyMatch
we might prefer noneMatch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually, this might be a good candidate for a dataprovider since there are a bunch of test that call the same code but use different data
final LineIterator lineIterator = codec.makeSourceFromStream(new PositionalBufferedStream(s)); | ||
codec.readHeader(lineIterator); | ||
|
||
final VCFCodec codecAnn = new VCFCodec(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use FeatureDataSource
or VCFFileReader
here instead of raw codecs and input streams I think. Also, be sure to put closeables in try()
// | ||
// | ||
// @Test(enabled = true) | ||
// public void testChromosomeCountsPed() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these are waiting for the next PR?
a42e5c6
to
f57120a
Compare
@@ -41,4 +42,10 @@ public static OptionalDouble getReadBaseQuality(final GATKRead read, final int r | |||
Utils.nonNull(read); | |||
return OptionalDouble.of(read.getBaseQuality(ReadUtils.getReadCoordinateForReferenceCoordinateUpToEndOfRead(read, refLoc, ReadUtils.ClippingTail.RIGHT_TAIL))); | |||
} | |||
|
|||
@Override | |||
// When we have a pileupe element we only need its underlying read in order to com |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
double typo! plileupE in order to com... pute?
this.binding = binding.get(); | ||
} | ||
|
||
public void sethInfo(VCFInfoHeaderLine hInfo) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, my good friend Seth Info
@@ -46,7 +46,8 @@ | |||
|
|||
private final String printFormat = "%.2f"; | |||
|
|||
private static final OneShotLogger logger = new OneShotLogger(AS_RMSMappingQuality.class); | |||
private static final OneShotLogger allele_logger = new OneShotLogger(AS_RMSMappingQuality.class); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
non-standard, maybe alleleNotFoundLogger? Same with below
* A container object for storing the objects necessary for carrying over expression annotations. | ||
* It holds onto the source feature input as well as any relevant header lines in order to alter the vcfHeader. | ||
*/ | ||
public static class VAExpression { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can this be private now?
updates were made, this needs to go in
Most things were addressed here. A bunch of follow on tickets created to address more complicated refactorings that we don't have time to hit now. |
Resolves #2493
Resolves #51