-
Notifications
You must be signed in to change notification settings - Fork 594
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GVCF "reblocking" code to compress homRef blocks #4940
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ldgauthier I did a pass on this. Sorry for the deluge of stylistic comments. They're mostly very simple changes. Thanks for porting this to gakt4!
import java.util.*; | ||
|
||
/** | ||
* Perform joint genotyping on gVCF files produced by HaplotypeCaller |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This copied comment isn't right. Needs up to date javadoc.
@DocumentedFeature | ||
public class ReblockGVCF extends VariantWalker { | ||
|
||
private static String GVCF_BLOCK = "GVCFBlock"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
expose this as a constant in GVCFWriter and then use that constant here, and in GenotypeGVCFs as well as in rangeToVCFHeaderLine in GVCFWriter
public class ReblockGVCF extends VariantWalker { | ||
|
||
private static String GVCF_BLOCK = "GVCFBlock"; | ||
private static int PLOIDY_TWO = 2; //FIXME: hack to avoid extra arg collections |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make it final in any case
programGroup = OtherProgramGroup.class, | ||
omitFromCommandLine = true) | ||
@DocumentedFeature | ||
public class ReblockGVCF extends VariantWalker { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should probably be final
private static String GVCF_BLOCK = "GVCFBlock"; | ||
private static int PLOIDY_TWO = 2; //FIXME: hack to avoid extra arg collections | ||
|
||
private VariantContextWriter vcfWriter; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would put the mutable fields after the argument definitions
|
||
final ArgumentsBuilder args = new ArgumentsBuilder(); | ||
args.addReference(new File(b37_reference_20_21)) | ||
.addArgument("V", getToolTestDataDir() + "NA12878.prod.chr20snippet.g.vcf") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The large g.vcf and expected result should be put into src/test/resources/large since they're pretty big. It would be good to bgzip them too.
import java.util.Collections; | ||
|
||
/** | ||
* Created by gauthier on 10/2/17. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we typically don't include these since git captures it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Finally found where to turn that off (Preferences > Editor > File and Code Templates > Includes).
return null; | ||
//there's a use case here related to ReblockGVCFs for overlapping deletions on different haplotypes | ||
if ( vc.getStart() <= nextAvailableStart && vc.getContig().equals(contigOfNextAvailableStart) ) { | ||
if (vc.getEnd() <= nextAvailableStart) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a unit case that triggers the first condition but not the second one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, I have new tests in GVCFWriterUnitTests.
// Remove GCVFBlocks | ||
headerLines.removeIf(vcfHeaderLine -> vcfHeaderLine.getKey().startsWith(GVCF_BLOCK)); | ||
|
||
headerLines.remove(GATKVCFHeaderLines.getInfoLine(GATKVCFConstants.DOWNSAMPLED_KEY)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what happens if someone adds these in annotationsToKeep?
* @param originalVC the combined genomic VC | ||
* @return a new VariantContext or null if the site turned monomorphic and we don't want such sites | ||
*/ | ||
protected VariantContext regenotypeVC(final ReferenceContext ref, final VariantContext originalVC) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this method named / documented correctly? It seems more like reblocking them with some minimal stripping out of alleles rather than regenotyping.
@lbergelson I addressed all your comments. I had some difficulty reconciling the list of Strings for annotation names to remove with the List for the getDefaultVariantAnnotations update, so there may be a few tests failing because the RAW_MQ header disappeared. I'll deal with those when Travis is done, but could you take a look at the other changes? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ldgauthier A few more comments minor cleanup comments and then I think it's good to go. It is failing a bunch of tests due to the change to the RAW_MQ tag. See integration tests and Unit test reports.
It looks like there's 1 place where you meant to add a test and forgot.
//obvious cases | ||
if (vc == null || vc.getAlternateAlleles().isEmpty()) { | ||
return false; | ||
} else if (vc.isBiallelic()) { | ||
return !(isSpanningDeletion(vc.getAlternateAllele(0)) || vc.isSymbolic()); | ||
return !(GATKVCFConstants.isSpanningDeletion(vc.getAlternateAllele(0)) || vc.isSymbolic()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for doing this cleanup.
import java.util.*; | ||
|
||
/** | ||
* Perform joint genotyping on gVCF files produced by HaplotypeCaller |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this line is still out of place?
* <pre> | ||
* gatk ReblockGVCF \ | ||
* -R reference.fasta \ | ||
* --variant sample1.g.vcf \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd just make this -V to match the others.
doc="File to which variants should be written") | ||
private File outputFile; | ||
|
||
@Argument(fullName=GenotypeGVCFs.ALL_SITES_LONG_NAME, shortName=GenotypeGVCFs.ALL_SITES_SHORT_NAME, doc="Include loci found to be non-variant after genotyping") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a little weird to make a boolean argument default to true. Should the name be inverted so that you specify it to disable it?
public GenotypeCalculationArgumentCollection genotypeArgs = new GenotypeCalculationArgumentCollection(); | ||
|
||
@Advanced | ||
@Argument(fullName="GVCF-GQ-band", shortName="GQB", doc="Exclusive upper bounds for reference confidence GQ bands " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should use the new GQ_BAND_LONG_NAME constants
Assert.assertTrue(!newG.hasAD()); | ||
} | ||
|
||
//TODO: these are duplicated from PosteriorProbabilitiesUtilsUnitTest but PR #4947 modifies VariantContextTestUtils, so I'll do some refactoring before the second of the two is merged |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should remember to resolve this when #4947 is completed.
} | ||
|
||
@Test | ||
public void testLowQualVariantToGQ0HomRef() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
forgotten test?
import java.util.Map; | ||
|
||
public class ReblockGVCFUnitTest { | ||
private final ReblockGVCF reblocker = new ReblockGVCF(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would just create this in each test that uses it, otherwise we potentially get weird stateful issues across multiple tests.
import java.util.List; | ||
import java.util.Map; | ||
|
||
public class ReblockGVCFUnitTest { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding these tests. Complicated functions like these have a tendency to lose bits of logic later on if there aren't direct tests for them.
final Allele alt1 = Allele.create("T", false); | ||
final Allele ref2 = Allele.create("TACACACACACTACTA", true); | ||
final Allele ref3 = Allele.create("C", true); | ||
final VariantContext deletion1 = new VariantContextBuilder(null, "1", 10000, 10020, Arrays.asList(ref1, alt1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
setting these things up is always such a pain, thank you for writing it
Codecov Report
@@ Coverage Diff @@
## master #4940 +/- ##
===============================================
- Coverage 86.753% 86.655% -0.098%
+ Complexity 29767 29110 -657
===============================================
Files 1825 1811 -14
Lines 137744 135032 -2712
Branches 15181 14982 -199
===============================================
- Hits 119497 117012 -2485
+ Misses 12729 12576 -153
+ Partials 5518 5444 -74
|
d7ed84e
to
0d1bb61
Compare
@lbergelson this is good to go, unless my other PR gets merged first, in which case I'll need to update the expected test data. |
That person might have been me... |
src/main/java/org/broadinstitute/hellbender/tools/walkers/variantutils/ReblockGVCF.java
Show resolved
Hide resolved
src/main/java/org/broadinstitute/hellbender/tools/walkers/variantutils/ReblockGVCF.java
Show resolved
Hide resolved
* | ||
* <h3>Caveats</h3> | ||
* <p>Only single-sample gVCF files produced by HaplotypeCaller can be used as input for this tool.</p> | ||
* <p>By default this tool only passes through annotations used by VQSR. A different set of annotations can be specified with the usual -A argument. Annotation groups are not supported by this tool.</p> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think annotation groups are automatically supported and propagated to the annotation engine via makeVariantAnnotations
, unless the tool manually intervenes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree -- that was probably just a leftover from the GATK3 version.
src/main/java/org/broadinstitute/hellbender/tools/walkers/variantutils/ReblockGVCF.java
Show resolved
Hide resolved
551ae34
to
13db51e
Compare
@ldgauthier I think there's an newly introduce compile warning because of the method names that I deprecated in htsjdk. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me with a quick re-review. There's a compilation warning that's failing things due to that deprecated Indeces function, quick fix to that and I think you're good to merge 👍
…GVCFs for smaller filesize
13db51e
to
22c4be1
Compare
* adds ReblockGVCF a tool to merge reference blocks in single-sample GVCFs for smaller file size
This is a port of the GATK3 version I actually ran. Some differences in the genotyping engines and the fixed median calculation in MathUtils between GATK3 and GATK4 make the results slightly different in some cases, but still within tolerances.