Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable SelectVariants to drop specific annotation fields from output vcf. #5254

Merged
merged 11 commits into from
Oct 16, 2018

Conversation

kachulis
Copy link
Contributor

@kachulis kachulis commented Oct 3, 2018

Requested in #5235. Specific info annotation fields can be dropped from output vcf by specifying them with -DA/--drop-annotation, and genotype annotations can be dropped with -DGA/--drop-genotype-annotation. (I would have used @vdauwera's initial suggestion of -DF, but it is already used for disable-read-filters). Annotations can be used for selection of variants while simultaneously being dropped from the output vcf.

@droazen droazen requested a review from cmnbroad October 3, 2018 18:55
@droazen droazen requested a review from ldgauthier October 3, 2018 18:55
@codecov-io
Copy link

codecov-io commented Oct 3, 2018

Codecov Report

Merging #5254 into master will increase coverage by 0.009%.
The diff coverage is 100%.

@@               Coverage Diff               @@
##              master     #5254       +/-   ##
===============================================
+ Coverage     86.753%   86.762%   +0.009%     
- Complexity     29767     29849       +82     
===============================================
  Files           1825      1828        +3     
  Lines         137744    138150      +406     
  Branches       15181     15234       +53     
===============================================
+ Hits          119497    119862      +365     
- Misses         12729     12741       +12     
- Partials        5518      5547       +29
Impacted Files Coverage Δ Complexity Δ
...der/tools/walkers/variantutils/SelectVariants.java 81.199% <100%> (+1.553%) 132 <0> (+13) ⬆️
...rs/variantutils/SelectVariantsIntegrationTest.java 100% <100%> (ø) 71 <2> (+2) ⬆️
...itute/hellbender/engine/filters/VariantFilter.java 33.333% <0%> (-33.333%) 1% <0%> (-1%)
...e/hellbender/utils/variant/GATKVCFHeaderLines.java 95.597% <0%> (-3.736%) 10% <0%> (ø)
...titute/hellbender/tools/walkers/GenotypeGVCFs.java 89.831% <0%> (-0.085%) 46% <0%> (-3%)
...nder/utils/variant/writers/GVCFWriterUnitTest.java 96.789% <0%> (-0.07%) 48% <0%> (+3%)
...institute/hellbender/engine/VariantWalkerBase.java 100% <0%> (ø) 14% <0%> (ø) ⬇️
...org/broadinstitute/hellbender/utils/MathUtils.java 78.17% <0%> (ø) 209% <0%> (ø) ⬇️
...ute/hellbender/utils/variant/GATKVCFConstants.java 80% <0%> (ø) 4% <0%> (ø) ⬇️
... and 16 more

Copy link
Collaborator

@cmnbroad cmnbroad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Look pretty good overall to me. A few minor requests.

/**
* Info annotation fields to be dropped
*/
@Argument(fullName = "drop-annotation", shortName = "DA", optional = true, doc = "Set info fields to drop from output vcf")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since drop-genotype-annotation is qualified with genotype, I think the other arg name should also be qualified, maybe drop-info-annotation, since otherwise its ambiguous by itself. Also, can you align the argument name, the variable name, and the help string to all be consistent as to whether they refer to field or annotation. Nit - I would suggest dropping the word "Set" from the doc string.

@@ -508,6 +516,16 @@ public void onTraversalStart() {
actualLines = headerLines;
}
}
if (!infoFieldsToDrop.isEmpty()) {
for (String infoField : infoFieldsToDrop) {
logger.info("Will drop info annotation: " + infoField);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another nit: we're not consistent about this, but prefer String.format to string concatenation.

@@ -508,6 +516,16 @@ public void onTraversalStart() {
actualLines = headerLines;
}
}
if (!infoFieldsToDrop.isEmpty()) {
for (String infoField : infoFieldsToDrop) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

final

//remove header lines for info field and genotype annotations being dropped
List<VCFHeaderLine> headerLinesToRemove = new ArrayList<>();
List<VCFInfoHeaderLine> infoHeaderLines = headerLines.stream().filter(l -> l instanceof VCFInfoHeaderLine).map(l -> (VCFInfoHeaderLine) l).collect(Collectors.toList());
for (VCFInfoHeaderLine infoHeaderLine : infoHeaderLines) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest either replacing this loop with a single lineinfoHeaderLines.removeIf(l -> infoFieldsToDrop.contains(l.getID()), or, since you've already created a stream above, just add another .filter call to the stream.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also finals for all of these

}
}
List<VCFFormatHeaderLine> formatHeaderLines = headerLines.stream().filter(l -> l instanceof VCFFormatHeaderLine).map(l -> (VCFFormatHeaderLine) l).collect(Collectors.toList());
for (VCFFormatHeaderLine formatHeaderLine : formatHeaderLines) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment(s) here.

}
rmAnnotationsBuilder.genotypes(GenotypesContext.create(genotypesToWrite));
final VariantContext variantContextToWrite = rmAnnotationsBuilder.make();

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its a little awkward that most of the code from here down (except for the actual .add) has to continue to use filteredGenotypeToNocall rather than the current working variant, which is in variantContextToWrite. It would be good to add a comment explaining that this is deliberate and required in order to ensure that the user can select on attributes that are also dropped.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, and I actually think the best option is moving the creation of variantContextToWrite to just before it is added to vcfWriter. This clears up the confusion you mentioned, and also means we aren't unnecessarily creating variantContextToWrite in cases where the variant will not be written.

}

@Test(dataProvider = "dropAnnotationsDataProvider")
public void testDropAnnotations(String args, String expectedFile, String name) throws IOException {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was going to suggest removing 'name, but it provides nice documentation. Maybe change the name to testName or something.

@kachulis
Copy link
Contributor Author

kachulis commented Oct 4, 2018

back to you, @cmnbroad

Copy link
Collaborator

@cmnbroad cmnbroad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One pretty small perf change requested, otherwise this looks good, unless @ldgauthier has any other comments.

}
rmAnnotationsBuilder.genotypes(GenotypesContext.create(genotypesToWrite));
final VariantContext variantContextToWrite = rmAnnotationsBuilder.make();
vcfWriter.add(variantContextToWrite);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, except I just realized that after this PR, we'll always recreate the entire VC and all the genotypes, even in the common case where there are no dropped fields. It might be worth moving this code out to a separate method, and only making the actual changes necessary, (i.e., if nothing is dropped, just return the input vc, and if only info fields are dropped, avoid recreating all the genotypes).

Copy link
Collaborator

@cmnbroad cmnbroad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good now. Lets wait a day to merge in case @ldgauthier wants to have a look since she is tagged as well.

Copy link
Contributor

@ldgauthier ldgauthier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Thanks for adding this helpful feature. Minor comments about docs.

@@ -404,6 +400,18 @@
@Argument(fullName="set-filtered-gt-to-nocall", optional=true, doc="Set filtered genotypes to no-call")
private boolean setFilteredGenotypesToNocall = false;

/**
* Info annotation fields to be dropped
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you note that these are specified based on their keys and not their class names? When we add annotations we do it by class name, so there's an unavoidable discrepancy there.

private List<String> infoAnnotationsToDrop = new ArrayList<>();

/**
* Genotype annotation fields to be dropped
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to above

@@ -780,6 +824,10 @@ protected VariantFilter makeVariantFilter() {
headerLines.addAll(Arrays.asList(ChromosomeCounts.descriptions));
headerLines.add(VCFStandardHeaderLines.getInfoLine(VCFConstants.DEPTH_KEY));

//remove header lines for info field and genotype annotations being dropped
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the cleanup!

@kachulis
Copy link
Contributor Author

Thanks @cmnbroad and @ldgauthier, will merge as soon as that last commit (just the documentation improvements @ldgauthier asked for) passes Travis.

@kachulis kachulis merged commit 201360f into master Oct 16, 2018
@kachulis kachulis deleted the ck_5235_SelectVariants_drop_annotations branch October 16, 2018 18:28
EdwardDixon pushed a commit to EdwardDixon/gatk that referenced this pull request Nov 9, 2018
…vcf. (broadinstitute#5254)

Enable SelectVariants to drop specific annotation fields from output vcf. (broadinstitute#5254)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants