-
Notifications
You must be signed in to change notification settings - Fork 594
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable SelectVariants to drop specific annotation fields from output vcf. #5254
Conversation
Codecov Report
@@ Coverage Diff @@
## master #5254 +/- ##
===============================================
+ Coverage 86.753% 86.762% +0.009%
- Complexity 29767 29849 +82
===============================================
Files 1825 1828 +3
Lines 137744 138150 +406
Branches 15181 15234 +53
===============================================
+ Hits 119497 119862 +365
- Misses 12729 12741 +12
- Partials 5518 5547 +29
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Look pretty good overall to me. A few minor requests.
/** | ||
* Info annotation fields to be dropped | ||
*/ | ||
@Argument(fullName = "drop-annotation", shortName = "DA", optional = true, doc = "Set info fields to drop from output vcf") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since drop-genotype-annotation
is qualified with genotype
, I think the other arg name should also be qualified, maybe drop-info-annotation
, since otherwise its ambiguous by itself. Also, can you align the argument name, the variable name, and the help string to all be consistent as to whether they refer to field
or annotation
. Nit - I would suggest dropping the word "Set" from the doc string.
@@ -508,6 +516,16 @@ public void onTraversalStart() { | |||
actualLines = headerLines; | |||
} | |||
} | |||
if (!infoFieldsToDrop.isEmpty()) { | |||
for (String infoField : infoFieldsToDrop) { | |||
logger.info("Will drop info annotation: " + infoField); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another nit: we're not consistent about this, but prefer String.format
to string concatenation.
@@ -508,6 +516,16 @@ public void onTraversalStart() { | |||
actualLines = headerLines; | |||
} | |||
} | |||
if (!infoFieldsToDrop.isEmpty()) { | |||
for (String infoField : infoFieldsToDrop) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
final
//remove header lines for info field and genotype annotations being dropped | ||
List<VCFHeaderLine> headerLinesToRemove = new ArrayList<>(); | ||
List<VCFInfoHeaderLine> infoHeaderLines = headerLines.stream().filter(l -> l instanceof VCFInfoHeaderLine).map(l -> (VCFInfoHeaderLine) l).collect(Collectors.toList()); | ||
for (VCFInfoHeaderLine infoHeaderLine : infoHeaderLines) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest either replacing this loop with a single lineinfoHeaderLines.removeIf(l -> infoFieldsToDrop.contains(l.getID())
, or, since you've already created a stream above, just add another .filter
call to the stream.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also finals for all of these
} | ||
} | ||
List<VCFFormatHeaderLine> formatHeaderLines = headerLines.stream().filter(l -> l instanceof VCFFormatHeaderLine).map(l -> (VCFFormatHeaderLine) l).collect(Collectors.toList()); | ||
for (VCFFormatHeaderLine formatHeaderLine : formatHeaderLines) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment(s) here.
} | ||
rmAnnotationsBuilder.genotypes(GenotypesContext.create(genotypesToWrite)); | ||
final VariantContext variantContextToWrite = rmAnnotationsBuilder.make(); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its a little awkward that most of the code from here down (except for the actual .add
) has to continue to use filteredGenotypeToNocall
rather than the current working variant, which is in variantContextToWrite
. It would be good to add a comment explaining that this is deliberate and required in order to ensure that the user can select on attributes that are also dropped.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, and I actually think the best option is moving the creation of variantContextToWrite
to just before it is added to vcfWriter
. This clears up the confusion you mentioned, and also means we aren't unnecessarily creating variantContextToWrite
in cases where the variant will not be written.
} | ||
|
||
@Test(dataProvider = "dropAnnotationsDataProvider") | ||
public void testDropAnnotations(String args, String expectedFile, String name) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was going to suggest removing 'name
, but it provides nice documentation. Maybe change the name to testName
or something.
back to you, @cmnbroad |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One pretty small perf change requested, otherwise this looks good, unless @ldgauthier has any other comments.
} | ||
rmAnnotationsBuilder.genotypes(GenotypesContext.create(genotypesToWrite)); | ||
final VariantContext variantContextToWrite = rmAnnotationsBuilder.make(); | ||
vcfWriter.add(variantContextToWrite); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, except I just realized that after this PR, we'll always recreate the entire VC and all the genotypes, even in the common case where there are no dropped fields. It might be worth moving this code out to a separate method, and only making the actual changes necessary, (i.e., if nothing is dropped, just return the input vc, and if only info fields are dropped, avoid recreating all the genotypes).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good now. Lets wait a day to merge in case @ldgauthier wants to have a look since she is tagged as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Thanks for adding this helpful feature. Minor comments about docs.
@@ -404,6 +400,18 @@ | |||
@Argument(fullName="set-filtered-gt-to-nocall", optional=true, doc="Set filtered genotypes to no-call") | |||
private boolean setFilteredGenotypesToNocall = false; | |||
|
|||
/** | |||
* Info annotation fields to be dropped |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you note that these are specified based on their keys and not their class names? When we add annotations we do it by class name, so there's an unavoidable discrepancy there.
private List<String> infoAnnotationsToDrop = new ArrayList<>(); | ||
|
||
/** | ||
* Genotype annotation fields to be dropped |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to above
@@ -780,6 +824,10 @@ protected VariantFilter makeVariantFilter() { | |||
headerLines.addAll(Arrays.asList(ChromosomeCounts.descriptions)); | |||
headerLines.add(VCFStandardHeaderLines.getInfoLine(VCFConstants.DEPTH_KEY)); | |||
|
|||
//remove header lines for info field and genotype annotations being dropped |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the cleanup!
Thanks @cmnbroad and @ldgauthier, will merge as soon as that last commit (just the documentation improvements @ldgauthier asked for) passes Travis. |
…vcf. (broadinstitute#5254) Enable SelectVariants to drop specific annotation fields from output vcf. (broadinstitute#5254)
Requested in #5235. Specific info annotation fields can be dropped from output vcf by specifying them with
-DA/--drop-annotation
, and genotype annotations can be dropped with-DGA/--drop-genotype-annotation
. (I would have used @vdauwera's initial suggestion of-DF
, but it is already used fordisable-read-filters
). Annotations can be used for selection of variants while simultaneously being dropped from the output vcf.