-
Notifications
You must be signed in to change notification settings - Fork 594
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SelectVariants JEXL filter fixes and refactor #8092
Conversation
// START Chris/Louis/David | ||
// Attempt 1, does not work | ||
// final ArgumentsBuilder args1 = new ArgumentsBuilder() | ||
// .add("V", testVcf.toString()) | ||
// .add("O", testOutput.getAbsolutePath()) | ||
// .add("--select", "'AC > 1'") | ||
// .add("--select", "'AF > 0'"); | ||
// // .add("select", "'GQ > 0'"); // tsato: this does not work for some reason... | ||
// runCommandLine(args1, SelectVariants.class.getSimpleName()); | ||
|
||
// Attempt 2 (remove single quotes), does not work | ||
// final ArgumentsBuilder args2 = new ArgumentsBuilder() | ||
// .add("V", testVcf.toString()) | ||
// .add("O", testOutput.getAbsolutePath()) | ||
// .add("--select", "AC > 1") | ||
// .add("--select", "AF > 0"); | ||
// // .add("select", "'GQ > 0'"); // tsato: this does not work for some reason... | ||
// runCommandLine(args2, SelectVariants.class.getSimpleName()); | ||
|
||
// Atempt 3 (remove spaces). commandline parser lets this through but not the JEXL parser | ||
// final ArgumentsBuilder args3 = new ArgumentsBuilder() | ||
// .add("V", testVcf.toString()) | ||
// .add("O", testOutput.getAbsolutePath()) | ||
// .addRaw("--select 'AC>1'") | ||
// .addRaw("--select 'AF>0'"); | ||
// // .add("select", "'GQ > 0'"); // tsato: this does not work for some reason... | ||
// runCommandLine(args3, SelectVariants.class.getSimpleName()); | ||
|
||
// Attempt 4 (give a list to runCommandLine), works | ||
// runCommandLine(Arrays.asList("-V", testVcf.toString(), | ||
// "-O", testOutput.getAbsolutePath(), | ||
// "--select","AC > 1"), | ||
// SelectVariants.class.getSimpleName()); | ||
|
||
// Attempt 5, works! | ||
// runCommandLine(Arrays.asList("-V", testVcf.toString(), | ||
// "-O", testOutput.getAbsolutePath(), | ||
// "--select", "ALGORITHMS == 'depth'"), | ||
// SelectVariants.class.getSimpleName()); | ||
|
||
// Attempt 6, works! | ||
runCommandLine(Arrays.asList("-V", testVcf.toString(), | ||
"-O", testOutput.getAbsolutePath(), | ||
"--select", "SVTYPE == 'DEL'"), | ||
SelectVariants.class.getSimpleName()); | ||
// END Chris/Louis/David |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@droazen @lbergelson @cmnbroad this block of code has the command line parsing issue I mentioned. My work around is Attempt 5 and 6, where I feed runCommandLine
a list of arguments rather than an argument builder. Is that good enough? Otherwise I suspect we need to update ArgumentsBuilder
to support JEXL command lines.
Github actions tests reported job failures from actions build 3463372954
|
Github actions tests reported job failures from actions build 3505269846
|
Github actions tests reported job failures from actions build 3567484153
|
Github actions tests reported job failures from actions build 3567534442
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@takutosato Back to you with my comments. Changes look really good overall, and definitely an improvement over the existing tool! Go ahead and merge this after addressing my comments and tests pass. You'll need to rebase the branch to get rid of some unrelated test failures.
* specified samples are extracted and the INFO field annotations are updated. | ||
* See example commands above for detailed usage examples. The expressions given to this argument | ||
* should either refer to INFO fields, or access FORMAT field with the VariantContext object | ||
* e.g. --select "vc.getGenotype('NA12878').getGQ() > 35" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be a good idea to include an example with vc.getGenotype().getGQ()
in the main tool docs above, along with a comment comparing that syntax to the new -select-genotype
approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, --select
should be -select
here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@Argument(shortName="select", doc="One or more criteria to use when selecting the data", optional=true) | ||
public static final String SELECT_SHORT_NAME = "select"; | ||
public static final String SELECT_LONG_NAME = "select-expressions"; | ||
@Argument(fullName=SELECT_LONG_NAME, shortName=SELECT_SHORT_NAME, doc="", optional=true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doc string should not be empty for this argument
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A bit worried about breaking some users' script by not allowing --select, so I might revert this new long name change.
*/ | ||
public static final String GENOTYPE_SELECT_SHORT_NAME = "select-genotype"; | ||
public static final String GENOTYPE_SELECT_LONG_NAME = "select-genotype-expressions"; | ||
@Argument(fullName=GENOTYPE_SELECT_LONG_NAME, shortName=GENOTYPE_SELECT_SHORT_NAME, doc="", optional=true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doc string should not be empty
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
* on INFO fields only, which may improve speed when working with a large cohort vcf that contains genotypes for | ||
* thousands of samples (format fields). | ||
*/ | ||
@Argument(fullName="jexl-first", shortName="jexl-first", doc="", optional=true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fill in doc string
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Long argument name should perhaps be more descriptive here -- eg., --apply-jexl-filters-first
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done x2
@@ -480,7 +523,7 @@ public void onTraversalStart() { | |||
|
|||
final List<String> genotypeField = getHeaderForVariants().getGenotypeSamples(); | |||
if(!ParsingUtils.isSorted(genotypeField)){ | |||
if(genotypeField.size() > 10) { | |||
if(!genotypeField.isEmpty()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that the previous threshold of 10 was deliberate, so that we don't warn unnecessarily about performance when there are only a few samples.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, done.
final int threshold = 1; | ||
final Predicate<VariantContext> filter = vc -> vc.getGenotypes().stream().anyMatch(g -> g.getGQ() > threshold); | ||
final Pair<VCFHeader, List<VariantContext>> preFilter = VariantContextTestUtils.readEntireVCFIntoMemory(svHGDBMultiSampleVcf.toString()); | ||
final int expectedNumVCs = (int) preFilter.getRight().stream().filter(filter).count(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In these new tests you added, have you confirmed in each case that the number of expected records is > 0? Otherwise the tests will not be as meaningful as they could be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added extra assert statements to check this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
SelectVariants.class.getSimpleName()); | ||
final Pair<VCFHeader, List<VariantContext>> result2 = VariantContextTestUtils.readEntireVCFIntoMemory(testOutput2.getAbsolutePath()); | ||
final int actualNumVCs2 = result2.getRight().size(); | ||
Assert.assertEquals(expectedNumVCs2, actualNumVCs2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This second test case should probably be a separate test method rather than embedded within testCombineInfoAndGenotypeJexl()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
SelectVariants.class.getSimpleName()); | ||
final Pair<VCFHeader, List<VariantContext>> result3 = VariantContextTestUtils.readEntireVCFIntoMemory(testOutput3.getAbsolutePath()); | ||
final int actualNumVCs3 = result3.getRight().size(); | ||
Assert.assertEquals(0, actualNumVCs3); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should also be pulled out into a separate named test method of its own.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
||
|
||
|
||
// Assert.assertEquals(expectedNumVCs2, actualNumVCs2);D |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Delete commented-out code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
// runCommandLine(Arrays.asList("-V", testVcf.toString(), | ||
// "-O", testOutput.getAbsolutePath(), | ||
// "--select", "ALGORITHMS == 'depth'"), | ||
// SelectVariants.class.getSimpleName()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Delete commented-out code (but consult with @lbergelson on why ArgumentsBuilder
didn't work here)
final Predicate<VariantContext> filter = vc -> vc.getAttributeAsInt("AC", -1) > 0 || | ||
vc.getAttributeAsString("SVTYPE", "").equals("DEL"); | ||
|
||
final Pair<VCFHeader, List<VariantContext>> preFilter = VariantContextTestUtils.readEntireVCFIntoMemory(svHGDBMultiSampleVcf.toString()); | ||
final int expectedCount = (int) preFilter.getRight().stream().filter(filter).count(); | ||
|
||
runCommandLine(Arrays.asList("-V", svHGDBMultiSampleVcf.toString(), | ||
"-O", testOutput.getAbsolutePath(), | ||
"--select", "SVTYPE == 'DEL'", | ||
"--select", "AC > 0"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cmnbroad what do you think? The issue here is that "--select" should no longer be a valid command line but runCommandLine does not complain about it.
@@ -2,13 +2,7 @@ | |||
|
|||
import htsjdk.samtools.SAMSequenceDictionary; | |||
import htsjdk.tribble.util.ParsingUtils; | |||
import htsjdk.variant.variantcontext.Allele; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Put the explicit imports back
@@ -603,6 +654,11 @@ public void apply(VariantContext vc, ReadsContext readsContext, ReferenceContext | |||
return; | |||
} | |||
|
|||
if (applyJexlFiltersBeforeFilteringGenotypes && !passesJexlFilters(vc)) { | |||
return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be tested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## master #8092 +/- ##
===============================================
+ Coverage 86.636% 86.659% +0.023%
- Complexity 38944 39021 +77
===============================================
Files 2335 2337 +2
Lines 182675 182958 +283
Branches 20053 20080 +27
===============================================
+ Hits 158262 158549 +287
+ Misses 17373 17362 -11
- Partials 7040 7047 +7
|
This PR is an attempt at improving SelectVariants by
The initial motivation for this code change (#7497) was to improve performance of SelectVariants by adding an option to do the "INFO-level filtering" before doing "genotype filtering." Our assumption was that this would improve performance because then we would avoid the expensive genotype "fully-decode" operation, which turns string format fields into appropriate object/types (int, array, etc.). This is (we think) done in
VariantContext.fullyDecode().
This turned out not to be possible for the following reasons. First, there are roughly four types of genotype subsetting you could do:
a) By the sample names (
--sample-name NA12878
)b) JEXL (
--select GQ > 0
)c) JEXL by accessing the variant context object (
--select vc.getGenotype('NA12878').getGQ() > 1
)d) Others (e.g.
--remove-fraction-genotype
)a) does not need "fully-decode." It turns out b) was never supported (GATK currently removes all variants and succeed.) And from my experiments, c) does not seem to ever trigger calling
VariantContext.fullyDecode().
In fact the only code path I can see that calls fullyDecode() is by setting thefully-decode
SelectVariants argument, which seems to just call fullyDecode at the beginning just for the sake of calling it (or so it appears to me. The utility of this command line argument is highly dubious.)It's possible that apache code does something similar to fully decoding that could affect performance.
All that is to say that we cannot achieve performance improvement with our original blueprint simply because this expensive "fullyDecode" operation seems to be a mythical operation that is never used in reality.
So while I could not speed up SelectVariants, I cleaned up the code and added the following new arguments:
--select-genotype
: with this new genotype-specific JEXL argument, we support filtering by genotype fields like 'GQ > 0', where the behavior in the multi-sample case is 'GQ > 0' in at least one sample. I have not added the ability to do 'GQ > 0 for all samples' but it should be a simple (but not easy…) exercise in boolean operations.applyJexlFiltersBeforeFilteringGenotypes
: if set to true, we do the JEXL checking before we subset by samples. In my tests, performance improvement from this option was very modest. Subsetting a ~3k 1kg SV vcf to a single sample was about 30 seconds faster (out of ~20 min total run time) than the default. I kept it in the PR because I thought some user might find it useful, but I wouldn't be opposed to removing it.Tests needed:
--applyJexlFiltersBeforeFilteringGenotypes.
Does this actually give us performance advantage?select-random-fraction