-
Notifications
You must be signed in to change notification settings - Fork 594
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
No need to sort variants in HaplotypeCallerSpark. #5909
Conversation
Codecov Report
@@ Coverage Diff @@
## master #5909 +/- ##
===============================================
+ Coverage 80.117% 80.119% +0.002%
- Complexity 30673 30674 +1
===============================================
Files 1991 1991
Lines 149341 149342 +1
Branches 16481 16482 +1
===============================================
+ Hits 119647 119651 +4
+ Misses 23892 23890 -2
+ Partials 5802 5801 -1
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a good change
writeVariants(ctx, outputFile, variants, header, writeTabixIndex, true); | ||
} | ||
|
||
public static void writeVariants( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Javadoc this method
@@ -184,11 +184,10 @@ private static void processAssemblyRegions( | |||
|
|||
final JavaRDD<VariantContext> variants = rdd.mapPartitions(assemblyFunction(header, referenceFileName, hcArgsBroadcast, annotatorEngineBroadcast)); | |||
|
|||
variants.cache(); // without caching, computations are run twice as a side effect of finding partition boundaries for sorting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
huh, does this have to do with the change in sorting? what sort of difference in runtime does this branch have?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cache call was added to avoid repeat work. I haven't tried re-running benchmarks yet but I suspect it will give some speed up.
VariantsSparkSink
will always sort variants before writing them out. However,HaplotypeCallerSpark
always processes reads in coordinate-sorted order, and produces variants in the same order, so there is no need forVariantsSparkSink
to sort variants. (In fact, in GVCF mode the sort is prohibitive since the engine creates a variant for every locus over the interval of interest, which go through the sort step before being merged into GVCF bands.)This PR removes the sort step for
HaplotypeCallerSpark
(andPrintVariantsSpark
, which doesn't need it either). All of the concordance unit tests pass, and as an additional sanity check I compared the GVCF output from running regularHaplotypeCaller
on a large input BAM toHaplotypeCallerSpark
(with and without variant sorting). Removing variant sorting actually made the GVCF output more similar to regularHaplotypeCaller
- it reduced the number of differences from three to one. (The one difference is a minor difference in QUAL due to a boundary artifact.) See VCFs in vcfs.zip.