No need to sort variants in HaplotypeCallerSpark. #5909

tomwhite · 2019-05-01T08:30:04Z

VariantsSparkSink will always sort variants before writing them out. However, HaplotypeCallerSpark always processes reads in coordinate-sorted order, and produces variants in the same order, so there is no need for VariantsSparkSink to sort variants. (In fact, in GVCF mode the sort is prohibitive since the engine creates a variant for every locus over the interval of interest, which go through the sort step before being merged into GVCF bands.)

This PR removes the sort step for HaplotypeCallerSpark (and PrintVariantsSpark, which doesn't need it either). All of the concordance unit tests pass, and as an additional sanity check I compared the GVCF output from running regular HaplotypeCaller on a large input BAM to HaplotypeCallerSpark (with and without variant sorting). Removing variant sorting actually made the GVCF output more similar to regular HaplotypeCaller - it reduced the number of differences from three to one. (The one difference is a minor difference in QUAL due to a boundary artifact.) See VCFs in vcfs.zip.

codecov · 2019-05-01T09:03:01Z

Codecov Report

Merging #5909 into master will increase coverage by 0.002%.
The diff coverage is 66.667%.

@@               Coverage Diff               @@
##              master     #5909       +/-   ##
===============================================
+ Coverage     80.117%   80.119%   +0.002%     
- Complexity     30673     30674        +1     
===============================================
  Files           1991      1991               
  Lines         149341    149342        +1     
  Branches       16481     16482        +1     
===============================================
+ Hits          119647    119651        +4     
+ Misses         23892     23890        -2     
+ Partials        5802      5801        -1

Impacted Files	Coverage Δ	Complexity Δ
...nder/tools/spark/pipelines/PrintVariantsSpark.java	`66.667% <ø> (ø)`	`2 <0> (ø)`	⬇️
...stitute/hellbender/tools/HaplotypeCallerSpark.java	`69.767% <ø> (-0.348%)`	`18 <0> (ø)`
...e/spark/datasources/VariantsSparkSinkUnitTest.java	`83.212% <100%> (ø)`	`28 <0> (ø)`	⬇️
...er/engine/spark/datasources/VariantsSparkSink.java	`76.471% <57.143%> (-1.654%)`	`9 <2> (+1)`
...lotypecaller/readthreading/ReadThreadingGraph.java	`88.971% <0%> (+0.245%)`	`159% <0%> (ø)`	⬇️
...nder/utils/runtime/StreamingProcessController.java	`67.773% <0%> (+0.474%)`	`33% <0%> (ø)`	⬇️
...e/hellbender/engine/spark/SparkContextFactory.java	`73.973% <0%> (+2.74%)`	`11% <0%> (ø)`	⬇️

jamesemery

This looks like a good change

jamesemery · 2019-05-01T14:24:46Z

src/main/java/org/broadinstitute/hellbender/engine/spark/datasources/VariantsSparkSink.java

+        writeVariants(ctx, outputFile, variants, header, writeTabixIndex, true);
+    }
+
+    public static void writeVariants(


Javadoc this method

jamesemery · 2019-05-01T14:25:47Z

src/main/java/org/broadinstitute/hellbender/tools/HaplotypeCallerSpark.java

@@ -184,11 +184,10 @@ private static void processAssemblyRegions(

        final JavaRDD<VariantContext> variants = rdd.mapPartitions(assemblyFunction(header, referenceFileName, hcArgsBroadcast, annotatorEngineBroadcast));

-        variants.cache(); // without caching, computations are run twice as a side effect of finding partition boundaries for sorting


huh, does this have to do with the change in sorting? what sort of difference in runtime does this branch have?

The cache call was added to avoid repeat work. I haven't tried re-running benchmarks yet but I suspect it will give some speed up.

No need to sort variants in HaplotypeCallerSpark.

359b1f5

tomwhite added Spark HaplotypeCallerSpark labels May 1, 2019

tomwhite requested a review from jamesemery May 1, 2019 08:30

tomwhite self-assigned this May 1, 2019

jamesemery approved these changes May 1, 2019

View reviewed changes

Javadoc

e4967fb

tomwhite merged commit 53d015e into master May 1, 2019

tomwhite deleted the tw_hc_spark_avoid_sorting_variants branch May 1, 2019 16:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No need to sort variants in HaplotypeCallerSpark. #5909

No need to sort variants in HaplotypeCallerSpark. #5909

tomwhite commented May 1, 2019

codecov bot commented May 1, 2019 •

edited

Loading

jamesemery left a comment

jamesemery May 1, 2019

jamesemery May 1, 2019

tomwhite May 1, 2019

		@@ -184,11 +184,10 @@ private static void processAssemblyRegions(

		final JavaRDD<VariantContext> variants = rdd.mapPartitions(assemblyFunction(header, referenceFileName, hcArgsBroadcast, annotatorEngineBroadcast));

		variants.cache(); // without caching, computations are run twice as a side effect of finding partition boundaries for sorting

No need to sort variants in HaplotypeCallerSpark. #5909

No need to sort variants in HaplotypeCallerSpark. #5909

Conversation

tomwhite commented May 1, 2019

codecov bot commented May 1, 2019 • edited Loading

Codecov Report

jamesemery left a comment

Choose a reason for hiding this comment

jamesemery May 1, 2019

Choose a reason for hiding this comment

jamesemery May 1, 2019

Choose a reason for hiding this comment

tomwhite May 1, 2019

Choose a reason for hiding this comment

codecov bot commented May 1, 2019 •

edited

Loading