Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No need to sort variants in HaplotypeCallerSpark. #5909

Merged
merged 2 commits into from
May 1, 2019

Conversation

tomwhite
Copy link
Contributor

@tomwhite tomwhite commented May 1, 2019

VariantsSparkSink will always sort variants before writing them out. However, HaplotypeCallerSpark always processes reads in coordinate-sorted order, and produces variants in the same order, so there is no need for VariantsSparkSink to sort variants. (In fact, in GVCF mode the sort is prohibitive since the engine creates a variant for every locus over the interval of interest, which go through the sort step before being merged into GVCF bands.)

This PR removes the sort step for HaplotypeCallerSpark (and PrintVariantsSpark, which doesn't need it either). All of the concordance unit tests pass, and as an additional sanity check I compared the GVCF output from running regular HaplotypeCaller on a large input BAM to HaplotypeCallerSpark (with and without variant sorting). Removing variant sorting actually made the GVCF output more similar to regular HaplotypeCaller - it reduced the number of differences from three to one. (The one difference is a minor difference in QUAL due to a boundary artifact.) See VCFs in vcfs.zip.

@codecov
Copy link

codecov bot commented May 1, 2019

Codecov Report

Merging #5909 into master will increase coverage by 0.002%.
The diff coverage is 66.667%.

@@               Coverage Diff               @@
##              master     #5909       +/-   ##
===============================================
+ Coverage     80.117%   80.119%   +0.002%     
- Complexity     30673     30674        +1     
===============================================
  Files           1991      1991               
  Lines         149341    149342        +1     
  Branches       16481     16482        +1     
===============================================
+ Hits          119647    119651        +4     
+ Misses         23892     23890        -2     
+ Partials        5802      5801        -1
Impacted Files Coverage Δ Complexity Δ
...nder/tools/spark/pipelines/PrintVariantsSpark.java 66.667% <ø> (ø) 2 <0> (ø) ⬇️
...stitute/hellbender/tools/HaplotypeCallerSpark.java 69.767% <ø> (-0.348%) 18 <0> (ø)
...e/spark/datasources/VariantsSparkSinkUnitTest.java 83.212% <100%> (ø) 28 <0> (ø) ⬇️
...er/engine/spark/datasources/VariantsSparkSink.java 76.471% <57.143%> (-1.654%) 9 <2> (+1)
...lotypecaller/readthreading/ReadThreadingGraph.java 88.971% <0%> (+0.245%) 159% <0%> (ø) ⬇️
...nder/utils/runtime/StreamingProcessController.java 67.773% <0%> (+0.474%) 33% <0%> (ø) ⬇️
...e/hellbender/engine/spark/SparkContextFactory.java 73.973% <0%> (+2.74%) 11% <0%> (ø) ⬇️

Copy link
Collaborator

@jamesemery jamesemery left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a good change

writeVariants(ctx, outputFile, variants, header, writeTabixIndex, true);
}

public static void writeVariants(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Javadoc this method

@@ -184,11 +184,10 @@ private static void processAssemblyRegions(

final JavaRDD<VariantContext> variants = rdd.mapPartitions(assemblyFunction(header, referenceFileName, hcArgsBroadcast, annotatorEngineBroadcast));

variants.cache(); // without caching, computations are run twice as a side effect of finding partition boundaries for sorting
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

huh, does this have to do with the change in sorting? what sort of difference in runtime does this branch have?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cache call was added to avoid repeat work. I haven't tried re-running benchmarks yet but I suspect it will give some speed up.

@tomwhite tomwhite merged commit 53d015e into master May 1, 2019
@tomwhite tomwhite deleted the tw_hc_spark_avoid_sorting_variants branch May 1, 2019 16:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants