Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make HaplotypeCallerSpark extend AssemblyRegionWalkerSpark #5386

Merged
merged 2 commits into from
Nov 8, 2018

Conversation

tomwhite
Copy link
Contributor

@tomwhite tomwhite commented Nov 2, 2018

This brings HaplotypeCallerSpark into closer alignment with the walker version.

cc @jonn-smith @jamesemery @droazen

@codecov-io
Copy link

codecov-io commented Nov 2, 2018

Codecov Report

Merging #5386 into master will increase coverage by 0.04%.
The diff coverage is 85.047%.

@@              Coverage Diff               @@
##              master     #5386      +/-   ##
==============================================
+ Coverage     86.985%   87.025%   +0.04%     
- Complexity     30423     30439      +16     
==============================================
  Files           1850      1852       +2     
  Lines         140944    140936       -8     
  Branches       15505     15508       +3     
==============================================
+ Hits          122600    122650      +50     
+ Misses         12705     12647      -58     
  Partials        5639      5639
Impacted Files Coverage Δ Complexity Δ
...der/tools/HaplotypeCallerSparkIntegrationTest.java 58.73% <ø> (-3.035%) 12 <0> (-1)
...ols/examples/ExampleAssemblyRegionWalkerSpark.java 0% <0%> (ø) 0 <0> (ø) ⬇️
...engine/spark/AssemblyRegionArgumentCollection.java 100% <100%> (ø) 1 <1> (?)
...nder/tools/spark/pipelines/ReadsPipelineSpark.java 90.566% <100%> (+0.983%) 13 <1> (+1) ⬆️
...ark/AssemblyRegionReadShardArgumentCollection.java 100% <100%> (ø) 1 <1> (?)
...bender/engine/spark/AssemblyRegionWalkerSpark.java 77.083% <85.185%> (+77.083%) 15 <7> (+15) ⬆️
...stitute/hellbender/tools/HaplotypeCallerSpark.java 88.462% <92%> (+11.234%) 22 <13> (-4) ⬇️
...utils/smithwaterman/SmithWatermanIntelAligner.java 50% <0%> (-30%) 1% <0%> (-2%)
...oadinstitute/hellbender/utils/gcs/BucketUtils.java 80.488% <0%> (-0.61%) 42% <0%> (ø)
...ithwaterman/SmithWatermanIntelAlignerUnitTest.java 60% <0%> (ø) 2% <0%> (ø) ⬇️
... and 7 more

Copy link
Collaborator

@jamesemery jamesemery left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a whole lot to say about this branch. I think it is a good change to fold the haplotype caller into an assembly region walker, though we may end up doing some extensive work on the walker at some point .

new FeatureContext(features, assemblyRegion.getExtendedSpan()))).iterator();
AssemblyRegionEvaluator assemblyRegionEvaluator = supplierBroadcast.getValue().get(); // one AssemblyRegionEvaluator instance per Spark partition
final ReadsDownsampler readsDownsampler = assemblyRegionArgs.maxReadsPerAlignmentStart > 0 ?
new PositionalDownsampler(assemblyRegionArgs.maxReadsPerAlignmentStart, header) : null;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the downsampler want to be configurable at this stage?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure that's needed yet since the examples and HaplotypeCaller both use the PositionalDownsampler. There would be a bit of work to ensure the downsampler is serializable.

new DownsampleableSparkReadShard(
new ShardBoundary(shardedRead.getInterval(), shardedRead.getPaddedInterval()), shardedRead, readsDownsampler)))
.map(shardedRead -> {
final Iterator<AssemblyRegion> assemblyRegionIter = new AssemblyRegionIterator(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A potential optimization we may want to look at for the haplotype caller is being able to generate these assembly regions and reconstruct the constituent up the reads later to save on holding an entire bam in an rdd at a time. That doesn't seem easy to do given the way this class is structured.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've got a branch based on this one that does exactly that :-)


@Override
protected AssemblyRegionArgumentCollection getAssemblyRegionArgumentCollection() {
return new ExampleAssemblyRegionWalkerSpark.ExampleAssemblyRegionArgumentCollection();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are these arguments getting pulled in from the example tool? Why can't this be its own argument collection?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a mistake - I've fixed it now.

@tomwhite tomwhite force-pushed the tw_spark_files_optimization_hc_as_spark_walker branch from a1730f7 to a7caab2 Compare November 7, 2018 09:51
…ll allowing

ReadsPipelineSpark to share common code.
@tomwhite tomwhite force-pushed the tw_spark_files_optimization_hc_as_spark_walker branch from a7caab2 to 9d1c3b5 Compare November 7, 2018 14:05
@tomwhite
Copy link
Contributor Author

tomwhite commented Nov 7, 2018

Thanks for the review @jamesemery. I have tried these changes on a cluster with an exome-sized input, with no performance issues, so I'd be comfortable merging. As you say it will be a useful basis for the HaplotypeCallerSpark work we are doing.

Copy link
Collaborator

@jamesemery jamesemery left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This branch looks more or less good to me. Considering that these changes are temporary before your future changes to this machinery I would say this could be approved 👍

@tomwhite tomwhite merged commit 626c887 into master Nov 8, 2018
@tomwhite tomwhite deleted the tw_spark_files_optimization_hc_as_spark_walker branch November 8, 2018 08:22
EdwardDixon pushed a commit to EdwardDixon/gatk that referenced this pull request Nov 9, 2018
…titute#5386)

* Factor out command line options for assembly region walkers (Spark)

* Make HaplotypeCallerSpark extend AssemblyRegionWalkerSpark, while still allowing
ReadsPipelineSpark to share common code.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants