Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LocusWalkerSpark drops intervals with no reads #5222

Merged
merged 9 commits into from
Oct 8, 2018

Conversation

tomwhite
Copy link
Contributor

This PR fixes #3823 opened by @LeeTL1220

There are several series of commits applied here:

  1. The first set of commits rebase ll_CollectAllelicCountsSpark on master.
  2. Then there's a commit (21e1dcf) to fix up some changes to make CollectAllelicCountsSpark work on master.
  3. The next set of commits applies the changes from Use SparkFiles to localize reference and known sites #5127 and Use Spark files to distribute the reference for Spark Walker classes #5221 (since they have not yet been merged) to make passing the reference in Spark more efficient. These changes are needed for the fix below to work.
  4. The actual fix is in 3326b90.
  5. There's also a new test for ExampleLocusWalkerSpark in addition to the one for CollectAllelicCountsSpark.

The problem was that emitEmptyLoci() in LocusWalkerSpark was not working properly. Any intervals that didn't overlap with reads were being dropped, which meant that those loci were not being passed to the LocusWalkerSpark subclass.

Copy link
Contributor

@LeeTL1220 LeeTL1220 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tomwhite I have reviewed the CollectAllelicCountsSpark piece of this PR. The changes look reasonable. Thanks for rebasing, as well.

@LeeTL1220
Copy link
Contributor

@tomwhite The other changes should probably be reviewed by someone else on the engine team.

@droazen
Copy link
Contributor

droazen commented Sep 26, 2018

This should not be merged until #5127 and #5221 have gone in first -- treating as blocked until then

@tomwhite
Copy link
Contributor Author

tomwhite commented Oct 1, 2018

@LeeTL1220 thanks for the review! I'll work with the engine team to get the other parts reviewed.

@jamesemery
Copy link
Collaborator

@tomwhite Could this branch be disentangled from ll_CollectAllelicCountsSpark since the actual fix is in 3326b90 we should probably just pull that out into its own branch and push that through with ll_CollectAllelicCountsSpark which doesn't seem directly related.

@tomwhite
Copy link
Contributor Author

tomwhite commented Oct 2, 2018

@jamesemery, yes, I've opened #5248 with the isolated fix (and unit test). I'll leave this open for the CollectAllelicCountsSpark changes, which I'll update when the other changes it depends on are merged.

@tomwhite tomwhite force-pushed the tw_ll_CollectAllelicCountsSpark branch from 670223f to 94c5a73 Compare October 2, 2018 16:37
@tomwhite
Copy link
Contributor Author

tomwhite commented Oct 2, 2018

Now that #5127 and #5221 have gone in I've rebased this. Note that it will fail until the fix (which I moved to #5248 on @jamesemery's suggestion) has gone in.

@tomwhite tomwhite merged commit 158f7f7 into master Oct 8, 2018
EdwardDixon pushed a commit to EdwardDixon/gatk that referenced this pull request Nov 9, 2018
* Spark version of collect allelic counts spark.

* Add ExampleLocusWalkerSpark.java and test
@davidbenjamin davidbenjamin deleted the tw_ll_CollectAllelicCountsSpark branch February 24, 2021 15:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

LocusWalkerSpark not determining the correct number of RDDs to process
4 participants