-
Notifications
You must be signed in to change notification settings - Fork 594
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Synchronize update of shared genotype likelihood tables. #5071
Conversation
Codecov Report
@@ Coverage Diff @@
## master #5071 +/- ##
===============================================
- Coverage 86.761% 86.759% -0.002%
+ Complexity 29765 29762 -3
===============================================
Files 1825 1825
Lines 137699 137698 -1
Branches 15176 15175 -1
===============================================
- Hits 119469 119465 -4
- Misses 12716 12720 +4
+ Partials 5514 5513 -1
|
Re-assigning to @jamesemery |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add comments explaining why this is necessary, I agree a test would be difficult to write in such a way that it would be informative across hardware. I'm still not strictly sure based on the HaplotypeCallerSpark code what is causing its stream to become parallel within a region.
@@ -275,12 +275,11 @@ public GenotypeLikelihoodCalculator getInstance(final int ploidy, final int alle | |||
* @param requestedMaximumAllele the new requested maximum allele maximum. | |||
* @param requestedMaximumPloidy the new requested ploidy maximum. | |||
*/ | |||
private void ensureCapacity(final int requestedMaximumAllele, final int requestedMaximumPloidy) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would add a comment to the top of this method explaining that since each region in spark owns one HaplotypeCallerEngine (and therefore genotyping engine) which gets reused that this must be synchronized to avoid race conditions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a comment.
I was able to trigger the ArrayIndexOutOfBoundException (and subsequently fix it with your synchronization) with the following code, so perhaps a test similar to this would be useful to include:
|
6e3b64b
to
fce66a5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One pedantic comment to add and then I think you would be good to merge
@@ -270,17 +270,18 @@ public GenotypeLikelihoodCalculator getInstance(final int ploidy, final int alle | |||
} | |||
|
|||
/** | |||
* Update of shared tables |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Upon some thought I realized that the getInstance() method which calls this one should probably be synchronized, specifically because calculateGenotypeCountUsingTables
makes a check against maximumPloidy & maximumAllele which are both not thread safe. Now since this method currently only ever increases either of those fields the check won't cause problems despite being unsafe, but we should probably add a strong comment somewhere enforcing that maximumPloidy & maximumAllele can only be increased by this method so we can feel safe checking against them where this is called.
@tomwhite To clarify, I think that the caller of It's not 100% clear to me whether a |
@jamesemery I agree - all access (read and write) to @droazen, are you concerned about performance for the Spark case? For the walker version, presumably the access is single-threaded, and hence uncontended, which is very cheap. Another option would be to maintain a separate instance of |
@jamesemery @droazen I've updated this branch to ensure all read and write paths to shared state in R session (times are in millis):
The p-value is not less than 0.05, so we can't reject the null hypothesis (that the mean times are the same). So adding synchronization doesn't seem to make any difference in this test. BTW, I noticed that |
085b399
to
892ff97
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
…ute#5071) * Synchronize update of shared genotype likelihood tables. * Add comment explaining synchronization. * Synchronize for reads as well as writes to shared state.
Fixes #4661.
I managed to run
genome_reads-pipeline_hdfs.sh
with this change, whereas before it was failing (see details in #4661).No unit test since it is very difficult to write a test for the effect of adding synchronization.