Synchronize update of shared genotype likelihood tables. #5071

tomwhite · 2018-07-31T08:22:59Z

I managed to run genome_reads-pipeline_hdfs.sh with this change, whereas before it was failing (see details in #4661).

No unit test since it is very difficult to write a test for the effect of adding synchronization.

codecov-io · 2018-07-31T09:22:35Z

Codecov Report

Merging #5071 into master will decrease coverage by 0.002%.
The diff coverage is 100%.

@@               Coverage Diff               @@
##              master     #5071       +/-   ##
===============================================
- Coverage     86.761%   86.759%   -0.002%     
+ Complexity     29765     29762        -3     
===============================================
  Files           1825      1825               
  Lines         137699    137698        -1     
  Branches       15176     15175        -1     
===============================================
- Hits          119469    119465        -4     
- Misses         12716     12720        +4     
+ Partials        5514      5513        -1

Impacted Files	Coverage Δ	Complexity Δ
...lkers/genotyper/GenotypeLikelihoodCalculators.java	`88.75% <100%> (+2.33%)`	`24 <0> (-1)`	⬇️
...utils/smithwaterman/SmithWatermanIntelAligner.java	`50% <0%> (-30%)`	`1% <0%> (-2%)`
...e/hellbender/engine/spark/SparkContextFactory.java	`71.233% <0%> (-2.74%)`	`11% <0%> (ø)`
...ithwaterman/SmithWatermanIntelAlignerUnitTest.java	`60% <0%> (ø)`	`2% <0%> (ø)`	⬇️

droazen · 2018-08-27T19:30:16Z

Re-assigning to @jamesemery

jamesemery

Add comments explaining why this is necessary, I agree a test would be difficult to write in such a way that it would be informative across hardware. I'm still not strictly sure based on the HaplotypeCallerSpark code what is causing its stream to become parallel within a region.

jamesemery · 2018-08-29T15:44:13Z

...ava/org/broadinstitute/hellbender/tools/walkers/genotyper/GenotypeLikelihoodCalculators.java

@@ -275,12 +275,11 @@ public GenotypeLikelihoodCalculator getInstance(final int ploidy, final int alle
     * @param requestedMaximumAllele the new requested maximum allele maximum.
     * @param requestedMaximumPloidy the new requested ploidy maximum.
     */
-    private void ensureCapacity(final int requestedMaximumAllele, final int requestedMaximumPloidy) {


I would add a comment to the top of this method explaining that since each region in spark owns one HaplotypeCallerEngine (and therefore genotyping engine) which gets reused that this must be synchronized to avoid race conditions.

Added a comment.

jamesemery · 2018-08-29T15:58:35Z

I was able to trigger the ArrayIndexOutOfBoundException (and subsequently fix it with your synchronization) with the following code, so perhaps a test similar to this would be useful to include:

    public void testRaceCondition(){
        final GenotypeLikelihoodCalculators calculator = new GenotypeLikelihoodCalculators();

        List<int[]> counts = new ArrayList<>();
        Random random = new Random(10);
        for (int i = 1; i < 19; i++) {
            counts.add(new int[]{random.nextInt(i), random.nextInt(i)});
        }

        Utils.stream(counts).parallel().forEach(params -> calculator.getInstance(params[0]+1, params[1]+1));
    }```

jamesemery

One pedantic comment to add and then I think you would be good to merge

jamesemery · 2018-09-10T18:21:17Z

...ava/org/broadinstitute/hellbender/tools/walkers/genotyper/GenotypeLikelihoodCalculators.java

@@ -270,17 +270,18 @@ public GenotypeLikelihoodCalculator getInstance(final int ploidy, final int alle
    }

    /**
-     * Update of shared tables


Upon some thought I realized that the getInstance() method which calls this one should probably be synchronized, specifically because calculateGenotypeCountUsingTables makes a check against maximumPloidy & maximumAllele which are both not thread safe. Now since this method currently only ever increases either of those fields the check won't cause problems despite being unsafe, but we should probably add a strong comment somewhere enforcing that maximumPloidy & maximumAllele can only be increased by this method so we can feel safe checking against them where this is called.

jamesemery · 2018-09-17T20:14:41Z

@tomwhite We just sat down and had a look at this class. @droazen was suggesting that this might still be unsafe, and that the outer layer where we compute the genotype likelihood should either be synchronized as well or the whole thing should be wrapped in a thread local object

droazen · 2018-09-17T21:05:50Z

@tomwhite To clarify, I think that the caller of ensureCapacity(), namely GenotypeLikelihoodCalculators.calculateGenotypeCountUsingTables(), also needs to be synchronized in order to avoid some unlikely but still-possible races. Given this, I think that we should consider whether ThreadLocal might be a better option here.

It's not 100% clear to me whether a ThreadLocal get() call is cheaper than a synchronized method call, but some casual googling suggests that it might be. If we're going to end up entering a synchronized method on every single call to GenotypeLikelihoodCalculators.getInstance(), we might want to do some research into whether ThreadLocal + no synchronization would be faster, since I believe that this is a performance-sensitive section of code.

tomwhite · 2018-09-21T14:16:49Z

@jamesemery I agree - all access (read and write) to GenotypeLikelihoodCalculators instance variables needs to be synchronized to make it safe. I think it would be sufficient to make getInstance() and calculateGenotypeCountUsingTables() synchronized.

@droazen, are you concerned about performance for the Spark case? For the walker version, presumably the access is single-threaded, and hence uncontended, which is very cheap.

Another option would be to maintain a separate instance of GenotypeLikelihoodCalculators per genotyping engine. The size of the table is ploidy * alleles, so not too large?

tomwhite · 2018-10-02T16:23:55Z

@jamesemery @droazen I've updated this branch to ensure all read and write paths to shared state in GenotypeLikelihoodCalculators is synchronized. I then wrote a little test using AlleleSubsettingUtils to access GenotypeLikelihoodCalculators 10^6 times to see the effect of adding synchronization.

R session (times are in millis):

> without_sync = c(10166, 10049, 10306, 10059, 10165)
> with_sync = c(10700, 10384, 9923, 10097, 10190)
> t.test(without_sync, with_sync, paired=TRUE)

	Paired t-test

data:  without_sync and with_sync
t = -0.70447, df = 4, p-value = 0.52
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -542.5421  322.9421
sample estimates:
mean of the differences 
                 -109.8

The p-value is not less than 0.05, so we can't reject the null hypothesis (that the mean times are the same). So adding synchronization doesn't seem to make any difference in this test.

BTW, I noticed that GenotypeLikelihoods has synchronization, so there is some precedent for thread-safety using this means.

droazen

👍

…ute#5071) * Synchronize update of shared genotype likelihood tables. * Add comment explaining synchronization. * Synchronize for reads as well as writes to shared state.

tomwhite requested a review from lbergelson July 31, 2018 08:22

droazen assigned lbergelson and jamesemery Aug 24, 2018

droazen requested review from jamesemery and removed request for lbergelson August 27, 2018 19:28

droazen unassigned lbergelson Aug 27, 2018

jamesemery reviewed Aug 29, 2018

View reviewed changes

tomwhite force-pushed the tw_hc_spark_array_index_ex branch from 6e3b64b to fce66a5 Compare September 4, 2018 09:17

jamesemery approved these changes Sep 10, 2018

View reviewed changes

tomwhite added 3 commits October 2, 2018 17:25

Synchronize update of shared genotype likelihood tables.

276ac36

Add comment explaning synchronization.

28d3a73

Synchronize for reads as well as writes to shared state.

892ff97

tomwhite force-pushed the tw_hc_spark_array_index_ex branch from 085b399 to 892ff97 Compare October 2, 2018 16:27

tomwhite requested a review from droazen October 3, 2018 08:02

droazen approved these changes Oct 3, 2018

View reviewed changes

droazen merged commit fad6eb8 into master Oct 3, 2018

droazen deleted the tw_hc_spark_array_index_ex branch October 3, 2018 18:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Synchronize update of shared genotype likelihood tables. #5071

Synchronize update of shared genotype likelihood tables. #5071

tomwhite commented Jul 31, 2018

codecov-io commented Jul 31, 2018 •

edited

Loading

droazen commented Aug 27, 2018

jamesemery left a comment

jamesemery Aug 29, 2018

tomwhite Sep 4, 2018

jamesemery commented Aug 29, 2018 •

edited

Loading

jamesemery left a comment

jamesemery Sep 10, 2018 •

edited

Loading

jamesemery commented Sep 17, 2018

droazen commented Sep 17, 2018 •

edited

Loading

tomwhite commented Sep 21, 2018

tomwhite commented Oct 2, 2018

droazen left a comment

Synchronize update of shared genotype likelihood tables. #5071

Synchronize update of shared genotype likelihood tables. #5071

Conversation

tomwhite commented Jul 31, 2018

codecov-io commented Jul 31, 2018 • edited Loading

Codecov Report

droazen commented Aug 27, 2018

jamesemery left a comment

Choose a reason for hiding this comment

jamesemery Aug 29, 2018

Choose a reason for hiding this comment

tomwhite Sep 4, 2018

Choose a reason for hiding this comment

jamesemery commented Aug 29, 2018 • edited Loading

jamesemery left a comment

Choose a reason for hiding this comment

jamesemery Sep 10, 2018 • edited Loading

Choose a reason for hiding this comment

jamesemery commented Sep 17, 2018

droazen commented Sep 17, 2018 • edited Loading

tomwhite commented Sep 21, 2018

tomwhite commented Oct 2, 2018

droazen left a comment

Choose a reason for hiding this comment

codecov-io commented Jul 31, 2018 •

edited

Loading

jamesemery commented Aug 29, 2018 •

edited

Loading

jamesemery Sep 10, 2018 •

edited

Loading

droazen commented Sep 17, 2018 •

edited

Loading