Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Synchronize update of shared genotype likelihood tables. #5071

Merged
merged 3 commits into from
Oct 3, 2018

Conversation

tomwhite
Copy link
Contributor

Fixes #4661.

I managed to run genome_reads-pipeline_hdfs.sh with this change, whereas before it was failing (see details in #4661).

No unit test since it is very difficult to write a test for the effect of adding synchronization.

@tomwhite tomwhite requested a review from lbergelson July 31, 2018 08:22
@codecov-io
Copy link

codecov-io commented Jul 31, 2018

Codecov Report

Merging #5071 into master will decrease coverage by 0.002%.
The diff coverage is 100%.

@@               Coverage Diff               @@
##              master     #5071       +/-   ##
===============================================
- Coverage     86.761%   86.759%   -0.002%     
+ Complexity     29765     29762        -3     
===============================================
  Files           1825      1825               
  Lines         137699    137698        -1     
  Branches       15176     15175        -1     
===============================================
- Hits          119469    119465        -4     
- Misses         12716     12720        +4     
+ Partials        5514      5513        -1
Impacted Files Coverage Δ Complexity Δ
...lkers/genotyper/GenotypeLikelihoodCalculators.java 88.75% <100%> (+2.33%) 24 <0> (-1) ⬇️
...utils/smithwaterman/SmithWatermanIntelAligner.java 50% <0%> (-30%) 1% <0%> (-2%)
...e/hellbender/engine/spark/SparkContextFactory.java 71.233% <0%> (-2.74%) 11% <0%> (ø)
...ithwaterman/SmithWatermanIntelAlignerUnitTest.java 60% <0%> (ø) 2% <0%> (ø) ⬇️

@droazen droazen requested review from jamesemery and removed request for lbergelson August 27, 2018 19:28
@droazen
Copy link
Contributor

droazen commented Aug 27, 2018

Re-assigning to @jamesemery

Copy link
Collaborator

@jamesemery jamesemery left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add comments explaining why this is necessary, I agree a test would be difficult to write in such a way that it would be informative across hardware. I'm still not strictly sure based on the HaplotypeCallerSpark code what is causing its stream to become parallel within a region.

@@ -275,12 +275,11 @@ public GenotypeLikelihoodCalculator getInstance(final int ploidy, final int alle
* @param requestedMaximumAllele the new requested maximum allele maximum.
* @param requestedMaximumPloidy the new requested ploidy maximum.
*/
private void ensureCapacity(final int requestedMaximumAllele, final int requestedMaximumPloidy) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add a comment to the top of this method explaining that since each region in spark owns one HaplotypeCallerEngine (and therefore genotyping engine) which gets reused that this must be synchronized to avoid race conditions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment.

@jamesemery
Copy link
Collaborator

jamesemery commented Aug 29, 2018

I was able to trigger the ArrayIndexOutOfBoundException (and subsequently fix it with your synchronization) with the following code, so perhaps a test similar to this would be useful to include:

    public void testRaceCondition(){
        final GenotypeLikelihoodCalculators calculator = new GenotypeLikelihoodCalculators();

        List<int[]> counts = new ArrayList<>();
        Random random = new Random(10);
        for (int i = 1; i < 19; i++) {
            counts.add(new int[]{random.nextInt(i), random.nextInt(i)});
        }

        Utils.stream(counts).parallel().forEach(params -> calculator.getInstance(params[0]+1, params[1]+1));
    }```

@tomwhite tomwhite force-pushed the tw_hc_spark_array_index_ex branch from 6e3b64b to fce66a5 Compare September 4, 2018 09:17
Copy link
Collaborator

@jamesemery jamesemery left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One pedantic comment to add and then I think you would be good to merge

@@ -270,17 +270,18 @@ public GenotypeLikelihoodCalculator getInstance(final int ploidy, final int alle
}

/**
* Update of shared tables
Copy link
Collaborator

@jamesemery jamesemery Sep 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upon some thought I realized that the getInstance() method which calls this one should probably be synchronized, specifically because calculateGenotypeCountUsingTables makes a check against maximumPloidy & maximumAllele which are both not thread safe. Now since this method currently only ever increases either of those fields the check won't cause problems despite being unsafe, but we should probably add a strong comment somewhere enforcing that maximumPloidy & maximumAllele can only be increased by this method so we can feel safe checking against them where this is called.

@jamesemery
Copy link
Collaborator

@tomwhite We just sat down and had a look at this class. @droazen was suggesting that this might still be unsafe, and that the outer layer where we compute the genotype likelihood should either be synchronized as well or the whole thing should be wrapped in a thread local object

@droazen
Copy link
Contributor

droazen commented Sep 17, 2018

@tomwhite To clarify, I think that the caller of ensureCapacity(), namely GenotypeLikelihoodCalculators.calculateGenotypeCountUsingTables(), also needs to be synchronized in order to avoid some unlikely but still-possible races. Given this, I think that we should consider whether ThreadLocal might be a better option here.

It's not 100% clear to me whether a ThreadLocal get() call is cheaper than a synchronized method call, but some casual googling suggests that it might be. If we're going to end up entering a synchronized method on every single call to GenotypeLikelihoodCalculators.getInstance(), we might want to do some research into whether ThreadLocal + no synchronization would be faster, since I believe that this is a performance-sensitive section of code.

@tomwhite
Copy link
Contributor Author

@jamesemery I agree - all access (read and write) to GenotypeLikelihoodCalculators instance variables needs to be synchronized to make it safe. I think it would be sufficient to make getInstance() and calculateGenotypeCountUsingTables() synchronized.

@droazen, are you concerned about performance for the Spark case? For the walker version, presumably the access is single-threaded, and hence uncontended, which is very cheap.

Another option would be to maintain a separate instance of GenotypeLikelihoodCalculators per genotyping engine. The size of the table is ploidy * alleles, so not too large?

@tomwhite
Copy link
Contributor Author

tomwhite commented Oct 2, 2018

@jamesemery @droazen I've updated this branch to ensure all read and write paths to shared state in GenotypeLikelihoodCalculators is synchronized. I then wrote a little test using AlleleSubsettingUtils to access GenotypeLikelihoodCalculators 10^6 times to see the effect of adding synchronization.

R session (times are in millis):

> without_sync = c(10166, 10049, 10306, 10059, 10165)
> with_sync = c(10700, 10384, 9923, 10097, 10190)
> t.test(without_sync, with_sync, paired=TRUE)

	Paired t-test

data:  without_sync and with_sync
t = -0.70447, df = 4, p-value = 0.52
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -542.5421  322.9421
sample estimates:
mean of the differences 
                 -109.8 

The p-value is not less than 0.05, so we can't reject the null hypothesis (that the mean times are the same). So adding synchronization doesn't seem to make any difference in this test.

BTW, I noticed that GenotypeLikelihoods has synchronization, so there is some precedent for thread-safety using this means.

@tomwhite tomwhite force-pushed the tw_hc_spark_array_index_ex branch from 085b399 to 892ff97 Compare October 2, 2018 16:27
@tomwhite tomwhite requested a review from droazen October 3, 2018 08:02
Copy link
Contributor

@droazen droazen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@droazen droazen merged commit fad6eb8 into master Oct 3, 2018
@droazen droazen deleted the tw_hc_spark_array_index_ex branch October 3, 2018 18:52
EdwardDixon pushed a commit to EdwardDixon/gatk that referenced this pull request Nov 9, 2018
…ute#5071)

* Synchronize update of shared genotype likelihood tables.

* Add comment explaining synchronization.

* Synchronize for reads as well as writes to shared state.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants