-
Notifications
You must be signed in to change notification settings - Fork 594
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use SparkFiles to localize reference and known sites #5127
Conversation
@lbergelson please take a look at this when you can. It reduces exome reads pipeline on Spark from 24min to 12min, and genome from 94min to 55min on the same clusters. |
@jamesemery please review in @lbergelson 's absence |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably keep the BROADCAST option on BQSR at least for profiling purposes (so we have something to make a comparison against). I know that will involve some work but it is probably better to spin it out into a second branch until we are convinced. The other comment I have is to ask whether you have tested the new behavior on single many-core machine types at all? It seems there might be issues that arise from 32 and 64 workers all attempting to open the same file for every assembly region in Haplotype Caller and the single machine spark case is one that we definitely care about.
* Joins an RDD of GATKReads to variant data by copying the variants files to every node, using Spark's file | ||
* copying mechanism. | ||
*/ | ||
public final class JoinReadsWithVariants { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a little strange that you put this functionality in its own class like this. I would probably move these methods to SparkUtils or even possibly put it into a walker that lives on top of BaseRecalibratorSpark containing the logic for this traversal. At the very least if you don't want to fold this into some other class you should probably move this to the org.broadinstitute.hellbender.utils.spark
package.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved to org.broadinstitute.hellbender.utils.spark
. (BTW I put it where it was originally since the name and package echos the other *JoinReadsWithVariants
classes, however we are going to remove these soon I hope.)
}); | ||
} | ||
|
||
private static Tuple2<GATKRead, Iterable<GATKVariant>> getOverlapping(final GATKRead read, final List<FeatureDataSource<VariantContext>> variantSources) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you rename these methods getOverlapping
? You use it for several methods here and each is doing a different thing and since they are all private they aren't a meaningful overload of eachother.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@@ -195,11 +200,14 @@ public static void callVariantsWithHaplotypeCallerAndWriteOutput( | |||
// Reads must be coordinate sorted to use the overlaps partitioner | |||
final SAMFileHeader readsHeader = header.clone(); | |||
readsHeader.setSortOrder(SAMFileHeader.SortOrder.coordinate); | |||
final JavaRDD<GATKRead> coordinateSortedReads = SparkUtils.sortReadsAccordingToHeader(reads, readsHeader, numReducers); | |||
final JavaRDD<GATKRead> coordinateSortedReads = reads; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eliminating this sort is dangerous, if you are going to do it (and it is probably better to avoid sorting at all especially if we were doing it needlessly) you should add a check that asserts the sort order of the input and no longer set the header sort order. You could also do what MarkDuplicates spark does and perform the sort only if it is out of the expected order
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right. I've added a check as you suggested.
final ReferenceMultiSourceAdapter referenceSource = new ReferenceMultiSourceAdapter(referenceMultiSource); | ||
final HaplotypeCallerEngine hcEngine = new HaplotypeCallerEngine(hcArgsBroadcast.value(), false, false, header, referenceSource, annotatorEngineBroadcast.getValue()); | ||
final String pathOnExecutor = SparkFiles.get(referenceFileName); | ||
final ReferenceSequenceFile taskReferenceSequenceFile = CachingIndexedFastaSequenceFile.checkAndCreate(IOUtils.getPath(pathOnExecutor)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this will end up opening a CachingIndexedFastaSequenceFile over the stored reference for each region? Should we expect this to have a performance impact where spark is being run on a single machine with many cores (as opposed to a cluster)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, not sure if this would be strictly relevant or help performance but if the caching distance were a parameter we could save ourself some of the memory overhead by using the assembly region size as the caching size, since we aren't caching between assembly regions as far as I can tell.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it will open a CachingIndexedFastaSequenceFile
per region. In terms of memory this is not an issue as it will use only ~1MB per instance. In terms of file contention, I haven't tried it, but I suspect that the cache itself will help (since most accesses are from cache) and using an SSD will mitigate it too, but this is something where we need to try with different cache sizes, number of executors, and threads on a single machine.
import java.util.List; | ||
import java.util.stream.Collectors; | ||
|
||
; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spurious ";"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
throw new IllegalArgumentException("Unrecognized known sites file extension. Must be .vcf or .vcf.gz"); | ||
} | ||
// TODO: check existence of all files | ||
ctx.addFile(vcfFileName); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When these files are added, a copy gets created on each worker node? So if a single machine houses 4 worker nodes then we would expect it to house 4 copies of the file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the YARN scheduler (used on Dataproc for example), each Spark executor gets a copy of the file. So if a single machine had 4 YARN containers each running a Spark executor, then yes it would have 4 copies of the file.
In Spark local mode there is a single executor, so a single copy of the file.
|
||
if (outputBam != null) { // only write output of BQSR if output BAM is specified | ||
writeReads(ctx, outputBam, finalReads, header); | ||
} | ||
|
||
// Run Haplotype Caller | ||
final ReadFilter hcReadFilter = ReadFilter.fromList(HaplotypeCallerEngine.makeStandardHCReadFilters(), header); | ||
final JavaRDD<GATKRead> filteredReadsForHC = finalReads.filter(read -> hcReadFilter.test(read)); | ||
filteredReadsForHC.persist(StorageLevel.DISK_ONLY()); // without caching, computations are run twice as a side effect of finding partition boundaries for sorting | ||
final JavaRDD<GATKRead> filteredReadsForHC = finalReads.filter(hcReadFilter::test); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume this was an empirical observation that the caching didn't help? Out of curiosity how much of a slowdown was caused by the caching here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was removed because HaplotypeCallerSpark
no longer sorts, so there is no need to cache any more. I didn't measure how much of a slowdown the caching caused.
@@ -48,4 +53,36 @@ public static RecalibrationReport apply( final JavaPairRDD<GATKRead, ReadContext | |||
final StandardCovariateList covariates = new StandardCovariateList(recalArgs, header); | |||
return RecalUtils.createRecalibrationReport(recalArgs.generateReportTable(covariates.covariateNames()), quantizationInfo.generateReportTable(), RecalUtils.generateReportTables(combinedTables, covariates)); | |||
} | |||
|
|||
/** | |||
* Run the {@link BaseRecalibrationEngine} on reads and overlapping variants. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that the tool and its execution in reads pipeline spark have been updated to use this method, we should just delete the other apply() function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Thanks for the review James. Regarding keeping Broadcast for BQSR - I'm not sure we need to. The Spark files approach is superior for several reasons: it's a lot faster (~2x), it uses a lot less memory (MB vs GB), it works progressively (conversely, if the reference is read into memory it takes a few minutes and can't be parallelized, so can't be sped up as much due to Amdahl's law), it doesn't require conversion of We don't know about how this performs on single multi-core machines (in Spark local mode), but I think we can examine and optimize that later. |
ff09bfc
to
0e286d1
Compare
Codecov Report
@@ Coverage Diff @@
## master #5127 +/- ##
===============================================
- Coverage 86.796% 86.755% -0.041%
+ Complexity 29779 29766 -13
===============================================
Files 1822 1825 +3
Lines 137732 137726 -6
Branches 15184 15188 +4
===============================================
- Hits 119546 119484 -62
- Misses 12665 12721 +56
Partials 5521 5521
|
@tomwhite Like we discussed this morning, we can and should get rid of the broadcast code but we should ideally first get some sort of plot we can point to in order to justify the change. This will also be useful for future presentations of our performance improvements. The plot would ideally be a comparison between the new distribution strategy and broadcasting compared across a variable number of cores, so the performance improvement can be better understood. |
Thanks @jamesemery. I have run a few benchmarks with the old and new code for comparison. The raw data looks like this:
Here is a bar graph comparing the exome data: and the genome data: |
@tomwhite These graphs look good! To clarify, the improvement that these graphs are demonstrating is the new method for distributing the reference data to the nodes or is this including the other improvements? Unfortunately we still probably want to definitively show that the reference distribution is better than the current broadcast approach so we can justify removing it entirely in this PR. |
@jamesemery you're right, the numbers include changes to distributing both the reference data and the known sites data (VCF). To get a better idea of the effect on reference data alone, I ran HaplotypeCallerSpark broadcast reference: 9.25 minutes So it looks like the new code is ~30% faster, just for the reference. |
4f76de8
to
9efe198
Compare
9efe198
to
cc2655b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One small comment and otherwise this looks good. We should however open another PR for removing the 2bit reference support code in the engine once this goes in.
*/ | ||
protected String addReferenceFilesForSpark(JavaSparkContext ctx, String referenceFile) { | ||
Path referencePath = IOUtils.getPath(referenceFile); | ||
Path indexPath = ReferenceSequenceFileFactory.getFastaIndexFileName(referencePath); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I presume this behaves reasonably if the files are empty correct? We aren't restricting our spark files to only running in the case where there exists both a dict and a fai next door.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a file-existence check for the dict and fai. Also made the method static.
* @param knownVariants the known variant (VCF) files, can be local files or remote paths | ||
* @return the reference file name; the absolute path of the file can be found by a Spark task using {@code SparkFiles#get()} | ||
*/ | ||
protected List<String> addKnownSitesForSpark(JavaSparkContext ctx, List<String> knownVariants) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at every usage of this method, it has to be called immediately before JoinReadsWithVariants
methods are called. I would either make this a more generic like 'addVCFsToSparkso it could be used for other purposes or I would explicitly make it part of
JoinReadsWithVariants` to avoid confusion about passing around the wrong List objects when people need to distribute vcf files. Either way it looks to me that this method could be static anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed and made static.
@tomwhite If you can address the one issue I had with the classes, it could also be accomplished by sterner commenting on the relevant methods and classes just so long as we are avoiding confusion somehow, then I think we can try to get this in quickly for Wednesdays release. |
…atorSpark, BQSRPipelineSpark, ReadsPipelineSpark, and to localise reference for HaplotypeCallerSpark. As a consequence of this change, the reference can be a standard fasta file for Spark tools (indeed 2bit is no longer supported, since htsjdk does not have support for it). Also: * Add AutoCloseableCollection to automatically close all objects in a collection. * Add CloseAtEndIterator to call close on an object at the end of the iteration. * Don't materialize all reads in memory in ApplyBQSRSparkFn.
Adjust Spark Eval script settings.
cc2655b
to
532a4ac
Compare
Thanks for the review @jamesemery. I've made the changes you suggested. It should be OK to merge now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good, lets merge it
* AddContextDataToReadSpark (and AddContextDataToReadSparkUnitTest) implemented the different JoinStrategy options for BQSR; has been replaced with the Spark Files mecahnism (see #5127) * BroadcastJoinReadsWithRefBases and JoinReadsWithRefBasesSparkUnitTest were only used by AddContextDataToReadSpark * BroadcastJoinReadsWithVariants and JoinReadsWithVariantsSparkUnitTest were only used by AddContextDataToReadSpark * ShuffleJoinReadsWithRefBases and ShuffleJoinReadsWithVariants were only used by AddContextDataToReadSpark * JoinStrategy was only used for BQSR (HC always uses overlaps partitioner), but is no longer used since #5127 * KnownSitesCache was replaced with Spark Files * ReferenceMultiSourceAdapter in HaplotypeCallerSpark was replaced with the regular ReferenceDataSource * BaseRecalibratorEngineSparkWrapper was only used by BaseRecalibratorSparkSharded, which was removed in #5192
…5127) * Use SparkFiles to localize reference and known sites for BaseRecalibratorSpark, BQSRPipelineSpark, ReadsPipelineSpark, and to localise reference for HaplotypeCallerSpark. As a consequence of this change, the reference can be a standard fasta file for Spark tools (indeed 2bit is no longer supported, since htsjdk does not have support for it). Also: * Add AutoCloseableCollection to automatically close all objects in a collection. * Add CloseAtEndIterator to call close on an object at the end of the iteration. * Don't materialize all reads in memory in ApplyBQSRSparkFn.
* AddContextDataToReadSpark (and AddContextDataToReadSparkUnitTest) implemented the different JoinStrategy options for BQSR; has been replaced with the Spark Files mecahnism (see #5127) * BroadcastJoinReadsWithRefBases and JoinReadsWithRefBasesSparkUnitTest were only used by AddContextDataToReadSpark * BroadcastJoinReadsWithVariants and JoinReadsWithVariantsSparkUnitTest were only used by AddContextDataToReadSpark * ShuffleJoinReadsWithRefBases and ShuffleJoinReadsWithVariants were only used by AddContextDataToReadSpark * JoinStrategy was only used for BQSR (HC always uses overlaps partitioner), but is no longer used since #5127 * KnownSitesCache was replaced with Spark Files * ReferenceMultiSourceAdapter in HaplotypeCallerSpark was replaced with the regular ReferenceDataSource * BaseRecalibratorEngineSparkWrapper was only used by BaseRecalibratorSparkSharded, which was removed in #5192
* AddContextDataToReadSpark (and AddContextDataToReadSparkUnitTest) implemented the different JoinStrategy options for BQSR; has been replaced with the Spark Files mecahnism (see #5127) * BroadcastJoinReadsWithRefBases and JoinReadsWithRefBasesSparkUnitTest were only used by AddContextDataToReadSpark * BroadcastJoinReadsWithVariants and JoinReadsWithVariantsSparkUnitTest were only used by AddContextDataToReadSpark * ShuffleJoinReadsWithRefBases and ShuffleJoinReadsWithVariants were only used by AddContextDataToReadSpark * JoinStrategy was only used for BQSR (HC always uses overlaps partitioner), but is no longer used since #5127 * KnownSitesCache was replaced with Spark Files * ReferenceMultiSourceAdapter in HaplotypeCallerSpark was replaced with the regular ReferenceDataSource * BaseRecalibratorEngineSparkWrapper was only used by BaseRecalibratorSparkSharded, which was removed in #5192
... for BaseRecalibratorSpark, BQSRPipelineSpark, ReadsPipelineSpark, and to localise reference for HaplotypeCallerSpark.
As a consequence of this change, the reference can be a standard fasta file for Spark
tools (indeed 2bit is no longer supported, since htsjdk does not have support for it).
Also:
See #5103