diff --git a/src/main/java/org/broadinstitute/hellbender/tools/spark/pipelines/SortSamSpark.java b/src/main/java/org/broadinstitute/hellbender/tools/spark/pipelines/SortSamSpark.java index 6109f503464..d258effcb38 100644 --- a/src/main/java/org/broadinstitute/hellbender/tools/spark/pipelines/SortSamSpark.java +++ b/src/main/java/org/broadinstitute/hellbender/tools/spark/pipelines/SortSamSpark.java @@ -1,6 +1,7 @@ package org.broadinstitute.hellbender.tools.spark.pipelines; import htsjdk.samtools.SAMFileHeader; +import htsjdk.samtools.SAMRecord; import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.JavaSparkContext; import org.broadinstitute.barclay.argparser.Argument; @@ -18,6 +19,43 @@ import java.util.Collections; import java.util.List; + +/** + * SortSam on Spark (works on SAM/BAM/CRAM) + * + *

A Spark implementation of Picard SortSam. The Spark version can run in parallel on multiple cores on a local machine or multiple machines on a Spark cluster while still matching the output of the single-core Picard version. See Blog#23420 for performance benchmarks.

+ * + *

The tool sorts reads by coordinate order by default or alternatively by read name, the QNAME field, if asked with the '-SO queryname' option. The contig ordering in the reference dictionary defines coordinate order, and the tool uses the sequence dictionary represented by the @SQ header lines or that of the optionally provided reference to sort reads by the RNAME field. For those reads mapping to a contig, coordinate sorting further orders reads by the POS field of the SAM record, which contains the leftmost mapping position.

+ * + *

To queryname-sort, the tool first groups by readname and then deterministically sorts within a readname set by orientation, secondary and supplementary SAM flags. For paired-end reads, reads in the pair share the same queryname. Because aligners can generate secondary and supplementary alignments, queryname groups can consists of, e.g. more than two records for a paired-end pair.

+ * + *

Usage examples

+ * Coordinate-sort aligned reads using all cores available locally + *
+ * gatk SortSamSpark \
+ * -I aligned.bam \
+ * -O coordinatesorted.bam
+ * 
+ * + * Queryname-sort reads using four cores on a Spark cluster + *
+ * gatk SortSamSpark \
+ * -I coordinatesorted.bam \
+ * -SO queryname \
+ * -O querygroupsorted.bam \
+ * -- \
+ *  --spark-runner SPARK \
+ *  --spark-master \
+ *  --num-executors 5 \
+ *  --executor-cores 4
+ * 
+ * + *

Notes

+ *
    + *
  1. This Spark tool requires a significant amount of disk operations. Run with both the input data and outputs on high throughput SSDs when possible. When pipelining this tool on Google Compute Engine instances, for best performance requisition machines with LOCAL SSDs.
  2. + *
  3. Furthermore, we recommend explicitly setting the Spark temp directory to an available SSD when running this in local mode by adding the argument --conf 'spark.local.dir=/PATH/TO/TEMP/DIR'. See the discussion at https://gatkforums.broadinstitute.org/gatk/discussion/comment/56337 for details.
  4. + *
+ */ @DocumentedFeature @CommandLineProgramProperties(summary = "Sorts the input SAM/BAM/CRAM", oneLineSummary = "SortSam on Spark (works on SAM/BAM/CRAM)", diff --git a/src/main/java/org/broadinstitute/hellbender/tools/spark/transforms/markduplicates/MarkDuplicatesSpark.java b/src/main/java/org/broadinstitute/hellbender/tools/spark/transforms/markduplicates/MarkDuplicatesSpark.java index a1b37d7e9ac..355f6d9695e 100644 --- a/src/main/java/org/broadinstitute/hellbender/tools/spark/transforms/markduplicates/MarkDuplicatesSpark.java +++ b/src/main/java/org/broadinstitute/hellbender/tools/spark/transforms/markduplicates/MarkDuplicatesSpark.java @@ -35,80 +35,85 @@ import java.util.*; /** - *

This is a Spark implementation of the MarkDuplicates tool from Picard that allows the tool to be run in - * parallel on multiple cores on a local machine or multiple machines on a Spark cluster while still matching - * the output of the single-core Picard version. Since the tool requires holding all of the readnames in memory - * while it groups the read information, it is recommended running this tool on a machine/configuration - * with at least 8 GB of memory overall for a typical 30x bam.

+ * MarkDuplicates on Spark * - *

This tool locates and tags duplicate reads in a BAM or SAM file, where duplicate reads are - * defined as originating from a single fragment of DNA. Duplicates can arise during sample preparation e.g. library - * construction using PCR. See also "EstimateLibraryComplexity" + - * for additional notes on PCR duplication artifacts. Duplicate reads can also result from a single amplification cluster, - * incorrectly detected as multiple clusters by the optical sensor of the sequencing instrument. These duplication artifacts are - * referred to as optical duplicates.

+ *

This is a Spark implementation of Picard MarkDuplicates that allows the tool to be run in parallel on multiple cores on a local machine or multiple machines on a Spark cluster while still matching the output of the non-Spark Picard version of the tool. Since the tool requires holding all of the readnames in memory while it groups read information, machine configuration and starting sort-order impact tool performance.

* - *

The MarkDuplicates tool works by comparing sequences in the 5 prime positions of both reads and read-pairs in a SAM/BAM file. - * After duplicate reads are collected, the tool differentiates the primary and duplicate reads using an algorithm that ranks - * reads by the sums of their base-quality scores (default method).

+ * Here are some differences of note between MarkDuplicatesSpark and Picard MarkDuplicates. * - *

The tool's main output is a new SAM or BAM file, in which duplicates have been identified in the SAM flags field for each - * read. Duplicates are marked with the hexadecimal value of 0x0400, which corresponds to a decimal value of 1024. - * If you are not familiar with this type of annotation, please see the following blog post for additional information.

" + + * * - *

Although the bitwise flag annotation indicates whether a read was marked as a duplicate, it does not identify the type of - * duplicate. To do this, a new tag called the duplicate type (DT) tag was recently added as an optional output in - * the 'optional field' section of a SAM/BAM file. Invoking the 'duplicate-tagging-policy' option, - * you can instruct the program to mark all the duplicates (All), only the optical duplicates (OpticalOnly), or no - * duplicates (DontTag). The records within the output of a SAM/BAM file will have values for the 'DT' tag (depending on the invoked - * 'duplicate-tagging-policy'), as either library/PCR-generated duplicates (LB), or sequencing-platform artifact duplicates (SQ). - * This tool uses the 'read-name-regex' and the 'optical-duplicate-pixel-distance' options as the primary methods to identify - * and differentiate duplicate types. Set read-name-regex' to null to skip optical duplicate detection, e.g. for RNA-seq - * or other data where duplicate sets are extremely large and estimating library complexity is not an aim. - * Note that without optical duplicate counts, library size estimation will be inaccurate.

+ *

For a typical 30x coverage WGS BAM, we recommend running on a machine with at least 16 GB. Memory usage scales with library complexity and the tool will need more memory for larger or more complex data. If the tool is running slowly it is possible Spark is running out of memory and is spilling data to disk excessively. If this is the case then increasing the memory available to the tool should yield speedup to a threshold; otherwise, increasing memory should have no effect beyond that threshold.

* - *

MarkDuplicates also produces a metrics file indicating the numbers of duplicates for both single- and paired-end reads.

+ *

Note that this tool does not support UMI based duplicate marking.

* - *

The program can take either coordinate-sorted or query-sorted inputs, however it is recommended that the input be - * query-sorted or query-grouped as the tool will have to perform an extra sort operation on the data in order to associate - * reads from the input bam with their mates.

+ *

See MarkDuplicates documentation for details on tool features and background information.

* - *

If desired, duplicates can be removed using the 'remove-all-duplicates' and 'remove-sequencing-duplicates' options.

+ *

Usage examples

+ * Provide queryname-grouped reads to MarkDuplicatesSpark + *
+ *      gatk MarkDuplicatesSpark \
+ *            -I input.bam \
+ *            -O marked_duplicates.bam
+ *     
+ * + * Additionally produce estimated library complexity metrics + *
+ *     gatk MarkDuplicatesSpark \
+ *             -I input.bam \
+ *             -O marked_duplicates.bam \
+ *             -M marked_dup_metrics.txt
  *
- * 

Usage example:

+ *
+ * + * + * MarkDuplicatesSpark run locally specifying the removal of sequencing duplicates and tagging OpticalDuplicates *
- *      gatk MarkDuplicatesSpark \\
- * -I input.bam \\
- * -O marked_duplicates.bam \\
- * -M marked_dup_metrics.txt + * gatk MarkDuplicatesSpark \ + * -I input.bam \ + * -O marked_duplicates.bam \ + * --remove-sequencing-duplicates \ + * --duplicate-tagging-policy OpticalOnly *
* - *

MarkDuplicates run locally specifying the core input (if 'spark.executor.cores' is unset spark will use all available cores on the machine)

+ * MarkDuplicates run locally specifying the core input. Note if 'spark.executor.cores' is unset, Spark will use all available cores on the machine. *
- *       gatk MarkDuplicatesSpark \\
- * -I input.bam \\
- * -O marked_duplicates.bam \\
- * -M marked_dup_metrics.txt \\
+ * gatk MarkDuplicatesSpark \ + * -I input.bam \ + * -O marked_duplicates.bam \ + * -M marked_dup_metrics.txt \ * --conf 'spark.executor.cores=5' *
* - *

MarkDuplicates run on a spark cluster 5 machines

+ * MarkDuplicates run on a Spark cluster of five executors and with eight executor cores *
- *       gatk MarkDuplicatesSpark \\
- * -I input.bam \\
- * -O marked_duplicates.bam \\
- * -M marked_dup_metrics.txt \\
- * -- \\
- * --spark-runner SPARK \\
- * --spark-master \\
- * --num-executors 5 \\
- * --executor-cores 8
+ * gatk MarkDuplicatesSpark \ + * -I input.bam \ + * -O marked_duplicates.bam \ + * -M marked_dup_metrics.txt \ + * -- \ + * --spark-runner SPARK \ + * --spark-master MASTER_URL \ + * --num-executors 5 \ + * --executor-cores 8 *
* * Please see - * MarkDuplicates + * Picard DuplicationMetrics * for detailed explanations of the output metrics. *
+ * + *

Notes

+ *
    + *
  1. This Spark tool requires a significant amount of disk operations. Run with both the input data and outputs on high throughput SSDs when possible. When pipelining this tool on Google Compute Engine instances, for best performance requisition machines with LOCAL SSDs.
  2. + *
  3. Furthermore, we recommend explicitly setting the Spark temp directory to an available SSD when running this in local mode by adding the argument --conf 'spark.local.dir=/PATH/TO/TEMP/DIR'. See this forum discussion for details.
  4. + *
*/ @DocumentedFeature @CommandLineProgramProperties(