From bc6b637ff677268823a522ad8fcfbd4c53b41858 Mon Sep 17 00:00:00 2001
From: James This tool is a Spark implementation of the tool MarkDuplicates in Picard allowing for better utilization
+ * of available system resources to speed up duplicate marking. This tool locates and tags duplicate reads in a BAM or SAM file, where duplicate reads are
+ * defined as originating from a single fragment of DNA. Duplicates can arise during sample preparation e.g. library
+ * construction using PCR. See also "EstimateLibraryComplexity" +
+ * for additional notes on PCR duplication artifacts. Duplicate reads can also result from a single amplification cluster,
+ * incorrectly detected as multiple clusters by the optical sensor of the sequencing instrument. These duplication artifacts are
+ * referred to as optical duplicates. The MarkDuplicates tool works by comparing sequences in the 5 prime positions of both reads and read-pairs in a SAM/BAM file.
+ * After duplicate reads arecollected, the tool differentiates the primary and duplicate reads using an algorithm that ranks
+ * reads by the sums of their base-quality scores (default method). The tool's main output is a new SAM or BAM file, in which duplicates have been identified in the SAM flags field for each
+ * read. Duplicates are marked with the hexadecimal value of 0x0400, which corresponds to a decimal value of 1024.
+ * If you are not familiar with this type of annotation, please see the following blog post for additional information. Although the bitwise flag annotation indicates whether a read was marked as a duplicate, it does not identify the type of
+ * duplicate. To do this, a new tag called the duplicate type (DT) tag was recently added as an optional output in
+ * the 'optional field' section of a SAM/BAM file. Invoking the 'duplicate-tagging-policy' option,
+ * you can instruct the program to mark all the duplicates (All), only the optical duplicates (OpticalOnly), or no
+ * duplicates (DontTag). The records within the output of a SAM/BAM file will have values for the 'DT' tag (depending on the invoked
+ * 'duplicate-tagging-policy'), as either library/PCR-generated duplicates (LB), or sequencing-platform artifact duplicates (SQ).
+ * This tool uses the 'read-name-regex' and the 'optical-duplicate-pixel-distance' options as the primary methods to identify
+ * and differentiate duplicate types. Set read-name-regex' to null to skip optical duplicate detection, e.g. for RNA-seq
+ * or other data where duplicate sets are extremely large and estimating library complexity is not an aim.
+ * Note that without optical duplicate counts, library size estimation will be inaccurate. MarkDuplicates also produces a metrics file indicating the numbers of duplicates for both single- and paired-end reads. The program can take either coordinate-sorted or query-sorted inputs, however it is recommended that the input be
+ * query-sorted or query-grouped as the tool will have to perform an extra sort operation on the data in order to associate
+ * reads from the input bam with their mates. If desired, duplicates can be removed using the 'remove-all-duplicates' and 'remove-sequencing-duplicates' options. This tool is a Spark implementation of the tool MarkDuplicates in Picard allowing for better utilization
- * of available system resources to speed up duplicate marking. This is a Spark implementation of the MarkDuplicates tool from Picard that allows the tool to be run in
+ * parallel on multiple cores on a local machine or multiple machines on a Spark cluster while still matching
+ * the output of the single-core Picard version. Since the tool requires holding all of the readnames in memory
+ * while it groups the paired-down read information, it is recommended running this tool on a machine/configuration
+ * with at least 8 GB of memory for a typical 30x bam. This tool locates and tags duplicate reads in a BAM or SAM file, where duplicate reads are
* defined as originating from a single fragment of DNA. Duplicates can arise during sample preparation e.g. library
@@ -46,7 +49,7 @@
* referred to as optical duplicates. The MarkDuplicates tool works by comparing sequences in the 5 prime positions of both reads and read-pairs in a SAM/BAM file.
- * After duplicate reads arecollected, the tool differentiates the primary and duplicate reads using an algorithm that ranks
+ * After duplicate reads are collected, the tool differentiates the primary and duplicate reads using an algorithm that ranks
* reads by the sums of their base-quality scores (default method). The tool's main output is a new SAM or BAM file, in which duplicates have been identified in the SAM flags field for each
@@ -80,6 +83,19 @@
* -M marked_dup_metrics.txt
*
*
+ * This is a Spark implementation of the MarkDuplicates tool from Picard that allows the tool to be run in
* parallel on multiple cores on a local machine or multiple machines on a Spark cluster while still matching
* the output of the single-core Picard version. Since the tool requires holding all of the readnames in memory
- * while it groups the paired-down read information, it is recommended running this tool on a machine/configuration
- * with at least 8 GB of memory for a typical 30x bam.Usage example:
+ *
+ * gatk MarkDuplicatesSpark \\
+ *
+ * Please see
+ * MarkDuplicates
+ * for detailed explanations of the output metrics.
+ *
+ * -I input.bam \\
+ * -O marked_duplicates.bam \\
+ * -M marked_dup_metrics.txt
+ *
+ */
@DocumentedFeature
@CommandLineProgramProperties(
summary ="Marks duplicates on spark",
oneLineSummary ="MarkDuplicates on Spark",
programGroup = ReadDataManipulationProgramGroup.class)
-@BetaFeature
public final class MarkDuplicatesSpark extends GATKSparkTool {
private static final long serialVersionUID = 1L;
From 92c417105bd9e027511869ce80cc64fbef877c3e Mon Sep 17 00:00:00 2001
From: James MarkDuplicates run on a spark cluster 5 machines
+ *
+ * gatk MarkDuplicatesSpark \\
+ *
* Please see
* MarkDuplicates
* for detailed explanations of the output metrics.
From 0be10b484be25871ccbec989ec651fda13d0827f Mon Sep 17 00:00:00 2001
From: James
+ * -I input.bam \\
+ * -O marked_duplicates.bam \\
+ * -M marked_dup_metrics.txt \\
+ * -- \\
+ * --spark-runner SPARK \\
+ * --spark-master
+ * --num-executors 5 \\
+ * --executor-cores 8
+ *
This tool locates and tags duplicate reads in a BAM or SAM file, where duplicate reads are * defined as originating from a single fragment of DNA. Duplicates can arise during sample preparation e.g. library @@ -83,6 +83,15 @@ * -M marked_dup_metrics.txt * * + *
+ * gatk MarkDuplicatesSpark \\+ * *
+ * -I input.bam \\
+ * -O marked_duplicates.bam \\
+ * -M marked_dup_metrics.txt \\
+ * --conf 'spark.executor.cores=5' + *
* gatk MarkDuplicatesSpark \\