Port of RevertSam tool into Spark #5395

jamesemery · 2018-11-07T19:37:22Z

Blocked by #5320

Fixes #4461

lbergelson · 2018-11-20T16:33:54Z

@jamesemery This needs a rebase now that #5320 is merged.

lbergelson

@jamesemery Checkpointing where I am in the review. I have a bunch more to cover.

lbergelson · 2018-11-20T16:38:36Z

src/main/java/org/broadinstitute/hellbender/tools/spark/RevertSamSpark.java

+    }
+
+// ________________________________________________________________________________________________________________________
+// sum garbage


typo: sum -> some

I think you meant to remove this?

lbergelson · 2018-11-20T16:39:39Z

src/main/java/org/broadinstitute/hellbender/tools/spark/RevertSamSpark.java

+ * <h3>Usage Examples</h3>
+ * <h4>Output to a single file</h4>
+ * <pre>
+ * java -jar picard.jar RevertSam \\


this is out of date

lbergelson · 2018-11-20T16:40:33Z

src/main/java/org/broadinstitute/hellbender/tools/spark/RevertSamSpark.java

+ *      -O reverted.bam
+ * </pre>
+ * <p>
+ * <h4>Output by read group into multiple files with sample map</h4>


all the examples are picard

lbergelson · 2018-11-20T16:40:55Z

src/main/java/org/broadinstitute/hellbender/tools/spark/RevertSamSpark.java

+ * </pre>
+ * This will output a BAM (Can be overridden with outputByReadgroupFileFormat option.)
+ * <br/>
+ * Note: If the program fails due to a SAM validation error, consider setting the VALIDATION_STRINGENCY option to


Is this still relevant to gatk? We set validation to lenient always I thought.

Do we? I've definitely had to set it but I may simply be forgetting which tools are secretly picard tools...

lbergelson · 2018-11-20T16:41:06Z

src/main/java/org/broadinstitute/hellbender/tools/spark/RevertSamSpark.java

+
+@DocumentedFeature
+@CommandLineProgramProperties(
+        summary =RevertSamSpark.USAGE_DETAILS,


weird spacing here

lbergelson · 2018-11-20T20:28:04Z

src/main/java/org/broadinstitute/hellbender/tools/spark/RevertSamSpark.java

+                try {
+                    outputMap = createOutputMapFromFile(outputMapFile);
+                } catch (IOException e) {
+                    throw new GATKException("Encountered an error reading output map file", e);


UserException

lbergelson · 2018-11-20T20:28:21Z

src/main/java/org/broadinstitute/hellbender/tools/spark/RevertSamSpark.java

+                    }
+                }
+                if (sampleAlias != null && !allSampleAliasesIdentical) {
+                    throw new GATKException("Read groups have multiple values for sample.  " +


UserException

lbergelson · 2018-11-20T20:28:28Z

src/main/java/org/broadinstitute/hellbender/tools/spark/RevertSamSpark.java

+                            "A value for sampleAlias cannot be supplied.");
+                }
+                if (libraryName != null && !allLibraryNamesIdentical) {
+                    throw new GATKException("Read groups have multiple values for library name.  " +


UserException

lbergelson · 2018-11-20T20:30:37Z

src/main/java/org/broadinstitute/hellbender/tools/spark/RevertSamSpark.java

+                                                                      final boolean restoreOriginalQualities) {
+        final Map<String, FastqQualityFormat> output = new HashMap<>();
+
+        inHeader.getValue().getReadGroups().stream().parallel().forEach(rg -> {


It's kind of weird to do spark operations in a parallel stream. Is this safe? Is it faster? Spark already tries to use all the resources, it seems like this would just create more contention.

Possibly? This is simply making the requests to spark to perform the partitioning in parallel that it may do as it will for parallelization down the line.

Is spark threadsafe?

I would remove the parallel unless you've shown that it's a significant speedup.

lbergelson · 2018-11-20T20:37:40Z

src/main/java/org/broadinstitute/hellbender/tools/spark/RevertSamSpark.java

+            final String key = rg.getId();
+            JavaRDD<GATKRead> filtered = reads.filter(r -> r.getReadGroup().equals(key));
+
+            if (!filtered.isEmpty()) {


This method seems like it could be very expensive. Particularly in the case where you have very unbalanced distributions of reads. (Worst case, many readgroups). In that case I think you will end up iterating the whole file N-1 times. I don't know how isEmpty works, but it seems like in cases where the rdd IS empty it would have to traverse the entire input and run the filter to prove that it is empty.

Have you noticed this taking a long time?

I'm not sure what a clean alternative is... I've often wished for a partition operation on an RDD. You can sort of do it with a groupByKey but it will have nasty out of memory issues...

As discussed, since there isn't a clear alternative that sounds like a good tradeoff between complexity and correctness I think I will leave this unanswered for now and wait and see if this ends up being a significant performance bottleneck.

Maybe add a comment here about the worst case performance issue.

jamesemery · 2018-11-27T21:53:52Z

@lbergelson Responded to your comments

lbergelson

@jamesemery Sorry this took so long. A bunch more comments. Needs a rebase as well and make sure it compiles in gradle. I think it has or at least had a bunch of warnings that caused failure.

lbergelson · 2018-11-29T17:18:17Z

src/main/java/org/broadinstitute/hellbender/tools/spark/RevertSamSpark.java

+    @Argument( fullName = DONT_RESTORE_ORIGINAL_QUALITIES_LONG_NAME, doc = "Set to prevent the tool from setting the OQ field to the QUAL where available.", optional = true)
+    public boolean dontRestoreOriginalQualities = false;
+
+    public static final String DONT_REMOVE_DUPLICATE_INFORMATION_LONG_NAME = "remove-duplicate-information";


the argument name is inverted from what it says

I think we can change this to keepDuplicateInformation and have it be less confusing without having to invert the value.

lbergelson · 2018-11-29T17:20:11Z

src/main/java/org/broadinstitute/hellbender/tools/spark/RevertSamSpark.java

+            " the output may have the unusual but sometimes desirable trait of having unmapped reads that are marked as duplicates.")
+    public boolean dontRemoveDuplicateInformation = false;
+
+    public static final String DONT_REMOVE_ALIGNMENT_INFORMATION_LONG_NAME = "remove-alignment-information";


Like above, this is called DONT REMOVE but the argument name is remove, and the doc

lbergelson · 2018-11-29T17:22:21Z

src/main/java/org/broadinstitute/hellbender/tools/spark/RevertSamSpark.java

+        final List<String> errors = new ArrayList<>();
+        validateOutputParams(outputByReadGroup, output, outputMap);
+
+        if (!sanitize && keepFirstDuplicate) errors.add("'keepFirstDuplicate' cannot be used without 'sanitize'");


lbergelson · 2018-11-29T17:23:49Z

src/main/java/org/broadinstitute/hellbender/tools/spark/RevertSamSpark.java

+            final String key = rg.getId();
+            JavaRDD<GATKRead> filtered = reads.filter(r -> r.getReadGroup().equals(key));
+
+            if (!filtered.isEmpty()) {


Maybe add a comment here about the worst case performance issue.

lbergelson · 2018-11-29T17:26:26Z

src/main/java/org/broadinstitute/hellbender/tools/spark/RevertSamSpark.java

+     * If this is run, we want to be careful to remove duplicated reads from the bam.
+     *
+     * In order to do this we group each read by its readname and randomly select one read labeled as first in pair
+     * and one read labled as second in pair to treat as the representative reads, throwing away the rest.


typo

Suggested change

* and one read labled as second in pair to treat as the representative reads, throwing away the rest.

* and one read labeled as second in pair to treat as the representative reads, throwing away the rest.

lbergelson · 2018-12-05T17:46:50Z

src/test/java/org/broadinstitute/hellbender/tools/spark/RevertSamSparkUnitTest.java

+        if (so != null) {
+            args.addArgument("sort-order",so.name()); //TODO decide on sort order outputing
+        }
+//        args[index++] = "dontRemoveDuplicateInformation=" + removeDuplicates; //TODO this is unsuported


this is supported now isn't it?

yes, this test was a stub i forgot about as I was working out the tests (it is duplicated from the other class). I have moved the unit-test like tests from the other class here instead

lbergelson · 2018-12-05T17:47:06Z

src/test/java/org/broadinstitute/hellbender/tools/spark/RevertSamSparkUnitTest.java

+
+        runCommandLine(args);
+
+//        if (outputByReadGroup) {


deleted code? reenable?

It's not testing anything right now

lbergelson · 2018-12-05T17:47:31Z

src/test/java/org/broadinstitute/hellbender/tools/spark/RevertSamSparkUnitTest.java

+//        } else {
+//            output = File.createTempFile("reverted", ".sam");
+//        }
+        output.deleteOnExit();


this is borked use BaseTest methods instead

lbergelson · 2018-12-05T17:47:40Z

src/test/java/org/broadinstitute/hellbender/tools/spark/RevertSamSparkUnitTest.java

+                                   final boolean restoreOriginalQualities, final boolean outputByReadGroup, final String sample, final String library,
+                                   final List<String> attributesToClear) throws Exception {
+
+        final File output = outputByReadGroup?Files.createTempDirectory("picardRevertSamTest").toFile():File.createTempFile("reverted", ".sam");


use base test methods

lbergelson · 2018-12-05T17:48:02Z

src/test/java/org/broadinstitute/hellbender/tools/spark/RevertSamSparkUnitTest.java

+        if (sample != null) {
+            args.addArgument("sample-alias",sample);
+        }
+        if (library != null) {


this seems like an integration test

jamesemery · 2018-12-07T15:32:25Z

@lbergelson responded to your comments to the best of my willingness

lbergelson

@jamesemery There were still a few mixed up things in the argument names. You have a compiler warning as well and a conflict. 👍 once those are resolved and tests are passing.

lbergelson · 2018-12-11T16:23:44Z

src/main/java/org/broadinstitute/hellbender/tools/spark/RevertSamSpark.java

-    @Argument(fullName = DONT_REMOVE_ALIGNMENT_INFORMATION_LONG_NAME, doc = "Remove all alignment information from the file.")
-    public boolean dontRemoveAlignmentInformation = false;
+    public static final String KEEP_ALIGNMENT_INFORMATION = "keep-alignment-information";
+    @Argument(fullName = KEEP_ALIGNMENT_INFORMATION, doc = "Remove all alignment information from the file.")


this message is still wrong

lbergelson · 2018-12-11T16:25:19Z

src/main/java/org/broadinstitute/hellbender/tools/spark/RevertSamSpark.java

@@ -167,9 +169,9 @@
            " the output may have the unusual but sometimes desirable trait of having unmapped reads that are marked as duplicates.")


Don't remove -> keep, and the name and actual argument are still mismatching

lbergelson · 2018-12-11T16:26:14Z

src/main/java/org/broadinstitute/hellbender/tools/spark/RevertSamSpark.java

@@ -196,7 +198,8 @@
        return Collections.singletonList(ReadFilterLibrary.ALLOW_ALL_READS);
    }

-    public static List<String> DEFAULT_ATTRIBUTES_TO_CLEAR = new ArrayList<String>() {{
+    public static List<String> DEFAULT_ATTRIBUTES_TO_CLEAR = Collections.unmodifiableList(new ArrayList<String>(){


It still seems like it would make sense to make this with just Arrays.asList() but I guess the anonymous subclass is fine....

But this way i don't make an array, then another one in order to build an arraylist! think about all of the wasted time we can spend in confusion instead!

lbergelson · 2018-12-11T16:27:06Z

src/main/java/org/broadinstitute/hellbender/tools/spark/RevertSamSpark.java

@@ -311,11 +316,13 @@ protected void runTool(JavaSparkContext ctx) {
                                                                      final boolean restoreOriginalQualities) {
        final Map<String, FastqQualityFormat> output = new HashMap<>();

-        inHeader.getValue().getReadGroups().stream().parallel().forEach(rg -> {
+        inHeader.getValue().getReadGroups().stream().forEach(rg -> {


thank you :)

…s for removal of alignment information and reversion to the original base qualities for reads.

…backtalk and refusal to respond

lbergelson

I'm sorry!

lbergelson · 2019-01-03T19:44:42Z

src/main/java/org/broadinstitute/hellbender/utils/codecs/table/TableCodec.java

        super(TableFeature.class);
+        if ( "".equals(headerLineDelimiter) ) {


Isn't this just Utils.nonEmpty() ?

lbergelson · 2019-01-03T19:46:07Z

src/main/java/org/broadinstitute/hellbender/utils/read/GATKRead.java

+     * If the original base quality scores have been store in the "OQ" tag will return the numeric
+     * score as a byte[]
+     */
+    default byte[] getOriginalBaseQualities() {


It doesn't seem worth including this as a first class function. It already exists in ReadUtils.

lbergelson · 2019-01-03T19:46:42Z

src/main/java/org/broadinstitute/hellbender/utils/read/GATKRead.java

+     * insert size (difference btw 5' end of read & 5' end of mate), if possible, else 0.
+     * Negative if mate maps to lower position than read.
+     */
+    void setInferredInsertSize(int insertSize);


this already exists as setFragmentLength

lbergelson · 2019-01-03T19:46:52Z

src/main/java/org/broadinstitute/hellbender/utils/read/SAMRecordToGATKReadAdapter.java

@@ -447,6 +447,13 @@ public void setIsUnplaced() {
        samRecord.setMappingQuality(SAMRecord.NO_MAPPING_QUALITY);
    }

+    @Override
+    public void setInferredInsertSize(int insertSize) {


@jamesemery I missed this somehow before, but this exists already as setFragmentLength

lbergelson · 2019-01-03T19:48:10Z

src/test/java/org/broadinstitute/hellbender/tools/spark/RevertSamSparkIntegrationTest.java

+    }
+
+    @Test
+    public static void testGetDefaultExtension() {


This seems like a unit test.

lbergelson · 2019-01-03T19:48:29Z

src/test/java/org/broadinstitute/hellbender/tools/spark/RevertSamSparkIntegrationTest.java

+    @Test
+    public void testGetDefaultExtension() {
+        Assert.assertEquals(RevertSamSpark.getDefaultExtension("this.is.a.sam", RevertSamSpark.FileType.dynamic), ".sam");
+        //Assert.assertEquals(RevertSamSpark.getDefaultExtension("this.is.a.cram", RevertSamSpark.FileType.dynamic), ".cram");


Can you add a comment pointing to whatever issue is tracking this problem?

lbergelson · 2019-01-03T19:49:06Z

src/test/java/org/broadinstitute/hellbender/tools/spark/RevertSamSparkIntegrationTest.java

+    }
+
+    @Test
+    public void testAssertAllReadGroupsMappedSuccess() {


a lot of these seem like unit tests.

lbergelson · 2019-01-03T19:49:34Z

...te/hellbender/tools/walkers/markduplicates/AbstractMarkDuplicatesCommandLineProgramTest.java

@@ -713,7 +713,7 @@ public void testDuplicateDetectionDataProviderWithMetrics(final File sam, final
        final List<String> lines = FileUtils.readLines(metricsFile, StandardCharsets.UTF_8);
        Assert.assertTrue(lines.get(0).startsWith("##"), lines.get(0));
        Assert.assertTrue(lines.get(1).startsWith("#"), lines.get(1));
-        Assert.assertTrue(lines.get(1).toLowerCase().contains("--input"), lines.get(1));  //Note: lowercase because picard uses INPUT and GATK uses input for full name
+        Assert.assertTrue(lines.get(1).toLowerCase().contains("--input"), lines.get(1));  //Note: lowercase because picard uses input and GATK uses input for full name


this is still bad

…s chirade end?

…o je_portRevertSam

codecov-io · 2019-01-03T20:25:51Z

Codecov Report

Merging #5395 into master will increase coverage by 80.152%.
The diff coverage is 90%.

@@               Coverage Diff               @@
##             master     #5395        +/-   ##
===============================================
+ Coverage     6.942%   87.094%   +80.152%     
- Complexity     2754     31379     +28625     
===============================================
  Files          1915      1918         +3     
  Lines        144130    144650       +520     
  Branches      15901     15992        +91     
===============================================
+ Hits          10005    125981    +115976     
+ Misses       133410     12859    -120551     
- Partials        715      5810      +5095

Impacted Files	Coverage Δ	Complexity Δ
...ellbender/cmdline/StandardArgumentDefinitions.java	`0% <ø> (ø)`	`0 <0> (ø)`	⬇️
...er/utils/haplotype/HaplotypeBAMWriterUnitTest.java	`89.076% <ø> (+86.555%)`	`23 <0> (+22)`	⬆️
...hellbender/tools/spark/pipelines/SortSamSpark.java	`100% <100%> (+100%)`	`6 <0> (+6)`	⬆️
.../AbstractMarkDuplicatesCommandLineProgramTest.java	`96.241% <100%> (+95.865%)`	`92 <0> (+90)`	⬆️
...lbender/utils/read/SAMRecordToGATKReadAdapter.java	`92.806% <100%> (+87.331%)`	`146 <1> (+135)`	⬆️
...rkduplicates/MarkDuplicatesSparkUtilsUnitTest.java	`91.954% <100%> (+90.805%)`	`15 <1> (+14)`	⬆️
...transforms/markduplicates/MarkDuplicatesSpark.java	`94.03% <100%> (+94.03%)`	`32 <0> (+32)`	⬆️
...forms/markduplicates/MarkDuplicatesSparkUtils.java	`89.573% <100%> (+89.573%)`	`68 <0> (+68)`	⬆️
...lbender/utils/codecs/table/TableCodecUnitTest.java	`95.575% <100%> (+94.69%)`	`21 <0> (+20)`	⬆️
...adinstitute/hellbender/utils/spark/SparkUtils.java	`88.073% <100%> (+88.073%)`	`25 <4> (+25)`	⬆️
... and 1765 more

lbergelson · 2019-01-03T20:57:47Z

src/main/java/org/broadinstitute/hellbender/utils/read/SAMRecordToGATKReadAdapter.java

@@ -447,6 +447,13 @@ public void setIsUnplaced() {
        samRecord.setMappingQuality(SAMRecord.NO_MAPPING_QUALITY);
    }

+    @Override
+    public void setInferredInsertSize(int insertSize) {


inferred -> infrared

lbergelson

Merge this before I find something else to comment on!

jamesemery assigned lbergelson Nov 7, 2018

jamesemery requested a review from lbergelson November 7, 2018 19:37

lbergelson reviewed Nov 20, 2018

View reviewed changes

lbergelson requested changes Dec 5, 2018

View reviewed changes

lbergelson assigned jamesemery and unassigned lbergelson Dec 5, 2018

jamesemery force-pushed the je_portRevertSam branch from e4b85bc to 9306fe7 Compare December 7, 2018 15:32

lbergelson requested changes Dec 11, 2018

View reviewed changes

jamesemery added 6 commits January 2, 2019 14:55

Added RevertSamSpark, a replacement for the RevertSam tool that allow…

6f98aa4

…s for removal of alignment information and reversion to the original base qualities for reads.

responding to the first round of comments

1fd3898

responding to another round of comments with nothing more than rowdy …

e732739

…backtalk and refusal to respond

fixing a warning

2b696b7

responding to final round of comments

2e48600

fixing the compiler warnings

a8d8338

jamesemery force-pushed the je_portRevertSam branch from 8e8a4bb to a8d8338 Compare January 2, 2019 20:06

jamesemery and others added 6 commits January 3, 2019 10:12

cleaning up a mistaken change

bd0c6f5

silly !, save it for snake

2b78ecd

solving the worlds problems

f55d6b3

getting rid of that grossness

de998df

readding the files i deleted by mistake

0c00b23

moving files because directories are hard

7238f2f

lbergelson requested changes Jan 3, 2019

View reviewed changes

jamesemery mentioned this pull request Jan 3, 2019

Add support for cram inputs for RevertSamSpark #5559

Closed

responded to yet another round of comments, when, pray tell, will thi…

01b2bc8

…s chirade end?

jamesemery force-pushed the je_portRevertSam branch from 7238f2f to 01b2bc8 Compare January 3, 2019 20:21

Merge branch 'je_portRevertSam' of github.com:broadinstitute/gatk int…

3c64144

…o je_portRevertSam

lbergelson reviewed Jan 3, 2019

View reviewed changes

jamesemery added 2 commits January 3, 2019 16:01

fixed a spurious override

f5b7062

No, that really did want to check for emptieness

2730a47

lbergelson approved these changes Jan 4, 2019

View reviewed changes

jamesemery merged commit b13d2c6 into master Jan 4, 2019

jamesemery deleted the je_portRevertSam branch January 4, 2019 16:42

	* and one read labled as second in pair to treat as the representative reads, throwing away the rest.
	* and one read labeled as second in pair to treat as the representative reads, throwing away the rest.

		@@ -167,9 +169,9 @@
		" the output may have the unusual but sometimes desirable trait of having unmapped reads that are marked as duplicates.")

		super(TableFeature.class);
		if ( "".equals(headerLineDelimiter) ) {

Port of RevertSam tool into Spark #5395

Port of RevertSam tool into Spark #5395

Conversation

jamesemery commented Nov 7, 2018

lbergelson commented Nov 20, 2018

lbergelson left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jamesemery commented Nov 27, 2018

lbergelson left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jamesemery commented Dec 7, 2018

lbergelson left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lbergelson left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-io commented Jan 3, 2019

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lbergelson left a comment

Choose a reason for hiding this comment