Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Port of RevertSam tool into Spark #5395

Merged
merged 16 commits into from
Jan 4, 2019
Merged

Port of RevertSam tool into Spark #5395

merged 16 commits into from
Jan 4, 2019

Conversation

jamesemery
Copy link
Collaborator

Blocked by #5320

Fixes #4461

@lbergelson
Copy link
Member

@jamesemery This needs a rebase now that #5320 is merged.

Copy link
Member

@lbergelson lbergelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jamesemery Checkpointing where I am in the review. I have a bunch more to cover.

}

// ________________________________________________________________________________________________________________________
// sum garbage
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: sum -> some

I think you meant to remove this?

* <h3>Usage Examples</h3>
* <h4>Output to a single file</h4>
* <pre>
* java -jar picard.jar RevertSam \\
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is out of date

* -O reverted.bam
* </pre>
* <p>
* <h4>Output by read group into multiple files with sample map</h4>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all the examples are picard

* </pre>
* This will output a BAM (Can be overridden with outputByReadgroupFileFormat option.)
* <br/>
* Note: If the program fails due to a SAM validation error, consider setting the VALIDATION_STRINGENCY option to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still relevant to gatk? We set validation to lenient always I thought.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we? I've definitely had to set it but I may simply be forgetting which tools are secretly picard tools...


@DocumentedFeature
@CommandLineProgramProperties(
summary =RevertSamSpark.USAGE_DETAILS,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

weird spacing here

try {
outputMap = createOutputMapFromFile(outputMapFile);
} catch (IOException e) {
throw new GATKException("Encountered an error reading output map file", e);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UserException

}
}
if (sampleAlias != null && !allSampleAliasesIdentical) {
throw new GATKException("Read groups have multiple values for sample. " +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UserException

"A value for sampleAlias cannot be supplied.");
}
if (libraryName != null && !allLibraryNamesIdentical) {
throw new GATKException("Read groups have multiple values for library name. " +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UserException

final boolean restoreOriginalQualities) {
final Map<String, FastqQualityFormat> output = new HashMap<>();

inHeader.getValue().getReadGroups().stream().parallel().forEach(rg -> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's kind of weird to do spark operations in a parallel stream. Is this safe? Is it faster? Spark already tries to use all the resources, it seems like this would just create more contention.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly? This is simply making the requests to spark to perform the partitioning in parallel that it may do as it will for parallelization down the line.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is spark threadsafe?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would remove the parallel unless you've shown that it's a significant speedup.

final String key = rg.getId();
JavaRDD<GATKRead> filtered = reads.filter(r -> r.getReadGroup().equals(key));

if (!filtered.isEmpty()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method seems like it could be very expensive. Particularly in the case where you have very unbalanced distributions of reads. (Worst case, many readgroups). In that case I think you will end up iterating the whole file N-1 times. I don't know how isEmpty works, but it seems like in cases where the rdd IS empty it would have to traverse the entire input and run the filter to prove that it is empty.

Have you noticed this taking a long time?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what a clean alternative is... I've often wished for a partition operation on an RDD. You can sort of do it with a groupByKey but it will have nasty out of memory issues...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed, since there isn't a clear alternative that sounds like a good tradeoff between complexity and correctness I think I will leave this unanswered for now and wait and see if this ends up being a significant performance bottleneck.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a comment here about the worst case performance issue.

@jamesemery
Copy link
Collaborator Author

@lbergelson Responded to your comments

Copy link
Member

@lbergelson lbergelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jamesemery Sorry this took so long. A bunch more comments. Needs a rebase as well and make sure it compiles in gradle. I think it has or at least had a bunch of warnings that caused failure.

@Argument( fullName = DONT_RESTORE_ORIGINAL_QUALITIES_LONG_NAME, doc = "Set to prevent the tool from setting the OQ field to the QUAL where available.", optional = true)
public boolean dontRestoreOriginalQualities = false;

public static final String DONT_REMOVE_DUPLICATE_INFORMATION_LONG_NAME = "remove-duplicate-information";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the argument name is inverted from what it says

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can change this to keepDuplicateInformation and have it be less confusing without having to invert the value.

" the output may have the unusual but sometimes desirable trait of having unmapped reads that are marked as duplicates.")
public boolean dontRemoveDuplicateInformation = false;

public static final String DONT_REMOVE_ALIGNMENT_INFORMATION_LONG_NAME = "remove-alignment-information";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like above, this is called DONT REMOVE but the argument name is remove, and the doc

final List<String> errors = new ArrayList<>();
validateOutputParams(outputByReadGroup, output, outputMap);

if (!sanitize && keepFirstDuplicate) errors.add("'keepFirstDuplicate' cannot be used without 'sanitize'");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

{}

final String key = rg.getId();
JavaRDD<GATKRead> filtered = reads.filter(r -> r.getReadGroup().equals(key));

if (!filtered.isEmpty()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a comment here about the worst case performance issue.

* If this is run, we want to be careful to remove duplicated reads from the bam.
*
* In order to do this we group each read by its readname and randomly select one read labeled as first in pair
* and one read labled as second in pair to treat as the representative reads, throwing away the rest.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo

Suggested change
* and one read labled as second in pair to treat as the representative reads, throwing away the rest.
* and one read labeled as second in pair to treat as the representative reads, throwing away the rest.

if (so != null) {
args.addArgument("sort-order",so.name()); //TODO decide on sort order outputing
}
// args[index++] = "dontRemoveDuplicateInformation=" + removeDuplicates; //TODO this is unsuported
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is supported now isn't it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, this test was a stub i forgot about as I was working out the tests (it is duplicated from the other class). I have moved the unit-test like tests from the other class here instead


runCommandLine(args);

// if (outputByReadGroup) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deleted code? reenable?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not testing anything right now

// } else {
// output = File.createTempFile("reverted", ".sam");
// }
output.deleteOnExit();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is borked use BaseTest methods instead

final boolean restoreOriginalQualities, final boolean outputByReadGroup, final String sample, final String library,
final List<String> attributesToClear) throws Exception {

final File output = outputByReadGroup?Files.createTempDirectory("picardRevertSamTest").toFile():File.createTempFile("reverted", ".sam");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use base test methods

if (sample != null) {
args.addArgument("sample-alias",sample);
}
if (library != null) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems like an integration test

@jamesemery
Copy link
Collaborator Author

@lbergelson responded to your comments to the best of my willingness

Copy link
Member

@lbergelson lbergelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jamesemery There were still a few mixed up things in the argument names. You have a compiler warning as well and a conflict. 👍 once those are resolved and tests are passing.

@Argument(fullName = DONT_REMOVE_ALIGNMENT_INFORMATION_LONG_NAME, doc = "Remove all alignment information from the file.")
public boolean dontRemoveAlignmentInformation = false;
public static final String KEEP_ALIGNMENT_INFORMATION = "keep-alignment-information";
@Argument(fullName = KEEP_ALIGNMENT_INFORMATION, doc = "Remove all alignment information from the file.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this message is still wrong

@@ -167,9 +169,9 @@
" the output may have the unusual but sometimes desirable trait of having unmapped reads that are marked as duplicates.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't remove -> keep, and the name and actual argument are still mismatching

@@ -196,7 +198,8 @@
return Collections.singletonList(ReadFilterLibrary.ALLOW_ALL_READS);
}

public static List<String> DEFAULT_ATTRIBUTES_TO_CLEAR = new ArrayList<String>() {{
public static List<String> DEFAULT_ATTRIBUTES_TO_CLEAR = Collections.unmodifiableList(new ArrayList<String>(){
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

final

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It still seems like it would make sense to make this with just Arrays.asList() but I guess the anonymous subclass is fine....

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But this way i don't make an array, then another one in order to build an arraylist! think about all of the wasted time we can spend in confusion instead!

@@ -311,11 +316,13 @@ protected void runTool(JavaSparkContext ctx) {
final boolean restoreOriginalQualities) {
final Map<String, FastqQualityFormat> output = new HashMap<>();

inHeader.getValue().getReadGroups().stream().parallel().forEach(rg -> {
inHeader.getValue().getReadGroups().stream().forEach(rg -> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:)

Copy link
Member

@lbergelson lbergelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry!

super(TableFeature.class);
if ( "".equals(headerLineDelimiter) ) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this just Utils.nonEmpty() ?

* If the original base quality scores have been store in the "OQ" tag will return the numeric
* score as a byte[]
*/
default byte[] getOriginalBaseQualities() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't seem worth including this as a first class function. It already exists in ReadUtils.

* insert size (difference btw 5' end of read & 5' end of mate), if possible, else 0.
* Negative if mate maps to lower position than read.
*/
void setInferredInsertSize(int insertSize);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this already exists as setFragmentLength

@@ -447,6 +447,13 @@ public void setIsUnplaced() {
samRecord.setMappingQuality(SAMRecord.NO_MAPPING_QUALITY);
}

@Override
public void setInferredInsertSize(int insertSize) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jamesemery I missed this somehow before, but this exists already as setFragmentLength

}

@Test
public static void testGetDefaultExtension() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a unit test.

@Test
public void testGetDefaultExtension() {
Assert.assertEquals(RevertSamSpark.getDefaultExtension("this.is.a.sam", RevertSamSpark.FileType.dynamic), ".sam");
//Assert.assertEquals(RevertSamSpark.getDefaultExtension("this.is.a.cram", RevertSamSpark.FileType.dynamic), ".cram");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment pointing to whatever issue is tracking this problem?

}

@Test
public void testAssertAllReadGroupsMappedSuccess() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a lot of these seem like unit tests.

@@ -713,7 +713,7 @@ public void testDuplicateDetectionDataProviderWithMetrics(final File sam, final
final List<String> lines = FileUtils.readLines(metricsFile, StandardCharsets.UTF_8);
Assert.assertTrue(lines.get(0).startsWith("##"), lines.get(0));
Assert.assertTrue(lines.get(1).startsWith("#"), lines.get(1));
Assert.assertTrue(lines.get(1).toLowerCase().contains("--input"), lines.get(1)); //Note: lowercase because picard uses INPUT and GATK uses input for full name
Assert.assertTrue(lines.get(1).toLowerCase().contains("--input"), lines.get(1)); //Note: lowercase because picard uses input and GATK uses input for full name
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is still bad

@codecov-io
Copy link

Codecov Report

Merging #5395 into master will increase coverage by 80.152%.
The diff coverage is 90%.

@@               Coverage Diff               @@
##             master     #5395        +/-   ##
===============================================
+ Coverage     6.942%   87.094%   +80.152%     
- Complexity     2754     31379     +28625     
===============================================
  Files          1915      1918         +3     
  Lines        144130    144650       +520     
  Branches      15901     15992        +91     
===============================================
+ Hits          10005    125981    +115976     
+ Misses       133410     12859    -120551     
- Partials        715      5810      +5095
Impacted Files Coverage Δ Complexity Δ
...ellbender/cmdline/StandardArgumentDefinitions.java 0% <ø> (ø) 0 <0> (ø) ⬇️
...er/utils/haplotype/HaplotypeBAMWriterUnitTest.java 89.076% <ø> (+86.555%) 23 <0> (+22) ⬆️
...hellbender/tools/spark/pipelines/SortSamSpark.java 100% <100%> (+100%) 6 <0> (+6) ⬆️
.../AbstractMarkDuplicatesCommandLineProgramTest.java 96.241% <100%> (+95.865%) 92 <0> (+90) ⬆️
...lbender/utils/read/SAMRecordToGATKReadAdapter.java 92.806% <100%> (+87.331%) 146 <1> (+135) ⬆️
...rkduplicates/MarkDuplicatesSparkUtilsUnitTest.java 91.954% <100%> (+90.805%) 15 <1> (+14) ⬆️
...transforms/markduplicates/MarkDuplicatesSpark.java 94.03% <100%> (+94.03%) 32 <0> (+32) ⬆️
...forms/markduplicates/MarkDuplicatesSparkUtils.java 89.573% <100%> (+89.573%) 68 <0> (+68) ⬆️
...lbender/utils/codecs/table/TableCodecUnitTest.java 95.575% <100%> (+94.69%) 21 <0> (+20) ⬆️
...adinstitute/hellbender/utils/spark/SparkUtils.java 88.073% <100%> (+88.073%) 25 <4> (+25) ⬆️
... and 1765 more

@@ -447,6 +447,13 @@ public void setIsUnplaced() {
samRecord.setMappingQuality(SAMRecord.NO_MAPPING_QUALITY);
}

@Override
public void setInferredInsertSize(int insertSize) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inferred -> infrared

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

Copy link
Member

@lbergelson lbergelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merge this before I find something else to comment on!

@jamesemery jamesemery merged commit b13d2c6 into master Jan 4, 2019
@jamesemery jamesemery deleted the je_portRevertSam branch January 4, 2019 16:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants