-
Notifications
You must be signed in to change notification settings - Fork 594
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Port of RevertSam tool into Spark #5395
Conversation
@jamesemery This needs a rebase now that #5320 is merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jamesemery Checkpointing where I am in the review. I have a bunch more to cover.
} | ||
|
||
// ________________________________________________________________________________________________________________________ | ||
// sum garbage |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: sum -> some
I think you meant to remove this?
* <h3>Usage Examples</h3> | ||
* <h4>Output to a single file</h4> | ||
* <pre> | ||
* java -jar picard.jar RevertSam \\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is out of date
* -O reverted.bam | ||
* </pre> | ||
* <p> | ||
* <h4>Output by read group into multiple files with sample map</h4> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all the examples are picard
* </pre> | ||
* This will output a BAM (Can be overridden with outputByReadgroupFileFormat option.) | ||
* <br/> | ||
* Note: If the program fails due to a SAM validation error, consider setting the VALIDATION_STRINGENCY option to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this still relevant to gatk? We set validation to lenient always I thought.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we? I've definitely had to set it but I may simply be forgetting which tools are secretly picard tools...
|
||
@DocumentedFeature | ||
@CommandLineProgramProperties( | ||
summary =RevertSamSpark.USAGE_DETAILS, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
weird spacing here
try { | ||
outputMap = createOutputMapFromFile(outputMapFile); | ||
} catch (IOException e) { | ||
throw new GATKException("Encountered an error reading output map file", e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UserException
} | ||
} | ||
if (sampleAlias != null && !allSampleAliasesIdentical) { | ||
throw new GATKException("Read groups have multiple values for sample. " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UserException
"A value for sampleAlias cannot be supplied."); | ||
} | ||
if (libraryName != null && !allLibraryNamesIdentical) { | ||
throw new GATKException("Read groups have multiple values for library name. " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UserException
final boolean restoreOriginalQualities) { | ||
final Map<String, FastqQualityFormat> output = new HashMap<>(); | ||
|
||
inHeader.getValue().getReadGroups().stream().parallel().forEach(rg -> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's kind of weird to do spark operations in a parallel stream. Is this safe? Is it faster? Spark already tries to use all the resources, it seems like this would just create more contention.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possibly? This is simply making the requests to spark to perform the partitioning in parallel that it may do as it will for parallelization down the line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is spark threadsafe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would remove the parallel unless you've shown that it's a significant speedup.
final String key = rg.getId(); | ||
JavaRDD<GATKRead> filtered = reads.filter(r -> r.getReadGroup().equals(key)); | ||
|
||
if (!filtered.isEmpty()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method seems like it could be very expensive. Particularly in the case where you have very unbalanced distributions of reads. (Worst case, many readgroups). In that case I think you will end up iterating the whole file N-1 times. I don't know how isEmpty works, but it seems like in cases where the rdd IS empty it would have to traverse the entire input and run the filter to prove that it is empty.
Have you noticed this taking a long time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what a clean alternative is... I've often wished for a partition operation on an RDD. You can sort of do it with a groupByKey but it will have nasty out of memory issues...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed, since there isn't a clear alternative that sounds like a good tradeoff between complexity and correctness I think I will leave this unanswered for now and wait and see if this ends up being a significant performance bottleneck.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add a comment here about the worst case performance issue.
@lbergelson Responded to your comments |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jamesemery Sorry this took so long. A bunch more comments. Needs a rebase as well and make sure it compiles in gradle. I think it has or at least had a bunch of warnings that caused failure.
@Argument( fullName = DONT_RESTORE_ORIGINAL_QUALITIES_LONG_NAME, doc = "Set to prevent the tool from setting the OQ field to the QUAL where available.", optional = true) | ||
public boolean dontRestoreOriginalQualities = false; | ||
|
||
public static final String DONT_REMOVE_DUPLICATE_INFORMATION_LONG_NAME = "remove-duplicate-information"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the argument name is inverted from what it says
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can change this to keepDuplicateInformation and have it be less confusing without having to invert the value.
" the output may have the unusual but sometimes desirable trait of having unmapped reads that are marked as duplicates.") | ||
public boolean dontRemoveDuplicateInformation = false; | ||
|
||
public static final String DONT_REMOVE_ALIGNMENT_INFORMATION_LONG_NAME = "remove-alignment-information"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like above, this is called DONT REMOVE but the argument name is remove, and the doc
final List<String> errors = new ArrayList<>(); | ||
validateOutputParams(outputByReadGroup, output, outputMap); | ||
|
||
if (!sanitize && keepFirstDuplicate) errors.add("'keepFirstDuplicate' cannot be used without 'sanitize'"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
{}
final String key = rg.getId(); | ||
JavaRDD<GATKRead> filtered = reads.filter(r -> r.getReadGroup().equals(key)); | ||
|
||
if (!filtered.isEmpty()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add a comment here about the worst case performance issue.
* If this is run, we want to be careful to remove duplicated reads from the bam. | ||
* | ||
* In order to do this we group each read by its readname and randomly select one read labeled as first in pair | ||
* and one read labled as second in pair to treat as the representative reads, throwing away the rest. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo
* and one read labled as second in pair to treat as the representative reads, throwing away the rest. | |
* and one read labeled as second in pair to treat as the representative reads, throwing away the rest. |
if (so != null) { | ||
args.addArgument("sort-order",so.name()); //TODO decide on sort order outputing | ||
} | ||
// args[index++] = "dontRemoveDuplicateInformation=" + removeDuplicates; //TODO this is unsuported |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is supported now isn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, this test was a stub i forgot about as I was working out the tests (it is duplicated from the other class). I have moved the unit-test like tests from the other class here instead
|
||
runCommandLine(args); | ||
|
||
// if (outputByReadGroup) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
deleted code? reenable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not testing anything right now
// } else { | ||
// output = File.createTempFile("reverted", ".sam"); | ||
// } | ||
output.deleteOnExit(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is borked use BaseTest methods instead
final boolean restoreOriginalQualities, final boolean outputByReadGroup, final String sample, final String library, | ||
final List<String> attributesToClear) throws Exception { | ||
|
||
final File output = outputByReadGroup?Files.createTempDirectory("picardRevertSamTest").toFile():File.createTempFile("reverted", ".sam"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use base test methods
if (sample != null) { | ||
args.addArgument("sample-alias",sample); | ||
} | ||
if (library != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems like an integration test
e4b85bc
to
9306fe7
Compare
@lbergelson responded to your comments to the best of my willingness |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jamesemery There were still a few mixed up things in the argument names. You have a compiler warning as well and a conflict. 👍 once those are resolved and tests are passing.
@Argument(fullName = DONT_REMOVE_ALIGNMENT_INFORMATION_LONG_NAME, doc = "Remove all alignment information from the file.") | ||
public boolean dontRemoveAlignmentInformation = false; | ||
public static final String KEEP_ALIGNMENT_INFORMATION = "keep-alignment-information"; | ||
@Argument(fullName = KEEP_ALIGNMENT_INFORMATION, doc = "Remove all alignment information from the file.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this message is still wrong
@@ -167,9 +169,9 @@ | |||
" the output may have the unusual but sometimes desirable trait of having unmapped reads that are marked as duplicates.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't remove -> keep, and the name and actual argument are still mismatching
@@ -196,7 +198,8 @@ | |||
return Collections.singletonList(ReadFilterLibrary.ALLOW_ALL_READS); | |||
} | |||
|
|||
public static List<String> DEFAULT_ATTRIBUTES_TO_CLEAR = new ArrayList<String>() {{ | |||
public static List<String> DEFAULT_ATTRIBUTES_TO_CLEAR = Collections.unmodifiableList(new ArrayList<String>(){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
final
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It still seems like it would make sense to make this with just Arrays.asList() but I guess the anonymous subclass is fine....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But this way i don't make an array, then another one in order to build an arraylist! think about all of the wasted time we can spend in confusion instead!
@@ -311,11 +316,13 @@ protected void runTool(JavaSparkContext ctx) { | |||
final boolean restoreOriginalQualities) { | |||
final Map<String, FastqQualityFormat> output = new HashMap<>(); | |||
|
|||
inHeader.getValue().getReadGroups().stream().parallel().forEach(rg -> { | |||
inHeader.getValue().getReadGroups().stream().forEach(rg -> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:)
…s for removal of alignment information and reversion to the original base qualities for reads.
…backtalk and refusal to respond
8e8a4bb
to
a8d8338
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sorry!
super(TableFeature.class); | ||
if ( "".equals(headerLineDelimiter) ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this just Utils.nonEmpty()
?
* If the original base quality scores have been store in the "OQ" tag will return the numeric | ||
* score as a byte[] | ||
*/ | ||
default byte[] getOriginalBaseQualities() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't seem worth including this as a first class function. It already exists in ReadUtils.
* insert size (difference btw 5' end of read & 5' end of mate), if possible, else 0. | ||
* Negative if mate maps to lower position than read. | ||
*/ | ||
void setInferredInsertSize(int insertSize); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this already exists as setFragmentLength
@@ -447,6 +447,13 @@ public void setIsUnplaced() { | |||
samRecord.setMappingQuality(SAMRecord.NO_MAPPING_QUALITY); | |||
} | |||
|
|||
@Override | |||
public void setInferredInsertSize(int insertSize) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jamesemery I missed this somehow before, but this exists already as setFragmentLength
} | ||
|
||
@Test | ||
public static void testGetDefaultExtension() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a unit test.
@Test | ||
public void testGetDefaultExtension() { | ||
Assert.assertEquals(RevertSamSpark.getDefaultExtension("this.is.a.sam", RevertSamSpark.FileType.dynamic), ".sam"); | ||
//Assert.assertEquals(RevertSamSpark.getDefaultExtension("this.is.a.cram", RevertSamSpark.FileType.dynamic), ".cram"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a comment pointing to whatever issue is tracking this problem?
} | ||
|
||
@Test | ||
public void testAssertAllReadGroupsMappedSuccess() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a lot of these seem like unit tests.
@@ -713,7 +713,7 @@ public void testDuplicateDetectionDataProviderWithMetrics(final File sam, final | |||
final List<String> lines = FileUtils.readLines(metricsFile, StandardCharsets.UTF_8); | |||
Assert.assertTrue(lines.get(0).startsWith("##"), lines.get(0)); | |||
Assert.assertTrue(lines.get(1).startsWith("#"), lines.get(1)); | |||
Assert.assertTrue(lines.get(1).toLowerCase().contains("--input"), lines.get(1)); //Note: lowercase because picard uses INPUT and GATK uses input for full name | |||
Assert.assertTrue(lines.get(1).toLowerCase().contains("--input"), lines.get(1)); //Note: lowercase because picard uses input and GATK uses input for full name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is still bad
7238f2f
to
01b2bc8
Compare
…o je_portRevertSam
Codecov Report
@@ Coverage Diff @@
## master #5395 +/- ##
===============================================
+ Coverage 6.942% 87.094% +80.152%
- Complexity 2754 31379 +28625
===============================================
Files 1915 1918 +3
Lines 144130 144650 +520
Branches 15901 15992 +91
===============================================
+ Hits 10005 125981 +115976
+ Misses 133410 12859 -120551
- Partials 715 5810 +5095
|
@@ -447,6 +447,13 @@ public void setIsUnplaced() { | |||
samRecord.setMappingQuality(SAMRecord.NO_MAPPING_QUALITY); | |||
} | |||
|
|||
@Override | |||
public void setInferredInsertSize(int insertSize) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
inferred -> infrared
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Merge this before I find something else to comment on!
Blocked by #5320
Fixes #4461