-
Notifications
You must be signed in to change notification settings - Fork 594
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added an option to opt out of the protections to sorting multiple bam inputs #5974
Added an option to opt out of the protections to sorting multiple bam inputs #5974
Conversation
Codecov Report
@@ Coverage Diff @@
## master #5974 +/- ##
===============================================
- Coverage 86.931% 86.929% -0.002%
- Complexity 32715 32721 +6
===============================================
Files 2013 2013
Lines 151268 151306 +38
Branches 16604 16610 +6
===============================================
+ Hits 131499 131529 +30
- Misses 13716 13720 +4
- Partials 6053 6057 +4
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jamesemery I think this should work slightly differently. Good to go after comments are addressed or let me know why I'm wrong :)
@@ -140,6 +141,11 @@ | |||
fullName = StandardArgumentDefinitions.METRICS_FILE_LONG_NAME) | |||
protected String metricsFile; | |||
|
|||
@Advanced | |||
@Argument(doc = "Override to allow non-queryname sorted inputs for multiple input bams.", optional=true, | |||
fullName = "allow-multiple-sort-orders-in-input") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets change this to treat-unsorted-files-as-queryname-grouped
. Make it apply to both single and multiple files. It should only overrule the case where the files are listed as unsorted or don't have a given sort order. A file that says it's coordinate sorted should still be treated as such.
The comment needs to mention that it will silently give incorrect results if the file's are not actually queryname grouped, and that this is an unsafe workaround for legacy files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It actually doesn't do that. What this change actually does is override the hard failure that used to throw an exception if all the multiple input files didin't have query grouped sort order.
If that override fails, the header associated with the input gets set to "unsorted" with no GO tag, which means the tool treats the input as being non-queryname sorted. Thus it will just happily sort the entire input for you at the outset. What the above check was doing before was silently asserting that the input was GO:query if all the inputs were indeed at least GO:query ordered according to their original headers.
if (!allowMultipleSortOrders) { | ||
headerMap.entrySet().stream().forEach(h -> { | ||
if (!ReadUtils.isReadNameGroupedBam(h.getValue())) { | ||
throw new UserException("Multiple inputs to MarkDuplicatesSpark detected. MarkDuplicatesSpark requires all inputs to be queryname sorted or querygroup-sorted for multi-input processing but input " + h.getKey() + " was sorted in " + h.getValue().getSortOrder() + " order. Try running with '--allow-multiple-sort-orders-in-input' to run by sorting all the input."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make a constant for the fullname of the argument and use it in the comment here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -482,4 +482,15 @@ public void testAssertCorrectSortOrderMultipleBams() { | |||
args.addInput(new File(TEST_DATA_DIR,"example.chr1.1-1K.unmarkedDups.noDups.bam")); | |||
runCommandLine(args); | |||
} | |||
|
|||
@Test | |||
public void testAssertCorrectSortOrderMultipleBamsOverriding() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a test case for a single unsorted bam, and one that rejects when there's a coordinate sorted bam along side an unsorted bam
@lbergelson I added a second argument per your request. I still disagree that this is the right argument to expose because it is very dangerous. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jamesemery I have a few last nitpicky dock changes.
@@ -127,6 +128,8 @@ | |||
programGroup = ReadDataManipulationProgramGroup.class) | |||
public final class MarkDuplicatesSpark extends GATKSparkTool { | |||
private static final long serialVersionUID = 1L; | |||
public static final String ALLOW_MULTIPLE_SORT_ORDERS_IN_INPUT_ARG = "allow-multiple-sort-orders-in-input"; | |||
public static final String TREAT_UNSORTED_AS_OREDED = "treat-unsorted-as-querygroup-ordered-for-multiple-inputs"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo OREDED -> ORDERED
@@ -140,6 +143,16 @@ | |||
fullName = StandardArgumentDefinitions.METRICS_FILE_LONG_NAME) | |||
protected String metricsFile; | |||
|
|||
@Advanced | |||
@Argument(doc = "Override to allow non-queryname sorted inputs for multiple input bams.", optional=true, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Argument(doc = "Override to allow non-queryname sorted inputs for multiple input bams.", optional=true, | |
@Argument(doc = "Allow non-queryname sorted inputs when specifying multiple input bams.", optional=true, |
protected boolean allowMultipleSortOrders = false; | ||
|
||
@Advanced | ||
@Argument(doc = "Treat unsorted files as query-group orderd files. NOTE: this may result in mark duplicates crashing if the file is unordered", optional=true, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Argument(doc = "Treat unsorted files as query-group orderd files. NOTE: this may result in mark duplicates crashing if the file is unordered", optional=true, | |
@Argument(doc = "Treat files that are marked as unsorted or have no defined sort order files as query-group ordered files. WARNING: This option disables a basic safety check and may result in crashes or incorrect output if the files are not actually queryname grouped", optional=true, |
I might make this warning a bit stronger.
mergedHeader.setGroupOrder(SAMFileHeader.GroupOrder.query); | ||
if (!allowMultipleSortOrders) { | ||
headerMap.entrySet().stream().forEach(h -> { | ||
if (!ReadUtils.isReadNameGroupedBam(h.getValue()) && (!treatUnsortedAsOrdered && h.getValue().getSortOrder().equals(SAMFileHeader.SortOrder.unsorted))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
treatUnsortedAsOrdered should work for either SortOrder.unknown
or SortOrder.unsorted
I would probably pull out a function here that does this check to make it a little clearer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Treat unsortedAsOrdered should also apply when using a single input bam.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
throw new UserException("Multiple inputs to MarkDuplicatesSpark detected. MarkDuplicatesSpark requires all inputs to be queryname sorted or querygroup-sorted for multi-input processing but input "+h.getKey()+" was sorted in "+h.getValue().getSortOrder()+" order"); | ||
}}); | ||
mergedHeader.setGroupOrder(SAMFileHeader.GroupOrder.query); | ||
if (!allowMultipleSortOrders) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if this is set then multiple bams will always be sorted even if they are all coordinate grouped already? It probably shouldn't do that.
Ack, I posted a bunch of outdated comments by accident, ignore the ones that make no sense now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jamesemery A few tiny doc changes and I think this is good to go.
@@ -313,7 +313,7 @@ protected void runTool(final JavaSparkContext ctx) { | |||
if (headerMap.size() > 1) { | |||
final Optional<Map.Entry<String, SAMFileHeader>> badlySorted = headerMap.entrySet() | |||
.stream() | |||
.filter(h -> treatAsReadGroupOrdered(h.getValue(), treatUnsortedAsOrdered)) | |||
.filter(h -> !treatAsReadGroupOrdered(h.getValue(), treatUnsortedAsOrdered)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
woops, sorry
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
whats worse is that I actually caught this yesterday but forgot to push the branch until today... sigh...
An advanced argument to skip multiple inputs. We still want to discourage this behavior for performance reasons so its not the default behavior.
Fixes #5973