-
Notifications
You must be signed in to change notification settings - Fork 594
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tie Out MarkDuplicatesSpark Compared to Picard Mark Duplicates Output #4675
Comments
I have compare the result of MarkDuplicates and MarkDuplicatesSpark. READ_PAIR_DUPLICATES Here is the metric file
|
@oldmikeyang Do you get the same result if the inputs and outputs to @jamesemery Can you comment? |
Hello, @oldmikeyang I'm in the middle of doing a tie out for MarkDuplicatesSpark right now. I just recently fixed (and it will hopefully be released soon) some counting issues involving the metrics collection (it was over-counting the number of duplicate pairs marked compared to picard) I suspect it is likely that the actual bam output is correct. I will have a branch soon that I would ask you to try markDuplicatesSpark again on and tell me if it's still causing problems, unfortunately an unrelated fix requires a change to go into picard broadinstitute/picard#1230. I will let you know when the PR is open, as I would love to know if it fixes this mismatch. |
I just try the MarkDuplicatesSpark from local file system without the HDFS.
Here is SPARK information
|
I would like to try it. By the way, I am running the whole SPARK version, MarkDuplicatesSpark + BQSRPipelineSpark + HaplotypeCallerSpark to get the vcf file. |
MarkDuplicates Spark output needs to tested against the version of picard they use in production to ensure that it produces identical output and is reasonably robust to pathological files. This requires that the following issues have been resolved:
#3705
#3706
The text was updated successfully, but these errors were encountered: