Tie Out MarkDuplicatesSpark Compared to Picard Mark Duplicates Output #4675

jamesemery · 2018-04-18T18:50:04Z

MarkDuplicates Spark output needs to tested against the version of picard they use in production to ensure that it produces identical output and is reasonably robust to pathological files. This requires that the following issues have been resolved:
#3705
#3706

oldmikeyang · 2018-10-05T02:44:35Z

I have compare the result of MarkDuplicates and MarkDuplicatesSpark.
the same input SAM file and the default parameter, the MarkDuplicatesSpark have more data marked as duplicated.
Can you give me any suggest how to debug it, why the Spark version have more data marked?

READ_PAIR_DUPLICATES
11933661 (MarkDuplicates)
11974162 (MarkDuplicatesSpark)

Here is the metric file


MarkDuplicatesSpark  --output hdfs://wolfpass-aep:9000/user/test/spark_412.MarkDuplicates.bam --metrics-file hdfs://wolfpass-aep:9000/user/test/spark_412.MarkDuplicates-metrics.txt --input hdfs://wolfpass-aep:9000/user/test/spark_412.bowtie2.bam --spark-master yarn  --duplicate-scoring-strategy SUM_OF_BASE_QUALITIES --do-not-mark-unmapped-mates false --read-name-regex <optimized capture of last three ':' separated fields as numeric values> --optical-duplicate-pixel-distance 100 --read-validation-stringency SILENT --interval-set-rule UNION --interval-padding 0 --interval-exclusion-padding 0 --interval-merging-rule ALL --bam-partition-size 0 --disable-sequence-dictionary-validation false --add-output-vcf-command-line true --sharded-output false --num-reducers 0 --help false --version false --showHidden false --verbosity INFO --QUIET false --use-jdk-deflater false --use-jdk-inflater false --gcs-max-retries 20 --gcs-project-for-requester-pays  --disable-tool-default-read-filters false

METRICS CLASS	org.broadinstitute.hellbender.utils.read.markduplicates.GATKDuplicationMetrics LIBRARY	UNPAIRED_READS_EXAMINED	READ_PAIRS_EXAMINED	SECONDARY_OR_SUPPLEMENTARY_RDS	UNMAPPED_READS	UNPAIRED_READ_DUPLICATES READ_PAIR_DUPLICATES	READ_PAIR_OPTICAL_DUPLICATES	PERCENT_DUPLICATION ESTIMATED_LIBRARY_SIZE
lib1	173613	53799913	0	7610605	81003	11974162	585768	0.222961	05870713

MarkDuplicates  --INPUT /home/test/WGS_pipeline/TEST/output/orig_412.bowtie2.bam --OUTPUT /home/test/WGS_pipeline/TEST/output/orig_412.MarkDuplicates.bam --METRICS_FILE /home/test/WGS_pipeline/TEST/output/orig_412.MarkDuplicates-metrics.txt  --MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP 50000 --MAX_FILE_HANDLES_FOR_READ_ENDS_MAP 8000 --SORTING_COLLECTION_SIZE_RATIO 0.25 --TAG_DUPLICATE_SET_MEMBERS false --REMOVE_SEQUENCING_DUPLICATES false --TAGGING_POLICY DontTag --CLEAR_DT true --ADD_PG_TAG_TO_READS true --REMOVE_DUPLICATES false --ASSUME_SORTED false --DUPLICATE_SCORING_STRATEGY SUM_OF_BASE_QUALITIES --PROGRAM_RECORD_ID MarkDup
licates --PROGRAM_GROUP_NAME MarkDuplicates --READ_NAME_REGEX <optimized capture of last three ':' separated fields as numeric values> --OPTICAL_DUPLICATE_PIXEL_DISTANCE 100 --MAX_OPTICAL_DUPLICATE_SET_SIZE 300000 --VERBOSITY INFO --QUIET false --VALIDATION_STRINGENCY STRICT --COMPRESSION_LEVEL 2 --MAX_RECORDS_IN_RAM 500000 --CREATE_INDEX false --CREATE_MD5_FILE false --GA4GH_CLIENT_SECRETS client_secrets.json --help f
alse --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false

METRICS CLASS	picard.sam.DuplicationMetrics
LIBRARY	UNPAIRED_READS_EXAMINED	READ_PAIRS_EXAMINED	SECONDARY_OR_SUPPLEMENTARY_RDS	UNMAPPED_READS	UNPAIRED_READ_DUPLICATES	READ_PAIR_DUPLICATES	READ_PAIR_OPTICAL_DUPLICATES	PERCENT_DUPLICATION ESTIMATED_LIBRARY_SIZE
lib1	173613	53799913	0	7610605	81003	11933661	585768	0.22221	06317338

droazen · 2018-10-05T13:26:33Z

@oldmikeyang Do you get the same result if the inputs and outputs to MarkDuplicatesSpark are on the local filesystem rather than HDFS?

@jamesemery Can you comment?

jamesemery · 2018-10-05T14:39:59Z

Hello, @oldmikeyang I'm in the middle of doing a tie out for MarkDuplicatesSpark right now. I just recently fixed (and it will hopefully be released soon) some counting issues involving the metrics collection (it was over-counting the number of duplicate pairs marked compared to picard) I suspect it is likely that the actual bam output is correct. I will have a branch soon that I would ask you to try markDuplicatesSpark again on and tell me if it's still causing problems, unfortunately an unrelated fix requires a change to go into picard broadinstitute/picard#1230.

I will let you know when the PR is open, as I would love to know if it fixes this mismatch.

oldmikeyang · 2018-10-06T01:53:18Z

@oldmikeyang Do you get the same result if the inputs and outputs to MarkDuplicatesSpark are on the local filesystem rather than HDFS?

@jamesemery Can you comment?

I just try the MarkDuplicatesSpark from local file system without the HDFS.
the application will failure. It said can't find the file, but the file is on the local file system.

ls -la /home/test/WGS_pipeline/TEST/output/spark_412.bowtie2.bam
-rw-rw-r--. 1 test test 4668988887 Oct  4 11:27 /home/test/WGS_pipeline/TEST/output/spark_412.bowtie2.bam

Here is SPARK information

A USER ERROR has occurred: Failed to read bam header from
/home/test/WGS_pipeline/TEST/output/spark_412.bowtie2.bam
 Caused by:File does not exist: /home/test/WGS_pipeline/TEST/output/spark_412.bowtie2.bam
	at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:72)
	at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:62)
	at org.apache.hadoop.hdfs.server.namenode.FSDirStatAndListingOp.getBlockLocations(FSDirStatAndListingOp.java:152)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1819)
	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:692)
	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:381)
	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:447)
	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989)
	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:850)
	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:793)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1840)
	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2489)


***********************************************************************
org.broadinstitute.hellbender.exceptions.UserException: Failed to read bam header from /home/test/WGS_pipeline/TEST/output/spark_412.bowtie2.bam
 Caused by:File does not exist: /home/test/WGS_pipeline/TEST/output/spark_412.bowtie2.bam
	at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:72)
	at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:62)
	at org.apache.hadoop.hdfs.server.namenode.FSDirStatAndListingOp.getBlockLocations(FSDirStatAndListingOp.java:152)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1819)
	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:692)
	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:381)
	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:447)
	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989)
	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:850)
	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:793)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1840)
	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2489)

	at org.broadinstitute.hellbender.engine.spark.datasources.ReadsSparkSource.getHeader(ReadsSparkSource.java:237)
	at org.broadinstitute.hellbender.engine.spark.GATKSparkTool.initializeReads(GATKSparkTool.java:488)
	at org.broadinstitute.hellbender.engine.spark.GATKSparkTool.initializeToolInputs(GATKSparkTool.java:468)
	at org.broadinstitute.hellbender.engine.spark.GATKSparkTool.runPipeline(GATKSparkTool.java:458)
	at org.broadinstitute.hellbender.engine.spark.SparkCommandLineProgram.doWork(SparkCommandLineProgram.java:30)
	at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:139)
	at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:192)
	at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:211)
	at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
	at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
	at org.broadinstitute.hellbender.Main.main(Main.java:289)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:635)
Caused by: java.io.FileNotFoundException: File does not exist: /home/test/WGS_pipeline/TEST/output/spark_412.bowtie2.bam
	at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:72)
	at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:62)
	at org.apache.hadoop.hdfs.server.namenode.FSDirStatAndListingOp.getBlockLocations(FSDirStatAndListingOp.java:152)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1819)
	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:692)
	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:381)
	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:447)
	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989)
	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:850)
	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:793)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1840)
	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2489)

	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
	at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
	at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73)
	at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1228)
	at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1213)
	at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1201)
	at org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:306)
	at org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:272)
	at org.apache.hadoop.hdfs.DFSInputStream.<init>(DFSInputStream.java:264)
	at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1526)
	at org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:304)
	at org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:299)
	at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
	at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:312)
	at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:769)
	at org.seqdoop.hadoop_bam.util.SAMHeaderReader.readSAMHeaderFrom(SAMHeaderReader.java:51)
	at org.broadinstitute.hellbender.engine.spark.datasources.ReadsSparkSource.getHeader(ReadsSparkSource.java:235)
	... 15 more
Caused by: org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File does not exist: /home/test/WGS_pipeline/TEST/output/spark_412.bowtie2.bam
	at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:72)
	at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:62)
	at org.apache.hadoop.hdfs.server.namenode.FSDirStatAndListingOp.getBlockLocations(FSDirStatAndListingOp.java:152)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1819)
	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:692)
	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:381)
	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:447)
	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989)
	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:850)
	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:793)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1840)
	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2489)

	at org.apache.hadoop.ipc.Client.call(Client.java:1475)
	at org.apache.hadoop.ipc.Client.call(Client.java:1412)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
	at com.sun.proxy.$Proxy10.getBlockLocations(Unknown Source)
	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:255)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
	at com.sun.proxy.$Proxy11.getBlockLocations(Unknown Source)
	at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1226)
	... 28 more
18/10/06 09:45:36 INFO yarn.ApplicationMaster: Final app status: FAILED, exitCode: 16, (reason: Shutdown hook called before final status was reported.)

oldmikeyang · 2018-10-06T02:14:32Z

Hello, @oldmikeyang I'm in the middle of doing a tie out for MarkDuplicatesSpark right now. I just recently fixed (and it will hopefully be released soon) some counting issues involving the metrics collection (it was over-counting the number of duplicate pairs marked compared to picard) I suspect it is likely that the actual bam output is correct. I will have a branch soon that I would ask you to try markDuplicatesSpark again on and tell me if it's still causing problems, unfortunately an unrelated fix requires a change to go into picard broadinstitute/picard#1230.

I will let you know when the PR is open, as I would love to know if it fixes this mismatch.

I would like to try it.

By the way, I am running the whole SPARK version, MarkDuplicatesSpark + BQSRPipelineSpark + HaplotypeCallerSpark to get the vcf file.
I found the output of the vcf is different with the GATK standard pipeline MarkDuplicates + BaseRecalibrator + ApplyBQSR + HaplotypeCaller.
The SPARK version is 3% less than the standard pipeline.
So I am debugging where these difference come from.

jamesemery added tools Picard Spark labels Apr 18, 2018

jamesemery self-assigned this Apr 18, 2018

lbergelson added the MarkDuplicatesSpark label Apr 26, 2018

jamesemery mentioned this issue Sep 14, 2018

Update MarkDuplicatesGATK/MarkDuplicatesSpark to incorporate recent Picard development #3705

Closed

jamesemery assigned takutosato Sep 20, 2018

droazen added this to the Engine-Q42018 milestone Oct 5, 2018

jamesemery mentioned this issue Oct 30, 2018

Added a few of the remaining useful features to MarkDuplicatesSpark and associated tests #5377

Merged

lbergelson closed this as completed in #5377 Dec 11, 2018

jamesemery mentioned this issue Jan 25, 2019

Releasing MarkDuplicatesSpark #5603

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tie Out MarkDuplicatesSpark Compared to Picard Mark Duplicates Output #4675

Tie Out MarkDuplicatesSpark Compared to Picard Mark Duplicates Output #4675

jamesemery commented Apr 18, 2018

oldmikeyang commented Oct 5, 2018 •

edited

Loading

droazen commented Oct 5, 2018

jamesemery commented Oct 5, 2018

oldmikeyang commented Oct 6, 2018 •

edited

Loading

oldmikeyang commented Oct 6, 2018 •

edited

Loading

Tie Out MarkDuplicatesSpark Compared to Picard Mark Duplicates Output #4675

Tie Out MarkDuplicatesSpark Compared to Picard Mark Duplicates Output #4675

Comments

jamesemery commented Apr 18, 2018

oldmikeyang commented Oct 5, 2018 • edited Loading

droazen commented Oct 5, 2018

jamesemery commented Oct 5, 2018

oldmikeyang commented Oct 6, 2018 • edited Loading

oldmikeyang commented Oct 6, 2018 • edited Loading

oldmikeyang commented Oct 5, 2018 •

edited

Loading

oldmikeyang commented Oct 6, 2018 •

edited

Loading

oldmikeyang commented Oct 6, 2018 •

edited

Loading