Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tie Out MarkDuplicatesSpark Compared to Picard Mark Duplicates Output #4675

Closed
jamesemery opened this issue Apr 18, 2018 · 5 comments · Fixed by #5377
Closed

Tie Out MarkDuplicatesSpark Compared to Picard Mark Duplicates Output #4675

jamesemery opened this issue Apr 18, 2018 · 5 comments · Fixed by #5377

Comments

@jamesemery
Copy link
Collaborator

MarkDuplicates Spark output needs to tested against the version of picard they use in production to ensure that it produces identical output and is reasonably robust to pathological files. This requires that the following issues have been resolved:
#3705
#3706

@oldmikeyang
Copy link

oldmikeyang commented Oct 5, 2018

I have compare the result of MarkDuplicates and MarkDuplicatesSpark.
the same input SAM file and the default parameter, the MarkDuplicatesSpark have more data marked as duplicated.
Can you give me any suggest how to debug it, why the Spark version have more data marked?

READ_PAIR_DUPLICATES
11933661 (MarkDuplicates)
11974162 (MarkDuplicatesSpark)

Here is the metric file


MarkDuplicatesSpark  --output hdfs://wolfpass-aep:9000/user/test/spark_412.MarkDuplicates.bam --metrics-file hdfs://wolfpass-aep:9000/user/test/spark_412.MarkDuplicates-metrics.txt --input hdfs://wolfpass-aep:9000/user/test/spark_412.bowtie2.bam --spark-master yarn  --duplicate-scoring-strategy SUM_OF_BASE_QUALITIES --do-not-mark-unmapped-mates false --read-name-regex <optimized capture of last three ':' separated fields as numeric values> --optical-duplicate-pixel-distance 100 --read-validation-stringency SILENT --interval-set-rule UNION --interval-padding 0 --interval-exclusion-padding 0 --interval-merging-rule ALL --bam-partition-size 0 --disable-sequence-dictionary-validation false --add-output-vcf-command-line true --sharded-output false --num-reducers 0 --help false --version false --showHidden false --verbosity INFO --QUIET false --use-jdk-deflater false --use-jdk-inflater false --gcs-max-retries 20 --gcs-project-for-requester-pays  --disable-tool-default-read-filters false

METRICS CLASS	org.broadinstitute.hellbender.utils.read.markduplicates.GATKDuplicationMetrics LIBRARY	UNPAIRED_READS_EXAMINED	READ_PAIRS_EXAMINED	SECONDARY_OR_SUPPLEMENTARY_RDS	UNMAPPED_READS	UNPAIRED_READ_DUPLICATES READ_PAIR_DUPLICATES	READ_PAIR_OPTICAL_DUPLICATES	PERCENT_DUPLICATION ESTIMATED_LIBRARY_SIZE
lib1	173613	53799913	0	7610605	81003	11974162	585768	0.222961	05870713

MarkDuplicates  --INPUT /home/test/WGS_pipeline/TEST/output/orig_412.bowtie2.bam --OUTPUT /home/test/WGS_pipeline/TEST/output/orig_412.MarkDuplicates.bam --METRICS_FILE /home/test/WGS_pipeline/TEST/output/orig_412.MarkDuplicates-metrics.txt  --MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP 50000 --MAX_FILE_HANDLES_FOR_READ_ENDS_MAP 8000 --SORTING_COLLECTION_SIZE_RATIO 0.25 --TAG_DUPLICATE_SET_MEMBERS false --REMOVE_SEQUENCING_DUPLICATES false --TAGGING_POLICY DontTag --CLEAR_DT true --ADD_PG_TAG_TO_READS true --REMOVE_DUPLICATES false --ASSUME_SORTED false --DUPLICATE_SCORING_STRATEGY SUM_OF_BASE_QUALITIES --PROGRAM_RECORD_ID MarkDup
licates --PROGRAM_GROUP_NAME MarkDuplicates --READ_NAME_REGEX <optimized capture of last three ':' separated fields as numeric values> --OPTICAL_DUPLICATE_PIXEL_DISTANCE 100 --MAX_OPTICAL_DUPLICATE_SET_SIZE 300000 --VERBOSITY INFO --QUIET false --VALIDATION_STRINGENCY STRICT --COMPRESSION_LEVEL 2 --MAX_RECORDS_IN_RAM 500000 --CREATE_INDEX false --CREATE_MD5_FILE false --GA4GH_CLIENT_SECRETS client_secrets.json --help f
alse --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false

METRICS CLASS	picard.sam.DuplicationMetrics
LIBRARY	UNPAIRED_READS_EXAMINED	READ_PAIRS_EXAMINED	SECONDARY_OR_SUPPLEMENTARY_RDS	UNMAPPED_READS	UNPAIRED_READ_DUPLICATES	READ_PAIR_DUPLICATES	READ_PAIR_OPTICAL_DUPLICATES	PERCENT_DUPLICATION ESTIMATED_LIBRARY_SIZE
lib1	173613	53799913	0	7610605	81003	11933661	585768	0.22221	06317338



@droazen
Copy link
Contributor

droazen commented Oct 5, 2018

@oldmikeyang Do you get the same result if the inputs and outputs to MarkDuplicatesSpark are on the local filesystem rather than HDFS?

@jamesemery Can you comment?

@droazen droazen added this to the Engine-Q42018 milestone Oct 5, 2018
@jamesemery
Copy link
Collaborator Author

Hello, @oldmikeyang I'm in the middle of doing a tie out for MarkDuplicatesSpark right now. I just recently fixed (and it will hopefully be released soon) some counting issues involving the metrics collection (it was over-counting the number of duplicate pairs marked compared to picard) I suspect it is likely that the actual bam output is correct. I will have a branch soon that I would ask you to try markDuplicatesSpark again on and tell me if it's still causing problems, unfortunately an unrelated fix requires a change to go into picard broadinstitute/picard#1230.

I will let you know when the PR is open, as I would love to know if it fixes this mismatch.

@oldmikeyang
Copy link

oldmikeyang commented Oct 6, 2018

@oldmikeyang Do you get the same result if the inputs and outputs to MarkDuplicatesSpark are on the local filesystem rather than HDFS?

@jamesemery Can you comment?

I just try the MarkDuplicatesSpark from local file system without the HDFS.
the application will failure. It said can't find the file, but the file is on the local file system.

ls -la /home/test/WGS_pipeline/TEST/output/spark_412.bowtie2.bam
-rw-rw-r--. 1 test test 4668988887 Oct  4 11:27 /home/test/WGS_pipeline/TEST/output/spark_412.bowtie2.bam

Here is SPARK information

A USER ERROR has occurred: Failed to read bam header from
/home/test/WGS_pipeline/TEST/output/spark_412.bowtie2.bam
 Caused by:File does not exist: /home/test/WGS_pipeline/TEST/output/spark_412.bowtie2.bam
	at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:72)
	at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:62)
	at org.apache.hadoop.hdfs.server.namenode.FSDirStatAndListingOp.getBlockLocations(FSDirStatAndListingOp.java:152)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1819)
	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:692)
	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:381)
	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:447)
	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989)
	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:850)
	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:793)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1840)
	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2489)


***********************************************************************
org.broadinstitute.hellbender.exceptions.UserException: Failed to read bam header from /home/test/WGS_pipeline/TEST/output/spark_412.bowtie2.bam
 Caused by:File does not exist: /home/test/WGS_pipeline/TEST/output/spark_412.bowtie2.bam
	at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:72)
	at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:62)
	at org.apache.hadoop.hdfs.server.namenode.FSDirStatAndListingOp.getBlockLocations(FSDirStatAndListingOp.java:152)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1819)
	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:692)
	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:381)
	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:447)
	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989)
	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:850)
	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:793)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1840)
	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2489)

	at org.broadinstitute.hellbender.engine.spark.datasources.ReadsSparkSource.getHeader(ReadsSparkSource.java:237)
	at org.broadinstitute.hellbender.engine.spark.GATKSparkTool.initializeReads(GATKSparkTool.java:488)
	at org.broadinstitute.hellbender.engine.spark.GATKSparkTool.initializeToolInputs(GATKSparkTool.java:468)
	at org.broadinstitute.hellbender.engine.spark.GATKSparkTool.runPipeline(GATKSparkTool.java:458)
	at org.broadinstitute.hellbender.engine.spark.SparkCommandLineProgram.doWork(SparkCommandLineProgram.java:30)
	at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:139)
	at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:192)
	at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:211)
	at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
	at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
	at org.broadinstitute.hellbender.Main.main(Main.java:289)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:635)
Caused by: java.io.FileNotFoundException: File does not exist: /home/test/WGS_pipeline/TEST/output/spark_412.bowtie2.bam
	at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:72)
	at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:62)
	at org.apache.hadoop.hdfs.server.namenode.FSDirStatAndListingOp.getBlockLocations(FSDirStatAndListingOp.java:152)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1819)
	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:692)
	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:381)
	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:447)
	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989)
	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:850)
	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:793)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1840)
	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2489)

	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
	at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
	at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73)
	at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1228)
	at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1213)
	at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1201)
	at org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:306)
	at org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:272)
	at org.apache.hadoop.hdfs.DFSInputStream.<init>(DFSInputStream.java:264)
	at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1526)
	at org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:304)
	at org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:299)
	at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
	at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:312)
	at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:769)
	at org.seqdoop.hadoop_bam.util.SAMHeaderReader.readSAMHeaderFrom(SAMHeaderReader.java:51)
	at org.broadinstitute.hellbender.engine.spark.datasources.ReadsSparkSource.getHeader(ReadsSparkSource.java:235)
	... 15 more
Caused by: org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File does not exist: /home/test/WGS_pipeline/TEST/output/spark_412.bowtie2.bam
	at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:72)
	at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:62)
	at org.apache.hadoop.hdfs.server.namenode.FSDirStatAndListingOp.getBlockLocations(FSDirStatAndListingOp.java:152)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1819)
	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:692)
	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:381)
	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:447)
	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989)
	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:850)
	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:793)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1840)
	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2489)

	at org.apache.hadoop.ipc.Client.call(Client.java:1475)
	at org.apache.hadoop.ipc.Client.call(Client.java:1412)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
	at com.sun.proxy.$Proxy10.getBlockLocations(Unknown Source)
	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:255)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
	at com.sun.proxy.$Proxy11.getBlockLocations(Unknown Source)
	at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1226)
	... 28 more
18/10/06 09:45:36 INFO yarn.ApplicationMaster: Final app status: FAILED, exitCode: 16, (reason: Shutdown hook called before final status was reported.)

@oldmikeyang
Copy link

oldmikeyang commented Oct 6, 2018

Hello, @oldmikeyang I'm in the middle of doing a tie out for MarkDuplicatesSpark right now. I just recently fixed (and it will hopefully be released soon) some counting issues involving the metrics collection (it was over-counting the number of duplicate pairs marked compared to picard) I suspect it is likely that the actual bam output is correct. I will have a branch soon that I would ask you to try markDuplicatesSpark again on and tell me if it's still causing problems, unfortunately an unrelated fix requires a change to go into picard broadinstitute/picard#1230.

I will let you know when the PR is open, as I would love to know if it fixes this mismatch.

I would like to try it.

By the way, I am running the whole SPARK version, MarkDuplicatesSpark + BQSRPipelineSpark + HaplotypeCallerSpark to get the vcf file.
I found the output of the vcf is different with the GATK standard pipeline MarkDuplicates + BaseRecalibrator + ApplyBQSR + HaplotypeCaller.
The SPARK version is 3% less than the standard pipeline.
So I am debugging where these difference come from.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment