-
Notifications
You must be signed in to change notification settings - Fork 594
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ramdom BUGS for germline copy number variants calling with GATK 4. PythonScriptExecutorException #6235
Comments
Hi, Last night, I tried again, and when I submited all the 25 sub-projects, a similar exceptions happens during the function GermlineCNVCaller. It seems that the problem is from gcnvkernel, when parallel projects are submitted at the same time. .............................................................(BUG 003).......................................................... 00:50:20.554 DEBUG ScriptExecutor - /gpfs/gsfs7/users/gatk_users1/0-Project/1-gCNV-Lung/z-bak/z-2019-10-28-1-Test-gCNV_23-40-33/2-Output/8-GATK-Temp/sample-07410307475890858352.tsv |
@droazen Mark has been CNV tech lead for some time now, so I’ll let him take a first crack at this or delegate. However, I will point out #4782, which is tangentially related. Looks like handling the global compiler lock appropriately should also address the main issue. Finally, I’ll add that we should include such computing environments in our future testing infrastructure. |
Dear samuelklee, Thank you for your concern. I just finish my testing with GATK/4.1.3.0, which is suffered from the same exceptions, too. I hope to get good news from you and Mark soon. Best regards and thank you! |
@xysj1989 We primarily run this workflow using the WDL on Terra. In this case, each GermlineCNVCaller shard is run on a separate VM using the GATK Docker. Hopefully, we can always at least guarantee that this default mode of running the workflow is functional and covered by tests. However, if you'd like to instead run multiple instances of GermlineCNVCaller locally, you may need to make sure certain environment variables are set appropriate. For example, I think you can address (2) above (the location of the temporary theano directory) by either setting environment variables or modifying your Theano configuration (see http://deeplearning.net/software/theano/library/config.html) appropriately. You may also want to check the GermlineCNVCaller task in the WDL to see how other variables are set there. Let me look into whether you can also address (1) in this way, or if this will require a GATK code change, and get back to you. (Of course, if you figure it out before me, please follow up!) Thanks again for bringing this to our attention. |
OK, looks like you can get around the compiler lock issues by pointing each invocation of GermlineCNVCaller to a different compilation directory. For example, invoke
This uses the
Where again, This is a bit of a hack. We could probably avoid this by changing the GATK code to use a specified or temporary directory for the theano directory without too much effort. However, there is an upside to using a non-temporary directory to avoid recompilation of the model upon subsequent runs. In this case, we'd just want to let the user be able to specify the theano directory (rather than dump things in @mwalker174 opinions? @droazen or engine team, thoughts on what the policy should be for python/R scripts doing this sort of thing? Is it generally true that the GATK leaves no trace, other than producing the expected output? |
Dear samuelklee, Thank you very much for you reply. I also found this problem last night. It seems that the problem is originally from Theano and Pymc3, rather than GATK 4.0. Some similar problems have been reported just like (1) pymc-devs/pymc#1463 (2) https://stackoverflow.com/questions/52270853/how-to-get-rid-of-theano-gof-compilelock and (3) https://groups.google.com/forum/#!topic/theano-users/eJ2vl2PUTk4 Last night, I have already tried to reset base_compiledir for theano, through two ways: (1) creating a ~/.theanorc file just like you suggested (2) modifying the file ~/.bashrc for my login node, by adding a line: export THEANO_FLAGS="base_compiledir=/scratch/gatk-user1/z-Temp/z-Temp-Theano-$chr" However, the truth is that, in our cluster, when I submit the 25 jobs (for each chromosomes), they are assigned to different computer nodes randomly. It means that I have to set THEANO environment variable for each corresponding random computer nodes respectively, which is quite difficult for me, as the nodes are random assigned. So, now I'm going to add lines like below to the ~/.theanorc in my login node, to see what will happen. Maybe It will work. However, I'm really appreciate it if some one in your team can help to add a function to specify a temporary directory for the theano directory, which can be bound to the corresponding node shared by other GATK threads. Thank you and Best regards. |
@xysj1989 I would think that if you use the python In any case, I will try to issue a PR allowing you to directly set the directory or use a temporary one soon. Thanks again for raising the issue! |
Dear samuelklee, @samuelklee Thank you for your suggested solution, which sounds really fantastic. At present, I'm testing the pipelines by adding THEANORC=PATH/TO/THEANORC before "GATK sub-functions". I will report the result, when they are finished! Best regards. |
@samuelklee The method you suggested, successfully solve my problems. I have tested 12 times and no more "theano-gof-compilelock" occurs. Thank you very much! |
Hi,
I am trying to call common and rare germline copy number variants with GATK 4, for more than 100 human samples based on human genome reference: hg19. For this project, I have 500 GB for memory, 10 TB for storage and 300 cpu cores. The program version is as below:
I didn't use the WDL way. I just follow the document of Notebook#11684 and build a local pipeline. I split the my project based on Chromosome, including (chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX, chrY, chrMT).
After finish the pipeline, I am testing it with 6 samples.
When I separately submit my script for each chromosome, every sub-project goes well: through my Input BAM Files, I can get the corresponding VCF Files (10 cores and 10 GB for each single project). That is to say, the environment of our GATK and Python for germline copy number variants calling should be OK.
However, When I submit all the 25 sub-projects (12 cores and 12 GB for each single project) at the same time, I' m randomly suffering the two following PythonScriptExecutorException for some of the random sub-projects:
.............................................................(BUG 001)..........................................................
Traceback (most recent call last):
File "/tmp/cohort_determine_ploidy_and_depth.3351404099122294482.py", line 8, in
import gcnvkernel
File "/usr/local/Anaconda/envs_app/gatk/4.1.2.0/lib/python3.6/site-packages/gcnvkernel/init.py", line 1, in
from pymc3 import version as pymc3_version
File "/usr/local/Anaconda/envs_app/gatk/4.1.2.0/lib/python3.6/site-packages/pymc3/init.py", line 5, in
from .distributions import *
File "/usr/local/Anaconda/envs_app/gatk/4.1.2.0/lib/python3.6/site-packages/pymc3/distributions/init.py", line 1, in
from . import timeseries
File "/usr/local/Anaconda/envs_app/gatk/4.1.2.0/lib/python3.6/site-packages/pymc3/distributions/timeseries.py", line 5, in
from .continuous import get_tau_sd, Normal, Flat
File "/usr/local/Anaconda/envs_app/gatk/4.1.2.0/lib/python3.6/site-packages/pymc3/distributions/continuous.py", line 16, in
from pymc3.theanof import floatX
File "/usr/local/Anaconda/envs_app/gatk/4.1.2.0/lib/python3.6/site-packages/pymc3/theanof.py", line 89, in
empty_gradient = tt.zeros(0, dtype='float32')
File "/usr/local/Anaconda/envs_app/gatk/4.1.2.0/lib/python3.6/site-packages/theano/tensor/basic.py", line 2558, in zeros
return alloc(np.array(0, dtype=dtype), *shape)
File "/usr/local/Anaconda/envs_app/gatk/4.1.2.0/lib/python3.6/site-packages/theano/tensor/basic.py", line 3091, in call
ret = super(Alloc, self).call(val, *shapes, **kwargs)
File "/usr/local/Anaconda/envs_app/gatk/4.1.2.0/lib/python3.6/site-packages/theano/gof/op.py", line 670, in call
no_recycling=[])
File "/usr/local/Anaconda/envs_app/gatk/4.1.2.0/lib/python3.6/site-packages/theano/gof/op.py", line 955, in make_thunk
no_recycling)
File "/usr/local/Anaconda/envs_app/gatk/4.1.2.0/lib/python3.6/site-packages/theano/gof/op.py", line 858, in make_c_thunk
output_storage=node_output_storage)
File "/usr/local/Anaconda/envs_app/gatk/4.1.2.0/lib/python3.6/site-packages/theano/gof/cc.py", line 1217, in make_thunk
keep_lock=keep_lock)
File "/usr/local/Anaconda/envs_app/gatk/4.1.2.0/lib/python3.6/site-packages/theano/gof/cc.py", line 1157, in compile
keep_lock=keep_lock)
File "/usr/local/Anaconda/envs_app/gatk/4.1.2.0/lib/python3.6/site-packages/theano/gof/cc.py", line 1623, in cthunk_factory
module = get_module_cache().module_from_key(
File "/usr/local/Anaconda/envs_app/gatk/4.1.2.0/lib/python3.6/site-packages/theano/gof/cc.py", line 48, in get_module_cache
return cmodule.get_module_cache(config.compiledir, init_args=init_args)
File "/usr/local/Anaconda/envs_app/gatk/4.1.2.0/lib/python3.6/site-packages/theano/gof/cmodule.py", line 1587, in get_module_cache
_module_cache = ModuleCache(dirname, **init_args)
File "/usr/local/Anaconda/envs_app/gatk/4.1.2.0/lib/python3.6/site-packages/theano/gof/cmodule.py", line 703, in init
self.refresh()
File "/usr/local/Anaconda/envs_app/gatk/4.1.2.0/lib/python3.6/site-packages/theano/gof/cmodule.py", line 794, in refresh
files = os.listdir(root)
FileNotFoundError: [Errno 2] No such file or directory: '/spin1/home/linux/gatk_users1/.theano/compiledir_Linux-3.10-el7.x86_64-x86_64-with-centos-7.5.1804-Core-x86_64-3.6.2-64/tmpmy0w17z3'
00:34:39.396 DEBUG ScriptExecutor - Result: 1
00:34:39.397 INFO DetermineGermlineContigPloidy - Shutting down engine
[October 27, 2019 12:34:39 AM EDT] org.broadinstitute.hellbender.tools.copynumber.DetermineGermlineContigPloidy done. Elapsed time: 0.66 minutes.
Runtime.totalMemory()=2151677952
org.broadinstitute.hellbender.utils.python.PythonScriptExecutorException:
python exited with 1
Command Line: python /tmp/cohort_determine_ploidy_and_depth.3351404099122294482.py --sample_coverage_metadata=/tmp/samples-by-coverage-per-contig8898090777596224038.tsv --output_calls_path=/gpfs/gsfs7/users/gatk_users1/0-Project/1-gCNV-Lung/z-bak/z-2019-10-26-1-Test-gCNV/2-Output/1-Contig-Ploidy/22.Contig_Ploidy_Dir/ploidy-calls --mapping_error_rate=1.000000e-02 --psi_s_scale=1.000000e-04 --mean_bias_sd=1.000000e-02 --psi_j_scale=1.000000e-03 --learning_rate=5.000000e-02 --adamax_beta1=9.000000e-01 --adamax_beta2=9.990000e-01 --log_emission_samples_per_round=2000 --log_emission_sampling_rounds=100 --log_emission_sampling_median_rel_error=5.000000e-04 --max_advi_iter_first_epoch=1000 --max_advi_iter_subsequent_epochs=1000 --min_training_epochs=20 --max_training_epochs=100 --initial_temperature=2.000000e+00 --num_thermal_advi_iters=5000 --convergence_snr_averaging_window=5000 --convergence_snr_trigger_threshold=1.000000e-01 --convergence_snr_countdown_window=10 --max_calling_iters=1 --caller_update_convergence_threshold=1.000000e-03 --caller_internal_admixing_rate=7.500000e-01 --caller_external_admixing_rate=7.500000e-01 --disable_caller=false --disable_sampler=false --disable_annealing=false --interval_list=/tmp/intervals8430607484736018931.tsv --contig_ploidy_prior_table=/gpfs/gsfs7/users/gatk_users1/0-Project/1-gCNV-Lung/z-bak/z-2019-10-26-1-Test-gCNV/9-Ref_Interval/3-contig_ploidy_priors/22.contig_ploidy_priors.csv --output_model_path=/gpfs/gsfs7/users/gatk_users1/0-Project/1-gCNV-Lung/z-bak/z-2019-10-26-1-Test-gCNV/2-Output/1-Contig-Ploidy/22.Contig_Ploidy_Dir/ploidy-model
at org.broadinstitute.hellbender.utils.python.PythonExecutorBase.getScriptException(PythonExecutorBase.java:75)
at org.broadinstitute.hellbender.utils.runtime.ScriptExecutor.executeCuratedArgs(ScriptExecutor.java:126)
at org.broadinstitute.hellbender.utils.python.PythonScriptExecutor.executeArgs(PythonScriptExecutor.java:170)
at org.broadinstitute.hellbender.utils.python.PythonScriptExecutor.executeScript(PythonScriptExecutor.java:151)
at org.broadinstitute.hellbender.utils.python.PythonScriptExecutor.executeScript(PythonScriptExecutor.java:121)
at org.broadinstitute.hellbender.tools.copynumber.DetermineGermlineContigPloidy.executeDeterminePloidyAndDepthPythonScript(DetermineGermlineContigPloidy.java:411)
at org.broadinstitute.hellbender.tools.copynumber.DetermineGermlineContigPloidy.doWork(DetermineGermlineContigPloidy.java:288)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:139)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:191)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:210)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:162)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:205)
at org.broadinstitute.hellbender.Main.main(Main.java:291)
Using GATK jar /usr/local/apps/GATK/4.1.2.0/gatk-package-4.1.2.0-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /usr/local/apps/GATK/4.1.2.0/gatk-package-4.1.2.0-local.jar DetermineGermlineContigPloidy -L /data/gatk_users1/0-Project/1-gCNV-Lung/z-bak/z-2019-10-26-1-Test-gCNV/9-Ref_Interval/2-Filter_Interval/22.preprocessed.Filtered.interval_list --interval-merging-rule OVERLAPPING_ONLY -I /data/gatk_users1/0-Project/1-gCNV-Lung/z-bak/z-2019-10-26-1-Test-gCNV/1-Input/3-BAM-ReadCount/22.SC349574.bam.csv -I /data/gatk_users1/0-Project/1-gCNV-Lung/z-bak/z-2019-10-26-1-Test-gCNV/1-Input/3-BAM-ReadCount/22.SC349575.bam.csv -I /data/gatk_users1/0-Project/1-gCNV-Lung/z-bak/z-2019-10-26-1-Test-gCNV/1-Input/3-BAM-ReadCount/22.SC349488.bam.csv -I /data/gatk_users1/0-Project/1-gCNV-Lung/z-bak/z-2019-10-26-1-Test-gCNV/1-Input/3-BAM-ReadCount/22.SC349489.bam.csv --contig-ploidy-priors /data/gatk_users1/0-Project/1-gCNV-Lung/z-bak/z-2019-10-26-1-Test-gCNV/9-Ref_Interval/3-contig_ploidy_priors/22.contig_ploidy_priors.csv --output /data/gatk_users1/0-Project/1-gCNV-Lung/z-bak/z-2019-10-26-1-Test-gCNV/2-Output/1-Contig-Ploidy/22.Contig_Ploidy_Dir --output-prefix ploidy --verbosity DEBUG
.............................................................(BUG 002)..........................................................
Stderr: Traceback (most recent call last):
File "/tmp/segment_gcnv_calls.3402406683372415608.py", line 9, in
import gcnvkernel
File "/usr/local/Anaconda/envs_app/gatk/4.1.2.0/lib/python3.6/site-packages/gcnvkernel/init.py", line 1, in
from pymc3 import version as pymc3_version
File "/usr/local/Anaconda/envs_app/gatk/4.1.2.0/lib/python3.6/site-packages/pymc3/init.py", line 5, in
from .distributions import *
File "/usr/local/Anaconda/envs_app/gatk/4.1.2.0/lib/python3.6/site-packages/pymc3/distributions/init.py", line 1, in
from . import timeseries
File "/usr/local/Anaconda/envs_app/gatk/4.1.2.0/lib/python3.6/site-packages/pymc3/distributions/timeseries.py", line 5, in
from .continuous import get_tau_sd, Normal, Flat
File "/usr/local/Anaconda/envs_app/gatk/4.1.2.0/lib/python3.6/site-packages/pymc3/distributions/continuous.py", line 16, in
from pymc3.theanof import floatX
File "/usr/local/Anaconda/envs_app/gatk/4.1.2.0/lib/python3.6/site-packages/pymc3/theanof.py", line 89, in
empty_gradient = tt.zeros(0, dtype='float32')
File "/usr/local/Anaconda/envs_app/gatk/4.1.2.0/lib/python3.6/site-packages/theano/tensor/basic.py", line 2558, in zeros
return alloc(np.array(0, dtype=dtype), *shape)
File "/usr/local/Anaconda/envs_app/gatk/4.1.2.0/lib/python3.6/site-packages/theano/tensor/basic.py", line 3091, in call
ret = super(Alloc, self).call(val, *shapes, **kwargs)
File "/usr/local/Anaconda/envs_app/gatk/4.1.2.0/lib/python3.6/site-packages/theano/gof/op.py", line 670, in call
no_recycling=[])
File "/usr/local/Anaconda/envs_app/gatk/4.1.2.0/lib/python3.6/site-packages/theano/gof/op.py", line 955, in make_thunk
no_recycling)
File "/usr/local/Anaconda/envs_app/gatk/4.1.2.0/lib/python3.6/site-packages/theano/gof/op.py", line 858, in make_c_thunk
output_storage=node_output_storage)
File "/usr/local/Anaconda/envs_app/gatk/4.1.2.0/lib/python3.6/site-packages/theano/gof/cc.py", line 1217, in make_thunk
keep_lock=keep_lock)
File "/usr/local/Anaconda/envs_app/gatk/4.1.2.0/lib/python3.6/site-packages/theano/gof/cc.py", line 1157, in compile
keep_lock=keep_lock)
File "/usr/local/Anaconda/envs_app/gatk/4.1.2.0/lib/python3.6/site-packages/theano/gof/cc.py", line 1623, in cthunk_factory
module = get_module_cache().module_from_key(
File "/usr/local/Anaconda/envs_app/gatk/4.1.2.0/lib/python3.6/site-packages/theano/gof/cc.py", line 48, in get_module_cache
return cmodule.get_module_cache(config.compiledir, init_args=init_args)
File "/usr/local/Anaconda/envs_app/gatk/4.1.2.0/lib/python3.6/site-packages/theano/gof/cmodule.py", line 1587, in get_module_cache
_module_cache = ModuleCache(dirname, **init_args)
File "/usr/local/Anaconda/envs_app/gatk/4.1.2.0/lib/python3.6/site-packages/theano/gof/cmodule.py", line 703, in init
self.refresh()
File "/usr/local/Anaconda/envs_app/gatk/4.1.2.0/lib/python3.6/site-packages/theano/gof/cmodule.py", line 794, in refresh
files = os.listdir(root)
FileNotFoundError: [Errno 2] No such file or directory: '/spin1/home/linux/gatk_users1/.theano/compiledir_Linux-3.10-el7.x86_64-x86_64-with-centos-7.5.1804-Core-x86_64-3.6.2-64/tmp3mkfuhpw'
at org.broadinstitute.hellbender.utils.python.PythonExecutorBase.getScriptException(PythonExecutorBase.java:75)
at org.broadinstitute.hellbender.utils.runtime.ScriptExecutor.executeCuratedArgs(ScriptExecutor.java:126)
at org.broadinstitute.hellbender.utils.python.PythonScriptExecutor.executeArgs(PythonScriptExecutor.java:170)
at org.broadinstitute.hellbender.utils.python.PythonScriptExecutor.executeScript(PythonScriptExecutor.java:151)
at org.broadinstitute.hellbender.utils.python.PythonScriptExecutor.executeScript(PythonScriptExecutor.java:121)
at org.broadinstitute.hellbender.tools.copynumber.PostprocessGermlineCNVCalls.executeSegmentGermlineCNVCallsPythonScript(PostprocessGermlineCNVCalls.java:509)
at org.broadinstitute.hellbender.tools.copynumber.PostprocessGermlineCNVCalls.generateSegmentsVCFFileFromAllShards(PostprocessGermlineCNVCalls.java:447)
at org.broadinstitute.hellbender.tools.copynumber.PostprocessGermlineCNVCalls.onTraversalSuccess(PostprocessGermlineCNVCalls.java:304)
at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:1041)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:139)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:191)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:210)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:162)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:205)
at org.broadinstitute.hellbender.Main.main(Main.java:291)
................................................................................................................................................
................................................................................................................................................
................................................................................................................................................
Thess exceptions happens randomly during the following two functions:
(1) DetermineGermlineContigPloidy
(2) PostprocessGermlineCNVCalls
I have tried 6 times, and for each time less than 6 random sub-projects (chromosome) failed because of the above two PythonScriptExecutorException, while the other sub-projects (chromosome) are pretty good. And for each time, the failed chromosomes are different from each other.
(1) Would you please help me to solve my problems? Dose it mean that, the current version of GATK germline calling process, do not support parallel projects in the high performace computer at the same time, which will bring about potential thread conflict?
(2) I notice that there are several tmp directory and files generated under "/spin1/home/linux/gatk_users1/.theano/compiledir_Linux-3.10-el7.x86_64-x86_64-with-centos-7.5.1804-Core-x86_64-3.6.2-64/ ", which are not specified by myself and they are never deleted. Are these temp process generated from theano? How can we set them to other paths of my expected dirs?
Best regards.
The text was updated successfully, but these errors were encountered: