Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gCNV Open file limits exceeded at GermlineCNVCallerCohortMode step #5714

Closed
drifty914 opened this issue Feb 25, 2019 · 10 comments
Closed

gCNV Open file limits exceeded at GermlineCNVCallerCohortMode step #5714

drifty914 opened this issue Feb 25, 2019 · 10 comments

Comments

@drifty914
Copy link

@samuelklee I'm running into an error for the cnv_germline_case_scattered_workflow WDL pipeline to create a Panel of Normals. It seems during the GermlineCNVCallerCohortMode step, the pipeline opens up tens-of-thousands of files that it doesn't close, causing the system to crash. This seems to happen for me both with the Docker image and Standalone GATK4.1.0.0 jar.

It reminds me of this issue mentioned on the forums from GATK3.8 but the error still occurs even if I limit that step to a single thread. I'm running on a Red Had HPC with 16 threads and 200GB of RAM available and using Cromwell v34. After checked with the manager for my cluster it seems the error occurred when over 60K files were opened simultaneously so this looks to me more like a memory leak than a ulimit issue.

Here's the output from a typical error file:

23:36:58.837 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/gatk/gatk-package-4.1.0.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
23:36:58.940 DEBUG NativeLibraryLoader - Extracting libgkl_compression.so to /gatk/local_mnt/cromwell-executions/CNVGermlineCohortWorkflow/098a389e-b298-4324-8a8c-9f46f05708b5/call-GermlineCNVCallerCohortMode/shard-12910/tmp.cd408023/libgkl_compression7867300459324040837.so
23:37:00.969 INFO  GermlineCNVCaller - ------------------------------------------------------------
23:37:00.970 INFO  GermlineCNVCaller - The Genome Analysis Toolkit (GATK) v4.1.0.0
23:37:00.970 INFO  GermlineCNVCaller - For support and documentation go to https://software.broadinstitute.org/gatk/
23:37:00.970 INFO  GermlineCNVCaller - Executing as user@e15b680c0241 on Linux v3.10.0-327.36.1.el7.x86_64 amd64
23:37:00.970 INFO  GermlineCNVCaller - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_191-8u191-b12-0ubuntu0.16.04.1-b12
23:37:00.971 INFO  GermlineCNVCaller - Start Date/Time: February 22, 2019 11:36:58 PM UTC
23:37:00.971 INFO  GermlineCNVCaller - ------------------------------------------------------------
23:37:00.971 INFO  GermlineCNVCaller - ------------------------------------------------------------
23:37:00.972 INFO  GermlineCNVCaller - HTSJDK Version: 2.18.2
23:37:00.972 INFO  GermlineCNVCaller - Picard Version: 2.18.25
23:37:00.974 INFO  GermlineCNVCaller - HTSJDK Defaults.BUFFER_SIZE : 131072
23:37:00.975 INFO  GermlineCNVCaller - HTSJDK Defaults.COMPRESSION_LEVEL : 2
23:37:00.975 INFO  GermlineCNVCaller - HTSJDK Defaults.CREATE_INDEX : false
23:37:00.975 INFO  GermlineCNVCaller - HTSJDK Defaults.CREATE_MD5 : false
23:37:00.975 INFO  GermlineCNVCaller - HTSJDK Defaults.CUSTOM_READER_FACTORY : 
23:37:00.975 INFO  GermlineCNVCaller - HTSJDK Defaults.DISABLE_SNAPPY_COMPRESSOR : false
23:37:00.975 INFO  GermlineCNVCaller - HTSJDK Defaults.EBI_REFERENCE_SERVICE_URL_MASK : https://www.ebi.ac.uk/ena/cram/md5/%s
23:37:00.975 INFO  GermlineCNVCaller - HTSJDK Defaults.NON_ZERO_BUFFER_SIZE : 131072
23:37:00.975 INFO  GermlineCNVCaller - HTSJDK Defaults.REFERENCE_FASTA : null
23:37:00.975 INFO  GermlineCNVCaller - HTSJDK Defaults.SAM_FLAG_FIELD_FORMAT : DECIMAL
23:37:00.975 INFO  GermlineCNVCaller - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
23:37:00.975 INFO  GermlineCNVCaller - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
23:37:00.975 INFO  GermlineCNVCaller - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
23:37:00.975 INFO  GermlineCNVCaller - HTSJDK Defaults.USE_CRAM_REF_DOWNLOAD : false
23:37:00.976 DEBUG ConfigFactory - Configuration file values: 
23:37:00.982 DEBUG ConfigFactory -     gcsMaxRetries = 20
23:37:00.982 DEBUG ConfigFactory -     gcsProjectForRequesterPays = 
23:37:00.982 DEBUG ConfigFactory -     gatk_stacktrace_on_user_exception = false
23:37:00.982 DEBUG ConfigFactory -     samjdk.use_async_io_read_samtools = false
23:37:00.982 DEBUG ConfigFactory -     samjdk.use_async_io_write_samtools = true
23:37:00.982 DEBUG ConfigFactory -     samjdk.use_async_io_write_tribble = false
23:37:00.983 DEBUG ConfigFactory -     samjdk.compression_level = 2
23:37:00.983 DEBUG ConfigFactory -     spark.kryoserializer.buffer.max = 512m
23:37:00.983 DEBUG ConfigFactory -     spark.driver.maxResultSize = 0
23:37:00.983 DEBUG ConfigFactory -     spark.driver.userClassPathFirst = true
23:37:00.983 DEBUG ConfigFactory -     spark.io.compression.codec = lzf
23:37:00.983 DEBUG ConfigFactory -     spark.yarn.executor.memoryOverhead = 600
23:37:00.983 DEBUG ConfigFactory -     spark.driver.extraJavaOptions = 
23:37:00.983 DEBUG ConfigFactory -     spark.executor.extraJavaOptions = 
23:37:00.983 DEBUG ConfigFactory -     codec_packages = [htsjdk.variant, htsjdk.tribble, org.broadinstitute.hellbender.utils.codecs]
23:37:00.983 DEBUG ConfigFactory -     read_filter_packages = [org.broadinstitute.hellbender.engine.filters]
23:37:00.983 DEBUG ConfigFactory -     annotation_packages = [org.broadinstitute.hellbender.tools.walkers.annotator]
23:37:00.983 DEBUG ConfigFactory -     cloudPrefetchBuffer = 40
23:37:00.983 DEBUG ConfigFactory -     cloudIndexPrefetchBuffer = -1
23:37:00.983 DEBUG ConfigFactory -     createOutputBamIndex = true
23:37:00.984 INFO  GermlineCNVCaller - Deflater: IntelDeflater
23:37:00.984 INFO  GermlineCNVCaller - Inflater: IntelInflater
23:37:00.984 INFO  GermlineCNVCaller - GCS max retries/reopens: 20
23:37:00.984 INFO  GermlineCNVCaller - Requester pays: disabled
23:37:00.984 INFO  GermlineCNVCaller - Initializing engine
23:37:00.990 DEBUG ScriptExecutor - Executing:
23:37:00.991 DEBUG ScriptExecutor -   python
23:37:00.991 DEBUG ScriptExecutor -   -c
23:37:00.991 DEBUG ScriptExecutor -   import gcnvkernel

23:38:13.336 DEBUG ScriptExecutor - Result: 0
23:38:13.341 INFO  GermlineCNVCaller - Done initializing engine
log4j:WARN No appenders could be found for logger (org.broadinstitute.hdf5.HDF5Library).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
23:38:14.588 INFO  GermlineCNVCaller - Intervals specified...
23:38:14.590 DEBUG GenomeLocParser - Prepared reference sequence contig dictionary
23:38:14.591 DEBUG GenomeLocParser -  1 (249250621 bp)
23:38:14.591 DEBUG GenomeLocParser -  2 (243199373 bp)
23:38:14.591 DEBUG GenomeLocParser -  3 (198022430 bp)
23:38:14.591 DEBUG GenomeLocParser -  4 (191154276 bp)
23:38:14.591 DEBUG GenomeLocParser -  5 (180915260 bp)
23:38:14.592 DEBUG GenomeLocParser -  6 (171115067 bp)
23:38:14.592 DEBUG GenomeLocParser -  7 (159138663 bp)
23:38:14.592 DEBUG GenomeLocParser -  8 (146364022 bp)
23:38:14.592 DEBUG GenomeLocParser -  9 (141213431 bp)
23:38:14.592 DEBUG GenomeLocParser -  10 (135534747 bp)
23:38:14.592 DEBUG GenomeLocParser -  11 (135006516 bp)
23:38:14.592 DEBUG GenomeLocParser -  12 (133851895 bp)
23:38:14.593 DEBUG GenomeLocParser -  13 (115169878 bp)
23:38:14.593 DEBUG GenomeLocParser -  14 (107349540 bp)
23:38:14.593 DEBUG GenomeLocParser -  15 (102531392 bp)
23:38:14.593 DEBUG GenomeLocParser -  16 (90354753 bp)
23:38:14.594 DEBUG GenomeLocParser -  17 (81195210 bp)
23:38:14.594 DEBUG GenomeLocParser -  18 (78077248 bp)
23:38:14.594 DEBUG GenomeLocParser -  19 (59128983 bp)
23:38:14.594 DEBUG GenomeLocParser -  20 (63025520 bp)
23:38:14.594 DEBUG GenomeLocParser -  21 (48129895 bp)
23:38:14.594 DEBUG GenomeLocParser -  22 (51304566 bp)
23:38:14.594 DEBUG GenomeLocParser -  X (155270560 bp)
23:38:14.595 DEBUG GenomeLocParser -  Y (59373566 bp)
23:38:14.595 DEBUG GenomeLocParser -  MT (16569 bp)
23:38:14.595 DEBUG GenomeLocParser -  GL000207.1 (4262 bp)
23:38:14.595 DEBUG GenomeLocParser -  GL000226.1 (15008 bp)
23:38:14.595 DEBUG GenomeLocParser -  GL000229.1 (19913 bp)
23:38:14.595 DEBUG GenomeLocParser -  GL000231.1 (27386 bp)
23:38:14.595 DEBUG GenomeLocParser -  GL000210.1 (27682 bp)
23:38:14.595 DEBUG GenomeLocParser -  GL000239.1 (33824 bp)
23:38:14.595 DEBUG GenomeLocParser -  GL000235.1 (34474 bp)
23:38:14.596 DEBUG GenomeLocParser -  GL000201.1 (36148 bp)
23:38:14.596 DEBUG GenomeLocParser -  GL000247.1 (36422 bp)
23:38:14.596 DEBUG GenomeLocParser -  GL000245.1 (36651 bp)
23:38:14.596 DEBUG GenomeLocParser -  GL000197.1 (37175 bp)
23:38:14.596 DEBUG GenomeLocParser -  GL000203.1 (37498 bp)
23:38:14.596 DEBUG GenomeLocParser -  GL000246.1 (38154 bp)
23:38:14.596 DEBUG GenomeLocParser -  GL000249.1 (38502 bp)
23:38:14.596 DEBUG GenomeLocParser -  GL000196.1 (38914 bp)
23:38:14.596 DEBUG GenomeLocParser -  GL000248.1 (39786 bp)
23:38:14.597 DEBUG GenomeLocParser -  GL000244.1 (39929 bp)
23:38:14.597 DEBUG GenomeLocParser -  GL000238.1 (39939 bp)
23:38:14.597 DEBUG GenomeLocParser -  GL000202.1 (40103 bp)
23:38:14.597 DEBUG GenomeLocParser -  GL000234.1 (40531 bp)
23:38:14.597 DEBUG GenomeLocParser -  GL000232.1 (40652 bp)
23:38:14.597 DEBUG GenomeLocParser -  GL000206.1 (41001 bp)
23:38:14.597 DEBUG GenomeLocParser -  GL000240.1 (41933 bp)
23:38:14.597 DEBUG GenomeLocParser -  GL000236.1 (41934 bp)
23:38:14.597 DEBUG GenomeLocParser -  GL000241.1 (42152 bp)
23:38:14.597 DEBUG GenomeLocParser -  GL000243.1 (43341 bp)
23:38:14.598 DEBUG GenomeLocParser -  GL000242.1 (43523 bp)
23:38:14.598 DEBUG GenomeLocParser -  GL000230.1 (43691 bp)
23:38:14.598 DEBUG GenomeLocParser -  GL000237.1 (45867 bp)
23:38:14.598 DEBUG GenomeLocParser -  GL000233.1 (45941 bp)
23:38:14.598 DEBUG GenomeLocParser -  GL000204.1 (81310 bp)
23:38:14.598 DEBUG GenomeLocParser -  GL000198.1 (90085 bp)
23:38:14.598 DEBUG GenomeLocParser -  GL000208.1 (92689 bp)
23:38:14.598 DEBUG GenomeLocParser -  GL000191.1 (106433 bp)
23:38:14.598 DEBUG GenomeLocParser -  GL000227.1 (128374 bp)
23:38:14.598 DEBUG GenomeLocParser -  GL000228.1 (129120 bp)
23:38:14.599 DEBUG GenomeLocParser -  GL000214.1 (137718 bp)
23:38:14.599 DEBUG GenomeLocParser -  GL000221.1 (155397 bp)
23:38:14.599 DEBUG GenomeLocParser -  GL000209.1 (159169 bp)
23:38:14.599 DEBUG GenomeLocParser -  GL000218.1 (161147 bp)
23:38:14.599 DEBUG GenomeLocParser -  GL000220.1 (161802 bp)
23:38:14.599 DEBUG GenomeLocParser -  GL000213.1 (164239 bp)
23:38:14.599 DEBUG GenomeLocParser -  GL000211.1 (166566 bp)
23:38:14.599 DEBUG GenomeLocParser -  GL000199.1 (169874 bp)
23:38:14.600 DEBUG GenomeLocParser -  GL000217.1 (172149 bp)
23:38:14.600 DEBUG GenomeLocParser -  GL000216.1 (172294 bp)
23:38:14.600 DEBUG GenomeLocParser -  GL000215.1 (172545 bp)
23:38:14.600 DEBUG GenomeLocParser -  GL000205.1 (174588 bp)
23:38:14.600 DEBUG GenomeLocParser -  GL000219.1 (179198 bp)
23:38:14.600 DEBUG GenomeLocParser -  GL000224.1 (179693 bp)
23:38:14.600 DEBUG GenomeLocParser -  GL000223.1 (180455 bp)
23:38:14.600 DEBUG GenomeLocParser -  GL000195.1 (182896 bp)
23:38:14.600 DEBUG GenomeLocParser -  GL000212.1 (186858 bp)
23:38:14.600 DEBUG GenomeLocParser -  GL000222.1 (186861 bp)
23:38:14.601 DEBUG GenomeLocParser -  GL000200.1 (187035 bp)
23:38:14.601 DEBUG GenomeLocParser -  GL000193.1 (189789 bp)
23:38:14.601 DEBUG GenomeLocParser -  GL000194.1 (191469 bp)
23:38:14.601 DEBUG GenomeLocParser -  GL000225.1 (211173 bp)
23:38:14.601 DEBUG GenomeLocParser -  GL000192.1 (547496 bp)
23:38:14.601 DEBUG GenomeLocParser -  NC_007605 (171823 bp)
23:38:14.601 DEBUG GenomeLocParser -  hs37d5 (35477943 bp)
23:38:15.218 INFO  IntervalArgumentCollection - Processing 13461 bp from intervals
23:38:15.222 INFO  GermlineCNVCaller - No annotated intervals were provided...
23:38:15.222 INFO  GermlineCNVCaller - No GC-content annotations for intervals found; explicit GC-bias correction will not be performed...
23:38:15.235 INFO  GermlineCNVCaller - Running the tool in the COHORT mode...
23:38:15.236 INFO  GermlineCNVCaller - Validating and aggregating data from input read-count files...
23:38:15.237 INFO  GermlineCNVCaller - Aggregating read-count file /gatk/local_mnt/cromwell-executions/CNVGermlineCohortWorkflow/098a389e-b298-4324-8a8c-9f46f05708b5/call-GermlineCNVCallerCohortMode/shard-12910/inputs/371827342/P0000335.b37.counts.hdf5 (1 / 5)
23:38:15.653 INFO  GermlineCNVCaller - Aggregating read-count file /gatk/local_mnt/cromwell-executions/CNVGermlineCohortWorkflow/098a389e-b298-4324-8a8c-9f46f05708b5/call-GermlineCNVCallerCohortMode/shard-12910/inputs/-1425124017/P0000480.b37.counts.hdf5 (2 / 5)
23:38:15.905 INFO  GermlineCNVCaller - Aggregating read-count file /gatk/local_mnt/cromwell-executions/CNVGermlineCohortWorkflow/098a389e-b298-4324-8a8c-9f46f05708b5/call-GermlineCNVCallerCohortMode/shard-12910/inputs/1072891920/P0000481.b37.counts.hdf5 (3 / 5)
23:38:16.174 INFO  GermlineCNVCaller - Aggregating read-count file /gatk/local_mnt/cromwell-executions/CNVGermlineCohortWorkflow/098a389e-b298-4324-8a8c-9f46f05708b5/call-GermlineCNVCallerCohortMode/shard-12910/inputs/-724059439/P0000992.b37.counts.hdf5 (4 / 5)
23:38:16.480 INFO  GermlineCNVCaller - Aggregating read-count file /gatk/local_mnt/cromwell-executions/CNVGermlineCohortWorkflow/098a389e-b298-4324-8a8c-9f46f05708b5/call-GermlineCNVCallerCohortMode/shard-12910/inputs/1773956498/P0001010.b37.counts.hdf5 (5 / 5)
23:38:17.591 DEBUG ScriptExecutor - Executing:
23:38:17.592 DEBUG ScriptExecutor -   python
23:38:17.592 DEBUG ScriptExecutor -   /gatk/local_mnt/cromwell-executions/CNVGermlineCohortWorkflow/098a389e-b298-4324-8a8c-9f46f05708b5/call-GermlineCNVCallerCohortMode/shard-12910/tmp.cd408023/cohort_denoising_calling.1650827882847090378.py
23:38:17.592 DEBUG ScriptExecutor -   --ploidy_calls_path=/gatk/local_mnt/cromwell-executions/CNVGermlineCohortWorkflow/098a389e-b298-4324-8a8c-9f46f05708b5/call-GermlineCNVCallerCohortMode/shard-12910/execution/contig-ploidy-calls-dir
23:38:17.592 DEBUG ScriptExecutor -   --output_calls_path=/gatk/local_mnt/cromwell-executions/CNVGermlineCohortWorkflow/098a389e-b298-4324-8a8c-9f46f05708b5/call-GermlineCNVCallerCohortMode/shard-12910/execution/out/csi_batch1-4_wes_gcnv_pon-calls
23:38:17.592 DEBUG ScriptExecutor -   --output_tracking_path=/gatk/local_mnt/cromwell-executions/CNVGermlineCohortWorkflow/098a389e-b298-4324-8a8c-9f46f05708b5/call-GermlineCNVCallerCohortMode/shard-12910/execution/out/csi_batch1-4_wes_gcnv_pon-tracking
23:38:17.592 DEBUG ScriptExecutor -   --modeling_interval_list=/gatk/local_mnt/cromwell-executions/CNVGermlineCohortWorkflow/098a389e-b298-4324-8a8c-9f46f05708b5/call-GermlineCNVCallerCohortMode/shard-12910/tmp.cd408023/intervals3567098336251103147.tsv
23:38:17.592 DEBUG ScriptExecutor -   --output_model_path=/gatk/local_mnt/cromwell-executions/CNVGermlineCohortWorkflow/098a389e-b298-4324-8a8c-9f46f05708b5/call-GermlineCNVCallerCohortMode/shard-12910/execution/out/csi_batch1-4_wes_gcnv_pon-model
23:38:17.592 DEBUG ScriptExecutor -   --enable_explicit_gc_bias_modeling=False
23:38:17.592 DEBUG ScriptExecutor -   --read_count_tsv_files
23:38:17.592 DEBUG ScriptExecutor -   /gatk/local_mnt/cromwell-executions/CNVGermlineCohortWorkflow/098a389e-b298-4324-8a8c-9f46f05708b5/call-GermlineCNVCallerCohortMode/shard-12910/tmp.cd408023/sample-08634547492854629567.tsv
23:38:17.592 DEBUG ScriptExecutor -   /gatk/local_mnt/cromwell-executions/CNVGermlineCohortWorkflow/098a389e-b298-4324-8a8c-9f46f05708b5/call-GermlineCNVCallerCohortMode/shard-12910/tmp.cd408023/sample-17754907257971163255.tsv
23:38:17.592 DEBUG ScriptExecutor -   /gatk/local_mnt/cromwell-executions/CNVGermlineCohortWorkflow/098a389e-b298-4324-8a8c-9f46f05708b5/call-GermlineCNVCallerCohortMode/shard-12910/tmp.cd408023/sample-28683597966637268428.tsv
23:38:17.592 DEBUG ScriptExecutor -   /gatk/local_mnt/cromwell-executions/CNVGermlineCohortWorkflow/098a389e-b298-4324-8a8c-9f46f05708b5/call-GermlineCNVCallerCohortMode/shard-12910/tmp.cd408023/sample-3663098894644979875.tsv
23:38:17.592 DEBUG ScriptExecutor -   /gatk/local_mnt/cromwell-executions/CNVGermlineCohortWorkflow/098a389e-b298-4324-8a8c-9f46f05708b5/call-GermlineCNVCallerCohortMode/shard-12910/tmp.cd408023/sample-45174765072960207199.tsv
23:38:17.592 DEBUG ScriptExecutor -   --psi_s_scale=1.000000e-04
23:38:17.592 DEBUG ScriptExecutor -   --mapping_error_rate=1.000000e-02
23:38:17.592 DEBUG ScriptExecutor -   --depth_correction_tau=1.000000e+04
23:38:17.592 DEBUG ScriptExecutor -   --q_c_expectation_mode=hybrid
23:38:17.592 DEBUG ScriptExecutor -   --max_bias_factors=5
23:38:17.593 DEBUG ScriptExecutor -   --psi_t_scale=1.000000e-03
23:38:17.593 DEBUG ScriptExecutor -   --log_mean_bias_std=1.000000e-01
23:38:17.593 DEBUG ScriptExecutor -   --init_ard_rel_unexplained_variance=1.000000e-01
23:38:17.593 DEBUG ScriptExecutor -   --num_gc_bins=20
23:38:17.593 DEBUG ScriptExecutor -   --gc_curve_sd=1.000000e+00
23:38:17.593 DEBUG ScriptExecutor -   --active_class_padding_hybrid_mode=50000
23:38:17.593 DEBUG ScriptExecutor -   --enable_bias_factors=True
23:38:17.593 DEBUG ScriptExecutor -   --disable_bias_factors_in_active_class=False
23:38:17.593 DEBUG ScriptExecutor -   --p_alt=1.000000e-06
23:38:17.593 DEBUG ScriptExecutor -   --cnv_coherence_length=1.000000e+04
23:38:17.593 DEBUG ScriptExecutor -   --max_copy_number=5
23:38:17.593 DEBUG ScriptExecutor -   --p_active=0.010000
23:38:17.593 DEBUG ScriptExecutor -   --class_coherence_length=10000.000000
23:38:17.593 DEBUG ScriptExecutor -   --learning_rate=5.000000e-02
23:38:17.593 DEBUG ScriptExecutor -   --adamax_beta1=9.000000e-01
23:38:17.593 DEBUG ScriptExecutor -   --adamax_beta2=9.900000e-01
23:38:17.593 DEBUG ScriptExecutor -   --log_emission_samples_per_round=50
23:38:17.593 DEBUG ScriptExecutor -   --log_emission_sampling_rounds=10
23:38:17.594 DEBUG ScriptExecutor -   --log_emission_sampling_median_rel_error=5.000000e-03
23:38:17.594 DEBUG ScriptExecutor -   --max_advi_iter_first_epoch=5000
23:38:17.594 DEBUG ScriptExecutor -   --max_advi_iter_subsequent_epochs=100
23:38:17.594 DEBUG ScriptExecutor -   --min_training_epochs=10
23:38:17.594 DEBUG ScriptExecutor -   --max_training_epochs=100
23:38:17.594 DEBUG ScriptExecutor -   --initial_temperature=2.000000e+00
23:38:17.594 DEBUG ScriptExecutor -   --num_thermal_advi_iters=2500
23:38:17.594 DEBUG ScriptExecutor -   --convergence_snr_averaging_window=500
23:38:17.594 DEBUG ScriptExecutor -   --convergence_snr_trigger_threshold=1.000000e-01
23:38:17.594 DEBUG ScriptExecutor -   --convergence_snr_countdown_window=10
23:38:17.594 DEBUG ScriptExecutor -   --max_calling_iters=10
23:38:17.594 DEBUG ScriptExecutor -   --caller_update_convergence_threshold=1.000000e-03
23:38:17.594 DEBUG ScriptExecutor -   --caller_internal_admixing_rate=7.500000e-01
23:38:17.594 DEBUG ScriptExecutor -   --caller_external_admixing_rate=1.000000e+00
23:38:17.594 DEBUG ScriptExecutor -   --disable_caller=false
23:38:17.594 DEBUG ScriptExecutor -   --disable_sampler=false
23:38:17.594 DEBUG ScriptExecutor -   --disable_annealing=false
Traceback (most recent call last):
  File "/gatk/local_mnt/cromwell-executions/CNVGermlineCohortWorkflow/098a389e-b298-4324-8a8c-9f46f05708b5/call-GermlineCNVCallerCohortMode/shard-12910/tmp.cd408023/cohort_denoising_calling.1650827882847090378.py", line 143, in <module>
  File "/opt/miniconda/envs/gatk/lib/python3.6/site-packages/gcnvkernel/tasks/task_cohort_denoising_calling.py", line 140, in __init__
  File "/opt/miniconda/envs/gatk/lib/python3.6/site-packages/pymc3/model.py", line 197, in __call__
  File "/opt/miniconda/envs/gatk/lib/python3.6/site-packages/gcnvkernel/models/model_denoising_calling.py", line 754, in __init__
  File "/opt/miniconda/envs/gatk/lib/python3.6/site-packages/pymc3/distributions/distribution.py", line 39, in __new__
  File "/opt/miniconda/envs/gatk/lib/python3.6/site-packages/pymc3/model.py", line 515, in Var
  File "/opt/miniconda/envs/gatk/lib/python3.6/site-packages/pymc3/model.py", line 869, in __init__
  File "/opt/miniconda/envs/gatk/lib/python3.6/site-packages/pymc3/distributions/continuous.py", line 250, in logp
  File "/opt/miniconda/envs/gatk/lib/python3.6/site-packages/theano/tensor/var.py", line 155, in __mul__
  File "/opt/miniconda/envs/gatk/lib/python3.6/site-packages/theano/gof/op.py", line 670, in __call__
  File "/opt/miniconda/envs/gatk/lib/python3.6/site-packages/theano/gof/op.py", line 935, in make_thunk
  File "/opt/miniconda/envs/gatk/lib/python3.6/site-packages/theano/gof/op.py", line 839, in make_c_thunk
  File "/opt/miniconda/envs/gatk/lib/python3.6/site-packages/theano/gof/cc.py", line 1190, in make_thunk
  File "/opt/miniconda/envs/gatk/lib/python3.6/site-packages/theano/gof/cc.py", line 1131, in __compile__
  File "/opt/miniconda/envs/gatk/lib/python3.6/site-packages/theano/gof/cc.py", line 1586, in cthunk_factory
  File "/opt/miniconda/envs/gatk/lib/python3.6/site-packages/theano/gof/cmodule.py", line 1118, in module_from_key
  File "/opt/miniconda/envs/gatk/lib/python3.6/site-packages/theano/gof/cmodule.py", line 1017, in _get_from_key
  File "/opt/miniconda/envs/gatk/lib/python3.6/site-packages/theano/gof/cmodule.py", line 691, in _get_module
  File "/opt/miniconda/envs/gatk/lib/python3.6/site-packages/theano/gof/cmodule.py", line 302, in dlimport
  File "<frozen importlib._bootstrap>", line 961, in _find_and_load
  File "<frozen importlib._bootstrap>", line 950, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 655, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 674, in exec_module
  File "<frozen importlib._bootstrap_external>", line 780, in get_code
  File "<frozen importlib._bootstrap_external>", line 832, in get_data
OSError: [Errno 23] Too many open files in system: '/gatk/local_mnt/.theano/compiledir_Linux-3.10-el7.x86_64-x86_64-with-debian-stretch-sid-x86_64-3.6.2-64/tmp_yxu5we5/__init__.py'
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/opt/miniconda/envs/gatk/lib/python3.6/site-packages/theano/gof/cmodule.py", line 1492, in _on_atexit
  File "/opt/miniconda/envs/gatk/lib/python3.6/site-packages/theano/gof/cmodule.py", line 1295, in clear_old
  File "/opt/miniconda/envs/gatk/lib/python3.6/site-packages/theano/gof/cmodule.py", line 746, in refresh
OSError: [Errno 23] Too many open files in system: '/gatk/local_mnt/.theano/compiledir_Linux-3.10-el7.x86_64-x86_64-with-debian-stretch-sid-x86_64-3.6.2-64'
23:39:07.445 DEBUG ScriptExecutor - Result: 1
23:39:07.447 INFO  GermlineCNVCaller - Shutting down engine
[February 22, 2019 11:39:07 PM UTC] org.broadinstitute.hellbender.tools.copynumber.GermlineCNVCaller done. Elapsed time: 2.14 minutes.
Runtime.totalMemory()=2305818624
org.broadinstitute.hellbender.utils.python.PythonScriptExecutorException: 
python exited with 1
Command Line: python /gatk/local_mnt/cromwell-executions/CNVGermlineCohortWorkflow/098a389e-b298-4324-8a8c-9f46f05708b5/call-GermlineCNVCallerCohortMode/shard-12910/tmp.cd408023/cohort_denoising_calling.1650827882847090378.py --ploidy_calls_path=/gatk/local_mnt/cromwell-executions/CNVGermlineCohortWorkflow/098a389e-b298-4324-8a8c-9f46f05708b5/call-GermlineCNVCallerCohortMode/shard-12910/execution/contig-ploidy-calls-dir --output_calls_path=/gatk/local_mnt/cromwell-executions/CNVGermlineCohortWorkflow/098a389e-b298-4324-8a8c-9f46f05708b5/call-GermlineCNVCallerCohortMode/shard-12910/execution/out/csi_batch1-4_wes_gcnv_pon-calls --output_tracking_path=/gatk/local_mnt/cromwell-executions/CNVGermlineCohortWorkflow/098a389e-b298-4324-8a8c-9f46f05708b5/call-GermlineCNVCallerCohortMode/shard-12910/execution/out/csi_batch1-4_wes_gcnv_pon-tracking --modeling_interval_list=/gatk/local_mnt/cromwell-executions/CNVGermlineCohortWorkflow/098a389e-b298-4324-8a8c-9f46f05708b5/call-GermlineCNVCallerCohortMode/shard-12910/tmp.cd408023/intervals3567098336251103147.tsv --output_model_path=/gatk/local_mnt/cromwell-executions/CNVGermlineCohortWorkflow/098a389e-b298-4324-8a8c-9f46f05708b5/call-GermlineCNVCallerCohortMode/shard-12910/execution/out/csi_batch1-4_wes_gcnv_pon-model --enable_explicit_gc_bias_modeling=False --read_count_tsv_files /gatk/local_mnt/cromwell-executions/CNVGermlineCohortWorkflow/098a389e-b298-4324-8a8c-9f46f05708b5/call-GermlineCNVCallerCohortMode/shard-12910/tmp.cd408023/sample-08634547492854629567.tsv /gatk/local_mnt/cromwell-executions/CNVGermlineCohortWorkflow/098a389e-b298-4324-8a8c-9f46f05708b5/call-GermlineCNVCallerCohortMode/shard-12910/tmp.cd408023/sample-17754907257971163255.tsv /gatk/local_mnt/cromwell-executions/CNVGermlineCohortWorkflow/098a389e-b298-4324-8a8c-9f46f05708b5/call-GermlineCNVCallerCohortMode/shard-12910/tmp.cd408023/sample-28683597966637268428.tsv /gatk/local_mnt/cromwell-executions/CNVGermlineCohortWorkflow/098a389e-b298-4324-8a8c-9f46f05708b5/call-GermlineCNVCallerCohortMode/shard-12910/tmp.cd408023/sample-3663098894644979875.tsv /gatk/local_mnt/cromwell-executions/CNVGermlineCohortWorkflow/098a389e-b298-4324-8a8c-9f46f05708b5/call-GermlineCNVCallerCohortMode/shard-12910/tmp.cd408023/sample-45174765072960207199.tsv --psi_s_scale=1.000000e-04 --mapping_error_rate=1.000000e-02 --depth_correction_tau=1.000000e+04 --q_c_expectation_mode=hybrid --max_bias_factors=5 --psi_t_scale=1.000000e-03 --log_mean_bias_std=1.000000e-01 --init_ard_rel_unexplained_variance=1.000000e-01 --num_gc_bins=20 --gc_curve_sd=1.000000e+00 --active_class_padding_hybrid_mode=50000 --enable_bias_factors=True --disable_bias_factors_in_active_class=False --p_alt=1.000000e-06 --cnv_coherence_length=1.000000e+04 --max_copy_number=5 --p_active=0.010000 --class_coherence_length=10000.000000 --learning_rate=5.000000e-02 --adamax_beta1=9.000000e-01 --adamax_beta2=9.900000e-01 --log_emission_samples_per_round=50 --log_emission_sampling_rounds=10 --log_emission_sampling_median_rel_error=5.000000e-03 --max_advi_iter_first_epoch=5000 --max_advi_iter_subsequent_epochs=100 --min_training_epochs=10 --max_training_epochs=100 --initial_temperature=2.000000e+00 --num_thermal_advi_iters=2500 --convergence_snr_averaging_window=500 --convergence_snr_trigger_threshold=1.000000e-01 --convergence_snr_countdown_window=10 --max_calling_iters=10 --caller_update_convergence_threshold=1.000000e-03 --caller_internal_admixing_rate=7.500000e-01 --caller_external_admixing_rate=1.000000e+00 --disable_caller=false --disable_sampler=false --disable_annealing=false
    at org.broadinstitute.hellbender.utils.python.PythonExecutorBase.getScriptException(PythonExecutorBase.java:75)
    at org.broadinstitute.hellbender.utils.runtime.ScriptExecutor.executeCuratedArgs(ScriptExecutor.java:126)
    at org.broadinstitute.hellbender.utils.python.PythonScriptExecutor.executeArgs(PythonScriptExecutor.java:170)
    at org.broadinstitute.hellbender.utils.python.PythonScriptExecutor.executeScript(PythonScriptExecutor.java:151)
    at org.broadinstitute.hellbender.utils.python.PythonScriptExecutor.executeScript(PythonScriptExecutor.java:121)
    at org.broadinstitute.hellbender.tools.copynumber.GermlineCNVCaller.executeGermlineCNVCallerPythonScript(GermlineCNVCaller.java:441)
    at org.broadinstitute.hellbender.tools.copynumber.GermlineCNVCaller.doWork(GermlineCNVCaller.java:288)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:138)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:191)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:210)
    at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:162)
    at org.broadinstitute.hellbender.Main.mainEntry(Main.java:205)
    at org.broadinstitute.hellbender.Main.main(Main.java:291)
Using GATK jar /gatk/gatk-package-4.1.0.0-local.jar defined in environment variable GATK_LOCAL_JAR
Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx10G -jar /gatk/gatk-package-4.1.0.0-local.jar GermlineCNVCaller --run-mode COHORT -L /gatk/local_mnt/cromwell-executions/CNVGermlineCohortWorkflow/098a389e-b298-4324-8a8c-9f46f05708b5/call-GermlineCNVCallerCohortMode/shard-12910/inputs/-25145621/SureSelect_Human_All_Exon_V6_UTR.1based.preprocessed.filtered.scattered.8543.interval_list --input /gatk/local_mnt/cromwell-executions/CNVGermlineCohortWorkflow/098a389e-b298-4324-8a8c-9f46f05708b5/call-GermlineCNVCallerCohortMode/shard-12910/inputs/371827342/P0000335.b37.counts.hdf5 --input /gatk/local_mnt/cromwell-executions/CNVGermlineCohortWorkflow/098a389e-b298-4324-8a8c-9f46f05708b5/call-GermlineCNVCallerCohortMode/shard-12910/inputs/-1425124017/P0000480.b37.counts.hdf5 --input /gatk/local_mnt/cromwell-executions/CNVGermlineCohortWorkflow/098a389e-b298-4324-8a8c-9f46f05708b5/call-GermlineCNVCallerCohortMode/shard-12910/inputs/1072891920/P0000481.b37.counts.hdf5 --input /gatk/local_mnt/cromwell-executions/CNVGermlineCohortWorkflow/098a389e-b298-4324-8a8c-9f46f05708b5/call-GermlineCNVCallerCohortMode/shard-12910/inputs/-724059439/P0000992.b37.counts.hdf5 --input /gatk/local_mnt/cromwell-executions/CNVGermlineCohortWorkflow/098a389e-b298-4324-8a8c-9f46f05708b5/call-GermlineCNVCallerCohortMode/shard-12910/inputs/1773956498/P0001010.b37.counts.hdf5 --contig-ploidy-calls contig-ploidy-calls-dir --interval-merging-rule OVERLAPPING_ONLY --output out --output-prefix csi_batch1-4_wes_gcnv_pon --verbosity DEBUG --p-alt 1e-6 --p-active 1e-2 --cnv-coherence-length 10000.0 --class-coherence-length 10000.0 --max-copy-number 5 --max-bias-factors 5 --mapping-error-rate 0.01 --interval-psi-scale 0.001 --sample-psi-scale 0.0001 --depth-correction-tau 10000.0 --log-mean-bias-standard-deviation 0.1 --init-ard-rel-unexplained-variance 0.1 --num-gc-bins 20 --gc-curve-standard-deviation 1.0 --copy-number-posterior-expectation-mode HYBRID --enable-bias-factors true --active-class-padding-hybrid-mode 50000 --learning-rate 0.05 --adamax-beta-1 0.9 --adamax-beta-2 0.99 --log-emission-samples-per-round 50 --log-emission-sampling-median-rel-error 0.005 --log-emission-sampling-rounds 10 --max-advi-iter-first-epoch 5000 --max-advi-iter-subsequent-epochs 100 --min-training-epochs 10 --max-training-epochs 100 --initial-temperature 2.0 --num-thermal-advi-iters 2500 --convergence-snr-averaging-window 500 --convergence-snr-trigger-threshold 0.1 --convergence-snr-countdown-window 10 --max-calling-iters 10 --caller-update-convergence-threshold 0.001 --caller-internal-admixing-rate 0.75 --caller-external-admixing-rate 1.00 --disable-annealing false

[2019-02-22 23:49:20,42] [info] WorkflowManagerActor WorkflowActor-098a389e-b298-4324-8a8c-9f46f05708b5 is in a terminal state: WorkflowFailedState
[2019-02-22 23:50:01,65] [info] SingleWorkflowRunnerActor workflow finished with status 'Failed'.
[2019-02-22 23:50:02,38] [info] Workflow polling stopped
[2019-02-22 23:50:02,48] [info] Shutting down WorkflowStoreActor - Timeout = 5 seconds
[2019-02-22 23:50:02,49] [info] Shutting down WorkflowLogCopyRouter - Timeout = 5 seconds
[2019-02-22 23:50:02,53] [info] Shutting down JobExecutionTokenDispenser - Timeout = 5 seconds
[2019-02-22 23:50:02,53] [info] Aborting all running workflows.
[2019-02-22 23:50:02,53] [info] JobExecutionTokenDispenser stopped
[2019-02-22 23:50:02,53] [info] WorkflowStoreActor stopped
[2019-02-22 23:50:02,61] [info] WorkflowLogCopyRouter stopped
[2019-02-22 23:50:02,61] [info] Shutting down WorkflowManagerActor - Timeout = 3600 seconds
[2019-02-22 23:50:02,61] [info] WorkflowManagerActor All workflows finished
[2019-02-22 23:50:02,61] [info] WorkflowManagerActor stopped
[2019-02-22 23:50:02,61] [info] Connection pools shut down
[2019-02-22 23:50:02,61] [info] Shutting down SubWorkflowStoreActor - Timeout = 1800 seconds
[2019-02-22 23:50:02,61] [info] Shutting down JobStoreActor - Timeout = 1800 seconds
[2019-02-22 23:50:02,61] [info] Shutting down CallCacheWriteActor - Timeout = 1800 seconds
[2019-02-22 23:50:02,61] [info] Shutting down ServiceRegistryActor - Timeout = 1800 seconds
[2019-02-22 23:50:02,61] [info] SubWorkflowStoreActor stopped
[2019-02-22 23:50:02,61] [info] Shutting down DockerHashActor - Timeout = 1800 seconds
[2019-02-22 23:50:02,61] [info] Shutting down IoProxy - Timeout = 1800 seconds
[2019-02-22 23:50:02,61] [info] CallCacheWriteActor Shutting down: 0 queued messages to process
[2019-02-22 23:50:02,61] [info] JobStoreActor stopped
[2019-02-22 23:50:02,61] [info] CallCacheWriteActor stopped
[2019-02-22 23:50:02,61] [info] KvWriteActor Shutting down: 0 queued messages to process
[2019-02-22 23:50:02,61] [info] WriteMetadataActor Shutting down: 0 queued messages to process
[2019-02-22 23:50:02,62] [info] DockerHashActor stopped
[2019-02-22 23:50:02,62] [info] IoProxy stopped
[2019-02-22 23:50:02,62] [info] ServiceRegistryActor stopped
[2019-02-22 23:50:02,65] [info] Database closed
[2019-02-22 23:50:02,65] [info] Stream materializer shut down
Workflow 098a389e-b298-4324-8a8c-9f46f05708b5 transitioned to state Failed
[2019-02-22 23:50:02,75] [info] Automatic shutdown of the async connection
[2019-02-22 23:50:02,75] [info] Gracefully shutdown sentry threads.
[2019-02-22 23:50:02,75] [info] Shutdown finished.```
@samuelklee
Copy link
Contributor

Hi @drifty914, thanks for bringing this to our attention. Can you clarify a few things? It looks like you are trying to run the cnv_germline_cohort_workflow.wdl (not the case-scattered WDL) to generate a panel using 5 samples, correct? How many bins are in your interval list, and how many bins are you running per sharded GermlineCNVCaller task? I'm surprised that you are running into these sorts of issues with so few samples.

Tagging @mwalker174 and @vruano. I think @vruano ran into a tangentially related issue (the GATK command line was unable to handle a large number of model files when the number of shards got too big) and has a branch with some changes to the WDL to address it, but I think he encountered this in case mode.

@drifty914
Copy link
Author

drifty914 commented Feb 25, 2019

Yes the cnv_germline_cohort_workflow.wdl sorry for the typo! I first encountered this using 39 samples and num_intervals_per_scatter set to 5000 (using the standalone jar). The output above using just 5 WES samples was trying the docker image and had num_interval_per_scatter set to 20 (as that's what I saw here). The interval_list contains 295,830 targets/bins.

If there's a different interval value or number of samples you'd like me to test out I'd be happy to do so.

@samuelklee
Copy link
Contributor

samuelklee commented Feb 25, 2019

OK, thanks @drifty914. Note that the file with num_intervals_per_scatter = 20 is a minimal test case that is run with our continuous integration tests. In real-world use, you want enough intervals in each shard to fit a denoising model---probably 5000 or more is safe.

I am wondering if your issue is related to #4782 and https://askubuntu.com/questions/162229/how-do-i-increase-the-open-files-limit-for-a-non-root-user. It may be that your user ulimit is not high enough for the theano compilation directory?

Let me try to put together a fix for that issue and see if it addresses yours as well.

@cmnbroad
Copy link
Collaborator

@jamesemery Any chance you could make this change while you're in there making the other changes to the docker ?

@cmnbroad
Copy link
Collaborator

Oh, I tagged the wrong ticket. I meant to tag James in #4782, which we should probably do whether or not its related to this.

@samuelklee
Copy link
Contributor

@drifty914 Do you encounter the ulimit message when you run a single shard (covering say 5000 intervals) of GermlineCNVCaller?

@drifty914
Copy link
Author

@samuelklee I was able to get it to complete a single shard after setting it to 5000 using the local jar (wasn't able to try out Docker on my current system).

@samuelklee
Copy link
Contributor

samuelklee commented Mar 1, 2019

Thanks @drifty914, it sounds like you may need to limit the number of concurrent jobs that Cromwell is allowed to scatter. We typically run gCNV in the cloud and scatter across multiple VMs, so we haven't encountered this issue before.

At the same time, you could also try to reduce the total number of shards (by increasing num_intervals_per_scatter), which should be fine if each shard has enough memory. We typically scatter 200 samples x 5000 intervals, which fits comfortably in VMs with 30GB of memory. We haven't gotten a chance to profile how much of this memory is being used in detail, so you might be able to get by with much less.

I don't think this is a matter of a memory leak or files being left open by the tool, as it looks like your job fails during the theano compilation step. I'll try to get an idea of how many files theano opens for each compilation, but I don't think this is something we have much control over. We have thought about whether it might be possible to reuse the same compiled theano model for identically sized shards, but haven't gotten a chance to investigate this yet either.

@drifty914
Copy link
Author

Thanks @samuelklee, I was able to get the pipeline to finish last night using the 5000 interval setting as long as I used a single thread to handle the larger memory footprint. I might be able to increase this slightly after some tuning tests. Likely the threads/memory issue is why my earlier attempts to use 5000 intervals with many more samples failed.

@samuelklee
Copy link
Contributor

OK, great to hear! I'll close this issue for now, but thanks for bringing it to our attention. We might try to release some documentation on how memory requirements, runtime, etc. scale with the size of the coverage matrix in each shard.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants