-
Notifications
You must be signed in to change notification settings - Fork 594
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gCNV Open file limits exceeded at GermlineCNVCallerCohortMode step #5714
Comments
Hi @drifty914, thanks for bringing this to our attention. Can you clarify a few things? It looks like you are trying to run the cnv_germline_cohort_workflow.wdl (not the case-scattered WDL) to generate a panel using 5 samples, correct? How many bins are in your interval list, and how many bins are you running per sharded GermlineCNVCaller task? I'm surprised that you are running into these sorts of issues with so few samples. Tagging @mwalker174 and @vruano. I think @vruano ran into a tangentially related issue (the GATK command line was unable to handle a large number of model files when the number of shards got too big) and has a branch with some changes to the WDL to address it, but I think he encountered this in case mode. |
Yes the If there's a different interval value or number of samples you'd like me to test out I'd be happy to do so. |
OK, thanks @drifty914. Note that the file with num_intervals_per_scatter = 20 is a minimal test case that is run with our continuous integration tests. In real-world use, you want enough intervals in each shard to fit a denoising model---probably 5000 or more is safe. I am wondering if your issue is related to #4782 and https://askubuntu.com/questions/162229/how-do-i-increase-the-open-files-limit-for-a-non-root-user. It may be that your user ulimit is not high enough for the theano compilation directory? Let me try to put together a fix for that issue and see if it addresses yours as well. |
@jamesemery Any chance you could make this change while you're in there making the other changes to the docker ? |
Oh, I tagged the wrong ticket. I meant to tag James in #4782, which we should probably do whether or not its related to this. |
@drifty914 Do you encounter the ulimit message when you run a single shard (covering say 5000 intervals) of GermlineCNVCaller? |
@samuelklee I was able to get it to complete a single shard after setting it to 5000 using the local jar (wasn't able to try out Docker on my current system). |
Thanks @drifty914, it sounds like you may need to limit the number of concurrent jobs that Cromwell is allowed to scatter. We typically run gCNV in the cloud and scatter across multiple VMs, so we haven't encountered this issue before. At the same time, you could also try to reduce the total number of shards (by increasing num_intervals_per_scatter), which should be fine if each shard has enough memory. We typically scatter 200 samples x 5000 intervals, which fits comfortably in VMs with 30GB of memory. We haven't gotten a chance to profile how much of this memory is being used in detail, so you might be able to get by with much less. I don't think this is a matter of a memory leak or files being left open by the tool, as it looks like your job fails during the theano compilation step. I'll try to get an idea of how many files theano opens for each compilation, but I don't think this is something we have much control over. We have thought about whether it might be possible to reuse the same compiled theano model for identically sized shards, but haven't gotten a chance to investigate this yet either. |
Thanks @samuelklee, I was able to get the pipeline to finish last night using the 5000 interval setting as long as I used a single thread to handle the larger memory footprint. I might be able to increase this slightly after some tuning tests. Likely the threads/memory issue is why my earlier attempts to use 5000 intervals with many more samples failed. |
OK, great to hear! I'll close this issue for now, but thanks for bringing it to our attention. We might try to release some documentation on how memory requirements, runtime, etc. scale with the size of the coverage matrix in each shard. |
@samuelklee I'm running into an error for the
cnv_germline_case_scattered_workflow
WDL pipeline to create a Panel of Normals. It seems during the GermlineCNVCallerCohortMode step, the pipeline opens up tens-of-thousands of files that it doesn't close, causing the system to crash. This seems to happen for me both with the Docker image and Standalone GATK4.1.0.0 jar.It reminds me of this issue mentioned on the forums from GATK3.8 but the error still occurs even if I limit that step to a single thread. I'm running on a Red Had HPC with 16 threads and 200GB of RAM available and using Cromwell v34. After checked with the manager for my cluster it seems the error occurred when over 60K files were opened simultaneously so this looks to me more like a memory leak than a ulimit issue.
Here's the output from a typical error file:
The text was updated successfully, but these errors were encountered: