-
Notifications
You must be signed in to change notification settings - Fork 98
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
freebayes: Dynamic memory allocation #881
Conversation
Most freebayes jobs have inputs smaller than 10GB and use less than 15GB of memory, rather than the 90GB that are currently being requested. Since UseGalaxy.eu resubmits jobs up to two times when they run out of memory, tripling the memory request each time, this change should cover about 99% of jobs within two resubmissions (see the graph in PR usegalaxy-eu#881).
It would be nice if you could include a comment and link to those charts here. Maybe also explain what the |
I recall that freebayes had problems with high-coverage reads. @wm75 might remember. I don't think we can detect the coverage of the input files dynamically. I guess for those very extreme cases users should try |
What do you mean with "comment and link to those charts"? A recipe to generate them? A more in-depth explanation accompanying them? The data used to generate them?
|
Many times it feels quite hopeless to maintain the TPV database (and ours). To make proper use of the cluster we need functions that provide a good upper bound on the memory usage for each tool in terms of its parameters and inputs. A rough guess makes us run into problems like this one. If the memory usage of the tools is predictable but we just do not know how to do it, I think machine learning can be used to generate the rules (taking the memory usage measurements from Galaxy, we have way more than we need), saving headaches in the long run. If however, the tool is simply unpredictable because it has stochastic elements, then we're out of luck. A solution like this one I submitted (and much of the information we submit to the TPV database and tools.yml) feels just like a patch. Although I feel that for this tool in particular we'll have little success with ML anyways (because evaluating specific details of the tool like the coverage does not scale). |
Set cores to 10 explicitly in tools.yml.
I tend to think that whatever we do here is better than nothing. And the reality is that all users of HPC systems do invent those numbers for their own, not sharing, no data-driven decisions ... so whatever we do is an improvement. Even if not perfect. |
That is true, thanks for reminding it :) I think I am just a junkie of silver bullets. |
Ok let's see what happens. |
It seems to work as expected.
|
I have noticed that we have accumulated almost 200 freebayes jobs in the Condor queue (50% to 75% of the queue). I have looked a bit into the memory requirements of this tool by taking as sample the successful job runs since mid-May.
Most jobs have inputs smaller than 10GB and use less than 15GB of memory, rather than the 90GB that are currently being requested.
I suggest thus the following simple solution: request
9 + input_size * 1
GB of memory. Since UseGalaxy.eu resubmits jobs up to two times when they run out of memory, tripling the memory request each time, this should cover about 99% of jobs within two resubmissions (see the graph below).Hopefully we can have the jobs queued faster with this change. I have tried to figure out why still a sizeable amount of jobs with small input sizes require huge amounts of memory, but have been unable to find a quick answer.