Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

freebayes: Dynamic memory allocation #881

Merged
merged 3 commits into from
Sep 1, 2023
Merged

Conversation

kysrpex
Copy link
Contributor

@kysrpex kysrpex commented Aug 31, 2023

I have noticed that we have accumulated almost 200 freebayes jobs in the Condor queue (50% to 75% of the queue). I have looked a bit into the memory requirements of this tool by taking as sample the successful job runs since mid-May.

Most jobs have inputs smaller than 10GB and use less than 15GB of memory, rather than the 90GB that are currently being requested.

grafik

grafik

I suggest thus the following simple solution: request 9 + input_size * 1 GB of memory. Since UseGalaxy.eu resubmits jobs up to two times when they run out of memory, tripling the memory request each time, this should cover about 99% of jobs within two resubmissions (see the graph below).

grafik

Hopefully we can have the jobs queued faster with this change. I have tried to figure out why still a sizeable amount of jobs with small input sizes require huge amounts of memory, but have been unable to find a quick answer.

Most freebayes jobs have inputs smaller than 10GB and use less than 15GB of memory, rather than the 90GB that are currently being requested.

Since UseGalaxy.eu resubmits jobs up to two times when they run out of memory, tripling the memory request each time, this change should cover about 99% of jobs within two resubmissions (see the graph in PR usegalaxy-eu#881).
@kysrpex kysrpex marked this pull request as ready for review August 31, 2023 15:20
@bgruening
Copy link
Member

It would be nice if you could include a comment and link to those charts here. Maybe also explain what the * 1 is doing.

@bgruening
Copy link
Member

I recall that freebayes had problems with high-coverage reads. @wm75 might remember. I don't think we can detect the coverage of the input files dynamically. I guess for those very extreme cases users should try (--limit-coverage) and see if that works.

@kysrpex
Copy link
Contributor Author

kysrpex commented Sep 1, 2023

It would be nice if you could include a comment and link to those charts here. Maybe also explain what the * 1 is doing.

What do you mean with "comment and link to those charts"? A recipe to generate them? A more in-depth explanation accompanying them? The data used to generate them?

* 1 is the identity element of multiplication and, as expected, it does nothing at all.

@kysrpex
Copy link
Contributor Author

kysrpex commented Sep 1, 2023

I recall that freebayes had problems with high-coverage reads. @wm75 might remember. I don't think we can detect the coverage of the input files dynamically. I guess for those very extreme cases users should try (--limit-coverage) and see if that works.

Many times it feels quite hopeless to maintain the TPV database (and ours). To make proper use of the cluster we need functions that provide a good upper bound on the memory usage for each tool in terms of its parameters and inputs. A rough guess makes us run into problems like this one.

If the memory usage of the tools is predictable but we just do not know how to do it, I think machine learning can be used to generate the rules (taking the memory usage measurements from Galaxy, we have way more than we need), saving headaches in the long run. If however, the tool is simply unpredictable because it has stochastic elements, then we're out of luck.

A solution like this one I submitted (and much of the information we submit to the TPV database and tools.yml) feels just like a patch. Although I feel that for this tool in particular we'll have little success with ML anyways (because evaluating specific details of the tool like the coverage does not scale).

Set cores to 10 explicitly in tools.yml.
@bgruening
Copy link
Member

Many times it feels quite hopeless to maintain the TPV database (and ours).

I tend to think that whatever we do here is better than nothing. And the reality is that all users of HPC systems do invent those numbers for their own, not sharing, no data-driven decisions ... so whatever we do is an improvement. Even if not perfect.

@kysrpex
Copy link
Contributor Author

kysrpex commented Sep 1, 2023

Many times it feels quite hopeless to maintain the TPV database (and ours).

I tend to think that whatever we do here is better than nothing. And the reality is that all users of HPC systems do invent those numbers for their own, not sharing, no data-driven decisions ... so whatever we do is an improvement. Even if not perfect.

That is true, thanks for reminding it :) I think I am just a junkie of silver bullets.

@kysrpex
Copy link
Contributor Author

kysrpex commented Sep 1, 2023

Ok let's see what happens.

@kysrpex kysrpex merged commit b45f9e0 into usegalaxy-eu:master Sep 1, 2023
@kysrpex kysrpex deleted the freebayes branch September 1, 2023 13:29
@kysrpex
Copy link
Contributor Author

kysrpex commented Sep 4, 2023

It seems to work as expected.

galaxy=> select count(*) from job where tool_id ilike 'toolshed.g2.bx.psu.edu/repos/devteam/freebayes/freebayes%' and update_time >= '2023-09-01 13:55:38.000000'  and state='error';
 count 
-------
     4
(1 row)

galaxy=> select count(*) from job where tool_id ilike 'toolshed.g2.bx.psu.edu/repos/devteam/freebayes/freebayes%' and update_time >= '2023-09-01 13:55:38.000000'  and state='ok';
 count 
-------
   277
(1 row)

galaxy=> select count(*) from job where tool_id ilike 'toolshed.g2.bx.psu.edu/repos/devteam/freebayes/freebayes%' and update_time >= '2023-09-01 13:55:38.000000'  and state='queued';
 count 
-------
     0
(1 row)

galaxy=> select count(*) from job where tool_id ilike 'toolshed.g2.bx.psu.edu/repos/devteam/freebayes/freebayes%' and update_time >= '2023-09-01 13:55:38.000000'  and state='new';
 count 
-------
     0
(1 row)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants