freebayes: Dynamic memory allocation #881

kysrpex · 2023-08-31T15:17:50Z

I have noticed that we have accumulated almost 200 freebayes jobs in the Condor queue (50% to 75% of the queue). I have looked a bit into the memory requirements of this tool by taking as sample the successful job runs since mid-May.

Most jobs have inputs smaller than 10GB and use less than 15GB of memory, rather than the 90GB that are currently being requested.

I suggest thus the following simple solution: request 9 + input_size * 1 GB of memory. Since UseGalaxy.eu resubmits jobs up to two times when they run out of memory, tripling the memory request each time, this should cover about 99% of jobs within two resubmissions (see the graph below).

Hopefully we can have the jobs queued faster with this change. I have tried to figure out why still a sizeable amount of jobs with small input sizes require huge amounts of memory, but have been unable to find a quick answer.

Most freebayes jobs have inputs smaller than 10GB and use less than 15GB of memory, rather than the 90GB that are currently being requested. Since UseGalaxy.eu resubmits jobs up to two times when they run out of memory, tripling the memory request each time, this change should cover about 99% of jobs within two resubmissions (see the graph in PR usegalaxy-eu#881).

files/galaxy/tpv/tools.yml

bgruening · 2023-09-01T04:34:01Z

It would be nice if you could include a comment and link to those charts here. Maybe also explain what the * 1 is doing.

bgruening · 2023-09-01T04:35:29Z

I recall that freebayes had problems with high-coverage reads. @wm75 might remember. I don't think we can detect the coverage of the input files dynamically. I guess for those very extreme cases users should try (--limit-coverage) and see if that works.

kysrpex · 2023-09-01T06:50:17Z

It would be nice if you could include a comment and link to those charts here. Maybe also explain what the * 1 is doing.

What do you mean with "comment and link to those charts"? A recipe to generate them? A more in-depth explanation accompanying them? The data used to generate them?

* 1 is the identity element of multiplication and, as expected, it does nothing at all.

kysrpex · 2023-09-01T07:02:46Z

I recall that freebayes had problems with high-coverage reads. @wm75 might remember. I don't think we can detect the coverage of the input files dynamically. I guess for those very extreme cases users should try (--limit-coverage) and see if that works.

Many times it feels quite hopeless to maintain the TPV database (and ours). To make proper use of the cluster we need functions that provide a good upper bound on the memory usage for each tool in terms of its parameters and inputs. A rough guess makes us run into problems like this one.

If the memory usage of the tools is predictable but we just do not know how to do it, I think machine learning can be used to generate the rules (taking the memory usage measurements from Galaxy, we have way more than we need), saving headaches in the long run. If however, the tool is simply unpredictable because it has stochastic elements, then we're out of luck.

A solution like this one I submitted (and much of the information we submit to the TPV database and tools.yml) feels just like a patch. Although I feel that for this tool in particular we'll have little success with ML anyways (because evaluating specific details of the tool like the coverage does not scale).

Set cores to 10 explicitly in tools.yml.

bgruening · 2023-09-01T13:03:25Z

Many times it feels quite hopeless to maintain the TPV database (and ours).

I tend to think that whatever we do here is better than nothing. And the reality is that all users of HPC systems do invent those numbers for their own, not sharing, no data-driven decisions ... so whatever we do is an improvement. Even if not perfect.

kysrpex · 2023-09-01T13:28:28Z

Many times it feels quite hopeless to maintain the TPV database (and ours).

I tend to think that whatever we do here is better than nothing. And the reality is that all users of HPC systems do invent those numbers for their own, not sharing, no data-driven decisions ... so whatever we do is an improvement. Even if not perfect.

That is true, thanks for reminding it :) I think I am just a junkie of silver bullets.

kysrpex · 2023-09-01T13:29:36Z

Ok let's see what happens.

kysrpex · 2023-09-04T07:33:34Z

It seems to work as expected.

galaxy=> select count(*) from job where tool_id ilike 'toolshed.g2.bx.psu.edu/repos/devteam/freebayes/freebayes%' and update_time >= '2023-09-01 13:55:38.000000'  and state='error';
 count 
-------
     4
(1 row)

galaxy=> select count(*) from job where tool_id ilike 'toolshed.g2.bx.psu.edu/repos/devteam/freebayes/freebayes%' and update_time >= '2023-09-01 13:55:38.000000'  and state='ok';
 count 
-------
   277
(1 row)

galaxy=> select count(*) from job where tool_id ilike 'toolshed.g2.bx.psu.edu/repos/devteam/freebayes/freebayes%' and update_time >= '2023-09-01 13:55:38.000000'  and state='queued';
 count 
-------
     0
(1 row)

galaxy=> select count(*) from job where tool_id ilike 'toolshed.g2.bx.psu.edu/repos/devteam/freebayes/freebayes%' and update_time >= '2023-09-01 13:55:38.000000'  and state='new';
 count 
-------
     0
(1 row)

kysrpex added the enhancement label Aug 31, 2023

kysrpex requested review from bgruening and sanjaysrikakulam August 31, 2023 15:17

kysrpex self-assigned this Aug 31, 2023

kysrpex force-pushed the freebayes branch from cae039b to e9ca615 Compare August 31, 2023 15:19

kysrpex marked this pull request as ready for review August 31, 2023 15:20

bgruening reviewed Sep 1, 2023

View reviewed changes

files/galaxy/tpv/tools.yml Show resolved Hide resolved

freebayes: set cores to 10

8fda5f6

Set cores to 10 explicitly in tools.yml.

add comment

7cdcfd2

bgruening approved these changes Sep 1, 2023

View reviewed changes

kysrpex merged commit b45f9e0 into usegalaxy-eu:master Sep 1, 2023

kysrpex deleted the freebayes branch September 1, 2023 13:29

kysrpex mentioned this pull request Sep 7, 2023

usegalaxy-eu.htcondor_release: first job resubmission does not increase request_memory #896

Closed

paulzierep mentioned this pull request Jan 15, 2024

increase memory of quast #1073

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

freebayes: Dynamic memory allocation #881

freebayes: Dynamic memory allocation #881

kysrpex commented Aug 31, 2023 •

edited

Loading

bgruening commented Sep 1, 2023

bgruening commented Sep 1, 2023

kysrpex commented Sep 1, 2023 •

edited

Loading

kysrpex commented Sep 1, 2023 •

edited

Loading

bgruening commented Sep 1, 2023

kysrpex commented Sep 1, 2023 •

edited

Loading

kysrpex commented Sep 1, 2023

kysrpex commented Sep 4, 2023

freebayes: Dynamic memory allocation #881

freebayes: Dynamic memory allocation #881

Conversation

kysrpex commented Aug 31, 2023 • edited Loading

bgruening commented Sep 1, 2023

bgruening commented Sep 1, 2023

kysrpex commented Sep 1, 2023 • edited Loading

kysrpex commented Sep 1, 2023 • edited Loading

bgruening commented Sep 1, 2023

kysrpex commented Sep 1, 2023 • edited Loading

kysrpex commented Sep 1, 2023

kysrpex commented Sep 4, 2023

kysrpex commented Aug 31, 2023 •

edited

Loading

kysrpex commented Sep 1, 2023 •

edited

Loading

kysrpex commented Sep 1, 2023 •

edited

Loading

kysrpex commented Sep 1, 2023 •

edited

Loading