Increase memory limits for build projects (autoscale workers) #4403
Labels
Improvement
Minor improvement to code
Needed: design decision
A core team decision is required
Needed: documentation
Documentation is required
Milestone
We have several projects lately that need more memory than the default value (1g). Although, since memory is a resource that could cause other different problems than CPU time we have been very careful when increasing this limit in some projects.
At this time we do have only 6 projects with different limits (1500m) than the default and only 2 of them with the maximum limit used (2g).
When a project needs more memory resources I usually suggest the owner to:
formats
being builtThese two points can be found in our docs https://docs.readthedocs.io/en/latest/guides/build-using-too-many-resources.html
This issue, in particular, is to collect projects that are running out of memory when building, increase their limits and track the results. Besides, to discuss a long term solution where increasing memory limits doesn't affect the builder servers.
Projects that are currently hitting the memory limit which I will start by increasing their limits:
2g
2g
Also to discuss what are the steps to follow from the core team to be able to increase these times in a safe way (without creating another issue in the builders) and propose a solution around it.
Ideas for a solution
Use Celery autoscale
We have talk about using Celery autoscaling option (http://docs.celeryproject.org/en/latest/userguide/workers.html#autoscaling) but instead of make celery decide when and how to increase/decrease the amount of workers, we may want to define our own
Autoscaler
and define it in the setting (http://docs.celeryproject.org/en/latest/userguide/configuration.html#std:setting-worker_autoscaler)Example of Autoscaling based on CPU and Memory: https://gist.github.com/speedplane/224eb551c51a74068011f4d776237513
Scale workers manually
Another idea we had in mind was to scale the workers depending on the values that we already knew:
container_time_limit
andcontainer_mem_limit
. So, before thetrigger_build
function is called we can decrease the workers if it will be a task that consumes too much memory.Increasing the workers at that point is not possible because we don't have information about the kind of tasks that the current builder is running. If we save the
task_id
into theBuild
object we could ask for all the tasks the builder is running, map it with the build object and know the project being built with the amount of resources needed.Another possibility instead of saving the
task_id
in theBuild
object could be to create a Celerychain
that first decrease the workers to 1, then execute the build, and then increase the workers to the default value.Use a specific queue for heavy mem usage projects
To avoid all this logic, we could have a builder with just only one worker. Before
trigger_build
is called, the web server checks for custom time limits and force the task to be added in this particular queue.The text was updated successfully, but these errors were encountered: