Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate high celery-worker memory #4864

Closed
lbeaufort opened this issue May 20, 2021 · 8 comments
Closed

Investigate high celery-worker memory #4864

lbeaufort opened this issue May 20, 2021 · 8 comments

Comments

@lbeaufort
Copy link
Member

lbeaufort commented May 20, 2021

What we’re after

On 5/20/21, MUR 7284 didn't appear more than an hour after publishing.

Celery-worker instances throwing Worker exited prematurely: signal 9 (SIGKILL) errors increasingly in the past month, which seems to correlate with celery-worker memory creeping up from 800MB/1GB to >=1GB/1GB.

We should either:

  • Increase celery-worker memory or
  • Determine the cause of high memory usage and address it.
  • Consider updating celery/kombu versions

Example:

 2021-05-20T12:10:00.10-0400 [APP/PROC/WEB/2] ERR [2021-05-20 16:10:00,101: INFO/ForkPoolWorker-10] Checking for modified cases
   2021-05-20T12:10:00.10-0400 [APP/PROC/WEB/2] ERR [2021-05-20 16:10:00,104: INFO/ForkPoolWorker-10] MUR 7284 found modified at 2021-05-20 11:00:11.016174
   2021-05-20T12:10:00.10-0400 [APP/PROC/WEB/2] ERR [2021-05-20 16:10:00,105: INFO/ForkPoolWorker-10] Loading MUR(s)
   2021-05-20T12:10:07.73-0400 [APP/PROC/WEB/2] ERR [2021-05-20 16:10:07,738: ERROR/MainProcess] Process 'ForkPoolWorker-10' pid:87 exited with 'signal 9 (SIGKILL)'
   2021-05-20T12:10:07.75-0400 [APP/PROC/WEB/2] ERR [2021-05-20 16:10:07,752: ERROR/MainProcess] Task handler raised error: WorkerLostError('Worker exited prematurely: signal 9 (SIGKILL) Job: 5259.')
   2021-05-20T12:10:07.75-0400 [APP/PROC/WEB/2] ERR Traceback (most recent call last):
   2021-05-20T12:10:07.75-0400 [APP/PROC/WEB/2] ERR   File "/home/vcap/deps/0/python/lib/python3.7/site-packages/billiard/pool.py", line 1267, in mark_as_worker_lost
   2021-05-20T12:10:07.75-0400 [APP/PROC/WEB/2] ERR     human_status(exitcode), job._job),
   2021-05-20T12:10:07.75-0400 [APP/PROC/WEB/2] ERR billiard.exceptions.WorkerLostError: Worker exited prematurely: signal 9 (SIGKILL) Job: 5259.

Kibana app health tracking example.

Related ticket(s)

(Include the tickets that either came before, after, or are happening in tandem with this new ticket)

  • [ ]

Action item(s)

(These are the smaller tasks that should happen in order to complete this work)

  • [ ]

Completion criteria

(What does the end state look like - as long as this task(s) is done, this work is complete)

  • [ ]

References/resources/technical considerations

(Is there sample code or a screenshot you can include to highlight a particular issue? Here is where you reinforce why this work is important)

@pkfec
Copy link
Contributor

pkfec commented May 21, 2021

In the past we ran into celery-worker memory issues when MUR#7594 was published in prod env. See here for more details #4592

And a follow up issue was submitted to consider increasing celery-worker memory #4638

@pkfec pkfec modified the milestones: PI 14 Innovation, Sprint 15.1 Jun 8, 2021
@pkfec
Copy link
Contributor

pkfec commented Jun 11, 2021

From Kibana logs, it appears that there are many recurring instances of worker running out of memory in past 90 days till 1 year. The out of memory errors in worker are not confined to a specific case or document.

@pkfec
Copy link
Contributor

pkfec commented Jun 11, 2021

celery v5.0.1 comes with few breaking changes:

  1. To start celery app signature has changed:
  • celery -A webservices.tasks beat --loglevel INFO
  • celery -A webservices.tasks worker --loglevel INFO
  1. --pool command line argument is REQUIRED. Have to specify the execution pool type when celery app is started (for example: prefork, gevent, eventlet or solo execution pool )

Tried upgrading celery, kombu, click, vine pkgs to most compatible versions in this feature branch:feature/update-celery-kombu-pkgs

@pkfec
Copy link
Contributor

pkfec commented Jun 15, 2021

ScheduleB/EFile downloads could be a symptom of worker running out of memory. Changes are going live (deploy to Prod) during innovation release on 06/15. Will monitor the worker memory after SB/EFile change gets deployed to production.

@pkfec
Copy link
Contributor

pkfec commented Jun 16, 2021

After considering our current cloud.gov org memory limit @lbeaufort @fec-jli advised to make few adjustments on dev and prod celery memory allocations. We decided to reduce Dev worker memory to 512M per instance and allocate more memory on Prod worker instances. Prod celery worker memory is now increased 1.5G per instance.

@pkfec
Copy link
Contributor

pkfec commented Jun 16, 2021

Increased worker memory to 1.5G in Production for each instance. See #4890

On 06/16/2021 @4:35pm ran thecf scale celery-worker -m 1500M in Production and scaled the worker memory to 1500M or 1.5GB

Screenshot of worker after upgrading to 1.5G memory:

Screen Shot 2021-06-16 at 4 32 18 PM

@pkfec
Copy link
Contributor

pkfec commented Jun 22, 2021

WIP PR with celery package upgrades: #4895

@pkfec
Copy link
Contributor

pkfec commented Jun 22, 2021

Monitor celery worker memory in Production. No additional work needed.

@pkfec pkfec closed this as completed Jun 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants