Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate Celery outage #2459

Closed
LindsayYoung opened this issue Jun 5, 2017 · 9 comments
Closed

Investigate Celery outage #2459

LindsayYoung opened this issue Jun 5, 2017 · 9 comments

Comments

@LindsayYoung
Copy link
Contributor

We have seen celery choke with a "no space left on device" error but the general app stats look ok. This can break downloads so we need to figure out what is causing the error and how to deal with it.

@LindsayYoung
Copy link
Contributor Author

Initial thoughts:
Add some more space for now
Maybe precesses are not being completed?
Maybe there are some advanced ways of managing the queue and looking at space in Celery?

@LindsayYoung
Copy link
Contributor Author

Also going to check:
maximum file length
space in data structure

@LindsayYoung
Copy link
Contributor Author

Tracked down the line that failed it is in the export query step, that is being handled by celery once.

@noahmanger noahmanger modified the milestones: Sprint 2.7, Sprint 2.8 Jun 12, 2017
@LindsayYoung
Copy link
Contributor Author

I am noting here that there is a 2G limit on disk space.

If this gives us problems again, we might want to check and see if we can refactor our download code so it writes to S3 in a streaming manner. I have not investigated that yet

@LindsayYoung
Copy link
Contributor Author

Had a good meeting with Josh, Carlo, Prya and Rohan about celery. It seems like we are going to do a two pronged attack,

  1. get celery scaled horizontally without errors,
  2. get the downloads streaming so they don't use disk

@LindsayYoung
Copy link
Contributor Author

We are still experiencing issues with celery-worker and celery-beat

  • Queries are getting blocked by updates to the master db, we are going to address that by increasing the max streaming delayhttps://github.com/18F/fec-infrastructure/pull/19
  • The update emails have failed to send but the other beat tasks are still operating.

@LindsayYoung
Copy link
Contributor Author

  • investigate flower

@LindsayYoung
Copy link
Contributor Author

OK, We are making progress on this issue

We are going to scale Celery horizontally

We have a script to test a bunch of downloads and it is working on dev with celery scaled horizontally. (Thanks @jontours and @pkfec)
We are not seeing 502 errors on dev with 4 instances and a high load so, we want to try scaling celery horizontally on Monday.

Queries won't error because the source table is updated

@vrajmohan and @ccostino adjusted the psql vars so that we can get rid of the queries that were getting canceled because the master table was updated

Working on streaming files

@vrajmohan is looking into streaming the writing of zipfiles so we don't use disk

Updating celery

@ccostino has made a PR to update the celery version and that is on dev. We still need to confirm this gets the nightly update back in order.

@ccostino
Copy link
Contributor

Thanks, @LindsayYoung! Related issue here for other adjustments/improvements/fixes: #2553

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants