-
Notifications
You must be signed in to change notification settings - Fork 214
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Django ASGI #2843
Comments
Work is ongoing to implement issues in the linked milestone. All but the last issue can be worked on concurrently. |
Hi @sarayourfriend, this project has not received an update comment in 14 days. Please leave an update comment as soon as you can. See the documentation on project updates for more information. |
We've progressed just under halfway through the milestone issues. Depending on capacity, I believe we should get to the point by the end of next week where we'll be on the final two issues where we actually implement the asynchronous endpoints and deployment. The most recent and significant success was when we deployed the API using the baked-in Gunicorn configuration, instead of duplicating configuration in the ECS task definition. This wasn't strictly necessary, but will make it much easier for us to deploy the ASGI version of the service with zero downtime and without needing to do complex switchovers or multiple deployments in order to safely switch. |
Hi @sarayourfriend, this project has not received an update comment in 14 days. Please leave an update comment as soon as you can. See the documentation on project updates for more information. |
Our plan is now to convert the app to ASGI, then apply the changes for sharing aiohttp client and the async image proxy route. Each of these have PRs open, but progress on the ASGI deployment depends on changes to the infrastructure and is also blocked by #3029 which is higher priority (hoping it solves the memory leak). Right now I'm working on getting an infrastructure PR ready for review that will allow making changes to the task count and resources for staging followed by production, to allow us to progressively deploy the ASGI switch in each environment. Once that is up and reviewed, it will unblock actually doing the deployment of the ASGI switch-over. It might be possible to do that by end of this week, if reviews are quick, but I'm going on holiday next week so if it isn't this week then it'll have to wait until next month, unless someone else takes it over (which I really wouldn't mind help on cc @WordPress/openverse-api). Being part-time makes it difficult to get fast progress on these "one-maintainer" projects, and I'd like to get this wrapped up so I can move on to helping with the content moderation workflow work, once those IPs are ready. |
Hi @sarayourfriend, this project has not received an update comment in 14 days. Please leave an update comment as soon as you can. See the documentation on project updates for more information. |
We've deployed the API using the canary approach, so now we can work on planning the roll out of the ASGI application. I need to rebase the ASGI PR due to dependabot and changes to uvloop in the existing application. |
We also need to deploy the canary approach to production, we have not done so yet since the canary service would be missing from our dashboards and we've wanted to have as much information available as possible while we investigate recent API response time issues. |
Ah! I missed that in what happened while I was AFK. I'll ask for some more information today during our 1:1 🙏 |
@AetherUnbound Do you think it would be a good idea to put a pause on this line of work for a week or two while we continue to work through the ES issues? I'm not personally comfortable making such dramatic changes to the way the application runs while we're still working to debug an obscure issue. |
I agree, I'll move this one to on-hold until we have a solid handle on that issue. |
Hi @sarayourfriend, this project has not received an update comment in 14 days. Please leave an update comment as soon as you can. See the documentation on project updates for more information. |
The project is no longer on hold. This is the plan of action, as I see it now, in order to finish this out:
That completes the initial ASGI deployment, but does not finish everything, as we still need to finish making changes to the application to actually benefit from ASGI at all. So after all that deployment is done and confirmed to work as expected:
We should expect to cut multiple release through this to ensure we test each step with production load and reduce the number of changes related to this work going out to production with each release. |
I'm going to rebase PRs today to get this ready to start rolling out next week. |
Edit: I'm updating this comment to a task list for preparing the deployment I've described above and will update it as I get PRs ready to go.
|
@stacimc @AetherUnbound and I deployed the production canary today and then Staci and I deployed the ASGI app to staging with resized workers. The deployment went well! We increased the number of tasks, then deployed the ASGI workers to staging, then decreased the task resources to a quarter of their previous amount. The end result is: slightly higher memory and CPU consumption by percent. I haven't looked at the raw numbers to see how they actually stack up, but this is what I expected. Asynchronous runtimes are often memory and CPU hogs. The previous idle memory usage in staging was about 9%. Now it's at about 13%, so 1.44x of the original. CPU was around 1% and is now ~5%, a 5x increase. If take those same ratios and apply them to production: current stable memory in production is around 30%. 1.44 of 30% is around 43%. CPU rises to peaks of 10% during high traffic periods, with spikes to 16% during deployments. 5x of 10% is 50% and of 16% it is 80%. However, we also need to consider the thumbnails service. Its memory usage is lower than the main API service, so no worries there, but the CPU goes to 34% during peak traffic when looking at 1 hour averages over the last 4 weeks. 5x of 34% is 170%, so the thumbnails very well could not fit within the plain 1/4 resize. To be safe, when we resize the production tasks, I'll propose that we resize the main API service tasks to match the staging tasks (.25 vCPU and 1GB memory) but to use .5 vCPU with 1GB memory for the thumbnails. That would put the thumbnails peaking around 85% CPU, which is high, but if it turns out to be stable, is a very nice level of resource consumption (remembering that we pay for everything we've reserved, regardless of whether we use it). Because of the asynchronous nature of the thumbnails endpoint, and because we're prioritising getting that route fully async, I wonder if we'll be able to reduce the number of tasks for the thumbnails service even temporarily. We could also see whether reducing the overall number of API tasks is feasible. Overall the changes should save us money compared to our current deployments, but I don't know how reserving .5 vCPU tasks for the thumbnails service will affect that calculus. I doubt it would put us over our current spend, but I'll check the pricing calculator tomorrow when I'm preparing the PR for the production task resize to confirm. We plan to deploy the changes to production on Monday. |
We deployed the ASGI app to production. This involved increasing the task count to 20, deploying the ASGI app, then decreasing task resources to 1/4 of the original vCPU and 1/2 of the original memory allocation. Things appeared okay at first, but we're seeing what looks like a memory leak on the production thumbnails. #3353 is the issue for tracking that work. Right now we're deploying gc debug logging to production thumbnails to see if we can uncover anything. |
@AetherUnbound The new PR (#3388) will close the last issue in this project thread. However, I'm unclear whether the project thread should also encompass getting rid of the thumbnails service. What do you think? I don't know exactly what will be involved other than just redirecting thumbnails traffic back to the main API service. Shall I add one final issue to do that, and then we can look to do that sometime in the next two weeks? We could try reducing the number of thumbnail tasks first to make sure the ASGI implementation is able to handle artificially increased production throughput. That would give me more confidence to think that a unified service will now be able to handle every request at once. But the main question is: should this project close once we've "asgi-fied" the application and confirmed that works, or once we've actually unified the API service again? We could move this to "shipped" after #3388 and then only consider it closed once the service is unified? |
I think that's an excellent plan! And yes, if removing the extra thumbnail service really only entails redirecting traffic (and the cleanup of anything else we spun up, like Sentry projects, for the thumbnails), then I think we should try to do that before EOY and before marking this as success 😄 |
Description
See #2566. This project thread is only a stand-in to track the effort put towards the issues recorded in the milestone (linked below).
Documents
Issues
https://github.com/WordPress/openverse/milestone/19
The text was updated successfully, but these errors were encountered: