Django ASGI #2843

sarayourfriend · 2023-08-17T22:16:03Z

Start Date	ETA	Project Lead	Actual Ship Date
2023-07-20	YYYY-MM-DD	@sarayourfriend	2023-12-04

Description

See #2566. This project thread is only a stand-in to track the effort put towards the issues recorded in the milestone (linked below).

Documents

Project Proposal API thumbnails, a quandry: to ASGI the Django monolith or create a new async Thumbnails service #2498
Implementation Plan(s) Evaluate feasibility of Django ASGI and ADRF conversion #2566

Issues

https://github.com/WordPress/openverse/milestone/19

sarayourfriend · 2023-08-17T22:44:06Z

Work is ongoing to implement issues in the linked milestone. All but the last issue can be worked on concurrently.

openverse-bot · 2023-09-01T00:21:59Z

Hi @sarayourfriend, this project has not received an update comment in 14 days. Please leave an update comment as soon as you can. See the documentation on project updates for more information.

sarayourfriend · 2023-09-01T00:46:07Z

We've progressed just under halfway through the milestone issues. Depending on capacity, I believe we should get to the point by the end of next week where we'll be on the final two issues where we actually implement the asynchronous endpoints and deployment.

The most recent and significant success was when we deployed the API using the baked-in Gunicorn configuration, instead of duplicating configuration in the ECS task definition. This wasn't strictly necessary, but will make it much easier for us to deploy the ASGI version of the service with zero downtime and without needing to do complex switchovers or multiple deployments in order to safely switch.

openverse-bot · 2023-09-16T00:19:43Z

Hi @sarayourfriend, this project has not received an update comment in 14 days. Please leave an update comment as soon as you can. See the documentation on project updates for more information.

sarayourfriend · 2023-09-18T03:09:09Z

Our plan is now to convert the app to ASGI, then apply the changes for sharing aiohttp client and the async image proxy route. Each of these have PRs open, but progress on the ASGI deployment depends on changes to the infrastructure and is also blocked by #3029 which is higher priority (hoping it solves the memory leak).

Right now I'm working on getting an infrastructure PR ready for review that will allow making changes to the task count and resources for staging followed by production, to allow us to progressively deploy the ASGI switch in each environment. Once that is up and reviewed, it will unblock actually doing the deployment of the ASGI switch-over. It might be possible to do that by end of this week, if reviews are quick, but I'm going on holiday next week so if it isn't this week then it'll have to wait until next month, unless someone else takes it over (which I really wouldn't mind help on cc @WordPress/openverse-api). Being part-time makes it difficult to get fast progress on these "one-maintainer" projects, and I'd like to get this wrapped up so I can move on to helping with the content moderation workflow work, once those IPs are ready.

openverse-bot · 2023-10-03T00:20:58Z

Hi @sarayourfriend, this project has not received an update comment in 14 days. Please leave an update comment as soon as you can. See the documentation on project updates for more information.

sarayourfriend · 2023-10-03T21:43:29Z

We've deployed the API using the canary approach, so now we can work on planning the roll out of the ASGI application. I need to rebase the ASGI PR due to dependabot and changes to uvloop in the existing application.

AetherUnbound · 2023-10-04T15:40:23Z

We also need to deploy the canary approach to production, we have not done so yet since the canary service would be missing from our dashboards and we've wanted to have as much information available as possible while we investigate recent API response time issues.

sarayourfriend · 2023-10-04T20:22:54Z

Ah! I missed that in what happened while I was AFK. I'll ask for some more information today during our 1:1 🙏

sarayourfriend · 2023-10-09T00:40:22Z

@AetherUnbound Do you think it would be a good idea to put a pause on this line of work for a week or two while we continue to work through the ES issues? I'm not personally comfortable making such dramatic changes to the way the application runs while we're still working to debug an obscure issue.

AetherUnbound · 2023-10-09T17:57:01Z

I agree, I'll move this one to on-hold until we have a solid handle on that issue.

openverse-bot · 2023-11-02T00:20:52Z

Hi @sarayourfriend, this project has not received an update comment in 14 days. Please leave an update comment as soon as you can. See the documentation on project updates for more information.

sarayourfriend · 2023-11-02T02:17:54Z

The project is no longer on hold. This is the plan of action, as I see it now, in order to finish this out:

Close https://github.com/WordPress/openverse-infrastructure/pull/625 and create a new PR to enable the canary service in production. The existing PR has changes that are no longer necessary as of https://github.com/WordPress/openverse-infrastructure/pull/646, which added the canaries to dashboards, monitors, and also fixed the alias routing issue.
Rebase Run the app as ASGI #3011 and prepare it for merging. Request re-reviews for good measure, especially if there are any significant changes caused by the rebase
Apply https://github.com/WordPress/openverse-infrastructure/pull/616 to the staging API to increase the number of staging API workers and merge the PR
Merge and deploy Run the app as ASGI #3011 to staging, confirm staging API stability
Create a new PR to resize the task resources for the staging API by removing the api_task_resources variable and leaving it to the module default in staging; confirm staging API stability
Increase the number of production API tasks
Cut an API release and deploy the ASGI app to production; confirm API stability
Reduce API task resources by removing the api_task_resources variable and leaving it to the module default in production; confirm API stability

That completes the initial ASGI deployment, but does not finish everything, as we still need to finish making changes to the application to actually benefit from ASGI at all. So after all that deployment is done and confirmed to work as expected:

Rebase and deploy Add aiohttp client sharing #3024 through to production
Rebase and deploy Add ADRF and make the thumbnail view async #3020 through to production
Finally, convert photon.get to async and deploy that through to production

We should expect to cut multiple release through this to ensure we test each step with production load and reduce the number of changes related to this work going out to production with each release.

sarayourfriend · 2023-11-02T23:09:13Z

I'm going to rebase PRs today to get this ready to start rolling out next week.

sarayourfriend · 2023-11-02T23:22:20Z

Edit: I'm updating this comment to a task list for preparing the deployment I've described above and will update it as I get PRs ready to go.

Enable production API canary. https://github.com/WordPress/openverse-infrastructure/pull/671 is ready to deploy (once Cloudflare's outage is resolved so that we can actually use Terraform again). I also closed https://github.com/WordPress/openverse-infrastructure/pull/625 as replaced by this new PR.
Rebase Run the app as ASGI #3011
Rebase https://github.com/WordPress/openverse-infrastructure/pull/616
Rebase Add aiohttp client sharing #3024 (done, with new tests added for good measure)
Rebase Add ADRF and make the thumbnail view async #3020
Start working on Convert the image proxy route to async #2789
- Given we're a long way off needing this, considering how many things need to happen before the change is relevant, I'm going to hold off working on it.

sarayourfriend · 2023-11-09T00:31:54Z

@stacimc @AetherUnbound and I deployed the production canary today and then Staci and I deployed the ASGI app to staging with resized workers. The deployment went well! We increased the number of tasks, then deployed the ASGI workers to staging, then decreased the task resources to a quarter of their previous amount. The end result is: slightly higher memory and CPU consumption by percent. I haven't looked at the raw numbers to see how they actually stack up, but this is what I expected. Asynchronous runtimes are often memory and CPU hogs. The previous idle memory usage in staging was about 9%. Now it's at about 13%, so 1.44x of the original. CPU was around 1% and is now ~5%, a 5x increase. If take those same ratios and apply them to production: current stable memory in production is around 30%. 1.44 of 30% is around 43%. CPU rises to peaks of 10% during high traffic periods, with spikes to 16% during deployments. 5x of 10% is 50% and of 16% it is 80%. However, we also need to consider the thumbnails service. Its memory usage is lower than the main API service, so no worries there, but the CPU goes to 34% during peak traffic when looking at 1 hour averages over the last 4 weeks. 5x of 34% is 170%, so the thumbnails very well could not fit within the plain 1/4 resize. To be safe, when we resize the production tasks, I'll propose that we resize the main API service tasks to match the staging tasks (.25 vCPU and 1GB memory) but to use .5 vCPU with 1GB memory for the thumbnails. That would put the thumbnails peaking around 85% CPU, which is high, but if it turns out to be stable, is a very nice level of resource consumption (remembering that we pay for everything we've reserved, regardless of whether we use it). Because of the asynchronous nature of the thumbnails endpoint, and because we're prioritising getting that route fully async, I wonder if we'll be able to reduce the number of tasks for the thumbnails service even temporarily. We could also see whether reducing the overall number of API tasks is feasible. Overall the changes should save us money compared to our current deployments, but I don't know how reserving .5 vCPU tasks for the thumbnails service will affect that calculus. I doubt it would put us over our current spend, but I'll check the pricing calculator tomorrow when I'm preparing the PR for the production task resize to confirm.

We plan to deploy the changes to production on Monday.

sarayourfriend · 2023-11-14T23:42:28Z

We deployed the ASGI app to production. This involved increasing the task count to 20, deploying the ASGI app, then decreasing task resources to 1/4 of the original vCPU and 1/2 of the original memory allocation.

Things appeared okay at first, but we're seeing what looks like a memory leak on the production thumbnails.

#3353 is the issue for tracking that work. Right now we're deploying gc debug logging to production thumbnails to see if we can uncover anything.

sarayourfriend · 2023-11-22T08:00:28Z

@AetherUnbound The new PR (#3388) will close the last issue in this project thread. However, I'm unclear whether the project thread should also encompass getting rid of the thumbnails service. What do you think? I don't know exactly what will be involved other than just redirecting thumbnails traffic back to the main API service. Shall I add one final issue to do that, and then we can look to do that sometime in the next two weeks? We could try reducing the number of thumbnail tasks first to make sure the ASGI implementation is able to handle artificially increased production throughput. That would give me more confidence to think that a unified service will now be able to handle every request at once.

But the main question is: should this project close once we've "asgi-fied" the application and confirmed that works, or once we've actually unified the API service again? We could move this to "shipped" after #3388 and then only consider it closed once the service is unified?

AetherUnbound · 2023-11-28T20:19:05Z

We could move this to "shipped" after #3388 and then only consider it closed once the service is unified?

I think that's an excellent plan! And yes, if removing the extra thumbnail service really only entails redirecting traffic (and the cleanup of anything else we spun up, like Sentry projects, for the thumbnails), then I think we should try to do that before EOY and before marking this as success 😄

sarayourfriend added the 🧭 project: thread An issue used to track a project and its progress label Aug 17, 2023

sarayourfriend self-assigned this Aug 17, 2023

sarayourfriend added this to Openverse Project Tracker Aug 17, 2023

github-project-automation bot moved this to Not Started in Openverse Project Tracker Aug 17, 2023

openverse-bot added this to Openverse Backlog Aug 17, 2023

github-project-automation bot moved this to 📋 Backlog in Openverse Backlog Aug 17, 2023

sarayourfriend moved this from Not Started to In Progress in Openverse Project Tracker Aug 17, 2023

AetherUnbound added 💻 aspect: code Concerns the software code in the repository 🧰 goal: internal improvement Improvement that benefits maintainers, not users 🧱 stack: api Related to the Django API labels Aug 17, 2023

sarayourfriend removed this from Openverse Backlog Aug 18, 2023

sarayourfriend mentioned this issue Aug 25, 2023

SystemExit #658

Closed

dhruvkb added this to the Django ASGI milestone Sep 11, 2023

krysal mentioned this issue Sep 26, 2023

Upgrade psycopg to version 3 in the API #3064

Merged

8 tasks

AetherUnbound moved this from In Progress to On Hold in Openverse Project Tracker Oct 9, 2023

AetherUnbound moved this from On Hold to In Progress in Openverse Project Tracker Nov 1, 2023

AetherUnbound moved this from In Progress to Shipped in Openverse Project Tracker Nov 14, 2023

AetherUnbound moved this from Shipped to In Progress in Openverse Project Tracker Nov 14, 2023

sarayourfriend mentioned this issue Nov 22, 2023

Use a separate service to generate audio waveforms #731

Closed

1 task

sarayourfriend moved this from 🚧 In Progress to 🚢 Shipped in Openverse Project Tracker Dec 4, 2023

sarayourfriend mentioned this issue Dec 11, 2023

Remove staging thumbnails deployment #3509

Merged

4 tasks

openverse-bot added this to Openverse Backlog Dec 17, 2023

openverse-bot moved this to 🏗 In Progress in Openverse Backlog Dec 17, 2023

sarayourfriend closed this as completed Dec 20, 2023

AetherUnbound moved this from 🚢 Shipped to ✅ Success in Openverse Project Tracker Jan 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Django ASGI #2843

Django ASGI #2843

sarayourfriend commented Aug 17, 2023 •

edited

Loading

sarayourfriend commented Aug 17, 2023

openverse-bot commented Sep 1, 2023

sarayourfriend commented Sep 1, 2023

openverse-bot commented Sep 16, 2023

sarayourfriend commented Sep 18, 2023

openverse-bot commented Oct 3, 2023

sarayourfriend commented Oct 3, 2023

AetherUnbound commented Oct 4, 2023

sarayourfriend commented Oct 4, 2023

sarayourfriend commented Oct 9, 2023

AetherUnbound commented Oct 9, 2023

openverse-bot commented Nov 2, 2023

sarayourfriend commented Nov 2, 2023

sarayourfriend commented Nov 2, 2023

sarayourfriend commented Nov 2, 2023 •

edited

Loading

sarayourfriend commented Nov 9, 2023

sarayourfriend commented Nov 14, 2023

sarayourfriend commented Nov 22, 2023 •

edited

Loading

AetherUnbound commented Nov 28, 2023

Django ASGI #2843

Django ASGI #2843

Comments

sarayourfriend commented Aug 17, 2023 • edited Loading

Description

Documents

Issues

sarayourfriend commented Aug 17, 2023

openverse-bot commented Sep 1, 2023

sarayourfriend commented Sep 1, 2023

openverse-bot commented Sep 16, 2023

sarayourfriend commented Sep 18, 2023

openverse-bot commented Oct 3, 2023

sarayourfriend commented Oct 3, 2023

AetherUnbound commented Oct 4, 2023

sarayourfriend commented Oct 4, 2023

sarayourfriend commented Oct 9, 2023

AetherUnbound commented Oct 9, 2023

openverse-bot commented Nov 2, 2023

sarayourfriend commented Nov 2, 2023

sarayourfriend commented Nov 2, 2023

sarayourfriend commented Nov 2, 2023 • edited Loading

sarayourfriend commented Nov 9, 2023

sarayourfriend commented Nov 14, 2023

sarayourfriend commented Nov 22, 2023 • edited Loading

AetherUnbound commented Nov 28, 2023

sarayourfriend commented Aug 17, 2023 •

edited

Loading

sarayourfriend commented Nov 2, 2023 •

edited

Loading

sarayourfriend commented Nov 22, 2023 •

edited

Loading