Skip to content
This repository has been archived by the owner on Feb 22, 2023. It is now read-only.

Move off of grequests #486

Closed
1 task
AetherUnbound opened this issue Jan 25, 2022 · 23 comments · Fixed by #1027
Closed
1 task

Move off of grequests #486

AetherUnbound opened this issue Jan 25, 2022 · 23 comments · Fixed by #1027
Assignees
Labels
💻 aspect: code Concerns the software code in the repository ✨ goal: improvement Improvement to an existing user-facing feature help wanted Open to participation from the community 🟨 priority: medium Not blocking but should be addressed soon 🐍 tech: python Requires familiarity with Python

Comments

@AetherUnbound
Copy link
Contributor

AetherUnbound commented Jan 25, 2022

Problem

The API currently uses the grequests library for handling some of the head requests it completes when returning results (specifically in validate_images.py).

Description

The README for the project says:

Note: You should probably use requests-threads or requests-futures instead.

The project has also not had a release since 2020. We might want to consider using one of those proposed alternatives.

Alternatives

Leave as is and hope for the best!

Additional context

Some more context in Sentry: https://sentry.io/share/issue/061ba99fc3df4c23bdb7643d337bbda0/

Implementation

  • 🙋 I would be interested in implementing this feature.
@AetherUnbound AetherUnbound added 🟩 priority: low Low priority and doesn't need to be rushed ✨ goal: improvement Improvement to an existing user-facing feature 💻 aspect: code Concerns the software code in the repository 🐍 tech: python Requires familiarity with Python labels Jan 25, 2022
@sarayourfriend sarayourfriend added good first issue New-contributor friendly help wanted Open to participation from the community labels Jan 31, 2022
@sarayourfriend
Copy link
Contributor

This would be a great issue for a new contributor to pick up, if anyone is looking for something to get started with 😀

@samajain
Copy link

samajain commented Apr 3, 2022

Hi @sarayourfriend,

I came across your project and I am interested in contributing! I am new to opensource community contributions but I have 5 years of experience working as an engineer with python, C++ and django frameworks. Can I work on this issue?

From the description, what I concluded is that we want to migrate from grequests to some other multi-threaded handling of requests, possibly requests-threads? Please share any other information which you think might be needed to work on the issue.

Thanks!

@sarayourfriend
Copy link
Contributor

@samajain Hi! Welcome to Openverse 😁 It'd be great to have you work on this issue. Your understanding of the issue and one of the potential solutions is correct.

request-threads could be a good issue but we don't currently use Twisted, which requests-threads depends on 🤔 I wonder if we could also look into using Django's builtin asyncio support and using something like httpx or one of the other popular async HTTP libraries. requests-threads hasn't been updated in a few years (including not having technical support for the latest Python versions).

If you feel comfortable doing that kind of research for the trade-offs between the different options then let us know 🙂 Unfortunately this issue might not be as clear cut as it seems now that I've looked into it a bit more 😕 If you'd prefer working one something more straight forward then also let me know and I can help you find something else.

Thanks for your interest in the project!

@samajain
Copy link

samajain commented Apr 4, 2022

I can do the research if you want me to and share my findings.

If not this issue then I can take up something else too 🙂 Since you have been part of this project, I leave it to your call.

I'll dig up for something from the good first issues label.

@AetherUnbound
Copy link
Contributor Author

We'd be more than happy to let you take this one on! If you're comfortable gathering the research and posting it here, we can work together to decide which library/path forward might be best 🙂

@samajain
Copy link

samajain commented Apr 4, 2022

Sure! I can do that. What kind of tradeoff aspects are you looking for? Shall I compare them based on their performance metrics or are you looking for a comparison on anything specific other than this?

@AetherUnbound
Copy link
Contributor Author

I'll let @sarayourfriend chime in with any other thoughts, but I think my greatest concern would be 1) what is the maintenance burden we incur with whatever dependency we go with and 2) how much restructuring of the codebase do we need to do in order to properly integrate the library. For instance, if we do go with an async-based library, how much will we need to change to make the application support async calls. Performance metrics would be useful but secondary I think to these factors 😄

@sarayourfriend
Copy link
Contributor

sarayourfriend commented Apr 5, 2022

I'm less concerned with the maintenance burden initially than with the performance impact of using async in Django 4. The Django docs mention that using async views (which is fully supported and does not require any restructuring, apparently 🤔) has some performance trade offs. It doesn't explain what those are, so it'd be nice to understand them. We could move the Django app to ASGI and run it with uvicorn instead of gunicorn, but that my gut tells me that's probably a bigger move than we need to solve this particular problem. It would also need to involve larger infrastructure changes that we don't necessarily have capacity for at the moment. To quote the Django docs:

Async views will still work under WSGI, but with performance penalties, and without the ability to have efficient long-running requests.

The only place we're using grequests right now is basically to fire off a bunch of HTTP requests in parallel, so the long-running requests issue isn't important. However, the usage is in the main search routine, so it's extremely critical path. It's perhaps the view where performance matters the most for us.

My preference would be to use the built in Django async support as it's in line with mainline Python support (and I think is generally where things are headed, mostly away from Twisted, etc). But if the performance impact is bad, then we'll necessarily need to suffer the maintenance burden of introducing a "bolted on" asynchronous library for now.

All of that to say, I'd recommend looking into what the trade offs are with using Django's built in asyncio view support under WSGI first, then moving onto seeing what other options exist if the penalties for partial async in Django WSGI app are bad.

@AetherUnbound
Copy link
Contributor Author

However, the usage is in the main search routine, so it's extremely critical path. It's perhaps the view where performance matters the most for us.

That's a great point, thanks for mentioning it! And I'm glad that it doesn't seem like async support in Django will be too difficult from a code perspective.

@samajain
Copy link

samajain commented Apr 7, 2022

@AetherUnbound @sarayourfriend Thank you for your inputs. I will do my research and get back soon!

@sarayourfriend
Copy link
Contributor

@samajain Did you get a chance to look into this?

Given Django's recent release of 4.2 adding async to the ORM, I think it is a good time to start planning a switch over to the async Django API. I don't think this will be a trivial change to do correctly and I don't know if it can be done in stages. Would it be detrimental, for example, to delay switching to async Redis and Elasticsearch clients and Django ORM to happen after switching to async Django generally to be able to use an async http library to replace our usage of grequests?

@sarayourfriend
Copy link
Contributor

I was reading more about Django's async support. It looks like we might be able to relatively easily get off of grequests by using Django's in-depth async_to_sync function to run an async version of the validate_images function.

@WordPress/openverse-maintainers I think we could prioritize this issue higher if someone wanted to try out async_to_sync with a regular asyncio HTTP Python library like httpx or similar.

@AetherUnbound
Copy link
Contributor Author

I am all for this!

@Prakharkarsh1
Copy link

Hii @AetherUnbound I also want to contribute to fix this issue

@dhruvkb
Copy link
Member

dhruvkb commented Sep 10, 2022

@Prakharkarsh1 please go ahead! Feel free to comment here if you need any pointers getting started.

@krysal krysal added 🟨 priority: medium Not blocking but should be addressed soon and removed 🟩 priority: low Low priority and doesn't need to be rushed good first issue New-contributor friendly labels Oct 25, 2022
@sarayourfriend
Copy link
Contributor

I took a peek at this today and moving to ASGI for our application would be mostly trivial (I think). The bulk of the work would be to remove grequests and replacing it with something like httpx or aiohttp (plus maybe AnyIO to make "grouping" the requests easier, if the libraries themselves do not natively support it).

Updating unit tests would probably take up most of the time spent on this. Due to grequests nature, the tests do a fair bit of unusual mocking to try to stay out of the way as much as possible, but it's very specific to grequests. Perhaps adding an abstraction layer here would help to simplify the tests in general.

@sarayourfriend
Copy link
Contributor

Here is a small diff that moves the app to ASGI:

diff --git a/api/catalog/asgi.py b/api/catalog/asgi.py
new file mode 100644
index 00000000..6b3169f4
--- /dev/null
+++ b/api/catalog/asgi.py
@@ -0,0 +1,7 @@
+import os
+
+from django.core.asgi import get_asgi_application
+
+os.environ.setdefault("DJANGO_SETTINGS_MODULE", "catalog.settings")
+
+application = get_asgi_application()
diff --git a/api/catalog/settings.py b/api/catalog/settings.py
index 268ebb62..192feca2 100644
--- a/api/catalog/settings.py
+++ b/api/catalog/settings.py
@@ -236,7 +236,7 @@ TEMPLATES = [
     },
 ]
 
-WSGI_APPLICATION = "catalog.wsgi.application"
+ASGI_APPLICATION = "catalog.aisgi.application"
 
 # Database
 # https://docs.djangoproject.com/en/2.0/ref/settings/#databases
diff --git a/api/catalog/wsgi.py b/api/catalog/wsgi.py
deleted file mode 100644
index f8458d06..00000000
--- a/api/catalog/wsgi.py
+++ /dev/null
@@ -1,23 +0,0 @@
-"""
-WSGI config for catalog project.
-
-It exposes the WSGI callable as a module-level variable named ``application``.
-
-For more information on this file, see
-https://docs.djangoproject.com/en/2.0/howto/deployment/wsgi/
-"""
-from gevent import monkey
-
-
-monkey.patch_all()
-import os
-
-from django.core.wsgi import get_wsgi_application
-
-from wsgi_basic_auth import BasicAuth
-
-
-os.environ.setdefault("DJANGO_SETTINGS_MODULE", "catalog.settings")
-
-application = get_wsgi_application()
-application = BasicAuth(application)

I've done no performance checking on this, but adding async to view routes and switching to the async base classes (if we need to do that?) seems like it would be the only bit left to make it work fully async. ORM methods might need to be changed as well...

It doesn't look like we're using any middleware that causes Django to have to adapt each request. I checked this by setting the log level to debug and checking for the django.logger logs about middleware adapting.

@dhruvkb
Copy link
Member

dhruvkb commented Nov 7, 2022

@sarayourfriend would switching to ASGI mean that we need to change the server from Gunicorn to something else?

@sarayourfriend
Copy link
Contributor

Yes, we would switch the production server to hypercorn, uvcorn, etc. I've used hypercorn in the past and liked it, it has almost the same api as gunicorn.

@sarayourfriend
Copy link
Contributor

It completely slipped by me that we would need DRF to also support async views for migrating to async Django to be possible. DRF still does not support async and it seems like it might not natively support it for a while (though a package could do the job).

The two alternatives listed by grequests (and mentioned in the issue description) are actually less recently maintained than grequests itself.

We could try just using Django's async_to_sync but I think we would pay a performance penalty from spinning up an event loop.

The Sentry issue continues to record events, so this problem does still exist today.

As I see it, there are two options:

  1. Try async_to_sync and measure whether there is a difference in performance in production. Revert if there is an unacceptable performance difference or a regression otherwise.
  2. Pull the async DRF implementation out of the PR and write our own async shim as recommended in the PR itself and dive head first into full async Django.

My preference is the second as (a) we'll want to move to async Django in the future anyway and (b) if we do the shimming right, it should be possible to swap it out for whatever more developed async DRF package comes out of the PR or we can just switch to async DRF when it is officially supported (if ever).

I'm going to tinker a bit with the second option as I am working on this mostly as a curiosity rather than it being a priority. I hope that is okay with folks. If anything I'm sure I'll learn something useful about our app and async generally in the process that will be applicable to a future async conversion (if one doesn't come out of this practice).

@sarayourfriend
Copy link
Contributor

I do not think it is worth the fuss to do number 2 right now. I've tried for a bit to get the second option working and it is going to take quite a bit of work from us to convert things to async with DRF.

Trying async_to_sync is probably the best approach here. We will want to implement some timings for the API first, however. Adding timeit around the validate_images function and logging the timings would be a good first step here.

@sarayourfriend
Copy link
Contributor

Drafted a PR for this long-standing issue today using aiohttp and async_to_sync. It was relatively painless (though the test rewrite I expect to be a struggle): #1027

@sarayourfriend
Copy link
Contributor

I was just re-reading my suggestion above and regarding the timeits, I don't think it's necessary... we can keep an eye on the existing request timing measurements we have and as long as those are stable this change should be safe, right? If we added more timing logs we wouldn't be in a much better place regarding identifying the root cause of a strange issue. We'll probably want to make sure that PR doesn't get released with another significant change though.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
💻 aspect: code Concerns the software code in the repository ✨ goal: improvement Improvement to an existing user-facing feature help wanted Open to participation from the community 🟨 priority: medium Not blocking but should be addressed soon 🐍 tech: python Requires familiarity with Python
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants