Move off of grequests #486

AetherUnbound · 2022-01-25T19:25:26Z

Problem

The API currently uses the grequests library for handling some of the head requests it completes when returning results (specifically in validate_images.py).

Description

The README for the project says:

Note: You should probably use requests-threads or requests-futures instead.

The project has also not had a release since 2020. We might want to consider using one of those proposed alternatives.

Alternatives

Leave as is and hope for the best!

Additional context

Some more context in Sentry: https://sentry.io/share/issue/061ba99fc3df4c23bdb7643d337bbda0/

Implementation

🙋 I would be interested in implementing this feature.

The text was updated successfully, but these errors were encountered:

sarayourfriend · 2022-01-31T11:27:07Z

This would be a great issue for a new contributor to pick up, if anyone is looking for something to get started with 😀

samajain · 2022-04-03T22:56:12Z

Hi @sarayourfriend,

I came across your project and I am interested in contributing! I am new to opensource community contributions but I have 5 years of experience working as an engineer with python, C++ and django frameworks. Can I work on this issue?

From the description, what I concluded is that we want to migrate from grequests to some other multi-threaded handling of requests, possibly requests-threads? Please share any other information which you think might be needed to work on the issue.

Thanks!

sarayourfriend · 2022-04-04T05:31:00Z

@samajain Hi! Welcome to Openverse 😁 It'd be great to have you work on this issue. Your understanding of the issue and one of the potential solutions is correct.

request-threads could be a good issue but we don't currently use Twisted, which requests-threads depends on 🤔 I wonder if we could also look into using Django's builtin asyncio support and using something like httpx or one of the other popular async HTTP libraries. requests-threads hasn't been updated in a few years (including not having technical support for the latest Python versions).

If you feel comfortable doing that kind of research for the trade-offs between the different options then let us know 🙂 Unfortunately this issue might not be as clear cut as it seems now that I've looked into it a bit more 😕 If you'd prefer working one something more straight forward then also let me know and I can help you find something else.

Thanks for your interest in the project!

samajain · 2022-04-04T13:53:15Z

I can do the research if you want me to and share my findings.

If not this issue then I can take up something else too 🙂 Since you have been part of this project, I leave it to your call.

I'll dig up for something from the good first issues label.

AetherUnbound · 2022-04-04T19:32:27Z

We'd be more than happy to let you take this one on! If you're comfortable gathering the research and posting it here, we can work together to decide which library/path forward might be best 🙂

samajain · 2022-04-04T19:56:26Z

Sure! I can do that. What kind of tradeoff aspects are you looking for? Shall I compare them based on their performance metrics or are you looking for a comparison on anything specific other than this?

AetherUnbound · 2022-04-04T20:57:56Z

I'll let @sarayourfriend chime in with any other thoughts, but I think my greatest concern would be 1) what is the maintenance burden we incur with whatever dependency we go with and 2) how much restructuring of the codebase do we need to do in order to properly integrate the library. For instance, if we do go with an async-based library, how much will we need to change to make the application support async calls. Performance metrics would be useful but secondary I think to these factors 😄

sarayourfriend · 2022-04-05T06:12:19Z

I'm less concerned with the maintenance burden initially than with the performance impact of using async in Django 4. The Django docs mention that using async views (which is fully supported and does not require any restructuring, apparently 🤔) has some performance trade offs. It doesn't explain what those are, so it'd be nice to understand them. We could move the Django app to ASGI and run it with uvicorn instead of gunicorn, but that my gut tells me that's probably a bigger move than we need to solve this particular problem. It would also need to involve larger infrastructure changes that we don't necessarily have capacity for at the moment. To quote the Django docs:

Async views will still work under WSGI, but with performance penalties, and without the ability to have efficient long-running requests.

The only place we're using grequests right now is basically to fire off a bunch of HTTP requests in parallel, so the long-running requests issue isn't important. However, the usage is in the main search routine, so it's extremely critical path. It's perhaps the view where performance matters the most for us.

My preference would be to use the built in Django async support as it's in line with mainline Python support (and I think is generally where things are headed, mostly away from Twisted, etc). But if the performance impact is bad, then we'll necessarily need to suffer the maintenance burden of introducing a "bolted on" asynchronous library for now.

All of that to say, I'd recommend looking into what the trade offs are with using Django's built in asyncio view support under WSGI first, then moving onto seeing what other options exist if the penalties for partial async in Django WSGI app are bad.

AetherUnbound · 2022-04-05T16:14:24Z

However, the usage is in the main search routine, so it's extremely critical path. It's perhaps the view where performance matters the most for us.

That's a great point, thanks for mentioning it! And I'm glad that it doesn't seem like async support in Django will be too difficult from a code perspective.

samajain · 2022-04-07T02:26:26Z

@AetherUnbound @sarayourfriend Thank you for your inputs. I will do my research and get back soon!

sarayourfriend · 2022-08-10T13:25:02Z

@samajain Did you get a chance to look into this?

Given Django's recent release of 4.2 adding async to the ORM, I think it is a good time to start planning a switch over to the async Django API. I don't think this will be a trivial change to do correctly and I don't know if it can be done in stages. Would it be detrimental, for example, to delay switching to async Redis and Elasticsearch clients and Django ORM to happen after switching to async Django generally to be able to use an async http library to replace our usage of grequests?

sarayourfriend · 2022-08-16T23:44:48Z

I was reading more about Django's async support. It looks like we might be able to relatively easily get off of grequests by using Django's in-depth async_to_sync function to run an async version of the validate_images function.

@WordPress/openverse-maintainers I think we could prioritize this issue higher if someone wanted to try out async_to_sync with a regular asyncio HTTP Python library like httpx or similar.

AetherUnbound · 2022-08-16T23:53:37Z

I am all for this!

Prakharkarsh1 · 2022-09-10T10:39:48Z

Hii @AetherUnbound I also want to contribute to fix this issue

dhruvkb · 2022-09-10T23:33:59Z

@Prakharkarsh1 please go ahead! Feel free to comment here if you need any pointers getting started.

sarayourfriend · 2022-11-03T00:17:27Z

I took a peek at this today and moving to ASGI for our application would be mostly trivial (I think). The bulk of the work would be to remove grequests and replacing it with something like httpx or aiohttp (plus maybe AnyIO to make "grouping" the requests easier, if the libraries themselves do not natively support it).

Updating unit tests would probably take up most of the time spent on this. Due to grequests nature, the tests do a fair bit of unusual mocking to try to stay out of the way as much as possible, but it's very specific to grequests. Perhaps adding an abstraction layer here would help to simplify the tests in general.

sarayourfriend · 2022-11-03T00:20:10Z

Here is a small diff that moves the app to ASGI:

diff --git a/api/catalog/asgi.py b/api/catalog/asgi.py
new file mode 100644
index 00000000..6b3169f4
--- /dev/null
+++ b/api/catalog/asgi.py
@@ -0,0 +1,7 @@
+import os
+
+from django.core.asgi import get_asgi_application
+
+os.environ.setdefault("DJANGO_SETTINGS_MODULE", "catalog.settings")
+
+application = get_asgi_application()
diff --git a/api/catalog/settings.py b/api/catalog/settings.py
index 268ebb62..192feca2 100644
--- a/api/catalog/settings.py
+++ b/api/catalog/settings.py
@@ -236,7 +236,7 @@ TEMPLATES = [
     },
 ]
 
-WSGI_APPLICATION = "catalog.wsgi.application"
+ASGI_APPLICATION = "catalog.aisgi.application"
 
 # Database
 # https://docs.djangoproject.com/en/2.0/ref/settings/#databases
diff --git a/api/catalog/wsgi.py b/api/catalog/wsgi.py
deleted file mode 100644
index f8458d06..00000000
--- a/api/catalog/wsgi.py
+++ /dev/null
@@ -1,23 +0,0 @@
-"""
-WSGI config for catalog project.
-
-It exposes the WSGI callable as a module-level variable named ``application``.
-
-For more information on this file, see
-https://docs.djangoproject.com/en/2.0/howto/deployment/wsgi/
-"""
-from gevent import monkey
-
-
-monkey.patch_all()
-import os
-
-from django.core.wsgi import get_wsgi_application
-
-from wsgi_basic_auth import BasicAuth
-
-
-os.environ.setdefault("DJANGO_SETTINGS_MODULE", "catalog.settings")
-
-application = get_wsgi_application()
-application = BasicAuth(application)

I've done no performance checking on this, but adding async to view routes and switching to the async base classes (if we need to do that?) seems like it would be the only bit left to make it work fully async. ORM methods might need to be changed as well...

It doesn't look like we're using any middleware that causes Django to have to adapt each request. I checked this by setting the log level to debug and checking for the django.logger logs about middleware adapting.

dhruvkb · 2022-11-07T14:24:51Z

@sarayourfriend would switching to ASGI mean that we need to change the server from Gunicorn to something else?

sarayourfriend · 2022-11-07T22:57:22Z

Yes, we would switch the production server to hypercorn, uvcorn, etc. I've used hypercorn in the past and liked it, it has almost the same api as gunicorn.

sarayourfriend · 2022-11-10T05:18:41Z

It completely slipped by me that we would need DRF to also support async views for migrating to async Django to be possible. DRF still does not support async and it seems like it might not natively support it for a while (though a package could do the job).

The two alternatives listed by grequests (and mentioned in the issue description) are actually less recently maintained than grequests itself.

We could try just using Django's async_to_sync but I think we would pay a performance penalty from spinning up an event loop.

The Sentry issue continues to record events, so this problem does still exist today.

As I see it, there are two options:

Try async_to_sync and measure whether there is a difference in performance in production. Revert if there is an unacceptable performance difference or a regression otherwise.
Pull the async DRF implementation out of the PR and write our own async shim as recommended in the PR itself and dive head first into full async Django.

My preference is the second as (a) we'll want to move to async Django in the future anyway and (b) if we do the shimming right, it should be possible to swap it out for whatever more developed async DRF package comes out of the PR or we can just switch to async DRF when it is officially supported (if ever).

I'm going to tinker a bit with the second option as I am working on this mostly as a curiosity rather than it being a priority. I hope that is okay with folks. If anything I'm sure I'll learn something useful about our app and async generally in the process that will be applicable to a future async conversion (if one doesn't come out of this practice).

sarayourfriend · 2022-11-15T02:37:53Z

I do not think it is worth the fuss to do number 2 right now. I've tried for a bit to get the second option working and it is going to take quite a bit of work from us to convert things to async with DRF.

Trying async_to_sync is probably the best approach here. We will want to implement some timings for the API first, however. Adding timeit around the validate_images function and logging the timings would be a good first step here.

sarayourfriend · 2022-12-02T04:23:48Z

Drafted a PR for this long-standing issue today using aiohttp and async_to_sync. It was relatively painless (though the test rewrite I expect to be a struggle): #1027

sarayourfriend · 2022-12-02T04:25:47Z

I was just re-reading my suggestion above and regarding the timeits, I don't think it's necessary... we can keep an eye on the existing request timing measurements we have and as long as those are stable this change should be safe, right? If we added more timing logs we wouldn't be in a much better place regarding identifying the root cause of a strange issue. We'll probably want to make sure that PR doesn't get released with another significant change though.

AetherUnbound added 🟩 priority: low Low priority and doesn't need to be rushed ✨ goal: improvement Improvement to an existing user-facing feature 💻 aspect: code Concerns the software code in the repository 🐍 tech: python Requires familiarity with Python labels Jan 25, 2022

AetherUnbound mentioned this issue Jan 28, 2022

Gunicorn workers unexpected quitting in production #489

Closed

1 task

sarayourfriend added good first issue New-contributor friendly help wanted Open to participation from the community labels Jan 31, 2022

sarayourfriend mentioned this issue Feb 6, 2022

PyCascades 2022 Sprinting WordPress/openverse#153

Closed

dhruvkb assigned Prakharkarsh1 Sep 10, 2022

krysal added 🟨 priority: medium Not blocking but should be addressed soon and removed 🟩 priority: low Low priority and doesn't need to be rushed good first issue New-contributor friendly labels Oct 25, 2022

krysal unassigned Prakharkarsh1 Oct 25, 2022

sarayourfriend self-assigned this Dec 2, 2022

sarayourfriend mentioned this issue Dec 2, 2022

Replace grequests with asyncio solution #1027

Merged

6 tasks

sarayourfriend closed this as completed in #1027 Dec 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move off of grequests #486

Move off of grequests #486

AetherUnbound commented Jan 25, 2022 •

edited by krysal

Loading

sarayourfriend commented Jan 31, 2022

samajain commented Apr 3, 2022 •

edited

Loading

sarayourfriend commented Apr 4, 2022

samajain commented Apr 4, 2022

AetherUnbound commented Apr 4, 2022

samajain commented Apr 4, 2022

AetherUnbound commented Apr 4, 2022

sarayourfriend commented Apr 5, 2022 •

edited

Loading

AetherUnbound commented Apr 5, 2022

samajain commented Apr 7, 2022

sarayourfriend commented Aug 10, 2022

sarayourfriend commented Aug 16, 2022

AetherUnbound commented Aug 16, 2022

Prakharkarsh1 commented Sep 10, 2022

dhruvkb commented Sep 10, 2022

sarayourfriend commented Nov 3, 2022

sarayourfriend commented Nov 3, 2022

dhruvkb commented Nov 7, 2022

sarayourfriend commented Nov 7, 2022

sarayourfriend commented Nov 10, 2022

sarayourfriend commented Nov 15, 2022

sarayourfriend commented Dec 2, 2022

sarayourfriend commented Dec 2, 2022

Move off of grequests #486

Move off of grequests #486

Comments

AetherUnbound commented Jan 25, 2022 • edited by krysal Loading

Problem

Description

Alternatives

Additional context

Implementation

sarayourfriend commented Jan 31, 2022

samajain commented Apr 3, 2022 • edited Loading

sarayourfriend commented Apr 4, 2022

samajain commented Apr 4, 2022

AetherUnbound commented Apr 4, 2022

samajain commented Apr 4, 2022

AetherUnbound commented Apr 4, 2022

sarayourfriend commented Apr 5, 2022 • edited Loading

AetherUnbound commented Apr 5, 2022

samajain commented Apr 7, 2022

sarayourfriend commented Aug 10, 2022

sarayourfriend commented Aug 16, 2022

AetherUnbound commented Aug 16, 2022

Prakharkarsh1 commented Sep 10, 2022

dhruvkb commented Sep 10, 2022

sarayourfriend commented Nov 3, 2022

sarayourfriend commented Nov 3, 2022

dhruvkb commented Nov 7, 2022

sarayourfriend commented Nov 7, 2022

sarayourfriend commented Nov 10, 2022

sarayourfriend commented Nov 15, 2022

sarayourfriend commented Dec 2, 2022

sarayourfriend commented Dec 2, 2022

AetherUnbound commented Jan 25, 2022 •

edited by krysal

Loading

samajain commented Apr 3, 2022 •

edited

Loading

sarayourfriend commented Apr 5, 2022 •

edited

Loading