Make usages of Redis resilient to absence of Redis #3505

dhruvkb · 2023-12-10T14:25:00Z

Fixes

Fixes #3385 by @AetherUnbound

Description

General changes

Refactored fixtures to remove duplicates and set up a separate fixtures package.
Added a factory for the ContentProvider model.
Moved CACHES to a separate settings file.
Replaced direct cache manipulations with fake Redis.

Throttling

Removed exempt IPs flow.
Added handling for Redis absence.
- All incoming requests will become allowed.
Updated tests to account for headers being absent when history cannot be determined from cache.

Search controller

Added handling for Redis absence in filtered providers.
- Filtered providers are computed from the DB.
Added handling for Redis absence in stats.
- Sources will be obtained from Elasticsearch.
- Also fixed a bug in handling of outdated formats.
Added tests for filtered providers computed from DB and sources computed from ES.

OAuth2

Added handling for Redis absence when checking usage.
- Usage is shown as None.
- Status code for the response is 424.
Added tests for 424 response and None usage counts.

Tallies

Added handling for Redis absence when tallying providers.
- Just log it!
Added tests for the added log lines.

Image proxy

Converted multiple incr requests to a pipeline.
Added handling for Redis absence when tallying response codes and HTTP errors and other errors.
- Just log it!
Added handling for Redis absence when determining image extension.
- Extension never cached, always treated as unknown i.e. None and determined from Content-Type header.
An itertools.count counter is used to space out exceptions being sent Sentry. Afaik, itertools.count is per worker process so we will still likely be sending 20x the number of Sentry event we expect.
Added tests for the added log lines.

Dead links

Added handling for Redis absence when getting and saving query masks.
- Query mask never cached, always starts with blank list i.e [] and recomputed.
Added handling for Redis absence when checking image liveness.
- All URLs will always be revalidated by checking HTTP response status code.
Added tests for the recomputed query mask and log lines.

Testing Instructions

All changes have accompanying CI tests. Additionally, you can try to use the API with the Redis service stopped and see that things still work normally.

Checklist

My pull request has a descriptive title (not a vague title likeUpdate index.md).
My pull request targets the default branch of the repository (main) or a parent feature branch.
My commit messages follow best practices.
My code follows the established code style of the repository.
I added or updated tests for the changes I made (if applicable).
I added or updated documentation (if applicable).
I tried running the project locally and verified that there are no visible errors.
I ran the DAG documentation generator (if applicable).

Developer Certificate of Origin

Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

api/api/utils/image_proxy/__init__.py

sarayourfriend

This is looking so cool. I wish there was a less verbose way to handle the connection errors, like some setting or something to just say "ignore any connection errors and don't make me have to handle them from this client". I guess we could do it with a wrapper or something but we wouldn't get as good logging in those cases.

I wonder how important logging the missed operations is, though 🤔 What's the thought behind that? None of the data is technically critical, and I don't think we could realistically back-fill it from logs if it was anyway, right?

api/api/utils/throttle.py

api/conf/settings/security.py

api/test/unit/controllers/test_search_controller.py

dhruvkb · 2023-12-12T21:51:41Z

None of the data is technically critical, and I don't think we could realistically back-fill it from logs if it was anyway, right?

I only wrote that info to the logs to be safe incase we did need it for some debugging or monitoring later. Backfilling it was not a goal, and none of the cached info seems serious enough to be considered critical, except for two, which are severely impacted by the lack of a cache:

throttling which gets disabled
dead link masking which makes several times more API calls

sarayourfriend · 2023-12-12T23:49:28Z

Would it make sense to remove the logged data (but keep the error log line) for the non-critical cases?

Those two you mentioned are definitely the most interesting ones. Throttling is probably fine to skip without issue for a temporary Redis outage; like it's even something I wouldn't consider an error in itself. The error there is Redis's inaccessibility, not an issue with throttling. For a search request we'd get several essentially duplicate log lines, one from each failed throttle class, and then at least a couple more from the search itself (dead link mask and cache, and the tallies).

I agree some of those are critical, I'm just not sure why the right monitoring choice is error-level log lines with unusable data, rather than adding monitoring to check if Redis is unavailable generally.

The reality is that even if Redis were out for 5 seconds, our services would be fine: they've survived for a long time without throttling working at all, and after it came back, the dead link cache would start working again and we would see a big reduction.

But if that isn't fine, if it's such a bad case that it's really an error-level log (meaning we cannot function without it), then we should have some redundancy in place, like deploying a new redis instance, porting data over to it, then moving our API to use that one instead. My concern is that error-level logging, with the data itself logged, for something we know will happen for a short period of time, but then otherwise should never happen, is getting error-logged but without any kind of visibility.

For non-critical uses of Redis, uses where the request would just take longer, or non-critical features can't work, then it seems like something that is safe to ignore in the API code, assuming we have another way of monitoring an outage-level issue with Redis. Individual monitoring points with error-level logging of Redis connection errors in the API at most duplicates the Redis-specific monitoring without giving any new information. If these logs went off, we'd just know Redis is out for a while. If it was out for a long while, we'd have much bigger issues than trying to back fill the data, much of which we can technically operate without in a pinch.

Anyway, just wondering what the error-level logs gets us in any of these cases, if a temporary Redis connection error is something we can deal with (it isn't catastrophic), and if a larger-scale Redis issue would be better monitored using a direct monitor of Redis rather than these indirect and disperate monitors. I don't have any strong answers here and my brain is quite foggy, just something worth thinking about and having a clear answer for so we know what these error logs mean from an actionability/monitoring perspective.

sarayourfriend

LGTM. I left some requested changes just to add whitespace to help with parsing the functions. I had a hard time reviewing some of these changes because the code is too compact for me to understand easily.

But it LGTM. I wish there was a less verbose way to handle all of this and it for some reason feels wrong to need to wrap every Redis call in a try/except like this (we don't do it for the ORM, but I guess that's a different level of "criticality" than Redis), and I've never done it before in other applications I've worked on that made calls to Redis. But, like I've said elsewhere, I have no idea how to avoid it without writing some kind of naive shim that does the try/except for us.

Anyway, nothing blocking here.

api/api/controllers/search_controller.py

api/api/utils/image_proxy/__init__.py

api/api/utils/image_proxy/extension.py

api/test/conftest.py

Co-authored-by: sarayourfriend <[email protected]>

openverse-bot · 2023-12-19T00:00:14Z

Based on the high urgency of this PR, the following reviewers are being gently reminded to review this PR:

@stacimc
This reminder is being automatically generated due to the urgency configuration.

Excluding weekend¹ days, this PR was ready for review 2 day(s) ago. PRs labelled with high urgency are expected to be reviewed within 2 weekday(s)².

@dhruvkb, if this PR is not ready for a review, please draft it to prevent reviewers from getting further unnecessary pings.

Specifically, Saturday and Sunday. ↩
For the purpose of these reminders we treat Monday - Friday as weekdays. Please note that the operation that generates these reminders runs at midnight UTC on Monday - Friday. This means that depending on your timezone, you may be pinged outside of the expected range. ↩

AetherUnbound

Amazing work with this @dhruvkb! I have some questions and comments, but nothing to block a merge 😄 I'm similarly sad there's no easy way to wrap all of the operations differently, but I think trying to do so might cause even more trouble.

I disabled the cache container locally and was able to use the API just fine, well done!

api/test/test_auth.py

api/test/unit/controllers/test_search_controller.py

AetherUnbound · 2023-12-19T20:48:18Z

api/test/unit/utils/test_image_proxy.py

+cache_availability_params = pytest.mark.parametrize(
+    "is_cache_reachable, cache_name",
+    [(True, "redis"), (False, "unreachable_redis")],
+)


Is there anyway to reuse this, even if it means importing it from somewhere? That way the description I mentioned above only has to be made once potentially.

They're all different in terms of the fixtures they use so defining them in one place would not be very helpful.

…dis_resilience

…ilience

github-actions bot added the 🧱 stack: api Related to the Django API label Dec 10, 2023

dhruvkb added 🟧 priority: high Stalls work on the project or its dependents 💻 aspect: code Concerns the software code in the repository 🧰 goal: internal improvement Improvement that benefits maintainers, not users labels Dec 10, 2023

dhruvkb force-pushed the redis_resilience branch from df535a4 to 56f1df9 Compare December 11, 2023 20:22

dhruvkb mentioned this pull request Dec 12, 2023

Fix source query caching #3516

Merged

8 tasks

sarayourfriend reviewed Dec 12, 2023

View reviewed changes

api/api/utils/image_proxy/__init__.py Outdated Show resolved Hide resolved

sarayourfriend reviewed Dec 12, 2023

View reviewed changes

api/api/utils/throttle.py Outdated Show resolved Hide resolved

api/conf/settings/security.py Outdated Show resolved Hide resolved

api/test/unit/controllers/test_search_controller.py Show resolved Hide resolved

dhruvkb marked this pull request as ready for review December 14, 2023 03:00

dhruvkb requested a review from a team as a code owner December 14, 2023 03:00

dhruvkb requested review from sarayourfriend and stacimc December 14, 2023 03:00

sarayourfriend mentioned this pull request Dec 15, 2023

Remove unnecessary get_token_info #3514

Closed

dhruvkb added 15 commits December 16, 2023 08:42

Remove unused caches and move config into separate file

9bd6af2

Organise and deduplicate fixtures

bdddc14

Add factory for model ContentProvider

0fdec47

Replace use of cache from django.core.cache with fixtures

e94788a

Make caching of filtered providers Redis-resilient

9605a43

Make caching of providers stats Redis-resilient

edff54b

Make caching of URL liveness Redis-resilient

b4b2822

Make tallying of providers Redis-resilient

2c25742

Fix bad test for dead links

2c6a131

Drop concept of IP allowlist from throttling

a3c4783

Make DRF throttling Redis-resilient

32ec720

Make caching of query mask Redis-resilient

1aca113

Make checking of token stats Redis-resilient

6d7bf7e

Use pipeline and make tallying of response codes Redis-resilient

f53310f

Make caching of image extension Redis-resilient

cdc929d

Make tallying of thumbnail errors Redis-resilient

0c652b1

dhruvkb force-pushed the redis_resilience branch from 7dcf408 to 0c652b1 Compare December 16, 2023 06:32

sarayourfriend approved these changes Dec 18, 2023

View reviewed changes

dhruvkb and others added 4 commits December 18, 2023 06:26

Add blank lines to split long blocks into chunks

f0b9b81

Co-authored-by: sarayourfriend <[email protected]>

Make Redis request even if recently failed

65f7dba

Remove previously-existing stray comma

44f3015

Merge branch 'main' into redis_resilience

788cf62

dhruvkb mentioned this pull request Dec 19, 2023

Add referrer based throttle scope #3486

Merged

7 tasks

dhruvkb requested a review from AetherUnbound December 19, 2023 02:01

AetherUnbound approved these changes Dec 19, 2023

View reviewed changes

sarayourfriend mentioned this pull request Dec 20, 2023

Replaced get_token_info calls with the request.auth.application #3528

Merged

8 tasks

dhruvkb added 3 commits December 20, 2023 09:58

Merge branch 'main' of https://github.com/WordPress/openverse into re…

1b20a61

…dis_resilience

Merge remote-tracking branch 'origin/redis_resilience' into redis_res…

d76794b

…ilience

Document cache_availability_params

5622777

dhruvkb merged commit f56f731 into main Dec 20, 2023
44 checks passed

dhruvkb deleted the redis_resilience branch December 20, 2023 06:32

dhruvkb mentioned this pull request Jan 4, 2024

Redis 7.x upgrade #3382

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make usages of Redis resilient to absence of Redis #3505

Make usages of Redis resilient to absence of Redis #3505

dhruvkb commented Dec 10, 2023 •

edited

Loading

sarayourfriend left a comment

dhruvkb commented Dec 12, 2023

sarayourfriend commented Dec 12, 2023

sarayourfriend left a comment

openverse-bot commented Dec 19, 2023

AetherUnbound left a comment

AetherUnbound Dec 19, 2023

dhruvkb Dec 20, 2023

Make usages of Redis resilient to absence of Redis #3505

Make usages of Redis resilient to absence of Redis #3505

Conversation

dhruvkb commented Dec 10, 2023 • edited Loading

Fixes

Description

General changes

Throttling

Search controller

OAuth2

Tallies

Image proxy

Dead links

Testing Instructions

Checklist

Developer Certificate of Origin

sarayourfriend left a comment

Choose a reason for hiding this comment

dhruvkb commented Dec 12, 2023

sarayourfriend commented Dec 12, 2023

sarayourfriend left a comment

Choose a reason for hiding this comment

openverse-bot commented Dec 19, 2023

Footnotes

AetherUnbound left a comment

Choose a reason for hiding this comment

AetherUnbound Dec 19, 2023

Choose a reason for hiding this comment

dhruvkb Dec 20, 2023

Choose a reason for hiding this comment

dhruvkb commented Dec 10, 2023 •

edited

Loading