Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make usages of Redis resilient to absence of Redis #3505

Merged
merged 23 commits into from
Dec 20, 2023
Merged

Conversation

dhruvkb
Copy link
Member

@dhruvkb dhruvkb commented Dec 10, 2023

Fixes

Fixes #3385 by @AetherUnbound

Description

General changes

  • Refactored fixtures to remove duplicates and set up a separate fixtures package.
  • Added a factory for the ContentProvider model.
  • Moved CACHES to a separate settings file.
  • Replaced direct cache manipulations with fake Redis.

Throttling

  • Removed exempt IPs flow.
  • Added handling for Redis absence.
    • All incoming requests will become allowed.
  • Updated tests to account for headers being absent when history cannot be determined from cache.

Search controller

  • Added handling for Redis absence in filtered providers.
    • Filtered providers are computed from the DB.
  • Added handling for Redis absence in stats.
    • Sources will be obtained from Elasticsearch.
    • Also fixed a bug in handling of outdated formats.
  • Added tests for filtered providers computed from DB and sources computed from ES.

OAuth2

  • Added handling for Redis absence when checking usage.
    • Usage is shown as None.
    • Status code for the response is 424.
  • Added tests for 424 response and None usage counts.

Tallies

  • Added handling for Redis absence when tallying providers.
    • Just log it!
  • Added tests for the added log lines.

Image proxy

  • Converted multiple incr requests to a pipeline.
  • Added handling for Redis absence when tallying response codes and HTTP errors and other errors.
    • Just log it!
  • Added handling for Redis absence when determining image extension.
    • Extension never cached, always treated as unknown i.e. None and determined from Content-Type header.
  • An itertools.count counter is used to space out exceptions being sent Sentry. Afaik, itertools.count is per worker process so we will still likely be sending 20x the number of Sentry event we expect.
  • Added tests for the added log lines.

Dead links

  • Added handling for Redis absence when getting and saving query masks.
    • Query mask never cached, always starts with blank list i.e [] and recomputed.
  • Added handling for Redis absence when checking image liveness.
    • All URLs will always be revalidated by checking HTTP response status code.
  • Added tests for the recomputed query mask and log lines.

Testing Instructions

All changes have accompanying CI tests. Additionally, you can try to use the API with the Redis service stopped and see that things still work normally.

Checklist

  • My pull request has a descriptive title (not a vague title likeUpdate index.md).
  • My pull request targets the default branch of the repository (main) or a parent feature branch.
  • My commit messages follow best practices.
  • My code follows the established code style of the repository.
  • I added or updated tests for the changes I made (if applicable).
  • I added or updated documentation (if applicable).
  • I tried running the project locally and verified that there are no visible errors.
  • I ran the DAG documentation generator (if applicable).

Developer Certificate of Origin

Developer Certificate of Origin
Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

@github-actions github-actions bot added the 🧱 stack: api Related to the Django API label Dec 10, 2023
@dhruvkb dhruvkb added 🟧 priority: high Stalls work on the project or its dependents 💻 aspect: code Concerns the software code in the repository 🧰 goal: internal improvement Improvement that benefits maintainers, not users labels Dec 10, 2023
@dhruvkb dhruvkb mentioned this pull request Dec 12, 2023
8 tasks
Copy link
Collaborator

@sarayourfriend sarayourfriend left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking so cool. I wish there was a less verbose way to handle the connection errors, like some setting or something to just say "ignore any connection errors and don't make me have to handle them from this client". I guess we could do it with a wrapper or something but we wouldn't get as good logging in those cases.

I wonder how important logging the missed operations is, though 🤔 What's the thought behind that? None of the data is technically critical, and I don't think we could realistically back-fill it from logs if it was anyway, right?

api/api/utils/throttle.py Outdated Show resolved Hide resolved
api/conf/settings/security.py Outdated Show resolved Hide resolved
@dhruvkb
Copy link
Member Author

dhruvkb commented Dec 12, 2023

None of the data is technically critical, and I don't think we could realistically back-fill it from logs if it was anyway, right?

I only wrote that info to the logs to be safe incase we did need it for some debugging or monitoring later. Backfilling it was not a goal, and none of the cached info seems serious enough to be considered critical, except for two, which are severely impacted by the lack of a cache:

  • throttling which gets disabled
  • dead link masking which makes several times more API calls

@sarayourfriend
Copy link
Collaborator

Would it make sense to remove the logged data (but keep the error log line) for the non-critical cases?

Those two you mentioned are definitely the most interesting ones. Throttling is probably fine to skip without issue for a temporary Redis outage; like it's even something I wouldn't consider an error in itself. The error there is Redis's inaccessibility, not an issue with throttling. For a search request we'd get several essentially duplicate log lines, one from each failed throttle class, and then at least a couple more from the search itself (dead link mask and cache, and the tallies).

I agree some of those are critical, I'm just not sure why the right monitoring choice is error-level log lines with unusable data, rather than adding monitoring to check if Redis is unavailable generally.

The reality is that even if Redis were out for 5 seconds, our services would be fine: they've survived for a long time without throttling working at all, and after it came back, the dead link cache would start working again and we would see a big reduction.

But if that isn't fine, if it's such a bad case that it's really an error-level log (meaning we cannot function without it), then we should have some redundancy in place, like deploying a new redis instance, porting data over to it, then moving our API to use that one instead. My concern is that error-level logging, with the data itself logged, for something we know will happen for a short period of time, but then otherwise should never happen, is getting error-logged but without any kind of visibility.

For non-critical uses of Redis, uses where the request would just take longer, or non-critical features can't work, then it seems like something that is safe to ignore in the API code, assuming we have another way of monitoring an outage-level issue with Redis. Individual monitoring points with error-level logging of Redis connection errors in the API at most duplicates the Redis-specific monitoring without giving any new information. If these logs went off, we'd just know Redis is out for a while. If it was out for a long while, we'd have much bigger issues than trying to back fill the data, much of which we can technically operate without in a pinch.

Anyway, just wondering what the error-level logs gets us in any of these cases, if a temporary Redis connection error is something we can deal with (it isn't catastrophic), and if a larger-scale Redis issue would be better monitored using a direct monitor of Redis rather than these indirect and disperate monitors. I don't have any strong answers here and my brain is quite foggy, just something worth thinking about and having a clear answer for so we know what these error logs mean from an actionability/monitoring perspective.

@dhruvkb dhruvkb marked this pull request as ready for review December 14, 2023 03:00
@dhruvkb dhruvkb requested a review from a team as a code owner December 14, 2023 03:00
Copy link
Collaborator

@sarayourfriend sarayourfriend left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I left some requested changes just to add whitespace to help with parsing the functions. I had a hard time reviewing some of these changes because the code is too compact for me to understand easily.

But it LGTM. I wish there was a less verbose way to handle all of this and it for some reason feels wrong to need to wrap every Redis call in a try/except like this (we don't do it for the ORM, but I guess that's a different level of "criticality" than Redis), and I've never done it before in other applications I've worked on that made calls to Redis. But, like I've said elsewhere, I have no idea how to avoid it without writing some kind of naive shim that does the try/except for us.

Anyway, nothing blocking here.

api/api/controllers/search_controller.py Outdated Show resolved Hide resolved
api/api/controllers/search_controller.py Show resolved Hide resolved
api/api/controllers/search_controller.py Show resolved Hide resolved
api/api/controllers/search_controller.py Outdated Show resolved Hide resolved
api/api/controllers/search_controller.py Show resolved Hide resolved
api/api/utils/image_proxy/__init__.py Show resolved Hide resolved
api/api/utils/image_proxy/extension.py Outdated Show resolved Hide resolved
api/api/utils/image_proxy/extension.py Outdated Show resolved Hide resolved
api/api/utils/image_proxy/extension.py Show resolved Hide resolved
api/test/conftest.py Show resolved Hide resolved
@openverse-bot
Copy link
Collaborator

Based on the high urgency of this PR, the following reviewers are being gently reminded to review this PR:

@stacimc
This reminder is being automatically generated due to the urgency configuration.

Excluding weekend1 days, this PR was ready for review 2 day(s) ago. PRs labelled with high urgency are expected to be reviewed within 2 weekday(s)2.

@dhruvkb, if this PR is not ready for a review, please draft it to prevent reviewers from getting further unnecessary pings.

Footnotes

  1. Specifically, Saturday and Sunday.

  2. For the purpose of these reminders we treat Monday - Friday as weekdays. Please note that the operation that generates these reminders runs at midnight UTC on Monday - Friday. This means that depending on your timezone, you may be pinged outside of the expected range.

@dhruvkb dhruvkb mentioned this pull request Dec 19, 2023
7 tasks
Copy link
Collaborator

@AetherUnbound AetherUnbound left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing work with this @dhruvkb! I have some questions and comments, but nothing to block a merge 😄 I'm similarly sad there's no easy way to wrap all of the operations differently, but I think trying to do so might cause even more trouble.

I disabled the cache container locally and was able to use the API just fine, well done!

api/test/test_auth.py Show resolved Hide resolved
Comment on lines +52 to +55
cache_availability_params = pytest.mark.parametrize(
"is_cache_reachable, cache_name",
[(True, "redis"), (False, "unreachable_redis")],
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there anyway to reuse this, even if it means importing it from somewhere? That way the description I mentioned above only has to be made once potentially.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They're all different in terms of the fixtures they use so defining them in one place would not be very helpful.

@dhruvkb dhruvkb merged commit f56f731 into main Dec 20, 2023
44 checks passed
@dhruvkb dhruvkb deleted the redis_resilience branch December 20, 2023 06:32
@dhruvkb dhruvkb mentioned this pull request Jan 4, 2024
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💻 aspect: code Concerns the software code in the repository 🧰 goal: internal improvement Improvement that benefits maintainers, not users 🟧 priority: high Stalls work on the project or its dependents 🧱 stack: api Related to the Django API
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Make API resilient to a Redis outage
4 participants