-
Notifications
You must be signed in to change notification settings - Fork 212
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make usages of Redis resilient to absence of Redis #3505
Conversation
df535a4
to
56f1df9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking so cool. I wish there was a less verbose way to handle the connection errors, like some setting or something to just say "ignore any connection errors and don't make me have to handle them from this client". I guess we could do it with a wrapper or something but we wouldn't get as good logging in those cases.
I wonder how important logging the missed operations is, though 🤔 What's the thought behind that? None of the data is technically critical, and I don't think we could realistically back-fill it from logs if it was anyway, right?
I only wrote that info to the logs to be safe incase we did need it for some debugging or monitoring later. Backfilling it was not a goal, and none of the cached info seems serious enough to be considered critical, except for two, which are severely impacted by the lack of a cache:
|
Would it make sense to remove the logged data (but keep the error log line) for the non-critical cases? Those two you mentioned are definitely the most interesting ones. Throttling is probably fine to skip without issue for a temporary Redis outage; like it's even something I wouldn't consider an error in itself. The error there is Redis's inaccessibility, not an issue with throttling. For a search request we'd get several essentially duplicate log lines, one from each failed throttle class, and then at least a couple more from the search itself (dead link mask and cache, and the tallies). I agree some of those are critical, I'm just not sure why the right monitoring choice is error-level log lines with unusable data, rather than adding monitoring to check if Redis is unavailable generally. The reality is that even if Redis were out for 5 seconds, our services would be fine: they've survived for a long time without throttling working at all, and after it came back, the dead link cache would start working again and we would see a big reduction. But if that isn't fine, if it's such a bad case that it's really an error-level log (meaning we cannot function without it), then we should have some redundancy in place, like deploying a new redis instance, porting data over to it, then moving our API to use that one instead. My concern is that error-level logging, with the data itself logged, for something we know will happen for a short period of time, but then otherwise should never happen, is getting error-logged but without any kind of visibility. For non-critical uses of Redis, uses where the request would just take longer, or non-critical features can't work, then it seems like something that is safe to ignore in the API code, assuming we have another way of monitoring an outage-level issue with Redis. Individual monitoring points with error-level logging of Redis connection errors in the API at most duplicates the Redis-specific monitoring without giving any new information. If these logs went off, we'd just know Redis is out for a while. If it was out for a long while, we'd have much bigger issues than trying to back fill the data, much of which we can technically operate without in a pinch. Anyway, just wondering what the error-level logs gets us in any of these cases, if a temporary Redis connection error is something we can deal with (it isn't catastrophic), and if a larger-scale Redis issue would be better monitored using a direct monitor of Redis rather than these indirect and disperate monitors. I don't have any strong answers here and my brain is quite foggy, just something worth thinking about and having a clear answer for so we know what these error logs mean from an actionability/monitoring perspective. |
7dcf408
to
0c652b1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I left some requested changes just to add whitespace to help with parsing the functions. I had a hard time reviewing some of these changes because the code is too compact for me to understand easily.
But it LGTM. I wish there was a less verbose way to handle all of this and it for some reason feels wrong to need to wrap every Redis call in a try/except like this (we don't do it for the ORM, but I guess that's a different level of "criticality" than Redis), and I've never done it before in other applications I've worked on that made calls to Redis. But, like I've said elsewhere, I have no idea how to avoid it without writing some kind of naive shim that does the try/except for us.
Anyway, nothing blocking here.
Based on the high urgency of this PR, the following reviewers are being gently reminded to review this PR: @stacimc Excluding weekend1 days, this PR was ready for review 2 day(s) ago. PRs labelled with high urgency are expected to be reviewed within 2 weekday(s)2. @dhruvkb, if this PR is not ready for a review, please draft it to prevent reviewers from getting further unnecessary pings. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing work with this @dhruvkb! I have some questions and comments, but nothing to block a merge 😄 I'm similarly sad there's no easy way to wrap all of the operations differently, but I think trying to do so might cause even more trouble.
I disabled the cache container locally and was able to use the API just fine, well done!
cache_availability_params = pytest.mark.parametrize( | ||
"is_cache_reachable, cache_name", | ||
[(True, "redis"), (False, "unreachable_redis")], | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there anyway to reuse this, even if it means importing it from somewhere? That way the description I mentioned above only has to be made once potentially.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They're all different in terms of the fixtures they use so defining them in one place would not be very helpful.
Fixes
Fixes #3385 by @AetherUnbound
Description
General changes
ContentProvider
model.CACHES
to a separate settings file.Throttling
Search controller
OAuth2
None
.None
usage counts.Tallies
Image proxy
incr
requests to a pipeline.None
and determined fromContent-Type
header.itertools.count
counter is used to space out exceptions being sent Sentry. Afaik,itertools.count
is per worker process so we will still likely be sending 20x the number of Sentry event we expect.Dead links
[]
and recomputed.Testing Instructions
All changes have accompanying CI tests. Additionally, you can try to use the API with the Redis service stopped and see that things still work normally.
Checklist
Update index.md
).main
) or a parent feature branch.Developer Certificate of Origin
Developer Certificate of Origin