-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
2017-05-16: 8 minute outage #385
Comments
While the simple fix to allocate more memory is deployed and this issue should be addressed, I'd like to keep this issue open until we implement some of the more proper fixes with |
Great incident report -- would love to see these kinds of details for all our production incidents, for all products. |
This is related to 18F/api.data.gov#385 This should allow the default configuration to publish somewhere in the neighborhood of 750-1500 API backends (up from 150-300) without running into potential issues. We still need to address the more underlying issue so the shard dict operates in a safer manner (eg, by using set_safe as described in the github issue).
See 18F/api.data.gov#385 This replaces "set" calls with "safe_set" for the "active_config" shared dict to prevent the possibility of publishing new API configuration removing old API configuration when it exceeds the available memory allocated for the shared dict. We also handle the possibility of "safe_set" still unpublishing the old configuration when it exceeds the available memory, which should ensure that the full API configuration never gets unpublished. This also adds more explicit error and warning logging to all the shared dict "set" calls.
While this has largely been mitigated by allocating more memory in our configuration, I finally got around to handling the underlying issue in a bit better way: NREL/api-umbrella@cb5e2c1 Now even if the configuration exceeds the available memory again, it will just result in the new configuration changes not being published, but existing configuration will remain live. |
We unfortunately had an outage of the api.data.gov service earlier today for 8 minutes from 2:00PM ET - 2:08PM ET. It affected all agencies and APIs.
Timeline of what happened
What caused the outage
So, how did an agency admin performing a normal function in the admin tool manage to break the entire service? It certainly wasn't the admin's fault, since they were just using the admin tool as normal. At first I was concerned that some type of API configuration could trigger some obscure bug bringing things down, but it turns out the problem is a bit of an edge case that required certain conditions to line up for this outage to occur.
For brief background, here's how API backend changes go from the admin tool to becoming live in our proxy layer:
The basic problem we ran into is that this "active config" gets stored in a bucket of memory that we have to define a maximum size of. The short version is that publishing this 1 new API backend caused us to run out of memory in this bucket that stores the active config. As a result, it ended up purging all configuration objects from this bucket of memory so API Umbrella was left thinking there were no APIs to route to.
The slightly more nuanced version is that our active configuration object was only consuming around half of the maximum memory we have allocated to this bucket. So while it wouldn't seem like we were in any immediate risk of running out of memory, it seems like some weird timing/circumstances caused this problem to crop up. I'm not 100% sure about this, since this gets into more of the nitty-gritty of how OpenResty manages the memory of shared dicts inside nginx, but it would appear like OpenResty tried to temporarily store 2 versions of the active config in this bucket of memory. Since we were consuming right around 1/2 of the available memory with a single version of the config, temporarily storing 2 versions would seem to explain why the memory got exceeded. That being said, I wasn't able to reproduce this precise behavior with the exact same configuration and memory settings locally, so again it's a bit strange, but my only explanation is that some combination of events (system load, other memory usage, etc) made this occur on production.
@sashax: I was finally able to reproduce the problem locally by adding a bunch of new API backends. So while this doesn't quite reproduce the original problem with production's settings, after I added 100+ new API backends (so a single version of the config would clearly exceed the available memory), then things broke in the exact same way production did. Why nginx decided to evict the memory on production is still a bit of a mystery (but probably related to load, reloads, concurrent access, etc). But since the behavior is the exact same as the behavior we saw during the outage, this makes me more confident that the issue was definitely related to us exceeding the size of this specific memory bucket.
Quick fix
Since the real issue came down to something as stupidly simple as running out of memory, the easy fix was simply to allocate more memory to this bucket that stores the active config: https://github.com/18F/api.data.gov-ops/commit/1488dd5c3773177511149c89e54481e37e40f2be
Better fixes...
While increasing the allocated memory does the trick, there's some underlying issues that would be good to address so we don't just forget about this and it bites us again if we ever reach 1,000 API backends.
Luckily, I think the necessary changes are pretty straightforward, so we can probably get these better fixes rolled out soon:
ngx.shared_dict:set()
calls. Currently our error logs are rather unhelpful in reporting any issues of this type (it didn't log any errors during today's outage). We should be sure to log any errors duringset()
calls, as well logging when items are added forcibly (which indicate older items are getting evicted from memory to make room for new items).ngx.shared.active_config
. By setting some random small field likedb_version
after setting the big config object, I think that's probably what ended up evicting the bigger config object from memory to make room for these extra metadata fields.set()
withset_safe()
. The safe version won't allow for old items to be evicted when running low on memory, which I think is precisely what we want in this case (but we don't actually want this for all shared dicts in our code base, so this won't be a global change). That should ensure the last good config is always present (even if new stuff isn't getting populated due to the memory being exhausted).set_safe()
should address the fundamental issue, but if it doesn't work as expected, here were a couple other ideas I had:The text was updated successfully, but these errors were encountered: