Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2017-05-16: 8 minute outage #385

Closed
GUI opened this issue May 17, 2017 · 3 comments
Closed

2017-05-16: 8 minute outage #385

GUI opened this issue May 17, 2017 · 3 comments

Comments

@GUI
Copy link
Member

GUI commented May 17, 2017

We unfortunately had an outage of the api.data.gov service earlier today for 8 minutes from 2:00PM ET - 2:08PM ET. It affected all agencies and APIs.

Timeline of what happened

  • An agency admin contacted me about publishing a new API to a new URL endpoint they needed permissions on. I added a new API scope for their admin group. (I don't think this part is actually relevant to the issue, but just noting it for completeness.)
  • The agency admin created a new API backend and published their changes.
  • At this point, all APIs and websites served by api.data.gov stopped routing correctly with everything returning 404 not found responses.
  • After getting alerts, I tried reloading the API Umbrella services to no avail (but I did not try fully restarting the services, which after debugging I think may have fixed things--I should really know better).
  • Since I was aware a new API backend had just been published, I reverted the published configuration to the previous version.
  • At this point all services returned to normal.
  • With things stabilized @sashax and I debugged things. We identified what we believe was the root cause (although it seems to be an edge-case related to system memory usage, so it's difficult to reproduce). We deployed a fix, and re-published the new API backend that originally caused the outage without any issues.

What caused the outage

So, how did an agency admin performing a normal function in the admin tool manage to break the entire service? It certainly wasn't the admin's fault, since they were just using the admin tool as normal. At first I was concerned that some type of API configuration could trigger some obscure bug bringing things down, but it turns out the problem is a bit of an edge case that required certain conditions to line up for this outage to occur.

For brief background, here's how API backend changes go from the admin tool to becoming live in our proxy layer:

  • When an API configuration change gets published via the admin interface, we serialize a full snapshot of all the API backends as a big hunk of JSON. This gets stored in a MongoDB collection as a new configuration record (each new record is versioned with the timestamp so we have a full trail of how configuration changed over time).
  • Our Lua proxy layer inside nginx is constantly polling this MongoDB collection for any configuration changes. If any configuration changes are detected, the new version is loaded into memory inside the nginx processes. This becomes the "active config".
  • Any requests being served by API Umbrella expect for the "active config" to be pre-loaded in memory (so we're not querying MongoDB on every request we serve). This "active config" object is then used to determine which URLs we serve and how they get routed.

The basic problem we ran into is that this "active config" gets stored in a bucket of memory that we have to define a maximum size of. The short version is that publishing this 1 new API backend caused us to run out of memory in this bucket that stores the active config. As a result, it ended up purging all configuration objects from this bucket of memory so API Umbrella was left thinking there were no APIs to route to.

The slightly more nuanced version is that our active configuration object was only consuming around half of the maximum memory we have allocated to this bucket. So while it wouldn't seem like we were in any immediate risk of running out of memory, it seems like some weird timing/circumstances caused this problem to crop up. I'm not 100% sure about this, since this gets into more of the nitty-gritty of how OpenResty manages the memory of shared dicts inside nginx, but it would appear like OpenResty tried to temporarily store 2 versions of the active config in this bucket of memory. Since we were consuming right around 1/2 of the available memory with a single version of the config, temporarily storing 2 versions would seem to explain why the memory got exceeded. That being said, I wasn't able to reproduce this precise behavior with the exact same configuration and memory settings locally, so again it's a bit strange, but my only explanation is that some combination of events (system load, other memory usage, etc) made this occur on production.

@sashax: I was finally able to reproduce the problem locally by adding a bunch of new API backends. So while this doesn't quite reproduce the original problem with production's settings, after I added 100+ new API backends (so a single version of the config would clearly exceed the available memory), then things broke in the exact same way production did. Why nginx decided to evict the memory on production is still a bit of a mystery (but probably related to load, reloads, concurrent access, etc). But since the behavior is the exact same as the behavior we saw during the outage, this makes me more confident that the issue was definitely related to us exceeding the size of this specific memory bucket.

Quick fix

Since the real issue came down to something as stupidly simple as running out of memory, the easy fix was simply to allocate more memory to this bucket that stores the active config: https://github.com/18F/api.data.gov-ops/commit/1488dd5c3773177511149c89e54481e37e40f2be

  • The default maximum size API Umbrella ships with is 600KB (yes, this is small, and we could probably increase the default).
  • Our current config of around 150 API backends seems to consume around 300KB of memory.
  • We bumped the maximum size up to 10MB, which should give us plenty of breathing room to handle lots of additional API backends.

Better fixes...

While increasing the allocated memory does the trick, there's some underlying issues that would be good to address so we don't just forget about this and it bites us again if we ever reach 1,000 API backends.

Luckily, I think the necessary changes are pretty straightforward, so we can probably get these better fixes rolled out soon:

  • These few lines are what are responsible for setting the active config on the shared_dict: https://github.com/NREL/api-umbrella/blob/v0.14.1/src/api-umbrella/proxy/models/active_config.lua#L250-L253
  • We should add more error handling and logging to any ngx.shared_dict:set() calls. Currently our error logs are rather unhelpful in reporting any issues of this type (it didn't log any errors during today's outage). We should be sure to log any errors during set() calls, as well logging when items are added forcibly (which indicate older items are getting evicted from memory to make room for new items).
  • I think what probably bit us was the extra metadata we're storing inside ngx.shared.active_config. By setting some random small field like db_version after setting the big config object, I think that's probably what ended up evicting the bigger config object from memory to make room for these extra metadata fields.
  • In the case of this shared_dict, we should probably replace set() with set_safe(). The safe version won't allow for old items to be evicted when running low on memory, which I think is precisely what we want in this case (but we don't actually want this for all shared dicts in our code base, so this won't be a global change). That should ensure the last good config is always present (even if new stuff isn't getting populated due to the memory being exhausted).
  • I think set_safe() should address the fundamental issue, but if it doesn't work as expected, here were a couple other ideas I had:
    • We could also use a separate shared dict for the big config object and another shared dict for the extra metadata fields (so we shouldn't run the risk of ever having the big active config object evicted due to other "sets" for metadata going on).
    • Instead of just fetching the active config from memory at proxy time, we could fallback to re-fetching from Mongo if it's missing. This would obviously degrade performance significantly if it didn't get re-inserted into memory, but it might be a way to keep things running even if the memory is exceeded.
    • Consider adding monitoring for how much memory is being used inside each nginx shared dict. This should be available via systemtap scripts, but there might be better ways to monitor this.
@GUI
Copy link
Member Author

GUI commented May 17, 2017

While the simple fix to allocate more memory is deployed and this issue should be addressed, I'd like to keep this issue open until we implement some of the more proper fixes with set_safe() and better error logging.

@pkarman
Copy link

pkarman commented May 17, 2017

Great incident report -- would love to see these kinds of details for all our production incidents, for all products.

GUI added a commit to NREL/api-umbrella that referenced this issue May 26, 2017
This is related to 18F/api.data.gov#385

This should allow the default configuration to publish somewhere in the
neighborhood of 750-1500 API backends (up from 150-300) without running
into potential issues.

We still need to address the more underlying issue so the shard dict
operates in a safer manner (eg, by using set_safe as described in the
github issue).
@GUI GUI closed this as completed Aug 10, 2018
@GUI GUI reopened this Aug 10, 2018
GUI added a commit to NREL/api-umbrella that referenced this issue Aug 11, 2018
See 18F/api.data.gov#385

This replaces "set" calls with "safe_set" for the "active_config" shared
dict to prevent the possibility of publishing new API configuration
removing old API configuration when it exceeds the available memory
allocated for the shared dict. We also handle the possibility of
"safe_set" still unpublishing the old configuration when it exceeds the
available memory, which should ensure that the full API configuration
never gets unpublished.

This also adds more explicit error and warning logging to all the shared
dict "set" calls.
@GUI
Copy link
Member Author

GUI commented Aug 11, 2018

While this has largely been mitigated by allocating more memory in our configuration, I finally got around to handling the underlying issue in a bit better way: NREL/api-umbrella@cb5e2c1 Now even if the configuration exceeds the available memory again, it will just result in the new configuration changes not being published, but existing configuration will remain live.

@GUI GUI closed this as completed Aug 11, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants