2017-05-16: 8 minute outage #385

GUI · 2017-05-17T03:54:42Z

We unfortunately had an outage of the api.data.gov service earlier today for 8 minutes from 2:00PM ET - 2:08PM ET. It affected all agencies and APIs.

Timeline of what happened

An agency admin contacted me about publishing a new API to a new URL endpoint they needed permissions on. I added a new API scope for their admin group. (I don't think this part is actually relevant to the issue, but just noting it for completeness.)
The agency admin created a new API backend and published their changes.
At this point, all APIs and websites served by api.data.gov stopped routing correctly with everything returning 404 not found responses.
After getting alerts, I tried reloading the API Umbrella services to no avail (but I did not try fully restarting the services, which after debugging I think may have fixed things--I should really know better).
Since I was aware a new API backend had just been published, I reverted the published configuration to the previous version.
At this point all services returned to normal.
With things stabilized @sashax and I debugged things. We identified what we believe was the root cause (although it seems to be an edge-case related to system memory usage, so it's difficult to reproduce). We deployed a fix, and re-published the new API backend that originally caused the outage without any issues.

What caused the outage

So, how did an agency admin performing a normal function in the admin tool manage to break the entire service? It certainly wasn't the admin's fault, since they were just using the admin tool as normal. At first I was concerned that some type of API configuration could trigger some obscure bug bringing things down, but it turns out the problem is a bit of an edge case that required certain conditions to line up for this outage to occur.

For brief background, here's how API backend changes go from the admin tool to becoming live in our proxy layer:

When an API configuration change gets published via the admin interface, we serialize a full snapshot of all the API backends as a big hunk of JSON. This gets stored in a MongoDB collection as a new configuration record (each new record is versioned with the timestamp so we have a full trail of how configuration changed over time).
Our Lua proxy layer inside nginx is constantly polling this MongoDB collection for any configuration changes. If any configuration changes are detected, the new version is loaded into memory inside the nginx processes. This becomes the "active config".
Any requests being served by API Umbrella expect for the "active config" to be pre-loaded in memory (so we're not querying MongoDB on every request we serve). This "active config" object is then used to determine which URLs we serve and how they get routed.

The basic problem we ran into is that this "active config" gets stored in a bucket of memory that we have to define a maximum size of. The short version is that publishing this 1 new API backend caused us to run out of memory in this bucket that stores the active config. As a result, it ended up purging all configuration objects from this bucket of memory so API Umbrella was left thinking there were no APIs to route to.

The slightly more nuanced version is that our active configuration object was only consuming around half of the maximum memory we have allocated to this bucket. So while it wouldn't seem like we were in any immediate risk of running out of memory, it seems like some weird timing/circumstances caused this problem to crop up. I'm not 100% sure about this, since this gets into more of the nitty-gritty of how OpenResty manages the memory of shared dicts inside nginx, but it would appear like OpenResty tried to temporarily store 2 versions of the active config in this bucket of memory. Since we were consuming right around 1/2 of the available memory with a single version of the config, temporarily storing 2 versions would seem to explain why the memory got exceeded. That being said, I wasn't able to reproduce this precise behavior with the exact same configuration and memory settings locally, so again it's a bit strange, but my only explanation is that some combination of events (system load, other memory usage, etc) made this occur on production.

@sashax: I was finally able to reproduce the problem locally by adding a bunch of new API backends. So while this doesn't quite reproduce the original problem with production's settings, after I added 100+ new API backends (so a single version of the config would clearly exceed the available memory), then things broke in the exact same way production did. Why nginx decided to evict the memory on production is still a bit of a mystery (but probably related to load, reloads, concurrent access, etc). But since the behavior is the exact same as the behavior we saw during the outage, this makes me more confident that the issue was definitely related to us exceeding the size of this specific memory bucket.

Quick fix

Since the real issue came down to something as stupidly simple as running out of memory, the easy fix was simply to allocate more memory to this bucket that stores the active config: https://github.com/18F/api.data.gov-ops/commit/1488dd5c3773177511149c89e54481e37e40f2be

The default maximum size API Umbrella ships with is 600KB (yes, this is small, and we could probably increase the default).
Our current config of around 150 API backends seems to consume around 300KB of memory.
We bumped the maximum size up to 10MB, which should give us plenty of breathing room to handle lots of additional API backends.

Better fixes...

While increasing the allocated memory does the trick, there's some underlying issues that would be good to address so we don't just forget about this and it bites us again if we ever reach 1,000 API backends.

Luckily, I think the necessary changes are pretty straightforward, so we can probably get these better fixes rolled out soon:

These few lines are what are responsible for setting the active config on the shared_dict: https://github.com/NREL/api-umbrella/blob/v0.14.1/src/api-umbrella/proxy/models/active_config.lua#L250-L253
We should add more error handling and logging to any ngx.shared_dict:set() calls. Currently our error logs are rather unhelpful in reporting any issues of this type (it didn't log any errors during today's outage). We should be sure to log any errors during set() calls, as well logging when items are added forcibly (which indicate older items are getting evicted from memory to make room for new items).
I think what probably bit us was the extra metadata we're storing inside ngx.shared.active_config. By setting some random small field like db_version after setting the big config object, I think that's probably what ended up evicting the bigger config object from memory to make room for these extra metadata fields.
In the case of this shared_dict, we should probably replace set() with set_safe(). The safe version won't allow for old items to be evicted when running low on memory, which I think is precisely what we want in this case (but we don't actually want this for all shared dicts in our code base, so this won't be a global change). That should ensure the last good config is always present (even if new stuff isn't getting populated due to the memory being exhausted).
I think set_safe() should address the fundamental issue, but if it doesn't work as expected, here were a couple other ideas I had:
- We could also use a separate shared dict for the big config object and another shared dict for the extra metadata fields (so we shouldn't run the risk of ever having the big active config object evicted due to other "sets" for metadata going on).
- Instead of just fetching the active config from memory at proxy time, we could fallback to re-fetching from Mongo if it's missing. This would obviously degrade performance significantly if it didn't get re-inserted into memory, but it might be a way to keep things running even if the memory is exceeded.
- Consider adding monitoring for how much memory is being used inside each nginx shared dict. This should be available via systemtap scripts, but there might be better ways to monitor this.

The text was updated successfully, but these errors were encountered:

GUI · 2017-05-17T03:56:11Z

While the simple fix to allocate more memory is deployed and this issue should be addressed, I'd like to keep this issue open until we implement some of the more proper fixes with set_safe() and better error logging.

pkarman · 2017-05-17T15:39:58Z

Great incident report -- would love to see these kinds of details for all our production incidents, for all products.

This is related to 18F/api.data.gov#385 This should allow the default configuration to publish somewhere in the neighborhood of 750-1500 API backends (up from 150-300) without running into potential issues. We still need to address the more underlying issue so the shard dict operates in a safer manner (eg, by using set_safe as described in the github issue).

See 18F/api.data.gov#385 This replaces "set" calls with "safe_set" for the "active_config" shared dict to prevent the possibility of publishing new API configuration removing old API configuration when it exceeds the available memory allocated for the shared dict. We also handle the possibility of "safe_set" still unpublishing the old configuration when it exceeds the available memory, which should ensure that the full API configuration never gets unpublished. This also adds more explicit error and warning logging to all the shared dict "set" calls.

GUI · 2018-08-11T22:46:49Z

While this has largely been mitigated by allocating more memory in our configuration, I finally got around to handling the underlying issue in a bit better way: NREL/api-umbrella@cb5e2c1 Now even if the configuration exceeds the available memory again, it will just result in the new configuration changes not being published, but existing configuration will remain live.

GUI closed this as completed Aug 10, 2018

GUI reopened this Aug 10, 2018

GUI closed this as completed Aug 11, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2017-05-16: 8 minute outage #385

2017-05-16: 8 minute outage #385

GUI commented May 17, 2017

GUI commented May 17, 2017

pkarman commented May 17, 2017

GUI commented Aug 11, 2018

2017-05-16: 8 minute outage #385

2017-05-16: 8 minute outage #385

Comments

GUI commented May 17, 2017

Timeline of what happened

What caused the outage

Quick fix

Better fixes...

GUI commented May 17, 2017

pkarman commented May 17, 2017

GUI commented Aug 11, 2018