Gracefully handle MongoDB replicaset changes #200

GUI · 2015-03-16T01:53:03Z

During our last upgrade of some system-level stuff a few weeks ago, we unexpectedly experienced about 15 seconds of downtime across our service. It occurred when I forced the MongoDB cluster to switch the primary server to a different server. I've gone through this process in the past without incurring downtime, so I finally got around to investigating how the MongoDB replicaset change triggered a brief outage.

I discovered two things that could happen when MongoDB was electing a new primary server:

The API backends config from MongoDB would get wiped, which could result in all API backends disappearing (this happened if the config reloader reloaded before the new primary was elected). This effectively took down all the APIs, which obviously is a very bad thing.
API key lookups on individual requests may have failed. Since we verify API keys with the MongoDB database, the key lookups could fail for any API that required API keys. These requests were being retried, but possibly not for long enough.

This has been addressed a few updates:

The config no longer deletes existing config on query failures: NREL/api-umbrella-config@d833939
The api key lookups will retry for longer: NREL/api-umbrella-gatekeeper@b5ffe67
Our full integration test suite now uses a replicaset for MongoDB and performs real replicaset changes, so we can better see how our entire stack reacts to MongoDB replicaset changes (and ensure this doesn't crop up again): NREL/api-umbrella-router@634b1ae

So now when MongoDB is completely down or just having a replicaset re-election, things should behave. The only downside of the current approach is that requests using API keys may pause until the new primary server is selected. So this means requests during this time may take 5-10 seconds to complete. But these primary server changes aren't super common, so I don't think this is a huge issue. This could be improved by caching the API key data locally on the servers, which is something that would eventually be good to roll out, but these fixes at least prevent outright failures. (And caching is hopefully on the horizon and is already implemented in the experimental Lua revamp: NREL/api-umbrella#111).

The text was updated successfully, but these errors were encountered:

GUI · 2015-03-16T01:53:43Z

As noted, fixed by a few different commits across projects. These updates have been rolled out to the servers.

GUI added bug api-umbrella-v0.8 labels Mar 16, 2015

GUI self-assigned this Mar 16, 2015

GUI added this to the Sprint 17 (3/9-3/20) milestone Mar 16, 2015

GUI closed this as completed Mar 16, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gracefully handle MongoDB replicaset changes #200

Gracefully handle MongoDB replicaset changes #200

GUI commented Mar 16, 2015

GUI commented Mar 16, 2015

Gracefully handle MongoDB replicaset changes #200

Gracefully handle MongoDB replicaset changes #200

Comments

GUI commented Mar 16, 2015

GUI commented Mar 16, 2015