You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
During our last upgrade of some system-level stuff a few weeks ago, we unexpectedly experienced about 15 seconds of downtime across our service. It occurred when I forced the MongoDB cluster to switch the primary server to a different server. I've gone through this process in the past without incurring downtime, so I finally got around to investigating how the MongoDB replicaset change triggered a brief outage.
I discovered two things that could happen when MongoDB was electing a new primary server:
The API backends config from MongoDB would get wiped, which could result in all API backends disappearing (this happened if the config reloader reloaded before the new primary was elected). This effectively took down all the APIs, which obviously is a very bad thing.
API key lookups on individual requests may have failed. Since we verify API keys with the MongoDB database, the key lookups could fail for any API that required API keys. These requests were being retried, but possibly not for long enough.
Our full integration test suite now uses a replicaset for MongoDB and performs real replicaset changes, so we can better see how our entire stack reacts to MongoDB replicaset changes (and ensure this doesn't crop up again): NREL/api-umbrella-router@634b1ae
So now when MongoDB is completely down or just having a replicaset re-election, things should behave. The only downside of the current approach is that requests using API keys may pause until the new primary server is selected. So this means requests during this time may take 5-10 seconds to complete. But these primary server changes aren't super common, so I don't think this is a huge issue. This could be improved by caching the API key data locally on the servers, which is something that would eventually be good to roll out, but these fixes at least prevent outright failures. (And caching is hopefully on the horizon and is already implemented in the experimental Lua revamp: NREL/api-umbrella#111).
The text was updated successfully, but these errors were encountered:
During our last upgrade of some system-level stuff a few weeks ago, we unexpectedly experienced about 15 seconds of downtime across our service. It occurred when I forced the MongoDB cluster to switch the primary server to a different server. I've gone through this process in the past without incurring downtime, so I finally got around to investigating how the MongoDB replicaset change triggered a brief outage.
I discovered two things that could happen when MongoDB was electing a new primary server:
This has been addressed a few updates:
So now when MongoDB is completely down or just having a replicaset re-election, things should behave. The only downside of the current approach is that requests using API keys may pause until the new primary server is selected. So this means requests during this time may take 5-10 seconds to complete. But these primary server changes aren't super common, so I don't think this is a huge issue. This could be improved by caching the API key data locally on the servers, which is something that would eventually be good to roll out, but these fixes at least prevent outright failures. (And caching is hopefully on the horizon and is already implemented in the experimental Lua revamp: NREL/api-umbrella#111).
The text was updated successfully, but these errors were encountered: