You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
U of Toronto reported 500 server errors on their JupyterHub after a recent deployment to production. An investigation uncovered that this was because an updated credential had been overwritten.
Here's a very short timeline (all happened within 8 hours):
We rolled back the latest deploy to prod with these commands:
helm list -n NAMESPACE
# Get the deployment name and revision number from the output
helm rollback -n NAMESPACE DEPLOYMENT_NAME REV_NUM-1 # That's "minus one", so one less than the output of the previous command
U Toronto reported that this was no longer an issue in FreshDesk
The biggest problem here (other than the underlying credentials issue) is that only a subset of the team felt capable of working on the U. Toronto Hub. That's because it is special-cased and not a part of our central deployment infrastructure.
Where we got lucky
We got lucky that this issue was resolveable with a helm rollback, and didn't require more extensive kubernetes debugging, which would have made progress difficult given the inequal access that team members had to the hub.
Summary
U of Toronto reported
500 server errors
on their JupyterHub after a recent deployment to production. An investigation uncovered that this was because an updated credential had been overwritten.Here's a very short timeline (all happened within 8 hours):
We deployed a minor change to production: Merge #100 and #102 to prod utoronto-2i2c/jupyterhub-deploy#103
This caused the
500
errors to beginU. Toronto reported these errors in FreshDesk
We discussed in the
#hub-support
channel, and identified that the credentials had not been updated in the repository after updating the UoT credentials (Incident report: U. Toronto credentials expired #637).We rolled back the latest deploy to prod with these commands:
U Toronto reported that this was no longer an issue in FreshDesk
FreshDesk ticket: https://2i2c.freshdesk.com/a/tickets/22
After-action report
What went wrong
The biggest problem here (other than the underlying credentials issue) is that only a subset of the team felt capable of working on the U. Toronto Hub. That's because it is special-cased and not a part of our central deployment infrastructure.
Where we got lucky
We got lucky that this issue was resolveable with a helm rollback, and didn't require more extensive kubernetes debugging, which would have made progress difficult given the inequal access that team members had to the hub.
Action items
Process improvements
Actions
The text was updated successfully, but these errors were encountered: