[Incident] U. Toronto authentication credentials overwritten #703

choldgraf · 2021-09-22T16:56:56Z

Summary

U of Toronto reported 500 server errors on their JupyterHub after a recent deployment to production. An investigation uncovered that this was because an updated credential had been overwritten.

Here's a very short timeline (all happened within 8 hours):

We deployed a minor change to production: Merge #100 and #102 to prod utoronto-2i2c/jupyterhub-deploy#103
This caused the 500 errors to begin
U. Toronto reported these errors in FreshDesk
We discussed in the #hub-support channel, and identified that the credentials had not been updated in the repository after updating the UoT credentials (Incident report: U. Toronto credentials expired #637).
- We discovered this by looking at a git blame of the credentials file: https://github.com/utoronto-2i2c/jupyterhub-deploy/blame/e1fd790000734f1356a5a78357ccca0df0d21f38/deployments/utoronto/secrets/prod.yaml#L20

We rolled back the latest deploy to prod with these commands:

helm list -n NAMESPACE
# Get the deployment name and revision number from the output
helm rollback -n NAMESPACE DEPLOYMENT_NAME REV_NUM-1  # That's "minus one", so one less than the output of the previous command

U Toronto reported that this was no longer an issue in FreshDesk

FreshDesk ticket: https://2i2c.freshdesk.com/a/tickets/22

After-action report

What went wrong

The biggest problem here (other than the underlying credentials issue) is that only a subset of the team felt capable of working on the U. Toronto Hub. That's because it is special-cased and not a part of our central deployment infrastructure.

Where we got lucky

We got lucky that this issue was resolveable with a helm rollback, and didn't require more extensive kubernetes debugging, which would have made progress difficult given the inequal access that team members had to the hub.

Action items

Process improvements

Bump up the priority of moving the Toronto deployment onto our hub deployment infrastructure (Migrate UToronto hub to our pilot-hubs repository #638)
Update the Toronto hub repository with new credentials (Update U. Toronto credentials in the repository #702)

Actions

Incident has been dealt with or is over
Sections above are filled out
Incident title and after-action report is cleaned up
All actionable items above have linked GitHub Issues

The text was updated successfully, but these errors were encountered:

choldgraf · 2021-09-22T17:02:56Z

I'm going to close this because it's resolved, but wanted to open it first so that we have a record of it :-)

choldgraf added the type: Hub Incident label Sep 22, 2021

choldgraf closed this as completed Sep 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Incident] U. Toronto authentication credentials overwritten #703

[Incident] U. Toronto authentication credentials overwritten #703

choldgraf commented Sep 22, 2021

choldgraf commented Sep 22, 2021

[Incident] U. Toronto authentication credentials overwritten #703

[Incident] U. Toronto authentication credentials overwritten #703

Comments

choldgraf commented Sep 22, 2021

Summary

After-action report

What went wrong

Where we got lucky

Action items

Process improvements

Actions

choldgraf commented Sep 22, 2021