[Incident] UToronto cluster ran out of disk space #1081

sgibson91 · 2022-03-10T17:49:26Z

Summary

Users are not able to log into hub
From grafana seems the trigger was 14hours before report, went unnoticed until a large volume of people tried to log in
The cause was that the disk space for the Azure cluster had run out, due to the increase in user accounts on the hub infrastructure.
We increased the disk quota for this cluster, and the problem is now resolved.

Hub information

Hub URL: https://jupyter.utoronto.ca

Timeline (if relevant)

All times in GMT

2022-03-10 16:50 GMT

Support ticket opened reporting increase reports of users not able to access the hub

17:10

Screen shots of grafana shared in Slack - these are from the last 7 days. You can see a sharp increase in "non-running pods" on the right side. (EDIT: this was a strong indicator of something going wrong)

17:25

Log message reported from hub pod logs

[W 2022-03-10 17:25:19.774 JupyterHub base:1349] Failing suspected API request to not-running server: /hub/user/f48cb1b0-2e65-4ca3-8fc5-0a195cd308c9/api/metrics/v1

Attempted to find log messages from time period indicated by the grafana charts, but couldn't retrieve anything earlier than [I 2022-03-10 17:28:33.835 JupyterHub log:189] using the below command:

k logs -c hub hub-59746db487-qb2db --since-time=2022-03-09T14:00:00.000000000Z

18:30

We identified that the hub disk quote had been reached, and the hub was no longer able to allocate more disk space. This was causing the outages.

We logged in to the Azure portal for this cluster, and manually increased the available Disk Space, this resolved the issue (see this comment for more details).

After-action report

What went wrong

The disk space ran out for the Azure cluster, which cause the hub to be unable to create new user sessions.
There was no automatic reporting or graphs on disk space usage, so it wasn't clear that this problem was approaching until it was too late.
Only a subset of our team members had the ability to look at the cloud infrastructure for the cluster, because it currently requires University of Toronto accounts for anyone to access it, and this hasn't been completed for all team members.

Where we got lucky

One of the two team members that had access to Toronto's cloud infrastructure happened to be available at the time, and was able to quickly spot the problem when looking at the logs.

Action items

Process improvements

Everyone in the engineering team should have access to the UToronto Azure subscription Get all 2i2c engineers access to UToronto Azure portal team-compass#386
We should establish escalation process for dealing with outages. For example, pinging cell phones if the problem is big enough or if only a certain team member can resolve it. Define escalation practices when there are hub outages #1118

Documentation improvements

Document various kinds of log messages you may get from a JupyterHub pod, to interpret what they might mean Add some docs on common log messages jupyterhub/jupyterhub#3820
Document every grafana graph we have, and what it represents, and useful ways to read it. Document every grafana graph, what it represents, and useful ways to read it. #1117

Technical improvements

Jupyter server currently does not start at all when the disk is full, leading to somewhat mysterious failures. We should find a way to make sure that the user's server does start even if the disk is full, and we can get better error messages that way.
We should have a Grafana tab where we are monitoring disk usage on the home directory (Add a Grafana plot to monitor disk usage on the home directory #1119)
We should have automated alerts when there are 5xx error responses from the user pods, not just from the hub process itself (updated Cloud usage monitoring and alerting infrastructure and process #328)
Simplify the UToronto deployment to make it more accessible Simplify UToronto hub deployment so it is easier to manage #1088

Actions

Incident has been dealt with or is over
Sections above are filled out
Incident title and after-action report is cleaned up
All actionable items above have linked GitHub Issues

The text was updated successfully, but these errors were encountered:

yuvipanda · 2022-03-10T19:04:38Z

Here's the workflow I followed.

Get access to the UToronto cluster, with python3 deployer use-cluster-credentials utoronto
Look at the hub logs, but nothing suspicious there. Particularly, no python stack traces - watching for the 'shape' of a python stacktrace going by as logs whoosh past is a useful first step.
Look at list of pods running, with k -n prod get pod (k is an alias for kubectl)
Notice that a freshly launched user pod immediately goes into Error state! This means that the process starting in the user pod (the jupyter singleuser server) fails on startup. This should basically never happen under normal operating conditions

Look at the logs for a pod in error state, with k -n prod logs <pod-name>. Immediately detect a python stacktrace!

   Traceback (most recent call last):
File "/opt/conda/bin/jupyterhub-singleuser", line 8, in <module>
  sys.exit(main())
File "/opt/conda/lib/python3.8/site-packages/jupyter_core/application.py", line 264, in launch_instance
  return super(JupyterApp, cls).launch_instance(argv=argv, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/traitlets/config/application.py", line 845, in launch_instance
  app.initialize(argv)
File "/opt/conda/lib/python3.8/site-packages/jupyterhub/singleuser/mixins.py", line 852, in initialize
  result = super().initialize(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/jupyterhub/singleuser/mixins.py", line 573, in initialize
  return super().initialize(argv)
File "/opt/conda/lib/python3.8/site-packages/traitlets/config/application.py", line 88, in inner
  return method(app, *args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/notebook/notebookapp.py", line 2143, in initialize
  super().initialize(argv)
File "/opt/conda/lib/python3.8/site-packages/traitlets/config/application.py", line 88, in inner
  return method(app, *args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/jupyter_core/application.py", line 239, in initialize
  self.migrate_config()
File "/opt/conda/lib/python3.8/site-packages/jupyterhub/singleuser/mixins.py", line 399, in migrate_config
  super().migrate_config()
File "/opt/conda/lib/python3.8/site-packages/jupyter_core/application.py", line 165, in migrate_config
  migrate()
File "/opt/conda/lib/python3.8/site-packages/jupyter_core/migrate.py", line 244, in migrate
  ensure_dir_exists(env['jupyter_config'])
File "/opt/conda/lib/python3.8/site-packages/jupyter_core/utils/__init__.py", line 11, in ensure_dir_exists
  os.makedirs(path, mode=mode)
File "/opt/conda/lib/python3.8/os.py", line 223, in makedirs
  mkdir(name, mode)
OSError: [Errno 28] No space left on device: '/home/jovyan/.jupyter'

Aha, so no space on disk. I pop over to the Azure portal (https://portal.azure.com/#blade/Microsoft_Azure_FileStorage/FileShareMenuBlade/overview/storageAccountId/%2Fsubscriptions%2Fead3521a-d994-4a44-a68d-b16e35642d5b%2FresourceGroups%2F2i2c-utoronto-cluster%2Fproviders%2FMicrosoft.Storage%2FstorageAccounts%2F2i2cutorontohubstorage/path/homes/protocol/NFS) and increase size of disk from 1.5TB to 3TB.
Launch a new user pod to see if it works. It does. I also look at newly starting pods to see if they are succeeding, with wk -n prod get pod -l component=singleuser-server (wk is an alias for watch kubectl, uses the very helpful watch command). Seems alright

Before doing this, I had also manually scaled the cluster up to 10 nodes (from the 3 that it had), because that could have also been the issue.

When debugging errors and outages, looking at the logs emitted by JupyterHub is very helpful. This document tries to document some common log messages, and what they mean. I currently added just one log message, but we can add more over time. Ref 2i2c-org/infrastructure#1081 where this would've been useful troubleshooting

choldgraf · 2022-03-15T22:24:23Z

I've opened up follow-up issues for the major remaining things in our post-mortem list. There is one that I wasn't really sure how to capture in an issue:

Jupyter server currently does not start at all when the disk is full, leading to somewhat mysterious failures. We should find a way to make sure that the user's server does start even if the disk is full, and we can get better error messages that way.

So if somebody knows how to turn that in an issue, please do so!

I'll close this, since the incident is resolved and we have follow-up issues for the remaining pieces.

damianavila · 2022-03-21T21:16:43Z

So if somebody knows how to turn that in an issue, please do so!

Maybe some checking before launching the single-user instance?
https://github.com/jupyterhub/jupyterhub/blob/main/jupyterhub/singleuser/app.py#L68

@yuvipanda @consideRatio, thoughts?

consideRatio · 2022-03-21T22:48:10Z

This is a common issue. I invested time fixing this kind of issue recently. I'm also not sure how to resolve it, or where it can make sense to track it.

I think perhaps raising this for technical deliberation in a discourse.jupyter.org post makes the most sense to me - describing the problem and asking for input on how to address it it in a sensible way.

sgibson91 added type: Hub Incident labels Mar 10, 2022

This was referenced Mar 10, 2022

Get all 2i2c engineers access to UToronto Azure portal 2i2c-org/team-compass#386

Closed

Add some docs on common log messages jupyterhub/jupyterhub#3820

Merged

choldgraf changed the title ~~[Incident] UToronto Spawn Failure Error~~ [Incident] UToronto cluster ran out of disk space Mar 10, 2022

This was referenced Mar 12, 2022

Cloud usage monitoring and alerting infrastructure and process #328

Closed

Document every grafana graph, what it represents, and useful ways to read it. #1117

Open

Add a Grafana plot to monitor disk usage on the home directory #1119

Closed

choldgraf closed this as completed Mar 15, 2022

yuvipanda mentioned this issue Oct 25, 2023

Document how to expand disk size on Azure #3327

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Incident] UToronto cluster ran out of disk space #1081

[Incident] UToronto cluster ran out of disk space #1081

sgibson91 commented Mar 10, 2022 •

edited by choldgraf

Loading

yuvipanda commented Mar 10, 2022 •

edited

Loading

choldgraf commented Mar 15, 2022

damianavila commented Mar 21, 2022

consideRatio commented Mar 21, 2022

[Incident] UToronto cluster ran out of disk space #1081

[Incident] UToronto cluster ran out of disk space #1081

Comments

sgibson91 commented Mar 10, 2022 • edited by choldgraf Loading

Summary

Hub information

Timeline (if relevant)

2022-03-10 16:50 GMT

17:10

17:25

18:30

After-action report

What went wrong

Where we got lucky

Action items

Process improvements

Documentation improvements

Technical improvements

Actions

yuvipanda commented Mar 10, 2022 • edited Loading

choldgraf commented Mar 15, 2022

damianavila commented Mar 21, 2022

consideRatio commented Mar 21, 2022

sgibson91 commented Mar 10, 2022 •

edited by choldgraf

Loading

yuvipanda commented Mar 10, 2022 •

edited

Loading