Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Incident] UToronto cluster ran out of disk space #1081

Closed
4 tasks done
sgibson91 opened this issue Mar 10, 2022 · 4 comments
Closed
4 tasks done

[Incident] UToronto cluster ran out of disk space #1081

sgibson91 opened this issue Mar 10, 2022 · 4 comments

Comments

@sgibson91
Copy link
Member

sgibson91 commented Mar 10, 2022

Summary

  • Users are not able to log into hub
  • From grafana seems the trigger was 14hours before report, went unnoticed until a large volume of people tried to log in
  • The cause was that the disk space for the Azure cluster had run out, due to the increase in user accounts on the hub infrastructure.
  • We increased the disk quota for this cluster, and the problem is now resolved.

Hub information

Timeline (if relevant)

All times in GMT

2022-03-10 16:50 GMT

Support ticket opened reporting increase reports of users not able to access the hub

17:10

Screen shots of grafana shared in Slack - these are from the last 7 days. You can see a sharp increase in "non-running pods" on the right side. (EDIT: this was a strong indicator of something going wrong)

image

image (1)

image (2)

17:25

Log message reported from hub pod logs

[W 2022-03-10 17:25:19.774 JupyterHub base:1349] Failing suspected API request to not-running server: /hub/user/f48cb1b0-2e65-4ca3-8fc5-0a195cd308c9/api/metrics/v1

Attempted to find log messages from time period indicated by the grafana charts, but couldn't retrieve anything earlier than [I 2022-03-10 17:28:33.835 JupyterHub log:189] using the below command:

k logs -c hub hub-59746db487-qb2db --since-time=2022-03-09T14:00:00.000000000Z

18:30

We identified that the hub disk quote had been reached, and the hub was no longer able to allocate more disk space. This was causing the outages.

We logged in to the Azure portal for this cluster, and manually increased the available Disk Space, this resolved the issue (see this comment for more details).


After-action report

What went wrong

  • The disk space ran out for the Azure cluster, which cause the hub to be unable to create new user sessions.
  • There was no automatic reporting or graphs on disk space usage, so it wasn't clear that this problem was approaching until it was too late.
  • Only a subset of our team members had the ability to look at the cloud infrastructure for the cluster, because it currently requires University of Toronto accounts for anyone to access it, and this hasn't been completed for all team members.

Where we got lucky

  • One of the two team members that had access to Toronto's cloud infrastructure happened to be available at the time, and was able to quickly spot the problem when looking at the logs.

Action items

Process improvements

  1. Everyone in the engineering team should have access to the UToronto Azure subscription Get all 2i2c engineers access to UToronto Azure portal team-compass#386
  2. We should establish escalation process for dealing with outages. For example, pinging cell phones if the problem is big enough or if only a certain team member can resolve it. Define escalation practices when there are hub outages #1118

Documentation improvements

  1. Document various kinds of log messages you may get from a JupyterHub pod, to interpret what they might mean Add some docs on common log messages jupyterhub/jupyterhub#3820
  2. Document every grafana graph we have, and what it represents, and useful ways to read it. Document every grafana graph, what it represents, and useful ways to read it. #1117

Technical improvements

  1. Jupyter server currently does not start at all when the disk is full, leading to somewhat mysterious failures. We should find a way to make sure that the user's server does start even if the disk is full, and we can get better error messages that way.
  2. We should have a Grafana tab where we are monitoring disk usage on the home directory (Add a Grafana plot to monitor disk usage on the home directory #1119)
  3. We should have automated alerts when there are 5xx error responses from the user pods, not just from the hub process itself (updated Cloud usage monitoring and alerting infrastructure and process #328)
  4. Simplify the UToronto deployment to make it more accessible Simplify UToronto hub deployment so it is easier to manage #1088

Actions

  • Incident has been dealt with or is over
  • Sections above are filled out
  • Incident title and after-action report is cleaned up
  • All actionable items above have linked GitHub Issues
@yuvipanda
Copy link
Member

yuvipanda commented Mar 10, 2022

Here's the workflow I followed.

  1. Get access to the UToronto cluster, with python3 deployer use-cluster-credentials utoronto
  2. Look at the hub logs, but nothing suspicious there. Particularly, no python stack traces - watching for the 'shape' of a python stacktrace going by as logs whoosh past is a useful first step.
  3. Look at list of pods running, with k -n prod get pod (k is an alias for kubectl)
  4. Notice that a freshly launched user pod immediately goes into Error state! This means that the process starting in the user pod (the jupyter singleuser server) fails on startup. This should basically never happen under normal operating conditions
  5. Look at the logs for a pod in error state, with k -n prod logs <pod-name>. Immediately detect a python stacktrace!
       Traceback (most recent call last):
    File "/opt/conda/bin/jupyterhub-singleuser", line 8, in <module>
      sys.exit(main())
    File "/opt/conda/lib/python3.8/site-packages/jupyter_core/application.py", line 264, in launch_instance
      return super(JupyterApp, cls).launch_instance(argv=argv, **kwargs)
    File "/opt/conda/lib/python3.8/site-packages/traitlets/config/application.py", line 845, in launch_instance
      app.initialize(argv)
    File "/opt/conda/lib/python3.8/site-packages/jupyterhub/singleuser/mixins.py", line 852, in initialize
      result = super().initialize(*args, **kwargs)
    File "/opt/conda/lib/python3.8/site-packages/jupyterhub/singleuser/mixins.py", line 573, in initialize
      return super().initialize(argv)
    File "/opt/conda/lib/python3.8/site-packages/traitlets/config/application.py", line 88, in inner
      return method(app, *args, **kwargs)
    File "/opt/conda/lib/python3.8/site-packages/notebook/notebookapp.py", line 2143, in initialize
      super().initialize(argv)
    File "/opt/conda/lib/python3.8/site-packages/traitlets/config/application.py", line 88, in inner
      return method(app, *args, **kwargs)
    File "/opt/conda/lib/python3.8/site-packages/jupyter_core/application.py", line 239, in initialize
      self.migrate_config()
    File "/opt/conda/lib/python3.8/site-packages/jupyterhub/singleuser/mixins.py", line 399, in migrate_config
      super().migrate_config()
    File "/opt/conda/lib/python3.8/site-packages/jupyter_core/application.py", line 165, in migrate_config
      migrate()
    File "/opt/conda/lib/python3.8/site-packages/jupyter_core/migrate.py", line 244, in migrate
      ensure_dir_exists(env['jupyter_config'])
    File "/opt/conda/lib/python3.8/site-packages/jupyter_core/utils/__init__.py", line 11, in ensure_dir_exists
      os.makedirs(path, mode=mode)
    File "/opt/conda/lib/python3.8/os.py", line 223, in makedirs
      mkdir(name, mode)
    OSError: [Errno 28] No space left on device: '/home/jovyan/.jupyter'
    
  6. Aha, so no space on disk. I pop over to the Azure portal (https://portal.azure.com/#blade/Microsoft_Azure_FileStorage/FileShareMenuBlade/overview/storageAccountId/%2Fsubscriptions%2Fead3521a-d994-4a44-a68d-b16e35642d5b%2FresourceGroups%2F2i2c-utoronto-cluster%2Fproviders%2FMicrosoft.Storage%2FstorageAccounts%2F2i2cutorontohubstorage/path/homes/protocol/NFS) and increase size of disk from 1.5TB to 3TB.
  7. Launch a new user pod to see if it works. It does. I also look at newly starting pods to see if they are succeeding, with wk -n prod get pod -l component=singleuser-server (wk is an alias for watch kubectl, uses the very helpful watch command). Seems alright

Before doing this, I had also manually scaled the cluster up to 10 nodes (from the 3 that it had), because that could have also been the issue.

yuvipanda added a commit to yuvipanda/jupyterhub that referenced this issue Mar 10, 2022
When debugging errors and outages, looking at the logs emitted by
JupyterHub is very helpful. This document tries to document some common
log messages, and what they mean.

I currently added just one log message, but we can add more
over time.

Ref 2i2c-org/infrastructure#1081
where this would've been useful troubleshooting
@choldgraf choldgraf changed the title [Incident] UToronto Spawn Failure Error [Incident] UToronto cluster ran out of disk space Mar 10, 2022
@choldgraf
Copy link
Member

I've opened up follow-up issues for the major remaining things in our post-mortem list. There is one that I wasn't really sure how to capture in an issue:

Jupyter server currently does not start at all when the disk is full, leading to somewhat mysterious failures. We should find a way to make sure that the user's server does start even if the disk is full, and we can get better error messages that way.

So if somebody knows how to turn that in an issue, please do so!

I'll close this, since the incident is resolved and we have follow-up issues for the remaining pieces.

@damianavila
Copy link
Contributor

So if somebody knows how to turn that in an issue, please do so!

Maybe some checking before launching the single-user instance?
https://github.com/jupyterhub/jupyterhub/blob/main/jupyterhub/singleuser/app.py#L68

@yuvipanda @consideRatio, thoughts?

@consideRatio
Copy link
Member

This is a common issue. I invested time fixing this kind of issue recently. I'm also not sure how to resolve it, or where it can make sense to track it.

I think perhaps raising this for technical deliberation in a discourse.jupyter.org post makes the most sense to me - describing the problem and asking for input on how to address it it in a sensible way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

No branches or pull requests

5 participants