-
Notifications
You must be signed in to change notification settings - Fork 64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Incident] UToronto cluster ran out of disk space #1081
Comments
Here's the workflow I followed.
Before doing this, I had also manually scaled the cluster up to 10 nodes (from the 3 that it had), because that could have also been the issue. |
When debugging errors and outages, looking at the logs emitted by JupyterHub is very helpful. This document tries to document some common log messages, and what they mean. I currently added just one log message, but we can add more over time. Ref 2i2c-org/infrastructure#1081 where this would've been useful troubleshooting
I've opened up follow-up issues for the major remaining things in our post-mortem list. There is one that I wasn't really sure how to capture in an issue:
So if somebody knows how to turn that in an issue, please do so! I'll close this, since the incident is resolved and we have follow-up issues for the remaining pieces. |
Maybe some checking before launching the single-user instance? @yuvipanda @consideRatio, thoughts? |
This is a common issue. I invested time fixing this kind of issue recently. I'm also not sure how to resolve it, or where it can make sense to track it. I think perhaps raising this for technical deliberation in a discourse.jupyter.org post makes the most sense to me - describing the problem and asking for input on how to address it it in a sensible way. |
Summary
Hub information
Timeline (if relevant)
All times in GMT
2022-03-10 16:50 GMT
Support ticket opened reporting increase reports of users not able to access the hub
17:10
Screen shots of grafana shared in Slack - these are from the last 7 days. You can see a sharp increase in "non-running pods" on the right side. (EDIT: this was a strong indicator of something going wrong)
17:25
Log message reported from hub pod logs
Attempted to find log messages from time period indicated by the grafana charts, but couldn't retrieve anything earlier than
[I 2022-03-10 17:28:33.835 JupyterHub log:189]
using the below command:18:30
We identified that the hub disk quote had been reached, and the hub was no longer able to allocate more disk space. This was causing the outages.
We logged in to the Azure portal for this cluster, and manually increased the available Disk Space, this resolved the issue (see this comment for more details).
After-action report
What went wrong
Where we got lucky
Action items
Process improvements
Documentation improvements
Technical improvements
Actions
The text was updated successfully, but these errors were encountered: