-
Notifications
You must be signed in to change notification settings - Fork 341
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
No space left on device creates hung EC2 instance #1006
Comments
Taking a look. 👀
…On Fri, May 13, 2022, 14:57 Jackson Maxfield Brown ***@***.***> wrote:
Hello!
First I just want to say thank you for this library, it is truly
incredible what I have been able to spin up in such a short timeframe. 🙇
🙇
Onto the error: I was attempting to train a model with a bit more data
than my original go and ran into a System.IO.IOException: No space left
on device. I should have expected this but did not. With my prior test
runs I saw that correctly after error or success the EC2 instance was
shutdown, but for this one it was not. The associated EC2 instance stayed
running until I manually went and terminated it.
My personal desire would be to have it terminate on *any* error,
including Sys but this one may be tricky to handle so I understand and it
may just be that some documentation should be added as to what to all
cleanup manually.
Full log here:
https://github.com/JacksonMaxfield/phd-infrastructures/actions/runs/2321863887
Thank you again!
—
Reply to this email directly, view it on GitHub
<#1006>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAIN7M45NTOPLZEH5UH6EVLVJ3F47ANCNFSM5V4SVUHQ>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
@JacksonMaxfield I have anecdotally noticed that AWS instances like to behave expectedly when the actions runner becomes exhausted of memory as you have done in your linked example. Can you provide the output of the following in a gist or Pastebin from the crashed instance?
|
I unfortunately cannot. I terminated the instance, it's attached volumnes, and etc. already. |
That is the intention, I believe that there may be a niceness issue that we can probably fix, and I will try to investigate. I suspect the oom crash is also causing our clean process to not execute, and thus the instance remains. |
No worries, if it happens again those commands are helpful for us to diagnose. the issue. |
@JacksonMaxfield, you may also want to use |
Yep! I am already using that. I just bumped it up. |
@JacksonMaxfield it's looking pretty successful! / If you wanted to say run it on a smaller instance and yank those logs? ❤️ but no worries otherwise. I'm pretty certain about why it failed to self-terminate. |
I can do that I may just need a bit of step by step instructions. If I am understanding correctly, you want me to add the option:
But then what do I do after that? Where does that key go? When and where should I run these commands?
Apologies for my naivety |
I am heading out for the weekend, can take a look next week! |
Correct, that will and your ssh keys to the default ubuntu user so you can connect to the instance with ssh. After the action fails from the server running out of memory can run:
then from your computer you can copy them:
There is a chance that the server could be really broke if the ssh command hangs, if that happens reboot it from the web consle and try the commands again. |
Thanks for the step-by-step @dacbd, running a new training job with a storage size that should fail: https://github.com/evamaxfield/phd-infrastructures/actions/runs/2333920369 Sidenote, maybe I just didn't see it in the documentation but it may be good to have a spot in the documentation listing all the things to cleanup manually if the instance never terminates?
I can double check this list after I cleanup the resources from the planned failed instance that is currently setting itself up 😂 |
Sure, we always welcome contributions, if you felt some of the documentation to be hard to understand as a new user please do let us know / feedback is always welcome. For |
Finally got it! Here you go! |
Thanks, just the cml log was enough in this case. |
Hello!
First I just want to say thank you for this library, it is truly incredible what I have been able to spin up in such a short timeframe. 🙇 🙇
Onto the error: I was attempting to train a model with a bit more data than my original go and ran into a
System.IO.IOException: No space left on device
. I should have expected this but did not. With my prior test runs I saw that correctly after error or success the EC2 instance was shutdown, but for this one it was not. The associated EC2 instance stayed running until I manually went and terminated it.My personal desire would be to have it terminate on any error, including Sys but this one may be tricky to handle so I understand and it may just be that some documentation should be added as to what to all cleanup manually.
Full log here: https://github.com/evamaxfield/phd-infrastructures/actions/runs/2321863887
Thank you again!
The text was updated successfully, but these errors were encountered: