No space left on device creates hung EC2 instance #1006

evamaxfield · 2022-05-13T21:57:22Z

Hello!

First I just want to say thank you for this library, it is truly incredible what I have been able to spin up in such a short timeframe. 🙇 🙇

Onto the error: I was attempting to train a model with a bit more data than my original go and ran into a System.IO.IOException: No space left on device. I should have expected this but did not. With my prior test runs I saw that correctly after error or success the EC2 instance was shutdown, but for this one it was not. The associated EC2 instance stayed running until I manually went and terminated it.

My personal desire would be to have it terminate on any error, including Sys but this one may be tricky to handle so I understand and it may just be that some documentation should be added as to what to all cleanup manually.

Full log here: https://github.com/evamaxfield/phd-infrastructures/actions/runs/2321863887

Thank you again!

The text was updated successfully, but these errors were encountered:

dacbd · 2022-05-13T22:03:58Z

Taking a look. 👀

…

On Fri, May 13, 2022, 14:57 Jackson Maxfield Brown ***@***.***> wrote: Hello! First I just want to say thank you for this library, it is truly incredible what I have been able to spin up in such a short timeframe. 🙇 🙇 Onto the error: I was attempting to train a model with a bit more data than my original go and ran into a System.IO.IOException: No space left on device. I should have expected this but did not. With my prior test runs I saw that correctly after error or success the EC2 instance was shutdown, but for this one it was not. The associated EC2 instance stayed running until I manually went and terminated it. My personal desire would be to have it terminate on *any* error, including Sys but this one may be tricky to handle so I understand and it may just be that some documentation should be added as to what to all cleanup manually. Full log here: https://github.com/JacksonMaxfield/phd-infrastructures/actions/runs/2321863887 Thank you again! — Reply to this email directly, view it on GitHub <#1006>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAIN7M45NTOPLZEH5UH6EVLVJ3F47ANCNFSM5V4SVUHQ> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

dacbd · 2022-05-13T22:16:49Z

@JacksonMaxfield I have anecdotally noticed that AWS instances like to behave expectedly when the actions runner becomes exhausted of memory as you have done in your linked example.

Can you provide the output of the following in a gist or Pastebin from the crashed instance?

journalctl -n all -u cml.service --no-pager
sudo dmesg --ctime
sudo dmesg --citime --userspace

evamaxfield · 2022-05-13T22:18:03Z

I unfortunately cannot. I terminated the instance, it's attached volumnes, and etc. already.

dacbd · 2022-05-13T22:20:21Z

My personal desire would be to have it terminate on any error, including Sys but this one may be tricky to handle so I understand and it may just be that some documentation should be added as to what to all cleanup manually.

That is the intention, I believe that there may be a niceness issue that we can probably fix, and I will try to investigate.

I suspect the oom crash is also causing our clean process to not execute, and thus the instance remains.

dacbd · 2022-05-13T22:24:08Z

I unfortunately cannot. I terminated the instance, it's attached volumnes, and etc. already.

No worries, if it happens again those commands are helpful for us to diagnose. the issue.
Consider including --cloud-startup-script=$(echo 'echo "$(curl https://github.com/'"$GITHUB_ACTOR"'.keys)" >> /home/ubuntu/.ssh/authorized_keys' | base64 -w 0) for easy access to the instance for debugging.

0x2b3bfa0 · 2022-05-13T22:40:28Z

@JacksonMaxfield, you may also want to use cml runner --cloud-hdd-size=<sumber> where <number> is a custom storage size in gigabytes.

evamaxfield · 2022-05-13T22:52:08Z

Yep! I am already using that. I just bumped it up.

dacbd · 2022-05-14T00:09:15Z

@JacksonMaxfield it's looking pretty successful! / If you wanted to say run it on a smaller instance and yank those logs? ❤️ but no worries otherwise. I'm pretty certain about why it failed to self-terminate.

evamaxfield · 2022-05-14T00:23:31Z

I can do that I may just need a bit of step by step instructions.

If I am understanding correctly, you want me to add the option:

--cloud-startup-script=$(echo 'echo "$(curl https://github.com/'"$GITHUB_ACTOR"'.keys)" >> /home/ubuntu/.ssh/authorized_keys' | base64 -w 0)

here

But then what do I do after that? Where does that key go?

When and where should I run these commands?

journalctl -n all -u cml.service --no-pager

sudo dmesg --ctime

sudo dmesg --citime --userspace

Apologies for my naivety

evamaxfield · 2022-05-14T00:48:28Z

I am heading out for the weekend, can take a look next week!

dacbd · 2022-05-14T03:11:03Z

Correct, that will and your ssh keys to the default ubuntu user so you can connect to the instance with ssh. After the action fails from the server running out of memory can run:

ssh ubuntu@instance_ip
sudo journalctl -n all -u cml.service --no-pager > cml.log
sudo dmesg --ctime > system.log
sudo dmesg --ctime --userspace > userspace.log

then from your computer you can copy them:

scp ubuntu@instance_ip:~/cml.log .
scp ubuntu@instance_ip:~/system.log .
scp ubuntu@instance_ip:~/userspace.log .

There is a chance that the server could be really broke if the ssh command hangs, if that happens reboot it from the web consle and try the commands again.

evamaxfield · 2022-05-16T17:49:38Z

Thanks for the step-by-step @dacbd, running a new training job with a storage size that should fail: https://github.com/evamaxfield/phd-infrastructures/actions/runs/2333920369

Sidenote, maybe I just didn't see it in the documentation but it may be good to have a spot in the documentation listing all the things to cleanup manually if the instance never terminates?
From last time I did it I think the things I had to terminate / delete were:

The EC2 instance
- I think the attached volume terminated itself when I terminated the instance but if not, then the volume should be terminated too
The created cml-iterative security group
The created key-pair?

I can double check this list after I cleanup the resources from the planned failed instance that is currently setting itself up 😂

dacbd · 2022-05-16T18:04:33Z

Sure, we always welcome contributions, if you felt some of the documentation to be hard to understand as a new user please do let us know / feedback is always welcome.

For cml runner there are some limitations to what it can self-clean up, IIRC the security group is one of them (it provides the VPC assignment) you can overcome this by providing a premade one with --aws-security-group

evamaxfield · 2022-05-16T20:08:09Z

Finally got it! Here you go!

cml.log
system.log
userspace.log

dacbd · 2022-05-17T04:29:40Z

Thanks, just the cml log was enough in this case.

dacbd · 2022-08-22T18:55:39Z

community/community#30440

dacbd · 2022-10-17T15:39:18Z

#1225

dacbd self-assigned this May 16, 2022

dacbd added bug Something isn't working cml-runner Subcommand awaiting-response Waiting for user feedback flaky Heisenbugs labels May 16, 2022

evamaxfield mentioned this issue May 16, 2022

docs/ec2-debug-and-manual-cleanup iterative/cml.dev#240

Merged

dacbd removed the awaiting-response Waiting for user feedback label May 17, 2022

dacbd mentioned this issue May 17, 2022

runner tempdir patch iterative/terraform-provider-iterative#582

Closed

dacbd mentioned this issue May 24, 2022

runner process management & tracking #1016

Open

4 tasks

dacbd mentioned this issue Jun 9, 2022

The self-hosted runner: cml-xxx lost communication with the server #1053

Closed

dacbd mentioned this issue Oct 17, 2022

Test for full drive delete iterative/cml-playground#129

Closed

dacbd closed this as completed Oct 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No space left on device creates hung EC2 instance #1006

No space left on device creates hung EC2 instance #1006

evamaxfield commented May 13, 2022 •

edited

Loading

dacbd commented May 13, 2022 via email

dacbd commented May 13, 2022

evamaxfield commented May 13, 2022

dacbd commented May 13, 2022

dacbd commented May 13, 2022

0x2b3bfa0 commented May 13, 2022

evamaxfield commented May 13, 2022

dacbd commented May 14, 2022

evamaxfield commented May 14, 2022

evamaxfield commented May 14, 2022

dacbd commented May 14, 2022

evamaxfield commented May 16, 2022 •

edited

Loading

dacbd commented May 16, 2022

evamaxfield commented May 16, 2022

dacbd commented May 17, 2022

dacbd commented Aug 22, 2022

dacbd commented Oct 17, 2022

No space left on device creates hung EC2 instance #1006

No space left on device creates hung EC2 instance #1006

Comments

evamaxfield commented May 13, 2022 • edited Loading

dacbd commented May 13, 2022 via email

dacbd commented May 13, 2022

evamaxfield commented May 13, 2022

dacbd commented May 13, 2022

dacbd commented May 13, 2022

0x2b3bfa0 commented May 13, 2022

evamaxfield commented May 13, 2022

dacbd commented May 14, 2022

evamaxfield commented May 14, 2022

evamaxfield commented May 14, 2022

dacbd commented May 14, 2022

evamaxfield commented May 16, 2022 • edited Loading

dacbd commented May 16, 2022

evamaxfield commented May 16, 2022

dacbd commented May 17, 2022

dacbd commented Aug 22, 2022

dacbd commented Oct 17, 2022

evamaxfield commented May 13, 2022 •

edited

Loading

evamaxfield commented May 16, 2022 •

edited

Loading