-
Notifications
You must be signed in to change notification settings - Fork 341
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Instances intermittently fail to terminate #649
Comments
Thats very weird. Unless you have setup the name. If you never did that I would say thats not doable 🤔 |
The instance name was either |
Do you mean that the instance was named just |
Yep precisely. It was just "-". I've really no idea what happened to the name, but I'm certain it was produced by CML because it was a spot instance of a specific type that matches those we use with CML. In any case I have only seen this behaviour once, whereas today alone I have had two CML runners fail to terminate but with the expected I've been trying to identify the contributing factors, but so far all I can say is that it has happened with a couple of instances from the c5 family, of different sizes. I'll let you know if I manage to pin down anything more specific. |
It's really intermittent. I can run 10 jobs and it doesn't happen, then I run one on its own and it does. Very strange. For example, I just found an instance that had failed to terminate:
the workflow on github: with config:
logs:
|
Sooo...something interesting happened after I terminated that instance ☝️ I terminated the instance with The reason for failure as you will see in the log is that AWS has only permitted us 4 vCPUS of the log:
|
Yes it is. Its resuming the failed job. Now you would be able to train really big models in spot instances transparently |
After a batch of 150 runs I can say that termination did not fail a single time. The only plausible reason is that we have a synthetic chrono based on runners logs. Once a new job arrives we reset it and stop it until the job is completed. Maybe the completion log is not arriving... that would end up with the chrono not counting and the idle timeout not being effective. |
💯💪 |
This is the only plausible error; To solve it we could check the status of the runner (active, idle) during a job execution. @iterative/cml |
Thanks for the insights! @jamt9000 seems to be related to single then?! |
In my case, I've never use |
@ivyleavedtoadflax the reason of not being destroyed is the missing name. The provider needs a name to be able to destroy the instance. That would explain everything, however:
|
Actually, even if the name is not set the id will always be set. #525 So this is not the error either. |
I have done a batch with that instance and with 60 sec idle timeout. 0 machines were left. here is the provider resource "iterative_cml_runner" "runner-gh-34" {
token = ""
repo = "https://github.com/DavidGOrtega/"
driver = "github"
labels = "test1"
idle_timeout = 60
cloud = "aws"
region = "us-west"
instance_type = "c5a.4xlarge"
spot = true
} |
@ivyleavedtoadflax @jamt9000 if you can access still to the machine in the console tags would help us. We destroy using the tag apposed to the name. So I need to know which tags the instance has as shown in the picture above. |
We explicitly set the instance name and it wasn't an empty string 🤷 |
I already terminated this one unfortunately, but if it happens again I will capture more information. |
@ivyleavedtoadflax thanks for the support. I was able to find the bug. Happens that sometimes the GH api takes a while to update the status of the job. It should be completed but despite that job is executed successfully and the the UI shows green the status via API is still 'queued'. Opposed to GL in GH we need to guess the job id since the logs are not providing them. |
Amazing, great work @DavidGOrtega 👏 |
I am also seeing that workflows that have already completed successfully get restarted which is probably related to this and #583 |
@jamt9000 exactly |
I've had a couple of instances recently that have failed to terminate. In the most recent case this was with the
--reuse
flag set, having run a series of 8 queued jobs.The instance is sitting idle, with a timeout of
60s
having passed ten minutes ago. I'll need to terminate the instance manually from the command line.In the most serious case, I had an instance run for two weeks without terminating. It took so long for us to notice because the instance name did not get set to
cml-*
as usual.Here's the yml we are using:
The text was updated successfully, but these errors were encountered: