-
Notifications
You must be signed in to change notification settings - Fork 341
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CML seemingly fails to restart job after AWS Spot instances have been shut down #650
Comments
This comment has been minimized.
This comment has been minimized.
Graceful shutdown issues are really fun to debug; see iterative/terraform-provider-iterative#90 for an example. 🙃 It would be awesome if you could reproduce this issue when spawning a single runner and follow the instructions below to see what's failing.
|
This is what I got from the logs: log.txt Interestingly though, the GitHub actions job didn't crash! It's still running as far as I can tell. Is this the expected behaviour? |
This comment has been minimized.
This comment has been minimized.
As per the spot request status reference, the state transitions you describe seem to match the expected behavior. I don't know if those actions should take several minutes, though.
Is it just the status, or have you checked if the job is alive? If you aren't sure, you can replace the while true; do date; sleep 1; done If you still see fresh date values after the spot instance enters the
In this case, can you please retrieve a new log using the following step instead of the - run: |
sudo npm config set user 0
sudo npm install --global git+https://github.com/iterative/cml#debug-long-job |
So actually that was my bad, i was running the jobs on the wrong runner :/ I managed to set the job correctly afterwards and I've been waiting for it to fail. Unfortunately, aws spot did not shut the instance down. It did however reach the maximum runtime of 6 hours. Here are the logs
The workflow doesn't get restarted after this I''m afraid... |
🤔 Our code is expecting the spot instances to be terminated. After those 6hours the spot instance is going to be still alive? |
Eureka! As you say spot instance is not terminated.Here the guilty is the job timeout in GH. You have to setup it as run_optimisation:
timeout-minutes: 10000
continue-on-error: false
strategy:
matrix: ${{fromJson(needs.setup_config.outputs.json_string).matrix}}
fail-fast: true
runs-on: [self-hosted, "cml-runner"]
container:
image: python:3.8.10-slim
volumes:
- /dev/shm:/dev/shm As far as can tell by MY experience:
|
Our runner only restarts in to scenarios:
|
Ah ok, i thought the timeout referred to the 360 minutes that a job can run on. In any case, I managed to replicate the issue. This is what i see on the logs
On the github the job failed because the runner shut down due to spot killing it. On the AWS side i can also confirm that the spot instance was terminated with |
I do not fully understand this. Could you please elaborate?
can you try to terminate the instance manually in the console and check that it restarts? If so, seems that spot instances might be killing instances with other method than terminate. |
And I can confirm that they use also terminate and the workflow restarts 🤔 |
Terminating the instance manually gives me this in the logs:
But I can confirm that the github worklow restarts! |
@thatGreekGuy96 I have setup a spot termination notification handler in #653 |
Awesome, thank you! |
Hey everyone,
So I noticed a couple of days ago that CML now has new functionality that allows it to restart workflows if one or more AWS spot runners have been told to shut down. However this doesn't seem to be happening for me.
A couple of details about our case:
latest
version of CML to deploy a bunch of runners as shown below.continue-on-error
set toFalse
(wondering whether that is interfering with cml?)The text was updated successfully, but these errors were encountered: