CML seemingly fails to restart job after AWS Spot instances have been shut down #650

thatGreekGuy96 · 2021-07-08T10:51:30Z

Hey everyone,
So I noticed a couple of days ago that CML now has new functionality that allows it to restart workflows if one or more AWS spot runners have been told to shut down. However this doesn't seem to be happening for me.

A couple of details about our case:

Our cloud is AWS
We're (as far as i can tell) using the latest version of CML to deploy a bunch of runners as shown below.

  deploy_runners:
    name: Deploy Cloud Instances
    needs: [setup_config]

    runs-on: ubuntu-latest

    steps:
      - uses: actions/checkout@v2
      - uses: iterative/setup-cml@v1
        with:
          version:  latest

      - name: "Deploy runner on EC2"
        shell: bash
        env:
          repo_token: ${{ secrets.ACCESS_TOKEN_CML }}
          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID_TESTING }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY_TESTING }}
          CASE_NAME: ${{ matrix.case_name }}
          N_RUNNERS: ${{ fromJson(needs.setup_config.outputs.json_string).n_runners }}

        run: |
          for (( i=1; i<=N_RUNNERS; i++ ))
          do
            echo "Deploying runner ${i}"
            cml-runner \
            --cloud aws \
            --cloud-region eu-west-2 \
            --cloud-type=m \
            --cloud-hdd-size 100 \
            --cloud-spot \
            --labels=cml-runner &
          done
          wait
          echo "Deployed ${N_RUNNERS} runners."

The job each runner runs does not use the CML images provided by iterative
The job that each runner runs has continue-on-error set to False (wondering whether that is interfering with cml?)

  run_optimisation:
    continue-on-error: false
    strategy:
      matrix: ${{fromJson(needs.setup_config.outputs.json_string).matrix}}
      fail-fast: true

    runs-on: [self-hosted, "cml-runner"]
    container:
      image: python:3.8.10-slim
      volumes:
          - /dev/shm:/dev/shm

The text was updated successfully, but these errors were encountered:

0x2b3bfa0 · 2021-07-08T11:15:23Z

Graceful shutdown issues are really fun to debug; see iterative/terraform-provider-iterative#90 for an example. 🙃 It would be awesome if you could reproduce this issue when spawning a single runner and follow the instructions below to see what's failing.

Generate a new RSA PEM private key on your local system for debugging purposes:
```
$ ssh-keygen -t rsa -m pem -b 4096 -f key.pem
```
Store the contents of key.pem as a repository secret named INSTANCE_PRIVATE_SSH_KEY as per the workflow below.

Run the following workflow and retrieve the instance IP address from the logs of the deploy step:

jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2
      - uses: iterative/setup-cml@v1
      - run: >-
          cml-runner
          --labels=debug
          --cloud=aws
          --cloud-type=m
          --cloud-hdd-size=100
          --cloud-region=eu-west-2
          --cloud-ssh-private="$INSTANCE_PRIVATE_SSH_KEY"
          --cloud-spot
        env:
          REPO_TOKEN: ${{ secrets.ACCESS_TOKEN_CML }}
          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID_TESTING }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY_TESTING }}
          INSTANCE_PRIVATE_SSH_KEY: ${{ secrets.INSTANCE_PRIVATE_SSH_KEY }}
  run:
    needs: deploy
    runs-on: [self-hosted, debug]
    steps:
      run: cat

Monitor the instance logs from your local system by using the generated key as an indentity file:
```
$ ssh -i key.pem ubuntu@IP_ADDRESS journalctl --follow --unit cml
```
Once the spot instance gets evicted, take a look at the logs and attach them to this issue.

thatGreekGuy96 · 2021-07-08T14:27:53Z

This is what I got from the logs: log.txt

Interestingly though, the GitHub actions job didn't crash! It's still running as far as I can tell. Is this the expected behaviour?

0x2b3bfa0 · 2021-07-08T19:36:31Z

On the EC2 console, I can see that the instance in question has indeed been terminated.
HOWEVER, the spot request that created it still has a "fulfilled" status AND the github actions job is still running...
I hope this makes more sense to you than it does to me!
Update: After about 5 minutes, the spot request was marked as terminated-by-user, [...]

As per the spot request status reference, the state transitions you describe seem to match the expected behavior. I don't know if those actions should take several minutes, though.

[...] _but the github actions job is still running...

Is it just the status, or have you checked if the job is alive? If you aren't sure, you can replace the cat command in the last line of #650 (comment) with the following:

while true; do date; sleep 1; done

If you still see fresh date values after the spot instance enters the terminated-by-user state, it would mean that either the old instance is still alive or everything worked as expected and CML restarted the job without you noticing. Please take note of the first date value; it may change after a job restart.

As far as I can tell, no new spot requests have been made.

In this case, can you please retrieve a new log using the following step instead of the - uses: iterative/setup-cml@v1 one?

- run: |
    sudo npm config set user 0
    sudo npm install --global git+https://github.com/iterative/cml#debug-long-job

thatGreekGuy96 · 2021-07-08T22:01:12Z

So actually that was my bad, i was running the jobs on the wrong runner :/ I managed to set the job correctly afterwards and I've been waiting for it to fail. Unfortunately, aws spot did not shut the instance down. It did however reach the maximum runtime of 6 hours. Here are the logs

-- Logs begin at Fri 2021-07-02 11:29:27 UTC. --
Jul 08 15:54:01 ip-172-31-10-233 systemd[1]: Starting cml service...
Jul 08 15:54:48 ip-172-31-10-233 cml.sh[3089]: Preparing workdir /tmp/tmp.ulnFhn9cxz/.cml/cml-rlj3ki1iw7...
Jul 08 15:54:48 ip-172-31-10-233 cml.sh[3089]: Launching github runner
Jul 08 15:55:06 ip-172-31-10-233 cml.sh[3089]: {"level":"info","date":"2021-07-08T15:55:06.759Z","repo":"https://github.com/continuum-industries/Pareto"}
Jul 08 15:55:06 ip-172-31-10-233 cml.sh[3089]: {"level":"info","date":"2021-07-08T15:55:06.759Z","repo":"https://github.com/continuum-industries/Pareto"}
Jul 08 15:55:07 ip-172-31-10-233 cml.sh[3089]: {"level":"info","date":"2021-07-08T15:55:07.185Z","repo":"https://github.com/continuum-industries/Pareto","status":"ready"}
Jul 08 15:55:23 ip-172-31-10-233 cml.sh[3089]: {"level":"info","date":"2021-07-08T15:55:22.988Z","repo":"https://github.com/continuum-industries/Pareto","job":3020940409,"status":"job_started"}
Jul 08 21:55:37 ip-172-31-10-233 cml.sh[3089]: {"level":"error","date":"2021-07-08T21:55:37.311Z","repo":"https://github.com/continuum-industries/Pareto","job":"","status":"job_ended","success":false}

The workflow doesn't get restarted after this I''m afraid...
I'll run another test to see if i AWS will shut down the spot instance

DavidGOrtega · 2021-07-09T09:11:31Z

Unfortunately, aws spot did not shut the instance down. It did however reach the maximum runtime of 6 hours. Here are the logs

🤔 Our code is expecting the spot instances to be terminated. After those 6hours the spot instance is going to be still alive?

DavidGOrtega · 2021-07-09T09:37:05Z

Eureka!
I think I got you!

As you say spot instance is not terminated.Here the guilty is the job timeout in GH.

You have to setup it as

run_optimisation:
    timeout-minutes: 10000
    continue-on-error: false
    strategy:
      matrix: ${{fromJson(needs.setup_config.outputs.json_string).matrix}}
      fail-fast: true

    runs-on: [self-hosted, "cml-runner"]
    container:
      image: python:3.8.10-slim
      volumes:
          - /dev/shm:/dev/shm

As far as can tell by MY experience:

workflow limit is meant to be at most 72H (our runner handles this ;) ) however if you specify the job limit above those 72 hours that limit disappears (probably a GH bug)
Never setup the timeout-minutes in a step. Opposed to GH docs it WONT work. Setting the limit to 6 hours again

DavidGOrtega · 2021-07-09T09:40:50Z

Our runner only restarts in to scenarios:

the machine is being destroyed and there are still jobs
the 72 hours GH limit that will restart the workflow even if you specify a greater timeout-minutes making the 72h limit invalid. We still do this (opposed to my experience) because I believe that it is a bug that should be fixed at any point breaking the runner again. The only drawback is that you might loose an epoch while we restart because of the 72 hours limiit.

thatGreekGuy96 · 2021-07-09T10:26:21Z

Ah ok, i thought the timeout referred to the 360 minutes that a job can run on.

In any case, I managed to replicate the issue. This is what i see on the logs

-- Logs begin at Fri 2021-07-02 11:29:27 UTC. --
Jul 09 08:21:43 ip-172-31-14-78 systemd[1]: Started cml.service.
Jul 09 08:22:36 ip-172-31-14-78 cml.sh[3252]: Preparing workdir /tmp/tmp.ePH7ZEdNZl/.cml/cml-8qy3d9e8w7...
Jul 09 08:22:36 ip-172-31-14-78 cml.sh[3252]: Launching github runner
Jul 09 08:22:58 ip-172-31-14-78 cml.sh[3252]: {"level":"info","date":"2021-07-09T08:22:58.295Z","repo":"https://github.com/continuum-industries/Pareto"}
Jul 09 08:22:58 ip-172-31-14-78 cml.sh[3252]: {"level":"info","date":"2021-07-09T08:22:58.296Z","repo":"https://github.com/continuum-industries/Pareto"}
Jul 09 08:22:58 ip-172-31-14-78 cml.sh[3252]: {"level":"info","date":"2021-07-09T08:22:58.693Z","repo":"https://github.com/continuum-industries/Pareto","status":"ready"}
Jul 09 08:23:10 ip-172-31-14-78 cml.sh[3252]: {"level":"info","date":"2021-07-09T08:23:09.417Z","repo":"https://github.com/continuum-industries/Pareto","job":3023764478,"status":"job_started"}

client_loop: send disconnect: Broken pipe

On the github the job failed because the runner shut down due to spot killing it. On the AWS side i can also confirm that the spot instance was terminated with instance-terminated-no-capacity . However the workflow did not restart.

DavidGOrtega · 2021-07-09T11:29:26Z

Ah ok, i thought the timeout referred to the 360 minutes that a job can run on.

I do not fully understand this. Could you please elaborate?

On the github the job failed because the runner shut down due to spot killing it. On the AWS side i can also confirm that the spot instance was terminated with instance-terminated-no-capacity . However the workflow did not restart.

can you try to terminate the instance manually in the console and check that it restarts? If so, seems that spot instances might be killing instances with other method than terminate.

DavidGOrtega · 2021-07-09T12:00:22Z

And I can confirm that they use also terminate and the workflow restarts 🤔

thatGreekGuy96 · 2021-07-12T08:20:14Z

Terminating the instance manually gives me this in the logs:

-- Logs begin at Fri 2021-07-02 11:29:27 UTC. --
Jul 12 08:15:19 ip-172-31-7-6 systemd[1]: Started cml.service.
Jul 12 08:16:01 ip-172-31-7-6 cml.sh[3301]: Preparing workdir /tmp/tmp.LCA5ec7MoW/.cml/cml-61sqgx299o...
Jul 12 08:16:01 ip-172-31-7-6 cml.sh[3301]: Launching github runner
Jul 12 08:16:27 ip-172-31-7-6 cml.sh[3301]: {"level":"info","date":"2021-07-12T08:16:27.131Z","repo":"https://github.com/continuum-industries/Pareto"}
Jul 12 08:16:27 ip-172-31-7-6 cml.sh[3301]: {"level":"info","date":"2021-07-12T08:16:27.132Z","repo":"https://github.com/continuum-industries/Pareto"}
Jul 12 08:16:28 ip-172-31-7-6 cml.sh[3301]: {"level":"info","date":"2021-07-12T08:16:28.324Z","repo":"https://github.com/continuum-industries/Pareto","status":"ready"}
Jul 12 08:16:40 ip-172-31-7-6 cml.sh[3301]: {"level":"info","date":"2021-07-12T08:16:40.023Z","repo":"https://github.com/continuum-industries/Pareto","job":3044570052,"status":"job_started"}
Connection to 18.133.187.34 closed by remote host.

But I can confirm that the github worklow restarts!

DavidGOrtega · 2021-07-20T10:21:51Z

@thatGreekGuy96 I have setup a spot termination notification handler in #653

thatGreekGuy96 · 2021-07-20T17:49:47Z

Awesome, thank you!

This comment has been minimized.

Sign in to view

DavidGOrtega added cml-runner Subcommand p0-critical Max priority (ASAP) labels Jul 8, 2021

DavidGOrtega self-assigned this Jul 8, 2021

This comment has been minimized.

Sign in to view

0x2b3bfa0 self-assigned this Jul 8, 2021

ivyleavedtoadflax mentioned this issue Jul 9, 2021

Instances intermittently fail to terminate #649

Closed

DavidGOrtega mentioned this issue Jul 20, 2021

Cloud runner terminate improvements #653

Merged

DavidGOrtega closed this as completed in #653 Jul 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CML seemingly fails to restart job after AWS Spot instances have been shut down #650

CML seemingly fails to restart job after AWS Spot instances have been shut down #650

thatGreekGuy96 commented Jul 8, 2021 •

edited

Loading

This comment has been minimized.

0x2b3bfa0 commented Jul 8, 2021 •

edited

Loading

thatGreekGuy96 commented Jul 8, 2021 •

edited by 0x2b3bfa0

Loading

This comment has been minimized.

0x2b3bfa0 commented Jul 8, 2021 •

edited

Loading

thatGreekGuy96 commented Jul 8, 2021

DavidGOrtega commented Jul 9, 2021

DavidGOrtega commented Jul 9, 2021 •

edited

Loading

DavidGOrtega commented Jul 9, 2021

thatGreekGuy96 commented Jul 9, 2021

DavidGOrtega commented Jul 9, 2021

DavidGOrtega commented Jul 9, 2021

thatGreekGuy96 commented Jul 12, 2021

DavidGOrtega commented Jul 20, 2021

thatGreekGuy96 commented Jul 20, 2021

CML seemingly fails to restart job after AWS Spot instances have been shut down #650

CML seemingly fails to restart job after AWS Spot instances have been shut down #650

Comments

thatGreekGuy96 commented Jul 8, 2021 • edited Loading

This comment has been minimized.

0x2b3bfa0 commented Jul 8, 2021 • edited Loading

thatGreekGuy96 commented Jul 8, 2021 • edited by 0x2b3bfa0 Loading

This comment has been minimized.

0x2b3bfa0 commented Jul 8, 2021 • edited Loading

thatGreekGuy96 commented Jul 8, 2021

DavidGOrtega commented Jul 9, 2021

DavidGOrtega commented Jul 9, 2021 • edited Loading

DavidGOrtega commented Jul 9, 2021

thatGreekGuy96 commented Jul 9, 2021

DavidGOrtega commented Jul 9, 2021

DavidGOrtega commented Jul 9, 2021

thatGreekGuy96 commented Jul 12, 2021

DavidGOrtega commented Jul 20, 2021

thatGreekGuy96 commented Jul 20, 2021

thatGreekGuy96 commented Jul 8, 2021 •

edited

Loading

0x2b3bfa0 commented Jul 8, 2021 •

edited

Loading

thatGreekGuy96 commented Jul 8, 2021 •

edited by 0x2b3bfa0

Loading

0x2b3bfa0 commented Jul 8, 2021 •

edited

Loading

DavidGOrtega commented Jul 9, 2021 •

edited

Loading