Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CML seemingly fails to restart job after AWS Spot instances have been shut down #650

Closed
thatGreekGuy96 opened this issue Jul 8, 2021 · 15 comments · Fixed by #653
Closed
Assignees
Labels
cml-runner Subcommand p0-critical Max priority (ASAP)

Comments

@thatGreekGuy96
Copy link

thatGreekGuy96 commented Jul 8, 2021

Hey everyone,
So I noticed a couple of days ago that CML now has new functionality that allows it to restart workflows if one or more AWS spot runners have been told to shut down. However this doesn't seem to be happening for me.

A couple of details about our case:

  • Our cloud is AWS
  • We're (as far as i can tell) using the latest version of CML to deploy a bunch of runners as shown below.
  deploy_runners:
    name: Deploy Cloud Instances
    needs: [setup_config]

    runs-on: ubuntu-latest

    steps:
      - uses: actions/checkout@v2
      - uses: iterative/setup-cml@v1
        with:
          version:  latest

      - name: "Deploy runner on EC2"
        shell: bash
        env:
          repo_token: ${{ secrets.ACCESS_TOKEN_CML }}
          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID_TESTING }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY_TESTING }}
          CASE_NAME: ${{ matrix.case_name }}
          N_RUNNERS: ${{ fromJson(needs.setup_config.outputs.json_string).n_runners }}

        run: |
          for (( i=1; i<=N_RUNNERS; i++ ))
          do
            echo "Deploying runner ${i}"
            cml-runner \
            --cloud aws \
            --cloud-region eu-west-2 \
            --cloud-type=m \
            --cloud-hdd-size 100 \
            --cloud-spot \
            --labels=cml-runner &
          done
          wait
          echo "Deployed ${N_RUNNERS} runners."
  • The job each runner runs does not use the CML images provided by iterative
  • The job that each runner runs has continue-on-error set to False (wondering whether that is interfering with cml?)
  run_optimisation:
    continue-on-error: false
    strategy:
      matrix: ${{fromJson(needs.setup_config.outputs.json_string).matrix}}
      fail-fast: true

    runs-on: [self-hosted, "cml-runner"]
    container:
      image: python:3.8.10-slim
      volumes:
          - /dev/shm:/dev/shm
@0x2b3bfa0

This comment has been minimized.

@DavidGOrtega DavidGOrtega added cml-runner Subcommand p0-critical Max priority (ASAP) labels Jul 8, 2021
@DavidGOrtega DavidGOrtega self-assigned this Jul 8, 2021
@0x2b3bfa0
Copy link
Member

0x2b3bfa0 commented Jul 8, 2021

Graceful shutdown issues are really fun to debug; see iterative/terraform-provider-iterative#90 for an example. 🙃 It would be awesome if you could reproduce this issue when spawning a single runner and follow the instructions below to see what's failing.

  1. Generate a new RSA PEM private key on your local system for debugging purposes:

    $ ssh-keygen -t rsa -m pem -b 4096 -f key.pem
  2. Store the contents of key.pem as a repository secret named INSTANCE_PRIVATE_SSH_KEY as per the workflow below.

  3. Run the following workflow and retrieve the instance IP address from the logs of the deploy step:

    jobs:
      deploy:
        runs-on: ubuntu-latest
        steps:
          - uses: actions/checkout@v2
          - uses: iterative/setup-cml@v1
          - run: >-
              cml-runner
              --labels=debug
              --cloud=aws
              --cloud-type=m
              --cloud-hdd-size=100
              --cloud-region=eu-west-2
              --cloud-ssh-private="$INSTANCE_PRIVATE_SSH_KEY"
              --cloud-spot
            env:
              REPO_TOKEN: ${{ secrets.ACCESS_TOKEN_CML }}
              AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID_TESTING }}
              AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY_TESTING }}
              INSTANCE_PRIVATE_SSH_KEY: ${{ secrets.INSTANCE_PRIVATE_SSH_KEY }}
      run:
        needs: deploy
        runs-on: [self-hosted, debug]
        steps:
          run: cat
  4. Monitor the instance logs from your local system by using the generated key as an indentity file:

    $ ssh -i key.pem ubuntu@IP_ADDRESS journalctl --follow --unit cml
  5. Once the spot instance gets evicted, take a look at the logs and attach them to this issue.

@thatGreekGuy96
Copy link
Author

thatGreekGuy96 commented Jul 8, 2021

This is what I got from the logs: log.txt

Interestingly though, the GitHub actions job didn't crash! It's still running as far as I can tell. Is this the expected behaviour?

@thatGreekGuy96

This comment has been minimized.

@0x2b3bfa0
Copy link
Member

0x2b3bfa0 commented Jul 8, 2021

On the EC2 console, I can see that the instance in question has indeed been terminated.

HOWEVER, the spot request that created it still has a "fulfilled" status AND the github actions job is still running...

I hope this makes more sense to you than it does to me!

Update: After about 5 minutes, the spot request was marked as terminated-by-user, [...]

As per the spot request status reference, the state transitions you describe seem to match the expected behavior. I don't know if those actions should take several minutes, though.

[...] _but the github actions job is still running...

Is it just the status, or have you checked if the job is alive? If you aren't sure, you can replace the cat command in the last line of #650 (comment) with the following:

while true; do date; sleep 1; done

If you still see fresh date values after the spot instance enters the terminated-by-user state, it would mean that either the old instance is still alive or everything worked as expected and CML restarted the job without you noticing. Please take note of the first date value; it may change after a job restart.

As far as I can tell, no new spot requests have been made.

In this case, can you please retrieve a new log using the following step instead of the - uses: iterative/setup-cml@v1 one?

- run: |
    sudo npm config set user 0
    sudo npm install --global git+https://github.com/iterative/cml#debug-long-job

@0x2b3bfa0 0x2b3bfa0 self-assigned this Jul 8, 2021
@thatGreekGuy96
Copy link
Author

So actually that was my bad, i was running the jobs on the wrong runner :/ I managed to set the job correctly afterwards and I've been waiting for it to fail. Unfortunately, aws spot did not shut the instance down. It did however reach the maximum runtime of 6 hours. Here are the logs

-- Logs begin at Fri 2021-07-02 11:29:27 UTC. --
Jul 08 15:54:01 ip-172-31-10-233 systemd[1]: Starting cml service...
Jul 08 15:54:48 ip-172-31-10-233 cml.sh[3089]: Preparing workdir /tmp/tmp.ulnFhn9cxz/.cml/cml-rlj3ki1iw7...
Jul 08 15:54:48 ip-172-31-10-233 cml.sh[3089]: Launching github runner
Jul 08 15:55:06 ip-172-31-10-233 cml.sh[3089]: {"level":"info","date":"2021-07-08T15:55:06.759Z","repo":"https://github.com/continuum-industries/Pareto"}
Jul 08 15:55:06 ip-172-31-10-233 cml.sh[3089]: {"level":"info","date":"2021-07-08T15:55:06.759Z","repo":"https://github.com/continuum-industries/Pareto"}
Jul 08 15:55:07 ip-172-31-10-233 cml.sh[3089]: {"level":"info","date":"2021-07-08T15:55:07.185Z","repo":"https://github.com/continuum-industries/Pareto","status":"ready"}
Jul 08 15:55:23 ip-172-31-10-233 cml.sh[3089]: {"level":"info","date":"2021-07-08T15:55:22.988Z","repo":"https://github.com/continuum-industries/Pareto","job":3020940409,"status":"job_started"}
Jul 08 21:55:37 ip-172-31-10-233 cml.sh[3089]: {"level":"error","date":"2021-07-08T21:55:37.311Z","repo":"https://github.com/continuum-industries/Pareto","job":"","status":"job_ended","success":false}

The workflow doesn't get restarted after this I''m afraid...
I'll run another test to see if i AWS will shut down the spot instance

@DavidGOrtega
Copy link
Contributor

Unfortunately, aws spot did not shut the instance down. It did however reach the maximum runtime of 6 hours. Here are the logs

🤔 Our code is expecting the spot instances to be terminated. After those 6hours the spot instance is going to be still alive?

@DavidGOrtega
Copy link
Contributor

DavidGOrtega commented Jul 9, 2021

Eureka!
I think I got you!

As you say spot instance is not terminated.Here the guilty is the job timeout in GH.

You have to setup it as

run_optimisation:
    timeout-minutes: 10000
    continue-on-error: false
    strategy:
      matrix: ${{fromJson(needs.setup_config.outputs.json_string).matrix}}
      fail-fast: true

    runs-on: [self-hosted, "cml-runner"]
    container:
      image: python:3.8.10-slim
      volumes:
          - /dev/shm:/dev/shm

As far as can tell by MY experience:

  • workflow limit is meant to be at most 72H (our runner handles this ;) ) however if you specify the job limit above those 72 hours that limit disappears (probably a GH bug)
  • Never setup the timeout-minutes in a step. Opposed to GH docs it WONT work. Setting the limit to 6 hours again

@DavidGOrtega
Copy link
Contributor

Our runner only restarts in to scenarios:

  • the machine is being destroyed and there are still jobs
  • the 72 hours GH limit that will restart the workflow even if you specify a greater timeout-minutes making the 72h limit invalid. We still do this (opposed to my experience) because I believe that it is a bug that should be fixed at any point breaking the runner again. The only drawback is that you might loose an epoch while we restart because of the 72 hours limiit.

@thatGreekGuy96
Copy link
Author

Ah ok, i thought the timeout referred to the 360 minutes that a job can run on.

In any case, I managed to replicate the issue. This is what i see on the logs

-- Logs begin at Fri 2021-07-02 11:29:27 UTC. --
Jul 09 08:21:43 ip-172-31-14-78 systemd[1]: Started cml.service.
Jul 09 08:22:36 ip-172-31-14-78 cml.sh[3252]: Preparing workdir /tmp/tmp.ePH7ZEdNZl/.cml/cml-8qy3d9e8w7...
Jul 09 08:22:36 ip-172-31-14-78 cml.sh[3252]: Launching github runner
Jul 09 08:22:58 ip-172-31-14-78 cml.sh[3252]: {"level":"info","date":"2021-07-09T08:22:58.295Z","repo":"https://github.com/continuum-industries/Pareto"}
Jul 09 08:22:58 ip-172-31-14-78 cml.sh[3252]: {"level":"info","date":"2021-07-09T08:22:58.296Z","repo":"https://github.com/continuum-industries/Pareto"}
Jul 09 08:22:58 ip-172-31-14-78 cml.sh[3252]: {"level":"info","date":"2021-07-09T08:22:58.693Z","repo":"https://github.com/continuum-industries/Pareto","status":"ready"}
Jul 09 08:23:10 ip-172-31-14-78 cml.sh[3252]: {"level":"info","date":"2021-07-09T08:23:09.417Z","repo":"https://github.com/continuum-industries/Pareto","job":3023764478,"status":"job_started"}

client_loop: send disconnect: Broken pipe

On the github the job failed because the runner shut down due to spot killing it. On the AWS side i can also confirm that the spot instance was terminated with instance-terminated-no-capacity . However the workflow did not restart.

@DavidGOrtega
Copy link
Contributor

Ah ok, i thought the timeout referred to the 360 minutes that a job can run on.

I do not fully understand this. Could you please elaborate?

On the github the job failed because the runner shut down due to spot killing it. On the AWS side i can also confirm that the spot instance was terminated with instance-terminated-no-capacity . However the workflow did not restart.

can you try to terminate the instance manually in the console and check that it restarts? If so, seems that spot instances might be killing instances with other method than terminate.

@DavidGOrtega
Copy link
Contributor

And I can confirm that they use also terminate and the workflow restarts 🤔

@thatGreekGuy96
Copy link
Author

Terminating the instance manually gives me this in the logs:

-- Logs begin at Fri 2021-07-02 11:29:27 UTC. --
Jul 12 08:15:19 ip-172-31-7-6 systemd[1]: Started cml.service.
Jul 12 08:16:01 ip-172-31-7-6 cml.sh[3301]: Preparing workdir /tmp/tmp.LCA5ec7MoW/.cml/cml-61sqgx299o...
Jul 12 08:16:01 ip-172-31-7-6 cml.sh[3301]: Launching github runner
Jul 12 08:16:27 ip-172-31-7-6 cml.sh[3301]: {"level":"info","date":"2021-07-12T08:16:27.131Z","repo":"https://github.com/continuum-industries/Pareto"}
Jul 12 08:16:27 ip-172-31-7-6 cml.sh[3301]: {"level":"info","date":"2021-07-12T08:16:27.132Z","repo":"https://github.com/continuum-industries/Pareto"}
Jul 12 08:16:28 ip-172-31-7-6 cml.sh[3301]: {"level":"info","date":"2021-07-12T08:16:28.324Z","repo":"https://github.com/continuum-industries/Pareto","status":"ready"}
Jul 12 08:16:40 ip-172-31-7-6 cml.sh[3301]: {"level":"info","date":"2021-07-12T08:16:40.023Z","repo":"https://github.com/continuum-industries/Pareto","job":3044570052,"status":"job_started"}
Connection to 18.133.187.34 closed by remote host.

But I can confirm that the github worklow restarts!

@DavidGOrtega
Copy link
Contributor

@thatGreekGuy96 I have setup a spot termination notification handler in #653

@thatGreekGuy96
Copy link
Author

Awesome, thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cml-runner Subcommand p0-critical Max priority (ASAP)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants