Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected stop of container build [CML, AWS, github] #676

Closed
sergeychuvakin opened this issue Jul 27, 2021 · 9 comments · Fixed by #679
Closed

Unexpected stop of container build [CML, AWS, github] #676

sergeychuvakin opened this issue Jul 27, 2021 · 9 comments · Fixed by #679
Assignees
Labels
bug Something isn't working cml-image Subcommand cml-runner Subcommand p0-critical Max priority (ASAP)

Comments

@sergeychuvakin
Copy link

My scenario:
I'm trying to create appropriate pipeline for my ML project. I'm using the following CML yaml file:

on:
  # Trigger the workflow on push or pull request
  push:
    branches:
      - mybranch

jobs:
  deploy-runner:
    runs-on: [ubuntu-latest]
    steps:
      - uses: iterative/setup-cml@v1
      - uses: actions/checkout@v2
      - name: Deploy runner on EC2
        env:
          PERSONAL_ACCESS_TOKEN: ${{ secrets.REPO_TOKEN }}
          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
          aws-region: us-west-1
        run: |
          cml-runner \
              --repo https://github.com/MyCompany/myrepo \
              --token=$PERSONAL_ACCESS_TOKEN \
              --cloud aws \
              --cloud-region us-west-1 \
              --cloud-type=g3.4xlarge \
              --labels=cml-runner \
              --idle-timeout 30
    
  model-training:
    needs: [deploy-runner]
    runs-on: [self-hosted, cml-runner]
    container:
      image: docker://dvcorg/cml:0-dvc1-base1-gpu
      options: --gpus all
    steps:
      - uses: actions/checkout@v2
      - uses: actions/setup-python@v2
        with:
          python-version: '3.6'
      - name: Train model
        env:
          repo_token: ${{ secrets.REPO_TOKEN }}
          AWS_ACCESS_KEY_ID: ${{ secrets.DVC_ACCESS_KEY }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.DVC_SECRET_KEY }}
          REQUIREMENTS_FILE: 'training/training_req.txt'
        run: |
          export AWS_DEFAULT_REGION=us-east-1
          echo "Install reqs"
          sudo apt update
          sudo apt-get install default-jre scala
          pip install py4j
          pip install --no-cache-dir -e .
          export PYSPARK_PYTHON=python3
          echo "Start CML"
          python3 -m spacy download en_core_web_sm
          echo "Pull data"
          dvc repro
          echo "## Model metrics" > report.md
          cat prepare_data/metrics.txt >> report.md
          cml-send-comment report.md
          

As you can notice I used the following image docker://dvcorg/cml:0-dvc1-base1-gpu, but I started receive the following error message:

Screenshot 2021-07-27 at 20 22 31

Screenshot 2021-07-27 at 20 24 36

I can see that container started to build but unexpectedly stopped, and I do not see the reason of this behavior. Actually i did not change anything in my script, and it just stopped work, but earlier I run it successfully.

Thanks!

@sergeychuvakin
Copy link
Author

Full logs archive.
logs_261.zip

@DavidGOrtega DavidGOrtega self-assigned this Jul 27, 2021
@DavidGOrtega DavidGOrtega added bug Something isn't working cml-image Subcommand p0-critical Max priority (ASAP) labels Jul 27, 2021
@DavidGOrtega
Copy link
Contributor

@sergeychuvakin thanks for reporting. Im checking.

@DavidGOrtega
Copy link
Contributor

From logs:

Re-evaluate condition on job cancellation for step: 'Initialize containers'.
2021-07-27T17:34:49.8283399Z ##[error]The runner has received a shutdown signal. This can happen when the runner service is stopped, or a manually started runner is canceled.

@DavidGOrtega DavidGOrtega added the cml-runner Subcommand label Jul 28, 2021
@DavidGOrtega
Copy link
Contributor

DavidGOrtega commented Jul 28, 2021

With self hosted runners this is happening too. Log says:

2021-07-28 07:15:35Z: Job run completed with result: Failed

An error occurred: Access denied. System:ServiceIdentity;DDDDDDDD-DDDD-DDDD-DDDD-DDDDDDDDDDDD needs View permissions to perform the action.

An error occurred: Access denied. System:ServiceIdentity;

Seems to be this

However why is failing downloading the image is not clear

@DavidGOrtega
Copy link
Contributor

Ok, GH runner seems to be restarting due to updates. This is a new behaviour

Runner will exit shortly for update, should back online within 10 seconds.

@sergeychuvakin
Copy link
Author

Ok, GH runner seems to be restarting due to updates. This is a new behaviour

Runner will exit shortly for update, should back online within 10 seconds.

Ok, thank you for investigation! Appreciate your efforts, looking forward to updates

@casperdcl
Copy link
Contributor

doesn't seem fixed (yet)

@casperdcl casperdcl reopened this Jul 28, 2021
@DavidGOrtega
Copy link
Contributor

fixed! Seems that the error that appeared later on was a hicup. I have tried multiple times successfully

@sergeychuvakin
Copy link
Author

Thanks! I was able to run it as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cml-image Subcommand cml-runner Subcommand p0-critical Max priority (ASAP)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants