Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When should we update dvc.lock and dvc push in GitHub Actions? #393

Open
Tracked by #196
daavoo opened this issue Sep 10, 2021 · 5 comments
Open
Tracked by #196

When should we update dvc.lock and dvc push in GitHub Actions? #393

daavoo opened this issue Sep 10, 2021 · 5 comments
Labels
documentation Markdown files question User requesting support

Comments

@daavoo
Copy link
Contributor

daavoo commented Sep 10, 2021

Discussed in iterative/dvc#6542

Originally posted by Hongbo-Miao September 6, 2021
Currently, my GitHub Actions workflow looks like this: when I open a pull request (change some model codes / params), CML creates a AWS EC2 instance, and DVC pull the data.

Here is my current GitHub Actions workflow:

Click to expand!
cml-cloud-set-up-cloud:
  name: CML (Cloud) - Set up cloud
  runs-on: ubuntu-20.04
  steps:
    - name: Cancel previous runs
      uses: styfle/[email protected]
      with:
        access_token: ${{ github.token }}
    - name: Checkout
      uses: actions/checkout@v2
    - name: Set up CML
      uses: iterative/setup-cml@v1
    - name: Set up cloud
      shell: bash
      env:
        REPO_TOKEN: ${{ secrets.CML_ACCESS_TOKEN }}
        AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
        AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
      run: |
        cml-runner \
        --cloud=aws \
        --cloud-region=us-west-2 \
        --cloud-type=t2.small \
        --labels=cml-runner
cml-cloud-train:
  name: CML (Cloud) - Train
  needs: cml-cloud-set-up-cloud
  runs-on: [self-hosted, cml-runner]
  # container: docker://iterativeai/cml:0-dvc2-base1-gpu
  container: docker://iterativeai/cml:0-dvc2-base1
  steps:
    - name: Cancel previous runs
      uses: styfle/[email protected]
      with:
        access_token: ${{ github.token }}
    - name: Checkout
      uses: actions/checkout@v2
    - name: Set up Miniconda
      uses: conda-incubator/setup-miniconda@v2
      with:
        miniconda-version: "latest"
        activate-environment: hm-cnn
    - name: Install requirements
      working-directory: convolutional-neural-network
      shell: bash -l {0}
      run: |
        conda install pytorch torchvision torchaudio --channel=pytorch
        conda install pandas
        conda install tabulate
        pip install -r requirements.txt
    - name: Pull Data
      working-directory: convolutional-neural-network
      env:
        AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
        AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
      run: |
        dvc pull
    - name: Train model
      working-directory: convolutional-neural-network
      shell: bash -l {0}
      env:
        WANDB_API_KEY: ${{ secrets.WANDB_API_KEY }}
      run: |
        dvc repro
    - name: Write CML report
      working-directory: convolutional-neural-network
      shell: bash -l {0}
      env:
        REPO_TOKEN: ${{ secrets.CML_ACCESS_TOKEN }}
      run: |
        echo "# CML (Cloud) Report" >> report.md
        echo "## Params" >> report.md
        cat output/reports/params.txt >> report.md
        cml-send-comment report.md

My dvc.yaml looks like this:

stages:
  prepare:
    cmd: tar -xf data/raw/cifar-10-python.tar.gz --dir=data/processed
    deps:
    - data/raw/cifar-10-python.tar.gz
    outs:
    - data/processed/cifar-10-batches-py/
  main:
    cmd: python main.py
    deps:
    - data/processed/cifar-10-batches-py/
    - evaluate.py
    - main.py
    - model/
    - train.py
    params:
    - lr
    - train.epochs
    outs:
    - output/models/model.pt

After training, if I think the change is good because the performance is better based on the reports,

  • the dvc.lock I feel needs to get update.
  • the new model model.pt needs to be uploaded to AWS S3 in my case.

My question is, after dvc repro, am I supposed to add dvc push and then commit? Something like

    - name: Train model
      working-directory: convolutional-neural-network
      shell: bash -l {0}
      env:
        WANDB_API_KEY: ${{ secrets.WANDB_API_KEY }}
      run: |
        dvc repro
        dvc push  # New added
        git add .  # New added
        git commit -m "update dvc.lock, etc."  # New added
        git push origin current_pr  # New added, need somehow get the current pull request name

This above method will apply when the pull request is open.
However, I kind of feeling the best moment adding would be when I decide merging because I think this is a good pull request that actually improves the machine learning performance. But at this moment, the EC2 instance has been destroyed.

Any suggestion? Thanks!

@daavoo daavoo added the question User requesting support label Sep 10, 2021
@daavoo
Copy link
Contributor Author

daavoo commented Sep 10, 2021

See previous discussion comments:

iterative/dvc#6542 (comment)

@hongbo-miao
Copy link

Just copy another comment to here:


I am more curious about the recommended way for this demo
https://github.com/iterative/cml_cloud_case/blob/master/.github/workflows/cml.yaml
as it is using AWS EC2 instance.
However, in current GitHub action workflow, it does not do something like

dvc push
cml-pr .

In the experiment pull request, it also not update the dvc.lock, etc. files, which is why I came up this question. (If we use AWS EC2 instance all time based on this approach, it will end with dvc.lock never got update.

Would be great to have some recommendations! 😃

To be more specific, it would be great to cover these scenarios and still not mess up the DVC:

  • Opened a pull request, and the experiment result looks good, then merge.
  • Opened a pull request, and the experiment result does not look good, do a force push to update. In second time, the experiment looks good, then merge.
  • Opened a pull request, and the experiment result does not look good, add another commit to update. In second time, the experiment looks good, then squash merge.

@mattlbeck
Copy link

Another option that I was taught to do was use DVC's run-cache.

In the CML runner:

dvc repro
dvc push --run-cache

On local machine

dvc pull --run-cache
dvc repro --pull
# any further checks or analysis of results
git add .
git commit -m "commit experiment"

This is less automated than the new cml-pr but has the benefit that the developers are making the commits, if that is important.

@casperdcl casperdcl added the documentation Markdown files label Aug 3, 2022
@sisp
Copy link
Contributor

sisp commented Nov 3, 2022

I've been wondering about all these questions myself and haven't found a satisfying answer yet.

@mattlbeck Wouldn't dvc push --run-cache pollute the DVC remote storage? The amount of run-cache data will increase quickly over time. Perhaps some kind of garbage collection is needed.

@dacbd
Copy link
Contributor

dacbd commented Nov 7, 2022

Relevant links for any future readers:

Why? I'm assuming the use case to be for rerunning or inspecting dvc exp/repro conducted from CI/CD (user can preform commit / or / or discard changes instead of CI)

@casperdcl casperdcl transferred this issue from iterative/cml Nov 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Markdown files question User requesting support
Projects
None yet
Development

No branches or pull requests

6 participants