When should we update dvc.lock and dvc push in GitHub Actions? #393

daavoo · 2021-09-10T09:39:33Z

Discussed in iterative/dvc#6542

^{Originally posted by Hongbo-Miao September 6, 2021}
Currently, my GitHub Actions workflow looks like this: when I open a pull request (change some model codes / params), CML creates a AWS EC2 instance, and DVC pull the data.

Here is my current GitHub Actions workflow:

Click to expand!

cml-cloud-set-up-cloud:
  name: CML (Cloud) - Set up cloud
  runs-on: ubuntu-20.04
  steps:
    - name: Cancel previous runs
      uses: styfle/[email protected]
      with:
        access_token: ${{ github.token }}
    - name: Checkout
      uses: actions/checkout@v2
    - name: Set up CML
      uses: iterative/setup-cml@v1
    - name: Set up cloud
      shell: bash
      env:
        REPO_TOKEN: ${{ secrets.CML_ACCESS_TOKEN }}
        AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
        AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
      run: |
        cml-runner \
        --cloud=aws \
        --cloud-region=us-west-2 \
        --cloud-type=t2.small \
        --labels=cml-runner
cml-cloud-train:
  name: CML (Cloud) - Train
  needs: cml-cloud-set-up-cloud
  runs-on: [self-hosted, cml-runner]
  # container: docker://iterativeai/cml:0-dvc2-base1-gpu
  container: docker://iterativeai/cml:0-dvc2-base1
  steps:
    - name: Cancel previous runs
      uses: styfle/[email protected]
      with:
        access_token: ${{ github.token }}
    - name: Checkout
      uses: actions/checkout@v2
    - name: Set up Miniconda
      uses: conda-incubator/setup-miniconda@v2
      with:
        miniconda-version: "latest"
        activate-environment: hm-cnn
    - name: Install requirements
      working-directory: convolutional-neural-network
      shell: bash -l {0}
      run: |
        conda install pytorch torchvision torchaudio --channel=pytorch
        conda install pandas
        conda install tabulate
        pip install -r requirements.txt
    - name: Pull Data
      working-directory: convolutional-neural-network
      env:
        AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
        AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
      run: |
        dvc pull
    - name: Train model
      working-directory: convolutional-neural-network
      shell: bash -l {0}
      env:
        WANDB_API_KEY: ${{ secrets.WANDB_API_KEY }}
      run: |
        dvc repro
    - name: Write CML report
      working-directory: convolutional-neural-network
      shell: bash -l {0}
      env:
        REPO_TOKEN: ${{ secrets.CML_ACCESS_TOKEN }}
      run: |
        echo "# CML (Cloud) Report" >> report.md
        echo "## Params" >> report.md
        cat output/reports/params.txt >> report.md
        cml-send-comment report.md

My dvc.yaml looks like this:

stages:
  prepare:
    cmd: tar -xf data/raw/cifar-10-python.tar.gz --dir=data/processed
    deps:
    - data/raw/cifar-10-python.tar.gz
    outs:
    - data/processed/cifar-10-batches-py/
  main:
    cmd: python main.py
    deps:
    - data/processed/cifar-10-batches-py/
    - evaluate.py
    - main.py
    - model/
    - train.py
    params:
    - lr
    - train.epochs
    outs:
    - output/models/model.pt

After training, if I think the change is good because the performance is better based on the reports,

the dvc.lock I feel needs to get update.
the new model model.pt needs to be uploaded to AWS S3 in my case.

My question is, after dvc repro, am I supposed to add dvc push and then commit? Something like

    - name: Train model
      working-directory: convolutional-neural-network
      shell: bash -l {0}
      env:
        WANDB_API_KEY: ${{ secrets.WANDB_API_KEY }}
      run: |
        dvc repro
        dvc push  # New added
        git add .  # New added
        git commit -m "update dvc.lock, etc."  # New added
        git push origin current_pr  # New added, need somehow get the current pull request name

This above method will apply when the pull request is open.
However, I kind of feeling the best moment adding would be when I decide merging because I think this is a good pull request that actually improves the machine learning performance. But at this moment, the EC2 instance has been destroyed.

Any suggestion? Thanks!

The text was updated successfully, but these errors were encountered:

daavoo · 2021-09-10T09:41:06Z

See previous discussion comments:

iterative/dvc#6542 (comment)

hongbo-miao · 2021-09-10T13:02:46Z

Just copy another comment to here:

I am more curious about the recommended way for this demo
https://github.com/iterative/cml_cloud_case/blob/master/.github/workflows/cml.yaml
as it is using AWS EC2 instance.
However, in current GitHub action workflow, it does not do something like

dvc push
cml-pr .

In the experiment pull request, it also not update the dvc.lock, etc. files, which is why I came up this question. (If we use AWS EC2 instance all time based on this approach, it will end with dvc.lock never got update.

Would be great to have some recommendations! 😃

To be more specific, it would be great to cover these scenarios and still not mess up the DVC:

Opened a pull request, and the experiment result looks good, then merge.
Opened a pull request, and the experiment result does not look good, do a force push to update. In second time, the experiment looks good, then merge.
Opened a pull request, and the experiment result does not look good, add another commit to update. In second time, the experiment looks good, then squash merge.

mattlbeck · 2021-09-17T09:01:05Z

Another option that I was taught to do was use DVC's run-cache.

In the CML runner:

dvc repro
dvc push --run-cache

On local machine

dvc pull --run-cache
dvc repro --pull
# any further checks or analysis of results
git add .
git commit -m "commit experiment"

This is less automated than the new cml-pr but has the benefit that the developers are making the commits, if that is important.

sisp · 2022-11-03T10:12:29Z

I've been wondering about all these questions myself and haven't found a satisfying answer yet.

@mattlbeck Wouldn't dvc push --run-cache pollute the DVC remote storage? The amount of run-cache data will increase quickly over time. Perhaps some kind of garbage collection is needed.

dacbd · 2022-11-07T17:03:12Z

Relevant links for any future readers:

Why? I'm assuming the use case to be for rerunning or inspecting dvc exp/repro conducted from CI/CD (user can preform commit / or / or discard changes instead of CI)

daavoo added the question User requesting support label Sep 10, 2021

casperdcl added the documentation Markdown files label Aug 3, 2022

casperdcl transferred this issue from iterative/cml Nov 18, 2022

casperdcl mentioned this issue Nov 18, 2022

guide: branching strategies #196

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When should we update dvc.lock and dvc push in GitHub Actions? #393

When should we update dvc.lock and dvc push in GitHub Actions? #393

daavoo commented Sep 10, 2021

daavoo commented Sep 10, 2021

hongbo-miao commented Sep 10, 2021

mattlbeck commented Sep 17, 2021

sisp commented Nov 3, 2022

dacbd commented Nov 7, 2022

When should we update dvc.lock and dvc push in GitHub Actions? #393

When should we update dvc.lock and dvc push in GitHub Actions? #393

Comments

daavoo commented Sep 10, 2021

Discussed in iterative/dvc#6542

daavoo commented Sep 10, 2021

hongbo-miao commented Sep 10, 2021

mattlbeck commented Sep 17, 2021

sisp commented Nov 3, 2022

dacbd commented Nov 7, 2022