Skip to content
This repository has been archived by the owner on Jan 29, 2024. It is now read-only.

Continuous deployment with DVC #288

Open
FrancescoCasalegno opened this issue Mar 12, 2021 · 8 comments
Open

Continuous deployment with DVC #288

FrancescoCasalegno opened this issue Mar 12, 2021 · 8 comments

Comments

@FrancescoCasalegno
Copy link
Contributor

Scope

We need to make sure that we know when the changes in our source code influence our models / datasets.  Without any manual procedures!

 

Current problems

  • We have multiple Dockerfiles that have a version tag of bbsearch in them
    • Self-referential
    • One needs to build them, run them and dvc repro manually
    • The tag is bumped up at the discretion of the developeper

 

Proposed solution

Github action triggered on each push

  • connect to a container / build a new one on Blue Brain's ML server
  • git checkout the given commit
  • run dvc repro (or other)  
  • (dvc metrics diff)

 

Notes

The most attainable/reasonable setup would be to use/replicate https://github.com/iterative/cml and just trigger some process on our server with pushes to a branch.

@jankrepl
Copy link
Contributor

jankrepl commented Mar 16, 2021

So it turns out that using "Self hosted runners" is not recommended for public repositories.
https://docs.github.com/en/actions/hosting-your-own-runners/about-self-hosted-runners

We recommend that you only use self-hosted runners with private repositories. This is because forks of your repository can potentially run dangerous code on your self-hosted runner machine by creating a pull request that executes the code in a workflow.

I am not sure if we want to use github servers to automatically train or evaluate our models.

@jankrepl
Copy link
Contributor

jankrepl commented Mar 26, 2021

See below a script that could be turned into a github action.

What is the goal?

Replace the manual process that we need to go through when reviewing PRs (heavily inspired by #265). Namely

  • Check whether all relevant assets (listed in dvc.lock files) are available on remote
  • Running dvc repro does not introduce any difference (dvc diff is empty)

In a way, it is like a unit test that makes sure that all potential changes to our models and data have been correctly tracked

What are the challenges

  • We would want this action to be triggered manually somehow (e.g. when a comment on a PR contains a specific substring)
  • All the dvc related things would be run on github servers - we need to provide SSH login details for the remote via GitHub secrets
  • We need to be really careful about permissions
    • This action can only be launched by an authorized person
    • Make sure external people (e.g. who forked our repo) cannot trigger the action or see the SSH login details
  • It might be really slow (e.g. dvc pull will need to download multiple GBs of data and models)
  • Potential reproducibility + environment issues (we do not want to run this inside of a docker container)

What are the benefits

  • Big time saver
  • We could drop the self referential retagging process that we have currently - if this action passes we know all model and data related changes have been correctly tracked
  • If the action fails we can go to the logs and right away identify what pipeline introduced some changes or what files are missing on the remote

Suggested script (WIP)

Before we run the script, a deterministic git revision will be checked out (e.g. the most recent commit of the branch from which we triggered the action).

set -e  # if any command exits with a nonzero code the entire script exits too
set -x 

pip install -r requirements.txt
dvc pull  # also checks that everything listed in dvc.lock is on remote 

# NER
pushd data_and_models/pipelines/ner/
dvc repro
test -z "$(dvc diff)"  # exits with nonzero code if there are any changes
popd

# Sentence embeddings
pushd data_and_models/pipelines/sentence_embedding/
dvc repro
test -z "$(dvc diff)"  # exits with nonzero code if there are any changes
popd

@pafonta
Copy link
Contributor

pafonta commented Mar 26, 2021

This is a must-have!

One comment:

Potential reproducibility + environment issues (we do not want to run this inside of a docker container)

Why wouldn't we want this to run inside a Docker container?

Indeed, not running inside a Docker container is:

  1. the opposite of what was chosen to be done at the moment,
  2. the opposite of the best practice to ensure reproducibility of environments, as far as I know.

@jankrepl
Copy link
Contributor

This is a must-have!

One comment:

Potential reproducibility + environment issues (we do not want to run this inside of a docker container)

Why wouldn't we want this to run inside a Docker container?

Indeed, not running inside a Docker container is:

  1. the opposite of what was chosen to be done at the moment,
  2. the opposite of the best practice to ensure reproducibility of environments, as far as I know.

In my opinion, github actions are already run inside of a "container" of some sort. So IMO there is not need to introduce yet another level of nesting.

@Stannislav
Copy link
Contributor

See below a script that could be turned into a github action.

What is the goal?

Replace the manual process that we need to go through when reviewing PRs (heavily inspired by #265). Namely

* Check whether all relevant assets (listed in `dvc.lock` files) are available on remote

* Running `dvc repro` does not introduce any difference (`dvc diff` is empty)

In a way, it is like a unit test that makes sure that all potential changes to our models and data have been correctly tracked

What are the challenges

* We would want this action to be triggered manually somehow (e.g. when a comment on a PR contains a specific substring)

* All the `dvc` related things would be run on github servers - we need to provide SSH login details for the remote via GitHub secrets

* We need to be really careful about permissions
  
  * This action can only be launched by an authorized person
  * Make sure external people (e.g. who forked our repo) cannot trigger the action or see the SSH login details

* It might be really slow

* Potential reproducibility + environment issues (we do not want to run this inside of a docker container)

What are the benefits

* Big time saver

* We could drop the self referential retagging process that we have currently - if this action passes we know all model and data related changes have been correctly tracked

* If the action fails we can go to the logs and right away identify what pipeline introduced some changes or what files are missing on the remote

Suggested script (WIP)

Before we run the script, a deterministic git revision will be checked out (e.g. the most recent commit of the branch from which we triggered the action).

set -e  # if any command exits with a nonzero code the entire script exits too
set -x 

pip install -r requirements.txt
dvc pull  # also checks that everything listed in dvc.lock is on remote 

# NER
pushd data_and_models/pipelines/ner/
dvc repro
test -z "$(dvc diff)"  # exits with nonzero code if there are any changes
popd

# Sentence embeddings
pushd data_and_models/pipelines/sentence_embedding/
dvc repro
test -z "$(dvc diff)"  # exits with nonzero code if there are any changes
popd

I also agree that this kind of test needs to ba automated. Among all points you mentioned above I'm worried about the following two:

  • Can we safely SSH to the DVC remote from GitHub? Is this compliant with the BBP policy?
  • Doing a 5GB pull is pretty heavy.

@Stannislav
Copy link
Contributor

Stannislav commented Mar 26, 2021

While dealing with the latest DVC tests I had the following issues / annoyances:

  1. Re-building the docker containers takes really long, we have to re-download and re-install all BBS dependencies every time
  2. Doing a 5GB DVC pull
  3. When doing repro on sentence embedding a model had to be dowloaded multiple times (transformes?)
  4. I can't run the container with my own username (errors out)
  5. dvc pull doesn't work out of the box, one needs to manually re-configure.
  6. I had a huge git diff output with files not related to DVC (tests, docs, notebooks, ...)

If what @jankrepl suggest above turns out to be infeasible, then we can think about writing something automated on our servers.

@pafonta
Copy link
Contributor

pafonta commented Mar 26, 2021

@jankrepl

In my opinion, github actions are already run inside of a "container" of some sort. So IMO there is not need to introduce yet another level of nesting.

The GitHub container might also just change and the reproduction fail because of the change.
Maybe one could set the GitHub container version or similar to ensure reproducibility.

@FrancescoCasalegno
Copy link
Contributor Author

Concerning using GitHub Actions — we cannot have GitHub servers (1) set up a VPN connection with BBP (2) pull/push data from BBP servers.

But we wait for GitLab actions to be available to do that on BBP premises.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants