-
Notifications
You must be signed in to change notification settings - Fork 394
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge branch 'master' into guide/external-disclaimer
- Loading branch information
Showing
48 changed files
with
1,889 additions
and
280 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,161 +1,157 @@ | ||
# Self-hosted Runners | ||
|
||
GitHub Actions are run on GitHub-hosted runners by default. However, there are | ||
many great reasons to use your own runners: to take advantage of GPUs; to | ||
orchestrate your team's shared computing resources, or to train in the cloud. | ||
GitHub Actions and GitLab CI are run on GitHub- and GitLab- hosted runners by | ||
default. However, there are many great reasons to use your own runners: to take | ||
advantage of GPUs; to orchestrate your team's shared computing resources, or to | ||
train in the cloud. | ||
|
||
☝️ **Tip!** Check out the | ||
[official GitHub documentation](https://help.github.com/en/actions/hosting-your-own-runners/about-self-hosted-runners) | ||
to get started setting up your self-hosted runner. | ||
☝️ **Tip!** Check out the official documentation from | ||
[GitHub](https://help.github.com/en/actions/hosting-your-own-runners/about-self-hosted-runners) | ||
and [GitLab](https://docs.gitlab.com/runner/) to get started setting up your | ||
self-hosted runner. | ||
|
||
### Allocating cloud resources with CML | ||
## Allocating cloud resources with CML | ||
|
||
When a workflow requires computational resources (such as GPUs) CML can | ||
automatically allocate cloud instances. For example, the following workflow | ||
deploys a `t2.micro` instance on AWS EC2 and trains a model on the instance. | ||
After the instance is idle for 120 seconds, it automatically shuts down. | ||
automatically allocate cloud instances using `cml-runner`. You can spin up | ||
instances on your AWS or Azure account (GCP support is forthcoming!). | ||
|
||
For example, the following workflow deploys a `t2.micro` instance on AWS EC2 and | ||
trains a model on the instance. After the job runs, the instance automatically | ||
shuts down. You might notice that this workflow is quite similar to the | ||
[basic use case](#usage) highlighted in the beginning of the docs- that's | ||
because it is! What's new is that we've added `cml-runner`, plus a few | ||
environmental variables for passing your cloud service credentials to the | ||
workflow. | ||
|
||
```yaml | ||
name: train-my-model | ||
name: "Train-in-the-cloud" | ||
on: [push] | ||
|
||
jobs: | ||
deploy-cloud-runner: | ||
deploy-runner: | ||
runs-on: [ubuntu-latest] | ||
container: docker://dvcorg/cml:latest | ||
steps: | ||
- name: deploy | ||
- uses: iterative/setup-cml@v1 | ||
- uses: actions/checkout@v2 | ||
- name: "Deploy runner on EC2" | ||
shell: bash | ||
env: | ||
repo_token: ${{ secrets.REPO_TOKEN }} | ||
repo_token: ${{ secrets.PERSONAL_ACCESS_TOKEN }} | ||
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }} | ||
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }} | ||
run: | | ||
echo "Deploying..." | ||
MACHINE="cml$(date +%s)" | ||
docker-machine create \ | ||
--driver amazonec2 \ | ||
--amazonec2-instance-type t2.micro \ | ||
--amazonec2-region us-east-1 \ | ||
--amazonec2-zone f \ | ||
--amazonec2-vpc-id vpc-06bc773d85a0a04f7 \ | ||
--amazonec2-ssh-user ubuntu \ | ||
$MACHINE | ||
eval "$(docker-machine env --shell sh $MACHINE)" | ||
( | ||
docker-machine ssh \ | ||
$MACHINE "sudo mkdir -p /docker_machine && \ | ||
sudo chmod 777 /docker_machine" && \ | ||
docker-machine scp -r -q ~/.docker/machine \ | ||
$MACHINE:/docker_machine && \ | ||
docker run --name runner -d \ | ||
-v /docker_machine/machine:/root/.docker/machine \ | ||
-e RUNNER_IDLE_TIMEOUT=120 \ | ||
-e DOCKER_MACHINE=${MACHINE} \ | ||
-e RUNNER_LABELS=cml \ | ||
-e repo_token=$repo_token \ | ||
-e RUNNER_REPO="https://github.com/${GITHUB_REPOSITORY}" \ | ||
dvcorg/cml-py3 && \ | ||
sleep 20 && echo "Deployed $MACHINE" | ||
) || (echo y | docker-machine rm $MACHINE && exit 1) | ||
train: | ||
needs: deploy-cloud-runner | ||
runs-on: [self-hosted, cml] | ||
|
||
cml-runner \ | ||
--cloud aws \ | ||
--cloud-region us-west \ | ||
--cloud-type=t2.micro \ | ||
--labels=cml-runner | ||
name: model-training | ||
needs: deploy-runner | ||
runs-on: [self-hosted,cml-runner] | ||
container: docker://dvcorg/cml-py3:latest | ||
steps: | ||
- uses: actions/checkout@v2 | ||
- name: cml_run | ||
env: | ||
repo_token: ${{ secrets.GITHUB_TOKEN }} | ||
run: | | ||
pip install -r requirements.txt | ||
python train.py | ||
cat metrics.txt >> report.md | ||
cml-publish confusion_matrix.png --md >> report.md | ||
cml-send-comment report.md | ||
``` | ||
Please note that for GCP's Compute Engine, deploying the cloud runner involves | ||
different steps: | ||
```yaml | ||
deploy-gce: | ||
runs-on: [ubuntu-latest] | ||
container: docker://dvcorg/cml:latest | ||
|
||
steps: | ||
- name: deploy | ||
shell: bash | ||
- uses: actions/checkout@v2 | ||
- name: "Train my model" | ||
env: | ||
repo_token: ${{ secrets.REPO_TOKEN }} | ||
GOOGLE_APPLICATION_CREDENTIALS_DATA: | ||
${{ secrets.GOOGLE_APPLICATION_CREDENTIALS_DATA }} | ||
repo_token: ${{ secrets.PERSONAL_ACCESS_TOKEN }} | ||
run: | | ||
echo "Deploying..." | ||
echo '${{ secrets.GOOGLE_APPLICATION_CREDENTIALS_DATA }}' \ | ||
> gce-credentials.json | ||
export GOOGLE_APPLICATION_CREDENTIALS='gce-credentials.json' | ||
RUNNER_LABELS="gce" | ||
RUNNER_REPO="https://github.com/${GITHUB_REPOSITORY}" | ||
MACHINE="cml$(date +%s)" | ||
docker-machine create --driver google \ | ||
--google-machine-type "n1-standard-4" \ | ||
--google-project "cml-project-279709" \ | ||
$MACHINE | ||
eval "$(docker-machine env --shell sh $MACHINE)" | ||
( | ||
docker-machine ssh \ | ||
$MACHINE "sudo mkdir -p /docker_machine && \ | ||
sudo chmod 777 /docker_machine" && \ | ||
docker-machine scp -r -q ~/.docker/machine/ \ | ||
$MACHINE:/docker_machine && \ | ||
docker-machine scp -q gce-credentials.json \ | ||
$MACHINE:/docker_machine/gce-credentials.json && \ | ||
eval "$(docker-machine env --shell sh $MACHINE)" && \ | ||
docker run --name runner -d \ | ||
-v /docker_machine/gce-credentials.json:/gce-credentials.json \ | ||
-e GOOGLE_APPLICATION_CREDENTIALS='/gce-credentials.json' \ | ||
-v /docker_machine/machine:/root/.docker/machine \ | ||
-e DOCKER_MACHINE=$MACHINE \ | ||
-e repo_token=$repo_token \ | ||
-e RUNNER_LABELS=$RUNNER_LABELS \ | ||
-e RUNNER_REPO=$RUNNER_REPO \ | ||
-e RUNNER_IDLE_TIMEOUT=120 \ | ||
dvcorg/cml-py3 && \ | ||
sleep 20 && echo "Deployed $MACHINE" | ||
) || (docker-machine rm -f $MACHINE && exit 1) | ||
pip install -r requirements.txt | ||
python train.py | ||
# Publish report with CML | ||
cat metrics.txt > report.md | ||
cml-send-comment report.md | ||
``` | ||
### Inputs | ||
In the above workflow, the step `deploy-runner` launches an EC2 `t2-micro` | ||
instance in the `us-west` region. The next step, `model-training`, runs on the | ||
newly launched instance. | ||
|
||
**Note that you can use any container with this workflow!** While you must have | ||
CML and its dependencies setup to use CML functions like `cml-send-comment` from | ||
your instance, you can create your favorite training environment in the cloud by | ||
pulling the Docker container of your choice. | ||
|
||
We like the | ||
[CML container](https://github.com/iterative/cml/blob/master/docker/Dockerfile) | ||
(`docker://dvcorg/cml-py3`) because it comes loaded with Python, CUDA, `git`, | ||
`node` and other essentials for full-stack data science. But we don't mind if | ||
you do it your way :) | ||
|
||
## Arguments | ||
|
||
The function `cml-runner` accepts the following arguments: | ||
|
||
| Argument | Description | Values | | ||
| --------------------- | ------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------- | | ||
| `--version` | Show version number | Boolean | | ||
| `--labels` | Comma delimited runner labels | Default: `cml` | | ||
| `--idle-timeout ` | Time in seconds for the runner to be waiting for jobs before shutting down. 0 waits forever. | Default: `300` | | ||
| `--name` | Name displayed in the repo once registered | Default: `cml-` followed by a unique identifier, i.e. `cml-cfwj9rrari` | | ||
| `--driver` | If not specified, driver is inferred from environmental variables | Choices: `github`,`gitlab` | | ||
| `--repo` | Specifies the Git repository to be used. If not specified, repo is inferred from the environmental variables. | Example: `https://github.com/iterative/cml` | | ||
| `--token` | Personal access token to be used. If not specified, it will be inferred from the environment. | `token` should be a string | | ||
| ` --cloud` | Cloud provider to deploy the runner | Choices: `aws`,`azure` | | ||
| `--cloud-region` | Region where the instance is deployed. | choices: `us-east`,`us-west`, `eu-west`, `eu-north`. Also accepts native cloud regions. Defaults to `us-west`. | | ||
| `--cloud-type` | Instance type. | Choices: `m`, `l`, `xl`. Also supports native types like i.e. `t2.micro` | | ||
| `--cloud-gpu` | GPU type. | Choices: `nogpu`,`k80`,`tesla` | | ||
| `--cloud-hdd-size` | HDD size in GB. | Accepts integer values | | ||
| `--cloud-ssh-private` | Your private RSA SSH key. If not provided will be generated by the Terraform-provider-Iterative. | Accepts string | | ||
| `--cloud-spot` | Request a spot instance | Boolean | | ||
| `--cloud-spot-price` | Spot max price. If not specified it takes current spot bidding pricing. | default: `-1` | | ||
| `-h` | Show help | Boolean | | ||
|
||
## Environmental variables | ||
|
||
You will need to | ||
[create a new personal access token](https://help.github.com/en/github/authenticating-to-github/creating-a-personal-access-token-for-the-command-line), | ||
`REPO_TOKEN`, with repository read/write access. `REPO_TOKEN` must be added as a | ||
secret in your project repository. | ||
[create a personal access token](https://help.github.com/en/github/authenticating-to-github/creating-a-personal-access-token-for-the-command-line) | ||
with repository read/write access and workflow privileges. In the example | ||
workflow, this token is stored as `PERSONAL_ACCESS_TOKEN`. | ||
|
||
Note that you will also need to provide access credentials for your cloud | ||
compute resources as secrets. In the above example, `AWS_ACCESS_KEY_ID` and | ||
`AWS_SECRET_ACCESS_KEY` are required to deploy EC2 instances. | ||
|
||
### Provisioning cloud compute | ||
Click below to see credentials needed for supported cloud service providers. | ||
|
||
<details> | ||
|
||
In the above example, we use | ||
[Docker Machine](https://docs.docker.com/machine/concepts/) to provision | ||
instances. Please see their documentation for further details. | ||
### AWS | ||
|
||
Note several CML-specific arguments to `docker run`: | ||
```yaml | ||
env: | ||
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }} | ||
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }} | ||
AWS_SESSION_TOKEN: ${{ secrets.AWS_SESSION_TOKEN }} | ||
``` | ||
|
||
Note that `AWS_SESSION_TOKEN` is optional. | ||
|
||
</details> | ||
|
||
<details> | ||
|
||
### Azure | ||
|
||
```yaml | ||
env: | ||
AZURE_STORAGE_CONNECTION_STRING: | ||
${{ secrets.AZURE_STORAGE_CONNECTION_STRING }} | ||
AZURE_STORAGE_CONTAINER_NAME: ${{ secrets.AZURE_STORAGE_CONTAINER_NAME }} | ||
``` | ||
|
||
</details> | ||
|
||
### Using on-premise machines as self-hosted runners | ||
|
||
You can also use the new `cml-runner` function to set up a local self-hosted | ||
runner. On your local machine or on-premise GPU cluster, you'll install CML as a | ||
package and then run: | ||
|
||
```yaml | ||
cml-runner \ --repo $your_project_repository_url \ | ||
--token=$personal_access_token \ --labels tf \ --idle-timeout 180 | ||
``` | ||
|
||
- `repo_token` should be set to your repository's personal access token | ||
- `RUNNER_REPO` should be set to the URL of your project repository | ||
- The docker container should be given as `dvcorg/cml`, `dvcorg/cml-py3`, | ||
`dvc/org/cml-gpu`, or `dvcorg/cml-gpu-py3` | ||
Now your machine will be listening for workflows from your project repository. |
Oops, something went wrong.