From e98c91b9b9b0eefeac1fdbdfb33bea945eed2e30 Mon Sep 17 00:00:00 2001
From: Casper da Costa-Luis
Date: Thu, 29 Apr 2021 23:50:16 +0100
Subject: [PATCH] some README updates (#483)
* some README updates
- badges
- more emphasis on setup-cml than docker
* fix yaml -> bash
- also prevents pre-commit hook from incorrectly reflowing
* readme: update badge label
* em-dash
* readme: more copyedits
* docs: more readme clarifications
* better emoji support
---
README.md | 306 +++++++++++++++++++++++++++++-------------------------
1 file changed, 167 insertions(+), 139 deletions(-)
diff --git a/README.md b/README.md
index 8894e1ae2..c3e71c1a2 100644
--- a/README.md
+++ b/README.md
@@ -2,6 +2,9 @@
+[![GHA](https://img.shields.io/github/v/tag/iterative/setup-cml?label=GitHub%20Actions&logo=GitHub)](https://github.com/iterative/setup-cml)
+[![npm](https://img.shields.io/npm/v/@dvcorg/cml?logo=npm)](https://www.npmjs.com/package/@dvcorg/cml)
+
**What is CML?** Continuous Machine Learning (CML) is an open-source library for
implementing continuous integration & delivery (CI/CD) in machine learning
projects. Use it to automate parts of your development workflow, including model
@@ -23,40 +26,39 @@ We built CML with these principles in mind:
plots in each Git Pull Request. Rigorous engineering practices help your team
make informed, data-driven decisions.
- **No additional services.** Build your own ML platform using just GitHub or
- GitLab and your favorite cloud services: AWS, Azure, GCP. No databases,
+ GitLab and your favourite cloud services: AWS, Azure, GCP. No databases,
services or complex setup needed.
-_⁉️ Need help? Just want to chat about continuous integration for ML?
-[Visit our Discord channel!](https://discord.gg/bzA6uY7)_
+:question: Need help? Just want to chat about continuous integration for ML?
+[Visit our Discord channel!](https://discord.gg/bzA6uY7)
-🌟🌟🌟 Check out our
+:play_or_pause_button: Check out our
[YouTube video series](https://www.youtube.com/playlist?list=PL7WG7YrwYcnDBDuCkFbcyjnZQrdskFsBz)
-for hands-on MLOps tutorials using CML! 🌟🌟🌟
+for hands-on MLOps tutorials using CML!
## Table of contents
1. [Usage](#usage)
-2. [Getting started](#getting-started)
+2. [Getting started (tutorial)](#getting-started)
3. [Using CML with DVC](#using-cml-with-dvc)
4. [Using self-hosted runners](#using-self-hosted-runners)
5. [Install CML as a package](#install-cml-as-a-package)
-6. [Examples](#a-library-of-cml-projects)
+6. [Example Projects](#see-also)
## Usage
You'll need a GitHub or GitLab account to begin. Users may wish to familiarize
themselves with [Github Actions](https://help.github.com/en/actions) or
-[GitLab CI/CD](https://about.gitlab.com/stages-devops-lifecycle/continuous-integration/).
+[GitLab CI/CD](https://about.gitlab.com/stages-devops-lifecycle/continuous-integration).
Here, will discuss the GitHub use case.
-⚠️ **GitLab users!** Please see our
-[docs about configuring CML with GitLab](https://github.com/iterative/cml/wiki/CML-with-GitLab).
-
-🪣 **Bitbucket Cloud users** We support you, too-
-[see our docs here](https://github.com/iterative/cml/wiki/CML-with-Bitbucket-Cloud).🪣
-_Bitbucket Server support estimated to arrive by January 2021._
-
-The key file in any CML project is `.github/workflows/cml.yaml`.
+- **GitLab users**: Please see our
+ [docs about configuring CML with GitLab](https://github.com/iterative/cml/wiki/CML-with-GitLab).
+- **Bitbucket Cloud users**: Please see our
+ [docs on CML with Bitbucket Cloud](https://github.com/iterative/cml/wiki/CML-with-Bitbucket-Cloud).
+ _Bitbucket Server support estimated to arrive by May 2021._
+- **GitHub Actions users**: The key file in any CML project is
+ `.github/workflows/cml.yaml`:
```yaml
name: your-workflow-name
@@ -64,34 +66,47 @@ on: [push]
jobs:
run:
runs-on: [ubuntu-latest]
- container: docker://dvcorg/cml-py3:latest
+ # optionally use a convenient Ubuntu LTS + CUDA + DVC + CML image
+ # container: docker://dvcorg/cml-py3:latest
steps:
- uses: actions/checkout@v2
- - name: 'Train my model'
- env:
- repo_token: ${{ secrets.GITHUB_TOKEN }}
+ # may need to setup NodeJS & Python3 on e.g. self-hosted
+ # - uses: actions/setup-node@v2
+ # with:
+ # node-version: '12'
+ # - uses: actions/setup-python@v2
+ # with:
+ # python-version: '3.x'
+ - uses: iterative/setup-cml@v1
+ - name: Train model
run: |
-
# Your ML workflow goes here
pip install -r requirements.txt
python train.py
-
- # Write your CML report
+ - name: Write CML report
+ env:
+ repo_token: ${{ secrets.GITHUB_TOKEN }}
+ run: |
+ # Post reports as comments in GitHub PRs
cat results.txt >> report.md
cml-send-comment report.md
```
+We helpfully provide CML and other useful libraries pre-installed on our
+[custom Docker images](https://github.com/iterative/cml/blob/master/docker/Dockerfile).
+In the above example, uncommenting the field
+`container: docker://dvcorg/cml-py3:latest` will make the GitHub Actions runner
+pull the CML Docker image. The image already has NodeJS, Python 3, DVC and CML
+set up on an Ubuntu LTS base with CUDA libraries and
+[Terraform](https://www.terraform.io) installed for convenience.
+
### CML Functions
-CML provides a number of helper functions to help package outputs from ML
-workflows, such as numeric data and data vizualizations about model performance,
-into a CML report. The library comes pre-installed on our
-[custom Docker images](https://github.com/iterative/cml/blob/master/docker/Dockerfile).
-In the above example, note the field `container: docker://dvcorg/cml-py3:latest`
-specifies the CML Docker image with Python 3 will be pulled by the GitHub
-Actions runner.
+CML provides a number of helper functions to help package the outputs of ML
+workflows (including numeric data and visualizations about model performance)
+into a CML report.
-Below is a list of CML functions for writing markdown reports and delivering
+Below is a table of CML functions for writing markdown reports and delivering
those reports to your CI system (GitHub Actions or GitLab CI).
| Function | Description | Inputs |
@@ -105,38 +120,40 @@ those reports to your CI system (GitHub Actions or GitLab CI).
CML reports are written in
[GitHub Flavored Markdown](https://github.github.com/gfm/). That means they can
-contain images, tables, formatted text, HTML blocks, code snippets and more -
+contain images, tables, formatted text, HTML blocks, code snippets and more —
really, what you put in a CML report is up to you. Some examples:
-📝 **Text**. Write to your report using whatever method you prefer. For example,
-copy the contents of a text file containing the results of ML model training:
+:spiral_notepad: **Text** Write to your report using whatever method you prefer.
+For example, copy the contents of a text file containing the results of ML model
+training:
```bash
cat results.txt >> report.md
```
-🖼️ **Images** Display images using the markdown or HTML. Note that if an image
-is an output of your ML workflow (i.e., it is produced by your workflow), you
-will need to use the `cml-publish` function to include it a CML report. For
-example, if `graph.png` is the output of my workflow `python train.py`, run:
+:framed_picture: **Images** Display images using the markdown or HTML. Note that
+if an image is an output of your ML workflow (i.e., it is produced by your
+workflow), you will need to use the `cml-publish` function to include it a CML
+report. For example, if `graph.png` is output by `python train.py`, run:
```bash
cml-publish graph.png --md >> report.md
```
-## Getting started
+## Getting Started
1. Fork our
- [example project repository](https://github.com/iterative/example_cml). ⚠️
- Note that if you are using GitLab,
- [you will need to create a Personal Access Token](https://github.com/iterative/cml/wiki/CML-with-GitLab#variables)
- for this example to work.
+ [example project repository](https://github.com/iterative/example_cml).
+
+> :warning: Note that if you are using GitLab,
+> [you will need to create a Personal Access Token](https://github.com/iterative/cml/wiki/CML-with-GitLab#variables)
+> for this example to work.
![](imgs/fork_project.png)
-The following steps can all be done in the GitHub browser interface. However, to
-follow along the commands, we recommend cloning your fork to your local
-workstation:
+> :warning: The following steps can all be done in the GitHub browser interface.
+> However, to follow along with the commands, we recommend cloning your fork to
+> your local workstation:
```bash
git clone https://github.com//example_cml
@@ -151,10 +168,11 @@ on: [push]
jobs:
run:
runs-on: [ubuntu-latest]
- container: docker://dvcorg/cml-py3:latest
steps:
- uses: actions/checkout@v2
- - name: 'Train my model'
+ - uses: actions/setup-python@v2
+ - uses: iterative/setup-cml@v1
+ - name: Train model
env:
repo_token: ${{ secrets.GITHUB_TOKEN }}
run: |
@@ -166,9 +184,9 @@ jobs:
cml-send-comment report.md
```
-4. In your text editor of choice, edit line 16 of `train.py` to `depth = 5`.
+3. In your text editor of choice, edit line 16 of `train.py` to `depth = 5`.
-5. Commit and push the changes:
+4. Commit and push the changes:
```bash
git checkout -b experiment
@@ -176,34 +194,38 @@ git add . && git commit -m "modify forest depth"
git push origin experiment
```
-6. In GitHub, open up a Pull Request to compare the `experiment` branch to
+5. In GitHub, open up a Pull Request to compare the `experiment` branch to
`master`.
![](imgs/make_pr.png)
Shortly, you should see a comment from `github-actions` appear in the Pull
-Request with your CML report. This is a result of the function
-`cml-send-comment` in your workflow.
+Request with your CML report. This is a result of the `cml-send-comment`
+function in your workflow.
![](imgs/cml_first_report.png)
-This is the gist of the CML workflow: when you push changes to your GitHub
-repository, the workflow in your `.github/workflows/cml.yaml` file gets run and
-a report generated. CML functions let you display relevant results from the
-workflow, like model performance metrics and vizualizations, in GitHub checks
-and comments. What kind of workflow you want to run, and want to put in your CML
-report, is up to you.
+This is the outline of the CML workflow:
+
+- you push changes to your GitHub repository,
+- the workflow in your `.github/workflows/cml.yaml` file gets run, and
+- a report is generated and posted to GitHub.
+
+CML functions let you display relevant results from the workflow — such as model
+performance metrics and visualizations — in GitHub checks and comments. What
+kind of workflow you want to run, and want to put in your CML report, is up to
+you.
## Using CML with DVC
-In many ML projects, data isn't stored in a Git repository and needs to be
+In many ML projects, data isn't stored in a Git repository, but needs to be
downloaded from external sources. [DVC](https://dvc.org) is a common way to
bring data to your CML runner. DVC also lets you visualize how metrics differ
between commits to make reports like this:
![](imgs/dvc_cml_long_report.png)
-The `.github/workflows/cml.yaml` file to create this report is:
+The `.github/workflows/cml.yaml` file used to create this report is:
```yaml
name: model-training
@@ -214,8 +236,7 @@ jobs:
container: docker://dvcorg/cml-py3:latest
steps:
- uses: actions/checkout@v2
- - name: 'Train my model'
- shell: bash
+ - name: Train model
env:
repo_token: ${{ secrets.GITHUB_TOKEN }}
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
@@ -234,26 +255,27 @@ jobs:
dvc metrics diff master --show-md >> report.md
# Publish confusion matrix diff
- echo -e "## Plots\n### Class confusions" >> report.md
+ echo "## Plots" >> report.md
+ echo "### Class confusions" >> report.md
dvc plots diff --target classes.csv --template confusion -x actual -y predicted --show-vega master > vega.json
vl2png vega.json -s 1.5 | cml-publish --md >> report.md
# Publish regularization function diff
- echo "### Effects of regularization\n" >> report.md
+ echo "### Effects of regularization" >> report.md
dvc plots diff --target estimators.csv -x Regularization --show-vega master > vega.json
vl2png vega.json -s 1.5 | cml-publish --md >> report.md
cml-send-comment report.md
```
-If you're using DVC with cloud storage, take note of environmental variables for
-your storage format.
+> :warning: If you're using DVC with cloud storage, take note of environment
+> variables for your storage format.
-### Environmental variables for supported cloud providers
+### Environment variables for supported cloud providers
- S3 and S3 compatible storage (Minio, DigitalOcean Spaces, IBM Cloud Object Storage...)
+ S3 and S3-compatible storage (Minio, DigitalOcean Spaces, IBM Cloud Object Storage...)
```yaml
@@ -264,7 +286,7 @@ env:
AWS_SESSION_TOKEN: ${{ secrets.AWS_SESSION_TOKEN }}
```
-> :point_right: AWS_SESSION_TOKEN is optional.
+> :point_right: `AWS_SESSION_TOKEN` is optional.
@@ -302,9 +324,10 @@ env:
Google Storage
-> :warning: Normally, GOOGLE_APPLICATION_CREDENTIALS points to the path of the
-> json file that contains the credentials. However in the action this variable
-> CONTAINS the content of the file. Copy that json and add it as a secret.
+> :warning: Normally, `GOOGLE_APPLICATION_CREDENTIALS` is the **path** of the
+> `json` file containing the credentials. However in the action this secret
+> variable is the **contents** of the file. Copy the `json` contents and add it
+> as a secret.
```yaml
env:
@@ -320,9 +343,9 @@ env:
> :warning: After configuring your
> [Google Drive credentials](https://dvc.org/doc/command-reference/remote/add)
-> you will find a json file at
-> `your_project_path/.dvc/tmp/gdrive-user-credentials.json`. Copy that json and
-> add it as a secret.
+> you will find a `json` file at
+> `your_project_path/.dvc/tmp/gdrive-user-credentials.json`. Copy its contents
+> and add it as a secret variable.
```yaml
env:
@@ -335,81 +358,82 @@ env:
GitHub Actions are run on GitHub-hosted runners by default. However, there are
many great reasons to use your own runners: to take advantage of GPUs; to
-orchestrate your team's shared computing resources, or to train in the cloud.
+orchestrate your team's shared computing resources, or to access on-premise
+data.
-☝️ **Tip!** Check out the
-[official GitHub documentation](https://help.github.com/en/actions/hosting-your-own-runners/about-self-hosted-runners)
-to get started setting up your self-hosted runner.
+> :point_up: **Tip!** Check out the
+> [official GitHub documentation](https://help.github.com/en/actions/hosting-your-own-runners/about-self-hosted-runners)
+> to get started setting up your own self-hosted runner.
### Allocating cloud resources with CML
-When a workflow requires computational resources (such as GPUs) CML can
+When a workflow requires computational resources (such as GPUs), CML can
automatically allocate cloud instances using `cml-runner`. You can spin up
instances on your AWS or Azure account (GCP support is forthcoming!).
For example, the following workflow deploys a `t2.micro` instance on AWS EC2 and
trains a model on the instance. After the job runs, the instance automatically
-shuts down. You might notice that this workflow is quite similar to the
-[basic use case](#usage) highlighted in the beginning of the docs- that's
-because it is! What's new is that we've added `cml-runner`, plus a few
-environmental variables for passing your cloud service credentials to the
+shuts down.
+
+You might notice that this workflow is quite similar to the
+[basic use case](#usage) above. The only addition is `cml-runner` and a few
+environment variables for passing your cloud service credentials to the
workflow.
```yaml
-name: "Train-in-the-cloud"
+name: Train-in-the-cloud
on: [push]
-
jobs:
deploy-runner:
runs-on: [ubuntu-latest]
steps:
- uses: iterative/setup-cml@v1
- uses: actions/checkout@v2
- - name: "Deploy runner on EC2"
- shell: bash
+ - name: Deploy runner on EC2
env:
repo_token: ${{ secrets.PERSONAL_ACCESS_TOKEN }}
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
run: |
cml-runner \
- --cloud aws \
- --cloud-region us-west \
- --cloud-type=t2.micro \
- --labels=cml-runner
- name: model-training
- needs: deploy-runner
- runs-on: [self-hosted,cml-runner]
+ --cloud aws \
+ --cloud-region us-west \
+ --cloud-type=t2.micro \
+ --labels=cml-runner
+ model-training:
+ needs: [deploy-runner]
+ runs-on: [self-hosted, cml-runner]
container: docker://dvcorg/cml-py3:latest
steps:
- - uses: actions/checkout@v2
- - name: "Train my model"
- env:
- repo_token: ${{ secrets.PERSONAL_ACCESS_TOKEN }}
- run: |
- pip install -r requirements.txt
- python train.py
- # Publish report with CML
- cat metrics.txt > report.md
- cml-send-comment report.md
+ - uses: actions/checkout@v2
+ - name: Train model
+ env:
+ repo_token: ${{ secrets.PERSONAL_ACCESS_TOKEN }}
+ run: |
+ pip install -r requirements.txt
+ python train.py
+
+ cat metrics.txt > report.md
+ cml-send-comment report.md
```
-In the above workflow, the step `deploy-runner` launches an EC2 `t2-micro`
-instance in the `us-west` region. The next step, `model-training`, runs on the
-newly launched instance.
+In the workflow above, the `deploy-runner` step launches an EC2 `t2-micro`
+instance in the `us-west` region. The `model-training` step then runs on the
+newly-launched instance.
-**Note that you can use any container with this workflow!** While you must have
-CML and its dependencies setup to use CML functions like `cml-send-comment` from
-your instance, you can create your favorite training environment in the cloud by
-pulling the Docker container of your choice.
+> :tada: **Note that you can use any container with this workflow!** While you
+> must [have CML and its dependencies set up](#install-cml-as-a-package) to use
+> functions such `cml-send-comment` from your instance, you can create your
+> favourite training environment in the cloud by pulling the Docker container of
+> your choice.
We like the CML container (`docker://dvcorg/cml-py3`) because it comes loaded
with Python, CUDA, `git`, `node` and other essentials for full-stack data
-science. But we don't mind if you do it your way :)
+science.
### Arguments
-The function `cml-runner` accepts the following arguments:
+The `cml-runner` function accepts the following arguments:
```
Usage: cml-runner.js
@@ -462,30 +486,31 @@ Options:
-h Show help [boolean]
```
-### Environmental variables
+### Environment variables
-You will need to
-[create a personal access token](https://help.github.com/en/github/authenticating-to-github/creating-a-personal-access-token-for-the-command-line)
-with repository read/write access and workflow privileges. In the example
-workflow, this token is stored as `PERSONAL_ACCESS_TOKEN`.
+> :warning: You will need to
+> [create a personal access token](https://help.github.com/en/github/authenticating-to-github/creating-a-personal-access-token-for-the-command-line)
+> with repository read/write access and workflow privileges. In the example
+> workflow, this token is stored as `PERSONAL_ACCESS_TOKEN`.
Note that you will also need to provide access credentials for your cloud
compute resources as secrets. In the above example, `AWS_ACCESS_KEY_ID` and
`AWS_SECRET_ACCESS_KEY` are required to deploy EC2 instances.
Please see our docs about
-[environmental variables needed to authenticate with supported cloud services](#environmental-variables-for-supported-cloud-providers).
+[environment variables needed to authenticate with supported cloud services](#environment-variables-for-supported-cloud-providers).
-### Using on-premise machines as self-hosted runners
+### On-premise (local) runners
-You can also use the new `cml-runner` function to set up a local self-hosted
-runner. On your local machine or on-premise GPU cluster, you'll install CML as a
-package and then run:
+This means using on-premise machines as self-hosted runners. The `cml-runner`
+function is used to set up a local self-hosted runner. On your local machine or
+on-premise GPU cluster, [install CML as a package](#install-cml-as-a-package)
+and then run:
-```yaml
+```bash
cml-runner \
--repo $your_project_repository_url \
- --token=$personal_access_token \
+ --token=$PERSONAL_ACCESS_TOKEN \
--labels tf \
--idle-timeout 180
```
@@ -494,8 +519,9 @@ Now your machine will be listening for workflows from your project repository.
## Install CML as a package
-In the above examples, CML is pre-installed in a custom Docker image, which is
-pulled by a CI runner. You can also install CML as a package:
+In the examples above, CML is installed by the `setup-cml` action, or comes
+pre-installed in a custom Docker image pulled by a CI runner. You can also
+install CML as a package:
```bash
npm i -g @dvcorg/cml
@@ -506,26 +532,28 @@ CLI commands:
```bash
sudo apt-get install -y libcairo2-dev libpango1.0-dev libjpeg-dev libgif-dev \
- librsvg2-dev libfontconfig-dev
+ librsvg2-dev libfontconfig-dev
npm install -g vega-cli vega-lite
```
-CML and Vega-Lite package installation require `npm` command from Node package.
-Below you can find how to install Node.
+CML and Vega-Lite package installation require the NodeJS package manager
+(`npm`) which ships with NodeJS. Installation instructions are below.
-### Install Node in GitHub
+### Install NodeJS in GitHub
-In GitHub there is a special action for NPM installation:
+This is probably not necessary when using GitHub's default containers or one of
+CML's Docker containers. Self-hosted runners may need to use a set up action to
+install NodeJS:
```bash
-uses: actions/setup-node@v1
+uses: actions/setup-node@v2
with:
node-version: '12'
```
-### Install Node in GitLab
+### Install NodeJS in GitLab
-GitLab requires direct installation of the NMP package:
+GitLab requires direct installation of NodeJS:
```bash
curl -sL https://deb.nodesource.com/setup_12.x | bash
@@ -533,9 +561,9 @@ apt-get update
apt-get install -y nodejs
```
-## A library of CML projects
+## See Also
-Here are some example projects using CML.
+These are some example projects using CML.
- [Basic CML project](https://github.com/iterative/cml_base_case)
- [CML with DVC to pull data](https://github.com/iterative/cml_dvc_case)