CML blog post

iterative · Jul 6, 2020 · 8878a3f · 8878a3f
1 parent 31e04db
commit 8878a3f
Show file tree

Hide file tree

Showing 3 changed files with 197 additions and 0 deletions.
diff --git a/content/blog/2020-07-07-cml-release.md b/content/blog/2020-07-07-cml-release.md
@@ -0,0 +1,197 @@
+---
+title: 'New Release: Continuous Machine Learning (CML) is CI/CD for ML'
+date: 2020-07-07
+description: |
+  The DVC team is releasing a new open-source project called Continuous Machine
+  Learning. CML helps to organize MLOps infrastructure on top of the traditional
+  software engineering stack.
+
+descriptionLong: |
+  Today, the DVC team is releasing a new open-source project called Continuous
+  Machine Learning, or CML (https://cml.dev) to mainstream the best engineering
+  practices of CI/CD to AI and ML teams. CML helps to organize MLOps
+  infrastructure on top of the traditional software engineering stack.
+
+picture: 2020-06-22/release.png
+pictureComment: CML release
+author: dmitry_petrov
+commentsUrl: https://discuss.dvc.org/t/dvc-1-0-release/412
+tags:
+  - Release
+  - CI/CD for ML
+  - MLOps
+  - DataOps
+---
+
+## 1. CI/CD for machine learning problems
+
+Continuous integration and continuous delivery (CI/CD) is a widely-used software
+engineering practice. It's a validated approach to increasing the agility of
+software development without sacrificing stability. **But why haven't CI/CD
+practices taken root in machine learning and data science so far?**
+
+We see three substantial technical barriers to using standard CI systems with
+machine learning projects:
+
+1. **Data dependencies.** In ML, data plays a similar role as code: ML results
+   critically depend on datasets, and changes in data need to trigger feedback
+   just like changes in source code. Furthermore, multi-GB datasets are
+   challenging to manage with Git-centric CI systems.
+2. **Metrics-driven.** The traditional software engineering idea of pass/fail
+   tests does not apply in ML. As an example, `+0.72% accuracy` and
+   `-0.35% precision` does not answer the question if the ML model is good or
+   not. Detailed reports with metrics and plots are needed to make a good/bad
+   model discussion
+3. **CPU/GPU resources**. ML training often requires more resources to train
+   then is typical to have in CI/CD runners. CI/CD must be connected with cloud
+   computing instances or Kubernetes clusters for ML training.
+
+## 2. CI/CD for ML is the next step for DVC team
+
+Since the beginning, our motivation has been helping ML teams benefit from
+DevOps. We started DVC because we knew that data management would be a crucial
+bottleneck, and sure enough, DVC was a big step towards making pipelines and
+experiments manageable and reproducible. But conversations with our community
+have brought us to one conclusion again and again: CI/CD for ML is the holy
+grail.
+
+Over the last 3 years, we've reached some big milestones:
+
+1. We built DVC to address the ML data management problem. Recently, we
+   [released DVC 1.0](https://dvc.org/blog/dvc-1-0-release), marking a new and
+   more stable era for our API.
+2. DVC has become a core part of many ML team's daily operations. The latest
+   [ThoughtWorks Technology Radar](https://www.thoughtworks.com/radar/tools)
+   says:
+
+   _"... it [DVC] has become a favorite tool for managing experiments in machine
+   learning (ML) projects. Since it's based on Git, DVC is a familiar
+   environment for software developers to bring their engineering practices to
+   ML practice."_
+
+3. An extraordinary team and community have emerged around DVC:
+   - 15 employees in our organization https://iterative.ai
+   - 100+ open-source contributors to DVC https://github.com/iterative/dvc and
+     another 100+ open-source contributors to docs
+     https://github.com/iterative/dvc.org
+   - 2000+ community members in our Discord https://dvc.org/chat and GitHub
+     issue tracker https://github.com/iterative/dvc
+   - 4000+ regular users of DVC
+
+Now that DVC is maturing, we're ready to take the next step: we want to
+revolutionize ML development processes. We want ML experiments to have greater
+visibility to teammates, shorter feedback loops, and more reproducibility. We
+want teams to spend less time managing their computing resources and
+experiments, and more time building value. The goal is to extend the amazing
+results of DevOps from software development to ML and MLOps.
+
+## 3. Continuous Machine Learning release
+
+Today, we're releasing an open-source project https://CML.DEV to close the gap
+between machine learning and software development practices.
+
+CML is a library of functions used inside CI/CD runners to make ML compatible
+with **GitHub Actions** and **GitLab CI**. We've created functions to:
+
+1. Generate informative reports on every Pull/Merge Request with metrics, plots,
+   and hyperparameters changes.
+2. Provision GPU\CPU resources from cloud service providers (**AWS, GCP, Azure,
+   Ali**) and deploy CI runners using **docker-machine**.
+3. Bring datasets from cloud storage to runners for model training (using
+   **DVC**), as well as save the resulting model in cloud storage.
+
+![Auto-generated metrics-driven report in GitLab Merge Request](/uploads/images/2020-07-07/cml-report-metrics.png)
+
+The workflow and visual reports are customizable by modifying the CI
+configuration file in your GitHub `./github/workflows/*.yaml` or GitLab
+`.gitlab-ci.yml` project. Use CML functions in conjunction with your own ML
+model training and testing scripts to create your own automated workflow and
+reporting system.
+
+```yaml
+# GitLab workflow in '.gitlab-ci.yml' file
+
+stages:
+  - cml_run
+
+cml:
+  stage: cml_run
+  image: dvcorg/cml-py3:latest
+  script:
+    - dvc pull data --run-cache
+
+    - pip install -r requirements.txt
+    - dvc repro
+
+    # Compare metrics to master
+    - git fetch --prune
+    - dvc metrics diff --show-md master >> report.md
+
+    # Visualize loss function diff
+    - dvc plots diff --target loss.csv --show-vega master > vega.json
+    - vl2png vega.json | cml-publish --md >> report.md
+    - dvc push data --run-cache
+    - cml-send-comment report.md
+```
+
+![Hyperparameter change with a result image in GitHub Pull request report](/uploads/images/2020-07-07/cml-report-params.png)
+
+In this example all the CML functions are defined in the **docker images** that
+is used in the workflow - `dvcorg/cml-py3`. Users can specify any docker image.
+The only restriction is that the CML library need to be installed to enable all
+the CML commands for the reporting and graphs:
+
+```bash
+npm i @dvcorg/cml
+```
+
+Graphs and image commands might require additional packages like `vega-cli` or
+image file convertors. This is how the original CML docker image installs all
+the important the dependencies:
+
+```yaml
+# Install update pip and nodejs, install dvc and cml
+ADD "./" "/cml" RUN wget https://dvc.org/deb/dvc.list -O
+/etc/apt/sources.list.d/dvc.list && \ apt update && \ apt -y install dvc && \
+apt -y install build-essential libcairo2-dev libpango1.0-dev libjpeg-dev
+libgif-dev librsvg2-dev && \ npm config set user 0 && \ npm config set
+unsafe-perm true && \ npm install -g /cml && \ npm install -g vega-cli && \ npm
+install -g vega-lite && \ apt-get install -y libfontconfig-dev && \ apt-get
+clean && \ rm -rf /var/lib/apt/lists/*
+```
+
+Examples of docker images can be found in the
+[CML repository](https://github.com/iterative/cml/).
+
+CML is based on the assumption that MLOps can work with traditional engineering
+tools. It shouldn't require an entirely separate platform. We're excited about a
+world where DevOps practitioners can work fluently on both software and ML
+aspects of a project.
+
+## 4. The relationship between CML and DVC
+
+CML and DVC are related projects under the umbrella of the same team, but will
+have separate websites and independent development. The CML project is hosted on
+a new web site: https://cml.dev. The source code and issue tracker is on GitHub:
+https://github.com/iterative/cml
+
+For support and communications, the DVC Discord server is still the place to go:
+https://dvc.org/chat. We've made a new `#cml` channel there to discuss CML,
+CI/CD for ML and other MLOps related questions.
+
+## 5. Conclusion
+
+With the rise of AI/ML teams and ML platforms in addition to the software
+engineering stack, we believe that the industry needs a single technology stack
+to work with software as well as AI projects. A simple layer of a tool is
+required to close the gap between AI projects and software projects to fit them
+into the existing stack and CML is the way to make it.
+
+Our philosophy is that ML projects, and MLOps practices, should be built on top
+of traditional engineering tools and not as a separate stack. A simple layer of
+tools will be required to close the gap, and CML is part of this ecosystem. We
+think this is the future of MLOps.
+
+As always, thanks for reading and for being part of the DVC community. We'd love
+to hear what you think about CML. Please be in touch on
+[Twitter](https://twitter.com/dvcorg) and [Discord](https://dvc.org/chat)!
diff --git a/static/uploads/images/2020-07-07/cml-report-metrics.png b/static/uploads/images/2020-07-07/cml-report-metrics.png
diff --git a/static/uploads/images/2020-07-07/cml-report-params.png b/static/uploads/images/2020-07-07/cml-report-params.png